[00:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170110T0000). [00:00:04] odder: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [00:01:34] 06Operations, 06Discovery, 06Maps, 06WMF-Legal, 03Interactive-Sprint: Define tile usage policy - https://phabricator.wikimedia.org/T141815#2929477 (10debt) Thanks for the update, @Slaporte ! [00:02:50] 06Operations, 10Traffic, 10Wikimedia-Mailing-lists: convert lists.wikimedia.org certificate to LetsEncrypt (deadline:2017-03-02) - https://phabricator.wikimedia.org/T154917#2929479 (10RobH) Chatted with Faidon and indeed, the file is in modules/role/templates/exim/exim4.conf.mailman.erb: # TLS tls_certifica... [00:03:02] jouncebot: You're such a smart bot :* [00:04:09] Hi. odder: I'll take the swat in 10 minutes if nobody is there before [00:04:51] Dereckson: Cool! [00:13:52] (03PS2) 10Tim Landscheidt: mailman: Indent @ssl_settings in Apache configuration [puppet] - 10https://gerrit.wikimedia.org/r/329742 [00:23:18] (03PS4) 10Dereckson: Add Collection namespace to the Polish Wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/331240 (https://phabricator.wikimedia.org/T154711) (owner: 10Odder) [00:24:12] (03CR) 10Dereckson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/331240 (https://phabricator.wikimedia.org/T154711) (owner: 10Odder) [00:24:46] (03Merged) 10jenkins-bot: Add Collection namespace to the Polish Wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/331240 (https://phabricator.wikimedia.org/T154711) (owner: 10Odder) [00:25:02] (03CR) 10jenkins-bot: Add Collection namespace to the Polish Wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/331240 (https://phabricator.wikimedia.org/T154711) (owner: 10Odder) [00:25:17] * odder waiting... [00:26:09] on the task I see Nemo_bis offers "This seems similar to the usage of the "Portal" namespace in several Wikisources, so it could use 106 (and have an alias from "Portal")." [00:26:53] could you write a note on the task why 124/125 is more suitable? [00:27:56] it's not, it's just as arbitrary as picking 106/107 [00:28:38] k [00:29:06] note it's still worthwhile to reply that [00:31:45] Dereckson: this also needs https://www.mediawiki.org/wiki/Manual:UpdateArticleCount.php to be run [00:32:43] namespacesDupe you mean? [00:32:54] (03CR) 10Dzahn: tendril: use Letsencrypt for SSL cert (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/330829 (https://phabricator.wikimedia.org/T133717) (owner: 10Dzahn) [00:33:35] that as well if you wish [00:33:54] ok [00:37:10] (03CR) 10Dzahn: tendril: use Letsencrypt for SSL cert (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/330829 (https://phabricator.wikimedia.org/T133717) (owner: 10Dzahn) [00:43:26] odder: live on mwdebug1002 [00:44:29] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 214, down: 0, dormant: 0, excluded: 0, unused: 0 [00:44:30] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 59, down: 0, dormant: 0, excluded: 0, unused: 0 [00:45:54] odder: you can ask on wiki to check these pages also: https://phabricator.wikimedia.org/T154711#2929541 [00:46:09] (03PS6) 10Dzahn: tendril: use Letsencrypt for SSL cert [puppet] - 10https://gerrit.wikimedia.org/r/330829 (https://phabricator.wikimedia.org/T133717) [00:46:27] odder: an issue with links in 4 pages, all fixed [00:47:18] Yeah, this looks weird to me. [00:48:10] They had WS as an alias to NS_PROJECT anyway in there. [00:48:40] ok [00:51:25] (03CR) 10Dzahn: tendril: use Letsencrypt for SSL cert (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/330829 (https://phabricator.wikimedia.org/T133717) (owner: 10Dzahn) [00:51:59] PROBLEM - puppet last run on analytics1036 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:52:38] odder: so it looks good on mwdebug1002 ? [00:57:43] Dereckson: Yup! Looks ok. [00:57:48] * Dereckson nods [00:58:11] Syncing [00:58:44] !log dereckson@tin Synchronized wmf-config/InitialiseSettings.php: Add Collection namespace to the Polish Wikisource (T154711) (duration: 00m 41s) [00:58:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:58:48] T154711: Add Kolekcja: namespace to Polish Wikisource - https://phabricator.wikimedia.org/T154711 [00:59:18] `mwscript namespaceDupes.php plwikisource` gives 0 pages to fix, 0 were resolvable. 0 links to fix, 0 were resolvable. [00:59:38] !log Fixed links with namespaceDupes on pl.wikisource [00:59:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:59:54] now the other update script for content namespaces [01:00:03] Yup, just so the stats are correct. [01:00:30] That will be 491 100 articles [01:01:02] !log Updated articles count on pl.wikisource: 491 100 (T154711) [01:01:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:01:20] SWAT done. [01:01:41] * odder bows [01:01:43] Thanks for the config change odder, and sorry for the delay [01:01:54] No worries, thanks for staying up this late. [01:03:03] You're welcome. [01:03:10] * odder off to bed, night all [01:03:19] 'night [01:05:20] I sure hope odder's config change didn't cause https://m.wikipedia.org/ [01:07:01] enterprisey: not in fatalmonitor [01:07:07] ah [01:07:08] no it didn't [01:07:15] cool [01:07:19] this error was in fatalmonitor at SWAT start [01:07:29] I was tryying to figure what it was [01:07:39] oh, cool [01:08:02] Uncaught exception 'ConfigException' with message 'GlobalVarConfig::get: undefined option: 'ServiceWiringFiles'' in /srv/mediawiki/php-1.29.0-w [01:08:10] mf.7/includes/config/GlobalVarConfig.php:53 [01:09:07] * Dereckson tries something on mwdebug1002 [01:11:37] (03PS2) 10Dzahn: ganglia: display deprecation banner [puppet] - 10https://gerrit.wikimedia.org/r/331097 (https://phabricator.wikimedia.org/T145659) (owner: 10Filippo Giunchedi) [01:13:35] 06Operations, 10netops: cr2-esams<->cr2-eqiad link down - https://phabricator.wikimedia.org/T154952#2929438 (10faidon) p:05High>03Low Level3 hasn't responded yet but this came back up at 00:44. [01:16:00] enterprisey: could be caused by 6481031722540cf6231d9be5aef6f1c4f62d9a06 Move MimeAnalyzer params to ServiceWiring, or by cf6931f83be7ce "MWServices load new ServiceWiringFiles after ExtRegistry load" [01:16:45] Dereckson: unfortunately, I'm not a MW dev :) - I'm just relaying a complaint from #wikipedia [01:16:48] enterprisey: well but that's 23 and 24 November [01:17:21] (03PS7) 10Dzahn: tendril: use Letsencrypt for SSL cert [puppet] - 10https://gerrit.wikimedia.org/r/330829 (https://phabricator.wikimedia.org/T133717) [01:17:21] enterprisey: they say when it started to be down? [01:17:58] 2017-01-09 [01:17:58] 23:38 reedy@tin: Synchronized wmf-config/CommonSettings.php: wfLoadExtension (duration: 00m 40s) [01:18:01] 23:34 reedy@tin: Synchronized wmf-config/extension-list: More to extension.json (duration: 00m 40s) [01:18:04] nah, I'll ask [01:18:05] oh [01:18:11] Reedy: ^ [01:18:22] Hm? [01:18:25] Was it me? :O [01:18:35] Had someone not deployed something? [01:18:53] Dereckson: they don't remember when the error started [01:19:05] heck, I don't think anybody goes to that URL on a regular basis [01:19:07] Reedy: the last deployment before the entry pops in fatalmonitor is yours [01:19:10] 23:38 reedy@tin: Synchronized wmf-config/CommonSettings.php: wfLoadExtension (duration: 00m 40s) [01:19:13] 23:34 reedy@tin: Synchronized wmf-config/extension-list: More to extension.json (duration: 00m 40s) [01:19:34] Well, the commonsettings change could've deployed someone esles code [01:20:10] Looking at that patch Dereckson I really dont think that should have caused an issue that looks like this [01:20:15] Reedy: afterwards, I deployed an IS change for th odder patch, *but* the error was already in fatalmonitor when I called the console at start of the SWAT [01:20:54] (03PS1) 10Jdlrobson: Minerva should apply known template hacks in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/331425 (https://phabricator.wikimedia.org/T94102) [01:21:27] (03PS8) 10Dzahn: tendril: use Letsencrypt for SSL cert [puppet] - 10https://gerrit.wikimedia.org/r/330829 (https://phabricator.wikimedia.org/T133717) [01:21:49] * Dereckson checks on logstash the start date [01:21:59] RECOVERY - puppet last run on analytics1036 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [01:22:00] 23:35 UC [01:22:02] UTC [01:22:15] 23:34 reedy@tin: Synchronized wmf-config/extension-list: More to extension.json (duration: 00m 40s) [01:22:18] 23:29 demon@tin: Synchronized multiversion: Final batch of MWVersion cleanup (in song form) (duration: 00m 56s) [01:22:31] so CS change is indeed out of the equation [01:23:12] Can you open a task, and dump the full stack trace? [01:23:16] yes [01:23:45] Dereckson: thanks for following this uP! [01:23:47] *up [01:25:25] (03CR) 10Dzahn: [C: 031] "ok, now it also compiles http://puppet-compiler.wmflabs.org/5063/" [puppet] - 10https://gerrit.wikimedia.org/r/330829 (https://phabricator.wikimedia.org/T133717) (owner: 10Dzahn) [01:25:54] https://phabricator.wikimedia.org/T154960 [01:26:35] I didn't realise it was m.wikipedia.org [01:26:41] I guess it's chads change [01:26:54] (03CR) 10Dzahn: [C: 032] tendril: use Letsencrypt for SSL cert [puppet] - 10https://gerrit.wikimedia.org/r/330829 (https://phabricator.wikimedia.org/T133717) (owner: 10Dzahn) [01:26:55] let's see [01:30:54] it seems it's when it does wfLoadExtension( 'PagedTiffHandler' ); [01:31:12] I'm just gonna revert chads patch [01:31:24] !log reedy@tin Synchronized multiversion: revert 0a2a0966011932bb503ee8a292fd75161fe029f7 (duration: 00m 56s) [01:32:00] PHP fatal error: Call to undefined method MWMultiVersion::getMediaWiki() [01:32:06] (03PS1) 10Dzahn: delete ganglia SSL cert, not needed anymore [puppet] - 10https://gerrit.wikimedia.org/r/331427 (https://phabricator.wikimedia.org/T154939) [01:32:08] When loading https://commons.wikimedia.org/wiki/Commons:Village_pump#Structured_data_on_Commons_Funding [01:32:27] Yes [01:32:28] !log reedy@tin Synchronized w: revert 0a2a0966011932bb503ee8a292fd75161fe029f7 (duration: 00m 40s) [01:32:45] Dereckson: Fixed [01:32:49] It was chads patch [01:32:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:33:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:33:09] Thanks for the revert [01:33:24] (03PS1) 10Reedy: Revert "Remove MWVersion, fold its two functions into MWMultiVersion" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/331428 [01:33:28] Was that "Call to undefined method: MWMultiVersion::getMediaWiki()"? [01:33:31] sorry for joining alte [01:33:31] (03CR) 10Reedy: [C: 032] Revert "Remove MWVersion, fold its two functions into MWMultiVersion" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/331428 (owner: 10Reedy) [01:33:35] ah, k [01:34:18] (03Merged) 10jenkins-bot: Revert "Remove MWVersion, fold its two functions into MWMultiVersion" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/331428 (owner: 10Reedy) [01:34:32] (03CR) 10jenkins-bot: Revert "Remove MWVersion, fold its two functions into MWMultiVersion" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/331428 (owner: 10Reedy) [01:34:35] * Josve05a is always late to the party. Either that, or ten months before any one else... :p [01:34:45] !log reedy@tin Synchronized rpc/RunJobs.php: revert (duration: 00m 40s) [01:34:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:34:59] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] [01:36:59] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [01:37:22] Dereckson: carry on [01:37:59] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [01:37:59] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0] [01:37:59] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0] [01:38:59] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [01:40:51] "Call to undefined method: MWMultiVersion::getMediaWiki()" [01:40:57] That's a lie, unless a server is out of sync [01:41:57] I BLAME EVERYTHING ELSE AND NOT MY CHANGE [01:41:59] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] [01:41:59] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [01:41:59] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [01:42:06] ostriches: on fatalmonitor it decreases slowly (from 240 to 193), and it's this: Call to undefined method MWMultiVersion::getMediaWiki() in /srv/mediawiki/rpc/RunJobs.php on line 31 [01:42:13] 186 now [01:42:23] I swear the function exists. [01:42:30] I blame out of sync servers or something [01:42:32] 06Operations, 10Traffic, 13Patch-For-Review: convert tendril to use Letsencrypt for SSL cert (deadline 2017-03-17) - https://phabricator.wikimedia.org/T154938#2929669 (10Dzahn) {F5263826} @jcrespo Tendril is now switched over to Letsencrypt. I double-checked the LDAP auth part after the Apache changes as y... [01:42:34] * ostriches shrugs [01:42:40] Whatever, we're reverting I guess. [01:42:48] No cleanup ever! [01:42:51] Techdebt 4 life [01:43:10] 06Operations, 10Traffic, 13Patch-For-Review: Letsencrypt all the prod things we can - planning - https://phabricator.wikimedia.org/T133717#2929672 (10Dzahn) [01:43:12] 06Operations, 10Traffic, 13Patch-For-Review: convert tendril to use Letsencrypt for SSL cert (deadline 2017-03-17) - https://phabricator.wikimedia.org/T154938#2929671 (10Dzahn) 05Open>03Resolved [01:43:19] 06Operations, 10Traffic: convert tendril to use Letsencrypt for SSL cert (deadline 2017-03-17) - https://phabricator.wikimedia.org/T154938#2929008 (10Dzahn) [01:43:28] 06Operations, 10DBA, 10Traffic: convert tendril to use Letsencrypt for SSL cert (deadline 2017-03-17) - https://phabricator.wikimedia.org/T154938#2929008 (10Dzahn) [01:43:45] * ostriches fumes [01:43:52] That's super dumb [01:43:55] 06Operations, 10Traffic, 13Patch-For-Review: Letsencrypt all the prod things we can - planning - https://phabricator.wikimedia.org/T133717#2240497 (10Dzahn) [01:43:59] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [01:44:18] Few things enrage me more than our shitty entry points. [01:44:26] Fucking multiversion [01:44:29] And MediaWiki [01:44:34] Can't do anything right [01:44:54] and life [01:45:59] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [01:45:59] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [01:46:34] (03CR) 10Dzahn: [C: 032] delete ganglia SSL cert, not needed anymore [puppet] - 10https://gerrit.wikimedia.org/r/331427 (https://phabricator.wikimedia.org/T154939) (owner: 10Dzahn) [01:50:29] PROBLEM - puppet last run on dbmonitor2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[acme-setup-acme-tendril] [01:50:55] that would be because dbmonitor2001 is using the tendril class which i changed [01:51:17] but that server is not in production yet, tendril is still on einsteinium as of now [01:51:24] but meh anyways [01:51:58] like with so many other 2001-things need to add something in Hiera about the "active" server [01:53:41] ACKNOWLEDGEMENT - puppet last run on dbmonitor2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[acme-setup-acme-tendril] daniel_zahn T154938 [01:55:59] did extension registry break the site? [01:56:01] I'm here now. [01:56:09] PROBLEM - puppet last run on dbmonitor1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 10 seconds ago with 1 failures. Failed resources (up to 3 shown): Exec[acme-setup-acme-tendril] [01:56:58] legoktm: Either that or my MWVersion refactor [01:56:58] ACKNOWLEDGEMENT - puppet last run on dbmonitor1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 10 seconds ago with 1 failures. Failed resources (up to 3 shown): Exec[acme-setup-acme-tendril] daniel_zahn T154938 [01:57:11] But I'm having trouble seeing how my refactor would cause that stacktrace [01:57:21] #2 /srv/mediawiki/php-1.29.0-wmf.7/includes/registration/ExtensionRegistry.php(97): MediaWiki\MediaWikiServices::getInstance() [01:57:21] #3 /srv/mediawiki/php-1.29.0-wmf.7/includes/registration/ExtensionRegistry.php(87): ExtensionRegistry->__construct() [01:57:24] (the count for the fatal, while bad, indicates an out-of-sync apache more than anything to me) [01:57:41] were all the fatals from one host? [01:57:50] I don't know I wasn't in front of my keyboard at the time [01:57:53] Just caught up [01:57:55] I think ExtensionRegistry depending upon MWServices is probably a bad idea [01:57:59] PROBLEM - puppet last run on db1015 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:58:10] legoktm: And I think MWMultiVersion relies on neither! [01:58:15] at least that early in setup process [02:00:09] 06Operations, 10DBA, 10Traffic: convert tendril to use Letsencrypt for SSL cert (deadline 2017-03-17) - https://phabricator.wikimedia.org/T154938#2929707 (10Dzahn) So everything is fine on einsteinium, which is still prod, just (unsurprisingly) it doesn't work on dbmonitor1001/2001 which are going to replace... [02:00:57] 06Operations, 10Traffic, 13Patch-For-Review: Letsencrypt all the prod things we can - planning - https://phabricator.wikimedia.org/T133717#2929712 (10Dzahn) [02:00:59] 06Operations, 10DBA, 10Traffic: convert tendril to use Letsencrypt for SSL cert (deadline 2017-03-17) - https://phabricator.wikimedia.org/T154938#2929711 (10Dzahn) 05Resolved>03Open [02:02:09] legoktm: issue apparently occured on m.wikipedia.org when MediaWiki tried to load PagedTiffHandler [02:02:18] 06Operations, 10DBA, 10Traffic: convert tendril to use Letsencrypt for SSL cert (deadline 2017-03-17) - https://phabricator.wikimedia.org/T154938#2929008 (10Dzahn) just reopened for that follow-up task on dbmonitor. still resolved on einsteinium, can be considered done for tracking task, old cert can be remo... [02:02:27] I think PagedTiffHandler is a red herring, that was just the first extension to be loaded [02:02:38] hence [02:02:39] #3 /srv/mediawiki/php-1.29.0-wmf.7/includes/registration/ExtensionRegistry.php(87): ExtensionRegistry->__construct() [02:03:25] I checked CS, you're right, PagedTiffHandler is the first wfLoadExtension call [02:03:48] 06Operations, 10DNS, 10Domains, 10Traffic: Donate wiktionary.pl to the Foundation - https://phabricator.wikimedia.org/T154826#2929715 (10Dzahn) a:05CRoslof>03None @tomasz yea, i guess let me add it to DNS repo and link to parking [02:04:18] 06Operations, 10DNS, 10Domains, 10Traffic: Donate wiktionary.pl to the Foundation - https://phabricator.wikimedia.org/T154826#2929717 (10Dzahn) a:03Dzahn [02:05:37] (03PS1) 10Dzahn: add wiktionary.pl as parked domain [dns] - 10https://gerrit.wikimedia.org/r/331429 (https://phabricator.wikimedia.org/T154826) [02:07:12] (03PS2) 10Dzahn: add wiktionary.pl as parked domain [dns] - 10https://gerrit.wikimedia.org/r/331429 (https://phabricator.wikimedia.org/T154826) [02:07:25] 06Operations, 10DBA, 10Traffic: convert tendril to use Letsencrypt for SSL cert (deadline 2017-03-17) - https://phabricator.wikimedia.org/T154938#2929008 (10Krenair) what about doing the same thing as https://gerrit.wikimedia.org/r/#/c/331242/ ? [02:11:03] (03CR) 10Dzahn: [C: 032] add wiktionary.pl as parked domain [dns] - 10https://gerrit.wikimedia.org/r/331429 (https://phabricator.wikimedia.org/T154826) (owner: 10Dzahn) [02:11:16] (03PS3) 10Dzahn: add wiktionary.pl as parked domain [dns] - 10https://gerrit.wikimedia.org/r/331429 (https://phabricator.wikimedia.org/T154826) [02:13:12] 06Operations, 10DNS, 10Domains, 10Traffic, 13Patch-For-Review: Donate wiktionary.pl to the Foundation - https://phabricator.wikimedia.org/T154826#2929759 (10Dzahn) a:05Dzahn>03CRoslof done from the ops side [02:18:20] (03PS1) 10Dzahn: disable Letsencrypt cert (do_acme: false) on dbmonitor* [puppet] - 10https://gerrit.wikimedia.org/r/331430 (https://phabricator.wikimedia.org/T154938) [02:18:50] 06Operations, 10DNS, 10Domains, 10Traffic, 13Patch-For-Review: Donate wiktionary.pl to the Foundation - https://phabricator.wikimedia.org/T154826#2929783 (10Dzahn) @tomasz thanks for the donation [02:22:25] !log l10nupdate@tin scap sync-l10n completed (1.29.0-wmf.7) (duration: 07m 46s) [02:22:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:25:59] RECOVERY - puppet last run on db1015 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [02:26:42] 06Operations, 10DBA, 10Traffic, 13Patch-For-Review: convert tendril to use Letsencrypt for SSL cert (deadline 2017-03-17) - https://phabricator.wikimedia.org/T154938#2929786 (10Dzahn) >>! In T154938#2929727, @Krenair wrote: > what about doing the same thing as https://gerrit.wikimedia.org/r/#/c/331242/ ?... [02:26:47] !log l10nupdate@tin ResourceLoader cache refresh completed at Tue Jan 10 02:26:47 UTC 2017 (duration 4m 22s) [02:26:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:27:07] (03CR) 10Dzahn: [C: 032] disable Letsencrypt cert (do_acme: false) on dbmonitor* [puppet] - 10https://gerrit.wikimedia.org/r/331430 (https://phabricator.wikimedia.org/T154938) (owner: 10Dzahn) [02:30:09] RECOVERY - puppet last run on dbmonitor1001 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [02:30:29] RECOVERY - puppet last run on dbmonitor2001 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [02:30:50] Krenair: ^ thanks, i forgot "do_acme" was already there [02:30:55] that was a good point [02:31:07] wanted to reinvent that [02:31:48] 06Operations, 10Traffic, 13Patch-For-Review: Letsencrypt all the prod things we can - planning - https://phabricator.wikimedia.org/T133717#2929796 (10Dzahn) [02:31:50] 06Operations, 10DBA, 10Traffic, 13Patch-For-Review: convert tendril to use Letsencrypt for SSL cert (deadline 2017-03-17) - https://phabricator.wikimedia.org/T154938#2929795 (10Dzahn) 05Open>03Resolved [02:32:29] np [02:33:46] 06Operations, 10Traffic: Letsencrypt all the prod things we can - planning - https://phabricator.wikimedia.org/T133717#2240497 (10Dzahn) [02:34:00] 06Operations, 10DBA, 10Traffic: convert tendril to use Letsencrypt for SSL cert (deadline 2017-03-17) - https://phabricator.wikimedia.org/T154938#2929800 (10Dzahn) [02:34:10] goes afk ..laters [02:34:17] cya [03:04:09] RECOVERY - Check systemd state on restbase-test1001 is OK: OK - running: The system is fully operational [03:04:26] (03Abandoned) 10Alex Monk: Redirect most noc.wikimedia.org/conf URLs to Diffusion [puppet] - 10https://gerrit.wikimedia.org/r/224214 (owner: 10Alex Monk) [03:07:09] PROBLEM - Check systemd state on restbase-test1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [03:07:27] (03PS1) 10Dereckson: Add recent dblist files on noc. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/331436 [03:26:27] (03CR) 10Alex Monk: [C: 04-1] "With I738836c6 and mitaka it should work" [puppet] - 10https://gerrit.wikimedia.org/r/329021 (https://phabricator.wikimedia.org/T104575) (owner: 10Alex Monk) [03:26:40] (03CR) 10Alex Monk: [C: 04-1] "With I738836c6 and mitaka it should work" [puppet] - 10https://gerrit.wikimedia.org/r/328609 (https://phabricator.wikimedia.org/T104575) (owner: 10Alex Monk) [03:26:45] (03CR) 10Alex Monk: [C: 04-1] "With I738836c6 and mitaka it should work" [puppet] - 10https://gerrit.wikimedia.org/r/328608 (https://phabricator.wikimedia.org/T104575) (owner: 10Alex Monk) [03:27:29] RECOVERY - Check systemd state on elastic2034 is OK: OK - running: The system is fully operational [03:27:38] (03CR) 10Alex Monk: [C: 04-1] "With I738836c6 and mitaka, I think" [puppet] - 10https://gerrit.wikimedia.org/r/328611 (owner: 10Alex Monk) [04:16:00] (03PS1) 10Alex Monk: Add a test to prevent people from making new dblist files without appropriate noc symlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/331441 [04:16:31] (03PS1) 10Dereckson: Fix Dayanand College Solapur event throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/331442 (https://phabricator.wikimedia.org/T154312) [04:16:40] (03CR) 10Alex Monk: "Yep. This has gone on for way too long, so I've made a way to stop it happening in the first place: If982a125" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/331436 (owner: 10Dereckson) [04:16:59] Krenair or reedy: could you deploy this throttle rule please? This is a fix for an event starting in 17 minutes to set the right IP ^^ [04:17:20] ^^ points to 331442 so, not 331436 [04:19:25] "Dayanand COllege IP address checked just now 10th Jan is ‎ 117.200.216.15 Please do needful at the earliest" [04:20:10] college without static IP, cool world :/ [04:20:53] Dereckson, hey don't you have deployment access these days? [04:22:21] er yes, but outside SWAT hours... [04:22:33] you still have access? [04:22:37] Yes. [04:23:01] it's past 4 in the morning here I'm not much better [04:24:11] Krinkle, legoktm, ostriches ? [04:24:23] needs to be done in 5 minutes. . . [04:24:30] hm [04:24:37] just do it? [04:25:33] Dereckson: do you feel uncomfortable doing it? [04:26:03] No, but I was discouraged to deploy outside hours. [04:26:11] so is everyone [04:26:13] (03CR) 10Legoktm: [C: 031] Fix Dayanand College Solapur event throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/331442 (https://phabricator.wikimedia.org/T154312) (owner: 10Dereckson) [04:26:17] doesn't mean we don't do it when necessary [04:26:24] k [04:27:03] (03CR) 10Dereckson: [C: 032] "Emergency deployment for ongoing event, per #wikimedia operations discussion" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/331442 (https://phabricator.wikimedia.org/T154312) (owner: 10Dereckson) [04:27:18] (03CR) 10Alex Monk: [C: 031] Fix Dayanand College Solapur event throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/331442 (https://phabricator.wikimedia.org/T154312) (owner: 10Dereckson) [04:27:35] (03Merged) 10jenkins-bot: Fix Dayanand College Solapur event throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/331442 (https://phabricator.wikimedia.org/T154312) (owner: 10Dereckson) [04:27:45] Dereckson: If you think it's worth doing an out of time deploy, I'd just do it. [04:27:46] (03CR) 10jenkins-bot: Fix Dayanand College Solapur event throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/331442 (https://phabricator.wikimedia.org/T154312) (owner: 10Dereckson) [04:27:59] But really we need to take throttles out of wmf-config, ugh. [04:28:53] legoktm: Reedy wants to work on this, with bd808 if I remember well [04:29:07] Yeah, I've talked to both :) [04:29:29] the ThrottleOverride thing? [04:29:52] yes [04:30:57] !log dereckson@tin Synchronized wmf-config/throttle.php: Fix Dayanand College Solapur event throttle rule (T154312) (duration: 00m 44s) [04:31:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:31:01] T154312: Request for a temporary lift of account creation cap on IPs (2017-01-04,2017-01-06,2017-01-10,2017-01-12) - https://phabricator.wikimedia.org/T154312 [04:42:23] ostriches: I'm really not sure [04:42:36] Unless the refactoring just highlighted the error for some reason [04:44:40] IF, my wfLoadExtension() changes had been the one that was erroring, sure... [04:44:44] But when its not [04:46:09] PROBLEM - puppet last run on db1050 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:49:18] Reedy: still looking at that issue? [04:49:26] addshore: Which/ [04:49:52] the confing moblie landing page thing? [04:49:59] Not particularly [04:50:14] Looks like it was debugged a bit while we were out [05:04:09] RECOVERY - Check systemd state on restbase-test1001 is OK: OK - running: The system is fully operational [05:07:09] PROBLEM - Check systemd state on restbase-test1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [05:09:29] PROBLEM - puppet last run on sca2004 is CRITICAL: CRITICAL: Puppet has 27 failures. Last run 2 minutes ago with 27 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[wipe],Package[zotero/translators],Package[zotero/translation-server] [05:14:09] RECOVERY - puppet last run on db1050 is OK: OK: Puppet is currently enabled, last run 0 seconds ago with 0 failures [05:38:29] RECOVERY - puppet last run on sca2004 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [05:58:35] (03CR) 10VolkerE: "@Catrope @Jforrester How to move on here?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319968 (https://phabricator.wikimedia.org/T147219) (owner: 10Catrope) [06:21:09] PROBLEM - puppet last run on seaborgium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:34:09] RECOVERY - Check systemd state on restbase-test1001 is OK: OK - running: The system is fully operational [06:37:09] PROBLEM - Check systemd state on restbase-test1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [06:46:09] PROBLEM - puppet last run on db1033 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:47:09] PROBLEM - Check HHVM threads for leakage on mw1168 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers [06:47:39] PROBLEM - citoid endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:48:09] PROBLEM - puppet last run on elastic1041 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[ethtool] [06:48:29] RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy [06:50:09] RECOVERY - puppet last run on seaborgium is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [06:52:09] PROBLEM - Check HHVM threads for leakage on mw1260 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers [06:54:19] PROBLEM - Check HHVM threads for leakage on mw1259 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers [06:57:09] PROBLEM - Check HHVM threads for leakage on mw1169 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers [07:10:46] (03Restored) 10Chad: (bug 38114) gerrit: alternate change list row colors [puppet] - 10https://gerrit.wikimedia.org/r/49993 (owner: 10Hashar) [07:14:09] RECOVERY - puppet last run on db1033 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [07:15:09] RECOVERY - Check systemd state on elastic2033 is OK: OK - running: The system is fully operational [07:17:09] RECOVERY - puppet last run on elastic1041 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [07:17:49] PROBLEM - restbase endpoints health on restbase2007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:18:39] RECOVERY - restbase endpoints health on restbase2007 is OK: All endpoints are healthy [08:04:09] RECOVERY - Check systemd state on restbase-test1001 is OK: OK - running: The system is fully operational [08:07:09] PROBLEM - Check systemd state on restbase-test1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [08:25:29] PROBLEM - puppet last run on labtestcontrol2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:41:09] RECOVERY - Check HHVM threads for leakage on mw1260 is OK: OK [08:53:29] RECOVERY - puppet last run on labtestcontrol2001 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [09:30:51] (03PS4) 10Hashar: puppet parse from rake [puppet] - 10https://gerrit.wikimedia.org/r/331239 (https://phabricator.wikimedia.org/T154894) [09:30:53] (03PS1) 10Hashar: kafka: fix Unrecognized escape sequence '\.' [puppet] - 10https://gerrit.wikimedia.org/r/331451 [09:31:54] (03CR) 10jerkins-bot: [V: 04-1] puppet parse from rake [puppet] - 10https://gerrit.wikimedia.org/r/331239 (https://phabricator.wikimedia.org/T154894) (owner: 10Hashar) [09:34:09] RECOVERY - Check systemd state on restbase-test1001 is OK: OK - running: The system is fully operational [09:37:09] PROBLEM - Check systemd state on restbase-test1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:39:29] PROBLEM - puppet last run on sca2004 is CRITICAL: CRITICAL: Puppet has 27 failures. Last run 2 minutes ago with 27 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[wipe],Package[zotero/translators],Package[zotero/translation-server] [10:07:29] RECOVERY - puppet last run on sca2004 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [10:12:59] (03PS1) 10Hashar: dataset: fix Unrecognized escape sequence '\?' [puppet] - 10https://gerrit.wikimedia.org/r/331457 [10:13:10] (03PS2) 10Hashar: kafka: fix Unrecognized escape sequence '\.' [puppet] - 10https://gerrit.wikimedia.org/r/331451 [10:13:19] RECOVERY - Check HHVM threads for leakage on mw1259 is OK: OK [10:13:28] (03PS1) 10Hashar: mailman: fix Unrecognized escape sequence '\;' [puppet] - 10https://gerrit.wikimedia.org/r/331458 [10:13:42] (03PS1) 10Hashar: nagios_common: fix erroneous contacts generation [puppet] - 10https://gerrit.wikimedia.org/r/331459 [10:17:18] (03PS5) 10Hashar: puppet parse validate from rake [puppet] - 10https://gerrit.wikimedia.org/r/331239 (https://phabricator.wikimedia.org/T154915) [10:17:20] (03PS1) 10Hashar: Octopus merge of linting fixes [puppet] - 10https://gerrit.wikimedia.org/r/331460 [10:18:05] (03CR) 10Hashar: [C: 04-1] "Do not submit. That is merely to act as if the parent four commits have been merged so I can run https://gerrit.wikimedia.org/r/#/c/331239" [puppet] - 10https://gerrit.wikimedia.org/r/331460 (owner: 10Hashar) [10:18:25] (03CR) 10jerkins-bot: [V: 04-1] puppet parse validate from rake [puppet] - 10https://gerrit.wikimedia.org/r/331239 (https://phabricator.wikimedia.org/T154915) (owner: 10Hashar) [10:22:58] (03PS6) 10Hashar: puppet parse validate from rake [puppet] - 10https://gerrit.wikimedia.org/r/331239 (https://phabricator.wikimedia.org/T154915) [10:24:59] PROBLEM - puppet last run on db1063 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:25:09] RECOVERY - Check HHVM threads for leakage on mw1169 is OK: OK [10:29:30] (03CR) 10Hashar: "Ariel that might be completely wrong. I am not sure which server to use to run the puppet compiler against this change :\" [puppet] - 10https://gerrit.wikimedia.org/r/331457 (owner: 10Hashar) [10:31:02] (03CR) 10Hashar: "I could have double escaped the anti slash but I am not quite sure it will work as intended. Using an intermediate variable seems easier t" [puppet] - 10https://gerrit.wikimedia.org/r/331451 (owner: 10Hashar) [10:32:18] (03CR) 10Hashar: "I am not sure which server to use to run the puppet compiler against this change :\" [puppet] - 10https://gerrit.wikimedia.org/r/331458 (owner: 10Hashar) [10:33:23] (03CR) 10Hashar: "I am not sure which server to use to run the puppet compiler against this change :\" [puppet] - 10https://gerrit.wikimedia.org/r/331459 (owner: 10Hashar) [10:36:09] RECOVERY - Check HHVM threads for leakage on mw1168 is OK: OK [10:40:47] 06Operations, 07Puppet, 10Continuous-Integration-Config: Get rid of "import realm.pp" in manifests/site.pp - https://phabricator.wikimedia.org/T154915#2930254 (10hashar) [10:45:47] (03PS4) 10Hashar: gerrit: alternate change list row colors [puppet] - 10https://gerrit.wikimedia.org/r/49993 (https://phabricator.wikimedia.org/T40114) [10:46:04] (03CR) 10Hashar: "Rebased." [puppet] - 10https://gerrit.wikimedia.org/r/49993 (https://phabricator.wikimedia.org/T40114) (owner: 10Hashar) [10:51:59] RECOVERY - puppet last run on db1063 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [10:55:19] 06Operations, 06Commons, 10TimedMediaHandler-Transcode, 10Wikimedia-Video, and 3 others: Commons video transcoders have over 6500 tasks in the backlog. - https://phabricator.wikimedia.org/T153488#2930327 (10Revent) Just to update, the transcoders are still (as of now) mainly processing old multi-GB files. [11:20:44] (03PS2) 10Juniorsys: Add missing trailing commas to Puppet resources [puppet] - 10https://gerrit.wikimedia.org/r/331247 (https://phabricator.wikimedia.org/T93645) [11:25:09] PROBLEM - puppet last run on sca1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:53:09] RECOVERY - puppet last run on sca1003 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [12:40:16] 06Operations, 10DBA, 10Traffic: convert tendril to use Letsencrypt for SSL cert (deadline 2017-03-17) - https://phabricator.wikimedia.org/T154938#2930402 (10jcrespo) Thank you for working on this. [13:04:09] RECOVERY - Check systemd state on restbase-test1001 is OK: OK - running: The system is fully operational [13:05:34] (03CR) 10Alexandros Kosiaris: [C: 04-1] "This is a pretty large patch. It touches > 600 lines and a really big number of files and is thus quite difficult (and hence improbable) t" [puppet] - 10https://gerrit.wikimedia.org/r/331247 (https://phabricator.wikimedia.org/T93645) (owner: 10Juniorsys) [13:07:09] PROBLEM - Check systemd state on restbase-test1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:09:19] RECOVERY - Check systemd state on elastic2027 is OK: OK - running: The system is fully operational [13:19:56] (03PS1) 10Alexandros Kosiaris: Remove non-existing parameter from documentation [puppet] - 10https://gerrit.wikimedia.org/r/331485 [13:20:20] (03CR) 10ArielGlenn: "Hashar: Cron jobs like this one always run on snapshot1007. But when you are unsure you can always compile against the whole batch: snapsh" [puppet] - 10https://gerrit.wikimedia.org/r/331457 (owner: 10Hashar) [13:31:48] (03CR) 10Alexandros Kosiaris: [C: 032] Remove non-existing parameter from documentation [puppet] - 10https://gerrit.wikimedia.org/r/331485 (owner: 10Alexandros Kosiaris) [13:32:19] RECOVERY - Check systemd state on elastic2036 is OK: OK - running: The system is fully operational [13:41:07] (03CR) 10Alexandros Kosiaris: [C: 032] "The hosts would be einsteinium and tegmen. A PCC run on them shows https://puppet-compiler.wmflabs.org/5065/ a noop so merging" [puppet] - 10https://gerrit.wikimedia.org/r/331459 (owner: 10Hashar) [13:43:04] (03Abandoned) 10Juniorsys: Add missing trailing commas to Puppet resources [puppet] - 10https://gerrit.wikimedia.org/r/331247 (https://phabricator.wikimedia.org/T93645) (owner: 10Juniorsys) [13:45:51] (03PS2) 10Alexandros Kosiaris: nagios_common: fix erroneous contacts generation [puppet] - 10https://gerrit.wikimedia.org/r/331459 (owner: 10Hashar) [13:45:56] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] nagios_common: fix erroneous contacts generation [puppet] - 10https://gerrit.wikimedia.org/r/331459 (owner: 10Hashar) [13:46:00] akosiaris: I am surprised it is a noop [13:46:08] the ruby conditional in the template looks wrong [13:46:37] hashar: running puppet on einsteinium right now [13:46:44] might be the dummy data is crap [13:48:10] hashar: nope. It's a noop even on einsteinium [13:48:11] Or I failed to understand what the template is doing :d [13:50:44] akosiaris: I think I found out the reason [13:50:58] template is used by nagios_common::contactgroups and nagios_common::contacts [13:51:13] they each accept a "source" parameter which default to use the template [13:51:19] but we specific a cfg file directly [13:51:31] puppet:///modules/nagios_common/contactgroups.cfg and secret('nagios/contacts.cfg') [13:51:40] effectively skipping the template :} [13:53:10] jouncebot: next [13:53:10] In 0 hour(s) and 6 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170110T1400) [13:53:26] hashar: looks like there is nothing for eu swat [13:54:32] hashar: a yes indeed [13:54:55] zeljkof: great :-} [14:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Dear anthropoid, the time has come. Please deploy European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170110T1400). [14:07:15] akosiaris: definitely confirmed it was broken by using rspec :} [14:08:12] @tto I actually have an update request for interwiki from before NY, we can just prod for it to be run [14:09:29] PROBLEM - puppet last run on sca2004 is CRITICAL: CRITICAL: Puppet has 20 failures. Last run 2 minutes ago with 20 failures. Failed resources (up to 3 shown): Service[puppet],Service[rsyslog],Exec[ip addr add 2620:0:860:102:10:192:16:30/64 dev eth0],Service[ferm] [14:09:40] sDrewth, indeed, sounds good to me. Just make sure you fix the mistake I pointed out :) [14:10:44] where did you comment? [14:10:49] @tto where did you comment? [14:10:50] sDrewth, on the phab task [14:10:55] k [14:14:27] fixed, and gave the ticket T154225 a tickle [14:14:28] T154225: Update interwiki map, following edit - https://phabricator.wikimedia.org/T154225 [14:16:46] sDrewth: thanks :) [14:17:04] (03PS1) 10Alexandros Kosiaris: Zotero: Try to limit RSS memory in upstart [puppet] - 10https://gerrit.wikimedia.org/r/331486 [14:17:31] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Zotero: Try to limit RSS memory in upstart [puppet] - 10https://gerrit.wikimedia.org/r/331486 (owner: 10Alexandros Kosiaris) [14:19:29] RECOVERY - puppet last run on sca2004 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [14:23:42] let's see if zotero continues causing problems to the rest of the processes on SCA after than [14:24:01] not that there are any other services over there. But it messes up with puppet [14:27:44] heh, crap software ..https://ganglia.wikimedia.org/latest/graph.php?r=day&z=xlarge&c=Service+Cluster+A+eqiad&m=cpu_report&s=by+name&mc=2&g=mem_report [14:35:00] 06Operations, 10Traffic, 13Patch-For-Review: convert wikitech.wikimedia.org from globalsign to letsencrypt certificate (deadline 2017-02-24) - https://phabricator.wikimedia.org/T154913#2928272 (10akosiaris) If the patch tested above works, it looks a good alternative to me. We 've had that issue in the past... [14:40:49] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=2638.70 Read Requests/Sec=875.40 Write Requests/Sec=597.60 KBytes Read/Sec=28814.40 KBytes_Written/Sec=6160.00 [14:48:07] <_joe_> akosiaris: lol [14:48:09] PROBLEM - puppet last run on analytics1031 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:55:49] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=5.60 Read Requests/Sec=0.10 Write Requests/Sec=2.50 KBytes Read/Sec=0.40 KBytes_Written/Sec=46.40 [15:04:10] RECOVERY - Check systemd state on restbase-test1001 is OK: OK - running: The system is fully operational [15:07:09] PROBLEM - Check systemd state on restbase-test1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [15:08:09] (03PS1) 10Hashar: nagios_common: basic spec for contacts.cfg [puppet] - 10https://gerrit.wikimedia.org/r/331490 [15:08:52] (03CR) 10jerkins-bot: [V: 04-1] nagios_common: basic spec for contacts.cfg [puppet] - 10https://gerrit.wikimedia.org/r/331490 (owner: 10Hashar) [15:17:09] RECOVERY - puppet last run on analytics1031 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [15:28:44] 06Operations, 10Analytics, 10ChangeProp, 10EventBus, and 5 others: Asynchronous processing in production: one queue to rule them all - https://phabricator.wikimedia.org/T149408#2751310 (10MelodyKramer) Notes from this session: https://etherpad.wikimedia.org/p/devsummit17-asynchronous-processing [15:32:08] (03CR) 10Hashar: [V: 031] "Thanks Ariel! I was shooting in the dark not knowing what was going to be the impact. The puppet compiler claims it is a noop on all thre" [puppet] - 10https://gerrit.wikimedia.org/r/331457 (owner: 10Hashar) [16:22:29] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.192.48.44 on port 6479 [16:23:29] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 2800135 keys, up 71 days 8 hours - replication_delay is 0 [16:27:09] PROBLEM - puppet last run on labnet1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:34:09] PROBLEM - puppet last run on sca1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:37:39] jouncebot: next [16:37:39] In 1 hour(s) and 22 minute(s): Services – Graphoid / Parsoid / OCG / Citoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170110T1800) [16:55:09] RECOVERY - puppet last run on labnet1002 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [16:55:26] (03PS3) 10Andrew Bogott: Puppetmaster: Remove remnants of ldap node definitions. [puppet] - 10https://gerrit.wikimedia.org/r/330959 (https://phabricator.wikimedia.org/T148781) [16:58:43] 06Operations, 10Traffic, 13Patch-For-Review: convert wikitech.wikimedia.org from globalsign to letsencrypt certificate (deadline 2017-02-24) - https://phabricator.wikimedia.org/T154913#2931057 (10Krenair) labtestwikitech already uses LE so we can test it on that: * curl to show the system has no issue with... [16:58:59] (03CR) 10Andrew Bogott: [C: 032] Puppetmaster: Remove remnants of ldap node definitions. [puppet] - 10https://gerrit.wikimedia.org/r/330959 (https://phabricator.wikimedia.org/T148781) (owner: 10Andrew Bogott) [17:02:09] RECOVERY - puppet last run on sca1004 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [17:03:16] (03CR) 10Andrew Bogott: "Correct, this will only work with Mitaka packages. But, I think we can switch the shinkengen instance to use Mitaka immediately to make t" [puppet] - 10https://gerrit.wikimedia.org/r/328611 (owner: 10Alex Monk) [17:03:58] (03CR) 10Andrew Bogott: [C: 032] Openstack clientlib: Include python3 packages if version is post-liberty [puppet] - 10https://gerrit.wikimedia.org/r/331105 (owner: 10Andrew Bogott) [17:04:07] (03PS6) 10Andrew Bogott: Openstack clientlib: Include python3 packages if version is post-liberty [puppet] - 10https://gerrit.wikimedia.org/r/331105 [17:05:05] (03PS7) 10Andrew Bogott: Openstack clientlib: Include python3 packages if version is post-liberty [puppet] - 10https://gerrit.wikimedia.org/r/331105 [17:05:07] (03PS8) 10Andrew Bogott: Move shinkengen from using LDAP to the OpenStack APIs [puppet] - 10https://gerrit.wikimedia.org/r/328611 (owner: 10Alex Monk) [17:10:39] PROBLEM - puppet last run on cp3033 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:13:13] (03CR) 10Andrew Bogott: designate-sink nova_ldap: set l to the correct site (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/312115 (owner: 10Alex Monk) [17:17:34] (03CR) 10Andrew Bogott: [C: 031] mwyaml: Accept existing, but empty "Hiera:" pages as well [puppet] - 10https://gerrit.wikimedia.org/r/325131 (https://phabricator.wikimedia.org/T152142) (owner: 10Tim Landscheidt) [17:19:44] (03CR) 10Andrew Bogott: "Would it be better to have git-sync-upstream run as puppetmaster::git_user" [puppet] - 10https://gerrit.wikimedia.org/r/324727 (https://phabricator.wikimedia.org/T152059) (owner: 10Tim Landscheidt) [17:19:55] (03CR) 10Alex Monk: designate-sink nova_ldap: set l to the correct site (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/312115 (owner: 10Alex Monk) [17:21:00] (03CR) 10Andrew Bogott: [C: 031] Tools: Undo obsolete /var/mail customization [puppet] - 10https://gerrit.wikimedia.org/r/326306 (owner: 10Tim Landscheidt) [17:21:56] (03PS3) 10Andrew Bogott: puppetmaster: Specify $group for all repositories [puppet] - 10https://gerrit.wikimedia.org/r/329595 (https://phabricator.wikimedia.org/T152060) (owner: 10Tim Landscheidt) [17:22:52] 06Operations, 10OCG-General, 10Reading-Community-Engagement, 06Reading-Web-Backlog, and 3 others: [EPIC] Replicate core OCG features and sunset OCG service - https://phabricator.wikimedia.org/T150871#2931096 (10JKatzWMF) [17:24:34] (03CR) 10Andrew Bogott: [C: 032] puppetmaster: Specify $group for all repositories [puppet] - 10https://gerrit.wikimedia.org/r/329595 (https://phabricator.wikimedia.org/T152060) (owner: 10Tim Landscheidt) [17:26:01] (03CR) 10Andrew Bogott: designate-sink nova_ldap: set l to the correct site (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/312115 (owner: 10Alex Monk) [17:28:31] (03PS1) 10Hashar: base: fix pick_initscript spec [puppet] - 10https://gerrit.wikimedia.org/r/331494 [17:38:02] (03CR) 10Dzahn: "server is fermium.wikimedia.org" [puppet] - 10https://gerrit.wikimedia.org/r/331458 (owner: 10Hashar) [17:38:39] RECOVERY - puppet last run on cp3033 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [17:40:04] 06Operations, 10ArticlePlaceholder, 10Traffic, 10Wikidata: Performance and caching considerations for article placeholders accesses - https://phabricator.wikimedia.org/T142944#2931122 (10hoo) 05Open>03Resolved a:03hoo We discussed this at the developer summit with @BBlack and we decided to go for 24h... [17:40:22] (03CR) 10Dzahn: "compiling .. job 5067" [puppet] - 10https://gerrit.wikimedia.org/r/331458 (owner: 10Hashar) [17:42:41] (03CR) 10Dzahn: [C: 032] "sometimes can't trust the compiler saying noop, but it is. searched for "heldmsg" in http://puppet-compiler.wmflabs.org/5067/fermium.wikim" [puppet] - 10https://gerrit.wikimedia.org/r/331458 (owner: 10Hashar) [17:42:51] (03PS2) 10Dzahn: mailman: fix Unrecognized escape sequence '\;' [puppet] - 10https://gerrit.wikimedia.org/r/331458 (owner: 10Hashar) [17:46:42] _joe_: so you think we can re-enable video2commons ? [17:46:58] <_joe_> matanya: lemme check for a sec [17:47:03] thanks [17:48:59] <_joe_> matanya: uhm seems like some people added a ton of new videos since yesterday [17:49:07] <_joe_> let me check a few more things [17:49:15] mutante: Guten Tag. I am not sure of the impact of those patches :/ [17:49:16] (03CR) 10Dzahn: "yep, hashar, nothing happens. doesn't matter for the puppet run." [puppet] - 10https://gerrit.wikimedia.org/r/331458 (owner: 10Hashar) [17:49:26] hashar: there was no impact, i checked [17:49:34] \O/ [17:49:36] i only saw one patch [17:49:53] <_joe_> nah it's just backlog [17:50:17] <_joe_> matanya: I think we can, with the caveat we might turn it off again [17:50:22] hashar, are you coming to SF? [17:50:24] mutante: the couple others are for analytics :) [17:50:26] <_joe_> also, please limit the number of workers [17:50:34] rfarrand: nop [17:50:37] hashar: ok :) [17:50:47] :( [17:51:02] _joe_: i prefer waiting if this is the case, enable and disable is confusing for users [17:51:06] mutante: still have to deal with the zuul/contint puppet manifest. Planning to do follow up on your reviews by end of the week [17:51:09] <_joe_> ok [17:51:15] <_joe_> let's re-assess tomorrow [17:51:18] mutante: there is no hurry though. It is merely sugar on top of the cake [17:51:25] ok, cool, thanks [17:51:30] hashar: that all sounds good :) [17:51:32] rfarrand: but maybe I will fly to SF at some point this year. Just out of an event :} [17:51:39] RECOVERY - Check systemd state on elastic2026 is OK: OK - running: The system is fully operational [17:52:17] <_joe_> matanya: we're still processing those damn white house press briefings :) [17:52:19] PROBLEM - puppet last run on labvirt1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:52:45] can we blacklist the white house :) [17:52:47] thanks Obama? [17:53:20] <_joe_> matanya: in a week or so, prolly [17:53:23] <_joe_> ;) [17:53:34] lol [17:53:38] <_joe_> just joking, ofc [17:54:45] when i saw all the special magazines about Obama i thought to myself.. is he always going to stay the test page? [17:55:19] RECOVERY - puppet last run on labvirt1005 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [17:55:23] <_joe_> I guess so [17:55:27] or are we always using the current president page :/ [17:55:30] for the record, the test page has not always been 'current us president' so lets not make that a new one [17:55:32] heh [17:55:38] heh [17:55:40] mutante: get out of my headddd [17:55:45] heh [17:55:48] what rob said [17:55:49] <_joe_> obama is a benchmark for a series of good reasons [17:55:54] <_joe_> that's not gonna change [17:55:57] sounds good [17:56:09] <_joe_> we didn't pick it for any reason that has to do with the fact he's POTUS [17:56:29] if he only knew he is a benchmark [17:56:40] <_joe_> it was used by ori, tim and me to benchmark HHVM because it's a template intensive page [17:57:01] <_joe_> ori: we miss you :((( [17:57:24] <_joe_> and it took like 15 seconds to render on the php 5 appservers [17:58:02] yup picked that Barrack obama page because of all the references the page has [17:59:41] so how much are they uploading? years of press briefings in HD ? [18:00:04] yurik, gwicke, cscott, arlolra, subbu, halfak, and Amir1: Respected human, time to deploy Services – Graphoid / Parsoid / OCG / Citoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170110T1800). Please do the needful. [18:00:08] something like that mutante [18:00:12] <_joe_> mutante: yes [18:00:33] oh, heh, that sounds massive [18:00:56] <_joe_> mutante: have you seen videoscalers usage rate in the last month? [18:01:18] i saw some of the ticket comments about the backlog there [18:01:30] havent really compared the rate to before though [18:01:48] <_joe_> https://ganglia.wikimedia.org/latest/?r=year&cs=9%2F14%2F2016+4%3A5&ce=1%2F11%2F2017+0%3A50&c=Video+scalers+eqiad&h=&tab=m&vn=&hide-hf=false&m=cpu_report&sh=1&z=small&hc=4&host_regex=&max_graphs=0&s=by+name [18:02:03] yeah its a huge difference [18:02:14] <_joe_> they're running at 75-85% utilization rate, and there are 4 servers now [18:03:01] oh, yea, and we already added mw1168/1169 [18:03:04] <_joe_> the issue is those are large files [18:03:09] remembers reinstalling those [18:03:10] <_joe_> https://commons.wikimedia.org/wiki/File:6-9-15-_White_House_Press_Briefing.webm for example [18:04:26] ah, and it shows Encode time on the wiki page [18:06:44] <_joe_> yes [18:08:06] (03PS6) 10Dzahn: Stop using package=>latest for standard packages [puppet] - 10https://gerrit.wikimedia.org/r/314270 (https://phabricator.wikimedia.org/T115348) (owner: 10Muehlenhoff) [18:08:32] (03PS7) 10Dzahn: Stop using package=>latest for standard packages [puppet] - 10https://gerrit.wikimedia.org/r/314270 (https://phabricator.wikimedia.org/T115348) (owner: 10Muehlenhoff) [18:14:11] (03CR) 10Dzahn: [C: 032] Stop using package=>latest for standard packages [puppet] - 10https://gerrit.wikimedia.org/r/314270 (https://phabricator.wikimedia.org/T115348) (owner: 10Muehlenhoff) [18:16:51] (03CR) 10Andrew Bogott: [C: 031] "lgtm but I'd like someone who has touched logstash before to merge." [puppet] - 10https://gerrit.wikimedia.org/r/329020 (https://phabricator.wikimedia.org/T151422) (owner: 10BryanDavis) [18:25:13] (03PS1) 10Dzahn: ganglia/gerrit/planet: missing trailing commas [puppet] - 10https://gerrit.wikimedia.org/r/331513 [18:26:14] (03CR) 10Chad: [C: 031] "Gerrit change is fine" [puppet] - 10https://gerrit.wikimedia.org/r/331513 (owner: 10Dzahn) [18:26:16] (03CR) 10Jforrester: "> composer 1.3.0 seems to change the STDOUT/STDERR file descriptor to pipe and PHPUnit output ends up with no colors. Again can live with" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/331093 (https://phabricator.wikimedia.org/T85947) (owner: 10Krinkle) [18:27:42] (03CR) 10Dzahn: "@Juniorsys thanks for the contribution but like Alex said it's hard to merge if it touches so many modules at once. Here i'm fixing three " [puppet] - 10https://gerrit.wikimedia.org/r/331247 (https://phabricator.wikimedia.org/T93645) (owner: 10Juniorsys) [18:33:23] (03CR) 10Dzahn: [C: 032] ganglia/gerrit/planet: missing trailing commas [puppet] - 10https://gerrit.wikimedia.org/r/331513 (owner: 10Dzahn) [18:33:39] (03CR) 10Dzahn: [C: 032] "http://puppet-compiler.wmflabs.org/5069/" [puppet] - 10https://gerrit.wikimedia.org/r/331513 (owner: 10Dzahn) [18:34:19] PROBLEM - puppet last run on ocg1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:34:19] RECOVERY - Check systemd state on restbase-test1001 is OK: OK - running: The system is fully operational [18:35:10] * legoktm pokes Reedy [18:35:16] legoktm@terbium:~$ mwscript extensions/CentralAuth/maintenance/migrateAccount.php --wiki=metawiki --username="USERNAME" --attachbroken=1 --attachmissing=1 --suppressrc=1 [18:35:16] The following extensions are required to be installed for this script to run: CentralAuth. Please enable them and then try again. [18:35:25] > "name": "Central Auth", [18:35:31] looool [18:35:41] legoktm: That might explain the problem someone else was reporting [18:35:46] legoktm: Guess you need to install CentralAuth on the cluster ;-) [18:36:06] legoktm: run update.php? [18:36:19] Reedy: That's just silly, it's disabled :p [18:36:26] legoktm: Imma fix [18:36:33] ok, I'll just live hack it for now [18:36:52] Reedy: member when someone tried and wondered why it wasn't working? :D :D :D [18:36:54] i member [18:37:03] i member [18:37:06] pepperidge farm remembers [18:37:07] I meeemmbeeerr [18:37:19] PROBLEM - Check systemd state on restbase-test1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [18:40:06] (03PS1) 10Dzahn: icinga: missing trailing commas [puppet] - 10https://gerrit.wikimedia.org/r/331516 [18:41:34] (03CR) 10Dzahn: "also https://gerrit.wikimedia.org/r/#/c/331516/" [puppet] - 10https://gerrit.wikimedia.org/r/331247 (https://phabricator.wikimedia.org/T93645) (owner: 10Juniorsys) [18:41:54] !log re-attached User:Fuu5tgsrygr / T154983 [18:41:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:43:26] (03CR) 10Dzahn: "http://puppet-compiler.wmflabs.org/5070/ (failure on einsteinium = compiler bug)" [puppet] - 10https://gerrit.wikimedia.org/r/331516 (owner: 10Dzahn) [18:47:30] (03CR) 10Dzahn: "http://puppet-compiler.wmflabs.org/5071/" [puppet] - 10https://gerrit.wikimedia.org/r/327592 (https://phabricator.wikimedia.org/T148494) (owner: 10Dzahn) [18:48:53] (03PS4) 10Dzahn: icinga/CI: give all shell scripts a file extension [puppet] - 10https://gerrit.wikimedia.org/r/327592 (https://phabricator.wikimedia.org/T148494) [18:49:30] ostriches, Reedy: https://gerrit.wikimedia.org/r/331518 [18:49:38] I just saw <3 [18:49:52] (03CR) 10Dzahn: "adding file extensions here makes jenkins run new tests we didn't run before. so we are getting "00:51:51 ./modules/icinga/files/check_key" [puppet] - 10https://gerrit.wikimedia.org/r/327592 (https://phabricator.wikimedia.org/T148494) (owner: 10Dzahn) [18:50:17] (03CR) 10jerkins-bot: [V: 04-1] icinga/CI: give all shell scripts a file extension [puppet] - 10https://gerrit.wikimedia.org/r/327592 (https://phabricator.wikimedia.org/T148494) (owner: 10Dzahn) [18:54:39] (03PS5) 10Dzahn: icinga/CI: give all shell scripts a file extension [puppet] - 10https://gerrit.wikimedia.org/r/327592 (https://phabricator.wikimedia.org/T148494) [18:55:39] (03CR) 10jerkins-bot: [V: 04-1] icinga/CI: give all shell scripts a file extension [puppet] - 10https://gerrit.wikimedia.org/r/327592 (https://phabricator.wikimedia.org/T148494) (owner: 10Dzahn) [18:57:49] (03PS6) 10Dzahn: icinga/CI: give all shell scripts a file extension [puppet] - 10https://gerrit.wikimedia.org/r/327592 (https://phabricator.wikimedia.org/T148494) [18:58:54] (03CR) 10jerkins-bot: [V: 04-1] icinga/CI: give all shell scripts a file extension [puppet] - 10https://gerrit.wikimedia.org/r/327592 (https://phabricator.wikimedia.org/T148494) (owner: 10Dzahn) [18:59:26] (03CR) 10Dzahn: "every round is fixing one error and then getting the next :)" [puppet] - 10https://gerrit.wikimedia.org/r/327592 (https://phabricator.wikimedia.org/T148494) (owner: 10Dzahn) [19:00:00] (03CR) 10Dzahn: ""E302 expected 2 blank lines, found 1" :p" [puppet] - 10https://gerrit.wikimedia.org/r/327592 (https://phabricator.wikimedia.org/T148494) (owner: 10Dzahn) [19:00:05] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170110T1900). Please do the needful. [19:00:32] @seen apergos [19:00:32] mutante: Last time I saw apergos they were quitting the network with reason: Quit: Leaving. N/A at 1/10/2017 3:46:09 PM (3h14m22s ago) [19:03:19] RECOVERY - puppet last run on ocg1003 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [19:04:23] (03CR) 10Paladox: "I have deployed this change onto https://gerrit.git.wmflabs.org/r/#/dashboard/self" [puppet] - 10https://gerrit.wikimedia.org/r/49993 (https://phabricator.wikimedia.org/T40114) (owner: 10Hashar) [19:13:33] (03CR) 10Chad: [C: 031] gerrit: alternate change list row colors [puppet] - 10https://gerrit.wikimedia.org/r/49993 (https://phabricator.wikimedia.org/T40114) (owner: 10Hashar) [19:13:53] (03PS7) 10Dzahn: icinga/CI: give all shell scripts a file extension [puppet] - 10https://gerrit.wikimedia.org/r/327592 (https://phabricator.wikimedia.org/T148494) [19:20:29] (03CR) 10Dzahn: "@AndrewBogott only touching your check_keystone scripts because after giving them a .py extension we got warnings from jenkins about too m" [puppet] - 10https://gerrit.wikimedia.org/r/327592 (https://phabricator.wikimedia.org/T148494) (owner: 10Dzahn) [19:23:54] (03PS1) 10Reedy: Add wikimania2018 [dns] - 10https://gerrit.wikimedia.org/r/331520 [19:26:15] (03CR) 10Dzahn: [C: 032] Add wikimania2018 [dns] - 10https://gerrit.wikimedia.org/r/331520 (owner: 10Reedy) [19:26:46] (03CR) 10Dzahn: "no ticket ?:)" [dns] - 10https://gerrit.wikimedia.org/r/331520 (owner: 10Reedy) [19:26:52] https://lists.wikimedia.org/pipermail/wikimedia-l/2017-January/085897.html [19:26:53] ;P [19:27:12] ah :) [19:27:16] Shark diving! [19:27:58] nice ! [19:28:46] DebConf16 was Cape Town, 17 Montreal; Wikimania17 is Montreal, 18 Cape Town [19:28:52] are they doing it on purpose [19:30:25] paravoid: No. :-) [19:30:36] heh [19:31:00] Where was DebConf15? Need to work out where Wikimania19 will be… ;-) [19:31:06] heheh [19:31:08] (03PS1) 10Chad: scap prep: use check_call, which is a real function name [mediawiki-config] - 10https://gerrit.wikimedia.org/r/331521 [19:31:15] so we immediately making a new wiki? [19:31:16] Heidelberg, Germany [19:31:20] Krenair: Sure. [19:31:27] okay [19:31:31] we're* [19:32:31] * bd808 would really like to see them all in ONE wikimania wiki [19:32:37] At least we don't have to swap apache config anymore. Yay wikimania -> wikimania%{YEAR} redirect :D [19:32:45] isn't there a task for that somewhere? [19:32:59] bd808: pros and cons [19:32:59] or maybe it was just some random mailing list/IRC discussion [19:33:08] paravoid: certainly [19:33:25] AFAIK, we usually do it fairly soon after it's announced [19:33:30] bd808: to use the same example https://wiki.debconf.org/ is a single (media)wiki, it can get /really/ confusing [19:33:42] pages from the previous year appear when searching for something [19:33:55] one idea: current year in main namespace and namespace for each past/future year [19:34:03] Krenair: T155038 :-) [19:34:03] T155038: Create Wikimania 2018 wiki - https://phabricator.wikimedia.org/T155038 [19:34:17] yeah, I've proposed that for the debconf wiki [19:34:19] yeah... [19:34:29] or maybe better static namespace per year [19:34:49] first thing to create wiki will be the labs db thing again [19:35:00] "tell dba about replicas" [19:35:26] bd808: static namespace per year, and just change the config for default search namespaces each year? [19:35:40] might work, year [19:36:03] s/year/yeah/ [19:36:09] still trying to get the fiwikivoyage and tcywiki and olowiki creation tickets closed [19:37:05] hrmm.. [19:37:10] always so hard [19:37:20] (03CR) 10Chad: [C: 032] scap prep: use check_call, which is a real function name [mediawiki-config] - 10https://gerrit.wikimedia.org/r/331521 (owner: 10Chad) [19:37:21] but still there is progress [19:37:23] (03PS1) 10Dzahn: restbase: add wikimania2018 [puppet] - 10https://gerrit.wikimedia.org/r/331523 (https://phabricator.wikimedia.org/T155038) [19:37:56] (03Merged) 10jenkins-bot: scap prep: use check_call, which is a real function name [mediawiki-config] - 10https://gerrit.wikimedia.org/r/331521 (owner: 10Chad) [19:38:07] (03CR) 10jenkins-bot: scap prep: use check_call, which is a real function name [mediawiki-config] - 10https://gerrit.wikimedia.org/r/331521 (owner: 10Chad) [19:39:10] !log demon@tin Synchronized scap/plugins/prep.py: prod no-op, for completeness (duration: 00m 40s) [19:39:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:40:00] (03PS1) 10Dzahn: labs dnsrecursor: add wikimania2018 [puppet] - 10https://gerrit.wikimedia.org/r/331526 (https://phabricator.wikimedia.org/T155038) [19:40:04] I've sent out the ops mail [19:40:17] and made the labs/dba ticket [19:40:48] Is someone taking care of apache? [19:40:59] yes [19:41:02] can do [19:41:11] it'll need to be added to the wikimania ServerAlias [19:41:16] yes [19:41:18] Krenair: I wonder if we can wildcard it... [19:41:25] ServerAlias wikimania*.wikimedia.org [19:41:48] ostriches, I think we'd still want wikimaniateam to be separate, perhaps? [19:41:56] Oh phooey [19:42:17] It might not matter though [19:42:22] don't :) [19:43:17] ostriches: Do we want to backport/deploy legoktm's fix? [19:43:22] Probably ya [19:43:49] (03CR) 10Mobrovac: [C: 031] restbase: add wikimania2018 [puppet] - 10https://gerrit.wikimedia.org/r/331523 (https://phabricator.wikimedia.org/T155038) (owner: 10Dzahn) [19:44:02] (03PS1) 10Dzahn: apache: add wikimania2018 server alias [puppet] - 10https://gerrit.wikimedia.org/r/331527 (https://phabricator.wikimedia.org/T155038) [19:45:04] Reedy, are you volunteering to do the mediawiki parts? [19:45:08] 19:40:07 ( ! ) Notice: Cannot find site mywiki in sites table [Called from Wikibase\Client\WikibaseClient::newSiteGroup in /home/jenkins/workspace/mediawiki-extensions-qunit-jessie/src/extensions/Wikidata/extensions/Wikibase/client/includes/WikibaseClient.php at line 716] in /home/jenkins/workspace [19:45:08] /mediawiki-extensions-qunit-jessie/src/includes/debug/MWDebug.php on line 309 [19:45:09] Rofl [19:45:35] (03PS1) 10Chad: scap prep: Shell and shell aren't the same. Silly python [mediawiki-config] - 10https://gerrit.wikimedia.org/r/331528 [19:46:17] (03CR) 10Chad: [C: 032] scap prep: Shell and shell aren't the same. Silly python [mediawiki-config] - 10https://gerrit.wikimedia.org/r/331528 (owner: 10Chad) [19:46:51] (03Merged) 10jenkins-bot: scap prep: Shell and shell aren't the same. Silly python [mediawiki-config] - 10https://gerrit.wikimedia.org/r/331528 (owner: 10Chad) [19:47:47] (03CR) 10Dzahn: [C: 032] "confirmed on mwdebug1001" [puppet] - 10https://gerrit.wikimedia.org/r/331527 (https://phabricator.wikimedia.org/T155038) (owner: 10Dzahn) [19:47:57] (03CR) 10jenkins-bot: scap prep: Shell and shell aren't the same. Silly python [mediawiki-config] - 10https://gerrit.wikimedia.org/r/331528 (owner: 10Chad) [19:48:29] Krenair: Yeah, can do [19:48:37] the restbase part needs deploy, right [19:49:09] remember we are supposed to have the db replicas done before running addwiki though [19:49:33] I wasn't gonna rush into creating the wiki [19:49:39] !log demon@tin Synchronized scap/plugins/prep.py: another no-op (duration: 00m 41s) [19:49:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:50:05] we were trying to lower the wiki creation time record though :) j/k [19:50:23] lol [19:50:29] blocked on other people :P [19:50:36] that's the challenge :p [19:51:23] Hi. I summarized the discussion at https://phabricator.wikimedia.org/T155038#2931734. As there are suggestions to address concerns (one namespace per year, search configured to use current/next edition), we could create wikimaniawiki and use it for 2018 and 2019. After 2019, we'll know if it's a good idea or not. [19:52:28] There would be huge benefits for the wiki maintainers too: they wouldn't have to reimport the different templates, prepare a discussion page, etc. [19:53:08] (03CR) 10jerkins-bot: [V: 04-1] labs dnsrecursor: add wikimania2018 [puppet] - 10https://gerrit.wikimedia.org/r/331526 (https://phabricator.wikimedia.org/T155038) (owner: 10Dzahn) [19:53:15] last time this came up there were maintainers who wanted their own wiki afair [19:53:30] and did not want to mix the content, have custom style and logo [19:53:43] trade offs :) [19:54:33] hi. it's still SWAT time, right? can i have a thing deployed? https://gerrit.wikimedia.org/r/#/c/331531/ [19:55:03] MatmaRex: No, nothing for you [19:55:07] ;_; [19:55:09] Hi MatmaRex, yes you can add it to the table [19:55:12] !next [19:55:14] @next [19:55:17] BOT [19:55:18] jouncebot: now [19:55:18] For the next 0 hour(s) and 4 minute(s): Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170110T1900) [19:55:21] jouncebot: next [19:55:21] In 4 hour(s) and 4 minute(s): Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170111T0000) [19:55:29] lol [19:55:40] i'll schedule it [19:57:26] (03PS2) 10Dzahn: labs dnsrecursor: add wikimania2018 [puppet] - 10https://gerrit.wikimedia.org/r/331526 (https://phabricator.wikimedia.org/T155038) [19:57:39] RECOVERY - Check systemd state on elastic2035 is OK: OK - running: The system is fully operational [19:58:11] (03PS3) 10Dzahn: labs dnsrecursor: add wikimania2018 [puppet] - 10https://gerrit.wikimedia.org/r/331526 (https://phabricator.wikimedia.org/T155038) [20:02:22] 06Operations, 13Patch-For-Review: Audit uses of package=>latest - https://phabricator.wikimedia.org/T115348#2931756 (10Dzahn) [20:03:00] 06Operations, 13Patch-For-Review: Audit uses of package=>latest - https://phabricator.wikimedia.org/T115348#1722037 (10Dzahn) After the merge above, I wonder if this ticket is now resolved. I also edited the task description just now that 2 that were marked as pending have been merged meanwhile. [20:04:04] 06Operations, 13Patch-For-Review: Audit uses of package=>latest - https://phabricator.wikimedia.org/T115348#2931786 (10Dzahn) p:05Triage>03Normal [20:06:19] Dereckson: deploying? deployed? should i ask Reedy? [20:07:19] Reedy: you deploy it or I do? [20:08:38] MatmaRex: okay, I can deploy it if Reedy is busy with another thing [20:08:42] Not fussed eihter way :) [20:09:15] 06Operations, 13Patch-For-Review: Audit uses of package=>latest - https://phabricator.wikimedia.org/T115348#2931794 (10MoritzMuehlenhoff) I'll have a final review of the current puppet repo before closing this. [20:09:35] There is an undeployed change by rlot for SemanticMediaWiki [20:10:00] this: https://gerrit.wikimedia.org/r/#/c/331514/ [20:10:05] Heh [20:10:08] There'll be a couple [20:10:11] And those are my fault [20:11:08] MatmaRex: live on mwdebug1002 [20:11:48] Dereckson: checking [20:11:58] Dereckson: Please just update the extension and sync dir [20:12:25] Reedy: for SemanticMediawiki? [20:12:26] Dereckson: looks good [20:12:32] Please [20:13:25] MatmaRex: syncing [20:14:03] !log dereckson@tin Synchronized php-1.29.0-wmf.7/extensions/UploadWizard/resources/transports/mw.FormDataTransport.js: mw.FormDataTransport: Don't remove Unicode characters from temp filename (T155039) (duration: 00m 41s) [20:14:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:14:07] T155039: Files named with cyrillic characters rejected with "The filename is too short." on Upload step - https://phabricator.wikimedia.org/T155039 [20:14:46] Dereckson: thanks! [20:15:00] you're welcome [20:16:13] !log dereckson@tin Synchronized php-1.29.0-wmf.7/extensions/SemanticMediaWiki: Remove deprecated function usages (T147924) (duration: 00m 49s) [20:16:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:16:18] T147924: Identify/Cleanup ContentHandler deprecated calls (and hook subscribers) in Wikitech specific extensions branches - https://phabricator.wikimedia.org/T147924 [20:24:47] 06Operations, 10ops-codfw: decommission mw2075-2089 to make room for new systems - https://phabricator.wikimedia.org/T154621#2931831 (10Papaul) @RobH Joe mentioned "3 servers to replace 4 imagescalers mw2086-2090 so in row B" but we didn't decommissioned mw2090 [20:25:35] 06Operations, 10ops-codfw: decommission mw2075-2089 to make room for new systems - https://phabricator.wikimedia.org/T154621#2931832 (10RobH) correct, that was a typo on his part. He mentioned imagescalers, and 4 of them, but that is 5 systems listed. mw2090 is not an image server, and hsould be left in ser... [20:28:23] papaul: I left mw2090 alone and in service, so i wouldnt touch that if i were you =] [20:28:31] (03PS1) 10Dzahn: delete tendril.wikimedia.org SSL cert [puppet] - 10https://gerrit.wikimedia.org/r/331534 [20:30:50] robh: i just asked since joe said it was coming out too like the other once [20:30:59] hrmm [20:31:06] but mw2090 is not an image scaler [20:31:24] and he lists to take down 4 image scalers [20:31:29] but then a range of 5 systems [20:31:40] we can add mw2090 later, but for now its still up in service and in use [20:32:00] since its not going to block the rest going in, i'd leave it and we can have him clarify, but it seems odd [20:32:21] robh:ok [20:32:34] cool, the rest were good to be wiped though [20:32:38] just not mw290 [20:32:41] 2090 even [20:32:41] (03CR) 10Dzahn: [C: 032] delete tendril.wikimedia.org SSL cert [puppet] - 10https://gerrit.wikimedia.org/r/331534 (owner: 10Dzahn) [20:32:45] wipe is complete [20:32:49] (and its totally understood and appreciated that you checked!) [20:33:22] 06Operations, 10ops-codfw: decommission mw2075-2089 to make room for new systems - https://phabricator.wikimedia.org/T154621#2931840 (10RobH) [20:33:34] i realized still had mw2090 listed in task description, fixed. [20:34:19] RECOVERY - Check systemd state on restbase-test1001 is OK: OK - running: The system is fully operational [20:54:20] (03CR) 10Dzahn: [C: 032] "works on https://gerrit.git.wmflabs.org/r/#/q/status:merged" [puppet] - 10https://gerrit.wikimedia.org/r/49993 (https://phabricator.wikimedia.org/T40114) (owner: 10Hashar) [20:54:37] (03PS5) 10Dzahn: gerrit: alternate change list row colors [puppet] - 10https://gerrit.wikimedia.org/r/49993 (https://phabricator.wikimedia.org/T40114) (owner: 10Hashar) [20:55:29] PROBLEM - puppet last run on labservices1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:58:39] PROBLEM - puppet last run on cp3048 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:04:57] !log gerrit restarting for config change 49993 (T40114) [21:05:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:05:03] T40114: [upstream] alternate change list row colors - https://phabricator.wikimedia.org/T40114 [21:05:28] should close bug from 2012 :) [21:05:35] ah [21:06:12] done. gerrit normal [21:06:20] alternating row colors on https://gerrit.wikimedia.org/r/#/q/status:merged [21:08:03] PROBLEM - puppet last run on sca1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:08:19] PROBLEM - puppet last run on labsdb1003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/mediawiki-config] [21:10:39] PROBLEM - puppet last run on sca2004 is CRITICAL: CRITICAL: Puppet has 27 failures. Last run 2 minutes ago with 27 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[zotero/translators],Package[zotero/translation-server],Exec[chown /srv/deployment/zotero for deploy-service] [21:13:15] (03CR) 10Dzahn: [C: 032] "only source => lines are changed,nothing on the target file systems, and here's the compiler run for tegmen http://puppet-compiler.wmflabs" [puppet] - 10https://gerrit.wikimedia.org/r/327592 (https://phabricator.wikimedia.org/T148494) (owner: 10Dzahn) [21:13:23] (03PS8) 10Dzahn: icinga/CI: give all shell scripts a file extension [puppet] - 10https://gerrit.wikimedia.org/r/327592 (https://phabricator.wikimedia.org/T148494) [21:14:26] runs puppet on sca2004 and no error [21:14:29] RECOVERY - puppet last run on sca2004 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [21:15:20] !log sca2004, labsdb1003 - ran puppet (they wanted to git clone during gerrit restart) [21:15:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:16:19] RECOVERY - puppet last run on labsdb1003 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [21:18:39] RECOVERY - puppet last run on cp3048 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [21:20:29] RECOVERY - puppet last run on labservices1001 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [21:22:15] !log cp3048 - labservices1001 - ran puppet, in this case it wasn't about gerrit, but recovered too [21:22:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:24:17] (03CR) 10Dzahn: "no-op on einsteinium/tegmen besides the newline changes in check_keystone*" [puppet] - 10https://gerrit.wikimedia.org/r/327592 (https://phabricator.wikimedia.org/T148494) (owner: 10Dzahn) [21:27:18] PROBLEM - puppet last run on fermium is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/lib/nagios/plugins/check_mailman_queue] [21:30:23] (03CR) 10Hashar: "Thank you Paladox :}" [puppet] - 10https://gerrit.wikimedia.org/r/49993 (https://phabricator.wikimedia.org/T40114) (owner: 10Hashar) [21:30:37] (03CR) 10Hashar: "And others as well obviously! ;-}" [puppet] - 10https://gerrit.wikimedia.org/r/49993 (https://phabricator.wikimedia.org/T40114) (owner: 10Hashar) [21:30:43] (03CR) 10Paladox: "Your welcome :)" [puppet] - 10https://gerrit.wikimedia.org/r/49993 (https://phabricator.wikimedia.org/T40114) (owner: 10Hashar) [21:33:18] RECOVERY - puppet last run on sca1004 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [22:05:11] 06Operations, 10Continuous-Integration-Infrastructure, 13Patch-For-Review, 07WorkType-Maintenance: Jenkins master / client ssh connection fails due to missing ssh algorithm - https://phabricator.wikimedia.org/T100509#2932111 (10Paladox) [22:06:14] !log demon@tin Synchronized php-1.29.0-wmf.7/includes/libs/objectcache/WANObjectCache.php: Silence obnoxious replag errors (duration: 00m 42s) [22:06:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:06:48] AaronSchulz: ^^ [22:07:16] (03PS2) 10Florianschmidtwelzow: Enable sitenotice banners for arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/327253 (https://phabricator.wikimedia.org/T152826) [22:12:00] !log reedy@tin Synchronized php-1.29.0-wmf.7/extensions/Wikidata: (no message) (duration: 02m 21s) [22:12:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:13:24] !log reedy@tin Synchronized php-1.29.0-wmf.7/includes/registration/ExtensionRegistry.php: (no message) (duration: 00m 43s) [22:13:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:19:53] I think hhvm on mw1201 is flipping out [22:20:14] Lots of LightProcess (lame, I know), Shutting down due to failure(s) to bind in HttpServer::runAndExitProcess, Unable to start page server [22:20:32] E0110 22:16:46.877539 935 fastcgi-server.cpp:105] 98failed to bind to async server socket: 127.0.0.1:9000: Address already in use [22:20:35] etc etc etc [22:22:17] (03PS1) 10Reedy: Revert "Revert "Remove MWVersion, fold its two functions into MWMultiVersion"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/331552 [22:22:21] (03PS2) 10Reedy: Revert "Revert "Remove MWVersion, fold its two functions into MWMultiVersion"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/331552 [22:22:40] (03PS3) 10Reedy: Reinstate "Remove MWVersion, fold its two functions into MWMultiVersion" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/331552 [22:26:04] (03PS1) 10Krinkle: gerrit: Reduce tableEven/OddRowColor contrast. [puppet] - 10https://gerrit.wikimedia.org/r/331553 (https://phabricator.wikimedia.org/T40114) [22:26:19] mutante: ^ [22:27:06] haha, that didnt take long [22:27:10] (03PS2) 10Krinkle: gerrit: Reduce tableEven/OddRowColor contrast. [puppet] - 10https://gerrit.wikimedia.org/r/331553 (https://phabricator.wikimedia.org/T40114) [22:27:44] ok [22:35:09] (03PS1) 10Dzahn: mediawiki videoscaler: include mediawiki::common role [puppet] - 10https://gerrit.wikimedia.org/r/331555 (https://phabricator.wikimedia.org/T150160) [22:36:00] <_joe_> ostriches: interesting state on mw1201 [22:36:57] (03CR) 10Chad: [C: 031] gerrit: Reduce tableEven/OddRowColor contrast. [puppet] - 10https://gerrit.wikimedia.org/r/331553 (https://phabricator.wikimedia.org/T40114) (owner: 10Krinkle) [22:37:11] _joe_: Yeah, that was a new one for me [22:37:28] <_joe_> ostriches: basically systemd lost track of the hhvm process there [22:43:15] (03CR) 10Paladox: [C: 031] gerrit: Reduce tableEven/OddRowColor contrast. [puppet] - 10https://gerrit.wikimedia.org/r/331553 (https://phabricator.wikimedia.org/T40114) (owner: 10Krinkle) [22:44:32] (03CR) 10Dzahn: [C: 032] gerrit: Reduce tableEven/OddRowColor contrast. [puppet] - 10https://gerrit.wikimedia.org/r/331553 (https://phabricator.wikimedia.org/T40114) (owner: 10Krinkle) [22:45:13] Thanks mutante [22:46:03] no problem, just have to restart [22:46:48] !log gerrit restarting for config change 331553 [22:46:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:46:58] PROBLEM - puppet last run on cp3040 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:47:39] restart done [22:50:28] PROBLEM - puppet last run on labsdb1010 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/mediawiki-config] [22:53:11] !log demon@tin Synchronized php-1.29.0-wmf.7/extensions/VisualEditor/ApiVisualEditor.php: T154962 logspam (duration: 00m 41s) [22:53:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:53:15] T154962: Undefined index: l in ApiVisualEditor::getLangLinks - https://phabricator.wikimedia.org/T154962 [22:53:18] James_F: ^ sync'd [22:53:43] Thank you! [22:55:08] PROBLEM - check_mysql on fdb2001 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 1495 [22:55:08] PROBLEM - check_mysql on lutetium is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 1495 [22:55:08] PROBLEM - check_mysql on frdb1001 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 1493 [22:55:28] PROBLEM - Nginx local proxy to apache on mw1182 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:55:38] PROBLEM - HHVM rendering on mw1182 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:56:18] PROBLEM - Apache HTTP on mw1182 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:57:14] (03PS1) 10Krinkle: contint: Re-add dir.php to doc.wm.org DirectoryIndex [puppet] - 10https://gerrit.wikimedia.org/r/331558 (https://phabricator.wikimedia.org/T150727) [22:58:57] James_F: Thanks for the quick fix. Errors have dropped off so I resolved the task [22:59:03] Yay, only 142 #logspam bugs open now! [22:59:05] ostriches: Awesome. [22:59:20] ostriches: Is there an obvious big Editing problem next to fix? ;-) [23:00:08] PROBLEM - check_mysql on lutetium is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 1795 [23:00:08] PROBLEM - check_mysql on frdb1001 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 1793 [23:00:08] PROBLEM - check_mysql on fdb2001 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 1483 [23:00:21] hrmm [23:00:35] James_F: Nothing on top of my logstash dashboards, but feel free to skim https://phabricator.wikimedia.org/tag/wikimedia-log-errors/ and see if any still need fixing / already fixed :) [23:00:41] slow slaves on fundraising db.... [23:00:45] i think i'll drop jeff a txt [23:00:48] Hmm. Is T64896 still real? [23:00:48] T64896: Fatal error: Call to a member function preSaveTransform() on a non-object in EditPage.php on line 3260 - https://phabricator.wikimedia.org/T64896 [23:01:12] Grepped for "edit" in "production impact". :-) [23:01:55] ostriches: still going in the wrong direction :/ https://phabricator.wikimedia.org/maniphest/report/burn/?project=PHID-PROJ-4uc7r7pdosfsk55qg7f6 [23:03:30] (03CR) 10Krinkle: [C: 032] build: Update PHPUnit from 3.7 to 4.8, add phplint to composer-test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/331093 (https://phabricator.wikimedia.org/T85947) (owner: 10Krinkle) [23:03:59] James_F: Assuming it was possible in 2014 when it was filed :p [23:04:04] Now, is it still? Who knows! [23:04:06] (03Merged) 10jenkins-bot: build: Update PHPUnit from 3.7 to 4.8, add phplint to composer-test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/331093 (https://phabricator.wikimedia.org/T85947) (owner: 10Krinkle) [23:04:20] (03CR) 10jenkins-bot: build: Update PHPUnit from 3.7 to 4.8, add phplint to composer-test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/331093 (https://phabricator.wikimedia.org/T85947) (owner: 10Krinkle) [23:04:29] ostriches: If there's nothing in the logs in the past 7 days we could at least move it to "backlog". :-) [23:04:38] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.192.48.44 on port 6479 [23:04:46] Backlog is unsorted stuff. Should probably have a "maybe fixed?" column [23:04:47] Heheh [23:04:55] * James_F grins. [23:05:08] PROBLEM - check_mysql on frdb1001 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 2093 [23:05:08] PROBLEM - check_mysql on lutetium is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 2095 [23:05:08] RECOVERY - check_mysql on fdb2001 is OK: Uptime: 1671266 Threads: 1 Questions: 54954024 Slow queries: 9033 Opens: 6710 Flush tables: 2 Open tables: 540 Queries per second avg: 32.881 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [23:05:27] * ostriches gets rid of Fix available & Done columns [23:05:38] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 2789862 keys, up 71 days 14 hours - replication_delay is 0 [23:05:38] Fix available is transient and unused. Done columns are redundant [23:05:59] yeah, I'm pro not keeping status tracked in 99.99% of workboard columns [23:06:07] (kanban/sprint boards are the exceptions) [23:06:56] ostriches: /srv/mediawiki is dirty btw due to untracked scap/log/ - just noticed, no worries. [23:07:08] PROBLEM - restbase endpoints health on praseodymium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:07:34] Hmm, should probably at scap/log to .gitignore in the flattened repo in /srv/mediawiki [23:07:43] cc thcipriani ^ [23:08:33] (03CR) 10Krinkle: "(don't mind me - Testing the new Jenkins job)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/331552 (owner: 10Reedy) [23:08:36] (03CR) 10Krinkle: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/331552 (owner: 10Reedy) [23:08:45] ostriches: yarp. [23:09:07] !log Ran DELETE FROM wbc_entity_usage WHERE eu_row_id IN(1714177, 1714178, 1714179, 1714180, 1714181, 1714182, 1714183, 1714184, 3914375); on s5 master (T147630) [23:09:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:09:11] T147630: wbc_entity_usage contains invalid entity ids (numeric parts to large) - https://phabricator.wikimedia.org/T147630 [23:09:30] (03PS2) 10Volans: mediawiki videoscaler: include mediawiki::common role [puppet] - 10https://gerrit.wikimedia.org/r/331555 (https://phabricator.wikimedia.org/T150160) (owner: 10Dzahn) [23:09:58] RECOVERY - restbase endpoints health on praseodymium is OK: All endpoints are healthy [23:10:08] PROBLEM - check_mysql on lutetium is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 2395 [23:10:08] PROBLEM - check_mysql on frdb1001 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 2393 [23:12:13] Jeff_Green: So the icinga alerts have been growing in seconds behind master pretty quickly =P [23:12:24] (03CR) 10Volans: [C: 04-1] "Spurious file in code review" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/331555 (https://phabricator.wikimedia.org/T150160) (owner: 10Dzahn) [23:12:34] sorry to bug ya! [23:12:45] no worries! we're just packing, we fly out in the AM [23:14:11] (03CR) 10Jforrester: [C: 031] contint: Re-add dir.php to doc.wm.org DirectoryIndex [puppet] - 10https://gerrit.wikimedia.org/r/331558 (https://phabricator.wikimedia.org/T150727) (owner: 10Krinkle) [23:14:59] RECOVERY - puppet last run on cp3040 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [23:15:08] PROBLEM - check_mysql on frdb1001 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 2693 [23:15:08] PROBLEM - check_mysql on lutetium is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 2695 [23:15:48] PROBLEM - Unmerged changes on repository mediawiki_config on tin is CRITICAL: There is one unmerged change in mediawiki_config (dir /srv/mediawiki-staging/, ref HEAD..readonly/master). [23:15:55] 06Operations, 10MediaWiki-Configuration, 06Performance-Team, 06Services (watching), and 5 others: Integrating MediaWiki (and other services) with dynamic configuration - https://phabricator.wikimedia.org/T149617#2932505 (10Volans) The summary of the session with all the relevant links (detailed notes, slid... [23:18:08] PROBLEM - restbase endpoints health on restbase1010 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:18:28] RECOVERY - puppet last run on labsdb1010 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [23:18:58] RECOVERY - restbase endpoints health on restbase1010 is OK: All endpoints are healthy [23:20:08] PROBLEM - check_mysql on lutetium is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 2995 [23:20:08] PROBLEM - check_mysql on frdb1001 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 2993 [23:25:08] PROBLEM - check_mysql on frdb1001 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 3293 [23:25:08] PROBLEM - check_mysql on lutetium is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 3295 [23:28:46] ACKNOWLEDGEMENT - check_mysql on frdb1001 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 3293 Jeff_Green ill-concieved large delete [23:30:08] PROBLEM - check_mysql on lutetium is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 3595 [23:30:39] ACKNOWLEDGEMENT - check_mysql on lutetium is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 3595 Jeff_Green ill-concieved large delete [23:31:38] PROBLEM - puppet last run on sca2003 is CRITICAL: CRITICAL: Puppet has 27 failures. Last run 2 minutes ago with 27 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[zotero/translators],Package[zotero/translation-server],Exec[chown /srv/deployment/zotero for deploy-service] [23:34:00] (03PS3) 10Dzahn: mediawiki videoscaler: include mediawiki::common role [puppet] - 10https://gerrit.wikimedia.org/r/331555 (https://phabricator.wikimedia.org/T150160) [23:34:54] (03CR) 10Dzahn: "yea, thanks. that cert file has nothing to do with it, already saw it and got distracted." [puppet] - 10https://gerrit.wikimedia.org/r/331555 (https://phabricator.wikimedia.org/T150160) (owner: 10Dzahn) [23:38:42] (03CR) 10Krinkle: [C: 031] mediawiki videoscaler: include mediawiki::common role [puppet] - 10https://gerrit.wikimedia.org/r/331555 (https://phabricator.wikimedia.org/T150160) (owner: 10Dzahn) [23:38:58] PROBLEM - puppet last run on cp4001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:39:07] (03CR) 10Giuseppe Lavagetto: [C: 04-2] "role::mediawiki::scaler includes role::mediawiki::common." [puppet] - 10https://gerrit.wikimedia.org/r/331555 (https://phabricator.wikimedia.org/T150160) (owner: 10Dzahn) [23:40:08] RECOVERY - check_mysql on lutetium is OK: Uptime: 3746284 Threads: 2 Questions: 587087543 Slow queries: 20133 Opens: 94649235 Flush tables: 2 Open tables: 64 Queries per second avg: 156.711 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [23:40:08] RECOVERY - check_mysql on frdb1001 is OK: Uptime: 4954842 Threads: 1 Questions: 687692084 Slow queries: 42351 Opens: 27169 Flush tables: 1 Open tables: 596 Queries per second avg: 138.791 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [23:40:45] (03CR) 10Dzahn: "yea.. .. i just compiled and saw no difference.. uhm.. but why do i not get freeimpi installed from standard" [puppet] - 10https://gerrit.wikimedia.org/r/331555 (https://phabricator.wikimedia.org/T150160) (owner: 10Dzahn) [23:42:22] (03CR) 10Dzahn: [C: 04-2] "right, this is not it. i am trying to debug why freeipmi doesn't get installed on these seemingly random hosts https://phabricator.wikime" [puppet] - 10https://gerrit.wikimedia.org/r/331555 (https://phabricator.wikimedia.org/T150160) (owner: 10Dzahn) [23:49:06] (03Abandoned) 10Dzahn: mediawiki videoscaler: include mediawiki::common role [puppet] - 10https://gerrit.wikimedia.org/r/331555 (https://phabricator.wikimedia.org/T150160) (owner: 10Dzahn) [23:58:58] RECOVERY - puppet last run on sca2003 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures