[00:00:04] RoanKattouw ostriches Krenair: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160301T0000). [00:00:05] RoanKattouw schana: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [00:00:07] 6Operations, 6Labs, 13Patch-For-Review: Setup private docker registry with authentication support in tools - https://phabricator.wikimedia.org/T118758#2074234 (10yuvipanda) This now works properly, and I can push and pull! However, docker has decided to do incredibly braindead things and ties image names to... [00:00:46] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [00:00:50] I'll do it [00:00:52] schana: You here? [00:00:56] yes [00:01:12] (03CR) 10Catrope: [C: 032] Configure default Echo subscriptions user options on he.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/246171 (https://phabricator.wikimedia.org/T114982) (owner: 10Dereckson) [00:01:20] Cool [00:01:55] (03Merged) 10jenkins-bot: Configure default Echo subscriptions user options on he.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/246171 (https://phabricator.wikimedia.org/T114982) (owner: 10Dereckson) [00:07:34] Should I be able to ssh to misc-web-lb.eqiad.wikimedia.org? If not, is there some other way to see why it’s failing in my attempts to load https://labtesthorizon.wikimedia.org/? [00:08:17] (03PS1) 10Ori.livneh: XHGui: Use $_SERVER['SCRIPT_NAME'] as the URI [mediawiki-config] - 10https://gerrit.wikimedia.org/r/274024 [00:08:41] andrewbogott: looking [00:09:07] I only just set up that site, it’s surely broken because I missed a piece [00:09:09] andrewbogott: you can ssh into one of the misc varnishes and run varnishlog to see the request [00:09:35] (03CR) 10Catrope: [C: 032] Change rate for reader segmentation survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/274005 (https://phabricator.wikimedia.org/T125946) (owner: 10Nschaaf) [00:09:38] so that would be… 'ssh misc-web-lb.eqiad.wikimedia.org’? [00:09:45] or is that not a misc varnish? [00:09:58] that's the load-balancer [00:10:06] (03Merged) 10jenkins-bot: Change rate for reader segmentation survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/274005 (https://phabricator.wikimedia.org/T125946) (owner: 10Nschaaf) [00:10:17] andrewbogott: you can use cp1061, as an example [00:10:21] ah, ok [00:10:22] thanks [00:10:37] andrewbogott: when I need a host from $foo_cluster, I typically grab it from ganglia [00:10:42] !log catrope@tin Synchronized wmf-config/InitialiseSettings.php: Configure default Echo subscriptions user options on hewiki (duration: 00m 48s) [00:10:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:11:11] andrewbogott: your VCL change is broken [00:11:27] andrewbogott: https://dpaste.de/pGPp/raw [00:11:40] so varnish refused to load the new vcl, which means it's still running with the old one [00:12:16] andrewbogott: for 'set req.backend = xxx', 'xxx' has to be a defined backend. A hostname is not enough. [00:12:31] ok, so, indeed, ‘missing a piece' [00:12:33] * andrewbogott looks [00:13:02] andrewbogott: you need to declare it in modules/role/manifests/cache/misc.pp [00:13:12] !log catrope@tin Synchronized wmf-config/InitialiseSettings.php: Change rate for reader segmentation survey (duration: 00m 41s) [00:13:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:13:44] andrewbogott: you can copy any of the existing examples that map to a single hostname. like terbium or ytterbium or whatever. [00:14:17] once you do that, the VCL will compile, and your vhost will work [00:15:11] schana: Your change is deployed, please verify it went OK [00:15:17] (03CR) 10Aaron Schulz: [C: 031] swift: return 400 on UnicodeDecodeErrors [puppet] - 10https://gerrit.wikimedia.org/r/273431 (https://phabricator.wikimedia.org/T128081) (owner: 10Filippo Giunchedi) [00:15:25] thanks RoanKattouw [00:15:41] (03PS2) 10Andrew Bogott: Catch liberty up with some horizon apache config changes [puppet] - 10https://gerrit.wikimedia.org/r/274017 [00:15:43] (03PS1) 10Andrew Bogott: Define labtestweb2001 as a misc backend [puppet] - 10https://gerrit.wikimedia.org/r/274026 [00:15:47] ori: ^ ? [00:16:01] * ori looks [00:16:47] (03CR) 10Ori.livneh: [C: 031] Define labtestweb2001 as a misc backend [puppet] - 10https://gerrit.wikimedia.org/r/274026 (owner: 10Andrew Bogott) [00:16:50] andrewbogott: yes, that's right. [00:17:18] great, thank you! [00:18:03] (03CR) 10Andrew Bogott: [C: 032] Define labtestweb2001 as a misc backend [puppet] - 10https://gerrit.wikimedia.org/r/274026 (owner: 10Andrew Bogott) [00:19:20] Hmm, my config patch didn't work [00:19:33] echo-subscriptions-web-emailuser is still false on hewiki [00:20:48] RECOVERY - puppet last run on cp1061 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [00:22:25] Oh I see why [00:22:48] $wgDefaultUserOptions['echo-subscriptions-web-emailuser'] = false; in Echo.php [00:23:40] could move it from InitialiseSettings.php to CommonSettings.php, after the require_once for Echo [00:23:53] (03CR) 10Catrope: "It turns out modifying wgDefaultUserOptions from InitialiseSettings.php this way doesn't work, because Echo.php overrides a bunch of these" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/246171 (https://phabricator.wikimedia.org/T114982) (owner: 10Dereckson) [00:24:06] (03PS1) 10Catrope: Revert "Configure default Echo subscriptions user options on he.wikipedia" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/274027 [00:24:12] (03CR) 10Catrope: [C: 032] Revert "Configure default Echo subscriptions user options on he.wikipedia" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/274027 (owner: 10Catrope) [00:25:05] ori: Yeah, PS1 of that change did that, but people complained [00:25:05] So I'm reverting and instead merging Ps1 [00:25:13] (03Merged) 10jenkins-bot: Revert "Configure default Echo subscriptions user options on he.wikipedia" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/274027 (owner: 10Catrope) [00:25:14] nod [00:26:35] !log Cleaned up remnants of XHGui role on hafnium, now that XHGui is on tungsten [00:26:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:27:58] (03PS1) 10Catrope: Configure default Echo subscriptions user options on he.wikipedia (take 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/274028 (https://phabricator.wikimedia.org/T114982) [00:28:09] RoanKattouw: any chance could ride along as well? [00:28:16] not in the calendar, so feel free to decline [00:28:25] (03CR) 10Catrope: [C: 032] XHGui: Use $_SERVER['SCRIPT_NAME'] as the URI [mediawiki-config] - 10https://gerrit.wikimedia.org/r/274024 (owner: 10Ori.livneh) [00:28:31] thanks [00:29:24] (03Merged) 10jenkins-bot: XHGui: Use $_SERVER['SCRIPT_NAME'] as the URI [mediawiki-config] - 10https://gerrit.wikimedia.org/r/274024 (owner: 10Ori.livneh) [00:30:07] !log catrope@tin Synchronized wmf-config/InitialiseSettings.php: Revert "Configure default Echo subscriptions user options on hewiki", doesnt work (duration: 00m 46s) [00:30:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:31:17] !log catrope@tin Synchronized wmf-config/StartProfiler.php: XHGui: Use SCRIPT_NAME as the URI (duration: 00m 46s) [00:31:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:31:36] thanks RoanKattouw [00:31:57] RECOVERY - puppet last run on cp1051 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [00:33:16] (03CR) 10Catrope: [C: 032] Configure default Echo subscriptions user options on he.wikipedia (take 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/274028 (https://phabricator.wikimedia.org/T114982) (owner: 10Catrope) [00:33:46] (03Merged) 10jenkins-bot: Configure default Echo subscriptions user options on he.wikipedia (take 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/274028 (https://phabricator.wikimedia.org/T114982) (owner: 10Catrope) [00:34:02] 6Operations, 10EventBus, 10hardware-requests: 3 conf200x servers in codfw for zookeeper (and etcd?) - https://phabricator.wikimedia.org/T121882#1890778 (10RobH) This wasn't in the pending approval column, so I've moved it there and dropped @mark a note via IRC PM. (It may not have been part of his triage du... [00:35:44] !log catrope@tin Synchronized wmf-config/InitialiseSettings.php: Configure default Echo subscriptions user options on hewiki (take 2, part 1) (duration: 00m 42s) [00:35:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:36:38] !log catrope@tin Synchronized wmf-config/CommonSettings.php: Configure default Echo subscriptions user options on hewiki (take 2, part 2) (duration: 00m 40s) [00:36:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:36:57] RECOVERY - puppet last run on cp1045 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [00:39:22] !log catrope@tin Synchronized php-1.27.0-wmf.14/extensions/Echo/Hooks.php: Add debug logging for thank-you-edit notifications (duration: 00m 43s) [00:39:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:39:27] (03PS1) 10Alex Monk: phabricator: Send weekly mail every week instead of on certain monthdays [puppet] - 10https://gerrit.wikimedia.org/r/274033 [00:39:49] schana: Is that survey rate change behaving as expected? [00:40:08] yes RoanKattouw [00:40:25] Cool, thanks [00:45:46] !log upgrade elastic2009.codfw.wmnet to elasticsearch 1.7.5 [00:45:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:47:07] RECOVERY - puppet last run on cp1058 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [00:51:31] (03PS5) 10Mforns: Replace limn::data::generate by reportupdater [puppet] - 10https://gerrit.wikimedia.org/r/273487 (https://phabricator.wikimedia.org/T127327) [00:59:37] (03CR) 10Mforns: Replace limn::data::generate by reportupdater (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/273487 (https://phabricator.wikimedia.org/T127327) (owner: 10Mforns) [01:01:13] (03CR) 10Mforns: "Otto, I didn't have time to test that in labs, and it's late for me, sorry. I can do that next week. Anyway I think this is the code that " [puppet] - 10https://gerrit.wikimedia.org/r/273487 (https://phabricator.wikimedia.org/T127327) (owner: 10Mforns) [01:03:10] !log ori@tin Synchronized php-1.27.0-wmf.14/includes/user/User.php: I419f356b: Cache user data in memory (duration: 00m 42s) [01:03:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:05:01] RoanKattouw: there are a whole bunch of these in exception.log: Function: EchoEventMapper::insert Error: 1205 Lock wait timeout exceeded; try restarting transaction (10.64.16.18) [01:05:49] Ugh, yes [01:05:58] Crap [01:05:58] That would be me [01:06:11] whole bunch = 70 [01:06:24] OK that's not /too/ terrible [01:06:34] 6Operations, 10MediaWiki-Uploading, 6Multimedia, 10Traffic, 10Wikimedia-Video: Uploading 1.2GB ogv results in 503 - https://phabricator.wikimedia.org/T128358#2074447 (10zhuyifei1999) >>! In T128358#2072888, @BBlack wrote: > We should confirm whether this is purely timeout related (due to slow upload spe... [01:06:37] Looking [01:07:48] In short it's because MySQL sucks [01:08:26] I'm cleaning up duplicate events from that table that were caused by T128249 [01:08:26] T128249: Multiple "You made your edit!" notifications - https://phabricator.wikimedia.org/T128249 [01:08:44] But the way I'm doing it is apparently slow and causing locks [01:08:48] I've aborted it, because on enwiki creating the tmp table for my query took >10 min [01:08:55] yikes [01:09:10] I thought it would be good for locks and replag and stuff to gather the IDs I needed to delete in a tmp table, then use that [01:09:14] With the whole thing wrapped in a transaction [01:09:16] But nooo [01:09:37] Maybe if I remove the transaction, I won't get such bad locking behavior [01:10:24] Alternatively, let's see if I can fake a temp table [01:10:47] The number of rows that need deleting is actually pretty low, and the SELECT to produce their IDs is pretty fast, I have no idea why INSERT...SELECTing the same data into a tmp table is so slow [01:13:57] RoanKattouw: google says INSERT INTO ... SELECT will read lock the SELECT table while a normal SELECT won't [01:14:44] so your query will have to wait until all write locks are cleared, and future write locks will have to wait for it [01:15:12] oh, good find [01:17:36] RoanKattouw: if it's not a lot of data and you don't care about consistency, it might be less trouble to just dump it to a file and then load the temporary table from that file [01:18:05] Yeah that would be better [01:18:20] It only takes 300ms to do the SELECT [01:19:28] I'm not sure if I can do SELECT INTO OUTFILE on our DB servers though [01:20:03] RoanKattouw: use tee? [01:21:35] OK I found a way to do it all in one query [01:21:43] Behold my ugly solution [01:21:56] delete from echo_event, echo_notification using echo_event join echo_notification on notification_user=event_agent_id and notification_event=event_id left join (select min(event_id) as m from echo_event where event_type='thank-you-edit' group by event_agent_id, event_extra) as t on t.m=event_id where m is null and event_type='thank-you-edit'; [01:22:34] I first wrote this with WHERE event_id NOT IN( SELECT MIN(event_id) FROM echo_event...) but that breaks because it doesn't want you to use the echo_event table in a subquery while the outer query deletes from that same table [01:22:41] But if you do it with a join it's apparently OK [01:23:56] 6Operations, 10MediaWiki-Uploading, 6Multimedia, 10Traffic, 10Wikimedia-Video: Uploading 1.2GB ogv results in 503 - https://phabricator.wikimedia.org/T128358#2074495 (10zhuyifei1999) I'm not sure if it's related, but action=purge on 3GB+ videos results in 503 as well. (Will file a separate task) [01:27:03] 6Operations, 10MediaWiki-Uploading, 6Multimedia, 10Traffic, 10Wikimedia-Video: Uploading 1.2GB ogv results in 503 - https://phabricator.wikimedia.org/T128358#2074496 (10Krenair) I don't really watch the clock (I tend to read email while those scripts are running), and importImages doesn't give a time. I... [01:36:33] !log upgrade elastic2010.codfw.wmnet to elasticsearch 1.7.5 [01:36:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:40:40] !log ori@tin Synchronized php-1.27.0-wmf.14/extensions/CentralAuth: Idc873134: Avoid using "new CentralAuthUser" since it avoids the cache (duration: 00m 51s) [01:40:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:08:52] (03PS2) 10Ori.livneh: varnish: report response age to StatsD [puppet] - 10https://gerrit.wikimedia.org/r/269086 [02:20:38] 6Operations, 10MediaWiki-Uploading, 6Multimedia, 10Traffic, 10Wikimedia-Video: Uploading 1.2GB ogv results in 503 - https://phabricator.wikimedia.org/T128358#2074619 (10zhuyifei1999) >>! In T128358#2074496, @Krenair wrote: > I don't really watch the clock (I tend to read email while those scripts are ru... [02:25:05] !log mwdeploy@tin sync-l10n completed (1.27.0-wmf.14) (duration: 10m 50s) [02:25:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:30:35] !log upgrade elastic2011.codfw.wmnet to elasticsearch 1.7.5 [02:30:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:32:41] !log l10nupdate@tin ResourceLoader cache refresh completed at Tue Mar 1 02:32:41 UTC 2016 (duration 7m 36s) [02:32:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:06:47] PROBLEM - ElasticSearch health check for shards on nobelium is CRITICAL: CRITICAL - elasticsearch inactive shards 1305 threshold =0.1% breach: status: red, number_of_nodes: 1, unassigned_shards: 1302, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 452, cluster_name: labsearch, relocating_shards: 0, active_shards: 452, initializing_shards: 3, number_of_data_nodes: 1, delayed_unassigne [03:12:20] ignore that [03:17:27] RECOVERY - ElasticSearch health check for shards on nobelium is OK: OK - elasticsearch status labsearch: status: green, number_of_nodes: 1, unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 1757, cluster_name: labsearch, relocating_shards: 0, active_shards: 1757, initializing_shards: 0, number_of_data_nodes: 1, delayed_unassigned_shards: 0 [03:20:52] 6Operations, 10Phabricator, 6Release-Engineering-Team: just in case: set up a new oauth consumer on mediawiki.org that has oauth callback url checkbox enabled - https://phabricator.wikimedia.org/T96618#2074739 (10mmodell) [04:03:29] !log upgrade elastic2012.codfw.wmnet to elasticsearch 1.7.5 [04:03:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [04:54:21] (hours later) everything works now, thanks ori! [04:55:16] !log upgrade elastic2013.codfw.wmnet to elasticsearch 1.7.5 [04:55:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [04:56:38] (03PS3) 10Andrew Bogott: Catch liberty up with some horizon apache config changes [puppet] - 10https://gerrit.wikimedia.org/r/274017 [04:56:40] (03PS1) 10Andrew Bogott: Horizon: turn off COMPRESS_OFFLINE [puppet] - 10https://gerrit.wikimedia.org/r/274055 [04:58:26] (03CR) 10Andrew Bogott: [C: 032] Horizon: turn off COMPRESS_OFFLINE [puppet] - 10https://gerrit.wikimedia.org/r/274055 (owner: 10Andrew Bogott) [05:00:36] (03PS4) 10Andrew Bogott: Catch liberty up with some horizon apache config changes [puppet] - 10https://gerrit.wikimedia.org/r/274017 [05:04:14] (03CR) 10Andrew Bogott: [C: 032] Catch liberty up with some horizon apache config changes [puppet] - 10https://gerrit.wikimedia.org/r/274017 (owner: 10Andrew Bogott) [05:49:31] (03PS1) 10Andrew Bogott: Run Horizon version 'liberty' on labtestweb. [puppet] - 10https://gerrit.wikimedia.org/r/274057 [05:57:02] (03CR) 10Andrew Bogott: [C: 032] Run Horizon version 'liberty' on labtestweb. [puppet] - 10https://gerrit.wikimedia.org/r/274057 (owner: 10Andrew Bogott) [06:02:50] (03PS1) 10Andrew Bogott: Run Horizon version 'liberty' on labtestweb. [puppet] - 10https://gerrit.wikimedia.org/r/274059 [06:04:12] (03CR) 10Andrew Bogott: [C: 032] Run Horizon version 'liberty' on labtestweb. [puppet] - 10https://gerrit.wikimedia.org/r/274059 (owner: 10Andrew Bogott) [06:10:00] !log upgrade elastic2014.codfw.wmnet to elasticsearch 1.7.5 [06:10:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:30:57] PROBLEM - puppet last run on mw1086 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:27] PROBLEM - puppet last run on mw1009 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:37] PROBLEM - puppet last run on mw1135 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:26] PROBLEM - puppet last run on mw2045 is CRITICAL: CRITICAL: Puppet has 1 failures [06:34:27] 6Operations: Estimate hardware requirements for relevance lab elasticsearch servers - https://phabricator.wikimedia.org/T128433#2074818 (10Peachey88) [06:34:42] 6Operations, 6Labs, 10Labs-Infrastructure: Estimate hardware requirements for relevance lab elasticsearch servers - https://phabricator.wikimedia.org/T128433#2074820 (10Peachey88) [06:48:14] 6Operations, 6Discovery, 10hardware-requests: Refresh elastic10{01..16}.eqiad.wmnet servers - https://phabricator.wikimedia.org/T128000#2074847 (10EBernhardson) [06:48:40] 6Operations, 6Discovery, 10hardware-requests: Refresh elastic10{01..16}.eqiad.wmnet servers - https://phabricator.wikimedia.org/T128000#2060818 (10EBernhardson) after poking around i think this belongs in hardware-requests, a procurement ticket will be created later for quotes and such. [06:55:47] RECOVERY - puppet last run on mw1086 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [06:57:07] RECOVERY - puppet last run on mw1009 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:26] RECOVERY - puppet last run on mw1135 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [06:58:06] RECOVERY - puppet last run on mw2045 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:11:06] !log upgrade elastic2015.codfw.wmnet to elasticsearch 1.7.5 [07:11:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:00:31] 6Operations, 10ops-codfw, 10DBA, 13Patch-For-Review: es2010 controller issue - https://phabricator.wikimedia.org/T127769#2074907 (10MoritzMuehlenhoff) a:3Papaul [08:03:08] 6Operations, 13Patch-For-Review: mc1014 fails to pxe-boot with jessie - https://phabricator.wikimedia.org/T128068#2074908 (10MoritzMuehlenhoff) @Elukey, since Faidon's patch has been merged, I'm assigning the task to you, so that you can close it if mc1014 now installs again. [08:03:23] 6Operations, 13Patch-For-Review: mc1014 fails to pxe-boot with jessie - https://phabricator.wikimedia.org/T128068#2074909 (10MoritzMuehlenhoff) a:3elukey [08:10:57] 6Operations, 10Traffic, 10Wikimedia-Blog, 7HTTPS: Switch blog to HTTPS-only - https://phabricator.wikimedia.org/T105905#1454386 (10MoritzMuehlenhoff) Assigning to Tilman for the time being since he's currently running the discussion with Automattic [08:11:15] 6Operations, 10Traffic, 10Wikimedia-Blog, 7HTTPS: Switch blog to HTTPS-only - https://phabricator.wikimedia.org/T105905#2074916 (10MoritzMuehlenhoff) a:3Tbayer [08:12:18] 6Operations, 10media-storage: File not found after reupload - https://phabricator.wikimedia.org/T125140#2074917 (10MoritzMuehlenhoff) p:5Triage>3Normal a:3fgiunchedi [08:12:46] 6Operations, 10ops-codfw: es2009 degraded RAID - https://phabricator.wikimedia.org/T125442#2074919 (10MoritzMuehlenhoff) a:3Papaul [08:13:19] 6Operations: upgrade 15+4 swift servers from precise to trusty - https://phabricator.wikimedia.org/T125024#2074922 (10MoritzMuehlenhoff) a:3fgiunchedi [08:14:36] 6Operations, 6Services: Package npm 2.14 - https://phabricator.wikimedia.org/T124474#2074924 (10MoritzMuehlenhoff) p:5Triage>3Low [08:36:07] 6Operations, 10Ops-Access-Requests, 6Services, 13Patch-For-Review: Requesting restbase-roots access to RESTBase cluster for Petr Pchelko - https://phabricator.wikimedia.org/T126283#2074938 (10MoritzMuehlenhoff) [08:41:29] (03PS3) 10Jcrespo: Update haproxy default file, as it cannot be dynamic in jessie [puppet] - 10https://gerrit.wikimedia.org/r/273958 (https://phabricator.wikimedia.org/T125027) [08:43:31] (03PS4) 10Jcrespo: Update haproxy default file, as it cannot be dynamic in jessie [puppet] - 10https://gerrit.wikimedia.org/r/273958 (https://phabricator.wikimedia.org/T125027) [08:55:21] (03CR) 10Jcrespo: "https://puppet-compiler.wmflabs.org/1889/" [puppet] - 10https://gerrit.wikimedia.org/r/273958 (https://phabricator.wikimedia.org/T125027) (owner: 10Jcrespo) [08:57:19] (03CR) 10Jcrespo: "Also tested "manually" on dbproxy1005 (still not in production)." [puppet] - 10https://gerrit.wikimedia.org/r/273958 (https://phabricator.wikimedia.org/T125027) (owner: 10Jcrespo) [09:16:37] 6Operations, 13Patch-For-Review: mc1014 fails to pxe-boot with jessie - https://phabricator.wikimedia.org/T128068#2074990 (10elukey) 5Open>3Resolved [09:19:19] _joe_ o/ I am about to start the work to re-image mc1002 and mc1003, I'll need your reviews for the mediawiki-config changes :) [09:19:43] <_joe_> elukey: ok, let's sync in private [09:22:58] (03PS8) 10Giuseppe Lavagetto: Use wmfMasterDatacenter for picking the master redis config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266513 (https://phabricator.wikimedia.org/T114273) [09:26:25] (03PS1) 10Jcrespo: Add mysql grants for haproxy on m5-master [puppet] - 10https://gerrit.wikimedia.org/r/274064 (https://phabricator.wikimedia.org/T126251) [09:26:35] (03CR) 10Giuseppe Lavagetto: [C: 031] "If that is the only char to keep into account, LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/273888 (https://phabricator.wikimedia.org/T128369) (owner: 10Filippo Giunchedi) [09:27:47] (03CR) 10Giuseppe Lavagetto: [C: 031] "I've seen you programmed this as a SWAT deploy, since it's just a repooling I think you're granted the right to deploy this change outside" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/273488 (https://phabricator.wikimedia.org/T125084) (owner: 10Elukey) [09:28:28] (03PS1) 10Elukey: Remove mc1002 from redis/memcached pools for maintenance. [puppet] - 10https://gerrit.wikimedia.org/r/274065 (https://phabricator.wikimedia.org/T123711) [09:29:43] (03CR) 10Giuseppe Lavagetto: "@moritzm: No, we don't need to take precise into account anymore; actually I am plannig to remove all references to it from the mediawiki " [puppet] - 10https://gerrit.wikimedia.org/r/218640 (https://phabricator.wikimedia.org/T102623) (owner: 10Dzahn) [09:30:53] (03CR) 10Elukey: [C: 032] Remove mc1002 from redis/memcached pools for maintenance. [puppet] - 10https://gerrit.wikimedia.org/r/274065 (https://phabricator.wikimedia.org/T123711) (owner: 10Elukey) [09:32:26] !log removed mc1002 from the redis/memcached pools for maintenance [09:32:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:38:00] (03PS1) 10Elukey: Remove mc1002 from the Lock Manager pool for maintenance. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/274068 (https://phabricator.wikimedia.org/T123711) [09:45:10] 6Operations, 13Patch-For-Review, 7user-notice: reinstall eqiad memcache servers with jessie - https://phabricator.wikimedia.org/T123711#2075047 (10elukey) @Elitre, @Johan: Hi! Today I am going to work on the servers holding user sessions again, so I expect some intermittent issues like users forced to login... [09:46:29] !log elastic2016.codfw.wmnet: upgrading to 1.7.5, shipping logs to logstash (T122697, T109101) [09:46:31] T109101: Currently elasticsearch logs do not leave nodes. We use logstash for this across the cluster generally. - https://phabricator.wikimedia.org/T109101 [09:46:31] T122697: Upgrade ElasticSearch to 1.7.5 - https://phabricator.wikimedia.org/T122697 [09:46:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:47:57] (03PS2) 10Filippo Giunchedi: ganglia: replace reserved characters in cluster name [puppet] - 10https://gerrit.wikimedia.org/r/273888 (https://phabricator.wikimedia.org/T128369) [09:48:04] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] ganglia: replace reserved characters in cluster name [puppet] - 10https://gerrit.wikimedia.org/r/273888 (https://phabricator.wikimedia.org/T128369) (owner: 10Filippo Giunchedi) [09:50:51] (03CR) 10Muehlenhoff: "Great, that should make it simpler, I'll push as PS14 based on that." [puppet] - 10https://gerrit.wikimedia.org/r/218640 (https://phabricator.wikimedia.org/T102623) (owner: 10Dzahn) [09:53:44] 6Operations, 10Monitoring, 13Patch-For-Review: ganglia cluster name validation in puppet - https://phabricator.wikimedia.org/T128369#2075095 (10fgiunchedi) 5Open>3Resolved change deployed (currently a noop though) [09:58:46] (03CR) 10Giuseppe Lavagetto: [C: 031] Remove mc1002 from the Lock Manager pool for maintenance. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/274068 (https://phabricator.wikimedia.org/T123711) (owner: 10Elukey) [09:59:44] (03PS14) 10Muehlenhoff: mediawiki: update font packages for jessie [puppet] - 10https://gerrit.wikimedia.org/r/218640 (https://phabricator.wikimedia.org/T102623) (owner: 10Dzahn) [10:02:22] (03CR) 10Elukey: [C: 032] Remove mc1002 from the Lock Manager pool for maintenance. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/274068 (https://phabricator.wikimedia.org/T123711) (owner: 10Elukey) [10:05:08] !log elukey@tin Synchronized wmf-config/filebackend-production.php: Add mc1002 from the lock managers after maintenance (duration: 00m 56s) [10:05:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:06:13] !log Amended previous log - Remove mc1002 from the lock managers after maintenance [10:06:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:06:39] and i left "after" [10:06:52] a section of my brain is not working today [10:24:59] (03PS5) 10Giuseppe Lavagetto: standard: move to own module [puppet] - 10https://gerrit.wikimedia.org/r/273209 (https://phabricator.wikimedia.org/T119042) [10:30:39] (03CR) 10Giuseppe Lavagetto: [C: 032] standard: move to own module [puppet] - 10https://gerrit.wikimedia.org/r/273209 (https://phabricator.wikimedia.org/T119042) (owner: 10Giuseppe Lavagetto) [10:31:14] (03PS2) 10Jcrespo: Add mysql grants for haproxy on m5-master [puppet] - 10https://gerrit.wikimedia.org/r/274064 (https://phabricator.wikimedia.org/T126251) [10:32:31] 6Operations, 10Ops-Access-Requests: Requesting access to researchers for nschaaf - https://phabricator.wikimedia.org/T128381#2072331 (10MoritzMuehlenhoff) @schana : For which groups (as listed at https://wikitech.wikimedia.org/wiki/Analytics/Data_access#Access_Groups) do you want to get added? [10:34:46] (03PS3) 10Jcrespo: Add mysql grants for haproxy on m5-master [puppet] - 10https://gerrit.wikimedia.org/r/274064 (https://phabricator.wikimedia.org/T126251) [10:34:53] (03PS4) 10Giuseppe Lavagetto: role::ntp: rename standard::ntp, move to the standard module [puppet] - 10https://gerrit.wikimedia.org/r/273246 [10:35:39] (03PS4) 10Jcrespo: Add mysql grants for haproxy on m5-master [puppet] - 10https://gerrit.wikimedia.org/r/274064 (https://phabricator.wikimedia.org/T126251) [10:36:26] (03PS5) 10Jcrespo: Add mysql grants for haproxy on m5-master [puppet] - 10https://gerrit.wikimedia.org/r/274064 (https://phabricator.wikimedia.org/T126251) [10:36:58] (03CR) 10Jcrespo: [C: 032] Add mysql grants for haproxy on m5-master [puppet] - 10https://gerrit.wikimedia.org/r/274064 (https://phabricator.wikimedia.org/T126251) (owner: 10Jcrespo) [10:37:38] (03CR) 10Jcrespo: [V: 032] Add mysql grants for haproxy on m5-master [puppet] - 10https://gerrit.wikimedia.org/r/274064 (https://phabricator.wikimedia.org/T126251) (owner: 10Jcrespo) [10:42:38] !log elastic2017.codfw.wmnet: upgrading to 1.7.5, shipping logs to logstash (T122697, T109101) [10:42:40] T109101: Currently elasticsearch logs do not leave nodes. We use logstash for this across the cluster generally. - https://phabricator.wikimedia.org/T109101 [10:42:40] T122697: Upgrade ElasticSearch to 1.7.5 - https://phabricator.wikimedia.org/T122697 [10:42:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:43:47] (03CR) 10Muehlenhoff: [C: 032 V: 032] Regenerate rules/control files after configuration changes [debs/linux44] - 10https://gerrit.wikimedia.org/r/273934 (owner: 10Muehlenhoff) [10:44:03] (03PS5) 10Giuseppe Lavagetto: role::ntp: rename standard::ntp, move to the standard module [puppet] - 10https://gerrit.wikimedia.org/r/273246 [10:45:20] (03CR) 10Giuseppe Lavagetto: [C: 032] "Compiler shows a noop on both clients and servers." [puppet] - 10https://gerrit.wikimedia.org/r/273246 (owner: 10Giuseppe Lavagetto) [10:45:59] (03PS1) 10Elukey: Fix the mcXXXX partman config to allow fully automated PXE OS installs. [puppet] - 10https://gerrit.wikimedia.org/r/274071 (https://phabricator.wikimedia.org/T123711) [10:46:40] (03PS2) 10Elukey: Fix the mcXXXX partman config to allow fully automated PXE OS installs. [puppet] - 10https://gerrit.wikimedia.org/r/274071 (https://phabricator.wikimedia.org/T123711) [10:47:18] 6Operations, 13Patch-For-Review: Configure librenms to use LDAP for authentication - https://phabricator.wikimedia.org/T107702#2075216 (10mark) p:5Low>3Normal I just tried to add a user in the web interface, and surprisingly it doesn't seem to work. Perhaps instead of fixing that, we should just finish int... [10:48:49] (03CR) 10Filippo Giunchedi: [C: 04-1] Parameterize the git_server variable in global scap.cfg (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/272947 (https://phabricator.wikimedia.org/T126259) (owner: 1020after4) [10:50:43] 6Operations, 10ops-eqiad, 10DBA: Decommission pc1001-1003 - https://phabricator.wikimedia.org/T124962#1971085 (10jcrespo) Doing this now, labs hopefully soon, too. [10:51:30] (03CR) 10Elukey: [C: 04-1] "Might also be due to the absence of confirm_write_new_label, checking partman's config before final code review." [puppet] - 10https://gerrit.wikimedia.org/r/274071 (https://phabricator.wikimedia.org/T123711) (owner: 10Elukey) [10:53:30] !log disabling puppet and following steps to decommission pc100[123] [10:53:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:01:18] (03CR) 10Elukey: "Actually confirm_new_label is already there, so it might just be the absence of "boolean". I am going to test this patch on carbon for the" [puppet] - 10https://gerrit.wikimedia.org/r/274071 (https://phabricator.wikimedia.org/T123711) (owner: 10Elukey) [11:02:18] (03PS7) 10Giuseppe Lavagetto: ntp: further reorg, split of client and server code [puppet] - 10https://gerrit.wikimedia.org/r/273247 [11:08:02] (03CR) 10Giuseppe Lavagetto: [C: 032] "noop, confirmed with the compiler." [puppet] - 10https://gerrit.wikimedia.org/r/273247 (owner: 10Giuseppe Lavagetto) [11:08:18] (03PS1) 10Jcrespo: Decommision pc100[123] [puppet] - 10https://gerrit.wikimedia.org/r/274076 (https://phabricator.wikimedia.org/T124962) [11:11:44] (03PS7) 10Giuseppe Lavagetto: role::diamond: move to standard::diamond [puppet] - 10https://gerrit.wikimedia.org/r/273248 [11:12:10] I wonder if I should break up that last patch into smaller ones, so I can disable them on icinga now [11:14:52] <_joe_> jynus: no I think it's fair [11:15:04] <_joe_> I mean the change [11:15:10] _joe_, the change is fair [11:15:22] but there are scripts there I do not know about [11:15:34] <_joe_> then you can just stop puppet on those machines, remove the fuppet facts and certs, and they will be removed from icinga as well [11:15:42] and I would need help, but it is not high priority [11:15:42] <_joe_> uhm ok I see [11:15:54] I wanted to remove only from site.pp [11:16:08] and leave the rest for thouough review [11:16:35] (03PS1) 10Elukey: Add mc1002 back to the redis/memcached pools after maintenance. [puppet] - 10https://gerrit.wikimedia.org/r/274077 (https://phabricator.wikimedia.org/T123711) [11:17:18] (03PS1) 10Jcrespo: Decommision db100[123] 1/2 [puppet] - 10https://gerrit.wikimedia.org/r/274078 (https://phabricator.wikimedia.org/T124962) [11:17:31] ^something like this that I can confidently apply now [11:18:39] (03PS2) 10Jcrespo: Decommision pc100[123] 2/2 [puppet] - 10https://gerrit.wikimedia.org/r/274076 (https://phabricator.wikimedia.org/T124962) [11:18:42] (03CR) 10Elukey: [C: 032] Add mc1002 back to the redis/memcached pools after maintenance. [puppet] - 10https://gerrit.wikimedia.org/r/274077 (https://phabricator.wikimedia.org/T123711) (owner: 10Elukey) [11:19:34] (03PS2) 10Jcrespo: Decommision db100[123] 1/2 [puppet] - 10https://gerrit.wikimedia.org/r/274078 (https://phabricator.wikimedia.org/T124962) [11:19:52] so it doesn't block the rest of the tasks (icinga, etc.) [11:20:08] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 12.50% of data above the critical threshold [1000.0] [11:20:26] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [11:20:27] !log mc1002.eqiad added back to the memcached/redis pools after maintenance [11:20:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:20:43] large number of 404 too [11:22:02] 6Operations, 10Ops-Access-Requests: Requesting access to researchers for nschaaf - https://phabricator.wikimedia.org/T128381#2072331 (10Krenair) title says researchers... [11:26:14] something happened at 11:18 [11:27:26] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [11:27:37] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [11:30:51] 6Operations, 10Ops-Access-Requests, 13Patch-For-Review: Access Request for mobrovac as ci-admin to mess with CI infrastructure - https://phabricator.wikimedia.org/T128175#2075268 (10hashar) Looking at log files is unrelated. The request is to grant on labnodepool1001.eqiad.wmnet sudo right as nodepool to ge... [11:31:12] hashar: :) [11:31:19] that is messy :D [11:31:19] hashar: should come to the next ops meeting [11:31:22] yeah [11:31:39] ops meeting is at a terrible time for me. That is right in the middle of family rush hours :D [11:32:19] ideally the Nodepool image used to boot instances would be hooked with labs LDAP, but I never managed to handle the provisionning using the puppet classes :-/ [11:33:09] (03PS2) 10Muehlenhoff: Decom berkelium/curium [puppet] - 10https://gerrit.wikimedia.org/r/273906 (https://phabricator.wikimedia.org/T125962) [11:34:05] (03CR) 10Jcrespo: [C: 032] Decommision db100[123] 1/2 [puppet] - 10https://gerrit.wikimedia.org/r/274078 (https://phabricator.wikimedia.org/T124962) (owner: 10Jcrespo) [11:35:13] (03PS3) 10Jcrespo: Decommision pc100[123] 2/2 [puppet] - 10https://gerrit.wikimedia.org/r/274076 (https://phabricator.wikimedia.org/T124962) [11:36:54] 6Operations, 10Analytics, 10Traffic: varnishkafka integration with Varnish 4 for analytics - https://phabricator.wikimedia.org/T124278#2075280 (10elukey) Adding a link that might be useful for the future if we want to move away from varnish-kafka written in C: https://github.com/xcir/python-varnishapi (ava... [11:37:43] (03CR) 10Mobrovac: [C: 04-1] "Time to abandon this, given that it's been superseded by If21429bd0db17924ac8c2eeb4e378f1701785f7c" [puppet] - 10https://gerrit.wikimedia.org/r/238432 (https://phabricator.wikimedia.org/T112644) (owner: 10Filippo Giunchedi) [11:38:23] hashar: ok, so basically hooking them up with LDAP would benefit us [11:38:52] (03PS8) 10Giuseppe Lavagetto: role::diamond: move to standard::diamond [puppet] - 10https://gerrit.wikimedia.org/r/273248 [11:39:07] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] "noop again" [puppet] - 10https://gerrit.wikimedia.org/r/273248 (owner: 10Giuseppe Lavagetto) [11:39:23] PROBLEM - Kafka Broker Replica Max Lag on kafka1014 is CRITICAL: CRITICAL: 61.90% of data above the critical threshold [5000000.0] [11:40:13] !log elastic2018.codfw.wmnet: upgrading to 1.7.5, shipping logs to logstash (T122697, T109101) [11:40:14] T109101: Currently elasticsearch logs do not leave nodes. We use logstash for this across the cluster generally. - https://phabricator.wikimedia.org/T109101 [11:40:14] T122697: Upgrade ElasticSearch to 1.7.5 - https://phabricator.wikimedia.org/T122697 [11:40:16] (03PS1) 10Elukey: Revert "Remove mc1002 from the Lock Manager pool for maintenance." to add mc1002 back to the pool. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/274080 [11:40:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:40:58] (03CR) 10Elukey: [C: 032] Revert "Remove mc1002 from the Lock Manager pool for maintenance." to add mc1002 back to the pool. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/274080 (owner: 10Elukey) [11:41:32] RECOVERY - Kafka Broker Replica Max Lag on kafka1014 is OK: OK: Less than 50.00% above the threshold [1000000.0] [11:42:34] (03PS3) 10Muehlenhoff: Decom berkelium/curium [puppet] - 10https://gerrit.wikimedia.org/r/273906 (https://phabricator.wikimedia.org/T125962) [11:43:04] (03CR) 10Muehlenhoff: [C: 032 V: 032] Decom berkelium/curium [puppet] - 10https://gerrit.wikimedia.org/r/273906 (https://phabricator.wikimedia.org/T125962) (owner: 10Muehlenhoff) [11:43:18] !log elukey@tin Synchronized wmf-config/filebackend-production.php: Add mc1002 back to the lock managers pool after maintenance (duration: 01m 01s) [11:43:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:44:41] !log shutting down pc100[123] [11:44:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:45:27] double checking I am shutting down the right hosts [11:47:05] (03PS4) 10Giuseppe Lavagetto: role::mail::sender: move to standard [puppet] - 10https://gerrit.wikimedia.org/r/273444 [11:48:06] PROBLEM - puppet last run on restbase2002 is CRITICAL: CRITICAL: puppet fail [11:53:09] !log shutting down berkelium (decomissioned) [11:53:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:53:39] (03CR) 10Giuseppe Lavagetto: [C: 032] "again a noop." [puppet] - 10https://gerrit.wikimedia.org/r/273444 (owner: 10Giuseppe Lavagetto) [11:59:33] 6Operations, 10ops-eqiad, 10DBA, 13Patch-For-Review: Decommission pc1001-1003 - https://phabricator.wikimedia.org/T124962#2075362 (10jcrespo) a:5jcrespo>3None I have done all tasks related to puppet/salt/alerting and shutdown the hosts. There are some pending tasks for cleaning up puppet, but they are... [12:00:48] 6Operations, 10ops-eqiad, 13Patch-For-Review: Decommission pc1001-1003 - https://phabricator.wikimedia.org/T124962#2075367 (10jcrespo) [12:04:54] (03CR) 10Jcrespo: "Chase, can you give me some background about the user/groups managing?" [puppet] - 10https://gerrit.wikimedia.org/r/274076 (https://phabricator.wikimedia.org/T124962) (owner: 10Jcrespo) [12:05:47] (03Abandoned) 10Filippo Giunchedi: restbase: override statsd metric prefix for restbase test cluster [puppet] - 10https://gerrit.wikimedia.org/r/238432 (https://phabricator.wikimedia.org/T112644) (owner: 10Filippo Giunchedi) [12:08:46] 6Operations: Some labvirt systems use qemu from "cloud archive" (which doesn't get security support) - https://phabricator.wikimedia.org/T127113#2075385 (10MoritzMuehlenhoff) I asked them: The qemu package in the cloud archive is supposed to be covered by security support, but by a different team that the Ubuntu... [12:11:17] (03PS1) 10BBlack: normalize_path: move to own include file [puppet] - 10https://gerrit.wikimedia.org/r/274083 (https://phabricator.wikimedia.org/T127387) [12:11:19] (03PS1) 10BBlack: normalize_path: make it a C function [puppet] - 10https://gerrit.wikimedia.org/r/274084 (https://phabricator.wikimedia.org/T127387) [12:11:21] (03PS1) 10BBlack: normalize_path: optional forward-slash, RB variant [puppet] - 10https://gerrit.wikimedia.org/r/274085 (https://phabricator.wikimedia.org/T127387) [12:11:23] (03PS1) 10BBlack: normalize_path: stop on fragment marker [puppet] - 10https://gerrit.wikimedia.org/r/274086 (https://phabricator.wikimedia.org/T127387) [12:11:25] (03PS1) 10BBlack: normalize_path: assert(url) [puppet] - 10https://gerrit.wikimedia.org/r/274087 (https://phabricator.wikimedia.org/T127387) [12:11:27] (03PS1) 10BBlack: normalize_path: refactor control flow [puppet] - 10https://gerrit.wikimedia.org/r/274088 (https://phabricator.wikimedia.org/T127387) [12:11:29] (03PS1) 10BBlack: normalize_path: fully parameterize the decoded set [puppet] - 10https://gerrit.wikimedia.org/r/274089 (https://phabricator.wikimedia.org/T127387) [12:16:09] !log shutting down curium (decomissioned) [12:16:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:16:22] 6Operations, 10RESTBase, 6Services, 10Traffic, and 2 others: Split slash decoding from general percent normalization in Varnish VCL - https://phabricator.wikimedia.org/T127387#2075401 (10BBlack) Having dug into this, the other problem with the existing patch is that it would still do a lot of unacceptable... [12:17:06] RECOVERY - puppet last run on restbase2002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:17:12] 6Operations, 6Performance-Team, 7Performance: Update HHVM package to recent release - https://phabricator.wikimedia.org/T119637#2075404 (10Joe) I built a 3.12.1 backport from the debian unstable git repository, and confirmed it does indeed absolve its basic functionalities. I am now in the process of buildi... [12:19:22] 6Operations, 10EventBus, 10hardware-requests: 3 conf200x servers in codfw for zookeeper (and etcd?) - https://phabricator.wikimedia.org/T121882#2075406 (10mark) I assume we have no weaker boxes for this purpose? Looking at conf100x, they use no resources whatsoever... [12:25:18] (03PS1) 10Yurik: Allow wikidata access to maps cluster [puppet] - 10https://gerrit.wikimedia.org/r/274094 [12:27:56] (03PS1) 10Muehlenhoff: Remove DNS entries for berkelium/curium [dns] - 10https://gerrit.wikimedia.org/r/274095 (https://phabricator.wikimedia.org/T125962) [12:33:00] 6Operations, 13Patch-For-Review: Decom/reclaim berkelium/curium - https://phabricator.wikimedia.org/T125962#2075424 (10MoritzMuehlenhoff) Reassigning to Chris for reclaiming or decommisioning. All steps from https://wikitech.wikimedia.org/wiki/Server_Lifecycle#Reclaim_or_Decommission up to the removal of the... [12:33:14] 6Operations, 13Patch-For-Review: Decom/reclaim berkelium/curium - https://phabricator.wikimedia.org/T125962#2075425 (10MoritzMuehlenhoff) a:5MoritzMuehlenhoff>3Cmjohnson [12:34:51] 6Operations: Some labvirt systems use qemu from "cloud archive" - https://phabricator.wikimedia.org/T127113#2075428 (10MoritzMuehlenhoff) [12:37:09] 6Operations, 6Performance-Team, 7Performance: Update HHVM package to recent release - https://phabricator.wikimedia.org/T119637#2075431 (10Nikerabbit) {T124163} is still present in 3.12.1 unless you change a config setting. [12:37:30] !log elastic2019.codfw.wmnet: upgrading to 1.7.5, shipping logs to logstash (T122697, T109101) [12:37:32] T109101: Currently elasticsearch logs do not leave nodes. We use logstash for this across the cluster generally. - https://phabricator.wikimedia.org/T109101 [12:37:32] T122697: Upgrade ElasticSearch to 1.7.5 - https://phabricator.wikimedia.org/T122697 [12:37:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:40:48] (03PS1) 10ArielGlenn: jessie install settings for ms1001 [puppet] - 10https://gerrit.wikimedia.org/r/274096 [12:42:03] (03PS1) 10Elukey: Remove mc1003 from the redis/memcached pools for maintenance. [puppet] - 10https://gerrit.wikimedia.org/r/274097 (https://phabricator.wikimedia.org/T123711) [12:42:20] (03CR) 10ArielGlenn: [C: 032] jessie install settings for ms1001 [puppet] - 10https://gerrit.wikimedia.org/r/274096 (owner: 10ArielGlenn) [12:43:50] (03PS2) 10Elukey: Remove mc1003 from the redis/memcached pools for maintenance. [puppet] - 10https://gerrit.wikimedia.org/r/274097 (https://phabricator.wikimedia.org/T123711) [12:44:20] (03CR) 10Hashar: "If you feel this is ready, we can cherry pick it on the CI puppet master (integration-master). It has slaves on Precise, Trusty and Jessie" [puppet] - 10https://gerrit.wikimedia.org/r/218640 (https://phabricator.wikimedia.org/T102623) (owner: 10Dzahn) [12:46:13] (03CR) 10Elukey: [C: 032] Remove mc1003 from the redis/memcached pools for maintenance. [puppet] - 10https://gerrit.wikimedia.org/r/274097 (https://phabricator.wikimedia.org/T123711) (owner: 10Elukey) [12:48:21] !log removed mc1003 from redis/memcached pools for maintenance [12:48:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:56:24] (03PS1) 10Elukey: Removed mc1003 from the Lock Manager pools for maintenance. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/274100 (https://phabricator.wikimedia.org/T123711) [12:58:15] _joe_: whenever you have time --^ [12:58:15] 6Operations, 10Ops-Access-Requests, 13Patch-For-Review: Access Request for mobrovac as ci-admin to mess with CI infrastructure - https://phabricator.wikimedia.org/T128175#2066176 (10faidon) Asking everyone who wants to investigate test failures to go through an access request process does not scale and is on... [12:58:51] elukey: thx :) [12:59:51] paravoid: last mc standing with ubuntu [12:59:53] :D [13:00:06] :) [13:02:52] 6Operations, 10Wikimedia-Mailing-lists: Fwd: 7 Fd-advisorygroup moderator request(s) waiting - https://phabricator.wikimedia.org/T128406#2075495 (10MoritzMuehlenhoff) 5Open>3Resolved Password has been changed and sent to the two list admins. [13:05:01] (03CR) 10Giuseppe Lavagetto: [C: 031] Removed mc1003 from the Lock Manager pools for maintenance. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/274100 (https://phabricator.wikimedia.org/T123711) (owner: 10Elukey) [13:07:22] (03CR) 10Elukey: [C: 032] Removed mc1003 from the Lock Manager pools for maintenance. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/274100 (https://phabricator.wikimedia.org/T123711) (owner: 10Elukey) [13:08:25] 6Operations, 6Discovery, 10hardware-requests: Refresh elastic10{01..16}.eqiad.wmnet servers - https://phabricator.wikimedia.org/T128000#2075511 (10mark) As I specified in my recent private e-mail, using the refresh budget we can cover ~10 servers of the latest spec. More could be purchased using the remainin... [13:09:19] !log elukey@tin Synchronized wmf-config/filebackend-production.php: Remove mc1003 from the lock managers pool for maintenance (duration: 00m 40s) [13:09:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:20:43] (03PS1) 10ArielGlenn: puppetize the install_console script [puppet] - 10https://gerrit.wikimedia.org/r/274101 [13:22:45] (03CR) 10ArielGlenn: "Rob, you reference this script in https://wikitech.wikimedia.org/wiki/Server_Lifecycle#Post-Install:_Get_puppet_running" [puppet] - 10https://gerrit.wikimedia.org/r/274101 (owner: 10ArielGlenn) [13:34:24] (03CR) 10QChris: Avoid breaking full phabricator URLs (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/256663 (https://phabricator.wikimedia.org/T75997) (owner: 10Thiemo Mättig (WMDE)) [13:36:54] 6Operations, 10MobileFrontend, 10Traffic, 5MW-1.27-release, and 5 others: Incorrect TOC and section edit links rendering in Vector due to ParserCache corruption via ParserOutput::setText( ParserOutput::getText() ) - https://phabricator.wikimedia.org/T124356#2075584 (10BBlack) >>! In T124356#2075427, @BBlac... [13:37:41] (03PS1) 10ArielGlenn: re-enable rsyncs between dataset1001 and ms1001 [puppet] - 10https://gerrit.wikimedia.org/r/274102 [13:38:58] (03CR) 10ArielGlenn: [C: 032] re-enable rsyncs between dataset1001 and ms1001 [puppet] - 10https://gerrit.wikimedia.org/r/274102 (owner: 10ArielGlenn) [13:43:48] 6Operations, 10Dumps-Generation, 13Patch-For-Review: Migrate dataset1001 and ms1001 to jessie - https://phabricator.wikimedia.org/T123724#2075603 (10ArielGlenn) ms1001 now upgraded to jessie and back in service. [13:52:25] (03PS3) 10Elukey: Fix the mcXXXX partman config to allow fully automated PXE OS installs. [puppet] - 10https://gerrit.wikimedia.org/r/274071 (https://phabricator.wikimedia.org/T123711) [13:53:43] !log elastic2020.codfw.wmnet: upgrading to 1.7.5, shipping logs to logstash (T122697, T109101) [13:53:45] T109101: Currently elasticsearch logs do not leave nodes. We use logstash for this across the cluster generally. - https://phabricator.wikimedia.org/T109101 [13:53:45] T122697: Upgrade ElasticSearch to 1.7.5 - https://phabricator.wikimedia.org/T122697 [13:53:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:54:33] 6Operations, 10EventBus, 6Services, 7service-deployment-requests: New Service Request - Change Propagation - https://phabricator.wikimedia.org/T128463#2075614 (10mobrovac) [13:55:22] 6Operations, 10EventBus, 6Services, 15User-mobrovac, 7service-deployment-requests: New Service Request - Change Propagation - https://phabricator.wikimedia.org/T128463#2075630 (10mobrovac) [13:55:58] (03CR) 10Elukey: [C: 032] "Merging the change because it corrects the syntax, but it doesn't resolve the problem (installer asking for confirmation before partition)" [puppet] - 10https://gerrit.wikimedia.org/r/274071 (https://phabricator.wikimedia.org/T123711) (owner: 10Elukey) [13:56:33] 6Operations, 10EventBus, 6Services, 15User-mobrovac, 7service-deployment-requests: New Service Request - Change Propagation - https://phabricator.wikimedia.org/T128463#2075614 (10mobrovac) [13:56:41] 6Operations, 10Analytics, 10ArchCom-RfC, 6Discovery, and 7 others: EventBus MVP - https://phabricator.wikimedia.org/T114443#2075637 (10mobrovac) [13:57:15] (03PS2) 10BBlack: normalize_path: move to own include file [puppet] - 10https://gerrit.wikimedia.org/r/274083 (https://phabricator.wikimedia.org/T127387) [13:57:17] (03PS2) 10BBlack: normalize_path: optional forward-slash, RB variant [puppet] - 10https://gerrit.wikimedia.org/r/274085 (https://phabricator.wikimedia.org/T127387) [13:57:19] (03PS2) 10BBlack: normalize_path: make it a C function [puppet] - 10https://gerrit.wikimedia.org/r/274084 (https://phabricator.wikimedia.org/T127387) [13:57:21] (03PS2) 10BBlack: normalize_path: assert(url) [puppet] - 10https://gerrit.wikimedia.org/r/274087 (https://phabricator.wikimedia.org/T127387) [13:57:23] (03PS2) 10BBlack: normalize_path: stop on fragment marker [puppet] - 10https://gerrit.wikimedia.org/r/274086 (https://phabricator.wikimedia.org/T127387) [13:57:25] (03PS2) 10BBlack: normalize_path: fully parameterize the decoded set [puppet] - 10https://gerrit.wikimedia.org/r/274089 (https://phabricator.wikimedia.org/T127387) [13:57:27] (03PS2) 10BBlack: normalize_path: refactor control flow [puppet] - 10https://gerrit.wikimedia.org/r/274088 (https://phabricator.wikimedia.org/T127387) [14:01:16] 6Operations, 10Dumps-Generation, 13Patch-For-Review: Migrate dataset1001 and ms1001 to jessie - https://phabricator.wikimedia.org/T123724#2075651 (10ArielGlenn) Tomorrow's plan for dataset1001 looks much like today's for ms1001. Window set for 1 to 4 pm UTC. Record raid setup and what order disks have been... [14:01:34] (03PS1) 10Muehlenhoff: Drop one patch [debs/linux44] - 10https://gerrit.wikimedia.org/r/274103 [14:03:18] (03PS1) 10Rush: labstore: create-dbusers niceness and logging modifications [puppet] - 10https://gerrit.wikimedia.org/r/274104 [14:08:23] (03PS1) 10Filippo Giunchedi: add restbase101[0-5] cassandra instances [dns] - 10https://gerrit.wikimedia.org/r/274105 (https://phabricator.wikimedia.org/T128107) [14:08:39] (03PS2) 10Rush: labstore: create-dbusers niceness and logging modifications [puppet] - 10https://gerrit.wikimedia.org/r/274104 [14:08:47] (03CR) 10Rush: [C: 032 V: 032] labstore: create-dbusers niceness and logging modifications [puppet] - 10https://gerrit.wikimedia.org/r/274104 (owner: 10Rush) [14:10:15] (03CR) 10Muehlenhoff: [C: 032 V: 032] Drop one patch [debs/linux44] - 10https://gerrit.wikimedia.org/r/274103 (owner: 10Muehlenhoff) [14:10:49] (03PS1) 10Elukey: Added a partman option to mc.cfg to allow fully automated partitioning. [puppet] - 10https://gerrit.wikimedia.org/r/274106 (https://phabricator.wikimedia.org/T123711) [14:11:05] (03PS2) 10Elukey: Added a partman option to mc.cfg to allow fully automated partitioning. [puppet] - 10https://gerrit.wikimedia.org/r/274106 (https://phabricator.wikimedia.org/T123711) [14:13:57] (03PS3) 10Elukey: Added a partman option to mc.cfg to allow fully automated partitioning. [puppet] - 10https://gerrit.wikimedia.org/r/274106 (https://phabricator.wikimedia.org/T123711) [14:15:02] (03CR) 10Elukey: [C: 032] "Tested manually in carbon, it finally works!" [puppet] - 10https://gerrit.wikimedia.org/r/274106 (https://phabricator.wikimedia.org/T123711) (owner: 10Elukey) [14:18:33] (03PS3) 10BBlack: normalize_path: move to own include file [puppet] - 10https://gerrit.wikimedia.org/r/274083 (https://phabricator.wikimedia.org/T127387) [14:18:41] (03CR) 10BBlack: [C: 032 V: 032] normalize_path: move to own include file [puppet] - 10https://gerrit.wikimedia.org/r/274083 (https://phabricator.wikimedia.org/T127387) (owner: 10BBlack) [14:18:52] (03PS3) 10BBlack: normalize_path: make it a C function [puppet] - 10https://gerrit.wikimedia.org/r/274084 (https://phabricator.wikimedia.org/T127387) [14:18:59] (03CR) 10BBlack: [C: 032 V: 032] normalize_path: make it a C function [puppet] - 10https://gerrit.wikimedia.org/r/274084 (https://phabricator.wikimedia.org/T127387) (owner: 10BBlack) [14:19:31] (03PS3) 10BBlack: normalize_path: optional forward-slash, RB variant [puppet] - 10https://gerrit.wikimedia.org/r/274085 (https://phabricator.wikimedia.org/T127387) [14:19:47] (03CR) 10BBlack: [C: 032 V: 032] normalize_path: optional forward-slash, RB variant [puppet] - 10https://gerrit.wikimedia.org/r/274085 (https://phabricator.wikimedia.org/T127387) (owner: 10BBlack) [14:20:02] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] add restbase101[0-5] cassandra instances [dns] - 10https://gerrit.wikimedia.org/r/274105 (https://phabricator.wikimedia.org/T128107) (owner: 10Filippo Giunchedi) [14:23:39] 6Operations, 10RESTBase, 6Services, 10Traffic, and 2 others: Split slash decoding from general percent normalization in Varnish VCL - https://phabricator.wikimedia.org/T127387#2075690 (10BBlack) I've merged the first 3 patches, which essentially gets us to "RB gets the same path decoding as MW, except for... [14:24:04] PROBLEM - Host mc1003 is DOWN: PING CRITICAL - Packet loss = 100% [14:24:17] (03PS1) 10Filippo Giunchedi: hieradata: add restbase101[0-5] instances to cassandra seeds [puppet] - 10https://gerrit.wikimedia.org/r/274107 (https://phabricator.wikimedia.org/T128107) [14:25:31] ^ elukey: icinga downtime for mc1003 expired? [14:25:42] ahhhhh yesss [14:25:52] I just rebooted it to check if it was ok... [14:25:54] grrrr [14:25:57] sorry! [14:26:27] * elukey also blames partman configs [14:26:31] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] hieradata: add restbase101[0-5] instances to cassandra seeds [puppet] - 10https://gerrit.wikimedia.org/r/274107 (https://phabricator.wikimedia.org/T128107) (owner: 10Filippo Giunchedi) [14:27:44] RECOVERY - Host mc1003 is UP: PING WARNING - Packet loss = 93%, RTA = 0.55 ms [14:31:54] PROBLEM - Check size of conntrack table on mc1003 is CRITICAL: Timeout while attempting connection [14:32:03] PROBLEM - salt-minion processes on mc1003 is CRITICAL: Timeout while attempting connection [14:32:05] PROBLEM - DPKG on mc1003 is CRITICAL: Timeout while attempting connection [14:32:13] PROBLEM - configured eth on mc1003 is CRITICAL: Timeout while attempting connection [14:32:14] PROBLEM - Disk space on mc1003 is CRITICAL: Timeout while attempting connection [14:32:23] PROBLEM - Memcached on mc1003 is CRITICAL: Connection timed out [14:32:43] PROBLEM - dhclient process on mc1003 is CRITICAL: Connection refused by host [14:33:05] PROBLEM - puppet last run on mc1003 is CRITICAL: Connection refused by host [14:33:15] PROBLEM - RAID on mc1003 is CRITICAL: Connection refused by host [14:33:42] ah snap [14:33:43] RECOVERY - Check size of conntrack table on mc1003 is OK: OK: nf_conntrack is 0 % full [14:33:44] RECOVERY - salt-minion processes on mc1003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [14:33:54] RECOVERY - DPKG on mc1003 is OK: All packages OK [14:33:55] RECOVERY - configured eth on mc1003 is OK: OK - interfaces up [14:34:03] !log elastic2021.codfw.wmnet: upgrading to 1.7.5, shipping logs to logstash (T122697, T109101) [14:34:04] RECOVERY - Disk space on mc1003 is OK: DISK OK [14:34:05] T109101: Currently elasticsearch logs do not leave nodes. We use logstash for this across the cluster generally. - https://phabricator.wikimedia.org/T109101 [14:34:05] T122697: Upgrade ElasticSearch to 1.7.5 - https://phabricator.wikimedia.org/T122697 [14:34:05] RECOVERY - Memcached on mc1003 is OK: TCP OK - 0.001 second response time on port 11211 [14:34:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:34:34] RECOVERY - dhclient process on mc1003 is OK: PROCS OK: 0 processes with command name dhclient [14:34:54] moritzm: please don't kill me :) --^ [14:35:03] RECOVERY - puppet last run on mc1003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:35:14] RECOVERY - RAID on mc1003 is OK: OK: Active: 2, Working: 2, Failed: 0, Spare: 0 [14:36:15] (03PS1) 10Muehlenhoff: Don't apply aufs patches [debs/linux44] - 10https://gerrit.wikimedia.org/r/274110 [14:36:36] I didn't even notice :-) [14:41:01] (03PS2) 10Muehlenhoff: Don't apply aufs patches [debs/linux44] - 10https://gerrit.wikimedia.org/r/274110 [14:44:36] (03CR) 10Muehlenhoff: [C: 032 V: 032] Don't apply aufs patches [debs/linux44] - 10https://gerrit.wikimedia.org/r/274110 (owner: 10Muehlenhoff) [14:44:49] (03PS1) 10Elukey: Add mc1003 back to the redis/memcached pools after maintenance. [puppet] - 10https://gerrit.wikimedia.org/r/274111 (https://phabricator.wikimedia.org/T123711) [14:46:10] (03CR) 10Elukey: [C: 032] Add mc1003 back to the redis/memcached pools after maintenance. [puppet] - 10https://gerrit.wikimedia.org/r/274111 (https://phabricator.wikimedia.org/T123711) (owner: 10Elukey) [14:46:47] 6Operations, 10Analytics, 10Traffic: varnishkafka integration with Varnish 4 for analytics - https://phabricator.wikimedia.org/T124278#2075747 (10Ottomata) Sounds good to me! Re: python varnish C types: cool! I didn’t know there was an existing lib out there for that. Ori and I wrote some python varnish b... [14:47:39] !log mc1003.eqiad added back to the redis/memcached pool after maintenance. [14:47:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:51:15] PROBLEM - Disk space on labstore2001 is CRITICAL: DISK CRITICAL - free space: /dev 0 MB (0% inode=99%) [14:51:23] (03PS1) 10Elukey: Revert "Removed mc1003 from the Lock Manager pools for maintenance." to add the host back to the pool after maintenance. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/274113 [14:51:44] (03PS2) 10Elukey: Revert "Removed mc1003 from the Lock Manager pools for maintenance." to add the host back to the pool. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/274113 [14:53:03] (03CR) 10Elukey: [C: 032] Revert "Removed mc1003 from the Lock Manager pools for maintenance." to add the host back to the pool. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/274113 (owner: 10Elukey) [14:53:13] RECOVERY - Disk space on labstore2001 is OK: DISK OK [14:55:08] !log elukey@tin Synchronized wmf-config/filebackend-production.php: Add mc1003 to the lock managers pool after maintenance (duration: 00m 40s) [14:55:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:55:27] _joe_ ---^ all mc hosts completed :) [14:55:36] <_joe_> cool [14:55:38] <_joe_> :) [14:59:37] also https://gerrit.wikimedia.org/r/#/c/266514/6/wmf-config/ProductionServices.php +1 [14:59:40] very nice [15:00:09] (03PS2) 10Elukey: Add kafka1012.eqiad.wmnet back to the media-wiki config. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/273488 (https://phabricator.wikimedia.org/T125084) [15:12:27] (03CR) 10Thiemo Mättig (WMDE): Avoid breaking full phabricator URLs (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/256663 (https://phabricator.wikimedia.org/T75997) (owner: 10Thiemo Mättig (WMDE)) [15:15:12] 6Operations, 6Research-and-Data, 10The-Wikipedia-Library, 10Traffic, and 6 others: Set an explicit "Origin When Cross-Origin" referer policy via the meta referrer tag - https://phabricator.wikimedia.org/T87276#2075814 (10Fuzheado) [15:26:11] (03PS7) 10Giuseppe Lavagetto: Configure redis LockManager in both DCs, use the master everywhere. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266514 [15:26:13] (03PS9) 10Giuseppe Lavagetto: Add references to wmfServices for Cirrusearch. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266512 (https://phabricator.wikimedia.org/T114273) [15:26:15] (03PS9) 10Giuseppe Lavagetto: Use wmfMasterDatacenter for picking the master redis config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266513 (https://phabricator.wikimedia.org/T114273) [15:35:48] (03PS1) 10Muehlenhoff: Add nschaaf to researchers [puppet] - 10https://gerrit.wikimedia.org/r/274118 (https://phabricator.wikimedia.org/T128381) [15:38:01] (03PS1) 10ArielGlenn: dumps: create dump directory if it doesn't exist, before stashing settings [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/274119 [15:38:14] (03CR) 10ArielGlenn: [C: 032] dumps: create dump directory if it doesn't exist, before stashing settings [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/274119 (owner: 10ArielGlenn) [15:46:59] (03CR) 10Ottomata: "Hm, I don't see the changes. I still see $working_path, and the rsync is still in init.pp." [puppet] - 10https://gerrit.wikimedia.org/r/273487 (https://phabricator.wikimedia.org/T127327) (owner: 10Mforns) [15:48:45] 6Operations, 6Services, 10procurement: Hardware request for SCA and SCB in codfw - https://phabricator.wikimedia.org/T128475#2075945 (10mobrovac) [15:49:15] (03CR) 10QChris: Avoid breaking full phabricator URLs (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/256663 (https://phabricator.wikimedia.org/T75997) (owner: 10Thiemo Mättig (WMDE)) [15:54:02] (03PS2) 10Ottomata: Camus: specify latest schema for ApiAction [puppet] - 10https://gerrit.wikimedia.org/r/273558 (https://phabricator.wikimedia.org/T108618) (owner: 10BryanDavis) [15:54:10] (03CR) 10Ottomata: [C: 032 V: 032] Camus: specify latest schema for ApiAction [puppet] - 10https://gerrit.wikimedia.org/r/273558 (https://phabricator.wikimedia.org/T108618) (owner: 10BryanDavis) [15:55:09] 6Operations, 10Ops-Access-Requests, 13Patch-For-Review: Access Request for mobrovac as ci-admin to mess with CI infrastructure - https://phabricator.wikimedia.org/T128175#2075971 (10JanZerebecki) Yes in principle these can be unprivileged. In practice when they run gate-and-submit (pre merge) or any post mer... [16:00:04] anomie ostriches thcipriani marktraceur Krenair: Dear anthropoid, the time has come. Please deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160301T1600). [16:00:04] elukey Luke081515 _joe_ Kelson: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [16:02:03] I can SWAT today. folks with patches around? [16:05:45] <_joe_> thcipriani: aye [16:06:01] <_joe_> I wasn't the first so I was waiting for others to come around [16:06:16] _joe_: hiya [16:06:32] <_joe_> thcipriani: if no one's around, I am willing to go [16:06:50] I had one question about this https://gerrit.wikimedia.org/r/#/c/266512/ : will this work on beta? [16:07:32] <_joe_> thcipriani: yes it should [16:07:46] <_joe_> wmf-config/LabsServices.php has the definition [16:08:48] <_joe_> thcipriani: https://github.com/wikimedia/operations-mediawiki-config/blob/master/wmf-config/LabsServices.php#L18 [16:09:07] aah, yup. My local copy of the repo was out of date. [16:09:29] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266512 (https://phabricator.wikimedia.org/T114273) (owner: 10Giuseppe Lavagetto) [16:10:23] (03Merged) 10jenkins-bot: Add references to wmfServices for Cirrusearch. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266512 (https://phabricator.wikimedia.org/T114273) (owner: 10Giuseppe Lavagetto) [16:11:35] <_joe_> are you already syncing it? [16:11:53] (03CR) 10Ottomata: "The mysql data size for this instance is currently around 40G. We have a task to enable pruning of old data, but we haven't actually trie" [puppet] - 10https://gerrit.wikimedia.org/r/273312 (https://phabricator.wikimedia.org/T127991) (owner: 10Ottomata) [16:12:06] <_joe_> I have a doubt, thcipriani let me test one thing [16:12:18] _joe_: ack. Holding. [16:12:25] <_joe_> is it on tin? [16:12:38] it is pulled down and ready to sync [16:12:47] on tin, yes [16:13:29] (03PS1) 10ArielGlenn: dumps: fix cron job to use stagefile parameter for createdirs job [puppet] - 10https://gerrit.wikimedia.org/r/274124 [16:14:48] <_joe_> thcipriani: gtg I think [16:15:01] _joe_: okie doke [16:16:05] 6Operations, 10DBA, 13Patch-For-Review: Puppetize pt-heartbeat on MariaDB10 masters and its corresponding checks on icinga - https://phabricator.wikimedia.org/T114752#2076046 (10jcrespo) @Aaron, I have patched pt-heartbeat to create and update automatically a "shard" column: ``` mysql> SELECT * FROM heartb... [16:16:56] 6Operations, 10Ops-Access-Requests, 13Patch-For-Review: Requesting access to researchers for nschaaf - https://phabricator.wikimedia.org/T128381#2072331 (10Milimetric) I believe @schana needs to be added to **statistics-users** to be able to login to stat1003, and **researchers** to be able to access the pas... [16:18:31] !log thcipriani@tin Synchronized wmf-config/CirrusSearch-production.php: SWAT: Add references to wmfServices for Cirrusearch [[gerrit:266512]] (duration: 00m 56s) [16:18:33] ^ _joe_ check please [16:18:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:18:40] <_joe_> thcipriani: ok [16:18:43] (03CR) 10Milimetric: "I believe this is a duplicate of https://gerrit.wikimedia.org/r/#/c/231144/" [puppet] - 10https://gerrit.wikimedia.org/r/270151 (owner: 10Tim Landscheidt) [16:19:49] <_joe_> thcipriani: search is not broken so... [16:20:01] that's good :D [16:20:34] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266513 (https://phabricator.wikimedia.org/T114273) (owner: 10Giuseppe Lavagetto) [16:21:00] (03Merged) 10jenkins-bot: Use wmfMasterDatacenter for picking the master redis config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266513 (https://phabricator.wikimedia.org/T114273) (owner: 10Giuseppe Lavagetto) [16:21:54] 6Operations, 7Puppet, 13Patch-For-Review: Import vs autoload: the puppet parser is a bad joke that stopped being funny years ago. - https://phabricator.wikimedia.org/T119042#1816403 (10Ottomata) We should just make a new special `modulepath` for roles!!!!!!! ``` modulepath = /etc/puppet/private/modules:/etc... [16:22:40] 6Operations, 10MediaWiki-Authentication-and-authorization, 10Traffic, 5MW-1.27-release-notes, and 2 others: Logging out of a wiki leaves an XXwikiSession= Cookie behind - https://phabricator.wikimedia.org/T127436#2076059 (10Anomie) 5Open>3Resolved [16:22:41] <_joe_> ottomata: sadly that wouldn't change a thing [16:24:02] <_joe_> thcipriani: let me know when it's synced [16:24:10] _joe_: wouldn'ti t just be a regular module then? [16:24:12] _joe_: yup, syncing now. [16:24:28] !log thcipriani@tin Synchronized wmf-config/redis.php: SWAT: Use wmfMasterDatacenter for picking the master redis config [[gerrit:266513]] (duration: 00m 39s) [16:24:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:24:33] ^ _joe_ sync'd [16:24:37] maybe role keyword would need to be adapted to know how to use that, but then autoload stuff should work like normal? [16:24:48] <_joe_> ottomata: sorry, later :P [16:24:55] <_joe_> I'm in the middle of SWAT here [16:24:59] and i could do role/eventbus/manifests/init.pp, instead of modules/role/manifests/eventbus/eventbus.pp [16:25:01] np! [16:25:06] you pinged mE :P [16:25:17] swat 'em good! [16:25:25] (03CR) 10Milimetric: "I'm thinking about this more and I realize puppet doesn't even run on limn1 anymore. So that's a dead instance anyway, and removing this " [puppet] - 10https://gerrit.wikimedia.org/r/231144 (owner: 10Faidon Liambotis) [16:25:41] (03CR) 10Milimetric: [C: 031] (WIP) Kill misc::limn & limn [puppet] - 10https://gerrit.wikimedia.org/r/231144 (owner: 10Faidon Liambotis) [16:26:43] (03CR) 10ArielGlenn: [C: 032] dumps: fix cron job to use stagefile parameter for createdirs job [puppet] - 10https://gerrit.wikimedia.org/r/274124 (owner: 10ArielGlenn) [16:27:12] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266514 (owner: 10Giuseppe Lavagetto) [16:27:15] <_joe_> thcipriani: I see no errors related to this in kibana, so it must be good! [16:27:47] _joe_: yarp. [16:27:58] (03Merged) 10jenkins-bot: Configure redis LockManager in both DCs, use the master everywhere. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266514 (owner: 10Giuseppe Lavagetto) [16:28:50] (03PS1) 10Muehlenhoff: Imported Upstream version 1.0.2g [debs/openssl] - 10https://gerrit.wikimedia.org/r/274127 [16:28:52] (03PS1) 10Muehlenhoff: * Update to 1.0.2g * Drop handle-ssl-shutdown-while-in-init-more-appropriately-v2.patch (part of new upstream release) [debs/openssl] - 10https://gerrit.wikimedia.org/r/274128 [16:30:16] 6Operations, 7Puppet, 13Patch-For-Review: Import vs autoload: the puppet parser is a bad joke that stopped being funny years ago. - https://phabricator.wikimedia.org/T119042#2076070 (10Ottomata) I suppose this would conflict with a lot of currently named role classes, since there are role classes that have s... [16:30:37] (03PS6) 10Mforns: Replace limn::data::generate by reportupdater [puppet] - 10https://gerrit.wikimedia.org/r/273487 (https://phabricator.wikimedia.org/T127327) [16:30:50] !log thcipriani@tin Synchronized wmf-config/ProductionServices.php: SWAT: Configure redis LockManager in both DCs, use the master everywhere. PART I [[gerrit:266514]] (duration: 00m 40s) [16:30:52] (03PS7) 10Mforns: Replace limn::data::generate by reportupdater [puppet] - 10https://gerrit.wikimedia.org/r/273487 (https://phabricator.wikimedia.org/T127327) [16:30:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:31:46] thcipriani: sorry I missed the ping for my patch, I'll reschedule it for another time! [16:32:01] elukey: I can still get yours in if you're around now [16:32:02] <_joe_> elukey: like now? [16:32:07] <_joe_> :P [16:32:13] <_joe_> thcipriani: we're done? [16:32:15] I am in standupppp [16:32:17] !log thcipriani@tin Synchronized wmf-config/filebackend-production.php: SWAT: Configure redis LockManager in both DCs, use the master everywhere. PART II [[gerrit:266514]] (duration: 00m 46s) [16:32:20] thcipriani: Sorry, I'm a bit late. :-/ Can we deploy my to patches now or in a few minutes? [16:32:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:32:22] _joe_: ^ now done [16:32:22] <_joe_> eheh ok :P [16:33:04] !log Bunch of Jenkins job got stall because I have killed threads in Jenkins to unblock integration-slave-trusty-1003 :-( Jenkins / Zuul is catching up. [16:33:05] Luke081515: sure [16:33:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:33:13] thanks [16:33:16] thcipriani: if you have time in ~20 mins we can work on it, otherwise I'll reschedule :) [16:33:58] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/273776 (https://phabricator.wikimedia.org/T123109) (owner: 10Luke081515) [16:34:47] elukey: ping me when you've got time, nbd :) [16:35:42] (03Merged) 10jenkins-bot: Correct one Domain at $wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/273776 (https://phabricator.wikimedia.org/T123109) (owner: 10Luke081515) [16:37:13] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: Correct one Domain at $wgCopyUploadsDomains [[gerrit:273776]] (duration: 00m 40s) [16:37:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:37:23] (03PS1) 10Jforrester: Enable VisualEditor Single Edit Tab on group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/274129 [16:37:25] (03PS1) 10Jforrester: Enable VisualEditor Single Edit Tab on the Polish Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/274130 (https://phabricator.wikimedia.org/T128477) [16:37:26] ^ Luke081515 check please [16:37:27] (03CR) 10Mforns: "Sorry yesterday night I was totally asleep." [puppet] - 10https://gerrit.wikimedia.org/r/273487 (https://phabricator.wikimedia.org/T127327) (owner: 10Mforns) [16:37:29] (03PS1) 10Jforrester: Enable VisualEditor Single Edit Tab on the English Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/274131 (https://phabricator.wikimedia.org/T128478) [16:38:04] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/273828 (https://phabricator.wikimedia.org/T128205) (owner: 10Luke081515) [16:38:07] wait a moment [16:38:12] I will check [16:38:14] okie doke [16:38:17] thcipriani: jenkins is a little bit in troubles :/ [16:38:24] (03CR) 10jenkins-bot: [V: 04-1] Enable VisualEditor Single Edit Tab on group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/274129 (owner: 10Jforrester) [16:38:46] (03CR) 10jenkins-bot: [V: 04-1] Enable VisualEditor Single Edit Tab on the Polish Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/274130 (https://phabricator.wikimedia.org/T128477) (owner: 10Jforrester) [16:38:46] hashar: I saw that zuul doesn't look too happy :( [16:38:58] (03CR) 10jenkins-bot: [V: 04-1] Enable VisualEditor Single Edit Tab on the English Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/274131 (https://phabricator.wikimedia.org/T128478) (owner: 10Jforrester) [16:39:04] I am just going to kill Jenkins [16:39:08] !log restarting Jenkins [16:39:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:39:29] (03Merged) 10jenkins-bot: Enable rollbacker and suppressredirect group at cewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/273828 (https://phabricator.wikimedia.org/T128205) (owner: 10Luke081515) [16:39:44] (03PS2) 10Jforrester: Enable VisualEditor Single Edit Tab on group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/274129 [16:40:23] thcipriani: The first one works, I will close the task now [16:40:37] Luke081515: cool, thanks for checking! [16:40:43] !log elastic2023.codfw.wmnet: upgrading to 1.7.5, shipping logs to logstash (T122697, T109101) [16:40:44] T109101: Currently elasticsearch logs do not leave nodes. We use logstash for this across the cluster generally. - https://phabricator.wikimedia.org/T109101 [16:40:45] T122697: Upgrade ElasticSearch to 1.7.5 - https://phabricator.wikimedia.org/T122697 [16:40:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:41:30] (03PS1) 10ArielGlenn: dumps: disable dump cron on snapshots, prep for dataset1001 upgrade [puppet] - 10https://gerrit.wikimedia.org/r/274132 [16:41:37] thcipriani: I have killed Jenkins :-} [16:42:37] hashar: weee! [16:43:17] thcipriani: and Zuul is retriggering jobs [16:43:36] but the patch is already merged ;) [16:43:38] (03CR) 10ArielGlenn: [C: 032] dumps: disable dump cron on snapshots, prep for dataset1001 upgrade [puppet] - 10https://gerrit.wikimedia.org/r/274132 (owner: 10ArielGlenn) [16:43:43] hashar: SWAT is unaffected so far [16:43:57] everything's been merging. All mw-config stuff [16:44:00] (03PS2) 10Jforrester: Enable VisualEditor Single Edit Tab on the Polish Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/274130 (https://phabricator.wikimedia.org/T128477) [16:44:04] thcipriani: \O/ [16:44:06] (03PS2) 10Jforrester: Enable VisualEditor Single Edit Tab on the English Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/274131 (https://phabricator.wikimedia.org/T128478) [16:44:09] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: Enable rollbacker and suppressredirect group at cewiki [[gerrit:273828]] (duration: 00m 41s) [16:44:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:44:14] ^ Luke081515 check please [16:44:17] (03CR) 10Jforrester: [C: 04-1] "Not without notice. :-)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/274131 (https://phabricator.wikimedia.org/T128478) (owner: 10Jforrester) [16:44:26] (03CR) 10Jforrester: [C: 04-1] "Not without notice. :-)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/274130 (https://phabricator.wikimedia.org/T128477) (owner: 10Jforrester) [16:44:49] thcipriani: works, thanks for SWATing ;) [16:45:02] task closed [16:45:03] Luke081515: awesome. thanks for the patches! [16:46:21] (03PS1) 10Filippo Giunchedi: cassandra: add restbase101[0-5] instances [puppet] - 10https://gerrit.wikimedia.org/r/274133 (https://phabricator.wikimedia.org/T128107) [16:46:50] (03CR) 10Giuseppe Lavagetto: "While technically correct, I'd like eithr bblack or ema to + this change before merging it" [puppet] - 10https://gerrit.wikimedia.org/r/274094 (owner: 10Yurik) [16:47:28] (03CR) 10Filippo Giunchedi: [C: 04-1] "to be merged once restbase1009-b bootstrap has finished" [puppet] - 10https://gerrit.wikimedia.org/r/274133 (https://phabricator.wikimedia.org/T128107) (owner: 10Filippo Giunchedi) [16:47:32] (03Abandoned) 10Tim Landscheidt: Remove unused type misc::limn::instance [puppet] - 10https://gerrit.wikimedia.org/r/270151 (owner: 10Tim Landscheidt) [16:48:35] RECOVERY - puppet last run on cp4015 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:49:08] (03PS1) 10Jcrespo: [WIP] Adding custom heartbeat script with "shard" additional column [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/274134 (https://phabricator.wikimedia.org/T114752) [16:49:28] (03CR) 10BBlack: [C: 031] Allow wikidata access to maps cluster [puppet] - 10https://gerrit.wikimedia.org/r/274094 (owner: 10Yurik) [16:49:30] (03CR) 10jenkins-bot: [V: 04-1] [WIP] Adding custom heartbeat script with "shard" additional column [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/274134 (https://phabricator.wikimedia.org/T114752) (owner: 10Jcrespo) [16:50:14] (03PS1) 10Ema: Preiliminary port to new VSL API [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/274135 (https://phabricator.wikimedia.org/T124278) [16:52:06] PROBLEM - Kafka Broker Replica Max Lag on kafka1012 is CRITICAL: CRITICAL: 65.22% of data above the critical threshold [5000000.0] [16:52:21] (03CR) 10Ema: [C: 04-1] "We still need to take care of the following:" [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/274135 (https://phabricator.wikimedia.org/T124278) (owner: 10Ema) [16:53:31] (03PS2) 10Jcrespo: [WIP] Adding custom heartbeat script with "shard" additional column [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/274134 (https://phabricator.wikimedia.org/T114752) [16:53:50] (03CR) 10jenkins-bot: [V: 04-1] [WIP] Adding custom heartbeat script with "shard" additional column [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/274134 (https://phabricator.wikimedia.org/T114752) (owner: 10Jcrespo) [16:54:30] (03PS3) 10Jcrespo: [WIP] Adding custom heartbeat script with "shard" additional column [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/274134 (https://phabricator.wikimedia.org/T114752) [16:54:52] (03CR) 10jenkins-bot: [V: 04-1] [WIP] Adding custom heartbeat script with "shard" additional column [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/274134 (https://phabricator.wikimedia.org/T114752) (owner: 10Jcrespo) [16:54:52] 6Operations: upgrade 15+4 swift servers from precise to trusty - https://phabricator.wikimedia.org/T125024#2076187 (10fgiunchedi) `ms-be1004` to `ms-be1008` upgraded [16:54:55] (03PS4) 10Jcrespo: [WIP] Adding custom heartbeat script with "shard" additional column [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/274134 (https://phabricator.wikimedia.org/T114752) [16:55:24] (03CR) 10jenkins-bot: [V: 04-1] [WIP] Adding custom heartbeat script with "shard" additional column [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/274134 (https://phabricator.wikimedia.org/T114752) (owner: 10Jcrespo) [16:57:26] 6Operations, 6Project-Admins, 3DevRel-March-2016: Operations-related subprojects/tags reorganization - https://phabricator.wikimedia.org/T119944#2076203 (10Aklapper) [16:57:55] (03PS5) 10Jcrespo: [WIP] Adding custom heartbeat script with "shard" additional column [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/274134 (https://phabricator.wikimedia.org/T114752) [16:58:05] 6Operations, 10Analytics, 10Traffic, 13Patch-For-Review: varnishkafka integration with Varnish 4 for analytics - https://phabricator.wikimedia.org/T124278#2076213 (10ema) So the trivial part is done, see https://gerrit.wikimedia.org/r/274135. Now we need to figure out the tricky one. :) Essentially, the... [16:58:13] (03CR) 10jenkins-bot: [V: 04-1] [WIP] Adding custom heartbeat script with "shard" additional column [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/274134 (https://phabricator.wikimedia.org/T114752) (owner: 10Jcrespo) [17:00:04] _joe_ jynus: Respected human, time to deploy Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160301T1700). Please do the needful. [17:00:04] yurik: A patch you scheduled for Puppet SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [17:00:12] yep [17:00:29] <_joe_> yurik: hey [17:00:47] seems quite uncontroversial [17:00:50] <_joe_> so, I'm going to merge the patch in a couple of minutes, let me finish something [17:01:08] are you planning to use it on wikidata? [17:01:41] _joe_, I can manage it if you are busy [17:01:43] jynus, there has been some discussion there [17:01:49] _joe_, thx [17:01:58] thcipriani I am ready if you have time! [17:02:17] (03PS3) 10Elukey: Add kafka1012.eqiad.wmnet back to the media-wiki config. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/273488 (https://phabricator.wikimedia.org/T125084) [17:02:24] jynus, wikidata ppl are considering to use it with some of the tools they are building on top of wikidata domain [17:02:29] just rebased --^ [17:02:39] like query.wikidata.org, etc [17:03:14] _joe_: can I sync out elukey 's change before puppetswat? [17:03:19] (03PS5) 10Tim Landscheidt: shinken: Only regenerate configuration when there are changes [puppet] - 10https://gerrit.wikimedia.org/r/267423 [17:03:22] <_joe_> thcipriani: go on [17:03:24] RECOVERY - Kafka Broker Replica Max Lag on kafka1012 is OK: OK: Less than 50.00% above the threshold [1000000.0] [17:03:34] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/273488 (https://phabricator.wikimedia.org/T125084) (owner: 10Elukey) [17:03:39] <_joe_> look who just recovered btw :P [17:04:05] (03Merged) 10jenkins-bot: Add kafka1012.eqiad.wmnet back to the media-wiki config. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/273488 (https://phabricator.wikimedia.org/T125084) (owner: 10Elukey) [17:04:10] <_joe_> yurik: we're waiting for SWAT to finish [17:04:16] no rush [17:04:56] _joe_: yeah those lags are a bit noisy, there are a lot of improvements in kafka 0.9 that should improve the issue [17:05:45] thcipriani: thanks! [17:05:48] !log thcipriani@tin Synchronized wmf-config/ProductionServices.php: SWAT: Add kafka1012.eqiad.wmnet back to the media-wiki config [[gerrit:273488]] (duration: 00m 39s) [17:05:50] ^ elukey sync'd [17:05:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:05:56] \o/ [17:06:17] <_joe_> elukey: check connections to kafka1012 from the appservers, maybe? [17:06:18] _joe_: i'll have 2 patches if that's ok for puppetswat [17:06:26] <_joe_> mobrovac: let's see them [17:06:27] _joe_ yes I was about to do it [17:06:33] also checking https://logstash.wikimedia.org/#/dashboard/elasticsearch/hhvm just in case [17:06:36] <_joe_> mobrovac: in general, please schedule them a bit earlier :) [17:06:39] _joe_: don't you trust me? :D [17:06:50] _joe_: i know, that's why i'm asking, but forgot puppetswat's today [17:06:53] <_joe_> elukey: I'll wait for your ACK to puppetswat [17:06:59] <_joe_> mobrovac: go on :) [17:07:09] (03PS6) 10Jcrespo: [WIP] Adding custom heartbeat script with "shard" additional column [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/274134 (https://phabricator.wikimedia.org/T114752) [17:07:38] (03PS2) 10Giuseppe Lavagetto: Allow wikidata access to maps cluster [puppet] - 10https://gerrit.wikimedia.org/r/274094 (owner: 10Yurik) [17:07:49] thx [17:08:07] (03CR) 10Mobrovac: [C: 031] Add purged_cache_control config variable [puppet] - 10https://gerrit.wikimedia.org/r/273974 (owner: 10GWicke) [17:08:12] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] "PuppetSWAT" [puppet] - 10https://gerrit.wikimedia.org/r/274094 (owner: 10Yurik) [17:08:27] <_joe_> elukey: can I go? [17:09:13] <_joe_> mobrovac: which patches, btw? [17:09:38] editing the dpeloyment page as we speak [17:09:40] _joe_ it looks good for the moment, only TIME_WAITs for kafka1012 [17:09:50] _joe_: https://gerrit.wikimedia.org/r/#/c/273974/ [17:09:52] (03PS3) 10Gehel: Expose elasticsearch through HTTP [puppet] - 10https://gerrit.wikimedia.org/r/273254 (https://phabricator.wikimedia.org/T124444) [17:10:02] (I am on a mw host of course) [17:10:11] _joe_: for the second one, i'll create a patch now [17:11:07] (03CR) 10jenkins-bot: [V: 04-1] Expose elasticsearch through HTTP [puppet] - 10https://gerrit.wikimedia.org/r/273254 (https://phabricator.wikimedia.org/T124444) (owner: 10Gehel) [17:11:47] (03PS2) 10Giuseppe Lavagetto: restbase: add purged_cache_control config variable [puppet] - 10https://gerrit.wikimedia.org/r/273974 (owner: 10GWicke) [17:12:45] _joe_: scratch that, i'll have only that one, the second one needs an rb deploy, which i can't do now [17:12:51] <_joe_> mobrovac: I am a bit unsure: did you test this in labs? [17:13:18] <_joe_> yurik: done [17:13:19] _joe_: this is a no-op everywhere, because the code is not yet deployed for this [17:13:29] _joe_, awesome, thanks! [17:13:31] i.e. not present in the deploy repo [17:13:54] <_joe_> mobrovac: oh ok [17:14:06] (03PS3) 10Giuseppe Lavagetto: restbase: add purged_cache_control config variable [puppet] - 10https://gerrit.wikimedia.org/r/273974 (owner: 10GWicke) [17:14:40] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] "this is a no-op everywhere, because the code is not yet deployed for this" [puppet] - 10https://gerrit.wikimedia.org/r/273974 (owner: 10GWicke) [17:14:55] <_joe_> mobrovac: so there is no need to restart restbase everywhere, right? [17:15:06] (03PS18) 10Ema: Maps VCL forward-port to Varnish 4 [puppet] - 10https://gerrit.wikimedia.org/r/269466 (https://phabricator.wikimedia.org/T124279) [17:15:19] _joe_: not really, but i'll do it anyhow so that our minds are at easy [17:15:25] s/easy/ease/ [17:15:50] <_joe_> mobrovac: I'm forcing a puppet run on the eqiad part of the cluster then [17:16:01] cool [17:16:13] (03CR) 10jenkins-bot: [V: 04-1] Maps VCL forward-port to Varnish 4 [puppet] - 10https://gerrit.wikimedia.org/r/269466 (https://phabricator.wikimedia.org/T124279) (owner: 10Ema) [17:17:09] I suppose I will have to join next week's puppet swat, as joe doesn't let me do anything... [17:17:39] <_joe_> jynus: oh thursday is all yours :P [17:17:43] ok [17:17:46] (03PS19) 10Ema: Maps VCL forward-port to Varnish 4 [puppet] - 10https://gerrit.wikimedia.org/r/269466 (https://phabricator.wikimedia.org/T124279) [17:18:11] <_joe_> mobrovac: {{done}} [17:18:19] kk, restarting [17:18:40] !log restbase rolling-restart restbase for https://gerrit.wikimedia.org/r/#/c/273974/ [17:18:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:20:05] thanks, mobrovac, for logging! [17:20:21] my pleasure :) [17:21:29] _joe_: jynus: rolling restart still in progress, but i've verified all's good on a subset of nodes [17:21:35] so we call this {{done}} [17:21:51] <_joe_> cool [17:22:50] I see some errors on icinga, but I suppose they are transient [17:23:07] they are [17:23:19] (03CR) 10Yuvipanda: "w00t!" [puppet] - 10https://gerrit.wikimedia.org/r/231144 (owner: 10Faidon Liambotis) [17:25:05] PROBLEM - puppet last run on mw1017 is CRITICAL: CRITICAL: puppet fail [17:35:12] 6Operations, 13Patch-For-Review: move RT off of magnesium - https://phabricator.wikimedia.org/T119112#2076335 (10Dzahn) a:3Dzahn [17:37:00] _joe_: Is it too late to sneak in some minor patches to puppetswat? [17:37:15] <_joe_> ostriches: depends... [17:37:25] pep8 fixes for various python stuff. [17:37:51] <_joe_> ostriches: let's see a couple :P [17:38:17] It's these 3 [17:38:18] https://gerrit.wikimedia.org/r/#/q/status:open+project:operations/puppet+branch:production+topic:pep8,n,z [17:39:55] RECOVERY - puppet last run on mw1017 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:40:19] <_joe_> ostriches: can't we get rid of the ridiculous "line too long" warning instead? [17:40:25] <_joe_> just sayin... [17:41:24] I don't think it warns in jenkins actually but it's easily fixed for people's local pep8 runs :) [17:41:44] <_joe_> ostriches: also, it doesn't look as valid python to me [17:41:50] <_joe_> but let me check [17:42:08] (03PS1) 10Yuvipanda: labs: Add CNAMES for tools specific things [puppet] - 10https://gerrit.wikimedia.org/r/274148 (https://phabricator.wikimedia.org/T118758) [17:42:10] Which one? [17:42:23] andrewbogott: ^ can you take a look when you have time? :) Since you've been working on dns stuff recently :) [17:42:25] <_joe_> let me check [17:42:57] yuvipanda: yes, but I’m in the middle of something now [17:43:14] andrewbogott: cool :) [17:43:37] andrewbogott: hmm, can you instead just tell me if restarting pdns-recursor causes outages? :) [17:43:49] <_joe_> ostriches: no it's actually ok, I'll merge the first one [17:43:51] only a short one :) [17:43:59] <_joe_> I'd let the other two for thursday though [17:44:02] andrewbogott: :D ok [17:44:06] _joe_: Okie dokie, thx [17:44:08] andrewbogott: I'll wait for review then [17:44:15] yuvipanda: it should be fine, that happens anytime puppet refreshes the floating ips [17:44:31] (03PS9) 10Yuvipanda: k8s: Add auth for docker client to authenticate to registry [puppet] - 10https://gerrit.wikimedia.org/r/274011 [17:45:12] (03PS2) 10Giuseppe Lavagetto: postgresql.py ganglia: pep8 fixes, mostly line too long [puppet] - 10https://gerrit.wikimedia.org/r/273106 (owner: 10Chad) [17:45:26] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] postgresql.py ganglia: pep8 fixes, mostly line too long [puppet] - 10https://gerrit.wikimedia.org/r/273106 (owner: 10Chad) [17:46:50] (03CR) 10Yuvipanda: [C: 032] k8s: Add auth for docker client to authenticate to registry [puppet] - 10https://gerrit.wikimedia.org/r/274011 (owner: 10Yuvipanda) [17:47:05] (03PS10) 10Yuvipanda: k8s: Add auth for docker client to authenticate to registry [puppet] - 10https://gerrit.wikimedia.org/r/274011 [17:47:17] andrewbogott: ah, ok. I'll just steam ahead then. [17:47:21] (03CR) 10Jcrespo: "Is the "daily" part that concerns me, do we really need daily backups?, not even production gets those.This is a total of 4-3 TB, and week" [puppet] - 10https://gerrit.wikimedia.org/r/273312 (https://phabricator.wikimedia.org/T127991) (owner: 10Ottomata) [17:47:35] (03CR) 10Yuvipanda: [V: 032] k8s: Add auth for docker client to authenticate to registry [puppet] - 10https://gerrit.wikimedia.org/r/274011 (owner: 10Yuvipanda) [17:47:49] _joe_: I added that one to today's puppetswat list [17:48:07] (03PS1) 10Chad: Gerrit: install standard base on new server lead [puppet] - 10https://gerrit.wikimedia.org/r/274150 [17:48:25] <_joe_> ostriches: thanks :) [17:50:25] (03PS2) 10Yuvipanda: labs: Add CNAMES for tools specific things [puppet] - 10https://gerrit.wikimedia.org/r/274148 (https://phabricator.wikimedia.org/T118758) [17:52:14] !log elastic2024.codfw.wmnet: upgrading to 1.7.5, shipping logs to logstash (T122697, T109101) [17:52:15] T109101: Currently elasticsearch logs do not leave nodes. We use logstash for this across the cluster generally. - https://phabricator.wikimedia.org/T109101 [17:52:15] T122697: Upgrade ElasticSearch to 1.7.5 - https://phabricator.wikimedia.org/T122697 [17:52:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:54:51] andrewbogott: last bug - which is the primary dns recursor now? [17:55:02] on labservices1001 [17:55:24] andrewbogott: thanks! [17:57:04] (03CR) 10Yuvipanda: [C: 032] labs: Add CNAMES for tools specific things [puppet] - 10https://gerrit.wikimedia.org/r/274148 (https://phabricator.wikimedia.org/T118758) (owner: 10Yuvipanda) [17:58:03] 6Operations, 6Labs, 10Labs-Infrastructure: Estimate hardware requirements for relevance lab elasticsearch servers - https://phabricator.wikimedia.org/T128433#2076499 (10TJones) [17:58:06] 6Operations, 10Mail: [URGENT] New email address receiving bounceback - https://phabricator.wikimedia.org/T128485#2076501 (10Aklapper) [18:00:04] yurik gwicke cscott arlolra subbu: Respected human, time to deploy Services – Graphoid / Parsoid / OCG / Citoid (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160301T1800). Please do the needful. [18:02:35] PROBLEM - puppet last run on labservices1001 is CRITICAL: CRITICAL: puppet fail [18:03:27] that's me ^ [18:03:28] (03PS1) 10Yuvipanda: labs: Put the dnsrecursor hiera value in right place [puppet] - 10https://gerrit.wikimedia.org/r/274155 [18:05:37] (03CR) 10Yuvipanda: [C: 032 V: 032] labs: Put the dnsrecursor hiera value in right place [puppet] - 10https://gerrit.wikimedia.org/r/274155 (owner: 10Yuvipanda) [18:05:40] (03PS1) 10ArielGlenn: fix up wrong var name in datasets nfs role [puppet] - 10https://gerrit.wikimedia.org/r/274156 [18:06:49] 6Operations, 6Performance-Team, 13Patch-For-Review, 7Performance: Update HHVM package to recent release - https://phabricator.wikimedia.org/T119637#2076527 (10Joe) Status update: # luasandbox works fine after tweaking debian/rules (as in - it's loaded and can execute a very simple lua script) # tidy needs... [18:07:55] PROBLEM - Kafka Broker Replica Max Lag on kafka1022 is CRITICAL: CRITICAL: 60.87% of data above the critical threshold [5000000.0] [18:08:00] (03CR) 10ArielGlenn: [C: 032] fix up wrong var name in datasets nfs role [puppet] - 10https://gerrit.wikimedia.org/r/274156 (owner: 10ArielGlenn) [18:08:05] RECOVERY - puppet last run on labservices1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [18:09:57] yurik, gwicke are you guys deploying anything now? [18:11:23] Attempt to connect to phuser@m3-master.eqiad.wmnet failed with error #2003: Can't connect to MySQL server on 'm3-master.eqiad.wmnet' (99). [18:11:38] seems to be a one-time glitch (works fine now), but is this normal? [18:11:57] (03PS1) 10Yuvipanda: labs: Set tools zonefile to be loaded [puppet] - 10https://gerrit.wikimedia.org/r/274160 (https://phabricator.wikimedia.org/T118758) [18:12:53] SPF|Cloud: That was happening yesterday when we were doing some batch operations, but shouldn't be happening right now.... [18:13:08] (03PS1) 10Dzahn: dynamicproxy: custom log schema (http/https) for tools [puppet] - 10https://gerrit.wikimedia.org/r/274161 (https://phabricator.wikimedia.org/T128409) [18:13:10] (03PS2) 10Yuvipanda: labs: Set tools zonefile to be loaded [puppet] - 10https://gerrit.wikimedia.org/r/274160 (https://phabricator.wikimedia.org/T118758) [18:14:19] (03PS2) 10Dzahn: dynamicproxy: custom log schema (http/https) for tools [puppet] - 10https://gerrit.wikimedia.org/r/274161 (https://phabricator.wikimedia.org/T128409) [18:14:32] (03CR) 10Yuvipanda: [C: 032] labs: Set tools zonefile to be loaded [puppet] - 10https://gerrit.wikimedia.org/r/274160 (https://phabricator.wikimedia.org/T118758) (owner: 10Yuvipanda) [18:14:35] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 12.50% of data above the critical threshold [1000.0] [18:14:51] subbu, nopeo [18:14:53] (03CR) 10Merlijn van Deen: "Is this just to run for a short while (days)? If not, we should probably add log rotation." [puppet] - 10https://gerrit.wikimedia.org/r/274161 (https://phabricator.wikimedia.org/T128409) (owner: 10Dzahn) [18:15:25] RECOVERY - Kafka Broker Replica Max Lag on kafka1022 is OK: OK: Less than 50.00% above the threshold [1000000.0] [18:16:14] PROBLEM - Misc HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [18:16:40] ostriches: actually, it happened again [18:16:41] (03PS3) 10Dzahn: dynamicproxy: custom log schema (http/https) for tools [puppet] - 10https://gerrit.wikimedia.org/r/274161 (https://phabricator.wikimedia.org/T128409) [18:17:04] ....hmmmm [18:17:33] (03PS1) 10Yuvipanda: labs: Make docker_registry cname be absolute [puppet] - 10https://gerrit.wikimedia.org/r/274163 [18:18:12] 6Operations: Reinstall redis servers (Job queues) with Jessie - https://phabricator.wikimedia.org/T123675#2076593 (10elukey) a:3elukey [18:18:50] (03PS2) 10Yuvipanda: labs: Make docker_registry cname be absolute [puppet] - 10https://gerrit.wikimedia.org/r/274163 [18:19:07] (03CR) 10Yuvipanda: [C: 032 V: 032] labs: Make docker_registry cname be absolute [puppet] - 10https://gerrit.wikimedia.org/r/274163 (owner: 10Yuvipanda) [18:19:15] (03CR) 10Dzahn: "good question. yes, it's just for a limited time, but i don't yet how long exactly. how long do we need to identify all the tools? i mean," [puppet] - 10https://gerrit.wikimedia.org/r/274161 (https://phabricator.wikimedia.org/T128409) (owner: 10Dzahn) [18:21:05] 6Operations: Reinstall redis servers (Job queues) with Jessie (NOTE: rdb1002 is special and is excluded!) - https://phabricator.wikimedia.org/T123675#2076639 (10elukey) [18:22:55] yea, confirmed. phabricator issues getting more [18:23:31] well, again, as yesterday, inserts from phabricator have multiplied by 26x [18:23:47] :o damn [18:24:24] connections by 10x [18:24:55] http://ganglia.wikimedia.org/latest/?c=Miscellaneous%20eqiad&h=iridium.eqiad.wmnet&m=cpu_report&r=hour&s=by%20name&hc=4&mc=2 doesn't look well either [18:25:19] (03PS1) 10Yuvipanda: Revert "labs: Set tools zonefile to be loaded" [puppet] - 10https://gerrit.wikimedia.org/r/274164 [18:25:26] (03PS2) 10Yuvipanda: Revert "labs: Set tools zonefile to be loaded" [puppet] - 10https://gerrit.wikimedia.org/r/274164 [18:25:30] 80 temporary tables per second [18:25:35] (03CR) 10Yuvipanda: [C: 032 V: 032] Revert "labs: Set tools zonefile to be loaded" [puppet] - 10https://gerrit.wikimedia.org/r/274164 (owner: 10Yuvipanda) [18:25:37] Attempt to connect to phuser@m3-master.eqiad.wmnet failed with error #2003: Can't connect to MySQL server on 'm3-master.eqiad.wmnet' (99).? [18:25:56] I need three atemps to create a task [18:26:07] Luke081515: we're aware [18:26:11] i'm reporting in -devtools [18:26:15] ostriches, are you there? [18:26:19] that's kind of the phab channel [18:27:12] jynus: I am. [18:27:15] PROBLEM - Misc HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0] [18:27:17] And I'm not doing any phab work. [18:27:39] phab seems to be doing it on its one :-) [18:28:02] (03CR) 10Alex Monk: "This isn't going to work without bastiononly as well" [puppet] - 10https://gerrit.wikimedia.org/r/274118 (https://phabricator.wikimedia.org/T128381) (owner: 10Muehlenhoff) [18:28:09] retarded phd? [18:28:30] someone doing something unknown? idk [18:28:58] most issues coming from diffusion [18:29:22] @iridium:~# grep fault /var/log/apache2/error.log Apache segfaults [18:29:25] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 37.50% of data above the critical threshold [1000.0] [18:29:42] rly? [18:30:29] mutante: I'm wondering if traffic increased? [18:30:56] normaly restarting phd will not cause such issues, in all cases [18:31:14] Cloning into bare repository '/srv/phab/repos/GCMC'... [18:31:14] fatal: could not read Username for 'https://gerrit.wikimedia.org': No such device or address at [/src/future/exec/ExecFuture.php:416] [18:31:15] PROBLEM - Kafka Broker Replica Max Lag on kafka1014 is CRITICAL: CRITICAL: 53.85% of data above the critical threshold [5000000.0] [18:32:08] Cloning? [18:32:12] dafuq? [18:32:14] things seem ok now [18:32:37] it cant execute the future [18:32:49] (815 more bytes) ... at [/src/future/exec/ExecFuture.php [18:33:18] ostriches: from /var/log/phd/daemons.log [18:34:02] #2 PhabricatorRepositoryPullLocalDaemon::resolveUpdateFuture [18:34:08] Looking. [18:34:09] I see [18:34:10] that "pull daemon" wants to pull stuff [18:34:12] apparently [18:34:48] Yeah I see it. [18:34:49] needs a username for gerrit.. [18:34:51] 'k [18:34:52] But the repo's missing on disk [18:34:55] It doesn't need a username. [18:34:58] Red herring [18:35:01] ok [18:37:05] AphrontConnectionQueryException [18:37:06] Attempt to connect to phuser@m3-master.eqiad.wmnet failed with error #2003: Can't connect to MySQL server on 'm3-master.eqiad.wmnet' (99). [18:37:18] phabricator error [18:37:51] irccloud at it [18:38:14] Waitttttt [18:38:18] I see what's going on with that stupid repo. [18:38:36] RECOVERY - Kafka Broker Replica Max Lag on kafka1014 is OK: OK: Less than 50.00% above the threshold [1000000.0] [18:40:44] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [18:41:21] !log starting parsoid deploy [18:41:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:41:43] mutante: mine didn't timeout <:) [18:42:01] misc cluster is throwing 500s since about 30 minutes ago, hence the graphite thing above [18:42:07] I'm guessing phab... [18:42:10] mutante: Fixed the phd logspam. [18:42:13] Misconfigured repo. [18:42:24] PROBLEM - Misc HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [18:42:30] bblack, see my comment on another channel [18:43:21] ostriches: :) cool! [18:43:44] SPF|Cloud: lucky [18:43:52] bblack: yea, i think phab [18:43:53] (03PS1) 10Andrew Bogott: Support totp auth in keystone [puppet] - 10https://gerrit.wikimedia.org/r/274167 (https://phabricator.wikimedia.org/T105690) [18:44:07] yes it is phab, we have some graphite data on that which shows iridium as the source [18:44:20] !log synced parsoid code; restarted parsoid on wtp1002 as a canary [18:44:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:44:55] (03CR) 10jenkins-bot: [V: 04-1] Support totp auth in keystone [puppet] - 10https://gerrit.wikimedia.org/r/274167 (https://phabricator.wikimedia.org/T105690) (owner: 10Andrew Bogott) [18:46:16] I can see it now, although there was other minor sources [18:47:25] parsoid on wtp1002 looking good. restarting on all nodes [18:47:46] (03PS1) 10ArielGlenn: fix up dataset nginx confs for jessie, ipv6only defaults to on now (!) [puppet] - 10https://gerrit.wikimedia.org/r/274168 [18:48:53] ostriches (or anyone else), if you have the time, check the space on iridium, / is a bit crowded, probably due to temporary increase in error logging [18:49:42] !log finished deploying parsoid sha 1f7ed5d0 [18:49:45] RECOVERY - Misc HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [18:49:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:50:00] probably same or equivalent cause to T124651 [18:50:00] T124651: iridium:/var/log/phd/daemons.log is growing too much (took 20% of filesystem space) - https://phabricator.wikimedia.org/T124651 [18:51:06] no, I am wrong with that [18:51:16] !log iridium: apt-get clean for some more disk space [18:51:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:51:45] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [18:52:25] it's atop logging too [18:52:56] (03PS1) 10RobH: whitelisting equinix domain for spam assassin [puppet] - 10https://gerrit.wikimedia.org/r/274170 (https://phabricator.wikimedia.org/T128497) [18:53:10] !log iridium - gzip /var/log/atop/atop_20160* [18:53:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:54:19] we should bring down iridium, we use lvm in hosts that don't need it, and not in the ones that really need it [18:54:40] "bring down"? [18:54:45] reimage [18:54:48] :-) [18:55:09] bring down the machine, or vm, wharever it is [18:55:40] stop the phd service [18:55:42] i suppose [18:55:49] on the server iridium [18:55:50] Yeah, it's atop and account [18:55:54] that pages, as we learned yesterday :-P [18:55:55] The phd logs are only ~300m [18:56:09] yes [18:56:31] account is taking 2.1 GB [18:56:39] Pruned all the old phd logs, not that it was much. [18:56:44] apergos: unless notifications are disabled [18:56:44] 9.8m now [18:57:17] mutante: of course [18:57:30] in any case, I think those values are normal, we need a larger partition [18:57:38] ok, disk space is back to 92% for now, gzipped logs [18:57:40] what now [18:57:53] mukunda proposed just moving logs over to /srv [18:57:57] at one point [18:58:00] That's just phd logs [18:58:04] Which wasn't the problem here. [18:58:08] We should move /var to a diff partition [18:58:15] Or at least /var/log [18:59:12] so the repo thing you fixed, ostriches [18:59:19] was that the main issue [18:59:37] and we are good for now, besides the logging thing? [19:00:29] var/log/account is pretty full, any reason we can't clean that out ostriches? [19:00:37] <_joe_> are all bots down? [19:00:59] i guess because of labs issues [19:02:27] mutante: Yeah, that's what was causing phd to freak out and prolly the m3 connection spike [19:02:45] chasemp: No reason I know of we can't... [19:04:35] ostriches: ok, good! (i mean, 'we know the reason'-good) [19:05:44] I think yuvi had some issues w/ dns in tools and reverted https://gerrit.wikimedia.org/r/#/c/274164/ [19:05:52] but it's possible there are still bots in a bad state [19:09:20] (03PS2) 10Andrew Bogott: Support totp auth in keystone [puppet] - 10https://gerrit.wikimedia.org/r/274167 (https://phabricator.wikimedia.org/T105690) [19:09:22] (03PS1) 10Andrew Bogott: Support totp auth for horizon [puppet] - 10https://gerrit.wikimedia.org/r/274173 [19:09:48] (03PS1) 10Chad: Moving group0 to wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/274174 [19:10:10] ostriches: I did a bit of cleanup so it's not as bad but if you want to move /var/log as a holdover for more drastic work I'm cool w/ it [19:10:47] (03CR) 10jenkins-bot: [V: 04-1] Support totp auth for horizon [puppet] - 10https://gerrit.wikimedia.org/r/274173 (owner: 10Andrew Bogott) [19:13:00] this partmon template was the generic "big srv partition for a web box" iirc but I wish I would have gone LVM now clearly [19:13:26] !log demon@tin Started scap: testwikis to wmf.15 and rebuild l10n [19:13:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:14:13] !log clean out /var/log/atop and /var/log/account on iridium [19:14:17] chasemp: I think the generic partman should probably put /var/log on a diff partition. Exploding logs should never kill the machine [19:14:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:14:39] ostriches: it seems odd it does not [19:14:46] !log demon@tin scap aborted: testwikis to wmf.15 and rebuild l10n (duration: 01m 19s) [19:14:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:14:59] (aborted on purpose, forgot something) [19:17:26] (03PS2) 10Andrew Bogott: Support totp auth for horizon [puppet] - 10https://gerrit.wikimedia.org/r/274173 [19:18:27] (03CR) 10jenkins-bot: [V: 04-1] Support totp auth for horizon [puppet] - 10https://gerrit.wikimedia.org/r/274173 (owner: 10Andrew Bogott) [19:19:38] !log testing heartbeat in m5 (db1009, db2030) [19:19:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:21:10] (03PS3) 10Andrew Bogott: Support totp auth for horizon [puppet] - 10https://gerrit.wikimedia.org/r/274173 [19:22:21] (03CR) 10jenkins-bot: [V: 04-1] Support totp auth for horizon [puppet] - 10https://gerrit.wikimedia.org/r/274173 (owner: 10Andrew Bogott) [19:24:20] (03PS4) 10Andrew Bogott: Support totp auth for horizon [puppet] - 10https://gerrit.wikimedia.org/r/274173 [19:24:55] (03CR) 10ArielGlenn: [C: 04-1] "don't merge right now, I'll do this tomorrow after dataset1001 update" [puppet] - 10https://gerrit.wikimedia.org/r/274168 (owner: 10ArielGlenn) [19:29:28] (03PS3) 10Andrew Bogott: Support totp auth in keystone [puppet] - 10https://gerrit.wikimedia.org/r/274167 (https://phabricator.wikimedia.org/T105690) [19:29:30] (03PS5) 10Andrew Bogott: Support totp auth for horizon [puppet] - 10https://gerrit.wikimedia.org/r/274173 [19:31:34] (03PS7) 10Jcrespo: Add custom heartbeat script with "shard" additional column [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/274134 (https://phabricator.wikimedia.org/T114752) [19:33:39] (03CR) 10Jcrespo: [C: 032] "Manually tested successfully on m5- I will commit this because there is no hosts currently using it in production (I will enable it progre" [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/274134 (https://phabricator.wikimedia.org/T114752) (owner: 10Jcrespo) [19:37:38] 6Operations, 6Labs, 13Patch-For-Review: Setup private docker registry with authentication support in tools - https://phabricator.wikimedia.org/T118758#2077031 (10yuvipanda) Meh, that screwed up, reverting all the CNAME work... [19:39:36] (03CR) 10Jcrespo: [C: 032] Upgrade mariadb module to allow new heartbeat updates [puppet] - 10https://gerrit.wikimedia.org/r/274178 (https://phabricator.wikimedia.org/T111266) (owner: 10Jcrespo) [19:42:53] (03PS2) 10Yuvipanda: labs: Revert all work around CNAMEs for toollabs [puppet] - 10https://gerrit.wikimedia.org/r/274180 (https://phabricator.wikimedia.org/T118758) [19:43:33] (03PS6) 10Dzahn: mediawiki: split role classes, move to modules [puppet] - 10https://gerrit.wikimedia.org/r/256574 [19:43:53] yuvipanda, do you think a 1/2 TB of memory with RAID10 SSDs will be enough for labs? :-) [19:43:57] (03PS7) 10Dzahn: mediawiki: split role classes, move to modules [puppet] - 10https://gerrit.wikimedia.org/r/256574 [19:44:17] jynus: depends :) [19:44:25] jynus: labsdb? :) or all of labs? :D [19:45:07] (03CR) 10jenkins-bot: [V: 04-1] mediawiki: split role classes, move to modules [puppet] - 10https://gerrit.wikimedia.org/r/256574 (owner: 10Dzahn) [19:45:12] (03CR) 10Yuvipanda: [C: 032] labs: Revert all work around CNAMEs for toollabs [puppet] - 10https://gerrit.wikimedia.org/r/274180 (https://phabricator.wikimedia.org/T118758) (owner: 10Yuvipanda) [19:45:52] 1 labsdb, and as a temporary measure [19:46:35] but I want people to discuss [full automatic] usage limits before putting them in production [19:46:53] jynus: 'online' vs 'batch' usage? [19:47:32] to be fair, I do not "care" much- as in, let the users decide; but have something in written form [19:49:03] * yuvipanda nods [19:49:07] that sounds appropriate [19:49:15] jynus: I'll get a task going to discuss this soon [19:49:45] I can do that, do not worry, I am just pinging you and asking for your input on that [19:50:07] probably I will send an email to labs pointing to a task [19:52:03] jynus: +1 :) [19:52:34] some user already provided some good advice based on old toolserver [19:52:57] (03PS1) 10Dzahn: admin: pentesters need nmap with privileged options [puppet] - 10https://gerrit.wikimedia.org/r/274182 [19:53:56] (03CR) 10jenkins-bot: [V: 04-1] admin: pentesters need nmap with privileged options [puppet] - 10https://gerrit.wikimedia.org/r/274182 (owner: 10Dzahn) [19:54:05] (03PS4) 10Andrew Bogott: Support totp auth in keystone [puppet] - 10https://gerrit.wikimedia.org/r/274167 (https://phabricator.wikimedia.org/T105690) [19:54:07] (03PS6) 10Andrew Bogott: Support totp auth for horizon [puppet] - 10https://gerrit.wikimedia.org/r/274173 [19:54:09] (03CR) 10Jdlrobson: [C: 031] Reduce sampling rate for language switcher [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272724 (https://phabricator.wikimedia.org/T127212) (owner: 10Bmansurov) [19:54:43] (03PS2) 10Dzahn: admin: pentesters need nmap with privileged options [puppet] - 10https://gerrit.wikimedia.org/r/274182 [19:55:29] (03CR) 10Chad: [C: 032] Moving group0 to wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/274174 (owner: 10Chad) [19:55:51] 6Operations, 10MobileFrontend, 10Traffic, 5MW-1.27-release, and 5 others: Incorrect TOC and section edit links rendering in Vector due to ParserCache corruption via ParserOutput::setText( ParserOutput::getText() ) - https://phabricator.wikimedia.org/T124356#2077114 (10Jdlrobson) Note I've seen pages on the... [19:56:18] (03Merged) 10jenkins-bot: Moving group0 to wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/274174 (owner: 10Chad) [19:56:52] 6Operations, 10Traffic, 7HTTPS: SSL cert needed for benefactorevents.wikimedia.org - https://phabricator.wikimedia.org/T115028#2077115 (10RobH) [19:57:28] (03CR) 10ArielGlenn: [C: 031] "With the caveat that there should be a 'toss logs when done' note on the task, and that someone needs to babysit this when it goes live in" [puppet] - 10https://gerrit.wikimedia.org/r/274161 (https://phabricator.wikimedia.org/T128409) (owner: 10Dzahn) [19:58:13] !log demon@tin Started scap: group0 to wmf.15 [19:58:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:58:24] PROBLEM - Kafka Broker Replica Max Lag on kafka1020 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [5000000.0] [20:00:04] ostriches: Respected human, time to deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160301T2000). Please do the needful. [20:00:52] 6Operations, 10MobileFrontend, 10Traffic, 5MW-1.27-release, and 5 others: Incorrect TOC and section edit links rendering in Vector due to ParserCache corruption via ParserOutput::setText( ParserOutput::getText() ) - https://phabricator.wikimedia.org/T124356#2077122 (10Legoktm) >>! In T124356#2077114, @Jdlr... [20:00:57] already am joucebottttt [20:03:23] 6Operations, 6Labs, 13Patch-For-Review: Setup private docker registry with authentication support in tools - https://phabricator.wikimedia.org/T118758#2077142 (10scfc) Is @Joe's T123628 a duplicate of this task? AFAIUI, there the registry would be a container and the name issue solved like other containers?... [20:03:36] ori: When we lowered the timeout for redis to 0.2s, what was the previous value? [20:05:02] 6Operations, 6Labs, 13Patch-For-Review: Setup private docker registry with authentication support in tools - https://phabricator.wikimedia.org/T118758#2077151 (10yuvipanda) [20:05:17] <_joe_> ostriches: actually, it was 0 (no timeout) [20:05:31] <_joe_> ostriches: due to the hhvm bug with float timeouts being cast to int [20:06:00] <_joe_> so what we did was prev_value => infinity => 0.2 (after fixing the hhvm bug in late january) [20:06:08] Hmm. I wonder if 0.2 is too low. We get a *crapton* of redis timeouts in the error logs. I'm curious what moving it to just like 0.3 would do to that error rate [20:06:21] <_joe_> ostriches: try 0.5 [20:06:40] 6Operations, 6Labs, 13Patch-For-Review: Setup private docker registry with authentication support in tools - https://phabricator.wikimedia.org/T118758#1808509 (10yuvipanda) Indeed it's the same, I've merged it in. The reason it's not just a container is mostly because we don't have swift on the horizon yet... [20:07:05] _joe_: I'll work up a patch and try it later when we're not on the train [20:07:45] 6Operations, 6Labs, 13Patch-For-Review: Setup private docker registry with authentication support in tools - https://phabricator.wikimedia.org/T118758#2077164 (10yuvipanda) That ticket also has a far more complex setup for a ful PaaS system that we aren't doing yet (and when we do do it, we shouldn't be buil... [20:09:24] RECOVERY - Kafka Broker Replica Max Lag on kafka1020 is OK: OK: Less than 50.00% above the threshold [1000000.0] [20:17:11] 6Operations, 10Analytics, 10ArchCom-RfC, 6Discovery, and 7 others: EventBus MVP - https://phabricator.wikimedia.org/T114443#2077248 (10mobrovac) >>! In T114443#2076614, @RobLa-WMF wrote: > @mobrovac - I'm confused, why don't you think T120212 is a blocker for this? It is, but it's an indirect one: it is b... [20:26:11] (03PS1) 10RobH: benefactorevents.wikimedia.org ssl certificate [puppet] - 10https://gerrit.wikimedia.org/r/274195 [20:27:54] (03CR) 10RobH: [C: 032] benefactorevents.wikimedia.org ssl certificate [puppet] - 10https://gerrit.wikimedia.org/r/274195 (owner: 10RobH) [20:29:37] !log demon@tin Finished scap: group0 to wmf.15 (duration: 31m 24s) [20:29:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:30:36] csteipp: have some time to spare today? I could use a review of https://gerrit.wikimedia.org/r/#/c/274167/ and help testing [20:31:51] (03PS3) 10Dzahn: admin: pentesters need nmap with privileged options [puppet] - 10https://gerrit.wikimedia.org/r/274182 (https://phabricator.wikimedia.org/T126012) [20:32:27] (03PS1) 10Dereckson: Site name configuration on wuu.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/274196 (https://phabricator.wikimedia.org/T128354) [20:32:32] 6Operations, 10Traffic, 7HTTPS: SSL cert needed for benefactorevents.wikimedia.org - https://phabricator.wikimedia.org/T115028#2077352 (10RobH) So the purchase of the SSL cert is done. The private key is on the private repo, and the public cert in the public puppet repo, with the filename of benefactorevers... [20:32:50] andrewbogott: I think so-- are those patches against liberty? [20:33:07] csteipp: kilo keystone, liberty horizon [20:33:14] (horizon is backwards compatible) [20:33:48] Good times. Yeah, I'll try to take a look. [20:34:09] thanks. Meanwhile I’m applying that keystone patch by hand to labtestcontrol so we can see how things go [20:35:15] PROBLEM - puppet last run on analytics1050 is CRITICAL: CRITICAL: Puppet has 1 failures [20:37:58] (03CR) 10Jgreen: [C: 031] "looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/274182 (https://phabricator.wikimedia.org/T126012) (owner: 10Dzahn) [20:38:11] csteipp: where did you find your python oath package? [20:38:25] andrewbogott: pip [20:38:31] :( [20:38:57] andrewbogott: I think i made a comment somewhere... if we need to, I can unroll that into a patch. [20:39:10] I can build a .deb probably [20:39:17] The oath code itself is a couple hundred lines. [20:40:39] csteipp: is that this? https://github.com/joestump/python-oauth2 [20:40:45] or is ‘oath’ a different package? [20:40:55] oath is different [20:41:37] https://pypi.python.org/pypi/oath [20:41:53] 6Operations, 10Traffic, 7HTTPS: SSL cert needed for benefactorevents.wikimedia.org - https://phabricator.wikimedia.org/T115028#2077372 (10RobH) I can send this onward, on the previous task T107059#1506289, @bblack sent @EWilfong_WMF the key using the public gpg key on that task. I've gpg encrypted the new k... [20:42:19] csteipp: ok, thanks [20:42:28] oh… does your ‘latest’ patch only work with liberty or does it work with both? [20:42:45] (03PS5) 10Andrew Bogott: Support totp auth in keystone [puppet] - 10https://gerrit.wikimedia.org/r/274167 (https://phabricator.wikimedia.org/T105690) [20:42:47] (03PS7) 10Andrew Bogott: Support totp auth for horizon [puppet] - 10https://gerrit.wikimedia.org/r/274173 [20:44:30] andrewbogott: The patch is just for libery [20:44:43] crap, ok, I’ll need to adjust then [20:44:54] I’ll have a new patch in a few minutes [20:45:19] yeah, the method should work fine. I just made the patch for liberty explicitly. [20:46:29] (03CR) 10Tim Landscheidt: [C: 04-1] "| nginx: [emerg] "log_format" directive is not allowed here in /etc/nginx/sites-enabled/proxy:39" [puppet] - 10https://gerrit.wikimedia.org/r/274161 (https://phabricator.wikimedia.org/T128409) (owner: 10Dzahn) [20:47:25] 6Operations, 10Traffic, 7HTTPS: SSL cert needed for benefactorevents.wikimedia.org - https://phabricator.wikimedia.org/T115028#2077377 (10CCogdill_WMF) Thanks. I need to talk to Trilogy about this since it's been off our plates for awhile. We have a meeting Thursday, so I'll update at the end of the week. [20:47:30] csteipp: ok, sorry, I’m still slightly confused. [20:47:35] You wrote a separate plugin for kilo [20:47:44] but hacked the existing ‘password’ plugin for liberty, right? [20:48:12] and I made yet a third thing, which the hacked password plugin (from your liberty patch) and made it a standalone plugin instead, mwtotp.py [20:48:47] would you expect mwtotp to work 1) on liberty 2) on kilo? [20:59:17] andrewbogott: hmm... not entirely sure. The issue I was having with liberty is that my config was getting overridden under high load. [20:59:26] hm [20:59:42] ok, well, since I have something set up, let’s see how that works on kilo :) [21:02:15] RECOVERY - puppet last run on analytics1050 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:07:23] (03PS1) 10BBlack: text VCL: move bits-events and normalization above PURGE [puppet] - 10https://gerrit.wikimedia.org/r/274282 (https://phabricator.wikimedia.org/T127387) [21:07:53] ()*&@#(%*&(*&!@*&($*& submodules, with a brick [21:09:14] (03PS2) 10BBlack: ext VCL: move bits-events and normalization above PURGE [puppet] - 10https://gerrit.wikimedia.org/r/274282 (https://phabricator.wikimedia.org/T127387) [21:11:41] (03CR) 10GWicke: [C: 031] ext VCL: move bits-events and normalization above PURGE [puppet] - 10https://gerrit.wikimedia.org/r/274282 (https://phabricator.wikimedia.org/T127387) (owner: 10BBlack) [21:12:36] (03CR) 10BBlack: [C: 032 V: 032] ext VCL: move bits-events and normalization above PURGE [puppet] - 10https://gerrit.wikimedia.org/r/274282 (https://phabricator.wikimedia.org/T127387) (owner: 10BBlack) [21:16:50] (03PS1) 10Madhuvishy: eventlogging: Allow processor format strings to be configurable via hiera [puppet] - 10https://gerrit.wikimedia.org/r/274286 [21:17:36] 6Operations, 10Analytics, 10ArchCom-RfC, 6Discovery, and 7 others: EventBus MVP - https://phabricator.wikimedia.org/T114443#2077418 (10RobLa-WMF) I realize that the blocking relationship is transitive, but given Otto's comment (T114443#2072426), it would seem that it would be clearer to make the blocking r... [21:29:36] !log elastic1001.eqiad.wmnet: upgrading to 1.7.5, shipping logs to logstash (T122697, T109101) [21:29:38] T109101: Currently elasticsearch logs do not leave nodes. We use logstash for this across the cluster generally. - https://phabricator.wikimedia.org/T109101 [21:29:38] T122697: Upgrade ElasticSearch to 1.7.5 - https://phabricator.wikimedia.org/T122697 [21:29:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:32:40] 6Operations, 10Traffic, 7HTTPS: SSL cert needed for benefactorevents.wikimedia.org - https://phabricator.wikimedia.org/T115028#2077450 (10RobH) I've confirmed with @bblack about the above in IRC. The new key has been gpg encruypted with Eric's key on T107059. I've emailed it to Eric @ the email address tie... [21:32:49] 6Operations, 10Traffic, 7HTTPS: SSL cert needed for benefactorevents.wikimedia.org - https://phabricator.wikimedia.org/T115028#2077451 (10RobH) a:5BBlack>3None [21:33:20] (03PS1) 10BBlack: Revert "ext VCL: move bits-events and normalization above PURGE" [puppet] - 10https://gerrit.wikimedia.org/r/274288 [21:33:31] (03CR) 10BBlack: [C: 032 V: 032] Revert "ext VCL: move bits-events and normalization above PURGE" [puppet] - 10https://gerrit.wikimedia.org/r/274288 (owner: 10BBlack) [21:40:59] (03PS8) 10Dzahn: mediawiki: split role classes, move to modules [puppet] - 10https://gerrit.wikimedia.org/r/256574 [21:51:09] (03CR) 10Dzahn: [C: 04-1] "not yet. "Error: Could not find class role::scap::target"" [puppet] - 10https://gerrit.wikimedia.org/r/256574 (owner: 10Dzahn) [21:53:55] (03CR) 10Dzahn: "why? labs vs. prod issue?" [puppet] - 10https://gerrit.wikimedia.org/r/256574 (owner: 10Dzahn) [22:03:51] 6Operations, 6Editing-Department, 6Performance-Team, 7Performance, 7user-notice: Severe save latency regression - https://phabricator.wikimedia.org/T126700#2077633 (10ori) [22:08:02] (03PS3) 1020after4: Parameterize the git_server variable in global scap.cfg [puppet] - 10https://gerrit.wikimedia.org/r/272947 (https://phabricator.wikimedia.org/T126259) [22:09:11] (03CR) 1020after4: Parameterize the git_server variable in global scap.cfg (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/272947 (https://phabricator.wikimedia.org/T126259) (owner: 1020after4) [22:17:40] (03PS9) 10Dzahn: mediawiki: split role classes, move to modules [puppet] - 10https://gerrit.wikimedia.org/r/256574 [22:23:13] (03PS6) 10Andrew Bogott: Support totp auth in keystone [puppet] - 10https://gerrit.wikimedia.org/r/274167 (https://phabricator.wikimedia.org/T105690) [22:23:14] (03PS8) 10Andrew Bogott: Support totp auth for horizon [puppet] - 10https://gerrit.wikimedia.org/r/274173 [22:24:32] (03CR) 10Dzahn: "http://puppet-compiler.wmflabs.org/1899/" [puppet] - 10https://gerrit.wikimedia.org/r/256574 (owner: 10Dzahn) [22:25:12] jouncebot: next [22:25:12] In 1 hour(s) and 34 minute(s): Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160302T0000) [22:27:03] RoanKattouw_away: matt_flaschen: Ping on https://gerrit.wikimedia.org/r/#/c/272929/ [22:27:49] !log temp. disabling puppet runs on mw appservers to be extra safe during mediawiki module change [22:27:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:28:35] (03CR) 10Dzahn: [C: 032] "tested noop in compiler on different roles and eqiad/codfw" [puppet] - 10https://gerrit.wikimedia.org/r/256574 (owner: 10Dzahn) [22:28:38] (03PS7) 10Andrew Bogott: Support totp auth in keystone [puppet] - 10https://gerrit.wikimedia.org/r/274167 (https://phabricator.wikimedia.org/T105690) [22:28:40] (03PS9) 10Andrew Bogott: Support totp auth for horizon [puppet] - 10https://gerrit.wikimedia.org/r/274173 [22:29:53] Krinkle, yeah, I will update that. [22:31:56] 6Operations, 10Analytics, 10ArchCom-RfC, 6Discovery, and 7 others: EventBus MVP - https://phabricator.wikimedia.org/T114443#2077786 (10mobrovac) [22:32:10] 6Operations, 10Analytics, 10ArchCom-RfC, 6Discovery, and 7 others: EventBus MVP - https://phabricator.wikimedia.org/T114443#1710260 (10mobrovac) >>! In T114443#2077418, @RobLa-WMF wrote: > I realize that the blocking relationship is transitive, but given Otto's comment (T114443#2072426), it would seem that... [22:33:07] splitting the mediawiki module up [22:33:23] and moving to module/role/ , one file per class like the other things we moved [22:33:44] (03CR) 10Dzahn: "disabled puppet on all, ran on mw1033, mw1026, mw2007, no change, re-enabled puppet on all" [puppet] - 10https://gerrit.wikimedia.org/r/256574 (owner: 10Dzahn) [22:34:19] !log re-enabled puppet runs on all mw* servers, mediawiki roles now in modules/role/manifests/mediawiki/ [22:34:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:36:04] (03PS3) 10Dzahn: ci: split and move role classes to modules/role/ [puppet] - 10https://gerrit.wikimedia.org/r/260939 [22:36:06] (03PS8) 10Andrew Bogott: Support totp auth in keystone [puppet] - 10https://gerrit.wikimedia.org/r/274167 (https://phabricator.wikimedia.org/T105690) [22:36:08] (03PS10) 10Andrew Bogott: Support totp auth for horizon [puppet] - 10https://gerrit.wikimedia.org/r/274173 [22:36:23] (03PS10) 10Ori.livneh: webperf: Create new navtiming metric with higher value limit [puppet] - 10https://gerrit.wikimedia.org/r/270725 (https://phabricator.wikimedia.org/T125381) (owner: 10Phedenskog) [22:36:45] (03CR) 10Ori.livneh: [C: 032 V: 032] webperf: Create new navtiming metric with higher value limit [puppet] - 10https://gerrit.wikimedia.org/r/270725 (https://phabricator.wikimedia.org/T125381) (owner: 10Phedenskog) [22:38:02] (03CR) 10jenkins-bot: [V: 04-1] Support totp auth in keystone [puppet] - 10https://gerrit.wikimedia.org/r/274167 (https://phabricator.wikimedia.org/T105690) (owner: 10Andrew Bogott) [22:38:11] 6Operations, 10Phabricator, 6Project-Admins, 6Triagers: Requests for addition to the #acl*Project-Admins group (in comments) - https://phabricator.wikimedia.org/T706#2077829 (10mmodell) 5Resolved>3stalled I'm changing the status so that {T706} will be rendered without the ~~strikethrough~~ on #project-... [22:38:49] 6Operations, 6Editing-Department, 6Performance-Team, 7Performance, 7user-notice: Severe save latency regression - https://phabricator.wikimedia.org/T126700#2077833 (10ori) 5Open>3Resolved a:3ori Closing, as p75 is now close to pre-wmf12 levels: {F3503068 size=full} We still don't have anything re... [22:39:15] (03CR) 10Alex Monk: Support totp auth for horizon (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/274173 (owner: 10Andrew Bogott) [22:39:34] (03PS9) 10Andrew Bogott: Support totp auth in keystone [puppet] - 10https://gerrit.wikimedia.org/r/274167 (https://phabricator.wikimedia.org/T105690) [22:39:37] (03PS11) 10Andrew Bogott: Support totp auth for horizon [puppet] - 10https://gerrit.wikimedia.org/r/274173 [22:41:17] "- Last run result for unit replicate-tools was exit-code " [22:41:34] Last run result for unit replicate-maps was exit-code [22:43:00] (03CR) 10Alex Monk: Support totp auth in keystone (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/274167 (https://phabricator.wikimedia.org/T105690) (owner: 10Andrew Bogott) [22:43:37] (03CR) 10Alex Monk: [C: 04-1] "see PS8 comments" [puppet] - 10https://gerrit.wikimedia.org/r/274167 (https://phabricator.wikimedia.org/T105690) (owner: 10Andrew Bogott) [22:44:04] 6Operations, 6Labs, 10Labs-Infrastructure, 10Monitoring: labstore monitoring - "Last run result for unit .. was exit-code" - https://phabricator.wikimedia.org/T128526#2077860 (10Dzahn) [22:44:36] ACKNOWLEDGEMENT - Last backup of the maps filesystem on labstore1001 is CRITICAL: CRITICAL - Last run result for unit replicate-maps was exit-code daniel_zahn https://phabricator.wikimedia.org/T128526 [22:44:37] ACKNOWLEDGEMENT - Last backup of the others filesystem on labstore1001 is CRITICAL: CRITICAL - Last run result for unit replicate-others was exit-code daniel_zahn https://phabricator.wikimedia.org/T128526 [22:44:37] ACKNOWLEDGEMENT - Last backup of the tools filesystem on labstore1001 is CRITICAL: CRITICAL - Last run result for unit replicate-tools was exit-code daniel_zahn https://phabricator.wikimedia.org/T128526 [22:45:07] 6Operations, 6Discovery, 6Labs, 10Labs-Infrastructure, and 3 others: labstore monitoring - "Last run result for unit .. was exit-code" - https://phabricator.wikimedia.org/T128526#2077887 (10Dzahn) [22:45:44] 6Operations, 6Discovery, 6Labs, 10Labs-Infrastructure, and 3 others: labstore monitoring - "Last run result for unit .. was exit-code" - https://phabricator.wikimedia.org/T128526#2077892 (10Dzahn) [22:46:48] (03PS4) 10Dzahn: ci: split and move role classes to modules/role/ [puppet] - 10https://gerrit.wikimedia.org/r/260939 [22:47:13] (03CR) 10Dzahn: [C: 032] "one more refactor. noop on gallium and scandium. and re: labs no class names are changing. http://puppet-compiler.wmflabs.org/1877/" [puppet] - 10https://gerrit.wikimedia.org/r/260939 (owner: 10Dzahn) [22:47:47] (03PS10) 10Andrew Bogott: Support totp auth in keystone [puppet] - 10https://gerrit.wikimedia.org/r/274167 (https://phabricator.wikimedia.org/T105690) [22:47:49] (03PS12) 10Andrew Bogott: Support totp auth for horizon [puppet] - 10https://gerrit.wikimedia.org/r/274173 [22:50:18] (03PS4) 10Dzahn: admin: pentesters need nmap with privileged options [puppet] - 10https://gerrit.wikimedia.org/r/274182 (https://phabricator.wikimedia.org/T126012) [22:50:40] (03CR) 10Dzahn: "double checked, noop on gallium and scandium" [puppet] - 10https://gerrit.wikimedia.org/r/260939 (owner: 10Dzahn) [22:50:56] (03CR) 10Dzahn: [C: 032] admin: pentesters need nmap with privileged options [puppet] - 10https://gerrit.wikimedia.org/r/274182 (https://phabricator.wikimedia.org/T126012) (owner: 10Dzahn) [22:54:44] (03CR) 10Andrew Bogott: "I've applied this by hand on labtestcontrol and it works in kilo. Since this is Chris's 'Libery' patch I've added the same files to be in" [puppet] - 10https://gerrit.wikimedia.org/r/274167 (https://phabricator.wikimedia.org/T105690) (owner: 10Andrew Bogott) [22:56:12] (03PS13) 10Andrew Bogott: Support totp auth for horizon [puppet] - 10https://gerrit.wikimedia.org/r/274173 (https://phabricator.wikimedia.org/T105690) [22:56:53] (03CR) 10Andrew Bogott: "This is running now on labtestweb, plus or minus a few typos. I can now log in with a second factor, and cannot log in with a missing or " [puppet] - 10https://gerrit.wikimedia.org/r/274173 (https://phabricator.wikimedia.org/T105690) (owner: 10Andrew Bogott) [22:57:22] (03CR) 10Andrew Bogott: "https://labtesthorizon.wikimedia.org" [puppet] - 10https://gerrit.wikimedia.org/r/274173 (https://phabricator.wikimedia.org/T105690) (owner: 10Andrew Bogott) [22:58:43] (03PS8) 10Dzahn: maps: move roles to modules/role/ [puppet] - 10https://gerrit.wikimedia.org/r/249059 [22:58:55] PROBLEM - Disk space on labservices1001 is CRITICAL: DISK CRITICAL - free space: / 343 MB (3% inode=76%) [23:00:50] andrewbogott: ^ https://phabricator.wikimedia.org/T126572 again [23:01:35] mutante: ok, looking, thanks [23:01:54] * andrewbogott had a joyful day of coding and will now pay the price [23:02:26] hopefully that wasn't me [23:02:30] I stopped touching that host [23:02:39] 2.4G ./upstart [23:02:47] curious [23:03:01] andrewbogott: most likely mdns logs and you can move them to /srv/var/ [23:03:10] but needs more permanent soltuion [23:04:21] verbose = False; debug = False <- should really not be generating Gb logfiles [23:04:21] logrotate for designate-mdns.log maybe [23:04:26] 6Operations, 10Ops-Access-Requests, 13Patch-For-Review: Requesting access to researchers for nschaaf - https://phabricator.wikimedia.org/T128381#2077931 (10Milimetric) Works for me, thanks. I'll follow up if there are any problems. [23:07:05] mutante: it is rotated, and also usually small. I don’t know if I left logging on over night, or what [23:07:50] andrewbogott: oh, ok. yea that was just from last time [23:08:06] RECOVERY - Disk space on labservices1001 is OK: DISK OK [23:08:27] mutante: I just erased one of the files, will watch to see if it is still huge tomorrow [23:08:52] andrewbogott: yep, cool [23:09:09] upstart captures stdout and stderr by default [23:09:31] (!log T126572 andrew deleted one of the files) [23:09:31] T126572: labservices1001 ran out of disk space - https://phabricator.wikimedia.org/T126572 [23:09:38] test [23:11:04] (03CR) 10Dzahn: [C: 032] "noop in compiler http://puppet-compiler.wmflabs.org/1868/" [puppet] - 10https://gerrit.wikimedia.org/r/249059 (owner: 10Dzahn) [23:11:40] csteipp: it works in kilo! So I will apply things in prod tomorrow after you give things a once-over. [23:12:44] (03CR) 10Dzahn: "being bold, removing Alex' review, it was from before i amended this and long time ago" [puppet] - 10https://gerrit.wikimedia.org/r/249059 (owner: 10Dzahn) [23:14:00] aww, wtf..maps [23:15:56] (03CR) 10Dzahn: "confirmed noop on maps-test2002. puppet run on maps-test2001/2003 broken since many days, not my change" [puppet] - 10https://gerrit.wikimedia.org/r/249059 (owner: 10Dzahn) [23:16:11] 6Operations, 10ops-ulsfo: ulsfo temperature-related exceptions - https://phabricator.wikimedia.org/T119631#2077982 (10RobH) UL has stated there is no temp issues in our rack. We've now have a thermal camera to take readings onsite, and I'll be visiting onsite shortly. [23:17:20] !log maps-test2001 - could not find dependency for postgres class is NOT related to my recent change. icinga crit since a long time [23:17:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:17:50] (03PS1) 10Andrew Bogott: Increase Horizon session length by a lot [puppet] - 10https://gerrit.wikimedia.org/r/274309 [23:19:36] (03CR) 10CSteipp: "I haven't been able to test this, but I think this should work correctly. We'll want to test this once it's rolled out to make sure we're " [puppet] - 10https://gerrit.wikimedia.org/r/274167 (https://phabricator.wikimedia.org/T105690) (owner: 10Andrew Bogott) [23:20:06] (03Abandoned) 10Dzahn: add AAAA record for argon (irc,rc streams) [dns] - 10https://gerrit.wikimedia.org/r/214506 (https://phabricator.wikimedia.org/T105422) (owner: 10Dzahn) [23:22:03] (03Abandoned) 10Andrew Bogott: Move labtest to openstack liberty [puppet] - 10https://gerrit.wikimedia.org/r/273481 (owner: 10Andrew Bogott) [23:23:13] (03PS3) 10Dzahn: deactivate wicipediacymraeg.org [dns] - 10https://gerrit.wikimedia.org/r/254055 (https://phabricator.wikimedia.org/T128085) [23:23:30] (03CR) 10Dzahn: [C: 032] "the "Wales Manager, Wikimedia UK" confirmed this is OK :)" [dns] - 10https://gerrit.wikimedia.org/r/254055 (https://phabricator.wikimedia.org/T128085) (owner: 10Dzahn) [23:24:59] (03PS2) 10Andrew Bogott: Increase Horizon session length by a lot [puppet] - 10https://gerrit.wikimedia.org/r/274309 [23:25:01] (03PS1) 10Andrew Bogott: Move production Horizon to Liberty [puppet] - 10https://gerrit.wikimedia.org/r/274311 (https://phabricator.wikimedia.org/T105690) [23:25:26] 6Operations, 10Traffic, 10domains, 13Patch-For-Review: figure out if we can park wicipediacymraeg.org - https://phabricator.wikimedia.org/T128085#2078054 (10Dzahn) Robin says he agrees on his talk page. merged [23:25:40] 6Operations, 10Traffic, 10domains, 13Patch-For-Review: figure out if we can park wicipediacymraeg.org - https://phabricator.wikimedia.org/T128085#2078055 (10Dzahn) 5Open>3Resolved [23:25:51] 6Operations, 10Traffic, 10domains: figure out if we can park wicipediacymraeg.org - https://phabricator.wikimedia.org/T128085#2063521 (10Dzahn) [23:31:17] (03CR) 10Dzahn: "not sure after https://phabricator.wikimedia.org/T128381#2076048 are docs and puppet code not saying the same thing? fwiw i think it's be" [puppet] - 10https://gerrit.wikimedia.org/r/274118 (https://phabricator.wikimedia.org/T128381) (owner: 10Muehlenhoff) [23:32:16] 6Operations, 10Ops-Access-Requests, 13Patch-For-Review: Requesting access to researchers for nschaaf - https://phabricator.wikimedia.org/T128381#2078090 (10Dzahn) Does that mean the docs and actual puppet code don't say the same thing? [23:38:02] (03PS2) 10Dzahn: Add nschaaf to researchers [puppet] - 10https://gerrit.wikimedia.org/r/274118 (https://phabricator.wikimedia.org/T128381) (owner: 10Muehlenhoff) [23:38:19] (03PS3) 10Dzahn: Add nschaaf to researchers, bastiononly [puppet] - 10https://gerrit.wikimedia.org/r/274118 (https://phabricator.wikimedia.org/T128381) (owner: 10Muehlenhoff) [23:39:47] (03PS4) 10Dzahn: Add nschaaf to researchers, bastiononly [puppet] - 10https://gerrit.wikimedia.org/r/274118 (https://phabricator.wikimedia.org/T128381) (owner: 10Muehlenhoff) [23:40:04] (03CR) 10CSteipp: Support totp auth for horizon (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/274173 (https://phabricator.wikimedia.org/T105690) (owner: 10Andrew Bogott) [23:40:07] (03CR) 10Dzahn: [C: 031] "like this, researchers+bastiononly will work" [puppet] - 10https://gerrit.wikimedia.org/r/274118 (https://phabricator.wikimedia.org/T128381) (owner: 10Muehlenhoff) [23:40:41] (03PS4) 10Tim Landscheidt: dynamicproxy: custom log schema (http/https) for tools [puppet] - 10https://gerrit.wikimedia.org/r/274161 (https://phabricator.wikimedia.org/T128409) (owner: 10Dzahn) [23:41:39] 6Operations, 10Ops-Access-Requests, 13Patch-For-Review: Requesting access to researchers for nschaaf - https://phabricator.wikimedia.org/T128381#2078132 (10Dzahn) apparently researchers has meanwhile been added to statistics-crunchers but not to bastion hosts. so what is needed is researchers and bastiononl... [23:42:07] (03CR) 10Andrew Bogott: Support totp auth for horizon (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/274173 (https://phabricator.wikimedia.org/T105690) (owner: 10Andrew Bogott) [23:46:05] (03CR) 10Tim Landscheidt: "1. log_format must be outside of server." [puppet] - 10https://gerrit.wikimedia.org/r/274161 (https://phabricator.wikimedia.org/T128409) (owner: 10Dzahn) [23:47:20] (03CR) 10Jforrester: [C: 031] "We'll do this this afternoon, in the SWAT." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272926 (https://phabricator.wikimedia.org/T127881) (owner: 10Jforrester) [23:47:25] (03PS9) 10Dzahn: osm/maps/postgres: move tuning.conf out of /files/ [puppet] - 10https://gerrit.wikimedia.org/r/249056 [23:49:05] (03CR) 10Dzahn: "thank you Tim, looks and sounds all good. just why the additional "combined" log as well?" [puppet] - 10https://gerrit.wikimedia.org/r/274161 (https://phabricator.wikimedia.org/T128409) (owner: 10Dzahn) [23:49:43] (03PS14) 10Andrew Bogott: Support totp auth for horizon [puppet] - 10https://gerrit.wikimedia.org/r/274173 (https://phabricator.wikimedia.org/T105690) [23:53:45] (03PS1) 10JGirault: Bump portals to master. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/274316 (https://phabricator.wikimedia.org/T128522) [23:55:24] 6Operations, 10Ops-Access-Requests, 13Patch-For-Review: Requesting access to researchers for nschaaf - https://phabricator.wikimedia.org/T128381#2078181 (10Krenair) >>! In T128381#2078090, @Dzahn wrote: > Does that mean the docs and actual puppet code don't say the same thing? I clarified the actual docs i... [23:55:45] (03CR) 10JGirault: "basically this commit https://github.com/wikimedia/wikimedia-portals/commit/864d5ebed066aa3267a17a5eb97409c69093e03b" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/274316 (https://phabricator.wikimedia.org/T128522) (owner: 10JGirault) [23:55:56] 6Operations, 10Ops-Access-Requests, 13Patch-For-Review: Requesting access to researchers for nschaaf - https://phabricator.wikimedia.org/T128381#2078182 (10Dzahn) @Krenair cool, thank you [23:57:15] (03CR) 10Jforrester: [C: 04-1] "Scheduled for Tuesday 8 March." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/271712 (https://phabricator.wikimedia.org/T127881) (owner: 10Jforrester) [23:57:37] !log upgrade elastic1002.eqiad.wmnet to elasticsearch 1.7.5 [23:57:40] (03CR) 10Jforrester: [C: 04-1] "Scheduled for Tuesday 15 March." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/271713 (https://phabricator.wikimedia.org/T127881) (owner: 10Jforrester) [23:57:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master