[00:00:04] RoanKattouw ostriches Krenair: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160205T0000). Please do the needful. [00:00:05] Addshore mafk matt_flaschen: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [00:00:10] *waves* [00:00:19] * mafk is here [00:01:50] (03PS3) 10BBlack: check_systemd_unit_state: support oneshot/RemainAfterExit [puppet] - 10https://gerrit.wikimedia.org/r/268596 [00:02:33] (03PS2) 10Aude: Set $wgArticlePlaceholderImageProperty for beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268592 [00:02:38] Present [00:03:04] * aude would like to add https://gerrit.wikimedia.org/r/#/c/268592/ to swat, if possible [00:03:07] (it's beta only) [00:03:31] (03PS4) 10BBlack: check_systemd_unit_state: support oneshot/RemainAfterExit [puppet] - 10https://gerrit.wikimedia.org/r/268596 [00:03:33] (03PS1) 10BBlack: traffic-pool: s/true/yes/ for consistency [puppet] - 10https://gerrit.wikimedia.org/r/268599 [00:04:36] (03CR) 10Dzahn: "so it's normal this does not exist on https://www.wikidata.org/wiki/P964 yet?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268592 (owner: 10Aude) [00:05:58] (03CR) 10Aude: "it's for beta: http://wikidata.beta.wmflabs.org/wiki/Property:P964" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268592 (owner: 10Aude) [00:07:56] (03PS3) 10MarcoAurelio: Adding museumvictoria.com.au domain to $wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267677 (https://phabricator.wikimedia.org/T125387) [00:09:07] (03PS4) 10Addshore: Revert "Revert "wgRCWatchCategoryMembership true on wikipedias & commons"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267646 [00:12:24] * aude wonders who is doing swat [00:12:32] *looks at the list* [00:12:46] Roan (RoanKattouw), Chad (ostriches), or Alex (Krenair) [00:12:54] not me today [00:12:57] That's right [00:13:06] Krenair: your on the list ;) [00:13:16] yeah, forgot to remove myself [00:13:28] * mafk offers some $$$ to Krenair [00:14:29] mafk: stroopwaffles are a better currency around here [00:15:03] p858snake: $troopwaffle$ then [00:15:05] aude: wanna do swat/your swat? :) [00:15:20] :P [00:15:33] * aude looks at the list of patches [00:18:26] ok... [00:18:54] (03PS5) 10BBlack: check_systemd_unit_state: support oneshot/RemainAfterExit [puppet] - 10https://gerrit.wikimedia.org/r/268596 [00:18:56] (03PS2) 10BBlack: traffic-pool: s/true/yes/ for consistency [puppet] - 10https://gerrit.wikimedia.org/r/268599 [00:18:58] (03PS1) 10BBlack: check_systemd_unit_state: pep8 cleanup [puppet] - 10https://gerrit.wikimedia.org/r/268600 [00:19:12] aude: you have the necessarily merge permissions, right? (I forget sometimes) [00:19:26] yep [00:19:32] i'll start with addshore's patch [00:19:39] awesome :) [00:19:48] !log restbase deploy start of 2aef1b67a0 on rb1001 [00:19:48] (03CR) 10Aude: [C: 032] Revert "Revert "wgRCWatchCategoryMembership true on wikipedias & commons"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267646 (owner: 10Addshore) [00:19:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:19:59] I'll be around for the shorturl-ones [00:20:22] (03Merged) 10jenkins-bot: Revert "Revert "wgRCWatchCategoryMembership true on wikipedias & commons"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267646 (owner: 10Addshore) [00:20:26] thanks mafk [00:20:41] greg-g: it's my obligation :) [00:20:57] tables should be created [00:21:02] indeed, I just appreciate it none-the-less :) [00:21:40] in the meanwhile, I'll issue a globalblock to a vandal [00:21:52] (03CR) 10BBlack: [C: 032] check_systemd_unit_state: pep8 cleanup [puppet] - 10https://gerrit.wikimedia.org/r/268600 (owner: 10BBlack) [00:22:22] syncing... [00:22:51] (03CR) 10BBlack: [C: 032] check_systemd_unit_state: support oneshot/RemainAfterExit [puppet] - 10https://gerrit.wikimedia.org/r/268596 (owner: 10BBlack) [00:23:04] (03CR) 10BBlack: [C: 032] traffic-pool: s/true/yes/ for consistency [puppet] - 10https://gerrit.wikimedia.org/r/268599 (owner: 10BBlack) [00:23:05] !log aude@mira Synchronized wmf-config/InitialiseSettings.php: Re-enable category watch on wikipedia and commons (duration: 01m 18s) [00:23:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:23:16] addshore: please check :) [00:23:23] Looks like its there :) [00:23:52] (03CR) 10Aude: [C: 032] Enabling Extension:ShortUrl on or.wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267780 (https://phabricator.wikimedia.org/T124429) (owner: 10MarcoAurelio) [00:24:17] ok [00:24:24] i'll do the short url patches now [00:24:31] here to look [00:24:39] (03CR) 10Mobrovac: [C: 031] "That's it!" [puppet] - 10https://gerrit.wikimedia.org/r/268560 (https://phabricator.wikimedia.org/T122249) (owner: 10Milimetric) [00:24:40] meh I goofed the nrpe check [00:24:54] (03Merged) 10jenkins-bot: Enabling Extension:ShortUrl on or.wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267780 (https://phabricator.wikimedia.org/T124429) (owner: 10MarcoAurelio) [00:25:02] (03CR) 10Aude: [C: 032] Enabling Extension:ShortUrl for bhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267783 (https://phabricator.wikimedia.org/T113348) (owner: 10MarcoAurelio) [00:25:14] (03CR) 10Aude: [C: 032] Enabling Ext:ShortURL for maiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268414 (https://phabricator.wikimedia.org/T125802) (owner: 10MarcoAurelio) [00:25:41] (03CR) 10Milimetric: "Thanks Marko, I'll get ottomata to merge this tomorrow when he's around." [puppet] - 10https://gerrit.wikimedia.org/r/268560 (https://phabricator.wikimedia.org/T122249) (owner: 10Milimetric) [00:25:45] (03Merged) 10jenkins-bot: Enabling Extension:ShortUrl for bhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267783 (https://phabricator.wikimedia.org/T113348) (owner: 10MarcoAurelio) [00:25:54] RECOVERY - traffic-pool service on cp1060 is OK: OK - traffic-pool is active [00:25:55] (03PS1) 10BBlack: check_systemd_unit_state: fix silly syntax bug [puppet] - 10https://gerrit.wikimedia.org/r/268601 [00:26:06] !log restart apache on mendelevium.eqiad.wmnet. seems there's a memory leak, need to investigate tomorrow [00:26:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:26:13] (03CR) 10BBlack: [C: 032 V: 032] check_systemd_unit_state: fix silly syntax bug [puppet] - 10https://gerrit.wikimedia.org/r/268601 (owner: 10BBlack) [00:26:22] (03Merged) 10jenkins-bot: Enabling Ext:ShortURL for maiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268414 (https://phabricator.wikimedia.org/T125802) (owner: 10MarcoAurelio) [00:26:32] (03CR) 10Aude: [C: 032] Remove testwiki from wmgUseShortUrl [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267784 (owner: 10MarcoAurelio) [00:26:52] (03CR) 10Madhuvishy: [C: 04-1] "I think we should not change the template directly - and make this param configurable in the class, and pass in the value from hiera." [puppet] - 10https://gerrit.wikimedia.org/r/268594 (https://phabricator.wikimedia.org/T125916) (owner: 10Nuria) [00:27:21] (03Merged) 10jenkins-bot: Remove testwiki from wmgUseShortUrl [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267784 (owner: 10MarcoAurelio) [00:27:45] RECOVERY - puppet last run on cp3042 is OK: OK: Puppet is currently enabled, last run 1 hour ago with 0 failures [00:28:09] looking at or.wikisource, looks shorturl is not there [00:28:43] they are syncing [00:29:26] !log aude@mira Synchronized wmf-config/InitialiseSettings.php: Enable ShortUrl on maiwiki, bhwiki and orwikisource (duration: 01m 17s) [00:29:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:29:34] RECOVERY - puppet last run on cp3044 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [00:29:34] RECOVERY - puppet last run on cp3017 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [00:29:35] RECOVERY - puppet last run on cp3032 is OK: OK: Puppet is currently enabled, last run 1 hour ago with 0 failures [00:29:35] RECOVERY - puppet last run on cp3046 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [00:29:35] RECOVERY - puppet last run on cp3007 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [00:29:35] RECOVERY - puppet last run on cp3030 is OK: OK: Puppet is currently enabled, last run 1 hour ago with 0 failures [00:29:35] RECOVERY - puppet last run on cp3037 is OK: OK: Puppet is currently enabled, last run 1 hour ago with 0 failures [00:29:36] RECOVERY - puppet last run on cp3015 is OK: OK: Puppet is currently enabled, last run 56 minutes ago with 0 failures [00:29:38] :) [00:29:42] please check [00:29:57] then i'll do the other config patches [00:29:59] (03PS2) 10EBernhardson: Better mediawiki REPL [puppet] - 10https://gerrit.wikimedia.org/r/268541 [00:29:59] aude: everything looks great with the catwatch stuff, so going to head home now (finally) :D [00:30:08] I'll still be pingable if anything comes up! [00:30:17] aude: or.wikisource broke [00:30:23] RECOVERY - traffic-pool service on cp3015 is OK: OK - traffic-pool is active [00:30:31] (03CR) 10EBernhardson: "Ahh, most likely that is the case. I was trying out the macros on my local instance without the script itself." [puppet] - 10https://gerrit.wikimedia.org/r/268541 (owner: 10EBernhardson) [00:30:34] :( [00:30:46] MediaWiki internal error. Exception caught inside exception handler. Set $wgShowExceptionDetails = true; at the bottom of LocalSettings.php to show detailed debugging information. [00:30:51] Error: 1146 Table 'orwikisource.shorturls' doesn't exist (10.64.16.24) [00:30:55] did you create the extension tables? [00:31:04] oh [00:31:34] RECOVERY - puppet last run on cp3005 is OK: OK: Puppet is currently enabled, last run 1 hour ago with 0 failures [00:31:34] RECOVERY - puppet last run on cp3033 is OK: OK: Puppet is currently enabled, last run 1 hour ago with 0 failures [00:31:34] RECOVERY - puppet last run on cp3041 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [00:31:34] RECOVERY - puppet last run on cp3034 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [00:31:34] RECOVERY - puppet last run on cp3045 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [00:31:35] RECOVERY - puppet last run on cp3016 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [00:32:16] Just on Special:ShortUrl though :) [00:32:20] aude: revert [00:32:38] oh and more.. [00:33:24] RECOVERY - puppet last run on cp3020 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [00:33:24] RECOVERY - puppet last run on cp3012 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [00:33:24] RECOVERY - puppet last run on cp3014 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [00:33:28] I imagine on page save also [00:33:29] doing [00:33:33] for new page [00:34:38] same for bhwiki and maiwiki [00:36:00] trying to add the tables, i get db error (read only) [00:36:06] doing the revert now [00:36:38] !log aude@mira Synchronized wmf-config/InitialiseSettings.php: revert shorturl changes (duration: 01m 17s) [00:36:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:36:41] are you doing it from mira? [00:36:52] you can't write to the dbs from mira, it's in codfw [00:36:55] errors gone :) [00:37:45] Krenair: ok, i can do from terbium [00:37:46] ? [00:37:51] yes [00:37:59] k [00:38:58] Grand total of 52 exceptions bubbled up to logstash for that ;) (Really going this time, night all) [00:39:10] confirmed: exceptions gone [00:39:28] good night addshore [00:40:40] mafk: suppose i have to populate the tables too? [00:40:58] or maybe can be done later? [00:41:04] aude: not sure, it's my first swat [00:41:27] There is a script for that: populateShortUrlTable.php [00:41:34] i see the script [00:41:47] During previous deployments, we ran it [00:41:52] but it were for small wikis. [00:41:56] For medium wikis, it could take time. [00:42:11] mai, bh and orwikisource are rather small [00:42:14] probably needs enabling now and then run the script [00:42:26] * aude proceeds [00:42:46] oh is the official url shortener today? [00:43:28] bblack: no, but mafk has decided to process configuration requests for 3 new wikis who wants Extension:ShortUrl now [00:43:44] oh ok cool [00:44:00] but if it is much of a problem, we can wait [00:44:13] what worries me is the patch for commons [00:44:14] PROBLEM - restbase endpoints health on restbase1002 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.0.221, port=7231): Max retries exceeded with url: /en.wikipedia.org/v1/?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [00:44:19] no, no problem from my end, I was just curious [00:44:47] mobrovac: ^^ [00:44:49] mobrovac: [00:44:52] I'll schedule a logo change for monday, and EP namespace updates too [00:44:56] syncing again [00:45:04] what patch for commons? [00:45:05] looking into it [00:45:19] ah yes, sorry, my bad [00:45:29] shorturl on commons? [00:45:45] aude: nope, uploaddomain [00:46:01] !log aude@mira Synchronized wmf-config/InitialiseSettings.php: Re-enable ShortUrl on maiwiki, bhwiki and orwikisource, after creating db table (duration: 01m 18s) [00:46:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:46:05] oh [00:46:07] $wgCopyUploadsDomains [00:46:27] * mafk checks shorturl wikis now [00:46:56] shorturl looks ok on orwikisource [00:47:09] still need to populate but can do that at the end of swat [00:47:27] I see links to shorturl on sidebar too [00:47:44] RECOVERY - restbase endpoints health on restbase1002 is OK: All endpoints are healthy [00:47:50] mutante: gwicke: ^^ [00:47:52] the special pages work [00:48:01] just they don't find anything yet [00:48:02] mobrovac: what happened? [00:48:07] mobrovac: alright :) [00:48:18] gwicke: forgot to enable puppet on that one [00:48:19] https://or.wikisource.org/s/4 works :) [00:48:29] i'll do https://gerrit.wikimedia.org/r/#/c/267677/ now [00:48:32] it looks ok to me [00:48:41] ah, the perils of lack of automation ;) [00:48:48] :) [00:48:53] ok aude, the worrying one for me [00:49:05] what is worrying abou tit? [00:49:21] it* [00:49:21] hope that it works, that's all [00:49:23] ok [00:49:33] looks good to me [00:49:46] (03CR) 10Aude: [C: 032] Adding museumvictoria.com.au domain to $wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267677 (https://phabricator.wikimedia.org/T125387) (owner: 10MarcoAurelio) [00:50:45] (03Merged) 10jenkins-bot: Adding museumvictoria.com.au domain to $wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267677 (https://phabricator.wikimedia.org/T125387) (owner: 10MarcoAurelio) [00:52:45] aude: you sysop at commons? I'm not sure I can test this works since it needs access to the gwstoolset [00:52:58] !log aude@mira Synchronized wmf-config/InitialiseSettings.php: Add museumvictoria.com.au to $wgCopyUploadsDomains (duration: 01m 17s) [00:53:15] mafk: i am though not entirely sure how to use gwtoolset :/ [00:53:30] maybe can try tomorrow, if you remind me [00:53:59] i'll do your last patch now [00:54:00] Dereckson: you there? [00:54:20] * aude sees nothing obvious broken with the last thing deployed [00:54:24] I regret having resigned the rights... I could have tested myself :D [00:54:54] (03CR) 10Aude: [C: 032] Set $wgEnotifMinorEdits = true on huwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267558 (https://phabricator.wikimedia.org/T125351) (owner: 10MarcoAurelio) [00:55:44] (03Merged) 10jenkins-bot: Set $wgEnotifMinorEdits = true on huwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267558 (https://phabricator.wikimedia.org/T125351) (owner: 10MarcoAurelio) [00:56:23] Testing. [00:57:47] Copy uploads are not available from this domain. [00:57:56] :/ [00:58:00] hmm [00:58:04] !log aude@mira Synchronized wmf-config/InitialiseSettings.php: Set $wgEnotifMinorEdits to true on huwiki (duration: 01m 16s) [00:58:05] *. issue, I tested http://museumvictoria.com.au/images/thumbnail.jpg?i=/pages/60807/ImageGallery/000009676c-39-web.jpg&resizewidth=true&w=475&h=238 [00:58:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:58:24] mafk: would you have an URL submitted when filling the task? [00:58:32] should I remove *. from the beggining then? [00:58:41] mafk: please check the minoredits thing [00:58:47] That depends, maybe they host images also on subdomains. [00:58:47] on it [00:59:13] matt_flaschen: around? [00:59:27] oh it should be webapi.aucklandmuseum.com [00:59:32] Testing again. [00:59:47] Email me also for minor edits of pages and files is on Special:Preferences aude, so I think it works. [01:00:02] mafk: ok [01:00:12] aude, here. [01:00:20] aude, mafk > works for webapi. so [01:00:33] (03CR) 10Aude: [C: 032] Set $wgArticlePlaceholderImageProperty for beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268592 (owner: 10Aude) [01:00:37] !log restbase deploy end of 2aef1b67a0 [01:00:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:00:43] https://phabricator.wikimedia.org/T125387 <-- Dereckson [01:00:43] mutante: gwicke: ^^ [01:00:44] all good [01:00:53] url samples were provided there [01:00:54] nicer metrics NOW [01:01:01] collections.museumvictoria.com.au [01:01:03] mutante: thnx for your help, greatly appreciated [01:01:09] i'm adding a couple of my own things into swat [01:01:11] mobrovac: yw [01:01:21] (03CR) 10Aude: [C: 032] Remove WB_EXPERIMENTAL_FEATURES (was labs only) setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268472 (owner: 10Aude) [01:01:38] (03Merged) 10jenkins-bot: Set $wgArticlePlaceholderImageProperty for beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268592 (owner: 10Aude) [01:01:59] and doing matt_flaschen's patch now [01:02:00] Woo, AP! [01:02:22] addshore: it's just something i missed and would work better with the correct setting on beta [01:02:22] (03Merged) 10jenkins-bot: Remove WB_EXPERIMENTAL_FEATURES (was labs only) setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268472 (owner: 10Aude) [01:02:39] :) [01:03:29] Testing a third time so. [01:03:50] Yes, works. [01:05:26] Dereckson: museumvictoria works then? [01:05:42] mafk: Dereckson if the commons setting needs tweaking, i suggest another patch at swat on monday [01:05:54] PROBLEM - puppet last run on ganeti2006 is CRITICAL: CRITICAL: puppet fail [01:05:57] sure [01:06:07] * aude should figure out how to use gwtoolset [01:06:22] !log aude@mira Synchronized wmf-config/: Sync wikidata config changes for beta (duration: 01m 15s) [01:06:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:06:51] waiting for jenkins on echo [01:07:47] mafk: yup [01:08:40] aude: I don't think that will be necessary, I first tested with a image shown on the museum website, but not one they intend to use for their GLAM batch upload. [01:08:48] Dereckson: ok, good [01:10:42] aude: don't forget to populate shorturl tables from terbium :) [01:10:49] mafk: yep [01:11:01] or fluorine, IDK, too many server names [01:11:09] * aude goes to do that while waiting for jenkins [01:11:11] * mafk = n00b [01:12:03] mafk: populated [01:12:12] Danke [01:12:27] so I can go to sleep now [01:12:37] :) [01:12:44] good night [01:13:37] thank you and likewise for when you do the same :) [01:13:48] phab. tickets closed btw [01:14:07] ok [01:14:15] mafk, fluorine is where logs are kept [01:14:24] * aude patiently waits for jenkins [01:14:31] terbium was correct [01:14:34] Krenair: thanks :) [01:14:40] Thanks mafk yo have taken care of the deployment. [01:14:55] It was a nice experience [01:19:03] matt_flaschen: deploying now [01:20:48] !log aude@mira Synchronized php-1.27.0-wmf.12/extensions/Echo: Re-add user rights messages in Echo (duration: 01m 20s) [01:20:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:21:47] (03PS2) 10Dzahn: cassandra: fix top-scope vars without namespaces [puppet] - 10https://gerrit.wikimedia.org/r/266975 [01:21:55] ah [01:21:58] this needs scap [01:22:10] * aude proceeds [01:22:33] !log aude@mira Started scap: Re-add user rights messages in Echo [01:22:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:22:58] hopefully doesn't take too long [01:23:01] (03CR) 10Dzahn: "where is $initial_token being set and in what scope? can't find it" [puppet] - 10https://gerrit.wikimedia.org/r/266975 (owner: 10Dzahn) [01:27:29] thanks for running swat aude [01:27:36] (03PS5) 10Tim Starling: For all apache access logs, use the WMF cache log format [puppet] - 10https://gerrit.wikimedia.org/r/268022 [01:27:53] Krenair: happy to do it [01:28:50] (03CR) 10Tim Starling: "PS5: document the log format on wikitech.wikimedia.org and use for the fallback instead of relying on configura" [puppet] - 10https://gerrit.wikimedia.org/r/268022 (owner: 10Tim Starling) [01:29:21] Yeah. I appreciate it aude. [01:30:38] (03CR) 10Dzahn: "http://puppet-compiler.wmflabs.org/1680/restbase1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/266975 (owner: 10Dzahn) [01:33:21] just a little question, am I allowed to logstash? It says to enter username/password of labs, but not sure I'll have access [01:33:24] RECOVERY - puppet last run on ganeti2006 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [01:34:18] mafk: use your wikitech credentials [01:35:05] tgr: keeps asking [01:35:12] I guess I'm not cleared [01:35:19] yeah [01:35:48] mafk: you need to be in the wmf ldap group [01:37:19] mafk: you can sign an NDA and ask bd808 to change the LDAP rule to wmf+nda [01:37:36] ...too late [01:37:52] hmm.. it might actually be +nda already [01:37:55] most such rules use wmf+nda+ops I think [01:38:18] * aude is not in wmf ldap group [01:38:22] but is nda [01:38:36] yeah, so we must have fixed that already [01:41:26] checked. it is ops + nda + wmf -- https://github.com/wikimedia/operations-puppet/blob/b0c41c9a09cc67f5d0fc3420cc897434d6840420/hieradata/role/common/logstash.yaml#L51-L54 [01:41:35] the message lists the groups allowed [01:41:51] for more information about the mess which is our ldap groups, see https://wikitech.wikimedia.org/wiki/LDAP_Groups [01:42:41] * bd808 knows Krenair is still chapped about ldap wmf granting +2 [01:47:10] !log aude@mira Finished scap: Re-add user rights messages in Echo (duration: 24m 37s) [01:47:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:47:39] matt_flaschen: ^ [01:47:56] looks good to me (the messages are there now) [01:49:36] Thanks, aude. [01:49:47] sure [01:49:52] swat is done :) [01:50:17] (03PS6) 10Tim Starling: For all apache access logs, use the WMF cache log format [puppet] - 10https://gerrit.wikimedia.org/r/268022 [01:50:23] (03CR) 10Tim Starling: [C: 032] For all apache access logs, use the WMF cache log format [puppet] - 10https://gerrit.wikimedia.org/r/268022 (owner: 10Tim Starling) [01:50:49] Can someone who is admin on MediaWiki.org add then remove autopatrolled on my account? I have that anyway as admin, but that will let me test the fix? [01:51:00] https://www.mediawiki.org/w/index.php?title=Special%3AUserRights&user=Mattflaschen-WMF [01:51:07] matt_flaschen: doing [01:51:42] added [01:51:47] (03CR) 10Alex Monk: [C: 04-1] mediawiki: Clean up beta sites Apache configs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/268578 (owner: 10Krinkle) [01:51:48] removed [01:52:29] !log deploying apache log format change following successful test on deployment-prep [01:52:32] Thanks, aude. Got the email right away, and it looks right. :) [01:52:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:52:42] Emails rather [01:53:01] great :) [01:55:40] (03CR) 10Alex Monk: "Thanks for working on this, by the way." [puppet] - 10https://gerrit.wikimedia.org/r/268578 (owner: 10Krinkle) [02:10:14] PROBLEM - puppet last run on radium is CRITICAL: CRITICAL: puppet fail [02:10:35] PROBLEM - puppet last run on logstash1004 is CRITICAL: CRITICAL: Puppet has 1 failures [02:11:35] PROBLEM - puppet last run on mw1099 is CRITICAL: CRITICAL: Puppet has 1 failures [02:12:05] PROBLEM - puppet last run on mw2128 is CRITICAL: CRITICAL: Puppet has 1 failures [02:12:05] PROBLEM - puppet last run on mw2066 is CRITICAL: CRITICAL: Puppet has 1 failures [02:12:14] PROBLEM - puppet last run on mw2013 is CRITICAL: CRITICAL: Puppet has 1 failures [02:12:34] PROBLEM - puppet last run on mw1088 is CRITICAL: CRITICAL: Puppet has 1 failures [02:12:36] PROBLEM - puppet last run on mw1117 is CRITICAL: CRITICAL: Puppet has 1 failures [02:20:15] !log restbase deploy start of caae1f7 [02:20:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:22:46] RECOVERY - puppet last run on radium is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [02:27:24] !log restbase deploy end of caae1f7 [02:27:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:27:56] that run of puppet failures was apparently due to apache reloading on strontium [02:32:46] (03PS1) 10Mobrovac: Revert "Revert "RESTBase: enable metrics batching"" [puppet] - 10https://gerrit.wikimedia.org/r/268611 (https://phabricator.wikimedia.org/T121231) [02:34:24] RECOVERY - puppet last run on logstash1004 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [02:34:36] (03CR) 10Mobrovac: "Resurrected in I3d505061b3058837c183d8ce47c4a3c06980a63d" [puppet] - 10https://gerrit.wikimedia.org/r/267917 (owner: 10Mobrovac) [02:35:14] RECOVERY - puppet last run on mw1099 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [02:36:15] RECOVERY - puppet last run on mw1088 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [02:36:24] RECOVERY - puppet last run on mw1117 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [02:37:36] RECOVERY - puppet last run on mw2128 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [02:37:44] RECOVERY - puppet last run on mw2066 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [02:37:45] RECOVERY - puppet last run on mw2013 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [02:39:27] (03CR) 10Tulsi Bhagat: [C: 031] Namespaces configuration on mai.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268573 (https://phabricator.wikimedia.org/T125801) (owner: 10Dereckson) [03:44:54] PROBLEM - puppet last run on cp3034 is CRITICAL: CRITICAL: puppet fail [04:01:27] (03PS1) 10BryanDavis: Add Tool namespace to wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268616 (https://phabricator.wikimedia.org/T122865) [04:12:05] RECOVERY - puppet last run on cp3034 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [04:36:11] (03PS1) 10Dzahn: grafana::dashboard, 'ensure' by default, move first [puppet] - 10https://gerrit.wikimedia.org/r/268619 [04:37:14] (03PS2) 10Dzahn: grafana::dashboard, $ensure first, present by default [puppet] - 10https://gerrit.wikimedia.org/r/268619 [04:48:46] (03CR) 10Ori.livneh: [C: 032] grafana::dashboard, $ensure first, present by default [puppet] - 10https://gerrit.wikimedia.org/r/268619 (owner: 10Dzahn) [05:54:26] PROBLEM - Kafka Broker Replica Max Lag on kafka1018 is CRITICAL: CRITICAL: 59.09% of data above the critical threshold [5000000.0] [06:01:56] RECOVERY - Kafka Broker Replica Max Lag on kafka1018 is OK: OK: Less than 50.00% above the threshold [1000000.0] [06:29:24] PROBLEM - puppet last run on lvs2002 is CRITICAL: CRITICAL: puppet fail [06:31:05] PROBLEM - puppet last run on db1056 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:05] PROBLEM - puppet last run on mw2036 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:35] PROBLEM - puppet last run on restbase2006 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:05] PROBLEM - puppet last run on mw1158 is CRITICAL: CRITICAL: Puppet has 2 failures [06:32:36] PROBLEM - puppet last run on mw2158 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:38] PROBLEM - puppet last run on mw2073 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:56] PROBLEM - puppet last run on mw2050 is CRITICAL: CRITICAL: Puppet has 2 failures [06:33:26] PROBLEM - puppet last run on mw2018 is CRITICAL: CRITICAL: Puppet has 2 failures [06:34:05] PROBLEM - puppet last run on ms-be2021 is CRITICAL: CRITICAL: Puppet has 1 failures [06:56:25] RECOVERY - puppet last run on db1056 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:56:34] RECOVERY - puppet last run on mw2036 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [06:56:44] RECOVERY - puppet last run on lvs2002 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [06:56:55] RECOVERY - puppet last run on restbase2006 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:56:55] RECOVERY - puppet last run on mw2018 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [06:57:25] RECOVERY - puppet last run on mw1158 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:35] RECOVERY - puppet last run on ms-be2021 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [06:57:46] RECOVERY - puppet last run on mw2158 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:56] RECOVERY - puppet last run on mw2073 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:15] RECOVERY - puppet last run on mw2050 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:05:04] (03CR) 10Mobrovac: "It must be $::initial_token, otherwise Puppet would have complained" [puppet] - 10https://gerrit.wikimedia.org/r/266975 (owner: 10Dzahn) [07:13:03] (03PS1) 10Mobrovac: [NO BUENO NO NO] Cassandra: scope initial_token [puppet] - 10https://gerrit.wikimedia.org/r/268626 [07:13:34] ignore please ^^ [07:19:27] (03PS2) 10Mobrovac: [NO BUENO NO NO] Cassandra: scope initial_token [puppet] - 10https://gerrit.wikimedia.org/r/268626 [07:27:26] (03PS1) 10Dereckson: Use extension registration for CategoryTree [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268627 (https://phabricator.wikimedia.org/T119117) [07:29:03] (03CR) 10Mobrovac: "Ok, this is very strange. I scoped initial_token to $::cassandra::initial_token in Ied7e5c6851a986df17b88c3fbc8a11e81ac87d46 (unlike this " [puppet] - 10https://gerrit.wikimedia.org/r/266975 (owner: 10Dzahn) [07:35:05] (03Abandoned) 10Mobrovac: [NO BUENO NO NO] Cassandra: scope initial_token [puppet] - 10https://gerrit.wikimedia.org/r/268626 (owner: 10Mobrovac) [08:23:04] <_joe_> New Rule: [NO BUENO NO NO] is the new official tag for patches we don't want to merge [08:24:35] PROBLEM - High load average on labstore1001 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [24.0] [08:26:55] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [08:28:16] RECOVERY - High load average on labstore1001 is OK: OK: Less than 50.00% above the threshold [16.0] [08:30:56] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [08:31:05] mobrovac: (PS2) Mobrovac: [NO BUENO NO NO] ==> made my day [08:37:55] (03PS1) 10Giuseppe Lavagetto: New release [debs/pybal] - 10https://gerrit.wikimedia.org/r/268629 [08:52:32] (03PS1) 10Dereckson: Add Recherche: to wgContentNamespaces on fr.wikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268630 (https://phabricator.wikimedia.org/T125948) [08:52:47] (03CR) 10Ricordisamoa: "As I wrote on the task I'm not opposed to the namespace in general, but I'd wait for a strong interest by tool maintainers." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268616 (https://phabricator.wikimedia.org/T122865) (owner: 10BryanDavis) [09:23:17] (03CR) 10Hoo man: [C: 031] Add $wgWBRepoSettings['sparqlEndpoint'] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268467 (https://phabricator.wikimedia.org/T125353) (owner: 10Addshore) [09:27:44] PROBLEM - Kafka Broker Replica Max Lag on kafka1022 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [5000000.0] [09:39:05] RECOVERY - Kafka Broker Replica Max Lag on kafka1022 is OK: OK: Less than 50.00% above the threshold [1000000.0] [09:47:54] (03PS1) 10Ema: debian/rules: pass --no-start to varnish{log,ncsa} [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/268633 (https://phabricator.wikimedia.org/T122880) [09:51:09] (03CR) 10Ema: [C: 032 V: 032] debian/rules: pass --no-start to varnish{log,ncsa} [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/268633 (https://phabricator.wikimedia.org/T122880) (owner: 10Ema) [10:05:59] (03CR) 10Addshore: "Scheduled for deployment in SWAT on 8th Feb (along with something else I am doing and something else aude is doing)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268467 (https://phabricator.wikimedia.org/T125353) (owner: 10Addshore) [10:15:49] !log rolling reboot of ocg* cluster [10:15:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:28:15] (03PS1) 10Elukey: Add mw1228.eqiad back to the DSH list. Phab Task: T122005 [puppet] - 10https://gerrit.wikimedia.org/r/268636 [10:32:34] PROBLEM - Host ocg1001 is DOWN: PING CRITICAL - Packet loss = 100% [10:34:35] RECOVERY - Host ocg1001 is UP: PING OK - Packet loss = 0%, RTA = 1.56 ms [10:35:24] PROBLEM - check_mysql on db1008 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 825 [10:36:38] ^ ocg is me, downtime to small [10:40:14] RECOVERY - check_mysql on db1008 is OK: Uptime: 1450914 Threads: 2 Questions: 8726538 Slow queries: 9776 Opens: 3366 Flush tables: 2 Open tables: 400 Queries per second avg: 6.014 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [10:46:18] (03CR) 10Filippo Giunchedi: "LGTM, but you should use Bug: T122005 in the commit message for it to be referenced in phabricator" [puppet] - 10https://gerrit.wikimedia.org/r/268636 (owner: 10Elukey) [10:48:04] <_joe_> yeah what godog said :) [10:48:07] (03PS2) 10Elukey: Add mw1228.eqiad back to the DSH list. Bug: T122005 [puppet] - 10https://gerrit.wikimedia.org/r/268636 (https://phabricator.wikimedia.org/T122005) [10:49:03] amended :) [10:49:14] _joe_: heh, still not seeing the message in the ticket, no idea how long it takes heh, not really that important [10:49:41] <_joe_> godog: nope, it will happen eventually I guess [10:49:55] <_joe_> oh ok I know why [10:50:09] <_joe_> elukey: no whitespace between Bug: and the change id [10:50:20] <_joe_> it should be the last manual line of the commit [10:51:28] (03PS3) 10Elukey: Add mw1228.eqiad back to the DSH list. Bug:T122005 [puppet] - 10https://gerrit.wikimedia.org/r/268636 (https://phabricator.wikimedia.org/T122005) [10:52:48] (03PS4) 10Elukey: Add mw1228.eqiad back to the DSH list. Bug:T122005 Change-Id: I3da9b7a2b26c494d38ecdffa2731eb8f4c0b34c0 Signed-off-by: elukey [puppet] - 10https://gerrit.wikimedia.org/r/268636 (https://phabricator.wikimedia.org/T122005) [10:53:11] (03PS1) 10Ema: Install man page: vmod-tbf [software/varnish/libvmod-tbf] (debian) - 10https://gerrit.wikimedia.org/r/268640 (https://phabricator.wikimedia.org/T124281) [10:54:02] _joe_: this one has a space but got referenced in phab: https://gerrit.wikimedia.org/r/#/c/260251/1//COMMIT_MSG [10:54:11] (03CR) 10Ema: [C: 032 V: 032] Install man page: vmod-tbf [software/varnish/libvmod-tbf] (debian) - 10https://gerrit.wikimedia.org/r/268640 (https://phabricator.wikimedia.org/T124281) (owner: 10Ema) [10:55:59] anyhow, I'll proceed :) [10:56:03] <_joe_> elukey: so, puppet is all good? [10:56:09] <_joe_> if so, proceed [10:56:37] (03CR) 10Elukey: [C: 032] Add mw1228.eqiad back to the DSH list. Bug:T122005 Change-Id: I3da9b7a2b26c494d38ecdffa2731eb8f4c0b34c0 Signed-off-by: elukey _joe_: I tried to run puppet agent -tv but it complains about /usr/bin/salt-call (Minion failed to authenticate), I thought that it was due to the absence of it in th dsh list [10:59:21] <_joe_> nope [10:59:27] <_joe_> that has nothing to do with it [10:59:40] <_joe_> you should accept the salt key on neodymium [11:00:30] goood one Luca, one assumption taken, one wrong :D [11:01:59] <_joe_> after that, you need to run probably kick the salt minion on mw1228 (and hope that works :P), then re-run puppet [11:02:12] <_joe_> as salt is needed for scap amongst other things [11:02:33] <_joe_> so yeah, we provide scap on mediawiki hosts using trebuchet, the other deployment system we have [11:02:37] <_joe_> also deprecated [11:02:49] <_joe_> #somuchtechdebt [11:03:54] ah scap is provided through trebuchet?? [11:04:11] * elukey feels in the movie inception [11:04:32] maybe feels like, better [11:16:15] RECOVERY - puppet last run on mw1228 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [11:17:27] \o/ [11:17:51] _joe_: proceeding with dsh [11:18:11] <_joe_> elukey: cool [11:36:01] !log start swiftrepl replication pass of common thumbs eqiad -> codfw [11:36:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:37:31] !log rebooting db2038 to db2040 for kernel update [11:37:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:42:05] RECOVERY - HHVM rendering on mw1228 is OK: HTTP OK: HTTP/1.1 200 OK - 70871 bytes in 3.322 second response time [11:51:16] RECOVERY - mediawiki-installation DSH group on mw1228 is OK: OK [11:54:51] !log rebooting db2041 to db2044 for kernel update [11:54:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:59:26] (03PS1) 10Elukey: Re-Add mw1228.eqiad.wmnet to the api_appserver pool Bug: T122005 [puppet] - 10https://gerrit.wikimedia.org/r/268648 (https://phabricator.wikimedia.org/T122005) [12:05:40] !log l10nupdate@tin LocalisationUpdate failed: git clone of core failed [12:05:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:05:52] !log l10nupdate@tin LocalisationUpdate failed: git pull of core failed [12:05:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:07:27] !log reimporting nlwiktionary pages into labs [12:07:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:15:21] (03PS1) 10Ema: Install man page: vmod_vslp [software/varnish/libvmod-vslp] (debian-wmf) - 10https://gerrit.wikimedia.org/r/268651 (https://phabricator.wikimedia.org/T124281) [12:15:37] !log rebooting db2045 to db2049 for kernel update [12:15:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:16:19] (03CR) 10Ema: [C: 032 V: 032] Install man page: vmod_vslp [software/varnish/libvmod-vslp] (debian-wmf) - 10https://gerrit.wikimedia.org/r/268651 (https://phabricator.wikimedia.org/T124281) (owner: 10Ema) [12:34:52] !log rebooting db2050 to db2054 for kernel update [12:34:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:35:39] (03CR) 10Giuseppe Lavagetto: [C: 032] Re-Add mw1228.eqiad.wmnet to the api_appserver pool Bug: T122005 [puppet] - 10https://gerrit.wikimedia.org/r/268648 (https://phabricator.wikimedia.org/T122005) (owner: 10Elukey) [12:35:50] <_joe_> elukey: you can submit the patch whenever you want [12:36:03] thanks! [12:36:23] (03CR) 10Elukey: [C: 032] Re-Add mw1228.eqiad.wmnet to the api_appserver pool Bug: T122005 [puppet] - 10https://gerrit.wikimedia.org/r/268648 (https://phabricator.wikimedia.org/T122005) (owner: 10Elukey) [12:38:57] !log repooled mw1228.eqiad.wmnet [12:39:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:42:31] <_joe_> http://config-master.wikimedia.org/conftool/eqiad/api still reports it depooled, but I won't trust it too much [12:43:40] <_joe_> mw1228.eqiad.wmnet:disabled/up/not pooled yup [12:43:48] <_joe_> did you actually repool it with confctl? [12:45:14] I was checking the logs sorry :) [12:45:34] I used confctl --find --action set/pooled=active mw1228.eqiad.wmnet [12:45:40] on palladium [12:45:43] <_joe_> pooled=yes [12:47:01] mw1228.eqiad.wmnet: pooled changed no => yes [12:48:09] ah snap I missed it in https://wikitech.wikimedia.org/wiki/Conftool#Pooling.2Fdepooling_a_server_from_all_the_related_services [12:48:52] { 'host': 'mw1228.eqiad.wmnet', 'weight':10, 'enabled': True } [12:50:23] the logs on the host looks good, much better now :) [12:50:59] !log rebooting db2055 to db2059 for kernel update [12:51:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:54:57] *waves at everyone* Does anyone know if there is a way to get nagios / icinga to check data from graphite? [12:55:36] I mean, a way that we already use somewhere. Or should I go and look for a nagios plugin! [13:03:22] addshore: yeah we have one [13:03:26] check_graphite probably [13:03:35] look for it in puppet.git [13:03:44] PROBLEM - Host elastic1021 is DOWN: PING CRITICAL - Packet loss = 100% [13:03:51] not sure whether it is available for labs Shinken / how to set it up for labs though [13:04:21] ahh yes I see a bunch of things in puppet now! [13:04:55] moritzm: I guess it's you ^ [13:05:12] yep, looking at it [13:05:13] hashar: shinken? wahhaa thats the first I have head of shinken [13:05:40] addshore: looks like a fork of nagios or icinga with a different web GUI [13:05:41] ahh yes, found check_graphite in nagios_common now [13:05:54] https://en.wikipedia.org/wiki/Shinken_(software) [13:06:04] another fork with another UI? :P [13:06:06] well [13:06:15] no that one is written in python apparently [13:06:23] whereas Nagios is in C [13:06:46] so I guess Shinken has some kind of compatibility layer to be able to reuse the tons of Nagios plugins that have been written for the last 16+ years [13:06:51] monitoring::graphite_threshold [13:07:10] Right, think I have found what I need! Thanks hashar ! [13:07:20] yw! [13:17:21] (03PS1) 10Addshore: Add WDQS_Lag monitoring [puppet] - 10https://gerrit.wikimedia.org/r/268657 [13:19:35] addshore: :-} [13:19:40] :) [13:19:55] probably going to add a few more too [13:20:17] Ideally I need to work on getting the stuff that generates the numbers into puppet ish too [13:20:46] but right now I don't even know where to start with that.. [13:24:46] ACKNOWLEDGEMENT - Host elastic1021 is DOWN: PING CRITICAL - Packet loss = 100% Muehlenhoff Memory problem, T125973 [13:28:57] (03PS1) 10ArielGlenn: dumps: workaround for pgrep check if script already running [puppet] - 10https://gerrit.wikimedia.org/r/268659 [13:32:04] (03PS1) 10Addshore: Add wikidata.org high edit count monitoring [puppet] - 10https://gerrit.wikimedia.org/r/268662 [13:32:13] (03CR) 10ArielGlenn: [C: 032] dumps: workaround for pgrep check if script already running [puppet] - 10https://gerrit.wikimedia.org/r/268659 (owner: 10ArielGlenn) [13:34:04] (03CR) 10jenkins-bot: [V: 04-1] Add wikidata.org high edit count monitoring [puppet] - 10https://gerrit.wikimedia.org/r/268662 (owner: 10Addshore) [13:34:29] !log starting rolling reboots of cp* (traffic cache hosts) for kernel updates [13:34:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:35:15] (03PS2) 10Addshore: Add wikidata.org high edit count monitoring [puppet] - 10https://gerrit.wikimedia.org/r/268662 [13:37:20] (03PS1) 10Addshore: Add Addshore to wikidata contactgroup [puppet] - 10https://gerrit.wikimedia.org/r/268664 (https://phabricator.wikimedia.org/T125975) [13:38:22] (03CR) 10Addshore: [C: 04-1] "Pending resolution of the bug" [puppet] - 10https://gerrit.wikimedia.org/r/268664 (https://phabricator.wikimedia.org/T125975) (owner: 10Addshore) [13:40:58] (03CR) 10Hashar: [C: 031] Add Addshore to wikidata contactgroup [puppet] - 10https://gerrit.wikimedia.org/r/268664 (https://phabricator.wikimedia.org/T125975) (owner: 10Addshore) [13:57:14] PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 37, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-0/0/1: down - Core: cr1-ulsfo:xe-1/2/0 (Telia, IC-313592, 51ms) {#11372} [10Gbps wave]BR [13:57:14] PROBLEM - Router interfaces on cr1-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 70, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-1/2/0: down - Core: cr1-eqord:xe-0/0/1 (Telia, IC-313592, 51ms) {#1502} [10Gbps wave]BR [13:57:48] ^ is that the new wave? [13:57:56] no, old one [13:58:20] well "old" as in not the one turned up yesterday [13:58:21] (03CR) 10Aude: "+1" [puppet] - 10https://gerrit.wikimedia.org/r/268664 (https://phabricator.wikimedia.org/T125975) (owner: 10Addshore) [13:59:05] RECOVERY - Router interfaces on cr1-eqord is OK: OK: host 208.80.154.198, interfaces up: 39, down: 0, dormant: 0, excluded: 0, unused: 0 [13:59:05] RECOVERY - Router interfaces on cr1-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 72, down: 0, dormant: 0, excluded: 0, unused: 0 [13:59:20] - Maintenance window: [13:59:20] Start Date and Time: 2016-Feb-05 13:00 UTC [13:59:21] End Date and Time: 2016-Feb-05 14:00 UTC [13:59:28] ^ was planned maint window on that link [14:06:53] !log rebooting db2060 to db2064 for kernel update [14:06:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:13:46] PROBLEM - IPsec on kafka1013 is CRITICAL: Strongswan CRITICAL - ok: 165 not-conn: cp3003_v6 [14:14:48] ^ we'll probably have a number of ipsec alerts today, from race-condition windows on quick cpNNNN machine reboots vs icinga 3x check -> critical [14:14:58] not easy to avoid with our current ipsec monitoring, but not really an issue [14:15:22] (03CR) 10Addshore: "@aude is wikidata-monitoring a mailing list we control?" [puppet] - 10https://gerrit.wikimedia.org/r/268664 (https://phabricator.wikimedia.org/T125975) (owner: 10Addshore) [14:15:44] RECOVERY - IPsec on kafka1013 is OK: Strongswan OK - 166 ESP OK [14:16:20] aude: how can I subscribe to wikidata-monitoring? [14:18:36] another stupid question: is there a consolidated list of public SSH keys from all servers that I could import in my known_hosts ? [14:20:49] !log reimporting nlwiktionary revision into labs (expect some temporary lag on labs-s3) [14:20:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:24:01] !log rebooting db2065 to db2070 for kernel update [14:24:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:30:59] gehel: yep every production machine has every other machine's host keys in /etc/ssh/ssh_known_hosts [14:31:15] addshore: ask Tobi_WMDE_SW_rem [14:31:34] godog: but not labs it seems ... [14:32:32] gehel: ah indeed, not on labs no [14:34:28] godog: thx for the info [14:35:32] (03PS1) 10Addshore: Rename wdqs-admins contact groun to wdqs [puppet] - 10https://gerrit.wikimedia.org/r/268673 [14:35:35] (03PS1) 10Addshore: Add wikidata-monitoring to wdqs contact group [puppet] - 10https://gerrit.wikimedia.org/r/268674 [14:35:53] aude: ^^ [14:36:11] (03PS2) 10Addshore: Rename wdqs-admins contact group to wdqs [puppet] - 10https://gerrit.wikimedia.org/r/268673 [14:36:19] (03PS2) 10Addshore: Add wikidata-monitoring to wdqs contact group [puppet] - 10https://gerrit.wikimedia.org/r/268674 [14:36:39] (03PS1) 10Ema: Install man page: vmod_header [software/varnish/libvmod-header] (debian) - 10https://gerrit.wikimedia.org/r/268675 (https://phabricator.wikimedia.org/T124281) [14:39:43] (03CR) 10Ema: [C: 031] VCL: do not use illegal "trusted" XFF values for XCIP [puppet] - 10https://gerrit.wikimedia.org/r/266486 (https://phabricator.wikimedia.org/T120121) (owner: 10BBlack) [14:41:10] (03CR) 10Ema: [C: 032 V: 032] Install man page: vmod_header [software/varnish/libvmod-header] (debian) - 10https://gerrit.wikimedia.org/r/268675 (https://phabricator.wikimedia.org/T124281) (owner: 10Ema) [14:41:58] !log confctl mw1228.eqiad.wmnet: weight changed 10 => 20 [14:42:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:52:02] jynus: please ping e.g. me if you feel a wikidata related production problem doesn't get enough attention [14:52:31] addshore: looks ok, though don't think i have permissions yet to restart query service or do anything to fix it [14:52:58] aude: it would probably still be good to know if it goes down (imo) [14:53:11] sure [14:53:14] I think jzerebecki has a little access there [14:53:23] unless that was some old dead group [14:53:24] jzerebecki, sure [14:53:27] will do [14:53:50] jynus: we are everywhere ;) I am now even in #wikimedia-databases :D [14:54:32] (03PS2) 10Filippo Giunchedi: Revert "Revert "RESTBase: enable metrics batching"" [puppet] - 10https://gerrit.wikimedia.org/r/268611 (https://phabricator.wikimedia.org/T121231) (owner: 10Mobrovac) [14:54:44] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] Revert "Revert "RESTBase: enable metrics batching"" [puppet] - 10https://gerrit.wikimedia.org/r/268611 (https://phabricator.wikimedia.org/T121231) (owner: 10Mobrovac) [14:54:46] aude, addshore: yea I should have access to restart wdqs, but never tried [14:55:05] one thing I want to point all of you, addshore, is a recently aproved RFC [14:55:06] it is documented https://www.mediawiki.org/wiki/Wikidata_query_service/User_Manual [14:55:24] urandom: merged ^ let me know what node you want to use as canary [14:55:40] aude: you shoudl request access to the wdqs-admins group [14:55:48] * aude might set it up locally to try it out and hack etc. [14:55:53] addshore, https://www.mediawiki.org/wiki/Development_policy#Database_policy and https://phabricator.wikimedia.org/T112637 [14:55:55] godog: i'm on 1001; do you want me to apply it there? [14:55:59] and then request access [14:56:04] hope you can give it a read [14:56:22] urandom: yup that works! [14:56:48] !log forcing puppet run and bouncing restbase on restbase1001.eqiad.wmnet (https://gerrit.wikimedia.org/r/#/c/268611/) [14:56:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:57:04] jynus: *reads* [14:57:27] not now, as in read it, share it, try to apply it short-term [14:57:44] godog: ok; done [14:58:47] urandom: yup, LGTM looks like it is batching as expected [14:59:20] godog: cool, i'll give it some minutes before proceeding [14:59:29] urandom: I'll keep an eye on graphite/statsd if something comes up as you proceed [14:59:56] (03PS1) 10Yurik: Set CSP to false [puppet] - 10https://gerrit.wikimedia.org/r/268677 [15:00:04] andrewbogott moritzm: Dear anthropoid, the time has come. Please deploy Wikitech maintenance (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160205T1500). [15:00:41] (03PS1) 10Giuseppe Lavagetto: scap: clone repositories for l10update [puppet] - 10https://gerrit.wikimedia.org/r/268678 [15:01:29] (03PS2) 10Yurik: For Kartotherian maps, set CSP to false [puppet] - 10https://gerrit.wikimedia.org/r/268677 [15:04:59] (03PS2) 10Giuseppe Lavagetto: scap: clone repositories for l10update [puppet] - 10https://gerrit.wikimedia.org/r/268678 [15:05:10] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] scap: clone repositories for l10update [puppet] - 10https://gerrit.wikimedia.org/r/268678 (owner: 10Giuseppe Lavagetto) [15:05:51] godog: still looks good, ready to continue? [15:07:40] urandom: yup [15:08:28] !log performing rolling restbase restart to apply config change (https://gerrit.wikimedia.org/r/#/c/268611/) [15:08:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:13:35] PROBLEM - puppet last run on tin is CRITICAL: CRITICAL: Puppet has 3 failures [15:14:35] PROBLEM - DPKG on labmon1001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [15:15:52] !log restbase rolling restart complete [15:15:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:15:57] godog: ^^^ [15:16:18] looking into labmon, apt process apparently stalled [15:17:26] I overslept, sorry, I’m here now [15:17:55] PROBLEM - IPsec on cp1053 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp3005_v4, cp3005_v6 [15:18:05] PROBLEM - IPsec on cp1052 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp3005_v4, cp3005_v6 [15:18:14] urandom: ack, I'm looking at https://grafana.wikimedia.org/dashboard/db/graphite-eqiad there was a spike in udp errors/rcvbuf, waiting to see if it recovers or if it is too much traffic for statsdlb to process [15:18:24] PROBLEM - IPsec on cp1066 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp3005_v4, cp3005_v6 [15:18:25] PROBLEM - IPsec on cp1065 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp3005_v4, cp3005_v6 [15:18:25] PROBLEM - IPsec on cp1067 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp3005_v4, cp3005_v6 [15:18:32] godog: ok [15:18:35] PROBLEM - IPsec on kafka1018 is CRITICAL: Strongswan CRITICAL - ok: 164 not-conn: cp3005_v4, cp3005_v6 [15:18:55] PROBLEM - IPsec on cp1068 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp3005_v4, cp3005_v6 [15:19:04] PROBLEM - IPsec on kafka1022 is CRITICAL: Strongswan CRITICAL - ok: 164 not-conn: cp3005_v4, cp3005_v6 [15:19:06] PROBLEM - IPsec on kafka1013 is CRITICAL: Strongswan CRITICAL - ok: 164 not-conn: cp3005_v4, cp3005_v6 [15:19:25] PROBLEM - IPsec on kafka1014 is CRITICAL: Strongswan CRITICAL - ok: 164 not-conn: cp3005_v4, cp3005_v6 [15:19:25] PROBLEM - IPsec on kafka1020 is CRITICAL: Strongswan CRITICAL - ok: 164 not-conn: cp3005_v4, cp3005_v6 [15:19:34] PROBLEM - IPsec on cp1055 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp3005_v4, cp3005_v6 [15:19:36] PROBLEM - IPsec on cp1054 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp3005_v4, cp3005_v6 [15:20:34] ^ ipsec alerts are just fallout of rolling reboots going slower / race conditions, sometimes [15:22:27] !log rebooting labnet1001 for kernel update [15:22:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:22:58] <_joe_> !log initializing mediawiki repos on tin [15:23:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:23:45] PROBLEM - Host cp3005 is DOWN: PING CRITICAL - Packet loss = 100% [15:24:13] !log cp3005 didn't come back online during rolling reboot, investigating (remains depooled) [15:24:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:24:23] godog: seems to have recovered [15:24:57] (03PS1) 10Giuseppe Lavagetto: scap: correct clone directives [puppet] - 10https://gerrit.wikimedia.org/r/268681 [15:25:15] PROBLEM - IPsec on cp1049 is CRITICAL: Strongswan CRITICAL - ok: 58 not-conn: cp2008_v4, cp2008_v6 [15:25:15] PROBLEM - IPsec on cp1062 is CRITICAL: Strongswan CRITICAL - ok: 58 not-conn: cp2008_v4, cp2008_v6 [15:25:15] PROBLEM - IPsec on cp1061 is CRITICAL: Strongswan CRITICAL - ok: 58 not-conn: cp2008_v4, cp2008_v6 [15:25:15] PROBLEM - IPsec on cp1063 is CRITICAL: Strongswan CRITICAL - ok: 58 not-conn: cp2008_v4, cp2008_v6 [15:25:16] PROBLEM - IPsec on cp1099 is CRITICAL: Strongswan CRITICAL - ok: 58 not-conn: cp2008_v4, cp2008_v6 [15:25:24] (03PS2) 10Giuseppe Lavagetto: scap: correct clone directives [puppet] - 10https://gerrit.wikimedia.org/r/268681 [15:25:35] PROBLEM - IPsec on cp1074 is CRITICAL: Strongswan CRITICAL - ok: 58 not-conn: cp2008_v4, cp2008_v6 [15:25:44] PROBLEM - IPsec on cp1051 is CRITICAL: Strongswan CRITICAL - ok: 58 not-conn: cp2008_v4, cp2008_v6 [15:25:54] PROBLEM - IPsec on cp1050 is CRITICAL: Strongswan CRITICAL - ok: 58 not-conn: cp2008_v4, cp2008_v6 [15:25:55] PROBLEM - IPsec on cp1048 is CRITICAL: Strongswan CRITICAL - ok: 58 not-conn: cp2008_v4, cp2008_v6 [15:26:05] PROBLEM - IPsec on cp1071 is CRITICAL: Strongswan CRITICAL - ok: 58 not-conn: cp2008_v4, cp2008_v6 [15:26:18] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] scap: correct clone directives [puppet] - 10https://gerrit.wikimedia.org/r/268681 (owner: 10Giuseppe Lavagetto) [15:26:25] PROBLEM - IPsec on cp1072 is CRITICAL: Strongswan CRITICAL - ok: 58 not-conn: cp2008_v4, cp2008_v6 [15:26:25] PROBLEM - IPsec on cp1073 is CRITICAL: Strongswan CRITICAL - ok: 58 not-conn: cp2008_v4, cp2008_v6 [15:26:34] PROBLEM - IPsec on cp1064 is CRITICAL: Strongswan CRITICAL - ok: 58 not-conn: cp2008_v4, cp2008_v6 [15:26:40] maybe I can detune the ipsec check during this work [15:26:48] (make icinga fail more times in a row before alerting, for today) [15:26:49] (03PS1) 10Elukey: Adding a new email template for Burrow lag alerts. [puppet] - 10https://gerrit.wikimedia.org/r/268682 [15:27:32] !log rebooting labcontrol1002 for kernel update [15:27:35] RECOVERY - Host cp3005 is UP: PING OK - Packet loss = 0%, RTA = 86.30 ms [15:27:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:27:45] RECOVERY - IPsec on cp1066 is OK: Strongswan OK - 58 ESP OK [15:27:46] RECOVERY - IPsec on cp1065 is OK: Strongswan OK - 58 ESP OK [15:27:46] RECOVERY - IPsec on cp1067 is OK: Strongswan OK - 58 ESP OK [15:28:05] (03CR) 10jenkins-bot: [V: 04-1] Adding a new email template for Burrow lag alerts. [puppet] - 10https://gerrit.wikimedia.org/r/268682 (owner: 10Elukey) [15:28:16] RECOVERY - IPsec on cp1068 is OK: Strongswan OK - 58 ESP OK [15:28:55] RECOVERY - IPsec on cp1055 is OK: Strongswan OK - 58 ESP OK [15:29:05] RECOVERY - IPsec on cp1054 is OK: Strongswan OK - 58 ESP OK [15:29:05] RECOVERY - IPsec on cp1053 is OK: Strongswan OK - 58 ESP OK [15:29:25] RECOVERY - IPsec on cp1052 is OK: Strongswan OK - 58 ESP OK [15:29:47] !log rebooting holmium for kernel update [15:29:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:30:35] PROBLEM - Host cp2008 is DOWN: PING CRITICAL - Packet loss = 100% [15:30:37] (03PS1) 10BBlack: temporarily de-sensitize ipsec icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/268683 [15:31:35] PROBLEM - Host 208.80.154.20 is DOWN: CRITICAL - Host Unreachable (208.80.154.20) [15:31:42] (03PS1) 10Ema: Display a message in motd if puppet agent is disabled [puppet] - 10https://gerrit.wikimedia.org/r/268684 [15:32:35] RECOVERY - puppet last run on tin is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:32:42] (03PS2) 10Elukey: Adding a new email template for Burrow lag alerts. [puppet] - 10https://gerrit.wikimedia.org/r/268682 [15:33:14] RECOVERY - Host 208.80.154.20 is UP: PING OK - Packet loss = 0%, RTA = 2.08 ms [15:33:16] !log re-restarting restbase on restbase1002.eqiad.wmnet,restbase1005.eqiad.wmnet,restbase1006.eqiad.wmnet,restbase1009.eqiad.wmnet (prior restarts may have happened before puppet run) [15:33:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:34:10] (03PS3) 10Jcrespo: Testing db jessie installer problems on db2030 [puppet] - 10https://gerrit.wikimedia.org/r/267681 (https://phabricator.wikimedia.org/T125256) [15:34:27] (03CR) 10BBlack: [C: 032] temporarily de-sensitize ipsec icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/268683 (owner: 10BBlack) [15:34:54] (03CR) 10Hoo man: [C: 04-1] Add WDQS_Lag monitoring (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/268657 (owner: 10Addshore) [15:34:55] RECOVERY - IPsec on cp1049 is OK: Strongswan OK - 60 ESP OK [15:34:55] RECOVERY - IPsec on cp1062 is OK: Strongswan OK - 60 ESP OK [15:34:56] RECOVERY - IPsec on cp1061 is OK: Strongswan OK - 60 ESP OK [15:34:56] RECOVERY - IPsec on cp1063 is OK: Strongswan OK - 60 ESP OK [15:35:05] RECOVERY - IPsec on cp1099 is OK: Strongswan OK - 60 ESP OK [15:35:06] RECOVERY - Host cp2008 is UP: PING OK - Packet loss = 0%, RTA = 36.80 ms [15:35:12] !log rebooting silver for kernel update - wikitech outage will ensue [15:35:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:35:15] RECOVERY - IPsec on cp1074 is OK: Strongswan OK - 60 ESP OK [15:35:35] RECOVERY - IPsec on cp1050 is OK: Strongswan OK - 60 ESP OK [15:35:44] RECOVERY - IPsec on cp1048 is OK: Strongswan OK - 60 ESP OK [15:35:45] RECOVERY - IPsec on kafka1018 is OK: Strongswan OK - 166 ESP OK [15:35:46] RECOVERY - IPsec on cp1071 is OK: Strongswan OK - 60 ESP OK [15:36:03] (03CR) 10Alex Monk: wgRCWatchCategoryMembership true everywhere except wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264734 (owner: 10Addshore) [15:36:05] running puppet on install2001 too, as I am to reinstall a codfw server [15:36:14] (03CR) 10Alex Monk: wgRCWatchCategoryMembership true on wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264735 (owner: 10Addshore) [15:36:14] RECOVERY - IPsec on cp1072 is OK: Strongswan OK - 60 ESP OK [15:36:15] RECOVERY - IPsec on cp1073 is OK: Strongswan OK - 60 ESP OK [15:36:15] RECOVERY - IPsec on kafka1022 is OK: Strongswan OK - 166 ESP OK [15:36:15] RECOVERY - IPsec on cp1064 is OK: Strongswan OK - 60 ESP OK [15:36:18] (03CR) 10Addshore: [C: 04-1] Add WDQS_Lag monitoring (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/268657 (owner: 10Addshore) [15:36:24] RECOVERY - IPsec on kafka1013 is OK: Strongswan OK - 166 ESP OK [15:36:36] RECOVERY - IPsec on kafka1014 is OK: Strongswan OK - 166 ESP OK [15:36:44] RECOVERY - IPsec on kafka1020 is OK: Strongswan OK - 166 ESP OK [15:37:19] (03PS4) 10Jcrespo: Testing db jessie installer problems on db2030 [puppet] - 10https://gerrit.wikimedia.org/r/267681 (https://phabricator.wikimedia.org/T125256) [15:38:25] <_joe_> !log launched l10update cronjob manually, was not running since tin's reimaging [15:39:06] RECOVERY - IPsec on cp1051 is OK: Strongswan OK - 60 ESP OK [15:39:09] huh, l10nupdate didn't get moved to mira? [15:39:27] (03CR) 10Jcrespo: [C: 032] Testing db jessie installer problems on db2030 [puppet] - 10https://gerrit.wikimedia.org/r/267681 (https://phabricator.wikimedia.org/T125256) (owner: 10Jcrespo) [15:39:30] (03PS2) 10Addshore: Add WDQS_Lag monitoring [puppet] - 10https://gerrit.wikimedia.org/r/268657 [15:39:39] (03CR) 10Hoo man: [C: 04-1] "It's tempting to set these thresholds to very low values as that gives a feeling of control." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/268662 (owner: 10Addshore) [15:40:02] (03CR) 10Jcrespo: "It was installer dependent, I am just verifying it now works." [puppet] - 10https://gerrit.wikimedia.org/r/267681 (https://phabricator.wikimedia.org/T125256) (owner: 10Jcrespo) [15:40:36] (Cannot access the database: Can't connect to MySQL server on '208.80.154.136' (111) (208.80.154.136)) [15:40:58] Looks like wikitechwiki's database went down [15:41:11] wikitech is supposed to be down at the moment SPF|Cloud [15:41:28] shouldn't that be in the topic here then? :) [15:41:35] (or whatever) [15:42:12] 15:35 < andrewbogott> !log rebooting silver for kernel update - wikitech outage will ensue [15:42:16] PROBLEM - Ensure NFS exports are maintained for new instances with NFS on labstore1001 is CRITICAL: CRITICAL - Expecting active but unit nfs-exports is failed [15:42:19] no SPF|Cloud, it was announced to wikitech-l and labs-l a couple of days ago [15:42:24] oh, sorry [15:42:40] I missed that message because of the icinga-wm spam [15:43:32] although, I do wonder if mysql is supposed to be stopped [15:43:35] krenair@silver:~$ service mysql status [15:43:35] mysql stop/waiting [15:43:54] would expect it to start on boot.. [15:44:03] wiki seems to work again? [15:44:50] yep, back up [15:44:56] PROBLEM - IPsec on cp1066 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp3006_v4, cp3006_v6 [15:45:05] PROBLEM - IPsec on cp1065 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp3006_v4, cp3006_v6 [15:45:15] PROBLEM - IPsec on cp1067 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp3006_v4, cp3006_v6 [15:45:26] PROBLEM - IPsec on kafka1018 is CRITICAL: Strongswan CRITICAL - ok: 164 not-conn: cp3006_v4, cp3006_v6 [15:46:46] RECOVERY - IPsec on cp1066 is OK: Strongswan OK - 58 ESP OK [15:47:04] RECOVERY - IPsec on cp1065 is OK: Strongswan OK - 58 ESP OK [15:47:05] RECOVERY - IPsec on cp1067 is OK: Strongswan OK - 58 ESP OK [15:47:12] !log performing rolling restbase restart in staging env [15:47:15] RECOVERY - IPsec on kafka1018 is OK: Strongswan OK - 166 ESP OK [15:47:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:48:16] (03CR) 10Addshore: "So the numbers came from the data over the last 3 months." [puppet] - 10https://gerrit.wikimedia.org/r/268662 (owner: 10Addshore) [15:49:37] wikitech is back up, but I’m resetting user sessions so probably best to not log in for a few minutes. [15:50:56] (03CR) 10JanZerebecki: [C: 031] Add WDQS_Lag monitoring [puppet] - 10https://gerrit.wikimedia.org/r/268657 (owner: 10Addshore) [15:52:08] (03CR) 10Hoo man: "Once we improve dispatching (or add more dispatcher power), the edit rate will grow again, We should probably keep that in mind… let's onl" [puppet] - 10https://gerrit.wikimedia.org/r/268662 (owner: 10Addshore) [15:52:51] (03CR) 10JanZerebecki: "I think Lydia does." [puppet] - 10https://gerrit.wikimedia.org/r/268664 (https://phabricator.wikimedia.org/T125975) (owner: 10Addshore) [15:53:10] PROBLEM - IPsec on cp1062 is CRITICAL: Strongswan CRITICAL - ok: 59 not-conn: cp2011_v6 [15:53:46] !log oblivian@tin sync-l10n completed (1.27.0-wmf.12) (duration: 00m 08s) [15:53:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:53:53] (03PS3) 10Elukey: Adding a new email template for Burrow lag alerts. [puppet] - 10https://gerrit.wikimedia.org/r/268682 [15:55:54] (03PS1) 10JanZerebecki: icinga contactgroups: add jzerebecki and irc-wikidata to wdqs [puppet] - 10https://gerrit.wikimedia.org/r/268687 [15:57:46] (03PS4) 10Elukey: Adding a new email template for Burrow lag alerts. [puppet] - 10https://gerrit.wikimedia.org/r/268682 [15:58:19] RECOVERY - IPsec on cp1062 is OK: Strongswan OK - 60 ESP OK [16:01:12] ipsec alerts should be quieter/gone now from the rolling reboots [16:01:19] unless a host gets stuck mid-reboot again [16:03:36] !log reimaging db2030 to test jessie installer [16:03:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:10:44] wikitech should now be completely back to normal. Everyone will need to log in afresh. [16:18:33] (03PS1) 10Ottomata: Reenable kafka1012 broker [puppet] - 10https://gerrit.wikimedia.org/r/268689 (https://phabricator.wikimedia.org/T125199) [16:19:44] (03CR) 10Ottomata: [C: 032 V: 032] Reenable kafka1012 broker [puppet] - 10https://gerrit.wikimedia.org/r/268689 (https://phabricator.wikimedia.org/T125199) (owner: 10Ottomata) [16:20:43] !log reenabling kafka1012 in analytics-eqiad kafka cluster [16:20:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:24:00] RECOVERY - Kafka Broker Server on kafka1012 is OK: PROCS OK: 1 process with command name java, args kafka.Kafka /etc/kafka/server.properties [16:24:11] <_joe_> and it paged :P [16:25:09] :) [16:26:05] (03PS5) 10Elukey: Adding a new email template for Burrow lag alerts. [puppet] - 10https://gerrit.wikimedia.org/r/268682 [16:26:39] elukey: just make lag-alert-email.tmpl be the default for $email_template in the module [16:26:39] also, it needs to be [16:26:41] 'burrow/lag-alert-email-tmpl' [16:26:53] and don't pass it in role::analytics::burrow [16:27:07] you are making the burrow module default email template be this [16:27:15] overriding the package [16:27:36] the user (role class) should,n't need to think about it unless it wants to actually override the module's setting [16:28:07] (03PS6) 10Elukey: Adding a new email template for Burrow lag alerts. [puppet] - 10https://gerrit.wikimedia.org/r/268682 [16:29:55] ottomata: I thought that the point was to put the choice of the email template to use outside the module [16:30:05] and the role seemed the right choice [16:30:07] no? [16:30:38] you are parameterizing $email_template so that if someone wanted to override your choice with their own, they could [16:30:44] but, you are providing a default template in the module [16:30:48] so you might as well make that the default [16:30:59] and not force users to specify unless they want to change and use their own [16:31:16] all right, adding the last change [16:31:40] otherwise it will just be copy/paste the same param value in every usage of the module, [16:32:58] (03PS7) 10Elukey: Adding a new email template for Burrow lag alerts. [puppet] - 10https://gerrit.wikimedia.org/r/268682 [16:33:12] oops wrong chat but oh well :) [16:37:04] (03CR) 10Ottomata: "Agree, we should parameterize the value." [puppet] - 10https://gerrit.wikimedia.org/r/268594 (https://phabricator.wikimedia.org/T125916) (owner: 10Nuria) [16:37:15] (03CR) 10Ottomata: "s/value/variable" [puppet] - 10https://gerrit.wikimedia.org/r/268594 (https://phabricator.wikimedia.org/T125916) (owner: 10Nuria) [16:38:58] (03PS8) 10Elukey: Adding a new email template for Burrow lag alerts. [puppet] - 10https://gerrit.wikimedia.org/r/268682 [16:40:24] (03CR) 10jenkins-bot: [V: 04-1] Adding a new email template for Burrow lag alerts. [puppet] - 10https://gerrit.wikimedia.org/r/268682 (owner: 10Elukey) [16:46:38] (03PS3) 10Dzahn: Add WDQS_Lag monitoring [puppet] - 10https://gerrit.wikimedia.org/r/268657 (owner: 10Addshore) [16:47:06] 6operations, 7Graphite, 5Patch-For-Review: udp rcvbuferrors and inerrors on graphite1001 - https://phabricator.wikimedia.org/T101141#2002536 (10fgiunchedi) [16:47:08] 6operations, 10RESTBase, 7Graphite, 5Patch-For-Review, 7service-runner: restbase should send metrics in batches - https://phabricator.wikimedia.org/T121231#2002534 (10fgiunchedi) 5Open>3Resolved @gwicke AFAICT yes, there's ~8k udp packets/s less being received by graphite after the switch [16:47:24] (03PS9) 10Elukey: Adding a new email template for Burrow lag alerts. [puppet] - 10https://gerrit.wikimedia.org/r/268682 [16:47:43] 6operations, 10RESTBase, 7Graphite, 5Patch-For-Review, 7service-runner: restbase should send metrics in batches - https://phabricator.wikimedia.org/T121231#2002538 (10GWicke) @fgiunchedi: Great, thanks! [16:47:45] (03CR) 10Dzahn: [C: 032] Add WDQS_Lag monitoring [puppet] - 10https://gerrit.wikimedia.org/r/268657 (owner: 10Addshore) [16:48:16] 6operations, 7Monitoring, 5Patch-For-Review: switch diamond to use graphite line protocol - https://phabricator.wikimedia.org/T121861#2002540 (10fgiunchedi) [16:48:18] 6operations, 7Graphite, 5Patch-For-Review: udp rcvbuferrors and inerrors on graphite1001 - https://phabricator.wikimedia.org/T101141#1330890 (10fgiunchedi) [16:49:38] 6operations, 7Icinga, 5Patch-For-Review: Add contact for Addshore in icinga - https://phabricator.wikimedia.org/T125975#2002544 (10Dzahn) a:3Dzahn [16:51:11] 6operations, 10RESTBase, 7Graphite, 5Patch-For-Review, 7service-runner: restbase should send metrics in batches - https://phabricator.wikimedia.org/T121231#2002551 (10Addshore) Decrease can be seen here below: https://grafana.wikimedia.org/dashboard/db/graphite-eqiad?from=1454681397639&to=1454689589408&p... [16:51:59] 6operations, 5Patch-For-Review: reinstall eqiad memcache servers with jessie - https://phabricator.wikimedia.org/T123711#2002552 (10elukey) After a chat with Joe with decided to postpone the start of the work to Monday morning to avoid big changes on a Friday (especially done by me). I'll start Monday morning... [16:52:06] (03CR) 10Ottomata: [C: 032] "Cool, looks good. Will merge later this afternoon." [puppet] - 10https://gerrit.wikimedia.org/r/268682 (owner: 10Elukey) [16:52:32] (03Abandoned) 10Addshore: Add Addshore to wikidata contactgroup [puppet] - 10https://gerrit.wikimedia.org/r/268664 (https://phabricator.wikimedia.org/T125975) (owner: 10Addshore) [16:53:16] (03CR) 10Dzahn: "eh, i am creating your contact right now.. and was about to merge this in a minute" [puppet] - 10https://gerrit.wikimedia.org/r/268664 (https://phabricator.wikimedia.org/T125975) (owner: 10Addshore) [16:53:23] 6operations, 7Icinga, 5Patch-For-Review: Add contact for Addshore in icinga - https://phabricator.wikimedia.org/T125975#2002558 (10Addshore) I have been added to the wikidata-monitoring contact email list now. However I would still appreciate my contact being added so that I can use it in the future! [16:54:01] (03CR) 10Addshore: "See https://phabricator.wikimedia.org/T125975#2002558" [puppet] - 10https://gerrit.wikimedia.org/r/268664 (https://phabricator.wikimedia.org/T125975) (owner: 10Addshore) [16:54:47] 6operations, 7Icinga, 5Patch-For-Review: Add contact for Addshore in icinga - https://phabricator.wikimedia.org/T125975#2002561 (10Dzahn) Added the contact as requested above. done [16:55:00] 6operations, 7Icinga, 5Patch-For-Review: Add contact for Addshore in icinga - https://phabricator.wikimedia.org/T125975#2002562 (10Addshore) 5Open>3Resolved [16:56:06] (03PS2) 10Dzahn: icinga contactgroups: add jzerebecki and irc-wikidata to wdqs [puppet] - 10https://gerrit.wikimedia.org/r/268687 (owner: 10JanZerebecki) [16:57:00] (03Abandoned) 10Addshore: Add wikidata-monitoring to wdqs contact group [puppet] - 10https://gerrit.wikimedia.org/r/268674 (owner: 10Addshore) [16:57:26] (03PS1) 10ArielGlenn: allow dataset servers to have rsync access to /srv/dumps on labstore [puppet] - 10https://gerrit.wikimedia.org/r/268692 (https://phabricator.wikimedia.org/T117180) [17:00:13] PROBLEM - check_puppetrun on americium is CRITICAL: CRITICAL: Puppet has 1 failures [17:01:18] (03PS9) 10Krinkle: [WIP] Implement /w/static.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/263566 (https://phabricator.wikimedia.org/T99096) [17:04:04] RECOVERY - DPKG on labmon1001 is OK: All packages OK [17:04:10] 6operations, 10ContentTranslation-Deployments, 10ContentTranslation-cxserver, 10Parsoid, and 4 others: Decom parsoid-lb.eqiad.wikimedia.org entrypoint - https://phabricator.wikimedia.org/T110474#2002590 (10BBlack) Took another log of all traffic today, for ~1 hour. Excluding our own healthcheck/monitoring... [17:05:13] PROBLEM - check_puppetrun on americium is CRITICAL: CRITICAL: Puppet has 1 failures [17:07:59] !log rolling cpNNNN reboots are 27% complete, only two hosts so far failed to reboot on their own (but came up fine after manual racadm powercycle) [17:07:59] (03PS1) 10Addshore: WDQS_Lag monitoring increase timespan [puppet] - 10https://gerrit.wikimedia.org/r/268695 [17:08:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:08:56] mutante ^^ [17:10:00] addshore: ok, anything that avoids "not enough data point" failed monitoring checks :) [17:10:10] xD [17:10:13] PROBLEM - check_puppetrun on americium is CRITICAL: CRITICAL: Puppet has 1 failures [17:10:17] yeh, 30 mins should be plenty [17:10:25] its my first time using the graphite nagios check thing [17:10:43] (03CR) 10Dzahn: [C: 032] WDQS_Lag monitoring increase timespan [puppet] - 10https://gerrit.wikimedia.org/r/268695 (owner: 10Addshore) [17:12:31] addshore: they can just be a bit obscure, like "crit - 22% of datapoints above threshold" and then you dont know what it actually means.. or yes, fails because not enough data points available [17:13:17] heh, well, hopefully it behaves! If not you might be seeing a patch to remove it ;) [17:13:26] the threshold on this is massive anyway ;) [17:13:41] yes, we'll just see, it's not like it's paging us [17:14:14] RECOVERY - Kafka Broker Under Replicated Partitions on kafka1013 is OK: OK: Less than 50.00% above the threshold [1.0] [17:15:14] PROBLEM - check_puppetrun on americium is CRITICAL: CRITICAL: Puppet has 1 failures [17:15:27] addshore: you knew https://gerrit.wikimedia.org/r/#/c/268687/ too i assume [17:15:47] so that will create IRC output but in the wikidata channel [17:16:06] yep I spotted that, thus I abandoned https://gerrit.wikimedia.org/r/#/c/268674/ [17:16:11] icinga-wm is configured per channel/logfile [17:16:15] 'k [17:16:27] (03CR) 10Addshore: [C: 031] icinga contactgroups: add jzerebecki and irc-wikidata to wdqs [puppet] - 10https://gerrit.wikimedia.org/r/268687 (owner: 10JanZerebecki) [17:17:07] Not sure if the wikidata-monitoring one needs to be in there right now, I'm sure we will decide over the next days / weeks.. [17:17:09] (03PS3) 10Dzahn: icinga contactgroups: add jzerebecki and irc-wikidata to wdqs [puppet] - 10https://gerrit.wikimedia.org/r/268687 (owner: 10JanZerebecki) [17:17:31] ok [17:17:45] (03CR) 10Dzahn: [C: 032] icinga contactgroups: add jzerebecki and irc-wikidata to wdqs [puppet] - 10https://gerrit.wikimedia.org/r/268687 (owner: 10JanZerebecki) [17:18:43] PROBLEM - puppet last run on labstore1002 is CRITICAL: Timeout while attempting connection [17:19:14] PROBLEM - Host labstore1002 is DOWN: PING CRITICAL - Packet loss = 100% [17:19:59] ^ not really and I got it [17:20:13] RECOVERY - check_puppetrun on americium is OK: OK: Puppet is currently enabled, last run 142 seconds ago with 0 failures [17:20:46] (03CR) 10BryanDavis: "Call for consensus sent to labs-l () with slowvote poll on Phabric" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268616 (https://phabricator.wikimedia.org/T122865) (owner: 10BryanDavis) [17:21:27] (03CR) 10Madhuvishy: "Since email.tmpl is not really using any variables from puppet and is not erb template - May be it belongs in a files/ directory rather th" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/268682 (owner: 10Elukey) [17:21:29] (03CR) 10BryanDavis: [C: 04-2] "Waiting for conclusion of community poll" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268616 (https://phabricator.wikimedia.org/T122865) (owner: 10BryanDavis) [17:21:42] (03CR) 10Dzahn: [C: 031] "+1, +Reedy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268004 (https://phabricator.wikimedia.org/T125501) (owner: 10Dereckson) [17:22:17] 6operations, 10ContentTranslation-Deployments, 10ContentTranslation-cxserver, 10Parsoid, and 4 others: Decom parsoid-lb.eqiad.wikimedia.org entrypoint - https://phabricator.wikimedia.org/T110474#2002627 (10GWicke) While breaking parsoid-prod.wmflab.org's favicon is a heavy price to pay, all those sacrifice... [17:25:42] 6operations, 10ContentTranslation-Deployments, 10ContentTranslation-cxserver, 10Parsoid, and 4 others: Decom parsoid-lb.eqiad.wikimedia.org entrypoint - https://phabricator.wikimedia.org/T110474#2002633 (10BBlack) Sometime between now and the 22nd, I'll try to get a full capture for a period of multiple da... [17:26:07] mutante: looks like I might have to poke it into submission a bit more (perhaps actually from the other side, and make sure data is sent for each host every minuite asap) [17:28:01] addshore: i was about to paste the link in icinga, but you see it , right ?:) yea, that's the common thing with those checks. UNKNOWN: More than half of the datapoints are undefined [17:28:21] it might be normal for some time, if they are also new in graphite [17:28:22] Leaving it there for 24 hours or so shouldnt be an isseu should it? [17:28:32] no issue [17:28:33] I have a flight in an hour ;) [17:28:39] cool, filing a ticket to remind myself! [17:28:42] dont worry at all [17:28:43] ok [17:30:09] (03CR) 10Addshore: "it looks like I might have to poke it into submission a bit more (perhaps actually from the other side, and make sure data is sent for eac" [puppet] - 10https://gerrit.wikimedia.org/r/268695 (owner: 10Addshore) [17:31:06] !log trouble shooting elastic1021 [17:31:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:32:10] 6operations, 10DBA, 10procurement: es2011-es2020 racking and onsite setup tasks - https://phabricator.wikimedia.org/T126006#2002670 (10RobH) 3NEW a:3RobH [17:32:16] (03PS10) 10Elukey: Adding a new email template for Burrow lag alerts. [puppet] - 10https://gerrit.wikimedia.org/r/268682 [17:32:27] (03CR) 10Elukey: Adding a new email template for Burrow lag alerts. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/268682 (owner: 10Elukey) [17:32:34] 6operations, 10ops-codfw: es2011-es2020 racking and onsite setup tasks - https://phabricator.wikimedia.org/T126006#2002670 (10RobH) [17:33:35] addshore: it already turned OK one of the 2 servers [17:36:24] 6operations, 10ops-codfw: es2011-es2020 racking and onsite setup tasks - https://phabricator.wikimedia.org/T126006#2002689 (10RobH) My proposed racking plan above should be reality checked by both @jcrespo (for ES configuration and review) and @papaul (for power and outlet availability in each rack.) @Papaul:... [17:36:52] 6operations, 10ops-codfw: es2011-es2020 racking and onsite setup tasks - https://phabricator.wikimedia.org/T126006#2002693 (10RobH) [17:37:02] 6operations, 10ops-codfw: es2011-es2020 racking and onsite setup tasks - https://phabricator.wikimedia.org/T126006#2002670 (10RobH) [17:39:29] 6operations, 10ops-eqiad: Hardware problem (probably memory) on elastic1021 - https://phabricator.wikimedia.org/T125973#2002716 (10Cmjohnson) The original error produced this message during post Error: Memory initialization warning detected. MEMBIST Memory Test failure DIMM A3 Swapped DIMM to B3 and now the e... [17:41:24] RECOVERY - Kafka Broker Under Replicated Partitions on kafka1022 is OK: OK: Less than 50.00% above the threshold [1.0] [17:41:39] (03PS2) 10Dzahn: Add sarin to DHCP Bug:T125752 [puppet] - 10https://gerrit.wikimedia.org/r/268425 (https://phabricator.wikimedia.org/T125752) (owner: 10Papaul) [17:44:22] (03CR) 10Dzahn: [C: 032] Add sarin to DHCP Bug:T125752 [puppet] - 10https://gerrit.wikimedia.org/r/268425 (https://phabricator.wikimedia.org/T125752) (owner: 10Papaul) [17:48:30] (03CR) 10Dzahn: "oops, missing bracket in dhcpd config on carbon..fixing" [puppet] - 10https://gerrit.wikimedia.org/r/268425 (https://phabricator.wikimedia.org/T125752) (owner: 10Papaul) [17:48:52] 6operations, 10ops-codfw: es2011-es2020 racking and onsite setup tasks - https://phabricator.wikimedia.org/T126006#2002774 (10jcrespo) +1 with this proposal, the cloning will be like this: ``` es2001 -> es2011 b1 1 es2002 -> es2012 c1 1 es2004 -> es2013 d1 1 es2005 -> es2014 a1 2 es2006 -> es2015 c1 2 es2007... [17:51:14] (03PS1) 10Dzahn: dhcp: fix syntax error, missing bracket [puppet] - 10https://gerrit.wikimedia.org/r/268705 [17:51:54] (03CR) 10Dzahn: [C: 032] dhcp: fix syntax error, missing bracket [puppet] - 10https://gerrit.wikimedia.org/r/268705 (owner: 10Dzahn) [17:53:55] (03PS6) 10Alex Monk: Initial configuration for ady.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268004 (https://phabricator.wikimedia.org/T125501) (owner: 10Dereckson) [17:54:04] (03PS1) 10Dzahn: dhcp: replace tabs with spaces [puppet] - 10https://gerrit.wikimedia.org/r/268706 [17:58:45] PROBLEM - puppet last run on carbon is CRITICAL: CRITICAL: Puppet has 1 failures [18:00:05] (03PS1) 10Thcipriani: Reload keyholder-agent on keyholder-auth change [puppet] - 10https://gerrit.wikimedia.org/r/268708 (https://phabricator.wikimedia.org/T125992) [18:00:14] PROBLEM - check_puppetrun on pay-lvs2001 is CRITICAL: CRITICAL: Puppet has 1 failures [18:00:30] (03CR) 10Jforrester: "Should probably add to wgContentNamespaces too." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268616 (https://phabricator.wikimedia.org/T122865) (owner: 10BryanDavis) [18:00:52] (03CR) 10Ori.livneh: [C: 032 V: 032] Reload keyholder-agent on keyholder-auth change [puppet] - 10https://gerrit.wikimedia.org/r/268708 (https://phabricator.wikimedia.org/T125992) (owner: 10Thcipriani) [18:01:34] (03PS1) 10Jcrespo: Depool db1018 for reimaging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268709 (https://phabricator.wikimedia.org/T125215) [18:03:38] (03PS2) 10Dzahn: dhcp: fix syntax error, missing bracket [puppet] - 10https://gerrit.wikimedia.org/r/268705 [18:03:45] (03CR) 10Dzahn: [V: 032] dhcp: fix syntax error, missing bracket [puppet] - 10https://gerrit.wikimedia.org/r/268705 (owner: 10Dzahn) [18:04:05] (03PS2) 10Jcrespo: Depool db1018 for reimaging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268709 (https://phabricator.wikimedia.org/T125215) [18:05:13] PROBLEM - check_puppetrun on pay-lvs2001 is CRITICAL: CRITICAL: Puppet has 1 failures [18:05:37] (03PS3) 10Jcrespo: Depool db1018 for reimaging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268709 (https://phabricator.wikimedia.org/T125215) [18:06:09] (03CR) 10Subramanya Sastry: "Any chance this can get resolved today or is this waiting for Monday's meeting?" [puppet] - 10https://gerrit.wikimedia.org/r/268438 (https://phabricator.wikimedia.org/T125435) (owner: 10Jcrespo) [18:06:24] RECOVERY - puppet last run on carbon is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [18:07:07] (03CR) 10Jcrespo: [C: 032] Depool db1018 for reimaging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268709 (https://phabricator.wikimedia.org/T125215) (owner: 10Jcrespo) [18:08:24] 6operations, 10Mathoid, 6Services, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Prepare mathoid for the codfw switchover - https://phabricator.wikimedia.org/T125058#2002875 (10Physikerwelt) ... is there additional information what this is about? I honestly do not understand the context. [18:10:13] RECOVERY - check_puppetrun on pay-lvs2001 is OK: OK: Puppet is currently enabled, last run 295 seconds ago with 0 failures [18:10:30] !log jynus@mira Synchronized wmf-config/db-eqiad.php: Depool db1018 for maintenance (duration: 02m 12s) [18:10:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:12:30] 6operations, 10Mathoid, 6Services, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Prepare mathoid for the codfw switchover - https://phabricator.wikimedia.org/T125058#2002885 (10Dzahn) @Physikerwelt the context is that the operations team wants to switch all services from one datacenter, the one in (eqiad... [18:13:37] Who knows something about /srv/mediawiki/fonts? [18:13:45] (03CR) 10Jcrespo: "I was waiting for you ok, but it seems you are ok with all of this." [puppet] - 10https://gerrit.wikimedia.org/r/268438 (https://phabricator.wikimedia.org/T125435) (owner: 10Jcrespo) [18:13:59] Is there a particular reason we don't version it? [18:15:27] !log stopping mysql@db1018 and starting to clone it for reimaging [18:15:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:16:09] (03CR) 10Subramanya Sastry: Add access to m5-master:testreduce* dbs for ssastry on ruthenium (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/268438 (https://phabricator.wikimedia.org/T125435) (owner: 10Jcrespo) [18:20:22] (03CR) 10Chad: [C: 032] Fix typo and sort by alphabetical order wgExtraSignatureNamespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267994 (owner: 10Dereckson) [18:20:44] 10Ops-Access-Requests, 6operations: let datacenter-ops read server logfiles - https://phabricator.wikimedia.org/T126018#2002928 (10Dzahn) 3NEW [18:21:22] (03Merged) 10jenkins-bot: Fix typo and sort by alphabetical order wgExtraSignatureNamespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267994 (owner: 10Dereckson) [18:23:48] !log demon@mira Synchronized wmf-config/InitialiseSettings.php: comment stuff, gerrit 267994 (duration: 01m 19s) [18:23:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:24:20] (03CR) 10Ori.livneh: [WIP] Implement /w/static.php (034 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/263566 (https://phabricator.wikimedia.org/T99096) (owner: 10Krinkle) [18:27:53] (03CR) 10Dduvall: [C: 04-1] "Left some comments, some of which are moot since we're going to use `deploy-local`." (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/262742 (owner: 10Alexandros Kosiaris) [18:29:33] PROBLEM - Host cp3032 is DOWN: PING CRITICAL - Packet loss = 100% [18:30:13] PROBLEM - check_puppetrun on americium is CRITICAL: CRITICAL: Puppet has 1 failures [18:32:33] PROBLEM - IPsec on cp1048 is CRITICAL: Strongswan CRITICAL - ok: 58 not-conn: cp3032_v4, cp3032_v6 [18:32:34] PROBLEM - IPsec on cp1099 is CRITICAL: Strongswan CRITICAL - ok: 58 not-conn: cp3032_v4, cp3032_v6 [18:32:34] PROBLEM - IPsec on cp1050 is CRITICAL: Strongswan CRITICAL - ok: 58 not-conn: cp3032_v4, cp3032_v6 [18:32:34] PROBLEM - IPsec on cp1051 is CRITICAL: Strongswan CRITICAL - ok: 58 not-conn: cp3032_v4, cp3032_v6 [18:32:45] PROBLEM - IPsec on kafka1022 is CRITICAL: Strongswan CRITICAL - ok: 164 not-conn: cp3032_v4, cp3032_v6 [18:32:45] PROBLEM - IPsec on kafka1013 is CRITICAL: Strongswan CRITICAL - ok: 164 not-conn: cp3032_v4, cp3032_v6 [18:32:55] PROBLEM - IPsec on cp1062 is CRITICAL: Strongswan CRITICAL - ok: 58 not-conn: cp3032_v4, cp3032_v6 [18:33:14] PROBLEM - IPsec on cp1061 is CRITICAL: Strongswan CRITICAL - ok: 58 not-conn: cp3032_v4, cp3032_v6 [18:33:23] PROBLEM - IPsec on cp1072 is CRITICAL: Strongswan CRITICAL - ok: 58 not-conn: cp3032_v4, cp3032_v6 [18:33:23] PROBLEM - IPsec on cp1074 is CRITICAL: Strongswan CRITICAL - ok: 58 not-conn: cp3032_v4, cp3032_v6 [18:33:29] (03PS1) 10JGirault: Bump portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268713 [18:33:35] PROBLEM - IPsec on cp1073 is CRITICAL: Strongswan CRITICAL - ok: 58 not-conn: cp3032_v4, cp3032_v6 [18:33:35] PROBLEM - IPsec on cp1071 is CRITICAL: Strongswan CRITICAL - ok: 58 not-conn: cp3032_v4, cp3032_v6 [18:33:43] PROBLEM - IPsec on cp1049 is CRITICAL: Strongswan CRITICAL - ok: 58 not-conn: cp3032_v4, cp3032_v6 [18:33:44] PROBLEM - IPsec on kafka1014 is CRITICAL: Strongswan CRITICAL - ok: 164 not-conn: cp3032_v4, cp3032_v6 [18:33:54] PROBLEM - IPsec on cp1064 is CRITICAL: Strongswan CRITICAL - ok: 58 not-conn: cp3032_v4, cp3032_v6 [18:33:55] PROBLEM - IPsec on kafka1020 is CRITICAL: Strongswan CRITICAL - ok: 164 not-conn: cp3032_v4, cp3032_v6 [18:33:55] PROBLEM - IPsec on cp1063 is CRITICAL: Strongswan CRITICAL - ok: 58 not-conn: cp3032_v4, cp3032_v6 [18:34:04] PROBLEM - IPsec on kafka1018 is CRITICAL: Strongswan CRITICAL - ok: 164 not-conn: cp3032_v4, cp3032_v6 [18:34:46] (03PS1) 10Papaul: Add sarin to autoinstall Bug:T125752 [puppet] - 10https://gerrit.wikimedia.org/r/268714 (https://phabricator.wikimedia.org/T125752) [18:35:13] PROBLEM - check_puppetrun on americium is CRITICAL: CRITICAL: Puppet has 1 failures [18:35:33] (03CR) 10Jcrespo: "Will add it tomorrow (something more urgent has required my attention), but please make sure *you do not use m5 as a testing machine*- it " [puppet] - 10https://gerrit.wikimedia.org/r/268438 (https://phabricator.wikimedia.org/T125435) (owner: 10Jcrespo) [18:37:45] 6operations, 6Mobile-Apps, 10RESTBase, 10Traffic: Enable RESTBase for mobile sites, or support zero headers in text varnishes - https://phabricator.wikimedia.org/T102524#1366601 (10bearND) I would like to see this implemented. T89177 is not public though. [18:39:07] I will come back later, after backup has finished [ETA:1h30] [18:40:13] PROBLEM - check_puppetrun on americium is CRITICAL: CRITICAL: Puppet has 1 failures [18:41:57] (03CR) 10Subramanya Sastry: "--safe-updates default is fine." [puppet] - 10https://gerrit.wikimedia.org/r/268438 (https://phabricator.wikimedia.org/T125435) (owner: 10Jcrespo) [18:43:26] 6operations, 6Mobile-Apps, 10RESTBase, 10Traffic: Enable RESTBase for mobile sites, or support zero headers in text varnishes - https://phabricator.wikimedia.org/T102524#2003122 (10BBlack) @BearND - you'd like to see what implemented, that isn't already? [18:44:11] (03PS2) 10Papaul: Add sarin to autoinstall Bug:T125752 [puppet] - 10https://gerrit.wikimedia.org/r/268714 (https://phabricator.wikimedia.org/T125752) [18:44:18] 6operations, 7Availability, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: swiftrepl replication pass for thumbnails eqiad -> codfw - https://phabricator.wikimedia.org/T125791#2003130 (10fgiunchedi) re: requested sizes, after ~1.5h of requests these are the results ``` $ sort ~/thumbs_requests | sort | uniq... [18:45:13] PROBLEM - check_puppetrun on americium is CRITICAL: CRITICAL: Puppet has 1 failures [18:45:38] (03CR) 10Dzahn: [C: 04-1] "there needs to be a ")" after the hostname (and tabs vs spaces)" [puppet] - 10https://gerrit.wikimedia.org/r/268714 (https://phabricator.wikimedia.org/T125752) (owner: 10Papaul) [18:46:26] mutante: working on it thanks [18:46:59] papaul: i'd like to replace all those tabs with spaces for once, so we dont have that issue anymore (because we did that in puppet and dns as well) [18:47:18] doing that as a separate change [18:47:49] otherwise we keep switching and mixing [18:47:56] (03PS10) 10Krinkle: [WIP] Implement /w/static.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/263566 (https://phabricator.wikimedia.org/T99096) [18:48:19] (03CR) 10Krinkle: [WIP] Implement /w/static.php (035 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/263566 (https://phabricator.wikimedia.org/T99096) (owner: 10Krinkle) [18:50:13] RECOVERY - check_puppetrun on americium is OK: OK: Puppet is currently enabled, last run 142 seconds ago with 0 failures [18:50:17] (03PS11) 10Krinkle: [WIP] Implement /w/static.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/263566 (https://phabricator.wikimedia.org/T99096) [18:50:29] (03CR) 10Krinkle: "Changed url prefix from /w/static to /w per Phabricator task." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/263566 (https://phabricator.wikimedia.org/T99096) (owner: 10Krinkle) [18:50:53] (03PS2) 10Dzahn: dhcp: replace tabs with spaces [puppet] - 10https://gerrit.wikimedia.org/r/268706 [18:53:39] 6operations, 6Phabricator, 6Project-Creators, 6Triagers: Requests for addition to the #project-creators group (in comments) - https://phabricator.wikimedia.org/T706#2003221 (10Luke081515) Please add me, I want help to cleanup the backlog of #project-creators, more than 60 tasks at the moment.... I read the... [18:54:47] (03PS1) 10Krinkle: [DONT MERGE] Set $wgResourceBasePath to "/w" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268715 (https://phabricator.wikimedia.org/T99096) [18:56:06] (03PS1) 10Dzahn: netboot.cfg - replace tabs with spaces [puppet] - 10https://gerrit.wikimedia.org/r/268717 [18:56:35] (03PS3) 10Papaul: Add sarin to autoinstall Bug:T125752 [puppet] - 10https://gerrit.wikimedia.org/r/268714 (https://phabricator.wikimedia.org/T125752) [18:58:24] PROBLEM - Host cp3033 is DOWN: PING CRITICAL - Packet loss = 100% [19:03:05] 6operations: setup/deploy oresrdb1001-oresrdb1002 - https://phabricator.wikimedia.org/T125562#2003272 (10RobH) [19:13:03] PROBLEM - Host cp2006 is DOWN: PING CRITICAL - Packet loss = 100% [19:15:44] PROBLEM - IPsec on cp1057 is CRITICAL: Strongswan CRITICAL - ok: 22 not-conn: cp2006_v4, cp2006_v6 [19:15:55] PROBLEM - IPsec on cp1056 is CRITICAL: Strongswan CRITICAL - ok: 22 not-conn: cp2006_v4, cp2006_v6 [19:16:04] PROBLEM - IPsec on cp1070 is CRITICAL: Strongswan CRITICAL - ok: 22 not-conn: cp2006_v4, cp2006_v6 [19:16:04] PROBLEM - IPsec on cp1069 is CRITICAL: Strongswan CRITICAL - ok: 22 not-conn: cp2006_v4, cp2006_v6 [19:20:12] 6operations, 10RESTBase, 10hardware-requests: normalize eqiad restbase cluster - replace restbase1001-1006 - https://phabricator.wikimedia.org/T125842#2003333 (10RobH) [19:25:07] 6operations, 10Deployment-Systems, 6Performance-Team, 10Traffic, 5Patch-For-Review: Make Varnish cache for /static/$wmfbranch/ expire when resources change within branch lifetime - https://phabricator.wikimedia.org/T99096#2003345 (10Krinkle) [19:27:24] PROBLEM - Host cp3034 is DOWN: PING CRITICAL - Packet loss = 100% [19:28:36] !log halted rolling cache reboots, we seem to be having problems with a batch of them coming back... [19:28:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:29:24] 6operations, 6Mobile-Apps, 10RESTBase, 10Traffic: Enable RESTBase for mobile sites, or support zero headers in text varnishes - https://phabricator.wikimedia.org/T102524#2003358 (10bearND) @BBlack I'd like to see the W0 header (X-CS) added to API responses for regular domains, in addition to the m-dot doma... [19:31:18] 6operations, 10Deployment-Systems, 6Performance-Team, 10Traffic, 5Patch-For-Review: Make Varnish cache for /static/$wmfbranch/ expire when resources change within branch lifetime - https://phabricator.wikimedia.org/T99096#2003363 (10Krinkle) [19:35:31] (03PS1) 10Dzahn: admin: add shell users for frack pentest [puppet] - 10https://gerrit.wikimedia.org/r/268722 (https://phabricator.wikimedia.org/T126012) [19:36:42] (03PS2) 10Dzahn: admin: add shell users for frack pentest [puppet] - 10https://gerrit.wikimedia.org/r/268722 (https://phabricator.wikimedia.org/T126012) [19:38:20] (03CR) 10jenkins-bot: [V: 04-1] admin: add shell users for frack pentest [puppet] - 10https://gerrit.wikimedia.org/r/268722 (https://phabricator.wikimedia.org/T126012) (owner: 10Dzahn) [19:39:17] (03PS4) 10Papaul: Add sarin to autoinstall Bug:T125752 [puppet] - 10https://gerrit.wikimedia.org/r/268714 (https://phabricator.wikimedia.org/T125752) [19:39:25] 6operations, 6Mobile-Apps, 10RESTBase, 10Traffic: Enable RESTBase for mobile sites, or support zero headers in text varnishes - https://phabricator.wikimedia.org/T102524#2003398 (10BBlack) I'll make a separate task about that. In theory it's trivial to enable, but it probably needs some coordination with... [19:41:55] now that's interesting jenkins, [19:42:05] Verified +2 , followed by verified -1 [19:42:10] without an action in between ? [19:43:03] ERROR: unknown environment 'pep8' [19:43:55] (03CR) 10Dzahn: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/268722 (https://phabricator.wikimedia.org/T126012) (owner: 10Dzahn) [19:46:42] RECOVERY - Host cp2006 is UP: PING OK - Packet loss = 0%, RTA = 37.17 ms [19:46:51] RECOVERY - IPsec on cp1069 is OK: Strongswan OK - 24 ESP OK [19:47:07] 6operations, 10MobileFrontend, 10Traffic, 6Zero: Enable X-CS headers for non-mobile domains - https://phabricator.wikimedia.org/T126053#2003432 (10BBlack) 3NEW [19:47:22] RECOVERY - Host cp3034 is UP: PING OK - Packet loss = 0%, RTA = 86.50 ms [19:48:12] RECOVERY - IPsec on cp1056 is OK: Strongswan OK - 24 ESP OK [19:48:21] RECOVERY - IPsec on cp1057 is OK: Strongswan OK - 24 ESP OK [19:48:21] RECOVERY - IPsec on cp1070 is OK: Strongswan OK - 24 ESP OK [19:50:23] 6operations, 6Phabricator, 6Project-Creators: Allow the use of team projects as representation of teams - https://phabricator.wikimedia.org/T126055#2003456 (10greg) 3NEW [19:50:49] gah, didn't mean to tag operations there [19:51:47] bearND, bblack, dr0ptp4kt what can i help with with the app header issue? [19:51:51] ah, you did it to yourself: https://phabricator.wikimedia.org/herald/rule/16/ [19:54:25] (03PS5) 10Dzahn: Add sarin to autoinstall Bug:T125752 [puppet] - 10https://gerrit.wikimedia.org/r/268714 (https://phabricator.wikimedia.org/T125752) (owner: 10Papaul) [19:54:25] yurik: my preference would be that we apply tagging to desktop wikipedia if it won't destroy things. but do you know whether the apps can then still call zeroconfig on .wikipedia.org and get a legit response? i can't remember if the api endpoint processes stuff (whereas the page lifecycle code i think only runs in the context of mfe). bblack, what do [19:54:25] you think? [19:55:06] yurik: i guess by "destroy things", a "thing" might be the zero portal graphs! lol [19:55:22] well there's two separate issues here from my limited POV [19:55:41] (03PS6) 10Dzahn: auto-install: add sarin to netboot [puppet] - 10https://gerrit.wikimedia.org/r/268714 (https://phabricator.wikimedia.org/T125752) (owner: 10Papaul) [19:56:24] (03PS7) 10Dzahn: auto-install: add sarin to netboot [puppet] - 10https://gerrit.wikimedia.org/r/268714 (https://phabricator.wikimedia.org/T125752) (owner: 10Papaul) [19:56:30] 1) on mobile: varnish sets X-CS and X-CS2 on the inbound request, delivering those headers to mediawiki, and mediawiki can/will in turn vary on X-CS/X-CS2 as appropriate [19:56:37] (03CR) 10Dzahn: [C: 032 V: 032] auto-install: add sarin to netboot [puppet] - 10https://gerrit.wikimedia.org/r/268714 (https://phabricator.wikimedia.org/T125752) (owner: 10Papaul) [19:57:02] 2) on mobile: varnish also echoes X-CS2 (the one that's always a real carrier value, as opposed to X-CS being just "ON" for most URLs) back to the user as a response header [19:57:14] bblack, MW does not care at all about X-CS2 (just clarifying) [19:57:26] good to know! :) [19:57:42] bblack, or to be even more exact - it is used for debug logging [19:57:51] in other words - not very important [19:58:23] ok, so [19:58:46] dr0ptp4kt, i see no problem with the api getting proper headers all the time - i think it won't affect anything at all [19:58:59] in VCL land, on the inbound request, basically if we detect a carrier, we set X-CS2: carrierid, and then X-CS="ON" [19:59:02] with the exception: [19:59:02] if (req.url ~ "(action=zeroconfig|:ZeroRatedMobileAccess)($|&|\?)" || req.http.host ~ "^(zero|m)\.") { [19:59:05] set req.http.X-CS = req.http.X-CS2; [19:59:06] bblack, i think the quick solution is to allow all api to get headers for everything [19:59:08] } else { [19:59:10] set req.http.X-CS = "ON"; [19:59:13] } [19:59:23] yurik: yeah it's easy to do in VCL terms, I just want to step through what it means first [19:59:45] are desktop requests going to Vary:X-CS? [20:00:09] hmm, good question. I'm not sure if api calls are properly varying... checking... [20:00:16] should they even? [20:02:07] bblack, api calls for config & message vary on X-CS, X-Subdomain, X-Forwarded-By, X-Forwarded-Proto [20:02:14] regardless of the domain, etc [20:03:42] so yes, you can enable X-CS to be sent to the api regardless of which domain its for [20:04:40] bblack, and yes, they should vary on X-CS -- because you always send the real X-CS to the api's action=zeroconfig [20:04:46] (03PS1) 10Dzahn: cygnus/technetium: install nmap, include standard [puppet] - 10https://gerrit.wikimedia.org/r/268726 (https://phabricator.wikimedia.org/T126012) [20:06:03] yurik: does it matter for the rest of the API is mainly what I worry about, not just those zero ones we already mention in VCL [20:06:07] yurik and bblack, for the apps they look first at whether the response bears an X-CS header before coming back to the server to get the config. here's android [20:06:09] https://github.com/wikimedia/apps-android-wikipedia/blob/f862f5f5d43f35000d9b3c671d5a02ad50a4a463/app/src/main/java/org/wikipedia/zero/WikipediaZeroHandler.java#L91 [20:06:22] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [1000.0] [20:06:22] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [1000.0] [20:06:27] I mean like, is some other random /api/ call going to Vary:X-CS, or worse, will it vary its content on X-CS but then fail to vary on X-CS? [20:07:12] (03PS2) 10Dzahn: cygnus/technetium: install nmap, include standard [puppet] - 10https://gerrit.wikimedia.org/r/268726 (https://phabricator.wikimedia.org/T126012) [20:07:27] !log cygnus - reboot VM [20:07:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:07:37] yurik and bblack you'll notice they currently call hostSupportsZeroHeaders and constrain that to ".m.wikipedia.org", but if the desktop domains have it they'll want to unconstrain it further so it's ".wikipedia.org" [20:08:01] mdholloway bearND niedzielski dbrant see fun conversation above ^ [20:08:09] (03CR) 10Dzahn: [C: 032] cygnus/technetium: install nmap, include standard [puppet] - 10https://gerrit.wikimedia.org/r/268726 (https://phabricator.wikimedia.org/T126012) (owner: 10Dzahn) [20:08:13] and gwicke ^^^ [20:08:40] caches are hard. [20:09:44] dr0ptp4kt: meta.wikipedia.org redirects to meta.wikimedia.org. would W0 cover it? [20:09:57] * yurik reads back [20:11:16] niedzielski: unsure about that, but would the app be accessing meta.wikipedia.org ('p', not 'm')? [20:11:42] (03PS1) 10BBlack: set X-CS and similar for all requests [puppet] - 10https://gerrit.wikimedia.org/r/268729 (https://phabricator.wikimedia.org/T126053) [20:11:49] ^ this is the patch I think everyone is asking about, and it's easy [20:11:54] dr0ptp4kt: we could use either but are either covered? [20:11:57] I just don't know what all the fallout might be [20:11:57] niedzielski: in terms of actual operator billing, generally speaking they're all using ip address whitelists, so that's how they ensure people aren't charged. but as far as banners and croutons and api results and such, that requires stuff on our side [20:12:28] dr0ptp4kt: ah, ok cool :) [20:12:45] niedzielski: it looks like the ip addresses, at least for ipv4, are the same for meta.wikipedia.org and meta.wikimedia.org, so on the billing front it should be the same. [20:13:10] dr0ptp4kt: excellent, thank you [20:14:43] RECOVERY - SSH on cygnus is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u1 (protocol 2.0) [20:15:22] RECOVERY - Host cp3033 is UP: PING OK - Packet loss = 0%, RTA = 86.06 ms [20:15:53] RECOVERY - puppet last run on cygnus is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:16:30] right now, I'm pretty sure mobile domain requests output Vary:X-CS, and desktop domain requests do not, for example [20:17:01] so that different HTML contents can be cached for whether to put the X-CS:ON stuff at the top of the page or not [20:17:21] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [20:17:21] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [20:17:37] if desktop requests in mediawiki start caring about X-CS:ON to vary content and don't output Vary:X-CS (on *all* responses from mediawiki, for anything, basically), then we're in trouble [20:17:54] bblack: the current REST API end points all don't vary on carrier, but if there was one eventually it certainly would need to add the vary [20:18:17] but what about everything else on the desktop domain? [20:18:26] e.g. /wiki/Foo, /w/api.php?... [20:19:46] maybe a better way to put this in generic terms: [20:19:50] bblack: your concern is that MW looks at the tag in general, and will start to vary its output (possibly without sending appropriate vary headers) once they are set? [20:20:01] yes [20:20:58] to restate in a better way than I have: "If any applayer thing (MW or a service) varies its content on some input other than hostname+URI, it had better be setting an appropriate Vary: header" [20:21:11] looking through https://github.com/search?q=%40wikimedia+X-CS&ref=searchresults&type=Code&utf8=%E2%9C%93 [20:21:14] (and always sending that Vary header in responses for that hostname+URI regardless of input [20:21:17] cp3032 ? [20:21:22] mutante: I'm on it [20:21:27] 'k [20:23:04] gwicke: right... so what concerns me there is someone might turn on the ZeroPortal extension that pays attention to X-CS for content variance, on desktop wikis, and it doesn't include the Vary:X-CS being setting in mediawik-config's mobile-landing.php [20:23:11] or something along those lines [20:23:55] I really don't know where in what extensions we vary content output for wiki pages on X-CS [20:24:33] zerobanner and zeroportal are prominent [20:25:20] and MobileFrontend [20:25:26] https://github.com/wikimedia/mediawiki-extensions-MobileFrontend/blob/b6af61fc3357efad52ee6bcb3bdd6070fb580ae4/includes/MobileFrontend.hooks.php#L176 [20:25:31] basically the structure of code/config doesn't match up well with how Vary works.... [20:25:38] which is scary [20:26:16] yeah, there are implicit dependencies, and it's not entirely clear if those are all satisfied in mediawiki-config [20:26:37] in an ideal world, if extensions-Foo causes the contents of /wiki/.* to vary based on incoming X-CS header, it should also signal back to the core to set Vary:X-CS on all /wiki/.* outputs, regardless of whether the extension did anything to those outputs in content terms. [20:27:29] https://github.com/wikimedia/operations-mediawiki-config/blob/ec1f3687d886a832094aea688f522db8cb94cfa8/wmf-config/CommonSettings.php#L2852 [20:28:16] zeroportal is the configuration interface I think, as opposed to the Zero that users hit [20:30:18] bblack, the rest of the api does not vary on X-CS. In theory, action=parse could produce different HTML if the request was marked with X-CS... looking further [20:30:21] zerobanner is also referenced at https://github.com/wikimedia/operations-mediawiki-config/blob/74f643c4126f5965f7bfa1d1d6d309edc77cc6b8/wmf-config/mobile.php#L39 [20:31:27] 6operations, 10ops-esams, 10Traffic: cp3032 is dead - https://phabricator.wikimedia.org/T126062#2003595 (10BBlack) 3NEW [20:31:59] so to me it looks like the code looking at X-CS would only run if a) it's the mobile site, or b) $wmgZeroPortal is true [20:32:54] is part of this plan for "X-CS for all", though, to get zero-rating and zero banners on the desktop site? [20:33:10] if so, someone's going to make some change related to that, and will that carry Vary:X-CS when they do? :) [20:34:27] it sounds unlikely given the bandwidth implications, but I'm not the right person to answer this [20:34:32] dr0ptp4kt: ^^ [20:34:53] ACKNOWLEDGEMENT - Host cp3032 is DOWN: PING CRITICAL - Packet loss = 100% Brandon Black T126062 [20:35:56] * gwicke is still wondering how $wmgZeroPortal is actually set [20:36:21] !log resuming rolling cache reboots [20:36:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:36:28] bblack, we don't really care about the portal - that's just the zero.wikimedia.org [20:36:34] bblack: gwicke eventually, maybe, treatment for the desktop will be created. but for now the main reason this comes up is in the context of the app, which had introduced the notion of hitting desktop domains, whereas previously it had used mdot, for its API calls [20:36:45] zero rating is already enabled by many partners for desktop (due to https) [20:36:58] right, they're just not "getting credit" for it [20:37:03] exactly [20:37:06] which they should [20:37:27] so this all inter-relates with an email thread I've had with dfoy lately about the IP diffs between mdot and desktop too [20:37:48] * yurik feels its a hairball [20:37:56] yeah it is [20:38:07] * yurik starts coughing [20:38:20] yurik: oh actually, you're CC on that thread [20:38:24] yep [20:38:29] interesting re desktop [20:38:30] it's called "Dallas IP range" [20:38:38] 6operations, 10Traffic, 7Performance: Estimate effective cache time for text - https://phabricator.wikimedia.org/T126063#2003621 (10ori) 3NEW [20:39:21] so basically we should soon enough enable X-CS to all desktops, which would start rewriting all HTML to all partners with magical links [20:39:25] (03PS1) 10Dzahn: wikitech: wikitech.m.wikimedia.org -> CNAME silver [dns] - 10https://gerrit.wikimedia.org/r/268734 (https://phabricator.wikimedia.org/T120527) [20:39:33] 6operations, 10Traffic, 7Performance: Estimate effective cache time for text - https://phabricator.wikimedia.org/T126063#2003635 (10BBlack) See also: T124954 [20:39:35] meow [20:39:48] 6operations, 6Labs, 10Labs-Infrastructure, 10Reading-Web, and 3 others: https://wikitech.m.wikimedia.org/ serves wikimedia.org portal - https://phabricator.wikimedia.org/T120527#2003637 (10Dzahn) >>! In T120527#1870974, @Krenair wrote: > That domain would need to be changed into a CNAME for silver for this... [20:39:53] yurik: ha ha ha ha ha [20:40:04] sadly it wasn't a joke :(((( [20:40:11] * yurik moves to kill zero [20:40:17] yurik: be nice now [20:40:21] rather than deal with it :-P [20:40:33] i like zero, i just don't want to break my head over it :))) [20:40:48] yurik: but seriously, i think a javascript only driven approach for link rewriting seems fair game on desktop. it probably covers 99% of usage. [20:40:52] (on desktop) [20:41:13] the exception to the rule obviously is compression proxy users who tap the desktop url at the bottom of the page [20:41:45] dr0ptp4kt, the js rewrite at the moment is based on the fact that the external URL is already mangled. So we would need to start handling "regular" links in JS [20:41:56] 6operations, 10Traffic, 7Performance: Estimate effective cache time for text - https://phabricator.wikimedia.org/T126063#2003639 (10GWicke) Put differently, what would the performance impact be if we reduced `s-maxage` to a) 2 weeks, b) 1 week, c) days? [20:42:27] yurik: i feel like on desktop that would be the sanest approach. one simple rl module to rule them all on desktop :) [20:42:37] yurik: of course no one can be signed up just right now to do that work [20:43:15] dr0ptp4kt, the problem is that now we have to vary HTML output based on X-CS + mobile vs desktop [20:45:13] yurik: are you saying that's the current case or are you saying that a JS-driven (i.e., always included RL module) approach for desktop would introduce such an issue? [20:45:21] yurik: we really need to use video, don't we :) [20:45:44] yurik and bblack speaking of which lmk if you want to hop on video and discuss. it would be like we're in the same office [20:46:21] (03PS2) 10Nuria: Increase length of lag window to 100 [puppet] - 10https://gerrit.wikimedia.org/r/268594 (https://phabricator.wikimedia.org/T125916) [20:46:28] 7Puppet, 6operations, 10RESTBase-Cassandra: cassandra - puppet compiler fail on test/staging hosts - https://phabricator.wikimedia.org/T125943#2003664 (10Dzahn) >>! In T125943#2001270, @mobrovac wrote: > So how is it possible that something compiles for `restbase1001`, but not for `restbase-test2001`. I t... [20:47:25] (03PS1) 10Thcipriani: Beta: Rebase mw-config submodules [puppet] - 10https://gerrit.wikimedia.org/r/268737 (https://phabricator.wikimedia.org/T126061) [20:48:26] (03CR) 10Dzahn: "adding ottomata, hey,do you know where this $initial_token comes from? (since Ariel pointed out it's been like this since you created the " [puppet] - 10https://gerrit.wikimedia.org/r/266975 (owner: 10Dzahn) [20:49:29] (03CR) 10Ottomata: Increase length of lag window to 100 (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/268594 (https://phabricator.wikimedia.org/T125916) (owner: 10Nuria) [20:50:21] dr0ptp4kt: I can't today really, but I think we should hash all of this out before we blindly turn on sending X-CS to the desktopwiki applayer and just see what happens :) [20:50:47] bblack: isn't that called a "pilot"? production in lieue of testing? [20:51:12] :P another approach is of course we can deprecate X-CS[2] as legacy and have new code / desktop use the newer version of the header they're already getting... [20:51:25] but I fear even mentioning that in case someone runs with it without thinking through Vary lol [20:51:29] PROBLEM - Apache HTTP on mw1136 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:52:09] PROBLEM - HHVM rendering on mw1136 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:52:19] (but, putting fear aside for the moment: the applayer is already received "X-Carrier" and "X-Carrier-Meta", which tell you the same or more than X-CS/X-CS2 did, for all requests) [20:52:26] s/received/receiving/ [20:52:54] it's not being echoed back to the user in responses though like X-CS [20:52:58] bblack, i'm not exactly sure what X-carrier does or who handles it [20:53:17] varnish generates X-Carrier and sends that to the applayer for all requests [20:53:24] it also generates X-CS[2] from X-Carrier [20:53:29] it all comes from the zero metadata [20:53:29] PROBLEM - DPKG on mw1136 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:53:32] 6operations, 10Traffic, 7Performance: Estimate effective cache time for text - https://phabricator.wikimedia.org/T126063#2003690 (10ori) Script to get cached HTML ages: ```lang=python # -*- coding: utf-8 -*- """ cache_age ~~~~~~~~~ Retrieve random pages from random Wikimedia projects and scrape thei... [20:53:38] PROBLEM - dhclient process on mw1136 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:53:39] that's been in place for a while now [20:53:40] PROBLEM - Disk space on mw1136 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:53:48] PROBLEM - Host cp3030 is DOWN: PING CRITICAL - Packet loss = 100% [20:53:49] PROBLEM - nutcracker port on mw1136 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:53:50] PROBLEM - HHVM processes on mw1136 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:53:59] PROBLEM - salt-minion processes on mw1136 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:54:00] PROBLEM - nutcracker process on mw1136 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:54:18] PROBLEM - RAID on mw1136 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:54:19] PROBLEM - puppet last run on mw1136 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:54:25] if traditional X-CS2 would have "123-456|wap", X-Carrier is "123-456" and X-Carrier-Meta is "wap" [20:54:39] PROBLEM - SSH on mw1136 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:54:50] PROBLEM - Check size of conntrack table on mw1136 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:54:50] bblack, but who processes / varies on those headerS? [20:55:00] PROBLEM - configured eth on mw1136 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:55:07] bblack: yeah,thar be dragons. in terms of inbound header enrichment, for the apps it should actually be sufficient to whitelist the specific path for the config call (i.e., the action=zeroconfig bearing URLs), but the outbound api.php responses would generally need to have post-processed X-CS outbound headers on desktop. that would let us isolate where the [20:55:08] enrichment is occurring. [20:55:24] that is to say, for the desktop case [20:55:25] we're still doing the same things we always did before with X-CS and X-CS2, but in the world of Varnish we're actually generating X-Carrier and X-Carrier-Meta as primary data we send with all applayer requests, and the legacy/existing X-CS and X-CS2 are derived from them [20:55:36] no applayer stuff is actually paying attention to or varying on X-Carrier that I know of [20:55:51] (03PS2) 10BryanDavis: Add Tool namespace to wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268616 (https://phabricator.wikimedia.org/T122865) [20:55:59] PROBLEM - IPsec on cp1054 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp3030_v4, cp3030_v6 [20:55:59] PROBLEM - IPsec on cp1053 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp3030_v4, cp3030_v6 [20:56:02] the X-CS[2] derivation only happens for mobile/zero, but X-Carrier exists for all [20:56:30] PROBLEM - IPsec on cp1068 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp3030_v4, cp3030_v6 [20:56:30] PROBLEM - IPsec on cp1067 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp3030_v4, cp3030_v6 [20:56:38] PROBLEM - IPsec on cp1066 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp3030_v4, cp3030_v6 [20:56:40] PROBLEM - IPsec on cp1055 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp3030_v4, cp3030_v6 [20:56:42] right. i think for desktop domains for the time being we'd only need to Vary for action=zeroconfig. but we could just apply outbound X-CS header enrichment for any api.php response (in the desktop wikipedia subdomains) [20:56:49] PROBLEM - IPsec on cp1065 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp3030_v4, cp3030_v6 [20:57:08] bblack: yurik maybe that's what you guys were saying. sorry, so many communication channels! [20:57:09] PROBLEM - IPsec on cp1052 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp3030_v4, cp3030_v6 [20:57:21] but the apps really don't care about the actual html. they're all api [20:57:25] bblack, so I guess we can 1) always send X-Carrier with all responses -- this way the app can always detect when its on zero network. and 2) always send proper x-cs to the api for action=zeroconfig [20:57:44] dr0ptp4kt, this way you can switch the app to only react to x-carrier [20:57:45] yurik: the apps inspect the response for x-cs [20:57:54] ^^ [20:57:57] well we also send improper X-CS on all mobile requests, to vary their html output for X-CS:ON [20:58:28] (03CR) 10MaxSem: [C: 031] wikitech: wikitech.m.wikimedia.org -> CNAME silver [dns] - 10https://gerrit.wikimedia.org/r/268734 (https://phabricator.wikimedia.org/T120527) (owner: 10Dzahn) [20:58:35] yurik: oh, i see. so the "newer" versions of the apps could look at x-carrier. the older versions already rely upon mdot responses, where it already works. [20:59:09] there are lots of dimensions to this problem [20:59:22] not the least of which is we don't want to pointlessly vary the cache where it doesn't need to be [20:59:31] caching and variable naming [20:59:36] bblack, right, but for the app we can avoid this problem outright - the app doesn't care about magical html that zero generates [20:59:36] two hardest problems [20:59:46] yurik: right [21:00:14] yurik: bblack i'll schedule something with you dudes and mdholloway :) [21:00:20] yurik: so you're saying that, for today, if I just echo X-Carrier data to the client, that's enough without worrying about MediaWiki-side effects [21:00:30] RECOVERY - IPsec on cp1065 is OK: Strongswan OK - 58 ESP OK [21:00:39] RECOVERY - Host cp3030 is UP: PING OK - Packet loss = 0%, RTA = 87.18 ms [21:00:53] bblack, i think so. And also, if you pass through the X-CS to api's zeroconfig for all requests, that should be ok as well. [21:00:58] PROBLEM - Host cp3035 is DOWN: PING CRITICAL - Packet loss = 100% [21:00:58] RECOVERY - dhclient process on mw1136 is OK: PROCS OK: 0 processes with command name dhclient [21:00:59] RECOVERY - IPsec on cp1052 is OK: Strongswan OK - 58 ESP OK [21:00:59] RECOVERY - Disk space on mw1136 is OK: DISK OK [21:01:08] RECOVERY - nutcracker port on mw1136 is OK: TCP OK - 0.000 second response time on port 11212 [21:01:08] RECOVERY - HHVM processes on mw1136 is OK: PROCS OK: 6 processes with command name hhvm [21:01:16] yurik: as opposed to today, where we only do that for zeroconfig on certain mobile domains? [21:01:38] RECOVERY - IPsec on cp1054 is OK: Strongswan OK - 58 ESP OK [21:01:38] RECOVERY - IPsec on cp1053 is OK: Strongswan OK - 58 ESP OK [21:01:40] actually that's not really true [21:01:41] the rule is: [21:01:42] if (req.url ~ "(action=zeroconfig|:ZeroRatedMobileAccess)($|&|\?)" || req.http.host ~ "^(zero|m)\.") { set req.http.X-CS = req.http.X-CS2; [21:01:50] 6operations, 10MobileFrontend, 10Traffic, 6Zero, and 3 others: Enable X-CS headers for non-mobile domains - https://phabricator.wikimedia.org/T126053#2003728 (10bearND) [21:02:00] (03PS3) 10Nuria: Increase length of lag window to 100 [puppet] - 10https://gerrit.wikimedia.org/r/268594 (https://phabricator.wikimedia.org/T125916) [21:02:09] RECOVERY - IPsec on cp1068 is OK: Strongswan OK - 58 ESP OK [21:02:09] RECOVERY - IPsec on cp1067 is OK: Strongswan OK - 58 ESP OK [21:02:11] ^ that sets "real" X-CS (as opposed to X-CS:ON) to action=zeroconfig for all domains, and to all requests that are zero|m without a language subdomain [21:02:18] RECOVERY - IPsec on cp1066 is OK: Strongswan OK - 58 ESP OK [21:02:19] RECOVERY - IPsec on cp1055 is OK: Strongswan OK - 58 ESP OK [21:02:38] bblack: yurik mdholloway niedzielski i set a meeting to discuss X-CS, X-Carrier and such [21:02:40] monday! [21:03:11] ok [21:03:17] 6operations, 10MobileFrontend, 10Traffic, 6Zero, and 3 others: Enable X-CS headers for non-mobile domains - https://phabricator.wikimedia.org/T126053#2003432 (10bearND) [21:03:47] bblack, but doesn't that code only get reached if its on *.m.* or *.zero.* ? [21:04:13] dr0ptp4kt: 👍 [21:04:20] RECOVERY - configured eth on mw1136 is OK: OK - interfaces up [21:04:39] RECOVERY - DPKG on mw1136 is OK: All packages OK [21:05:09] RECOVERY - salt-minion processes on mw1136 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [21:05:09] RECOVERY - nutcracker process on mw1136 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [21:05:24] 6operations, 10Traffic, 7Performance: Estimate effective cache time for text - https://phabricator.wikimedia.org/T126063#2003745 (10BBlack) My point in the other ticket is it's not really about the percentage of pages which are cached longer than X, it's about the percentage of requests. In the extreme poss... [21:05:53] (03CR) 10Madhuvishy: [C: 031] "Looks great! Just one trailing space comment - and then needs ops to merge." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/268594 (https://phabricator.wikimedia.org/T125916) (owner: 10Nuria) [21:06:20] yurik: yeah [21:06:39] 10Ops-Access-Requests, 6operations: Allow mobrovac to start/stop/restart services on SCx - https://phabricator.wikimedia.org/T125879#2003765 (10Dzahn) Ok, so from looking at the role classes: sca means: apertium, cxserver and zotero scb means: mobileapps, mathoid, graphoid and citoid I'm not sure if mathoid... [21:07:47] bblack, right, so I think this IF should be done on ALL requests -- if (req.url ~ "(action=zeroconfig|:ZeroRatedMobileAccess)($|&|\?)" || req.http.host ~ "^(zero|m)\.") { set req.http.X-CS = req.http.X-CS2; } [21:08:15] but else should only be done for zero & m subdomains [21:08:44] does action=zeroconfig even exist on desktop domains presently? [21:09:21] while we're at it, we should decide whether we can kill that ZeroTLS cookie heh [21:09:21] bblack, sure, the extension is enabled per language, not per domain [21:09:29] it's the only other strange zero complication at the cache layer [21:09:41] dr0ptp4kt, ^ [21:09:43] (03PS4) 10Ottomata: Increase length of lag window to 100 [puppet] - 10https://gerrit.wikimedia.org/r/268594 (https://phabricator.wikimedia.org/T125916) (owner: 10Nuria) [21:09:49] PROBLEM - configured eth on mw1136 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:09:53] it was dr0ptp4kt who understands that beast [21:09:59] RECOVERY - Host cp3035 is UP: PING OK - Packet loss = 0%, RTA = 85.70 ms [21:10:44] (03CR) 10jenkins-bot: [V: 04-1] Increase length of lag window to 100 [puppet] - 10https://gerrit.wikimedia.org/r/268594 (https://phabricator.wikimedia.org/T125916) (owner: 10Nuria) [21:11:18] yurik: no, you understand it best. bblack, yurik knows it all. [21:11:33] yurik: and yet i can't claim total ignorance [21:11:45] (03CR) 10Nuria: [C: 031] "Cool, friendlier templates will be nice. I will let madhuvishy address puppet conventions/code standards as I am not familiar with those." [puppet] - 10https://gerrit.wikimedia.org/r/268682 (owner: 10Elukey) [21:11:50] PROBLEM - DPKG on mw1136 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:12:05] dr0ptp4kt, the tls thingy - i never understood it, so its all yours :-P [21:12:09] PROBLEM - nutcracker port on mw1136 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:12:10] PROBLEM - HHVM processes on mw1136 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:12:19] PROBLEM - salt-minion processes on mw1136 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:12:19] PROBLEM - nutcracker process on mw1136 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:12:38] dr0ptp4kt: my gut feeling was that the whole ZeroTLS thing was a transitional thing per-carrier, and we eventually transitioned them all, and it can just go away now [21:12:47] dr0ptp4kt, even the name implies that it was you - i would have called it https or something :-P [21:13:28] the two hints at that state are (1) It's about TLS, and there is no choice on TLS anymore, all requests are TLS and (2) The code was expecting other VCL carrier-specific blocks to set req.http.X-ZeroTLS, and there's no such settings anymore [21:13:48] PROBLEM - dhclient process on mw1136 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:13:49] PROBLEM - Disk space on mw1136 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:14:20] yurik: don't make me git -S ;) [21:14:28] so my feeling tends to be we can just gut that code, and optionally put in a temporary cookie-delete to clean up the now-unused cookies in clients just to be nice [21:14:36] but I'm just not 100% confident of that [21:14:40] * yurik welcomes the challenge :-P [21:14:56] yurik: i know :) [21:16:22] (03CR) 10jenkins-bot: [V: 04-1] Increase length of lag window to 100 [puppet] - 10https://gerrit.wikimedia.org/r/268594 (https://phabricator.wikimedia.org/T125916) (owner: 10Nuria) [21:17:20] 10Ops-Access-Requests, 6operations: add mobrovac to mathoid admins (was: Allow mobrovac to start/stop/restart services on SCx) - https://phabricator.wikimedia.org/T125879#2003797 (10Dzahn) [21:18:29] RECOVERY - SSH on mw1136 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.6 (protocol 2.0) [21:18:38] RECOVERY - Check size of conntrack table on mw1136 is OK: OK: nf_conntrack is 0 % full [21:18:40] RECOVERY - configured eth on mw1136 is OK: OK - interfaces up [21:18:42] 10Ops-Access-Requests, 6operations: add mobrovac to mathoid admins (was: Allow mobrovac to start/stop/restart services on SCx) - https://phabricator.wikimedia.org/T125879#1999706 (10Dzahn) Of the services on scb, you are an admin of: zotero, citoid, graphoid, mobileapps you are not an admin of: mathoid, cxs... [21:19:00] RECOVERY - DPKG on mw1136 is OK: All packages OK [21:19:00] RECOVERY - dhclient process on mw1136 is OK: PROCS OK: 0 processes with command name dhclient [21:19:08] RECOVERY - Disk space on mw1136 is OK: DISK OK [21:19:18] RECOVERY - nutcracker port on mw1136 is OK: TCP OK - 0.000 second response time on port 11212 [21:19:18] RECOVERY - HHVM processes on mw1136 is OK: PROCS OK: 6 processes with command name hhvm [21:19:29] RECOVERY - salt-minion processes on mw1136 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [21:19:29] RECOVERY - nutcracker process on mw1136 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [21:19:40] RECOVERY - RAID on mw1136 is OK: OK: no RAID installed [21:19:49] RECOVERY - puppet last run on mw1136 is OK: OK: Puppet is currently enabled, last run 59 minutes ago with 0 failures [21:20:27] dr0ptp4kt: thanks for the ping to this interesting conversation, btw [21:21:09] 10Ops-Access-Requests, 10Ops-Access-Reviews, 6operations, 3Discovery-Search-Sprint, 5Patch-For-Review: Access for new Discovery OpsEng: Guillaume Lederrey - https://phabricator.wikimedia.org/T125651#2003838 (10Dzahn) @Gehel Do you need anything else from us? membership on mailing lists? access to private... [21:23:18] (03PS5) 10Nuria: Increase length of lag window to 100 [puppet] - 10https://gerrit.wikimedia.org/r/268594 (https://phabricator.wikimedia.org/T125916) [21:23:43] dr0ptp4kt, https://gerrit.wikimedia.org/r/#/c/113655/8/includes/PageRenderingHooks.php [21:24:20] git -s wouldn't have found it ;P [21:24:34] its my patch that you added stuff to :) [21:25:09] PROBLEM - puppet last run on mw1136 is CRITICAL: CRITICAL: puppet fail [21:25:14] bblack, ^ not sure if its that useful, but might bring out the reason for the whole tls ^^ [21:25:33] bearND: sure thing. bearND i didn't invite you or dbrant to the meeting as i think two android engineers is probably enough (plus you're busy at the slot 1135-1225 sf time), but lmk if you become interested in attending or feel a strong need to attend [21:25:55] dr0ptp4kt, bblack another useful patch - https://gerrit.wikimedia.org/r/#/c/115669/ [21:26:34] yurik: :) on the git [21:26:47] yurik: bblack et al ok, i gotta eat some lunch. talk to you all later [21:28:17] (03CR) 10Ottomata: "I think leaving in templates/ is fine, even though there are no ERb blocks. Since a user of the burrow module could change the email_temp" [puppet] - 10https://gerrit.wikimedia.org/r/268682 (owner: 10Elukey) [21:29:46] (03PS10) 10Dduvall: Puppet provider for scap3 [puppet] - 10https://gerrit.wikimedia.org/r/262742 (https://phabricator.wikimedia.org/T113072) (owner: 10Alexandros Kosiaris) [21:30:52] (03PS6) 10Ottomata: Increase length of lag window to 100 [puppet] - 10https://gerrit.wikimedia.org/r/268594 (https://phabricator.wikimedia.org/T125916) (owner: 10Nuria) [21:30:54] (03CR) 10Dduvall: "I've refactored the patch to make use of `deploy-local` and removed all the Trebuchet cruft." [puppet] - 10https://gerrit.wikimedia.org/r/262742 (https://phabricator.wikimedia.org/T113072) (owner: 10Alexandros Kosiaris) [21:32:22] (03PS11) 10Dduvall: Puppet provider for scap3 [puppet] - 10https://gerrit.wikimedia.org/r/262742 (https://phabricator.wikimedia.org/T113072) (owner: 10Alexandros Kosiaris) [21:34:28] aude, so basically wikidataclient.dblist is all - special - wikimedia - wiktionary - wikiversity + public group 0 wikis + meta/wikidata.org/species/commons? [21:34:36] 6operations, 10Traffic: Decrease max object TTL in varnishes - https://phabricator.wikimedia.org/T124954#2003887 (10ori) > (which implies that a lot of cache hits there don't send Age:, so I probably need to find better ways to look at this) Should Apache or MediaWiki add a header with the current timestamp?... [21:36:11] (03PS12) 10Dduvall: Puppet provider for scap3 [puppet] - 10https://gerrit.wikimedia.org/r/262742 (https://phabricator.wikimedia.org/T113072) (owner: 10Alexandros Kosiaris) [21:37:08] PROBLEM - Host cp2018 is DOWN: PING CRITICAL - Packet loss = 100% [21:37:15] (03PS1) 10Merlijn van Deen: scap: remove unnecessary unicode em-dash [puppet] - 10https://gerrit.wikimedia.org/r/268800 [21:37:18] twentyafterfour: ^ [21:37:19] RECOVERY - Host cp2018 is UP: PING OK - Packet loss = 0%, RTA = 36.30 ms [21:37:37] that's the non-ascii bit in the manifest, but I'm still confused why it complains [21:38:05] dr0ptp4kt: that's fine. I don't need to attend if mdholloway is attending [21:38:22] (03CR) 10Dzahn: [C: 032] scap: remove unnecessary unicode em-dash [puppet] - 10https://gerrit.wikimedia.org/r/268800 (owner: 10Merlijn van Deen) [21:39:09] uh, i forgot to go eat. ok, seriously, soon i must get food [21:39:44] (03PS4) 10Dzahn: creation of parsoid-test-admins group [puppet] - 10https://gerrit.wikimedia.org/r/266632 (https://phabricator.wikimedia.org/T124701) (owner: 10RobH) [21:39:54] 10Ops-Access-Requests, 6operations, 6Parsing-Team, 5Patch-For-Review: Getting parsing-team members sudo access to manage (start, stop, restart) services on ruthenium - https://phabricator.wikimedia.org/T124701#2003913 (10Dzahn) >>! In T124701#1987469, @RobH wrote: > We could also rename the new group parso... [21:39:59] 10Ops-Access-Requests, 6operations, 6Parsing-Team, 5Patch-For-Review: Getting parsing-team members sudo access to manage (start, stop, restart) services on ruthenium - https://phabricator.wikimedia.org/T124701#2003914 (10Dzahn) a:3Dzahn [21:41:22] (03CR) 10jenkins-bot: [V: 04-1] creation of parsoid-test-admins group [puppet] - 10https://gerrit.wikimedia.org/r/266632 (https://phabricator.wikimedia.org/T124701) (owner: 10RobH) [21:42:00] valhallasw`cloud: ah good find [21:42:36] chasemp: there's a few more [21:42:44] https://www.irccloud.com/pastebin/pHkZqOlV/ [21:42:45] ^ [21:43:11] but I'm off to bed, so I'm not going to fix them now :-) [21:43:24] 6operations, 10Traffic: Decrease max object TTL in varnishes - https://phabricator.wikimedia.org/T124954#2003929 (10BBlack) Yeah. Ideally, in *nix epoch seconds, because it's much easier to do math on that. It would also be handy for emergency invalidations of all object mediawiki emitted from timestamp X to... [21:43:52] 6operations, 10Salt, 5Patch-For-Review: setup/deploy sarin(WMF5851) as a salt master in codfw - https://phabricator.wikimedia.org/T125752#2003933 (10Papaul) a:5RobH>3Papaul [21:43:54] later on man, good night [21:44:08] PROBLEM - Host cp3031 is DOWN: PING CRITICAL - Packet loss = 100% [21:46:09] PROBLEM - IPsec on cp1055 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp3031_v4, cp3031_v6 [21:46:10] PROBLEM - IPsec on cp1065 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp3031_v4, cp3031_v6 [21:46:38] PROBLEM - IPsec on cp1052 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp3031_v4, cp3031_v6 [21:47:09] PROBLEM - IPsec on cp1053 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp3031_v4, cp3031_v6 [21:47:09] PROBLEM - IPsec on cp1054 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp3031_v4, cp3031_v6 [21:47:48] PROBLEM - IPsec on cp1068 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp3031_v4, cp3031_v6 [21:47:49] PROBLEM - IPsec on cp1067 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp3031_v4, cp3031_v6 [21:47:49] PROBLEM - IPsec on cp1066 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp3031_v4, cp3031_v6 [21:47:58] RECOVERY - IPsec on cp1055 is OK: Strongswan OK - 58 ESP OK [21:47:59] RECOVERY - IPsec on cp1065 is OK: Strongswan OK - 58 ESP OK [21:48:19] RECOVERY - Host cp3031 is UP: PING OK - Packet loss = 0%, RTA = 86.22 ms [21:48:20] RECOVERY - IPsec on cp1052 is OK: Strongswan OK - 58 ESP OK [21:48:35] uh bblack is that you^? [21:48:59] RECOVERY - IPsec on cp1053 is OK: Strongswan OK - 58 ESP OK [21:48:59] RECOVERY - IPsec on cp1054 is OK: Strongswan OK - 58 ESP OK [21:49:11] yeah [21:49:24] ok cool just makin' sure [21:49:30] RECOVERY - IPsec on cp1068 is OK: Strongswan OK - 58 ESP OK [21:49:30] RECOVERY - IPsec on cp1067 is OK: Strongswan OK - 58 ESP OK [21:49:39] RECOVERY - IPsec on cp1066 is OK: Strongswan OK - 58 ESP OK [21:51:36] (03CR) 10Dzahn: [C: 04-1] "Duplicate group GIDs" [puppet] - 10https://gerrit.wikimedia.org/r/266632 (https://phabricator.wikimedia.org/T124701) (owner: 10RobH) [21:52:02] mutante: it wasnt dupe back when the patch was first created ;] [21:52:27] (03PS5) 10Dzahn: creation of parsoid-test-admins group [puppet] - 10https://gerrit.wikimedia.org/r/266632 (https://phabricator.wikimedia.org/T124701) (owner: 10RobH) [21:52:49] (03PS1) 10Krinkle: [DONT MERGE] mediawiki: Rewrite /w/{skins,resources,extensions} to /w/static.php [puppet] - 10https://gerrit.wikimedia.org/r/268802 (https://phabricator.wikimedia.org/T99096) [21:53:01] yes, it's after the rebase [21:53:07] a new group was added meanwhile [21:53:08] (03PS2) 10Krinkle: [DONT MERGE] Set $wgResourceBasePath to "/w" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268715 (https://phabricator.wikimedia.org/T99096) [21:53:43] (03PS6) 10Dzahn: creation of parsoid-test-admins group [puppet] - 10https://gerrit.wikimedia.org/r/266632 (https://phabricator.wikimedia.org/T124701) (owner: 10RobH) [21:53:52] (03CR) 10Dzahn: [C: 032] creation of parsoid-test-admins group [puppet] - 10https://gerrit.wikimedia.org/r/266632 (https://phabricator.wikimedia.org/T124701) (owner: 10RobH) [21:58:01] (03PS1) 10JGirault: Bump portals to master (Move inlined JS to a separate file) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268804 [21:58:54] (03PS1) 10BBlack: disable ipsec for cp3032, dead [puppet] - 10https://gerrit.wikimedia.org/r/268805 [21:59:09] (03CR) 10BBlack: [C: 032 V: 032] disable ipsec for cp3032, dead [puppet] - 10https://gerrit.wikimedia.org/r/268805 (owner: 10BBlack) [22:00:07] 6operations, 10Traffic: Decrease max object TTL in varnishes - https://phabricator.wikimedia.org/T124954#2003974 (10GWicke) @bblack, do we know why the `Age` header would be missing? And does it matter / would it change the overall result? [22:01:22] (03PS11) 10Aklapper: Adding a new email template for Burrow lag alerts. [puppet] - 10https://gerrit.wikimedia.org/r/268682 (https://phabricator.wikimedia.org/T126008) (owner: 10Elukey) [22:01:29] RECOVERY - IPsec on kafka1014 is OK: Strongswan OK - 164 ESP OK [22:01:30] RECOVERY - IPsec on kafka1020 is OK: Strongswan OK - 164 ESP OK [22:01:50] RECOVERY - IPsec on cp1074 is OK: Strongswan OK - 58 ESP OK [22:01:58] RECOVERY - IPsec on kafka1013 is OK: Strongswan OK - 164 ESP OK [22:01:59] RECOVERY - IPsec on cp1049 is OK: Strongswan OK - 58 ESP OK [22:02:19] RECOVERY - IPsec on cp1099 is OK: Strongswan OK - 58 ESP OK [22:02:24] (03PS1) 10Dzahn: admin: add arlolra,cscott,gwicke to parsoid-test-admins [puppet] - 10https://gerrit.wikimedia.org/r/268808 (https://phabricator.wikimedia.org/T124701) [22:02:30] RECOVERY - IPsec on cp1064 is OK: Strongswan OK - 58 ESP OK [22:02:40] RECOVERY - IPsec on kafka1018 is OK: Strongswan OK - 164 ESP OK [22:02:49] RECOVERY - IPsec on cp1050 is OK: Strongswan OK - 58 ESP OK [22:02:59] RECOVERY - IPsec on kafka1022 is OK: Strongswan OK - 164 ESP OK [22:03:08] RECOVERY - IPsec on cp1072 is OK: Strongswan OK - 58 ESP OK [22:03:09] RECOVERY - IPsec on cp1071 is OK: Strongswan OK - 58 ESP OK [22:03:10] RECOVERY - IPsec on cp1051 is OK: Strongswan OK - 58 ESP OK [22:03:20] RECOVERY - IPsec on cp1063 is OK: Strongswan OK - 58 ESP OK [22:03:20] RECOVERY - IPsec on cp1062 is OK: Strongswan OK - 58 ESP OK [22:03:20] RECOVERY - IPsec on cp1061 is OK: Strongswan OK - 58 ESP OK [22:03:32] (03PS1) 10Dzahn: admin: add parsoid-test-admins to ruthenium [puppet] - 10https://gerrit.wikimedia.org/r/268809 (https://phabricator.wikimedia.org/T124701) [22:03:38] RECOVERY - IPsec on cp1048 is OK: Strongswan OK - 58 ESP OK [22:03:49] RECOVERY - IPsec on cp1073 is OK: Strongswan OK - 58 ESP OK [22:05:02] (03PS2) 10Dzahn: admin: add parsoid-test-admins to ruthenium [puppet] - 10https://gerrit.wikimedia.org/r/268809 (https://phabricator.wikimedia.org/T124701) [22:05:31] (03PS3) 10Dzahn: admin: add parsoid-test-admins to ruthenium [puppet] - 10https://gerrit.wikimedia.org/r/268809 (https://phabricator.wikimedia.org/T124701) [22:05:39] (03CR) 10Dzahn: [V: 032] admin: add parsoid-test-admins to ruthenium [puppet] - 10https://gerrit.wikimedia.org/r/268809 (https://phabricator.wikimedia.org/T124701) (owner: 10Dzahn) [22:05:51] (03CR) 10Dzahn: [C: 032] admin: add parsoid-test-admins to ruthenium [puppet] - 10https://gerrit.wikimedia.org/r/268809 (https://phabricator.wikimedia.org/T124701) (owner: 10Dzahn) [22:10:22] !log cache rolling reboots stopped for the weekend, can pick up the other half monday [22:10:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:11:27] (03PS2) 10Dzahn: admin: add arlolra,cscott,gwicke to parsoid-test-admins [puppet] - 10https://gerrit.wikimedia.org/r/268808 (https://phabricator.wikimedia.org/T124701) [22:11:56] 10Ops-Access-Requests, 6operations, 6Parsing-Team, 5Patch-For-Review: Getting parsing-team members sudo access to manage (start, stop, restart) services on ruthenium - https://phabricator.wikimedia.org/T124701#2003999 (10Dzahn) - merged the new empty group created by rob, just renamed to parsoid-test-admin... [22:13:42] (03PS5) 10Dzahn: Remove absented uuid-generator script [puppet] - 10https://gerrit.wikimedia.org/r/214625 (owner: 10Alexandros Kosiaris) [22:14:17] (03PS1) 10Krinkle: multiversion: Create getAvailableBranchDirs() method [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268813 [22:14:37] (03PS12) 10Krinkle: [WIP] Implement /w/static.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/263566 (https://phabricator.wikimedia.org/T99096) [22:14:41] (03PS3) 10Krinkle: [DONT MERGE] Set $wgResourceBasePath to "/w" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268715 (https://phabricator.wikimedia.org/T99096) [22:19:38] (03PS3) 10Dzahn: apache: rotate logs daily, default to 30d [puppet] - 10https://gerrit.wikimedia.org/r/266480 (owner: 10ArielGlenn) [22:20:09] RECOVERY - HHVM rendering on mw1136 is OK: HTTP OK: HTTP/1.1 200 OK - 70586 bytes in 9.051 second response time [22:21:09] RECOVERY - Apache HTTP on mw1136 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.051 second response time [22:22:10] RECOVERY - puppet last run on mw1136 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [22:22:10] (03PS13) 10Dduvall: Puppet provider for scap3 [puppet] - 10https://gerrit.wikimedia.org/r/262742 (https://phabricator.wikimedia.org/T113072) (owner: 10Alexandros Kosiaris) [22:22:33] (03CR) 10Dzahn: "jenkins doesn't get it. "Users assigned that do not exist" the users are created in this change as well" [puppet] - 10https://gerrit.wikimedia.org/r/268722 (https://phabricator.wikimedia.org/T126012) (owner: 10Dzahn) [22:22:57] 6operations, 10Traffic: Decrease max object TTL in varnishes - https://phabricator.wikimedia.org/T124954#2004045 (10BBlack) @gwicke - I think I had a bad filter or I misinterpreted results, one or the other. In another run I just did (also just 10 minutes on 1x cache_text, it seems like virtually all cache hi... [22:26:54] (03PS3) 10Dzahn: admin: add shell users for frack pentest [puppet] - 10https://gerrit.wikimedia.org/r/268722 (https://phabricator.wikimedia.org/T126012) [22:29:10] 6operations, 10Traffic: Decrease max object TTL in varnishes - https://phabricator.wikimedia.org/T124954#2004072 (10BBlack) The stats like the above (if taken on a broader and longer scale) still don't really simulate what would happen if we capped the TTL lower. It's not the case that we'd see a 0.7% increas... [22:29:27] 10Ops-Access-Requests, 10Ops-Access-Reviews, 6operations, 3Discovery-Search-Sprint, 5Patch-For-Review: Access for new Discovery OpsEng: Guillaume Lederrey - https://phabricator.wikimedia.org/T125651#2004073 (10Gehel) @Dzahn: only thing I know of at the moment is the access to graphite / icinga / grafana.... [22:32:24] Krenair: Hi! Sorry I hadn't followed up before - but I see this is fixed - https://phabricator.wikimedia.org/T121602 but I still can't edit those pages. Should I reopen this or make a new ticket? [22:32:53] I saw your comment madhuvishy [22:33:05] Krenair: not wikiversity yet and probably not alls pecial [22:33:20] 6operations, 10Traffic: Decrease max object TTL in varnishes - https://phabricator.wikimedia.org/T124954#2004096 (10GWicke) Yeah, long time survivors should likely be hot, and forcing them to be refreshed after 1-2 weeks shouldn't significantly alter the hit rate for those objects (or overall). To me it looks... [22:33:34] Krenair: ah yes - okay. [22:33:46] but wikiversity will eventually (actually soon) get wikibase [22:33:54] madhuvishy, looks like I did miss a bit - we need to enable subpage recognition in the Hiera namespace [22:34:34] (03CR) 10Mobrovac: "Nice! I've got just one ruby-nit-pick in-lined :)" [puppet] - 10https://gerrit.wikimedia.org/r/262742 (https://phabricator.wikimedia.org/T113072) (owner: 10Alexandros Kosiaris) [22:34:48] Krenair: ah alright. Would it help if I made a new ticket? [22:34:58] nah [22:35:02] okay [22:35:03] and not wiktionary yet, but also eventually they will get wikibase [22:37:21] (03PS1) 10Alex Monk: Enable subpages in wikitech's Hiera namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268818 (https://phabricator.wikimedia.org/T121602) [22:38:36] (03CR) 10Mobrovac: "Hm, the patch changed from beneath my feet :) One question in-lined." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/262742 (https://phabricator.wikimedia.org/T113072) (owner: 10Alexandros Kosiaris) [22:39:16] (03CR) 10Alex Monk: [C: 032] Enable subpages in wikitech's Hiera namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268818 (https://phabricator.wikimedia.org/T121602) (owner: 10Alex Monk) [22:39:43] (03Merged) 10jenkins-bot: Enable subpages in wikitech's Hiera namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268818 (https://phabricator.wikimedia.org/T121602) (owner: 10Alex Monk) [22:41:48] !log krenair@mira Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/268818/ (duration: 01m 22s) [22:41:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:41:55] madhuvishy, try now? ^ [22:42:36] Krenair: yay! works. thanks so much :) [22:43:01] np [22:43:12] 7Blocked-on-Operations, 6operations, 10Wikimedia-General-or-Unknown: Invalidate all users sessions - https://phabricator.wikimedia.org/T124440#2004154 (10Tgr) Per SAL, the script was started at 2016-01-31 05:35, and Legoktm says at 2016-02-04 11:04 it was at 10467742 users out of 45463166 (23%). That's 10467... [22:45:31] 6operations, 10EventBus, 6Services, 10hardware-requests: 4 more Kafka brokers, 2 in eqiad and 2 codfw - https://phabricator.wikimedia.org/T124469#2004164 (10RobH) a:5Ottomata>3RobH [22:45:59] 7Blocked-on-Operations, 6operations, 10Wikimedia-General-or-Unknown: Invalidate all users sessions - https://phabricator.wikimedia.org/T124440#2004169 (10greg) >>! In T124440#2004154, @Tgr wrote: > Per SAL, the script was started at 2016-01-31 05:35, and Legoktm says at 2016-02-04 11:04 it was at 10467742 us... [22:48:44] !log restarting slave on m2/codfw (db2011) [22:48:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:50:45] (03CR) 10Mobrovac: "FTR, you need to:" [puppet] - 10https://gerrit.wikimedia.org/r/268560 (https://phabricator.wikimedia.org/T122249) (owner: 10Milimetric) [22:51:26] 6operations, 10EventBus, 6Services, 10hardware-requests: 4 more Kafka brokers, 2 in eqiad and 2 codfw - https://phabricator.wikimedia.org/T124469#2004185 (10GWicke) I have some questions on this decision: 1) Why do we need separate clusters at this point, considering the relatively limited message volumes... [22:54:09] (03PS1) 10Ori.livneh: Add 'Backend-Timing' response header on all Apaches [puppet] - 10https://gerrit.wikimedia.org/r/268821 [22:55:54] 6operations, 10Traffic: Decrease max object TTL in varnishes - https://phabricator.wikimedia.org/T124954#2004198 (10BBlack) @Gwicke - rather than lowering s-maxage in app code, IMHO we should lower it in the VCL cap we have here: https://github.com/wikimedia/operations-puppet/blob/production/modules/varnish/te... [22:56:09] !log reimaging db1018 [22:56:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:57:06] (03PS1) 10Dzahn: admin: add akumar, mnoushad to pentesters [puppet] - 10https://gerrit.wikimedia.org/r/268823 (https://phabricator.wikimedia.org/T126012) [22:58:25] 6operations, 10Traffic: Decrease max object TTL in varnishes - https://phabricator.wikimedia.org/T124954#2004204 (10BBlack) In any case, before this goes from idea to action, I need to get finer-grained stats over a broader set of requests, and even then we'll probably want to slowly drop in stages Just In Cas... [22:58:34] (03CR) 10jenkins-bot: [V: 04-1] admin: add akumar, mnoushad to pentesters [puppet] - 10https://gerrit.wikimedia.org/r/268823 (https://phabricator.wikimedia.org/T126012) (owner: 10Dzahn) [22:58:41] (03CR) 10Dzahn: "what's the best way to proof this is noop in beta?" [puppet] - 10https://gerrit.wikimedia.org/r/260937 (owner: 10Dzahn) [22:59:46] wait, am I installing jessy or trusty? [23:00:02] jessie now that it works ?:) [23:00:23] the question is whant am I *actually* installing, not what I want [23:01:15] trusty [23:01:16] (03CR) 10BBlack: [C: 031] Add 'Backend-Timing' response header on all Apaches [puppet] - 10https://gerrit.wikimedia.org/r/268821 (owner: 10Ori.livneh) [23:01:26] when looking at dhcp config [23:01:38] (03PS1) 10Jcrespo: Revert "Revert "Revert "Revert "Install Jessie on db1018"""" [puppet] - 10https://gerrit.wikimedia.org/r/268826 [23:01:46] at some point we can change the default installer [23:01:56] greg-g, am I fine to push the agreed upon portals change right now? [23:02:08] Platonides, can you see ^that, at the end, it just works [23:02:19] (03PS2) 10Jcrespo: Revert "Revert "Revert "Revert "Install Jessie on db1018"""" [puppet] - 10https://gerrit.wikimedia.org/r/268826 [23:02:25] OuKB: sure [23:02:48] OuKB: ya'll figured out the Beta Cluster issue? [23:03:24] (03CR) 10Dzahn: "no diff on gallium and scandium http://puppet-compiler.wmflabs.org/1687/" [puppet] - 10https://gerrit.wikimedia.org/r/260939 (owner: 10Dzahn) [23:04:01] 6operations: overhaul labstore setup [tracking] - https://phabricator.wikimedia.org/T126083#2004220 (10chasemp) 3NEW [23:04:19] (03CR) 10Jcrespo: [C: 032] Revert "Revert "Revert "Revert "Install Jessie on db1018"""" [puppet] - 10https://gerrit.wikimedia.org/r/268826 (owner: 10Jcrespo) [23:04:38] (03CR) 10Dzahn: "where does this actually run please?" [puppet] - 10https://gerrit.wikimedia.org/r/260187 (owner: 10Dzahn) [23:05:19] greg-g: we had cached 404s in varnish, but now they’re gone and images are fine. debt also confirmed the page is working as expected. (we limited the change to just moving out the inline JS). Beta is all good. We can deploy [23:06:01] (03CR) 10MaxSem: [C: 032] Bump portals to master (Move inlined JS to a separate file) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268804 (owner: 10JGirault) [23:06:39] jgirault: word, thanks [23:07:05] (03CR) 10Dzahn: "where should i run this to show it's no change? labsdb servers, right?" [puppet] - 10https://gerrit.wikimedia.org/r/260611 (owner: 10Dzahn) [23:07:41] (03CR) 1020after4: Puppet provider for scap3 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/262742 (https://phabricator.wikimedia.org/T113072) (owner: 10Alexandros Kosiaris) [23:07:55] 6operations, 10Incident-Labs-NFS-20151216: Reinstall labstore1002 to ensure consistency with labstore1001 - https://phabricator.wikimedia.org/T121905#2004244 (10chasemp) We have been exploring the possibility of more/different servers so this has been stalled. [23:07:55] (03CR) 10Dzahn: "no, it does not affect hiera lookups, the class names don't change, just the location where they are declared" [puppet] - 10https://gerrit.wikimedia.org/r/260610 (owner: 10Dzahn) [23:09:09] 6operations, 6Labs: evaluate possibility for nscd use with useldap - https://phabricator.wikimedia.org/T124991#2004248 (10chasemp) [23:09:17] 6operations, 10Incident-Labs-NFS-20151216: Reinstall labstore1002 to ensure consistency with labstore1001 - https://phabricator.wikimedia.org/T121905#2004253 (10chasemp) [23:09:21] 6operations, 10Incident-Labs-NFS-20151216: Add step in start-nfs to ask operator to consider dropping some snapshots - https://phabricator.wikimedia.org/T121890#2004254 (10chasemp) [23:09:25] 6operations, 6Labs: One instance hammering on NFS should not make it unavailable to everyone else - https://phabricator.wikimedia.org/T95766#2004257 (10chasemp) [23:09:33] 6operations: overhaul labstore setup [tracking] - https://phabricator.wikimedia.org/T126083#2004247 (10chasemp) [23:10:05] (03PS2) 10Dzahn: swift: move roles to modules/role/ [puppet] - 10https://gerrit.wikimedia.org/r/260610 [23:10:58] (03CR) 10Dzahn: "of course this needs 10 manual rebases meanwhile, i dont think i'll ever get it merged" [puppet] - 10https://gerrit.wikimedia.org/r/256574 (owner: 10Dzahn) [23:11:49] (03Merged) 10jenkins-bot: Bump portals to master (Move inlined JS to a separate file) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268804 (owner: 10JGirault) [23:12:14] 7Blocked-on-Operations, 6operations, 10Wikimedia-General-or-Unknown: Invalidate all users sessions - https://phabricator.wikimedia.org/T124440#2004274 (10csteipp) @Tgr, you lost me a little in this, but let me try to respond. > About the last point: the two bugs that made us run the script leaked user token... [23:13:57] 6operations: change nfs-exports job to only run on changes to /etc/exports.d - https://phabricator.wikimedia.org/T126085#2004284 (10chasemp) 3NEW [23:14:27] (03PS5) 10Dzahn: mediawiki: move roles into separate files [puppet] - 10https://gerrit.wikimedia.org/r/256574 [23:14:53] (03CR) 10Alex Monk: "Don't they need bastiononly?" [puppet] - 10https://gerrit.wikimedia.org/r/268823 (https://phabricator.wikimedia.org/T126012) (owner: 10Dzahn) [23:20:44] Krenair, didja merge the last commit on mira? I still see it in the diff [23:20:49] (03CR) 10Dzahn: [C: 04-1] "meanwhile: Could not find class role::scap::target" [puppet] - 10https://gerrit.wikimedia.org/r/256574 (owner: 10Dzahn) [23:20:58] (03PS2) 10Dzahn: admin: add akumar, mnoushad to pentesters, bastionly [puppet] - 10https://gerrit.wikimedia.org/r/268823 (https://phabricator.wikimedia.org/T126012) [23:22:16] OuKB, it looks like it was merged? [23:22:30] krenair@mira:/srv/mediawiki-staging (master)$ git log --oneline -3 [23:22:30] 695ec6b multiversion: Create getAvailableBranchDirs() method [23:22:30] bb0ac90 Enable subpages in wikitech's Hiera namespace [23:22:30] 34c75ab Merge "Fix typo and sort by alphabetical order wgExtraSignatureNamespaces" [23:22:41] "git log HEAD..origin/master" shows your commit [23:22:46] now try git fetch origin && git diff HEAD origin [23:22:56] (03CR) 10Dzahn: "giving up, no reviews in 2 months and constantly changes" [puppet] - 10https://gerrit.wikimedia.org/r/256574 (owner: 10Dzahn) [23:22:57] why are you using diff [23:23:11] cuz that's the standard procedure ;) [23:23:30] https://wikitech.wikimedia.org/wiki/Heterogeneous_deployment#Change_wiki_configuration [23:23:42] (03PS3) 10Dzahn: deactivate wikimediacommons.[co.uk|eu|info|jp.net|mobi|net|org] [dns] - 10https://gerrit.wikimedia.org/r/244092 [23:23:51] (03PS13) 10Krinkle: [WIP] Implement /w/static.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/263566 (https://phabricator.wikimedia.org/T99096) [23:23:52] 6operations: new labstore hardware for eqiad - https://phabricator.wikimedia.org/T126089#2004344 (10chasemp) 3NEW [23:23:55] it's not on https://wikitech.wikimedia.org/wiki/How_to_deploy_code OuKB [23:24:00] (03CR) 10Krinkle: "Fixed Access-Control-Allow-Origin." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/263566 (https://phabricator.wikimedia.org/T99096) (owner: 10Krinkle) [23:24:16] anyway, nothing of mine is in "git diff HEAD origin" [23:24:28] (03CR) 10Dzahn: [C: 032] "traffic checked in https://phabricator.wikimedia.org/T124237" [dns] - 10https://gerrit.wikimedia.org/r/244092 (owner: 10Dzahn) [23:24:33] (03CR) 10jenkins-bot: [V: 04-1] admin: add akumar, mnoushad to pentesters, bastionly [puppet] - 10https://gerrit.wikimedia.org/r/268823 (https://phabricator.wikimedia.org/T126012) (owner: 10Dzahn) [23:24:43] Krinkle did the getAvailableBranchDirs commit [23:25:06] Yeah [23:25:11] Currentlyt staging [23:25:19] confirmed on testwiki so ok to roll out, but havent' done that part yet [23:25:20] undone now [23:25:25] 10Ops-Access-Requests, 6operations: add mobrovac to mathoid admins (was: Allow mobrovac to start/stop/restart services on SCx) - https://phabricator.wikimedia.org/T125879#2004353 (10mobrovac) >>! In T125879#2003797, @Dzahn wrote: > Of the services on scb, you are an admin of: > > zotero, citoid, graphoid, mob... [23:25:41] 6operations: change labstore1001/1002 to cfq io scheduler - https://phabricator.wikimedia.org/T126090#2004354 (10chasemp) 3NEW [23:25:49] Krinkle, then I'll deploy my part and leave the rest to you [23:25:57] Portals patch is pulled but submodjule update not applied it seems [23:26:24] yup, I'm figuring out if the tree is sane [23:26:37] I think puppet is faster on jessie, but I may be imagining things [23:28:18] 6operations, 7domains: traffic stats for typo domains - https://phabricator.wikimedia.org/T124237#2004361 (10Dzahn) I merged https://gerrit.wikimedia.org/r/244092 and deactivated the wikimediacommons.* domains --> parking [23:29:03] !log maxsem@mira Synchronized portals/prod/wikipedia.org/assets: (no message) (duration: 01m 19s) [23:29:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:30:22] !log maxsem@mira Synchronized portals: (no message) (duration: 01m 18s) [23:30:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:32:14] (03CR) 10Krinkle: [C: 032] multiversion: Create getAvailableBranchDirs() method [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268813 (owner: 10Krinkle) [23:32:24] (03PS2) 10Ori.livneh: Add 'Backend-Timing' response header on all Apaches [puppet] - 10https://gerrit.wikimedia.org/r/268821 [23:32:46] 7Blocked-on-Operations, 6operations, 10Wikimedia-General-or-Unknown: Invalidate all users sessions - https://phabricator.wikimedia.org/T124440#2004378 (10Tgr) >>! In T124440#2004274, @csteipp wrote: > I think it is "worth it" to log everyone out Sure. My point is, setting `$wgAuthenticationTokenVersion` *wi... [23:32:56] (03CR) 10Ori.livneh: [C: 032 V: 032] Add 'Backend-Timing' response header on all Apaches [puppet] - 10https://gerrit.wikimedia.org/r/268821 (owner: 10Ori.livneh) [23:33:07] (03Merged) 10jenkins-bot: multiversion: Create getAvailableBranchDirs() method [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268813 (owner: 10Krinkle) [23:33:27] 10Ops-Access-Requests, 10Ops-Access-Reviews, 6operations, 3Discovery-Search-Sprint, 5Patch-For-Review: Access for new Discovery OpsEng: Guillaume Lederrey - https://phabricator.wikimedia.org/T125651#2004384 (10Dzahn) @Gehel Alright, so the LDAP group "wmf' was missing. I just added you to that, so you sh... [23:34:15] (03PS4) 10Krinkle: [DONT MERGE] Set $wgResourceBasePath to "/w" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268715 (https://phabricator.wikimedia.org/T99096) [23:34:52] 6operations, 10Deployment-Systems, 6Performance-Team, 10Traffic, 5Patch-For-Review: Make Varnish cache for /static/$wmfbranch/ expire when resources change within branch lifetime - https://phabricator.wikimedia.org/T99096#2004391 (10Krinkle) [23:37:57] 7Blocked-on-Operations, 6operations, 10Wikimedia-General-or-Unknown: Invalidate all users sessions - https://phabricator.wikimedia.org/T124440#2004414 (10csteipp) > Sure. My point is, setting $wgAuthenticationTokenVersion *will* log everyone out, but you can log back in by manipulating your token cookie the... [23:38:28] PROBLEM - puppet last run on strontium is CRITICAL: CRITICAL: Puppet has 1 failures [23:39:38] PROBLEM - puppet last run on mw1152 is CRITICAL: CRITICAL: Puppet has 1 failures [23:39:48] PROBLEM - puppet last run on mw1193 is CRITICAL: CRITICAL: Puppet has 1 failures [23:40:19] PROBLEM - puppet last run on mw2154 is CRITICAL: CRITICAL: Puppet has 1 failures [23:40:38] PROBLEM - puppet last run on mw2171 is CRITICAL: CRITICAL: Puppet has 2 failures [23:40:49] that's the puppetmaster getting graceful'd [23:40:55] these are ephemeral [23:41:19] PROBLEM - puppet last run on mw1071 is CRITICAL: CRITICAL: Puppet has 1 failures [23:41:37] PROBLEM - puppet last run on mw2148 is CRITICAL: CRITICAL: Puppet has 1 failures [23:41:47] PROBLEM - puppet last run on mw2115 is CRITICAL: CRITICAL: Puppet has 2 failures [23:42:08] PROBLEM - puppet last run on mw2104 is CRITICAL: CRITICAL: Puppet has 2 failures [23:42:27] PROBLEM - puppet last run on mw2013 is CRITICAL: CRITICAL: Puppet has 1 failures [23:42:49] (03PS1) 10Dzahn: admin: add mobrovac to mathoid and cxserver admins [puppet] - 10https://gerrit.wikimedia.org/r/268839 (https://phabricator.wikimedia.org/T125879) [23:42:50] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: add mobrovac to mathoid and cxserver admins (was: Allow mobrovac to start/stop/restart services on SCx) - https://phabricator.wikimedia.org/T125879#2004435 (10Dzahn) [23:45:18] RECOVERY - puppet last run on mw2115 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [23:45:39] RECOVERY - puppet last run on mw2104 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [23:47:44] 6operations: overhaul labstore setup [tracking] - https://phabricator.wikimedia.org/T126083#2004459 (10chasemp) labs admins are trying to shape NFS traffic to make it more reliable and predictable Some data from the nfsiostat collector I added about a week ago write for exec nodes http://graphite.wmflabs.org/... [23:48:07] (03PS4) 10Ottomata: [WIP] Refactor manifests/role/analytics/* into modules/role, use hiera to configure [puppet] - 10https://gerrit.wikimedia.org/r/267797 (https://phabricator.wikimedia.org/T109859) [23:49:13] (03PS1) 10Yuvipanda: labstore: Add error checking + logging info to create-dbusers [puppet] - 10https://gerrit.wikimedia.org/r/268841 [23:49:25] chasemp: ^ for create-dbusers [23:49:26] (03CR) 10jenkins-bot: [V: 04-1] [WIP] Refactor manifests/role/analytics/* into modules/role, use hiera to configure [puppet] - 10https://gerrit.wikimedia.org/r/267797 (https://phabricator.wikimedia.org/T109859) (owner: 10Ottomata) [23:49:36] I'll take care of it [23:49:41] ok thanks man [23:51:16] !log dropped old nfs snapshots from labstore1001 [23:51:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:51:44] (03CR) 10Yuvipanda: [C: 032] labstore: Add error checking + logging info to create-dbusers [puppet] - 10https://gerrit.wikimedia.org/r/268841 (owner: 10Yuvipanda) [23:51:48] PROBLEM - puppet last run on cp2014 is CRITICAL: CRITICAL: Puppet has 1 failures [23:52:07] (03PS14) 10Krinkle: [WIP] Implement /w/static.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/263566 (https://phabricator.wikimedia.org/T99096) [23:52:14] 10Ops-Access-Requests, 6operations: let datacenter-ops read server logfiles - https://phabricator.wikimedia.org/T126018#2004471 (10Dzahn) a:3Dzahn [23:52:31] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: add mobrovac to mathoid and cxserver admins (was: Allow mobrovac to start/stop/restart services on SCx) - https://phabricator.wikimedia.org/T125879#2004473 (10Dzahn) a:3Dzahn [23:53:07] 6operations, 6Labs: One instance hammering on NFS should not make it unavailable to everyone else - https://phabricator.wikimedia.org/T95766#2004476 (10chasemp) on vm usage patterns for write https://phabricator.wikimedia.org/T126083#2004459 I am looking at shaping traffic using tc to some extent (though pre... [23:53:09] RECOVERY - Ensure mysql credential creation for tools users is running on labstore1001 is OK: OK - create-dbusers is active [23:54:22] !log tc to shape some nfs read traffic in tools for labs (also logged there) can be cancelled with: /sbin/tc qdisc del dev eth0 root [23:54:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:54:33] 1 [23:54:36] 2 [23:54:36] chasemp: the createdb-users issue seems to have been a transient LDAP issue. I put in some additional error checking and logging and is ok now [23:54:38] 3 [23:54:41] !log nfs shaping is really writes :) [23:54:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:54:56] YuviPanda: ok I did start it up (or had puppet start it) and it died soon after [23:55:09] just fyi as I don't know more than that [23:55:13] chasemp: yeah, I just saw a run happen and it looked ok [23:55:19] kk [23:55:28] let's keep teh logging and see? [23:55:54] all the previous crashes were the same error I put in handling for and added logging [23:55:56] yeah [23:56:39] 10Ops-Access-Requests, 6operations: let datacenter-ops read server logfiles - https://phabricator.wikimedia.org/T126018#2004485 (10Dzahn) p:5Triage>3Normal [23:56:43] sounds good [23:56:59] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: add mobrovac to mathoid and cxserver admins (was: Allow mobrovac to start/stop/restart services on SCx) - https://phabricator.wikimedia.org/T125879#2004489 (10Dzahn) p:5Triage>3High [23:57:21] 10Ops-Access-Requests, 6operations, 6Parsing-Team, 5Patch-For-Review: Getting parsing-team members sudo access to manage (start, stop, restart) services on ruthenium - https://phabricator.wikimedia.org/T124701#2004492 (10Dzahn) 5stalled>3Open