[00:00:00] Krenair: are you looking at any logs in particular? [00:00:03] no [00:00:04] RoanKattouw ostriches Krenair: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160120T0000). Please do the needful. [00:00:04] ebernhardson RoanKattouw Kaldari bd808 urandom: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [00:00:15] (03PS3) 10Reedy: Add lots of wfMsg*() for wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264023 (https://phabricator.wikimedia.org/T123583) [00:00:19] o/ [00:00:57] https://secure.php.net/ldap_start_tls [00:01:03] This function is currently not documented; only its argument list is available. [00:01:06] great. [00:01:53] \o [00:02:55] oops, now I’m banned [00:03:13] andrewbogott, banned? [00:03:22] 'Visitors to this wiki using your IP address have created 6 accounts in the last day [00:03:23] ' [00:03:36] haha, oh dear [00:03:49] will figure out how to unban you once I've done this swat [00:05:49] ebernhardson, ping [00:06:05] kaldari, urandom: ping [00:06:17] where is Roan... [00:06:35] Krenair: preemptive pong [00:06:58] I saw you wave earlier, and I just realised ebernhardson did too :P [00:07:16] his name has the exact same number of characters as andrew [00:07:49] Krenair: pong [00:07:51] me vs. Kr inkle must be so much worse for everyone else, on my screen we are distinguished by colour... [00:08:38] at least me and yuri.k have different char counts [00:08:55] grr [00:09:07] my notify is set to yuri :-P [00:09:18] :D [00:10:42] Krenair: here [00:10:58] !log krenair@tin Synchronized php-1.27.0-wmf.10/extensions/CirrusSearch/includes/ElasticsearchIntermediary.php: https://gerrit.wikimedia.org/r/#/c/264989/ (duration: 00m 32s) [00:11:01] ebernhardson, ^ [00:11:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:12:29] PROBLEM - puppet last run on cp3004 is CRITICAL: CRITICAL: puppet fail [00:12:56] Krenair: Sorry, meeting went long, here now [00:12:58] Krenair: looks sane [00:13:25] There are still Avro errors coming through on fluorine :( [00:13:41] Krenair: i would expect they are backed up in syslog, have to wait a bit [00:13:48] true [00:13:56] actually, interesting pattern [00:14:08] all the ones coming through now are 'repeated x times' [00:14:16] Krenair: those would be the backed up ones :) [00:14:25] where x is usually > 1000 [00:14:31] so this is definitely counting backed up ones, yeah [00:14:49] but i'm seeing traffic into kafka, so i think its working as expected again [00:15:00] (03CR) 10Madhuvishy: apache: Add http to https redirection for simplestatic role (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/265162 (owner: 10Madhuvishy) [00:15:04] (03PS2) 10Madhuvishy: apache: Add http to https redirection for simplestatic role [puppet] - 10https://gerrit.wikimedia.org/r/265162 [00:16:32] next CirrusSearch one looks safe for sync-dir [00:17:13] !log krenair@tin Synchronized php-1.27.0-wmf.10/extensions/CirrusSearch: https://gerrit.wikimedia.org/r/#/c/265146/ (duration: 00m 33s) [00:17:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:17:25] ebernhardson, ^ [00:17:29] please check [00:17:57] (03PS3) 10Madhuvishy: apache: Add http to https redirection for simplestatic role [puppet] - 10https://gerrit.wikimedia.org/r/265162 [00:18:04] (03PS3) 10Alex Monk: Add wgScriptPath to InitialiseSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264260 (owner: 10Catrope) [00:18:20] (03CR) 10Alex Monk: [C: 032] Add wgScriptPath to InitialiseSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264260 (owner: 10Catrope) [00:18:28] Krenair: nothing seems out of place, will have to wait a bit to see what comes out of graphite [00:18:48] (03Merged) 10jenkins-bot: Add wgScriptPath to InitialiseSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264260 (owner: 10Catrope) [00:21:39] !log krenair@tin Synchronized wmf-config/InitialiseSettings-labs.php: https://gerrit.wikimedia.org/r/#/c/264260/ (duration: 00m 32s) [00:21:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:21:57] RoanKattouw, syncing [00:22:25] !log krenair@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/264260/ (duration: 00m 32s) [00:22:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:24:19] RoanKattouw, all good? [00:25:07] PROBLEM - puppet last run on mw2193 is CRITICAL: CRITICAL: puppet fail [00:29:40] RoanKattouw, ... [00:30:05] Krenair, it's not showing up yet. I imagine he's troubleshooting. [00:31:17] (i.e. the "Enhanced notifications" betafeature should now be visible at https://test2.wikipedia.org/wiki/Special:Preferences#mw-prefsection-betafeatures but is not) [00:31:52] (unless that's a different patch!) [00:33:21] (ah, it is. nvm me.) [00:33:30] I was about to say.. [00:35:14] I'll leave it on for now, skipping his other patches [00:35:17] kaldari, ping [00:35:27] PROBLEM - puppet last run on db2066 is CRITICAL: CRITICAL: puppet fail [00:35:42] here and ready to test [00:36:04] (03PS2) 10Alex Monk: Disable active user gadget stats on testwiki (in preparation for enwiki) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264961 (https://phabricator.wikimedia.org/T121949) (owner: 10Kaldari) [00:36:17] (03CR) 10Alex Monk: [C: 032] Disable active user gadget stats on testwiki (in preparation for enwiki) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264961 (https://phabricator.wikimedia.org/T121949) (owner: 10Kaldari) [00:36:41] (03Merged) 10jenkins-bot: Disable active user gadget stats on testwiki (in preparation for enwiki) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264961 (https://phabricator.wikimedia.org/T121949) (owner: 10Kaldari) [00:37:36] Krenair: Yup the wgScriptPath patch is working [00:37:37] RECOVERY - puppet last run on cp3004 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [00:37:38] Sorry for the delay [00:37:38] !log krenair@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/264961/ (duration: 00m 33s) [00:37:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:38:04] and my absent-mindedness this afternoon in general [00:38:17] !log krenair@tin Synchronized wmf-config/CommonSettings.php: https://gerrit.wikimedia.org/r/#/c/264961/ (duration: 00m 31s) [00:38:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:39:28] Krenair: Looks good [00:40:16] thanks [00:43:50] (03PS2) 10Alex Monk: Add cross-wiki notifications to beta features whitelist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264920 (owner: 10Catrope) [00:44:30] (03CR) 10Alex Monk: [C: 032] Add cross-wiki notifications to beta features whitelist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264920 (owner: 10Catrope) [00:48:51] bd808, syncing [00:49:05] !log restbase deploy start of d621b76 [00:49:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:49:20] !log krenair@tin Synchronized php-1.27.0-wmf.10/extensions/MobileFrontend/includes/api/ApiMobileView.php: https://gerrit.wikimedia.org/r/#/c/264973/ (duration: 00m 32s) [00:49:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:49:30] andrewbogott: I've made the patches to the 3 extensions to fix the wfMsg() stuff.. Will get someone (Florian?) to review it, then we can update the branches used in wmf.10/wmf.11 and carry on with Wikitech on a newer version [00:49:45] Reedy: ok, thank you. [00:49:49] And also: I’m sorry. [00:50:01] Krenair: Looks like it worked :) My error reproduction url is returning good results now [00:50:05] indeed [00:50:13] logs look better too [00:51:55] waiting for jenkins... [00:52:07] (03Merged) 10jenkins-bot: Add cross-wiki notifications to beta features whitelist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264920 (owner: 10Catrope) [00:53:04] (03Abandoned) 10Reedy: Add lots of wfMsg*() for wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264023 (https://phabricator.wikimedia.org/T123583) (owner: 10Reedy) [00:53:07] !log krenair@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/264920/ (duration: 00m 33s) [00:53:08] (03PS3) 10Alex Monk: Enable cross-wiki notifications beta feature on testwiki and test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264917 (owner: 10Catrope) [00:53:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:53:24] RoanKattouw, ^ [00:53:33] RoanKattouw, there's nothing to test until after this next patch, right? [00:53:56] Krenair: Yeah that's right [00:54:08] RECOVERY - puppet last run on mw2193 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [00:54:13] (03CR) 10Alex Monk: [C: 032] Enable cross-wiki notifications beta feature on testwiki and test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264917 (owner: 10Catrope) [00:54:32] Is it just me, or is jenkins much slower than usual for this repo? [00:55:56] (03Merged) 10jenkins-bot: Enable cross-wiki notifications beta feature on testwiki and test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264917 (owner: 10Catrope) [00:56:34] !log delete from localuser where lu_name ="Αντώνης Μανιός" and lu_wiki ="mediawikiwiki" limit 1 on centralauth db for T119736 [00:56:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:57:05] !log krenair@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/264917/ (duration: 00m 32s) [00:57:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:58:54] RoanKattouw, quiddity: ^ [00:59:33] woo! works. :) [00:59:46] (03PS1) 10Yuvipanda: toollabs: Stop specifying $gridmaster in role [puppet] - 10https://gerrit.wikimedia.org/r/265175 [01:00:08] RECOVERY - puppet last run on db2066 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [01:00:22] (03PS2) 10Yuvipanda: toollabs: Stop specifying $gridmaster in role [puppet] - 10https://gerrit.wikimedia.org/r/265175 [01:01:16] !log restbase deploy end of d621b76 [01:01:19] (03CR) 10Yuvipanda: [C: 032] apache: Add http to https redirection for simplestatic role [puppet] - 10https://gerrit.wikimedia.org/r/265162 (owner: 10Madhuvishy) [01:01:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:02:02] (03PS3) 10Yuvipanda: toollabs: Stop specifying $gridmaster in role [puppet] - 10https://gerrit.wikimedia.org/r/265175 [01:02:10] (03CR) 10Yuvipanda: [C: 032 V: 032] toollabs: Stop specifying $gridmaster in role [puppet] - 10https://gerrit.wikimedia.org/r/265175 (owner: 10Yuvipanda) [01:03:20] oh, there's another patch for swat [01:03:39] though the user never showed up [01:03:55] and: krenair@tin:/srv/mediawiki-staging (master)$ date [01:03:55] Wed Jan 20 01:03:47 UTC 2016 [01:04:24] (03CR) 10Alex Monk: Enable EventBus extension (post-deploy) (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265142 (https://phabricator.wikimedia.org/T116786) (owner: 10Eevans) [01:05:41] (03PS1) 10Yuvipanda: toollabs: Do not use inheritance for submit role [puppet] - 10https://gerrit.wikimedia.org/r/265176 [01:05:51] (03PS2) 10Yuvipanda: toollabs: Do not use inheritance for submit role [puppet] - 10https://gerrit.wikimedia.org/r/265176 [01:06:10] (03CR) 10Yuvipanda: [C: 032 V: 032] toollabs: Do not use inheritance for submit role [puppet] - 10https://gerrit.wikimedia.org/r/265176 (owner: 10Yuvipanda) [01:08:26] Krenair: any idea how I can reset the account creation limit? [01:08:35] oh, right [01:08:36] Oh, oops, you’re swatting still [01:08:40] knew there was something I forgot [01:08:41] nope [01:08:48] (03PS1) 10Aaron Schulz: Set $wgCentralAuthUseSlaves for loginwiki, mw.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265178 [01:08:57] (03PS2) 10Aaron Schulz: Set $wgCentralAuthUseSlaves for loginwiki, mw.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265178 (https://phabricator.wikimedia.org/T119689) [01:09:31] apparently the standard solution is to use an account with the account creater priv [01:09:35] creator [01:09:38] which won’t help me much [01:10:55] Use throttle.php! [01:11:29] (03CR) 10Eevans: Enable EventBus extension (post-deploy) (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265142 (https://phabricator.wikimedia.org/T116786) (owner: 10Eevans) [01:11:38] throttle.php does the opposite of throttling? [01:11:48] (03PS5) 10Eevans: Enable EventBus extension (post-deploy) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265142 (https://phabricator.wikimedia.org/T116786) [01:11:49] It's for adding throttle exceptions [01:12:06] publicly, typically for academic etc. IPs [01:12:17] * YuviPanda commits a revert to revert a commit [01:13:42] andrewbogott, I'm just going to make it forget you tried to create so many accounts. mind PMing me your IP, or writing it to a file somewhere or something? [01:14:21] just restart memcached/redis on teh box? :P [01:14:24] (03PS5) 10Aaron Schulz: Configure $wgCdnReboundPurgeDelay [mediawiki-config] - 10https://gerrit.wikimedia.org/r/258365 (https://phabricator.wikimedia.org/T113192) [01:14:27] lol [01:15:25] (03PS1) 10Yuvipanda: toollabs: Don't setup exec_environ in submit hosts [puppet] - 10https://gerrit.wikimedia.org/r/265180 (https://phabricator.wikimedia.org/T124014) [01:16:08] Certainly the quickest way [01:16:39] (03PS2) 10Yuvipanda: toollabs: Don't setup exec_environ in submit hosts [puppet] - 10https://gerrit.wikimedia.org/r/265180 (https://phabricator.wikimedia.org/T124014) [01:16:51] (03CR) 10Yuvipanda: [C: 032 V: 032] toollabs: Don't setup exec_environ in submit hosts [puppet] - 10https://gerrit.wikimedia.org/r/265180 (https://phabricator.wikimedia.org/T124014) (owner: 10Yuvipanda) [01:16:53] Krenair: when I try to create an account, ldap says: [01:17:06] https://www.irccloud.com/pastebin/bMOZR5AK/ [01:17:10] which seems fine [01:17:21] Reedy, as opposed to $wgMemc->delete( wfMemcKey( 'acctcreate', 'ip', $ip ) ) ? [01:17:56] well, or not ‘fine’ but it seems not to be a tls issue [01:18:36] maybe it just can’t write to the db [01:18:53] guess I'll have to dig into the ldap_start_tls source code [01:19:16] Krenair: why? Why do you think it has anything to do with ldap? [01:19:43] the message I saw earlier in the ldap debug log was that it failed to start TLS [01:20:00] I think that paste is mw authorizing with ldap, succeeding, then failing at something else so closing the ldap session in despair [01:20:29] ok, let me try again but with the mw logs as well as the slapd logs [01:22:05] yeah, same. bind and unbind in slapd, no log at all on fluorine [01:22:09] It was this: [01:22:10] 31<Krenair>30 2016-01-19 23:40:10 labtestweb2001 labtestwiki ldap INFO: 2.1.0 Failed to start TLS. [01:22:31] according to the Extension:LdapAuthentication source code, that's ldap_start_tls returning something false-y [01:22:39] (03PS1) 10Yuvipanda: toollabs: Add cronrunner role [puppet] - 10https://gerrit.wikimedia.org/r/265183 (https://phabricator.wikimedia.org/T123873) [01:22:58] (03CR) 10jenkins-bot: [V: 04-1] toollabs: Add cronrunner role [puppet] - 10https://gerrit.wikimedia.org/r/265183 (https://phabricator.wikimedia.org/T123873) (owner: 10Yuvipanda) [01:22:58] what logfile did that come from? [01:24:39] (03CR) 10Aaron Schulz: [C: 032] Configure $wgCdnReboundPurgeDelay [mediawiki-config] - 10https://gerrit.wikimedia.org/r/258365 (https://phabricator.wikimedia.org/T113192) (owner: 10Aaron Schulz) [01:25:07] (03Merged) 10jenkins-bot: Configure $wgCdnReboundPurgeDelay [mediawiki-config] - 10https://gerrit.wikimedia.org/r/258365 (https://phabricator.wikimedia.org/T113192) (owner: 10Aaron Schulz) [01:27:40] !log aaron@tin Synchronized wmf-config: Configure $wgCdnReboundPurgeDelay (duration: 00m 32s) [01:27:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:27:50] andrewbogott, this was the ldap debug thing... where you go into wikitech.php and uncomment the if (false) [01:27:54] sorry [01:27:57] comment the if(false) [01:28:00] or change it to if(true) [01:28:03] or whatever [01:28:31] (03PS2) 10Yuvipanda: toollabs: Add cronrunner role [puppet] - 10https://gerrit.wikimedia.org/r/265183 (https://phabricator.wikimedia.org/T123873) [01:28:46] ah, I see, ok [01:28:52] * andrewbogott tries one more time [01:29:19] yeah, ok, I see it [01:29:22] hm [01:30:08] (03PS1) 10Kaldari: Disable active gadget user stats on enwiki since it takes too long [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265185 (https://phabricator.wikimedia.org/T121949) [01:30:19] (03CR) 10Yuvipanda: [C: 032] toollabs: Add cronrunner role [puppet] - 10https://gerrit.wikimedia.org/r/265183 (https://phabricator.wikimedia.org/T123873) (owner: 10Yuvipanda) [01:30:50] (03PS2) 10Kaldari: Disable active gadget user stats on enwiki since it takes too long [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265185 (https://phabricator.wikimedia.org/T121949) [01:31:05] $this->printDebug( "Using TLS", SENSITIVE ); [01:31:05] if ( !ldap_start_tls( $this->ldapconn ) ) { [01:31:05] $this->printDebug( "Failed to start TLS.", SENSITIVE ); [01:31:09] From LdapAuthentication [01:31:23] Now that ldap_start_tls function is (very helpfully) undocumented [01:31:42] you’re right. I’m surprised that the server side reports success [01:31:44] According to your server logs, it's actually connecting and then unbinding, right? [01:31:47] yeah [01:32:06] let me look at the cert again [01:32:13] okay, which leads me to suspect that there's undocumented (okay, all behaviour of that function is undocumented, but this is totally unexpected) behaviour in that function [01:33:40] Having looked at puppet, I think the issue is just that the cert is misnamed. I installed the labs-ldap.codfw.wikimedia.org cert [01:33:46] which doesn’t match the hostname, so [01:33:50] probably the client just rejects it [01:34:13] so, hm, can I do this with a self-signed cert? I suppose not :( [01:36:43] Krenair: sorry, this is a pretty obvious mistake [01:38:04] bd808, do we get a log of PHP warnings? [01:42:51] (03PS1) 10Yuvipanda: toollabs: Include toollabs init in cronrunner [puppet] - 10https://gerrit.wikimedia.org/r/265187 [01:43:05] (03PS2) 10Yuvipanda: toollabs: Include toollabs init in cronrunner [puppet] - 10https://gerrit.wikimedia.org/r/265187 [01:43:45] No, it's far too spammy [01:43:49] So we disable it [01:46:16] (03CR) 10Yuvipanda: [C: 032] toollabs: Include toollabs init in cronrunner [puppet] - 10https://gerrit.wikimedia.org/r/265187 (owner: 10Yuvipanda) [01:46:53] 6operations: Google Webmaster Tools - 1000 domain limit - https://phabricator.wikimedia.org/T99132#1946755 (10Tbayer) Update: @dbrant was able to successfully add the Wikipedia app to the Webmaster Tools of the "android@wikimedia.org" group. (It's not yet clear what this means for Search Console features that re... [01:47:30] andrewbogott, do ops have a CA that could be added to the trusted list and then used to sign the cert? [01:48:00] If we do, I don’t think I’ve ever done anything like that. [01:48:10] It’s not a bad idea for internal services though. [01:54:47] (03PS1) 10Dduvall: Establish `group2` and `exempt` dblists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265189 [01:55:18] PROBLEM - puppet last run on snapshot1002 is CRITICAL: CRITICAL: Puppet has 1 failures [01:56:27] (03CR) 10Dduvall: Establish `group2` and `exempt` dblists (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265189 (owner: 10Dduvall) [02:00:54] (03CR) 10Alex Monk: Establish `group2` and `exempt` dblists (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265189 (owner: 10Dduvall) [02:12:29] (03PS1) 10Yuvipanda: toollabs: Puppetize jlocal [puppet] - 10https://gerrit.wikimedia.org/r/265191 [02:12:44] (03PS2) 10Yuvipanda: toollabs: Puppetize jlocal [puppet] - 10https://gerrit.wikimedia.org/r/265191 [02:12:49] (03CR) 10jenkins-bot: [V: 04-1] toollabs: Puppetize jlocal [puppet] - 10https://gerrit.wikimedia.org/r/265191 (owner: 10Yuvipanda) [02:13:44] (03PS3) 10Yuvipanda: toollabs: Puppetize jlocal [puppet] - 10https://gerrit.wikimedia.org/r/265191 [02:14:04] (03CR) 10jenkins-bot: [V: 04-1] toollabs: Puppetize jlocal [puppet] - 10https://gerrit.wikimedia.org/r/265191 (owner: 10Yuvipanda) [02:14:15] (03PS4) 10Yuvipanda: toollabs: Puppetize jlocal [puppet] - 10https://gerrit.wikimedia.org/r/265191 [02:15:03] (03CR) 10Yuvipanda: [C: 032 V: 032] toollabs: Puppetize jlocal [puppet] - 10https://gerrit.wikimedia.org/r/265191 (owner: 10Yuvipanda) [02:20:27] RECOVERY - puppet last run on snapshot1002 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [02:22:55] (03PS1) 10Yuvipanda: toollabs: Use hiera to figure out where cron runner host is [puppet] - 10https://gerrit.wikimedia.org/r/265192 (https://phabricator.wikimedia.org/T123873) [02:23:16] (03CR) 10jenkins-bot: [V: 04-1] toollabs: Use hiera to figure out where cron runner host is [puppet] - 10https://gerrit.wikimedia.org/r/265192 (https://phabricator.wikimedia.org/T123873) (owner: 10Yuvipanda) [02:23:27] (03PS2) 10Yuvipanda: toollabs: Use hiera to figure out where cron runner host is [puppet] - 10https://gerrit.wikimedia.org/r/265192 (https://phabricator.wikimedia.org/T123873) [02:24:01] (03CR) 10Yuvipanda: [C: 032 V: 032] toollabs: Use hiera to figure out where cron runner host is [puppet] - 10https://gerrit.wikimedia.org/r/265192 (https://phabricator.wikimedia.org/T123873) (owner: 10Yuvipanda) [02:33:07] (03PS1) 10Yuvipanda: toollabs: Move bigbrother to services nodes [puppet] - 10https://gerrit.wikimedia.org/r/265193 (https://phabricator.wikimedia.org/T123873) [02:35:11] !log mwdeploy@tin sync-l10n completed (1.27.0-wmf.9) (duration: 11m 20s) [02:35:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:39:47] (03PS2) 10Yuvipanda: toollabs: Move bigbrother to services nodes [puppet] - 10https://gerrit.wikimedia.org/r/265193 (https://phabricator.wikimedia.org/T123873) [02:39:49] (03PS1) 10Yuvipanda: toollabs: Backup crontabs from cronrunner [puppet] - 10https://gerrit.wikimedia.org/r/265194 (https://phabricator.wikimedia.org/T123873) [02:42:16] (03PS1) 10Yuvipanda: toollabs: Move updatetools to run on services host [puppet] - 10https://gerrit.wikimedia.org/r/265195 (https://phabricator.wikimedia.org/T123873) [02:48:42] (03CR) 10Yuvipanda: [C: 032] toollabs: Backup crontabs from cronrunner [puppet] - 10https://gerrit.wikimedia.org/r/265194 (https://phabricator.wikimedia.org/T123873) (owner: 10Yuvipanda) [03:02:48] !log mwdeploy@tin sync-l10n completed (1.27.0-wmf.10) (duration: 10m 06s) [03:02:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:09:52] (03PS1) 10Yuvipanda: toollabs: Remove unneeded inheritance in checker role [puppet] - 10https://gerrit.wikimedia.org/r/265197 [03:09:54] (03PS1) 10Yuvipanda: toollabs: Remove inheritance from services role [puppet] - 10https://gerrit.wikimedia.org/r/265198 [03:09:56] (03PS1) 10Yuvipanda: toollabs: Point shadow to correct master host [puppet] - 10https://gerrit.wikimedia.org/r/265199 [03:11:57] PROBLEM - puppet last run on cp3004 is CRITICAL: CRITICAL: puppet fail [03:25:13] (03PS1) 10Yuvipanda: tools: Remove role inheritance from static hosts [puppet] - 10https://gerrit.wikimedia.org/r/265202 [03:25:15] (03PS1) 10Yuvipanda: toollabs: Remove inheritance from bastion role [puppet] - 10https://gerrit.wikimedia.org/r/265203 [03:25:17] (03PS1) 10Yuvipanda: toollabs: Remove inheritance in role from compute [puppet] - 10https://gerrit.wikimedia.org/r/265204 [03:25:19] (03PS1) 10Yuvipanda: toollabs: Remove inheritance from mailrelay [puppet] - 10https://gerrit.wikimedia.org/r/265205 [03:29:41] (03PS1) 10Yuvipanda: toollabs: Move toolwatcher to services [puppet] - 10https://gerrit.wikimedia.org/r/265206 (https://phabricator.wikimedia.org/T123873) [03:33:24] (03PS1) 10Yuvipanda: toollabs: Remove inheritance from gridengine master role [puppet] - 10https://gerrit.wikimedia.org/r/265207 [03:33:26] (03PS1) 10Yuvipanda: toollabs: Remove role inheritance from gridengine shadow [puppet] - 10https://gerrit.wikimedia.org/r/265208 [03:37:19] !log mwdeploy@tin sync-l10n completed (1.27.0-wmf.11) (duration: 16m 21s) [03:37:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:39:09] RECOVERY - puppet last run on cp3004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [03:44:48] !log l10nupdate@tin ResourceLoader cache refresh completed at Wed Jan 20 03:44:48 UTC 2016 (duration 7m 29s) [03:44:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [04:43:28] PROBLEM - puppet last run on mw1162 is CRITICAL: CRITICAL: Puppet has 62 failures [04:54:08] PROBLEM - RAID on mw1162 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:54:18] PROBLEM - SSH on mw1162 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:10:08] PROBLEM - nutcracker port on mw1162 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:14:17] PROBLEM - configured eth on mw1162 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:16:28] RECOVERY - nutcracker port on mw1162 is OK: TCP OK - 0.000 second response time on port 11212 [05:21:18] PROBLEM - DPKG on mw1162 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:23:08] PROBLEM - nutcracker process on mw1162 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:28:58] PROBLEM - nutcracker port on mw1162 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:30:07] PROBLEM - salt-minion processes on mw1162 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:34:18] RECOVERY - salt-minion processes on mw1162 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [05:35:17] PROBLEM - dhclient process on mw1162 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:40:37] PROBLEM - salt-minion processes on mw1162 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:44:38] PROBLEM - Disk space on mw1162 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:59:17] RECOVERY - Disk space on mw1162 is OK: DISK OK [06:01:37] RECOVERY - salt-minion processes on mw1162 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [06:05:48] PROBLEM - Disk space on mw1162 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:06:47] RECOVERY - dhclient process on mw1162 is OK: PROCS OK: 0 processes with command name dhclient [06:06:48] RECOVERY - nutcracker port on mw1162 is OK: TCP OK - 0.000 second response time on port 11212 [06:07:48] RECOVERY - Disk space on mw1162 is OK: DISK OK [06:09:28] RECOVERY - nutcracker process on mw1162 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [06:09:58] PROBLEM - salt-minion processes on mw1162 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:12:58] PROBLEM - dhclient process on mw1162 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:12:58] PROBLEM - nutcracker port on mw1162 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:15:47] PROBLEM - nutcracker process on mw1162 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:17:07] RECOVERY - dhclient process on mw1162 is OK: PROCS OK: 0 processes with command name dhclient [06:17:08] RECOVERY - nutcracker port on mw1162 is OK: TCP OK - 0.000 second response time on port 11212 [06:18:17] RECOVERY - salt-minion processes on mw1162 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [06:22:07] RECOVERY - DPKG on mw1162 is OK: All packages OK [06:23:18] PROBLEM - dhclient process on mw1162 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:23:27] PROBLEM - nutcracker port on mw1162 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:24:25] !log codfw elasticsearch cluster stopped responding during load test, idling test to see if it recovers [06:24:28] PROBLEM - salt-minion processes on mw1162 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:24:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:26:38] RECOVERY - salt-minion processes on mw1162 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [06:28:27] PROBLEM - DPKG on mw1162 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:29:37] RECOVERY - dhclient process on mw1162 is OK: PROCS OK: 0 processes with command name dhclient [06:29:47] RECOVERY - nutcracker port on mw1162 is OK: TCP OK - 0.000 second response time on port 11212 [06:30:38] PROBLEM - puppet last run on neodymium is CRITICAL: CRITICAL: Puppet has 2 failures [06:31:08] PROBLEM - puppet last run on db1067 is CRITICAL: CRITICAL: puppet fail [06:31:19] PROBLEM - puppet last run on mw1112 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:19] PROBLEM - puppet last run on mw1110 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:19] PROBLEM - puppet last run on mw2145 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:47] PROBLEM - puppet last run on mw2036 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:17] PROBLEM - puppet last run on mw1135 is CRITICAL: CRITICAL: Puppet has 4 failures [06:32:18] PROBLEM - puppet last run on nobelium is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:18] PROBLEM - puppet last run on mw2073 is CRITICAL: CRITICAL: Puppet has 3 failures [06:32:19] PROBLEM - puppet last run on mw1222 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:29] PROBLEM - puppet last run on mw2129 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:48] RECOVERY - DPKG on mw1162 is OK: All packages OK [06:33:09] PROBLEM - puppet last run on cp4013 is CRITICAL: CRITICAL: puppet fail [06:33:47] PROBLEM - puppet last run on mw2126 is CRITICAL: CRITICAL: Puppet has 3 failures [06:36:07] PROBLEM - dhclient process on mw1162 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:36:08] PROBLEM - nutcracker port on mw1162 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:40:08] RECOVERY - nutcracker port on mw1162 is OK: TCP OK - 0.000 second response time on port 11212 [06:40:08] RECOVERY - dhclient process on mw1162 is OK: PROCS OK: 0 processes with command name dhclient [06:40:58] PROBLEM - DPKG on mw1162 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:43:29] PROBLEM - salt-minion processes on mw1162 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:43:29] PROBLEM - Disk space on mw1162 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:46:28] PROBLEM - nutcracker port on mw1162 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:46:28] PROBLEM - dhclient process on mw1162 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:49:47] RECOVERY - Disk space on mw1162 is OK: DISK OK [06:53:57] RECOVERY - salt-minion processes on mw1162 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [06:56:18] RECOVERY - puppet last run on mw2145 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [06:57:09] RECOVERY - puppet last run on nobelium is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [06:57:17] RECOVERY - puppet last run on mw1222 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [06:57:28] RECOVERY - puppet last run on mw2129 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [06:57:47] RECOVERY - puppet last run on neodymium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:17] RECOVERY - puppet last run on db1067 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:38] RECOVERY - puppet last run on mw2126 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:48] RECOVERY - puppet last run on mw2036 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:59:17] RECOVERY - puppet last run on mw1135 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:59:47] RECOVERY - nutcracker process on mw1162 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [06:59:47] RECOVERY - DPKG on mw1162 is OK: All packages OK [07:00:08] RECOVERY - puppet last run on cp4013 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:00:28] RECOVERY - puppet last run on mw1110 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:01:28] RECOVERY - puppet last run on mw2073 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [07:02:38] RECOVERY - puppet last run on mw1112 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [07:03:19] RECOVERY - nutcracker port on mw1162 is OK: TCP OK - 0.000 second response time on port 11212 [07:03:19] RECOVERY - dhclient process on mw1162 is OK: PROCS OK: 0 processes with command name dhclient [07:03:19] RECOVERY - configured eth on mw1162 is OK: OK - interfaces up [07:09:38] PROBLEM - configured eth on mw1162 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:13:48] PROBLEM - nutcracker port on mw1162 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:14:28] PROBLEM - nutcracker process on mw1162 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:14:57] PROBLEM - Disk space on mw1162 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:16:38] PROBLEM - DPKG on mw1162 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:20:07] PROBLEM - dhclient process on mw1162 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:21:08] PROBLEM - salt-minion processes on mw1162 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:22:47] RECOVERY - nutcracker process on mw1162 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [07:22:47] RECOVERY - DPKG on mw1162 is OK: All packages OK [07:23:17] RECOVERY - salt-minion processes on mw1162 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [07:23:17] RECOVERY - Disk space on mw1162 is OK: DISK OK [07:24:03] !log ebernhardson@tin Synchronized php-1.27.0-wmf.10/extensions/CirrusSearch/includes/DataSender.php: stop checking for frozen indices while codfw elasticsearch recovers (duration: 01m 42s) [07:24:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:24:17] RECOVERY - nutcracker port on mw1162 is OK: TCP OK - 0.000 second response time on port 11212 [07:24:17] RECOVERY - dhclient process on mw1162 is OK: PROCS OK: 0 processes with command name dhclient [07:28:18] RECOVERY - configured eth on mw1162 is OK: OK - interfaces up [07:29:37] RECOVERY - SSH on mw1162 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.4 (protocol 2.0) [07:35:58] PROBLEM - SSH on mw1162 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:41:08] PROBLEM - configured eth on mw1162 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:48:07] PROBLEM - nutcracker process on mw1162 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:48:07] PROBLEM - DPKG on mw1162 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:49:28] PROBLEM - nutcracker port on mw1162 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:54:47] PROBLEM - salt-minion processes on mw1162 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:54:47] PROBLEM - Disk space on mw1162 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:56:11] <_joe_> !log powercycling mw1162, unable to login from console, memory exhaustion [07:56:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:57:55] _joe_: why does memory growth make the machine unresponsive? isn't it the job of the oom killer to prevent that from happening? do we have a kernel parameter set to the wrong value? [07:58:27] PROBLEM - puppet last run on mw2193 is CRITICAL: CRITICAL: Puppet has 1 failures [07:58:33] of course, it is also a problem that memory leaks. but it should not require manual intervention like this. [07:58:50] RECOVERY - SSH on mw1162 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.4 (protocol 2.0) [07:58:54] the OOM killer should kill HHVM and upstart should restart it [07:58:57] <_joe_> ori: as for the oom killer, unluckily it's not always effective [07:59:22] <_joe_> it might be we have it set to kill processes just when too little memory is available [07:59:31] <_joe_> also, we use no swap (on purpose) [07:59:47] RECOVERY - nutcracker port on mw1162 is OK: TCP OK - 0.000 second response time on port 11212 [07:59:47] RECOVERY - configured eth on mw1162 is OK: OK - interfaces up [07:59:52] Kernel tries OOM SIGKILL! It's not very effective. [08:00:27] RECOVERY - nutcracker process on mw1162 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [08:00:27] RECOVERY - DPKG on mw1162 is OK: All packages OK [08:00:58] RECOVERY - salt-minion processes on mw1162 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [08:00:58] RECOVERY - Disk space on mw1162 is OK: DISK OK [08:00:58] RECOVERY - RAID on mw1162 is OK: OK: no RAID installed [08:02:10] <_joe_> AFAICT, the defaults of the oom killer should work for us [08:04:13] i think you're right, but there is still room for tweaking [08:04:17] so taking mw1160 as the example [08:05:49] <_joe_> uhm also, mw1162 has a cronjob that should've killed hhvm [08:06:18] not if the process hasn't been up for very long [08:06:29] something could have restarted it in the past 24h [08:06:37] the cronjob checks the process age [08:06:38] <_joe_> nope, I changed the logic [08:06:53] <_joe_> it checks the percentage of memory used by the hhvm process now [08:07:02] <_joe_> just on those two machines [08:07:07] <_joe_> dedicated to gwt jobs [08:07:20] well, it must not work [08:07:22] <_joe_> and it runs hourly [08:07:47] <_joe_> well, it worked quite well in the past, probably just need to tune down the percentage of memory at which the kill happens [08:08:01] <_joe_> https://ganglia.wikimedia.org/latest/?r=week&cs=1%2F14%2F2016+4%3A47&ce=1%2F20%2F2016+5%3A49&m=cpu_report&c=Jobrunners+eqiad&h=mw1162.eqiad.wmnet&tab=m&vn=&hide-hf=false&mc=2&z=medium&metric_group=NOGROUPS [08:08:17] <_joe_> the two preceding drops are kills from the cron AFAICT [08:08:46] ok, so that's one thing to try (tuning down the percentage) [08:08:48] <_joe_> since it just takes 2 hours from 0 to 100%, I should definitely tune it down [08:09:33] we should really upgrade [08:09:38] on mw1160 i see that vm.overcommit_memory is 0. the important thing is that it is not 2, so the oom killer is not disabled [08:09:51] <_joe_> yeah checked that first thing :) [08:10:13] <_joe_> but yeah, we're almost at the point of upgrading hhvm [08:10:28] (03CR) 10Alexandros Kosiaris: [V: 04-1] "Puppet compiler complains on this https://puppet-compiler.wmflabs.org/1628/sca1002.eqiad.wmnet/change.sca1002.eqiad.wmnet.err" [puppet] - 10https://gerrit.wikimedia.org/r/264719 (https://phabricator.wikimedia.org/T123906) (owner: 10KartikMistry) [08:10:28] /proc/`pidof -s hhvm`/oom_adj is 0, which means HHVM is killable [08:10:40] <_joe_> checked as well [08:11:04] <_joe_> oom kills hhvm plenty of time btw [08:11:12] and /proc/`pidof -s hhvm`/oom_score is 61; the fact that it is nonzero means it can get killed [08:11:15] <_joe_> it's just not a very reliable mechanism [08:12:02] we don't set vm.panic_on_oom=1 [08:12:08] (03CR) 10Alexandros Kosiaris: [C: 04-1] cxserver: Add all available source languages for Russian in Yandex MT (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/264719 (https://phabricator.wikimedia.org/T123906) (owner: 10KartikMistry) [08:12:09] so we don't get automatic reboots [08:12:10] <_joe_> no, and we should not [08:12:29] why not? [08:12:37] <_joe_> most of the times oom works fine, and saves us a full reboot [08:12:59] <_joe_> an unclean one, too [08:13:06] <_joe_> (not really a problem though) [08:13:27] i don't vm.panic_on_oom means the OOM killer doesn't try to sigkill some processes first [08:13:32] *don't think [08:13:52] <_joe_> uhm maybe I recall wrong [08:14:16] <_joe_> but panic_on_oom was intended to cause a kernel panic instead of activating the oom killer [08:14:21] * _joe_ looks back at the manual [08:14:31] i might be wrong [08:15:17] <_joe_> oom_kill_allocating_task [08:15:23] <_joe_> this could be tweaked maybe [08:15:38] PROBLEM - puppet last run on cp3035 is CRITICAL: CRITICAL: puppet fail [08:15:46] <_joe_> https://www.kernel.org/doc/Documentation/sysctl/vm.txt [08:15:57] <_joe_> "If this is set to 1, the kernel panics when out-of-memory happens. [08:16:00] <_joe_> However, if a process limits using nodes by mempolicy/cpusets, [08:16:02] <_joe_> and those nodes become memory exhaustion status, one process [08:16:05] <_joe_> may be killed by oom-killer. No panic occurs in this case. [08:16:07] <_joe_> Because other nodes' memory may be free. This means system total status [08:16:10] <_joe_> may be not fatal yet. [08:16:13] <_joe_> " [08:16:15] <_joe_> If this is set to 2, the kernel panics compulsorily even on the [08:16:15] <_joe_> above-mentioned. Even oom happens under memory cgroup, the whole [08:16:15] <_joe_> system panics. [08:16:26] i'm not sure i agree that having the oom killer try to fix the memory situation is cleaner than a full reboot; quite the contrary [08:16:29] <_joe_> since we don't use a cgroup to run hhvm [08:17:49] <_joe_> ori: if we were running something else than a stateless appserver in a pool of hundreds, maybe. It's a fact that most of the times, oom works just fine for long-lasting leaks [08:19:05] the oom killer could kill another process. also, hhvm is already in bad shape if conditions are such that the oom killer has been activated. a reboot means pybal depools it right away [08:19:30] a server could be in a bad state but still be pooled and thus still have traffic sent to it [08:19:46] a reboot seems way better [08:19:50] <_joe_> really? with the kind of checks we do, it seems unlikely [08:20:07] <_joe_> or you have data to support the idea of a server in a bad state still being pooled? [08:20:08] if it's a single purpose server, chances are the offending process will be killed. [08:20:15] which is the service [08:20:22] <_joe_> which is what happens 99% of the times ofc [08:20:24] but that's adjustable as well [08:20:57] <_joe_> akosiaris: I think the cases of irrecoverable-oom we've seen recently on the jobrunners are a classical case of "we don't use swap and memory grows very fast" [08:21:16] <_joe_> so the problem is the kernel trying to do its heuristics before killing a process [08:21:19] i recall cases in which a server became unresponsive and would not accept new ssh connection attempts, but an existing session would still work, albeit sluggishly [08:21:40] that suggests to me that the server could be in bad shape indeed but still keep pybal's idle connection open [08:21:43] that's usually the case in servers with swap ori [08:21:47] <_joe_> I'm pretty sure oom_kill_allocating_task would do what you want, ori :) [08:21:57] <_joe_> ori: and the proxyfetch? [08:22:00] but I haven't read the entire backlog [08:22:07] <_joe_> we do render a mw page every N seconds [08:22:14] what are we discussing here ? whether a manual reboot is better than OOM ? [08:22:30] <_joe_> akosiaris: whether panic_on_oom is a good idea [08:22:35] <_joe_> to trigger an automatic reboot [08:22:39] <_joe_> in case of oom [08:22:48] <_joe_> I don't think it's a good idea [08:22:50] it is not [08:23:04] first and foremost, not for technical reasons [08:23:37] but social ones. Having our boxes reboot unpredictably (at least to an unfamiliar eye) is a bad idea [08:23:37] RECOVERY - puppet last run on mw2193 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:24:11] <_joe_> well, I didn't even get to _that_ [08:24:59] <_joe_> anyways, lemme fix the issue at hand [08:25:44] (03PS2) 10Hoo man: Support bzip2 compression format [puppet] - 10https://gerrit.wikimedia.org/r/262423 (https://phabricator.wikimedia.org/T118397) (owner: 10Lokal Profil) [08:26:22] (03CR) 10Hoo man: [C: 031] "Rebased, should be fine to deploy at any time." [puppet] - 10https://gerrit.wikimedia.org/r/262423 (https://phabricator.wikimedia.org/T118397) (owner: 10Lokal Profil) [08:26:49] <_joe_> hoo: schedule it for puppetSWAT [08:27:24] _joe_: I'll just ping Ariel when I really need it :P [08:27:53] <_joe_> uhm, it's not how it's supposed to be nowadays, but fair enough :P [08:28:05] (03CR) 10Hoo man: [C: 032 V: 032] "Tested with the configuration in https://gerrit.wikimedia.org/r/262423 and the real data." [dumps/dcat] - 10https://gerrit.wikimedia.org/r/262422 (https://phabricator.wikimedia.org/T118397) (owner: 10Lokal Profil) [08:29:34] (03PS1) 10Giuseppe Lavagetto: jobrunner: lower the kill threshold for gwt-enabled jobrunners [puppet] - 10https://gerrit.wikimedia.org/r/265224 [08:33:46] (03CR) 10Giuseppe Lavagetto: [C: 032] jobrunner: lower the kill threshold for gwt-enabled jobrunners [puppet] - 10https://gerrit.wikimedia.org/r/265224 (owner: 10Giuseppe Lavagetto) [08:34:33] (03PS1) 10Alexandros Kosiaris: Revert "admin: add new group mobileapps-admins" [puppet] - 10https://gerrit.wikimedia.org/r/265226 (https://phabricator.wikimedia.org/T123540) [08:36:18] (03PS2) 10Alexandros Kosiaris: Revert "admin: add new group mobileapps-admins" [puppet] - 10https://gerrit.wikimedia.org/r/265226 (https://phabricator.wikimedia.org/T123540) [08:36:29] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Revert "admin: add new group mobileapps-admins" [puppet] - 10https://gerrit.wikimedia.org/r/265226 (https://phabricator.wikimedia.org/T123540) (owner: 10Alexandros Kosiaris) [08:40:39] RECOVERY - puppet last run on cp3035 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [08:42:34] (03PS5) 10KartikMistry: cxserver: Add all available source languages for Russian in Yandex MT [puppet] - 10https://gerrit.wikimedia.org/r/264719 (https://phabricator.wikimedia.org/T123906) [08:42:38] 7Blocked-on-Operations, 6operations, 10ops-eqiad, 5Patch-For-Review: reclaim erbium, gadolinium into spares - https://phabricator.wikimedia.org/T123029#1947243 (10akosiaris) Yay!!! This was the last blocker on fully decomissioning the netapps @ EQIAD. Thank you! [08:43:48] 6operations: asw-b-eqiad:ge-5/0/1(nas1001-a:e0a) port saturation - https://phabricator.wikimedia.org/T106181#1947244 (10akosiaris) 5Open>3Invalid a:3akosiaris Closing as invalid. nas1001-{a,b} are being decomissioned [08:44:15] <_joe_> wow, we're really decommissioning the netapps? [08:44:27] for like 1 and a half now [08:44:35] but that was the final share being mounted somewhere [08:44:54] * _joe_ cheers [08:44:54] <_joe_> yeah I meant, we're ACTUALLY decommissioning them now [08:51:13] (03CR) 10Alexandros Kosiaris: "change seems good, minor comment inline" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/264943 (https://phabricator.wikimedia.org/T124024) (owner: 10Giuseppe Lavagetto) [08:52:43] (03CR) 10Alexandros Kosiaris: [C: 04-1] role::deployment: make it possible to switch between different servers (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/264944 (https://phabricator.wikimedia.org/T124024) (owner: 10Giuseppe Lavagetto) [08:53:13] (03CR) 10Alexandros Kosiaris: [C: 031] deployment: activate redis replica between the masters [puppet] - 10https://gerrit.wikimedia.org/r/264945 (https://phabricator.wikimedia.org/T124024) (owner: 10Giuseppe Lavagetto) [08:55:00] (03CR) 10Alexandros Kosiaris: [C: 031] role::deployment::server: syncronize /srv/deployment [puppet] - 10https://gerrit.wikimedia.org/r/264954 (owner: 10Giuseppe Lavagetto) [08:55:18] PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [08:57:09] (03PS1) 10Elukey: Added some comments for the node stat1001 (analytics) [puppet] - 10https://gerrit.wikimedia.org/r/265227 [08:57:18] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [08:58:49] (03CR) 10Giuseppe Lavagetto: scap: use logical names for the rsync master (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/264943 (https://phabricator.wikimedia.org/T124024) (owner: 10Giuseppe Lavagetto) [09:00:43] (03CR) 10Giuseppe Lavagetto: role::deployment: make it possible to switch between different servers (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/264944 (https://phabricator.wikimedia.org/T124024) (owner: 10Giuseppe Lavagetto) [09:03:38] <_joe_> akosiaris: your change wasn't merged on strontium [09:04:43] yeah and I got kicked out of internet 5 secs later [09:04:59] it's raining [09:05:14] and my area is not exactly known to be very water proof [09:05:19] <_joe_> rain determines your internet connection speed? [09:05:21] <_joe_> oh, ok [09:05:29] not the speed, the quality [09:05:35] it's a very badly connected area [09:05:37] <_joe_> I'm waiting on a tech guy to come to my house since 9 AM [09:05:46] I am thrilled that I actually have internet [09:05:53] <_joe_> because of the issues we had in the last few days [09:07:37] so, latency is around 1sec right now [09:07:50] if you see me lagging out, that's why [09:07:56] <_joe_> wow [09:07:58] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [09:08:22] (03CR) 10Yuvipanda: [C: 032] "More docs are always good! <3" [puppet] - 10https://gerrit.wikimedia.org/r/265227 (owner: 10Elukey) [09:08:35] YuviPanda: go to sleep [09:08:50] but I Feel so awake [09:08:58] and if I wake up early in the morning [09:08:59] <_joe_> I'm not sure I agree yuvi [09:09:00] I end up being mean to people [09:09:05] all the time [09:09:06] <_joe_> comments and docs should not be in site.pp [09:09:19] YuviPanda: ah, you are on the right path to sysadmin mastery after all ;-) [09:09:22] akosiaris: I merged your change [09:09:30] akosiaris: at some point during a meeting at dev summit [09:09:32] thanks [09:09:38] _joe_: said 'ah, so now he is become real ops' [09:09:45] as I went on a tirade about something or the other [09:09:49] it's true and makes me sad [09:09:54] * YuviPanda doesn't want to be grumpy [09:10:02] <_joe_> why? [09:10:07] RECOVERY - Unmerged changes on repository puppet on palladium is OK: No changes to merge. [09:10:12] <_joe_> life is so much better when you're grumpy [09:10:48] PROBLEM - dhclient process on cygnus is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:11:08] PROBLEM - configured eth on cygnus is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:11:15] http://i.imgur.com/0o0JbPs.jpg [09:11:17] PROBLEM - Check size of conntrack table on cygnus is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:11:17] PROBLEM - salt-minion processes on cygnus is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:11:19] PROBLEM - RAID on cygnus is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:11:28] PROBLEM - Disk space on cygnus is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:11:28] PROBLEM - puppet last run on cygnus is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:11:29] PROBLEM - DPKG on cygnus is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:11:29] YuviPanda: there you go ^ [09:11:33] _joe_: hello! I was about to ask a comment for the CR above, I was trying to send my first one and I was wondering what was the best practice for site.pp [09:11:43] since I saw comments etc.. [09:11:47] akosiaris: hahah [09:12:01] <_joe_> elukey: most comments are by your partner in crime, ottomata :) [09:12:45] * YuviPanda likes comments in code too [09:12:48] ah, down to 39ms latency for the 2nd hop [09:12:52] ahhhh so for the moment my mistakes are shielded by ottomata :D [09:13:00] seems like someone fixed something somewhere [09:13:45] <_joe_> YuviPanda: I do like comments, I'd just leave site.pp as terse as possible [09:15:11] !log unexport /vol/fr_archive on nas1001-a [09:15:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:18:04] (03PS2) 10Giuseppe Lavagetto: role::deployment: make it possible to switch between different servers [puppet] - 10https://gerrit.wikimedia.org/r/264944 (https://phabricator.wikimedia.org/T124024) [09:18:07] !log offline fr_archive volume on nas1001-a [09:18:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:21:10] (03CR) 10Alexandros Kosiaris: [C: 032] cxserver: Add all available source languages for Russian in Yandex MT [puppet] - 10https://gerrit.wikimedia.org/r/264719 (https://phabricator.wikimedia.org/T123906) (owner: 10KartikMistry) [09:21:18] (03PS6) 10Alexandros Kosiaris: cxserver: Add all available source languages for Russian in Yandex MT [puppet] - 10https://gerrit.wikimedia.org/r/264719 (https://phabricator.wikimedia.org/T123906) (owner: 10KartikMistry) [09:21:24] (03CR) 10Alexandros Kosiaris: [V: 032] cxserver: Add all available source languages for Russian in Yandex MT [puppet] - 10https://gerrit.wikimedia.org/r/264719 (https://phabricator.wikimedia.org/T123906) (owner: 10KartikMistry) [09:21:30] <_joe_> akosiaris: I see you're murdering the netapp, can I consider your -1 to be just about that whitespace? (re https://gerrit.wikimedia.org/r/264944) [09:23:46] (03CR) 10Filippo Giunchedi: [C: 031] admin: add datacenter-ops to install-server role [puppet] - 10https://gerrit.wikimedia.org/r/264994 (https://phabricator.wikimedia.org/T123681) (owner: 10Dzahn) [09:24:45] I am a bit torn about the deployment thing but I 'll survie [09:24:49] survive [09:24:53] yeah, lemme +1 it [09:25:10] (03CR) 10Alexandros Kosiaris: [C: 031] role::deployment: make it possible to switch between different servers [puppet] - 10https://gerrit.wikimedia.org/r/264944 (https://phabricator.wikimedia.org/T124024) (owner: 10Giuseppe Lavagetto) [09:28:08] <_joe_> akosiaris: well, did you read my response? [09:28:16] <_joe_> about the deployment thing [09:28:24] <_joe_> we do that with cnames already [09:28:33] <_joe_> and use .svc just for VIPs [09:28:37] yes and I have a comment that using svc just for VIP is not really true [09:28:40] <_joe_> I do agree it's somewhat lame [09:28:51] <_joe_> uhm I checked wmnet [09:28:56] look at apertium for example [09:29:10] although I will fix that with the move to SCB [09:29:24] <_joe_> isn't apertium a VIP? [09:29:35] nope :-( [09:29:39] <_joe_> OH [09:29:45] but the idea is it will become [09:29:49] <_joe_> that's the exception [09:30:06] yes, which is why I did not answer with that [09:30:54] akosiaris: thanks. [09:31:00] kart_: yw [09:31:31] akosiaris: now over to bug we encouter with using registry from cxserver. [09:31:31] so, I 'll give it a week for the netapps being offline and after that I will send them to oblivion [09:31:38] akosiaris: \o/ [09:31:51] oblivion being chris's loving hands ;-) [09:32:04] kart_: which bug ? [09:32:06] <_joe_> for some value of "loving" [09:32:15] :-D [09:32:30] <_joe_> when we decommission servers, I imagine chris going at them with a viking axe [09:33:04] (03PS2) 10Giuseppe Lavagetto: Add deployment virtual name [dns] - 10https://gerrit.wikimedia.org/r/264932 [09:33:33] (03CR) 10Giuseppe Lavagetto: [C: 032] Add deployment virtual name [dns] - 10https://gerrit.wikimedia.org/r/264932 (owner: 10Giuseppe Lavagetto) [09:34:27] RECOVERY - configured eth on cygnus is OK: OK - interfaces up [09:34:28] RECOVERY - salt-minion processes on cygnus is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [09:34:28] RECOVERY - Check size of conntrack table on cygnus is OK: OK: nf_conntrack is 0 % full [09:34:38] RECOVERY - RAID on cygnus is OK: OK: no RAID installed [09:34:48] RECOVERY - Disk space on cygnus is OK: DISK OK [09:34:48] RECOVERY - puppet last run on cygnus is OK: OK: Puppet is currently enabled, last run 29 minutes ago with 0 failures [09:34:48] RECOVERY - DPKG on cygnus is OK: All packages OK [09:35:06] I am trying a "fix" for the KVM bug that we 've been having on cygnus btw [09:36:03] !log gnt-instance modify -H disk_aio=native cygnus.codfw.wmnet [09:36:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:36:18] RECOVERY - dhclient process on cygnus is OK: PROCS OK: 0 processes with command name dhclient [09:38:07] akosiaris: See: https://phabricator.wikimedia.org/T122498 [09:38:13] Also PM :) [09:50:25] kart_: that looks like a config loading problem.... btw I see I2c1500a9cfbd5ae8fedbe3ff98d41a34f078f88c applied on deployment-puppetmaster. That patch fails in the compiler for that I see in my comments in https://gerrit.wikimedia.org/r/#/c/263550/ [09:50:34] s/for that/from what/ [09:50:56] Error: (): mapping values are not allowed in this context at line 23 column 21 at /mnt/jenkins-workspace/puppet-compiler/1576/change/src/modules/service/manifests/node.pp:128 on node sca1001.eqiad.wmnet [09:51:22] kart_: what do we need it anyway btw ? [09:51:52] a valid json document is always a valid yaml document anyways (the inverse is not true btw) [09:56:02] (03PS2) 10Giuseppe Lavagetto: scap: use logical names for the rsync master [puppet] - 10https://gerrit.wikimedia.org/r/264943 (https://phabricator.wikimedia.org/T124024) [09:56:51] kart_: I 've removed that patch in deployment-prep btw, it was a noop anyway given beta does not have a registry anyway [10:05:55] (03CR) 10Giuseppe Lavagetto: [C: 032] scap: use logical names for the rsync master [puppet] - 10https://gerrit.wikimedia.org/r/264943 (https://phabricator.wikimedia.org/T124024) (owner: 10Giuseppe Lavagetto) [10:08:08] 6operations, 10Fundraising-Backlog, 6Security, 10fundraising-tech-ops: Delete gadolinium:/a/log/fundraising/ - https://phabricator.wikimedia.org/T92336#1947339 (10akosiaris) a:5Ottomata>3akosiaris T123029 tracked the above. Now erbium, gadolinium are no longer active. The unmount has taken place and th... [10:08:56] 6operations, 10Fundraising-Backlog, 6Security, 10fundraising-tech-ops: Delete gadolinium:/a/log/fundraising/ - https://phabricator.wikimedia.org/T92336#1947344 (10akosiaris) 5Open>3Resolved Resolving this now, actually deletion is tracked in T118535 [10:10:05] akosiaris: "a valid json document is always a valid yaml doc" I have never thought about that. Nice trick :-} [10:10:07] PROBLEM - check_mysql on db1008 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 796 [10:10:44] hashar: :-) [10:11:01] * hashar yamlize everything [10:12:22] (03PS2) 10Alexandros Kosiaris: add url-downloader in codfw on alsafi [puppet] - 10https://gerrit.wikimedia.org/r/264205 (https://phabricator.wikimedia.org/T122134) (owner: 10Dzahn) [10:20:17] RECOVERY - check_mysql on db1008 is OK: Uptime: 67315 Threads: 2 Questions: 614135 Slow queries: 434 Opens: 487 Flush tables: 2 Open tables: 333 Queries per second avg: 9.123 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [10:24:18] (03PS1) 10Alexandros Kosiaris: puppet-facts-export: Set an 8 byte uniqueid [puppet] - 10https://gerrit.wikimedia.org/r/265233 (https://phabricator.wikimedia.org/T122909) [10:30:41] akosiaris: lol @ "topic" ^ [10:31:36] :D [10:31:59] (03CR) 10Alexandros Kosiaris: [C: 032] puppet-facts-export: Set an 8 byte uniqueid [puppet] - 10https://gerrit.wikimedia.org/r/265233 (https://phabricator.wikimedia.org/T122909) (owner: 10Alexandros Kosiaris) [10:34:16] (03PS3) 10Giuseppe Lavagetto: role::deployment: make it possible to switch between different servers [puppet] - 10https://gerrit.wikimedia.org/r/264944 (https://phabricator.wikimedia.org/T124024) [10:35:56] (03PS1) 10ArielGlenn: dumps: bump mwbzutils version to 0.0.5 [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/265234 [10:37:10] (03CR) 10ArielGlenn: [C: 032 V: 032] dumps: bump mwbzutils version to 0.0.5 [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/265234 (owner: 10ArielGlenn) [10:39:39] (03CR) 10Giuseppe Lavagetto: [C: 032] role::deployment: make it possible to switch between different servers [puppet] - 10https://gerrit.wikimedia.org/r/264944 (https://phabricator.wikimedia.org/T124024) (owner: 10Giuseppe Lavagetto) [10:40:07] (03CR) 10Alexandros Kosiaris: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/260190 (owner: 10Dzahn) [10:41:47] akosiaris: right. so what's issue :/ [10:41:57] I can't debug further. [10:42:14] kart_: er, why ? [10:42:34] what is it that you need to debug it further ? [10:42:58] akosiaris: registry is still null, and problem exists. [10:43:24] so the problem is in cxserver then ? [10:43:36] the config loading part ? [10:44:17] akosiaris: I also symlinked config.yaml -> config.prod.yaml in cxserver/deploy, but still. [10:46:59] kart_: well, I suppose some verbose/debug logging from cxserver at the time it is reading the configuration might help [10:53:05] 7Puppet, 6operations, 5Patch-For-Review: puppet compiler runs fail when backup::host is included on host - https://phabricator.wikimedia.org/T122909#1947408 (10akosiaris) 5Open>3Resolved a:3akosiaris and now that the above change was merged and the facts updated (manually as always), https://puppet-com... [10:54:22] <_joe_> uhm something funny going on on mira [10:55:03] define funny [10:56:17] (03CR) 10Alexandros Kosiaris: [C: 032] "https://puppet-compiler.wmflabs.org/1632/alsafi.wikimedia.org/ says ok, merging" [puppet] - 10https://gerrit.wikimedia.org/r/264205 (https://phabricator.wikimedia.org/T122134) (owner: 10Dzahn) [10:56:23] (03PS3) 10Alexandros Kosiaris: add url-downloader in codfw on alsafi [puppet] - 10https://gerrit.wikimedia.org/r/264205 (https://phabricator.wikimedia.org/T122134) (owner: 10Dzahn) [10:56:58] akosiaris: btw https://wikitech.wikimedia.org/wiki/Nova_Resource:Puppet3-diffs/Documentation now contains copy-pastable commands to update facts, still manual but ok-ish [10:57:20] (03PS4) 10Alexandros Kosiaris: add url-downloader in codfw on alsafi [puppet] - 10https://gerrit.wikimedia.org/r/264205 (https://phabricator.wikimedia.org/T122134) (owner: 10Dzahn) [10:57:37] (03PS5) 10Alexandros Kosiaris: add url-downloader in codfw on alsafi [puppet] - 10https://gerrit.wikimedia.org/r/264205 (https://phabricator.wikimedia.org/T122134) (owner: 10Dzahn) [10:57:43] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] add url-downloader in codfw on alsafi [puppet] - 10https://gerrit.wikimedia.org/r/264205 (https://phabricator.wikimedia.org/T122134) (owner: 10Dzahn) [11:04:29] (03PS1) 10Alexandros Kosiaris: Add url-downloader.{eqiad,codfw}.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/265237 (https://phabricator.wikimedia.org/T122134) [11:08:04] (03PS1) 10Giuseppe Lavagetto: redis: do not 'expect fork' if daemonize = false in the settings. [puppet] - 10https://gerrit.wikimedia.org/r/265238 [11:09:04] 6operations, 5Patch-For-Review: url-downloader should be set up more redundantly - https://phabricator.wikimedia.org/T122134#1947444 (10akosiaris) I think the part where we have a backup for this service in `codfw` is ready. However that only alleviates the fact that url-downloader is a SPOF in both datacenter... [11:09:26] (03CR) 10Alexandros Kosiaris: [C: 032] Add url-downloader.{eqiad,codfw}.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/265237 (https://phabricator.wikimedia.org/T122134) (owner: 10Alexandros Kosiaris) [11:11:00] godog: nice [11:13:38] 6operations, 10fundraising-tech-ops: make sure netapp fundraising share gets wiped - https://phabricator.wikimedia.org/T118535#1947459 (10akosiaris) volume has been fully offlined. Waiting for about a week in case something comes up (I honestly expect nothing to come up) and then I will wipe it [11:20:29] 6operations, 10ops-eqiad: decomission the netapps in EQIAD: nas1001-a, nas1001-b - https://phabricator.wikimedia.org/T124156#1947460 (10akosiaris) 3NEW a:3akosiaris [11:21:02] 6operations, 10ops-eqiad: decomission the netapps in EQIAD: nas1001-a, nas1001-b - https://phabricator.wikimedia.org/T124156#1947470 (10akosiaris) [11:21:05] 6operations, 10fundraising-tech-ops: make sure netapp fundraising share gets wiped - https://phabricator.wikimedia.org/T118535#1802733 (10akosiaris) [11:31:16] (03PS3) 10Alexandros Kosiaris: url_downloader: Add port as a hierable parameter [puppet] - 10https://gerrit.wikimedia.org/r/191379 [11:32:11] (03CR) 10jenkins-bot: [V: 04-1] url_downloader: Add port as a hierable parameter [puppet] - 10https://gerrit.wikimedia.org/r/191379 (owner: 10Alexandros Kosiaris) [11:34:00] (03PS4) 10Alexandros Kosiaris: url_downloader: Add port as a hierable parameter [puppet] - 10https://gerrit.wikimedia.org/r/191379 [11:36:08] hashar: around ? [11:37:01] hashar: I am looking at https://gerrit.wikimedia.org/r/#/c/191379/ and just noticed (for the Nth time) that operations-puppet-tox-pep8 is ran. It fails because pep8 does not exist anymore as an environment in our tox.ini and that makes sense [11:37:11] what do I need to update for that job to not be run anymore ? [11:37:57] zuul/layout.yaml and jjb/operations-puppet.yaml I suppose ? [11:39:21] i am there [11:39:36] ah yeah I forgot about that set of changes ... [11:39:57] (03PS2) 10Giuseppe Lavagetto: redis: do not 'expect fork' if daemonize = false in the settings. [puppet] - 10https://gerrit.wikimedia.org/r/265238 [11:40:36] akosiaris: since the .py files were barely passing pep8, we went with a custom job that runs pep8 in each directory so we can ignore different type of errors via a .pep8 file [11:40:52] so if say modules/ganglia/files/plugins/ files are known to fail, we would stuck a .pep8 file in there [11:41:00] yeah I remember that [11:41:07] the script wrapper is integration/jenkins.git which is not convenient to run on a local machine [11:41:21] so I fixed almost all pep8 errors so one can just run pep8 from the root of the repo [11:41:29] and then get rid of all the .pep8 exceptions [11:41:42] the operations-puppet-tox-pep8 is run on any change and really : tox -e pep8 [11:41:46] i.e. run pep8 form root of repo [11:42:13] it fails right now because the pep8 env is not defined in /tox.ini [11:42:23] (03CR) 10Giuseppe Lavagetto: [C: 032] redis: do not 'expect fork' if daemonize = false in the settings. [puppet] - 10https://gerrit.wikimedia.org/r/265238 (owner: 10Giuseppe Lavagetto) [11:42:36] the entry point is defined via https://gerrit.wikimedia.org/r/#/c/244148/ [11:43:16] but then we are wondering whether the puppet submodules should be linted or ignored [11:43:54] linted I 'd say [11:44:06] very few are third party anyway [11:44:09] I posted a reasoning for ignoring them [11:44:19] which is that we would need to have all submodules pass [11:44:25] PROBLEM - puppet last run on mira is CRITICAL: CRITICAL: puppet fail [11:44:31] then when changing operations/puppet rule we will need to bump submodules as well [11:45:05] i.e. if we bump pep8 version, we have to bump it on all submodules first [11:45:50] (03PS1) 10Giuseppe Lavagetto: role::deployment::server: fixup for I4e632e66f [puppet] - 10https://gerrit.wikimedia.org/r/265242 [11:45:53] ah, so all submodules will be tested in their own repo ? [11:46:00] that sounds fine [11:46:02] we would need that yeah [11:46:34] (03CR) 10Giuseppe Lavagetto: [C: 032] role::deployment::server: fixup for I4e632e66f [puppet] - 10https://gerrit.wikimedia.org/r/265242 (owner: 10Giuseppe Lavagetto) [11:47:42] (03CR) 10Alexandros Kosiaris: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/244148 (https://phabricator.wikimedia.org/T114887) (owner: 10Hashar) [11:48:22] hashar: well, if we have the same rules apply to all submodules (in an incremental basis) I am fine with ignoring them. lemme comment on that [11:48:51] that would be easier this way [11:49:05] we can then bump the pep8 version independently on each submodules [11:49:22] as well as bump it on operations/puppet.git regardless of the submodules lint version [11:49:25] might be easier to handle [11:49:40] but you are right we will want all submodules to have some version of pep8 as well [11:49:45] (03CR) 10Alexandros Kosiaris: [C: 031] "If we say that we will apply the exact same rules on all submodule repos as in the main puppet repo, I would be fine with excluding all su" [puppet] - 10https://gerrit.wikimedia.org/r/244148 (https://phabricator.wikimedia.org/T114887) (owner: 10Hashar) [11:50:26] then the only submodules having python files are kafkatee and varnishkafka :-} [11:50:45] RECOVERY - puppet last run on mira is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [11:50:49] yeah, but others might get as well at some point in time [11:50:54] can someone help KartikMistry here? https://phabricator.wikimedia.org/T120815 [11:51:15] it is about deployment, and I should not be the right person to help him [11:51:21] funnily, I have found that one of the repo had a python script with unit tests which were not triggered on Jenkins :-} [11:51:43] (mediawiki deployment) [11:52:33] jynus: the way I read your comments I understand it should be part of the usual deployment process ? [11:52:47] so, ops not needed ? [11:52:50] yes, it is a create table I already reviewd and agreed [11:52:54] for a table creation that would be self served [11:52:59] yes [11:53:14] I think he doesn't know to move around in tin, etc. [11:53:23] something like: mwscript --wiki=metawiki maintenance/sql.php < extensions/ContentTranslation/maintenance/create-whatever.sql [11:53:24] use the mediawiki sql script and that [11:53:35] jynus: I am taking it [11:53:40] aaa that [11:53:40] maybe he doesn't have access? [11:53:42] ok then [11:53:47] thanks hashar [11:59:18] it is not that I could not do it, it is that it is better to teach someone the proper way than continue doing it all the time [12:00:00] and I would be doing it "the ops way" not the mediawiki way [12:00:44] Krenair: Where are we with the bumps for wikitech? [12:02:43] Replication to github is broken, so I dunno :P [12:03:58] jynus: I fully agree and did just that "i.e. explain stuff / point to doc" [12:04:17] there is no point in having you deal with basic changes :D [12:04:27] !log trying schema change on wikidata (wb_terms) [12:04:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:04:42] ^speaking on non-trivial changes [12:05:46] PROBLEM - puppet last run on mw2132 is CRITICAL: CRITICAL: Puppet has 1 failures [12:06:47] not sure if I should abort or not [12:08:18] (03PS2) 10Giuseppe Lavagetto: deployment: activate redis replica between the masters [puppet] - 10https://gerrit.wikimedia.org/r/264945 (https://phabricator.wikimedia.org/T124024) [12:08:28] lunch & [12:09:03] (03CR) 10Giuseppe Lavagetto: [C: 032] deployment: activate redis replica between the masters [puppet] - 10https://gerrit.wikimedia.org/r/264945 (https://phabricator.wikimedia.org/T124024) (owner: 10Giuseppe Lavagetto) [12:09:16] jynus: KartikMistry -> kart_ :) [12:09:31] jynus: Thanks. I can schedule deployment today or tomorrow. [12:11:50] PROBLEM - MariaDB Slave SQL: s5 on db1026 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1176, Errmsg: Error Key PRIMARY doesnt exist in table wb_terms on query. Default database: wikidatawiki. Query: INSERT LOW_PRIORITY IGNORE INTO wikidatawiki._wb_terms_new (term_row_id, term_entity_id, term_entity_type, term_language, term_type, term_text, term_search_key, term_weight) SELECT term_row_id, term_entity_id, term_e [12:12:07] broken replication on db1026 because it lacks of a primary key [12:12:11] uh [12:12:34] nice [12:12:40] uhoh [12:13:13] we have service, I am not sure we hace recentchanges [12:14:17] how did that go through the master ? [12:14:41] every other server has a primary key! [12:14:49] including the master and the other slaves [12:14:56] wat ? eee [12:15:01] how did that happen ? [12:15:18] let me deploy a mw role change to get recetchanges back [12:17:46] (03PS1) 10Jcrespo: Setting s5 master as recentchanges role [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265244 [12:18:24] (03CR) 10Jcrespo: [C: 032 V: 032] Setting s5 master as recentchanges role [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265244 (owner: 10Jcrespo) [12:18:48] (03Merged) 10jenkins-bot: Setting s5 master as recentchanges role [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265244 (owner: 10Jcrespo) [12:19:11] PROBLEM - MariaDB Slave Lag: s5 on db1045 is CRITICAL: CRITICAL slave_sql_lag Seconds_Behind_Master: 336 [12:20:06] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Setting s5 master as recentchanges role (duration: 00m 32s) [12:20:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:21:30] rc is back [12:21:35] db1049 is going to page too btw [12:21:41] yep [12:21:42] lag is increasing over there as well [12:22:18] while we have 3 servers, everithing is good, I count with mediawikis automatic load balancing [12:23:10] (03CR) 10John Vandenberg: [C: 031] tox entry point to run pep8==1.4.6 [puppet] - 10https://gerrit.wikimedia.org/r/244148 (https://phabricator.wikimedia.org/T114887) (owner: 10Hashar) [12:23:53] it is the problem of having 2 different batches of hardware, only the one with more memory can keep up under stress [12:25:12] PROBLEM - MariaDB Slave Lag: s5 on db1049 is CRITICAL: CRITICAL slave_sql_lag Seconds_Behind_Master: 312 [12:26:37] we definitelly should isolate wikidata on its own shard [12:27:10] BTW, this is not an outage [12:27:33] it is a scheduled maintenance, I just didn't know it was going to be so bad [12:29:35] RECOVERY - puppet last run on mw2132 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [12:29:58] there is not service loss, however, only server loss [12:30:20] that is why I am continuing with it [12:31:10] jynus: replag is growing though on db104{5,9} [12:31:16] 9 and 5 min respectively [12:31:43] yep, I know [12:32:01] I count those as server loss [12:32:33] mediawiki depools those automatically [12:33:47] I see eswiki issues, those concern me more, as it is on s7, not s5 [12:35:45] they are ApiQueryContributions queries, so, unrelated? [12:36:52] I think things are stable, and only service problem I saw was recentchanges freezing for a few seconds [12:36:58] which is now ok [12:37:19] I will try to fix db1026 [12:40:10] 6operations, 6Project-Creators: Operations-related subprojects/tags reorganization - https://phabricator.wikimedia.org/T119944#1947582 (10faidon) @Aklapper et al., any comments? Shall we proceed? [12:41:16] (03PS1) 10ArielGlenn: make neodymium primary salt master [puppet] - 10https://gerrit.wikimedia.org/r/265245 [12:41:32] <_joe_> \o/ [12:43:17] !log performing alter table on db1026 (ETA: 5 hours) [12:43:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:43:55] PROBLEM - Apache HTTP on mw1130 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.006 second response time [12:43:59] (03PS1) 10Elukey: Added notify Service for Kafka's Burrow for https://phabricator.wikimedia.org/T123942 [puppet] - 10https://gerrit.wikimedia.org/r/265246 [12:44:17] PROBLEM - HHVM rendering on mw1130 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.014 second response time [12:44:50] (03CR) 10Elukey: "Next step would be to apply this to Labs" [puppet] - 10https://gerrit.wikimedia.org/r/265246 (owner: 10Elukey) [12:50:06] (03PS1) 10ArielGlenn: move git-deploy to neodymium (primary salt master) [puppet] - 10https://gerrit.wikimedia.org/r/265248 [12:51:30] 6operations, 10Salt: Move salt master to separate host from puppet master - https://phabricator.wikimedia.org/T115287#1947629 (10ArielGlenn) Starting the move to neodymium as primary and then soon only salt master. https://gerrit.wikimedia.org/r/#/c/265245/ and https://gerrit.wikimedia.org/r/#/c/265248/ [12:54:35] RECOVERY - Apache HTTP on mw1130 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.764 second response time [12:54:56] RECOVERY - HHVM rendering on mw1130 is OK: HTTP OK: HTTP/1.1 200 OK - 65739 bytes in 1.401 second response time [12:55:04] !log restart hhvm on mw1130 [12:55:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:55:39] akosiaris, to answer your questions (sorry I ignored you, as I was focusing on monitoring) [12:56:21] ah yes I forgot to update that. thanks mark [12:56:27] :) [12:56:52] a difficult schema change was done, I expected lag on some servers- we got that, but that is handled automatically, no action required [12:56:54] (03PS1) 10Hashar: Add .gitreview [puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/265250 [12:56:56] (03PS1) 10Hashar: Introduce tox as a test entry point [puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/265251 [12:56:58] (03PS1) 10Hashar: Pass flake8 and add it to tox envlist [puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/265252 [12:57:46] the unexpected part was recentchanges having a non-expected schema difference- one without a primary key- that was solved right away [12:57:55] (03PS2) 10Elukey: Adding notify Service for Kafka's Burrow. Bug: T123942 [puppet] - 10https://gerrit.wikimedia.org/r/265246 (https://phabricator.wikimedia.org/T123942) [12:58:29] rc slaves have different optimized schema, but usually they still have a PK. I am creating it now. [12:58:55] jynus: yeah I saw... 5 hours ETA [12:59:21] current state is schema change ongoing, ETA 5 hours, no service impact (only small degradation on HA, but I counted with that) [12:59:39] the 5 hours are for fixing db1026, too [12:59:42] (03CR) 10Hashar: "I added tox support to operations/puppet/kafkatee and operations/puppet/varnishkafka via a few patches. https://gerrit.wikimedia.org/r/#/" [puppet] - 10https://gerrit.wikimedia.org/r/244148 (https://phabricator.wikimedia.org/T114887) (owner: 10Hashar) [12:59:43] which is now depooled [13:00:01] (03PS3) 10Elukey: Add notify Service for Kafka's Burrow. Bug: T123942 [puppet] - 10https://gerrit.wikimedia.org/r/265246 (https://phabricator.wikimedia.org/T123942) [13:01:54] I know some pages went off- it is difficult to predict load impact [13:02:23] yeah, that makes sense [13:02:29] but I would sign right now this impact of any other online 500 million-row-table primary key schema change [13:03:14] 10Ops-Access-Requests, 6operations, 6Services, 3Mobile-Content-Service, 5Patch-For-Review: Allow mobrovac to restart MobileApps - https://phabricator.wikimedia.org/T123540#1947670 (10akosiaris) p:5Triage>3Normal [13:04:11] (03PS1) 10Alexandros Kosiaris: Add mobrovac to mobileapps-admin group [puppet] - 10https://gerrit.wikimedia.org/r/265253 (https://bugzilla.wikimedia.org/123540) [13:05:49] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: add datacenter-ops to dhcp /install-server - https://phabricator.wikimedia.org/T123681#1947673 (10akosiaris) p:5Triage>3Normal The change misses the "you need to run puppet there" part though [13:05:55] (03CR) 10Alexandros Kosiaris: [C: 031] admin: add datacenter-ops to install-server role [puppet] - 10https://gerrit.wikimedia.org/r/264994 (https://phabricator.wikimedia.org/T123681) (owner: 10Dzahn) [13:06:23] (03PS2) 10Alex Monk: Add mobrovac to mobileapps-admin group [puppet] - 10https://gerrit.wikimedia.org/r/265253 (https://phabricator.wikimedia.org/T123540) (owner: 10Alexandros Kosiaris) [13:10:48] 6operations, 5Patch-For-Review: url-downloader should be set up more redundantly - https://phabricator.wikimedia.org/T122134#1947682 (10akosiaris) p:5High>3Normal Lowering priority since this is partially done. Also assigning it to myself [13:10:55] 6operations, 5Patch-For-Review: url-downloader should be set up more redundantly - https://phabricator.wikimedia.org/T122134#1947684 (10akosiaris) a:3akosiaris [13:11:03] (03PS6) 10ArielGlenn: dumps: set up but don't enable script for dumps to run from cron [puppet] - 10https://gerrit.wikimedia.org/r/263807 (https://phabricator.wikimedia.org/T107750) [13:12:42] RECOVERY - MariaDB Slave Lag: s5 on db1049 is OK: OK slave_sql_lag Seconds_Behind_Master: 53 [13:14:27] 6operations, 10RESTBase, 10RESTBase-Cassandra, 5codfw-rollout-Jan-Mar-2016: Set up multi-DC replication for Cassandra - https://phabricator.wikimedia.org/T108613#1947687 (10faidon) [13:15:11] 6operations, 5codfw-rollout-Jan-Mar-2016: Be able to switch programmatically between deployment servers in codfw and eqiad - https://phabricator.wikimedia.org/T124024#1947690 (10faidon) [13:15:29] 6operations, 5codfw-rollout-Jan-Mar-2016: url-downloader should be set up more redundantly - https://phabricator.wikimedia.org/T122134#1947692 (10faidon) [13:15:38] 6operations, 5codfw-rollout-Jan-Mar-2016: Scale up and out our puppetmaster infrastructure - https://phabricator.wikimedia.org/T98128#1947694 (10faidon) [13:16:36] 6operations, 5codfw-rollout-Apr-Jun-2015, 5codfw-rollout-Jan-Mar-2016: Document what is left for having a full cluster installation in codfw - https://phabricator.wikimedia.org/T97322#1947704 (10faidon) [13:17:02] 6operations, 10Traffic, 5Patch-For-Review: HTTP/1.1 keepalive for local nginx->varnish conns - https://phabricator.wikimedia.org/T107749#1947708 (10elukey) Adding another note: http://nginx.org/en/docs/http/ngx_http_upstream_module.html#server The server directive has a max_conns value, but... "limits the... [13:21:21] (03PS7) 10ArielGlenn: dumps: set up but don't enable script for dumps to run from cron [puppet] - 10https://gerrit.wikimedia.org/r/263807 (https://phabricator.wikimedia.org/T107750) [13:22:09] (03PS8) 10ArielGlenn: dumps: set up but don't enable script for dumps to run from cron [puppet] - 10https://gerrit.wikimedia.org/r/263807 (https://phabricator.wikimedia.org/T107750) [13:29:06] (03PS1) 10Giuseppe Lavagetto: deployment::redis: use ipv4 address when replicating [puppet] - 10https://gerrit.wikimedia.org/r/265258 [13:30:11] (03CR) 10Giuseppe Lavagetto: [C: 032] deployment::redis: use ipv4 address when replicating [puppet] - 10https://gerrit.wikimedia.org/r/265258 (owner: 10Giuseppe Lavagetto) [13:31:03] (03PS9) 10ArielGlenn: dumps: set up but don't enable script for dumps to run from cron [puppet] - 10https://gerrit.wikimedia.org/r/263807 (https://phabricator.wikimedia.org/T107750) [13:31:58] (03CR) 10jenkins-bot: [V: 04-1] dumps: set up but don't enable script for dumps to run from cron [puppet] - 10https://gerrit.wikimedia.org/r/263807 (https://phabricator.wikimedia.org/T107750) (owner: 10ArielGlenn) [13:35:24] (03PS10) 10ArielGlenn: dumps: set up but don't enable script for dumps to run from cron [puppet] - 10https://gerrit.wikimedia.org/r/263807 (https://phabricator.wikimedia.org/T107750) [13:37:05] PROBLEM - puppet last run on tin is CRITICAL: CRITICAL: puppet fail [13:37:20] <_joe_> this is me ^^ [13:37:24] (03PS1) 10Giuseppe Lavagetto: deployment::redis: re-fix syntax [puppet] - 10https://gerrit.wikimedia.org/r/265259 [13:37:55] (03CR) 10Giuseppe Lavagetto: [C: 032] deployment::redis: re-fix syntax [puppet] - 10https://gerrit.wikimedia.org/r/265259 (owner: 10Giuseppe Lavagetto) [13:38:18] (03CR) 10jenkins-bot: [V: 04-1] deployment::redis: re-fix syntax [puppet] - 10https://gerrit.wikimedia.org/r/265259 (owner: 10Giuseppe Lavagetto) [13:39:40] 6operations, 10MediaWiki-General-or-Unknown, 10MobileFrontend-Feature-requests, 10Traffic: Fix mobile purging - https://phabricator.wikimedia.org/T124165#1947764 (10BBlack) 3NEW a:3BBlack [13:39:56] PROBLEM - puppet last run on mw2011 is CRITICAL: CRITICAL: puppet fail [13:39:57] (03PS2) 10Giuseppe Lavagetto: deployment::redis: re-fix syntax [puppet] - 10https://gerrit.wikimedia.org/r/265259 [13:40:39] (03PS1) 10DCausse: Remove deprecated properties and minor comment cleanups [puppet] - 10https://gerrit.wikimedia.org/r/265260 [13:41:03] !log Revert migration of mobile traffic to text cluster in codfw https://phabricator.wikimedia.org/T109286 [13:41:49] 6operations, 10Traffic, 5Patch-For-Review: HTTP/1.1 keepalive for local nginx->varnish conns - https://phabricator.wikimedia.org/T107749#1947774 (10elukey) http://nginx.org/en/docs/http/ngx_http_upstream_module.html#server A combination of max_fails (defaul to 1) and fail_timeout (default to 10 seconds) cou... [13:42:08] (03CR) 10Giuseppe Lavagetto: [C: 032] deployment::redis: re-fix syntax [puppet] - 10https://gerrit.wikimedia.org/r/265259 (owner: 10Giuseppe Lavagetto) [13:45:36] RECOVERY - puppet last run on tin is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [14:03:32] <_joe_> Reedy: where? [14:03:35] <_joe_> ouch [14:03:36] from tin [14:03:47] The last one I ran, worked fine [14:03:53] <_joe_> oh ok [14:04:05] We still have one DB server pooled that shouldn't be pooled [14:04:08] <_joe_> I guess it was while I was applying the change that restarted rsync :P [14:04:19] lol [14:04:20] 14:03:53 sync-dir failed: /srv/mediawiki-staging/php-1.27.0-wmf.11/vendor/pear/mail/Mail/smtpmx.php has content before opening ffs [14:05:09] https://github.com/wikimedia/mediawiki-vendor/blob/master/pear/mail/Mail/smtpmx.php [14:05:11] seriously? [14:06:46] looks like it [14:06:58] I presume there's some invisible shit [14:07:10] !log reedy@tin Synchronized php-1.27.0-wmf.10/: consistency (duration: 02m 38s) [14:07:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:07:11] well, that worked alright [14:07:17] need to go get some food. Back in a few [14:07:39] there isn't invisible shit there. [14:07:46] i guess our check is case-sensitive? :D [14:07:47] BOM? [14:08:20] hoo, I do not see which one, both db1026 and db1045 are depooled [14:08:30] no BOM. it's just uppercase (03PS1) 10Giuseppe Lavagetto: deployment::rsync: fix cron [puppet] - 10https://gerrit.wikimedia.org/r/265267 [14:08:56] jynus: Still appears in MediaWiki [14:09:06] <_joe_> yeah what MatmaRex said [14:09:46] PROBLEM - Apache HTTP on mw1191 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:09:51] hoo, where? [14:09:58] it's not synced out [14:09:59] one sec [14:10:16] RECOVERY - puppet last run on mw2011 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:10:29] (03CR) 10Giuseppe Lavagetto: [C: 032] deployment::rsync: fix cron [puppet] - 10https://gerrit.wikimedia.org/r/265267 (owner: 10Giuseppe Lavagetto) [14:10:30] jynus: https://www.wikidata.org/w/api.php?action=query&meta=siteinfo&siprop=dbrepllag&sishowalldb= [14:10:46] PROBLEM - HHVM rendering on mw1191 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:10:51] Started the sync [14:11:11] !log hoo@tin Synchronized wmf-config/db-eqiad.php: Has not been synced before (duration: 00m 32s) [14:11:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:11:33] <_joe_> hoo: any errors? you shouldn't see any [14:11:41] db1049 getting replag now [14:11:42] very fast [14:11:45] revert [14:11:46] ? [14:11:52] 24s by now [14:13:08] yep, revert [14:13:11] you do it? [14:13:13] yeah [14:14:06] (03PS1) 10Hoo man: Revert "Setting db1049 as the API node, as db1045 is lagged" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265268 [14:14:14] (03CR) 10Hoo man: [C: 032] Revert "Setting db1049 as the API node, as db1045 is lagged" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265268 (owner: 10Hoo man) [14:14:30] <_joe_> !log syncronizing /srv/deployment manually between the two deployment servers for the first time [14:14:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:14:42] godog: hello, the Swift cluster for beta-cluster is going to be build on top of instances in deployment-prep isn't it ? [14:14:43] (03CR) 10Jcrespo: [C: 032] Revert "Setting db1049 as the API node, as db1045 is lagged" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265268 (owner: 10Hoo man) [14:14:45] (03Merged) 10jenkins-bot: Revert "Setting db1049 as the API node, as db1045 is lagged" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265268 (owner: 10Hoo man) [14:14:53] hashar_: yep [14:15:09] godog: just asking because some wondered whether we could set up a labs wide Swift cluster based on real hardware [14:15:15] i.e. a shared / common service for all labs [14:15:23] which is most probably another can of worms ;-} [14:15:40] !log hoo@tin Synchronized wmf-config/db-eqiad.php: Re-Pool lagged db1045 (duration: 00m 35s) [14:15:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:15:56] I suppose I forgot to rebase before deploying that [14:15:58] hashar_: yeah out of scope for this, there's a task for that [14:16:39] in any case, it didn't work, as I expected- that can be either the load or the optimization for API [14:19:30] (03PS1) 10Hoo man: Set db1045 load to 0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265270 [14:19:48] (03CR) 10Hoo man: [C: 032] "Per Jynus" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265270 (owner: 10Hoo man) [14:20:21] (03Merged) 10jenkins-bot: Set db1045 load to 0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265270 (owner: 10Hoo man) [14:20:21] hoo: MatmaRex: It also questions why scap let it out before... Not sync-dir doesn't like it [14:21:15] huh... someone holding the scap lock? [14:21:21] yup, me [14:21:42] shouldn't take it long to fail or pass [14:21:54] (03CR) 10Ottomata: Add notify Service for Kafka's Burrow. Bug: T123942 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/265246 (https://phabricator.wikimedia.org/T123942) (owner: 10Elukey) [14:22:41] (03PS2) 10Ottomata: Remove deprecated properties and minor comment cleanups [puppet] - 10https://gerrit.wikimedia.org/r/265260 (owner: 10DCausse) [14:22:52] he says, hopefully [14:23:05] MatmaRex: Yup, doesn't like !log reedy@tin Synchronized php-1.27.0-wmf.11/: consistency (duration: 02m 38s) [14:23:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:23:48] hoo: should be good now [14:23:53] Already on it :D [14:24:20] !log hoo@tin Synchronized wmf-config/db-eqiad.php: Set db1045 load to 0 (duration: 00m 32s) [14:24:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:27:24] (03PS1) 10BBlack: text VCL: delay mobile hostname rewrites [puppet] - 10https://gerrit.wikimedia.org/r/265271 (https://phabricator.wikimedia.org/T124166) [14:27:53] (03PS1) 10Giuseppe Lavagetto: deployment::rsync: the inactive server should sync with the master [puppet] - 10https://gerrit.wikimedia.org/r/265272 [14:28:22] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] deployment::rsync: the inactive server should sync with the master [puppet] - 10https://gerrit.wikimedia.org/r/265272 (owner: 10Giuseppe Lavagetto) [14:29:34] hoo or MatmaRex mind C+2 https://gerrit.wikimedia.org/r/265273 please? I filed bugs for the scap side of things [14:30:27] (03PS3) 10Ottomata: Remove deprecated properties and minor comment cleanups [puppet] - 10https://gerrit.wikimedia.org/r/265260 (owner: 10DCausse) [14:30:36] (03CR) 10Ottomata: [C: 032 V: 032] Remove deprecated properties and minor comment cleanups [puppet] - 10https://gerrit.wikimedia.org/r/265260 (owner: 10DCausse) [14:30:38] Reedy: +2ed [14:30:44] 6operations, 10MediaWiki-General-or-Unknown, 10MobileFrontend-Feature-requests, 10Traffic: Fix mobile purging - https://phabricator.wikimedia.org/T124165#1947932 (10BBlack) a:5BBlack>3None [14:30:46] But feels very dirty [14:31:01] yeah [14:31:10] I'm gonna make a patch to fix scap [14:31:44] DAMN IT [14:31:49] WHY DOES IT HAVE TO BE IN PHAB [14:32:45] Reedy: Because we want you to try Arkanist :P (Or is Arcanist?) [14:32:57] I think I made a scap patch before in it [14:33:10] # Best case scenario to begin with the php open tag [14:33:10] if text.startswith(' if text.lower().startswith() [14:34:37] _joe_: yeah, what I was going to do [14:34:46] else, regex, but that probably seems slower unless compiled [14:34:55] <_joe_> well, that's assuming your code is not in utf-8 fromat [14:35:07] <_joe_> (scap still runs on python 2.x right?) [14:35:21] Not sure [14:35:21] Reedy: scap has been deemed an early adopter of Phabricator differential on the basis the main authors are all in #releng :} [14:35:36] ok, more people awake now! need some systemd advice... [14:36:01] I'm trying to write a unit for kafka mirror maker, previously we had an init.d and a default file [14:36:23] the default and init.d scripts were able to programmatically do something that I can't seem to do with systemd [14:36:42] kafka miror-maker takes a variable list of consumer.property files [14:36:44] so like [14:37:00] --consumer.config c1.properties --consumer.config c2.properties ... [14:37:43] the .deb packages default behavior was to have a env var in /etc/default/kafka-mirror [14:37:53] KAFKA_MIRROR_CONSUMER_CONFIGS=${KAFKA_CONFIG}/mirror/consumer*.properties [14:38:20] so any consumer.properties file there would automatically be included when the daemon started [14:38:29] (this was configurable via the default file) [14:40:24] this is handy, but i could get over it and just not provide a working default for the .deb package [14:40:30] isn't there a systemd file where various values for the service can be set? I thought [14:40:40] you can do EnvironmentFile [14:40:50] but, it won't evaulate in a shell [14:40:53] it just sets variables [14:41:03] [Service] [14:41:04] EnvironmentFile=-/etc/default/kafka-mirror [14:41:05] ? [14:41:06] PROBLEM - Apache HTTP on mw1123 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:41:08] ja, but [14:41:16] it won't expand wildcards [14:41:21] or do any normal shell stuff [14:41:22] groan [14:41:29] you can only set static variables [14:41:30] no shell. right [14:41:45] so ja, i could just forget about a sane default in the .deb package [14:41:45] PROBLEM - HHVM rendering on mw1123 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:41:50] and do it all in puppet [14:41:52] except [14:42:32] previously i puppetized individual consumer.properties file templates as a define [14:42:54] made each of those before the kafka-mirror service [14:42:55] PROBLEM - Check size of conntrack table on mw1123 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:43:03] and then kafka -mirror could just go [14:43:05] well you can still set in the default: KAFKA_MIRROR_CONSUMER_CONFIGS=${KAFKA_CONFIG}/mirror/consumer*.properties [14:43:06] but [14:43:07] PROBLEM - DPKG on mw1123 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:43:09] (03PS2) 10BBlack: text VCL: delay mobile hostname rewrites [puppet] - 10https://gerrit.wikimedia.org/r/265271 (https://phabricator.wikimedia.org/T124166) [14:43:12] halfak: no [14:43:13] oops [14:43:14] sorry [14:43:18] (autocomplete) [14:43:19] hashar_: naw [14:43:23] and wrap your daemon in a shel script that will bash/ expand KAFKA_MIRROR_CONSUMER_CONFIGS [14:43:26] PROBLEM - puppet last run on mw1123 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:43:27] oh [14:43:31] pff [14:43:35] hm [14:43:37] PROBLEM - SSH on mw1123 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:43:37] that is not a bad idea [14:43:41] kinda ugly but maybe that would work... [14:43:45] PROBLEM - HHVM processes on mw1123 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:43:48] :P ottomata [14:43:59] hmmmm [14:44:03] lame pseudo code would be something like run-daemon-with-env.sh: eval KAFKA_MIRROR_CONSUMER_CONFIGS && my-daemon [14:44:06] PROBLEM - RAID on mw1123 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:44:22] well, hm, i actually do control a shell wrapper in the package anyway.... [14:44:24] so the default file would have the raw string having a wildcard literal star [14:44:25] PROBLEM - configured eth on mw1123 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:44:29] that would be launched by systemd [14:44:33] taking a look at what's up with mw1123 [14:44:34] which is passed to the shell wrapper around the daemon which actually does the expansion [14:44:35] PROBLEM - dhclient process on mw1123 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:44:39] instead of trying to get it done in systemd [14:44:40] YEAHHHhhh i thikn this will work [14:44:52] good idea! ok gonna try it, thanks hashar_ [14:44:56] PROBLEM - nutcracker port on mw1123 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:45:34] doesn't EnvironmentFile let you use something that the shell runs? [14:45:51] ottomata: or since EnvironmentFile can be passed a wildcard , you could generate a default file per consumer property -:D [14:46:46] PROBLEM - nutcracker process on mw1123 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:46:46] PROBLEM - Disk space on mw1123 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:46:56] PROBLEM - salt-minion processes on mw1123 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:47:13] !log Finished reverting migration of mobile traffic to text cluster in codfw https://phabricator.wikimedia.org/T109286 [14:47:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:47:18] apergos: no [14:47:25] its explicitly only static variable assignment [14:47:26] not a shell [14:50:05] well then wrap the daemon either in the service file or by itself is about all I see [14:50:05] !log powercycle mw1123, hhvm oom [14:50:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:50:09] like hashar says [14:50:22] yeah, gonna try it [14:50:27] i think that will work [14:50:36] its actually already wrapped by a shell script [14:50:42] i can just do the file expansion manually there [14:50:48] there ya go [14:51:36] RECOVERY - DPKG on mw1123 is OK: All packages OK [14:51:57] \O/ [14:52:06] RECOVERY - SSH on mw1123 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.4 (protocol 2.0) [14:52:15] RECOVERY - HHVM processes on mw1123 is OK: PROCS OK: 6 processes with command name hhvm [14:52:17] !log reedy@tin Synchronized php-1.27.0-wmf.11/vendor/: Fix ?PHP properly from commit (duration: 00m 36s) [14:52:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:52:36] RECOVERY - RAID on mw1123 is OK: OK: no RAID installed [14:52:46] RECOVERY - configured eth on mw1123 is OK: OK - interfaces up [14:52:56] RECOVERY - dhclient process on mw1123 is OK: PROCS OK: 0 processes with command name dhclient [14:53:00] 6operations, 5codfw-rollout-Jan-Mar-2016: Be able to switch programmatically between deployment servers in codfw and eqiad - https://phabricator.wikimedia.org/T124024#1948002 (10Joe) a:3Joe [14:53:05] RECOVERY - nutcracker process on mw1123 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [14:53:05] RECOVERY - Disk space on mw1123 is OK: DISK OK [14:53:15] RECOVERY - salt-minion processes on mw1123 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [14:53:25] RECOVERY - nutcracker port on mw1123 is OK: TCP OK - 0.000 second response time on port 11212 [14:53:31] 6operations, 5codfw-rollout-Jan-Mar-2016: Be able to switch programmatically between deployment servers in codfw and eqiad - https://phabricator.wikimedia.org/T124024#1943475 (10Joe) Everything is in place for a switchover test, now scheduled for Monday 25th of january. [14:53:35] RECOVERY - Check size of conntrack table on mw1123 is OK: OK: nf_conntrack is 9 % full [14:53:55] RECOVERY - Apache HTTP on mw1123 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.331 second response time [14:54:26] RECOVERY - HHVM rendering on mw1123 is OK: HTTP OK: HTTP/1.1 200 OK - 65519 bytes in 1.407 second response time [14:56:15] RECOVERY - puppet last run on mw1123 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:56:49] (03CR) 10Hashar: "recheck" [puppet/kafkatee] - 10https://gerrit.wikimedia.org/r/264010 (owner: 10Hashar) [14:56:53] (03CR) 10Hashar: "recheck" [puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/265250 (owner: 10Hashar) [14:58:37] (03CR) 10Hashar: "check experimental" [puppet/kafkatee] - 10https://gerrit.wikimedia.org/r/264011 (owner: 10Hashar) [14:58:41] (03CR) 10Hashar: "check experimental" [puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/265251 (owner: 10Hashar) [14:59:29] (03CR) 10Hashar: "check experimental" [puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/265252 (owner: 10Hashar) [14:59:35] (03CR) 10Hashar: "check experimental" [puppet/kafkatee] - 10https://gerrit.wikimedia.org/r/264012 (owner: 10Hashar) [15:00:57] ottomata: I got a bunch of changes for puppet/varnishkafka and puppet/kafkatee . Merely adding .gitreview , tox as a test entry point and enabling flake8 [15:01:16] ottomata: if you could add them to your review backlog, once happy with them I will make CI to run tox for both repos [15:08:04] 6operations, 5Patch-For-Review: rack/setup pc1004-1006 - https://phabricator.wikimedia.org/T121888#1948040 (10Cmjohnson) [15:33:16] (03CR) 10BBlack: [C: 031] text VCL: delay mobile hostname rewrites [puppet] - 10https://gerrit.wikimedia.org/r/265271 (https://phabricator.wikimedia.org/T124166) (owner: 10BBlack) [15:35:26] 6operations, 10ops-eqiad: mw1228 reporting readonly file system - https://phabricator.wikimedia.org/T122005#1948108 (10faidon) a:3Cmjohnson Ping? [15:35:36] (03PS3) 10Ema: text VCL: delay mobile hostname rewrites [puppet] - 10https://gerrit.wikimedia.org/r/265271 (https://phabricator.wikimedia.org/T124166) (owner: 10BBlack) [15:36:28] (03CR) 10Ema: [C: 032 V: 032] text VCL: delay mobile hostname rewrites [puppet] - 10https://gerrit.wikimedia.org/r/265271 (https://phabricator.wikimedia.org/T124166) (owner: 10BBlack) [15:39:13] Warning: file_get_contents(/dev/fd/63): failed to open stream: No such file or directory [15:39:14] pfff [15:39:17] stupid php [15:39:48] 6operations, 10ops-eqiad: mw1228 reporting readonly file system - https://phabricator.wikimedia.org/T122005#1948123 (10Cmjohnson) p:5Normal>3High [15:41:13] 6operations, 10ops-eqiad, 10Analytics: Possible bad mem chip or slot on dbproxy1004 - https://phabricator.wikimedia.org/T123546#1948124 (10Cmjohnson) @ottomata @nuria let's coordinate a time that we can get this done. [15:41:31] 6operations, 10RESTBase: Reduce log spam by removing non-operational cassandra IPs from seeds - https://phabricator.wikimedia.org/T123869#1948125 (10Eevans) > ... To avoid issues in production, I think it also makes sense to initially only apply this to staging. Though it should be safe to remove all of the i... [15:41:36] (03PS1) 10Alexandros Kosiaris: servermon: Increase HOST_TIMEOUT to 45 minutes [puppet] - 10https://gerrit.wikimedia.org/r/265280 [15:41:52] 6operations, 10ops-eqiad, 10Analytics: Possible bad mem chip or slot on dbproxy1004 - https://phabricator.wikimedia.org/T123546#1948126 (10Ottomata) I think we need to coordinate with @jcrespo. This box is more than just eventlogging db proxy. [15:43:00] (03CR) 10Alexandros Kosiaris: [C: 032] servermon: Increase HOST_TIMEOUT to 45 minutes [puppet] - 10https://gerrit.wikimedia.org/r/265280 (owner: 10Alexandros Kosiaris) [15:43:06] (03PS2) 10Alexandros Kosiaris: servermon: Increase HOST_TIMEOUT to 45 minutes [puppet] - 10https://gerrit.wikimedia.org/r/265280 [15:43:17] (03CR) 10Alexandros Kosiaris: [V: 032] servermon: Increase HOST_TIMEOUT to 45 minutes [puppet] - 10https://gerrit.wikimedia.org/r/265280 (owner: 10Alexandros Kosiaris) [15:47:29] 6operations, 10ops-eqiad, 10Analytics: Possible bad mem chip or slot on dbproxy1004 - https://phabricator.wikimedia.org/T123546#1948150 (10jcrespo) No I think dbproxy1004 only serves m4/eventlogging. But we can failover to another machine without needing downtime, I just need time to setup another proxy temp... [15:56:10] godog: congratulations on swift@beta-cluster :-} [15:56:20] godog: I guess we will want to nuke a bunch of images [15:57:30] hashar_: hehe thanks, yeah what's not used can be nuked before upload [15:58:30] godog: I found some nice candidate. A few thousands huge files that got uploaded to test a GLAM related tool [15:58:42] some utility to let museum and library to bulk upload their material [16:00:04] anomie ostriches thcipriani marktraceur Krenair: Dear anthropoid, the time has come. Please deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160120T1600). [16:00:04] Dereckson Kaldari: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [16:01:36] 6operations, 10Wikimedia-Mailing-lists: Fwd: usergroup-applications admin password? - https://phabricator.wikimedia.org/T124176#1948203 (10Dzahn) a:3Dzahn [16:01:39] 6operations, 10Wikimedia-Mailing-lists: Fwd: usergroup-applications admin password? - https://phabricator.wikimedia.org/T124176#1948197 (10Dzahn) [16:01:41] I can SWAT. Dereckson kaldari ping! [16:01:48] cool [16:02:08] I'm ready for deployment and testing [16:02:16] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265185 (https://phabricator.wikimedia.org/T121949) (owner: 10Kaldari) [16:03:11] (03Merged) 10jenkins-bot: Disable active gadget user stats on enwiki since it takes too long [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265185 (https://phabricator.wikimedia.org/T121949) (owner: 10Kaldari) [16:05:06] PROBLEM - High load average on labstore1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [24.0] [16:05:10] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: Disable active gadget user stats on enwiki since it takes too long [[gerrit:265185]] (duration: 00m 32s) [16:05:14] ^ kaldari check please [16:05:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:05:45] thcipriani: Looks good [16:05:51] kaldari: cool, thank you! [16:07:01] Dereckson: ping me when you're around and I'll get your changes out. [16:10:09] 6operations, 10Traffic, 6Zero, 5Patch-For-Review: Merge mobile cache into text cache - https://phabricator.wikimedia.org/T109286#1948215 (10BBlack) [16:10:11] 6operations, 10Traffic, 5Patch-For-Review: Fix varnish handling of mobile hostname rewriting - https://phabricator.wikimedia.org/T124166#1948214 (10BBlack) 5Open>3Resolved [16:11:27] RECOVERY - High load average on labstore1001 is OK: OK: Less than 50.00% above the threshold [16.0] [16:12:23] 6operations, 10Traffic, 5Patch-For-Review: HTTP/1.1 keepalive for local nginx->varnish conns - https://phabricator.wikimedia.org/T107749#1948219 (10BBlack) > The server directive has a max_conns value, but... > > "limits the maximum number of simultaneous active connections to the proxied server (1.5.9). D... [16:13:08] !log bounce hhvm on mw1191 and syntaxlight runaway processes [16:13:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:13:16] RECOVERY - Apache HTTP on mw1191 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.043 second response time [16:13:34] thcipriani: ping [16:13:46] sorry, I were distracted about something on #wikipedia-fr [16:13:53] PROBLEM - MariaDB Slave SQL: s5 on db1026 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1176, Errmsg: Error Key PRIMARY doesnt exist in table wb_terms on query. Default database: wikidatawiki. Query: INSERT LOW_PRIORITY IGNORE INTO wikidatawiki._wb_terms_new (term_row_id, term_entity_id, term_entity_type, term_language, term_type, term_text, term_search_key, term_weight) SELECT term_row_id, term_entity_id, term_e [16:13:57] Dereckson: pong, np, merging stuff now :) [16:14:35] RECOVERY - HHVM rendering on mw1191 is OK: HTTP OK: HTTP/1.1 200 OK - 65535 bytes in 1.336 second response time [16:14:54] mh same thing as this morning, I guess schema migration isn't finished yet jynus ? [16:14:59] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265163 (https://phabricator.wikimedia.org/T124080) (owner: 10Dereckson) [16:15:32] <_joe_> thcipriani: not sure it's advisable to release now, we have a db page [16:15:40] whats the latest page? [16:15:43] <_joe_> jynus: are you already working on it? [16:15:54] <_joe_> db1026 replication broken again [16:15:55] (03Merged) 10jenkins-bot: Add *.archives.gov to wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265163 (https://phabricator.wikimedia.org/T124080) (owner: 10Dereckson) [16:16:20] yeah, hoo convinced me on trying something for some API problems, but it didn't work [16:16:33] _joe_: ack. I can hold. [16:16:41] I will downtime it if it flaps again [16:16:46] <_joe_> jynus: can deployments go on? [16:17:01] yes, it should not affect mediawiki in general [16:17:09] <_joe_> thcipriani: go on then :) [16:17:12] jynus: _joe_ kk, thanks :) [16:17:24] it is just s5 more stressed than usual [16:17:31] but in fully read-write mode [16:17:37] will log when finished [16:19:02] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265165 (https://phabricator.wikimedia.org/T121779) (owner: 10Dereckson) [16:19:06] PROBLEM - puppet last run on mw2109 is CRITICAL: CRITICAL: puppet fail [16:19:14] PROBLEM - MariaDB Slave Lag: s5 on db1045 is CRITICAL: CRITICAL slave_sql_lag Seconds_Behind_Master: 722 [16:19:25] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: Add *.archives.gov to wgCopyUploadsDomains [[gerrit:265163]] (duration: 00m 32s) [16:19:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:19:29] (03Merged) 10jenkins-bot: Add *.bodleian.ox.ac.uk to wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265165 (https://phabricator.wikimedia.org/T121779) (owner: 10Dereckson) [16:21:12] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: Add *.bodleian.ox.ac.uk to wgCopyUploadsDomains [[gerrit:265165]] (duration: 00m 33s) [16:21:15] ^ Dereckson check please [16:21:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:21:56] I'm checking ia rebase is not needed for 259003. [16:22:25] Dereckson: could you manually rebase https://gerrit.wikimedia.org/r/#/c/259003/ gerrit is mad at me [16:24:04] Rebased. I'm pushing to Gerrit. [16:24:13] kk, thanks. [16:26:35] godog: I am going to mass clear beta-cluster uploaded files [16:26:42] godog: and purge deleted/archived as well [16:26:48] godog: will keep you informed :-} [16:27:05] (03PS3) 10Dereckson: Add davidabian.com to wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/259003 (https://phabricator.wikimedia.org/T121383) [16:27:53] (03PS3) 10Giuseppe Lavagetto: lvs: use etcd for pybal config for ulsfo backups [puppet] - 10https://gerrit.wikimedia.org/r/263847 [16:29:11] Dereckson: 259003 looks ready to me, are you ready to get it out? [16:30:57] hashar_: ok thanks! btw I'll be VAC 21-28 but should be already accessible from deployment-prep [16:31:12] Yes. [16:31:34] 265163 tested. [16:31:45] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/259003 (https://phabricator.wikimedia.org/T121383) (owner: 10Dereckson) [16:32:11] (03Merged) 10jenkins-bot: Add davidabian.com to wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/259003 (https://phabricator.wikimedia.org/T121383) (owner: 10Dereckson) [16:33:28] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: Add davidabian.com to wgCopyUploadsDomains [[gerrit:259003]] (duration: 00m 32s) [16:33:30] ^ Dereckson check please [16:34:15] (03PS1) 10Yurik: Change default graph version param [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265282 [16:34:31] 265165 tested. [16:35:38] thcipriani, sorry i forgot to add ^^ earlier, could you also deploy that minor config change? [16:35:51] * thcipriani looks [16:37:35] (03PS1) 10ArielGlenn: bump version to 0.0.5 [debs/mwbzutils] - 10https://gerrit.wikimedia.org/r/265284 [16:37:52] yurik: yup, np [16:37:59] thanks! [16:38:05] Hmmmm... davidabian.com uses http://davidabian.com too, I'm going to need to amend that. [16:38:24] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265282 (owner: 10Yurik) [16:39:06] (03Merged) 10jenkins-bot: Change default graph version param [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265282 (owner: 10Yurik) [16:40:40] (03PS1) 10Dereckson: Add davidabian.com to wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265286 [16:42:01] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: Change default graph version param. Part I [[gerrit:265282]] (duration: 00m 36s) [16:42:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:42:04] (03CR) 10ArielGlenn: [C: 032 V: 032] bump version to 0.0.5 [debs/mwbzutils] - 10https://gerrit.wikimedia.org/r/265284 (owner: 10ArielGlenn) [16:42:23] thcipriani, i added the patch to SWAT list [16:42:26] thanks! [16:42:40] thcipriani: could you alsso deploy 265286? [16:42:43] !log thcipriani@tin Synchronized wmf-config/CommonSettings.php: SWAT: Change default graph version param. Part II [[gerrit:265282]] (duration: 00m 32s) [16:42:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:42:59] ^ yurik check please. Thanks for adding to SWAT list. [16:43:07] Dereckson: looking [16:44:33] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265286 (owner: 10Dereckson) [16:44:57] (03Merged) 10jenkins-bot: Add davidabian.com to wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265286 (owner: 10Dereckson) [16:45:07] RECOVERY - puppet last run on mw2109 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [16:45:17] I've updated [[Deployments]] to add it. [16:46:12] cool, thank you [16:46:35] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: Add davidabian.com to wgCopyUploadsDomains [[gerrit:265286]] (duration: 00m 32s) [16:46:41] ^ Dereckson check please [16:47:59] Tested. Works fine. [16:48:03] Thank you for the deployment. [16:48:27] 6operations, 7Tracking: Improve access to and control over incident and metrics monitoring infrastructure - https://phabricator.wikimedia.org/T124179#1948329 (10akosiaris) 3NEW [16:48:31] 6operations, 6Performance-Team, 7Availability, 7Epic, 5codfw-rollout-Jan-Mar-2016: Cleanup active-DC based MW config code and make it more robust and easy to change - https://phabricator.wikimedia.org/T114273#1948336 (10Joe) [16:48:45] Dereckson: thank you for checking, appreciated. [16:49:11] 6operations, 10Analytics-Cluster, 10EventBus, 6Services: Investigate proper set up for using Kafka MirrorMaker with new main Kafka clusters. - https://phabricator.wikimedia.org/T123954#1948340 (10aaron) [16:49:41] 6operations, 7Monitoring, 7Tracking: Improve access to and control over incident and metrics monitoring infrastructure - https://phabricator.wikimedia.org/T124179#1948342 (10akosiaris) [16:52:14] 10Ops-Access-Requests, 6operations: Access for new Analytics Opsen: Luca Toscano - https://phabricator.wikimedia.org/T122925#1948354 (10Joe) Actually during the ops meeting it was decided luca should have full access to the cluster, as @mark confirmed. [16:52:43] (03PS1) 10Giuseppe Lavagetto: admin: add Luca (elukey) to the ops group [puppet] - 10https://gerrit.wikimedia.org/r/265287 (https://phabricator.wikimedia.org/T122925) [16:53:06] (03PS1) 10Ottomata: Add special opt --consumer.configs to kafka mirror-maker command [debs/kafka] (debian) - 10https://gerrit.wikimedia.org/r/265288 (https://phabricator.wikimedia.org/T124077) [16:53:10] thcipriani, not good :( breaks in a few places. Could you mark the default as 1 for now (no need to rollback). I will submit a patch in a sec [16:54:08] (03CR) 10Ottomata: [C: 031] admin: add Luca (elukey) to the ops group [puppet] - 10https://gerrit.wikimedia.org/r/265287 (https://phabricator.wikimedia.org/T122925) (owner: 10Giuseppe Lavagetto) [16:54:29] 6operations, 10Traffic: test ticket - https://phabricator.wikimedia.org/T124182#1948370 (10Krenair) 3NEW [16:54:32] (03PS1) 10Yurik: Set default graph vega version back to 1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265289 [16:54:33] thcipriani, ^ [16:54:41] 6operations, 10Traffic: test ticket - https://phabricator.wikimedia.org/T124182#1948378 (10Krenair) 5Open>3Invalid a:3Krenair [16:54:56] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265289 (owner: 10Yurik) [16:55:16] thanks, sorry for the trouble [16:55:29] (03Merged) 10jenkins-bot: Set default graph vega version back to 1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265289 (owner: 10Yurik) [16:55:54] (03CR) 10Giuseppe Lavagetto: [C: 032] admin: add Luca (elukey) to the ops group [puppet] - 10https://gerrit.wikimedia.org/r/265287 (https://phabricator.wikimedia.org/T122925) (owner: 10Giuseppe Lavagetto) [16:56:19] yurik: np, thanks for the followup. [16:56:41] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: Set default graph vega version back to 1 [[gerrit:265289]] (duration: 00m 32s) [16:56:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:56:46] ^ yurik check please [16:56:51] * yurik looks [16:57:07] thcipriani, all's good, thx [16:57:18] yurik: kk, thanks for checking. [16:57:41] anybody wants to eval.php something for me, on enwiki? [16:57:44] echo ObjectCache::getMainWANInstance()->get( wfMemcKey( 'unpatrollable-page', 45686529 ) ); [16:57:52] the output is probably '1'. or null. [16:58:39] (03PS3) 10Dzahn: admin: add dc-ops to install-server, allow puppet cmds [puppet] - 10https://gerrit.wikimedia.org/r/264994 (https://phabricator.wikimedia.org/T123681) [16:58:40] 6operations, 7Monitoring: Evaluate alternative web interfaces to icinga 1 core - https://phabricator.wikimedia.org/T124185#1948409 (10akosiaris) 3NEW [16:58:56] 6operations, 7Monitoring: Evaluate alternative web interfaces to icinga 1 core - https://phabricator.wikimedia.org/T124185#1948409 (10akosiaris) [16:58:58] 6operations, 7Monitoring, 7Tracking: Improve access to and control over incident and metrics monitoring infrastructure - https://phabricator.wikimedia.org/T124179#1948415 (10akosiaris) [16:59:05] 7Puppet, 6Analytics-Kanban, 10Analytics-Wikimetrics, 5Patch-For-Review: Cleanup Wikimetrics puppet module so it can run puppet continuously without own puppetmaster {dove} [21 pts] - https://phabricator.wikimedia.org/T101763#1948417 (10Nuria) 5Open>3Resolved [16:59:22] 7Puppet, 6Analytics-Kanban, 10Analytics-Wikimetrics, 5Patch-For-Review: Cleanup Wikimetrics puppet module so it can run puppet continuously without own puppetmaster {dove} [21 pts] - https://phabricator.wikimedia.org/T101763#1347110 (10Nuria) [17:04:16] Krenair: could you eval.php something for me, on enwiki? `echo ObjectCache::getMainWANInstance()->get( wfMemcKey( 'unpatrollable-page', 45686529 ) );` the output is probably '1', or null. [17:04:40] MatmaRex: 1 [17:04:53] RECOVERY - MariaDB Slave Lag: s5 on db1045 is OK: OK slave_sql_lag Seconds_Behind_Master: 43 [17:05:15] 6operations, 10MediaWiki-General-or-Unknown, 10MobileFrontend-Feature-requests, 10Traffic: Fix mobile purging - https://phabricator.wikimedia.org/T124165#1948456 (10BBlack) ping to check bot [17:06:48] Krenair: thanks, i quoted you on https://phabricator.wikimedia.org/T123747 :) [17:07:34] Krenair: We can bump wikitech to .10 or .11 now [17:07:44] cool [17:08:01] want to just do it? [17:08:09] I think so [17:08:21] Reedy, uh, what about https://gerrit.wikimedia.org/r/#/c/265196/ ? [17:08:22] We can easily see if it majorly breaks anything [17:08:38] Krenair: I already did it [17:08:46] rebasing yours made it a noop [17:09:33] !log restbase cassandra: set DTCS max_window_size_seconds to 70736000, large enough to accommodate a two-year window [17:09:34] k [17:09:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:09:49] I did have a look, but apparently didn't see the bump commit [17:11:42] milimetric: Could you review https://gerrit.wikimedia.org/r/#/c/264700/ and https://gerrit.wikimedia.org/r/#/c/264698/3 , or recommend someone who can? [17:12:01] * milimetric looks while in a meeting [17:18:56] (03PS1) 10Bmansurov: Add sampling rates for mobile web language switcher [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265292 (https://phabricator.wikimedia.org/T123932) [17:19:24] (03CR) 10Dzahn: [C: 031] Add mobrovac to mobileapps-admin group [puppet] - 10https://gerrit.wikimedia.org/r/265253 (https://phabricator.wikimedia.org/T123540) (owner: 10Alexandros Kosiaris) [17:20:49] 6operations, 7Availability, 5codfw-rollout-Jan-Mar-2016: Figure out a replication strategy for Swift - https://phabricator.wikimedia.org/T91869#1948513 (10fgiunchedi) [17:21:52] 6operations, 7Tracking: reduce amount of remaining Ubuntu 12.04 (precise) systems - https://phabricator.wikimedia.org/T123525#1948514 (10Dzahn) [17:22:23] 6operations, 7Tracking: reduce amount of remaining Ubuntu 12.04 (precise) systems - https://phabricator.wikimedia.org/T123525#1931694 (10Dzahn) reduced count from 92 to 86 (mostly by decom'ing and shutting down unused servers) [17:28:49] 6operations, 7Graphite: Wes Moran not able to log into Graphite - https://phabricator.wikimedia.org/T123796#1948560 (10eliza) @Dzahn Just connected with Wes and he mentioned this user name: westonnh Is it possible to use that? Eliza [17:30:52] (03CR) 10Mobrovac: [C: 04-1] "The puppet compiler is happy with this change - https://puppet-compiler.wmflabs.org/1635/ . However, that's a fallacy because this patch i" [puppet] - 10https://gerrit.wikimedia.org/r/264032 (https://phabricator.wikimedia.org/T118778) (owner: 10Subramanya Sastry) [17:33:18] 6operations, 7Tracking: reduce amount of remaining Ubuntu 12.04 (precise) systems - https://phabricator.wikimedia.org/T123525#1948569 (10fgiunchedi) note from techops meeting: main blocker for bastion conversion is ganglia upstart -> systemd conversion [17:40:18] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: add datacenter-ops to dhcp /install-server and allow to run puppet commands - https://phabricator.wikimedia.org/T123681#1948598 (10Dzahn) [17:41:04] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: add datacenter-ops to dhcp /install-server and allow to run puppet commands - https://phabricator.wikimedia.org/T123681#1935599 (10Dzahn) thanks @akosiaris i amended the change and edited the title here to add the ability to run puppet commands. i chang... [17:43:13] (03PS6) 10Eevans: Enable EventBus extension on test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265142 (https://phabricator.wikimedia.org/T116786) [17:48:49] !log restarting replication on db1026 after schema change [17:48:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:49:33] RECOVERY - MariaDB Slave SQL: s5 on db1026 is OK: OK slave_sql_state Slave_SQL_Running: Yes [17:49:41] PROBLEM - MariaDB Slave Lag: s5 on db1026 is CRITICAL: CRITICAL slave_sql_lag Seconds_Behind_Master: 20606 [17:55:35] 6operations, 10Fundraising-Backlog, 10Traffic, 10Unplanned-Sprint-Work, 5Patch-For-Review: Firefox SPDY-coalesces requests to geoiplookup over text-lb, causing GeoIP IPv6 failures - https://phabricator.wikimedia.org/T121922#1948639 (10BBlack) 5Open>3Resolved a:3BBlack Assuming this is no longer an... [17:55:50] (03CR) 10Mobrovac: [C: 031] Enable EventBus extension on test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265142 (https://phabricator.wikimedia.org/T116786) (owner: 10Eevans) [17:56:13] PROBLEM - High load average on labstore1001 is CRITICAL: CRITICAL: 75.00% of data above the critical threshold [24.0] [17:58:53] T124194 [18:05:13] PROBLEM - High load average on labstore1001 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [24.0] [18:06:26] !log turning up BGP with Zayo in codfw [18:06:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:08:34] (03PS3) 10Dzahn: Add mobrovac to mobileapps-admin group [puppet] - 10https://gerrit.wikimedia.org/r/265253 (https://phabricator.wikimedia.org/T123540) (owner: 10Alexandros Kosiaris) [18:08:54] (03CR) 10Dzahn: [C: 032] "approved in meeting" [puppet] - 10https://gerrit.wikimedia.org/r/265253 (https://phabricator.wikimedia.org/T123540) (owner: 10Alexandros Kosiaris) [18:10:08] (03CR) 10Subramanya Sastry: "Yes, since ruthenium is now moving to jessie now. I'll have to update the already-merged parsoid-rt* stuff too besides the parsoid-vd* stu" [puppet] - 10https://gerrit.wikimedia.org/r/264032 (https://phabricator.wikimedia.org/T118778) (owner: 10Subramanya Sastry) [18:10:15] jynus: hello (cc ottomata ) do you want to the EL scheduled maintenance despite the lag issues? [18:10:33] what lag issues? [18:11:43] jynus: replication lag [18:11:45] hi, BTW :-) [18:11:53] jynus: looks like we are still couple days behing [18:11:58] jynus: ay sorry [18:12:00] let me check [18:12:26] jynus: will be back in sec [18:13:13] no, that is not true, there is a peak lag of 3 hours [18:13:24] most tables are under 1 hour [18:14:03] 10Ops-Access-Requests, 6operations, 6Services, 3Mobile-Content-Service, 5Patch-For-Review: Allow mobrovac to restart MobileApps - https://phabricator.wikimedia.org/T123540#1948697 (10Dzahn) Notice: /Stage[main]/Admin/Admin::Groupmembers[mobileapps-admin]/Exec[mobileapps-admin_ensure_members]/returns: exe... [18:14:12] jynus: when would you start, tomorrow? [18:14:14] 6operations, 6Services, 10Wikimedia-Developer-Summit-2016: Service Ownership and Maintenance - https://phabricator.wikimedia.org/T122825#1948700 (10Aklapper) Wikimedia Developer Summit 2016 ended two weeks ago. This task is still open. **If the session in this task took place**, please make sure 1) that the... [18:14:17] friday? [18:15:35] 10Ops-Access-Requests, 6operations, 6Services, 3Mobile-Content-Service, 5Patch-For-Review: Allow mobrovac to restart MobileApps - https://phabricator.wikimedia.org/T123540#1948727 (10Dzahn) 5Open>3Resolved a:3Dzahn @mobrovac you are now in the mobileapps group and that gives you this: ``` root@sc... [18:17:15] 6operations, 3Mobile-Content-Service: Improve operational documentation for the MobileApps extension - https://phabricator.wikimedia.org/T123852#1948765 (10Mholloway) a:3Mholloway [18:17:17] 10Ops-Access-Requests, 6operations, 6Services, 3Mobile-Content-Service: Allow mobrovac to restart MobileApps - https://phabricator.wikimedia.org/T123540#1932040 (10Dzahn) [18:18:21] PROBLEM - MariaDB Slave Lag: s5 on db1070 is CRITICAL: CRITICAL slave_sql_lag could not connect [18:18:33] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [18:18:39] <_joe_> what? [18:18:48] <_joe_> jynus: ^^ known? [18:18:54] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [18:19:36] yep [18:19:50] discussing on wikipedia-databases [18:20:37] but it shoud be only transient [18:20:58] 6operations: compressed http responses without content-length not cached by varnish - https://phabricator.wikimedia.org/T124195#1948786 (10fgiunchedi) 3NEW [18:20:59] When a config change is deployed on the main cluster, how to propagate it also to Labs (here for Commons beta)? Is there an automatic sync or is it a manual operation? [18:21:04] 6operations, 10ops-codfw: ms-be2015.codfw.wmnet: slot=11 dev=sdl failed - https://phabricator.wikimedia.org/T123830#1948793 (10Papaul) a:5Papaul>3fgiunchedi Drive replacement complete [18:21:44] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 12.50% of data above the critical threshold [1000.0] [18:22:11] Dereckson: Automatic [18:22:37] 6operations, 3Mobile-Content-Service: Improve operational documentation for the MobileApps extension - https://phabricator.wikimedia.org/T123852#1948796 (10Mholloway) I can start writing this up after tomorrow morning's SWAT. [18:23:04] 6operations: compressed http responses without content-length not cached by varnish - https://phabricator.wikimedia.org/T124195#1948797 (10fgiunchedi) [18:23:05] PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 37, down: 2, dormant: 0, excluded: 1, unused: 0BRxe-1/3/0: down - DISABLEDBRxe-0/0/3: down - DISABLEDBR [18:24:31] PROBLEM - MariaDB Slave IO: s5 on db1070 is CRITICAL: CRITICAL slave_io_state could not connect [18:24:34] hoo: during the sync operation or later? [18:24:37] 6operations, 3Mobile-Content-Service: Improve operational documentation for the mobileapps service - https://phabricator.wikimedia.org/T123852#1948805 (10Mholloway) [18:25:03] Dereckson: Later on [18:25:03] RECOVERY - High load average on labstore1001 is OK: OK: Less than 50.00% above the threshold [16.0] [18:25:32] PROBLEM - MariaDB Slave SQL: s5 on db1070 is CRITICAL: CRITICAL slave_sql_state could not connect [18:25:44] 6operations, 10ops-eqiad, 10Analytics: Possible bad mem chip or slot on dbproxy1004 - https://phabricator.wikimedia.org/T123546#1948817 (10Nuria) @Cmjohson: the update should only be a few minutes right? If so let's do it today/tomorrow if possible. [18:25:52] Thanks. [18:26:04] PROBLEM - HHVM rendering on mw1145 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:26:23] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [18:26:35] 6operations, 10Traffic: compressed http responses without content-length not cached by varnish - https://phabricator.wikimedia.org/T124195#1948818 (10Dzahn) [18:26:54] PROBLEM - Apache HTTP on mw1145 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:27:02] 6operations, 10Traffic: compressed http responses without content-length not cached by varnish - https://phabricator.wikimedia.org/T124195#1948821 (10BBlack) Yeah copying a bit from IRC discussion at the time: basically anytime apache's configured to gzip stuff, it's going to have a deflate buffer size limit c... [18:27:09] 6operations: Port Ganglia aggregator setup to systemd - https://phabricator.wikimedia.org/T124197#1948822 (10faidon) 3NEW [18:27:12] 6operations, 10Traffic: compressed http responses without content-length not cached by varnish - https://phabricator.wikimedia.org/T124195#1948829 (10Dzahn) you could call it just an Apache config issue instead of varnish, but tagged "Traffic" anyways, feel free to remove it again if you think it shouldn't be [18:27:24] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [18:27:51] RECOVERY - MariaDB Slave SQL: s5 on db1070 is OK: OK slave_sql_state Slave_SQL_Running: Yes [18:28:04] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [18:28:23] RECOVERY - HHVM rendering on mw1145 is OK: HTTP OK: HTTP/1.1 200 OK - 65528 bytes in 4.494 second response time [18:28:34] 6operations, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Be able to switch programmatically between deployment servers in codfw and eqiad - https://phabricator.wikimedia.org/T124024#1948830 (10Aklapper) [18:28:41] 6operations, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: url-downloader should be set up more redundantly - https://phabricator.wikimedia.org/T122134#1948831 (10Aklapper) [18:28:47] 6operations, 6Performance-Team, 7Availability, 7Epic, and 2 others: Cleanup active-DC based MW config code and make it more robust and easy to change - https://phabricator.wikimedia.org/T114273#1948834 (10Aklapper) [18:28:49] 6operations, 10CirrusSearch, 6Discovery, 5codfw-rollout, 3codfw-rollout-Jul-Sep-2015: Implement multi-DC support in CirrusSearch - https://phabricator.wikimedia.org/T105709#1948836 (10Aklapper) [18:28:51] 6operations, 6Discovery, 5codfw-rollout, 3codfw-rollout-Jul-Sep-2015: Set up a CirrusSearch cluster in codfw (Dallas, Texas) - https://phabricator.wikimedia.org/T105703#1948837 (10Aklapper) [18:28:53] 6operations, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Scale up and out our puppetmaster infrastructure - https://phabricator.wikimedia.org/T98128#1948838 (10Aklapper) [18:28:55] 6operations, 10RESTBase, 10RESTBase-Cassandra, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Set up multi-DC replication for Cassandra - https://phabricator.wikimedia.org/T108613#1948835 (10Aklapper) [18:28:59] 6operations, 10ops-codfw, 5codfw-rollout, 3codfw-rollout-Apr-Jun-2015: mw2050 has probably a faulty disk - https://phabricator.wikimedia.org/T93858#1948842 (10Aklapper) [18:29:01] 6operations, 5codfw-rollout, 3codfw-rollout-Apr-Jun-2015, 3codfw-rollout-Jan-Mar-2016: Document what is left for having a full cluster installation in codfw - https://phabricator.wikimedia.org/T97322#1948840 (10Aklapper) [18:29:03] 6operations, 5codfw-rollout, 3codfw-rollout-Apr-Jun-2015: Pybal RunCommand monitor doesn't work correctly on ubuntu trusty - https://phabricator.wikimedia.org/T94822#1948841 (10Aklapper) [18:29:05] 6operations, 10DBA, 5Patch-For-Review, 5codfw-rollout, 3codfw-rollout-Apr-Jun-2015: Grant access to the databases to codfw appserver networks - https://phabricator.wikimedia.org/T93211#1948844 (10Aklapper) [18:29:07] 6operations, 10ops-codfw, 10netops, 5codfw-rollout, 3codfw-rollout-Apr-Jun-2015: Codfw mediawiki appservers from any rows but row A can't communicate with the dhcp server - https://phabricator.wikimedia.org/T92815#1948845 (10Aklapper) [18:29:10] 6operations, 5Patch-For-Review, 5codfw-rollout, 3codfw-rollout-Apr-Jun-2015: Set up load balancing for appservers in dallas - https://phabricator.wikimedia.org/T92377#1948846 (10Aklapper) [18:29:12] 6operations, 10ops-codfw, 5codfw-rollout, 3codfw-rollout-Apr-Jun-2015: rack and setup rdb2001-2004 - https://phabricator.wikimedia.org/T92013#1948847 (10Aklapper) [18:29:14] RECOVERY - Apache HTTP on mw1145 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.297 second response time [18:29:14] 6operations, 5Patch-For-Review, 5codfw-rollout, 3codfw-rollout-Apr-Jun-2015: setup & deploy rdb2001-2004 - https://phabricator.wikimedia.org/T92011#1948848 (10Aklapper) [18:29:15] can someone tell what was the real impact to users? [18:29:16] 6operations, 10Wikimedia-Apache-configuration, 10Wikimedia-Site-Requests, 5codfw-rollout, 3codfw-rollout-Apr-Jun-2015: Configure mediawiki to operate in the Dallas DC - https://phabricator.wikimedia.org/T91754#1948850 (10Aklapper) [18:29:19] 6operations, 7Availability, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Figure out a replication strategy for Swift - https://phabricator.wikimedia.org/T91869#1948849 (10Aklapper) [18:29:21] 6operations, 10hardware-requests, 5codfw-rollout, 3codfw-rollout-Apr-Jun-2015: setup deployment server in codfw (tin equivalent) - https://phabricator.wikimedia.org/T91678#1948851 (10Aklapper) [18:29:22] RECOVERY - MariaDB Slave IO: s5 on db1070 is OK: OK slave_io_state Slave_IO_Running: Yes [18:29:23] 6operations, 10ops-codfw, 5codfw-rollout, 3codfw-rollout-Apr-Jun-2015: Configure mw2001-2134 correctly - https://phabricator.wikimedia.org/T91238#1948852 (10Aklapper) [18:29:25] 6operations, 10ops-codfw, 5codfw-rollout, 3codfw-rollout-Apr-Jun-2015: Console on mc2013 is unresponsive - https://phabricator.wikimedia.org/T90580#1948855 (10Aklapper) [18:29:27] 6operations, 10ops-codfw, 5codfw-rollout, 3codfw-rollout-Apr-Jun-2015: Console on mc2001 is unresponsive - https://phabricator.wikimedia.org/T90559#1948856 (10Aklapper) [18:29:29] 6operations, 10ops-codfw, 5codfw-rollout, 3codfw-rollout-Apr-Jun-2015: PXE doesn't work on mc2017-18 - https://phabricator.wikimedia.org/T90586#1948854 (10Aklapper) [18:29:31] 6operations, 10ops-codfw, 5codfw-rollout, 3codfw-rollout-Apr-Jun-2015: mc2004 console is unreadable remotely - https://phabricator.wikimedia.org/T90883#1948853 (10Aklapper) [18:29:33] 6operations, 10ops-codfw, 5codfw-rollout, 3codfw-rollout-Apr-Jun-2015: Move network cable to the other port on codfw memcached hosts - https://phabricator.wikimedia.org/T90456#1948857 (10Aklapper) [18:29:35] 6operations, 5Patch-For-Review, 5codfw-rollout, 3codfw-rollout-Apr-Jun-2015: deploy wtp2001-2020 - https://phabricator.wikimedia.org/T90271#1948858 (10Aklapper) [18:29:37] 6operations, 5codfw-rollout, 3codfw-rollout-Apr-Jun-2015: deploy services on rbf2001-2002 - https://phabricator.wikimedia.org/T88309#1948860 (10Aklapper) [18:29:39] 6operations, 10ops-codfw, 5codfw-rollout, 3codfw-rollout-Apr-Jun-2015: Rebalance mc locations & update mgmt addresses for mc2001-mc2018 memcached servers in codfw - https://phabricator.wikimedia.org/T88693#1948859 (10Aklapper) [18:29:41] 7Puppet, 6operations, 5Patch-For-Review, 5codfw-rollout, 3codfw-rollout-Apr-Jun-2015: Check that the redis roles can be applied in codfw, set up puppet. - https://phabricator.wikimedia.org/T86898#1948861 (10Aklapper) [18:29:41] oh my [18:29:43] 6operations, 5codfw-rollout, 3codfw-rollout-Apr-Jun-2015: Setup the main appservers cluster in codfw - https://phabricator.wikimedia.org/T86893#1948864 (10Aklapper) [18:29:44] ^ andre is testing mass edits and the bot does _not_ get kicked anymore, that's the new part [18:29:45] 6operations, 5codfw-rollout, 3codfw-rollout-Apr-Jun-2015: Set up the mediawiki application layer in codfw - https://phabricator.wikimedia.org/T86894#1948863 (10Aklapper) [18:29:47] 6operations, 10ops-codfw, 10hardware-requests, 5Patch-For-Review, and 2 others: Procure rdb2001-2004 - onsite pending racking - https://phabricator.wikimedia.org/T86896#1948862 (10Aklapper) [18:29:49] 6operations, 5codfw-rollout, 3codfw-rollout-Apr-Jun-2015: Setup the api appservers cluster in codfw - https://phabricator.wikimedia.org/T86892#1948865 (10Aklapper) [18:29:51] 6operations, 5codfw-rollout, 3codfw-rollout-Apr-Jun-2015: Setup videoscalers cluster in codfw - https://phabricator.wikimedia.org/T86891#1948866 (10Aklapper) [18:29:53] 6operations, 5codfw-rollout, 3codfw-rollout-Apr-Jun-2015: Setup imagescalers cluster in codfw - https://phabricator.wikimedia.org/T86890#1948867 (10Aklapper) [18:29:55] 6operations, 5codfw-rollout, 3codfw-rollout-Apr-Jun-2015: Setup jobrunners cluster in codfw - https://phabricator.wikimedia.org/T86889#1948868 (10Aklapper) [18:29:57] 6operations, 5Patch-For-Review, 5codfw-rollout, 3codfw-rollout-Apr-Jun-2015: Setup redis clusters in codfw - https://phabricator.wikimedia.org/T86887#1948870 (10Aklapper) [18:29:59] 6operations, 5codfw-rollout, 3codfw-rollout-Apr-Jun-2015: Setup memcached cluster in codfw - https://phabricator.wikimedia.org/T86888#1948869 (10Aklapper) [18:30:03] 6operations, 5codfw-rollout, 3codfw-rollout-Apr-Jun-2015: install/deploy codfw appservers - https://phabricator.wikimedia.org/T85227#1948871 (10Aklapper) [18:30:09] mutante: I'm not **testing** mass edits. I just mass edit. ;) [18:30:22] for https://phabricator.wikimedia.org/T124057 [18:30:31] well ok, cool, usually wikibugs would have died by now ,and it's still alive [18:30:43] Death isn't anymore what it was supposed to be. [18:30:54] yea [18:31:11] 6operations: Reimage hooft with jessie and rename to bast3001 - https://phabricator.wikimedia.org/T123712#1948877 (10faidon) Note that hooft is the last of the legacy .esams.wikimedia.org hostnames. Whether we rename it to bast3001 or not, we should drop that suffix (look at older commits where I did the same wi... [18:31:22] jynus: $ grep -c 10.64.48.25 exception.log [18:31:22] 4771 [18:31:35] 6operations, 10Incident-Labs-NFS-20151216: Reinstall labstore1002 to ensure consistency with labstore1001 - https://phabricator.wikimedia.org/T121905#1948878 (10chasemp) Thanks chris, and since T123740 is done I want to get to this sooner than later [18:31:45] I see 700K errors on the logs [18:32:11] but not sure how many of those reach the user, if queries are retried, etc. [18:32:17] 6operations, 10Incident-Labs-NFS-20151216: Investigate need and candidate for labstore100(1|2) kernel upgrade - https://phabricator.wikimedia.org/T121903#1948883 (10chasemp) What are you thinking, 4.x? [18:32:21] RECOVERY - MariaDB Slave Lag: s5 on db1070 is OK: OK slave_sql_lag Seconds_Behind_Master: 0 [18:32:23] !log bounce stuck hhvm on mw1205 [18:32:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:33:00] jynus: What kind of errors? [18:33:12] Thinking about it... there are many which just timed out, I guess [18:33:18] those don't end up in exception.log [18:33:28] there is a good pike here: https://grafana.wikimedia.org/dashboard/db/varnish-http-errors [18:34:03] RECOVERY - HHVM rendering on mw1205 is OK: HTTP OK: HTTP/1.1 200 OK - 65527 bytes in 0.718 second response time [18:34:04] RECOVERY - Apache HTTP on mw1205 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.112 second response time [18:34:17] (03PS4) 10Elukey: Add notify Service for Kafka's Burrow. Bug: T123942 [puppet] - 10https://gerrit.wikimedia.org/r/265246 (https://phabricator.wikimedia.org/T123942) [18:34:30] !log restart hhvm on mw1206 [18:34:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:34:36] jynus: That's probably pretty accurate [18:34:50] those are not too bad, once the server got depooled [18:35:01] most of the log errors are problably probes [18:35:11] graphite check says it could not even, because "Overall insertion rate from MySQL consumer" [18:35:31] that generates around 200 errors/minute [18:35:50] mutante, "it could not even"? [18:36:23] its coming back [18:36:28] we just deployed eventlogging with a mysql consumer change [18:36:39] mutante: ^ [18:36:45] jynus: sorry, that was wrong. the name of the check is "insertion rate" and it's "20.00% of data under the critical threshold [10.0] [18:37:05] ottomata: ah! [18:37:05] mutante, ottomata you are talking about something different [18:37:13] eventlogging is not a problem [18:37:18] oh [18:37:19] hahaha [18:37:19] wikidata was [18:37:21] indeed [18:37:22] :) [18:37:58] there was a server having fun with metadata locking [18:38:45] I think it was only a 5 minute issue [18:39:15] from 18:14 to 18:19 [18:39:37] which was worse than intended because we were already in degraded redundancy [18:41:13] RECOVERY - puppet last run on ms-be2015 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [18:41:21] good news is that: [18:41:47] !log schema change on wikidatawiki (wb_terms) finished- slaves already catching up [18:41:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:44:31] 6operations, 10ops-eqiad, 10Analytics: Possible bad mem chip or slot on dbproxy1004 - https://phabricator.wikimedia.org/T123546#1948935 (10Nuria) @Cmjohson: @madhuvishy is on ops duty this week and she can help coordinate this small maintenance window. We just need to: 1. communicate to list 2. stop el,... [18:45:39] 6operations, 10ops-codfw: ms-be2015.codfw.wmnet: slot=11 dev=sdl failed - https://phabricator.wikimedia.org/T123830#1948945 (10fgiunchedi) 5Open>3Resolved disk replaced [18:46:23] 6operations, 10Incident-Labs-NFS-20151216: Reinstall labstore1002 to ensure consistency with labstore1001 - https://phabricator.wikimedia.org/T121905#1948960 (10faidon) (transcribing from IRC for posterity) Nothing specifically and no specific kernel recommendation (other than our standard). This really needs... [18:48:13] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:48:36] 6operations, 6Project-Creators: Operations-related subprojects/tags reorganization - https://phabricator.wikimedia.org/T119944#1948970 (10Aklapper) [18:48:38] chasemp: I resized labstore2001, no problems. Do you want me to assign the bug to you for further research, or close it? [18:49:45] (03PS5) 10Elukey: Add subscribe config file for Kafka's Burrow service Bug: T123942 [puppet] - 10https://gerrit.wikimedia.org/r/265246 (https://phabricator.wikimedia.org/T123942) [18:50:23] RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy [18:50:59] 6operations, 10Wikimedia-Mailing-lists: Fwd: usergroup-applications admin password? - https://phabricator.wikimedia.org/T124176#1948991 (10Dzahn) 5Open>3Resolved I reset the password using ./reset_list_admin_password.sh on fermium and mailed @slaporte the new password. [18:51:14] (03CR) 10Ottomata: [C: 032] Add subscribe config file for Kafka's Burrow service Bug: T123942 [puppet] - 10https://gerrit.wikimedia.org/r/265246 (https://phabricator.wikimedia.org/T123942) (owner: 10Elukey) [18:53:11] 6operations, 10ops-eqiad, 10Analytics: Possible bad mem chip or slot on dbproxy1004 - https://phabricator.wikimedia.org/T123546#1948996 (10madhuvishy) @jcrespo @Cmjohnson: EL can handle downtime - We will just stop the EL mysql consumers, and restart them after maintenance window - and data should get recons... [18:54:45] !log manually triggering an ubuntu mirror update ("sudo -u mirror /usr/local/sbin/update-ubuntu-mirror" on carbon) [18:54:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:55:24] PROBLEM - puppet last run on mw2053 is CRITICAL: CRITICAL: Puppet has 1 failures [18:58:08] 6operations, 6Project-Creators: Operations-related subprojects/tags reorganization - https://phabricator.wikimedia.org/T119944#1949020 (10Aklapper) > #media-storage (convert the old yellow #Swift tag) Could you clarify the relationship with #Wikimedia-Media-Storage please? [18:58:39] 6operations, 6Project-Creators: Operations-related subprojects/tags reorganization - https://phabricator.wikimedia.org/T119944#1949023 (10Aklapper) Further current tags that might be Ops-related (I assume they stay-as-is as non-changing stuff has not been explicitly mentioned in the task desc): #Diamond, #Gra... [19:00:06] marxarelli: Respected human, time to deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160120T1900). Please do the needful. [19:00:20] thanks, computer [19:05:57] (03CR) 10Muehlenhoff: "Can we wait with merging until I'm back on Monday? I'd like to also move the debdeploy master to neodymium along with that." [puppet] - 10https://gerrit.wikimedia.org/r/265245 (owner: 10ArielGlenn) [19:06:32] andrewbogott: just double checking, i should hold back labswiki and labtestwiki when promoting group1 today, right? [19:06:51] marxarelli: in theory it’s safe to upgrade now. Reedy put in a hack [19:07:00] but it would be good to coordinate with him during the roll-out... [19:07:02] (or me, I guess) [19:07:08] got it [19:07:41] andrewbogott, Reedy: we can wait to promote it until tomorrow if that would help in coordinating [19:08:19] yes please, I’d like to get the final word from Reedy before we try [19:08:25] great. will do [19:09:28] !log starting promotion of group1, but holding back labswiki and labtestwiki until Jan 21 'all' promotion [19:09:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:09:51] 6operations, 6Project-Creators: Operations-related subprojects/tags reorganization - https://phabricator.wikimedia.org/T119944#1949053 (10faidon) >>! In T119944#1949020, @Aklapper wrote: >> #media-storage (convert the old yellow #Swift tag) > > Could you clarify the relationship with #Wikimedia-Media-Storage... [19:10:33] andrewbogott: It's more than a hack [19:10:42] It's the proper fix to support latest MW [19:10:56] (03CR) 1020after4: [C: 032] Establish `group2` and `exempt` dblists (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265189 (owner: 10Dduvall) [19:10:59] andrewbogott: It should be fine to go, and no way to know for sure unless we do test it [19:11:10] (03CR) 1020after4: [C: 031 V: 04-1] Establish `group2` and `exempt` dblists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265189 (owner: 10Dduvall) [19:11:10] So please, do upgrade them, to .10 or .11, either is fine [19:11:19] Reedy: better yet! [19:11:22] But if it breaks, let me know, and I'll see what needs fixing it [19:11:52] (03CR) 1020after4: "ugh. accidental clickting." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265189 (owner: 10Dduvall) [19:12:24] (03CR) 1020after4: Establish `group2` and `exempt` dblists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265189 (owner: 10Dduvall) [19:12:56] Reedy: alright then, i'll include it [19:13:44] "don't call my fix a hack, you hack!" [19:13:58] !log including labswiki and labtestwiki in group1 promotion after all [19:14:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:15:33] twentyafterfour: hmm, deploy-promote is telling me that it will promote everything in group1 from wmf.9 to wmf.10 [19:15:44] 6operations, 6Project-Creators: Operations-related subprojects/tags reorganization - https://phabricator.wikimedia.org/T119944#1949081 (10faidon) >>! In T119944#1949023, @Aklapper wrote: > Further current tags that might be Ops-related (I assume they stay-as-is as non-changing stuff has not been explicitly men... [19:15:51] but that doesn't seem right, as only labswiki and labtestwiki (a subset of group1) is still on .9 [19:16:21] marxarelli: I think that's a bug that thcipriani already encountered, relating to the way it sorts? [19:16:30] thcipriani: how did you resolve it? [19:16:53] yeah, looking at it now. it does a `sort --version-sort | head -1` [19:17:07] marxarelli: do you have the latest version of mediawiki/tools/release? well hmm --version-sort should work [19:17:23] let me see... [19:17:30] twentyafterfour: i do [19:18:29] i mean, i can just tweak wikiversions for labswiki and labtestwiki to .10, then promote the group to .11, then switch those two back to .10 [19:18:49] you can pass it an explicit version, IIRC [19:19:31] ah, you're right [19:19:45] but, yes, currently deploy-promote does make the assumption that we have two wikiversions by default. [19:20:18] probably ought to be a --help flag on that script :P [19:20:27] it's doing the wrong thing with head and tail I think [19:20:54] it should probably prefer the newest not the second version in the list [19:20:59] so tail before head? [19:21:11] or just tail -1 instead of head -2 | tail -1 [19:21:24] RECOVERY - puppet last run on mw2053 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [19:21:35] also +1 to a help message. the script should really be python with docopt or something similar [19:21:36] greg-g: My hack was putting the functions in wikitech.php :P [19:22:11] no, i think thcipriani is right. it's the assumption about two versions, taking the least value as the oldversion and the second to least as the newversion [19:22:43] marxarelli: that's what I mean... [19:22:46] anyway, the explicit argument should work in this case [19:23:17] yeah but I'm just suggesting that the default should prefer the newest not the second [19:23:32] I think [19:23:53] but yeah, I've always just used the explicit version arg in this situation [19:24:32] the 3 live versions thing is just due to the SMW compat problems right? [19:24:59] twentyafterfour: i see what you're saying. that would work. it would just have to report all oldversions instead of the oldest [19:25:40] e.g. "Promote 1.27.0-wmf.9, 1.27.0-wmf.10 to 1.27.0-wmf.11 [19:25:59] bd808: I think so. yeah it's a slightly unusual situation [19:26:05] marxarelli: right [19:26:12] bd808: yep, but Reedy fixed the compat issues so i'm going ahead with promoting labs wikis to .11 [19:26:53] w00t! [19:27:38] (03PS1) 10Dduvall: group1 wikis to 1.27.0-wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265307 [19:29:34] thcipriani, twentyafterfour: can you double check the diff, please? ^ [19:29:42] * thcipriani looks [19:29:49] specifically the symlink changes [19:30:13] PROBLEM - check_puppetrun on betelgeuse is CRITICAL: CRITICAL: Puppet has 59 failures [19:31:16] marxarelli: lgtm, that's how it looks doing group1 updates. See the .10 update: https://gerrit.wikimedia.org/r/#/c/263898/ [19:31:41] thcipriani: rad. thanks! [19:32:03] (03CR) 10Dduvall: [C: 032] group1 wikis to 1.27.0-wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265307 (owner: 10Dduvall) [19:32:14] 6operations, 6Project-Creators: Operations-related subprojects/tags reorganization - https://phabricator.wikimedia.org/T119944#1949155 (10Krenair) >>! In T119944#1949053, @faidon wrote: >>>! In T119944#1949020, @Aklapper wrote: >>> #media-storage (convert the old yellow #Swift tag) >> >> Could you clarify the... [19:32:34] (03Merged) 10jenkins-bot: group1 wikis to 1.27.0-wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265307 (owner: 10Dduvall) [19:32:53] !log dduvall@tin rebuilt wikiversions.php and synchronized wikiversions files: group1 wikis to 1.27.0-wmf.11 [19:32:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:33:25] PROBLEM - puppet last run on labvirt1011 is CRITICAL: Connection refused by host [19:33:34] PROBLEM - Disk space on labvirt1011 is CRITICAL: Connection refused by host [19:33:45] PROBLEM - salt-minion processes on labvirt1011 is CRITICAL: Connection refused by host [19:33:54] PROBLEM - kvm ssl cert on labvirt1011 is CRITICAL: Connection refused by host [19:33:54] PROBLEM - configured eth on labvirt1011 is CRITICAL: Connection refused by host [19:34:03] PROBLEM - dhclient process on labvirt1011 is CRITICAL: Connection refused by host [19:34:33] PROBLEM - SSH on labvirt1011 is CRITICAL: Connection refused [19:34:34] PROBLEM - RAID on labvirt1011 is CRITICAL: Connection refused by host [19:35:04] PROBLEM - nova-compute process on labvirt1011 is CRITICAL: Connection refused by host [19:35:23] PROBLEM - DPKG on labvirt1011 is CRITICAL: Connection refused by host [19:36:05] andrewbogott: I'm ok w/ closing it atm if you are [19:36:14] great [19:36:15] by the time it's an issue we'll hopefully be thinking in other ways [19:38:18] 6operations, 7Graphite: Wes Moran not able to log into Graphite - https://phabricator.wikimedia.org/T123796#1949200 (10Dzahn) @Eliza @Wwes talked with Wes on IRC, we found out the "shell user name" associated with that user and then saw it had never been added to the "WMF" LDAP group. (should technically be p... [19:39:00] 6operations, 7Graphite: Wes Moran not able to log into Graphite - https://phabricator.wikimedia.org/T123796#1949207 (10Dzahn) 5Open>3Resolved a:3Dzahn Let us know if there is still an unexpected problem. [19:40:51] (03PS2) 10Dzahn: deactivate wikiepdia.[com|org] [dns] - 10https://gerrit.wikimedia.org/r/254049 [19:41:17] (03CR) 10Dereckson: [C: 04-1] "This waits a logo." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/254471 (https://phabricator.wikimedia.org/T118491) (owner: 10Dereckson) [19:42:05] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [1000.0] [19:42:16] Dereckson: isn't that change uploading the logo? [19:42:28] eh, modifying [19:43:32] 10Ops-Access-Requests, 6operations, 6Services, 3Mobile-Content-Service: Allow mobrovac to restart MobileApps - https://phabricator.wikimedia.org/T123540#1949227 (10mobrovac) Cheers! Greatly appreciated! [19:43:46] 6operations, 10OTRS, 7user-notice: Upgrade OTRS to a more recent stable release - https://phabricator.wikimedia.org/T74109#1949231 (10Rjd0060) Recheduled for February 3 at 0800 UTC. [19:45:13] RECOVERY - check_puppetrun on betelgeuse is OK: OK: Puppet is currently enabled, last run 253 seconds ago with 0 failures [19:45:27] (03CR) 10Cmjohnson: [C: 031] admin: add dc-ops to install-server, allow puppet cmds [puppet] - 10https://gerrit.wikimedia.org/r/264994 (https://phabricator.wikimedia.org/T123681) (owner: 10Dzahn) [19:47:51] (03PS1) 10Chad: PacketLossLogtailer.py: pep8 fixes, mostly line too long [puppet] - 10https://gerrit.wikimedia.org/r/265315 [19:48:21] mutante: yes, but there is a glitch with this logo, and a request has been added to Wikimedia Commons graphics labs for a better one. [19:48:33] Dereckson: ah, understand. thank you [19:52:37] andrewbogott: https://wikitech.wikimedia.org/wiki/Special:FormEdit/Tools_Access_Request is still 500-ing :/ [19:53:33] PROBLEM - Unmerged changes on repository mediawiki_config on mira is CRITICAL: There is one unmerged change in mediawiki_config (dir /srv/mediawiki-staging/). [19:53:54] Reedy: so probably needs a rollback :( [19:54:05] I’m in the midst of a security thing today so won’t have much time to pay attention to this [19:57:47] andrewbogott: should i go ahead and roll it back? [19:58:32] 6operations, 10ops-eqiad, 10Analytics: Possible bad mem chip or slot on dbproxy1004 - https://phabricator.wikimedia.org/T123546#1949298 (10Ottomata) Since we are about to have an EL downtime anyway, can we fit this in as well? [19:58:40] Reedy: ^ ? [19:59:53] marxarelli: I'd prefer we get it fixed up [20:00:00] If you give me a few minutes, I can dedicate some time to it [20:00:08] Just need to find out what the errors are, and fix them [20:00:33] "Just need to find out what the errors are, and fix them" [20:01:01] i fixed all the functions that FlorianSW and I removed [20:01:07] It could be something else, similar [20:01:47] Where are the error logs on silver? [20:01:57] Reedy: alrighty, no problem. fyi, i'll be around to help with rollbacks or patches for another hour [20:02:10] ottomata, I was proposing that, but there is too many people involved there, just ping me when you want to proceed. Chris is the boss for anything related to hardware. [20:02:35] Reedy: it logs centrally as far as I know [20:03:05] jynus: you wanted to do this over the weekend, rigth? [20:03:14] yes [20:03:37] starting when, friday? [20:03:53] but Chris has probably other priorities, I do not think the memory thing is very urgent [20:04:16] ja, but it'll be nice to not have to do another EL downtime for it [20:04:17] hm [20:04:24] but, that mem downtime should be short compared to this. [20:04:25] hm. [20:04:26] starting tomorrow, initially [20:04:40] ottomata: I only need 5 mins to swap the DIMM...want to do it tomorrow? [20:04:46] there doesn't need to be downtime for the proxy, I can change the dns [20:04:54] oo, to another proxy? [20:05:23] just repoint the dns, or setup one of the spare proxys [20:05:40] hm, ok. well, 5 mins isn't a big deal really. [20:05:52] ok ok, i'll take on the mem thing and make it happen [20:06:02] cmjohnson1: can we do it in our morning tomorrow? [20:06:04] 9:30 say? [20:06:16] ottomata let's make it 10 [20:06:19] if that's okay [20:06:30] 11 then [20:06:33] i got a meeting at 10 [20:06:36] ja? [20:06:39] okay [20:06:42] jynus: i forget, what's your TZ? [20:07:25] madhuvishy: nuria, have we notified folks about the EL downtime? [20:07:34] do they know they wont' see any updates over the weekend? [20:07:46] ottomata: i'm missing something [20:07:53] Exception: Permission denied: user=root, access=READ, :p [20:07:55] why do we need EL downtime? [20:09:05] there are 2 things here [20:09:14] 1. dbproxy reboot for mem chip switch [20:09:18] that is fast, no big deal [20:09:22] i'll take care of that with chris tomorrow [20:09:33] PROBLEM - puppet last run on ms-fe2001 is CRITICAL: CRITICAL: puppet fail [20:09:34] the other is the big tokudb conversion [20:09:42] ottomata: aah [20:09:44] okay [20:09:45] which jynus and nuria want to do over this weekend [20:09:51] for that, we need to stop mysql writes [20:09:56] right [20:10:11] so for both, we'll kill only mysql consumers right? [20:10:19] people won't see data, but we got it? [20:10:24] that is ~41 hours, give it some margin for extra unexpected time [20:10:43] <_joe_> getting rid of tokudb or converting something to tokudb? :) [20:10:46] but of course, doesn't need to be all at once [20:10:59] _joe_, converting all to tokudb, isn't that great? [20:12:53] jynus: if everything works as planned, 41 hours should be ok. it means that inserts will be large and frequent for a long time when we turn them back on [20:13:05] as they catch back up [20:13:24] but madhuvishy, correct [20:13:35] okay [20:13:40] they will just not see their data in mysql until downtime is over and consumers have time to insert everything [20:14:00] i can send notification to list - should we send for both together? [20:14:09] i woudln't bother with the first [20:14:11] the 5 minutes thing i doubt people will even notice [20:14:12] folks won't notice [20:14:13] 6operations, 6Services: Service Ownership and Maintenance - https://phabricator.wikimedia.org/T122825#1949380 (10mobrovac) [20:14:14] ya [20:14:22] especially if the replag is sometimes 3ish hours :p [20:14:22] 6operations, 6Services: Service Ownership and Maintenance - https://phabricator.wikimedia.org/T122825#1949382 (10mobrovac) p:5Triage>3Normal [20:14:26] ha ha [20:14:28] right [20:14:39] ok, madhuvishy nuria, jynus, let's do this [20:14:47] the second one is over the weekend? [20:14:53] ja, but starting tomrorow sometime [20:14:55] jynus: , when? [20:15:05] let's do after chris and I do the mem chip thing [20:15:27] hm we can just say starting 11 EST tomorrow [20:15:34] until monday at the latest [20:15:46] okay [20:15:49] i'll send email [20:15:49] although, who knows, maybe we shoudl ask? [20:15:50] eef [20:15:55] quarterly reviews friday [20:15:59] maybe folks need stats? [20:16:13] (03CR) 10John Vandenberg: [C: 031] Add .gitreview [puppet/kafkatee] - 10https://gerrit.wikimedia.org/r/264010 (owner: 10Hashar) [20:16:17] (03CR) 10John Vandenberg: [C: 031] Add .gitreview [puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/265250 (owner: 10Hashar) [20:16:31] ottomata: hmmm [20:16:40] ottomata: most slides are due today [20:16:44] so i doubt it [20:16:47] ok [20:16:48] then let's just do it [20:16:53] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [20:16:55] if someone yells we can reschedule [20:16:58] but let's plan on it [20:16:59] ya okay [20:17:03] ottomata: also [20:17:14] (03CR) 10John Vandenberg: [C: 031] Introduce tox as a test entry point [puppet/kafkatee] - 10https://gerrit.wikimedia.org/r/264011 (owner: 10Hashar) [20:17:28] how do you kill all consumers at once? [20:17:42] just grep mysql | kill ..? [20:17:45] (03CR) 10John Vandenberg: [C: 031] Introduce tox as a test entry point [puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/265251 (owner: 10Hashar) [20:18:16] If you worry is killing mysql processes, do not worry, I can do that :-) [20:18:39] 6operations, 10ops-eqiad, 10Analytics: Possible bad mem chip or slot on dbproxy1004 - https://phabricator.wikimedia.org/T123546#1949407 (10Ottomata) @cmjohnson and I will do this Jan 21 16:00 UTC (11am EST). Should be a very short and unnoticeable downtime. [20:18:53] hehe, naw, madhuvishy [20:18:56] jynus: no - the kafka consumers [20:19:06] ok ok [20:19:29] I think mforns had problems the last time, you should ask him [20:19:29] ottomata: doesn't puppet bring them back up if you kill that way? or do you have to merge puppet change commenting them out? [20:19:32] i ahve a little script [20:19:35] on eventlog1001 [20:19:38] ~otto/el-service [20:19:43] ah [20:19:47] but [20:19:50] you can't do just all msyql consumers [20:19:53] you have to do each one manually [20:19:54] so [20:21:31] madhuvishy: [20:21:34] for f in /etc/eventlogging.d/consumers/mysql-m4-master*; do name=$(basename $f); ~otto/el-service consumer $name status; done [20:22:43] ottomata: hmmm [20:22:48] andrewbogott: can you give me access to the apache error log on silver please? [20:23:17] madhuvishy: i just made that script because i could never remember the upstart invocation [20:23:27] ottomata: aaah [20:23:39] Reedy: what group? [20:23:52] wikidev or deployment? [20:24:05] or read to all, for temporary is ok [20:24:09] Reedy: how’s that? [20:24:45] jesus christ that file is noisy [20:24:46] :D [20:25:03] [Wed Jan 20 20:24:36.301415 2016] [:error] [pid 6048] [client 2620:0:861:118:2a80:23ff:fe9a:d4cc:39845] PHP Notice: Array to string conversion in /srv/mediawiki/php-1.27.0-wmf.11/extensions/SemanticMediaWiki/includes/storage/SQLStore/SMW_SQLStore3_Writers.php on line 384 [20:25:16] [Wed Jan 20 20:24:32.472442 2016] [:error] [pid 5556] [client 66.249.64.76:57354] PHP Warning: call_user_func_array() expects parameter 1 to be a valid callback, class 'Linker' does not have a method 'makeLinkObj' in /srv/mediawiki/php-1.27.0-wmf.11/includes/Linker.php on line 2367 [20:25:50] aha, got it [20:25:51] [Wed Jan 20 20:25:40.248032 2016] [:error] [pid 6344] [client 2001:8b0:fafa:8023:1d0e:a9f8:28a0:38de:55845] PHP Fatal error: Call to a member function formHTML() on a non-object in /srv/mediawiki/php-1.27.0-wmf.11/extensions/SemanticForms/includes/SF_AutoeditAPI.php on line 933 [20:25:59] jynus: so, we are ok to start tomorrow [20:26:05] what time is ok for you? [20:26:14] (03CR) 10John Vandenberg: [C: 031] Pass flake8 and add it to tox envlist (031 comment) [puppet/kafkatee] - 10https://gerrit.wikimedia.org/r/264012 (owner: 10Hashar) [20:26:21] ottomata: what you wrote will display the status for all of them right? [20:26:23] any time is ok, I just need to copy and paste to the console [20:26:30] ok [20:26:36] just after 16:00 ok? [20:26:40] UTC [20:26:41] (03PS2) 10Chad: Update debian package for gerrit [debs/gerrit] - 10https://gerrit.wikimedia.org/r/263631 [20:26:41] ? [20:26:43] ok [20:26:48] when cmjohnson1 and I are done with the dbproxy thing? [20:26:49] andrewbogott: fucking lol. That's an error on master [20:27:44] 6operations, 10Traffic, 6Zero, 5Patch-For-Review: Merge mobile cache into text cache - https://phabricator.wikimedia.org/T109286#1949446 (10BBlack) The traffic move from mobile->text is now on hold (we did convert codfw, then we rolled back) due to purge-related issues that need to be addressed first, in b... [20:27:56] jynus: ottomata so you wouldn't be able to get to the data on m4-master and analytics-store (old too?). Correct? [20:28:58] (03CR) 10John Vandenberg: [C: 031] "this looks very similar to I78ab5562d ; is there no way to avoid code duplication?" [puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/265252 (owner: 10Hashar) [20:29:04] 6operations, 10Wikipedia-iOS-App-Product-Backlog: Provide access to iOS team for piwik production server - https://phabricator.wikimedia.org/T124218#1949461 (10Fjalapeno) 3NEW [20:29:33] PROBLEM - puppet last run on db2034 is CRITICAL: CRITICAL: puppet fail [20:29:42] madhuvishy: correct [20:29:47] its m4-master we are doing maintenance on [20:29:50] yeah [20:29:53] (03CR) 10Chad: "Gerrit 2.12-23-g5d0c4bf is based on upstream's stable-2.12 branch (instead of the released 2.12)." [debs/gerrit] - 10https://gerrit.wikimedia.org/r/263631 (owner: 10Chad) [20:29:54] analytics-store is the slave [20:30:00] okay. so you can't query existing data too [20:30:01] wait [20:30:01] i think [20:30:05] jynus: is that correct? [20:30:06] ^^^ [20:30:09] oh yes [20:30:10] please note that m4-master dns points to dbproxy1004 [20:30:13] (yes [20:30:14] ) [20:30:20] okay [20:30:21] the proxy redirects mysql taffic to db1046 [20:30:25] (03PS1) 10EBernhardson: Change CirrusSearch sharding values for codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265372 (https://phabricator.wikimedia.org/T124215) [20:30:28] jynus: the tokudb conversion is on m4-master? [20:30:29] right? [20:30:32] not on analytics-store? [20:30:35] or both? [20:30:38] on db1046 [20:30:40] ok [20:30:40] the master [20:30:45] so madhuvishy yes [20:30:50] they will be able to query existing data [20:30:52] so it is not confused with the proxy (that has no data) [20:30:52] on analytics-store [20:30:56] just won't get any updates [20:30:56] yep [20:31:01] ah okay [20:31:07] reads will not be affected [20:31:13] will we also migrate analytics-store at some point? [20:31:15] (03CR) 10jenkins-bot: [V: 04-1] Change CirrusSearch sharding values for codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265372 (https://phabricator.wikimedia.org/T124215) (owner: 10EBernhardson) [20:31:39] actually, most of analytics store is already migrated, because I can pause replication at any time [20:31:49] 7Blocked-on-Operations, 6operations, 10Wikipedia-iOS-App-Product-Backlog: Provide access to iOS team for piwik production server - https://phabricator.wikimedia.org/T124218#1949481 (10Fjalapeno) [20:31:51] jynus: aah interesting [20:31:52] okay [20:32:23] problem is that some tables where so big that updates made impossible the conversion online [20:32:54] usually downtime is not needed, and in theory it will not be needed anymore, as toku allows easier online changes [20:33:14] so this is a 1 time thing (in theory) [20:33:45] in future cases, we will be able to use the proxy to failover (semi(automatically [20:34:43] PROBLEM - puppet last run on mw1135 is CRITICAL: CRITICAL: Puppet has 26 failures [20:34:50] thanks jynus that makes sense [20:36:04] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [20:37:09] !log s#/dev/md1#/dev/mapper/tank-data# [20:37:14] RECOVERY - puppet last run on ms-fe2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:37:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:37:28] there is something else going on, and this time is not the schema change [20:37:31] !log s#/dev/md1#/dev/mapper/tank-data# on labvirt1010, reverted by puppet with Notice: /Stage[main]/Role::Labs::Openstack::Nova::Compute/Mount[/var/lib/nova/instances]/device: device changed '/dev/mapper/tank-data' to '/dev/md1' [20:37:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:38:28] 503 in commons [20:38:53] PROBLEM - Host labvirt1011 is DOWN: PING CRITICAL - Packet loss = 100% [20:39:31] (03PS1) 10Chad: toolschecker.py: 1 minor pep8 fix [puppet] - 10https://gerrit.wikimedia.org/r/265373 [20:40:43] RECOVERY - SSH on labvirt1011 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.4 (protocol 2.0) [20:40:44] RECOVERY - RAID on labvirt1011 is OK: OK: no RAID installed [20:40:54] RECOVERY - Host labvirt1011 is UP: PING OK - Packet loss = 0%, RTA = 0.45 ms [20:41:13] RECOVERY - nova-compute process on labvirt1011 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/nova-compute [20:41:24] RECOVERY - DPKG on labvirt1011 is OK: All packages OK [20:41:45] RECOVERY - puppet last run on labvirt1011 is OK: OK: Puppet is currently enabled, last run 1 hour ago with 0 failures [20:41:54] (03PS7) 10Mobrovac: Add the visualdiff module + instantiate visualdiffing services [puppet] - 10https://gerrit.wikimedia.org/r/264032 (https://phabricator.wikimedia.org/T118778) (owner: 10Subramanya Sastry) [20:41:54] RECOVERY - Disk space on labvirt1011 is OK: DISK OK [20:42:03] akosiaris: you're playing on 1011 right, not 1010? :) [20:42:11] or are these alerts from elsewhere? [20:42:13] RECOVERY - salt-minion processes on labvirt1011 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [20:42:14] RECOVERY - kvm ssl cert on labvirt1011 is OK: Cert /etc/ssl/localcerts/labvirt-star.eqiad.wmnet.crt will not expire for at least 90 days [20:42:14] RECOVERY - dhclient process on labvirt1011 is OK: PROCS OK: 0 processes with command name dhclient [20:42:34] (03PS1) 10Chad: redis_monitoring.py: easy pep8 fixes [puppet] - 10https://gerrit.wikimedia.org/r/265374 [20:43:01] MaxSem: ping re https://gerrit.wikimedia.org/r/#/c/265316 ( https://phabricator.wikimedia.org/T124165 ) [20:43:07] 6operations, 6Project-Creators: Operations-related subprojects/tags reorganization - https://phabricator.wikimedia.org/T119944#1949543 (10Tgr) >>! In T119944#1949053, @faidon wrote: > I wasn't even aware of [Wikimedia-Media-storage]… From the description it does sound like the same thing, but looking at the ta... [20:43:21] or really anyone that knows about MobileFrontend and/or MW content-purging in general [20:43:26] my php-fu is weak too heh [20:43:34] YuviPanda: both actually [20:43:44] and these recoveries are fine [20:43:54] ah [20:43:56] ok [20:43:58] (03CR) 10Mobrovac: "I added modules/testreduce/files/parsoid-rt-client.systemd.service to give you a taste of how systemd service files are written. I've left" [puppet] - 10https://gerrit.wikimedia.org/r/264032 (https://phabricator.wikimedia.org/T118778) (owner: 10Subramanya Sastry) [20:44:14] I 've tested the change on labvirt1010 first cause I had a nagging feeling [20:44:18] turns out I was correct [20:44:19] I am not seeing it, but something happened since 19:32 UTC [20:45:30] is eqiad text only [20:46:20] no, all text [20:47:25] gotcha: 19:32 logmsgbot: dduvall@tin rebuilt wikiversions.php and synchronized wikiversions files: group1 wikis to 1.27.0-wmf.11 [20:48:43] PROBLEM - Apache HTTP on mw1135 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:49:13] PROBLEM - HHVM rendering on mw1135 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:49:24] PROBLEM - DPKG on mw1135 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:49:24] 6operations, 6Project-Creators: Operations-related subprojects/tags reorganization - https://phabricator.wikimedia.org/T119944#1949563 (10BBlack) >>! In T119944#1949081, @faidon wrote: > I doubt that #Varnish is of any usefulness and that the project tag #Traffic is enough. Maybe @BBlack has a differing opinio... [20:49:45] PROBLEM - HHVM processes on mw1135 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:50:14] PROBLEM - dhclient process on mw1135 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:50:24] PROBLEM - salt-minion processes on mw1135 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:50:24] PROBLEM - configured eth on mw1135 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:50:33] PROBLEM - nutcracker port on mw1135 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:50:34] PROBLEM - SSH on mw1135 is CRITICAL: Server answer [20:50:53] PROBLEM - RAID on mw1135 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:50:54] PROBLEM - Check size of conntrack table on mw1135 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:51:13] PROBLEM - nutcracker process on mw1135 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:51:24] PROBLEM - Disk space on mw1135 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:54:12] akosiaris: Who handles security incidents these days? https://phabricator.wikimedia.org/T124224 [20:54:47] multichill: security team [20:55:03] RECOVERY - puppet last run on db2034 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:55:08] I see them tagged, I suppose csteipp will pick it up [20:55:15] actually already did from what I see [20:55:53] Looks like csteip already got it [20:56:26] csteipp: If you want me to do some debug stuff, I'm still the other user :-) [20:57:55] multichill: Yeah, I'm looking into it... Did you login as yourself, and then edits showed up as the other person? Or have you logged in recently? [20:58:48] I was looking at my edits. https://www.wikidata.org/w/index.php?title=Q22116951&diff=prev&oldid=293519101 is my last widar edit. That was last night before I closed my laptop (suspend). [20:59:15] Now I started to do some edits and wondered why it was so slow and not showing up under my account. So I found this other account with my edits [20:59:40] multichill: What's an easy way to use Widar to make an edit. Seeing if I can reproduce, but never used those tools. [20:59:52] csteipp: multichill I've heard similar reports from Quarry users btw. [20:59:57] from a while ago [21:00:04] gwicke cscott arlolra subbu bearND mdholloway: Dear anthropoid, the time has come. Please deploy Services – Parsoid / OCG / Citoid / Mobileapps / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160120T2100). [21:00:05] I chalked it up to redis session corruption on my end [21:00:12] and cleared redis and it went away [21:00:22] no mobileapps deployment today [21:00:30] andrewbogott: We have a fix [21:00:34] parsoid deploy waiting on some patches to merge .. will be another 10-15 mins before i am ready. [21:00:39] csteipp: https://tools.wmflabs.org/autolist/index.php?language=en&project=wikipedia&category=&depth=12&wdq=CLAIM%5B170%3A381248%5D%20AND%20CLAIM%5B195%3A632682%5D%20AND%20NOCLAIM%5B31%5D&pagepile=&statementlist=P31%3AQ3305213&run=Run&mode_manual=or&mode_cat=or&mode_wdq=not&mode_find=or&chunk_size=10000 [21:00:58] It asks you to login to Widar. You can add "P31:Q3305213". That was the thing I was doing [21:00:58] Reedy: great [21:01:10] YuviPanda: Is that a Widar tool too? [21:01:19] csteipp: noope [21:01:22] andrewbogott: Not MW cores fault [21:01:26] csteipp: it uses python-mwoauth [21:01:32] and redis backed session storage [21:02:12] bblack, pong. I can finish this change. basically, you need to purge $1.m.wikipedia.org/wiki/$2 for every $1.wikipedia.org/wiki/$2, right? [21:05:03] RECOVERY - dhclient process on mw1135 is OK: PROCS OK: 0 processes with command name dhclient [21:05:13] RECOVERY - salt-minion processes on mw1135 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [21:05:14] RECOVERY - configured eth on mw1135 is OK: OK - interfaces up [21:05:23] RECOVERY - nutcracker port on mw1135 is OK: TCP OK - 0.000 second response time on port 11212 [21:05:24] RECOVERY - SSH on mw1135 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.4 (protocol 2.0) [21:05:28] csteipp: I found the widar cookie [21:05:39] I wonder if someone else uses that if they use the other account too [21:05:44] RECOVERY - Check size of conntrack table on mw1135 is OK: OK: nf_conntrack is 0 % full [21:05:44] RECOVERY - RAID on mw1135 is OK: OK: no RAID installed [21:06:03] RECOVERY - nutcracker process on mw1135 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [21:06:23] RECOVERY - Disk space on mw1135 is OK: DISK OK [21:06:23] RECOVERY - DPKG on mw1135 is OK: All packages OK [21:06:43] RECOVERY - HHVM processes on mw1135 is OK: PROCS OK: 6 processes with command name hhvm [21:06:51] (03PS1) 10Andrew Bogott: Specify role::labs::openstack::nova::compute::instance_dev correctly for labvirt1010 anad 1011 [puppet] - 10https://gerrit.wikimedia.org/r/265383 [21:07:43] RECOVERY - Apache HTTP on mw1135 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.044 second response time [21:08:13] RECOVERY - HHVM rendering on mw1135 is OK: HTTP OK: HTTP/1.1 200 OK - 65518 bytes in 0.777 second response time [21:08:41] !log reedy@tin Synchronized php-1.27.0-wmf.11/extensions/SemanticForms/: Fix fatal on wikitech (duration: 00m 36s) [21:08:43] RECOVERY - puppet last run on mw1135 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [21:08:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:10:13] (03CR) 10Alexandros Kosiaris: [C: 031] Specify role::labs::openstack::nova::compute::instance_dev correctly for labvirt1010 anad 1011 [puppet] - 10https://gerrit.wikimedia.org/r/265383 (owner: 10Andrew Bogott) [21:10:14] RECOVERY - Unmerged changes on repository mediawiki_config on mira is OK: No changes to merge. [21:10:34] 6operations, 6Parsing-Team, 10Parsoid, 6Services, 5Patch-For-Review: Update ruthenium to Debian jessie from Ubuntu 12.04 - https://phabricator.wikimedia.org/T122328#1949699 (10Dzahn) We have copied about 190G of the 340G .. ongoing ... [21:10:52] (03CR) 10Andrew Bogott: [C: 032] Specify role::labs::openstack::nova::compute::instance_dev correctly for labvirt1010 anad 1011 [puppet] - 10https://gerrit.wikimedia.org/r/265383 (owner: 10Andrew Bogott) [21:11:31] (03CR) 10Krinkle: [WIP] Implement /w/static.php (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/263566 (https://phabricator.wikimedia.org/T99096) (owner: 10Krinkle) [21:11:44] (03PS2) 10Krinkle: [WIP] Implement /w/static.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/263566 (https://phabricator.wikimedia.org/T99096) [21:11:52] 6operations, 10ops-codfw: ms-be2015.codfw.wmnet: slot=8 dev=sdi failed - https://phabricator.wikimedia.org/T124056#1949701 (10Papaul) Here is the Dispatch # for the 3Tb SAS HDD: 316136522 ETA 1/21/16 via FedEx. I will have the drive on site tomorrow [21:12:22] csteipp: Can you join #wikidata ? [21:14:22] !log rebooting labvirt1011 [21:14:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:14:33] there may be some kind of session issue with .11: https://logstash.wikimedia.org/#dashboard/temp/AVJg4zreptxhN1XaPbEd [21:14:58] shouldn't we revert? [21:15:34] PROBLEM - puppet last run on palladium is CRITICAL: CRITICAL: puppet fail [21:16:47] possibly. this is a troubling increase in session errors, strongly correlated with the .11 deploy. [21:16:53] PROBLEM - Host labvirt1011 is DOWN: PING CRITICAL - Packet loss = 100% [21:17:19] thcipriani: your call [21:17:34] RECOVERY - Host labvirt1011 is UP: PING OK - Packet loss = 0%, RTA = 1.06 ms [21:17:43] RECOVERY - puppet last run on palladium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:18:50] yeah, looks like it started with https://tools.wmflabs.org/sal/log/AVJcKmyV3CEqUS3YC1BC [21:19:02] 23:13dduvall@tin Finished scap: testwiki to php-1.27.0-wmf.11 and rebuild l10n cache (duration: 72m 03s) [21:19:23] thcipriani: revert group1, leave it on testwikis to debug [21:19:42] I just said it was your call, but that's what I would do [21:19:42] greg-g: kk, doing. [21:20:25] (03PS2) 10EBernhardson: Change CirrusSearch sharding values for codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265372 (https://phabricator.wikimedia.org/T124215) [21:20:27] now, who can we get help from to debug [21:21:07] is wikidata included in group1? [21:21:30] it is [21:21:38] what's up? [21:21:58] I didn't watch the deploy closely today as there's no new Wikidata code [21:22:20] but wikidate got new mediawiki-generic code, right? [21:22:26] greg-g: anomie and tgr? [21:22:35] jynus: yes [21:22:36] jynus: I'm just catching up, is there a task for this yet? [21:22:38] "Can neither load the session nor create an empty session" that? [21:22:52] MaxSem: ? [21:23:03] anomie: tgr we ahve a session issue that coincided with wmf.11 rolling out [21:23:15] (03CR) 10Alexandros Kosiaris: admin: add dc-ops to install-server, allow puppet cmds (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/264994 (https://phabricator.wikimedia.org/T123681) (owner: 10Dzahn) [21:23:17] greg-g: two, at least [21:23:27] https://phabricator.wikimedia.org/T124143 [21:23:29] greg-g, I do not care so much about the 503 [21:23:42] https://phabricator.wikimedia.org/T124126 [21:23:48] but about the session issue [21:23:54] greg-g: Is it T124126 or T124143? I have patches up for both of them. [21:24:01] I'm just about to merge anomie's fix for the first one [21:24:07] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: add datacenter-ops to dhcp /install-server and allow to run puppet commands - https://phabricator.wikimedia.org/T123681#1949734 (10akosiaris) Added a comment about 'puppet * ' allowing puppet apply [21:24:10] (03PS1) 10Thcipriani: group1 wikis to 1.27.0-wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265387 [21:24:22] the second one does not actually break stuff from what I can see [21:24:37] sweet, thanks anomie [21:24:43] and tgr [21:25:13] ah, so...no rollback? [21:25:31] I'd prefer rollback, merge fixes, test on testwikis, unrollback [21:25:45] is that the session related stuff ? [21:25:45] +1 [21:25:46] can i start parsoid deploy now? [21:26:09] thcipriani: rollback [21:26:15] doing [21:26:23] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265387 (owner: 10Thcipriani) [21:26:32] ^ "SWAT": force of habit [21:26:33] basically, I am only worried about T124224 (which I do not now if it is included on the upgrade) [21:26:51] ^greg-g [21:26:52] subbu: can it wait 15 minutes? [21:26:56] greg-g, can i start the parsoid deploy .. different cluster .. should be fine i assume .. but let me know if you want me to wait. [21:26:58] sure. :) [21:29:25] zuul not picking up my +2... [21:29:34] muther [21:29:43] (03PS4) 10Dzahn: admin: add dc-ops to install-server, allow puppet agent -t -v [puppet] - 10https://gerrit.wikimedia.org/r/264994 (https://phabricator.wikimedia.org/T123681) [21:29:53] (03Merged) 10jenkins-bot: group1 wikis to 1.27.0-wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265387 (owner: 10Thcipriani) [21:29:54] thcipriani: might be the l10n bot overloading it [21:29:56] there it goes [21:30:25] hmm [21:30:44] !log thcipriani@tin rebuilt wikiversions.php and synchronized wikiversions files: group1 wikis to 1.27.0-wmf.10 [21:30:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:30:52] guess some other changes where around [21:31:17] (03CR) 10Dzahn: admin: add dc-ops to install-server, allow puppet agent -t -v (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/264994 (https://phabricator.wikimedia.org/T123681) (owner: 10Dzahn) [21:32:07] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: add datacenter-ops to dhcp /install-server and allow to run puppet commands - https://phabricator.wikimedia.org/T123681#1949780 (10Dzahn) Amended to just allow 'puppet agent -t -v' specifically and not other puppet commands in response to Alex' comment. [21:32:09] !log reverted group1 wikis to 1.27.0-wmf.10 due to session errors. [21:32:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:33:58] (03CR) 10Dzahn: "hive stats:" [dns] - 10https://gerrit.wikimedia.org/r/254049 (owner: 10Dzahn) [21:34:20] subbu: I've finished up, FYI, if you want to do parsoid deploy [21:34:28] jynus: do you know how many events per second we can insert into mysql after the conversion - just a rough number [21:34:44] thcipriani, thanks. will do. [21:35:13] (03CR) 10Dzahn: "eh, wrong paste. just 3 hits in an hour, and 104 in a day" [dns] - 10https://gerrit.wikimedia.org/r/254049 (owner: 10Dzahn) [21:35:53] !log starting parsoid deploy [21:35:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:36:55] ok, anomie, I'll assume you're on point (or tgr, depending on your evening schedule) to merge/backport/test [21:37:34] greg-g: My evening schedule is "go away in 23 minutes or when I fix that one bug, whichever comes second" [21:38:43] thank you [21:40:08] 6operations, 7domains: traffic stats for typo domains - https://phabricator.wikimedia.org/T124237#1949803 (10Dzahn) 3NEW a:3Dzahn [21:41:23] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [21:41:25] !log synced new parsoid code; restarted parsoid on wtp1001 as a canary [21:41:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:41:44] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [21:42:05] (03PS1) 10Catrope: Enable Echo cross-wiki tracking table on all(*) wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265395 (https://phabricator.wikimedia.org/T124232) [21:44:03] PROBLEM - puppet last run on mw2194 is CRITICAL: CRITICAL: puppet fail [21:45:23] PROBLEM - Host labvirt1011 is DOWN: PING CRITICAL - Packet loss = 100% [21:45:44] RECOVERY - Host labvirt1011 is UP: PING OK - Packet loss = 0%, RTA = 0.60 ms [21:48:03] !log finished deploying parsoid sha f1ddfb88 [21:48:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:50:24] PROBLEM - Unmerged changes on repository mediawiki_config on mira is CRITICAL: There is one unmerged change in mediawiki_config (dir /srv/mediawiki-staging/). [21:50:42] !log reedy@tin Synchronized php-1.27.0-wmf.11/extensions/SemanticMediaWiki: Fix wikitech log noise (duration: 00m 34s) [21:50:43] PROBLEM - puppet last run on mw1131 is CRITICAL: CRITICAL: Puppet has 1 failures [21:50:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:51:22] PROBLEM - MariaDB Slave Lag: s5 on db1026 is CRITICAL: CRITICAL slave_sql_lag Seconds_Behind_Master: 18834 [21:51:24] !log reedy@tin Synchronized php-1.27.0-wmf.11/extensions/SemanticResultFormats: Fix wikitech log noise (duration: 00m 31s) [21:51:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:52:40] (03CR) 10Hoo man: [C: 032] "Changes look sane, output is unchanged." [dumps/dcat] - 10https://gerrit.wikimedia.org/r/263169 (owner: 10Lokal Profil) [21:52:43] RECOVERY - Unmerged changes on repository mediawiki_config on mira is OK: No changes to merge. [21:52:45] (03CR) 10Hoo man: [V: 032] Reduce the number of parameters passed to functions [dumps/dcat] - 10https://gerrit.wikimedia.org/r/263169 (owner: 10Lokal Profil) [21:57:16] (03CR) 10Luke081515: [C: 031] Enable Echo cross-wiki tracking table on all(*) wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265395 (https://phabricator.wikimedia.org/T124232) (owner: 10Catrope) [22:02:27] 6operations, 7domains: traffic stats for typo domains - https://phabricator.wikimedia.org/T124237#1949879 (10Dzahn) | domain | period | date | hits | | wikiepdia.com | 1d | 2016-01-19 | 71 | | wikiepdia.org | 1d | 2016-01-19 | 104| | wikimediacommons.co.uk | 1d | 2016-01-19 | | | wikimediacommons.eu | 1d | 201... [22:04:43] !log anomie@tin Synchronized php-1.27.0-wmf.2/extensions/OAuth: Deploy fix for T124224 (duration: 00m 34s) [22:04:43] PROBLEM - Host labvirt1011 is DOWN: PING CRITICAL - Packet loss = 100% [22:04:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:05:44] RECOVERY - Host labvirt1011 is UP: PING OK - Packet loss = 0%, RTA = 1.45 ms [22:05:51] Hmm. It would help if I deployed the right directory... [22:06:12] !log anomie@tin Synchronized php-1.27.0-wmf.11/extensions/OAuth: Deploy fix for T124224 (duration: 00m 32s) [22:06:14] heh [22:06:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:09:18] wmf.2? :) [22:09:31] Last time I did a SWAT, apparently. [22:09:58] * anomie ^R-ed, then forgot to update the version [22:11:44] RECOVERY - puppet last run on mw2194 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [22:12:25] (03CR) 10Alex Monk: "Why can't wmgEchoUseCrossWikiTrackingTable be replaced with a check for wmgUseCentralAuth?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265395 (https://phabricator.wikimedia.org/T124232) (owner: 10Catrope) [22:12:55] (03CR) 10Catrope: "Good idea." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265395 (https://phabricator.wikimedia.org/T124232) (owner: 10Catrope) [22:16:28] (03PS1) 10Andrew Bogott: Build jessie image with the latest 3.19 kernel build. [puppet] - 10https://gerrit.wikimedia.org/r/265405 [22:16:34] (03PS2) 10Catrope: Enable Echo cross-wiki tracking table on all wikis with CentralAuth [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265395 (https://phabricator.wikimedia.org/T124232) [22:18:24] RECOVERY - puppet last run on mw1131 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [22:26:34] (03PS2) 10Andrew Bogott: Build jessie image with the latest 3.19 kernel build. [puppet] - 10https://gerrit.wikimedia.org/r/265405 [22:28:00] looks like we're just waiting on some parser tests and then the backports/fixes for the session issues will be ready to go [22:28:21] I have 2 back-to-back 1:1s starting in 2 minutes, so, jfdi :) [22:28:45] greg-g: ack [22:34:34] Note to self, deploy when greg-g is busy [22:37:13] (03PS1) 10Catrope: Enable cross-wiki notifications beta feature on first wave of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265413 (https://phabricator.wikimedia.org/T124234) [22:37:35] (03CR) 10Catrope: [C: 04-2] "Not yet" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265413 (https://phabricator.wikimedia.org/T124234) (owner: 10Catrope) [22:38:54] !log tgr@tin Synchronized php-1.27.0-wmf.11/includes/: T124143,T124126 (duration: 00m 36s) [22:38:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:43:49] bd808: the stack overflow errors are visible in tin's fatalmonitor but not in the logstash fatalmonitor [22:44:14] hmmm.. that's unexpected [22:45:29] tgr: are they old maybe? [22:45:44] ~5min [22:46:16] but it was the same yesterday too, when they were piling up on group0, I just didn't realize it [22:46:44] our hhvm logging pipeline to logstash is broken somehow [22:48:25] !log HHVM log messages not being recorded in Logstash; bd808 to investigate [22:48:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:53:34] bd808, greg-g: anomie's patches are live on group0, the errors are gone, no obvious breakage [22:54:02] I'm off to grab some food, back in 10m [22:54:05] !log no HHVM log events in logstash since 2015-12-31T23:59:44.000Z [22:54:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:54:17] * bd808 grumbles [22:54:38] from https://grafana.wikimedia.org/dashboard/db/authentication-metrics it seems something in login API broke as well [22:55:21] the need_token spike? [22:55:34] that would probably be session related right? [22:55:37] not sure if that's the same issue, the stack overflow thing had too long traces so the part that would have explained where it got triggered got truncated [22:56:12] 6operations, 6Project-Creators: Operations-related subprojects/tags reorganization - https://phabricator.wikimedia.org/T119944#1950162 (10matmarex) >>! In T119944#1949053, @faidon wrote: >>>! In T119944#1949020, @Aklapper wrote: >>> #media-storage (convert the old yellow #Swift tag) >> >> Could you clarify th... [23:01:55] WTF? Some portion of the logstash cluster thinks it's January 2015 and not January 2016 [23:04:34] http://e.lvme.me/swtmvsx.jpg [23:04:38] !log Logstash1001 went nuts and decided that instead of 2016 it would go back to the start of 2015 after 2015-12-31T23:59 [23:04:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:05:43] it was a very good year [23:06:28] !log Restarted logstash on logstash1001 [23:06:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:08:51] bd808: leap second :P [23:08:54] ? [23:19:02] tgr: sweet, thanks, I think marxarelli will push us forward once we're done with our 1:1 [23:21:54] (03PS8) 10Subramanya Sastry: Add the visualdiff module + instantiate visualdiffing services [puppet] - 10https://gerrit.wikimedia.org/r/264032 (https://phabricator.wikimedia.org/T118778) [23:22:48] (03CR) 10Subramanya Sastry: "I've left behind the upstart files for ease of reviewing. Will remove them after review and everything looks fine." [puppet] - 10https://gerrit.wikimedia.org/r/264032 (https://phabricator.wikimedia.org/T118778) (owner: 10Subramanya Sastry) [23:26:16] {{done with our 1:1}} [23:27:42] tgr: so group0 is looking good with the backport in wmf.11? [23:28:16] greg-g, marxarelli: two more issues, not sure if they are serious [23:28:37] there was a big spike in need_token errors in the login API with the group1 deployment [23:28:50] kk, will hold off [23:28:57] but the group0 deploy had no effect on it at all, not sure what to make of that [23:29:03] is there an eta on a fix? [23:29:08] surely we have bots running on mw.org? [23:29:08] hmm, ok [23:29:52] not sure how to track that down, we have a dashboard for login API response statuses so all I know is that the frequencies changed [23:30:07] might be a single client that did something weird [23:30:36] plus there is a hhvm warning about dropped sessions [23:30:49] I think that's expected but let me think about it a bit [23:31:00] tgr: could it have been related to the session issues? [23:31:00] thanks for fixing that bd808 by the way [23:31:10] ok, i'll hold off [23:31:37] it's almost certainly related, I'm just not sure if these are real errors or harmless side effects [23:31:47] friendly reminder, SWAT is in 29 minutes, so we should decide and push forward in the next 15, to give us some time to make sure things are ok [23:35:04] 6operations, 6Phabricator, 6Release-Engineering-Team, 10Traffic, 5Patch-For-Review: Phabricator needs to expose ssh - https://phabricator.wikimedia.org/T100519#1950387 (10mmodell) git-ssh.wikimedia.org has an ipv6 address in DNS, however, it's not yet active due to lack of time to work on this. We need t... [23:37:06] (03PS1) 10Dzahn: comment IPv6 records for git-ssh [dns] - 10https://gerrit.wikimedia.org/r/265424 (https://phabricator.wikimedia.org/T100519) [23:39:14] (03PS2) 10Dzahn: comment IPv6 records for git-ssh [dns] - 10https://gerrit.wikimedia.org/r/265424 (https://phabricator.wikimedia.org/T100519) [23:40:20] (03CR) 10Reedy: [C: 031] comment IPv6 records for git-ssh [dns] - 10https://gerrit.wikimedia.org/r/265424 (https://phabricator.wikimedia.org/T100519) (owner: 10Dzahn) [23:40:47] (03CR) 10Dzahn: [C: 032] comment IPv6 records for git-ssh [dns] - 10https://gerrit.wikimedia.org/r/265424 (https://phabricator.wikimedia.org/T100519) (owner: 10Dzahn) [23:43:48] 6operations, 6Phabricator, 6Release-Engineering-Team, 10Traffic, 5Patch-For-Review: Phabricator needs to expose ssh - https://phabricator.wikimedia.org/T100519#1950423 (10Dzahn) >>! In T100519#1950387, @mmodell wrote: > git-ssh.wikimedia.org has an ipv6 address in DNS, however, it's not yet active due to... [23:44:06] tgr: any progress? [23:44:20] we should probably call it now, either way [23:45:15] That means I need to fix wmf.10 fully for wikitech then :P [23:45:19] marxarelli: the hhvm errors are expected [23:45:43] so, OK to have? :) [23:45:45] no idea about the need_token issue [23:46:15] the easiest way to figure that out might be to deploy and wait for someone to report it [23:46:39] there was about one of those every 5 sec, don't know what that means on group1 scale [23:46:42] tgr: what is the user-experience of that error? [23:46:58] API login fail, I assume [23:47:23] if it affects all bots, it should not be rolled out [23:47:50] if it's a single client, we need to roll it out eventually to figure out what's going on [23:48:03] and what are the effects of leaving wikitech on wmf.10 until tomorrow? [23:48:23] https://wikitech.wikimedia.org/wiki/Special:FormEdit/Tools_Access_Request is broken again :P [23:48:42] 6operations, 10ops-eqiad, 10Analytics: Possible bad mem chip or slot on dbproxy1004 - https://phabricator.wikimedia.org/T123546#1950435 (10Tbayer) [23:48:46] neat. rock and a hard place [23:48:57] Reedy: wasn't that being moved to form in phab? [23:48:57] 2 cherry picks/bumps to do [23:49:03] p858snake: Eventually [23:49:15] tgr: https://lists.wikimedia.org/pipermail/labs-l/2016-January/004241.html [23:49:16] Actually, I lie [23:49:24] marxarelli: 1 patch for full breakage [23:49:28] 2 for removing some noise [23:49:33] tgr: is that related? [23:49:48] most likely [23:50:38] perhaps we should rollback wikitech to wmf.9 and give tgr more time to figure out the need_token issue? [23:50:39] IMO let's stay on 10 then, if we know a bot that's affected, it should be possible to reproduce/fix on group 0 [23:50:54] * greg-g nods [23:50:55] agreed [23:51:14] marxarelli: do the needful on wikitech [23:51:27] werd [23:51:30] er... wilco [23:51:49] what's the next deploy window? Thursday's train to all wikis at the same time? [23:51:50] is there a ticket tracking this? [23:51:56] * YuviPanda wants to respond on labs-l [23:52:12] yes, 2 of them. [23:52:32] https://phabricator.wikimedia.org/T123583 [23:52:41] https://phabricator.wikimedia.org/T124248 [23:52:55] thanks! [23:53:10] oh [23:53:13] YuviPanda: re the api login or the wikitech form stuff [23:53:17] I was talking about the api login stuff [23:53:19] not the wikitech form [23:53:21] right [23:53:27] YuviPanda: there is now: https://phabricator.wikimedia.org/T124252 [23:53:48] not sure if it's the same issue but it would make sense [23:54:24] (03PS1) 10EBernhardson: [cirrus maint] redirect stderr to log and use full mwscript path [puppet] - 10https://gerrit.wikimedia.org/r/265427 (https://phabricator.wikimedia.org/T120843) [23:54:57] (03PS1) 10Dduvall: Rollback labswiki and labtestwiki to 1.27.0-wmf.9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265428 [23:56:09] (03CR) 10Dduvall: [C: 032] Rollback labswiki and labtestwiki to 1.27.0-wmf.9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265428 (owner: 10Dduvall) [23:56:19] marxarelli: nooooo [23:56:28] Reedy: ok [23:56:42] !log reedy@tin Synchronized php-1.27.0-wmf.10/extensions/SemanticForms/: fix wikitech again (duration: 00m 34s) [23:56:46] (03Merged) 10jenkins-bot: Rollback labswiki and labtestwiki to 1.27.0-wmf.9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265428 (owner: 10Dduvall) [23:56:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:56:52] https://wikitech.wikimedia.org/wiki/Special:FormEdit/Tools_Access_Request [23:56:54] fixed :) [23:56:58] Reedy: :) [23:57:51] bloody hell. reverting [23:58:49] lol [23:58:56] 6operations, 10Gerrit, 10GitHub-Mirrors, 10ValueView, and 3 others: [Task] Redirect unused extensions/ValueView repository to data-values/value-view - https://phabricator.wikimedia.org/T123624#1950547 (10JanZerebecki) [23:59:35] Should I backport my patch to fix some log noise? [23:59:36] 6operations, 7Availability, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Implement a replication strategy for Swift - https://phabricator.wikimedia.org/T91869#1950552 (10aaron) [23:59:41] Meh, don't think it's necessary