[00:02:53] Hrmph I have to do the MessageBlobStore thing, hold on [00:02:58] RECOVERY - cassandra-b service on restbase2004 is OK: OK - cassandra-b is active [00:03:13] the MessageBlobStore thing? [00:04:18] PROBLEM - NTP on restbase2004 is CRITICAL: NTP CRITICAL: Offset unknown [00:05:36] The value of the message is now cached as "" [00:05:40] ugh [00:05:46] It used to be that you could delete a row from the DB to fix that [00:05:53] But that cache doesn't live in the DB any more, so you need eval.php [00:06:06] And thanks to dependency injection, the right incantations are non-trivial [00:06:12] But I think I've got it, hold on [00:07:50] OK, yes I got it [00:07:56] I'll pastebin it so you can see [00:08:17] https://gist.github.com/catrope/f7a11078c02d20f7d247ca372b430b23 [00:08:17] how to do it in future I mean [00:09:34] so the trick is calling $store->getBlob($m, 'en'); twice? [00:09:41] oh, the updateMessage call [00:09:42] I see [00:09:44] Yeah [00:09:52] The getBlob before and after was to verify [00:10:05] for some reason I saw the echo above the updateMessage and ignored it :D [00:10:25] (And the first getBlob call is to verify that the problem is really in the place where I think it is) [00:10:41] OK, everything looks good now, I'm ready for the config change [00:10:58] Thanks for bearing with me on that one [00:11:06] (03PS3) 10Alex Monk: Show the cross-wiki notifications beta feature invitation on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286534 (https://phabricator.wikimedia.org/T117669) (owner: 10Catrope) [00:11:10] (03CR) 10Alex Monk: [C: 032] Show the cross-wiki notifications beta feature invitation on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286534 (https://phabricator.wikimedia.org/T117669) (owner: 10Catrope) [00:11:46] (03Merged) 10jenkins-bot: Show the cross-wiki notifications beta feature invitation on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286534 (https://phabricator.wikimedia.org/T117669) (owner: 10Catrope) [00:12:41] !log krenair@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/286534/ (duration: 00m 30s) [00:12:46] RoanKattouw, ^ [00:12:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:13:08] Thanks [00:13:12] Will verify [00:13:18] RECOVERY - puppet last run on cp2021 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [00:13:46] on nlwiki, grumble grumble [00:13:56] I guess I need to do the MessageBlobStore thing on all wikis, not just test [00:14:57] * RoanKattouw chooses the nuclear option [00:15:15] I see that my refreshMessageBlobs.php script has been kept, except that it just calls $messageBlobStore->clear() now [00:15:32] I could run that on all wikis, that would obliterate the MBS everywhere [00:15:38] But ... maybe not [00:15:41] Let's not [00:16:08] maybe live hack it to do what you need then put it back before anyone else logs in and tries to use it? :) [00:17:15] echo '$config = ConfigFactory::getDefaultInstance()->makeConfig( 'main' ); $rl = new ResourceLoader( $config ); $store = new MessageBlobStore( $rl ); $store->updateMessage( "echo-popup-footer-beta-invitation" );' | mwscript eval.php nlwiki [00:17:21] I could run that for wiki in all.dblist [00:17:59] ....except it didn't work, wtf [00:18:57] No, it did work [00:19:10] I was just impatient [00:19:16] OK I'll just run that everywhere [00:21:15] (03PS7) 10Alex Monk: Initialize configuration for jam.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286258 (https://phabricator.wikimedia.org/T134017) (owner: 10Dzahn) [00:25:27] all ok RoanKattouw? [00:25:38] Yeah, it's still running but it seems to be working [00:28:36] And it finished [00:29:49] Krenair: Are you going to deploy that jamwiki thing now/today? [00:29:55] yes [00:30:01] OK [00:30:05] I wanna do a manual l10nupdate run but that can wait until you're done with that [00:31:06] (03PS8) 10Alex Monk: Initialize configuration for jam.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286258 (https://phabricator.wikimedia.org/T134017) (owner: 10Dzahn) [00:31:10] (03CR) 10Alex Monk: [C: 032] Initialize configuration for jam.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286258 (https://phabricator.wikimedia.org/T134017) (owner: 10Dzahn) [00:31:41] (03Merged) 10jenkins-bot: Initialize configuration for jam.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286258 (https://phabricator.wikimedia.org/T134017) (owner: 10Dzahn) [00:32:48] think I'm gonna disable the search and wikidata parts and do them afterwards [00:33:51] OH COME ON [00:34:02] [ee64a1e2a4d674cc68a7bf86] [no req] MWException from line 2967 of /srv/mediawiki-staging/php-1.27.0-wmf.22/includes/db/Database.php: Could not open "/srv/mediawiki-staging/php-1.27.0-wmf.22/extensions/OAI/sql/update_table.sql". [00:34:12] How are we completely incapable of keeping addWiki working?! [00:36:37] Has anyone let Krinkle know that we're creating a jamwiki? ;) [00:36:49] !log krenair@tin Synchronized dblists: create jamwiki - https://gerrit.wikimedia.org/r/286258 (duration: 00m 27s) [00:36:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:38:25] !log krenair@tin rebuilt wikiversions.php and synchronized wikiversions files: create jamwiki - https://gerrit.wikimedia.org/r/286258 [00:38:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:39:05] Krenair: the newprojects email looks fully correct this time :) [00:39:13] indeed [00:39:28] guessing there's no way to remove my --help screwup from the archive [00:39:42] !log krenair@tin Synchronized wmf-config/InitialiseSettings.php: create jamwiki - https://gerrit.wikimedia.org/r/286258 (duration: 00m 24s) [00:39:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:40:39] !log krenair@tin Synchronized static/images/project-logos: create jamwiki - https://gerrit.wikimedia.org/r/286258 (duration: 00m 24s) [00:40:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:41:12] !log krenair@tin Synchronized langlist: create jamwiki - https://gerrit.wikimedia.org/r/286258 (duration: 00m 24s) [00:41:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:42:56] it's up [00:43:14] setting up search [00:46:19] done [00:46:22] wikidata will need https://gerrit.wikimedia.org/r/#/c/286473/ deployed [00:48:02] (03PS1) 10Alex Monk: Interwiki cache update - mainly for jamwiki, but other map changes included [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286552 (https://phabricator.wikimedia.org/T134017) [00:48:33] legoktm: I was reminded of Krinkle's list just today because of https://gerrit.wikimedia.org/r/286281 [00:48:52] (03CR) 10Alex Monk: [C: 032] Interwiki cache update - mainly for jamwiki, but other map changes included [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286552 (https://phabricator.wikimedia.org/T134017) (owner: 10Alex Monk) [00:48:58] :P [00:49:23] (03Merged) 10jenkins-bot: Interwiki cache update - mainly for jamwiki, but other map changes included [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286552 (https://phabricator.wikimedia.org/T134017) (owner: 10Alex Monk) [00:50:28] !log krenair@tin Synchronized wmf-config/interwiki.php: https://gerrit.wikimedia.org/r/286552 - interwiki cache update including jamwiki, horizon, and other things (duration: 00m 26s) [00:50:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:51:41] so, next... swift [00:52:50] done [00:53:32] I still don't think I've ever got addWiki running without a fatal exception the first time for each given new wiki [00:54:38] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 713 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5496398 keys - replication_delay is 713 [00:56:24] (03PS1) 10Alex Monk: Add jamwiki to restbase and labs DNS configs [puppet] - 10https://gerrit.wikimedia.org/r/286553 (https://phabricator.wikimedia.org/T134017) [00:58:29] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5482559 keys - replication_delay is 0 [01:11:24] actually, wikidata didn't need the update [01:11:28] the wmf.22 wasn't broken [01:11:41] (master was when I ran the script in beta earlier, but that's fixed now) [01:11:47] the wmf.22 branch* [01:15:01] thanks legoktm [01:20:23] (03CR) 10Dzahn: [C: 032] udpmxircecho: remove newlines from RC data [puppet] - 10https://gerrit.wikimedia.org/r/286544 (https://phabricator.wikimedia.org/T123729) (owner: 10Dzahn) [01:20:51] (03PS2) 10Dzahn: udpmxircecho: fix utf-8 encoding issue [puppet] - 10https://gerrit.wikimedia.org/r/286546 (https://phabricator.wikimedia.org/T123729) [01:23:47] (03CR) 10Dzahn: [C: 032] udpmxircecho: fix utf-8 encoding issue [puppet] - 10https://gerrit.wikimedia.org/r/286546 (https://phabricator.wikimedia.org/T123729) (owner: 10Dzahn) [01:28:08] (03Abandoned) 10Dzahn: add AAAA record for argon (irc) [dns] - 10https://gerrit.wikimedia.org/r/214506 (https://phabricator.wikimedia.org/T105422) (owner: 10Dzahn) [01:31:59] 06Operations, 10Wikimedia-IRC-RC-Server, 13Patch-For-Review: Migrate irc.wikimedia.org to Jessie - https://phabricator.wikimedia.org/T123729#2258473 (10Dzahn) [01:32:59] 06Operations: decom argon - https://phabricator.wikimedia.org/T134223#2258481 (10Dzahn) [01:33:10] 06Operations: decom argon - https://phabricator.wikimedia.org/T134223#2258498 (10Dzahn) [01:33:12] 06Operations, 10Wikimedia-IRC-RC-Server, 13Patch-For-Review: Migrate irc.wikimedia.org to Jessie - https://phabricator.wikimedia.org/T123729#1936574 (10Dzahn) [01:33:46] (03PS2) 10Dzahn: switch irc.wm.org from argon to kraz [dns] - 10https://gerrit.wikimedia.org/r/286509 (https://phabricator.wikimedia.org/T123729) [01:36:37] (03PS3) 10Dzahn: switch irc.wm.org from argon to kraz [dns] - 10https://gerrit.wikimedia.org/r/286509 (https://phabricator.wikimedia.org/T123729) [01:37:15] (03CR) 10Dzahn: [C: 032] switch irc.wm.org from argon to kraz [dns] - 10https://gerrit.wikimedia.org/r/286509 (https://phabricator.wikimedia.org/T123729) (owner: 10Dzahn) [01:39:35] !log switching irc.wikimedia.org from old server argon to new server kraz. old server still running untouched as argon.wikimedia.org. no clients are kicked. appservers are sending RC to both. [01:39:59] (03PS2) 10Tim Starling: Remove obsolete ocwiki hack [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285538 [01:40:01] !log irc.wm.org - see T123729 if any questions [01:40:02] T123729: Migrate irc.wikimedia.org to Jessie - https://phabricator.wikimedia.org/T123729 [01:40:07] (03CR) 10Tim Starling: [C: 032] Remove obsolete ocwiki hack [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285538 (owner: 10Tim Starling) [01:40:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:41:37] i got this again where gerrit web ui says "jenkins-bot Verified +2" as the last message.. yet when i add my CR +2 , still says "needs Verified" [01:42:23] (03CR) 10Dzahn: "recheck" [dns] - 10https://gerrit.wikimedia.org/r/286509 (https://phabricator.wikimedia.org/T123729) (owner: 10Dzahn) [01:42:32] mutante: That's because Jenkins V+2ed PS2, not PS3 [01:43:10] RoanKattouw: oh, now i see. thanks. because i edited the message [01:43:16] and a little delay there [01:43:51] In theory though the +2 should have kicked off a gate-and-submit job [01:43:59] But I see no activity in zuul, also none in response to your recheck [01:44:01] Or the new patchest [01:44:04] Very strange [01:44:09] that may be because this is the DNS repo [01:44:18] with different checks and config [01:44:33] Hmm [01:44:40] did the "recheck" thing [01:44:47] waits [01:44:48] (03PS4) 10Catrope: switch irc.wm.org from argon to kraz [dns] - 10https://gerrit.wikimedia.org/r/286509 (https://phabricator.wikimedia.org/T123729) (owner: 10Dzahn) [01:45:38] mutante: I think you've triggered a Zuul bug and you need to remove and reapply your +1 [01:45:40] *+2 [01:45:50] Normally I'd add a +2 myself but I don't have +2 rights in operations/* [01:45:59] ok, let me try [01:46:16] (03CR) 10Dzahn: [C: 032] switch irc.wm.org from argon to kraz [dns] - 10https://gerrit.wikimedia.org/r/286509 (https://phabricator.wikimedia.org/T123729) (owner: 10Dzahn) [01:46:23] needs Verified [01:46:51] Wait and now Jenkins suddenly V+2ed PS3? WTF?! [01:46:51] now it needs to check PS4 but is not done with PS3 yet [01:47:04] it's just ...slow [01:47:04] ? [01:47:10] That did not appear on the zuul status screen at all [01:47:17] Also that job only took 5s [01:48:16] Maybe just V+2 it yourself? If PS3 passed, PS4 certainly passes too because I only changed the commit message [01:48:23] (03CR) 10Dzahn: "recheck" [dns] - 10https://gerrit.wikimedia.org/r/286509 (https://phabricator.wikimedia.org/T123729) (owner: 10Dzahn) [01:49:17] yes, but i dont like to get in the habit. it's bad enough how much we do it on puppet repo and these are the DNS zones [01:50:27] (03CR) 10Dzahn: [V: 032] switch irc.wm.org from argon to kraz [dns] - 10https://gerrit.wikimedia.org/r/286509 (https://phabricator.wikimedia.org/T123729) (owner: 10Dzahn) [01:50:42] (03Merged) 10jenkins-bot: Remove obsolete ocwiki hack [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285538 (owner: 10Tim Starling) [01:50:47] i already logged i would switch it a couple minutes ago .. so [01:51:21] RoanKattouw: thanks for looking. irc.wm.org is now jessie :p [01:51:53] well, at least for new clients and after cache refreshes [01:52:01] the old one is also running as before [01:53:37] !log mwdeploy@tin sync-l10n completed (1.27.0-wmf.22) (duration: 4303m 19s) [01:53:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:53:44] !log mwdeploy@tin sync-l10n completed (1.27.0-wmf.21) (duration: 12941m 41s) [01:53:47] !log mwdeploy@tin sync-l10n completed (1.27.0-wmf.21) (duration: 23019m 56s) [01:53:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:53:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:54:22] !log Killed ssh processes on tin that had been hanging for days [01:54:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Mr. Obvious [01:55:25] 06Operations, 10Wikimedia-IRC-RC-Server, 13Patch-For-Review: Migrate irc.wikimedia.org to Jessie - https://phabricator.wikimedia.org/T123729#2258554 (10Dzahn) a:03Dzahn [01:56:08] 06Operations, 10Wikimedia-IRC-RC-Server: Migrate irc.wikimedia.org to Jessie - https://phabricator.wikimedia.org/T123729#1936574 (10Dzahn) [01:56:53] 06Operations, 13Patch-For-Review, 07Tracking: reduce amount of remaining Ubuntu 12.04 (precise) systems - https://phabricator.wikimedia.org/T123525#2258560 (10Dzahn) [01:56:55] 06Operations, 10Wikimedia-IRC-RC-Server: Migrate irc.wikimedia.org to Jessie - https://phabricator.wikimedia.org/T123729#1936574 (10Dzahn) 05Open>03Resolved 18:44 < mutante> !log switching irc.wikimedia.org from old server argon to new server kraz. old server still running untouched as argon.wikimedia.o... [01:57:55] (03CR) 10Dzahn: [C: 031] "please merge, irc.wm.org is switched and this was scheduled for May 2nd, we can still make it :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/217858 (https://phabricator.wikimedia.org/T122933) (owner: 10Faidon Liambotis) [01:58:06] (03CR) 10Legoktm: Apache redirects for w.wiki (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/285932 (https://phabricator.wikimedia.org/T108557) (owner: 10Dereckson) [01:58:41] is there a night deploy ?:P [01:58:49] haha [01:59:08] that irc link thing there. it has a comment "can be merged anytime. no impact" [01:59:16] and a scheduled deployment date of May 2nd [01:59:41] Yeah, if Krinkle says that I'll happily merge+deploy [01:59:42] it should have been in the earlier window but irc.wm wasn't ready due to other bugs [01:59:44] (03CR) 10Catrope: [C: 032] Remove obsolete 'https -> http' rewrite for IRC notifications [mediawiki-config] - 10https://gerrit.wikimedia.org/r/217858 (https://phabricator.wikimedia.org/T122933) (owner: 10Faidon Liambotis) [01:59:48] thanks a lot :) [02:02:37] 06Operations, 10Traffic, 10Wikimedia-IRC-RC-Server, 07HTTPS, and 2 others: Remove the "HTTPS to HTTP" url filter in the IRC feed - https://phabricator.wikimedia.org/T122933#1916986 (10Dzahn) 19:04 < mutante> that irc link thing there. it has a comment "can be merged anytime. no impact" 19:04 < mutante> and... [02:03:10] Deployment date: Monday, May 2nd, 2016. on the ticket. and we did not fail it.. we are using PST timezone for that one, hehe [02:04:28] (03PS12) 10Catrope: Remove obsolete 'https -> http' rewrite for IRC notifications [mediawiki-config] - 10https://gerrit.wikimedia.org/r/217858 (https://phabricator.wikimedia.org/T122933) (owner: 10Faidon Liambotis) [02:04:30] So that's marked as no impact because we realised someone else had already made the change we wanted [02:04:34] (03CR) 10Catrope: [C: 032] Remove obsolete 'https -> http' rewrite for IRC notifications [mediawiki-config] - 10https://gerrit.wikimedia.org/r/217858 (https://phabricator.wikimedia.org/T122933) (owner: 10Faidon Liambotis) [02:04:35] this is just clearing up useless code [02:04:42] !log catrope@tin Synchronized wmf-config/InitialiseSettings.php: Remove obsolete ocwiki hack (duration: 00m 37s) [02:04:47] OK [02:04:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:04:52] 06Operations, 10Wikimedia-IRC-RC-Server: Migrate irc.wikimedia.org to Jessie - https://phabricator.wikimedia.org/T123729#2258589 (10Dzahn) [02:04:54] 06Operations, 10Traffic, 10Wikimedia-IRC-RC-Server, 07HTTPS, and 2 others: Remove the "HTTPS to HTTP" url filter in the IRC feed - https://phabricator.wikimedia.org/T122933#2258586 (10Dzahn) 05Open>03Resolved a:03Dzahn claiming resolved, but _please_ confirm [02:05:00] Well Tim already merged something into wmf-config without deploying earlier, so I might as well do both [02:05:03] (03Merged) 10jenkins-bot: Remove obsolete 'https -> http' rewrite for IRC notifications [mediawiki-config] - 10https://gerrit.wikimedia.org/r/217858 (https://phabricator.wikimedia.org/T122933) (owner: 10Faidon Liambotis) [02:05:33] Krenair: ah, even better :) [02:05:46] !log catrope@tin Synchronized wmf-config/InitialiseSettings.php: Remove obsolete https->http rewrite for IRC notifications (duration: 00m 28s) [02:05:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:07:15] 06Operations, 10Wikimedia-IRC-RC-Server, 07IPv6, 13Patch-For-Review: enable IPv6 on irc.wikimedia.org - https://phabricator.wikimedia.org/T105422#2258592 (10Dzahn) @kraz:~# host irc.wikimedia.org irc.wikimedia.org is an alias for kraz.wikimedia.org. kraz.wikimedia.org has address 208.80.153.44 kraz.wikimed... [02:09:24] 06Operations, 10Wikimedia-IRC-RC-Server: Migrate irc.wikimedia.org to Jessie - https://phabricator.wikimedia.org/T123729#2258596 (10Dzahn) [02:09:26] 06Operations, 10Wikimedia-IRC-RC-Server, 07IPv6, 13Patch-For-Review: enable IPv6 on irc.wikimedia.org - https://phabricator.wikimedia.org/T105422#2258593 (10Dzahn) 05Open>03Resolved a:03Dzahn ping6 irc.wikimedia.org PING irc.wikimedia.org(2620:0:860:2:208:80:153:44) 56 data bytes 64 bytes from 2620:0... [02:10:52] Krenair: ooh, and you even did jam.wp too.. thanks man! [02:14:05] eh. i mean of course "ya man" [02:14:16] in the jam language [02:15:33] (03CR) 10Dzahn: [C: 031] Add jamwiki to restbase and labs DNS configs [puppet] - 10https://gerrit.wikimedia.org/r/286553 (https://phabricator.wikimedia.org/T134017) (owner: 10Alex Monk) [02:15:45] haha [02:17:38] PROBLEM - configured eth on alsafi is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:17:39] PROBLEM - salt-minion processes on alsafi is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:17:39] PROBLEM - restbase endpoints health on restbase-test2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:17:49] PROBLEM - puppet last run on alsafi is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:17:58] PROBLEM - restbase endpoints health on restbase2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:18:18] PROBLEM - citoid endpoints health on scb2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:18:19] PROBLEM - graphoid endpoints health on scb2002 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) is CRITICAL: Test retrieve PNG from mediawiki.org returned the unexpected status 400 (expecting: 200) [02:18:28] PROBLEM - DPKG on alsafi is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:18:29] PROBLEM - restbase endpoints health on restbase2005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:18:30] PROBLEM - citoid endpoints health on scb2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:18:36] Krenair: should be "Jamaican patois" [02:18:38] PROBLEM - Disk space on alsafi is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:18:40] PROBLEM - restbase endpoints health on restbase2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:18:40] PROBLEM - restbase endpoints health on restbase2004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:18:47] patois can be just a dialect of French [02:18:49] PROBLEM - graphoid endpoints health on scb2001 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) is CRITICAL: Test retrieve PNG from mediawiki.org returned the unexpected status 400 (expecting: 200) [02:18:49] https://en.wiktionary.org/wiki/patois [02:18:58] PROBLEM - RAID on alsafi is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:18:58] PROBLEM - Check size of conntrack table on alsafi is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:18:59] PROBLEM - restbase endpoints health on restbase-test2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:19:00] ok [02:19:09] PROBLEM - restbase endpoints health on restbase2003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:19:09] PROBLEM - restbase endpoints health on restbase2006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:19:18] PROBLEM - dhclient process on alsafi is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [02:19:28] PROBLEM - restbase endpoints health on restbase-test2003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:21:53] !log ssh alsafi [02:21:59] RECOVERY - citoid endpoints health on scb2001 is OK: All endpoints are healthy [02:22:01] [02:22:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:22:08] RECOVERY - DPKG on alsafi is OK: All packages OK [02:22:08] RECOVERY - graphoid endpoints health on scb2002 is OK: All endpoints are healthy [02:22:09] RECOVERY - restbase endpoints health on restbase2005 is OK: All endpoints are healthy [02:22:10] RECOVERY - citoid endpoints health on scb2002 is OK: All endpoints are healthy [02:22:19] RECOVERY - Disk space on alsafi is OK: DISK OK [02:22:28] RECOVERY - restbase endpoints health on restbase2002 is OK: All endpoints are healthy [02:22:28] RECOVERY - restbase endpoints health on restbase2004 is OK: All endpoints are healthy [02:22:38] RECOVERY - Check size of conntrack table on alsafi is OK: OK: nf_conntrack is 0 % full [02:22:38] RECOVERY - RAID on alsafi is OK: OK: no RAID installed [02:22:38] RECOVERY - graphoid endpoints health on scb2001 is OK: All endpoints are healthy [02:22:39] RECOVERY - restbase endpoints health on restbase-test2002 is OK: All endpoints are healthy [02:22:49] RECOVERY - restbase endpoints health on restbase2003 is OK: All endpoints are healthy [02:22:50] RECOVERY - restbase endpoints health on restbase2006 is OK: All endpoints are healthy [02:22:59] RECOVERY - dhclient process on alsafi is OK: PROCS OK: 0 processes with command name dhclient [02:23:08] RECOVERY - restbase endpoints health on restbase-test2003 is OK: All endpoints are healthy [02:23:09] RECOVERY - salt-minion processes on alsafi is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [02:23:09] RECOVERY - configured eth on alsafi is OK: OK - interfaces up [02:23:09] RECOVERY - restbase endpoints health on restbase-test2001 is OK: All endpoints are healthy [02:23:29] RECOVERY - puppet last run on alsafi is OK: OK: Puppet is currently enabled, last run 25 minutes ago with 0 failures [02:23:29] RECOVERY - restbase endpoints health on restbase2001 is OK: All endpoints are healthy [02:24:16] 06Operations: decom argon - https://phabricator.wikimedia.org/T134223#2258635 (10Dzahn) p:05Triage>03Normal [02:26:09] !log mwdeploy@tin sync-l10n completed (1.27.0-wmf.22) (duration: 08m 40s) [02:26:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:27:06] 06Operations, 10Traffic, 10Wikimedia-IRC-RC-Server, 07HTTPS, and 2 others: Remove the "HTTPS to HTTP" url filter in the IRC feed - https://phabricator.wikimedia.org/T122933#2258641 (10Dzahn) @Johan the change happened now, late but still on May 2nd as announced [02:34:45] !log l10nupdate@tin ResourceLoader cache refresh completed at Tue May 3 02:34:45 UTC 2016 (duration 8m 36s) [02:34:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:38:32] 06Operations, 10Wikimedia-IRC-RC-Server: Migrate irc.wikimedia.org to Jessie - https://phabricator.wikimedia.org/T123729#2258647 (10Dzahn) >>! In T123729#2216681, @Krinkle wrote: > * Set up kraz (Jessie; VM) to be a replacement for argon (Precise; metal). done > * Update MediaWiki wmf-config to broadcast event... [02:43:24] 06Operations: decom argon - https://phabricator.wikimedia.org/T134223#2258652 (10Dzahn) [02:43:26] 06Operations, 13Patch-For-Review, 07Tracking: reduce amount of remaining Ubuntu 12.04 (precise) systems - https://phabricator.wikimedia.org/T123525#2258651 (10Dzahn) [02:49:41] 06Operations, 10ops-codfw, 10RESTBase: plug in restbase2004 network cable - https://phabricator.wikimedia.org/T134197#2258655 (10Papaul) a:05Papaul>03RobH Complete [03:12:09] PROBLEM - torrus.wikimedia.org UI on netmon1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string Torrus Top: Wikimedia not found on https://torrus.wikimedia.org:443/torrus - 678 bytes in 0.052 second response time [03:19:33] 06Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 13Patch-For-Review, 05WMF-deploy-2016-05-01_(1.27.0-wmf.23): Create Wikipedia Jamaican - https://phabricator.wikimedia.org/T134017#2258685 (10Mtherwjs) 05Open>03Resolved [03:35:11] (03Abandoned) 10KartikMistry: Enable non-default MT for some languages [puppet] - 10https://gerrit.wikimedia.org/r/277463 (https://phabricator.wikimedia.org/T129849) (owner: 10KartikMistry) [03:42:19] 06Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 13Patch-For-Review, 05WMF-deploy-2016-05-01_(1.27.0-wmf.23): Create Wikipedia Jamaican - https://phabricator.wikimedia.org/T134017#2258728 (10Krenair) 05Resolved>03Open There are open patches, this is not ready to call complete. [03:57:50] PROBLEM - puppet last run on cp3019 is CRITICAL: CRITICAL: puppet fail [04:09:26] 06Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 13Patch-For-Review, 05WMF-deploy-2016-05-01_(1.27.0-wmf.23): Create Wikipedia Jamaican - https://phabricator.wikimedia.org/T134017#2258732 (10Mtherwjs) 05Open>03Resolved [04:11:17] mutante: Thanks for the e-mail about irc.wikimedia.org. [04:23:30] RECOVERY - puppet last run on cp3019 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [04:33:40] 06Operations, 10ops-codfw: sinistra - RAID failure - https://phabricator.wikimedia.org/T134187#2258749 (10Dzahn) p:05Triage>03Normal [04:36:14] 06Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 13Patch-For-Review, 05WMF-deploy-2016-05-01_(1.27.0-wmf.23): Create Wikipedia Jamaican - https://phabricator.wikimedia.org/T134017#2258750 (10Dzahn) 05Resolved>03Open [04:37:01] 06Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 13Patch-For-Review, 05WMF-deploy-2016-05-01_(1.27.0-wmf.23): Create Wikipedia Jamaican - https://phabricator.wikimedia.org/T134017#2252547 (10Dzahn) p:05Triage>03Normal "Please do not start editing this new site. This site has a tes... [04:38:47] 06Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 13Patch-For-Review, 05WMF-deploy-2016-05-01_(1.27.0-wmf.23): Create Wikipedia Jamaican - https://phabricator.wikimedia.org/T134017#2258754 (10Dzahn) @Mtherwjs please don't just re-close after Krenair already explained it's not quite don... [04:44:14] 06Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 13Patch-For-Review, 05WMF-deploy-2016-05-01_(1.27.0-wmf.23): Create Wikipedia Jamaican - https://phabricator.wikimedia.org/T134017#2258757 (10Mtherwjs) 05Open>03Resolved [04:46:01] Leah: :) [04:46:11] and Mtherwjs. that is edit war-ing now [04:46:23] please stop [04:49:31] 06Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 13Patch-For-Review, 05WMF-deploy-2016-05-01_(1.27.0-wmf.23): Create Wikipedia Jamaican - https://phabricator.wikimedia.org/T134017#2258759 (10Dzahn) 05Resolved>03Open @Aklapper @Mtherwjs please stop edit war. https://en.wikipedia.o... [04:50:53] 06Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 13Patch-For-Review, 05WMF-deploy-2016-05-01_(1.27.0-wmf.23): Create Wikipedia Jamaican - https://phabricator.wikimedia.org/T134017#2258762 (10Mtherwjs) 05Open>03Resolved [05:00:14] (03PS4) 10KartikMistry: cxserver: scap3 migration [puppet] - 10https://gerrit.wikimedia.org/r/286395 (https://phabricator.wikimedia.org/T120104) [05:04:10] 06Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 13Patch-For-Review, 05WMF-deploy-2016-05-01_(1.27.0-wmf.23): Create Wikipedia Jamaican - https://phabricator.wikimedia.org/T134017#2258778 (10Dzahn) [05:04:28] 06Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 13Patch-For-Review, 05WMF-deploy-2016-05-01_(1.27.0-wmf.23): Create Wikipedia Jamaican - https://phabricator.wikimedia.org/T134017#2252547 (10Dzahn) 05Resolved>03Open @Mtherjws nope :) [05:05:44] 06Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 13Patch-For-Review, 05WMF-deploy-2016-05-01_(1.27.0-wmf.23): Create Wikipedia Jamaican - https://phabricator.wikimedia.org/T134017#2258782 (10DSsfduhi3) 05Open>03Resolved [05:07:10] 06Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 13Patch-For-Review, 05WMF-deploy-2016-05-01_(1.27.0-wmf.23): Create Wikipedia Jamaican - https://phabricator.wikimedia.org/T134017#2258784 (10Dzahn) [05:17:39] PROBLEM - Disk space on elastic1029 is CRITICAL: DISK CRITICAL - free space: /var/lib/elasticsearch 79708 MB (15% inode=99%) [05:29:09] RECOVERY - Disk space on elastic1029 is OK: DISK OK [06:24:35] (03PS3) 10Giuseppe Lavagetto: hhvm: make /var/log/hhvm owned by root [puppet] - 10https://gerrit.wikimedia.org/r/285945 (owner: 10Hashar) [06:30:55] PROBLEM - puppet last run on restbase2006 is CRITICAL: CRITICAL: Puppet has 2 failures [06:31:15] PROBLEM - puppet last run on cp2002 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:25] PROBLEM - puppet last run on cp1053 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:36] PROBLEM - puppet last run on mw1090 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:44] PROBLEM - puppet last run on cp4010 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:44] PROBLEM - puppet last run on mw1112 is CRITICAL: CRITICAL: Puppet has 2 failures [06:32:15] PROBLEM - puppet last run on ms-fe1004 is CRITICAL: CRITICAL: Puppet has 2 failures [06:33:14] PROBLEM - puppet last run on cp4016 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:25] PROBLEM - puppet last run on mw2073 is CRITICAL: CRITICAL: Puppet has 2 failures [06:33:25] PROBLEM - puppet last run on mw1135 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:26] PROBLEM - puppet last run on db2016 is CRITICAL: CRITICAL: puppet fail [06:33:44] PROBLEM - puppet last run on elastic2007 is CRITICAL: CRITICAL: Puppet has 2 failures [06:33:54] PROBLEM - puppet last run on mw2136 is CRITICAL: CRITICAL: Puppet has 1 failures [06:34:06] PROBLEM - puppet last run on mw2126 is CRITICAL: CRITICAL: Puppet has 1 failures [06:36:18] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "Actually, I realized that we do set ownership of the directory explicitly in the upstart script for hhvm, see https://phabricator.wikimedi" [puppet] - 10https://gerrit.wikimedia.org/r/285945 (owner: 10Hashar) [06:56:15] RECOVERY - puppet last run on restbase2006 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [06:56:34] RECOVERY - puppet last run on cp4016 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [06:56:35] RECOVERY - puppet last run on mw1135 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [06:56:44] RECOVERY - puppet last run on cp2002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:56:45] RECOVERY - puppet last run on cp1053 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:56:56] RECOVERY - puppet last run on mw1090 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:56:56] RECOVERY - puppet last run on elastic2007 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [06:56:56] RECOVERY - puppet last run on mw1112 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:56:56] RECOVERY - puppet last run on cp4010 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [06:57:25] RECOVERY - puppet last run on mw2126 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [06:57:35] RECOVERY - puppet last run on ms-fe1004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:25] (03PS1) 10Smalyshev: [WIP] Add configs for kafka-watcher tool [puppet] - 10https://gerrit.wikimedia.org/r/286588 (https://phabricator.wikimedia.org/T97562) [06:58:35] RECOVERY - puppet last run on mw2073 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:36] RECOVERY - puppet last run on db2016 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [06:58:59] (03PS4) 10Giuseppe Lavagetto: hhvm: make /var/log/hhvm owned by root [puppet] - 10https://gerrit.wikimedia.org/r/285945 (owner: 10Hashar) [06:59:05] RECOVERY - puppet last run on mw2136 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:59:29] (03CR) 10jenkins-bot: [V: 04-1] [WIP] Add configs for kafka-watcher tool [puppet] - 10https://gerrit.wikimedia.org/r/286588 (https://phabricator.wikimedia.org/T97562) (owner: 10Smalyshev) [07:04:42] <_joe_> SMalyshev: I was looking at https://github.com/wikimedia/mediawiki-services-kafka-watcher and I see quite a few things that are not done the way normal python software is handled [07:05:10] RECOVERY - torrus.wikimedia.org UI on netmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 2556 bytes in 0.137 second response time [07:05:19] <_joe_> like, you don't create a module, there is no setup.py, no main, no entrypoint script [07:05:31] <_joe_> and no logging whatsoever [07:06:02] <_joe_> honestly I don't think it's production-quality at the moment. [07:06:59] !log recovering torrus database @ netmon following T87815 [07:07:00] T87815: Torrus is broken - https://phabricator.wikimedia.org/T87815 [07:07:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:07:08] _joe_: it's a work in progress :) but if you want to help, you're more than welcome [07:07:26] _joe_: why it needs setup.py btw? [07:07:42] <_joe_> SMalyshev: also, it would be interesting to allow using multiple handlers by using something like gevent or asyncio [07:07:46] _joe_: and what you mean by "entrypoint script"? [07:07:59] <_joe_> SMalyshev: because that way you can install/deinstall it on a system [07:08:00] _joe_: it can use multiple handlers. [07:08:36] _joe_: is that how python modules are deployed now? could you explain more or give an example how puppet deploys modules via setup.py? [07:08:54] <_joe_> SMalyshev: not puppet, everyone on the planet [07:09:07] _joe_: it's not the script for everyone on the planet :) [07:09:12] <_joe_> SMalyshev: most python software we write is distributed as deb packages [07:09:31] _joe_: deb packages? that sounds rather complicated... [07:09:39] <_joe_> SMalyshev: well, it's FLOSS, I assume we want to respect standards :) I'll send a patch [07:09:49] _joe_: how you get from repo to deb package? [07:10:12] <_joe_> SMalyshev: I think it's ok to distribute this via scap3 btw [07:10:23] _joe_: well, it is FLOSS but you won't get much use of it trying to install it as a python module [07:10:35] <_joe_> don't worry about debs, but giving it some normal-looking structure would help [07:10:42] <_joe_> I can help with that btw [07:10:47] <_joe_> I'll send a patch [07:11:04] _joe_: scap3 could use some docs too :) so far the only thing I found is about how to deploy node services. Which is not really helpful for python :) [07:11:38] <_joe_> SMalyshev: I plan on experimenting with it in the coming weeks btw [07:11:53] <_joe_> and you know, reach out to releng people [07:11:55] _joe_: I've made first sketch ar https://gerrit.wikimedia.org/r/#/c/286588/ but that does not include deployment [07:12:19] <_joe_> btw a standardish-looking python software written by us: https://github.com/wikimedia/operations-software-conftool [07:13:43] _joe_: ok, I'll take a look, thanks [07:14:40] _joe_: this one seems to rely on setup creating the CLI tool. I'm not sure whether this is the best route for a daemon... [07:15:43] not sure I understand how it's deployed [07:17:18] 06Operations: Torrus is broken - https://phabricator.wikimedia.org/T87815#2258851 (10jcrespo) 05Resolved>03Open @Mark I am reopening this and pinging you just to notify you as the "owner" of this service that this happened again today (and fixed following Gage's guide), and decide what is the its future: Ma... [07:18:58] 06Operations, 10ops-eqiad: ms-be1002 has a faulty disk - https://phabricator.wikimedia.org/T134234#2258855 (10Joe) [07:19:03] 06Operations: Torrus is broken - https://phabricator.wikimedia.org/T87815#2258867 (10jcrespo) @mark Also, I think you mentioned not puppetizing it, so collectors may be erased already, can you check? [07:19:07] 06Operations, 10ops-eqiad: ms-be1002 has a faulty disk - https://phabricator.wikimedia.org/T134234#2258868 (10Joe) p:05Triage>03Normal [07:19:41] ACKNOWLEDGEMENT - RAID on ms-be1002 is CRITICAL: CRITICAL: 1 failed LD(s) (Offline) Giuseppe Lavagetto T134234 [07:19:41] ACKNOWLEDGEMENT - puppet last run on ms-be1002 is CRITICAL: CRITICAL: Puppet has 1 failures Giuseppe Lavagetto T134234 [07:20:03] _joe_: I tried using utils/new_wmf_service.py to generate service defs, but that doesn't seem to work well currently [07:20:22] <_joe_> SMalyshev: sorry, is this script something we're supposed to use in production? [07:20:24] it generates a lot of unrelated changes and assumes too much about what is going on [07:20:31] _joe_: yes [07:20:39] eventually [07:20:44] <_joe_> SMalyshev: this is not a public-facing service [07:20:50] _joe_: it's part of https://phabricator.wikimedia.org/T97562 [07:21:13] <_joe_> SMalyshev: meaning, it's a daemon, doesn't need to accept external traffic [07:21:31] it's supposed to get purge messages from Kafka and send them to memcache (and maybe varnish in the future) [07:22:04] <_joe_> ok, as I said, this has nothing to do with new_wmf_service [07:22:37] _joe_: sorry, IRC kicked me out, I've probably lost a couple of last responses [07:23:19] <_joe_> 09:22 < _joe_> ok, as I said, this has nothing to do with new_wmf_service [07:23:25] <_joe_> :) [07:23:44] <_joe_> SMalyshev: I'll speak with gehel when he's around for helping you with the ops part [07:23:47] _joe_: maybe :) It's a daemon that is pulling from Kafka and pushing to anything else [07:24:04] _joe_: yeah I spoke with him about it and with godog too [07:24:07] <_joe_> yeah I got what it does ;) [07:24:38] that got me some clarity but not 100% there yet esp. the deployment part [07:25:01] <_joe_> SMalyshev: I think we can make this more robust in all respects [07:25:03] still kind of fuzzy on that one but I'll read more on scap3 and will try to see [07:25:22] <_joe_> actually I think deb packaging would make your life easier [07:25:25] <_joe_> and I can do it [07:25:44] <_joe_> you wouldn't need to do almost anything in puppet [07:26:11] <_joe_> but I'll discuss with gehel what we think is best [07:26:16] _joe_: if you can that'd be nice, at least initial patch. However, what I am concerned about is the update cycle. I.e. until we get it running, we probably would be messing a lot with it [07:26:55] is it easy to update the deb repo or we'd need to do complex stuff every time? [07:27:16] <_joe_> SMalyshev: until you get it running, you can simply rebuild the deb locally, or you can just pull the git source on the labs machine where you're testing [07:27:43] <_joe_> but yeah I have very present the shortcomings of the basic approach [07:27:49] <_joe_> of using debs [07:28:06] <_joe_> err, I am well aware of [07:28:13] _joe_: on labs, yes. but on something like deployment beta machines it'd be nicer to use actual process so that we know it works [07:28:13] * gehel reading back... [07:28:33] <_joe_> gehel: yeah we might want to pair up a bit on this [07:28:50] <_joe_> SMalyshev: at the very least, you need to add some meaningful logging to it [07:29:26] _joe_: I thought about it, but I'm not sure what to log and where. E.g. we don't really need to log every packet from Kafka. But what we want to log? [07:29:53] systemd will probably log errors and stuff in the journal anyway [07:30:04] <_joe_> SMalyshev: failures? [07:30:07] <_joe_> errors? [07:30:23] <_joe_> impossibility to connect/read from kafka and to write/handle? [07:30:24] wouldn't those be automatically in the journal? [07:30:45] <_joe_> only if it's an uncatched exception that gets to stdoutt [07:30:48] * gehel need some more coffe, night was short... [07:31:07] _joe_: right. So python would log it to journal. Do we want to log it somewhere else too? [07:31:15] <_joe_> nope [07:31:35] <_joe_> logging.basicConfig(level=logging.INFO) is enough usually :) [07:32:25] right now it just spews eveyrthing to stdout but maybe making it use logging is not a bad idea [07:32:37] <_joe_> it uses "print" ? [07:32:55] <_joe_> yeah I see [07:33:00] <_joe_> that "works" [07:33:06] _joe_: well, mostly it doesn't use anything and relies on exceptions but sometimes it uses print [07:33:08] <_joe_> but yeah, use logging :) [07:33:18] ok, I'll make a patch tomorrow [07:33:52] _joe_: also, slightly more up-to-date code is here: https://gerrit.wikimedia.org/r/#/c/284989/ [07:34:15] <_joe_> SMalyshev: cool, I'll take a look [07:35:11] _joe_: oh, btw, since we talked about debs - there's a slight problem there with the fact that jesse has ancient version of python memcached library [07:35:22] would it be too hard to make a deb for a newer version? [07:35:48] <_joe_> depending on backwards compatibility, either no or yes [07:35:54] <_joe_> :) [07:36:20] _joe_: well, the library itself in not compatible with old version exatcly (they changed imports) - but that's exactly why i want the new one [07:36:33] and assuming also new one actually works better :) [07:37:00] <_joe_> SMalyshev: ok, that could be an argument against using system-wide debs. but you know, usually those deps are listed in setup.py ;) [07:37:39] _joe_: so I can list it in setup.py, and python would install it. The problem is - would deployment mechanism install it? [07:38:03] <_joe_> SMalyshev: let's say it will tell me (and debhelper) how to install it :P [07:38:03] right now I do require_package('python-pymemcache') and that gets me right package, but wrong version [07:38:30] <_joe_> SMalyshev: python software needs to be organized in modules and use setuptools [07:38:31] _joe_: but if you're building deb you'll need a deb for that version anyway [07:38:56] <_joe_> SMalyshev: even if I want to install it via any other deployment mechanism setup.py is important [07:39:08] <_joe_> (e.g. with scap3 and wheels, like ores does) [07:39:24] _joe_: I'm not sure I still understand how setup.py participates in deployment... [07:39:32] unless you use it to build deb [07:39:58] <_joe_> SMalyshev: I'll talk with gehel and come up with some more advice [07:41:10] _joe_: ok, thanks, I'll go to bed then :) [07:41:34] <_joe_> yeah man rest [07:41:44] <_joe_> sorry for sniping you in this discussion that late [07:42:00] _joe_: oh, no problem, I appreciate the help! [07:45:50] (03PS1) 10Jcrespo: Repool db1040 after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286592 (https://phabricator.wikimedia.org/T134114) [07:45:59] ^volans [07:47:05] (03CR) 10Volans: Repool db1040 after maintenance (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286592 (https://phabricator.wikimedia.org/T134114) (owner: 10Jcrespo) [07:47:11] jynus: ^^ :( [07:47:15] :) [07:47:20] (sorry bad parentheses) [07:47:40] question is I did not investigate at all, i just reimaged [07:48:06] but I do not like dumps/vslow and rcs on the same server [07:48:33] so I am asuming all issues are 5.5 related [07:49:05] 06Operations: Torrus is broken - https://phabricator.wikimedia.org/T87815#2258886 (10jcrespo) a:05Gage>03mark [07:49:17] I agree with the guess [07:52:42] (03PS2) 10Jcrespo: Repool db1040 after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286592 (https://phabricator.wikimedia.org/T134114) [07:57:55] 06Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 13Patch-For-Review, 05WMF-deploy-2016-05-01_(1.27.0-wmf.23): Create Wikipedia Jamaican - https://phabricator.wikimedia.org/T134017#2258901 (10Aklapper) 05Resolved>03Open [08:05:03] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Repool db1040 after maintenance (duration: 00m 25s) [08:05:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:08:31] 06Operations, 10EventBus, 10hardware-requests: 3 conf200x servers in codfw for zookeeper (and etcd?) - https://phabricator.wikimedia.org/T121882#2258936 (10elukey) [08:08:33] 06Operations, 10ops-codfw, 06Analytics-Kanban, 13Patch-For-Review: rack/setup/deploy conf200[123] - https://phabricator.wikimedia.org/T131959#2258935 (10elukey) 05Open>03Resolved [08:09:15] 06Operations, 06Analytics-Kanban, 13Patch-For-Review: Upgrade stat1001 to Debian Jessie - https://phabricator.wikimedia.org/T76348#2258937 (10elukey) [08:09:30] (03CR) 10Mobrovac: [C: 031] "Has been cherry-picked on beta since friday and working good there" [puppet] - 10https://gerrit.wikimedia.org/r/286153 (owner: 10Mobrovac) [08:11:10] (03PS6) 10Giuseppe Lavagetto: hhvm: make /var/log/hhvm owned by root [puppet] - 10https://gerrit.wikimedia.org/r/285945 (owner: 10Hashar) [08:13:05] (03CR) 10Giuseppe Lavagetto: [C: 032] hhvm: make /var/log/hhvm owned by root [puppet] - 10https://gerrit.wikimedia.org/r/285945 (owner: 10Hashar) [08:14:07] 06Operations, 10ops-eqiad, 10Analytics-Cluster: Analytics hosts showed high temperature alarms - https://phabricator.wikimedia.org/T132256#2258939 (10elukey) Updated list (excluding empty results): ``` elukey@neodymium:~$ sudo -i salt -t 120 analytics10* cmd.run 'grep "Hardware event" /var/log/mcelog | uniq... [08:18:54] !log restarting elasticsearch server elastic2002.codfw.wmnet (T110236) [08:18:55] T110236: Use unicast instead of multicast for node communication - https://phabricator.wikimedia.org/T110236 [08:19:00] 06Operations, 10Traffic: Evaluate Apache Traffic Server - https://phabricator.wikimedia.org/T96853#1227616 (10MoritzMuehlenhoff) Is that still planned now that we're moving to varnish 4? [08:19:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:19:56] _joe_: I got my coffee, brain is now mostly working [08:20:19] <_joe_> gehel: hehe I need to take a pause now [08:20:25] :P [08:20:45] _joe_: I'll probably be there when you come back. [08:23:09] 06Operations, 10Gitblit: git.wikimedia.org is unstable - https://phabricator.wikimedia.org/T83702#2258951 (10MoritzMuehlenhoff) 05stalled>03declined gitblit is currently being migrated to Diffusion (T123718), closing this bug. [08:25:01] (03CR) 10Gehel: [C: 031] cirrus: Only use curl pools on hhvm [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286485 (https://phabricator.wikimedia.org/T132751) (owner: 10EBernhardson) [08:28:18] 06Operations, 06Performance-Team, 13Patch-For-Review: Update memcached package and configuration options - https://phabricator.wikimedia.org/T129963#2258957 (10MoritzMuehlenhoff) You should run debdiff on the source package, while running debdiff on the binary debs has some use cases, the source package is t... [08:37:01] 06Operations, 07Tracking: Upgrade Wikimedia servers to Ubuntu Trusty (14.04) (tracking) - https://phabricator.wikimedia.org/T65899#2258973 (10MoritzMuehlenhoff) 05Open>03declined Closing this ticket, it's currently not really useful/duplicated. We already have a tracking ticket to get rid of precise and ou... [08:37:52] (03PS1) 10Hashar: contint: pypy package for pywikibot [puppet] - 10https://gerrit.wikimedia.org/r/286598 (https://phabricator.wikimedia.org/T134235) [08:41:28] 06Operations, 10Wikimedia-General-or-Unknown, 07I18n, 07Upstream: Update Malayalam fonts packages - https://phabricator.wikimedia.org/T33950#355946 (10MoritzMuehlenhoff) @Praveenp : Would http://packages.ubuntu.com/trusty/fonts-lohit-mlym provide the required fonts? If so, we can add these to the app servers. [08:41:48] PROBLEM - graphoid endpoints health on scb2001 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) is CRITICAL: Test retrieve PNG from mediawiki.org returned the unexpected status 400 (expecting: 200) [08:41:49] PROBLEM - Check size of conntrack table on alsafi is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:41:49] PROBLEM - restbase endpoints health on restbase-test2003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:41:49] PROBLEM - restbase endpoints health on restbase-test2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:41:49] PROBLEM - puppet last run on alsafi is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:42:08] PROBLEM - restbase endpoints health on restbase2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:42:08] PROBLEM - citoid endpoints health on scb2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:42:08] PROBLEM - restbase endpoints health on restbase2003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:42:11] 06Operations: Ferm rules for palladium - https://phabricator.wikimedia.org/T113344#2258993 (10MoritzMuehlenhoff) a:03MoritzMuehlenhoff [08:42:19] PROBLEM - restbase endpoints health on restbase2006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:42:28] PROBLEM - Disk space on alsafi is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:42:29] PROBLEM - restbase endpoints health on restbase2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:42:29] PROBLEM - graphoid endpoints health on scb2002 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) is CRITICAL: Test retrieve PNG from mediawiki.org returned the unexpected status 400 (expecting: 200) [08:42:38] PROBLEM - citoid endpoints health on scb2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:42:50] PROBLEM - restbase endpoints health on restbase-test2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:42:58] PROBLEM - restbase endpoints health on restbase2004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:42:59] PROBLEM - dhclient process on alsafi is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:42:59] 06Operations, 06Performance-Team, 13Patch-For-Review: Update memcached package and configuration options - https://phabricator.wikimedia.org/T129963#2258998 (10elukey) ``` elukey@copper:~$ debdiff memcached_1.4.25-2.dsc /var/cache/pbuilder/result/jessie-amd64/memcached_1.4.25-2~bpo8+1.dsc dpkg-source: warnin... [08:43:00] PROBLEM - RAID on alsafi is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:43:09] PROBLEM - restbase endpoints health on restbase2005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:43:18] PROBLEM - configured eth on alsafi is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [08:43:19] PROBLEM - DPKG on alsafi is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:43:28] <_joe_> what is happening in codfw? [08:43:30] PROBLEM - salt-minion processes on alsafi is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:44:00] 06Operations, 10ops-esams: replace bast3001 with newer hardware - https://phabricator.wikimedia.org/T131562#2258999 (10MoritzMuehlenhoff) [08:44:47] 06Operations, 10ops-esams, 10hardware-requests: replace bast3001 with newer hardware - https://phabricator.wikimedia.org/T131562#2259001 (10Peachey88) [08:45:33] 06Operations, 10hardware-requests: decom argon - https://phabricator.wikimedia.org/T134223#2259002 (10Peachey88) [08:45:47] 06Operations, 10ops-codfw: ms-be2007.codfw.wmnet: slot=6 dev=sdg failed - https://phabricator.wikimedia.org/T133517#2259003 (10MoritzMuehlenhoff) a:03Papaul [08:46:47] 06Operations, 10ops-eqiad: ms-be1002 has a faulty disk - https://phabricator.wikimedia.org/T134234#2259006 (10MoritzMuehlenhoff) a:03Cmjohnson [08:47:04] 06Operations, 10ops-codfw: sinistra - RAID failure - https://phabricator.wikimedia.org/T134187#2259007 (10MoritzMuehlenhoff) a:03Papaul [08:50:44] 06Operations, 10ops-codfw, 06DC-Ops, 13Patch-For-Review: mw2212 had several downtimes recently - test before repool - https://phabricator.wikimedia.org/T129196#2259009 (10MoritzMuehlenhoff) a:03Papaul [08:52:00] 06Operations, 10ops-codfw, 06DC-Ops: lvs2002 Embedded Flash/SD-CARD iLO errors - https://phabricator.wikimedia.org/T126321#2259011 (10MoritzMuehlenhoff) a:03Papaul [08:53:24] <_joe_> can someone look at alsafi please? [08:54:01] _joe_: on it [08:54:31] <_joe_> thanks [09:02:36] 06Operations, 10Graphoid, 06Services: Graphoid returns a 400 on MW API time-out - https://phabricator.wikimedia.org/T134237#2259051 (10mobrovac) [09:04:00] RECOVERY - restbase endpoints health on restbase2006 is OK: All endpoints are healthy [09:04:08] RECOVERY - Disk space on alsafi is OK: DISK OK [09:04:08] RECOVERY - restbase endpoints health on restbase2002 is OK: All endpoints are healthy [09:04:18] RECOVERY - graphoid endpoints health on scb2002 is OK: All endpoints are healthy [09:04:18] RECOVERY - citoid endpoints health on scb2001 is OK: All endpoints are healthy [09:04:38] RECOVERY - restbase endpoints health on restbase-test2001 is OK: All endpoints are healthy [09:04:39] RECOVERY - dhclient process on alsafi is OK: PROCS OK: 0 processes with command name dhclient [09:04:39] RECOVERY - restbase endpoints health on restbase2004 is OK: All endpoints are healthy [09:04:39] RECOVERY - RAID on alsafi is OK: OK: no RAID installed [09:04:49] RECOVERY - restbase endpoints health on restbase2005 is OK: All endpoints are healthy [09:04:59] RECOVERY - configured eth on alsafi is OK: OK - interfaces up [09:04:59] RECOVERY - DPKG on alsafi is OK: All packages OK [09:05:18] RECOVERY - salt-minion processes on alsafi is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [09:05:29] RECOVERY - Check size of conntrack table on alsafi is OK: OK: nf_conntrack is 0 % full [09:05:29] RECOVERY - graphoid endpoints health on scb2001 is OK: All endpoints are healthy [09:05:29] RECOVERY - restbase endpoints health on restbase-test2003 is OK: All endpoints are healthy [09:05:30] RECOVERY - puppet last run on alsafi is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [09:05:38] RECOVERY - restbase endpoints health on restbase-test2002 is OK: All endpoints are healthy [09:05:48] RECOVERY - restbase endpoints health on restbase2001 is OK: All endpoints are healthy [09:05:48] RECOVERY - restbase endpoints health on restbase2003 is OK: All endpoints are healthy [09:05:49] RECOVERY - citoid endpoints health on scb2002 is OK: All endpoints are healthy [09:07:37] <_joe_> why all at the same time? [09:07:48] <_joe_> moritzm: what is alsafi doing btw? [09:07:52] <_joe_> I don't remember [09:08:27] !log Update cxserver to 8a4254e [09:08:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:10:04] _joe_: /var/log/hhvm owned by root <-- thanks a ton :-} [09:10:12] _joe_: it's back up, just waiting for icinga to notice that [09:10:35] <_joe_> moritzm: it did [09:10:42] !log powercycled alsafi (stuck in KVM) [09:10:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:11:10] !log restarting elasticsearch server elastic2003.codfw.wmnet (T110236) [09:11:11] T110236: Use unicast instead of multicast for node communication - https://phabricator.wikimedia.org/T110236 [09:11:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:11:41] 06Operations, 10Graphoid, 06Services: Graphoid returns a 400 on MW API time-out - https://phabricator.wikimedia.org/T134237#2259088 (10mobrovac) p:05Triage>03High a:03Yurik @Joe tested the API manually while this problem was occurring and no time-outs happened. Moreover, there was only one instance of... [09:22:45] 06Operations, 06Discovery, 06Discovery-Search-Backlog, 07Elasticsearch: Publish "pending_tasks" count from Elastic search cluster to graphite - https://phabricator.wikimedia.org/T134240#2259118 (10Gehel) [09:24:49] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [09:25:07] 06Operations, 10Graphoid, 06Services: graphoid should not use the http proxy to connect to the mediawiki api and other internal services - https://phabricator.wikimedia.org/T134241#2259139 (10Joe) [09:25:12] 06Operations, 10Graphoid, 06Services: graphoid should not use the http proxy to connect to the mediawiki api and other internal services - https://phabricator.wikimedia.org/T134241#2259151 (10Joe) p:05Triage>03High [09:26:11] 06Operations, 10Graphoid, 06Services: Graphoid returns a 400 on MW API time-out - https://phabricator.wikimedia.org/T134237#2259155 (10Joe) >>! In T134237#2259088, @mobrovac wrote: > @Joe tested the API manually while this problem was occurring and no time-outs happened. Moreover, there was only one instance... [09:26:41] 06Operations, 10Graphoid, 06Services, 15User-mobrovac: graphoid should not use the http proxy to connect to the mediawiki api and other internal services - https://phabricator.wikimedia.org/T134241#2259139 (10Joe) [09:26:51] <_joe_> mobrovac: ^^ :P [09:26:52] 06Operations: kvm on ganeti instances getting stuck - https://phabricator.wikimedia.org/T134242#2259160 (10fgiunchedi) [09:28:04] 06Operations: kvm on ganeti instances getting stuck - https://phabricator.wikimedia.org/T134242#2259173 (10fgiunchedi) furud.codfw.wmnet also experienced high load average, a reboot fixed it (T134098) [09:28:48] 06Operations: kvm on ganeti instances getting stuck - https://phabricator.wikimedia.org/T134242#2259160 (10MoritzMuehlenhoff) Another occurrance was serpens yesterday morning (which is also listed as having the workaround applied in the etherpad, so that doesn't seem to be affective) [09:29:19] PROBLEM - puppet last run on mw2016 is CRITICAL: CRITICAL: puppet fail [09:29:58] PROBLEM - puppet last run on cp3039 is CRITICAL: CRITICAL: Puppet has 1 failures [09:33:44] RECOVERY - RAID on ms-be1002 is OK: OK: optimal, 13 logical, 13 physical [09:34:34] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [09:35:54] 06Operations, 10ops-eqiad: ms-be1002 has a faulty disk - https://phabricator.wikimedia.org/T134234#2258855 (10fgiunchedi) FTR `swift-drive-audit` takes care of umounting the disk and commenting `/etc/fstab` what's usually left to do is ask megacli to offline the disk and mark it as missing, possibly locating i... [09:38:04] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [09:38:15] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [09:39:43] PROBLEM - Misc HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] [09:41:00] <_joe_> uhm [09:41:07] <_joe_> checking [09:42:04] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0] [09:43:53] <_joe_> elukey: seems like aqs having issues [09:44:23] <_joe_> "uri_path":"/api/rest_v1/metrics/pageviews/per-article/en.wikipedia/all-access/user/Laplace%27s_demon/daily/2016031400/2016050200 [09:44:26] <_joe_> etc etc [09:44:40] _joe_ checking, but I believe that it must be the outstanding problem with cassandra :( [09:45:19] <_joe_> elukey: it is being hammered by one client AFAICT [09:45:53] <_joe_> "user_agent":"Ruby" [09:46:36] yep yep same thing each time, one single IP with a modest amount of req/sec triggers timeout in cassandra and restbase 500s (https://grafana.wikimedia.org/dashboard/db/pageviews) [09:47:17] we are moving to a new cluster with SSDs and cassandra multi-instance, but we can't do more than that in the meantime [09:47:24] RECOVERY - Misc HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [09:47:48] maybe request throttling [09:49:54] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [09:50:43] interesting, from https://logstash.wikimedia.org/#/dashboard/elasticsearch/analytics-cassandra it seems that AQS was doing sstables compaction [09:54:13] RECOVERY - puppet last run on cp3039 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:55:09] !log testing online index creation on db2018 [09:55:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:56:44] RECOVERY - puppet last run on mw2016 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:57:25] the online part works well, but we will see if we will found metadata locking issues [09:59:49] however, it took 200 seconds or so [10:02:43] 06Operations, 10ops-codfw, 13Patch-For-Review: rack/setup/deploy restbase200[7-9] - https://phabricator.wikimedia.org/T132976#2259355 (10fgiunchedi) a:05fgiunchedi>03Papaul @papaul I'm seeing 4x SSD on restbase200[78] (the machines with 1TB samsung) though those should have 5x, there should be 1TB samsun... [10:03:33] !log applying schema change on codfw-s3 db servers T130692 [10:03:34] T130692: Add new indexes from eec016ece6d2b30addcdf3d3efcc2ba59b10e858 to production databases - https://phabricator.wikimedia.org/T130692 [10:03:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:06:16] 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service, and 2 others: Reinstall and data reload of WDQS servers - https://phabricator.wikimedia.org/T133566#2259384 (10Lydia_Pintscher) [10:06:46] 06Operations, 06Performance-Team, 13Patch-For-Review: Update memcached package and configuration options - https://phabricator.wikimedia.org/T129963#2259387 (10MoritzMuehlenhoff) Looks good. The ~bpo8 version appendix is normally only used for packages uploaded to the official jessie-backports suite in the D... [10:10:37] Is https://phabricator.wikimedia.org/T134017 a joke? A R3R on Phabricator, with a sockpuppet account once the first account was disabled? oO [10:12:56] (03PS3) 10Volans: MariaDB: Set additional salt grains for core DBs [puppet] - 10https://gerrit.wikimedia.org/r/286303 (https://phabricator.wikimedia.org/T133337) [10:13:30] ^? [10:14:00] I'm 5 hosts-close to delete that code [10:14:16] ??? [10:14:19] ah, my fault [10:14:28] it's to be able to salt by shard/master/slave [10:14:28] (03CR) 10jenkins-bot: [V: 04-1] MariaDB: Set additional salt grains for core DBs [puppet] - 10https://gerrit.wikimedia.org/r/286303 (https://phabricator.wikimedia.org/T133337) (owner: 10Volans) [10:14:33] I confused mariadb::core with coredb [10:14:36] :-) [10:14:37] :) [10:15:30] I still don't like it, so for now is a test, because of how $master is used [10:15:36] I agree with the comment, too. Not 100% sure if non-active masters should be masters or nor [10:16:10] probably $master should be true and the code should check $::mw_primary too [10:16:24] yes [10:16:38] or even be real masters, including pt-h and all [10:16:51] I am not sure [10:17:19] the arrows are not aligned though, and lint will complain [10:17:42] fixed [10:17:43] (03PS4) 10Volans: MariaDB: Set additional salt grains for core DBs [puppet] - 10https://gerrit.wikimedia.org/r/286303 (https://phabricator.wikimedia.org/T133337) [10:20:26] probably my answer is- whoever makes the patch I agree. But let's test that first [10:24:51] !log applying schema change on eqiad-s3 db servers T130692 [10:24:52] T130692: Add new indexes from eec016ece6d2b30addcdf3d3efcc2ba59b10e858 to production databases - https://phabricator.wikimedia.org/T130692 [10:24:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:26:12] (03PS5) 10Volans: MariaDB: Set additional salt grains for core DBs [puppet] - 10https://gerrit.wikimedia.org/r/286303 (https://phabricator.wikimedia.org/T133337) [10:29:19] <_joe_> I am not sure I like this approach [10:29:29] <_joe_> let me comment on the patch [10:30:27] sure go ahead [10:32:44] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "See comment in the code" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/286303 (https://phabricator.wikimedia.org/T133337) (owner: 10Volans) [10:33:37] giuseppe, I said I was unsure [10:33:53] the issue is that $master=true should not be on puppet in the first place [10:34:19] but on that orchestration you said you wanted to discuss :-) [10:35:01] so both puppet and mediawiki can read it [10:35:06] _joe_: fully agree, that will be fixed by a different patch, at least we can start addressing shards and masters quickly and in a reliable way [10:36:00] and even before $master will go out of puppet we could improve it with $master true for all eqiad and codfw masters and use also $::mw_primary to decide what should run on which master [10:36:14] (active vs passive) [10:36:18] (03PS1) 10Elukey: Add LVS configuration for EventBus in codfw (DNS reverse/wmnet config already in place). [puppet] - 10https://gerrit.wikimedia.org/r/286621 (https://phabricator.wikimedia.org/T121558) [10:36:31] $master is a bad variable name for what initially was "the host that gets writes" [10:36:46] is basically RW :) [10:37:06] but I do not disagree with you [10:37:36] however, I think volans comment will work well after that is changed, right? [10:37:56] <_joe_> just comment on the patch volans :) [10:38:41] so technically, your comment is right, but it should go on site.pp [10:38:58] which was committed days ago :-) [10:40:47] (03CR) 10Jcrespo: "I wonder potential issues with shard for non-production hosts. Should we have a core=True and apply it to all including multi-source?" [puppet] - 10https://gerrit.wikimedia.org/r/286303 (https://phabricator.wikimedia.org/T133337) (owner: 10Volans) [10:42:00] !log installing poppler security updates on ocg (and other trusty hosts) [10:42:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:46:24] (03PS2) 10Elukey: Add LVS configuration for EventBus in codfw (DNS reverse/wmnet config already in place). [puppet] - 10https://gerrit.wikimedia.org/r/286621 (https://phabricator.wikimedia.org/T121558) [10:46:45] (03CR) 10Volans: "@JCrespo: I thought about multi-source too, they use ad different class as of now and they should have the shard grains with "replace => f" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/286303 (https://phabricator.wikimedia.org/T133337) (owner: 10Volans) [10:47:04] * volans brb [10:48:00] (03CR) 10Giuseppe Lavagetto: [C: 031] Add LVS configuration for EventBus in codfw (DNS reverse/wmnet config already in place). [puppet] - 10https://gerrit.wikimedia.org/r/286621 (https://phabricator.wikimedia.org/T121558) (owner: 10Elukey) [11:00:06] 06Operations, 10ops-codfw, 06Analytics-Kanban, 06DC-Ops, and 5 others: setup kafka2001 & kafka2002 - https://phabricator.wikimedia.org/T121558#2259592 (10elukey) Double checked that health checks work fine: ``` elukey@kafka1001:~$ curl http://kafka2002.codfw.wmnet:8085/v1/topics {"change-prop.retry.change... [11:01:38] (03CR) 10John Vandenberg: [C: 031] contint: pypy package for pywikibot [puppet] - 10https://gerrit.wikimedia.org/r/286598 (https://phabricator.wikimedia.org/T134235) (owner: 10Hashar) [11:07:22] 06Operations, 10Traffic: Evaluate Apache Traffic Server - https://phabricator.wikimedia.org/T96853#2259602 (10BBlack) Yes. We decided the varnish 4 migration was the next necessary practical step regardless, but I think at least evaluating other alternatives (which are more open-source-friendly, privacy-frien... [11:08:42] 06Operations, 10MediaWiki-Cache, 10MediaWiki-JobQueue, 10MediaWiki-JobRunner, and 3 others: Investigate massive increase in htmlCacheUpdate jobs in Dec/Jan - https://phabricator.wikimedia.org/T124418#2259603 (10Lydia_Pintscher) So when I look at http://graphite.wikimedia.org/render/?width=586&height=308&_s... [11:19:50] 06Operations, 10MediaWiki-Cache, 10MediaWiki-JobQueue, 10MediaWiki-JobRunner, and 3 others: Investigate massive increase in htmlCacheUpdate jobs in Dec/Jan - https://phabricator.wikimedia.org/T124418#2259612 (10BBlack) I really don't think it's specifically Wikidata-related either at this point. Wikidata... [11:20:06] (03PS1) 10Giuseppe Lavagetto: hhvm: fix logging dir permissions [puppet] - 10https://gerrit.wikimedia.org/r/286624 [11:34:40] (03PS2) 10Muehlenhoff: Move jobrunner ferm service into the roles [puppet] - 10https://gerrit.wikimedia.org/r/286415 [11:35:12] !log cp1068: upgraded varnish to 3.0.6plus-wm9 [11:35:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:45:06] PROBLEM - Disk space on cp1068 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=87%) [11:50:28] ^ on that [11:52:46] RECOVERY - Disk space on cp1068 is OK: DISK OK [11:55:27] PROBLEM - Varnishkafka log producer on cp1068 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [12:01:27] (03CR) 10Giuseppe Lavagetto: [C: 031] Changeprop: Set hyperswitch as the start-up module [puppet] - 10https://gerrit.wikimedia.org/r/286153 (owner: 10Mobrovac) [12:02:49] RECOVERY - Varnishkafka log producer on cp1068 is OK: PROCS OK: 3 processes with command name varnishkafka [12:04:14] ----^ checking [12:04:25] (03PS3) 10Giuseppe Lavagetto: Changeprop: Set hyperswitch as the start-up module [puppet] - 10https://gerrit.wikimedia.org/r/286153 (owner: 10Mobrovac) [12:05:31] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] Changeprop: Set hyperswitch as the start-up module [puppet] - 10https://gerrit.wikimedia.org/r/286153 (owner: 10Mobrovac) [12:05:45] <_joe_> mobrovac: you've been served [12:05:54] grazie _joe_ [12:07:21] !log Cutting 1.27.0-wmf.23 branches [12:07:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:07:33] !log Cutting 1.27.0-wmf.23 branches T131557 [12:07:34] T131557: MW-1.27.0-wmf.23 deployment blockers - https://phabricator.wikimedia.org/T131557 [12:07:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:09:29] (03PS1) 10Nikerabbit: Translate: Use Apertium via cxserver [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286632 (https://phabricator.wikimedia.org/T133008) [12:09:47] (03CR) 10jenkins-bot: [V: 04-1] Translate: Use Apertium via cxserver [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286632 (https://phabricator.wikimedia.org/T133008) (owner: 10Nikerabbit) [12:11:49] PROBLEM - changeprop endpoints health on scb1002 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.16.21, port=7272): Max retries exceeded with url: /?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [12:15:42] <_joe_> mobrovac: is this ok? [12:15:47] (03CR) 10Mobrovac: [C: 04-1] Change-Prop: Enable summary and definition updates. (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/286539 (owner: 10Ppchelko) [12:16:03] <_joe_> mobrovac: the error on changeprop I mean [12:16:06] _joe_: yes, ignore, i have to wait for puppet to be run everywhere to do a proper deploy [12:16:06] (03PS2) 10Nikerabbit: Translate: Use Apertium via cxserver [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286632 (https://phabricator.wikimedia.org/T133008) [12:16:54] _joe_: on that note, i'm opening an access ticket to be able to run puppet on sc(a|b) [12:17:05] <_joe_> mobrovac: ack [12:22:02] 06Operations, 10Ops-Access-Requests: Allow mobrovac to run puppet on SC(A|B) - https://phabricator.wikimedia.org/T134251#2259756 (10mobrovac) [12:22:15] 06Operations, 10Ops-Access-Requests: Allow mobrovac to run puppet on SC(A|B) - https://phabricator.wikimedia.org/T134251#2259770 (10mobrovac) @GWicke could you please approve? [12:27:50] 06Operations, 10MobileFrontend, 10Traffic, 07Regression, 07user-notice: Incorrect TOC and section edit links rendering in Vector due to ParserCache corruption via ParserOutput::setText( ParserOutput::getText() ) - https://phabricator.wikimedia.org/T124356#2072828 (10hashar) https://gerrit.wikimedia.org/r... [12:28:29] !log stopping wdqs-updater as it leaks pipes [12:28:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:29:18] PROBLEM - changeprop endpoints health on scb1001 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.0.16, port=7272): Max retries exceeded with url: /?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [12:30:21] still ignore mode ^^ [12:31:02] !log restarting elasticsearch server elastic2004.codfw.wmnet (T110236) [12:31:03] T110236: Use unicast instead of multicast for node communication - https://phabricator.wikimedia.org/T110236 [12:31:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:32:16] PROBLEM - changeprop endpoints health on scb2001 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.192.32.132, port=7272): Max retries exceeded with url: /?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [12:34:24] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [12:34:33] PROBLEM - Misc HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [12:36:59] 06Operations, 10ops-codfw, 06Analytics-Kanban, 06DC-Ops, and 5 others: setup kafka2001 & kafka2002 - https://phabricator.wikimedia.org/T121558#2259830 (10mobrovac) Note that the topics' names ought to be prefixed with `codfw.`. @elukey I guess you created them by running `./bin/ensure-kafka-topics-exist` ? [12:39:24] PROBLEM - changeprop endpoints health on scb2002 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.192.48.43, port=7272): Max retries exceeded with url: /?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [12:40:04] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [12:42:04] RECOVERY - Misc HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [12:45:14] RECOVERY - changeprop endpoints health on scb1001 is OK: All endpoints are healthy [12:45:41] what? [12:47:12] 06Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 13Patch-For-Review, 05WMF-deploy-2016-05-01_(1.27.0-wmf.23): Create Wikipedia Jamaican - https://phabricator.wikimedia.org/T134017#2259832 (10jcrespo) I will setup labs replication ASAP (production side). [12:48:50] (03PS1) 10Elukey: Fix kafkatee's webrequest_{text,upload} since we have now 24 partitions per topic. [puppet] - 10https://gerrit.wikimedia.org/r/286637 [12:49:03] 06Operations: Upgrade jessie systems from Linux 3.19 to 4.4 - https://phabricator.wikimedia.org/T131928#2259836 (10MoritzMuehlenhoff) a:03MoritzMuehlenhoff [12:51:34] 06Operations, 10ops-eqiad, 06DC-Ops: Decommission cp1037-1040 - https://phabricator.wikimedia.org/T83553#2259841 (10MoritzMuehlenhoff) a:03Cmjohnson [12:51:54] 06Operations, 10ops-eqiad, 06DC-Ops: Decommission cp1037-1040 - https://phabricator.wikimedia.org/T83553#915557 (10MoritzMuehlenhoff) This is resolved, right? These are no longer in site.pp and racktables. [12:52:06] moritzm: For CI: scandium kernel is up-to-date. You can do labnodepool1001.eqiad.wmnet anytime (i.e. now if you want) [12:52:56] 06Operations: Upgrade jessie systems from Linux 3.19 to 4.4 - https://phabricator.wikimedia.org/T131928#2183368 (10hashar) For CI: scandium is up-to-date. You can do labnodepool1001.eqiad.wmnet anytime. The reboot will cause a slight delay in replenishing the pool of CI slaves, but that is usually not a concern... [12:55:48] (03PS1) 10Dereckson: Enable NewUserMessage on gu.wikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286638 (https://phabricator.wikimedia.org/T134253) [12:57:06] (03PS1) 10Jdrewniak: Bumping portals to master. New footer A/B test. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286640 (https://phabricator.wikimedia.org/T133732) [13:00:02] !log Change runtime semi-synchronous replication on eqiad core DBs (s1-s7) to match configured value T131753 [13:00:03] T131753: Semi-synchronous replication status in all MySQL production clusters - https://phabricator.wikimedia.org/T131753 [13:00:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:01:52] !log restarting wdqs-updater and keeping it under close scrutiny for the moment [13:01:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:02:26] !log restarting elasticsearch server elastic2005.codfw.wmnet (T110236) [13:02:27] T110236: Use unicast instead of multicast for node communication - https://phabricator.wikimedia.org/T110236 [13:02:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:06:46] (03PS1) 10Hashar: Group 0 to php-1.27.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286641 [13:08:22] (03PS1) 10Hashar: admin: gitconfig tweak for hashar [puppet] - 10https://gerrit.wikimedia.org/r/286642 [13:08:46] (03PS2) 10Hashar: Group 0 to php-1.27.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286641 (https://phabricator.wikimedia.org/T131557) [13:09:07] (03PS3) 10Hashar: Group0 to php-1.27.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286641 (https://phabricator.wikimedia.org/T131557) [13:10:32] !log hashar@tin Started scap: testwiki to php-1.27.0-wmf.23 and rebuild l10n cache T131557 [13:10:33] T131557: MW-1.27.0-wmf.23 deployment blockers - https://phabricator.wikimedia.org/T131557 [13:10:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:18:20] !log applying schema change on s3-master db T130692 [13:18:21] T130692: Add new indexes from eec016ece6d2b30addcdf3d3efcc2ba59b10e858 to production databases - https://phabricator.wikimedia.org/T130692 [13:18:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:20:18] hashar: are we on wmf.23 this week? [13:20:59] thought (at least according to https://www.mediawiki.org/wiki/MediaWiki_1.28/Roadmap) we are moving to 1.28.0-wmf.1 this week [13:21:13] aude: unless I have made a mistake: 1.27.0-wmf.23 [13:21:40] we are cutting 1.28 later [13:21:43] either way is ok [13:22:35] https://wikitech.wikimedia.org/wiki/Deployments says wmf23 [13:23:50] RECOVERY - changeprop endpoints health on scb1002 is OK: All endpoints are healthy [13:24:12] https://www.mediawiki.org/wiki/MediaWiki_1.28/Roadmap is wrong for sure [13:25:28] i have deleted it [13:26:40] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [13:26:49] PROBLEM - Misc HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [13:28:43] hashar: thanks [13:28:56] we are making a new branch [13:30:47] ohh [13:31:04] aude: you can just create it in Gerrit [13:31:11] doing, as usual [13:31:16] ;-) [13:31:25] https://gerrit.wikimedia.org/r/#/c/286648/ [13:31:29] sorry :( [13:31:56] or i can updat ethe submodule [13:31:59] aude: note I have already cut the branches so wikidata is using .22 [13:32:02] ok [13:32:04] (or should) [13:32:10] then that is just a submodule update bump [13:32:25] I am doing the sync for testwiki [13:32:33] whenever you get the bump done, we can sync again :-} [13:32:37] this is what happens when we (wikidata) wait until too late :/ [13:32:40] my fault [13:33:09] not so bad :-} [13:33:15] (03PS2) 10Giuseppe Lavagetto: hhvm: fix logging dir permissions [puppet] - 10https://gerrit.wikimedia.org/r/286624 [13:33:17] I should probably have confirmed with you whether you wanted a new branch [13:33:21] luckily it is easy [13:33:41] we're suppose to make our branches on monday, ideally [13:34:10] (03CR) 10Giuseppe Lavagetto: [C: 032] hhvm: fix logging dir permissions [puppet] - 10https://gerrit.wikimedia.org/r/286624 (owner: 10Giuseppe Lavagetto) [13:34:24] <_joe_> come on jenkins [13:34:40] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [13:34:50] RECOVERY - Misc HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [13:35:08] (03CR) 10Giuseppe Lavagetto: [V: 032] hhvm: fix logging dir permissions [puppet] - 10https://gerrit.wikimedia.org/r/286624 (owner: 10Giuseppe Lavagetto) [13:36:49] RECOVERY - changeprop endpoints health on scb2002 is OK: All endpoints are healthy [13:37:26] !log hashar@tin Finished scap: testwiki to php-1.27.0-wmf.23 and rebuild l10n cache T131557 (duration: 26m 54s) [13:37:27] T131557: MW-1.27.0-wmf.23 deployment blockers - https://phabricator.wikimedia.org/T131557 [13:37:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:38:43] !log Warming up HHVM cache hitting testwiki T1315567 [13:38:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:38:58] !log Warming up HHVM cache hitting testwiki T131557 [13:38:59] T131557: MW-1.27.0-wmf.23 deployment blockers - https://phabricator.wikimedia.org/T131557 [13:39:01] yeah wrong bug [13:39:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:40:09] !log restarting elasticsearch server elastic2006.codfw.wmnet (T110236) [13:40:10] T110236: Use unicast instead of multicast for node communication - https://phabricator.wikimedia.org/T110236 [13:40:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:42:40] PROBLEM - changeprop endpoints health on scb2002 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.192.48.43, port=7272): Max retries exceeded with url: /?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [13:44:39] RECOVERY - changeprop endpoints health on scb2002 is OK: All endpoints are healthy [13:45:37] (03PS1) 10Aude: Remove use of $wmgWikibaseEnableArbitraryAccess [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286651 (https://phabricator.wikimedia.org/T134257) [13:45:38] (03PS1) 10Aude: Remove arbitrary access config from InitialiseSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286652 (https://phabricator.wikimedia.org/T134257) [13:45:40] (03PS1) 10Aude: Remove arbitraryaccess wikitag from CommonSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286653 (https://phabricator.wikimedia.org/T134257) [13:45:42] (03PS1) 10Aude: Remove arbitraryaccess.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286654 (https://phabricator.wikimedia.org/T134257) [13:46:10] RECOVERY - eventlogging-service-eventbus endpoints health on kafka2002 is OK: All endpoints are healthy [13:46:47] ---^ sorry this was me [13:47:09] RECOVERY - eventlogging-service-eventbus endpoints health on kafka2001 is OK: All endpoints are healthy [13:48:17] and also --^, the cluster wasn't healthy [13:51:09] RECOVERY - changeprop endpoints health on scb2001 is OK: All endpoints are healthy [13:51:45] (03CR) 10Ottomata: [C: 031] Add LVS configuration for EventBus in codfw (DNS reverse/wmnet config already in place). [puppet] - 10https://gerrit.wikimedia.org/r/286621 (https://phabricator.wikimedia.org/T121558) (owner: 10Elukey) [13:51:52] !log Warming up HHVM cache hitting testwiki /w/api.php T131557 [13:51:53] T131557: MW-1.27.0-wmf.23 deployment blockers - https://phabricator.wikimedia.org/T131557 [13:51:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:57:14] (03PS2) 10Mobrovac: Change-Prop: Enable summary and definition updates. [puppet] - 10https://gerrit.wikimedia.org/r/286539 (owner: 10Ppchelko) [13:58:16] (03CR) 10Mobrovac: Change-Prop: Enable summary and definition updates. (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/286539 (owner: 10Ppchelko) [13:59:04] 06Operations, 10ops-codfw, 06Analytics-Kanban, 06DC-Ops, and 5 others: setup kafka2001 & kafka2002 - https://phabricator.wikimedia.org/T121558#2260066 (10elukey) Created topics with mobrovac, all good. Last step is to enable LVS. [13:59:10] (03CR) 10Giuseppe Lavagetto: [C: 031] Move jobrunner ferm service into the roles [puppet] - 10https://gerrit.wikimedia.org/r/286415 (owner: 10Muehlenhoff) [14:00:32] !log creating sanitarium db filtering (redactatron) for jamwiki [14:00:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:02:28] (03CR) 10Giuseppe Lavagetto: [C: 031] "Given we agree that this issue should be addressed later, I'm ok with the change" [puppet] - 10https://gerrit.wikimedia.org/r/286303 (https://phabricator.wikimedia.org/T133337) (owner: 10Volans) [14:05:25] (03PS2) 10Elukey: Fix kafkatee's webrequest_{text,upload} since we have now 24 partitions per topic. [puppet] - 10https://gerrit.wikimedia.org/r/286637 [14:05:27] (03CR) 10Ottomata: [C: 031] Fix kafkatee's webrequest_{text,upload} since we have now 24 partitions per topic. [puppet] - 10https://gerrit.wikimedia.org/r/286637 (owner: 10Elukey) [14:05:58] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 660 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5582733 keys - replication_delay is 660 [14:08:57] (03CR) 10Elukey: [C: 032] Fix kafkatee's webrequest_{text,upload} since we have now 24 partitions per topic. [puppet] - 10https://gerrit.wikimedia.org/r/286637 (owner: 10Elukey) [14:09:50] (03CR) 10Hoo man: [C: 031] "This is going to enable arbitrary access on all wikiversities, adywiki and jamwiki." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286651 (https://phabricator.wikimedia.org/T134257) (owner: 10Aude) [14:10:43] (03CR) 10Hoo man: [C: 031] Remove arbitrary access config from InitialiseSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286652 (https://phabricator.wikimedia.org/T134257) (owner: 10Aude) [14:10:52] 06Operations, 10Wikimedia-General-or-Unknown, 07I18n, 07Upstream: Update Malayalam fonts packages - https://phabricator.wikimedia.org/T33950#2260121 (10Praveenp) Sorry, I wish I could help, but I fed up fighting obduracy. I lost track of these things. :-( [14:12:00] (03CR) 10Hoo man: [C: 031] Remove arbitraryaccess wikitag from CommonSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286653 (https://phabricator.wikimedia.org/T134257) (owner: 10Aude) [14:12:10] !log changed kafkatee's config on oxygen to watch all the webrequest text|upload kafka partitions (analytics doubled them recently) [14:12:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:12:34] (03CR) 10Hoo man: [C: 031] Remove arbitraryaccess.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286654 (https://phabricator.wikimedia.org/T134257) (owner: 10Aude) [14:23:13] 06Operations: decom argon - https://phabricator.wikimedia.org/T134223#2260139 (10Dzahn) [14:25:17] mutante: https://phabricator.wikimedia.org/T134247 [14:36:29] !log restarting elasticsearch server elastic2007.codfw.wmnet (T110236) [14:36:30] T110236: Use unicast instead of multicast for node communication - https://phabricator.wikimedia.org/T110236 [14:36:30] 06Operations, 10hardware-requests: eqiad: (3) nodes for Druid / analytics - https://phabricator.wikimedia.org/T128807#2260203 (10Ottomata) @JAllemandou @milimetric, can you comment on desired partition layout for this? I'm going to guess a small `/` partition, and then the rest of space (lvm?) on RAID 10 acro... [14:36:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:42:43] 06Operations, 06Analytics-Kanban, 13Patch-For-Review: Upgrade stat1001 to Debian Jessie - https://phabricator.wikimedia.org/T76348#2260210 (10ezachte) This task is set to done, but my home dir hasn't been restored yet, and https://stats.wikimedia.org/cgi-bin/search_portal.pl?search=views produces an access e... [14:43:22] (03PS1) 10Ottomata: Alter role::kafka::analytics::broker to be able to use confluent module during upgrade [puppet] - 10https://gerrit.wikimedia.org/r/286660 (https://phabricator.wikimedia.org/T121562) [14:43:55] 06Operations, 10hardware-requests: eqiad: (3) nodes for Druid / analytics - https://phabricator.wikimedia.org/T128807#2260216 (10JAllemandou) @Ottomata: I agree on having a 30G / across disks, and my guess is that having one lvm on RAID10 for the rest would be fine (but I'm no Druid expert). [14:50:36] (03PS2) 10Ottomata: Alter role::kafka::analytics::broker to be able to use confluent module during upgrade [puppet] - 10https://gerrit.wikimedia.org/r/286660 (https://phabricator.wikimedia.org/T121562) [14:53:38] !log hashar@tin Synchronized php-1.27.0-wmf.23/extensions/Wikidata: wikidata to .23 https://gerrit.wikimedia.org/r/#/c/286657/ (duration: 02m 17s) [14:53:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:54:13] did we push a change today to kill image srcsets? [14:55:40] (03PS1) 10Aude: Disable Wikibase on beta.wikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286661 [14:55:42] (03PS1) 10Aude: Enable data access on Wikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286662 (https://phabricator.wikimedia.org/T134266) [14:55:50] I can jump in to take this SWAT if there aren't any objections. [14:55:52] 06Operations, 06Analytics-Kanban, 13Patch-For-Review: Upgrade stat1001 to Debian Jessie - https://phabricator.wikimedia.org/T76348#2260239 (10elukey) Hi @ezachte, my bad, working on it now. [14:56:11] thcipriani: i can do swat [14:56:17] * aude about to add a bunch of patches [14:56:24] or you can do the ones that are not mine [14:57:05] (03PS1) 10Filippo Giunchedi: cassandra: add restbase200[789] instances [dns] - 10https://gerrit.wikimedia.org/r/286663 (https://phabricator.wikimedia.org/T132976) [14:57:16] aude: yeah, I was just going to demo some things to some folks here. If I could grab the other patches, that'd be good. [14:57:27] (reading offsite) [14:57:53] ok [14:58:35] you can do mine also if youwant [14:58:41] i'm editing the wiki now [14:59:07] Hello [14:59:24] aude: you have a patch for Wikiversity/Wikidata too? [14:59:28] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] cassandra: add restbase200[789] instances [dns] - 10https://gerrit.wikimedia.org/r/286663 (https://phabricator.wikimedia.org/T132976) (owner: 10Filippo Giunchedi) [14:59:49] Dereckson: i do [14:59:51] :) [15:00:04] anomie ostriches thcipriani marktraceur Krenair: Dear anthropoid, the time has come. Please deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160503T1500). [15:00:04] Dereckson jan_drewniak: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [15:00:06] Perfect, we're on schedule in [[Deployments]] in this case. [15:01:34] aude: ok, I'll get out the first two and then let you take the wikibase things? [15:02:15] RECOVERY - NTP on restbase2004 is OK: NTP OK: Offset 0.004657387733 secs [15:02:23] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286638 (https://phabricator.wikimedia.org/T134253) (owner: 10Dereckson) [15:02:46] ok [15:02:48] (03Merged) 10jenkins-bot: Enable NewUserMessage on gu.wikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286638 (https://phabricator.wikimedia.org/T134253) (owner: 10Dereckson) [15:03:10] jan_drewniak: ping for swat [15:03:17] o/ [15:03:19] o/ [15:05:21] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: Enable NewUserMessage on gu.wikiquote [[gerrit:286638]] (duration: 00m 33s) [15:05:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:05:28] ^ Dereckson check please [15:06:29] Looks good according Special:Version, and a test edit. Can't test further as long someone doesn't create an account specifically for this wiki. Looks good so far so. [15:06:45] Dereckson: ok, thanks for checking [15:07:41] (03CR) 10Hashar: [C: 031] "Proposed to Puppet Swat of May 3rd 16:00ā€“17:00 UTC" [puppet] - 10https://gerrit.wikimedia.org/r/286484 (owner: 10Hashar) [15:07:45] (03CR) 10Hashar: [C: 031] "Proposed to Puppet Swat of May 3rd 16:00ā€“17:00 UTC" [puppet] - 10https://gerrit.wikimedia.org/r/286642 (owner: 10Hashar) [15:07:49] (03CR) 10Hashar: [C: 031] "Proposed to Puppet Swat of May 3rd 16:00ā€“17:00 UTC" [puppet] - 10https://gerrit.wikimedia.org/r/286598 (https://phabricator.wikimedia.org/T134235) (owner: 10Hashar) [15:08:00] (03CR) 10Hashar: [C: 031] "Proposed to Puppet Swat of May 3rd 16:00ā€“17:00 UTC" [puppet] - 10https://gerrit.wikimedia.org/r/282323 (https://phabricator.wikimedia.org/T114421) (owner: 10Hashar) [15:08:05] (03CR) 10Hashar: [C: 031] "Proposed to Puppet Swat of May 3rd 16:00ā€“17:00 UTC" [puppet] - 10https://gerrit.wikimedia.org/r/282322 (owner: 10Hashar) [15:08:45] !log restarting elasticsearch server elastic2008.codfw.wmnet (T110236) [15:08:46] T110236: Use unicast instead of multicast for node communication - https://phabricator.wikimedia.org/T110236 [15:08:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:08:52] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286640 (https://phabricator.wikimedia.org/T133732) (owner: 10Jdrewniak) [15:09:04] (03PS2) 10Thcipriani: Bumping portals to master. New footer A/B test. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286640 (https://phabricator.wikimedia.org/T133732) (owner: 10Jdrewniak) [15:10:03] (03CR) 10Thcipriani: Bumping portals to master. New footer A/B test. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286640 (https://phabricator.wikimedia.org/T133732) (owner: 10Jdrewniak) [15:10:12] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286640 (https://phabricator.wikimedia.org/T133732) (owner: 10Jdrewniak) [15:10:26] ff-only mediawiki-config things. [15:10:44] ah, that will be annoying but suppose the rebase button is enough [15:11:05] (03Merged) 10jenkins-bot: Bumping portals to master. New footer A/B test. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286640 (https://phabricator.wikimedia.org/T133732) (owner: 10Jdrewniak) [15:12:28] running sync-portals now [15:13:09] !log thcipriani@tin Synchronized portals/prod/wikipedia.org/assets: (no message) (duration: 00m 37s) [15:13:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:13:40] !log thcipriani@tin Synchronized portals: (no message) (duration: 00m 30s) [15:13:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:13:50] ^ jan_drewniak check please [15:14:40] aude: all yours :) [15:14:41] thcipriani: looks good, thanks! [15:14:47] jan_drewniak: thanks for checking! [15:15:37] !log Change runtime semi-synchronous replication on eqiad x1,es2,es3 to match configured value T131753 [15:15:38] T131753: Semi-synchronous replication status in all MySQL production clusters - https://phabricator.wikimedia.org/T131753 [15:15:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:15:53] thcipriani: ok [15:16:28] mutante, looks like udpmxircecho is broken on kraz! [15:17:50] * aude grabs more coffee before i start :) [15:20:21] (03PS1) 10Filippo Giunchedi: cassandra: add restbase200[789] [puppet] - 10https://gerrit.wikimedia.org/r/286665 (https://phabricator.wikimedia.org/T132976) [15:21:08] (03CR) 10Aude: [C: 032] Remove use of $wmgWikibaseEnableArbitraryAccess [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286651 (https://phabricator.wikimedia.org/T134257) (owner: 10Aude) [15:21:58] (03PS2) 10Aude: Remove use of $wmgWikibaseEnableArbitraryAccess [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286651 (https://phabricator.wikimedia.org/T134257) [15:22:02] (03PS1) 10Muehlenhoff: Update to 1.0.2h [debs/openssl] - 10https://gerrit.wikimedia.org/r/286666 [15:24:55] (03CR) 10Aude: [C: 032] Remove use of $wmgWikibaseEnableArbitraryAccess [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286651 (https://phabricator.wikimedia.org/T134257) (owner: 10Aude) [15:25:32] (03Merged) 10jenkins-bot: Remove use of $wmgWikibaseEnableArbitraryAccess [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286651 (https://phabricator.wikimedia.org/T134257) (owner: 10Aude) [15:27:49] !log aude@tin Synchronized wmf-config/Wikibase.php: Remove use of $wmgWikibaseEnableArbitraryAccess setting (duration: 00m 28s) [15:27:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:28:11] !log "systemctl restart ircecho" on kraz [15:28:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:29:56] looks ok so far [15:30:17] 06Operations, 10Wikimedia-IRC-RC-Server: RC stream is broken over IRC - https://phabricator.wikimedia.org/T134247#2259656 (10faidon) [15:30:20] (03PS2) 10Aude: Remove arbitrary access config from InitialiseSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286652 (https://phabricator.wikimedia.org/T134257) [15:32:25] (03CR) 10Aude: [C: 032] "double and triple checked that $wmgWikibaseEnableArbitraryAccess is no longer used anywhere" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286652 (https://phabricator.wikimedia.org/T134257) (owner: 10Aude) [15:32:50] (03Merged) 10jenkins-bot: Remove arbitrary access config from InitialiseSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286652 (https://phabricator.wikimedia.org/T134257) (owner: 10Aude) [15:34:03] !log aude@tin Synchronized wmf-config/InitialiseSettings.php: Remove $wmgWikibaseEnableArbitraryAccess setting (duration: 00m 26s) [15:34:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:34:31] looks good [15:36:25] (03PS2) 10Aude: Remove arbitraryaccess wikitag from CommonSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286653 (https://phabricator.wikimedia.org/T134257) [15:36:28] (03PS2) 10Filippo Giunchedi: cassandra: add restbase200[789] [puppet] - 10https://gerrit.wikimedia.org/r/286665 (https://phabricator.wikimedia.org/T132976) [15:36:35] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] cassandra: add restbase200[789] [puppet] - 10https://gerrit.wikimedia.org/r/286665 (https://phabricator.wikimedia.org/T132976) (owner: 10Filippo Giunchedi) [15:37:06] (03CR) 10Aude: [C: 032] Remove arbitraryaccess wikitag from CommonSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286653 (https://phabricator.wikimedia.org/T134257) (owner: 10Aude) [15:37:39] 06Operations, 10Wikimedia-IRC-RC-Server: RC stream is broken over IRC - https://phabricator.wikimedia.org/T134247#2259656 (10faidon) OK, so recent changes messages were not flowing to kraz. Debugging revealed that udpmxircecho.py did not listen on 9390. `systemctl status ircecho` showed the following exception... [15:38:19] (03Merged) 10jenkins-bot: Remove arbitraryaccess wikitag from CommonSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286653 (https://phabricator.wikimedia.org/T134257) (owner: 10Aude) [15:39:23] !log aude@tin Synchronized wmf-config/CommonSettings.php: Remove arbitraryaccess wikitag (duration: 00m 26s) [15:39:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:39:48] (03PS2) 10Aude: Remove arbitraryaccess.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286654 (https://phabricator.wikimedia.org/T134257) [15:39:56] (03CR) 10Aude: [C: 032] Remove arbitraryaccess.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286654 (https://phabricator.wikimedia.org/T134257) (owner: 10Aude) [15:40:25] (03Merged) 10jenkins-bot: Remove arbitraryaccess.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286654 (https://phabricator.wikimedia.org/T134257) (owner: 10Aude) [15:40:52] 06Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 13Patch-For-Review, 05WMF-deploy-2016-05-01_(1.27.0-wmf.23): Create Wikipedia Jamaican - https://phabricator.wikimedia.org/T134017#2260402 (10jcrespo) I set up labs replication filters. [15:40:59] Is Greg working this week? [15:41:13] If not, who is filling in? [15:41:32] !log aude@tin Synchronized dblists/: Remove arbitraryaccess.dblist (duration: 00m 25s) [15:41:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:41:52] (03PS2) 10Aude: Disable Wikibase on beta.wikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286661 [15:42:01] (03CR) 10Aude: [C: 032] Disable Wikibase on beta.wikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286661 (owner: 10Aude) [15:42:12] now the easier patches [15:42:16] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5578820 keys - replication_delay is 0 [15:42:17] hoo: i think greg is on leave [15:42:32] Yeah, I think so as well... that's why I'm asking [15:42:39] dunno who's filling in [15:42:41] hoo: Greg is out. poke ostriches [15:42:56] thcipriani: Ok, will mail him [15:42:57] (03Merged) 10jenkins-bot: Disable Wikibase on beta.wikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286661 (owner: 10Aude) [15:42:58] thanks [15:43:00] Yo yo [15:43:05] What's shakin bacon? [15:43:42] (03Abandoned) 10Alex Monk: Add jamwiki to restbase and labs DNS configs [puppet] - 10https://gerrit.wikimedia.org/r/286553 (https://phabricator.wikimedia.org/T134017) (owner: 10Alex Monk) [15:44:08] 06Operations, 10Wikimedia-IRC-RC-Server: Replace ircd-ratbox with something newer/maintained - https://phabricator.wikimedia.org/T134271#2260405 (10faidon) [15:44:18] !log aude@tin Synchronized dblists/wikidataclient.dblist: Remove beta.wikiversity as a client. See T54971 (duration: 00m 25s) [15:44:19] T54971: Sitelinks to Incubator, OldWikisource and BetaWikiversity - https://phabricator.wikimedia.org/T54971 [15:44:21] ostriches: We want to deploy ArticlePlaceholders to four Wikipedias next Wednesday [15:44:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:44:46] (03PS2) 10Aude: Enable data access on Wikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286662 (https://phabricator.wikimedia.org/T134266) [15:44:54] (03CR) 10Aude: [C: 032] Enable data access on Wikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286662 (https://phabricator.wikimedia.org/T134266) (owner: 10Aude) [15:45:23] The extension already is on beta and testwiki [15:45:27] (03Merged) 10jenkins-bot: Enable data access on Wikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286662 (https://phabricator.wikimedia.org/T134266) (owner: 10Aude) [15:45:44] and there is consensus from the four wikipedia communities [15:45:51] to try it [15:45:52] 06Operations, 06Analytics-Kanban, 13Patch-For-Review: Upgrade stat1001 to Debian Jessie - https://phabricator.wikimedia.org/T76348#2260441 (10elukey) [15:45:57] Exactly [15:46:22] https://phabricator.wikimedia.org/T117965#2256422 [15:47:24] !log aude@tin Synchronized wmf-config/InitialiseSettings.php: Enable data access for Wikiversity (duration: 00m 38s) [15:47:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:47:53] it's a kitten: https://en.wikiversity.org/wiki/User:Aude :D [15:48:02] :D :) [15:48:04] data access looks good [15:48:10] swat is over [15:48:10] aude, hoo: Consensus? Already deployed? Next week? Don't see why not. [15:48:20] ostriches: Great, thanks [15:48:40] I'll sign it up on wikitech once the schedule for next week is in place [15:48:46] !log restarting elasticsearch server elastic2009.codfw.wmnet (T110236) [15:48:46] T110236: Use unicast instead of multicast for node communication - https://phabricator.wikimedia.org/T110236 [15:48:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:52:46] ostriches: good morning! I got 1.27.0-wmf.23 cut and sec patches applied properly. It is running on testwiki! [15:54:14] Yay! [15:55:31] (03PS1) 10Andrew Bogott: Split centralized labs pdns database into two different local DBs. [puppet] - 10https://gerrit.wikimedia.org/r/286670 (https://phabricator.wikimedia.org/T128737) [15:57:10] jouncebot: next [15:57:10] In 0 hour(s) and 2 minute(s): Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160503T1600) [15:57:43] (03PS1) 10Jcrespo: [WIP] Fix grants for labs pdns [puppet] - 10https://gerrit.wikimedia.org/r/286671 (https://phabricator.wikimedia.org/T128737) [15:57:55] (03PS2) 10Jcrespo: [WIP] Fix grants for labs pdns [puppet] - 10https://gerrit.wikimedia.org/r/286671 (https://phabricator.wikimedia.org/T128737) [16:00:04] godog moritzm _joe_: Respected human, time to deploy Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160503T1600). Please do the needful. [16:00:04] Krenair hashar coreyfloyd: A patch you scheduled for Puppet SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [16:00:09] o/ [16:00:25] <- doing puppetswat, since others are busy [16:00:37] 06Operations, 10ops-eqiad: rack/setup/deploy 3 eqiad druid nodes - https://phabricator.wikimedia.org/T134275#2260510 (10RobH) [16:01:15] Max 8 patches, there are 9 [16:01:30] there are a bunch of super trivial ones [16:01:39] we'll see how far we get! [16:01:45] so https://gerrit.wikimedia.org/r/#/c/283227/ [16:01:51] ^we have the guity for #9 there^ [16:01:59] :-) [16:02:03] krenair wrote some diamond monitor for the keyholder [16:02:06] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 638 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5580542 keys - replication_delay is 638 [16:02:20] which got cherry picked on beta to confirm diamond collector works fine [16:02:28] and yuvi offered advice [16:02:37] Krenair: is active elsewhere, so I assume he's here. the first patches are listed under him [16:02:44] so the bit is about having it enabled on shin ken which is more or less automatic [16:02:47] no impact on prod ;-) [16:02:49] 06Operations, 10ops-eqiad: rack/setup/deploy 3 eqiad druid nodes - https://phabricator.wikimedia.org/T134275#2260533 (10Ottomata) @Ottomata: So this is the ordering procurement task, they likely cannot see the contents of this ticket. Any setup info should be split onto an independent setup task, now created... [16:02:50] I am here [16:03:05] (03PS3) 10BBlack: Add jam.wikipedia to RESTBase and Labs dnsrecursor [puppet] - 10https://gerrit.wikimedia.org/r/286278 (https://phabricator.wikimedia.org/T134017) (owner: 10Dereckson) [16:03:37] 06Operations, 10ops-eqiad: rack/setup/deploy 3 eqiad druid nodes - https://phabricator.wikimedia.org/T134275#2260536 (10RobH) [16:03:47] bblack: Krenair: i'm also here to assist with restbase restarts [16:03:59] 06Operations, 10hardware-requests: eqiad: (3) nodes for Druid / analytics - https://phabricator.wikimedia.org/T128807#2260539 (10RobH) [16:04:00] (03CR) 10Jcrespo: [C: 031] Add jam.wikipedia to RESTBase and Labs dnsrecursor [puppet] - 10https://gerrit.wikimedia.org/r/286278 (https://phabricator.wikimedia.org/T134017) (owner: 10Dereckson) [16:04:01] 06Operations, 10ops-eqiad: rack/setup/deploy 3 eqiad druid nodes - https://phabricator.wikimedia.org/T134275#2260510 (10RobH) [16:04:03] 06Operations, 10ops-eqiad: Rack and setup (3) Druid Nodes in eqiad - https://phabricator.wikimedia.org/T134276#2260540 (10Cmjohnson) [16:04:04] does the jam patch need restarts elsewhere? [16:04:13] 06Operations, 10ops-eqiad: Rack and setup (3) Druid Nodes in eqiad - https://phabricator.wikimedia.org/T134276#2260553 (10Cmjohnson) @JAllemandou @Milimetric, can you comment on desired partition layout for this? I'm going to guess a small / partition, and then the rest of space (lvm?) on RAID 10 across all 4... [16:04:17] (03PS2) 10Andrew Bogott: Split centralized labs pdns database into two different local DBs. [puppet] - 10https://gerrit.wikimedia.org/r/286670 (https://phabricator.wikimedia.org/T128737) [16:04:21] "the jam patch" lol [16:04:30] (03CR) 10BBlack: [C: 032] Add jam.wikipedia to RESTBase and Labs dnsrecursor [puppet] - 10https://gerrit.wikimedia.org/r/286278 (https://phabricator.wikimedia.org/T134017) (owner: 10Dereckson) [16:04:48] !log merging https://gerrit.wikimedia.org/r/286278 (puppetswat) [16:04:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:05:12] mobrovac: good to go for RB if there's a need to do something manual there [16:05:14] 06Operations, 10ops-eqiad: Rack and setup (3) Druid Nodes in eqiad - https://phabricator.wikimedia.org/T134276#2260560 (10Cmjohnson) [16:05:31] bblack: could you just force a puppet run on rb nodes for me pleasE? [16:06:22] sure [16:06:54] mobrovac: does it need to be staggered? [16:07:06] nah, just force it [16:07:11] i'll rolling-restart then [16:07:18] the only thing that worries me is the apache change for potential breakage [16:09:09] and the sudo for privileges escalation [16:09:19] mobrovac: done [16:09:27] kk, restarting [16:09:33] mobrovac: well, wait [16:09:43] well, waiting [16:09:48] several rb hosts have puppet disabled... [16:09:55] that must be staging [16:10:04] xenon and friends? [16:10:14] restbase200[789] [16:10:20] xenon is a restbase host? :) [16:10:39] rot13 in action maybe? [16:10:49] yeah, 200[789] are not in use yet bblack afaik [16:10:53] ok [16:11:06] well I ran the agent on restbase*, and it skipped those 3 [16:11:16] keep going if that's all you need [16:11:18] !log restabse restarting after https://gerrit.wikimedia.org/r/#/c/286278/ [16:11:19] 06Operations, 10ops-codfw, 13Patch-For-Review: rack/setup/deploy restbase200[7-9] - https://phabricator.wikimedia.org/T132976#2260618 (10Papaul) a:05Papaul>03fgiunchedi @fgiunchedi I added the disks to restbase2007 and restbase2008. Both have now 5 SSDs [16:11:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:11:30] yeah restbase200[789] are being provisioned [16:11:32] bblack: there you go for 200[789] ^ [16:12:09] wait, does it sudo as a user to run python, which popens a sudo? [16:12:11] the keyholder patch looks dangerous at first glance, with nrpe doing sudo or whatever heh [16:13:05] I do not like that, I would stall it until later to check it more carefully [16:13:45] (03CR) 10Andrew Bogott: [WIP] Fix grants for labs pdns (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/286671 (https://phabricator.wikimedia.org/T128737) (owner: 10Jcrespo) [16:13:46] let's at least skip over it for now and see how we do on time and 8 patches and all [16:14:03] yes, I am not saying forever, just for now [16:14:05] nrpe was already doing sudo, this lets diamond do it as well [16:15:21] (03PS2) 10BBlack: Redirect yue.wikipedia.org to zh-yue.wikipedia.org for now [puppet] - 10https://gerrit.wikimedia.org/r/285086 (https://phabricator.wikimedia.org/T105999) (owner: 10Alex Monk) [16:16:34] bblack: Krenair: all good for jam, https://jam.wikipedia.org/api/rest_v1/?doc is showing up [16:16:35] it's not using sudo to run python [16:16:40] thanks mobrovac [16:16:49] (03CR) 10BBlack: [C: 032 V: 032] Redirect yue.wikipedia.org to zh-yue.wikipedia.org for now [puppet] - 10https://gerrit.wikimedia.org/r/285086 (https://phabricator.wikimedia.org/T105999) (owner: 10Alex Monk) [16:17:12] !log merging https://gerrit.wikimedia.org/r/285086 for puppetswat (apache redirects change) [16:17:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:17:58] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5582195 keys - replication_delay is 0 [16:19:03] still waiting to see it work ok on 1x appserver, just a bit [16:19:53] Krenair: back.. [16:20:14] 06Operations, 10ops-eqiad, 06DC-Ops, 13Patch-For-Review: eqiad: Rack and setup new labstore - https://phabricator.wikimedia.org/T133397#2260659 (10chasemp) @cmjohnson I can handle: - update install_server module - install OS if you can finish racktables + label then assign this to me? I want to run thr... [16:20:29] works [16:20:31] Krenair, oh, I see, it only grants sudo to execute /usr/lib/nagios/plugins/check_keyholder [16:20:33] Krenair: what happened? i see it running normal [16:20:57] mutante, yeah it was restarted [16:20:57] (03PS2) 10BBlack: Set up yue.wikipedia.org DNS record [dns] - 10https://gerrit.wikimedia.org/r/285085 (https://phabricator.wikimedia.org/T105999) (owner: 10Alex Monk) [16:21:06] jynus, yup [16:22:16] (03CR) 10BBlack: [C: 032] Set up yue.wikipedia.org DNS record [dns] - 10https://gerrit.wikimedia.org/r/285085 (https://phabricator.wikimedia.org/T105999) (owner: 10Alex Monk) [16:22:40] !log authdns-update for https://gerrit.wikimedia.org/r/#/c/285085/ (yue lang) + workarounds from https://phabricator.wikimedia.org/T97051#1994679 [16:22:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:22:59] Krenair: what were the symptoms? did it stay online but stop talking? [16:23:16] mutante, ircd stayed up, but rc-pmtpa was not logged in [16:23:43] hm,ok [16:23:48] yue.wikipedia.org has address 208.80.153.224 [16:23:48] mutante, https://phabricator.wikimedia.org/T134247#2260394 [16:24:07] hashar: on to your list for now [16:24:07] sounds correct [16:24:33] so we used to have postgres installed, but it never got used https://gerrit.wikimedia.org/r/#q,286484,n,z does the cleanup [16:24:59] for sudo rights [16:25:00] (03PS2) 10BBlack: admin: contint-admins no more need postgres [puppet] - 10https://gerrit.wikimedia.org/r/286484 (owner: 10Hashar) [16:25:13] -rights seems pretty safe to me :) [16:25:28] https://gerrit.wikimedia.org/r/#q,286642,n,z is to set a online conf in my ~/.gitconfig which is used when we cut wmf branch from tin [16:25:36] (03CR) 10BBlack: [C: 032 V: 032] admin: contint-admins no more need postgres [puppet] - 10https://gerrit.wikimedia.org/r/286484 (owner: 10Hashar) [16:25:43] referenced at https://wikitech.wikimedia.org/wiki/Heterogeneous_deployment/Train_deploys#Create_the_new_branch_in_gerrit [16:25:53] !log merging https://gerrit.wikimedia.org/r/286484 for puppetswat [16:25:54] hashar, I assume postgress itself is still in use? [16:25:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:26:13] jynus: it has never been used and got purged from all CI hosts months ago. [16:26:15] Krenair: hard to understand why it works on old server [16:26:22] (03PS2) 10BBlack: admin: gitconfig tweak for hashar [puppet] - 10https://gerrit.wikimedia.org/r/286642 (owner: 10Hashar) [16:26:23] oh, thanks [16:26:29] (03CR) 10BBlack: [C: 032 V: 032] admin: gitconfig tweak for hashar [puppet] - 10https://gerrit.wikimedia.org/r/286642 (owner: 10Hashar) [16:26:29] jynus: the associated sudo rule got forgotten, and it is not needed any way since we have full root on labs instances [16:26:35] sure [16:26:44] !log merging https://gerrit.wikimedia.org/r/286642 for puppetswat [16:26:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:27:14] https://gerrit.wikimedia.org/r/#q,286598,n,z adds pypy ( a python interpreter with JIT ) that is for labs instances [16:27:33] (03PS2) 10BBlack: contint: pypy package for pywikibot [puppet] - 10https://gerrit.wikimedia.org/r/286598 (https://phabricator.wikimedia.org/T134235) (owner: 10Hashar) [16:27:36] there is apparently some volunteer interested in running the pywikibot framework using pypy [16:27:59] hashar, yours were indeed trivial [16:28:09] !log merging https://gerrit.wikimedia.org/r/286598 for puppetswat [16:28:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:28:26] (03CR) 10BBlack: [C: 032 V: 032] contint: pypy package for pywikibot [puppet] - 10https://gerrit.wikimedia.org/r/286598 (https://phabricator.wikimedia.org/T134235) (owner: 10Hashar) [16:28:28] https://gerrit.wikimedia.org/r/#/c/282322/ that one remove legacy cruft from the production slave gallium . No more jobs running on there [16:28:38] puppet compile is all happy, or at least it managed to compile the manifest ;-} [16:29:04] will run puppet on gallium to confirm it is all happy [16:29:16] jynus: yeah I am doing far too many trivial patches overflowing opsen ;-} [16:29:23] no [16:29:29] actually that is good [16:29:40] that is what puppet swat was about [16:30:16] !log uploaded openssl 1.0.2h for jessie-wikimedia to carbon [16:30:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:30:31] (03PS5) 10BBlack: contint: clean up role::ci::slave [puppet] - 10https://gerrit.wikimedia.org/r/282322 (owner: 10Hashar) [16:31:39] yup giuseppe suggested to pile them up in the swat slot [16:31:46] that is more convenient for everyone i guess [16:32:08] (03CR) 10BBlack: [C: 032] contint: clean up role::ci::slave [puppet] - 10https://gerrit.wikimedia.org/r/282322 (owner: 10Hashar) [16:32:18] !log merging https://gerrit.wikimedia.org/r/#/c/282322 for puppetswat [16:32:21] (03PS4) 10Hashar: contint: move npmtravis out of prod slave [puppet] - 10https://gerrit.wikimedia.org/r/282323 (https://phabricator.wikimedia.org/T114421) [16:32:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:32:42] running the next one in the puppet compiler [16:33:28] running puppet on gallium [16:33:39] the postgre related sudo rules are gone \O/ [16:34:43] (03PS3) 10Jcrespo: [WIP] Fix grants for labs pdns [puppet] - 10https://gerrit.wikimedia.org/r/286671 (https://phabricator.wikimedia.org/T128737) [16:35:00] Oh yeah, we were gonna use postgres for testing at one point [16:35:03] I forgot that [16:35:04] (03PS4) 10Jcrespo: [WIP] Fix grants for labs pdns [puppet] - 10https://gerrit.wikimedia.org/r/286671 (https://phabricator.wikimedia.org/T128737) [16:35:12] (03CR) 10Hashar: "Compiled at https://puppet-compiler.wmflabs.org/2656/" [puppet] - 10https://gerrit.wikimedia.org/r/282323 (https://phabricator.wikimedia.org/T114421) (owner: 10Hashar) [16:35:22] bblack: more removal of unused stuff https://gerrit.wikimedia.org/r/#/c/282323/ :-} [16:35:26] (03CR) 10BBlack: [C: 032] contint: move npmtravis out of prod slave [puppet] - 10https://gerrit.wikimedia.org/r/282323 (https://phabricator.wikimedia.org/T114421) (owner: 10Hashar) [16:35:28] compile is happy [16:35:36] !log merging https://gerrit.wikimedia.org/r/282323 for puppetswat [16:35:38] and apache is all happy after "282322 contint: clean up role::ci::slave" [16:35:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:35:48] thanks to the puppet compiler! [16:35:52] :) [16:36:00] ostriches, do not worry, we still use a completely different database in test and in production :-D [16:36:10] it highlighted how apache mod headers end up being removed iirc [16:36:52] jynus: Consistency is boring and overrated :P [16:37:41] yeah ok so back on the keyholder thing [16:38:02] so, there's an existing NRPE that does sudo /usr/lib/nagios/plugins/check_keyholder [16:38:07] and now diamond can do the same thing [16:38:10] jynus: bblack: all my patches are all fine 100 % puppet agent -tv approved. Thank you very much [16:38:11] right? [16:38:18] yes [16:38:26] if there is an issue, it is existing [16:38:52] e.g. sudo should not be used in the first place/a special user [16:39:18] yeah it's kind of ugly, but I don't think it makes it any worse [16:39:24] +1 [16:39:34] (03PS1) 10Giuseppe Lavagetto: imagemagick: create class imagemagick::install, use on librenms [puppet] - 10https://gerrit.wikimedia.org/r/286676 [16:39:36] (03PS1) 10Giuseppe Lavagetto: mediawiki: install imagemagick policy [puppet] - 10https://gerrit.wikimedia.org/r/286677 [16:39:36] but do not trust me [16:39:38] (03PS1) 10Giuseppe Lavagetto: ocg: add imagemagick policy [puppet] - 10https://gerrit.wikimedia.org/r/286678 [16:39:40] (03PS1) 10Giuseppe Lavagetto: imagemagick: install policy everywhere [puppet] - 10https://gerrit.wikimedia.org/r/286679 [16:39:46] (03PS12) 10BBlack: deployment-prep: keyholder shinken monitoring [puppet] - 10https://gerrit.wikimedia.org/r/283227 (https://phabricator.wikimedia.org/T111064) (owner: 10Alex Monk) [16:39:50] it should have appropriate permission by just running as "keyholder" user [16:39:57] yes [16:40:01] <_joe_> csteipp: ^^ [16:40:28] I have no idea what user diamond runs under, maybe it has the appropriate sudo rule [16:40:31] <_joe_> csteipp: we must wait for puppet swat to be over, though [16:40:32] you mean we could fix nrpe+diamond by having them sudo to keyholder instead? [16:40:55] <_joe_> bblack: if a patch is somewhat debatable, skip it [16:40:58] (03PS1) 10Cmjohnson: Adding mac addresses to dhcpd file for mw1261-mw1283 [puppet] - 10https://gerrit.wikimedia.org/r/286680 [16:41:02] <_joe_> we can take care of it separately [16:41:02] at least on beta, the files are owned by keyholder [16:41:23] _joe_: I've done the other 8 and skipped over this one, now circling back: https://gerrit.wikimedia.org/r/#/c/283227 [16:41:42] <_joe_> bblack: ok so I'm free to merge at least one change of mine in the meanwhile? [16:41:45] it wasn't the one that bumped the limit, but it looked a little scary [16:41:47] <_joe_> they're kind of important [16:41:53] _joe_: go for it [16:42:01] (03PS1) 10Elukey: Add Apache mod-cgi to stat1001 configuration to allow Perl CGI scripts. [puppet] - 10https://gerrit.wikimedia.org/r/286681 (https://phabricator.wikimedia.org/T76348) [16:42:11] oh way [16:42:16] <_joe_> csteipp: a +1 on the first change as far as the policy file is concerned would be greatly appreciated [16:42:19] that is the sudo rule https://gerrit.wikimedia.org/r/#/c/283227/12/modules/keyholder/manifests/monitoring.pp,cm :) [16:42:19] <_joe_> :) [16:42:29] (03CR) 10CSteipp: [C: 031] imagemagick: create class imagemagick::install, use on librenms [puppet] - 10https://gerrit.wikimedia.org/r/286676 (owner: 10Giuseppe Lavagetto) [16:42:49] while you're doing that I'm going to poke at something else important too, brb [16:42:51] LGTM. [16:43:15] (03CR) 10Hashar: deployment-prep: keyholder shinken monitoring (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/283227 (https://phabricator.wikimedia.org/T111064) (owner: 10Alex Monk) [16:43:17] (03CR) 10Cmjohnson: [C: 032] Adding mac addresses to dhcpd file for mw1261-mw1283 [puppet] - 10https://gerrit.wikimedia.org/r/286680 (owner: 10Cmjohnson) [16:43:19] (03CR) 10Giuseppe Lavagetto: "PERL ALERT" [puppet] - 10https://gerrit.wikimedia.org/r/286681 (https://phabricator.wikimedia.org/T76348) (owner: 10Elukey) [16:43:34] I have no idea what user diamond runs under, maybe it has the appropriate sudo rule [16:43:38] diamond [16:43:39] !log restarting elasticsearch server elastic2010.codfw.wmnet (T110236) [16:43:39] T110236: Use unicast instead of multicast for node communication - https://phabricator.wikimedia.org/T110236 [16:43:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:44:39] ((( it is 2016 and folks still write perl / cgi web interface! ))) [16:45:37] <_joe_> hashar: hence my PERL ALERT [16:46:00] (03PS2) 10Giuseppe Lavagetto: imagemagick: create class imagemagick::install, use on librenms [puppet] - 10https://gerrit.wikimedia.org/r/286676 [16:46:18] <_joe_> elukey: seriously, don't merge that change without thorough thinking [16:46:28] (03CR) 10Giuseppe Lavagetto: [C: 032] imagemagick: create class imagemagick::install, use on librenms [puppet] - 10https://gerrit.wikimedia.org/r/286676 (owner: 10Giuseppe Lavagetto) [16:46:38] (03CR) 10Giuseppe Lavagetto: [V: 032] imagemagick: create class imagemagick::install, use on librenms [puppet] - 10https://gerrit.wikimedia.org/r/286676 (owner: 10Giuseppe Lavagetto) [16:47:30] oh and I can totally use imagemagick::install -} [16:47:40] (03PS1) 10Alex Monk: Make udpmxircecho conform to pep8 [puppet] - 10https://gerrit.wikimedia.org/r/286683 [16:48:25] _joe_ I need to make some perl CGI work on stats.wikimedia.org until we have full replacement [16:48:27] <_joe_> Krenair: <3 [16:48:33] :) [16:48:36] I don't like it too [16:48:36] <_joe_> elukey: why perl cgis? [16:49:40] _joe_ analytics cgis that will be replaced in the near future (like https://stats.wikimedia.org/cgi-bin/search_portal.pl?search=views) but still need to work [16:50:47] we may have never written it down properly for newer people [16:51:06] but there was a point in the past where we made a policy decision that new operations code should always be python unless there's a solid reason to go otherwise [16:51:43] (03CR) 10Alex Monk: deployment-prep: keyholder shinken monitoring (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/283227 (https://phabricator.wikimedia.org/T111064) (owner: 10Alex Monk) [16:51:45] (03CR) 10Andrew Bogott: [WIP] Fix grants for labs pdns (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/286671 (https://phabricator.wikimedia.org/T128737) (owner: 10Jcrespo) [16:51:51] <_joe_> bblack: you'll start seeing some go soon I think [16:52:05] PROBLEM - puppet last run on restbase2009 is CRITICAL: CRITICAL: Puppet has 1 failures [16:52:22] (03PS2) 10Giuseppe Lavagetto: ocg: add imagemagick policy [puppet] - 10https://gerrit.wikimedia.org/r/286678 [16:52:22] bblack, _joe_: I am not happy about the change too but it will be temporarily, we'll migrate to the new system in this quarter [16:52:26] is there some reason we're going to prefer go for some things? [16:52:46] <_joe_> bblack: kubernetes has very good client libraries in go [16:52:59] <_joe_> much better than anything you'll get in python [16:53:11] <_joe_> for the api I mean [16:53:20] <_joe_> ofc you can autogen a client with the swagger spec [16:53:29] <_joe_> but in itself it's useless [16:53:51] (03CR) 10Giuseppe Lavagetto: [C: 032] ocg: add imagemagick policy [puppet] - 10https://gerrit.wikimedia.org/r/286678 (owner: 10Giuseppe Lavagetto) [16:54:18] 06Operations, 06Analytics-Kanban, 13Patch-For-Review: Upgrade stat1001 to Debian Jessie - https://phabricator.wikimedia.org/T76348#2260792 (10Nuria) [16:54:38] (03PS1) 10Filippo Giunchedi: cassandra: disable restbase2004-b instance [puppet] - 10https://gerrit.wikimedia.org/r/286684 [16:55:37] PROBLEM - cassandra-b service on restbase2004 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed [16:55:51] is puppet swat finished I'm assuming? [16:55:58] kind of [16:56:08] there's still one un-done patch, and I haven't updated the wikitech page [16:56:13] 06Operations, 06Analytics-Kanban: Out of memory errors causing Yarn nodemanager to shutdown on analytics hosts - https://phabricator.wikimedia.org/T132559#2260809 (10JAllemandou) [16:56:19] I'm working on a sec thing instead for the moment... [16:56:43] bblack: thanks, I was asking because I'm about to merge https://gerrit.wikimedia.org/r/#/c/286684/ which is safe tho [16:56:53] go ahead [16:56:53] (03PS2) 10Giuseppe Lavagetto: mediawiki: install imagemagick policy [puppet] - 10https://gerrit.wikimedia.org/r/286677 [16:56:55] 06Operations, 10Ops-Access-Requests, 06Analytics-Kanban, 13Patch-For-Review: All members of analytics team need to have sudo -u hdfs on cluster {hawk} [2 pts] - https://phabricator.wikimedia.org/T126752#2260812 (10JAllemandou) [16:56:56] PROBLEM - puppet last run on restbase2004 is CRITICAL: CRITICAL: Puppet has 1 failures [16:57:10] (03PS2) 10Filippo Giunchedi: cassandra: disable restbase2004-b instance [puppet] - 10https://gerrit.wikimedia.org/r/286684 [16:57:17] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] cassandra: disable restbase2004-b instance [puppet] - 10https://gerrit.wikimedia.org/r/286684 (owner: 10Filippo Giunchedi) [16:57:53] {{done}} [16:58:00] 06Operations, 06Analytics-Kanban, 13Patch-For-Review: nf_conntrack warnings for kafka hosts - https://phabricator.wikimedia.org/T131028#2260819 (10JAllemandou) [16:58:23] 06Operations, 06Analytics-Kanban, 10Traffic, 13Patch-For-Review: varnishkafka integration with Varnish 4 for analytics - https://phabricator.wikimedia.org/T124278#2260824 (10JAllemandou) [16:58:47] !log bootstrap restbase2009-a T132976 [16:58:48] T132976: rack/setup/deploy restbase200[7-9] - https://phabricator.wikimedia.org/T132976 [16:58:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:00:00] bblack: my patches are all good, puppet is happy everywhere, no regression. Thank you! [17:00:04] yurik gwicke cscott arlolra subbu: Dear anthropoid, the time has come. Please deploy Services ā€“ Graphoid / Parsoid / OCG / Citoid (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160503T1700). [17:00:08] 06Operations, 06Analytics-Kanban, 13Patch-For-Review: Upgrade stat1001 to Debian Jessie - https://phabricator.wikimedia.org/T76348#2260834 (10elukey) Update: thanks to @ottomata I was able to restore correctly the home directories, now everything should be fine. I am in the process of adding mod-cgi to stat... [17:00:13] swat is a success for me [17:00:27] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 655 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5588135 keys - replication_delay is 655 [17:00:28] dinner time, be back later for the MediaWiki train deployment [17:00:36] ok let's call puppetswat done [17:00:42] thank you! [17:00:43] since another is starting anyways [17:00:50] sorry Krenair [17:00:58] (03PS1) 10Cmjohnson: Adding production dns entries for mw1261-1283 [dns] - 10https://gerrit.wikimedia.org/r/286685 [17:01:06] no problem [17:01:31] RECOVERY - puppet last run on restbase2004 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [17:02:06] (03PS5) 10Jcrespo: Fix grants for labs pdns [puppet] - 10https://gerrit.wikimedia.org/r/286671 (https://phabricator.wikimedia.org/T128737) [17:02:34] no deploy today [17:02:37] (03CR) 10Cmjohnson: [C: 032] Adding production dns entries for mw1261-1283 [dns] - 10https://gerrit.wikimedia.org/r/286685 (owner: 10Cmjohnson) [17:02:37] parsoid deploy [17:03:06] (03CR) 10Jcrespo: "I think this is it. @andrew - designate host must be an IP, otherwise the grants will be ignored." [puppet] - 10https://gerrit.wikimedia.org/r/286671 (https://phabricator.wikimedia.org/T128737) (owner: 10Jcrespo) [17:03:15] !log nginx on all cp* restarted for openssl-1.0.2h [17:03:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:03:21] (03PS3) 10Giuseppe Lavagetto: mediawiki: install imagemagick policy [puppet] - 10https://gerrit.wikimedia.org/r/286677 [17:03:50] RECOVERY - puppet last run on restbase2009 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [17:04:34] (03CR) 10Andrew Bogott: [C: 031] "Looks right, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/286671 (https://phabricator.wikimedia.org/T128737) (owner: 10Jcrespo) [17:04:36] (03CR) 10Giuseppe Lavagetto: [C: 032] mediawiki: install imagemagick policy [puppet] - 10https://gerrit.wikimedia.org/r/286677 (owner: 10Giuseppe Lavagetto) [17:06:01] PROBLEM - restbase endpoints health on restbase2009 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.192.48.53, port=7231): Max retries exceeded with url: /en.wikipedia.org/v1/?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [17:06:21] downtiming, known [17:08:45] ACKNOWLEDGEMENT - cassandra-b service on restbase2004 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed Filippo Giunchedi removed from puppet [17:10:37] 06Operations, 10ops-codfw: sinistra - RAID failure - https://phabricator.wikimedia.org/T134187#2260885 (10Dzahn) ``` [4610067.148387] md: using 128k window, over a total of 7716112384k. [4629270.170823] Process accounting resumed [4649611.578176] ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0 [46496... [17:11:07] !log HTTP/2 enable for cache_maps (nginx upgrade) - T96848 [17:11:08] T96848: Support HTTP/2 - https://phabricator.wikimedia.org/T96848 [17:11:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:11:18] (03PS1) 10Elukey: Add fake secret for statistics::web role. [labs/private] - 10https://gerrit.wikimedia.org/r/286687 [17:11:20] 06Operations, 10ops-codfw: sinistra - RAID failure - https://phabricator.wikimedia.org/T134187#2260887 (10Dzahn) "..Disk failure on sdd3".. & ".. Disk failure on sdd2" [17:11:54] (03PS6) 10Jcrespo: Fix grants for labs pdns [puppet] - 10https://gerrit.wikimedia.org/r/286671 (https://phabricator.wikimedia.org/T128737) [17:12:20] bleh dpkg errors :P [17:12:51] (03CR) 10Dzahn: [C: 031] Add fake secret for statistics::web role. [labs/private] - 10https://gerrit.wikimedia.org/r/286687 (owner: 10Elukey) [17:13:39] !log restarting elasticsearch server elastic2011.codfw.wmnet (T110236) [17:13:40] (03CR) 10Elukey: [C: 032 V: 032] Add fake secret for statistics::web role. [labs/private] - 10https://gerrit.wikimedia.org/r/286687 (owner: 10Elukey) [17:13:40] T110236: Use unicast instead of multicast for node communication - https://phabricator.wikimedia.org/T110236 [17:13:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:13:48] (03CR) 10Dzahn: "this should fix the compiler runs on stat1001/1004 things, yea" [labs/private] - 10https://gerrit.wikimedia.org/r/286687 (owner: 10Elukey) [17:14:35] ACKNOWLEDGEMENT - Restbase root url on restbase2009 is CRITICAL: Connection refused Filippo Giunchedi bootstrapping [17:14:35] ACKNOWLEDGEMENT - cassandra-a CQL 10.192.48.54:9042 on restbase2009 is CRITICAL: Connection refused Filippo Giunchedi bootstrapping [17:14:35] ACKNOWLEDGEMENT - restbase endpoints health on restbase2009 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.192.48.53, port=7231): Max retries exceeded with url: /en.wikipedia.org/v1/?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) Filippo Giunchedi bootstrapping [17:16:04] (03PS1) 10BBlack: maps.wm.o: HTTP/2 [puppet] - 10https://gerrit.wikimedia.org/r/286689 [17:16:22] PROBLEM - DPKG on cp1047 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [17:16:23] PROBLEM - DPKG on cp3005 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [17:16:23] (03CR) 10BBlack: [C: 032 V: 032] maps.wm.o: HTTP/2 [puppet] - 10https://gerrit.wikimedia.org/r/286689 (owner: 10BBlack) [17:16:31] PROBLEM - DPKG on cp4019 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [17:16:31] PROBLEM - DPKG on cp4012 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [17:16:41] PROBLEM - DPKG on cp4011 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [17:16:42] PROBLEM - DPKG on cp3006 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [17:16:42] PROBLEM - DPKG on cp2003 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [17:16:50] PROBLEM - DPKG on cp3003 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [17:17:10] that's all me, it's cache_maps servers [17:17:10] PROBLEM - DPKG on cp2021 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [17:17:10] PROBLEM - DPKG on cp1046 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [17:17:31] PROBLEM - DPKG on cp1059 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [17:17:31] PROBLEM - DPKG on cp4020 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [17:17:32] (03PS2) 10Giuseppe Lavagetto: imagemagick: install policy everywhere [puppet] - 10https://gerrit.wikimedia.org/r/286679 [17:17:33] dpkg post-install and puppet switches sometimes don't like each other, will try the other ordering for the next cluster :P [17:17:51] PROBLEM - DPKG on cp2009 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [17:17:51] PROBLEM - DPKG on cp2015 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [17:17:52] PROBLEM - DPKG on cp3004 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [17:18:08] !log restarting blazegraph on wdqs1002 (T134238) [17:18:10] T134238: Query service fails with "Too many open files" - https://phabricator.wikimedia.org/T134238 [17:18:11] PROBLEM - DPKG on cp1060 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [17:18:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:19:44] (03CR) 10Giuseppe Lavagetto: [C: 032] imagemagick: install policy everywhere [puppet] - 10https://gerrit.wikimedia.org/r/286679 (owner: 10Giuseppe Lavagetto) [17:21:04] (03PS7) 10Jcrespo: Fix grants for labs pdns [puppet] - 10https://gerrit.wikimedia.org/r/286671 (https://phabricator.wikimedia.org/T128737) [17:21:11] PROBLEM - puppet last run on cp2003 is CRITICAL: CRITICAL: Puppet has 1 failures [17:22:00] PROBLEM - puppet last run on cp3006 is CRITICAL: CRITICAL: Puppet has 1 failures [17:22:20] PROBLEM - puppet last run on cp4019 is CRITICAL: CRITICAL: Puppet has 1 failures [17:24:00] what a mess, once nginx and dpkg and puppet disagree on compatible config syntax... [17:24:10] RECOVERY - DPKG on cp1060 is OK: All packages OK [17:24:53] (03PS8) 10Jcrespo: Fix grants for labs pdns [puppet] - 10https://gerrit.wikimedia.org/r/286671 (https://phabricator.wikimedia.org/T128737) [17:25:11] PROBLEM - puppet last run on cp2015 is CRITICAL: CRITICAL: Puppet has 1 failures [17:27:11] (03CR) 10Jcrespo: [C: 032] Fix grants for labs pdns [puppet] - 10https://gerrit.wikimedia.org/r/286671 (https://phabricator.wikimedia.org/T128737) (owner: 10Jcrespo) [17:29:04] 06Operations, 10OTRS, 10Traffic, 07HTTPS, 13Patch-For-Review: ssl certificate replacement: ticket.wikimedia.org (expires 2016-02-16) - https://phabricator.wikimedia.org/T122320#2260936 (10RobH) [17:29:07] 06Operations, 10Traffic, 07HTTPS: ssl certificate replacement: tendril.wikimedia.org (expires 2016-02-15) - https://phabricator.wikimedia.org/T122319#2260939 (10RobH) [17:29:10] 06Operations, 10Traffic, 07HTTPS: acquire SSL certificate for w.wiki - https://phabricator.wikimedia.org/T91612#2260945 (10RobH) [17:29:13] 06Operations, 10Traffic, 10Wikimedia-General-or-Unknown, 07HTTPS, 13Patch-For-Review: Set up "w.wiki" domain for usage with UrlShortener - https://phabricator.wikimedia.org/T108649#2260944 (10RobH) [17:29:31] RECOVERY - DPKG on cp1059 is OK: All packages OK [17:29:38] 06Operations, 10Traffic, 07HTTPS: SSL cert needed for benefactorevents.wikimedia.org - https://phabricator.wikimedia.org/T115028#2260969 (10RobH) [17:30:10] bblack: was there anything in puppet swat that could have caused text 4xx to spike? around 16:30 UTC [17:30:29] (03PS2) 10Elukey: Add Apache mod-cgi to stat1001 configuration to allow Perl CGI scripts. [puppet] - 10https://gerrit.wikimedia.org/r/286681 (https://phabricator.wikimedia.org/T76348) [17:30:30] probably, yes [17:31:03] the yue thing [17:31:22] err, spike as in triple, and staying there [17:31:38] oh, probably not [17:32:21] RECOVERY - puppet last run on cp2003 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [17:32:21] RECOVERY - puppet last run on cp2015 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [17:32:40] RECOVERY - puppet last run on cp3006 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [17:33:20] RECOVERY - DPKG on cp2009 is OK: All packages OK [17:33:26] 16:27 does seem to be when it takes off [17:33:30] RECOVERY - DPKG on cp1046 is OK: All packages OK [17:33:30] RECOVERY - DPKG on cp2021 is OK: All packages OK [17:33:49] RECOVERY - DPKG on cp2003 is OK: All packages OK [17:33:49] RECOVERY - DPKG on cp4011 is OK: All packages OK [17:33:49] RECOVERY - puppet last run on cp4019 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [17:34:00] RECOVERY - DPKG on cp1047 is OK: All packages OK [17:34:00] RECOVERY - DPKG on cp2015 is OK: All packages OK [17:34:00] RECOVERY - DPKG on cp3005 is OK: All packages OK [17:34:00] RECOVERY - DPKG on cp4012 is OK: All packages OK [17:34:05] yup, all 404s it seems [17:34:19] yue seems most likely still, if anything from swat [17:34:19] RECOVERY - DPKG on cp4019 is OK: All packages OK [17:34:30] RECOVERY - DPKG on cp4020 is OK: All packages OK [17:34:40] RECOVERY - DPKG on cp3006 is OK: All packages OK [17:34:52] 06Operations, 10ops-codfw: sinistra - RAID failure - https://phabricator.wikimedia.org/T134187#2257327 (10RobH) Sinistra is under warranty until 2016-03-02. @papaul can get dell to dispatch a replacement disk. [17:35:00] but if it was yue, it should've fixed itself [17:35:09] RECOVERY - DPKG on cp3003 is OK: All packages OK [17:35:15] (as the apache patch rolled out over ~30 mins, while DNS appeared immediately) [17:35:20] RECOVERY - DPKG on cp3004 is OK: All packages OK [17:38:30] !log deployed new grants on new hosts for pdns-labs database [17:38:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:40:38] (03PS3) 10Elukey: Add Apache mod-cgi to stat1001 configuration to allow Perl CGI scripts. [puppet] - 10https://gerrit.wikimedia.org/r/286681 (https://phabricator.wikimedia.org/T76348) [17:41:07] 06Operations, 10ops-codfw: sinistra - RAID failure - https://phabricator.wikimedia.org/T134187#2261003 (10Papaul) 2019-03-02 [17:41:53] (03CR) 10Elukey: [C: 032] Add Apache mod-cgi to stat1001 configuration to allow Perl CGI scripts. [puppet] - 10https://gerrit.wikimedia.org/r/286681 (https://phabricator.wikimedia.org/T76348) (owner: 10Elukey) [17:43:04] godog: seems to be a spike in 404s for /.well-known/apple-app-site-association [17:43:25] even at the new higher level, that URL is 87% of 404s in sampled-1k [17:46:07] (03PS1) 10Cmjohnson: Adding mgmt and prodcution dns for druid1001-1003 [dns] - 10https://gerrit.wikimedia.org/r/286694 [17:46:30] (was 66% before the rise too, though) [17:46:47] (03PS1) 10BearND: Deploy mobileapps using scap3 [puppet] - 10https://gerrit.wikimedia.org/r/286695 (https://phabricator.wikimedia.org/T129147) [17:47:36] but still, removing the apple thing from the set, the rate is relatively flat across that boundary in the graph [17:49:27] 06Operations, 06Mobile-Apps, 10Traffic: alias /apple-app-site-association and /.well-known/apple-app-site-association - https://phabricator.wikimedia.org/T130647#2261033 (10fgiunchedi) p:05Low>03High [17:50:50] bblack: hah thanks a lot! relevant is ^ [17:50:58] 06Operations, 06Mobile-Apps, 10Traffic: alias /apple-app-site-association and /.well-known/apple-app-site-association - https://phabricator.wikimedia.org/T130647#2141611 (10fgiunchedi) this is back and doubled our rate of 404s starting 16:27 UTC [17:51:56] 06Operations, 10ops-codfw, 06DC-Ops: lvs2002 Embedded Flash/SD-CARD iLO errors - https://phabricator.wikimedia.org/T126321#2261040 (10Papaul) @MoritzMuehlenhoff Ref 5308703178: Confirmation of your request ref:_00Dd0bUlK._50027juPsW:ref Product description: HP ProLiant DL360p Gen8 4 LFF Configure-to-order S... [17:53:07] (03PS1) 10Elukey: Fix the mod-cgi configuration for stats.wikimedia.org after Apache 2.4 upgrade. [puppet] - 10https://gerrit.wikimedia.org/r/286696 (https://phabricator.wikimedia.org/T76348) [17:54:51] 06Operations, 10ops-codfw, 06DC-Ops, 10Traffic: lvs2002 Embedded Flash/SD-CARD iLO errors - https://phabricator.wikimedia.org/T126321#2261045 (10BBlack) We'll want to coordinate the downtime on this to minimize impact, by shutting down pybal softly on lvs2002 before shutting down the OS. [17:55:27] (03CR) 10Dzahn: [C: 031] Fix the mod-cgi configuration for stats.wikimedia.org after Apache 2.4 upgrade. [puppet] - 10https://gerrit.wikimedia.org/r/286696 (https://phabricator.wikimedia.org/T76348) (owner: 10Elukey) [17:56:06] (03CR) 10Elukey: [C: 032] Fix the mod-cgi configuration for stats.wikimedia.org after Apache 2.4 upgrade. [puppet] - 10https://gerrit.wikimedia.org/r/286696 (https://phabricator.wikimedia.org/T76348) (owner: 10Elukey) [17:59:12] !log restarting elasticsearch server elastic2012.codfw.wmnet (T110236) [17:59:13] T110236: Use unicast instead of multicast for node communication - https://phabricator.wikimedia.org/T110236 [17:59:18] (03PS1) 10Dzahn: udpmxircecho: set setsockopt SO_REUSEADDR [puppet] - 10https://gerrit.wikimedia.org/r/286697 (https://phabricator.wikimedia.org/T134247) [17:59:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:59:40] bblack: ah yeah just saw the new ios app release announcement on wmfall, lines up [18:00:03] <- off [18:00:08] bye! [18:00:11] godog, double may be an underestimation: https://grafana.wikimedia.org/dashboard/file/varnish-http-errors.json?from=1456860918149&to=1462298375165 [18:00:14] byew [18:00:17] 06Operations, 06Analytics-Kanban, 13Patch-For-Review: Upgrade stat1001 to Debian Jessie - https://phabricator.wikimedia.org/T76348#2261058 (10elukey) And https://stats.wikimedia.org/cgi-bin/search_portal.pl?search=views now works fine! @ezachte: would you mind to do another sanity check to verify that every... [18:00:42] (03PS2) 10Dzahn: udpmxircecho: set setsockopt SO_REUSEADDR [puppet] - 10https://gerrit.wikimedia.org/r/286697 (https://phabricator.wikimedia.org/T134247) [18:03:01] PROBLEM - aqs endpoints health on aqs1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:03:01] PROBLEM - aqs endpoints health on aqs1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:03:41] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [18:04:39] (03CR) 10Alex Monk: [C: 031] udpmxircecho: set setsockopt SO_REUSEADDR [puppet] - 10https://gerrit.wikimedia.org/r/286697 (https://phabricator.wikimedia.org/T134247) (owner: 10Dzahn) [18:04:41] RECOVERY - aqs endpoints health on aqs1002 is OK: All endpoints are healthy [18:09:07] (03PS4) 10Rush: Increase the filehandle limit for rabbitmq in labs. [puppet] - 10https://gerrit.wikimedia.org/r/285888 (owner: 10Andrew Bogott) [18:11:00] PROBLEM - aqs endpoints health on aqs1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:12:22] (03PS1) 10BBlack: cache_misc: HTTP/2 T96848 [puppet] - 10https://gerrit.wikimedia.org/r/286700 [18:12:40] (03CR) 10BBlack: [C: 032 V: 032] cache_misc: HTTP/2 T96848 [puppet] - 10https://gerrit.wikimedia.org/r/286700 (owner: 10BBlack) [18:14:32] RECOVERY - aqs endpoints health on aqs1003 is OK: All endpoints are healthy [18:14:45] (03PS1) 10Dzahn: udpmxircecho: die if socket fails to open [puppet] - 10https://gerrit.wikimedia.org/r/286701 (https://phabricator.wikimedia.org/T134247) [18:14:51] RECOVERY - aqs endpoints health on aqs1001 is OK: All endpoints are healthy [18:16:30] (03PS2) 10Dzahn: udpmxircecho: die if socket fails to open [puppet] - 10https://gerrit.wikimedia.org/r/286701 (https://phabricator.wikimedia.org/T134247) [18:16:43] (03CR) 10jenkins-bot: [V: 04-1] udpmxircecho: die if socket fails to open [puppet] - 10https://gerrit.wikimedia.org/r/286701 (https://phabricator.wikimedia.org/T134247) (owner: 10Dzahn) [18:17:31] !log HTTP/2 enable for cache_misc (nginx upgrade - T96848) [18:17:32] T96848: Support HTTP/2 - https://phabricator.wikimedia.org/T96848 [18:17:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:23:36] !log upgrading varnish3 package on cache_misc [18:23:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:26:44] !log cache_misc: rolling varnishd restarts for package update [18:26:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:27:31] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [18:33:26] (03CR) 10Rush: "along w/ the comments on both hosts we have:" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/286670 (https://phabricator.wikimedia.org/T128737) (owner: 10Andrew Bogott) [18:34:13] (03CR) 10Mholloway: [C: 031] Deploy mobileapps using scap3 [puppet] - 10https://gerrit.wikimedia.org/r/286695 (https://phabricator.wikimedia.org/T129147) (owner: 10BearND) [18:35:21] PROBLEM - puppet last run on nescio is CRITICAL: CRITICAL: puppet fail [18:38:24] wow I am accessing https://stats.wikimedia.org/ in HTTP/2 \o/ [18:38:54] Chrome happily tells me that I am using h2 [18:39:11] RECOVERY - puppet last run on nescio is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [18:40:19] :) [18:40:46] (03PS13) 10Eevans: [WIP]: Cassandra 2.2.6 config [puppet] - 10https://gerrit.wikimedia.org/r/284078 (https://phabricator.wikimedia.org/T126629) [18:41:17] 06Operations, 10Wikimedia-IRC-RC-Server: IRC RC server still mentions pmtpa on various places - https://phabricator.wikimedia.org/T133328#2228294 (10Dzahn) I changed the hostname (in the ircd.conf that is in private repo), NOT the bot name. I did not restart the server, but on next restart it should be changed. [18:41:32] (03CR) 10Eevans: "@Joal said:" [puppet] - 10https://gerrit.wikimedia.org/r/284078 (https://phabricator.wikimedia.org/T126629) (owner: 10Eevans) [18:48:28] (03CR) 10Andrew Bogott: "Agreed, I'll add code to wipe out that directory." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/286670 (https://phabricator.wikimedia.org/T128737) (owner: 10Andrew Bogott) [18:48:44] (03PS3) 10Ottomata: Alter role::kafka::analytics::broker to be able to use confluent module during upgrade [puppet] - 10https://gerrit.wikimedia.org/r/286660 (https://phabricator.wikimedia.org/T121562) [18:48:46] (03PS11) 10Ottomata: Add new confluent module and puppetization for using confluent Kafka [puppet] - 10https://gerrit.wikimedia.org/r/284349 (https://phabricator.wikimedia.org/T132631) [18:51:46] (03PS1) 10Ottomata: Remove confluent conditional in role::kafka::analytics::broker [puppet] - 10https://gerrit.wikimedia.org/r/286708 (https://phabricator.wikimedia.org/T121562) [18:51:53] 06Operations, 06Discovery, 06Discovery-Search-Backlog, 06Labs, 10hardware-requests: eqiad: (2) Relevance forge servers - https://phabricator.wikimedia.org/T131184#2261280 (10chasemp) >>! In T131184#2257101, @EBernhardson wrote: > This project does replace nobelium. Nobelium will be decommissioned and re... [18:52:52] (03CR) 10Ottomata: [C: 04-1] "Wait until 0.9 is done!" [puppet] - 10https://gerrit.wikimedia.org/r/286708 (https://phabricator.wikimedia.org/T121562) (owner: 10Ottomata) [18:55:12] (03PS3) 10Andrew Bogott: Split centralized labs pdns database into two different local DBs. [puppet] - 10https://gerrit.wikimedia.org/r/286670 (https://phabricator.wikimedia.org/T128737) [18:55:14] (03PS1) 10Andrew Bogott: Clean out pdns config cruft. [puppet] - 10https://gerrit.wikimedia.org/r/286709 [18:56:06] (03PS5) 10Aaron Schulz: Switched to pt-heartbeat lag detection on s6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/243116 (https://phabricator.wikimedia.org/T111266) [18:56:39] (03CR) 10jenkins-bot: [V: 04-1] Split centralized labs pdns database into two different local DBs. [puppet] - 10https://gerrit.wikimedia.org/r/286670 (https://phabricator.wikimedia.org/T128737) (owner: 10Andrew Bogott) [18:56:51] (03CR) 10jenkins-bot: [V: 04-1] Clean out pdns config cruft. [puppet] - 10https://gerrit.wikimedia.org/r/286709 (owner: 10Andrew Bogott) [18:58:06] (03PS4) 10Hashar: Group0 to php-1.27.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286641 (https://phabricator.wikimedia.org/T131557) [18:59:44] (03PS2) 10Andrew Bogott: Clean out pdns config cruft. [puppet] - 10https://gerrit.wikimedia.org/r/286709 [18:59:46] jouncebot: ping [18:59:46] (03PS4) 10Andrew Bogott: Split centralized labs pdns database into two different local DBs. [puppet] - 10https://gerrit.wikimedia.org/r/286670 (https://phabricator.wikimedia.org/T128737) [19:00:04] hashar: Dear anthropoid, the time has come. Please deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160503T1900). [19:00:15] (03CR) 10Hashar: [C: 032] Group0 to php-1.27.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286641 (https://phabricator.wikimedia.org/T131557) (owner: 10Hashar) [19:00:39] (03Merged) 10jenkins-bot: Group0 to php-1.27.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286641 (https://phabricator.wikimedia.org/T131557) (owner: 10Hashar) [19:01:10] (03CR) 10jenkins-bot: [V: 04-1] Split centralized labs pdns database into two different local DBs. [puppet] - 10https://gerrit.wikimedia.org/r/286670 (https://phabricator.wikimedia.org/T128737) (owner: 10Andrew Bogott) [19:01:18] !log restarting elasticsearch server elastic2013.codfw.wmnet (T110236) [19:01:19] T110236: Use unicast instead of multicast for node communication - https://phabricator.wikimedia.org/T110236 [19:01:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:01:26] (03CR) 10jenkins-bot: [V: 04-1] Clean out pdns config cruft. [puppet] - 10https://gerrit.wikimedia.org/r/286709 (owner: 10Andrew Bogott) [19:01:42] !log hashar@tin rebuilt wikiversions.php and synchronized wikiversions files: group0 to 1.27.0-wmf.23 T131557 [19:01:43] T131557: MW-1.27.0-wmf.23 deployment blockers - https://phabricator.wikimedia.org/T131557 [19:01:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:03:45] (03PS2) 10Ottomata: Remove confluent conditional in role::kafka::analytics::broker [puppet] - 10https://gerrit.wikimedia.org/r/286708 (https://phabricator.wikimedia.org/T121562) [19:03:55] mw train is boring nowadays [19:03:57] nothing explodes [19:11:23] 06Operations, 10Wikimedia-IRC-RC-Server: IRC RC server still mentions pmtpa on various places - https://phabricator.wikimedia.org/T133328#2261358 (10Danny_B) @dzahn Did you also check MOTD? IDK if the relevant line (3rd from bottom in the log above) comes from server or is a part of it... Thanks. [19:11:40] !log group0 to 1.27.0-wmf.23 is complete. [19:11:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:12:51] (03PS3) 10Andrew Bogott: Clean out pdns config cruft. [puppet] - 10https://gerrit.wikimedia.org/r/286709 [19:12:52] (03PS5) 10Andrew Bogott: Split centralized labs pdns database into two different local DBs. [puppet] - 10https://gerrit.wikimedia.org/r/286670 (https://phabricator.wikimedia.org/T128737) [19:13:55] (03CR) 10jenkins-bot: [V: 04-1] Clean out pdns config cruft. [puppet] - 10https://gerrit.wikimedia.org/r/286709 (owner: 10Andrew Bogott) [19:14:09] (03CR) 10jenkins-bot: [V: 04-1] Split centralized labs pdns database into two different local DBs. [puppet] - 10https://gerrit.wikimedia.org/r/286670 (https://phabricator.wikimedia.org/T128737) (owner: 10Andrew Bogott) [19:14:24] (03CR) 10Odder: [C: 031] Commons: Restrict changetags userright [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286522 (https://phabricator.wikimedia.org/T134196) (owner: 10Rillke) [19:15:19] (03PS4) 10Andrew Bogott: Clean out pdns config cruft. [puppet] - 10https://gerrit.wikimedia.org/r/286709 [19:15:21] (03PS6) 10Andrew Bogott: Split centralized labs pdns database into two different local DBs. [puppet] - 10https://gerrit.wikimedia.org/r/286670 (https://phabricator.wikimedia.org/T128737) [19:17:50] 06Operations, 10Wikimedia-IRC-RC-Server: IRC RC server still mentions pmtpa on various places - https://phabricator.wikimedia.org/T133328#2261378 (10Dzahn) @Danny_B yes, i did. that line in the MOTD comes from the server name. I found there is a "vimotd" command that comes with the ircd package and the motd j... [19:21:33] PROBLEM - Unmerged changes on repository mediawiki_config on mira is CRITICAL: There is one unmerged change in mediawiki_config (dir /srv/mediawiki-staging/). [19:33:36] (03CR) 10Yurik: "Lydia, Stas, and I discussed it over hangout, and we agreed to proceed with this patch." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/284091 (https://phabricator.wikimedia.org/T126741) (owner: 10Yurik) [19:33:54] (03PS3) 10Tjones: A/B/C test of control vs textcat vs accept-lang + textcat [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268048 (https://phabricator.wikimedia.org/T121542) (owner: 10EBernhardson) [19:35:13] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5599958 keys - replication_delay is 0 [19:38:22] (03PS5) 10Rush: Clean out pdns config cruft. [puppet] - 10https://gerrit.wikimedia.org/r/286709 (owner: 10Andrew Bogott) [19:38:45] (03PS7) 10Rush: Split centralized labs pdns database into two different local DBs. [puppet] - 10https://gerrit.wikimedia.org/r/286670 (https://phabricator.wikimedia.org/T128737) (owner: 10Andrew Bogott) [19:40:42] (03CR) 10Andrew Bogott: [C: 032] Clean out pdns config cruft. [puppet] - 10https://gerrit.wikimedia.org/r/286709 (owner: 10Andrew Bogott) [19:41:09] (03CR) 10Krinkle: [C: 04-1] Switched to pt-heartbeat lag detection on s6 (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/243116 (https://phabricator.wikimedia.org/T111266) (owner: 10Aaron Schulz) [19:43:36] (03CR) 10Smalyshev: [C: 031] Revert "Don't yet allow wikidatasparql graph urls" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/284091 (https://phabricator.wikimedia.org/T126741) (owner: 10Yurik) [19:47:43] !log restarting elasticsearch server elastic2014.codfw.wmnet (T110236) [19:47:44] T110236: Use unicast instead of multicast for node communication - https://phabricator.wikimedia.org/T110236 [19:55:00] 06Operations, 10ops-eqiad: Rack and setup (3) Druid Nodes in eqiad - https://phabricator.wikimedia.org/T134276#2261458 (10Cmjohnson) [19:55:02] 06Operations, 10ops-eqiad: rack/setup/deploy 3 eqiad druid nodes - https://phabricator.wikimedia.org/T134275#2260510 (10Cmjohnson) [19:55:42] 06Operations, 10ops-eqiad: Rack and Set up new application servers mw1261-1283 - https://phabricator.wikimedia.org/T133798#2261461 (10Cmjohnson) [19:56:42] 06Operations, 10ops-eqiad: Rack and Set up new application servers mw1261-1283 - https://phabricator.wikimedia.org/T133798#2244548 (10Cmjohnson) [19:57:25] 06Operations, 10ops-eqiad: Rack and Set up new application servers mw1284-1307 - https://phabricator.wikimedia.org/T134309#2261471 (10Cmjohnson) [19:58:16] 06Operations, 10ops-eqiad: Rack and Set up new application servers mw1284-1307 - https://phabricator.wikimedia.org/T134309#2261486 (10Cmjohnson) [19:59:25] 06Operations, 10ops-eqiad: Rack and Set up new application servers mw1284-1307 - https://phabricator.wikimedia.org/T134309#2261471 (10Cmjohnson) [20:00:30] 06Operations, 10ops-eqiad: Rack and Setup new elastic search - https://phabricator.wikimedia.org/T133772#2261507 (10Cmjohnson) [20:01:21] 06Operations, 10ops-eqiad: Rack and Setup new elastic search - https://phabricator.wikimedia.org/T133772#2242648 (10Cmjohnson) I attempted to change to the UEFI on elastic1047 and install and that didn't get to the installer at all. [20:02:11] !log Restarting Jenkins [20:02:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:02:43] 06Operations, 10ops-eqiad, 06Analytics-Kanban: rack/setup/deploy aqs100[456] - https://phabricator.wikimedia.org/T133785#2261516 (10Cmjohnson) [20:08:16] 06Operations, 10Wikimedia-IRC-RC-Server, 13Patch-For-Review: RC stream is broken over IRC - https://phabricator.wikimedia.org/T134247#2261538 (10Dzahn) >>! In T134247#2260394, @faidon wrote: > - Longer-term: udpmxircecho should probably write some stats of messages processed somewhere and we should alert whe... [20:31:35] (03PS1) 1020after4: Add beta-specific access.conf exceptions in scap::target [puppet] - 10https://gerrit.wikimedia.org/r/286754 (https://phabricator.wikimedia.org/T121721) [20:47:56] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.192.48.44 on port 6479 [20:48:08] 06Operations, 10ops-codfw: codfw: return one intel ssd to dasher for warranty replacement - https://phabricator.wikimedia.org/T132210#2261680 (10RobH) a:05RobH>03Papaul I put Papaul in contact with the Samsung & Dasher support folks, and the form to fill out. I neglected to re-assign this to him. [20:48:28] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [20:48:57] 07Blocked-on-Operations, 06Operations, 10hardware-requests: Evaluate replacing SATA disks on ganeti100X.eqiad.wmnet with SSDs - https://phabricator.wikimedia.org/T132679#2261684 (10RobH) I'm still awaiting feedback for this on T133313. I've pinged those involved on that task via email. [20:49:47] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5611653 keys - replication_delay is 0 [20:51:28] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [20:53:49] (03PS1) 10Gehel: Collect pending_tasks count metric from elasticsearch [puppet] - 10https://gerrit.wikimedia.org/r/286756 (https://phabricator.wikimedia.org/T134240) [20:54:57] 06Operations, 06Discovery, 03Discovery-Search-Sprint, 07Elasticsearch, 13Patch-For-Review: Publish "pending_tasks" count from Elastic search cluster to graphite - https://phabricator.wikimedia.org/T134240#2261688 (10Gehel) [20:55:12] 06Operations, 06Discovery, 03Discovery-Search-Sprint, 07Elasticsearch, 13Patch-For-Review: Publish "pending_tasks" count from Elastic search cluster to graphite - https://phabricator.wikimedia.org/T134240#2259118 (10Gehel) [20:59:07] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [21:00:06] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [21:01:40] 06Operations, 06Discovery, 03Discovery-Search-Sprint, 07Elasticsearch, 13Patch-For-Review: Publish "pending_tasks" count from Elastic search cluster to graphite - https://phabricator.wikimedia.org/T134240#2261694 (10Gehel) a:03Gehel [21:11:17] !log restarting wdqs1002 (T134238) [21:11:18] T134238: Query service fails with "Too many open files" - https://phabricator.wikimedia.org/T134238 [21:11:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:20:50] (03CR) 10Dzahn: [C: 031] Move jobrunner ferm service into the roles [puppet] - 10https://gerrit.wikimedia.org/r/286415 (owner: 10Muehlenhoff) [21:31:06] !log upgrading varnish3 package on cache_upload [21:31:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:33:16] (03CR) 10Mobrovac: "LGTM, but we need to check that having the same repo in both scap sources and deployment.yaml will not cause problems." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/286695 (https://phabricator.wikimedia.org/T129147) (owner: 10BearND) [21:34:17] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 632 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5619302 keys - replication_delay is 632 [21:35:17] PROBLEM - tools-home on tools.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 20 seconds [21:35:22] (03CR) 10Mobrovac: "On second thought, I will check that, but please include in this patch the removal of mobileapps/deploy from hieradata/common/role/deploym" [puppet] - 10https://gerrit.wikimedia.org/r/286695 (https://phabricator.wikimedia.org/T129147) (owner: 10BearND) [21:37:05] andrewbogott ^ i am down the street heading back now. DNS? [21:37:18] chasemp: Not sureā€¦ I restarted things and it looks better but icinga isn't happy yet [21:37:46] chasemp: ok, I'm going to say not dns [21:37:53] !log cache_upload: rolling varnishd (backend only) restarts for package update [21:38:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:43:07] RECOVERY - tools-home on tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 824473 bytes in 19.151 second response time [21:45:38] RECOVERY - Unmerged changes on repository mediawiki_config on mira is OK: No changes to merge. [21:45:43] !log ebernhardson@tin Synchronized php-1.27.0-wmf.23/includes/search/SearchEngine.php: T134305 Fix invalid namespace handling in wmf.23 (duration: 00m 38s) [21:45:44] T134305: Notice: Undefined variable: namespaces in /srv/mediawiki/php-1.27.0-wmf.23/includes/search/SearchEngineConfig.php on line 109 - https://phabricator.wikimedia.org/T134305 [21:45:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:47:49] PROBLEM - Disk space on cp3047 is CRITICAL: DISK CRITICAL - free space: / 106 MB (1% inode=87%) [21:48:10] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5617928 keys - replication_delay is 0 [21:48:28] PROBLEM - Disk space on cp1071 is CRITICAL: DISK CRITICAL - free space: / 339 MB (3% inode=88%) [21:48:42] fucking varnishkafka [21:48:49] PROBLEM - Disk space on cp3044 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=87%) [21:48:58] PROBLEM - Disk space on cp3039 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=87%) [21:48:59] PROBLEM - Disk space on cp3046 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=87%) [21:49:08] PROBLEM - Disk space on cp3049 is CRITICAL: DISK CRITICAL - free space: / 161 MB (1% inode=87%) [21:49:09] PROBLEM - Disk space on cp1049 is CRITICAL: DISK CRITICAL - free space: / 50 MB (0% inode=87%) [21:49:18] PROBLEM - Disk space on cp3035 is CRITICAL: DISK CRITICAL - free space: / 142 MB (1% inode=87%) [21:49:29] PROBLEM - Disk space on cp1073 is CRITICAL: DISK CRITICAL - free space: / 59 MB (0% inode=87%) [21:49:48] PROBLEM - Disk space on cp1048 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=87%) [21:50:18] PROBLEM - Disk space on cp1050 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=87%) [21:50:30] PROBLEM - Disk space on cp1072 is CRITICAL: DISK CRITICAL - free space: / 145 MB (1% inode=87%) [21:51:00] PROBLEM - Disk space on cp1074 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=87%) [21:51:00] PROBLEM - Disk space on cp3038 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=88%) [21:51:08] PROBLEM - Disk space on cp1064 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=87%) [21:51:10] PROBLEM - Disk space on cp1063 is CRITICAL: DISK CRITICAL - free space: / 152 MB (1% inode=87%) [21:51:40] PROBLEM - Disk space on cp1062 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=87%) [21:51:48] PROBLEM - Disk space on cp3045 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=87%) [21:51:50] PROBLEM - Disk space on cp3034 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=87%) [21:52:00] PROBLEM - Disk space on cp3037 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=87%) [21:52:30] PROBLEM - Disk space on cp3036 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=87%) [21:52:50] PROBLEM - Disk space on cp3048 is CRITICAL: DISK CRITICAL - free space: / 222 MB (2% inode=88%) [21:53:09] RECOVERY - Disk space on cp3035 is OK: DISK OK [21:53:09] RECOVERY - Disk space on cp1063 is OK: DISK OK [21:53:19] RECOVERY - Disk space on cp1073 is OK: DISK OK [21:53:38] RECOVERY - Disk space on cp1048 is OK: DISK OK [21:53:38] RECOVERY - Disk space on cp1062 is OK: DISK OK [21:53:39] RECOVERY - Disk space on cp3047 is OK: DISK OK [21:53:39] RECOVERY - Disk space on cp3045 is OK: DISK OK [21:53:48] RECOVERY - Disk space on cp3034 is OK: DISK OK [21:53:59] RECOVERY - Disk space on cp3037 is OK: DISK OK [21:54:09] RECOVERY - Disk space on cp1050 is OK: DISK OK [21:54:18] RECOVERY - Disk space on cp1071 is OK: DISK OK [21:54:28] RECOVERY - Disk space on cp1072 is OK: DISK OK [21:54:29] RECOVERY - Disk space on cp3036 is OK: DISK OK [21:54:39] PROBLEM - Varnishkafka log producer on cp1072 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [21:54:39] PROBLEM - Varnishkafka log producer on cp2011 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [21:54:48] PROBLEM - Varnishkafka log producer on cp4006 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [21:54:48] PROBLEM - Varnishkafka log producer on cp4013 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [21:54:48] RECOVERY - Disk space on cp3044 is OK: DISK OK [21:54:48] RECOVERY - Disk space on cp3048 is OK: DISK OK [21:54:49] RECOVERY - Disk space on cp3039 is OK: DISK OK [21:54:49] PROBLEM - Varnishkafka log producer on cp2008 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [21:54:49] RECOVERY - Disk space on cp3046 is OK: DISK OK [21:54:58] RECOVERY - Disk space on cp1074 is OK: DISK OK [21:54:59] PROBLEM - Varnishkafka log producer on cp1071 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [21:54:59] RECOVERY - Disk space on cp3049 is OK: DISK OK [21:54:59] RECOVERY - Disk space on cp3038 is OK: DISK OK [21:54:59] PROBLEM - Varnishkafka log producer on cp3034 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [21:54:59] RECOVERY - Disk space on cp1049 is OK: DISK OK [21:54:59] RECOVERY - Disk space on cp1064 is OK: DISK OK [21:55:08] PROBLEM - Varnishkafka log producer on cp3037 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [21:55:09] PROBLEM - Varnishkafka log producer on cp3047 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [21:55:09] PROBLEM - Varnishkafka log producer on cp2005 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [21:55:09] PROBLEM - Varnishkafka log producer on cp2020 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [21:55:19] PROBLEM - Varnishkafka log producer on cp2026 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [21:55:19] PROBLEM - Varnishkafka log producer on cp2014 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [21:55:20] PROBLEM - Varnishkafka log producer on cp1074 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [21:55:20] PROBLEM - Varnishkafka log producer on cp1064 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [21:55:20] PROBLEM - Varnishkafka log producer on cp4005 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [21:55:28] PROBLEM - Varnishkafka log producer on cp3035 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [21:55:28] PROBLEM - Varnishkafka log producer on cp3048 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [21:55:29] PROBLEM - Varnishkafka log producer on cp2024 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [21:55:29] PROBLEM - Varnishkafka log producer on cp1099 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [21:55:30] PROBLEM - Varnishkafka log producer on cp4007 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [21:55:33] !log stopped varnishkafka on all cache_upload, and wiped out the spammy junk it fills the disk with in /var/cache/varnishkafka/ [21:55:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:55:39] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [1000.0] [21:55:39] PROBLEM - Varnishkafka log producer on cp4014 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [21:55:48] PROBLEM - Varnishkafka log producer on cp1062 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [21:55:48] PROBLEM - Varnishkafka log producer on cp1073 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [21:55:58] PROBLEM - Varnishkafka log producer on cp1049 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [21:55:59] PROBLEM - Varnishkafka log producer on cp3049 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [21:55:59] PROBLEM - Varnishkafka log producer on cp3036 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [21:56:10] PROBLEM - Varnishkafka log producer on cp1050 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [21:56:10] PROBLEM - Varnishkafka log producer on cp1063 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [21:56:19] PROBLEM - Varnishkafka log producer on cp2022 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [21:56:28] PROBLEM - Varnishkafka log producer on cp1048 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [21:56:28] PROBLEM - Varnishkafka log producer on cp4015 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [21:56:29] PROBLEM - Varnishkafka log producer on cp3046 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [21:56:29] PROBLEM - Varnishkafka log producer on cp3038 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [21:56:29] PROBLEM - Varnishkafka log producer on cp3044 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [21:56:40] RECOVERY - WDQS SPARQL on wdqs1001 is OK: HTTP OK: HTTP/1.1 200 OK - 15318 bytes in 0.010 second response time [21:57:50] PROBLEM - Varnishkafka log producer on cp2017 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [21:57:58] RECOVERY - WDQS HTTP on wdqs1001 is OK: HTTP OK: HTTP/1.1 200 OK - 15318 bytes in 0.002 second response time [21:57:59] RECOVERY - Varnishkafka log producer on cp3036 is OK: PROCS OK: 1 process with command name varnishkafka [21:58:39] PROBLEM - Varnishkafka log producer on cp3045 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [21:59:18] RECOVERY - Varnishkafka log producer on cp1074 is OK: PROCS OK: 1 process with command name varnishkafka [21:59:18] RECOVERY - Varnishkafka log producer on cp1064 is OK: PROCS OK: 1 process with command name varnishkafka [21:59:18] RECOVERY - Varnishkafka log producer on cp4005 is OK: PROCS OK: 1 process with command name varnishkafka [21:59:18] RECOVERY - Varnishkafka log producer on cp3035 is OK: PROCS OK: 1 process with command name varnishkafka [21:59:19] RECOVERY - Varnishkafka log producer on cp3048 is OK: PROCS OK: 1 process with command name varnishkafka [21:59:19] RECOVERY - Varnishkafka log producer on cp1099 is OK: PROCS OK: 1 process with command name varnishkafka [21:59:19] RECOVERY - Varnishkafka log producer on cp2024 is OK: PROCS OK: 1 process with command name varnishkafka [21:59:28] PROBLEM - Host bismuth is DOWN: PING CRITICAL - Packet loss = 100% [21:59:36] RECOVERY - Varnishkafka log producer on cp4007 is OK: PROCS OK: 1 process with command name varnishkafka [21:59:48] RECOVERY - Varnishkafka log producer on cp4014 is OK: PROCS OK: 1 process with command name varnishkafka [21:59:49] RECOVERY - Varnishkafka log producer on cp1073 is OK: PROCS OK: 1 process with command name varnishkafka [21:59:49] RECOVERY - Varnishkafka log producer on cp1062 is OK: PROCS OK: 1 process with command name varnishkafka [21:59:50] RECOVERY - Varnishkafka log producer on cp2017 is OK: PROCS OK: 1 process with command name varnishkafka [21:59:59] RECOVERY - Varnishkafka log producer on cp1049 is OK: PROCS OK: 1 process with command name varnishkafka [22:00:08] RECOVERY - Varnishkafka log producer on cp3049 is OK: PROCS OK: 1 process with command name varnishkafka [22:00:19] RECOVERY - Host bismuth is UP: PING OK - Packet loss = 0%, RTA = 1.35 ms [22:00:25] <_joe_> what's up? [22:00:26] RECOVERY - Varnishkafka log producer on cp1050 is OK: PROCS OK: 1 process with command name varnishkafka [22:00:26] RECOVERY - Varnishkafka log producer on cp1063 is OK: PROCS OK: 1 process with command name varnishkafka [22:00:37] _joe_: everything's fine with the varnishkafka spam [22:00:39] RECOVERY - Varnishkafka log producer on cp2022 is OK: PROCS OK: 1 process with command name varnishkafka [22:00:40] RECOVERY - Varnishkafka log producer on cp1048 is OK: PROCS OK: 1 process with command name varnishkafka [22:00:45] I don't know about WDQS and bismuth though [22:00:48] RECOVERY - Varnishkafka log producer on cp4015 is OK: PROCS OK: 1 process with command name varnishkafka [22:00:48] RECOVERY - Varnishkafka log producer on cp3046 is OK: PROCS OK: 1 process with command name varnishkafka [22:00:49] RECOVERY - Varnishkafka log producer on cp3038 is OK: PROCS OK: 1 process with command name varnishkafka [22:00:49] RECOVERY - Varnishkafka log producer on cp3044 is OK: PROCS OK: 1 process with command name varnishkafka [22:00:58] RECOVERY - Varnishkafka log producer on cp1072 is OK: PROCS OK: 1 process with command name varnishkafka [22:00:58] RECOVERY - Varnishkafka log producer on cp2011 is OK: PROCS OK: 1 process with command name varnishkafka [22:00:59] RECOVERY - Varnishkafka log producer on cp4006 is OK: PROCS OK: 1 process with command name varnishkafka [22:00:59] RECOVERY - Varnishkafka log producer on cp4013 is OK: PROCS OK: 1 process with command name varnishkafka [22:01:00] RECOVERY - Varnishkafka log producer on cp3045 is OK: PROCS OK: 1 process with command name varnishkafka [22:01:08] RECOVERY - Varnishkafka log producer on cp2008 is OK: PROCS OK: 1 process with command name varnishkafka [22:01:09] RECOVERY - Varnishkafka log producer on cp1071 is OK: PROCS OK: 1 process with command name varnishkafka [22:01:18] RECOVERY - Varnishkafka log producer on cp3034 is OK: PROCS OK: 1 process with command name varnishkafka [22:01:20] RECOVERY - Varnishkafka log producer on cp3037 is OK: PROCS OK: 1 process with command name varnishkafka [22:01:20] RECOVERY - Varnishkafka log producer on cp3047 is OK: PROCS OK: 1 process with command name varnishkafka [22:01:26] <_joe_> ok => bed [22:01:28] RECOVERY - Varnishkafka log producer on cp2005 is OK: PROCS OK: 1 process with command name varnishkafka [22:01:28] RECOVERY - Varnishkafka log producer on cp2020 is OK: PROCS OK: 1 process with command name varnishkafka [22:01:29] RECOVERY - Varnishkafka log producer on cp2026 is OK: PROCS OK: 1 process with command name varnishkafka [22:01:30] RECOVERY - Varnishkafka log producer on cp2014 is OK: PROCS OK: 1 process with command name varnishkafka [22:01:39] nite! [22:03:17] (03PS2) 10Smalyshev: [WIP] Add configs for kafka-watcher tool [puppet] - 10https://gerrit.wikimedia.org/r/286588 (https://phabricator.wikimedia.org/T97562) [22:04:22] (03CR) 10jenkins-bot: [V: 04-1] [WIP] Add configs for kafka-watcher tool [puppet] - 10https://gerrit.wikimedia.org/r/286588 (https://phabricator.wikimedia.org/T97562) (owner: 10Smalyshev) [22:04:40] 06Operations, 06Project-Admins: Operations-related subprojects/tags reorganization - https://phabricator.wikimedia.org/T119944#2261983 (10Danny_B) [22:06:19] (03PS2) 10Legoktm: Add UploadsLink to production extension-list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286494 (https://phabricator.wikimedia.org/T130018) (owner: 10Rillke) [22:06:21] (03PS2) 10Legoktm: Enable UploadsLink at Wikimedia Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286517 (https://phabricator.wikimedia.org/T130018) (owner: 10Rillke) [22:13:22] (03PS3) 10Smalyshev: [WIP] Add configs for kafka-watcher tool [puppet] - 10https://gerrit.wikimedia.org/r/286588 (https://phabricator.wikimedia.org/T97562) [22:13:32] 06Operations, 10Wikimedia-IRC-RC-Server: Replace ircd-ratbox with something newer/maintained - https://phabricator.wikimedia.org/T134271#2261992 (10Dzahn) @Faidon T132427 is about @Muehlenhoff building the ratbox package and the thing is that there is this custom patch in it -> https://github.com/wikimedia/o... [22:13:41] 06Operations, 10Wikimedia-IRC-RC-Server: Replace ircd-ratbox with something newer/maintained - https://phabricator.wikimedia.org/T134271#2261996 (10Krenair) Ideally one that wouldn't require this custom patch: https://github.com/wikimedia/operations-debs-ircd-ratbox/blob/master/ircd-ratbox-notalk.patch (preven... [22:16:54] yurik: I've reseted the Lydia CR-1 per your last comment, but I'm a little surprised this wasn't a part of the aude train of Wikidata config changes of the morning SWAT. [22:17:02] for https://gerrit.wikimedia.org/r/#/c/284091/ [22:22:11] 06Operations, 10Traffic: confctl: give regexen more freedom - https://phabricator.wikimedia.org/T134323#2262017 (10BBlack) [22:22:58] (03PS4) 10Yurik: Revert "Don't yet allow wikidatasparql graph urls" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/284091 (https://phabricator.wikimedia.org/T126741) [22:23:30] 06Operations, 10Traffic: confctl select needs a -y flag? - https://phabricator.wikimedia.org/T134324#2262031 (10BBlack) [22:25:42] (03CR) 10Rush: [C: 032] Add beta-specific access.conf exceptions in scap::target [puppet] - 10https://gerrit.wikimedia.org/r/286754 (https://phabricator.wikimedia.org/T121721) (owner: 1020after4) [22:39:26] (03PS5) 10Andrew Bogott: Increase the filehandle limit for rabbitmq in labs. [puppet] - 10https://gerrit.wikimedia.org/r/285888 [22:40:53] (03CR) 10Andrew Bogott: [C: 032] Increase the filehandle limit for rabbitmq in labs. [puppet] - 10https://gerrit.wikimedia.org/r/285888 (owner: 10Andrew Bogott) [22:46:51] !log restarting rabbitmq on labcontrol1001 to pick up a new ulimit [22:46:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:47:15] !log upgrading varnish3 package on cache_text ... [22:47:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:52:43] PROBLEM - Varnish HTTP text-backend - port 3128 on cp4009 is CRITICAL: Connection refused [22:53:13] PROBLEM - Varnish HTTP text-backend - port 3128 on cp1054 is CRITICAL: Connection refused [22:53:32] PROBLEM - Varnish HTTP text-backend - port 3128 on cp3041 is CRITICAL: Connection refused [22:53:33] PROBLEM - Varnish HTTP text-backend - port 3128 on cp3030 is CRITICAL: Connection refused [22:54:02] PROBLEM - Varnish HTTP text-backend - port 3128 on cp1052 is CRITICAL: Connection refused [22:54:19] (03PS2) 10EBernhardson: cirrus: Only use curl pools on hhvm [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286485 (https://phabricator.wikimedia.org/T132751) [22:54:24] PROBLEM - Varnish HTTP text-backend - port 3128 on cp3042 is CRITICAL: Connection refused [22:54:43] PROBLEM - Varnish HTTP text-backend - port 3128 on cp2001 is CRITICAL: Connection refused [22:55:43] PROBLEM - Varnish HTTP text-backend - port 3128 on cp2004 is CRITICAL: Connection refused [22:56:32] PROBLEM - Varnish HTTP text-backend - port 3128 on cp1067 is CRITICAL: Connection refused [22:56:34] .... [22:56:52] PROBLEM - Varnish HTTP text-backend - port 3128 on cp1065 is CRITICAL: Connection refused [22:56:53] PROBLEM - Varnish HTTP text-backend - port 3128 on cp4016 is CRITICAL: Connection refused [22:57:43] PROBLEM - Varnish HTTP text-backend - port 3128 on cp1055 is CRITICAL: Connection refused [22:58:28] bblack: something I can do to help let me know man, bad package upgrade? [22:58:33] RECOVERY - Varnish HTTP text-backend - port 3128 on cp4009 is OK: HTTP OK: HTTP/1.1 200 OK - 186 bytes in 0.151 second response time [22:58:50] it's ok so far [22:58:53] RECOVERY - Varnish HTTP text-backend - port 3128 on cp4016 is OK: HTTP OK: HTTP/1.1 200 OK - 186 bytes in 0.150 second response time [22:59:03] RECOVERY - Varnish HTTP text-backend - port 3128 on cp1054 is OK: HTTP OK: HTTP/1.1 200 OK - 187 bytes in 0.003 second response time [22:59:03] and not really a bad package upgrade, more like a bad package upgrade process [22:59:23] RECOVERY - Varnish HTTP text-backend - port 3128 on cp3041 is OK: HTTP OK: HTTP/1.1 200 OK - 186 bytes in 0.167 second response time [22:59:25] gotcha :) [22:59:43] RECOVERY - Varnish HTTP text-backend - port 3128 on cp2004 is OK: HTTP OK: HTTP/1.1 200 OK - 187 bytes in 0.073 second response time [22:59:43] RECOVERY - Varnish HTTP text-backend - port 3128 on cp1055 is OK: HTTP OK: HTTP/1.1 200 OK - 186 bytes in 0.001 second response time [22:59:53] RECOVERY - Varnish HTTP text-backend - port 3128 on cp1052 is OK: HTTP OK: HTTP/1.1 200 OK - 187 bytes in 0.013 second response time [23:00:04] RoanKattouw ostriches Krenair Dereckson: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160503T2300). Please do the needful. [23:00:04] yurik: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [23:00:14] the dependency-hell between the various varnish-related packages and configurations is disturbing [23:00:15] here [23:00:22] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0] [23:00:22] RECOVERY - Varnish HTTP text-backend - port 3128 on cp3042 is OK: HTTP OK: HTTP/1.1 200 OK - 187 bytes in 0.166 second response time [23:00:24] RECOVERY - Varnish HTTP text-backend - port 3128 on cp1067 is OK: HTTP OK: HTTP/1.1 200 OK - 187 bytes in 0.005 second response time [23:00:42] RECOVERY - Varnish HTTP text-backend - port 3128 on cp2001 is OK: HTTP OK: HTTP/1.1 200 OK - 186 bytes in 0.076 second response time [23:00:52] RECOVERY - Varnish HTTP text-backend - port 3128 on cp1065 is OK: HTTP OK: HTTP/1.1 200 OK - 186 bytes in 0.017 second response time [23:00:58] luckily varnish's own health stuff, and the depooling done around the process, should protect against most of the above fallout [23:01:03] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] [23:01:14] Hi. I'm swatting this night. bblack: green light or you need to do something before? [23:01:20] there's a 503 spike related, but it's brief [23:01:24] Dereckson: green light [23:01:28] k [23:01:57] yurik: so I wondered why the revert change wasn't a part of the aude's series of patches crafted this afternoon [23:02:31] Dereckson, Lydia_WMDE, SMalyshev and I jut spoke a few hours ago to settle this patch [23:02:46] RECOVERY - Varnish HTTP text-backend - port 3128 on cp3030 is OK: HTTP OK: HTTP/1.1 200 OK - 187 bytes in 0.169 second response time [23:03:02] aude wasn't involved in this patch from what i know [23:03:10] ok [23:03:16] PROBLEM - Codfw HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [23:04:18] Dereckson, also, i think https://gerrit.wikimedia.org/r/#/c/286474/ needs to be deployed [23:04:18] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0] [23:04:31] Dereckson, problem is, all graphs are badly broken at the moment [23:04:50] due to last week's jdlrobson's migration to the new loader [23:04:57] PROBLEM - Varnish HTTP text-backend - port 3128 on cp2023 is CRITICAL: Connection refused [23:05:36] PROBLEM - Varnish HTTP text-backend - port 3128 on cp2013 is CRITICAL: Connection refused [23:05:45] (03PS3) 10Dzahn: udpmxircecho: set setsockopt SO_REUSEADDR [puppet] - 10https://gerrit.wikimedia.org/r/286697 (https://phabricator.wikimedia.org/T134247) [23:06:05] (03CR) 10Dereckson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/284091 (https://phabricator.wikimedia.org/T126741) (owner: 10Yurik) [23:06:15] (03CR) 10Dzahn: [C: 032] udpmxircecho: set setsockopt SO_REUSEADDR [puppet] - 10https://gerrit.wikimedia.org/r/286697 (https://phabricator.wikimedia.org/T134247) (owner: 10Dzahn) [23:06:32] (03Merged) 10jenkins-bot: Revert "Don't yet allow wikidatasparql graph urls" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/284091 (https://phabricator.wikimedia.org/T126741) (owner: 10Yurik) [23:06:40] Dereckson, i will add 286474 to the swat page [23:06:47] RECOVERY - Varnish HTTP text-backend - port 3128 on cp2023 is OK: HTTP OK: HTTP/1.1 200 OK - 187 bytes in 0.075 second response time [23:07:27] RECOVERY - Varnish HTTP text-backend - port 3128 on cp2013 is OK: HTTP OK: HTTP/1.1 200 OK - 186 bytes in 0.079 second response time [23:08:21] !log dereckson@tin Synchronized wmf-config/CommonSettings.php: Revert Don't yet allow wikidatasparql graph urls (T126741) (duration: 00m 26s) [23:08:21] T126741: Add support for the wikidata's Sparql queries to graphs - https://phabricator.wikimedia.org/T126741 [23:08:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:08:30] yurik: please test [23:09:11] Dereckson, will do. I just added 286474 (per above), could you push it out too pls [23:09:37] (03PS1) 10BBlack: varnishreqstats: fix systemd deps [puppet] - 10https://gerrit.wikimedia.org/r/286768 [23:09:42] Oh, 286474 is required for the revert? [23:09:44] !log dereckson@tin Synchronized wmf-config/CommonSettings-labs.php: Revert Don't yet allow wikidatasparql graph urls (no op in prod) (duration: 00m 25s) [23:09:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:09:57] (03CR) 10BBlack: [C: 032 V: 032] varnishreqstats: fix systemd deps [puppet] - 10https://gerrit.wikimedia.org/r/286768 (owner: 10BBlack) [23:10:20] ebernhardson: ping [23:10:29] Dereckson: pong [23:11:06] Dereckson, they are not really related [23:11:18] revert is simply re-enabling one of the external data protocols [23:11:25] yurik: will wait, I'm doing config first, then extensions [23:11:32] ok [23:11:48] ebernhardson: I suggest we test it on mw1017 first, is that fine? [23:12:40] Dereckson: tis ok [23:12:52] (03CR) 10Dereckson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286485 (https://phabricator.wikimedia.org/T132751) (owner: 10EBernhardson) [23:14:36] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [23:14:37] RECOVERY - Codfw HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [23:14:42] (03PS3) 10Dereckson: cirrus: Only use curl pools on hhvm [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286485 (https://phabricator.wikimedia.org/T132751) (owner: 10EBernhardson) [23:14:51] (03CR) 10Dereckson: "SWAT, take 2" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286485 (https://phabricator.wikimedia.org/T132751) (owner: 10EBernhardson) [23:14:56] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [23:15:57] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [23:16:32] (03CR) 10Dereckson: [C: 032] "SWAT, take 3" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286485 (https://phabricator.wikimedia.org/T132751) (owner: 10EBernhardson) [23:16:35] Dereckson: to trigger the second time i think you have to X out the first cr :) [23:16:38] yup you noticed ;) [23:16:57] (03Merged) 10jenkins-bot: cirrus: Only use curl pools on hhvm [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286485 (https://phabricator.wikimedia.org/T132751) (owner: 10EBernhardson) [23:19:39] ebernhardson: live on mw1017 [23:21:42] Dereckson: seems fine. [23:21:43] (03PS3) 10Dzahn: udpmxircecho: die if socket fails to open [puppet] - 10https://gerrit.wikimedia.org/r/286701 (https://phabricator.wikimedia.org/T134247) [23:21:45] k [23:21:56] (03CR) 10Dzahn: [C: 032] udpmxircecho: die if socket fails to open [puppet] - 10https://gerrit.wikimedia.org/r/286701 (https://phabricator.wikimedia.org/T134247) (owner: 10Dzahn) [23:23:37] !log dereckson@tin Synchronized wmf-config/CirrusSearch-production.php: Cirrus: only use pooled curl in hhvm / [[Gerrit:286485]] (duration: 00m 34s) [23:23:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:23:54] ebernhardson: live on the whole cluster, still fine? [23:24:46] yurik: you selfmerged 286474... [23:25:04] Dereckson, it was changed by jgirault [23:25:16] so what? [23:25:25] Dereckson, well, it was more of his patch then :) [23:26:18] Dereckson: yup looks fine [23:26:31] ebernhardson: cool, thanks for testing [23:26:37] No, I don't think so. I compared the PS1 and the PS2, it removed one file. [23:26:54] jgirault: there? [23:27:00] Dereckson: yes [23:27:43] Dereckson, yes, i know - it was me copying code from matmarex's suggestion to do the same as in https://phabricator.wikimedia.org/rMWbc4e07b6f63b0865a14aef981366a79d10329c87#cb95de99 [23:27:50] jgirault: if you've social +2 rights on Graph, could you add on Gerrit https://gerrit.wikimedia.org/r/#/c/286474 looks fine for you? [23:28:28] social +2 ? i haven't heard of that one :) [23:28:51] Dereckson, you get an ok from me [23:29:02] 06Operations, 10Wikimedia-IRC-RC-Server, 13Patch-For-Review: RC stream is broken over IRC - https://phabricator.wikimedia.org/T134247#2262096 (10Dzahn) The actionables on this ticket should be done, i merged the 2 changes above and restarted the bot on both hosts. That leaves just the "longterm" actionable... [23:29:10] MaxSem: k [23:29:17] can't really +anything because it's merged [23:30:00] so, let's cherry pick that [23:30:04] $ mwversionsinuse [23:30:05] 1.27.0-wmf.22 1.27.0-wmf.23 [23:30:11] yurik: both branches I imagine? [23:30:15] Dereckson, yep [23:30:34] thx MaxSem [23:30:45] 06Operations, 10Wikimedia-IRC-RC-Server: udpmxircecho should write stats of messages processed and we should alert when that drops to zero - https://phabricator.wikimedia.org/T134326#2262099 (10Dzahn) [23:30:50] not sure how MaxSem always reads everything that is going on in all the channels :) [23:31:53] mutante: Did the name for wikimediafoundation.org's channel on irc.wikimedia.org change? [23:31:55] Okay. [23:31:56] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [23:32:08] 06Operations, 10Wikimedia-IRC-RC-Server, 13Patch-For-Review: RC stream is broken over IRC - https://phabricator.wikimedia.org/T134247#2259656 (10Dzahn) 05Open>03Resolved Closing this because the stream is working and the short term fixes have been implemented. I moved the remaining actionable to a subtask. [23:32:27] My bot is in #wikimediafoundation.org. [23:32:30] yurik: could you edit the deployments page? The changes are https://gerrit.wikimedia.org/r/#/c/286774/ and https://gerrit.wikimedia.org/r/#/c/286775/ [23:33:24] Dereckson, done [23:34:00] I also wonder why "/list *foundat*" just ignores the filter. [23:34:45] (03PS4) 10Smalyshev: [WIP] Add configs for kafka-watcher tool [puppet] - 10https://gerrit.wikimedia.org/r/286588 (https://phabricator.wikimedia.org/T97562) [23:34:51] We're waiting Zuul. [23:35:46] (03CR) 10jenkins-bot: [V: 04-1] [WIP] Add configs for kafka-watcher tool [puppet] - 10https://gerrit.wikimedia.org/r/286588 (https://phabricator.wikimedia.org/T97562) (owner: 10Smalyshev) [23:36:04] 06Operations, 06Project-Admins: Operations-related subprojects/tags reorganization - https://phabricator.wikimedia.org/T119944#2262123 (10BBlack) 05Resolved>03Open @Aklapper - The #Varnish tag is still causing issues. Apparently it's still possible for it to be used in new tasks, which then don't end up w... [23:40:32] !log slow, depooled, staggered restart of varnish frontends on text and upload clusters commencing [23:40:33] yurik: Jenkins tests are okay, we're waiting Zuul gating [23:40:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:41:27] and so we're waiting a core change: https://integration.wikimedia.org/ci/job/mediawiki-extensions-php55/3270/ [23:43:07] PROBLEM - Varnish HTTP upload-backend - port 3128 on cp1050 is CRITICAL: Connection refused [23:43:53] Merged. [23:44:06] Leah: i'm not aware of any change that would influence channel names [23:44:46] (03PS5) 10Smalyshev: [WIP] Add configs for kafka-watcher tool [puppet] - 10https://gerrit.wikimedia.org/r/286588 (https://phabricator.wikimedia.org/T97562) [23:45:01] Leah: but the channel only gets created once there actually is a change [23:45:06] RECOVERY - Varnish HTTP upload-backend - port 3128 on cp1050 is OK: HTTP OK: HTTP/1.1 200 OK - 187 bytes in 0.010 second response time [23:46:32] mutante: https://wikimediafoundation.org/wiki/Special:RecentChanges?hidebots=0&hidemyself=1 [23:46:39] Leah: the behaviour of /list is the same on both server [23:47:00] Does "/list *wikimedia*" limit the list to only channels containing "wikimedia" for you? [23:47:09] It seems to just spit out every channel name. [23:47:16] no, it does not [23:47:30] Maybe an old bug. [23:47:33] but that's not a change [23:47:37] Sure. [23:48:23] are you sure a channel for foundation wiki existed? [23:48:54] Very sure. [23:49:02] and what was the name? [23:49:10] [20:01] [[Former Board of Trustees members]]; Quiddity (WMF); /* Kat Walsh */ fix userpage link (leaving as redirect - could also be changed to mindspillage?); https://wikimediafoundation.org/w/index.php?diff=105743&oldid=105673 [23:49:19] Is the last message I have. [23:49:29] mutante: I believe it was #wikimediafoundation.org [23:49:52] hmm. no other channel ends in .org [23:49:55] (03PS1) 10Smalyshev: Bump WDQS cache to 5 mins [puppet] - 10https://gerrit.wikimedia.org/r/286776 [23:50:37] mutante: If there's some other channel name, I can use that. :-) [23:50:40] how is snitch related to that? [23:50:42] But I don't see any channel for that wiki. [23:50:51] i dont see a channel like that on the old server either [23:50:58] snitch is on irc.freenode.net. snatch is on irc.wikimedia.org. [23:51:04] (03CR) 10Yurik: [C: 031] Bump WDQS cache to 5 mins [puppet] - 10https://gerrit.wikimedia.org/r/286776 (owner: 10Smalyshev) [23:51:06] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [23:51:19] mutante: Do you see any channel name containing "foundation"? [23:51:31] This definitely worked yesterday. [23:51:39] I've been stalking this wiki for like half a decade. [23:51:59] well, the *foundation* stuff doesnt work :) but looking [23:52:33] yurik: okay, ready in staging [23:52:35] let's sync [23:52:43] yei! [23:52:47] Leah: it might be this issue https://phabricator.wikimedia.org/T21244 [23:53:02] Looking at kraz.wikimedia.org, rc-pmtpa is only in 139 channels, as far as I can tell. [23:53:06] That sounds low. [23:54:29] !log dereckson@tin Synchronized php-1.27.0-wmf.22/extensions/Graph/lib/d3-global.js: Graph: match modern module loading in core (1/3) (duration: 00m 26s) [23:54:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:55:13] Leah: that means that since it was restarted (which is not long ago because i just merged something that needed a restart), 139 wikis have had edits [23:55:19] !log dereckson@tin Synchronized php-1.27.0-wmf.22/extensions/Graph/lib/topojson-global.js: Graph: match modern module loading in core (2/3) (duration: 00m 25s) [23:55:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:55:51] mutante: Okay, I hadn't realized it had been restarted again. [23:55:59] !log dereckson@tin Synchronized php-1.27.0-wmf.22/extensions/Graph/extension.json: Graph: match modern module loading in core (3/3) (duration: 00m 25s) [23:56:02] yurik: please test (on a wmf.22 wiki so) ^ [23:56:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:56:48] Leah: if you can login .. i found this earlier https://graphite.wikimedia.org/render/?width=586&height=308&_salt=1462305917.15&target=servers.argon.ircd.users&target=servers.kraz.ircd.users [23:57:13] eh, wrong link. that's users [23:57:55] Dereckson, are you sure its up to date on enwiki? [23:58:21] https://graphite.wikimedia.org/render/?width=586&height=308&_salt=1462305917.15&target=servers.argon.ircd.channels&target=servers.kraz.ircd.channels [23:58:43] mutante: When was the restart? https://es.wikivoyage.org/wiki/Especial:CambiosRecientes has edits, but the channel doesn't have rc-pmtpa. [23:58:57] Dereckson, actually yeah, debug mode works [23:59:12] Leah: i restarted the bot to implement the actionables on https://phabricator.wikimedia.org/T134247#2260394 [23:59:26] yurik: what's the URL of the JS to update ? [23:59:27] Dereckson, awesome, release works - must have been a caching issue [23:59:32] i dont know, the appservers are sending UDP packets [23:59:34] we can purge that I guess [23:59:36] the bot is just 60 lines [23:59:42] it has no logic about this at all [23:59:52] it just passes the stuff through [23:59:55] Sure.