[00:00:04] <jouncebot>	 RoanKattouw ostriches Krenair: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160121T0000).
[00:00:04] <jouncebot>	 urandom RoanKattouw: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process.
[00:00:29] * urandom is available!
[00:01:02] <greg-g>	 Reedy: down with log noise!
[00:01:20] <Reedy>	 greg-g: It's possibly only wikitech... And depends how long we're staying on .10
[00:01:22] <MaxSem>	 and mine patch too! :P
[00:02:44] <grrrit-wm>	 (03PS1) 10Dduvall: Revert "Rollback labswiki and labtestwiki to 1.27.0-wmf.9" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265429 
[00:03:09] <AaronSchulz>	 Reedy: https://gerrit.wikimedia.org/r/#/c/265425/
[00:03:29] <Reedy>	 AaronSchulz: we have far too much shit in our repos
[00:04:05] <RoanKattouw_away>	 OK I'll do the SWAT
[00:04:13] <grrrit-wm>	 (03CR) 10Dduvall: [C: 032] Revert "Rollback labswiki and labtestwiki to 1.27.0-wmf.9" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265429 (owner: 10Dduvall)
[00:04:24] <RoanKattouw>	 Or... not
[00:04:32] <greg-g>	 RoanKattouw: uno momento
[00:04:44] <greg-g>	 but we did roll back
[00:04:51] <RoanKattouw>	 I have a meeting
[00:04:59] <RoanKattouw>	 I can do the SWAT but only at :30
[00:05:02] * greg-g nods
[00:05:06] <grrrit-wm>	 (03Merged) 10jenkins-bot: Revert "Rollback labswiki and labtestwiki to 1.27.0-wmf.9" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265429 (owner: 10Dduvall)
[00:05:10] <greg-g>	 there probably won't be much to do due to the rollback
[00:05:11] <RoanKattouw>	 Sounds like that might work out
[00:05:21] <RoanKattouw_away>	 Wait, wmf NIE?
[00:05:26] <RoanKattouw_away>	 *NINE
[00:05:35] <RoanKattouw_away>	 Oh, labswiki
[00:05:37] <marxarelli>	 for labswiki
[00:05:41] <Reedy>	 RoanKattouw_away: Revert
[00:05:52] <marxarelli>	 yes, and that
[00:06:03] <RoanKattouw_away>	  21:32 thcipriani: reverted group1 wikis to 1.27.0-wmf.10 due to session errors.
[00:06:03] <RoanKattouw_away>	 meh, Isee
[00:06:06] <greg-g>	 I guess the eventbus one can go out, since it's only going to test wikis anywho
[00:06:09] <RoanKattouw_away>	 We still have group0 on wmf11 though right?
[00:06:15] <greg-g>	 right
[00:06:23] <RoanKattouw_away>	 OK
[00:06:25] <greg-g>	 (I want a freaking dashboard for this)
[00:06:27] <RoanKattouw_away>	 Good enough for me
[00:06:38] <RoanKattouw_away>	 greg-g: +5
[00:06:41] <Reedy>	 https://noc.wikimedia.org/conf/
[00:06:45] <Reedy>	 Currently active MediaWiki versions: 1.27.0-wmf.10, 1.27.0-wmf.11
[00:06:58] <greg-g>	 .....
[00:07:02] <greg-g>	 when was that added?
[00:07:05] <Reedy>	 Ages ago
[00:07:10] <Reedy>	 Literally, AGES ago
[00:07:22] <greg-g>	 well then
[00:07:23] <Reedy>	 greg-g: I got sick of people asking
[00:07:23] <mutante>	 when we used svn
[00:07:28] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s5 on db1026 is OK: OK slave_sql_lag Seconds_Behind_Master: 0
[00:07:48] <MaxSem>	 mutante, wrong - CVS ;)
[00:08:03] <Reedy>	 https://github.com/wikimedia/operations-mediawiki-config/commit/fd94140cf1ad681856d4023380562d14487dc047
[00:08:07] <Reedy>	 greg-g: Nearly 2 years ago
[00:08:14] <Reedy>	 Oh, no
[00:08:24] <Reedy>	 Yes
[00:08:24] <Reedy>	 https://github.com/wikimedia/operations-mediawiki-config/commit/09ddd03a8a6f27d012f5fc8f8f316014d9903d4f
[00:08:33] <Reedy>	 2 days away from being 2 years ago
[00:08:45] <wikibugs>	 6operations, 6Phabricator, 6Release-Engineering-Team, 10Traffic, 5Patch-For-Review: Phabricator needs to expose ssh - https://phabricator.wikimedia.org/T100519#1950585 (10mmodell) @dzahn: awesome, thanks!
[00:08:54] <mutante>	 oh, it's actually newer than i though
[00:09:13] <Reedy>	 heh
[00:09:18] <Reedy>	 2 years is hardly new though :)
[00:10:10] <greg-g>	 it does help, but I'd love a "which wikis are on which version" answer dashboard, but that's for another rant
[00:11:19] <p858snake>	 greg-g: thats what you have Reedy for >.>
[00:11:43] <Reedy>	 reedy@tin:/srv/mediawiki-staging/multiversion$ ./activeMWVersions --withdb
[00:11:43] <Reedy>	 1.27.0-wmf.10=piwiktionary 1.27.0-wmf.11=mediawikiwiki
[00:13:41] <Reedy>	 greg-g: It'd be relatively easy
[00:13:47] <Reedy>	 Where would it live?
[00:16:05] <mutante>	 https://www.mediawiki.org/wiki/Special:SiteMatrix ?
[00:17:26] <icinga-wm>	 PROBLEM - Unmerged changes on repository mediawiki_config on mira is CRITICAL: There are 2 unmerged changes in mediawiki_config (dir /srv/mediawiki-staging/).
[00:18:02] <Reedy>	 Is the scap waiting for Roan?
[00:18:07] <Reedy>	 s/scap/swat/
[00:18:41] <bd808>	 greg-g, Reedy: we could make a tool in tools labs pretty easily I think. The data file we need is fetchable from either git or noc.
[00:19:34] <Reedy>	 I guess you could do it purely in javascript or PHP
[00:19:54] <bd808>	 yeah, it's just a json blob these days
[00:20:26] <greg-g>	 Reedy: dunno re: where it'd live, some cool single use webpage? ;)
[00:26:15] <grrrit-wm>	 (03CR) 10Gehel: [C: 031] "Looks good to me (but what do I know)." [puppet] - 10https://gerrit.wikimedia.org/r/265427 (https://phabricator.wikimedia.org/T120843) (owner: 10EBernhardson)
[00:28:45] <bd808>	 greg-g, Reedy: quick and dirty POC -- https://tools.wmflabs.org/bd808-test/versions.php
[00:29:08] <Reedy>	 https://noc.wikimedia.org/conf/highlight.php?file=wikiversions.json
[00:29:09] <Reedy>	 Ctrl + F
[00:29:10] <Reedy>	 :P
[00:29:35] <Reedy>	 I wonder which way it's wanted... Like that?
[00:29:39] <Reedy>	 Or for a version, list the dbnames
[00:29:40] <Reedy>	 Or both
[00:31:02] <mutante>	 Reedy: insert line break, then it's grep-able already
[00:32:49] <greg-g>	 bd808: yeah, I don't know how, but I'd love a simplier version
[00:33:13] <greg-g>	 from a whiteboard of mine way back in the day: https://commons.wikimedia.org/wiki/File:PersonalDashboard_v1.jpg
[00:33:13] * Reedy slaps greg-g
[00:33:33] <bd808>	 mutante: once tin is running HHVM we can put line breaks back in!
[00:33:37] <greg-g>	 where those blue boxes around the the wiki names are expanding on-click to show which wikis
[00:33:46] <Reedy>	 bd808: lolol
[00:33:51] <Reedy>	 Progress is being made again, at least
[00:34:08] <bd808>	 nearly there I think
[00:34:31] <Reedy>	 Well, soon as tin goes offline for reinstall, we can start changing stuff
[00:34:49] <Reedy>	 If only we'd decided what version of PHP we should bump to...
[00:34:54] <bd808>	 greg-g: ah. yeah that but ... yesterday's config blows up the idea of 3 nice buckets a bit
[00:35:03] <greg-g>	 yeah....
[00:35:20] <bblack>	 MaxSem: yes, your understanding is correct about what I was trying to do :)
[00:35:22] <greg-g>	 solution: get rid of the buckets
[00:35:23] <bd808>	 what are the trend lines showing?
[00:35:34] <greg-g>	 warning/errors per group
[00:35:43] <bd808>	 *nod*
[00:35:49] <MaxSem>	 bblack, will tke a look today
[00:36:10] <bblack>	 MaxSem: well, mostly.  We really want every purge of $1.wikipedia.org/X to purge $1.m/wikipedia.org/X
[00:36:10] <greg-g>	 eg 
[00:36:21] <greg-g>	 https://logstash.wikimedia.org/#/dashboard/elasticsearch/fatalmonitor-group1https://logstash.wikimedia.org/#/dashboard/elasticsearch/fatalmonitor-group1
[00:36:23] <bblack>	 it's not really specific to the /wiki/Foo URLs
[00:36:25] <greg-g>	 bah https://logstash.wikimedia.org/#/dashboard/elasticsearch/fatalmonitor-group1
[00:36:36] <bblack>	 MaxSem: thanks!
[00:36:51] <marxarelli>	 greg-g: or iterate over `git log --since 1w --follow wikiversions.json` and render changes as a timeline?
[00:37:00] <bblack>	 bleh can't type, but you get the idea I'm sure
[00:37:18] <greg-g>	 marxarelli: interesting
[00:37:45] <greg-g>	 but... what the heck, this doesn't look good still: https://logstash.wikimedia.org/#/dashboard/elasticsearch/fatalmonitor-group1
[00:42:16] <marxarelli>	 ^ ? <tgr> plus there is a hhvm warning about dropped sessions
[00:42:30] <tgr>	 marxarelli: are you sure that's group1?
[00:42:41] <marxarelli>	 "Recursion detected in RequestContext::getLanguage in /srv/mediawiki/php-1.27.0-wmf.11/includes/context/RequestContext.php on line 351"
[00:42:50] <tgr>	 as in, current group1?
[00:42:53] <marxarelli>	 it shouldn't be
[00:42:53] <greg-g>	 query : wiki:*wiktionary OR wiki:*wikinews OR wiki:*wikibooks OR wiki:*wikiquote OR wiki:*wikisource OR wiki:*wikinews OR wiki:*wikiversity OR _missing_: wiki 
[00:43:15] * greg-g scratches his head
[00:43:37] <MaxSem>	 greg-g, since RoanKattouw_away is away, I can do the swat?
[00:44:15] <greg-g>	 yessir, the eventbus one looks easy enough
[00:44:19] <tgr>	 the recursion stuff at least says wmf11
[00:44:22] <bd808>	 marxarelli: "Recursion detected" is sadly normal
[00:44:44] <marxarelli>	 and "Failed to write session data (user). Please verify that the current setting of session.save_path is correct () in /srv/mediawiki/php-1.27.0-wmf.11/includes/session/SessionManager.php on line 588"
[00:44:44] <tgr>	 the other is wmf11 as well
[00:44:57] <tgr>	 see https://phabricator.wikimedia.org/T124126#1950584 about that
[00:44:58] <marxarelli>	 right. doesn't make sense
[00:45:24] <greg-g>	 MaxSem: and I guess the echo one, too
[00:45:39] <tgr>	 hhvm doesn't report which wiki it happened on, so it gets caught by _missing_
[00:45:41] <MaxSem>	 and the geodata one? :P
[00:45:59] * greg-g reloads page
[00:46:15] <marxarelli>	 ah, i see
[00:46:49] <greg-g>	 MaxSem: sure
[00:46:57] <greg-g>	 tgr: ahhhh
[00:47:01] <greg-g>	 (re _missing_)
[00:47:04] <marxarelli>	 that's an unfortunate bit of missing log data
[00:47:31] <bd808>	 I'm going to restart the other 2 logstash services because I'm still seeing some January 2015 data trickle in
[00:47:57] <bd808>	 marxarelli: those are hhvm pooping itself. no place to capture the wiki id from
[00:48:12] <grrrit-wm>	 (03CR) 10MaxSem: [C: 032] Enable EventBus extension on test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265142 (https://phabricator.wikimedia.org/T116786) (owner: 10Eevans)
[00:48:33] <MaxSem>	 brace yourself, urandom 
[00:49:00] <grrrit-wm>	 (03Merged) 10jenkins-bot: Enable EventBus extension on test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265142 (https://phabricator.wikimedia.org/T116786) (owner: 10Eevans)
[00:49:05] <greg-g>	 urandom: about?
[00:49:27] <urandom>	 greg-g: aye
[00:49:34] <greg-g>	 good :)
[00:49:49] <urandom>	 greg-g: how come?
[00:49:53] <urandom>	 the swat?
[00:50:14] <logmsgbot>	 !log maxsem@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/265142/ (duration: 00m 32s)
[00:50:20] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[00:50:22] <greg-g>	 urandom: yeah, eventbus
[00:50:24] <MaxSem>	 urandom, please test:)
[00:51:24] <icinga-wm>	 RECOVERY - Unmerged changes on repository mediawiki_config on mira is OK: No changes to merge.
[00:51:26] <urandom>	 is testwiki and test2wiki among those that have already gotten wmf.11?
[00:52:04] <MaxSem>	 urandom, yes: https://noc.wikimedia.org/conf/highlight.php?file=wikiversions.json
[00:52:08] <greg-g>	 yep
[00:54:16] <MaxSem>	 ebernhardson, I'm observing 13 155 of /srv/mediawiki/php-1.27.0-wmf.9/vendor/ruflin/elastica/lib/Elastica/Transport/Http.php: Operation timed out
[00:54:30] <MaxSem>	 "13" means not very severe :)
[00:55:38] <Reedy>	 .9?
[00:56:51] <logmsgbot>	 !log maxsem@tin Synchronized php-1.27.0-wmf.11/extensions/GeoData/: https://gerrit.wikimedia.org/r/#/c/265409/ (duration: 00m 33s)
[00:56:57] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[00:58:32] <MaxSem>	 ebernhardson, GD logging is live on wmf11 ^^^
[00:58:41] <MaxSem>	 urandom, how are we looking?
[01:00:04] <jouncebot>	 twentyafterfour: Dear anthropoid, the time has come. Please deploy Phabricator update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160121T0100).
[01:01:02] <urandom>	 MaxSem: still looking
[01:01:17] <urandom>	 but nothing horrible yet
[01:01:49] <grrrit-wm>	 (03CR) 10MaxSem: [C: 032] Enable Echo cross-wiki tracking table on all wikis with CentralAuth [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265395 (https://phabricator.wikimedia.org/T124232) (owner: 10Catrope)
[01:02:40] <grrrit-wm>	 (03Merged) 10jenkins-bot: Enable Echo cross-wiki tracking table on all wikis with CentralAuth [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265395 (https://phabricator.wikimedia.org/T124232) (owner: 10Catrope)
[01:02:46] <greg-g>	 such confidence
[01:03:07] * RoanKattouw apologizes for flaking out
[01:03:09] <RoanKattouw>	 My meeting went long
[01:03:53] <greg-g>	 alright, I need to head out
[01:04:06] <logmsgbot>	 !log maxsem@tin Synchronized wmf-config/: https://gerrit.wikimedia.org/r/#/c/265395/ (duration: 00m 32s)
[01:04:08] <greg-g>	 now that Roan's here he can deal with the echo thing, so I think we're all covered
[01:04:10] <MaxSem>	 RoanKattouw, ^^^
[01:04:12] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[01:04:29] <greg-g>	 MaxSem: be sure to test your own swat :P
[01:05:15] <MaxSem>	 greg-g, I have ebernhardson for that! :p
[01:05:39] <greg-g>	 oh whew, checks and balances, important
[01:05:57] * MaxSem looks outside and sees no nuclear fireballs on east
[01:06:03] <MaxSem>	 LGTM!
[01:06:57] <ebernhardson>	 MaxSem: cool
[01:07:14] <urandom>	 EventBus seems OK
[01:09:08] <wikibugs>	 6operations, 5Patch-For-Review: reinstall bast4001 with jessie - https://phabricator.wikimedia.org/T123674#1950843 (10Dzahn)
[01:09:10] <wikibugs>	 6operations: Switch ganglia aggregator init stuff to systemd on jessie - https://phabricator.wikimedia.org/T96842#1950844 (10Dzahn)
[01:10:07] <wikibugs>	 6operations: Switch ganglia aggregator init stuff to systemd on jessie - https://phabricator.wikimedia.org/T96842#1227390 (10Dzahn) duplicate of T124197 ?
[01:10:22] <wikibugs>	 6operations: Port Ganglia aggregator setup to systemd - https://phabricator.wikimedia.org/T124197#1948822 (10Dzahn) duplicate of T96842 ?
[01:10:25] <ebernhardson>	 MaxSem: those timeouts are a bit odd, and all on labtestwiki
[01:10:48] <wikibugs>	 6operations: Port Ganglia aggregator setup to systemd - https://phabricator.wikimedia.org/T124197#1950855 (10Dzahn)
[01:10:50] <wikibugs>	 6operations: Make services manageable by systemd (tracking) - https://phabricator.wikimedia.org/T97402#1950854 (10Dzahn)
[01:11:19] <wikibugs>	 6operations: Port Ganglia aggregator setup to systemd - https://phabricator.wikimedia.org/T124197#1948822 (10Dzahn)
[01:11:21] <wikibugs>	 6operations, 5Patch-For-Review: reinstall bast4001 with jessie - https://phabricator.wikimedia.org/T123674#1950856 (10Dzahn)
[01:11:48] <wikibugs>	 6operations: Reinstall magnesium with jessie - https://phabricator.wikimedia.org/T123713#1950858 (10Dzahn) a:3Dzahn
[01:11:50] <ebernhardson>	 doesn't look related to cluster, failures for both eqiad and codfw
[01:12:35] <Krenair>	 ebernhardson, isn't there some special network exception to allow silver to connect to cirrussearch?
[01:12:51] <Krenair>	 maybe andrewbogott forgot to copy that for labtestweb2001
[01:13:02] <ebernhardson>	 Krenair: oh that would make sense. 
[01:13:13] <ebernhardson>	 is labtestwiki new as well? never heard of it before :)
[01:13:17] <Krenair>	 yes
[01:13:31] <Krenair>	 it lives on a silver-like machine called labtestweb2001
[01:13:40] <Krenair>	 modules/role/manifests/elasticsearch/server.pp:        srange  => '(($INTERNAL @resolve(silver.wikimedia.org)))',
[01:13:42] <Krenair>	 yeah, that'll be it
[01:13:51] <ebernhardson>	 andrewbogott: ^
[01:14:36] <bd808>	 !log Restarted logstash on logstash1002
[01:14:40] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[01:14:56] <wikibugs>	 6operations, 10Continuous-Integration-Infrastructure, 5Continuous-Integration-Scaling: install/setup/deploy cobalt as replacement for gallium - https://phabricator.wikimedia.org/T95959#1950862 (10Dzahn) maybe we could go ahead with the same setup we had on gallium, or let's make an actual blocker for the net...
[01:15:19] <wikibugs>	 6operations, 10Continuous-Integration-Infrastructure, 10netops, 5Continuous-Integration-Scaling: install/setup/deploy cobalt as replacement for gallium - https://phabricator.wikimedia.org/T95959#1950863 (10Dzahn)
[01:15:40] <bd808>	 !log Restarted logstash on logstash1003
[01:15:43] <YuviPanda>	 ebernhardson: Krenair want me to patch?
[01:15:45] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[01:15:47] <wikibugs>	 6operations, 10Continuous-Integration-Infrastructure, 10netops, 5Continuous-Integration-Scaling: install/setup/deploy cobalt as replacement for gallium - https://phabricator.wikimedia.org/T95959#1204862 (10Dzahn) added @netops please specify which VLAN to use for cobalt
[01:16:01] <ebernhardson>	 YuviPanda: yes would be appreciated, basically just needs another ferm line for wherever labtestwiki is
[01:16:53] <YuviPanda>	 kk
[01:18:10] <wikibugs>	 6operations: Reinstall caesium with jessie - https://phabricator.wikimedia.org/T123714#1950882 (10Dzahn) a:3Dzahn
[01:18:38] <wikibugs>	 6operations: Reinstall caesium with jessie (and convert to VM) - https://phabricator.wikimedia.org/T123714#1950883 (10Dzahn)
[01:18:44] <grrrit-wm>	 (03PS1) 10Yuvipanda: elasticsearch: Add ferm rule for labtestweb2001 [puppet] - 10https://gerrit.wikimedia.org/r/265436 
[01:19:07] <wikibugs>	 6operations: Reinstall caesium with jessie (and convert to VM) - https://phabricator.wikimedia.org/T123714#1936383 (10Dzahn) This is releases.wikimedia.org . Totally agree it's a candidate for a ganeti VM. I'm going to request a machine for it.
[01:22:28] <grrrit-wm>	 (03CR) 10EBernhardson: [C: 031] elasticsearch: Add ferm rule for labtestweb2001 [puppet] - 10https://gerrit.wikimedia.org/r/265436 (owner: 10Yuvipanda)
[01:23:05] <grrrit-wm>	 (03CR) 10Yuvipanda: [C: 032] elasticsearch: Add ferm rule for labtestweb2001 [puppet] - 10https://gerrit.wikimedia.org/r/265436 (owner: 10Yuvipanda)
[01:23:11] <andrewbogott>	 ebernhardson: ok, I’ll add that
[01:23:47] <wikibugs>	 6operations: request VM for releases.wm.org - https://phabricator.wikimedia.org/T124261#1950905 (10Dzahn) 3NEW a:3Dzahn
[01:24:10] <YuviPanda>	 andrewbogott: labtestweb? I already merged patch
[01:24:20] <andrewbogott>	 for cirrus?  great.
[01:25:09] <wikibugs>	 6operations, 10vm-requests: request VM for releases.wm.org - https://phabricator.wikimedia.org/T124261#1950905 (10Dzahn)
[01:27:28] <wikibugs>	 6operations, 10vm-requests: request VM for releases.wm.org - https://phabricator.wikimedia.org/T124261#1950922 (10Dzahn)
[01:27:58] <wikibugs>	 6operations, 10vm-requests: request VM for releases.wm.org - https://phabricator.wikimedia.org/T124261#1950923 (10Dzahn) a:5Dzahn>3None
[01:28:42] <MaxSem>	 ebernhardson, apocalypse: 1393 Avro failed to serialize record for CirrusSearchRequestSet : {"payload":{"tookMs":"Expected string, but recieved integer"}} in /srv/mediawiki/php-1.27.0-wmf.
[01:28:42] <MaxSem>	 10/includes/debug/logger/monolog/AvroFormatter.php on line 97
[01:28:56] <wikibugs>	 6operations, 10vm-requests: request VM for releases.wm.org - https://phabricator.wikimedia.org/T124261#1950905 (10Dzahn) @Mark should we have a second one of this in codfw as well?  so we can switch over DCs and still have releases.wm.org up ?
[01:31:58] <ebernhardson>	 MaxSem: those should be fixed in the latest code, we pushed out a fix yesterday :S
[01:32:05] <ebernhardson>	 looking..
[01:32:09] <MaxSem>	 pushed?
[01:32:14] <ebernhardson>	 swatt'd
[01:32:21] <MaxSem>	 to wmf10 too?
[01:32:31] <ebernhardson>	 lemme poke on tin and see..
[01:33:58] <ebernhardson>	 MaxSem: yes wmf.10 and wmf.11 have fix
[01:34:41] <wikibugs>	 6operations: Migrate nitrogen to jessie - https://phabricator.wikimedia.org/T123732#1950963 (10Dzahn) this is role::ipv6relay / miredo  What kind of requirements does the IPv6 relay have?  I installed "nload" to see how much it's used network-wise.
[01:35:24] <ebernhardson>	 MaxSem: where do you see them? on fluorine `grep AvroForm /a/mw-log/hhvm.log` only reports log lines from several days ago
[01:35:37] <ebernhardson>	 (which seems odd since they rotate...)
[01:37:09] <MaxSem>	 Jan 20 09:25:53 mw1237:  #012Notice: Undefined index: 1 in /srv/mediawiki/php-1.27.0-wmf.11/extensions/VisualEditor/VisualEditor.hooks.php on line 83
[01:37:10] <MaxSem>	 Jan 19 15:50:59 mw1217:  message repeated 217 times: [ #012Notice: Avro failed to serialize record for CirrusSearchRequestSet : {"payload":{"tookMs":"Expected string, but recieved integer"}} in /srv/mediawiki/php-1.27.0-wmf.10/includes/debug/logger/monolog/AvroFormatter.php on line 97]
[01:37:10] <MaxSem>	 Jan 20 09:08:46 mw1064:  message repeated 2 times: [ #012Fatal error: Stack overflow in /srv/mediawiki/php-1.27.0-wmf.11/includes/libs/objectcache/MemcachedBagOStuff.php on line 177]
[01:37:29] <MaxSem>	 we have some bogus crap coming late, in other words
[01:37:36] <ebernhardson>	 ok :)
[01:41:00] <wikibugs>	 6operations, 10Wikimedia-General-or-Unknown: Job queue is growing and growing - https://phabricator.wikimedia.org/T124194#1950987 (10Peachey88)
[01:44:03] <wikibugs>	 6operations, 6Performance-Team, 10Wikimedia-General-or-Unknown, 5Patch-For-Review, and 2 others: jobrunner memory leaks - https://phabricator.wikimedia.org/T122069#1950990 (10ori)
[01:44:40] <wikibugs>	 6operations: Migrate nitrogen to jessie - https://phabricator.wikimedia.org/T123732#1950998 (10Dzahn) incoming and outgoing about 3MBit/s on average over a couple minutes
[01:47:34] <mutante>	 !log nitrogen - install package upgrades
[01:47:39] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[01:50:39] <bd808>	 MaxSem: I'd love to know where those syslog events get buffered for days before showing up on fluorine and in logstash
[01:50:49] <bd808>	 it's an ongoing mystery
[01:54:47] <MaxSem>	 bd808, some algorithm that produces "message repeated 217 times"?
[01:55:31] <bd808>	 that bit is built into rsyslog but there should be a max accumulation time as well which is something like 5 minutes max
[01:55:57] * bd808 should look that up again
[02:05:24] <mutante>	 i wonder how many of the 4 redis boxes could be down at a time
[02:05:40] <mutante>	 to reinstall them that is
[02:06:07] <grrrit-wm>	 (03PS1) 10Ori.livneh: Job runners: Add a dedicated htmlCacheUpdate runner [puppet] - 10https://gerrit.wikimedia.org/r/265438 (https://phabricator.wikimedia.org/T123815) 
[02:08:15] <grrrit-wm>	 (03PS2) 10Ori.livneh: Job runners: Add a dedicated htmlCacheUpdate runner [puppet] - 10https://gerrit.wikimedia.org/r/265438 (https://phabricator.wikimedia.org/T123815) 
[02:10:00] <grrrit-wm>	 (03CR) 10Aaron Schulz: [C: 031] Job runners: Add a dedicated htmlCacheUpdate runner [puppet] - 10https://gerrit.wikimedia.org/r/265438 (https://phabricator.wikimedia.org/T123815) (owner: 10Ori.livneh)
[02:10:56] <grrrit-wm>	 (03PS3) 10Ori.livneh: Job runners: Add a dedicated htmlCacheUpdate runner [puppet] - 10https://gerrit.wikimedia.org/r/265438 (https://phabricator.wikimedia.org/T123815) 
[02:11:06] <grrrit-wm>	 (03CR) 10Ori.livneh: [C: 032 V: 032] Job runners: Add a dedicated htmlCacheUpdate runner [puppet] - 10https://gerrit.wikimedia.org/r/265438 (https://phabricator.wikimedia.org/T123815) (owner: 10Ori.livneh)
[02:14:04] <andrewbogott>	 bd808: I’m right now messing with broken puppet on stashbot-deploy… I don’t suppose you’d like to fix it so I don't have to?
[02:14:33] <bd808>	 andrewbogott: actually we can jsut shoot those instances in the head
[02:14:47] <andrewbogott>	 bd808: delete the project too?
[02:14:57] <bd808>	 I've migrated to other servers in the tools project and just not shut things down yet
[02:15:10] <bd808>	 andrewbogott: yeah kill it all with fire
[02:15:15] <andrewbogott>	 great, will do.  Thanks
[02:16:14] <icinga-wm>	 PROBLEM - puppet last run on ms-be2012 is CRITICAL: CRITICAL: Puppet has 1 failures
[02:16:45] <ori>	 !log Restarting jobrunner service on job runners to ensure I180856917 gets picked up
[02:16:46] <andrewbogott>	 done
[02:16:50] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[02:17:33] <bd808>	 andrewbogott: cool. Hopefully I won't remember tomorrow why I hadn't cleaned it all up yet. ;)
[02:24:04] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[02:24:22] <mobrovac>	 !log citoid deploying 3a1b6c8648
[02:24:27] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[02:24:37] <mobrovac>	 wth mobileapps?
[02:26:32] <YuviPanda>	 bd808: andrewbogott yay, less self hosted puppetmasters!
[02:26:34] <mobrovac>	 checked mobileapps, all looks good to me
[02:26:53] <mobrovac>	 the checker script concurs
[02:27:03] <logmsgbot>	 !log mwdeploy@tin sync-l10n completed (1.27.0-wmf.10) (duration: 09m 33s)
[02:27:08] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[02:28:23] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy
[02:30:18] <wikibugs>	 6operations, 10Wikimedia-General-or-Unknown, 5Patch-For-Review: Job queue is growing and growing - https://phabricator.wikimedia.org/T124194#1951071 (10ori) The bulk of the jobs in the queue are htmlCacheUpdate jobs, and @aaron suspects that the runners are slow to clear the backlog because the [[ https://gi...
[02:36:51] <wikibugs>	 6operations, 10Gerrit, 10GitHub-Mirrors, 10ValueView, and 2 others: [Bug] ValueView GitHub mirror not updated any more - https://phabricator.wikimedia.org/T123521#1951094 (10JanZerebecki) I just checked with `git ls-remote` and the github mirror still does not have `refs/meta/config`, which means syncing i...
[02:41:53] <icinga-wm>	 RECOVERY - puppet last run on ms-be2012 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[02:45:34] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[02:47:34] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy
[02:49:35] <logmsgbot>	 !log mwdeploy@tin sync-l10n completed (1.27.0-wmf.11) (duration: 09m 39s)
[02:49:40] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[02:51:50] <grrrit-wm>	 (03PS1) 10Aude: Remove unused/no longer existing item-create oauth grant [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265447 
[02:56:45] <logmsgbot>	 !log l10nupdate@tin ResourceLoader cache refresh completed at Thu Jan 21 02:56:44 UTC 2016 (duration 7m 9s)
[02:56:50] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[02:57:04] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[02:59:13] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy
[03:01:43] <grrrit-wm>	 (03CR) 10Alex Monk: [C: 031] [cirrus maint] redirect stderr to log and use full mwscript path [puppet] - 10https://gerrit.wikimedia.org/r/265427 (https://phabricator.wikimedia.org/T120843) (owner: 10EBernhardson)
[03:20:56] <grrrit-wm>	 (03PS1) 10Andrew Bogott: Actively remove use of webproxy.eqiad.wmnet on labs [puppet] - 10https://gerrit.wikimedia.org/r/265451 
[03:38:33] <grrrit-wm>	 (03Abandoned) 10Andrew Bogott: Actively remove use of webproxy.eqiad.wmnet on labs [puppet] - 10https://gerrit.wikimedia.org/r/265451 (owner: 10Andrew Bogott)
[03:40:30] <grrrit-wm>	 (03PS9) 10Subramanya Sastry: Add the visualdiff module + instantiate visualdiffing services [puppet] - 10https://gerrit.wikimedia.org/r/264032 (https://phabricator.wikimedia.org/T118778) 
[03:58:53] <icinga-wm>	 PROBLEM - RAID on mw1162 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[03:59:04] <icinga-wm>	 PROBLEM - nutcracker port on mw1162 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[04:00:24] <icinga-wm>	 PROBLEM - puppet last run on mw1162 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[04:01:14] <icinga-wm>	 PROBLEM - SSH on mw1162 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[04:05:35] <icinga-wm>	 RECOVERY - nutcracker port on mw1162 is OK: TCP OK - 0.002 second response time on port 11212
[04:08:23] <icinga-wm>	 PROBLEM - nutcracker process on mw1162 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[04:14:24] <icinga-wm>	 PROBLEM - configured eth on mw1162 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[04:15:14] <icinga-wm>	 PROBLEM - salt-minion processes on mw1162 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[04:15:54] <icinga-wm>	 RECOVERY - SSH on mw1162 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.4 (protocol 2.0)
[04:15:54] <icinga-wm>	 RECOVERY - RAID on mw1162 is OK: OK: no RAID installed
[04:16:23] <icinga-wm>	 RECOVERY - configured eth on mw1162 is OK: OK - interfaces up
[04:16:44] <icinga-wm>	 RECOVERY - nutcracker process on mw1162 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker
[04:17:13] <icinga-wm>	 RECOVERY - salt-minion processes on mw1162 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion
[04:17:15] <icinga-wm>	 RECOVERY - puppet last run on mw1162 is OK: OK: Puppet is currently enabled, last run 38 minutes ago with 0 failures
[04:58:19] <grrrit-wm>	 (03CR) 10Tim Starling: [WIP] Implement /w/static.php (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/263566 (https://phabricator.wikimedia.org/T99096) (owner: 10Krinkle)
[05:15:54] <icinga-wm>	 PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 226, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/2/0: down - Core: cr1-codfw:xe-5/2/1 (Telia, IC-307235, 34ms) {#2648} [10Gbps wave]BR
[05:18:40] <grrrit-wm>	 (03CR) 10Tim Starling: [WIP] Implement /w/static.php (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/263566 (https://phabricator.wikimedia.org/T99096) (owner: 10Krinkle)
[05:35:34] <grrrit-wm>	 (03PS1) 10KartikMistry: Beta: Set ContentTranslationCorpora to true [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265458 (https://phabricator.wikimedia.org/T119617) 
[05:36:43] <icinga-wm>	 PROBLEM - puppet last run on mw1133 is CRITICAL: CRITICAL: Puppet has 58 failures
[05:41:26] <grrrit-wm>	 (03PS2) 10KartikMistry: Beta: Set ContentTranslationCorpora to true [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265458 (https://phabricator.wikimedia.org/T119617) 
[05:43:24] <icinga-wm>	 RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 228, down: 0, dormant: 0, excluded: 0, unused: 0
[05:46:40] <grrrit-wm>	 (03PS1) 10KartikMistry: Enable ContentTranslationCorpora [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265459 (https://phabricator.wikimedia.org/T119617) 
[05:55:45] <icinga-wm>	 PROBLEM - puppet last run on mw2106 is CRITICAL: CRITICAL: puppet fail
[05:56:04] <icinga-wm>	 PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 226, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/2/0: down - Core: cr1-codfw:xe-5/2/1 (Telia, IC-307235, 34ms) {#2648} [10Gbps wave]BR
[06:02:43] <icinga-wm>	 RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 228, down: 0, dormant: 0, excluded: 0, unused: 0
[06:04:14] <icinga-wm>	 RECOVERY - puppet last run on mw1133 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:23:24] <icinga-wm>	 RECOVERY - puppet last run on mw2106 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:28:03] <icinga-wm>	 PROBLEM - puppet last run on mw1138 is CRITICAL: CRITICAL: Puppet has 14 failures
[06:30:14] <icinga-wm>	 PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 226, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/2/0: down - Core: cr1-codfw:xe-5/2/1 (Telia, IC-307235, 34ms) {#2648} [10Gbps wave]BR
[06:31:13] <icinga-wm>	 PROBLEM - puppet last run on subra is CRITICAL: CRITICAL: Puppet has 1 failures
[06:31:43] <icinga-wm>	 PROBLEM - puppet last run on mc1017 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:31:43] <icinga-wm>	 PROBLEM - puppet last run on mw1026 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:31:53] <icinga-wm>	 PROBLEM - puppet last run on db1056 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:31:53] <icinga-wm>	 PROBLEM - puppet last run on mw1215 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:31:53] <icinga-wm>	 PROBLEM - puppet last run on kafka1002 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:32:04] <icinga-wm>	 PROBLEM - puppet last run on mw2023 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:32:04] <icinga-wm>	 PROBLEM - puppet last run on mw2020 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:32:14] <icinga-wm>	 PROBLEM - puppet last run on mw1139 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:32:23] <icinga-wm>	 PROBLEM - puppet last run on mw2036 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:32:24] <icinga-wm>	 PROBLEM - puppet last run on ms-fe1004 is CRITICAL: CRITICAL: Puppet has 2 failures
[06:32:53] <icinga-wm>	 PROBLEM - puppet last run on mw2207 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:32:53] <icinga-wm>	 PROBLEM - puppet last run on mw2141 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:32:54] <icinga-wm>	 PROBLEM - puppet last run on mw2081 is CRITICAL: CRITICAL: Puppet has 2 failures
[06:33:03] <icinga-wm>	 PROBLEM - puppet last run on mw1203 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:33:13] <icinga-wm>	 PROBLEM - puppet last run on mw1119 is CRITICAL: CRITICAL: Puppet has 2 failures
[06:33:24] <icinga-wm>	 PROBLEM - puppet last run on mw2145 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:38:14] <icinga-wm>	 PROBLEM - puppet last run on mw1133 is CRITICAL: CRITICAL: Puppet has 63 failures
[06:38:39] <grrrit-wm>	 (03CR) 10Luke081515: [C: 031] "...but looks good" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265413 (https://phabricator.wikimedia.org/T124234) (owner: 10Catrope)
[06:41:04] <icinga-wm>	 PROBLEM - puppet last run on cp3037 is CRITICAL: CRITICAL: puppet fail
[06:41:13] <icinga-wm>	 PROBLEM - RAID on mw1161 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[06:41:14] <wikibugs>	 6operations, 10Wikimedia-General-or-Unknown, 5Patch-For-Review: Job queue is growing and growing - https://phabricator.wikimedia.org/T124194#1951238 (10Luke081515) Looks good now
[06:41:23] <icinga-wm>	 PROBLEM - SSH on mw1161 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[06:41:44] <icinga-wm>	 PROBLEM - puppet last run on mw1161 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[06:42:04] <icinga-wm>	 PROBLEM - configured eth on mw1161 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[06:42:35] <icinga-wm>	 PROBLEM - Disk space on mw1161 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[06:43:05] <icinga-wm>	 PROBLEM - DPKG on mw1161 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[06:45:35] <icinga-wm>	 PROBLEM - salt-minion processes on mw1161 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[06:46:45] <icinga-wm>	 PROBLEM - dhclient process on mw1161 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[06:46:45] <icinga-wm>	 PROBLEM - nutcracker port on mw1161 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[06:47:23] <icinga-wm>	 PROBLEM - nutcracker process on mw1161 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[06:47:43] <icinga-wm>	 RECOVERY - salt-minion processes on mw1161 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion
[06:48:53] <icinga-wm>	 RECOVERY - dhclient process on mw1161 is OK: PROCS OK: 0 processes with command name dhclient
[06:48:54] <icinga-wm>	 RECOVERY - nutcracker port on mw1161 is OK: TCP OK - 0.000 second response time on port 11212
[06:49:24] <icinga-wm>	 RECOVERY - nutcracker process on mw1161 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker
[06:49:24] <icinga-wm>	 RECOVERY - DPKG on mw1161 is OK: All packages OK
[06:53:13] <icinga-wm>	 RECOVERY - Disk space on mw1161 is OK: DISK OK
[06:54:04] <icinga-wm>	 PROBLEM - salt-minion processes on mw1161 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[06:54:45] <icinga-wm>	 RECOVERY - puppet last run on mw1026 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures
[06:54:54] <icinga-wm>	 RECOVERY - puppet last run on mw1215 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures
[06:55:13] <icinga-wm>	 PROBLEM - nutcracker port on mw1161 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[06:55:13] <icinga-wm>	 PROBLEM - dhclient process on mw1161 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[06:55:15] <icinga-wm>	 RECOVERY - puppet last run on mw2020 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures
[06:55:23] <icinga-wm>	 RECOVERY - puppet last run on mw1139 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures
[06:55:34] <icinga-wm>	 RECOVERY - puppet last run on ms-fe1004 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures
[06:55:44] <icinga-wm>	 PROBLEM - nutcracker process on mw1161 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[06:55:45] <icinga-wm>	 PROBLEM - DPKG on mw1161 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[06:56:03] <icinga-wm>	 RECOVERY - puppet last run on mw2141 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:56:04] <icinga-wm>	 RECOVERY - puppet last run on mw2081 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures
[06:56:13] <icinga-wm>	 RECOVERY - puppet last run on mw1203 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures
[06:56:33] <icinga-wm>	 RECOVERY - puppet last run on subra is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures
[06:56:34] <icinga-wm>	 RECOVERY - puppet last run on mw2145 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:56:53] <icinga-wm>	 RECOVERY - puppet last run on mc1017 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:57:04] <icinga-wm>	 RECOVERY - puppet last run on db1056 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:57:04] <icinga-wm>	 RECOVERY - puppet last run on kafka1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:57:24] <icinga-wm>	 RECOVERY - puppet last run on mw2023 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:57:35] <icinga-wm>	 RECOVERY - puppet last run on mw2036 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:58:04] <icinga-wm>	 RECOVERY - puppet last run on mw2207 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:58:25] <icinga-wm>	 RECOVERY - puppet last run on mw1119 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[07:01:44] <icinga-wm>	 RECOVERY - nutcracker port on mw1161 is OK: TCP OK - 0.000 second response time on port 11212
[07:01:44] <icinga-wm>	 RECOVERY - dhclient process on mw1161 is OK: PROCS OK: 0 processes with command name dhclient
[07:04:03] <icinga-wm>	 PROBLEM - Disk space on mw1161 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[07:06:34] <icinga-wm>	 RECOVERY - puppet last run on cp3037 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures
[07:08:04] <icinga-wm>	 PROBLEM - nutcracker port on mw1161 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[07:08:05] <icinga-wm>	 PROBLEM - dhclient process on mw1161 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[07:10:14] <icinga-wm>	 RECOVERY - dhclient process on mw1161 is OK: PROCS OK: 0 processes with command name dhclient
[07:10:50] <ori>	 Wikimedia: a tale of salvation and woe
[07:19:04] <icinga-wm>	 RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 228, down: 0, dormant: 0, excluded: 0, unused: 0
[07:20:53] <icinga-wm>	 RECOVERY - Disk space on mw1161 is OK: DISK OK
[07:21:03] <icinga-wm>	 RECOVERY - puppet last run on mw1138 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures
[07:23:53] <icinga-wm>	 PROBLEM - Apache HTTP on mw1116 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[07:25:03] <icinga-wm>	 PROBLEM - dhclient process on mw1161 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[07:25:04] <icinga-wm>	 PROBLEM - HHVM rendering on mw1116 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[07:26:05] <icinga-wm>	 PROBLEM - Check size of conntrack table on mw1116 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[07:26:13] <icinga-wm>	 PROBLEM - puppet last run on mw1116 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[07:26:14] <icinga-wm>	 PROBLEM - RAID on mw1116 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[07:27:13] <icinga-wm>	 PROBLEM - Disk space on mw1161 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[07:27:54] <icinga-wm>	 RECOVERY - Apache HTTP on mw1116 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 1.425 second response time
[07:28:04] <icinga-wm>	 RECOVERY - Check size of conntrack table on mw1116 is OK: OK: nf_conntrack is 4 % full
[07:28:13] <icinga-wm>	 RECOVERY - puppet last run on mw1116 is OK: OK: Puppet is currently enabled, last run 34 minutes ago with 0 failures
[07:28:14] <icinga-wm>	 RECOVERY - RAID on mw1116 is OK: OK: no RAID installed
[07:29:14] <icinga-wm>	 RECOVERY - HHVM rendering on mw1116 is OK: HTTP OK: HTTP/1.1 200 OK - 69958 bytes in 0.905 second response time
[07:31:34] <icinga-wm>	 RECOVERY - dhclient process on mw1161 is OK: PROCS OK: 0 processes with command name dhclient
[07:33:53] <icinga-wm>	 RECOVERY - Disk space on mw1161 is OK: DISK OK
[07:34:43] <icinga-wm>	 PROBLEM - puppet last run on mw1116 is CRITICAL: CRITICAL: Puppet has 2 failures
[07:37:54] <icinga-wm>	 PROBLEM - dhclient process on mw1161 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[07:40:23] <icinga-wm>	 PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 226, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/2/0: down - Core: cr1-codfw:xe-5/2/1 (Telia, IC-307235, 34ms) {#2648} [10Gbps wave]BR
[07:45:04] <icinga-wm>	 RECOVERY - salt-minion processes on mw1161 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion
[07:46:25] <icinga-wm>	 PROBLEM - Disk space on mw1161 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[07:48:25] <icinga-wm>	 RECOVERY - Disk space on mw1161 is OK: DISK OK
[07:49:03] <icinga-wm>	 RECOVERY - nutcracker process on mw1161 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker
[07:51:25] <icinga-wm>	 PROBLEM - salt-minion processes on mw1161 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[07:53:24] <icinga-wm>	 RECOVERY - salt-minion processes on mw1161 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion
[07:53:24] <icinga-wm>	 RECOVERY - SSH on mw1161 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.4 (protocol 2.0)
[07:53:34] <icinga-wm>	 RECOVERY - puppet last run on mw1116 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[07:54:04] <icinga-wm>	 RECOVERY - configured eth on mw1161 is OK: OK - interfaces up
[07:54:34] <icinga-wm>	 RECOVERY - nutcracker port on mw1161 is OK: TCP OK - 0.000 second response time on port 11212
[07:54:35] <icinga-wm>	 RECOVERY - dhclient process on mw1161 is OK: PROCS OK: 0 processes with command name dhclient
[07:55:13] <icinga-wm>	 RECOVERY - DPKG on mw1161 is OK: All packages OK
[07:55:15] <icinga-wm>	 RECOVERY - RAID on mw1161 is OK: OK: no RAID installed
[07:58:04] <icinga-wm>	 RECOVERY - puppet last run on mw1161 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[07:59:33] <wikibugs>	 6operations, 10Traffic, 5Patch-For-Review: HTTP/1.1 keepalive for local nginx->varnish conns - https://phabricator.wikimedia.org/T107749#1951270 (10elukey) @BBlack: I didn't see the "Additionally, the following parameters are available as part of our commercial subscription:" before the directive, in the oth...
[08:01:34] <_joe_>	 !log upgrading all codfw appserver layer's kernel to linux-image-3.13.0-76-generic
[08:01:40] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[08:04:41] <grrrit-wm>	 (03CR) 10Giuseppe Lavagetto: [C: 032] instrumentation: fixup for Ib0b3c139a [debs/pybal] - 10https://gerrit.wikimedia.org/r/264088 (owner: 10Giuseppe Lavagetto)
[08:05:53] <icinga-wm>	 RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 228, down: 0, dormant: 0, excluded: 0, unused: 0
[08:06:59] <grrrit-wm>	 (03CR) 10Hoo man: [C: 031] "Fine to deploy at any time, right is unused in Wikibase." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265447 (owner: 10Aude)
[08:08:31] <grrrit-wm>	 (03Merged) 10jenkins-bot: instrumentation: fixup for Ib0b3c139a [debs/pybal] - 10https://gerrit.wikimedia.org/r/264088 (owner: 10Giuseppe Lavagetto)
[08:09:13] <icinga-wm>	 PROBLEM - puppet last run on mw1015 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago
[08:11:23] <icinga-wm>	 PROBLEM - Host mw2027 is DOWN: PING CRITICAL - Packet loss = 100%
[08:12:03] <icinga-wm>	 PROBLEM - Host mw2022 is DOWN: PING CRITICAL - Packet loss = 100%
[08:12:14] <icinga-wm>	 PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 226, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/2/0: down - Core: cr1-codfw:xe-5/2/1 (Telia, IC-307235, 34ms) {#2648} [10Gbps wave]BR
[08:12:20] <_joe_>	 that's mee rebooting them, but this is not good
[08:12:33] <icinga-wm>	 PROBLEM - Host mw2049 is DOWN: PING CRITICAL - Packet loss = 100%
[08:12:43] <icinga-wm>	 PROBLEM - Host mw2015 is DOWN: PING CRITICAL - Packet loss = 100%
[08:13:05] <_joe_>	 ok, let's stop here
[08:13:43] <icinga-wm>	 PROBLEM - Host mw2155 is DOWN: PING CRITICAL - Packet loss = 100%
[08:13:53] <icinga-wm>	 PROBLEM - Host mw2014 is DOWN: PING CRITICAL - Packet loss = 100%
[08:13:53] <icinga-wm>	 PROBLEM - Host mw2114 is DOWN: PING CRITICAL - Packet loss = 100%
[08:14:13] <icinga-wm>	 PROBLEM - Host mw2100 is DOWN: PING CRITICAL - Packet loss = 100%
[08:14:14] <icinga-wm>	 RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 228, down: 0, dormant: 0, excluded: 0, unused: 0
[08:14:33] <icinga-wm>	 PROBLEM - Host mw2041 is DOWN: PING CRITICAL - Packet loss = 100%
[08:14:44] <icinga-wm>	 PROBLEM - Host mw2141 is DOWN: PING CRITICAL - Packet loss = 100%
[08:15:33] <icinga-wm>	 PROBLEM - Host mw2144 is DOWN: PING CRITICAL - Packet loss = 100%
[08:15:43] <icinga-wm>	 PROBLEM - Host mw2177 is DOWN: PING CRITICAL - Packet loss = 100%
[08:16:53] <icinga-wm>	 PROBLEM - Host mw2159 is DOWN: PING CRITICAL - Packet loss = 100%
[08:16:53] <icinga-wm>	 PROBLEM - Host mw2043 is DOWN: PING CRITICAL - Packet loss = 100%
[08:17:49] <icinga-wm>	 ACKNOWLEDGEMENT - Host mw2014 is DOWN: PING CRITICAL - Packet loss = 100% Giuseppe Lavagetto Issues with the latest ubuntu kernel
[08:17:49] <icinga-wm>	 ACKNOWLEDGEMENT - Host mw2015 is DOWN: PING CRITICAL - Packet loss = 100% Giuseppe Lavagetto Issues with the latest ubuntu kernel
[08:17:49] <icinga-wm>	 ACKNOWLEDGEMENT - Host mw2022 is DOWN: PING CRITICAL - Packet loss = 100% Giuseppe Lavagetto Issues with the latest ubuntu kernel
[08:17:49] <icinga-wm>	 ACKNOWLEDGEMENT - Host mw2027 is DOWN: PING CRITICAL - Packet loss = 100% Giuseppe Lavagetto Issues with the latest ubuntu kernel
[08:17:49] <icinga-wm>	 ACKNOWLEDGEMENT - Host mw2041 is DOWN: PING CRITICAL - Packet loss = 100% Giuseppe Lavagetto Issues with the latest ubuntu kernel
[08:17:49] <icinga-wm>	 ACKNOWLEDGEMENT - Host mw2043 is DOWN: PING CRITICAL - Packet loss = 100% Giuseppe Lavagetto Issues with the latest ubuntu kernel
[08:17:49] <icinga-wm>	 ACKNOWLEDGEMENT - Host mw2049 is DOWN: PING CRITICAL - Packet loss = 100% Giuseppe Lavagetto Issues with the latest ubuntu kernel
[08:17:50] <icinga-wm>	 ACKNOWLEDGEMENT - Host mw2100 is DOWN: PING CRITICAL - Packet loss = 100% Giuseppe Lavagetto Issues with the latest ubuntu kernel
[08:17:50] <icinga-wm>	 ACKNOWLEDGEMENT - Host mw2114 is DOWN: PING CRITICAL - Packet loss = 100% Giuseppe Lavagetto Issues with the latest ubuntu kernel
[08:17:51] <icinga-wm>	 ACKNOWLEDGEMENT - Host mw2141 is DOWN: PING CRITICAL - Packet loss = 100% Giuseppe Lavagetto Issues with the latest ubuntu kernel
[08:17:51] <icinga-wm>	 ACKNOWLEDGEMENT - Host mw2144 is DOWN: PING CRITICAL - Packet loss = 100% Giuseppe Lavagetto Issues with the latest ubuntu kernel
[08:17:52] <icinga-wm>	 ACKNOWLEDGEMENT - Host mw2155 is DOWN: PING CRITICAL - Packet loss = 100% Giuseppe Lavagetto Issues with the latest ubuntu kernel
[08:19:33] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2003 is CRITICAL: PYBAL CRITICAL - apaches_80 - Could not depool server mw2159.codfw.wmnet because of too many down!
[08:19:46] <_joe_>	 wow, it works :)
[08:19:59] <_joe_>	 (the pybal check, I mean)
[08:20:24] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2006 is CRITICAL: PYBAL CRITICAL - apaches_80 - Could not depool server mw2159.codfw.wmnet because of too many down!
[08:22:03] <icinga-wm>	 RECOVERY - Host mw2027 is UP: PING OK - Packet loss = 0%, RTA = 36.93 ms
[08:23:34] <wikibugs>	 6operations, 10Continuous-Integration-Infrastructure, 10netops, 5Continuous-Integration-Scaling: install/setup/deploy cobalt as replacement for gallium - https://phabricator.wikimedia.org/T95959#1951302 (10akosiaris) So, gallium is in `public1-b-eqiad` (208.80.154.128/26). The story behind a public IP is a...
[08:24:35] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2006 is OK: PYBAL OK - All pools are healthy
[08:25:53] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2003 is OK: PYBAL OK - All pools are healthy
[08:26:15] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: [C: 032] redis_monitoring.py: easy pep8 fixes [puppet] - 10https://gerrit.wikimedia.org/r/265374 (owner: 10Chad)
[08:26:19] <grrrit-wm>	 (03PS2) 10Alexandros Kosiaris: redis_monitoring.py: easy pep8 fixes [puppet] - 10https://gerrit.wikimedia.org/r/265374 (owner: 10Chad)
[08:26:25] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: [V: 032] redis_monitoring.py: easy pep8 fixes [puppet] - 10https://gerrit.wikimedia.org/r/265374 (owner: 10Chad)
[08:27:37] <grrrit-wm>	 (03PS2) 10Alexandros Kosiaris: toolschecker.py: 1 minor pep8 fix [puppet] - 10https://gerrit.wikimedia.org/r/265373 (owner: 10Chad)
[08:27:43] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] toolschecker.py: 1 minor pep8 fix [puppet] - 10https://gerrit.wikimedia.org/r/265373 (owner: 10Chad)
[08:29:13] <icinga-wm>	 PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 226, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/2/0: down - Core: cr1-codfw:xe-5/2/1 (Telia, IC-307235, 34ms) {#2648} [10Gbps wave]BR
[08:29:53] <icinga-wm>	 RECOVERY - Host mw2022 is UP: PING OK - Packet loss = 0%, RTA = 36.11 ms
[08:31:05] <grrrit-wm>	 (03PS2) 10Alexandros Kosiaris: PacketLossLogtailer.py: pep8 fixes, mostly line too long [puppet] - 10https://gerrit.wikimedia.org/r/265315 (owner: 10Chad)
[08:31:11] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] PacketLossLogtailer.py: pep8 fixes, mostly line too long [puppet] - 10https://gerrit.wikimedia.org/r/265315 (owner: 10Chad)
[08:32:03] <icinga-wm>	 RECOVERY - Host mw2049 is UP: PING OK - Packet loss = 0%, RTA = 37.25 ms
[08:33:24] <icinga-wm>	 RECOVERY - Host mw2015 is UP: PING OK - Packet loss = 0%, RTA = 36.99 ms
[08:35:44] <icinga-wm>	 RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 228, down: 0, dormant: 0, excluded: 0, unused: 0
[08:38:24] <icinga-wm>	 RECOVERY - Host mw2155 is UP: PING OK - Packet loss = 0%, RTA = 36.73 ms
[08:38:49] <grrrit-wm>	 (03PS1) 10Hoo man: Restore s5 DB configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265466 
[08:40:14] <icinga-wm>	 RECOVERY - Host mw2014 is UP: PING OK - Packet loss = 0%, RTA = 36.67 ms
[08:41:23] <icinga-wm>	 RECOVERY - Host mw2114 is UP: PING OK - Packet loss = 0%, RTA = 37.21 ms
[08:42:44] <icinga-wm>	 RECOVERY - Host mw2100 is UP: PING OK - Packet loss = 0%, RTA = 37.48 ms
[08:43:14] <icinga-wm>	 RECOVERY - Host mw2141 is UP: PING OK - Packet loss = 0%, RTA = 36.85 ms
[08:44:43] <icinga-wm>	 RECOVERY - Host mw2041 is UP: PING OK - Packet loss = 0%, RTA = 36.44 ms
[08:46:04] <icinga-wm>	 RECOVERY - Host mw2144 is UP: PING OK - Packet loss = 0%, RTA = 36.29 ms
[08:46:14] <icinga-wm>	 RECOVERY - Host mw2177 is UP: PING OK - Packet loss = 0%, RTA = 37.30 ms
[08:47:14] <icinga-wm>	 RECOVERY - Host mw2159 is UP: PING OK - Packet loss = 0%, RTA = 36.28 ms
[08:49:03] <icinga-wm>	 PROBLEM - Disk space on stat1002 is CRITICAL: DISK CRITICAL - free space: /a 325818 MB (3% inode=99%)
[08:49:13] <icinga-wm>	 RECOVERY - Host mw2043 is UP: PING OK - Packet loss = 0%, RTA = 36.62 ms
[08:50:33] <icinga-wm>	 PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet).
[08:50:43] <icinga-wm>	 PROBLEM - Host mw2022 is DOWN: PING CRITICAL - Packet loss = 100%
[08:50:43] <icinga-wm>	 PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet).
[08:56:33] <icinga-wm>	 PROBLEM - nutcracker process on mw1162 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[08:57:14] <icinga-wm>	 PROBLEM - puppet last run on mw1162 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[08:57:34] <icinga-wm>	 PROBLEM - DPKG on mw1162 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[08:57:44] <icinga-wm>	 PROBLEM - SSH on mw1162 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[08:57:53] <icinga-wm>	 PROBLEM - RAID on mw1162 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[08:58:43] <icinga-wm>	 RECOVERY - nutcracker process on mw1162 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker
[08:59:43] <icinga-wm>	 RECOVERY - DPKG on mw1162 is OK: All packages OK
[09:01:24] <icinga-wm>	 PROBLEM - salt-minion processes on mw1162 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[09:01:45] <icinga-wm>	 RECOVERY - Disk space on stat1002 is OK: DISK OK
[09:01:46] <wikibugs>	 6operations, 10Traffic, 5Patch-For-Review: HTTP/1.1 keepalive for local nginx->varnish conns - https://phabricator.wikimedia.org/T107749#1951327 (10elukey) Also I believe I got the wrong meaning of the max_fails directive, that does not mean "retry for" but jus "consider this backend unavailable if x request...
[09:01:54] <icinga-wm>	 RECOVERY - Host mw2022 is UP: PING OK - Packet loss = 0%, RTA = 36.60 ms
[09:01:58] <wikibugs>	 6operations, 10Traffic: HTTP/1.1 keepalive for local nginx->varnish conns - https://phabricator.wikimedia.org/T107749#1951328 (10elukey)
[09:03:53] <icinga-wm>	 PROBLEM - Disk space on mw1162 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[09:06:13] <icinga-wm>	 PROBLEM - DPKG on mw1162 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[09:06:53] <icinga-wm>	 PROBLEM - puppet last run on mw2022 is CRITICAL: CRITICAL: puppet fail
[09:07:43] <icinga-wm>	 RECOVERY - salt-minion processes on mw1162 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion
[09:09:24] <icinga-wm>	 PROBLEM - nutcracker process on mw1162 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[09:10:13] <icinga-wm>	 RECOVERY - Disk space on mw1162 is OK: DISK OK
[09:10:53] <icinga-wm>	 PROBLEM - nutcracker port on mw1162 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[09:11:04] <icinga-wm>	 RECOVERY - puppet last run on mw2022 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[09:12:33] <icinga-wm>	 RECOVERY - DPKG on mw1162 is OK: All packages OK
[09:15:23] <icinga-wm>	 PROBLEM - configured eth on mw1162 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[09:15:44] <icinga-wm>	 RECOVERY - nutcracker process on mw1162 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker
[09:17:04] <icinga-wm>	 PROBLEM - RAID on mw1162 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[09:18:13] <icinga-wm>	 PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 226, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/2/0: down - Core: cr1-codfw:xe-5/2/1 (Telia, IC-307235, 34ms) {#2648} [10Gbps wave]BR
[09:18:14] <icinga-wm>	 PROBLEM - salt-minion processes on mw1162 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[09:18:54] <icinga-wm>	 PROBLEM - DPKG on mw1162 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[09:19:33] <icinga-wm>	 RECOVERY - configured eth on mw1162 is OK: OK - interfaces up
[09:20:44] <icinga-wm>	 PROBLEM - Disk space on mw1162 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[09:21:03] <icinga-wm>	 RECOVERY - DPKG on mw1162 is OK: All packages OK
[09:21:23] <icinga-wm>	 RECOVERY - nutcracker port on mw1162 is OK: TCP OK - 0.018 second response time on port 11212
[09:22:04] <icinga-wm>	 PROBLEM - nutcracker process on mw1162 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[09:22:44] <icinga-wm>	 RECOVERY - Disk space on mw1162 is OK: DISK OK
[09:24:24] <icinga-wm>	 RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 228, down: 0, dormant: 0, excluded: 0, unused: 0
[09:25:54] <icinga-wm>	 PROBLEM - configured eth on mw1162 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[09:26:43] <icinga-wm>	 RECOVERY - salt-minion processes on mw1162 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion
[09:27:11] <_joe_>	 !log powercycled mw1162, memory exhaustion
[09:27:16] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[09:27:23] <icinga-wm>	 PROBLEM - DPKG on mw1162 is CRITICAL: Timeout while attempting connection
[09:27:31] <paravoid>	 heh I was logging in
[09:28:29] <_joe_>	 oh sorry
[09:28:39] <paravoid>	 not at all
[09:28:40] <_joe_>	 I tried twice, no luck, so I decided to powercycle
[09:29:05] <_joe_>	 I have bricked 10 codfw appserver this morning, so spent a good part of the morning in the consoles :P
[09:29:15] <paravoid>	 oops
[09:29:24] <icinga-wm>	 RECOVERY - SSH on mw1162 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.4 (protocol 2.0)
[09:29:43] <icinga-wm>	 RECOVERY - RAID on mw1162 is OK: OK: no RAID installed
[09:29:49] <wikibugs>	 6operations, 10Analytics, 10Analytics-Cluster, 10Traffic: Upgrade analytics-eqiad Kafka cluster to Kafka 0.9 - https://phabricator.wikimedia.org/T121562#1951353 (10elukey)
[09:30:04] <icinga-wm>	 RECOVERY - configured eth on mw1162 is OK: OK - interfaces up
[09:30:34] <icinga-wm>	 RECOVERY - nutcracker process on mw1162 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker
[09:31:13] <wikibugs>	 6operations: Switch ganglia aggregator init stuff to systemd on jessie - https://phabricator.wikimedia.org/T96842#1951356 (10faidon)
[09:31:15] <wikibugs>	 6operations: Port Ganglia aggregator setup to systemd - https://phabricator.wikimedia.org/T124197#1951357 (10faidon)
[09:31:23] <icinga-wm>	 RECOVERY - puppet last run on mw1162 is OK: OK: Puppet is currently enabled, last run 52 minutes ago with 0 failures
[09:31:35] <icinga-wm>	 RECOVERY - DPKG on mw1162 is OK: All packages OK
[09:34:30] <wikibugs>	 6operations, 6Phabricator, 6Release-Engineering-Team, 10Traffic, 5Patch-For-Review: Phabricator needs to expose ssh - https://phabricator.wikimedia.org/T100519#1951371 (10faidon) @mmodell can you please fix IPv6 instead or explain why it is difficult to do so? FWIW, IPv6 penetration is > 10% globally and...
[09:35:20] <wikibugs>	 6operations, 10Beta-Cluster-Infrastructure, 6Labs, 10Labs-Infrastructure: beta: Get SSL certificates for *.{projects}.beta.wmflabs.org - https://phabricator.wikimedia.org/T50501#1951375 (10hashar)
[09:40:14] <icinga-wm>	 PROBLEM - HHVM rendering on mw2054 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[09:41:34] <icinga-wm>	 PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 226, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/2/0: down - Core: cr1-codfw:xe-5/2/1 (Telia, IC-307235, 34ms) {#2648} [10Gbps wave]BR
[09:42:14] <icinga-wm>	 RECOVERY - HHVM rendering on mw2054 is OK: HTTP OK: HTTP/1.1 200 OK - 69936 bytes in 3.644 second response time
[09:43:48] <wikibugs>	 6operations, 10Beta-Cluster-Infrastructure, 6Labs, 10Labs-Infrastructure: beta: Get SSL certificates for *.{projects}.beta.wmflabs.org - https://phabricator.wikimedia.org/T50501#1951388 (10faidon) >>! In T50501#527689, @Krinkle wrote: > Would it be an option to flatten our subdomains? >  > We'd only need b...
[09:50:03] <icinga-wm>	 RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 228, down: 0, dormant: 0, excluded: 0, unused: 0
[09:53:39] <_joe_>	 !log rolling reboot of the codfw appserver layer
[09:53:44] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[09:54:01] <grrrit-wm>	 (03PS2) 10Hoo man: Restore s5 DB configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265466 
[09:54:23] <icinga-wm>	 PROBLEM - Host mw2049 is DOWN: PING CRITICAL - Packet loss = 100%
[09:54:43] <icinga-wm>	 PROBLEM - Host mw2015 is DOWN: PING CRITICAL - Packet loss = 100%
[09:55:33] <icinga-wm>	 RECOVERY - Host mw2015 is UP: PING OK - Packet loss = 0%, RTA = 36.28 ms
[09:55:43] <icinga-wm>	 RECOVERY - Host mw2049 is UP: PING OK - Packet loss = 0%, RTA = 36.15 ms
[09:57:55] <icinga-wm>	 PROBLEM - Apache HTTP on mw2114 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[10:00:03] <icinga-wm>	 RECOVERY - Apache HTTP on mw2114 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.181 second response time
[10:02:53] <icinga-wm>	 PROBLEM - Host mw2100 is DOWN: PING CRITICAL - Packet loss = 100%
[10:03:20] <grrrit-wm>	 (03CR) 10Jcrespo: [C: 031] Restore s5 DB configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265466 (owner: 10Hoo man)
[10:03:54] <icinga-wm>	 RECOVERY - Host mw2100 is UP: PING OK - Packet loss = 0%, RTA = 36.64 ms
[10:05:13] <icinga-wm>	 PROBLEM - Host mw2043 is DOWN: PING CRITICAL - Packet loss = 100%
[10:05:44] <icinga-wm>	 RECOVERY - Host mw2043 is UP: PING OK - Packet loss = 0%, RTA = 36.47 ms
[10:06:01] <grrrit-wm>	 (03PS1) 10Faidon Liambotis: Drain codfw for network maintenance [dns] - 10https://gerrit.wikimedia.org/r/265469 
[10:06:04] <icinga-wm>	 PROBLEM - Apache HTTP on mw2037 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[10:06:30] <grrrit-wm>	 (03CR) 10Jcrespo: [C: 032] Restore s5 DB configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265466 (owner: 10Hoo man)
[10:06:49] <grrrit-wm>	 (03CR) 10Faidon Liambotis: [C: 032] Drain codfw for network maintenance [dns] - 10https://gerrit.wikimedia.org/r/265469 (owner: 10Faidon Liambotis)
[10:07:03] <icinga-wm>	 RECOVERY - Unmerged changes on repository puppet on palladium is OK: No changes to merge.
[10:07:14] <icinga-wm>	 RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge.
[10:08:03] <icinga-wm>	 RECOVERY - Apache HTTP on mw2037 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.601 second response time
[10:09:54] <icinga-wm>	 PROBLEM - HHVM rendering on mw2157 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[10:10:05] <icinga-wm>	 PROBLEM - HHVM rendering on mw2195 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[10:10:25] <logmsgbot>	 !log jynus@tin Synchronized wmf-config/db-eqiad.php: Restore s5 DB configuration (duration: 01m 57s)
[10:10:30] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[10:10:56] <jynus>	 !log mw2098.codfw.wmnet failed to sync
[10:11:00] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[10:11:48] <_joe_>	 jynus: ouch, I'm rebooting servers, should I stop?
[10:11:53] <jynus>	 no problem
[10:11:54] <icinga-wm>	 RECOVERY - HHVM rendering on mw2157 is OK: HTTP OK: HTTP/1.1 200 OK - 69959 bytes in 2.014 second response time
[10:12:04] <icinga-wm>	 RECOVERY - HHVM rendering on mw2195 is OK: HTTP OK: HTTP/1.1 200 OK - 69958 bytes in 0.346 second response time
[10:12:09] <jynus>	 let me know when you are doen with this so I can sync it manually
[10:12:12] <_joe_>	 that's why you'd have 2-4 servers down at a time in codfw
[10:12:23] <icinga-wm>	 PROBLEM - Host mw2098 is DOWN: PING CRITICAL - Packet loss = 100%
[10:12:25] <jynus>	 with this specific server, I mean
[10:12:42] <_joe_>	 I guess mw2098 will come back in a couple of minutes
[10:12:44] <jynus>	 (I suppose I can see it myself :-))
[10:13:16] <_joe_>	 seems like it's having troubles rebooting
[10:14:05] <icinga-wm>	 PROBLEM - HHVM rendering on mw2030 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[10:14:20] <_joe_>	 yes, actually I seem unable to reach the console too :/
[10:14:43] <icinga-wm>	 PROBLEM - Host mw2073 is DOWN: PING CRITICAL - Packet loss = 100%
[10:14:43] <icinga-wm>	 PROBLEM - Host mw2048 is DOWN: PING CRITICAL - Packet loss = 100%
[10:15:14] <wikibugs>	 10Ops-Access-Requests, 6operations: Create new puppet group `discovery-analytics-deploy` - https://phabricator.wikimedia.org/T122620#1951433 (10faidon) Nitpick: is it possible to name this perhaps something else than "discovery-analytics-deploy"? What if others outside of the Discovery department want access t...
[10:16:03] <icinga-wm>	 PROBLEM - Apache HTTP on mw2164 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[10:16:04] <icinga-wm>	 RECOVERY - Host mw2073 is UP: PING OK - Packet loss = 0%, RTA = 37.01 ms
[10:16:05] <icinga-wm>	 RECOVERY - HHVM rendering on mw2030 is OK: HTTP OK: HTTP/1.1 200 OK - 69959 bytes in 2.565 second response time
[10:16:14] <icinga-wm>	 RECOVERY - Host mw2048 is UP: PING OK - Packet loss = 0%, RTA = 36.67 ms
[10:16:58] <wikibugs>	 6operations, 10Continuous-Integration-Infrastructure, 10netops, 5Continuous-Integration-Scaling: install/setup/deploy cobalt as replacement for gallium - https://phabricator.wikimedia.org/T95959#1951435 (10hashar) `gallium.wikimedia.org` has a bunch of services which are exposed publicly via the misc-web v...
[10:17:54] <icinga-wm>	 RECOVERY - Apache HTTP on mw2164 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.362 second response time
[10:19:21] <_joe_>	 !log mw2098 doesn't reboot, console unreachable
[10:19:26] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[10:19:29] <_joe_>	 jynus: ^^
[10:19:35] <icinga-wm>	 PROBLEM - HHVM rendering on mw2140 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[10:19:35] <jynus>	 I'm reopening T85286
[10:19:41] <jynus>	 _joe_, 
[10:20:21] <_joe_>	 yeah, makes sense
[10:21:34] <icinga-wm>	 RECOVERY - HHVM rendering on mw2140 is OK: HTTP OK: HTTP/1.1 200 OK - 69959 bytes in 2.708 second response time
[10:22:00] <jynus>	 did you know if it was actually rebooted by you or was simply dead in the first place?
[10:22:08] <_joe_>	 I rebooted it, yes
[10:22:12] <jynus>	 thanks
[10:22:14] <_joe_>	 well, technically salt did
[10:22:18] <jynus>	 :-)
[10:23:08] <wikibugs>	 6operations, 5codfw-rollout, 3codfw-rollout-Apr-Jun-2015: install/deploy codfw appservers - https://phabricator.wikimedia.org/T85227#1951450 (10jcrespo)
[10:23:10] <wikibugs>	 6operations, 10ops-codfw: mw2098 non-responsive to mgmt - https://phabricator.wikimedia.org/T85286#1951448 (10jcrespo) 5Resolved>3Open This happened again, I am reopening this because I believe this could be related. Maybe consider, if it is under guarantee to report a faulty DRAC.  To be more clear, the c...
[10:29:54] <icinga-wm>	 PROBLEM - Apache HTTP on mw2010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[10:30:13] <icinga-wm>	 PROBLEM - HHVM rendering on mw2010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[10:32:05] <icinga-wm>	 RECOVERY - Apache HTTP on mw2010 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.590 second response time
[10:32:23] <icinga-wm>	 RECOVERY - HHVM rendering on mw2010 is OK: HTTP OK: HTTP/1.1 200 OK - 69959 bytes in 3.006 second response time
[10:33:13] <wikibugs>	 6operations, 6Phabricator, 6Release-Engineering-Team, 10Traffic, 5Patch-For-Review: Phabricator needs to expose ssh - https://phabricator.wikimedia.org/T100519#1951457 (10Reedy) >>! In T100519#1951371, @faidon wrote: > @mmodell can you please fix IPv6 instead or explain why it is difficult to do so? FWIW...
[10:33:23] <icinga-wm>	 PROBLEM - Apache HTTP on mw1133 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[10:33:23] <icinga-wm>	 PROBLEM - HHVM rendering on mw1133 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[10:33:24] <icinga-wm>	 PROBLEM - configured eth on mw1133 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[10:33:34] <icinga-wm>	 PROBLEM - SSH on mw1133 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[10:33:34] <icinga-wm>	 PROBLEM - dhclient process on mw1133 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[10:33:34] <icinga-wm>	 PROBLEM - nutcracker port on mw1133 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[10:33:35] <icinga-wm>	 PROBLEM - RAID on mw1133 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[10:34:14] <icinga-wm>	 PROBLEM - nutcracker process on mw1133 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[10:34:14] <icinga-wm>	 PROBLEM - DPKG on mw1133 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[10:34:14] <icinga-wm>	 PROBLEM - HHVM processes on mw1133 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[10:34:14] <icinga-wm>	 PROBLEM - Check size of conntrack table on mw1133 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[10:34:53] <icinga-wm>	 PROBLEM - salt-minion processes on mw1133 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[10:35:04] <icinga-wm>	 PROBLEM - Disk space on mw1133 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[10:36:15] <icinga-wm>	 RECOVERY - HHVM processes on mw1133 is OK: PROCS OK: 6 processes with command name hhvm
[10:36:15] <icinga-wm>	 RECOVERY - nutcracker process on mw1133 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker
[10:36:23] <icinga-wm>	 RECOVERY - DPKG on mw1133 is OK: All packages OK
[10:37:03] <icinga-wm>	 PROBLEM - Host mw2169 is DOWN: PING CRITICAL - Packet loss = 100%
[10:37:43] <icinga-wm>	 RECOVERY - configured eth on mw1133 is OK: OK - interfaces up
[10:37:43] <icinga-wm>	 RECOVERY - SSH on mw1133 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.4 (protocol 2.0)
[10:37:44] <icinga-wm>	 RECOVERY - Host mw2169 is UP: PING OK - Packet loss = 0%, RTA = 36.11 ms
[10:37:44] <icinga-wm>	 RECOVERY - nutcracker port on mw1133 is OK: TCP OK - 0.000 second response time on port 11212
[10:37:44] <icinga-wm>	 RECOVERY - dhclient process on mw1133 is OK: PROCS OK: 0 processes with command name dhclient
[10:38:23] <icinga-wm>	 RECOVERY - Check size of conntrack table on mw1133 is OK: OK: nf_conntrack is 0 % full
[10:38:54] <icinga-wm>	 RECOVERY - salt-minion processes on mw1133 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion
[10:39:13] <icinga-wm>	 RECOVERY - Disk space on mw1133 is OK: DISK OK
[10:39:25] <icinga-wm>	 PROBLEM - puppet last run on mw2137 is CRITICAL: CRITICAL: puppet fail
[10:39:53] <icinga-wm>	 RECOVERY - RAID on mw1133 is OK: OK: no RAID installed
[10:41:13] <icinga-wm>	 PROBLEM - Host mw2127 is DOWN: PING CRITICAL - Packet loss = 100%
[10:41:34] <icinga-wm>	 RECOVERY - puppet last run on mw2137 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures
[10:41:44] <icinga-wm>	 RECOVERY - Host mw2127 is UP: PING OK - Packet loss = 0%, RTA = 37.58 ms
[10:45:24] <icinga-wm>	 PROBLEM - Apache HTTP on mw2059 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.286 second response time
[10:46:08] <paravoid>	 all that are you, _joe_?
[10:47:33] <icinga-wm>	 RECOVERY - Apache HTTP on mw2059 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.803 second response time
[10:48:23] <grrrit-wm>	 (03PS4) 10Giuseppe Lavagetto: lvs: use etcd for pybal config for ulsfo backups [puppet] - 10https://gerrit.wikimedia.org/r/263847 
[10:48:44] <icinga-wm>	 PROBLEM - Host mw2023 is DOWN: PING CRITICAL - Packet loss = 100%
[10:49:10] <wikibugs>	 6operations, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: url-downloader should be set up more redundantly - https://phabricator.wikimedia.org/T122134#1951465 (10faidon) I'm not sure how good would LVS with two VMs would do — all the VM hosts are in the same availability zone (row), after all. We either need...
[10:50:03] <icinga-wm>	 RECOVERY - Host mw2023 is UP: PING OK - Packet loss = 0%, RTA = 36.82 ms
[10:50:05] <icinga-wm>	 PROBLEM - HHVM rendering on mw2040 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[10:50:14] <icinga-wm>	 PROBLEM - HHVM rendering on mw2023 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[10:50:36] <wikibugs>	 6operations, 10Traffic: Evaluate and Test Limited Deployment of Varnish 4 - https://phabricator.wikimedia.org/T122880#1951466 (10ema) a:3ema
[10:52:05] <icinga-wm>	 RECOVERY - HHVM rendering on mw2040 is OK: HTTP OK: HTTP/1.1 200 OK - 69910 bytes in 2.896 second response time
[10:52:13] <icinga-wm>	 RECOVERY - HHVM rendering on mw2023 is OK: HTTP OK: HTTP/1.1 200 OK - 69910 bytes in 2.035 second response time
[10:55:42] <wikibugs>	 6operations, 10Traffic: Forward-port Varnish 3 patches to Varnish 4 - https://phabricator.wikimedia.org/T124277#1951469 (10ema) 3NEW a:3ema
[10:56:37] <wikibugs>	 6operations, 10Traffic: varnishkafka integration with Varnish 4  for analytics - https://phabricator.wikimedia.org/T124278#1951476 (10ema) 3NEW
[10:57:44] <icinga-wm>	 PROBLEM - HHVM rendering on mw2072 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[10:59:45] <icinga-wm>	 RECOVERY - HHVM rendering on mw2072 is OK: HTTP OK: HTTP/1.1 200 OK - 69910 bytes in 3.664 second response time
[10:59:57] <wikibugs>	 6operations, 10Traffic: Forward-port VCL to Varnish 4 - https://phabricator.wikimedia.org/T124279#1951482 (10ema) 3NEW
[11:00:13] <icinga-wm>	 PROBLEM - Host mw2147 is DOWN: PING CRITICAL - Packet loss = 100%
[11:01:24] <icinga-wm>	 RECOVERY - Host mw2147 is UP: PING OK - Packet loss = 0%, RTA = 36.41 ms
[11:01:24] <icinga-wm>	 RECOVERY - puppet last run on mw1133 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[11:02:04] <icinga-wm>	 PROBLEM - Apache HTTP on mw2176 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[11:04:03] <icinga-wm>	 RECOVERY - Apache HTTP on mw2176 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.463 second response time
[11:04:43] <icinga-wm>	 PROBLEM - Host mw2166 is DOWN: PING CRITICAL - Packet loss = 100%
[11:05:44] <icinga-wm>	 RECOVERY - Host mw2166 is UP: PING OK - Packet loss = 0%, RTA = 37.02 ms
[11:05:45] <icinga-wm>	 PROBLEM - HHVM rendering on mw2166 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[11:06:01] <grrrit-wm>	 (03PS3) 10Alex Monk: Move portals into generic sites.pp [puppet] - 10https://gerrit.wikimedia.org/r/264978 (https://phabricator.wikimedia.org/T86644) 
[11:07:44] <icinga-wm>	 RECOVERY - HHVM rendering on mw2166 is OK: HTTP OK: HTTP/1.1 200 OK - 69906 bytes in 1.693 second response time
[11:08:44] <grrrit-wm>	 (03CR) 1020after4: [C: 031] "This is awesome." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/263566 (https://phabricator.wikimedia.org/T99096) (owner: 10Krinkle)
[11:08:53] <icinga-wm>	 PROBLEM - Host mw2097 is DOWN: PING CRITICAL - Packet loss = 100%
[11:09:08] <jynus>	 !log adding new version of mariadb to carbon for jessie (10.0.23-1)
[11:09:13] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[11:09:34] <icinga-wm>	 PROBLEM - HHVM rendering on mw2207 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[11:10:14] <icinga-wm>	 RECOVERY - Host mw2097 is UP: PING OK - Packet loss = 0%, RTA = 36.26 ms
[11:11:04] <wikibugs>	 6operations, 6Phabricator, 6Release-Engineering-Team, 10Traffic, 5Patch-For-Review: Phabricator needs to expose ssh - https://phabricator.wikimedia.org/T100519#1951490 (10mmodell) @faidon: I don't have any idea how to fix ipv6.  I have zero experience with the systems involved and I don't even have ipv6...
[11:11:34] <icinga-wm>	 RECOVERY - HHVM rendering on mw2207 is OK: HTTP OK: HTTP/1.1 200 OK - 69906 bytes in 2.271 second response time
[11:12:08] <grrrit-wm>	 (03PS1) 10DCausse: Recycle completion suggester indices for small wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265472 
[11:12:29] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] Recycle completion suggester indices for small wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265472 (owner: 10DCausse)
[11:12:53] <icinga-wm>	 PROBLEM - Host mw2118 is DOWN: PING CRITICAL - Packet loss = 100%
[11:14:13] <icinga-wm>	 PROBLEM - HHVM rendering on mw2191 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[11:14:15] <grrrit-wm>	 (03PS2) 10DCausse: Recycle completion suggester indices for small wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265472 
[11:14:23] <icinga-wm>	 RECOVERY - Host mw2118 is UP: PING OK - Packet loss = 0%, RTA = 36.70 ms
[11:14:43] <grrrit-wm>	 (03PS4) 10Alex Monk: Move portals into generic sites.pp [puppet] - 10https://gerrit.wikimedia.org/r/264978 (https://phabricator.wikimedia.org/T86644) 
[11:16:13] <icinga-wm>	 RECOVERY - HHVM rendering on mw2191 is OK: HTTP OK: HTTP/1.1 200 OK - 69906 bytes in 2.324 second response time
[11:16:48] <wikibugs>	 6operations: Migrate pool counters to trusty/jessie - https://phabricator.wikimedia.org/T123734#1951498 (10akosiaris) for one of the pool counters, it probably makes sense to move it to a ganeti VM. requirements are minimal for a pool counter anyway, it's a perfect candidate for virtualization. I am saying for o...
[11:17:11] <_joe_>	 akosiaris: no :)
[11:17:13] <icinga-wm>	 PROBLEM - Host mw2034 is DOWN: PING CRITICAL - Packet loss = 100%
[11:17:44] <icinga-wm>	 RECOVERY - Host mw2034 is UP: PING OK - Packet loss = 0%, RTA = 36.31 ms
[11:19:23] <wikibugs>	 6operations: Migrate carbon to jessie - https://phabricator.wikimedia.org/T123733#1951500 (10akosiaris) Note the 1.5TB current usage for carbon (and another 6 TB free). carbon has a software RAID5 specifically to have enough disk space for it's role. The plan seems fine, just take into consideration the storage...
[11:19:40] <wikibugs>	 6operations: Migrate pool counters to trusty/jessie - https://phabricator.wikimedia.org/T123734#1951503 (10Joe) @akosiaris I think that's a bad idea - a single poolcounter server dying still causes unavailability (notwithstanding the mitigations we tried to create with T105378. I'd say until T105378 is resolved...
[11:20:47] <grrrit-wm>	 (03CR) 10Giuseppe Lavagetto: [C: 032] lvs: use etcd for pybal config for ulsfo backups [puppet] - 10https://gerrit.wikimedia.org/r/263847 (owner: 10Giuseppe Lavagetto)
[11:20:52] <wikibugs>	 6operations, 6Phabricator, 6Release-Engineering-Team, 10Traffic, 5Patch-For-Review: Phabricator needs to expose ssh - https://phabricator.wikimedia.org/T100519#1951505 (10faidon) >>! In T100519#1951490, @mmodell wrote: > @faidon: I don't have any idea how to fix ipv6.  I have zero experience with the sys...
[11:20:56] <_joe_>	 let's go!
[11:21:12] <akosiaris>	 _joe_:  ? no to what ?
[11:21:25] <icinga-wm>	 PROBLEM - High load average on labstore1001 is CRITICAL: CRITICAL: 62.50% of data above the critical threshold [24.0]
[11:22:03] <icinga-wm>	 PROBLEM - HHVM rendering on mw2067 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[11:22:11] <_joe_>	 akosiaris: poolcounter on a VM
[11:22:18] <akosiaris>	 why not ?
[11:22:27] <_joe_>	 see my comment on the ticket
[11:22:46] <_joe_>	 sorry, trying to activate etcd on a prod pybal now
[11:24:03] <icinga-wm>	 RECOVERY - HHVM rendering on mw2067 is OK: HTTP OK: HTTP/1.1 200 OK - 69906 bytes in 3.428 second response time
[11:24:53] <grrrit-wm>	 (03PS2) 10Jcrespo: Adding new parsercache machines (pc[12]00[4-6]) [puppet] - 10https://gerrit.wikimedia.org/r/265473 (https://phabricator.wikimedia.org/T121879) 
[11:25:07] <_joe_>	 !log restarting pybal on lvs4004, switching to etcd
[11:25:11] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[11:26:37] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: [C: 032] url_downloader: Add port as a hierable parameter [puppet] - 10https://gerrit.wikimedia.org/r/191379 (owner: 10Alexandros Kosiaris)
[11:26:43] <grrrit-wm>	 (03PS5) 10Alexandros Kosiaris: url_downloader: Add port as a hierable parameter [puppet] - 10https://gerrit.wikimedia.org/r/191379 
[11:26:46] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: [V: 032] url_downloader: Add port as a hierable parameter [puppet] - 10https://gerrit.wikimedia.org/r/191379 (owner: 10Alexandros Kosiaris)
[11:27:27] <_joe_>	 !log restarting pybal on lvs4003, switching to etcd
[11:27:31] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[11:27:45] <icinga-wm>	 RECOVERY - High load average on labstore1001 is OK: OK: Less than 50.00% above the threshold [16.0]
[11:28:04] <wikibugs>	 6operations, 10Traffic: Create separate packages for required vmods - https://phabricator.wikimedia.org/T124281#1951515 (10ema) 3NEW a:3ema
[11:29:12] <Krenair>	 grep: modules/admin/files/home/akosiaris/.my.cnf: No such file or directory
[11:29:15] <icinga-wm>	 PROBLEM - Apache HTTP on mw2142 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.247 second response time
[11:29:23] <Krenair>	 every time I grep for anything in the puppet repo :(
[11:29:44] <icinga-wm>	 PROBLEM - Apache HTTP on mw2046 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.272 second response time
[11:30:04] <icinga-wm>	 PROBLEM - Host mw2039 is DOWN: PING CRITICAL - Packet loss = 100%
[11:31:00] <akosiaris>	 Krenair: git grep
[11:31:33] <akosiaris>	 with the extra bonus of being faster
[11:31:34] <icinga-wm>	 RECOVERY - Apache HTTP on mw2142 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.242 second response time
[11:32:03] <icinga-wm>	 RECOVERY - Apache HTTP on mw2046 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.253 second response time
[11:33:35] <icinga-wm>	 PROBLEM - HHVM rendering on mw2060 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.152 second response time
[11:34:14] <grrrit-wm>	 (03PS3) 10Jcrespo: Adding new parsercache machines (pc[12]00[4-6]) [puppet] - 10https://gerrit.wikimedia.org/r/265473 (https://phabricator.wikimedia.org/T121879) 
[11:34:41] <paravoid>	 and I was looking for that :)
[11:35:08] <wikibugs>	 6operations, 10ops-codfw, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: rack/setup pc2004-2006 - https://phabricator.wikimedia.org/T121879#1951530 (10faidon)
[11:35:14] <wikibugs>	 6operations, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: url-downloader should be set up more redundantly - https://phabricator.wikimedia.org/T122134#1951532 (10akosiaris) Yes the single row right now comment is obviously true. Still, there is the rack level availability zone issue as well. And we 've had c...
[11:35:16] <_joe_>	 paravoid: into what?
[11:35:31] <paravoid>	 the parsercache task
[11:35:39] <jynus>	 there are 2
[11:35:46] <_joe_>	 oh for, not into, I thought you somehow were looking at the codfw appservers
[11:35:47] <jynus>	 see gerrit
[11:35:53] <icinga-wm>	 RECOVERY - HHVM rendering on mw2060 is OK: HTTP OK: HTTP/1.1 200 OK - 69906 bytes in 8.082 second response time
[11:36:28] <jynus>	 it took me also some time to find those tasks, looking in my "assigned" is no longer useful
[11:36:47] <grrrit-wm>	 (03CR) 10Jcrespo: [C: 032] Adding new parsercache machines (pc[12]00[4-6]) [puppet] - 10https://gerrit.wikimedia.org/r/265473 (https://phabricator.wikimedia.org/T121879) (owner: 10Jcrespo)
[11:37:24] <icinga-wm>	 PROBLEM - HHVM rendering on mw2135 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.155 second response time
[11:37:44] <icinga-wm>	 PROBLEM - HHVM rendering on mw2139 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[11:39:34] <icinga-wm>	 RECOVERY - HHVM rendering on mw2135 is OK: HTTP OK: HTTP/1.1 200 OK - 69906 bytes in 4.090 second response time
[11:39:44] <icinga-wm>	 RECOVERY - HHVM rendering on mw2139 is OK: HTTP OK: HTTP/1.1 200 OK - 69906 bytes in 2.706 second response time
[11:40:24] <icinga-wm>	 PROBLEM - puppet last run on cp4008 is CRITICAL: CRITICAL: puppet fail
[11:41:31] <wikibugs>	 6operations, 6Phabricator, 6Release-Engineering-Team, 10Traffic, 5Patch-For-Review: Phabricator needs to expose ssh - https://phabricator.wikimedia.org/T100519#1951540 (10mmodell) @faidon: I was only summarizing the discussion we (myself, @reedy, @dzahn and @chasemp) had in IRC. Please don't shoot the me...
[11:41:54] <icinga-wm>	 PROBLEM - HHVM rendering on mw2193 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[11:42:04] <icinga-wm>	 PROBLEM - HHVM rendering on mw2123 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[11:42:34] <paravoid>	 twentyafterfour: not my intention to shoot anyone! :)
[11:44:03] <icinga-wm>	 RECOVERY - HHVM rendering on mw2193 is OK: HTTP OK: HTTP/1.1 200 OK - 69906 bytes in 2.422 second response time
[11:44:13] <icinga-wm>	 RECOVERY - HHVM rendering on mw2123 is OK: HTTP OK: HTTP/1.1 200 OK - 69906 bytes in 2.534 second response time
[11:45:04] <icinga-wm>	 PROBLEM - Host mw2122 is DOWN: PING CRITICAL - Packet loss = 100%
[11:45:44] <icinga-wm>	 PROBLEM - puppet last run on pc1004 is CRITICAL: CRITICAL: Puppet has 2 failures
[11:45:44] <icinga-wm>	 PROBLEM - HHVM rendering on mw2179 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.284 second response time
[11:46:03] <icinga-wm>	 RECOVERY - Host mw2122 is UP: PING OK - Packet loss = 0%, RTA = 37.04 ms
[11:46:20] <grrrit-wm>	 (03PS5) 10Alex Monk: Move portals into generic sites.pp [puppet] - 10https://gerrit.wikimedia.org/r/264978 (https://phabricator.wikimedia.org/T86644) 
[11:46:24] <icinga-wm>	 PROBLEM - puppet last run on pc2004 is CRITICAL: CRITICAL: Puppet has 2 failures
[11:48:04] <icinga-wm>	 RECOVERY - HHVM rendering on mw2179 is OK: HTTP OK: HTTP/1.1 200 OK - 69906 bytes in 8.716 second response time
[11:48:44] <icinga-wm>	 PROBLEM - Host mw2117 is DOWN: PING CRITICAL - Packet loss = 100%
[11:48:44] <icinga-wm>	 PROBLEM - Host mw2204 is DOWN: PING CRITICAL - Packet loss = 100%
[11:49:24] <icinga-wm>	 RECOVERY - Host mw2204 is UP: PING OK - Packet loss = 0%, RTA = 36.45 ms
[11:49:53] <icinga-wm>	 PROBLEM - Host mw2092 is DOWN: PING CRITICAL - Packet loss = 100%
[11:50:15] <wikibugs>	 6operations, 10ops-codfw: mw2039 fails to reboot, mgmt interface unreachable - https://phabricator.wikimedia.org/T124282#1951543 (10Joe) 3NEW
[11:50:24] <icinga-wm>	 RECOVERY - Host mw2117 is UP: PING OK - Packet loss = 0%, RTA = 37.21 ms
[11:50:34] <icinga-wm>	 RECOVERY - Host mw2092 is UP: PING OK - Packet loss = 0%, RTA = 38.03 ms
[11:51:19] <icinga-wm>	 ACKNOWLEDGEMENT - Host mw2039 is DOWN: PING CRITICAL - Packet loss = 100% Giuseppe Lavagetto T124282
[11:52:23] <icinga-wm>	 ACKNOWLEDGEMENT - Host mw2098 is DOWN: PING CRITICAL - Packet loss = 100% Giuseppe Lavagetto T85286
[11:54:34] <icinga-wm>	 PROBLEM - HHVM rendering on mw2101 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[11:56:24] <icinga-wm>	 PROBLEM - puppet last run on pc2005 is CRITICAL: CRITICAL: Puppet has 2 failures
[11:56:43] <icinga-wm>	 RECOVERY - HHVM rendering on mw2101 is OK: HTTP OK: HTTP/1.1 200 OK - 69906 bytes in 2.094 second response time
[11:56:53] <icinga-wm>	 PROBLEM - Host mw2131 is DOWN: PING CRITICAL - Packet loss = 100%
[11:57:24] <icinga-wm>	 PROBLEM - Host mw2110 is DOWN: PING CRITICAL - Packet loss = 100%
[11:57:35] <icinga-wm>	 PROBLEM - puppet last run on pc2006 is CRITICAL: CRITICAL: Puppet has 2 failures
[11:58:13] <icinga-wm>	 RECOVERY - Host mw2131 is UP: PING OK - Packet loss = 0%, RTA = 36.76 ms
[11:58:44] <icinga-wm>	 RECOVERY - Host mw2110 is UP: PING OK - Packet loss = 0%, RTA = 36.48 ms
[12:02:00] <icinga-wm>	 PROBLEM - Apache HTTP on mw2202 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[12:02:22] <icinga-wm>	 PROBLEM - NTP on mw2069 is CRITICAL: NTP CRITICAL: Offset unknown
[12:02:51] <icinga-wm>	 PROBLEM - NTP on mw2108 is CRITICAL: NTP CRITICAL: Offset unknown
[12:03:00] <icinga-wm>	 RECOVERY - Apache HTTP on mw2202 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.479 second response time
[12:03:01] <icinga-wm>	 PROBLEM - puppet last run on pc1006 is CRITICAL: CRITICAL: Puppet has 2 failures
[12:03:41] <grrrit-wm>	 (03PS1) 10Jcrespo: Fixing db path for newer machines by adding a condition on the role [puppet] - 10https://gerrit.wikimedia.org/r/265479 
[12:05:51] <icinga-wm>	 RECOVERY - NTP on mw2069 is OK: NTP OK: Offset -0.001298308372 secs
[12:06:40] <icinga-wm>	 RECOVERY - NTP on mw2108 is OK: NTP OK: Offset 0.0001020431519 secs
[12:06:44] <wikibugs>	 6operations, 6Phabricator, 6Release-Engineering-Team, 10Traffic, 5Patch-For-Review: Phabricator needs to expose ssh - https://phabricator.wikimedia.org/T100519#1951552 (10Reedy) >>! In T100519#1951505, @faidon wrote: > In any case, please approach "X is broken and I don't know how to fix it" with "can so...
[12:07:20] <icinga-wm>	 RECOVERY - puppet last run on cp4008 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[12:07:40] <wikibugs>	 6operations, 6Phabricator, 6Release-Engineering-Team, 10Traffic, 5Patch-For-Review: Phabricator needs to expose ssh - https://phabricator.wikimedia.org/T100519#1951553 (10hashar) The DNS IPv6 entry has been dropped yesterday because there is no ssh service listening there to serve the git repositories....
[12:09:21] <icinga-wm>	 PROBLEM - puppet last run on pc1005 is CRITICAL: CRITICAL: Puppet has 2 failures
[12:09:31] <icinga-wm>	 PROBLEM - Apache HTTP on mw2061 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.280 second response time
[12:09:31] <icinga-wm>	 PROBLEM - Host mw2111 is DOWN: PING CRITICAL - Packet loss = 100%
[12:09:31] <icinga-wm>	 PROBLEM - Host mw2119 is DOWN: PING CRITICAL - Packet loss = 100%
[12:09:41] <icinga-wm>	 PROBLEM - HHVM rendering on mw2061 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[12:10:11] <icinga-wm>	 RECOVERY - Host mw2111 is UP: PING OK - Packet loss = 0%, RTA = 37.38 ms
[12:10:21] <icinga-wm>	 RECOVERY - Host mw2119 is UP: PING OK - Packet loss = 0%, RTA = 37.07 ms
[12:10:31] <icinga-wm>	 PROBLEM - Apache HTTP on mw2111 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[12:10:45] <paravoid>	 !log upgrading cr1-codfw to JunOS 13.3R8.7
[12:10:50] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[12:11:41] <icinga-wm>	 RECOVERY - Apache HTTP on mw2061 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.860 second response time
[12:11:51] <icinga-wm>	 RECOVERY - HHVM rendering on mw2061 is OK: HTTP OK: HTTP/1.1 200 OK - 33593 bytes in 3.203 second response time
[12:12:31] <icinga-wm>	 RECOVERY - Apache HTTP on mw2111 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.487 second response time
[12:17:51] <icinga-wm>	 PROBLEM - Apache HTTP on mw2143 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[12:18:11] <icinga-wm>	 PROBLEM - Apache HTTP on mw2198 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[12:18:20] <icinga-wm>	 PROBLEM - Apache HTTP on mw2019 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[12:20:00] <icinga-wm>	 RECOVERY - Apache HTTP on mw2143 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.554 second response time
[12:20:20] <icinga-wm>	 RECOVERY - Apache HTTP on mw2198 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.443 second response time
[12:20:21] <icinga-wm>	 RECOVERY - Apache HTTP on mw2019 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.146 second response time
[12:26:20] <icinga-wm>	 PROBLEM - HHVM rendering on mw2017 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[12:28:21] <icinga-wm>	 RECOVERY - HHVM rendering on mw2017 is OK: HTTP OK: HTTP/1.1 200 OK - 33593 bytes in 2.705 second response time
[12:31:38] <icinga-wm>	 PROBLEM - HHVM rendering on mw2165 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[12:31:48] <icinga-wm>	 PROBLEM - NTP on mw2165 is CRITICAL: NTP CRITICAL: Offset unknown
[12:31:48] <icinga-wm>	 PROBLEM - NTP on mw2205 is CRITICAL: NTP CRITICAL: Offset unknown
[12:32:28] <icinga-wm>	 PROBLEM - Host mw2201 is DOWN: PING CRITICAL - Packet loss = 100%
[12:32:48] <icinga-wm>	 RECOVERY - HHVM rendering on mw2165 is OK: HTTP OK: HTTP/1.1 200 OK - 33592 bytes in 0.280 second response time
[12:33:18] <icinga-wm>	 RECOVERY - NTP on mw2165 is OK: NTP OK: Offset 0.00022149086 secs
[12:33:18] <icinga-wm>	 RECOVERY - NTP on mw2205 is OK: NTP OK: Offset -0.001058459282 secs
[12:33:19] <icinga-wm>	 RECOVERY - Host mw2201 is UP: PING OK - Packet loss = 0%, RTA = 38.39 ms
[12:33:42] <wikibugs>	 6operations, 10Wikimedia-General-or-Unknown, 5Patch-For-Review: Job queue is growing and growing - https://phabricator.wikimedia.org/T124194#1951586 (10Luke081515) 5Open>3Resolved a:3Luke081515 Seems like it is fixed now, (the queue needs still some time to make the backlog smaller) but categorysation...
[12:33:59] <wikibugs>	 6operations, 10Wikimedia-General-or-Unknown, 5Patch-For-Review: Job queue is growing and growing - https://phabricator.wikimedia.org/T124194#1951589 (10Luke081515) a:5Luke081515>3ori
[12:34:11] <wikibugs>	 6operations, 10Wikimedia-General-or-Unknown: Job queue is growing and growing - https://phabricator.wikimedia.org/T124194#1948627 (10Luke081515)
[12:37:08] <icinga-wm>	 PROBLEM - Host mw2196 is DOWN: PING CRITICAL - Packet loss = 100%
[12:37:48] <icinga-wm>	 PROBLEM - HHVM rendering on mw2208 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[12:37:48] <icinga-wm>	 RECOVERY - Host mw2196 is UP: PING OK - Packet loss = 0%, RTA = 36.56 ms
[12:38:08] <icinga-wm>	 PROBLEM - Apache HTTP on mw2016 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[12:39:49] <icinga-wm>	 RECOVERY - HHVM rendering on mw2208 is OK: HTTP OK: HTTP/1.1 200 OK - 70157 bytes in 2.653 second response time
[12:40:09] <icinga-wm>	 RECOVERY - Apache HTTP on mw2016 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.627 second response time
[12:40:28] <icinga-wm>	 PROBLEM - Host mw2057 is DOWN: PING CRITICAL - Packet loss = 100%
[12:40:28] <icinga-wm>	 PROBLEM - Host mw2068 is DOWN: PING CRITICAL - Packet loss = 100%
[12:41:48] <icinga-wm>	 RECOVERY - Host mw2068 is UP: PING OK - Packet loss = 0%, RTA = 36.28 ms
[12:41:49] <icinga-wm>	 RECOVERY - Host mw2057 is UP: PING OK - Packet loss = 0%, RTA = 36.21 ms
[12:41:59] <icinga-wm>	 PROBLEM - HHVM rendering on mw2181 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[12:43:04] <wikibugs>	 6operations, 10netops, 7Monitoring: Icinga monitoring for (Juniper MX480) routing engine status - https://phabricator.wikimedia.org/T124285#1951614 (10mark) 3NEW
[12:43:59] <icinga-wm>	 RECOVERY - HHVM rendering on mw2181 is OK: HTTP OK: HTTP/1.1 200 OK - 70157 bytes in 2.354 second response time
[12:44:03] <wikibugs>	 6operations, 10netops, 7Monitoring: Icinga monitoring for (Juniper MX480) routing engine status - https://phabricator.wikimedia.org/T124285#1951622 (10mark) p:5Triage>3Normal
[12:44:48] <wikibugs>	 6operations, 10netops, 7Monitoring: Icinga monitoring for (Juniper MX480) routing engine status - https://phabricator.wikimedia.org/T124285#1951625 (10faidon) Note that Juniper raises a system (or chassis?) alarm when the RE down, so a check for "show chassis alarms" and "show system alarms" (as described al...
[12:44:58] <icinga-wm>	 PROBLEM - Host mw2182 is DOWN: PING CRITICAL - Packet loss = 100%
[12:45:08] <icinga-wm>	 PROBLEM - Host mw2042 is DOWN: PING CRITICAL - Packet loss = 100%
[12:45:48] <icinga-wm>	 RECOVERY - Host mw2182 is UP: PING OK - Packet loss = 0%, RTA = 36.29 ms
[12:45:59] <icinga-wm>	 PROBLEM - HHVM rendering on mw2062 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[12:46:08] <icinga-wm>	 RECOVERY - Host mw2042 is UP: PING OK - Packet loss = 0%, RTA = 36.33 ms
[12:47:59] <icinga-wm>	 RECOVERY - HHVM rendering on mw2062 is OK: HTTP OK: HTTP/1.1 200 OK - 70157 bytes in 3.746 second response time
[12:49:08] <wikibugs>	 6operations, 10Analytics-Cluster, 10EventBus, 6Services: Investigate proper set up for using Kafka MirrorMaker with new main Kafka clusters. - https://phabricator.wikimedia.org/T123954#1951634 (10Joe) I do think that we DEFINITELY want to rely events to active listeners in both datacenters. What we //don't...
[12:51:19] <icinga-wm>	 RECOVERY - puppet last run on mw1015 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[12:53:08] <icinga-wm>	 PROBLEM - Host mw2106 is DOWN: PING CRITICAL - Packet loss = 100%
[12:54:19] <icinga-wm>	 PROBLEM - Apache HTTP on mw2163 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[12:54:39] <icinga-wm>	 RECOVERY - Host mw2106 is UP: PING OK - Packet loss = 0%, RTA = 36.25 ms
[12:56:19] <icinga-wm>	 RECOVERY - Apache HTTP on mw2163 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.437 second response time
[12:57:40] <icinga-wm>	 PROBLEM - Host mw2045 is DOWN: PING CRITICAL - Packet loss = 100%
[12:57:59] <icinga-wm>	 RECOVERY - Host mw2045 is UP: PING OK - Packet loss = 0%, RTA = 36.09 ms
[12:57:59] <wikibugs>	 6operations, 10netops, 7Monitoring: Icinga monitoring for (Juniper MX480) routing engine status - https://phabricator.wikimedia.org/T124285#1951637 (10mark)
[12:58:01] <wikibugs>	 6operations, 10netops, 7Monitoring: Juniper monitoring - https://phabricator.wikimedia.org/T83992#1951638 (10mark)
[12:59:10] <icinga-wm>	 PROBLEM - Host cr1-codfw is DOWN: PING CRITICAL - Packet loss = 100%
[12:59:36] <wikibugs>	 6operations, 10Traffic: Forward-port VCL to Varnish 4 - https://phabricator.wikimedia.org/T124279#1951640 (10faidon) https://github.com/fgsch/varnish3to4 is pretty good. I spent a small amount of time (less than a half hour) at some point running this + manual changes against the upload VCL and I was successfu...
[13:00:58] <icinga-wm>	 RECOVERY - Host cr1-codfw is UP: PING OK - Packet loss = 0%, RTA = 37.31 ms
[13:02:28] <icinga-wm>	 PROBLEM - Apache HTTP on mw2183 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[13:04:28] <icinga-wm>	 RECOVERY - Apache HTTP on mw2183 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.475 second response time
[13:04:50] <wikibugs>	 6operations, 10Traffic: Forward-port VCL to Varnish 4 - https://phabricator.wikimedia.org/T124279#1951642 (10mark) And instead of inline C, we can consider using vmods too.
[13:05:38] <icinga-wm>	 PROBLEM - Host mw2047 is DOWN: PING CRITICAL - Packet loss = 100%
[13:05:59] <icinga-wm>	 RECOVERY - Host mw2047 is UP: PING OK - Packet loss = 0%, RTA = 37.74 ms
[13:06:28] <icinga-wm>	 PROBLEM - Apache HTTP on mw2044 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[13:08:19] <icinga-wm>	 RECOVERY - Apache HTTP on mw2044 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.192 second response time
[13:12:38] <icinga-wm>	 PROBLEM - Host mw2184 is DOWN: PING CRITICAL - Packet loss = 100%
[13:13:28] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[13:13:29] <icinga-wm>	 RECOVERY - Host mw2184 is UP: PING OK - Packet loss = 0%, RTA = 36.61 ms
[13:16:38] <icinga-wm>	 PROBLEM - Host mw2154 is DOWN: PING CRITICAL - Packet loss = 100%
[13:17:29] <icinga-wm>	 RECOVERY - Host mw2154 is UP: PING OK - Packet loss = 0%, RTA = 36.83 ms
[13:17:38] <icinga-wm>	 PROBLEM - HHVM rendering on mw2154 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[13:17:49] <icinga-wm>	 PROBLEM - HHVM rendering on mw2171 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[13:19:39] <icinga-wm>	 RECOVERY - HHVM rendering on mw2154 is OK: HTTP OK: HTTP/1.1 200 OK - 70165 bytes in 2.371 second response time
[13:19:50] <icinga-wm>	 RECOVERY - HHVM rendering on mw2171 is OK: HTTP OK: HTTP/1.1 200 OK - 70165 bytes in 2.371 second response time
[13:20:21] <_joe_>	 !log rolling reboot of imagescalers, jobrunners in codfw
[13:20:26] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[13:20:32] <paravoid>	 _joe_: <3
[13:21:06] <_joe_>	 paravoid: I'm just doing codfw, eqiad is a bit more challenging, but I guess I'll be done by tomorrow evening
[13:21:49] <icinga-wm>	 PROBLEM - HHVM rendering on mw2174 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[13:23:49] <icinga-wm>	 RECOVERY - HHVM rendering on mw2174 is OK: HTTP OK: HTTP/1.1 200 OK - 70165 bytes in 2.427 second response time
[13:23:58] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy
[13:24:48] <icinga-wm>	 PROBLEM - Host mw2107 is DOWN: PING CRITICAL - Packet loss = 100%
[13:25:59] <icinga-wm>	 PROBLEM - Apache HTTP on mw2153 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[13:26:19] <icinga-wm>	 RECOVERY - Host mw2107 is UP: PING OK - Packet loss = 0%, RTA = 36.89 ms
[13:27:59] <icinga-wm>	 RECOVERY - Apache HTTP on mw2153 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.179 second response time
[13:30:18] <icinga-wm>	 PROBLEM - HHVM rendering on mw2095 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[13:30:28] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[13:30:59] <icinga-wm>	 PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 118, down: 2, dormant: 0, excluded: 0, unused: 0BRxe-5/2/0: down - Core: cr1-codfw:xe-5/2/0 {#10695} [10Gbps DF]BRae0: down - Core: cr1-codfw:ae0BR
[13:31:08] <icinga-wm>	 PROBLEM - Host cr1-codfw is DOWN: PING CRITICAL - Packet loss = 100%
[13:32:30] <icinga-wm>	 RECOVERY - HHVM rendering on mw2095 is OK: HTTP OK: HTTP/1.1 200 OK - 70165 bytes in 2.178 second response time
[13:32:38] <icinga-wm>	 RECOVERY - Host cr1-codfw is UP: PING OK - Packet loss = 0%, RTA = 38.12 ms
[13:33:08] <icinga-wm>	 RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 122, down: 0, dormant: 0, excluded: 0, unused: 0
[13:36:49] <icinga-wm>	 PROBLEM - Apache HTTP on mw2020 is CRITICAL: Connection refused
[13:37:59] <icinga-wm>	 PROBLEM - HHVM rendering on mw2020 is CRITICAL: Connection refused
[13:38:19] <icinga-wm>	 PROBLEM - RAID on mw2020 is CRITICAL: Connection refused by host
[13:38:29] <icinga-wm>	 PROBLEM - configured eth on mw2020 is CRITICAL: Connection refused by host
[13:38:49] <icinga-wm>	 PROBLEM - Check size of conntrack table on mw2020 is CRITICAL: Connection refused by host
[13:38:59] <icinga-wm>	 PROBLEM - dhclient process on mw2020 is CRITICAL: Connection refused by host
[13:39:19] <icinga-wm>	 PROBLEM - nutcracker port on mw2020 is CRITICAL: Connection refused by host
[13:39:19] <icinga-wm>	 PROBLEM - DPKG on mw2020 is CRITICAL: Connection refused by host
[13:39:38] <icinga-wm>	 PROBLEM - nutcracker process on mw2020 is CRITICAL: Connection refused by host
[13:39:39] <icinga-wm>	 PROBLEM - Disk space on mw2020 is CRITICAL: Connection refused by host
[13:39:49] <icinga-wm>	 PROBLEM - salt-minion processes on mw2020 is CRITICAL: Connection refused by host
[13:40:19] <icinga-wm>	 PROBLEM - HHVM processes on mw2020 is CRITICAL: Connection refused by host
[13:40:20] <icinga-wm>	 PROBLEM - puppet last run on mw2020 is CRITICAL: Connection refused by host
[13:41:08] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy
[13:41:58] <icinga-wm>	 PROBLEM - Host mw2005 is DOWN: PING CRITICAL - Packet loss = 100%
[13:42:58] <icinga-wm>	 RECOVERY - Host mw2005 is UP: PING OK - Packet loss = 0%, RTA = 36.06 ms
[13:45:33] <grrrit-wm>	 (03PS1) 10Alex Monk: annualreport: Ensure latest checkout of git repo [puppet] - 10https://gerrit.wikimedia.org/r/265485 
[13:49:10] <grrrit-wm>	 (03PS2) 10Jcrespo: Fixing db path for newer machines by adding a condition on the role [puppet] - 10https://gerrit.wikimedia.org/r/265479 
[13:49:48] <icinga-wm>	 PROBLEM - Host mw2083 is DOWN: PING CRITICAL - Packet loss = 100%
[13:50:06] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: [C: 032] annualreport: Ensure latest checkout of git repo [puppet] - 10https://gerrit.wikimedia.org/r/265485 (owner: 10Alex Monk)
[13:51:49] <icinga-wm>	 RECOVERY - Host mw2083 is UP: PING OK - Packet loss = 0%, RTA = 36.84 ms
[13:51:51] <grrrit-wm>	 (03PS3) 10Jcrespo: Fixing db path for newer machines by adding a condition on the role [puppet] - 10https://gerrit.wikimedia.org/r/265479 
[13:52:10] <icinga-wm>	 PROBLEM - HHVM rendering on mw2151 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[13:54:18] <icinga-wm>	 RECOVERY - HHVM rendering on mw2151 is OK: HTTP OK: HTTP/1.1 200 OK - 69961 bytes in 5.876 second response time
[13:55:39] <icinga-wm>	 PROBLEM - puppet last run on bromine is CRITICAL: CRITICAL: Puppet has 1 failures
[13:56:19] <icinga-wm>	 PROBLEM - Apache HTTP on mw2088 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.404 second response time
[13:58:28] <icinga-wm>	 RECOVERY - Apache HTTP on mw2088 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.178 second response time
[13:58:49] <grrrit-wm>	 (03PS1) 10Alexandros Kosiaris: simplify annualreport module [puppet] - 10https://gerrit.wikimedia.org/r/265487 
[13:59:55] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: [C: 032] simplify annualreport module [puppet] - 10https://gerrit.wikimedia.org/r/265487 (owner: 10Alexandros Kosiaris)
[14:00:45] <grrrit-wm>	 (03PS4) 10Jcrespo: Fixing db path for newer machines by adding a condition on the role [puppet] - 10https://gerrit.wikimedia.org/r/265479 
[14:02:10] <icinga-wm>	 RECOVERY - puppet last run on bromine is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[14:02:49] <icinga-wm>	 PROBLEM - Host mw2081 is DOWN: PING CRITICAL - Packet loss = 100%
[14:03:16] <grrrit-wm>	 (03CR) 10Jcrespo: [C: 032] "Verified with puppet compiler. The fix is ugly, but it is only temporal, until pc100[123] are decommissioned." [puppet] - 10https://gerrit.wikimedia.org/r/265479 (owner: 10Jcrespo)
[14:03:19] <icinga-wm>	 RECOVERY - Host mw2081 is UP: PING OK - Packet loss = 0%, RTA = 35.04 ms
[14:05:58] <icinga-wm>	 PROBLEM - Host cr2-codfw is DOWN: PING CRITICAL - Packet loss = 100%
[14:06:59] <icinga-wm>	 PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0]
[14:07:16] <paravoid>	 hrm
[14:07:19] <icinga-wm>	 PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0]
[14:07:22] <paravoid>	 I wonder if ^^^ is codfw related
[14:07:26] <paravoid>	 oh, never mind then
[14:09:38] <icinga-wm>	 RECOVERY - Host cr2-codfw is UP: PING OK - Packet loss = 0%, RTA = 37.60 ms
[14:09:58] <bblack>	 could be if for some reason esams traffic is going via-codfw to eqiad :)
[14:10:09] <paravoid>	 that's impossible :)
[14:10:18] <bblack>	 never say never!
[14:11:08] <icinga-wm>	 RECOVERY - puppet last run on pc2004 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures
[14:13:28] <icinga-wm>	 RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[14:13:48] <icinga-wm>	 RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[14:15:28] <icinga-wm>	 PROBLEM - Host mw2087 is DOWN: PING CRITICAL - Packet loss = 100%
[14:21:08] <icinga-wm>	 RECOVERY - puppet last run on pc2005 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[14:23:09] <icinga-wm>	 RECOVERY - puppet last run on pc1004 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures
[14:24:45] <icinga-wm>	 RECOVERY - puppet last run on pc2006 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[14:26:59] <icinga-wm>	 RECOVERY - puppet last run on pc1006 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[14:28:08] <grrrit-wm>	 (03PS1) 10Jcrespo: Enabling ssl and ferm on parsercaches, disabling performance_schema [puppet] - 10https://gerrit.wikimedia.org/r/265488 
[14:28:58] <grrrit-wm>	 (03CR) 10Jcrespo: [C: 032] Enabling ssl and ferm on parsercaches, disabling performance_schema [puppet] - 10https://gerrit.wikimedia.org/r/265488 (owner: 10Jcrespo)
[14:30:00] <wikibugs>	 6operations, 10OTRS, 7user-notice: Upgrade OTRS to a more recent stable release - https://phabricator.wikimedia.org/T74109#1951715 (10Johan) Asking dumb question to make sure the answer is as obvious as I hope it is: OTRS will (probably) be down between  0800 UTC and probably somewhere between 1400 or 1600 U...
[14:36:19] <icinga-wm>	 PROBLEM - Host cr2-codfw is DOWN: PING CRITICAL - Packet loss = 100%
[14:37:01] <paravoid>	 !log upgraded cr2-codfw to JunOS 13.3R8.7
[14:37:06] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[14:37:58] <icinga-wm>	 RECOVERY - Host cr2-codfw is UP: PING OK - Packet loss = 0%, RTA = 36.75 ms
[14:40:01] <wikibugs>	 6operations, 10OTRS, 7user-notice: Upgrade OTRS to a more recent stable release - https://phabricator.wikimedia.org/T74109#1951756 (10akosiaris) >>! In T74109#1951715, @Johan wrote: > Asking dumb question to make sure the answer is as obvious as I hope it is: OTRS will (probably) be down between  0800 UTC an...
[14:41:57] <grrrit-wm>	 (03PS1) 10Faidon Liambotis: Revert "Drain codfw for network maintenance" [dns] - 10https://gerrit.wikimedia.org/r/265489 
[14:42:35] <grrrit-wm>	 (03CR) 10Faidon Liambotis: [C: 032] Revert "Drain codfw for network maintenance" [dns] - 10https://gerrit.wikimedia.org/r/265489 (owner: 10Faidon Liambotis)
[14:47:43] <wikibugs>	 6operations, 10netops: Upgrade JunOS on cr1/cr2-codfw - https://phabricator.wikimedia.org/T113640#1951766 (10faidon) 5Open>3Resolved a:3faidon All done!
[14:47:49] <grrrit-wm>	 (03PS1) 10Jcrespo: Modifying parsercache including latest optimizations and options [puppet] - 10https://gerrit.wikimedia.org/r/265490 
[14:50:36] <wikibugs>	 6operations, 6Phabricator, 6Release-Engineering-Team, 10Traffic, 5Patch-For-Review: Phabricator needs to expose ssh - https://phabricator.wikimedia.org/T100519#1951773 (10BBlack) I'm putting together 3x commits for review that I think will resolve this, they should show up below...
[14:51:01] <grrrit-wm>	 (03CR) 10Jcrespo: [C: 032] Modifying parsercache including latest optimizations and options [puppet] - 10https://gerrit.wikimedia.org/r/265490 (owner: 10Jcrespo)
[14:53:19] <icinga-wm>	 PROBLEM - puppet last run on ms-be2010 is CRITICAL: CRITICAL: puppet fail
[14:55:40] <grrrit-wm>	 (03PS1) 10BBlack: Add IPv6 for iridium-vcs.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/265492 (https://phabricator.wikimedia.org/T100519) 
[14:56:05] <grrrit-wm>	 (03PS1) 10BBlack: Add iridium-vcs.eqiad.wmnet ipv6 to phab puppetization [puppet] - 10https://gerrit.wikimedia.org/r/265493 (https://phabricator.wikimedia.org/T100519) 
[14:56:07] <grrrit-wm>	 (03PS1) 10BBlack: Add public IPv6 to git-ssh LVS IPs [puppet] - 10https://gerrit.wikimedia.org/r/265494 (https://phabricator.wikimedia.org/T100519) 
[14:56:24] <wikibugs>	 6operations, 10OTRS, 7user-notice: Upgrade OTRS to a more recent stable release - https://phabricator.wikimedia.org/T74109#1951980 (10Johan) Thanks.
[14:56:53] <wikibugs>	 6operations, 6Phabricator, 6Release-Engineering-Team, 10Traffic, 5Patch-For-Review: Phabricator needs to expose ssh - https://phabricator.wikimedia.org/T100519#1951981 (10BBlack) I think those 3 and then uncommenting the public after it's deployed and tested should do the trick.  Needs review!
[15:03:27] <icinga-wm>	 PROBLEM - NTP on mw2020 is CRITICAL: NTP CRITICAL: No response from NTP server
[15:19:27] <icinga-wm>	 RECOVERY - puppet last run on ms-be2010 is OK: OK: Puppet is currently enabled, last run 0 seconds ago with 0 failures
[15:36:30] <grrrit-wm>	 (03PS3) 10DCausse: Recycle completion suggester indices for small wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265472 
[15:39:32] <wikibugs>	 7Blocked-on-Operations, 6operations, 10Wikipedia-iOS-App-Product-Backlog: Provide access to iOS team for piwik production server - https://phabricator.wikimedia.org/T124218#1952109 (10akosiaris) 5Open>3Resolved a:3akosiaris Hello,   Users:  @KHammerstein @Fjalapeno @JMinor @Bgerstle-WMF @Nirzar  have b...
[15:42:37] <wikibugs>	 6operations, 10OCG-General-or-Unknown, 6Scrum-of-Scrums, 6Services: The OCG cleanup cache script doesn't work properly - https://phabricator.wikimedia.org/T120079#1952143 (10akosiaris) @cscott, any news on this?
[15:42:55] <wikibugs>	 6operations, 5Patch-For-Review, 7Pybal: Pybal IdleConnectionMonitor with TCP KeepAlive shows random fails if more than 100 servers are involved. - https://phabricator.wikimedia.org/T119372#1952151 (10akosiaris) p:5High>3Normal
[15:43:03] <grrrit-wm>	 (03PS1) 10Andrew Bogott: Bump kernel version in jessie base image [puppet] - 10https://gerrit.wikimedia.org/r/265500 
[15:51:13] <ottomata>	 cmjohnson:  we're doing T123546 now ja?
[15:51:30] <cmjohnson>	 now or in 9 mins...whichever you prefer...i am ready
[15:52:02] <ottomata>	 well 9 mins, but ja
[15:52:08] <ottomata>	 ok, i'm going to prep the el things
[15:52:55] <bblack>	 jouncebot: next
[15:52:55] <jouncebot>	 In 0 hour(s) and 7 minute(s): Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160121T1600)
[15:55:52] <grrrit-wm>	 (03PS1) 10Jcrespo: New parsercache servers for codfw datacenter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265501 (https://phabricator.wikimedia.org/T121879) 
[15:55:59] <kart_>	 hashar: around for next one hour?
[15:58:25] <hashar>	 kart_: nope, leaving soonish
[15:58:28] <hashar>	 kart_: what is happening?
[15:59:17] <wikibugs>	 7Blocked-on-Operations, 6operations, 10Wikipedia-iOS-App-Product-Backlog: Provide access to iOS team for piwik production server - https://phabricator.wikimedia.org/T124218#1952246 (10BGerstle-WMF) confirmed, now have access!
[15:59:41] <ottomata>	 !log stopping eventlogging mysql consumers for https://phabricator.wikimedia.org/T123546
[15:59:46] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[16:00:04] <jouncebot>	 anomie ostriches thcipriani marktraceur Krenair: Respected human, time to deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160121T1600). Please do the needful.
[16:00:04] <jouncebot>	 Addshore mdholloway kart_ bblack: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process.
[16:00:07] <addshore>	 *waves*
[16:00:09] <cmjohnson>	 ottomata power down for me 
[16:00:21] <mdholloway>	 yo
[16:00:28] <ottomata>	 cmjohnson:  1 min
[16:00:42] <cmjohnson>	 yep...ping me when ok
[16:00:48] <thcipriani>	 I can SWAT today. Going to try to get config changes out first then backports. addshore you're up first.
[16:00:56] <addshore>	 awesome :)
[16:01:25] <ottomata>	 cmjohnson:  good to go
[16:01:28] <grrrit-wm>	 (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264732 (owner: 10Addshore)
[16:01:34] * kart_ here
[16:01:34] <ottomata>	 power down dbproxy1004 go!
[16:01:35] <cmjohnson>	 cool
[16:02:03] <kart_>	 thcipriani: we've table creation too :)
[16:02:20] <bblack>	 <- here
[16:02:35] <thcipriani>	 kart_: oh good :)
[16:02:49] <grrrit-wm>	 (03Merged) 10jenkins-bot: wgRCWatchCategoryMembership true on dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264732 (owner: 10Addshore)
[16:03:24] <jynus>	 did you stop eventlogging? I want to restart db1046 too
[16:04:25] <cmjohnson>	 ottomata: powering on
[16:05:03] <thcipriani>	 mw2020.codfw.wmnet REMOTE HOST IDENTIFICATION HAS CHANGED! ← expected?
[16:05:18] <icinga-wm>	 PROBLEM - Host dbproxy1004 is DOWN: PING CRITICAL - Packet loss = 100%
[16:05:21] <logmsgbot>	 !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: wgRCWatchCategoryMembership true on dewiki [[gerrit:264732]] (duration: 01m 28s)
[16:05:26] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[16:05:27] <bblack>	 thcipriani: depends how long it's been since you connected there
[16:05:33] <thcipriani>	 ^ addshore check please
[16:05:40] <grrrit-wm>	 (03PS3) 10KartikMistry: Beta: Set ContentTranslationCorpora to true [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265458 (https://phabricator.wikimedia.org/T119617) 
[16:05:45] <addshore>	 thcipriani: looks like its deployed, going to manually test now!
[16:05:49] <thcipriani>	 bblack: probably yesterday morning.
[16:06:21] <bblack>	 it looks like it's mid-reinstall
[16:06:31] <bblack>	 or something like that, it's not in a normal state
[16:06:34] <thcipriani>	 mw2098.codfw mw2039.codfw mw2087.codfw also timed-out
[16:07:02] <bblack>	 _joe_: ^ ?
[16:07:04] <jynus>	 there are several mws that failed to restart after a rolling restart this morning
[16:07:08] <icinga-wm>	 RECOVERY - Host dbproxy1004 is UP: PING OK - Packet loss = 0%, RTA = 2.57 ms
[16:07:11] <jynus>	 there are tickets about that
[16:07:14] <bblack>	 ok
[16:07:20] <grrrit-wm>	 (03PS2) 10KartikMistry: Enable ContentTranslationCorpora [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265459 (https://phabricator.wikimedia.org/T119617) 
[16:07:22] <bblack>	 that may slow down scap or something :/
[16:07:41] <thcipriani>	 yeah, they'll timeout every time :\
[16:08:04] <bblack>	 jynus you have the ticket ref? phab search sucks
[16:08:07] <jynus>	 hey, I didn't restart them, don't blame me :-P
[16:08:16] <jynus>	 didn't create the tickets either
[16:08:44] <_joe_>	 thcipriani: they are dead, yes, I should remove them
[16:08:46] <jynus>	 only this https://phabricator.wikimedia.org/T85286, after it timed out gor me
[16:08:49] <_joe_>	 still didn't have time
[16:08:59] <addshore>	 thcipriani: generally all looking fine
[16:09:07] <icinga-wm>	 PROBLEM - citoid endpoints health on sca1002 is CRITICAL: /api (bad PMCID) is CRITICAL: Could not fetch url http://10.64.48.29:1970/api: Timeout on connection while downloading http://10.64.48.29:1970/api
[16:09:15] <thcipriani>	 addshore: cool, thanks for checking.
[16:09:29] <thcipriani>	 _joe_: kk, I'll continue with SWAT then.
[16:09:59] <thcipriani>	 mdholloway: I'll come back to yours since tests take a little bit to merge.
[16:10:16] <mdholloway>	 thcipriani: sounds good
[16:10:44] <thcipriani>	 kart_: about this table...
[16:10:55] <kart_>	 yes.
[16:11:06] <kart_>	 thcipriani: you can go with Beta config change first.
[16:11:13] <kart_>	 (tables are created there)
[16:11:28] <kart_>	 then tables in Production. And, enable config change.
[16:12:43] <jynus>	 this is other: https://phabricator.wikimedia.org/T124282
[16:13:07] <icinga-wm>	 PROBLEM - Last backup of the tools filesystem on labstore1001 is CRITICAL: CRITICAL - Last run result for unit replicate-tools was exit-code
[16:13:10] <grrrit-wm>	 (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265458 (https://phabricator.wikimedia.org/T119617) (owner: 10KartikMistry)
[16:13:50] <grrrit-wm>	 (03CR) 1020after4: [C: 031] "+1 because I can't +2 on operations/puppet :(" [puppet] - 10https://gerrit.wikimedia.org/r/265493 (https://phabricator.wikimedia.org/T100519) (owner: 10BBlack)
[16:13:52] <grrrit-wm>	 (03Merged) 10jenkins-bot: Beta: Set ContentTranslationCorpora to true [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265458 (https://phabricator.wikimedia.org/T119617) (owner: 10KartikMistry)
[16:13:54] <paravoid>	 ebernhardson et all: es20xx have been warning about disk space for more than a day now
[16:14:06] <jynus>	 not a day
[16:14:10] <jynus>	 months
[16:14:14] <thcipriani>	 kart_: kk, while we wait for that change to go out to beta, I'm going to circle back and get mdholloway 's stuff out the door.
[16:14:27] <kart_>	 thcipriani: OK!
[16:14:28] <paravoid>	 er
[16:14:30] <paravoid>	 dammit
[16:14:37] <jynus>	 paravoid, I've been complaining about those for almost a year
[16:14:39] <grrrit-wm>	 (03CR) 1020after4: [C: 031] "because I can't +2 in operations/puppet" [puppet] - 10https://gerrit.wikimedia.org/r/265494 (https://phabricator.wikimedia.org/T100519) (owner: 10BBlack)
[16:14:39] <paravoid>	 ebernhardson: ignore that! :)
[16:14:40] <paravoid>	 brainfart
[16:14:42] <wikibugs>	 6operations, 10ops-eqiad, 10Analytics: Possible bad mem chip or slot on dbproxy1004 - https://phabricator.wikimedia.org/T123546#1952293 (10Cmjohnson) 5Open>3Resolved Replaced the bad DIMM at slot A3
[16:14:56] <kart_>	 thcipriani: I will have to update Beta patch once Production in too.
[16:15:07] <jynus>	 paravoid, vote yes to sex servers now!
[16:16:17] <grrrit-wm>	 (03PS4) 10Hashar: contint: rsync server to hold jobs caches [puppet] - 10https://gerrit.wikimedia.org/r/253322 (https://phabricator.wikimedia.org/T116017) 
[16:16:17] <ottomata>	 cmjohnson:  all good?
[16:16:40] <ottomata>	 looks good
[16:17:49] <icinga-wm>	 RECOVERY - citoid endpoints health on sca1002 is OK: All endpoints are healthy
[16:18:17] <icinga-wm>	 RECOVERY - Router interfaces on cr1-eqord is OK: OK: host 208.80.154.198, interfaces up: 37, down: 0, dormant: 0, excluded: 3, unused: 0
[16:18:40] <grrrit-wm>	 (03CR) 10Hashar: "Fixed root dir ownership so it belongs to jenkins-deploy" [puppet] - 10https://gerrit.wikimedia.org/r/253322 (https://phabricator.wikimedia.org/T116017) (owner: 10Hashar)
[16:18:46] <ottomata>	 cmjohnson: , you done?  can i turn el stuff back on?
[16:18:52] <grrrit-wm>	 (03PS1) 10Jcrespo: Depool pc1001 for maintenance (clone to pc1004) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265504 
[16:19:14] <cmjohnson>	 ottomata: yes I've been finished for awhile...pinged you about it 
[16:19:28] <paravoid>	 !log deactivating GTT BGP peering on cr2-eqiad
[16:19:33] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[16:19:55] <ottomata>	 cmjohnson:  cool, saw the message about powering on, wasn't sure that was all done
[16:20:24] <cmjohnson>	 oh..yeah sorry I should've been more clear
[16:20:38] <ottomata>	 !log started eventlogging mysql consumers
[16:20:42] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[16:20:48] <jynus>	 which servers have to be depooled?
[16:21:02] <grrrit-wm>	 (03CR) 10Mobrovac: [C: 04-1] Add the visualdiff module + instantiate visualdiffing services (037 comments) [puppet] - 10https://gerrit.wikimedia.org/r/264032 (https://phabricator.wikimedia.org/T118778) (owner: 10Subramanya Sastry)
[16:21:53] <wikibugs>	 6operations, 10ops-codfw: mw2087 fails to reboot, mgmt interface unreachable - https://phabricator.wikimedia.org/T124299#1952308 (10Joe) 3NEW
[16:22:08] <ottomata>	 thanks cmjohnson!
[16:22:22] <ottomata>	 jynus:  when do you want to start the el tokudb stuff?
[16:22:34] <_joe_>	 papaul: we do have 3 appservers that failed to reboot; would you be able to take a look today or tomorrow?
[16:22:49] <icinga-wm>	 ACKNOWLEDGEMENT - Host mw2087 is DOWN: PING CRITICAL - Packet loss = 100% Giuseppe Lavagetto T124299
[16:22:54] <logmsgbot>	 !log thcipriani@tin Synchronized php-1.27.0-wmf.11/extensions/MobileApp/config/config.json: SWAT: Roll out RESTBase usage to Android Beta app: 100% [[gerrit:265118]] (duration: 01m 28s)
[16:22:59] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[16:23:20] <jynus>	 ottomata, let's do it
[16:23:49] <jynus>	 can I just do it?
[16:23:52] <thcipriani>	 ^ mdholloway sync'd out on test wikis
[16:23:56] <ottomata>	 oook, gimme just a few, i'm going to puppetize this consumer stop
[16:24:08] <jynus>	 waiting
[16:24:22] <jynus>	 I will prepare some updates for that mysql, too
[16:24:41] <mdholloway>	 thcipriani: sweet!  thanks thcipriani
[16:24:49] <thcipriani>	 doing .10 now
[16:24:55] <mdholloway>	 k
[16:25:37] <grrrit-wm>	 (03PS1) 10Giuseppe Lavagetto: lvs: switch all of ulsfo to use etcd for pybal config [puppet] - 10https://gerrit.wikimedia.org/r/265505 
[16:26:12] <_joe_>	 bblack, ema ^^ I'm merging this after my meeting. Or, if you feel bold enough, you can go on ofc :P
[16:27:07] <bblack>	 ok :)
[16:27:32] <grrrit-wm>	 (03CR) 10DCausse: "Unless my math are wrong we have 28 shards for enwiki_content." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265372 (https://phabricator.wikimedia.org/T124215) (owner: 10EBernhardson)
[16:28:12] <grrrit-wm>	 (03PS1) 10Ottomata: Temporarily disable eventlogging mysql consumers and burrow monitoring for them [puppet] - 10https://gerrit.wikimedia.org/r/265506 (https://phabricator.wikimedia.org/T120187) 
[16:28:27] <logmsgbot>	 !log thcipriani@tin Synchronized php-1.27.0-wmf.10/extensions/MobileApp/config/config.json: SWAT: Roll out RESTBase usage to Android Beta app: 100% [[gerrit:265117]] (duration: 01m 27s)
[16:28:32] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[16:28:58] <thcipriani>	 mdholloway: ^ sync'd and purged the url
[16:29:10] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] Temporarily disable eventlogging mysql consumers and burrow monitoring for them [puppet] - 10https://gerrit.wikimedia.org/r/265506 (https://phabricator.wikimedia.org/T120187) (owner: 10Ottomata)
[16:29:27] <mdholloway>	 thcipriani: great!  yup, looks good.  thanks again.
[16:29:32] <grrrit-wm>	 (03PS2) 10Ottomata: Temporarily disable eventlogging mysql consumers and burrow monitoring for them [puppet] - 10https://gerrit.wikimedia.org/r/265506 (https://phabricator.wikimedia.org/T120187) 
[16:29:37] <thcipriani>	 mdholloway: cool, thanks for checking!
[16:30:01] <thcipriani>	 kart_: looks like beta-scap-eqiad finished, beta look ok? 
[16:30:27] <kart_>	 thcipriani: a minute.
[16:30:47] <thcipriani>	 kart_: kk, gonna get bblack 's change done then
[16:31:04] <bblack>	 \o/
[16:31:07] <grrrit-wm>	 (03CR) 10Ottomata: [C: 032] Temporarily disable eventlogging mysql consumers and burrow monitoring for them [puppet] - 10https://gerrit.wikimedia.org/r/265506 (https://phabricator.wikimedia.org/T120187) (owner: 10Ottomata)
[16:31:18] <icinga-wm>	 PROBLEM - citoid endpoints health on sca1002 is CRITICAL: /api (bad PMCID) is CRITICAL: Could not fetch url http://10.64.48.29:1970/api: Timeout on connection while downloading http://10.64.48.29:1970/api
[16:31:37] <wikibugs>	 6operations, 6Phabricator, 6Release-Engineering-Team, 10Traffic, 5Patch-For-Review: Phabricator needs to expose ssh - https://phabricator.wikimedia.org/T100519#1952355 (10greg) Thanks @bblack
[16:32:16] <kart_>	 thcipriani: OK. Lets go for table creation on Production.
[16:33:17] <thcipriani>	 kart_: lemme get this change out the door (in zuul now), then I'll circle back to you change.
[16:33:34] <thcipriani>	 *your change
[16:33:46] <grrrit-wm>	 (03PS3) 10KartikMistry: Enable ContentTranslationCorpora [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265459 (https://phabricator.wikimedia.org/T119617) 
[16:33:55] <grrrit-wm>	 (03CR) 10Luke081515: [C: 031] Add public IPv6 to git-ssh LVS IPs [puppet] - 10https://gerrit.wikimedia.org/r/265494 (https://phabricator.wikimedia.org/T100519) (owner: 10BBlack)
[16:34:16] <kart_>	 thcipriani: sure.
[16:34:35] <ottomata>	 jynus:  good to go
[16:34:59] <ottomata>	 !log stopped eventlogging mysql consumers for long downtime: https://phabricator.wikimedia.org/T120187
[16:35:04] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[16:35:29] <icinga-wm>	 PROBLEM - Unmerged changes on repository mediawiki_config on tin is CRITICAL: There are 2 unmerged changes in mediawiki_config (dir /srv/mediawiki-staging/).
[16:35:37] <icinga-wm>	 RECOVERY - citoid endpoints health on sca1002 is OK: All endpoints are healthy
[16:35:48] <icinga-wm>	 PROBLEM - Unmerged changes on repository mediawiki_config on mira is CRITICAL: There are 2 unmerged changes in mediawiki_config (dir /srv/mediawiki-staging/).
[16:36:53] <thcipriani>	 ^ I'll get these in a sec, beta-only changes
[16:37:06] <kart_>	 yeah. Thought so :)
[16:37:07] <grrrit-wm>	 (03PS4) 10KartikMistry: Enable ContentTranslationCorpora [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265459 (https://phabricator.wikimedia.org/T119617) 
[16:37:23] <papaul>	 joe: i will
[16:40:54] <logmsgbot>	 !log thcipriani@tin Synchronized php-1.27.0-wmf.11/extensions/MobileFrontend/includes/MobileFrontend.hooks.php: SWAT: Use TitleSquidURLs hook to purge mobile URLs directly Part I [[gerrit:265486]] (duration: 01m 28s)
[16:40:59] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[16:41:36] <thcipriani>	 ^ bblack wanted to ensure the function was there before adding it to hooks, sync-ing part II (MobileFrontend.php) now.
[16:41:47] <bblack>	 thcipriani: good thinking, thanks :)
[16:41:52] <wikibugs>	 10Ops-Access-Requests, 6operations: Create new puppet group `discovery-analytics-deploy` - https://phabricator.wikimedia.org/T122620#1952407 (10EBernhardson)
[16:42:15] <_joe_>	 papaul: thanks :))
[16:42:18] <jynus>	 ok, its running
[16:42:31] <logmsgbot>	 !log thcipriani@tin Synchronized php-1.27.0-wmf.11/extensions/MobileFrontend/MobileFrontend.php: SWAT: Use TitleSquidURLs hook to purge mobile URLs directly Part II [[gerrit:265486]] (duration: 01m 28s)
[16:42:36] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[16:42:42] <jynus>	 !log batch-converting m4-master (log) tables from innodb to tokudb
[16:42:46] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[16:43:08] <thcipriani>	 ^ bblack should be sync'd out to mediawiki.org and testwikis now (should roll to more before eod)
[16:43:17] <jynus>	 ottomata, I've left it on a screen on db1046
[16:43:25] <wikibugs>	 10Ops-Access-Requests, 6operations: Create new puppet group `discovery-analytics-deploy` - https://phabricator.wikimedia.org/T122620#1909028 (10EBernhardson) based on @faidon and @ottomata suggestions i've changed the named of the puppet group to analytics-search-users in the request, i also adjusted the usern...
[16:43:27] <jynus>	 in case you want to check its progress
[16:43:58] <jynus>	 you can also see it with mysql's SHOW PROCESSLIST and compare it to the list of conversions on /srv/tmp
[16:44:03] <bblack>	 thcipriani: it should affect group1 now, or not?
[16:44:12] <wikibugs>	 10Ops-Access-Requests, 6operations: Create new puppet group `discovery-analytics-deploy` - https://phabricator.wikimedia.org/T122620#1952426 (10EBernhardson)
[16:44:25] <thcipriani>	 bblack: no, we rolled back group1 late yesterday
[16:44:29] <bblack>	 oh!
[16:44:47] <bblack>	 ok, I'm not sure if I can really verify on group0, but will see what I can figure out :)
[16:44:52] <thcipriani>	 :)
[16:45:03] <bblack>	 hmm there's an m.wikimedia.org, will see there
[16:45:10] <bblack>	 I mean m.mediawiki.org
[16:46:08] <bblack>	 thcipriani: confirmed, works right on group0 :)
[16:46:31] <bblack>	    63 RxURL        c /wiki/Manual:System_administration
[16:46:31] <bblack>	    63 RxHeader     c Host: www.mediawiki.org
[16:46:31] <bblack>	    63 RxURL        c /w/index.php?title=Manual:System_administration&action=history
[16:46:32] <thcipriani>	 bblack: cool, thanks for checking. Should be rolled out to group1 sometime today (if logs are looking better)
[16:46:34] <bblack>	    63 RxHeader     c Host: www.mediawiki.org
[16:46:37] <bblack>	    63 RxURL        c /wiki/Manual:System_administration
[16:46:39] <bblack>	    63 RxHeader     c Host: m.mediawiki.org
[16:46:42] <bblack>	    63 RxURL        c /w/index.php?title=Manual:System_administration&action=history
[16:46:45] <bblack>	    63 RxHeader     c Host: m.mediawiki.org
[16:46:47] <bblack>	 ok, thanks
[16:47:14] <thcipriani>	 kart_: never made a new table as part of SWAT, is there a .sql file for this?
[16:49:05] <thcipriani>	 kart_: the sql/parallel-corpora.sql one, guessing?
[16:49:17] <kart_>	 thcipriani: yes. ContentTranslation/sql/parallel-corpora.sql
[16:49:19] <kart_>	 yes.
[16:49:23] <kart_>	 Thanks :)
[16:49:40] <kart_>	 thcipriani: note that it will go to Wikishared DB
[16:50:22] <kart_>	 wikishared.
[16:52:52] <kart_>	 thcipriani: and there is, https://phabricator.wikimedia.org/T120815#1948202
[16:52:59] <kart_>	 Please check it.
[16:53:41] <wikibugs>	 7Blocked-on-Operations, 6operations, 10Wikipedia-iOS-App-Product-Backlog: Provide access to iOS team for piwik production server - https://phabricator.wikimedia.org/T124218#1952441 (10Fjalapeno) Thanks!
[16:53:53] <thcipriani>	 kart_: does this command look right? mwscript sql.php --wiki=aawiki --wikidb=wikishared /srv/mediawiki-staging/php-1.27.0-wmf.11/extensions/ContentTranslation/sql/parallel-corpora.sql
[16:55:09] <kart_>	 --cluster extension1 too.
[16:55:09] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: [C: 04-1] Move portals into generic sites.pp (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/264978 (https://phabricator.wikimedia.org/T86644) (owner: 10Alex Monk)
[16:55:20] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: "Change looks good, comment inline" [puppet] - 10https://gerrit.wikimedia.org/r/264978 (https://phabricator.wikimedia.org/T86644) (owner: 10Alex Monk)
[16:56:15] <kart_>	 thcipriani: possible to use sql.php --cluster extension1 and go to wikishared and source sql there?
[16:56:16] <thcipriani>	 kart_: is extension1 the name of wikishared for beta? That's the imporession that I get from that ticket you posted...
[16:56:42] <kart_>	 thcipriani: extension1 is cluster, wikishared is DB.
[16:57:18] <kart_>	 thcipriani: see: https://phabricator.wikimedia.org/T120815#1948202
[16:58:39] <kart_>	 jynus: ^^ 
[16:59:21] <thcipriani>	 kart_: I'm just going to source the script from: `sql wikishared` seems like the right thing
[16:59:42] <jynus>	 you are asking the wrong man, I know almost nothing about mediawiki
[16:59:55] <nuria>	 jynus: hola amigo 
[17:00:03] <jynus>	 hi, nuria
[17:00:04] <jouncebot>	 akosiaris mutante: Dear anthropoid, the time has come. Please deploy Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160121T1700).
[17:00:04] <jouncebot>	 Krenair: A patch you scheduled for Puppet SWAT(Max 8 patches) is about to be deployed. Please be available during the process.
[17:00:16] <kart_>	 jynus: there is bit of SQL involved :)
[17:00:22] <nuria>	 jynus: could you  please let me know where can i run " bash check_eventlogging_lag.sh dbstore1002"? 
[17:00:44] <kart_>	 thcipriani: yes. make sure we're in wikishared, that's it.
[17:00:53] <thcipriani>	 kk, doing.
[17:01:02] <akosiaris>	 Krenair: there's a -1 from me on https://gerrit.wikimedia.org/r/#/c/264978 for a small issue, otherwise the change looks good
[17:01:05] <akosiaris>	 mobrovac: around ?
[17:01:14] <jynus>	 nuria, it's a horrible, 5-minute made script that I can share if you do not judge me very badly
[17:01:16] <akosiaris>	 it's graphoid time
[17:01:20] <thcipriani>	 kart_: blerg: The MariaDB server is running with the --read-only option so it cannot execute this statement
[17:01:25] <nuria>	 jynus: no i would love it, REALLY
[17:01:29] <mobrovac>	 akosiaris: yup, in the meeting you're not in :D
[17:01:47] <akosiaris>	 ok
[17:01:51] <jynus>	 I am not sure you have access to the master, but let me at least share the code, we can arrange how to run it later
[17:01:54] <kart_>	 thcipriani: are you in cluster1 and wikishared DB?
[17:02:12] <jynus>	 the idea would be to put that in an alert, nuria when properly done
[17:02:37] <jynus>	 but let me at least commit it to operations/software:dbtools
[17:02:44] <andrewbogott>	 !log rebooting labvirt1008
[17:02:50] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[17:03:42] <kart_>	 thcipriani: still same?
[17:04:27] <thcipriani>	 kart_: yeah, still looking at it. I'll try the command you posted with wikidb=wikishared
[17:05:30] <nuria>	 jynus: excellent, that way we have a "synchronized" way to check lag
[17:06:09] <kart_>	 thcipriani: sql.php --cluster extension1 and use wikishared; and try source parallel-corpora.sql
[17:06:36] <nuria>	 jynus: let me know when it is comitted, i do not receive alerts from taht depot
[17:06:39] <kart_>	 if this still has permission issue, we need Ops/DB.
[17:06:41] <nuria>	 jynus: thank you
[17:08:31] <grrrit-wm>	 (03CR) 10Giuseppe Lavagetto: [C: 032] lvs: switch all of ulsfo to use etcd for pybal config [puppet] - 10https://gerrit.wikimedia.org/r/265505 (owner: 10Giuseppe Lavagetto)
[17:08:38] <grrrit-wm>	 (03PS2) 10Giuseppe Lavagetto: lvs: switch all of ulsfo to use etcd for pybal config [puppet] - 10https://gerrit.wikimedia.org/r/265505 
[17:08:44] <grrrit-wm>	 (03CR) 10Giuseppe Lavagetto: "https://puppet-compiler.wmflabs.org/1642/lvs4001.ulsfo.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/265505 (owner: 10Giuseppe Lavagetto)
[17:09:18] <grrrit-wm>	 (03PS1) 10Jcrespo: [WIP] Quick and dirty script to check lag @ eventlogging schema [software] - 10https://gerrit.wikimedia.org/r/265509 
[17:09:28] <thcipriani>	 kart_: still no luck. I'm going to try running the command I posted above with the cluster extension1 option.
[17:09:33] <grrrit-wm>	 (03CR) 10Giuseppe Lavagetto: [V: 032] lvs: switch all of ulsfo to use etcd for pybal config [puppet] - 10https://gerrit.wikimedia.org/r/265505 (owner: 10Giuseppe Lavagetto)
[17:10:02] <jynus>	 nuria, https://gerrit.wikimedia.org/r/#/c/265509/1
[17:10:20] <kart_>	 thcipriani: is it permission issue? :/
[17:11:36] <thcipriani>	 kart_: only when using the sql option and trying to source the file. mwscript hasn't worked with any incantations I've tried.
[17:11:38] <jynus>	 are you running it on the master?
[17:12:06] <jynus>	 I remember a recent patch that by default points to a slave, which all of ours are read-only
[17:12:47] <_joe_>	 !log restarting pybal on the main balancers in ulsfo to consume from etcd
[17:12:51] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[17:12:54] <Krenair>	 what are you guys trying to do with the sql scripts exactly?
[17:12:56] <thcipriani>	 jynus: I run: `sql wikishared` on tin. Tried to source a .sql file and it gave me a The MariaDB server is running with the --read-only option so it cannot execute this statement so I'd guess not :)
[17:13:03] <paravoid>	 ottomata: hey, any reason to not install kafkacat on bastions?
[17:13:16] <jynus>	 thcipriani, then that is a slave, not a master
[17:13:18] <thcipriani>	 Krenair: just trying to run /srv/mediawiki-staging/php-1.27.0-wmf.11/extensions/ContentTranslation/sql/parallel-corpora.sql for the wikishareddb
[17:13:20] <Krenair>	 thcipriani, `sql --write wikishared`
[17:13:52] <grrrit-wm>	 (03PS6) 10Alex Monk: Move portals into generic sites.pp [puppet] - 10https://gerrit.wikimedia.org/r/264978 (https://phabricator.wikimedia.org/T86644) 
[17:13:53] <kart_>	 Krenair is here. Good!
[17:14:21] <Krenair>	 or `mwscript sql.php testwiki --cluster extension1 --wikidb wikishared extensions/ContentTranslation/sql/parallel-corpora.sql` or something
[17:14:45] <thcipriani>	 Krenair: thank you! kart_ got it with sql --write wikishared and sourcing the script.
[17:14:56] <jynus>	 good that I have them as read only, so you would not try to drift our slaves :-)
[17:15:32] <thcipriani>	 jynus: indeed.
[17:15:45] <kart_>	 jynus: --bow-to-jynus is the hidden option too.
[17:15:47] <kart_>	 :D
[17:16:21] <jynus>	 ha
[17:16:23] <nuria>	 jynus: we can use your script and create alarm if you want, does that sound good?
[17:16:30] <nuria>	 cc ottomata 
[17:16:50] <jynus>	 nuria, if you can polish that, I would personally make sure to deploy it
[17:16:56] <jynus>	 :-)
[17:17:00] <kart_>	 thcipriani: jynus Krenair thanks.
[17:17:03] <nuria>	 jynus: sounds great, send it our way 
[17:17:28] <thcipriani>	 kart_: ok, ready to deploy your final patch, then?
[17:17:30] <nuria>	 jynus: cc ottomata elukey and myself on your commit , that you!
[17:17:41] <kart_>	 thcipriani: yes. Table looks OK.
[17:17:45] <thcipriani>	 kk
[17:17:49] <grrrit-wm>	 (03CR) 10Jcrespo: [C: 032] [WIP] Quick and dirty script to check lag @ eventlogging schema [software] - 10https://gerrit.wikimedia.org/r/265509 (owner: 10Jcrespo)
[17:17:56] <grrrit-wm>	 (03CR) 10Jcrespo: [V: 032] [WIP] Quick and dirty script to check lag @ eventlogging schema [software] - 10https://gerrit.wikimedia.org/r/265509 (owner: 10Jcrespo)
[17:18:25] <grrrit-wm>	 (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265459 (https://phabricator.wikimedia.org/T119617) (owner: 10KartikMistry)
[17:18:49] <grrrit-wm>	 (03Merged) 10jenkins-bot: Enable ContentTranslationCorpora [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265459 (https://phabricator.wikimedia.org/T119617) (owner: 10KartikMistry)
[17:19:18] <mobrovac>	 akosiaris: i'll be ready to go in 10 mins
[17:19:49] <jynus>	 ^Ive added you but merged, because for the alarm it should be improved and commited to operations/puppet, not there
[17:20:18] <grrrit-wm>	 (03CR) 10Jcrespo: "Where you notified of the commit?" [software] - 10https://gerrit.wikimedia.org/r/265509 (owner: 10Jcrespo)
[17:20:25] <nuria>	 jynus: ticket created: https://phabricator.wikimedia.org/T124306
[17:21:11] <wikibugs>	 6operations, 10ops-eqiad, 6Labs, 10Labs-Infrastructure: Locate and assign some MD1200 shelves for proper testing of labstore1002 - https://phabricator.wikimedia.org/T101741#1952507 (10coren) a:5coren>3None
[17:21:57] <icinga-wm>	 RECOVERY - Unmerged changes on repository mediawiki_config on tin is OK: No changes to merge.
[17:22:18] <icinga-wm>	 RECOVERY - Unmerged changes on repository mediawiki_config on mira is OK: No changes to merge.
[17:22:32] <logmsgbot>	 !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: Enable ContentTranslationCorpora Part I [[gerrit:265459]] (duration: 01m 28s)
[17:22:37] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[17:24:12] <logmsgbot>	 !log thcipriani@tin Synchronized wmf-config/CommonSettings.php: SWAT: Enable ContentTranslationCorpora Part II [[gerrit:265459]] (duration: 01m 28s)
[17:24:15] <thcipriani>	 ^ kart_ check please
[17:24:17] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[17:24:22] <kart_>	 OK.
[17:24:33] <kart_>	 Table should populate some data in some time.
[17:24:37] <wikibugs>	 6operations, 10MediaWiki-General-or-Unknown, 10MobileFrontend-Feature-requests, 10Traffic, and 3 others: Fix mobile purging - https://phabricator.wikimedia.org/T124165#1952512 (10BBlack) ^ So the fix is in 1.27.0-wmf.11, which is on group0 so far.  When it reaches group1 and group2 as well, we can resolve...
[17:25:18] <icinga-wm>	 PROBLEM - puppet last run on db2061 is CRITICAL: CRITICAL: puppet fail
[17:29:00] <grrrit-wm>	 (03PS1) 10Alexandros Kosiaris: graphoid: apply the role on scb [puppet] - 10https://gerrit.wikimedia.org/r/265511 
[17:29:02] <grrrit-wm>	 (03PS1) 10Alexandros Kosiaris: graphoid: update LVS/conftool configuration [puppet] - 10https://gerrit.wikimedia.org/r/265512 
[17:30:26] <akosiaris>	 !log disabled puppet and salt-minion on sca1001, sca1002 for graphoid upgrade
[17:30:32] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[17:31:07] <icinga-wm>	 RECOVERY - Last backup of the tools filesystem on labstore1001 is OK: OK - Last run for unit replicate-tools was successful
[17:31:41] <chasemp>	 ^this actually did fail but that's a bogus recovery so hooray icinga :)
[17:32:09] <wikibugs>	 6operations, 10Analytics, 10DBA: Improve eventlogging replication procedure - https://phabricator.wikimedia.org/T124307#1952524 (10jcrespo) 3NEW
[17:32:55] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: [C: 032] Move portals into generic sites.pp [puppet] - 10https://gerrit.wikimedia.org/r/264978 (https://phabricator.wikimedia.org/T86644) (owner: 10Alex Monk)
[17:32:59] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: [C: 032] graphoid: apply the role on scb [puppet] - 10https://gerrit.wikimedia.org/r/265511 (owner: 10Alexandros Kosiaris)
[17:33:09] <grrrit-wm>	 (03PS7) 10Alexandros Kosiaris: Move portals into generic sites.pp [puppet] - 10https://gerrit.wikimedia.org/r/264978 (https://phabricator.wikimedia.org/T86644) (owner: 10Alex Monk)
[17:33:12] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: [V: 032] Move portals into generic sites.pp [puppet] - 10https://gerrit.wikimedia.org/r/264978 (https://phabricator.wikimedia.org/T86644) (owner: 10Alex Monk)
[17:33:34] <wikibugs>	 6operations, 10Analytics, 10DBA: Improve eventlogging replication procedure - https://phabricator.wikimedia.org/T124307#1952534 (10jcrespo)
[17:36:48] <icinga-wm>	 PROBLEM - salt-minion processes on sca1001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion
[17:36:58] <icinga-wm>	 PROBLEM - salt-minion processes on sca1002 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion
[17:38:05] <mobrovac>	 known ^
[17:38:17] <icinga-wm>	 ACKNOWLEDGEMENT - salt-minion processes on sca1001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion alexandros kosiaris graphoid migration to scb ongoing
[17:38:17] <icinga-wm>	 ACKNOWLEDGEMENT - salt-minion processes on sca1002 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion alexandros kosiaris graphoid migration to scb ongoing
[17:39:39] <icinga-wm>	 PROBLEM - puppet last run on mw2154 is CRITICAL: CRITICAL: Puppet has 1 failures
[17:39:57] <icinga-wm>	 PROBLEM - puppet last run on mw1187 is CRITICAL: CRITICAL: Puppet has 1 failures
[17:39:58] <icinga-wm>	 PROBLEM - puppet last run on mw1224 is CRITICAL: CRITICAL: Puppet has 1 failures
[17:40:47] <icinga-wm>	 PROBLEM - puppet last run on mw2180 is CRITICAL: CRITICAL: Puppet has 1 failures
[17:40:48] <icinga-wm>	 PROBLEM - puppet last run on mw2148 is CRITICAL: CRITICAL: Puppet has 1 failures
[17:41:17] <icinga-wm>	 PROBLEM - puppet last run on mw2185 is CRITICAL: CRITICAL: Puppet has 1 failures
[17:41:27] <icinga-wm>	 PROBLEM - puppet last run on mw1223 is CRITICAL: CRITICAL: Puppet has 1 failures
[17:41:27] <icinga-wm>	 PROBLEM - puppet last run on mw1071 is CRITICAL: CRITICAL: Puppet has 1 failures
[17:41:27] <icinga-wm>	 PROBLEM - puppet last run on mw1209 is CRITICAL: CRITICAL: Puppet has 1 failures
[17:41:48] <icinga-wm>	 PROBLEM - puppet last run on mw1099 is CRITICAL: CRITICAL: Puppet has 1 failures
[17:41:57] <icinga-wm>	 PROBLEM - puppet last run on mw1176 is CRITICAL: CRITICAL: Puppet has 1 failures
[17:43:17] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[17:43:53] <grrrit-wm>	 (03PS10) 10Subramanya Sastry: Add the visualdiff module + instantiate visualdiffing services [puppet] - 10https://gerrit.wikimedia.org/r/264032 (https://phabricator.wikimedia.org/T118778) 
[17:44:35] <kart_>	 thcipriani: still around?
[17:44:41] <thcipriani>	 kart_: yup
[17:44:46] <grrrit-wm>	 (03PS1) 10KartikMistry: Really enable ContentTranslationCorpora [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265514 
[17:44:54] <kart_>	 thcipriani: I made mistake in config patch :/ Set it to false instead of true.
[17:45:00] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] Add the visualdiff module + instantiate visualdiffing services [puppet] - 10https://gerrit.wikimedia.org/r/264032 (https://phabricator.wikimedia.org/T118778) (owner: 10Subramanya Sastry)
[17:45:11] <kart_>	 So, can you deploy 265514 please?
[17:46:11] <thcipriani>	 kart_: sure, lemme get this in before the train.
[17:46:45] <kart_>	 Thanks!
[17:46:47] <grrrit-wm>	 (03CR) 10Thcipriani: [C: 032] Really enable ContentTranslationCorpora [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265514 (owner: 10KartikMistry)
[17:46:59] <grrrit-wm>	 (03PS1) 10Giuseppe Lavagetto: role::deployment: add a warning on the inactive server [puppet] - 10https://gerrit.wikimedia.org/r/265515 
[17:47:12] <grrrit-wm>	 (03Merged) 10jenkins-bot: Really enable ContentTranslationCorpora [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265514 (owner: 10KartikMistry)
[17:48:24] <grrrit-wm>	 (03PS11) 10Subramanya Sastry: Add the visualdiff module + instantiate visualdiffing services [puppet] - 10https://gerrit.wikimedia.org/r/264032 (https://phabricator.wikimedia.org/T118778) 
[17:48:34] <akosiaris>	 !log add scb1001, scb1002 in pybal graphoid config
[17:48:39] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[17:49:06] <grrrit-wm>	 (03PS2) 10Alexandros Kosiaris: graphoid: update LVS/conftool configuration [puppet] - 10https://gerrit.wikimedia.org/r/265512 
[17:49:13] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] graphoid: update LVS/conftool configuration [puppet] - 10https://gerrit.wikimedia.org/r/265512 (owner: 10Alexandros Kosiaris)
[17:49:37] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] Add the visualdiff module + instantiate visualdiffing services [puppet] - 10https://gerrit.wikimedia.org/r/264032 (https://phabricator.wikimedia.org/T118778) (owner: 10Subramanya Sastry)
[17:49:53] <logmsgbot>	 !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: Really enable ContentTranslationCorpora [[gerrit:265514]] (duration: 01m 29s)
[17:49:56] <thcipriani>	 ^ kart_ 
[17:49:57] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[17:50:10] <kart_>	 thcipriani: thanks. checking.
[17:51:18] <icinga-wm>	 RECOVERY - puppet last run on db2061 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[17:52:44] <grrrit-wm>	 (03PS12) 10Subramanya Sastry: Add the visualdiff module + instantiate visualdiffing services [puppet] - 10https://gerrit.wikimedia.org/r/264032 (https://phabricator.wikimedia.org/T118778) 
[17:56:04] <Krenair>	 akosiaris, was the portal change applied?
[17:56:12] <akosiaris>	 Krenair: yes
[17:56:30] <akosiaris>	 well, on some hosts I checked
[17:56:35] <akosiaris>	 it was a one line change btw
[17:56:47] <akosiaris>	 +++ /tmp/puppet-file20160121-9518-v5ylck	2016-01-21 17:41:41.922700070 +0000
[17:56:48] <akosiaris>	 @@ -1,4 +1,5 @@
[17:56:48] <akosiaris>	  <VirtualHost *:80>
[17:56:48] <akosiaris>	 +
[17:56:49] <akosiaris>	      ServerName wikipedia.org
[17:57:01] <akosiaris>	 not sure where that came from yet, will look into it later
[17:57:31] <akosiaris>	 !log depool sca1001,sca1002 for graphoid pybal config
[17:57:36] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[17:57:56] <kart_>	 thcipriani: thanks. We've data!!
[17:58:17] <thcipriani>	 kart_: nice! thanks for checking.
[17:59:50] <Krenair>	 akosiaris, so prod is fine? there's a problem in beta
[18:00:05] <akosiaris>	 Krenair: yes, prod seems fine
[18:00:25] <akosiaris>	 mobrovac: mobileapps complaining on scb1001
[18:00:51] <mobrovac>	 looking
[18:00:56] <akosiaris>	 Krenair: what kind of problem ? 
[18:01:01] <Krenair>	 puppet errors
[18:01:19] <akosiaris>	 actually, I got a migration going on, will look into it in a few 
[18:01:41] <Krenair>	 Ohh.
[18:01:52] <Krenair>	 I had an earlier version of the patch on the beta puppetmaster
[18:02:04] <Krenair>	 It got into a conflict trying to merge
[18:04:13] <greg-g>	 for the record in here, context is "what should we do with the train after yesterday's rollback?":
[18:04:16] <greg-g>	 18:01 <    greg-g> I think we should move ahead with the train (ie: roll out to group1, wait a bit, then roll out to wikipedis, as  scheduled)
[18:04:25] <akosiaris>	 Krenair: a ok, good to know
[18:04:31] <grrrit-wm>	 (03PS1) 10Alexandros Kosiaris: citoid: Apply the role on scb [puppet] - 10https://gerrit.wikimedia.org/r/265517 
[18:04:33] <grrrit-wm>	 (03PS1) 10Alexandros Kosiaris: citoid: update LVS/conftool configuration [puppet] - 10https://gerrit.wikimedia.org/r/265518 
[18:04:59] <icinga-wm>	 RECOVERY - puppet last run on mw2154 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures
[18:05:09] <icinga-wm>	 RECOVERY - puppet last run on mw1223 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[18:05:28] <icinga-wm>	 RECOVERY - puppet last run on mw2185 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[18:05:29] <icinga-wm>	 RECOVERY - puppet last run on mw1187 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[18:05:29] <icinga-wm>	 RECOVERY - puppet last run on mw1224 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[18:05:49] <icinga-wm>	 RECOVERY - puppet last run on mw1209 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[18:05:49] <icinga-wm>	 RECOVERY - puppet last run on mw2148 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[18:06:16] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: [C: 032] citoid: Apply the role on scb [puppet] - 10https://gerrit.wikimedia.org/r/265517 (owner: 10Alexandros Kosiaris)
[18:06:29] <icinga-wm>	 RECOVERY - puppet last run on mw1071 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures
[18:06:38] <icinga-wm>	 RECOVERY - puppet last run on mw2180 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[18:06:39] <icinga-wm>	 RECOVERY - puppet last run on mw1099 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[18:06:58] <icinga-wm>	 RECOVERY - puppet last run on mw1176 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[18:07:28] <andrewbogott>	 !log rebooting labvirt1001
[18:07:33] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[18:09:29] <grrrit-wm>	 (03PS1) 10Rush: diamond: nfsd grab /proc/fs/nfsd/pool_stats as well [puppet] - 10https://gerrit.wikimedia.org/r/265519 
[18:10:38] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] diamond: nfsd grab /proc/fs/nfsd/pool_stats as well [puppet] - 10https://gerrit.wikimedia.org/r/265519 (owner: 10Rush)
[18:11:26] <marxarelli>	 greg-g, thcipriani: is swat still happening or can we start the train early?
[18:11:41] <thcipriani>	 marxarelli: SWAT is complete.
[18:11:59] <icinga-wm>	 PROBLEM - Host ores.wmflabs.org is DOWN: PING CRITICAL - Packet loss = 100%
[18:12:03] <marxarelli>	 might be good to have extra time to assess group1 before promoting to al
[18:12:04] <marxarelli>	 all
[18:12:08] <greg-g>	 word
[18:13:05] <jynus>	 I wanted to do some depools for some data migration, but I think I will wait
[18:13:16] <icinga-wm>	 PROBLEM - tools-home on tools.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 20 seconds
[18:13:33] <wikibugs>	 7Blocked-on-Operations, 6operations, 10RESTBase, 10hardware-requests: Expand SSD space in Cassandra cluster - https://phabricator.wikimedia.org/T121575#1952659 (10RobH)
[18:14:15] <grrrit-wm>	 (03CR) 1020after4: "Regarding "Update MediaWiki core to output hashes in static urls."" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/263566 (https://phabricator.wikimedia.org/T99096) (owner: 10Krinkle)
[18:16:10] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy
[18:18:00] <grrrit-wm>	 (03CR) 10Mobrovac: [C: 04-1] "One detail in the systemd service file left to deal with." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/264032 (https://phabricator.wikimedia.org/T118778) (owner: 10Subramanya Sastry)
[18:18:00] <icinga-wm>	 RECOVERY - Host ores.wmflabs.org is UP: PING OK - Packet loss = 0%, RTA = 1.91 ms
[18:20:30] <icinga-wm>	 PROBLEM - citoid endpoints health on scb1002 is CRITICAL: /api (Zotero alive) is CRITICAL: Test Zotero alive returned the unexpected status 404 (expecting: 200)
[18:20:54] <marxarelli>	 anomie: is "Failed to write session data (user)." for wmf.11 something to do with your backported fix yesterday?
[18:21:22] <icinga-wm>	 PROBLEM - citoid endpoints health on scb1001 is CRITICAL: /api (Zotero alive) is CRITICAL: Test Zotero alive returned the unexpected status 404 (expecting: 200)
[18:21:38] <icinga-wm>	 RECOVERY - tools-home on tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 982305 bytes in 3.210 second response time
[18:21:54] <marxarelli>	 anomie: seeing that in fatalmonitor
[18:22:33] <anomie>	 marxarelli: https://gerrit.wikimedia.org/r/#/c/265480/
[18:22:52] <anomie>	 should fix it
[18:23:42] <marxarelli>	 anomie: alright. i'll cherrypick that
[18:25:23] <grrrit-wm>	 (03PS1) 10Andrew Bogott: Send broken-puppet nags to admins of all projects! [puppet] - 10https://gerrit.wikimedia.org/r/265522 
[18:25:44] <wikibugs>	 6operations, 10Analytics, 10DBA: Improve eventlogging replication procedure - https://phabricator.wikimedia.org/T124307#1952723 (10Milimetric) p:5Triage>3Normal
[18:28:48] <grrrit-wm>	 (03PS1) 10Alex Monk: Delete config.beta.wmflabs.org [puppet] - 10https://gerrit.wikimedia.org/r/265525 
[18:28:50] <grrrit-wm>	 (03PS1) 10Alex Monk: Fix upload.beta.wmflabs.org docroot path [puppet] - 10https://gerrit.wikimedia.org/r/265526 
[18:29:00] <grrrit-wm>	 (03PS1) 10Alexandros Kosiaris: scb: Update the realserver_ips [puppet] - 10https://gerrit.wikimedia.org/r/265527 
[18:29:06] <grrrit-wm>	 (03CR) 10Subramanya Sastry: Add the visualdiff module + instantiate visualdiffing services (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/264032 (https://phabricator.wikimedia.org/T118778) (owner: 10Subramanya Sastry)
[18:29:38] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] scb: Update the realserver_ips [puppet] - 10https://gerrit.wikimedia.org/r/265527 (owner: 10Alexandros Kosiaris)
[18:31:05] <grrrit-wm>	 (03PS13) 10Subramanya Sastry: Add the visualdiff module + instantiate visualdiffing services [puppet] - 10https://gerrit.wikimedia.org/r/264032 (https://phabricator.wikimedia.org/T118778) 
[18:31:50] <icinga-wm>	 RECOVERY - citoid endpoints health on scb1002 is OK: All endpoints are healthy
[18:32:41] <icinga-wm>	 RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy
[18:34:00] <akosiaris>	 !log pool scb1001, scb1002 for citoid
[18:34:05] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[18:39:47] <akosiaris>	 !log depool sca1001, sca1002 for citoid
[18:39:53] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[18:41:26] <akosiaris>	 !log enable puppet and salt-minion on sca100{1,2}.eqiad.wmnet
[18:41:31] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[18:43:29] <mobrovac>	 !log restbase disabling puppet in prod for testing firejail in staging
[18:43:29] <marxarelli>	 just got an ssh host verification failure for mw2020.codfw.wmnet
[18:43:34] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[18:43:43] <marxarelli>	 was that host re-imaged or something?
[18:43:44] <logmsgbot>	 !log dduvall@tin Synchronized php-1.27.0-wmf.11/includes/session/PHPSessionHandler.php: deploy follow-up warning fix for T124126 (duration: 01m 28s)
[18:43:48] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[18:44:45] <wikibugs>	 6operations, 10Traffic: Forward-port Varnish 3 patches to Varnish 4 - https://phabricator.wikimedia.org/T124277#1952810 (10ema) Patches marked as forward-ported are available on Varnish 4 WMF repo on Gerrit: https://gerrit.wikimedia.org/r/#/admin/projects/operations/debs/varnish4  - [X] 0010-varnishd-cache_dir...
[18:44:52] <wikibugs>	 6operations, 6Services: Migrate SCA cluster to SCB (Jessie and Node 4.2) - https://phabricator.wikimedia.org/T96017#1952813 (10mobrovac)
[18:45:23] <wikibugs>	 6operations, 6Services: Migrate SCA cluster to SCB (Jessie and Node 4.2) - https://phabricator.wikimedia.org/T96017#1206310 (10mobrovac)
[18:46:38] <grrrit-wm>	 (03PS2) 10Alexandros Kosiaris: citoid: update LVS/conftool configuration [puppet] - 10https://gerrit.wikimedia.org/r/265518 
[18:46:53] <andrewbogott>	 !log rebooting labvirt1002
[18:46:58] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[18:47:06] <grrrit-wm>	 (03PS1) 10Mobrovac: RESTBase: Enable firejail [puppet] - 10https://gerrit.wikimedia.org/r/265531 
[18:47:14] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] citoid: update LVS/conftool configuration [puppet] - 10https://gerrit.wikimedia.org/r/265518 (owner: 10Alexandros Kosiaris)
[18:47:20] <marxarelli>	 !log 4 apache sync failures during sync-file, appear to be know issues
[18:47:24] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[18:48:48] <grrrit-wm>	 (03PS1) 10Alex Monk: Move apache includes into generic sites.pp [puppet] - 10https://gerrit.wikimedia.org/r/265532 (https://phabricator.wikimedia.org/T86644) 
[18:49:08] <grrrit-wm>	 (03PS2) 10Alexandros Kosiaris: RESTBase: Enable firejail [puppet] - 10https://gerrit.wikimedia.org/r/265531 (owner: 10Mobrovac)
[18:49:16] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] RESTBase: Enable firejail [puppet] - 10https://gerrit.wikimedia.org/r/265531 (owner: 10Mobrovac)
[18:49:51] <marxarelli>	 !log sync to mw2020 failed due to failed host key verification, mw2087/mw2039/mw2098 due to connection failed
[18:51:20] <marxarelli>	 !log starting train promotion of group1 to 1.27.0-wmf.11
[18:52:48] <marxarelli>	 er, morebots?
[18:52:53] <marxarelli>	 !log sync to mw2020 failed due to failed host key verification, mw2087/mw2039/mw2098 due to connection failed
[18:52:58] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[18:53:06] <marxarelli>	 there ya are, buddy
[18:53:08] <marxarelli>	 !log starting train promotion of group1 to 1.27.0-wmf.11
[18:53:13] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[18:54:23] <grrrit-wm>	 (03PS1) 10Dduvall: group1 wikis to 1.27.0-wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265536 
[18:55:16] <grrrit-wm>	 (03CR) 10Dduvall: [C: 032] group1 wikis to 1.27.0-wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265536 (owner: 10Dduvall)
[18:55:48] <grrrit-wm>	 (03Merged) 10jenkins-bot: group1 wikis to 1.27.0-wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265536 (owner: 10Dduvall)
[18:56:05] <Krenair>	 krenair@mw2020.codfw.wmnet's password: 
[18:56:06] <Krenair>	 Ummmm.
[18:56:33] <marxarelli>	 Krenair: that host just failed for me during sync-file as well
[18:56:40] <Krenair>	 yeah, that's why I went to take a look
[18:56:43] <marxarelli>	 host key verification fail
[18:56:48] <Krenair>	 Maybe someone in ops is reinstalling it?
[18:57:10] <Krenair>	 I don't see a note in SAL about it until your one
[18:57:11] <logmsgbot>	 !log dduvall@tin rebuilt wikiversions.php and synchronized wikiversions files: group1 wikis to 1.27.0-wmf.11
[18:57:16] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[18:57:41] <thcipriani>	 Krenair: ran into it during SWAT, bblack mentioned that it was in a bad state somehow.
[18:58:26] <marxarelli>	 anomie, tgr: heads-up, group1 has been promoted
[18:58:52] <bblack>	 so in this train window, we're doing group1+group2 -> wmf.11?
[18:59:02] <marxarelli>	 group1 first
[18:59:27] <marxarelli>	 then wait/verify, then all
[18:59:32] <greg-g>	 bblack: group1 first, wait/watch.... yeah
[18:59:33] <marxarelli>	 is the plan anyway
[18:59:49] <bblack>	 ok thanks
[19:00:04] <jouncebot>	 marxarelli: Dear anthropoid, the time has come. Please deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160121T1900).
[19:00:17] <marxarelli>	 silly robot
[19:04:04] * anomie tests logins to wikidata.org via web and API, and succeeds
[19:06:27] <grrrit-wm>	 (03PS1) 10Alexandros Kosiaris: WIP: DONT MERGE. cleanup SCA from *oid services [puppet] - 10https://gerrit.wikimedia.org/r/265541 
[19:08:40] <marxarelli>	 seeing some new fatal errors
[19:08:45] <marxarelli>	 CentralAuth/session related
[19:08:48] <marxarelli>	 "Fatal error: Cannot call abstract method MediaWiki\Session\SessionProvider::provideSessionInfo() in /srv/mediawiki/php-1.27.0-wmf.11/extensions/CentralAuth/includes/session/CentralAuthTokenSessionProvider.php on line 107"
[19:09:04] <Krenair>	 anomie, tgr ^
[19:09:08] <marxarelli>	 csteipp, tgr ^
[19:09:13] * anomie looks
[19:10:18] <grrrit-wm>	 (03PS2) 10Chad: ganglia diskstat.py: pep8 fixes all over the place [puppet] - 10https://gerrit.wikimedia.org/r/264997 
[19:10:51] <anomie>	 marxarelli: https://gerrit.wikimedia.org/r/265542 should fix that
[19:11:09] <marxarelli>	 anomie: kk
[19:13:58] <andrewbogott>	 !log rebooting labvirt1003
[19:14:03] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[19:17:41] <icinga-wm>	 PROBLEM - Host www.toolserver.org is DOWN: PING CRITICAL - Packet loss = 100%
[19:17:43] <grrrit-wm>	 (03PS2) 10BBlack: Add iridium-vcs.eqiad.wmnet ipv6 to phab puppetization [puppet] - 10https://gerrit.wikimedia.org/r/265493 (https://phabricator.wikimedia.org/T100519) 
[19:18:14] <tgr>	 https://grafana.wikimedia.org/dashboard/db/authentication-metrics?panelId=13&fullscreen
[19:18:21] <icinga-wm>	 PROBLEM - Unmerged changes on repository mediawiki_config on mira is CRITICAL: There is one unmerged change in mediawiki_config (dir /srv/mediawiki-staging/).
[19:18:58] <marxarelli>	 tgr: need_token errors again?
[19:19:10] <tgr>	 yeah
[19:19:34] <tgr>	 what would be a good place to ask bot operators about it? #commons?
[19:19:40] <tgr>	 still only on group1, right?
[19:19:44] <marxarelli>	 right
[19:20:14] <grrrit-wm>	 (03CR) 10BBlack: [C: 032] Add iridium-vcs.eqiad.wmnet ipv6 to phab puppetization [puppet] - 10https://gerrit.wikimedia.org/r/265493 (https://phabricator.wikimedia.org/T100519) (owner: 10BBlack)
[19:20:38] <greg-g>	 commons would be one good place, yeah
[19:21:14] <grrrit-wm>	 (03CR) 10BBlack: [C: 032] Add IPv6 for iridium-vcs.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/265492 (https://phabricator.wikimedia.org/T100519) (owner: 10BBlack)
[19:22:17] <mobrovac>	 !log restbase re-enabling puppet in prod
[19:22:22] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[19:24:56] <mobrovac>	 !log restbase rolling-restart after firejail inclusion
[19:25:01] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[19:25:36] <grrrit-wm>	 (03PS1) 10Alex Monk: beta: Remove deployment.wmflabs.org VHost that doesn't actually resolve [puppet] - 10https://gerrit.wikimedia.org/r/265548 
[19:25:51] <icinga-wm>	 RECOVERY - Host www.toolserver.org is UP: PING OK - Packet loss = 0%, RTA = 1.80 ms
[19:31:54] <logmsgbot>	 !log dduvall@tin Synchronized php-1.27.0-wmf.11/extensions/CentralAuth/includes/session/CentralAuthTokenSessionProvider.php: deploy https://gerrit.wikimedia.org/r/#/c/265545/ for 1.27.0-wmf.11 (duration: 01m 28s)
[19:31:56] <andrewbogott>	 !log rebooting labvirt1004
[19:31:59] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[19:32:00] <icinga-wm>	 RECOVERY - Unmerged changes on repository mediawiki_config on mira is OK: No changes to merge.
[19:32:04] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[19:37:28] <greg-g>	 marxarelli: anomie tgr errors went down https://grafana.wikimedia.org/dashboard/db/authentication-metrics?panelId=13&fullscreen
[19:37:37] <marxarelli>	 greg-g: yeah, saw that
[19:37:54] <greg-g>	 how's other graphs doing?
[19:38:00] <marxarelli>	 trend seems to follow the last sync
[19:38:51] <icinga-wm>	 PROBLEM - Host mw2020 is DOWN: PING CRITICAL - Packet loss = 100%
[19:39:08] <marxarelli>	 fatalmonitor looks good to my n00b eyes
[19:39:39] <tgr>	 greg-g: that's probabl just grafana being stupid
[19:39:50] <tgr>	 last few pixels are not really reliable
[19:41:08] <Danny_B>	 anybody rolled out some new version of mw to wiktionaries?
[19:41:18] <Danny_B>	 in last couple mins
[19:41:27] <marxarelli>	 tgr: seems to be consistent over the past 10 min
[19:41:33] <marxarelli>	 is it that dumb?
[19:41:42] <bblack>	 did someone just puppet-merge on palladium and grab my change?
[19:41:47] <tgr>	 Danny_B: yes
[19:41:52] <anomie>	 Danny_B: 18:53 UTC, I believe.
[19:42:01] <icinga-wm>	 RECOVERY - Host mw2020 is UP: PING OK - Packet loss = 0%, RTA = 37.08 ms
[19:42:06] <greg-g>	 Danny_B: wikitionaries have a new version since 10:57
[19:42:21] <anomie>	 err, 18:57 UTC. Yeah.
[19:42:23] <Danny_B>	 the behaviior changed just few mins ago on cs wikt
[19:42:36] <Danny_B>	 and it does not work, please roll back
[19:42:49] <Danny_B>	 not possible to edit at all, since edit links don'ť work
[19:42:51] <tgr>	 marxarelli: I'll trust it when I start seeing the low-level noise on the right side
[19:42:56] <greg-g>	 edit links don't work?
[19:43:47] <Danny_B>	 hitting the edit link by the headline links to #/editor/1
[19:43:51] <Danny_B>	 nothing happens
[19:43:59] <Danny_B>	 besides edit links now look different
[19:44:06] <aude>	 i can edit en wiktionary
[19:44:11] <MaxSem>	 mobile?
[19:44:19] <tgr>	 sounds like a gadget issue
[19:44:20] <greg-g>	 I can edit on en.wikitionary
[19:44:21] <Danny_B>	 they are of the size of the header
[19:44:47] <Danny_B>	 not as small as they used to be, neither enclosed in brackets
[19:44:49] <Danny_B>	 is said cs wiktionary
[19:44:56] <Danny_B>	 s/is/i
[19:45:29] <tgr>	 mobile edit interface does look weird, no idea if that's new though
[19:45:38] <Danny_B>	 not mobile, desktop
[19:45:44] <greg-g>	 cs also works; http://imgur.com/msfOPSm
[19:45:54] <aude>	 can edit there also
[19:46:04] <Danny_B>	 aude: you started the page
[19:46:11] <Danny_B>	 you did not try to edit the section
[19:46:19] <aude>	 Danny_B: i edited the section also
[19:46:20] <marxarelli>	 tgr: you were right. just spiked again
[19:46:41] <Danny_B>	 "wgHostname":"mw1103" if that helps
[19:46:45] <aude>	 Danny_B: i believe you though 
[19:47:31] <aude>	 interesting though i was logged in on en.wiktionary
[19:47:41] <aude>	 then apparently got logged out / not logged in on cs.wiktionary
[19:47:56] <tgr>	 marxarelli: we have some kind of statsd buffering proxy I think
[19:47:58] <greg-g>	 marxarelli: can you rollback cs wikitionary to see if that fixes Danny_B's issue?
[19:48:09] <marxarelli>	 greg-g: sure thing
[19:48:29] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] Bump kernel version in jessie base image [puppet] - 10https://gerrit.wikimedia.org/r/265500 (owner: 10Andrew Bogott)
[19:49:03] * anomie is trying to debug why logging in isn't working on cswikt, but rolling back will probably screw that up
[19:49:06] <aude>	 weird that i am not logged in there, but am when i go to wikivoyage, commons, etc
[19:50:17] <aude>	 also afrikaans wiktionary, danish, ...
[19:50:18] <Danny_B>	 i'm creating the sshot for ya guys
[19:50:24] <aude>	 but logged in on spanish wiktionary
[19:50:29] <grrrit-wm>	 (03PS3) 10Andrew Bogott: Bump kernel version in Jessie base image [puppet] - 10https://gerrit.wikimedia.org/r/265500 
[19:50:45] <marxarelli>	 greg-g: anomie is debugging. rollback, er... ?
[19:51:33] <greg-g>	 aude: any luck debugging?
[19:51:35] <Danny_B>	 aude: re logging: if wikitech was also changed the mw version, then i submitted some bug which may be relevant, it's marked as secure though
[19:51:42] <Danny_B>	 so idk if you can see it
[19:51:44] <aude>	 greg-g: no
[19:51:51] <aude>	 it's some wiktionarys not, some i am
[19:51:53] <greg-g>	 Danny_B: which bug?
[19:52:03] <greg-g>	 marxarelli: rollback
[19:52:07] <marxarelli>	 kk
[19:52:08] <aude>	 don't know if they are on different db servers / groups
[19:52:09] <greg-g>	 just cswiki for now
[19:52:37] <aude>	 arabic logged in (and maybe already had account there)
[19:52:41] <aude>	 same with spanish
[19:53:04] <aude>	 just a guess
[19:53:43] <tgr>	 marxarelli: Danny_B: cs.wikt edit interface looks the same as en.wiki
[19:53:48] <aude>	 japanese, polish works
[19:54:06] <Krenair>	 Did you guys create bot_passwords tables on all wikis?
[19:54:06] <Danny_B>	 http://imgur.com/EEnhSoc  i clicked the edit by "výslovnost" and see the url in browser location bar
[19:54:16] <aude>	 not vietnamese or portuguese
[19:54:17] <tgr>	 maybe Danny_B unexpectedly ended up on the mobile if
[19:54:17] <Danny_B>	 Krenair: no they did not
[19:54:29] <Krenair>	 Danny_B, clearly.  but let them answer.
[19:54:34] <Danny_B>	 Krenair: i reported the bug
[19:54:42] <andrewbogott>	 !log rebooting labvirt1005
[19:54:46] <logmsgbot>	 !log dduvall@tin rebuilt wikiversions.php and synchronized wikiversions files: rollback cswiktionary to 1.27.0-wmf.10
[19:54:47] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[19:54:52] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[19:55:06] <Danny_B>	 greg-g: https://phabricator.wikimedia.org/T124335
[19:55:21] <greg-g>	 Danny_B: test again, please, we just rolled back cswikitionary
[19:55:23] <aude>	 the ones i am logged in, already had an account
[19:55:27] <tgr>	 Krenair: should be using the central DB on meta
[19:55:28] <aude>	 https://es.wiktionary.org/wiki/Especial:Informaci%C3%B3n_de_la_cuenta_global/Aude
[19:55:32] <Danny_B>	 greg-g: ok, mmt pls
[19:56:26] <Danny_B>	 greg-g: works as before now
[19:56:42] <Danny_B>	 how can i help you with debugging? what would you like me to examine?
[19:57:00] <greg-g>	 wtf
[19:57:15] <aude>	 ar.wikibooks ( no account, no login)
[19:57:19] <greg-g>	 why it's only effecting cswiki I have no idea
[19:58:02] <anomie>	 greg-g: So it looks like the reason I can't log in on cswikt is because CentralAuth is suddenly creating unattached accounts.
[19:58:04] <marxarelli>	 greg-g: we have two outstanding issues now, mystery cswikt UI muck and need_token spike
[19:58:15] <Krenair>	 if UseBotPasswords is not true, should the table be expected to exist?
[19:58:17] <mobrovac>	 !log mobileapps deploying 68c09e
[19:58:18] <anomie>	 No idea if that's related to any of the other problems
[19:58:22] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[19:58:41] <anomie>	 Krenair: If the UseBotPasswords is false, it shouldn't need the table to exist.
[19:58:55] <tgr>	 marxarelli, anomie: https://phabricator.wikimedia.org/T74791#1953065
[19:59:10] <greg-g>	 marxarelli: tgr the need_token thing isn't resolving? :(
[19:59:25] <tgr>	 probably the central id provider issue?
[19:59:41] <tgr>	 maybe the bot password problem is the same?
[20:00:15] <tgr>	 greg-g: still no clue what that issue even is, just see a few bots trying to log in like crazy
[20:00:28] <Krenair>	 I mean EnableBotPasswords
[20:00:36] <greg-g>	 :/
[20:00:53] <legoktm>	 ermmm
[20:00:57] <legoktm>	 unattached accounts??
[20:01:16] <tgr>	 Krenair: are there DB errors or why do you ask?
[20:01:26] <Krenair>	 tgr, https://phabricator.wikimedia.org/T124335
[20:01:48] <tgr>	 Krenair: can you CC me on that?
[20:02:01] <Danny_B>	 i can't  unfortunatelly follow here nonstop atm, please ping me when you need further assistance or testing or when you think it's fixed, thank you
[20:02:29] <greg-g>	 Danny_B: sadly you're the only one able to repo right now
[20:02:29] <legoktm>	 tgr: added
[20:02:43] <greg-g>	 Danny_B: I'll ping when we need more testing
[20:03:21] <tgr>	 ah, labswiki
[20:03:27] <Danny_B>	 greg-g: no prob, i just need to switch to different window and work, so i'll may have delays in replies a bit, but i'll do my best
[20:03:39] <greg-g>	 kk
[20:05:49] <tgr>	 why is labswiki not in the nonglobal list?
[20:06:07] <Krenair>	 it is in the nonglobal list
[20:06:20] <Krenair>	 therefore UseBotPasswords is set to false
[20:06:26] <Krenair>	 bah, EnableBotPasswords*
[20:06:54] <anomie>	 tgr, Krenair: https://gerrit.wikimedia.org/r/265561
[20:07:13] <Krenair>	 But User::setPasswordInternal -> BotPassword::invalidateAllPasswordsForUser -> BotPassword::invalidateAllPasswordsForCentralId -> boom
[20:08:50] <icinga-wm>	 PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 120, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-5/0/2: down - Core: cr2-ulsfo:xe-1/3/0 (Zayo, OGYX/124337//ZYO) {#11541} [10Gbps DWDM]BR
[20:09:33] <Krenair>	 going to go and have dinner, cherry-pick: https://gerrit.wikimedia.org/r/#/c/265563/
[20:10:01] <Krenair>	 will deploy it when I'm back if nobody else does
[20:13:57] <marxarelli>	 Krenair, anomie, tgr: i'll deploy it once the cherry-pick merges
[20:13:57] <tgr>	 I'll deploy
[20:14:04] <tgr>	 that works too
[20:16:33] <andrewbogott>	 !log rebooting labvirt1006
[20:16:34] <grrrit-wm>	 (03PS2) 10EBernhardson: Adjust cirrus titlesuggest index shard counts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/261287 (https://phabricator.wikimedia.org/T124332) 
[20:16:38] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[20:16:57] <grrrit-wm>	 (03CR) 10EBernhardson: [C: 04-1] "These values were chosen by hand, need to update with the values calculated by the script in T124332" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/261287 (https://phabricator.wikimedia.org/T124332) (owner: 10EBernhardson)
[22:35:37] <icinga-wm>	 PROBLEM - puppet last run on ms-be1020 is CRITICAL: CRITICAL: Puppet has 1 failures
[22:36:49] <legoktm>	 https://phabricator.wikimedia.org/T124356 sounds like MF is leaking into vector?
[22:41:16] <tgr>	 Danny_B: I think your mobile/desktop choice sticks with you between wikis?
[22:41:26] <tgr>	 so you should try testing with a different user
[22:41:46] * greg-g nods
[22:43:41] <icinga-wm>	 RECOVERY - puppet last run on ganeti2001 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures
[22:44:00] <icinga-wm>	 PROBLEM - Unmerged changes on repository mediawiki_config on mira is CRITICAL: There is one unmerged change in mediawiki_config (dir /srv/mediawiki-staging/).
[22:44:23] <grrrit-wm>	 (03PS1) 10Alex Monk: mediawiki: Move www.wikimedia.org portal into wwwportals [puppet] - 10https://gerrit.wikimedia.org/r/265642 
[22:44:44] <Danny_B>	 tgr: i never tried mobile
[22:45:11] <tgr>	 yes but the error is clearly related to that somehow
[22:45:16] <Danny_B>	 but ok, i'll test as anonymous
[22:45:20] <tgr>	 you are getting mobile edit links
[22:46:07] <Danny_B>	 it's very time consuming, as it takes a while to hit it, some automated testing would be handy
[22:46:17] <legoktm>	 !log started running migratePass0.php (CentralAuth) on group1 wikis
[22:46:22] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[22:47:16] <Krenair>	 why do we have remnant.conf and main.conf in apache config?
[22:49:58] <Danny_B>	 tgr: i've just hit it as anonymous in browser i have not used for months
[22:50:20] <tgr>	 thanks for checking
[22:51:51] <Danny_B>	 welcome
[22:52:42] <Danny_B>	 if it was on me i'd rollback wmf11 and done some investigation in its sources... but i am not an op neither mng... ;-)
[22:55:23] <grrrit-wm>	 (03CR) 10EBernhardson: [C: 032] Recycle completion suggester indices for small wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265472 (owner: 10DCausse)
[22:56:07] <grrrit-wm>	 (03Merged) 10jenkins-bot: Recycle completion suggester indices for small wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265472 (owner: 10DCausse)
[22:56:13] <tgr>	 Danny_B: are you using a standard browser?
[22:56:45] <Danny_B>	 tgr: #define( "standard browser" )
[22:57:09] <tgr>	 something mainstream like current Chrome, Firefox...
[22:57:51] <icinga-wm>	 RECOVERY - Host mw2087 is UP: PING OK - Packet loss = 0%, RTA = 36.38 ms
[22:59:08] <Danny_B>	 firefox, chrome, iron, k-meleon, ie, opera, vivaldi, amaya, lynx and bunch of others
[23:01:11] <icinga-wm>	 RECOVERY - puppet last run on ms-be1020 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures
[23:03:01] <icinga-wm>	 PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0]
[23:05:31] <icinga-wm>	 PROBLEM - very high load average likely xfs on ms-be1002 is CRITICAL: CRITICAL - load average: 199.11, 142.18, 71.37
[23:06:24] <grrrit-wm>	 (03CR) 10Mobrovac: [C: 04-1] "Minor comments in-lined. Also, the upstart conf file should probably be removed." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/265628 (owner: 10Subramanya Sastry)
[23:09:30] <icinga-wm>	 RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[23:16:07] <legoktm>	 Danny_B: erm, can you give me an example url you hit and saw it with?
[23:16:31] <grrrit-wm>	 (03PS3) 10Subramanya Sastry: Migrate parsoid::role::testing service from upstart to systemd [puppet] - 10https://gerrit.wikimedia.org/r/265628 
[23:17:41] <icinga-wm>	 PROBLEM - Unmerged changes on repository mediawiki_config on tin is CRITICAL: There are 2 unmerged changes in mediawiki_config (dir /srv/mediawiki-staging/).
[23:18:12] <legoktm>	 ebernhardson: ^?
[23:19:29] <Danny_B>	 legoktm: https://cs.wiktionary.org/wiki/leetspeak  https://cs.wiktionary.org/wiki/Polsko  https://sk.wiktionary.org/wiki/kostern%C3%AD  https://cs.wiktionary.org/wiki/d%C5%AFle%C5%BEit%C3%BD  https://cs.wikiquote.org/wiki/Jean-Marie_Adiaffi  
[23:20:14] <legoktm>	 thank you
[23:20:45] <Danny_B>	 i kept these open, but there is several i've closed and don't remember
[23:21:27] <legoktm>	 Danny_B: can you send me the html for one of those pages?
[23:21:41] <legoktm>	 ooh
[23:21:43] <legoktm>	 I got a repro
[23:22:03] <Danny_B>	 cool
[23:22:11] <Danny_B>	 now i can go sleep ;-)
[23:24:27] <grrrit-wm>	 (03CR) 10Mobrovac: Migrate parsoid::role::testing service from upstart to systemd (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/265628 (owner: 10Subramanya Sastry)
[23:25:32] <grrrit-wm>	 (03PS2) 10Ottomata: 0.8.2.1-4 release - kafka-mirror package enhancements [debs/kafka] (debian) - 10https://gerrit.wikimedia.org/r/265288 (https://phabricator.wikimedia.org/T124077) 
[23:25:34] <grrrit-wm>	 (03PS2) 10Nuria: Removing code generates pageviews using legacy definition [puppet] - 10https://gerrit.wikimedia.org/r/265634 (https://phabricator.wikimedia.org/T124244) 
[23:25:47] <greg-g>	 legoktm: wow, thanks
[23:26:03] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] Removing code generates pageviews using legacy definition [puppet] - 10https://gerrit.wikimedia.org/r/265634 (https://phabricator.wikimedia.org/T124244) (owner: 10Nuria)
[23:26:25] <grrrit-wm>	 (03Abandoned) 10Nuria: Removing code generates pageviews using legacy definition [puppet] - 10https://gerrit.wikimedia.org/r/265634 (https://phabricator.wikimedia.org/T124244) (owner: 10Nuria)
[23:26:59] <jdlrobson>	 yeh im seeing it on https://cs.wiktionary.org/wiki/Polsko
[23:27:00] <icinga-wm>	 PROBLEM - Disk space on iridium is CRITICAL: DISK CRITICAL - free space: / 338 MB (3% inode=82%)
[23:27:55] <YuviPanda>	 twentyafterfour: ^
[23:28:02] <YuviPanda>	 svn import causing space issues?
[23:28:23] <twentyafterfour>	 YuviPanda: already fixed
[23:28:41] <legoktm>	 I think something is corrupting the parser cache.
[23:28:58] <jdlrobson>	 this sounds very similar to the wikivoyage issue..
[23:29:09] <YuviPanda>	 thanks twentyafterfour
[23:29:11] <icinga-wm>	 RECOVERY - Disk space on iridium is OK: DISK OK
[23:29:26] <legoktm>	 					$skin = $wgOut->getSkin();
[23:29:26] <legoktm>	 					return call_user_func_array(
[23:29:26] <legoktm>	 						array( $skin, 'doEditSectionLink' ),
[23:29:29] <legoktm>	 eww
[23:29:36] <twentyafterfour>	 sorry about that, I didn't realize ~ was on a tiny partition
[23:30:31] <Platonides>	 still, it shouldn't be corrupting the parser cache
[23:30:50] <legoktm>	 right, edit section links run afterwards
[23:31:35] <legoktm>	 so somehow $wgOut->getSkin() is returning SkinMinerva??
[23:31:54] <grrrit-wm>	 (03PS1) 10Nuria: Removing code generates pageviews using legacy definition [puppet] - 10https://gerrit.wikimedia.org/r/265656 (https://phabricator.wikimedia.org/T124244) 
[23:32:27] <grrrit-wm>	 (03PS4) 10Subramanya Sastry: Migrate parsoid::role::testing service from upstart to systemd [puppet] - 10https://gerrit.wikimedia.org/r/265628 
[23:32:49] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] Removing code generates pageviews using legacy definition [puppet] - 10https://gerrit.wikimedia.org/r/265656 (https://phabricator.wikimedia.org/T124244) (owner: 10Nuria)
[23:32:50] <jdlrobson>	 legoktm: it's running the MobileFormatter. Is it possible we do not have separate parser caches for mobile/desktop on those projects?
[23:33:00] <jdlrobson>	 legoktm: you'll notice it's not just the edit links but it's doing section wrapping
[23:33:30] <Platonides>	 why doesn't it contain the parsercache hash?
[23:33:36] <wikibugs>	 6operations, 10ops-codfw: mw2098 non-responsive to mgmt - https://phabricator.wikimedia.org/T85286#1954018 (10Papaul) Fixes & Enhancements Enhancements: N/A Fixes: - Fix for issues that cause iDRAC7 sluggish responsiveness after a prolonged period of time (approx. 45-100 days, depending on the usage). In some...
[23:33:43] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] Migrate parsoid::role::testing service from upstart to systemd [puppet] - 10https://gerrit.wikimedia.org/r/265628 (owner: 10Subramanya Sastry)
[23:34:34] <grrrit-wm>	 (03PS2) 10Nuria: Removing code generates pageviews using legacy definition [puppet] - 10https://gerrit.wikimedia.org/r/265656 (https://phabricator.wikimedia.org/T124244) 
[23:34:36] <grrrit-wm>	 (03CR) 10Subramanya Sastry: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/265628 (owner: 10Subramanya Sastry)
[23:35:00] <legoktm>	 Platonides: by default ParserOptions::$mEnableLimitReport is false...
[23:35:06] <wikibugs>	 6operations, 10ops-codfw: mw2087 fails to reboot, mgmt interface unreachable - https://phabricator.wikimedia.org/T124299#1954026 (10Papaul) Update IDRAC firmware from 1.30 to 2.21 @joe please see note on T85286
[23:35:29] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] Removing code generates pageviews using legacy definition [puppet] - 10https://gerrit.wikimedia.org/r/265656 (https://phabricator.wikimedia.org/T124244) (owner: 10Nuria)
[23:35:51] <jdlrobson>	 legoktm: where do you see array( $skin, 'doEditSectionLink' ), ? I'm not finding that in MobileFrontend
[23:35:55] <jdlrobson>	 smells like old code
[23:36:07] <legoktm>	 interestingly ParserOptions::matches() ignores mEnableLimitReport
[23:36:11] <greg-g>	 legoktm: / jdlrobson: assessment on whether we should rollback or not due to this?
[23:36:17] <Platonides>	 it makes sense
[23:36:23] <Platonides>	 since it doesn't affect the output
[23:36:29] <legoktm>	 jdlrobson: ParserOutput::getText() in core
[23:37:01] <legoktm>	 greg-g: normally I'd say rollback, but I think we're going to have more issues rolling back the session stuff, and apparently this is very hard to repro...
[23:37:12] * greg-g nods
[23:37:23] <jdlrobson>	 I suspect that $po->setText( ExtMobileFrontend::DOMParse( $outputPage, $po->getText(), $isBeta ) ); is the culprit but I'm not sure why it's running on the desktop ParserOutput
[23:37:51] <jdlrobson>	 it should also be disabling TOC
[23:37:55] <jdlrobson>	 which it doesn't seem to be doing though
[23:38:05] <jdlrobson>	 oh that runs on OutputPage
[23:38:09] <jdlrobson>	 ergg so confusing
[23:39:01] <legoktm>	 I'm going to live hack mw1017 a bit
[23:39:03] <Danny_B>	 idea: if the last edit of the page was coming through mobile, then the page stays in mobile output ?
[23:39:16] <jdlrobson>	 yeh only the example https://cs.wiktionary.org/wiki/Polsko has had no recent edits
[23:39:20] <legoktm>	 https://cs.wiktionary.org/w/index.php?title=Polsko&action=history doesn't say mobile?
[23:39:40] <jdlrobson>	 legoktm: are you able to flush the parser output cache for a particular page?
[23:39:50] <Danny_B>	 anyway, i'm going to take a nap now (almost 1am here), feel free to pm me with additional needs
[23:40:04] <legoktm>	 jdlrobson: probably, but I don't want to do that unless we have another repro
[23:40:09] <jdlrobson>	 if so there's a few experiments we could run e.g. visit via mobile first and see whether that impacts desktop cache
[23:40:25] <Danny_B>	 legoktm: did you try those other links i gave you?
[23:41:07] <bblack>	 so this is sounding like something that's going to need a lot of varnish purging once fixed, right?
[23:41:19] <legoktm>	 Danny_B: oops, never clicked further.
[23:41:23] <grrrit-wm>	 (03CR) 10Ottomata: [C: 032] 0.8.2.1-4 release - kafka-mirror package enhancements [debs/kafka] (debian) - 10https://gerrit.wikimedia.org/r/265288 (https://phabricator.wikimedia.org/T124077) (owner: 10Ottomata)
[23:41:29] <bblack>	 maybe we can purge on the obj.http.Date range for when it was generating bad stuff
[23:41:30] <legoktm>	 okay, I can also repro on https://cs.wiktionary.org/wiki/d%C5%AFle%C5%BEit%C3%BD and https://cs.wikiquote.org/wiki/Jean-Marie_Adiaffi
[23:41:30] <icinga-wm>	 PROBLEM - Host ms-be1002 is DOWN: PING CRITICAL - Packet loss = 100%
[23:42:52] <Danny_B>	 legoktm: i purged leetspeak befor because of testing
[23:42:56] <legoktm>	 ok
[23:43:23] * Danny_B -> nap
[23:43:33] <legoktm>	 null edit fixed https://cs.wikiquote.org/wiki/Jean-Marie_Adiaffi
[23:44:36] <wikibugs>	 6operations, 10ops-codfw: mw2098 non-responsive to mgmt - https://phabricator.wikimedia.org/T85286#1954075 (10Papaul) Note version 1.56 down is for IDRAC 7 and not IDRAC 6. I am stay working  on the IDRAC 6 for the PowerEdge R410 series like mw2039 which had the same problem as well.
[23:44:59] <legoktm>	 eh
[23:45:15] <legoktm>	 notices in languages I don't speak keep popping up and disappear before I can copy the text
[23:47:07] <legoktm>	 even doing an actual edit on mobile doesn't cause it to come back: https://cs.wikiquote.org/w/index.php?title=Jean-Marie_Adiaffi&type=revision&diff=76155&oldid=70975
[23:48:56] <legoktm>	 bblack: maybe? we still don't know whats causing it yet :/
[23:50:43] <jdlrobson>	 legoktm: i tried editing via desktop site on a mobile device but that didn't cause any issues,so i'm out of ideas :/
[23:51:07] <legoktm>	 not having the limit report is bugging me...
[23:52:44] <grrrit-wm>	 (03PS1) 10Alex Monk: beta: Move login and bits apache configs into wikimedia.conf, like prod [puppet] - 10https://gerrit.wikimedia.org/r/265659 
[23:53:26] <ebernhardson>	 legoktm: i just saw your message from earlier, i dont think mediawiki-config is me (havn't merged anything today)
[23:53:49] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] beta: Move login and bits apache configs into wikimedia.conf, like prod [puppet] - 10https://gerrit.wikimedia.org/r/265659 (owner: 10Alex Monk)
[23:53:51] <legoktm>	 ebernhardson: [14:55:23] <grrrit-wm> (CR) EBernhardson: [C: 2] Recycle completion suggester indices for small wikis [mediawiki-config] - https://gerrit.wikimedia.org/r/265472 (owner: DCausse)
[23:54:00] <Reedy>	 Krenair: Wouldn't it be easier to just template out the production configs (for servername, aliases etc) and update the beta config to use it?
[23:54:33] <grrrit-wm>	 (03CR) 10Subramanya Sastry: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/265628 (owner: 10Subramanya Sastry)
[23:54:47] <Krenair>	 Reedy, the plan is to identify the differences and merge them
[23:55:03] <Reedy>	 In theory, it should be minimal
[23:55:16] <ebernhardson>	 legoktm: oh doh...i mean't to merge a different patch :S
[23:55:20] <grrrit-wm>	 (03CR) 10Alex Monk: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/265659 (owner: 10Alex Monk)
[23:55:35] <ebernhardson>	 well...might as well put that up to swat now since it's pre-config for a patch going out next week
[23:55:41] <legoktm>	 related, https://github.com/wikimedia/mediawiki is like 8 days behind??
[23:55:54] <Krenair>	 Reedy, I could upload one change that just deletes half of the mediawiki sites config and adds a load more
[23:56:13] <Reedy>	 lol
[23:56:15] <Krenair>	 Reedy, however, I would like to stand a realistic chance of getting stuff done, and therefore I need to be able to convince ops to approve changes
[23:58:24] <legoktm>	 https://gerrit.wikimedia.org/r/#/c/263606/ looks relatively safe
[23:58:50] <icinga-wm>	 RECOVERY - Unmerged changes on repository mediawiki_config on tin is OK: No changes to merge.
[23:58:51] <legoktm>	 Danny_B: do we have any evidence this started with wmf.11 yesterday?
[23:59:41] <icinga-wm>	 RECOVERY - Unmerged changes on repository mediawiki_config on mira is OK: No changes to merge.