[00:00:04] <jouncebot>	 RoanKattouw ostriches Krenair: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160127T0000). Please do the needful.
[00:00:04] <jouncebot>	 ebernhardson: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process.
[00:02:24] <ebernhardson>	 i suppose it's just me, i'll deploy
[00:02:31] <chasemp>	 !log enable puppet and codify the 192 thread count for nfsd
[00:02:34] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[00:02:43] <grrrit-wm>	 (03CR) 10EBernhardson: [C: 032] Put more like query load back on eqiad for codfw load testing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266559 (owner: 10EBernhardson)
[00:02:53] <icinga-wm>	 PROBLEM - RAID on mw1161 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[00:03:04] <icinga-wm>	 PROBLEM - puppet last run on mw1161 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[00:03:07] <grrrit-wm>	 (03Merged) 10jenkins-bot: Put more like query load back on eqiad for codfw load testing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266559 (owner: 10EBernhardson)
[00:03:23] <icinga-wm>	 PROBLEM - nutcracker process on mw1161 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[00:03:32] <icinga-wm>	 PROBLEM - SSH on mw1161 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[00:03:53] <icinga-wm>	 PROBLEM - DPKG on mw1161 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[00:04:42] <icinga-wm>	 PROBLEM - configured eth on mw1161 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[00:05:43] <icinga-wm>	 PROBLEM - nutcracker port on mw1161 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[00:05:52] <icinga-wm>	 PROBLEM - Disk space on mw1161 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[00:06:13] <icinga-wm>	 RECOVERY - Unmerged changes on repository mediawiki_config on tin is OK: No changes to merge.
[00:06:52] <icinga-wm>	 PROBLEM - dhclient process on mw1161 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[00:07:13] <icinga-wm>	 PROBLEM - salt-minion processes on mw1161 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[00:07:52] <greg-g>	 mw1161 is down?
[00:08:03] <greg-g>	 (I was scared for a second, but its only one host)
[00:09:16] <ebernhardson>	 not sure how this will work out...sync-file is stuck at sync-proxies having only synced 11 of 12 proxies
[00:09:22] <icinga-wm>	 RECOVERY - Disk space on mw1161 is OK: DISK OK
[00:10:19] <ebernhardson>	 greg-g: but annoyingly, mw1161 is one of the 12 proxies used for scap
[00:10:19] <Reedy>	 ebernhardson: shouldn't matter too much
[00:10:20] <logmsgbot>	 !log ebernhardson@mira Synchronized wmf-config/CirrusSearch-production.php: point morelike queries back at the eqiad cluster (duration: 05m 41s)
[00:10:23] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[00:10:25] <Reedy>	 they'll just use other hosts
[00:10:28] <ebernhardson>	 oh good
[00:10:31] <Krenair>	 greg-g, happens to some hosts sometimes
[00:10:33] <Reedy>	 not hardcoded
[00:10:41] <Krenair>	 so yeah, scary
[00:10:59] <ebernhardson>	 ahh, i thought this was supposed to be 'row aware', and was worried it would stick with the proxy in the same DC row
[00:11:03] <icinga-wm>	 RECOVERY - nutcracker port on mw1161 is OK: TCP OK - 0.001 second response time on port 11212
[00:11:27] <bd808>	 The proxy selection mechanism is the 2nd coolest thing in the python version of scap. The coolest thing is the ssh dispatch loop. ori came up with both of them. :)
[00:11:28] <Reedy>	 nah
[00:11:38] <Reedy>	 ebernhardson: it uses the "best" server
[00:11:48] <Reedy>	 it should be the same rack/row
[00:11:51] <Reedy>	 But often not
[00:11:57] <bd808>	 lowest tcp hop count
[00:12:02] <Krenair>	 it chooses the best proxies, and proxies are typically selected to be one per row, right?
[00:12:27] <Reedy>	 Yeah
[00:13:04] <Reedy>	 https://github.com/wikimedia/operations-puppet/blob/2015be754e75ee2ceeb6a8aa6449f5a706bb7df0/hieradata/common/scap/dsh.yaml#L3-L16
[00:13:28] <Reedy>	 well, theres one per rack with mw app servers in
[00:13:38] <wikibugs>	 10Ops-Access-Requests, 6operations, 6Parsing-Team: Getting parsing-team members sudo access to manage (start, stop, restart) services on ruthenium - https://phabricator.wikimedia.org/T124701#1968165 (10ssastry) @RobH Thanks. Looks good.  @tstarling, @arlolra, @cscott are the others besides me that will need...
[00:14:43] <icinga-wm>	 PROBLEM - Disk space on mw1161 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[00:15:54] <logmsgbot>	 !log ebernhardson@mira Synchronized php-1.27.0-wmf.11/extensions/CirrusSearch/: Allow pointing morelike queries at a specific datacenter (duration: 03m 04s)
[00:15:57] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[00:16:10] <grrrit-wm>	 (03CR) 10Bmansurov: "Yeah, that's what the task says now." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265292 (https://phabricator.wikimedia.org/T123932) (owner: 10Bmansurov)
[00:16:23] <icinga-wm>	 PROBLEM - nutcracker port on mw1161 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[00:16:51] <wikibugs>	 6operations, 6Discovery: Elasticsearch health and capacity planning FY2016-17 - https://phabricator.wikimedia.org/T124626#1968174 (10Tfinc) In my previous life managing search clusters we split our corpus by geographic location, then function (full text, prefix, etc) and then partitioned the data to fit into t...
[00:23:31] <grrrit-wm>	 (03PS1) 10RobH: creation of parsoid-rt-admin group [puppet] - 10https://gerrit.wikimedia.org/r/266632 (https://phabricator.wikimedia.org/T124701) 
[00:25:15] <wikibugs>	 6operations, 10vm-requests, 5Patch-For-Review: move releases.wm.org to bromine (was: request VM for releases.wm.org) - https://phabricator.wikimedia.org/T124261#1968197 (10Dzahn)
[00:25:17] <wikibugs>	 6operations, 5Patch-For-Review: Reinstall caesium with jessie (and convert to VM) - https://phabricator.wikimedia.org/T123714#1968198 (10Dzahn)
[00:27:14] <wikibugs>	 10Ops-Access-Requests, 6operations, 6Parsing-Team, 5Patch-For-Review: Getting parsing-team members sudo access to manage (start, stop, restart) services on ruthenium - https://phabricator.wikimedia.org/T124701#1968200 (10RobH)
[00:27:17] <wikibugs>	 6operations, 5Patch-For-Review: Reinstall caesium with jessie (and convert to VM) - https://phabricator.wikimedia.org/T123714#1968202 (10Dzahn) instead of reinstalling caesium, we decided to move the only service that was on it, releases.wikimedia.org, over to an exising virtual machine, bromine.eqiad.wmnet wh...
[00:28:06] <wikibugs>	 10Ops-Access-Requests, 6operations, 6Parsing-Team, 5Patch-For-Review: Getting parsing-team members sudo access to manage (start, stop, restart) services on ruthenium - https://phabricator.wikimedia.org/T124701#1968207 (10RobH) a:5ssastry>3RobH Please note my patchset does NOT include the actual users y...
[00:28:15] <wikibugs>	 10Ops-Access-Requests, 6operations, 6Parsing-Team, 5Patch-For-Review: Getting parsing-team members sudo access to manage (start, stop, restart) services on ruthenium - https://phabricator.wikimedia.org/T124701#1968209 (10RobH) 5Open>3stalled
[00:28:20] <wikibugs>	 6operations, 5Patch-For-Review: Reinstall caesium with jessie (and convert to VM) - https://phabricator.wikimedia.org/T123714#1968211 (10Dzahn) 5Open>3Invalid
[00:28:52] <wikibugs>	 6operations, 5Patch-For-Review: Reinstall caesium (releases.wm.org) with jessie (and convert to VM) - https://phabricator.wikimedia.org/T123714#1968212 (10greg)
[00:29:10] <greg-g>	 gah, I was confused
[00:29:15] <greg-g>	 ignore that :)
[00:29:25] <wikibugs>	 6operations, 5Patch-For-Review: Reinstall caesium (releases.wm.org) with jessie (and convert to VM) - https://phabricator.wikimedia.org/T123714#1968214 (10Dzahn)
[00:30:20] <wikibugs>	 6operations: Reinstall caesium (releases.wm.org) with jessie (and convert to VM) - https://phabricator.wikimedia.org/T123714#1936383 (10Dzahn)
[00:30:53] <wikibugs>	 6operations, 10vm-requests: move releases.wm.org to bromine (was: request VM for releases.wm.org) - https://phabricator.wikimedia.org/T124261#1968217 (10Dzahn)
[00:33:49] <_joe_>	 AaronSchulz: I prepared a few patches to mediawiki-config, most still definitely need refining, but I think it's a shot in the right direction to make switching datacenters easier
[00:34:22] <icinga-wm>	 RECOVERY - nutcracker port on mw1161 is OK: TCP OK - 0.000 second response time on port 11212
[00:34:23] <icinga-wm>	 RECOVERY - Disk space on mw1161 is OK: DISK OK
[00:34:30] <wikibugs>	 6operations, 10Beta-Cluster-Infrastructure, 6Labs, 10Labs-Infrastructure: beta: Get SSL certificates for *.{projects}.beta.wmflabs.org - https://phabricator.wikimedia.org/T50501#1968235 (10bd808)
[00:35:03] <icinga-wm>	 RECOVERY - configured eth on mw1161 is OK: OK - interfaces up
[00:35:12] <icinga-wm>	 RECOVERY - RAID on mw1161 is OK: OK: no RAID installed
[00:35:23] <icinga-wm>	 RECOVERY - puppet last run on mw1161 is OK: OK: Puppet is currently enabled, last run 54 minutes ago with 0 failures
[00:35:23] <icinga-wm>	 RECOVERY - dhclient process on mw1161 is OK: PROCS OK: 0 processes with command name dhclient
[00:35:33] <icinga-wm>	 RECOVERY - nutcracker process on mw1161 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker
[00:35:42] <icinga-wm>	 RECOVERY - SSH on mw1161 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.4 (protocol 2.0)
[00:35:43] <icinga-wm>	 RECOVERY - salt-minion processes on mw1161 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion
[00:35:59] <wikibugs>	 6operations, 10vm-requests: move releases.wm.org to bromine (was: request VM for releases.wm.org) - https://phabricator.wikimedia.org/T124261#1968240 (10Dzahn) Hi, if you have been subscribed to this ticket it's because you are a member in one of the "releasers-" admin groups and have shell access. This is fyi...
[00:36:02] <icinga-wm>	 RECOVERY - DPKG on mw1161 is OK: All packages OK
[00:38:03] <grrrit-wm>	 (03CR) 10GWicke: "I think we want to separate ganglia and especially alerts / nagios anyway, and doing so is a lot simpler when using cluster." [puppet] - 10https://gerrit.wikimedia.org/r/238431 (https://phabricator.wikimedia.org/T112644) (owner: 10Filippo Giunchedi)
[00:39:06] <greg-g>	 mw1161 is back? who touched it?
[00:40:28] <Krenair>	 greg-g, mind if I send a couple of site-requests changes through before the end of this window?
[00:41:09] <greg-g>	 Krenair: i don't think so?
[00:41:10] <greg-g>	 :)
[00:41:21] <grrrit-wm>	 (03PS2) 10Alex Monk: Disable NewUserMessage on metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266161 (https://phabricator.wikimedia.org/T122441) 
[00:41:41] <grrrit-wm>	 (03CR) 10Alex Monk: [C: 032] Disable NewUserMessage on metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266161 (https://phabricator.wikimedia.org/T122441) (owner: 10Alex Monk)
[00:42:04] <grrrit-wm>	 (03Merged) 10jenkins-bot: Disable NewUserMessage on metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266161 (https://phabricator.wikimedia.org/T122441) (owner: 10Alex Monk)
[00:43:39] <grrrit-wm>	 (03PS2) 10Alex Monk: Change ukwikinews logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266497 (https://phabricator.wikimedia.org/T124778) 
[00:43:44] <grrrit-wm>	 (03CR) 10Alex Monk: [C: 032] Change ukwikinews logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266497 (https://phabricator.wikimedia.org/T124778) (owner: 10Alex Monk)
[00:44:21] <grrrit-wm>	 (03Merged) 10jenkins-bot: Change ukwikinews logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266497 (https://phabricator.wikimedia.org/T124778) (owner: 10Alex Monk)
[00:44:49] <logmsgbot>	 !log krenair@mira Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/266161/ (duration: 02m 27s)
[00:44:52] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[00:46:39] <Krenair>	 why is sync-masters taking so long?
[00:48:02] <logmsgbot>	 !log krenair@mira Synchronized w/static/images/project-logos/ukwikinews.png: https://gerrit.wikimedia.org/r/#/c/266497/ (duration: 02m 29s)
[00:48:05] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[00:50:30] <Lantern>	 Hello. After a 'Portal' namespace added to Wuu Wikipedia in T124389, pages originally beginning with 'Portal:' are inaccessible now (like Portal:地理 and its talk page Talk:Portal:地理). I hope someone can fix it.
[00:51:08] <Krenair>	 oh, yeah
[00:51:11] <Krenair>	 there's a script for that
[00:52:01] <logmsgbot>	 !log krenair@mira Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/266497/ (duration: 02m 26s)
[00:52:04] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[00:52:59] <Krenair>	 Lantern, try now
[00:53:43] <wikibugs>	 6operations, 10Incident-20160126-WikimediaDomainRedirection, 10Traffic, 7Documentation: Automate and/or better-document varnish ban procedure for operations staff, so it can be accomplished with more speed and confidence in outage conditions - https://phabricator.wikimedia.org/T124835#1968281 (10BBlack) I'...
[00:55:17] <Lantern>	 Talk:Portal:地理 is still inaccessible, https://wuu.wikipedia.org/wiki/Talk:Portal:%E5%9C%B0%E7%90%86
[00:56:06] <Krenair>	 That hasn't moved namespaces, it's still in Talk:
[00:56:16] <Krenair>	 huh
[01:00:04] <Lantern>	 ok, thx
[01:01:02] <Krenair>	 bd808, any idea what's up with sync-masters?
[01:01:19] <bd808>	 Krenair: nope. what are you seeing?
[01:01:34] <Krenair>	 bd808, 00:51:41 Finished sync-masters (duration: 02m 07s)
[01:02:10] <Krenair>	 Lantern, can you open a task about this?
[01:02:29] <Krenair>	 It also affects Talk:Portal:江南古镇
[01:03:06] <Lantern>	 ok
[01:03:24] <Lantern>	 i will open a task
[01:03:42] <bd808>	 Krenair: hmm... so rsync between mira and tin is super slow. ping times don't look bad and load averages are low
[01:04:10] <Krenair>	 ssh between them seems fine
[01:07:20] <wikibugs>	 10Ops-Access-Requests, 6operations, 6Parsing-Team, 5Patch-For-Review: Getting parsing-team members sudo access to manage (start, stop, restart) services on ruthenium - https://phabricator.wikimedia.org/T124701#1968335 (10RobH) a:5RobH>3ssastry @ssastry: Can you have your manager approve the request to...
[01:08:55] <bd808>	 Krenair: my next guess in debugging would be to get a root involved and have them run the rsync command from /usr/local/bin/scap-master-sync with --verbose added to see it sheds any light
[01:19:35] <MaxSem>	 Krenair, are you poking at namespaces? cuz I'm about to press a button... :P
[01:19:51] <Krenair>	 not at the moment MaxSem 
[01:19:55] <MaxSem>	 ok
[01:19:57] <Krenair>	 what button are you about to press?
[01:21:08] <MaxSem>	 namespaceDupes
[01:21:25] <MaxSem>	 !log running mwscript namespaceDupes.php  --wiki=wuuwiki --move-talk --fix
[01:21:29] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[01:21:43] <MaxSem>	 (I did a dry run first)
[01:22:09] <MaxSem>	 "Database is read-only: Brief Database Maintenance in progress, please try again in 3 minutes"
[01:22:16] <MaxSem>	 uuuuuugh? :P
[01:22:39] <Krenair>	 did you run it from mira or something?
[01:22:45] <MaxSem>	 yup
[01:22:51] <Krenair>	 those codfw servers won't let you write to the DB
[01:22:57] <MaxSem>	 kekeke
[01:22:59] <Krenair>	 should be using terbium
[01:23:05] <Krenair>	 ?
[01:23:26] <Krenair>	 this was discovered earlier, I did ask for someone to change the reason given to be useful, but... :/
[01:26:50] <MaxSem>	 !log Fail, trying something else...
[01:26:54] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[01:27:11] <wikibugs>	 7Puppet, 10MediaWiki-extensions-ORES, 6Revision-Scoring-As-A-Service: Fix puppet webservice name to uwsgi-ores-web - https://phabricator.wikimedia.org/T124621#1968371 (10Ladsgroup) I don't think this is related to the extension or maybe I'm wrong
[01:29:31] <MaxSem>	 !log on terbium: ran mwscript namespaceDupes.php --wiki=wuuwiki --source-pseudo-namespace='' --add-suffix=/renamed --fix
[01:29:35] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[01:31:40] <wikibugs>	 7Puppet, 6Revision-Scoring-As-A-Service, 10ores: Fix puppet webservice name to uwsgi-ores-web - https://phabricator.wikimedia.org/T124621#1968374 (10Ladsgroup)
[01:35:12] <wikibugs>	 7Blocked-on-Operations, 6operations, 7Graphite: scale graphite deployment (tracking) - https://phabricator.wikimedia.org/T85451#1968377 (10GWicke)
[01:36:13] <ori>	 gwicke: heh. thanks for reviving that.
[01:36:53] <gwicke>	 it's an ongoing issue for us
[01:41:19] <wikibugs>	 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Create new puppet group `analytics-search-users` - https://phabricator.wikimedia.org/T122620#1968383 (10RobH) a:3RobH I'll hunt someone down to review this tomorrow, it has sat long enough.
[01:50:00] <wikibugs>	 7Blocked-on-Operations, 6operations, 7Graphite: scale graphite deployment (tracking) - https://phabricator.wikimedia.org/T85451#1968401 (10GWicke) FYI, here are some upcoming changes in Services that will use more disk space for metrics:  - We are about to split RESTBase metrics by request type (internal, in...
[01:50:42] <grrrit-wm>	 (03PS1) 10Ori.livneh: Add a speed experiment which inlines the top stylesheet [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266647 
[01:51:07] <grrrit-wm>	 (03CR) 10Ori.livneh: [C: 032] Add a speed experiment which inlines the top stylesheet [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266647 (owner: 10Ori.livneh)
[01:51:37] <grrrit-wm>	 (03Merged) 10jenkins-bot: Add a speed experiment which inlines the top stylesheet [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266647 (owner: 10Ori.livneh)
[01:53:41] <grrrit-wm>	 (03PS1) 10Ori.livneh: dotfiles: symlink .hosts/mira to .hosts/tin [puppet] - 10https://gerrit.wikimedia.org/r/266648 
[01:54:10] <grrrit-wm>	 (03CR) 10Ori.livneh: [C: 032 V: 032] dotfiles: symlink .hosts/mira to .hosts/tin [puppet] - 10https://gerrit.wikimedia.org/r/266648 (owner: 10Ori.livneh)
[01:59:52] <logmsgbot>	 !log ori@mira Synchronized docroot and w: Icc4f6134b0: Add a speed experiment which inlines the top stylesheet (duration: 02m 28s)
[01:59:56] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[02:06:53] <ori>	 YuviPanda: https://gerrit.wikimedia.org/r/#/c/266332/
[02:07:47] <grrrit-wm>	 (03CR) 10Yuvipanda: [C: 031] Add a service alias for mw1017 (app server debug backend) [dns] - 10https://gerrit.wikimedia.org/r/266332 (owner: 10Ori.livneh)
[02:07:55] <grrrit-wm>	 (03CR) 10Ori.livneh: [C: 032] Add a service alias for mw1017 (app server debug backend) [dns] - 10https://gerrit.wikimedia.org/r/266332 (owner: 10Ori.livneh)
[02:09:33] <icinga-wm>	 PROBLEM - High load average on labstore1001 is CRITICAL: CRITICAL: 77.78% of data above the critical threshold [24.0]
[02:13:04] <icinga-wm>	 RECOVERY - High load average on labstore1001 is OK: OK: Less than 50.00% above the threshold [16.0]
[02:23:58] <logmsgbot>	 !log mwdeploy@tin sync-l10n completed (1.27.0-wmf.10) (duration: 09m 51s)
[02:24:02] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[02:48:08] <logmsgbot>	 !log mwdeploy@tin sync-l10n completed (1.27.0-wmf.11) (duration: 10m 25s)
[02:48:11] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[02:55:21] <logmsgbot>	 !log l10nupdate@tin ResourceLoader cache refresh completed at Wed Jan 27 02:55:21 UTC 2016 (duration 7m 13s)
[02:55:24] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[03:13:42] <grrrit-wm>	 (03PS1) 10EBernhardson: Correct invalid shard configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266655 
[03:15:15] <grrrit-wm>	 (03CR) 10EBernhardson: [C: 032] Correct invalid shard configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266655 (owner: 10EBernhardson)
[03:15:56] <grrrit-wm>	 (03Merged) 10jenkins-bot: Correct invalid shard configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266655 (owner: 10EBernhardson)
[03:19:29] <logmsgbot>	 !log ebernhardson@mira Synchronized wmf-config/CirrusSearch-production.php: Correct invalid cirrus shard configuration (duration: 02m 59s)
[03:19:32] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[03:50:22] <icinga-wm>	 PROBLEM - puppet last run on ms-fe2003 is CRITICAL: CRITICAL: puppet fail
[03:52:14] <icinga-wm>	 PROBLEM - High load average on labstore1001 is CRITICAL: CRITICAL: 77.78% of data above the critical threshold [24.0]
[03:59:22] <icinga-wm>	 RECOVERY - High load average on labstore1001 is OK: OK: Less than 50.00% above the threshold [16.0]
[04:10:03] <icinga-wm>	 PROBLEM - High load average on labstore1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [24.0]
[04:13:52] <grrrit-wm>	 (03PS1) 10EBernhardson: Reduce replica count for commonswiki_file in codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266658 
[04:14:16] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] Reduce replica count for commonswiki_file in codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266658 (owner: 10EBernhardson)
[04:17:57] <grrrit-wm>	 (03CR) 10EBernhardson: "If disk space is a concern, i could try and focus this in on more specific metrics to hold onto. Within cirrussearch we want to keep serve" [puppet] - 10https://gerrit.wikimedia.org/r/266567 (owner: 10EBernhardson)
[04:18:32] <icinga-wm>	 RECOVERY - puppet last run on ms-fe2003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[04:21:18] <grrrit-wm>	 (03PS2) 10Dereckson: Raise file upload limit to 2.5 GB [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266544 (https://phabricator.wikimedia.org/T116514) (owner: 10TheDJ)
[04:22:27] <grrrit-wm>	 (03PS3) 10Dereckson: Raise file upload limit to 2500 MB [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266544 (https://phabricator.wikimedia.org/T116514) (owner: 10TheDJ)
[04:41:52] <icinga-wm>	 RECOVERY - High load average on labstore1001 is OK: OK: Less than 50.00% above the threshold [16.0]
[05:20:21] <grrrit-wm>	 (03PS1) 10EBernhardson: Allow access to graphite/events/get_data [puppet] - 10https://gerrit.wikimedia.org/r/266663 
[05:24:49] <grrrit-wm>	 (03PS2) 10EBernhardson: Allow access to graphite/events/get_data [puppet] - 10https://gerrit.wikimedia.org/r/266663 
[05:29:59] <wikibugs>	 7Blocked-on-Operations, 6operations, 10Wikimedia-General-or-Unknown: Invalidate all users sessions - https://phabricator.wikimedia.org/T124440#1968538 (10Tgr) >>! In T124440#1966254, @Legoktm wrote: > It's still running :/  Opened T124861 about that.
[05:58:14] <icinga-wm>	 RECOVERY - Kafka Cluster analytics-eqiad Broker Messages In Per Second on graphite1001 is OK: OK: No anomaly detected
[06:14:12] <icinga-wm>	 PROBLEM - High load average on labstore1001 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [24.0]
[06:29:53] <icinga-wm>	 PROBLEM - puppet last run on cp3008 is CRITICAL: CRITICAL: puppet fail
[06:30:42] <grrrit-wm>	 (03PS1) 10KartikMistry: Beta: Add cxserver registry to Beta [puppet] - 10https://gerrit.wikimedia.org/r/266668 
[06:30:42] <icinga-wm>	 PROBLEM - puppet last run on mc1017 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:30:43] <icinga-wm>	 PROBLEM - puppet last run on mw1170 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:30:52] <icinga-wm>	 PROBLEM - puppet last run on db1067 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:30:53] <icinga-wm>	 PROBLEM - puppet last run on wtp2008 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:31:03] <icinga-wm>	 PROBLEM - puppet last run on db1046 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:31:43] <icinga-wm>	 PROBLEM - puppet last run on mw2158 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:32:02] <icinga-wm>	 PROBLEM - puppet last run on mw2129 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:32:44] <icinga-wm>	 PROBLEM - puppet last run on mw2050 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:35:22] <grrrit-wm>	 (03CR) 10Florianschmidtwelzow: [C: 031] Raise file upload limit to 2500 MB [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266544 (https://phabricator.wikimedia.org/T116514) (owner: 10TheDJ)
[06:42:33] <icinga-wm>	 RECOVERY - High load average on labstore1001 is OK: OK: Less than 50.00% above the threshold [16.0]
[06:47:29] <grrrit-wm>	 (03PS2) 10KartikMistry: Beta: Add cxserver registry to Beta [puppet] - 10https://gerrit.wikimedia.org/r/266668 
[06:53:42] <icinga-wm>	 PROBLEM - puppet last run on mw1083 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:55:43] <icinga-wm>	 RECOVERY - puppet last run on db1046 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures
[06:56:42] <icinga-wm>	 PROBLEM - High load average on labstore1001 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [24.0]
[06:57:04] <icinga-wm>	 RECOVERY - puppet last run on mc1017 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:57:13] <icinga-wm>	 RECOVERY - puppet last run on mw1170 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures
[06:57:13] <icinga-wm>	 RECOVERY - puppet last run on db1067 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures
[06:57:22] <icinga-wm>	 RECOVERY - puppet last run on wtp2008 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:57:23] <icinga-wm>	 PROBLEM - puppet last run on mw1134 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:57:32] <icinga-wm>	 RECOVERY - puppet last run on mw2050 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures
[06:58:12] <icinga-wm>	 RECOVERY - puppet last run on mw2158 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:58:13] <icinga-wm>	 RECOVERY - puppet last run on cp3008 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:58:23] <icinga-wm>	 RECOVERY - puppet last run on mw2129 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[07:08:16] <wikibugs>	 6operations, 6Discovery, 10Wikidata, 10Wikidata-Query-Service: Deploy WDQS node on codfw - https://phabricator.wikimedia.org/T124862#1968603 (10Smalyshev) 3NEW
[07:08:31] <wikibugs>	 6operations, 6Discovery, 10Wikidata, 10Wikidata-Query-Service: Adjust balance of WDQS nodes to allow continued operation if eqiad went offline. - https://phabricator.wikimedia.org/T124627#1968613 (10Smalyshev)
[07:08:34] <wikibugs>	 6operations, 6Discovery, 10Wikidata, 10Wikidata-Query-Service: Deploy WDQS node on codfw - https://phabricator.wikimedia.org/T124862#1968614 (10Smalyshev)
[07:10:06] <wikibugs>	 6operations, 6Discovery, 10Wikidata, 10Wikidata-Query-Service: Deploy WDQS node on codfw - https://phabricator.wikimedia.org/T124862#1968603 (10Smalyshev)
[07:12:53] <_joe_>	 SMalyshev: I doubt this can happen this quarter
[07:13:24] <SMalyshev>	 _joe_: what is the blocker - hw, time, something else?
[07:13:31] <_joe_>	 time, mainly
[07:13:43] <SMalyshev>	 _joe_: we've got new ops guy, maybe he could help?
[07:13:53] <_joe_>	 hw, it must come from your budget :)
[07:14:01] <_joe_>	 SMalyshev: when did guillame joins?
[07:14:12] <SMalyshev>	 _joe_: next week I understand
[07:14:22] <_joe_>	 ok so I got that right :P
[07:14:33] <icinga-wm>	 RECOVERY - High load average on labstore1001 is OK: OK: Less than 50.00% above the threshold [16.0]
[07:14:47] <_joe_>	 SMalyshev: to be honest, I'd like him to help with the switchover of ES to codfw when the time arrives
[07:15:02] <_joe_>	 given it's a shared goal this quarter
[07:15:08] <SMalyshev>	 _joe_: yeah so maybe he could help. It shouldn't be a lot of work. But it's not super-urgent - it's just part of making us less critically dependent on one cluster
[07:15:23] <_joe_>	 SMalyshev: I agree fully it needs to be done :)
[07:16:22] <SMalyshev>	 _joe_: well, while he gets more familiar with eqiad/codfw stuff, that may come as one of the tasks too :) anyway, I just created the task so we know it should be done. We'll see how it works budget/time wise, if we have to wait for a couple of months, no problem, current servers work just fine for now
[07:16:47] * _joe_ nod
[07:20:13] <icinga-wm>	 RECOVERY - puppet last run on mw1083 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[07:24:02] <icinga-wm>	 RECOVERY - puppet last run on mw1134 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[07:32:32] <grrrit-wm>	 (03CR) 10Amire80: [C: 04-1] Beta: Add cxserver registry to Beta (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/266668 (owner: 10KartikMistry)
[07:33:05] <ebernhardson>	 wdqs in codfw will probably be begining of next FY or so. Or at least it came up during planning for next FY, sounds like i should include a machine in the budget (and note we are giving ops back a machine in eqiad) ?
[07:48:01] <SMalyshev>	 ebernhardson: that's a good idea
[07:50:51] <grrrit-wm>	 (03PS3) 10KartikMistry: Beta: Add cxserver registry to Beta [puppet] - 10https://gerrit.wikimedia.org/r/266668 
[07:51:58] <grrrit-wm>	 (03CR) 10KartikMistry: Beta: Add cxserver registry to Beta (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/266668 (owner: 10KartikMistry)
[08:01:35] <_joe_>	 ebernhardson: include at least two
[08:01:46] <_joe_>	 we don't want to have to switch datacenters if one machine fails
[08:15:47] <grrrit-wm>	 (03CR) 10Amire80: [C: 04-1] Beta: Add cxserver registry to Beta (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/266668 (owner: 10KartikMistry)
[08:29:05] <grrrit-wm>	 (03PS4) 10KartikMistry: Beta: Add cxserver registry to Beta [puppet] - 10https://gerrit.wikimedia.org/r/266668 
[08:30:35] <grrrit-wm>	 (03CR) 10Amire80: [C: 031] Beta: Add cxserver registry to Beta [puppet] - 10https://gerrit.wikimedia.org/r/266668 (owner: 10KartikMistry)
[08:31:56] <kart_>	 Can anyone from Ops merge https://gerrit.wikimedia.org/r/#/c/266668/ ? It will 'unbreak' beta Content Translation.
[08:32:02] <kart_>	 akosiaris: godog ^^
[08:35:14] <ori>	 kart_: it can be cherry-picked on the beta puppetmaster
[08:35:52] <ori>	 that is probably a good idea anyway, since i'm not sure adding 1,145 lines of hiera data is the right way to do this
[08:38:17] <kart_>	 ori: better as of now :)
[08:38:28] <ori>	 great
[08:42:29] <_joe_>	 thanks ori
[08:42:44] <_joe_>	 I wasn't paying attention to this channel early enough :/
[08:43:55] <grrrit-wm>	 (03CR) 10KartikMistry: "Cherry-picked to Beta, but https://cxserver-beta.wmflabs.org/v1#!/Languages/get_v1_languagepairs is still empty, so I will look into this " [puppet] - 10https://gerrit.wikimedia.org/r/266668 (owner: 10KartikMistry)
[08:44:07] <grrrit-wm>	 (03CR) 10Giuseppe Lavagetto: Use the logical redis definition for GettingStarted. (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266481 (https://phabricator.wikimedia.org/T124671) (owner: 10Giuseppe Lavagetto)
[08:44:55] <grrrit-wm>	 (03PS3) 10Giuseppe Lavagetto: Use the logical redis definition for GettingStarted. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266481 (https://phabricator.wikimedia.org/T124671) 
[08:45:16] <hashar>	 good morning
[08:45:28] <grrrit-wm>	 (03PS2) 10Giuseppe Lavagetto: Rationalize definition of service hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266509 (https://phabricator.wikimedia.org/T114273) 
[08:45:43] <hashar>	 _joe_: ori: could you confirm production redis servers have been transitioned to Jessie ? 
[08:46:07] <hashar>	 the redis servers on beta cluster are on Trusty and the redis-server package there doesn't support one of the new option we are using
[08:46:18] <hashar>	 (oh and good morning / night)
[08:46:37] <ori>	 which option?
[08:46:38] <ori>	 I can check
[08:46:49] <_joe_>	 hashar: in eqiad most redises are precises
[08:47:16] <_joe_>	 so it's still redis 2.6
[08:47:35] <_joe_>	 hashar: they'll be moved to jessie during this quarter, I guess
[08:47:39] <ori>	 no, I updated those to 2.8 with a backported package
[08:48:07] <_joe_>	 ori: which ones?
[08:48:12] <hashar>	 ori: 'latency-monitor-threshold 100'
[08:48:16] <_joe_>	 rdb1001 has redis-server 2.6
[08:48:25] <hashar>	 our main bug is https://phabricator.wikimedia.org/T124677   (job queue broken)
[08:48:32] <_joe_>	 ii  redis-server         2:2.6.13-1+wmf1      Persistent key-value database with network interface
[08:49:13] <ori>	 not sure; we were seeing latency spikes and i wanted to use the latency monitor
[08:49:15] <_joe_>	 hashar: which version of redis do you have in beta?
[08:49:22] <hashar>	 there is a couple other tasks that got merged in, but all related to deployment-redis01 being dead (because it can't start)
[08:49:36] <hashar>	 what ever is shipped by Trusty so   2.8.4-2+wmf1
[08:49:41] <ori>	 i'll add a conditional
[08:49:45] <hashar>	 whereas Jessie ships 3.0.6-2~bpo8+1
[08:50:02] <ori>	 hrm
[08:50:04] <ori>	 it's there
[08:50:11] <ori>	 it's in a     if os_version('debian >= jessie') { } block
[08:50:13] <hashar>	 I am surprised it hasn't impacted production yet , but I guess the redis server services are rather stables
[08:51:07] <_joe_>	 ori: uhm maybe os_version doesn't behave as it's supposed to be?
[08:51:08] <hashar>	 oh
[08:51:24] <ori>	 hashar: https://github.com/wikimedia/operations-puppet/blob/production/modules/redis/manifests/init.pp#L36-L42
[08:51:27] <hashar>	 or the redis configuration file got generated before the os_version harness has been enabled
[08:51:35] <ori>	 could be
[08:51:53] <ori>	 simply deleting the line should resolve it, then
[08:51:55] <_joe_>	 hashar: and you never ran puppet again?
[08:52:04] <hashar>	 surely os_version being broken would have been noticed and iirc it is covered by tests (though they could be wrong)
[08:52:06] <_joe_>	 oh I see there is no ensure => absent
[08:52:09] <_joe_>	 damn puppet
[08:52:23] <hashar>	 I know puppet fails to apply some refresh from time to time
[08:52:37] <ori>	 i don't mean to sneak off, but i'm really tired
[08:52:47] <ori>	 sounds like this can be solved by editing out the line
[08:52:51] <_joe_>	 go to bed, I think I can figure this out :)
[08:52:53] <_joe_>	 and yes
[08:52:55] <hashar>	 ori: go go to bed :-}
[08:53:07] <_joe_>	 hashar: I'm on it
[08:53:09] <hashar>	 ori: thank you for the confirmation we still have Precise redis on prod. 
[08:53:36] <hashar>	 redis.conf:#latency-monitor-threshold 100
[08:53:42] <hashar>	 looks like it has been monkey patched
[08:54:05] <hashar>	 quoting Mukunda "I commented the line from the config file and started redis, I'm going to leave it to ori to decide what to do about a permanent solution."
[08:54:48] <_joe_>	 yeah but he did that wrong
[08:54:54] <_joe_>	 let me fix this
[08:56:08] <_joe_>	 ok it seems allright now
[08:56:48] <_joe_>	 hashar: so the problem is that file_line wasn't absented on non-jessie hosts after it was already applied
[08:56:50] <twentyafterfour>	 ?
[08:57:01] <twentyafterfour>	 hm
[08:57:12] <hashar>	 ohhh
[08:57:29] <_joe_>	 and things like file_line, cron, etc do remain on the system, they're simply unmanaged
[08:57:34] <hashar>	 _joe_: that is because we monkey patch the configuration file that is provided by the .deb package isn't it ?
[08:57:35] <_joe_>	 a fact we often forget
[08:57:39] <_joe_>	 yes
[08:57:57] <_joe_>	 we patch, monkey-patching is something else :)
[08:58:11] <hashar>	 so should we manually edit them or is there a change to apply in puppet  ?
[08:58:24] <hashar>	 twentyafterfour: hello !   basically the latency-monitor-threshold  invalid value is a left over
[08:58:27] <_joe_>	 manually edit it
[08:58:40] <_joe_>	 I reapplied puppet and it didn't come back
[08:58:50] <hashar>	 twentyafterfour: it is only supposed to be applied on Jessie.  and production does use old redis-server not supporting that setting
[08:58:58] <hashar>	 _joe_: doing the mass edits :-}
[08:59:03] <hashar>	 thank you very much
[08:59:21] <hashar>	 _joe_: would you mind writing a quick summary on https://phabricator.wikimedia.org/T124677 ?
[08:59:21] <_joe_>	 I did nothing :)
[08:59:24] <hashar>	 for the record
[08:59:30] <_joe_>	 yup
[08:59:39] <hashar>	 well at least explain how puppet (mis?)behave
[09:00:33] <twentyafterfour>	 so why was commenting the line the wrong thing to do? It seemed like a valid temporary fix.
[09:00:59] <_joe_>	 twentyafterfour: no it seemed to me that you didn't restart all the redis instances, while you did
[09:01:07] <_joe_>	 twentyafterfour: you did the right thing
[09:01:53] <icinga-wm>	 PROBLEM - High load average on labstore1001 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [24.0]
[09:06:33] <twentyafterfour>	 :)
[09:08:25] <hashar>	 Notice: /Stage[main]/Sysfs/Service[sysfsutils]/ensure: ensure changed 'stopped' to 'running'
[09:08:25] <hashar>	 Info: /Stage[main]/Sysfs/Service[sysfsutils]: Unscheduling refresh on Service[sysfsutils]
[09:08:26] <hashar>	 bah
[09:08:33] <hashar>	 it is not even a daemon :}
[09:19:12] <icinga-wm>	 PROBLEM - citoid endpoints health on sca1002 is CRITICAL: /api (bad PMCID) is CRITICAL: Could not fetch url http://10.64.48.29:1970/api: Timeout on connection while downloading http://10.64.48.29:1970/api
[09:20:53] <icinga-wm>	 RECOVERY - citoid endpoints health on sca1002 is OK: All endpoints are healthy
[09:21:33] <icinga-wm>	 PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0]
[09:23:23] <icinga-wm>	 PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0]
[09:24:56] <hashar>	 Debug: Executing '/etc/init.d/sysfsutils status'
[09:24:56] <hashar>	 Debug: Executing '/etc/init.d/sysfsutils start'
[09:24:56] <hashar>	 Notice: /Stage[main]/Sysfs/Service[sysfsutils]/ensure: ensure changed 'stopped' to 'running'
[09:24:56] <hashar>	 ah
[09:25:02] <hashar>	 and there is no status ..
[09:26:28] <_joe_>	 hashar: so hasstatus => no should fix that, maybe
[09:26:50] <hashar>	 i guess
[09:27:02] <hashar>	 gotta look at what happens on other distributions
[09:27:20] <_joe_>	 I have no time to look into it, sorry
[09:27:33] <hashar>	 i will
[09:27:39] <hashar>	 just sharing my thoughts out loud
[09:27:45] <hashar>	 since I feel lonely in my coworking place
[09:32:02] <icinga-wm>	 PROBLEM - Kafka Broker Replica Max Lag on kafka1020 is CRITICAL: CRITICAL: 58.62% of data above the critical threshold [5000000.0]
[09:37:52] <icinga-wm>	 RECOVERY - High load average on labstore1001 is OK: OK: Less than 50.00% above the threshold [16.0]
[09:41:05] <ema>	 5xx reqs/min getting better, it looks like a spike
[09:44:53] <icinga-wm>	 RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[09:46:14] <icinga-wm>	 RECOVERY - Kafka Broker Replica Max Lag on kafka1020 is OK: OK: Less than 50.00% above the threshold [1000000.0]
[09:46:43] <icinga-wm>	 RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[09:47:40] <grrrit-wm>	 (03PS1) 10Hashar: sysfs: puppet always restarted the sysfsutils service [puppet] - 10https://gerrit.wikimedia.org/r/266684 
[09:54:04] <grrrit-wm>	 (03PS2) 10Hashar: sysfs: puppet always restarted the sysfsutils service [puppet] - 10https://gerrit.wikimedia.org/r/266684 
[10:02:03] <icinga-wm>	 PROBLEM - puppet last run on mw1119 is CRITICAL: CRITICAL: Puppet has 76 failures
[10:03:03] <icinga-wm>	 PROBLEM - High load average on labstore1001 is CRITICAL: CRITICAL: 90.00% of data above the critical threshold [24.0]
[10:10:50] <wikibugs>	 7Puppet, 6operations, 10Salt: Make it possible for wmf-reimage to work seamlessly with a non-local salt master - https://phabricator.wikimedia.org/T124761#1968960 (10ArielGlenn) We could write a runner for the salt master that accepts a key after checking the puppet accepted cert, and we could configure the...
[10:11:35] <grrrit-wm>	 (03PS3) 10Hashar: sysfs: puppet always restarted the sysfsutils service [puppet] - 10https://gerrit.wikimedia.org/r/266684 
[10:12:20] <icinga-wm>	 PROBLEM - NTP on mc2009 is CRITICAL: NTP CRITICAL: No response from NTP server
[10:12:20] <icinga-wm>	 PROBLEM - NTP on mc2012 is CRITICAL: NTP CRITICAL: No response from NTP server
[10:13:39] <grrrit-wm>	 (03CR) 10Hashar: "Else puppet keeps attempting to restart sysfsutils :(" [puppet] - 10https://gerrit.wikimedia.org/r/266684 (owner: 10Hashar)
[10:15:18] <icinga-wm>	 PROBLEM - NFS read/writeable on labs instances on tools.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[10:15:39] <_joe_>	 oh gee, toollabs 
[10:16:56] <icinga-wm>	 RECOVERY - NFS read/writeable on labs instances on tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.075 second response time
[10:18:05] <moritzm>	 looking into ntpd on mc2009/2012
[10:20:15] <grrrit-wm>	 (03PS3) 10Giuseppe Lavagetto: Rationalize definition of service hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266509 (https://phabricator.wikimedia.org/T114273) 
[10:20:53] <ema>	 puppet failures on mw1119 are due to lack of memory
[10:21:03] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] Rationalize definition of service hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266509 (https://phabricator.wikimedia.org/T114273) (owner: 10Giuseppe Lavagetto)
[10:22:51] <grrrit-wm>	 (03CR) 10Alex Monk: "New config file, will need to be added to noc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266509 (https://phabricator.wikimedia.org/T114273) (owner: 10Giuseppe Lavagetto)
[10:23:12] <grrrit-wm>	 (03PS4) 10Giuseppe Lavagetto: Rationalize definition of service hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266509 (https://phabricator.wikimedia.org/T114273) 
[10:23:20] <_joe_>	 Krenair: ah, right
[10:23:50] <_joe_>	 thanks
[10:25:51] <ema>	 !log restarting apache2 and hhvm on mw1119
[10:25:54] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[10:28:13] <icinga-wm>	 RECOVERY - High load average on labstore1001 is OK: OK: Less than 50.00% above the threshold [16.0]
[10:29:02] <icinga-wm>	 RECOVERY - puppet last run on mw1119 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures
[10:30:55] <grrrit-wm>	 (03PS5) 10Giuseppe Lavagetto: Rationalize definition of service hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266509 (https://phabricator.wikimedia.org/T114273) 
[10:34:00] <wikibugs>	 7Puppet, 6operations, 10Salt: Make it possible for wmf-reimage to work seamlessly with a non-local salt master - https://phabricator.wikimedia.org/T124761#1969025 (10Joe) @ArielGlenn it seems like a good idea.
[10:34:31] <grrrit-wm>	 (03PS1) 10Muehlenhoff: Fix random ntp startup failures [puppet] - 10https://gerrit.wikimedia.org/r/266693 
[10:35:12] <icinga-wm>	 PROBLEM - check_mysql on db1008 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 765
[10:39:12] <icinga-wm>	 PROBLEM - High load average on labstore1001 is CRITICAL: CRITICAL: 88.89% of data above the critical threshold [24.0]
[10:40:12] <icinga-wm>	 RECOVERY - check_mysql on db1008 is OK: Uptime: 673316 Threads: 2 Questions: 4975629 Slow queries: 4496 Opens: 1802 Flush tables: 2 Open tables: 417 Queries per second avg: 7.389 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0
[10:42:11] <grrrit-wm>	 (03PS5) 10Giuseppe Lavagetto: Define Production service entries for InitialiseSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266510 (https://phabricator.wikimedia.org/T114273) 
[10:43:17] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] Define Production service entries for InitialiseSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266510 (https://phabricator.wikimedia.org/T114273) (owner: 10Giuseppe Lavagetto)
[10:47:31] <wikibugs>	 6operations, 10Incident-20160126-WikimediaDomainRedirection, 10Wikimedia-Apache-configuration, 10netops: All wikis under wikimedia.org (Meta, Commons, loginwiki, others) are redirecting to wikimediafoundation.org - https://phabricator.wikimedia.org/T124804#1969049 (10Aklapper) >>! In T124804#1968040, @TheD...
[10:48:33] <grrrit-wm>	 (03PS2) 10KartikMistry: cxserver: Enable new pairs for Yandex MT [puppet] - 10https://gerrit.wikimedia.org/r/265691 (https://phabricator.wikimedia.org/T121053) 
[10:48:56] <grrrit-wm>	 (03CR) 10Ema: [C: 031] Fix random ntp startup failures [puppet] - 10https://gerrit.wikimedia.org/r/266693 (owner: 10Muehlenhoff)
[10:51:34] <icinga-wm>	 RECOVERY - NTP on mc2009 is OK: NTP OK: Offset -0.0001429319382 secs
[10:53:20] <kart_>	 godog: or akosiaris: around?
[10:53:23] <icinga-wm>	 RECOVERY - NTP on mc2012 is OK: NTP OK: Offset 0.0004059076309 secs
[10:53:27] <grrrit-wm>	 (03PS6) 10Giuseppe Lavagetto: Define Production service entries for InitialiseSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266510 (https://phabricator.wikimedia.org/T114273) 
[10:53:43] <akosiaris>	 kart_: yup
[10:53:57] <akosiaris>	 kart_: how can I help ?
[10:54:22] <kart_>	 akosiaris: deploy the https://gerrit.wikimedia.org/r/#/c/265691/ in around 2-3 hours time? :)
[10:54:35] <kart_>	 akosiaris: let me know. I need to let other people before it.
[10:54:36] <akosiaris>	 sounds like a very good candidate for puppet swat
[10:54:40] <kart_>	 That's all :)
[10:54:46] <akosiaris>	 lemme check and +1 it if it's ok
[10:54:48] <kart_>	 akosiaris: sadly no puppet SWAT today?
[10:55:00] <akosiaris>	 a wednesday
[10:55:02] <akosiaris>	 indeed
[10:55:05] <akosiaris>	 ok
[10:55:10] <akosiaris>	 I will anyway be around indeed
[10:55:11] <_joe_>	 why are we using hiera for such ginormous config?
[10:55:31] <_joe_>	 (this is probably the 20th time I ask)
[10:55:36] <akosiaris>	 _joe_: it's the old in cxserver config vs in puppet config issue. cxserver had a regression
[10:56:02] <akosiaris>	 it used to be moved into the cxserver repo instead of puppet but with the migration to service-runner there was a regression LE is still investigating
[10:58:35] <kart_>	 _joe_: I'm working on it. 
[10:58:45] <_joe_>	 ok ok :)
[10:59:28] <kart_>	 akosiaris: thanks. I will ping for 'go ahead'.
[11:03:21] <wikibugs>	 6operations, 10DBA, 10MediaWiki-Configuration, 6Release-Engineering-Team, and 2 others: codfw is in read only according to mediawiki - https://phabricator.wikimedia.org/T124795#1969195 (10Joe) jcrespo: how do we make it 100% read-only? is there an easy way to do that?  I agree that we should stop using cro...
[11:05:43] <akosiaris>	 jynus: so, on Feb 3rd, I will need to create a backup copy of the OTRS database in the fastest possible way. What's your recommendation ?
[11:05:47] <akosiaris>	 mydumper ?
[11:07:52] <Vito>	 guc outage, is it worth notifying?
[11:08:07] <_joe_>	 guc?
[11:08:53] <Vito>	 global user contributions
[11:09:02] <Vito>	 labs' tool
[11:09:15] <_joe_>	 oh, sorry, I wasn't thinking about labs :P
[11:09:30] <akosiaris>	 usually no, if it is a tool no it is not
[11:09:35] <_joe_>	 Vito: I'll get in #wikimedia-labs, if help is needed
[11:09:43] <akosiaris>	 but maybe we can help
[11:09:53] <_joe_>	 yeah, my point too
[11:11:41] <_joe_>	 Vito: I just tried to use it and it seems to work
[11:12:32] <icinga-wm>	 PROBLEM - citoid endpoints health on sca1002 is CRITICAL: /api (bad PMCID) is CRITICAL: Could not fetch url http://10.64.48.29:1970/api: Timeout on connection while downloading http://10.64.48.29:1970/api
[11:13:52] <Vito>	 _joe_: seems some istance is gone, so it should happen randomly
[11:14:04] <icinga-wm>	 RECOVERY - citoid endpoints health on sca1002 is OK: All endpoints are healthy
[11:14:58] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: [C: 031] "https://puppet-compiler.wmflabs.org/1659/ says OK, this is ready for merge" [puppet] - 10https://gerrit.wikimedia.org/r/265691 (https://phabricator.wikimedia.org/T121053) (owner: 10KartikMistry)
[11:16:45] <wikibugs>	 6operations, 10vm-requests: move releases.wm.org to bromine (was: request VM for releases.wm.org) - https://phabricator.wikimedia.org/T124261#1969251 (10akosiaris) >>! In T124261#1967740, @Dzahn wrote: > @akosiaris It does mean that all shell users who are in "releasers-mediawiki" or "releasers-mobile" now get...
[11:16:59] <wikibugs>	 6operations: move releases.wm.org to bromine (was: request VM for releases.wm.org) - https://phabricator.wikimedia.org/T124261#1969252 (10akosiaris)
[11:21:54] <icinga-wm>	 RECOVERY - High load average on labstore1001 is OK: OK: Less than 50.00% above the threshold [16.0]
[11:24:03] <icinga-wm>	 PROBLEM - Kafka Cluster analytics-eqiad Broker Messages In Per Second on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 15 data above and 45 below the confidence bounds
[11:31:43] <icinga-wm>	 PROBLEM - puppet last run on elastic1009 is CRITICAL: CRITICAL: Puppet has 1 failures
[11:36:07] <wikibugs>	 6operations, 5Patch-For-Review, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Check the memcached and redis (sessions) configuration and functionality in codfw - https://phabricator.wikimedia.org/T124879#1969286 (10Joe) 3NEW
[11:36:48] <jynus>	 akosiaris stopping replication, probably
[11:37:15] <jynus>	 revert by failovering to the slave
[11:40:06] <jynus>	 let me see what else is there on the shard to make it possible
[11:43:57] <akosiaris>	 I 'll probably revert within 8 tops if all goes south
[11:44:04] <akosiaris>	 8 hours that is
[11:52:35] <icinga-wm>	 RECOVERY - Kafka Cluster analytics-eqiad Broker Messages In Per Second on graphite1001 is OK: OK: No anomaly detected
[11:54:47] <wikibugs>	 6operations, 10Traffic: update the multicast purging documentation - https://phabricator.wikimedia.org/T82096#1969329 (10hashar) Great, thank you @BBlack
[11:56:27] <jynus>	 other than that, maybe creating a snapshot
[11:58:14] <icinga-wm>	 RECOVERY - puppet last run on elastic1009 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[11:59:23] <icinga-wm>	 PROBLEM - High load average on labstore1001 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [24.0]
[12:00:54] <wikibugs>	 6operations, 10Incident-20160126-WikimediaDomainRedirection, 10Traffic, 7Documentation: Automate and/or better-document varnish ban procedure for operations staff, so it can be accomplished with more speed and confidence in outage conditions - https://phabricator.wikimedia.org/T124835#1969361 (10BBlack) 5...
[12:06:42] <icinga-wm>	 RECOVERY - High load average on labstore1001 is OK: OK: Less than 50.00% above the threshold [16.0]
[12:09:56] <jynus>	 in general, creating a backups is not a problem, recovering it when it is not the only thing on that server is :-/
[12:13:43] <icinga-wm>	 PROBLEM - High load average on labstore1001 is CRITICAL: CRITICAL: 77.78% of data above the critical threshold [24.0]
[12:16:44] <icinga-wm>	 PROBLEM - citoid endpoints health on sca1002 is CRITICAL: /api (bad PMCID) is CRITICAL: Could not fetch url http://10.64.48.29:1970/api: Timeout on connection while downloading http://10.64.48.29:1970/api
[12:18:32] <icinga-wm>	 RECOVERY - citoid endpoints health on sca1002 is OK: All endpoints are healthy
[12:21:02] <icinga-wm>	 RECOVERY - High load average on labstore1001 is OK: OK: Less than 50.00% above the threshold [16.0]
[12:22:34] <icinga-wm>	 PROBLEM - HHVM rendering on mw1019 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50430 bytes in 0.044 second response time
[12:23:04] <icinga-wm>	 PROBLEM - Apache HTTP on mw1019 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50430 bytes in 0.093 second response time
[12:28:23] <icinga-wm>	 RECOVERY - Apache HTTP on mw1019 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.049 second response time
[12:29:22] <moritzm>	 !log rebooting analytics1028 for kernel update
[12:29:24] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[12:29:54] <icinga-wm>	 RECOVERY - HHVM rendering on mw1019 is OK: HTTP OK: HTTP/1.1 200 OK - 67653 bytes in 0.126 second response time
[12:33:58] <Bsadowski1>	 jynus :O
[12:34:11] <Bsadowski1>	 Hey
[12:39:22] <icinga-wm>	 PROBLEM - High load average on labstore1001 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [24.0]
[12:42:06] <jynus>	 Bsadowski1, ?
[12:42:53] <icinga-wm>	 RECOVERY - High load average on labstore1001 is OK: OK: Less than 50.00% above the threshold [16.0]
[12:52:03] <icinga-wm>	 PROBLEM - puppet last run on mw1211 is CRITICAL: CRITICAL: Puppet has 1 failures
[12:57:00] <wikibugs>	 6operations, 10DBA: External Storage on codfw (es2005-2010) is consuming 100-90GB of disk space per server and per month and it has 370GB available - https://phabricator.wikimedia.org/T119056#1969500 (10mark) p:5Normal>3High
[13:04:00] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: [C: 032] cxserver: Enable new pairs for Yandex MT [puppet] - 10https://gerrit.wikimedia.org/r/265691 (https://phabricator.wikimedia.org/T121053) (owner: 10KartikMistry)
[13:04:05] <grrrit-wm>	 (03PS3) 10Alexandros Kosiaris: cxserver: Enable new pairs for Yandex MT [puppet] - 10https://gerrit.wikimedia.org/r/265691 (https://phabricator.wikimedia.org/T121053) (owner: 10KartikMistry)
[13:04:09] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: [V: 032] cxserver: Enable new pairs for Yandex MT [puppet] - 10https://gerrit.wikimedia.org/r/265691 (https://phabricator.wikimedia.org/T121053) (owner: 10KartikMistry)
[13:05:30] <grrrit-wm>	 (03PS2) 10Bene: Use custom generator for mobile search on Wikibase [mediawiki-config] - 10https://gerrit.wikimedia.org/r/254645 (https://phabricator.wikimedia.org/T85368) 
[13:06:09] <grrrit-wm>	 (03CR) 10Bene: [C: 031] "I think the issues have been resolved in the task and this should be ready to get merged." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/254645 (https://phabricator.wikimedia.org/T85368) (owner: 10Bene)
[13:06:35] <kart_>	 akosiaris: thanks.
[13:10:21] <elukey>	 !log rebooting analytics1029 for kernel upgrade
[13:10:24] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[13:15:22] <icinga-wm>	 PROBLEM - DPKG on fermium is CRITICAL: DPKG CRITICAL dpkg reports broken packages
[13:15:36] <akosiaris>	 !log rebooting fermium for kernel upgrades
[13:15:38] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[13:15:39] <akosiaris>	 that's me ^
[13:17:12] <icinga-wm>	 RECOVERY - DPKG on fermium is OK: All packages OK
[13:19:12] <icinga-wm>	 RECOVERY - puppet last run on mw1211 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[13:19:14] <grrrit-wm>	 (03PS1) 10Giuseppe Lavagetto: conftool: temporarily remove the appservers that are down. [puppet] - 10https://gerrit.wikimedia.org/r/266711 
[13:22:32] <grrrit-wm>	 (03PS2) 10Giuseppe Lavagetto: conftool: temporarily remove the appservers that are down. [puppet] - 10https://gerrit.wikimedia.org/r/266711 
[13:24:01] <grrrit-wm>	 (03CR) 10Giuseppe Lavagetto: [C: 032] conftool: temporarily remove the appservers that are down. [puppet] - 10https://gerrit.wikimedia.org/r/266711 (owner: 10Giuseppe Lavagetto)
[13:29:04] <kart_>	 akosiaris: can you check /etc/cxserver/config.yaml? Our change isn't reflected there (yet).
[13:29:23] <icinga-wm>	 PROBLEM - citoid endpoints health on sca1002 is CRITICAL: /api (bad PMCID) is CRITICAL: Could not fetch url http://10.64.48.29:1970/api: Timeout on connection while downloading http://10.64.48.29:1970/api
[13:29:55] <kart_>	 akosiaris: on sca1001/1002
[13:31:13] <icinga-wm>	 RECOVERY - citoid endpoints health on sca1002 is OK: All endpoints are healthy
[13:32:24] <kart_>	 akosiaris: any idea how long it will take? It is usually fast.
[13:32:25] <elukey>	 !log rebooting analytics1030/1031 for kernel upgrade
[13:32:28] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[13:32:34] <akosiaris>	 kart_: takes a while for the change to propagate. tops 30 mins
[13:32:40] <akosiaris>	 a from what I see it is there now 
[13:33:54] <akosiaris>	 kart_: so I saw that it is there now, I assume you are ok ?
[13:34:11] <_joe_>	 mark: you think you can trick me in doing budget? ;)
[13:34:17] <wikibugs>	 6operations, 10DBA: External Storage on codfw (es2005-2010) is consuming 100-90GB of disk space per server and per month and it has 370GB available - https://phabricator.wikimedia.org/T119056#1969592 (10mark) Let's aim for the same specs as the ES refresh we did for eqiad recently, and get quotes ASAP.
[13:34:17] <kart_>	 akosiaris: ok. Working.
[13:34:19] <_joe_>	 uh wrong channel
[13:34:30] <kart_>	 akosiaris: I will keep this time in my mind from next time.
[13:34:35] <mark>	 _joe_: well you may want to make sure I have budget for my staff next year ;)
[13:34:35] <kart_>	 Sorry for noise!
[13:34:40] <_joe_>	 eheh
[13:34:43] <_joe_>	 fair enough
[13:50:11] <grrrit-wm>	 (03PS5) 10Mdann52: Add 2 sites to $wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/262893 (https://phabricator.wikimedia.org/T122995) 
[13:50:28] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] Add 2 sites to $wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/262893 (https://phabricator.wikimedia.org/T122995) (owner: 10Mdann52)
[13:50:52] <icinga-wm>	 PROBLEM - Kafka Broker Replica Max Lag on kafka1022 is CRITICAL: CRITICAL: 73.91% of data above the critical threshold [5000000.0]
[13:53:50] <grrrit-wm>	 (03PS1) 10Giuseppe Lavagetto: conftool-data: remove mw1031, decommissioned [puppet] - 10https://gerrit.wikimedia.org/r/266716 
[13:54:14] <grrrit-wm>	 (03CR) 10Giuseppe Lavagetto: [C: 032] conftool-data: remove mw1031, decommissioned [puppet] - 10https://gerrit.wikimedia.org/r/266716 (owner: 10Giuseppe Lavagetto)
[13:54:28] <grrrit-wm>	 (03CR) 10Giuseppe Lavagetto: [V: 032] conftool-data: remove mw1031, decommissioned [puppet] - 10https://gerrit.wikimedia.org/r/266716 (owner: 10Giuseppe Lavagetto)
[13:54:51] <grrrit-wm>	 (03PS1) 10Jcrespo: Repool pc1003 for cloning to pc1006 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266717 
[13:55:21] <grrrit-wm>	 (03PS2) 10Jcrespo: Depool pc1003 for cloning to pc1006 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266717 
[13:57:39] <grrrit-wm>	 (03CR) 10Jcrespo: [C: 032] Depool pc1003 for cloning to pc1006 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266717 (owner: 10Jcrespo)
[13:59:38] <jynus>	 !log about to going new hardware/OS/mariadb-only for parsercache service
[13:59:41] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[14:01:06] <grrrit-wm>	 (03PS1) 10KartikMistry: cxserver: Add missing ru as source for MT [puppet] - 10https://gerrit.wikimedia.org/r/266721 
[14:01:39] <grrrit-wm>	 (03PS1) 10Muehlenhoff: Grant icinga permissions to ema and elukey [puppet] - 10https://gerrit.wikimedia.org/r/266722 
[14:01:43] <icinga-wm>	 RECOVERY - Kafka Broker Replica Max Lag on kafka1022 is OK: OK: Less than 50.00% above the threshold [1000000.0]
[14:02:16] <grrrit-wm>	 (03PS1) 10BBlack: X-Cache: add "int" status for internal responses [puppet] - 10https://gerrit.wikimedia.org/r/266723 
[14:02:58] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] cxserver: Add missing ru as source for MT [puppet] - 10https://gerrit.wikimedia.org/r/266721 (owner: 10KartikMistry)
[14:03:05] <grrrit-wm>	 (03PS2) 10Alexandros Kosiaris: cxserver: Add missing ru as source for MT [puppet] - 10https://gerrit.wikimedia.org/r/266721 (owner: 10KartikMistry)
[14:03:11] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: [V: 032] cxserver: Add missing ru as source for MT [puppet] - 10https://gerrit.wikimedia.org/r/266721 (owner: 10KartikMistry)
[14:03:39] <logmsgbot>	 !log jynus@mira Synchronized wmf-config/db-eqiad.php: Depool pc1003 for cloning to pc1006 (duration: 02m 30s)
[14:03:43] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[14:04:02] <elukey>	 !log rebooting analytics 1032 to 1035 for kernel upgrades
[14:04:05] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[14:04:14] <icinga-wm>	 PROBLEM - NTP on seaborgium is CRITICAL: NTP CRITICAL: Offset 26.52398765 secs
[14:04:33] <akosiaris>	 kart_: https://gerrit.wikimedia.org/r/266721 merged
[14:04:48] <kart_>	 cool. Thanks!
[14:11:02] <icinga-wm>	 PROBLEM - citoid endpoints health on sca1002 is CRITICAL: /api (bad PMCID) is CRITICAL: Could not fetch url http://10.64.48.29:1970/api: Timeout on connection while downloading http://10.64.48.29:1970/api
[14:11:34] <grrrit-wm>	 (03PS2) 10Muehlenhoff: Grant icinga permissions to ema and elukey [puppet] - 10https://gerrit.wikimedia.org/r/266722 
[14:11:54] <grrrit-wm>	 (03CR) 10Muehlenhoff: [C: 032 V: 032] Grant icinga permissions to ema and elukey [puppet] - 10https://gerrit.wikimedia.org/r/266722 (owner: 10Muehlenhoff)
[14:12:17] <grrrit-wm>	 (03PS1) 10Giuseppe Lavagetto: [WiP] Allow treating pooled=inactive differently from pooled=no in the etcd driver [debs/pybal] - 10https://gerrit.wikimedia.org/r/266728 
[14:12:37] <_joe_>	 bblack: ^^ this is a sketch of what needs to be done, but I'm not satisfied with it
[14:12:42] <icinga-wm>	 RECOVERY - citoid endpoints health on sca1002 is OK: All endpoints are healthy
[14:13:34] <wikibugs>	 6operations, 10Graphoid, 6Services, 10Traffic: Remove graphoid from parsoidcache - https://phabricator.wikimedia.org/T110477#1969647 (10BBlack) I'm doing some final validation now (checking request logs for any trailing requests to these hostnames).  Will upload the changes to remove this, but not merge ye...
[14:14:14] <grrrit-wm>	 (03PS1) 10BBlack: graphoid(.eqiad).wm.o hostname removal [dns] - 10https://gerrit.wikimedia.org/r/266731 (https://phabricator.wikimedia.org/T110477) 
[14:14:28] <grrrit-wm>	 (03PS1) 10BBlack: graphoid(.eqiad).wm.o VCL removal [puppet] - 10https://gerrit.wikimedia.org/r/266732 (https://phabricator.wikimedia.org/T110477) 
[14:18:32] <wikibugs>	 6operations, 10ContentTranslation-cxserver, 6Services, 10Traffic: Remove cxserver from parsoidcache cluster - https://phabricator.wikimedia.org/T110478#1969652 (10BBlack) FYI, I'm still seeing live requests to the cxserver public hostnames on cache_parsoid, e.g.  ```    32 RxURL        c /v1/dictionary/rec...
[14:19:05] <wikibugs>	 7Puppet, 6Revision-Scoring-As-A-Service, 10ores: Fix puppet webservice name to uwsgi-ores-web - https://phabricator.wikimedia.org/T124621#1969653 (10Halfak) Yeah.  That's right.  My mistake!
[14:22:47] <wikibugs>	 6operations, 10RESTBase, 6Services, 10Traffic: Remove restbase from parsoidcache - https://phabricator.wikimedia.org/T110475#1969654 (10BBlack) The problem with the redirect is it's complicated, because we still have this conflict between internal and external RB URLs due to the whole `Host:` header vs `/h...
[14:30:11] <grrrit-wm>	 (03PS2) 10BBlack: graphoid(.eqiad).wm.o VCL removal [puppet] - 10https://gerrit.wikimedia.org/r/266732 (https://phabricator.wikimedia.org/T110477) 
[14:35:51] <elukey>	 !log analytics 1035 hasn't been rebooted because it is a Hadoop Journal Node (will be restarted in the end)
[14:35:54] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[14:36:56] <ottomata>	 ooo, hi elukey! what's happening (still checking email)
[14:38:23] <icinga-wm>	 PROBLEM - citoid endpoints health on sca1001 is CRITICAL: /api (bad PMCID) is CRITICAL: Could not fetch url http://10.64.32.153:1970/api: Timeout on connection while downloading http://10.64.32.153:1970/api
[14:40:04] <icinga-wm>	 RECOVERY - citoid endpoints health on sca1001 is OK: All endpoints are healthy
[14:40:12] <elukey>	 ottomata: o/ rebooting all the nodes to update the kernel, nothing big :)
[14:42:52] <moritzm>	 ottomata: with the notable exception of the hadoop master/standby :-)
[14:44:26] <ottomata>	 ah ok
[14:44:28] <ottomata>	 cool
[14:45:09] <elukey>	 !log rebooting analytics 1036 to 1039 for kernel upgrade
[14:45:12] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[14:47:20] <grrrit-wm>	 (03CR) 10DCausse: "left one comment but the unit test already detected the problem :)" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266658 (owner: 10EBernhardson)
[14:58:45] <jynus>	 !log cloning persercache contents from pc1003 to pc1006
[14:58:48] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[15:01:39] <grrrit-wm>	 (03PS1) 10BBlack: cxserver, citoid -> cache_text cluster [dns] - 10https://gerrit.wikimedia.org/r/266739 (https://phabricator.wikimedia.org/T110476) 
[15:02:01] <grrrit-wm>	 (03PS3) 10BBlack: graphoid(.eqiad).wm.o VCL removal [puppet] - 10https://gerrit.wikimedia.org/r/266732 (https://phabricator.wikimedia.org/T110477) 
[15:02:03] <grrrit-wm>	 (03PS1) 10BBlack: Text VCL: Add support for citoid+cxserver passes [puppet] - 10https://gerrit.wikimedia.org/r/266740 (https://phabricator.wikimedia.org/T110476) 
[15:02:05] <grrrit-wm>	 (03PS1) 10BBlack: cache_parsoid: remove citoid+cxserver pass-through [puppet] - 10https://gerrit.wikimedia.org/r/266741 (https://phabricator.wikimedia.org/T110476) 
[15:06:31] <ottomata>	 elukey:  ja analytics1026 you can just do anytime, it'll be fine
[15:06:34] <ottomata>	 1027 hm.
[15:06:50] <ottomata>	 can we coordinate that with this?
[15:06:51] <ottomata>	 https://phabricator.wikimedia.org/T110090
[15:07:08] <ottomata>	 i am ready to do it, but keep putting it off because i was going to do it after we do the mobile->text changes
[15:07:36] <wikibugs>	 6operations, 10RESTBase, 6Services, 10Traffic: Remove restbase from parsoidcache - https://phabricator.wikimedia.org/T110475#1969780 (10BBlack) Well, the other thing I can do to make this simple is just treat it like the legacy citoid/cxserver entrypoints: if it's one of the legacy restbase hostnames, just...
[15:09:43] <moritzm>	 ottomata: rebooting a journalnode host ist fine as long as two other are active in the cluster, right?
[15:11:22] <grrrit-wm>	 (03PS2) 10Muehlenhoff: Fix random ntp startup failures [puppet] - 10https://gerrit.wikimedia.org/r/266693 
[15:11:37] <grrrit-wm>	 (03CR) 10Muehlenhoff: [C: 032 V: 032] Fix random ntp startup failures [puppet] - 10https://gerrit.wikimedia.org/r/266693 (owner: 10Muehlenhoff)
[15:11:39] <grrrit-wm>	 (03PS1) 10BBlack: Text VCL: add support for legacy rb passes [puppet] - 10https://gerrit.wikimedia.org/r/266747 (https://phabricator.wikimedia.org/T110475) 
[15:11:41] <grrrit-wm>	 (03PS1) 10BBlack: cache_parsoid: remove restbase pass-through [puppet] - 10https://gerrit.wikimedia.org/r/266748 (https://phabricator.wikimedia.org/T110475) 
[15:11:45] <grrrit-wm>	 (03PS1) 10BBlack: restbase legacy hostnames -> cache_text cluster [dns] - 10https://gerrit.wikimedia.org/r/266749 (https://phabricator.wikimedia.org/T110475) 
[15:12:42] <ottomata>	 moritzm:  correct
[15:12:45] <ottomata>	 one at a time they will be just fine
[15:13:02] <moritzm>	 ok
[15:13:29] <wikibugs>	 6operations, 10ContentTranslation-cxserver, 6Services, 10Traffic, 5Patch-For-Review: Remove cxserver from parsoidcache cluster - https://phabricator.wikimedia.org/T110478#1969808 (10Nikerabbit) >>! In T110478#1965406, @BBlack wrote: > Does that imply that **nothing** should be using the hostnames `cxserv...
[15:14:42] <elukey>	 ottomata: yes I'll skip 2017
[15:14:47] <elukey>	 *1027
[15:16:30] <ottomata>	 moritzm: does anything special need to happen to apply this other than a reboot?
[15:16:43] <icinga-wm>	 PROBLEM - Kafka Cluster analytics-eqiad Broker Messages In Per Second on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 12 data above and 47 below the confidence bounds
[15:16:49] <ottomata>	 hm!
[15:16:54] <ottomata>	 probably because I changed metrics, checking..
[15:17:13] <ottomata>	 moritzm: if you hold off on 1027, I'm hoping to move some services off of there soon, and i have to schedule some maintenance for it anyway
[15:21:27] <moritzm>	 ottomata: ok for 1027
[15:21:44] <ottomata>	 ok cool
[15:22:04] <moritzm>	 ottomata: just installing the new kernel and a reboot (but the new kernel has been installed on all analytics hosts already)
[15:22:07] <ottomata>	 i should be able to do that  shortly after the mobile->text merge is complete
[15:22:19] <ottomata>	 perfect
[15:22:19] <wikibugs>	 6operations, 10ContentTranslation-cxserver, 6Services, 10Traffic, 5Patch-For-Review: Remove cxserver from parsoidcache cluster - https://phabricator.wikimedia.org/T110478#1969829 (10BBlack) Ok.  I was under the impression that as part of some eventual plan, the CX extensions would switch to using public...
[15:29:54] <wikibugs>	 6operations, 10ContentTranslation-cxserver, 6Services, 10Traffic, 5Patch-For-Review: Remove cxserver from parsoidcache cluster - https://phabricator.wikimedia.org/T110478#1969853 (10akosiaris) >>! In T110478#1969829, @BBlack wrote: > Ok.  I was under the impression that as part of some eventual plan, the...
[15:31:52] <grrrit-wm>	 (03PS5) 10Ottomata: Create new puppet group analytics-search-users [puppet] - 10https://gerrit.wikimedia.org/r/265795 (https://phabricator.wikimedia.org/T122620) (owner: 10EBernhardson)
[15:32:00] <wikibugs>	 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Create new puppet group `analytics-search-users` - https://phabricator.wikimedia.org/T122620#1969857 (10Ottomata) I got it…
[15:32:16] <grrrit-wm>	 (03CR) 10Ottomata: [C: 032 V: 032] Create new puppet group analytics-search-users [puppet] - 10https://gerrit.wikimedia.org/r/265795 (https://phabricator.wikimedia.org/T122620) (owner: 10EBernhardson)
[15:33:33] <wikibugs>	 6operations, 10ContentTranslation-cxserver, 6Services, 10Traffic, 5Patch-For-Review: Remove cxserver from parsoidcache cluster - https://phabricator.wikimedia.org/T110478#1969870 (10BBlack) @akosiaris - we're talking about two different parts of the problem.  Regardless of whether/how cxserver's app code...
[15:33:39] <grrrit-wm>	 (03PS1) 10Subramanya Sastry: ruthenium services: Use puppetized db credentials for testreduce dbs [puppet] - 10https://gerrit.wikimedia.org/r/266753 (https://phabricator.wikimedia.org/T124704) 
[15:37:37] <wikibugs>	 6operations, 10ContentTranslation-cxserver, 6Services, 10Traffic, 5Patch-For-Review: Remove cxserver from parsoidcache cluster - https://phabricator.wikimedia.org/T110478#1969876 (10akosiaris) >>! In T110478#1969870, @BBlack wrote: > @akosiaris - we're talking about two different parts of the problem.  R...
[15:39:43] <icinga-wm>	 PROBLEM - puppet last run on stat1002 is CRITICAL: CRITICAL: Puppet has 1 failures
[15:39:52] <MarkTraceur>	 ottomata: Looks like you were involved with the kernel upgrades mentioned in SAL, were they super urgent or something?
[15:40:21] <grrrit-wm>	 (03CR) 10Yurik: [C: 031] "seems like everything points to the restbase's url" [puppet] - 10https://gerrit.wikimedia.org/r/266732 (https://phabricator.wikimedia.org/T110477) (owner: 10BBlack)
[15:40:41] <grrrit-wm>	 (03PS1) 10Ottomata: Include role::elasticsearch::analytics on Hadoop namenodes and stat1002 [puppet] - 10https://gerrit.wikimedia.org/r/266754 (https://phabricator.wikimedia.org/T122620) 
[15:40:46] <ottomata>	 MarkTraceur:  moritzm knows more
[15:40:51] <MarkTraceur>	 Hmm
[15:41:05] <MarkTraceur>	 I think stat1003 was included in the reboots but I'm not 100% certain
[15:41:11] <MarkTraceur>	 Maybe it was an unrelated downtime
[15:41:46] <ottomata>	 MarkTraceur:  that sounds right but I'm not sure
[15:41:53] <MarkTraceur>	 Ah well
[15:41:57] <ottomata>	 elukey: ?
[15:42:06] <moritzm>	 MarkTraceur, ottomat: yeah, stat1002/stat1003 needed reboots for a kernel security update
[15:42:10] <MarkTraceur>	 It killed a script I was running, just wondered if I missed coordination of that
[15:42:20] <MarkTraceur>	 Oh, okay, if it was an urgent security thing then fine :)
[15:42:27] <moritzm>	 I sent a headsup mail to the analytics list yesterday
[15:42:34] <MarkTraceur>	 Oh, yeah, so I just fail
[15:43:06] <grrrit-wm>	 (03CR) 10Ottomata: [C: 032] Include role::elasticsearch::analytics on Hadoop namenodes and stat1002 [puppet] - 10https://gerrit.wikimedia.org/r/266754 (https://phabricator.wikimedia.org/T122620) (owner: 10Ottomata)
[15:43:23] <grrrit-wm>	 (03CR) 10Yurik: [C: 031] "haven't tested, but looks ok. If this is how extensions should be loaded now, i'm fine with it." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266433 (https://phabricator.wikimedia.org/T119117) (owner: 10Dereckson)
[15:43:40] <ebernhardson>	 \o/
[15:43:59] <MarkTraceur>	 Going through email now, I'm going to identify and fix the failing in my communications
[15:45:30] <ottomata>	 hmmmm _joe_, admin::groups are not collected from multiple roles?
[15:46:09] <moritzm>	 ottomata: _joe_ is traveling ATM
[15:46:35] <ottomata>	 ah k
[15:47:41] <elukey>	 !log rebooting analytics 1026, 1040 -> 1042 due to kernel upgrade.
[15:47:44] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[15:48:34] <icinga-wm>	 PROBLEM - puppet last run on analytics1001 is CRITICAL: CRITICAL: puppet fail
[15:49:03] <icinga-wm>	 PROBLEM - puppet last run on analytics1002 is CRITICAL: CRITICAL: puppet fail
[15:49:49] <ottomata>	 ^ that's me
[15:49:51] <ottomata>	 am working on it
[15:51:45] <grrrit-wm>	 (03PS2) 10BBlack: X-Cache: add "int" status for internal responses [puppet] - 10https://gerrit.wikimedia.org/r/266723 
[15:53:18] <bblack>	 ottomata: re ^, do you know where analytics webrequest parsing code lives? I've lost track, but I wanted to double-check that recent changes in X-Cache format don't break their parsing of it for cache_status (and for that matter, I suspect cache_status doesn't report what we really want it to report anyways right now...)
[15:53:56] <ottomata>	 yes think so...
[15:54:37] <ottomata>	 bblack, it looks like no special parsing is done of cache_status
[15:54:46] <ottomata>	 or x_cache
[15:54:53] <ottomata>	 both are included in the refined webrequest table
[15:54:59] <ottomata>	 directly as they are from varnish
[15:55:33] <ottomata>	 https://github.com/wikimedia/analytics-refinery/blob/master/oozie/webrequest/refine/refine_webrequest.hql#L64
[15:55:43] <bblack>	 oh %{Varnish:handling@cache_status}x
[15:55:53] <ottomata>	 ja
[15:56:01] <bblack>	 ok
[15:56:10] <ottomata>	 whatever varnishkafka is configured to send is what makes it into those fields
[15:56:17] <bblack>	 probably not a very relevant field, as it's only reporting a naive interpretation of the frontend cache disposition
[15:56:28] <ottomata>	 aye, huh ok
[15:56:30] <bblack>	 i.e. cache_status may come up "miss", but it is in fact a cache hit at a deeper layer, etc...
[15:56:44] <ottomata>	 i can't think of any analysis that is using either of those atm. i think ops folks have looked at them before
[15:56:49] <ottomata>	 aye, makes sense
[15:56:55] <ottomata>	 x_cache has the results all the way down?
[15:57:08] <bblack>	 yes, although interpreting them is non-trivial
[15:57:11] <ottomata>	 aye
[15:57:28] <ottomata>	 feel free to do like you did with client_ip in varnish fanciness if you like
[15:57:33] <ottomata>	 to make it all canonical and stuff :)
[15:58:02] <bblack>	 yeah I was thinking about (a) leaving X-Cache basically as it is for debugging and analysis we sometimes do on deeper cache-layers-internal stuff
[15:58:49] <bblack>	 and then also summarizing in a new output header that just applies one of a few overall labels for "all of the cache layers as a black box".  probably just "hit|int|miss|pass"
[15:59:44] <bblack>	 (where hit = real cache object hit, int = internally-generated dynamically by varnish caches, miss|pass -> usual meanings which always result in applayer fetch)
[16:00:05] <jouncebot>	 anomie ostriches thcipriani marktraceur Krenair: Dear anthropoid, the time has come. Please deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160127T1600).
[16:00:05] <jouncebot>	 tgr: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process.
[16:00:18] <hashar>	 o/
[16:00:27] <hashar>	 going to lurk this SWAT
[16:00:47] <thcipriani|afk>	 tgr: I can SWAT if you're around
[16:01:01] <tgr>	 thcipriani|afk: here
[16:01:31] <grrrit-wm>	 (03PS1) 10Jcrespo: Add testreduce account and grants from ruthenium on m5-master Bug: T124704 [puppet] - 10https://gerrit.wikimedia.org/r/266756 (https://phabricator.wikimedia.org/T124704) 
[16:03:12] <elukey>	 !log rebooting analytics 1043 -> 1050 for kernel upgrade.
[16:03:16] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[16:05:04] <wikibugs>	 6operations, 10ContentTranslation-cxserver, 6Services, 10Traffic, 5Patch-For-Review: Remove cxserver from parsoidcache cluster - https://phabricator.wikimedia.org/T110478#1969978 (10GWicke) The last time we talked about moving the CXServer API to RB the issue was that some of those APIs are really not re...
[16:05:28] <grrrit-wm>	 (03PS1) 10Ottomata: Fix for analytics-search-user changes [puppet] - 10https://gerrit.wikimedia.org/r/266757 (https://phabricator.wikimedia.org/T122620) 
[16:06:29] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] Fix for analytics-search-user changes [puppet] - 10https://gerrit.wikimedia.org/r/266757 (https://phabricator.wikimedia.org/T122620) (owner: 10Ottomata)
[16:07:25] <grrrit-wm>	 (03PS2) 10Ottomata: Fix for analytics-search-user changes [puppet] - 10https://gerrit.wikimedia.org/r/266757 (https://phabricator.wikimedia.org/T122620) 
[16:08:19] <wikibugs>	 6operations, 10ContentTranslation-cxserver, 6Services, 10Traffic, 5Patch-For-Review: Remove cxserver from parsoidcache cluster - https://phabricator.wikimedia.org/T110478#1969991 (10BBlack) @gwicke no need for the stopgap, we'll just keep doing traffic pass-through of cxserver.wikimedia.org for now (but...
[16:09:31] <wikibugs>	 6operations, 10ContentTranslation-cxserver, 6Services, 10Traffic, 5Patch-For-Review: Remove cxserver from parsoidcache cluster - https://phabricator.wikimedia.org/T110478#1969995 (10GWicke) @bblack: Okay, thanks!
[16:09:54] <grrrit-wm>	 (03CR) 10Ottomata: [C: 032] Fix for analytics-search-user changes [puppet] - 10https://gerrit.wikimedia.org/r/266757 (https://phabricator.wikimedia.org/T122620) (owner: 10Ottomata)
[16:11:21] <logmsgbot>	 !log thcipriani@mira Synchronized php-1.27.0-wmf.11/extensions/CentralAuth/includes/session/CentralAuthSessionProvider.php: SWAT: Avoid forceHTTPS cookie flapping if core and CA are setting the same cookie [[gerrit:266671]] (duration: 02m 26s)
[16:11:24] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[16:11:24] <grrrit-wm>	 (03PS3) 10BBlack: X-Cache: add "int" status for internal responses [puppet] - 10https://gerrit.wikimedia.org/r/266723 
[16:11:24] <thcipriani>	 ^ tgr check please
[16:12:08] <grrrit-wm>	 (03PS4) 10BBlack: X-Cache: add "int" status for internal responses [puppet] - 10https://gerrit.wikimedia.org/r/266723 
[16:12:24] <icinga-wm>	 RECOVERY - puppet last run on analytics1001 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures
[16:12:26] <grrrit-wm>	 (03CR) 10BBlack: [C: 032 V: 032] X-Cache: add "int" status for internal responses [puppet] - 10https://gerrit.wikimedia.org/r/266723 (owner: 10BBlack)
[16:12:54] <icinga-wm>	 RECOVERY - puppet last run on analytics1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[16:13:13] <tgr>	 thcipriani: verified, thanks!
[16:13:24] <thcipriani>	 tgr: thanks for checking
[16:15:20] <_joe_>	 ottomata: no, hiera data can either be defined in one role only, or be exactly equal across roles
[16:15:45] <_joe_>	 Or, you use a container role
[16:16:33] <grrrit-wm>	 (03PS3) 10BBlack: Undo special-casing of testwiki in Varnish [puppet] - 10https://gerrit.wikimedia.org/r/266414 (owner: 10Ori.livneh)
[16:16:53] <_joe_>	 Anyways, @airport, on mobile. Read the docs and the code :-P 
[16:17:51] <ottomata>	 ha, ok, container role?
[16:17:58] <grrrit-wm>	 (03CR) 10BBlack: [C: 032] Undo special-casing of testwiki in Varnish [puppet] - 10https://gerrit.wikimedia.org/r/266414 (owner: 10Ori.livneh)
[16:19:51] <grrrit-wm>	 (03PS2) 10Jcrespo: Add testreduce account and grants from ruthenium on m5-master [puppet] - 10https://gerrit.wikimedia.org/r/266756 (https://phabricator.wikimedia.org/T124704) 
[16:19:51] <Krenair>	 is gerrit review a bit broken?
[16:20:02] <Krenair>	 I keep getting "line 1:66 no viable alternative at character '%'"
[16:21:29] <grrrit-wm>	 (03PS2) 10BBlack: Text VCL: add support for legacy rb passes [puppet] - 10https://gerrit.wikimedia.org/r/266747 (https://phabricator.wikimedia.org/T110475) 
[16:21:31] <grrrit-wm>	 (03PS4) 10BBlack: graphoid(.eqiad).wm.o VCL removal [puppet] - 10https://gerrit.wikimedia.org/r/266732 (https://phabricator.wikimedia.org/T110477) 
[16:21:33] <grrrit-wm>	 (03PS2) 10BBlack: cache_parsoid: remove restbase pass-through [puppet] - 10https://gerrit.wikimedia.org/r/266748 (https://phabricator.wikimedia.org/T110475) 
[16:21:35] <grrrit-wm>	 (03PS2) 10BBlack: cache_parsoid: remove citoid+cxserver pass-through [puppet] - 10https://gerrit.wikimedia.org/r/266741 (https://phabricator.wikimedia.org/T110476) 
[16:21:37] <grrrit-wm>	 (03PS2) 10BBlack: Text VCL: Add support for citoid+cxserver passes [puppet] - 10https://gerrit.wikimedia.org/r/266740 (https://phabricator.wikimedia.org/T110476) 
[16:21:39] <grrrit-wm>	 (03PS3) 10Jcrespo: Add testreduce account and grants from ruthenium on m5-master [puppet] - 10https://gerrit.wikimedia.org/r/266756 (https://phabricator.wikimedia.org/T124704) 
[16:21:43] <icinga-wm>	 RECOVERY - NTP on seaborgium is OK: NTP OK: Offset -0.0001972913742 secs
[16:21:49] <grrrit-wm>	 (03CR) 10GWicke: [C: 031] Text VCL: add support for legacy rb passes [puppet] - 10https://gerrit.wikimedia.org/r/266747 (https://phabricator.wikimedia.org/T110475) (owner: 10BBlack)
[16:22:16] <logmsgbot>	 !log thcipriani@mira Synchronized php-1.27.0-wmf.11/extensions/CentralAuth/includes/CentralAuthUtils.php: SWAT: Preserve certain keys when updating central session [[gerrit:266672]] (duration: 02m 28s)
[16:22:18] <thcipriani>	 ^ tgr check please
[16:22:19] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[16:23:28] <grrrit-wm>	 (03CR) 10GWicke: [C: 031] cache_parsoid: remove citoid+cxserver pass-through [puppet] - 10https://gerrit.wikimedia.org/r/266741 (https://phabricator.wikimedia.org/T110476) (owner: 10BBlack)
[16:23:59] <grrrit-wm>	 (03CR) 10Subramanya Sastry: [C: 031] Add testreduce account and grants from ruthenium on m5-master [puppet] - 10https://gerrit.wikimedia.org/r/266756 (https://phabricator.wikimedia.org/T124704) (owner: 10Jcrespo)
[16:24:55] <grrrit-wm>	 (03CR) 10GWicke: [C: 031] Text VCL: Add support for citoid+cxserver passes [puppet] - 10https://gerrit.wikimedia.org/r/266740 (https://phabricator.wikimedia.org/T110476) (owner: 10BBlack)
[16:25:54] <tgr>	 thcipriani: also verified, thanks again!
[16:26:02] <thcipriani>	 tgr: thank you!
[16:26:23] <icinga-wm>	 PROBLEM - puppet last run on analytics1026 is CRITICAL: CRITICAL: Puppet has 1 failures
[16:29:18] <wikibugs>	 6operations: move releases.wm.org to bromine (was: request VM for releases.wm.org) - https://phabricator.wikimedia.org/T124261#1970042 (10Niedzielski) {icon thumbs-up} @Dzahn, thanks for the heads up and quick summary. bromine works fine for me.
[16:37:40] <grrrit-wm>	 (03PS4) 10Jcrespo: Add testreduce account and grants from ruthenium on m5-master [puppet] - 10https://gerrit.wikimedia.org/r/266756 (https://phabricator.wikimedia.org/T124704) 
[16:39:24] <grrrit-wm>	 (03CR) 10Jcrespo: [C: 032] Add testreduce account and grants from ruthenium on m5-master [puppet] - 10https://gerrit.wikimedia.org/r/266756 (https://phabricator.wikimedia.org/T124704) (owner: 10Jcrespo)
[16:41:36] <wikibugs>	 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Create new puppet group `analytics-search-users` - https://phabricator.wikimedia.org/T122620#1970065 (10RobH) a:5RobH>3Ottomata It appears that @ottomatta merged all the patches (which is one step better than just reviewing my review).  It appears th...
[16:45:24] <Krenair>	 ostriches, https://gerrit.wikimedia.org/r/#/q/project:operations/mediawiki-config+status:open+-label:Code-Review%253C%253D-1+-label:Verified-1,n,z is broken :(
[16:45:55] <ostriches>	 broken?
[16:46:05] <bd808>	 Krenair: double encode -- https://gerrit.wikimedia.org/r/#/q/project:operations/mediawiki-config+status:open+-label:Code-Review%3C%3D-1+-label:Verified-1,n,z
[16:46:16] <Krenair>	 indeed
[16:46:21] <Krenair>	 but it's a URL that gerrit actually generates
[16:46:37] <bd808>	 I'm not sure exactly when or why the JS in gerrit started doing that but it is really annoying
[16:46:46] <Krenair>	 when you put a '=' in the query
[16:51:27] <ostriches>	 bd808: Must've been that secret upgrade and change to all the apache config I did a few months ago when you started complaining :P
[16:51:37] <subbu>	 jynus, should I hardcode 'testreduce' as the db user in my patches or is $db_user variable set to 'testreduce'?
[16:51:39] <ostriches>	 muahaha
[16:52:14] <bd808>	 ostriches: I just saw you holding a cat and touching your pinkie to your lips
[16:52:31] <jynus>	 well, a variables is better- I just like the username in the public pupet repo, only the password in the non-public
[16:52:41] <ostriches>	 bd808: s/cat/puppy/
[16:52:44] <jynus>	 subbu, ^if that makes sense to you
[16:53:20] <jynus>	 (as username is already public the puppet for the mysql server configuration)
[16:54:36] <subbu>	 jynus, yes .. once i upload newer version of the patches, could you leave your review comments on the patches in case they need further changes?
[16:54:56] <jynus>	 yes, I will
[16:55:17] <jynus>	 let me also confirm access from ruthemium
[17:00:25] <jynus>	 subbu, I can confirm right acccess from ruthemium
[17:01:26] <grrrit-wm>	 (03CR) 10Alex Monk: [C: 04-1] "It's still called wmgBetaFeaturesWhitelist in InitialiseSettings" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266470 (https://phabricator.wikimedia.org/T119117) (owner: 10Dereckson)
[17:01:34] <grrrit-wm>	 (03PS2) 10Subramanya Sastry: ruthenium services: Use puppetized db credentials for testreduce dbs [puppet] - 10https://gerrit.wikimedia.org/r/266753 (https://phabricator.wikimedia.org/T124704) 
[17:01:38] <grrrit-wm>	 (03CR) 10Chad: "I don't think we're going to install any FreeBSD apaches...like ever :p" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265619 (owner: 10Dereckson)
[17:02:16] <grrrit-wm>	 (03CR) 10Alex Monk: [C: 04-1] "non-merged MW core dependency" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266454 (https://phabricator.wikimedia.org/T85538) (owner: 10Cenarium)
[17:02:30] <subbu>	 jynus, ok .. i updated my patches if you want to take a look.
[17:03:12] <Krenair>	 error: 'files/misc/ubuntu-cloud.key': short read Success
[17:03:16] <Krenair>	 error Success?
[17:04:09] <jynus>	 subbu, see comment on https://gerrit.wikimedia.org/r/#/c/266752/2
[17:04:16] <jynus>	 Krenair, where is that?
[17:05:03] <Krenair>	 from git-grep
[17:05:14] <Krenair>	 while looking through the puppet repo
[17:06:19] <grrrit-wm>	 (03CR) 10Alex Monk: "Where is this actually used? I see where 404.php is used, but not 404.html." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264461 (owner: 10Suriyaa Kudo)
[17:09:24] <grrrit-wm>	 (03PS3) 10Subramanya Sastry: ruthenium services: Use puppetized db credentials for testreduce dbs [puppet] - 10https://gerrit.wikimedia.org/r/266753 (https://phabricator.wikimedia.org/T124704) 
[17:13:13] <grrrit-wm>	 (03CR) 10Alex Monk: [C: 04-1] "I don't think those constants you uncommented will be defined when InitialiseSettings gets run" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264219 (owner: 10Cenarium)
[17:13:40] <subbu>	 jynus, https://gerrit.wikimedia.org/r/#/c/266753/ is the public puppet part of it.
[17:13:59] <jynus>	 will submit this, which will allow testing the other
[17:14:11] <subbu>	 ah, ok.
[17:14:25] <jynus>	 I do not know if I mentioned this already, the private part was already done
[17:14:55] <jynus>	 that is why I wanted it without the user, as it had been commited with it
[17:16:11] <elukey>	 !log rebooting analytics1035.eqiad.wmnet for kernel upgrade
[17:16:14] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[17:16:18] <csteipp>	 Anyone have the ssh fingerprints for deployment.eqiad.wmnet?
[17:18:41] <subbu>	 jynus, ah, that makes sense now .. i am slowly comprehending all the pieces.
[17:22:51] <jynus>	 is the change something that you can test immediatelly?
[17:23:13] <jynus>	 Assume you have someone with sudo helping you
[17:23:23] <subbu>	 yes.
[17:23:42] <grrrit-wm>	 (03CR) 10Alex Monk: [C: 031] Rename two namespaces at bswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/247093 (https://phabricator.wikimedia.org/T115812) (owner: 10Luke081515)
[17:23:45] <subbu>	 the parsoid-rt and parsoid-vd services would come up.
[17:23:57] <subbu>	 and i should be able to open http://parsoid-tests.wikimedia.org/
[17:25:03] <jynus>	 running puppet-compiler
[17:26:03] <grrrit-wm>	 (03CR) 10BBlack: [C: 032] "Monitored traffic for a while just-in-case, only seeing random (and very rare) crawler hits. This is easily reverted with a 10 minute neg" [dns] - 10https://gerrit.wikimedia.org/r/266731 (https://phabricator.wikimedia.org/T110477) (owner: 10BBlack)
[17:26:16] <grrrit-wm>	 (03CR) 10Alex Monk: [C: 031] Add namespace aliases for English Wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264066 (https://phabricator.wikimedia.org/T123187) (owner: 10Pmlineditor)
[17:26:42] <subbu>	 jynus, oh .. i think i forgot ot set hostname in the config since the dbs are no longer on ruthenium.
[17:26:58] <subbu>	 what is the hostname i should use?
[17:27:45] <grrrit-wm>	 (03CR) 10Jcrespo: [C: 04-1] "Role mariadb?" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/266753 (https://phabricator.wikimedia.org/T124704) (owner: 10Subramanya Sastry)
[17:27:51] <subbu>	 m5-master.eqiad.wmnet. got it.
[17:27:54] <elukey>	 !log rebooting analytics105* hosts to upgrade their kernel
[17:27:57] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[17:28:13] <jynus>	 that is right
[17:28:15] <grrrit-wm>	 (03CR) 10Alex Monk: [C: 031] Add Category to $wgNamespacesWithSubpages on arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264460 (https://phabricator.wikimedia.org/T121985) (owner: 10Pmlineditor)
[17:28:18] <jynus>	 also role mariadb?
[17:28:39] <subbu>	 so, replace mariadb with mysql-client?
[17:28:55] <subbu>	 i don't know what you meant by auto-install of mysql-client
[17:28:58] <grrrit-wm>	 (03CR) 10Alex Monk: [C: 04-1] "Seems to be some confusion on the task about this" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264937 (https://phabricator.wikimedia.org/T123837) (owner: 10Mdann52)
[17:29:08] <jynus>	 for now, you can just delete that role
[17:29:32] <jynus>	 we can latter assess the need for a command line client, etc
[17:29:51] <jynus>	 let's make the patch as small as possible to make the service work
[17:29:55] <subbu>	 ok.
[17:31:02] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: "I don't see it cherry-picked in deployment-puppetmaster though." [puppet] - 10https://gerrit.wikimedia.org/r/266668 (owner: 10KartikMistry)
[17:31:11] <jynus>	 (as that would probably require additional operation-access-requests)
[17:31:13] <grrrit-wm>	 (03PS4) 10Subramanya Sastry: ruthenium services: Use puppetized db credentials for testreduce dbs [puppet] - 10https://gerrit.wikimedia.org/r/266753 (https://phabricator.wikimedia.org/T124704) 
[17:31:24] <wikibugs>	 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Create new puppet group `analytics-search-users` - https://phabricator.wikimedia.org/T122620#1970326 (10Ottomata) 5Open>3Resolved Thanks!  I think we done, I had to move things around a little bit.
[17:31:40] <wikibugs>	 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Create new puppet group `analytics-search-users` - https://phabricator.wikimedia.org/T122620#1970331 (10Ottomata) Sorry, I thought I replied here that I was taking this.
[17:34:11] <yurik>	 greg-g, i'm about to deploy new graphoid service - seem like noone is deploying at the moment
[17:34:34] <grrrit-wm>	 (03CR) 10Alex Monk: [C: 031] Namespace configuration on cu.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265885 (https://phabricator.wikimedia.org/T123654) (owner: 10Dereckson)
[17:34:36] <yurik>	 marxarelli, you haven't started the train yet, right?
[17:34:46] <yurik>	 not that it should be affected - its a service
[17:36:07] <yurik>	 hmm, actually never mind, its seems mira needs to be setup for graphoid deployment first... checking
[17:36:10] <greg-g>	 yurik: two questions: 1) why doesn't it go in the service deploy window at 21:00 UTC? or 2) Why don't you request a window for this so you don't have to kinda-ask-but-not-really-because-you're-just-going-to-do-it-anyway-even-if-greg-is-sick-and-not-watching-irc?
[17:36:29] * greg-g is sick and barely watching irc
[17:36:40] <jynus>	 I do not think it is getting the password right
[17:36:45] <jynus>	 ^subbu
[17:37:12] <yurik>	 lol, sorry greg-g  - that's how i have been deploying it before and i thought it was ok for a service.  I didn't know we had a serivce depl window
[17:37:42] <jynus>	 is shows empty on puppet compiler but should show the fake one
[17:37:53] <greg-g>	 now that it is getting to be more of a real thing, it needs to follow the process more rigidly, yurik 
[17:38:00] <subbu>	 i guess i am not including it properly or referenceing the password variable properly.
[17:38:09] <subbu>	 let me check how it is used in other files .. unless you know what the problem is.
[17:38:24] <jynus>	 let me rebuild it again, to be sure
[17:38:43] <jynus>	 I think you have to reference the full namespace, but I may be wrong
[17:39:08] <subbu>	 looks like it has to be referenced as $passwords::testreduce::mysql::user
[17:39:19] <jynus>	 subbu, see https://puppet-compiler.wmflabs.org/1663/ruthenium.eqiad.wmnet/
[17:39:41] <yurik>	 greg-g, sure thing.  That window doesn't include graphoid though. Plus its at midnight-1am, so a bit inconvenient. I will add a window to the deployment schedule if that's ok with you?
[17:39:41] <jynus>	 I expect nosecret there
[17:40:21] <jynus>	 db_user
[17:40:21] <grrrit-wm>	 (03PS5) 10Subramanya Sastry: ruthenium services: Use puppetized db credentials for testreduce dbs [puppet] - 10https://gerrit.wikimedia.org/r/266753 (https://phabricator.wikimedia.org/T124704) 
[17:40:50] <subbu>	 jynus, i updated the patch .. can you check if that does better.
[17:40:54] <jynus>	 db_pass, yes, you got it right
[17:41:00] <jynus>	 let me recheck
[17:41:29] <grrrit-wm>	 (03PS1) 10Chad: Also keep /srv/patches in sync between masters [puppet] - 10https://gerrit.wikimedia.org/r/266773 
[17:43:04] <jynus>	 subbu, that's better :-) , https://puppet-compiler.wmflabs.org/1664/ruthenium.eqiad.wmnet/
[17:43:31] <greg-g>	 yurik: propose another window time
[17:43:35] <icinga-wm>	 PROBLEM - High load average on labstore1001 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [24.0]
[17:43:37] <subbu>	 \o/
[17:44:13] <yurik>	 greg-g, another stable window?
[17:44:16] <grrrit-wm>	 (03CR) 10Jcrespo: [C: 031] ruthenium services: Use puppetized db credentials for testreduce dbs [puppet] - 10https://gerrit.wikimedia.org/r/266753 (https://phabricator.wikimedia.org/T124704) (owner: 10Subramanya Sastry)
[17:44:36] <jynus>	 deploy and test, subbu ?
[17:44:40] <yurik>	 because graphoid is usually one-off, and i don't want to move the window for all services if that's convinient for everyone else?
[17:44:44] <yurik>	 greg-g, %
[17:44:54] <yurik>	 that was a ^, not %
[17:45:02] <subbu>	 jynus, works for me.
[17:45:17] <grrrit-wm>	 (03CR) 10Jcrespo: [C: 032] ruthenium services: Use puppetized db credentials for testreduce dbs [puppet] - 10https://gerrit.wikimedia.org/r/266753 (https://phabricator.wikimedia.org/T124704) (owner: 10Subramanya Sastry)
[17:45:29] <greg-g>	 yurik: what I mean is pick a time for graphoid that works for you/your team, I'll see if it makes sense on the calendar
[17:46:17] <yurik>	 greg-g, right, but as a permanent fixture?  If possible, it would be great to simply request a window when nothing else is being deployed
[17:46:29] <jynus>	 deploying now
[17:46:59] <jynus>	 !log migrating ruthenium parsoid-test database to m5-master
[17:47:02] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[17:47:18] * subbu logs onto ruthenium as well
[17:47:25] <jynus>	 parsoid-vd has refreshed automaticall, is that enough?
[17:47:40] <cscott>	 yurik: ocg is in that situation.  i still find it useful to have a regularly scheduled time (coscheduled with the parsoid deploy) even if I don't do a deploy most weeks.
[17:48:22] <cscott>	 yurik: you could join the *oids in the parsoid deploy window. ;)
[17:48:35] <jynus>	 and /etc/testreduce/parsoid-vd.settings.js updated
[17:48:36] <subbu>	 jynus, yay ... http://parsoid-tests.wikimedia.org/ is now live :)
[17:48:38] <greg-g>	 what cscott said, yurik 
[17:49:11] <subbu>	 and http://parsoid-tests.wikimedia.org/commits looks right.
[17:49:11] <greg-g>	 yurik: I don't like the one-off requests, if you have a window you have a window and all is good
[17:49:11] <jynus>	 this has not finished, let me check load
[17:49:11] <yurik>	 cscott, i was hoping to have an earlier window because its running a bit late for UTC+3 greg-g 
[17:49:11] <subbu>	 jynus, thanks .. at least the db part of it seems good.
[17:49:11] <greg-g>	 yurik: exactly, so propose one, as I said a while ago
[17:49:20] <grrrit-wm>	 (03PS1) 10ArielGlenn: dumps: stash some current dump run config settings in file and reuse [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/266775 
[17:49:27] <yurik>	 greg-g, yes yes, i'm writing one now :)
[17:49:27] <greg-g>	 yurik: again, I'm not OK with continued "just ping greg 5 minutes before I want to do soemthing"
[17:49:36] <yurik>	 understood :)
[17:49:43] <greg-g>	 cool :)
[17:49:48] <subbu>	 jynus, and http://parsoid-tests.wikimedia.org/vd_testreduce/commits is also up.
[17:50:00] <subbu>	 so, both testreduce services are operational and are connecting with the right m5-master dbs.
[17:50:05] <jynus>	 subbu, remember that you are now in a misc production server
[17:50:08] <yurik>	 greg-g,  i will schedule something every day, but will skip it most of the time :P
[17:50:16] <csteipp>	 !log deploy patch for T97157
[17:50:18] <jynus>	 that has advantages (more resources)
[17:50:19] <cscott>	 yurik: sure, i understand.  even the parsoid deploy window is a bit late for UTC-5, since it sometimes gets uncomfortably close to when i have to leave to pick up my kids.
[17:50:19] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[17:50:25] <jynus>	 managed server if it fails
[17:50:31] <greg-g>	 yurik: no, not every day
[17:50:32] <subbu>	 jynus, you mean wrt m5-master?
[17:50:34] <jynus>	 but also responsabilities
[17:50:44] <icinga-wm>	 PROBLEM - puppet last run on mendelevium is CRITICAL: CRITICAL: puppet fail
[17:50:48] <jynus>	 do not bring it down, ok ;-)
[17:50:52] <yurik>	 cscott, how about two hours before the -oids?  you and i can join :)
[17:51:07] <subbu>	 jynus, do you mean wrt. ruthenium or the database .. m5-master?
[17:51:13] <yurik>	 greg-g will be happy, and I will only schedule it mon-thursday :)
[17:51:20] <jynus>	 m5-master
[17:51:49] <subbu>	 i see .. ah, ok. i guess we need to tune our queries then.
[17:52:04] <cscott>	 yurik: i suspect greg-g will say that plan has conflicts on t/th but is fine on m/w. ;)
[17:52:19] <greg-g>	 yurik: no
[17:52:39] <cscott>	 2hrs before parsoid would be 19:00-20:00 UTC M/W
[17:52:46] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: [C: 04-1] "looks good, minor nitpick" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/266773 (owner: 10Chad)
[17:53:02] <greg-g>	 yurik: two days/week, go on Tues/Thu, pick a time
[17:53:55] <yurik>	 greg-g, tues thu is good - this way i will either use that window, or join other oids at a later time.  cscott - what TZ are you in?
[17:54:17] <yurik>	 later time on mon-wed
[17:54:21] <cscott>	 EST, currently UTC-5.
[17:54:24] <subbu>	 jynus occasionally the queries that populate parsoid-tests.wikimedia.org tend to be expensive ... so, i'll work to fix those queries soon.
[17:54:46] <cscott>	 or, i should say, EST is always UTC-5, but i'm currently in EST sometimes in EDT. ;)
[17:54:50] <jynus>	 look, databases are there to being used
[17:55:48] <jynus>	 just makse sure you do not create 1000 connections and use all io available, and I will be happy
[17:55:49] <subbu>	 ah, ok. that is not a problem. :)
[17:55:49] <wikibugs>	 6operations, 6Parsing-Team, 10hardware-requests: Dedicated server for running Parsoid's roundtrip tests to get reliable parse latencies and use as perf. benchmarking tests - https://phabricator.wikimedia.org/T116090#1970414 (10ssastry)
[17:56:12] <subbu>	 thanks again.
[17:56:16] <yurik>	 cscott, 18-19 UTC, which is 1pm-2pm EST i think
[17:56:21] <yurik>	 greg-g, ^ ?
[17:56:27] <yurik>	 on tue thu
[17:56:40] <yurik>	 between puppet swat & mw train
[17:58:32] <grrrit-wm>	 (03CR) 10Thcipriani: "Inline question." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/266773 (owner: 10Chad)
[18:00:20] <csteipp>	 !log deploy patch for T103239
[18:00:22] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[18:01:10] <grrrit-wm>	 (03CR) 10Daniel Kinzler: [C: 031] "It's what we want, and it works for me." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/263046 (https://phabricator.wikimedia.org/T123112) (owner: 10Thiemo Mättig (WMDE))
[18:01:42] <grrrit-wm>	 (03CR) 10Jhernandez: [C: 031] Add sampling rates for mobile web language switcher [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265292 (https://phabricator.wikimedia.org/T123932) (owner: 10Bmansurov)
[18:06:43] <icinga-wm>	 PROBLEM - citoid endpoints health on sca1002 is CRITICAL: /api (bad URL) is CRITICAL: Could not fetch url http://10.64.48.29:1970/api: Timeout on connection while downloading http://10.64.48.29:1970/api
[18:06:43] <icinga-wm>	 PROBLEM - citoid endpoints health on sca1001 is CRITICAL: /api (bad URL) is CRITICAL: Could not fetch url http://10.64.32.153:1970/api: Timeout on connection while downloading http://10.64.32.153:1970/api
[18:07:34] <icinga-wm>	 PROBLEM - citoid endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[18:08:05] <yurik>	 greg-g, cscott, gwicke - i added another earlier service deployment window at 09:00 PST on TUE and THU - this should make it easier for European and East Coast based services to be deployed :)
[18:08:14] <icinga-wm>	 RECOVERY - citoid endpoints health on sca1001 is OK: All endpoints are healthy
[18:08:20] <yurik>	 and i didn't do this ^
[18:10:04] <icinga-wm>	 RECOVERY - citoid endpoints health on sca1002 is OK: All endpoints are healthy
[18:11:03] <icinga-wm>	 RECOVERY - citoid endpoints health on scb1002 is OK: All endpoints are healthy
[18:11:16] <grrrit-wm>	 (03CR) 10Mobrovac: [C: 04-1] cxserver, citoid -> cache_text cluster (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/266739 (https://phabricator.wikimedia.org/T110476) (owner: 10BBlack)
[18:13:07] <grrrit-wm>	 (03CR) 10Mobrovac: [C: 031] Text VCL: Add support for citoid+cxserver passes [puppet] - 10https://gerrit.wikimedia.org/r/266740 (https://phabricator.wikimedia.org/T110476) (owner: 10BBlack)
[18:13:23] <grrrit-wm>	 (03CR) 10Chad: Also keep /srv/patches in sync between masters (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/266773 (owner: 10Chad)
[18:13:26] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: Also keep /srv/patches in sync between masters (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/266773 (owner: 10Chad)
[18:13:28] <grrrit-wm>	 (03PS6) 10Jean-Frédéric: Add 2 sites to $wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/262893 (https://phabricator.wikimedia.org/T122995) (owner: 10Mdann52)
[18:13:57] <grrrit-wm>	 (03CR) 10Jean-Frédéric: "Rebased against master." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/262893 (https://phabricator.wikimedia.org/T122995) (owner: 10Mdann52)
[18:14:03] <grrrit-wm>	 (03CR) 10BBlack: cxserver, citoid -> cache_text cluster (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/266739 (https://phabricator.wikimedia.org/T110476) (owner: 10BBlack)
[18:14:19] <grrrit-wm>	 (03PS2) 10BBlack: restbase legacy hostnames -> cache_text cluster [dns] - 10https://gerrit.wikimedia.org/r/266749 (https://phabricator.wikimedia.org/T110475) 
[18:14:21] <grrrit-wm>	 (03PS2) 10BBlack: cxserver, citoid -> cache_text cluster [dns] - 10https://gerrit.wikimedia.org/r/266739 (https://phabricator.wikimedia.org/T110476) 
[18:15:33] <icinga-wm>	 RECOVERY - High load average on labstore1001 is OK: OK: Less than 50.00% above the threshold [16.0]
[18:16:49] <grrrit-wm>	 (03CR) 10Mobrovac: [C: 031] cache_parsoid: remove citoid+cxserver pass-through [puppet] - 10https://gerrit.wikimedia.org/r/266741 (https://phabricator.wikimedia.org/T110476) (owner: 10BBlack)
[18:17:14] <icinga-wm>	 RECOVERY - puppet last run on mendelevium is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures
[18:18:58] <subbu>	 jynus or any other root, can you run netstat -ltpn on ruthenium (i don't have root to do that) to see what is running on port 58805 .. since we are getting some mysterious failures on some tests?
[18:19:02] <subbu>	 ex. http://parsoid-tests.wikimedia.org/resultFlagNew/b410e18e3e9b25ed487f92d24995502dc2782bc9/f1ddfb884e32715c8b16d5149ee9b5119fc7de74/enwiki/2015%20NASCAR%20Xfinity%20Series
[18:20:32] <jynus>	 nodejs
[18:21:20] <subbu>	 ah, no more info besides that?
[18:21:35] <jynus>	 testred+ 27506  0.0  0.3 964968 50304 ?        Sl   Jan26   0:10 /usr/bin/nodejs /usr/lib/parsoid/src/tests/../bin/server.js --num-workers 1 --config /usr/lib/parsoid/src/tests/testreduce/parsoid-rt-client.rttest.localsettings.js
[18:22:07] <subbu>	 thanks.
[18:24:24] <icinga-wm>	 PROBLEM - puppet last run on mw2094 is CRITICAL: CRITICAL: Puppet has 1 failures
[18:29:21] <bblack>	 subbu: do you have a link to a test failure?
[18:30:10] <subbu>	 bblack, http://parsoid-tests.wikimedia.org/resultFlagNew/b410e18e3e9b25ed487f92d24995502dc2782bc9/f1ddfb884e32715c8b16d5149ee9b5119fc7de74/enwiki/2015%20NASCAR%20Xfinity%20Series .. looks like it is one of the parsoid workers that communicates with a test client.
[18:30:44] <subbu>	 with the ruthenium reimage .. we also got upgraded from node 0.10 to node 4.2 
[18:30:44] <subbu>	 this is tests btw, not production.
[18:31:15] <bblack>	 oh sorry didn't notice link above :)
[18:31:21] <subbu>	 the funny thing is that all the failures reported on http://parsoid-tests.wikimedia.org/regressions/between/b410e18e3e9b25ed487f92d24995502dc2782bc9/f1ddfb884e32715c8b16d5149ee9b5119fc7de74 .. (with a 1 in the error column) are from the same worker.
[18:31:53] <subbu>	 there are 8 separate test clients running and all the other 7 aren't reporting it .. 
[18:31:58] <bblack>	 the test output says port 58580, you asked jynus 58805
[18:32:13] <subbu>	 oh ... good catch. :)
[18:32:43] <wikibugs>	 7Blocked-on-Operations, 6operations, 6Services, 7Graphite: scale graphite deployment (tracking) - https://phabricator.wikimedia.org/T85451#1970579 (10GWicke)
[18:32:58] * subbu remembers not to trust his short term memory
[18:33:31] <bblack>	 nothing listening on 58580 at the moment
[18:33:33] <grrrit-wm>	 (03CR) 10Mobrovac: [C: 031] graphoid(.eqiad).wm.o VCL removal [puppet] - 10https://gerrit.wikimedia.org/r/266732 (https://phabricator.wikimedia.org/T110477) (owner: 10BBlack)
[18:34:08] <wikibugs>	 6operations, 10Parsoid, 6Services, 10service-template-node, 7service-runner: Create a standard service template / init / logging / package setup - https://phabricator.wikimedia.org/T88585#1970580 (10GWicke) p:5High>3Normal
[18:34:09] <subbu>	 hmm .. interesting.
[18:34:45] <wikibugs>	 6operations, 10RESTBase, 10RESTBase-Cassandra, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Set up multi-DC replication for Cassandra - https://phabricator.wikimedia.org/T108613#1970581 (10GWicke) p:5High>3Normal
[18:35:10] <wikibugs>	 6operations, 10RESTBase, 10RESTBase-Cassandra, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Set up multi-DC replication for Cassandra - https://phabricator.wikimedia.org/T108613#1524920 (10GWicke) Lowered priority as the main multi-DC goal is reached, and the main remaining bit is adding encryption for c...
[18:35:12] <mobrovac>	 subbu: "nothing listening on that port" explains the ECONNREFUSED
[18:36:27] <subbu>	 yup .. the error message is not helpful .. i don't know if it is parsoid or if it is the test client code .. maybe i should add more error logs.
[18:39:39] <grrrit-wm>	 (03PS1) 10Jcrespo: Repool pc1006 after cloning [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266787 (https://phabricator.wikimedia.org/T121888) 
[18:40:20] <wikibugs>	 6operations, 7Icinga: icinga contacts and permissions for ema and elukey - https://phabricator.wikimedia.org/T124941#1970606 (10Dzahn)
[18:41:34] <bblack>	 subbu: if you have the ability to re-test older revs, you could figure out whether it's a test setup problem or a real code regression
[18:41:44] <icinga-wm>	 PROBLEM - Kafka Broker Replica Max Lag on kafka1018 is CRITICAL: CRITICAL: 60.71% of data above the critical threshold [5000000.0]
[18:42:39] <subbu>	 bblack, i have that ability .. but the only change is that we upgrade from node 0.10 to node 4.2 .. so, i suspect it is exposing something.
[18:42:55] <bblack>	 ah
[18:43:27] <subbu>	 so, we'll have to figure this out before we consider upgrading production to node 4.2 :)
[18:44:21] <bblack>	 nodejs 4.2 changelog: added default feature to randomly refuse connections to reduce server load for the performance win!
[18:44:47] <wikibugs>	 10Ops-Access-Requests, 6operations, 6Parsing-Team, 5Patch-For-Review: Getting parsing-team members sudo access to manage (start, stop, restart) services on ruthenium - https://phabricator.wikimedia.org/T124701#1970623 (10ssastry) I've asked @trevorparscal to approve.  But, one other sudo permission require...
[18:45:52] <subbu>	 bblack, :)
[18:46:34] <grrrit-wm>	 (03CR) 10Jcrespo: [C: 032] Repool pc1006 after cloning [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266787 (https://phabricator.wikimedia.org/T121888) (owner: 10Jcrespo)
[18:48:26] <bd808>	 !log HHVM on mw1019 still dying on a regular basis with "Lost parent, LightProcess exiting"
[18:48:29] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[18:49:20] <Krenair>	 bd808, is it time to make a ticket?
[18:49:58] <logmsgbot>	 !log jynus@mira Synchronized wmf-config/db-eqiad.php: Repool pc1006 after cloning (duration: 02m 25s)
[18:50:01] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[18:50:39] <bd808>	 Krenair: J.oe said mysteriously 2 days ago that he knew what the problem was and that it was a "red herring". Something about it having not been restarted in a year. Maybe that server is depooled and just puking due to health checks?
[18:50:53] <icinga-wm>	 RECOVERY - puppet last run on mw2094 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures
[18:50:58] <bd808>	 it is annoying the fatalmonitor logs for sure
[18:53:02] <grrrit-wm>	 (03PS1) 10Subramanya Sastry: ruthenium: Rename parsoid systemd file + start up a single worker [puppet] - 10https://gerrit.wikimedia.org/r/266788 
[18:55:44] <icinga-wm>	 RECOVERY - Kafka Broker Replica Max Lag on kafka1018 is OK: OK: Less than 50.00% above the threshold [1000000.0]
[18:57:43] <icinga-wm>	 PROBLEM - High load average on labstore1001 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [24.0]
[18:58:26] <wikibugs>	 6operations, 7Icinga: icinga contacts and permissions for ema and elukey - https://phabricator.wikimedia.org/T124941#1970654 (10Dzahn)  In the private puppet repository, on palladium, in `/root/private/modules/secret/secrets/nagios/contacts.cfg` , i added:   ``` define contact{         contact_name...
[18:59:20] <marxarelli>	 tgr, anomie: group0 seems in pretty good shape from what i can see. any concerns about group1 promotion today?
[19:00:05] <jouncebot>	 marxarelli: Respected human, time to deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160127T1900). Please do the needful.
[19:01:14] <icinga-wm>	 RECOVERY - High load average on labstore1001 is OK: OK: Less than 50.00% above the threshold [16.0]
[19:01:18] <grrrit-wm>	 (03CR) 10Mobrovac: [C: 031] ruthenium: Rename parsoid systemd file + start up a single worker [puppet] - 10https://gerrit.wikimedia.org/r/266788 (owner: 10Subramanya Sastry)
[19:01:25] <anomie>	 marxarelli: I have no concerns
[19:01:27] <tgr>	 marxarelli: haven't seen any new issues since wmf11->group0
[19:01:42] <marxarelli>	 great
[19:02:40] <marxarelli>	 i do see loads on "parent, LightProcess exiting" on flourine but jynus (or someone), this is a known issue, right?
[19:03:17] <marxarelli>	 Krenair: ^ ?
[19:03:29] <Krenair>	 is it from mw1019 marxarelli?
[19:03:36] <jynus>	 no, it happening on mira is known
[19:03:43] <jynus>	 the other is 19 or something else
[19:04:15] <marxarelli>	 ah, yes. it's just 19
[19:04:39] <Krenair>	 yes, known
[19:04:44] <wikibugs>	 6operations, 7Icinga: icinga contacts and permissions for ema and elukey - https://phabricator.wikimedia.org/T124941#1970675 (10Dzahn) In the public repo in `nagios_common/files/contactgroups.cfg` there is a contact_group called "sms". This is the critical one for paging.  The newly created contacts would be a...
[19:05:00] <greg-g>	 Krenair: jynus: known and OK I presume? :) Also, is there a task for it?
[19:05:08] <wikibugs>	 7Puppet, 6operations, 6Release-Engineering-Team, 6Services, 7service-runner: Create a standard puppet module for service-runner services - https://phabricator.wikimedia.org/T89901#1970677 (10GWicke) @mobrovac, should we resolve this task?
[19:05:11] <marxarelli>	 alright then, will promote group1 shortly
[19:05:55] <jynus>	 Krenair, not ok, but not causing issues
[19:06:00] <jynus>	 ^greg
[19:06:02] <Krenair>	 greg-g, I asked the same thing earlier
[19:06:08] <Krenair>	 well
[19:06:09] <Krenair>	 sort of
[19:06:13] <Krenair>	 31<Krenair>30 bd808, is it time to make a ticket?
[19:06:22] <jynus>	 but I am talking about mira, not the other
[19:06:32] <Krenair>	 <bd808>21 Krenair: J.oe said mysteriously 2 days ago that he knew what the problem was and that it was a "red herring". Something about it having not been restarted in a year. Maybe that server is depooled and just puking due to health checks?
[19:06:46] <jynus>	 that should be checked
[19:06:48] <greg-g>	 let's get a task so we have more than irc logs
[19:06:50] <grrrit-wm>	 (03PS5) 10BBlack: graphoid(.eqiad).wm.o VCL removal [puppet] - 10https://gerrit.wikimedia.org/r/266732 (https://phabricator.wikimedia.org/T110477) 
[19:08:12] <grrrit-wm>	 (03CR) 10BBlack: [C: 032] "Should be safe! If revert is necessary, also revert https://gerrit.wikimedia.org/r/#/c/266731/" [puppet] - 10https://gerrit.wikimedia.org/r/266732 (https://phabricator.wikimedia.org/T110477) (owner: 10BBlack)
[19:08:24] <icinga-wm>	 PROBLEM - Check correctness of the icinga configuration on neon is CRITICAL: Icinga configuration contains errors
[19:08:25] <mutante>	 one day i'll try to make a bot where you just say !IRC2PHAB 10  or something and it creates a task and copies the last couple lines over there 
[19:08:34] <grrrit-wm>	 (03CR) 10BBlack: "If it looks necessary to revert this, also revert https://gerrit.wikimedia.org/r/#/c/266732/" [dns] - 10https://gerrit.wikimedia.org/r/266731 (https://phabricator.wikimedia.org/T110477) (owner: 10BBlack)
[19:08:47] <grrrit-wm>	 (03CR) 10Dereckson: "It's not really the point. The point is more to have a correct handling of fatal errors and die nicely instead of have a cascading of erro" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265619 (owner: 10Dereckson)
[19:10:02] <wikibugs>	 6operations, 10Analytics, 10Analytics-Cluster, 6Analytics-Kanban, 10hardware-requests: New Hive / Oozie server node in eqiad Analytics VLAN - https://phabricator.wikimedia.org/T124945#1970744 (10Ottomata) 3NEW
[19:10:11] <wikibugs>	 6operations, 10Analytics, 10Analytics-Cluster, 6Analytics-Kanban, 10hardware-requests: New Hive / Oozie server node in eqiad Analytics VLAN - https://phabricator.wikimedia.org/T124945#1970755 (10Ottomata)
[19:11:06] <subbu>	 robh, do you know when T124701 will be approved (i.e. when is the ops meeting)?
[19:11:09] <grrrit-wm>	 (03PS3) 10BBlack: Text VCL: Add support for citoid+cxserver passes [puppet] - 10https://gerrit.wikimedia.org/r/266740 (https://phabricator.wikimedia.org/T110476) 
[19:11:20] <grrrit-wm>	 (03CR) 10BBlack: [C: 032] Text VCL: Add support for citoid+cxserver passes [puppet] - 10https://gerrit.wikimedia.org/r/266740 (https://phabricator.wikimedia.org/T110476) (owner: 10BBlack)
[19:11:34] <grrrit-wm>	 (03PS1) 10Dduvall: group1 wikis to 1.27.0-wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266795 
[19:11:39] <grrrit-wm>	 (03CR) 10BBlack: [V: 032] Text VCL: Add support for citoid+cxserver passes [puppet] - 10https://gerrit.wikimedia.org/r/266740 (https://phabricator.wikimedia.org/T110476) (owner: 10BBlack)
[19:11:43] <robh>	 subbu: yep, so we need to append in journalctl service access?
[19:12:02] <robh>	 yep = yes the patch addition has to have ops meeting to approve everyone getthing the sudo rights 
[19:12:05] <subbu>	 yes please so i can look at logs.
[19:12:19] <robh>	 so you need to sudo as the user, not a service?
[19:12:54] <mutante>	 to look at logs use this:
[19:12:56] <mutante>	 'ALL = NOPASSWD: /bin/journalctl *']
[19:13:01] <wikibugs>	 6operations, 6Analytics-Kanban, 10hardware-requests: 8 x 3 SSDs per for AQS nodes. - https://phabricator.wikimedia.org/T124947#1970805 (10Ottomata) 3NEW a:3JAllemandou
[19:13:02] <grrrit-wm>	 (03CR) 10Dduvall: [C: 032] group1 wikis to 1.27.0-wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266795 (owner: 10Dduvall)
[19:13:14] <wikibugs>	 6operations, 6Analytics-Kanban, 10hardware-requests: 8 x 3 SSDs for AQS nodes. - https://phabricator.wikimedia.org/T124947#1970814 (10Ottomata)
[19:13:21] <subbu>	 i don't understand the distinction .. but right now all the services are logging to 'journal' in the systemd files .. so i / parsoid-rt-admin members need to be able to view them.
[19:13:25] <grrrit-wm>	 (03Merged) 10jenkins-bot: group1 wikis to 1.27.0-wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266795 (owner: 10Dduvall)
[19:13:26] <marxarelli>	 hurray for well-formatted json. so much easier to verify the wikiversions diff
[19:13:27] <wikibugs>	 6operations, 6Analytics-Kanban, 10hardware-requests: 8 x 3 SSDs for AQS nodes. - https://phabricator.wikimedia.org/T124947#1970805 (10Ottomata)
[19:13:57] <robh>	 subbu: cool, mutante gave the answer.  So yea, the other rights I gave are for services, now you need to read that file as that user so it should be what mutante put
[19:14:02] <logmsgbot>	 !log dduvall@mira rebuilt wikiversions.php and synchronized wikiversions files: group1 wikis to 1.27.0-wmf.11
[19:14:08] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[19:14:13] <mutante>	 subbu: robh: journalctl *  is what we do, it will let users read all logs, trying to limit that per service doesnt really work
[19:14:17] <robh>	 I'll append it into the patchset and on Monday we can get the meeting review to allow it
[19:14:29] <robh>	 mutante: duly noted, thank you!
[19:14:48] <subbu>	 robh, monday. ok.
[19:14:50] <mutante>	 at least not with "journalctl -u service *"
[19:14:59] <grrrit-wm>	 (03PS2) 10Dereckson: Get rid of $wg = $wmg for BetaFeatures [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266470 (https://phabricator.wikimedia.org/T119117) 
[19:16:30] <grrrit-wm>	 (03CR) 10Dereckson: "PS2: addressed PS1 comment" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266470 (https://phabricator.wikimedia.org/T119117) (owner: 10Dereckson)
[19:17:27] <grrrit-wm>	 (03PS2) 10RobH: creation of parsoid-rt-admin group [puppet] - 10https://gerrit.wikimedia.org/r/266632 (https://phabricator.wikimedia.org/T124701) 
[19:17:28] <wikibugs>	 6operations, 7Icinga: icinga contacts and permissions for ema and elukey - https://phabricator.wikimedia.org/T124941#1970836 (10Dzahn) @elukey could you logout of Icinga, log back in with "elukey" (non-capitalized) and the normal LDAP/wikitech password, then execute a command, like send a "custom notification"...
[19:18:35] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] creation of parsoid-rt-admin group [puppet] - 10https://gerrit.wikimedia.org/r/266632 (https://phabricator.wikimedia.org/T124701) (owner: 10RobH)
[19:18:44] <grrrit-wm>	 (03PS3) 10RobH: creation of parsoid-rt-admin group [puppet] - 10https://gerrit.wikimedia.org/r/266632 (https://phabricator.wikimedia.org/T124701) 
[19:20:21] <wikibugs>	 10Ops-Access-Requests, 6operations, 6Parsing-Team, 5Patch-For-Review: Getting parsing-team members sudo access to manage (start, stop, restart) services on ruthenium - https://phabricator.wikimedia.org/T124701#1970860 (10RobH) I've updated the patchset to include:  'ALL = NOPASSWD: /bin/journalctl *' which...
[19:21:05] <wikibugs>	 6operations, 10Analytics-Cluster, 6Analytics-Kanban, 10hardware-requests: Hadoop Node expansion for end of FY - https://phabricator.wikimedia.org/T124951#1970876 (10Ottomata) 3NEW
[19:22:36] <wikibugs>	 6operations, 10Analytics-Cluster, 6Analytics-Kanban, 10hardware-requests: Hadoop Node expansion for end of FY - https://phabricator.wikimedia.org/T124951#1970895 (10Ottomata)
[19:22:56] <wikibugs>	 6operations, 6Analytics-Kanban, 10hardware-requests: 8 x 3 SSDs for AQS nodes. - https://phabricator.wikimedia.org/T124947#1970805 (10Ottomata)
[19:22:58] <wikibugs>	 6operations, 10Analytics-Cluster, 6Analytics-Kanban, 10hardware-requests: Hadoop Node expansion for end of FY - https://phabricator.wikimedia.org/T124951#1970876 (10Ottomata)
[19:23:01] <wikibugs>	 6operations, 10Analytics, 10Analytics-Cluster, 6Analytics-Kanban, 10hardware-requests: New Hive / Oozie server node in eqiad Analytics VLAN - https://phabricator.wikimedia.org/T124945#1970744 (10Ottomata)
[19:25:36] <grrrit-wm>	 (03CR) 10Subramanya Sastry: creation of parsoid-rt-admin group (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/266632 (https://phabricator.wikimedia.org/T124701) (owner: 10RobH)
[19:27:37] <greg-g>	 marxarelli: while you watch the fatalmonitor, can you report the hhvm light process issue in phab, cc'ing jynu.s  and _joe._  ? kthx (if it hasn't already been, I may have missed it)
[19:27:38] <wikibugs>	 6operations, 7Icinga: icinga contacts and permissions for ema and elukey - https://phabricator.wikimedia.org/T124941#1970923 (10Dzahn) The meta check "Check correctness of the icinga configuration" ([[ https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=neon&service=Check+correctness+of+the+ici...
[19:28:11] <marxarelli>	 greg-g: sure thing
[19:30:15] <icinga-wm>	 RECOVERY - Check correctness of the icinga configuration on neon is OK: Icinga configuration is correct
[19:30:16] <icinga-wm>	 PROBLEM - Unmerged changes on repository mediawiki_config on tin is CRITICAL: There is one unmerged change in mediawiki_config (dir /srv/mediawiki-staging/).
[19:31:35] <mutante>	 !log stat1002 - running puppet, was reported as last run about 4 hours ago but not deactivated
[19:31:40] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[19:31:44] <mutante>	 aaah:
[19:31:49] <mutante>	 redis.exceptions.ConnectionError: Error connecting to mira.codfw.wmnet:6379. timed out.
[19:31:59] <mutante>	 !log stat1002 - redis.exceptions.ConnectionError: Error connecting to mira.codfw.wmnet:6379. timed out.
[19:32:04] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[19:33:29] <wikibugs>	 6operations, 5Patch-For-Review: rack/setup pc1004-1006 - https://phabricator.wikimedia.org/T121888#1970936 (10jcrespo)
[19:34:23] <wikibugs>	 6operations, 5Patch-For-Review: rack/setup pc1004-1006 - https://phabricator.wikimedia.org/T121888#1970937 (10jcrespo) 5Open>3Resolved pc100[456] are in production and pc100[123] are depooled:  https://grafana.wikimedia.org/dashboard/db/server-board?from=1453318327796&to=1453922887796&var-server=pc1*&var-n...
[19:34:31] <wikibugs>	 6operations, 10Traffic: Decrease max object TTL in varnishes - https://phabricator.wikimedia.org/T124954#1970940 (10BBlack) 3NEW
[19:34:54] <icinga-wm>	 PROBLEM - puppet last run on cp1060 is CRITICAL: CRITICAL: Puppet has 1 failures
[19:35:20] <wikibugs>	 6operations, 10Analytics-Cluster: stat1002 - redis can't connect to mira.codfw.wmnet - https://phabricator.wikimedia.org/T124955#1970956 (10Dzahn)
[19:35:53] <mutante>	 ottomata: https://phabricator.wikimedia.org/T124955
[19:36:05] <icinga-wm>	 PROBLEM - puppet last run on cp1059 is CRITICAL: CRITICAL: Puppet has 1 failures
[19:36:31] <jynus>	 mutante, do you know what is happening?
[19:36:40] <mutante>	 jynus: i am guessing this server used to connect to tin.eqiad.wmnet in the past
[19:37:02] <mutante>	 jynus: and now we switched deployment servers to mira, so it tries to use that.. but there are missing ACLs or firewall rules
[19:37:09] <jynus>	 yes, an error, but nothing ongoing, right?
[19:37:14] <mutante>	 letting a server from analytics connect to mira
[19:37:27] <mutante>	 i dont really know what is broken if the redis on stat1002 cant connect
[19:37:30] <mutante>	 some stats i assume
[19:38:08] <mutante>	 something for discovery analytics, maybe numbers are wrong, but nothing like downtime
[19:38:09] <wikibugs>	 6operations, 5WMF-deploy-2016-01-19_(1.27.0-wmf.11): Rise in "parent, LightProcess exiting" fatals on mw1019 since 1.27.0-wmf.11 deploy - https://phabricator.wikimedia.org/T124956#1970973 (10dduvall) 3NEW
[19:38:38] <icinga-wm>	 PROBLEM - puppet last run on cp1047 is CRITICAL: CRITICAL: Puppet has 1 failures
[19:39:20] <jynus>	 there are varnish puppet errors, is that you bblack ?
[19:40:16] <bblack>	 possibly! looking
[19:41:17] <jynus>	 sorry, I was seeing too many errors and got nervous
[19:41:39] <bblack>	 yeah it's me, somehow
[19:41:42] <jynus>	 not for this thing, in general
[19:42:13] <icinga-wm>	 PROBLEM - puppet last run on cp1046 is CRITICAL: CRITICAL: Puppet has 1 failures
[19:42:19] <jynus>	 everyting seems fine
[19:42:26] <bblack>	 well yeah I meant the puppet fails on cp10xx are me.  they're not causing problems.
[19:42:34] <icinga-wm>	 ACKNOWLEDGEMENT - puppet last run on stat1002 is CRITICAL: CRITICAL: Puppet has 1 failures daniel_zahn https://phabricator.wikimedia.org/T124955
[19:43:28] <mutante>	 ah, so re: stat1002, what it's doing is trying to deploy this:
[19:43:31] <mutante>	 Error: /Stage[main]/Role::Elasticsearch::Analytics/Package[wikimedia/discovery/analytics]/
[19:43:40] <mutante>	 but it cant deploy it, because cant talk to mira
[19:43:45] <jynus>	 it was the combination of cps and stat what got me nervous, ignore me
[19:43:54] <mutante>	 i dont think it's an issue besides "no new deploys"
[19:44:05] <bblack>	 yeah my brain started assuming cache_mobile was no longer relevant, but of course it still (barely) is :P
[19:44:45] <wikibugs>	 7Blocked-on-Operations, 10Deployment-Systems, 10RESTBase, 6Services: RESTBase deployment process - https://phabricator.wikimedia.org/T103344#1971004 (10GWicke)
[19:44:53] <ebernhardson>	 mutante: hmm, sorry i didn't realize we were in a no new deploys ATM
[19:45:14] <mutante>	 ebernhardson: no, i'm just saying it's broken
[19:45:17] <jynus>	 ebernhardson, he means that it is technically impossible now
[19:45:19] <jynus>	 :-)
[19:45:23] <ebernhardson>	 oh :)
[19:45:34] <ebernhardson>	 mutante: but yes that is us, and it was just deployed to puppet this morning
[19:45:43] <mutante>	 i think the issue is: analytics network needs to be allowed to talk to deployment server in codfw
[19:45:59] <ebernhardson>	 i think it already can, because this is the same deployment method used by analytics for their refinery repository
[19:46:06] <ebernhardson>	 but maybe only tin, and not mira?
[19:46:07] <mutante>	 ebernhardson: the problem is redis.exceptions.ConnectionError: Error connecting to mira.codfw.wmnet:6379. timed out.
[19:46:13] <icinga-wm>	 PROBLEM - High load average on labstore1001 is CRITICAL: CRITICAL: 57.14% of data above the critical threshold [24.0]
[19:46:25] <ebernhardson>	 akosiaris mentioned something about not being a fan of cross-DC ACL's so it might make sense that only tin can talk to analytics
[19:46:29] <mutante>	 ebernhardson: yes, but tin is eqiad and mira is codfw, and i think there are only ACLs for eiqad
[19:46:35] <jynus>	 there are now some cross-datacenter issues
[19:46:39] <mutante>	 ebernhardson: yea
[19:46:54] <jynus>	 like the one we found yesteday about db writes from codfw
[19:47:10] <grrrit-wm>	 (03PS1) 10BBlack: Add cxserver/citoid to cache_mobile [puppet] - 10https://gerrit.wikimedia.org/r/266799 (https://phabricator.wikimedia.org/T110476) 
[19:47:14] <jynus>	 I agree with that, we do not necessarily want that
[19:47:24] <grrrit-wm>	 (03CR) 10BBlack: [C: 032 V: 032] Add cxserver/citoid to cache_mobile [puppet] - 10https://gerrit.wikimedia.org/r/266799 (https://phabricator.wikimedia.org/T110476) (owner: 10BBlack)
[19:48:01] <ebernhardson>	 mutante: i suppose the question is what to do, i can put together a patch to back out the repository until tin is back in service. Unless the plan is for tin to become the backup and mira to stay primary
[19:48:15] <YuviPanda>	 !log started nfs-exports daemon on labstore1001, had been dead for a few days
[19:48:20] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[19:48:38] <wikibugs>	 6operations, 10Analytics-Cluster: stat1002 - redis can't connect to mira.codfw.wmnet - https://phabricator.wikimedia.org/T124955#1971034 (10Dzahn) what it's doing is trying to deploy wikimedia/discovery/analytics and it can't deploy it because of the redis connection timeout.  Error: Execution of '/usr/bin/sal...
[19:49:04] <icinga-wm>	 RECOVERY - puppet last run on cp1060 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures
[19:49:05] <mutante>	 ebernhardson: i made this https://phabricator.wikimedia.org/T124955  maybe you can link that patch there?
[19:49:25] <grrrit-wm>	 (03PS2) 10Cenarium: Autopromotion: remove deprecated onView event, fix INGROUPS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264219 
[19:49:25] <icinga-wm>	 RECOVERY - Ensure NFS exports are maintained for new instances with NFS on labstore1001 is OK: OK - nfs-exports is active
[19:49:44] <icinga-wm>	 RECOVERY - High load average on labstore1001 is OK: OK: Less than 50.00% above the threshold [16.0]
[19:49:47] <wikibugs>	 6operations, 10Analytics-Cluster, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: stat1002 - redis can't connect to mira.codfw.wmnet - https://phabricator.wikimedia.org/T124955#1971050 (10jcrespo)
[19:49:49] <mutante>	 ebernhardson: afaik, we want to switch to mira for at least 48 hours but then back, but i also have to ask
[19:50:05] <wikibugs>	 6operations, 10Analytics-Cluster, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: stat1002 - redis can't connect to mira.codfw.wmnet - https://phabricator.wikimedia.org/T124955#1971054 (10EBernhardson) https://gerrit.wikimedia.org/r/#/c/265795/ is the patch that added this, it adds a new user to analytics mac...
[19:50:56] <grrrit-wm>	 (03CR) 10Cenarium: "So that's why they were commented out, OK I've fixed that." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264219 (owner: 10Cenarium)
[19:51:30] <wikibugs>	 6operations, 10Analytics-Cluster, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: stat1002 - redis can't connect to mira.codfw.wmnet - https://phabricator.wikimedia.org/T124955#1971066 (10EBernhardson) I would also note that this means analytics can't deploy new versions of refinery as long as mira is master...
[19:52:22] <wikibugs>	 6operations, 10Analytics-Cluster, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: stat1002 - redis can't connect to mira.codfw.wmnet - https://phabricator.wikimedia.org/T124955#1971075 (10Dzahn) It might need #netops because ACLs on network hardware might have to be adjusted, since the analytics VLAN is separ...
[19:52:44] <icinga-wm>	 RECOVERY - puppet last run on cp1046 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures
[19:53:44] <icinga-wm>	 RECOVERY - puppet last run on cp1059 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[19:54:04] <icinga-wm>	 RECOVERY - puppet last run on cp1047 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[19:54:37] <wikibugs>	 6operations, 10ops-eqiad, 10DBA: Decommission pc1001-1003 - https://phabricator.wikimedia.org/T124962#1971085 (10jcrespo) 3NEW
[19:57:17] <wikibugs>	 6operations, 10Incident-20160126-WikimediaDomainRedirection, 10Wikimedia-Apache-configuration, 10netops: All wikis under wikimedia.org (Meta, Commons, loginwiki, others) are redirecting to wikimedia.org - https://phabricator.wikimedia.org/T124804#1971095 (10Krenair)
[19:58:46] <MatmaRex>	 Krenair: they were redirecting to wikimediafoundation.org, though.
[19:59:09] <grrrit-wm>	 (03CR) 10Alex Monk: [C: 031] Autopromotion: remove deprecated onView event, fix INGROUPS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264219 (owner: 10Cenarium)
[19:59:15] <icinga-wm>	 PROBLEM - Kafka Cluster analytics-eqiad Broker Messages In Per Second on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 13 data above and 46 below the confidence bounds
[19:59:20] <Krenair>	 MatmaRex, I thought some were showing the portal page from wikimedia.org?
[19:59:46] <MatmaRex>	 hmm. maybe? the ones i've seen were doing a HTTP redirect to wikimediafoundation.org, though.
[20:00:26] <paravoid>	 twentyafterfour: "There is one unmerged change in mediawiki_config (dir /srv/mediawiki-staging/). "
[20:00:29] <paravoid>	 for 45 minutes now
[20:00:31] <mutante>	 ebernhardson: i checked the iptables rules, they exist on tin and mira, and allow that connection from that IP on that port, it must be on the network hardware
[20:00:32] <paravoid>	 (on tin)
[20:01:28] <paravoid>	 ottomata: eventlog1001 has puppet disabled with no reason specified
[20:01:32] <mutante>	 paravoid: would you have time to look at a router ACL maybe?  stat1002 in analytics can't talk to mira, but it can talk to tin, i believe we are missing one to allow the codfw part
[20:01:52] <ottomata>	 oh woops, thanks paravoid, that is leftover from yesterday's wikimediafoundation outage
[20:02:30] <ottomata>	 fixed.
[20:02:31] <ottomata>	 thanks
[20:03:14] <icinga-wm>	 PROBLEM - puppet last run on eventlog1001 is CRITICAL: CRITICAL: Puppet last ran 1 day ago
[20:03:23] <wikibugs>	 6operations, 10Incident-20160126-WikimediaDomainRedirection, 10Wikimedia-Apache-configuration, 10netops: All wikis under wikimedia.org (Meta, Commons, loginwiki, others) are using wikimedia.org VHost, including /wiki/ -> wikimediafoundation.org redirect - https://phabricator.wikimedia.org/T124804#1971105 (1...
[20:03:33] <wikibugs>	 6operations, 10Analytics-Cluster, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: stat1002 - redis can't connect to mira.codfw.wmnet - https://phabricator.wikimedia.org/T124955#1971106 (10Dzahn) i checked ferm/iptables rules on tin and mira. they are the same and allow connections to 6379 (the redis port) fro...
[20:03:54] <wikibugs>	 6operations, 10Analytics-Cluster, 10netops, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: stat1002 - redis can't connect to mira.codfw.wmnet - https://phabricator.wikimedia.org/T124955#1971110 (10Dzahn)
[20:05:03] <icinga-wm>	 RECOVERY - puppet last run on eventlog1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[20:05:07] <wikibugs>	 6operations, 10Analytics-Cluster, 10netops, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: stat1002 - redis can't connect to mira.codfw.wmnet - https://phabricator.wikimedia.org/T124955#1971121 (10Ottomata) Yeah, makes sense!  stat1002 is in the Analytics VLAN, so a rule will need to be opened up in the VL...
[20:05:55] <wikibugs>	 6operations, 6Analytics-Kanban, 10hardware-requests: 8 x 3 SSDs for AQS nodes. - https://phabricator.wikimedia.org/T124947#1971123 (10Ottomata)
[20:06:55] <icinga-wm>	 PROBLEM - Disk space on ms-be2003 is CRITICAL: DISK CRITICAL - /srv/swift-storage/sdg1 is not accessible: Input/output error
[20:07:34] <icinga-wm>	 PROBLEM - RAID on ms-be2003 is CRITICAL: CRITICAL: 1 failed LD(s) (Offline)
[20:08:49] <wikibugs>	 6operations, 10Parsoid, 6Services, 10service-template-node, 7service-runner: Create a standard service template / init / logging / package setup - https://phabricator.wikimedia.org/T88585#1971133 (10mobrovac)
[20:08:52] <wikibugs>	 7Puppet, 6operations, 6Release-Engineering-Team, 6Services, 7service-runner: Create a standard puppet module for service-runner services - https://phabricator.wikimedia.org/T89901#1971131 (10mobrovac) 5Open>3Resolved a:5GWicke>3mobrovac
[20:09:00] <wikibugs>	 7Puppet, 6operations, 6Release-Engineering-Team, 6Services, 7service-runner: Create a standard puppet module for service-runner services - https://phabricator.wikimedia.org/T89901#1048322 (10mobrovac) Indeed @Gwicke :) Done.
[20:10:11] <paravoid>	 mutante: should be fixed I think
[20:10:17] <wikibugs>	 6operations, 6Analytics-Kanban, 10hardware-requests: 8 x 3 SSDs for AQS nodes. - https://phabricator.wikimedia.org/T124947#1971137 (10JAllemandou) When discussing about cassandra response time issues with @Gwicke, he told me the Services Team had used SSDs to mitigate that issue. They use Samsung 850 Pro 1Tb...
[20:10:34] <mutante>	 paravoid: thank you :)
[20:13:04] <icinga-wm>	 RECOVERY - puppet last run on stat1002 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures
[20:13:05] <mutante>	 ebernhardson: paravoid: ottomata: confirmed fixed Package[wikimedia/discovery/analytics]/ensure: ensure changed 'purged' to 'present'
[20:13:08] <mutante>	 ^
[20:14:01] <ebernhardson>	 mutante: thanks!
[20:15:43] <wikibugs>	 6operations, 10Analytics-Cluster, 10netops, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: stat1002 - redis can't connect to mira.codfw.wmnet - https://phabricator.wikimedia.org/T124955#1971162 (10Dzahn) fixed by @Faidon , thanks!  --  confirmed working now:  Package[wikimedia/discovery/analytics]/ensure:...
[20:15:48] <ottomata>	 danke!
[20:16:01] <wikibugs>	 6operations, 10Analytics-Cluster, 10netops, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: stat1002 - redis can't connect to mira.codfw.wmnet - https://phabricator.wikimedia.org/T124955#1971166 (10Dzahn) 5Open>3Resolved a:3Dzahn
[20:16:20] <marxarelli>	 greg-g: heading to lunch. things looks fine according to fatalmonitor, so a tentative \o/
[20:16:41] <marxarelli>	 anomie, tgr: thanks for fixing all the things :)
[20:17:53] <icinga-wm>	 PROBLEM - puppet last run on ms-be2003 is CRITICAL: CRITICAL: Puppet has 1 failures
[20:18:15] <icinga-wm>	 PROBLEM - Kafka Broker Replica Max Lag on kafka1014 is CRITICAL: CRITICAL: 60.71% of data above the critical threshold [5000000.0]
[20:19:03] <wikibugs>	 7Blocked-on-Operations, 6operations, 6Discovery, 3Discovery-Search-Sprint: Make elasticsearch cluster accessible from analytics hadoop workers - https://phabricator.wikimedia.org/T120281#1971185 (10EBernhardson) Another option for analytics<->codfw that me and @SMalyshev just talked about would be using an...
[20:20:27] <greg-g>	 marxarelli|afk: sweet
[20:21:34] <icinga-wm>	 PROBLEM - puppet last run on cp3006 is CRITICAL: CRITICAL: puppet fail
[20:23:54] <grrrit-wm>	 (03PS2) 10Ori.livneh: ruthenium: Rename parsoid systemd file + start up a single worker [puppet] - 10https://gerrit.wikimedia.org/r/266788 (owner: 10Subramanya Sastry)
[20:25:19] <grrrit-wm>	 (03CR) 10Ori.livneh: [V: 032] ruthenium: Rename parsoid systemd file + start up a single worker [puppet] - 10https://gerrit.wikimedia.org/r/266788 (owner: 10Subramanya Sastry)
[20:25:31] <grrrit-wm>	 (03CR) 10Ori.livneh: [C: 032] ruthenium: Rename parsoid systemd file + start up a single worker [puppet] - 10https://gerrit.wikimedia.org/r/266788 (owner: 10Subramanya Sastry)
[20:31:23] <icinga-wm>	 PROBLEM - Kafka Cluster analytics-eqiad Broker Messages In Per Second on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 13 data above and 45 below the confidence bounds
[20:34:00] <ottomata>	 man why anomaly detection gotta be all weird
[20:36:03] <icinga-wm>	 RECOVERY - Kafka Broker Replica Max Lag on kafka1014 is OK: OK: Less than 50.00% above the threshold [1000000.0]
[20:46:37] <cscott>	 how's the train doing?
[20:46:57] <cscott>	 any blockers for the parsoid/ocg deploy window in 15 min?
[20:48:13] <icinga-wm>	 RECOVERY - puppet last run on cp3006 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures
[20:48:14] <wikibugs>	 6operations, 10Incident-20160126-WikimediaDomainRedirection, 10Wikimedia-Apache-configuration, 10netops: All wikis under wikimedia.org (Meta, Commons, loginwiki, others) are using wikimedia.org VHost, including /wiki/ -> wikimediafoundation.org redirect - https://phabricator.wikimedia.org/T124804#1971299 (1...
[20:51:08] <wikibugs>	 6operations, 10Incident-20160126-WikimediaDomainRedirection, 10Wikimedia-Apache-configuration, 10netops, 5Patch-For-Review: All wikis under wikimedia.org (Meta, Commons, loginwiki, others) are using wikimedia.org VHost, including /wiki/ -> wikimediafounda... - https://phabricator.wikimedia.org/T124804#1971323
[20:51:16] <wikibugs>	 6operations, 10Incident-20160126-WikimediaDomainRedirection, 10Wikimedia-Apache-configuration, 10netops: All wikis under wikimedia.org (Meta, Commons, loginwiki, others) are using wikimedia.org VHost, including /wiki/ -> wikimediafoundation.org redirect - https://phabricator.wikimedia.org/T124804#1971326 (1...
[20:53:01] <grrrit-wm>	 (03CR) 10Dzahn: creation of parsoid-rt-admin group (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/266632 (https://phabricator.wikimedia.org/T124701) (owner: 10RobH)
[20:54:30] <grrrit-wm>	 (03CR) 10Dzahn: creation of parsoid-rt-admin group (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/266632 (https://phabricator.wikimedia.org/T124701) (owner: 10RobH)
[20:55:32] <grrrit-wm>	 (03CR) 10Dzahn: [C: 031] make default log rotation for apache be 30 days [puppet] - 10https://gerrit.wikimedia.org/r/266480 (owner: 10ArielGlenn)
[20:56:03] <icinga-wm>	 RECOVERY - Kafka Cluster analytics-eqiad Broker Messages In Per Second on graphite1001 is OK: OK: No anomaly detected
[20:56:51] <grrrit-wm>	 (03CR) 10Dzahn: [C: 04-1] "sorry, -1 unless we get the SSL cert issue resolved with letsencrypt some time later" [dns] - 10https://gerrit.wikimedia.org/r/254294 (owner: 10Odder)
[20:57:12] <ottomata>	 GOOD better NOT have an anomaly when you don't
[20:57:14] <ottomata>	 better stay like that!
[20:59:31] <mutante>	 warning: abnormal anomalies detected :p
[21:00:04] <jouncebot>	 gwicke cscott arlolra subbu bearND mdholloway: Dear anthropoid, the time has come. Please deploy Services – Parsoid / OCG / Citoid / Mobileapps / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160127T2100).
[21:00:15] <grrrit-wm>	 (03PS2) 10Dzahn: phabricator: don't use communitymetrics@, use wikitech [puppet] - 10https://gerrit.wikimedia.org/r/266316 (https://phabricator.wikimedia.org/T123581) 
[21:00:24] <grrrit-wm>	 (03CR) 10Dzahn: [C: 032] phabricator: don't use communitymetrics@, use wikitech [puppet] - 10https://gerrit.wikimedia.org/r/266316 (https://phabricator.wikimedia.org/T123581) (owner: 10Dzahn)
[21:01:10] <cscott>	 robh, greg-g, marxarelli|afk: any update on the train deploy?  i'm assuming it has completed successfully and we are not currently in an outage and i'm clear to deploy ocg?
[21:01:43] <grrrit-wm>	 (03CR) 10Odder: "Hugely disappointing as the redirection doesn't work any longer." [dns] - 10https://gerrit.wikimedia.org/r/254294 (owner: 10Odder)
[21:01:54] <icinga-wm>	 RECOVERY - Disk space on ms-be2003 is OK: DISK OK
[21:03:55] <bd808>	 cscott: yeah I think you are all good
[21:07:21] <yurik>	 cscott, are you deploying?  I need to deploy something too, pls ping me when done
[21:10:17] <cscott>	 yurik: yup, on it.  i'll ping you when done.  shouldn't be long (assuming the world doesn't break)
[21:14:56] * yurik thinks the world shouldn't break more than twice in one day... or was it yesterday?
[21:15:59] <wikibugs>	 6operations, 7Mail, 7Mobile, 5Patch-For-Review: consolidate mailman redirects in exim aliases file - https://phabricator.wikimedia.org/T123581#1971422 (10Dzahn) @Jkrauska can you do this kind of thing on your side even if both addresses are external on lists? Or should that stay in exim? Do you happen to k...
[21:20:22] <grrrit-wm>	 (03CR) 10Dzahn: "I understand that must be very disappointing after all that time WMF let you wait on this just to donate a domain and i'm sorry for the wa" [dns] - 10https://gerrit.wikimedia.org/r/254294 (owner: 10Odder)
[21:22:14] <grrrit-wm>	 (03CR) 10Subramanya Sastry: creation of parsoid-rt-admin group (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/266632 (https://phabricator.wikimedia.org/T124701) (owner: 10RobH)
[21:22:41] <grrrit-wm>	 (03PS1) 10Ori.livneh: Speed trials: fix-up for inlined CSS variant [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266918 
[21:22:55] <grrrit-wm>	 (03CR) 10Ori.livneh: [C: 032] Speed trials: fix-up for inlined CSS variant [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266918 (owner: 10Ori.livneh)
[21:23:12] <odder>	 hola mutante
[21:23:40] <odder>	 letsencrypt looks really nice, any chance WMF might actually sponsor them?
[21:24:05] <mutante>	 odder: yes, it has been discussed, we are just not there yet
[21:24:33] <icinga-wm>	 RECOVERY - Unmerged changes on repository mediawiki_config on tin is OK: No changes to merge.
[21:26:07] <ori>	 (hi odder!)
[21:26:14] <mutante>	 odder: the way this went since back in RT days is really unfortunate, sorry in the name of WMF, don't abandon that yet
[21:26:22] <logmsgbot>	 !log ori@mira Synchronized docroot and w: (no message) (duration: 02m 26s)
[21:26:28] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[21:26:31] <cscott>	 !log updated OCG to version 64050af0456a43344b32e3e93561a79207565eaf
[21:26:36] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[21:27:25] <odder>	 hi ori! long time no see
[21:27:36] <marxarelli>	 cscott: gah! sorry, i neglected to update the roadmap
[21:28:04] <marxarelli>	 just to confirm, yeah, group1 is on wmf.11
[21:28:14] <ori>	 odder: yeah, how have you been?
[21:28:21] <mutante>	 odder: some more info on letsencrypt and related ticket https://phabricator.wikimedia.org/T101048
[21:29:10] <mdholloway>	 !log mobileapps deployed 6f35859
[21:29:15] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[21:29:26] <cscott>	 yurik: ok, i'm done.
[21:29:55] <yurik>	 cscott, thx.  I saw that we switched off from tin.  What do i need to do to set up git deploy on the new host?
[21:30:40] <odder>	 ori: Been alright! Donated a domain the other day to the WMF and trying to unsquat a few others
[21:31:03] <odder>	 dem f^%$s keep renewing them though, and probably not worth to get the lawyers involved, I don't think
[21:31:25] <wikibugs>	 10Ops-Access-Requests, 6operations, 6Parsing-Team, 5Patch-For-Review: Getting parsing-team members sudo access to manage (start, stop, restart) services on ruthenium - https://phabricator.wikimedia.org/T124701#1971467 (10TrevorParscal) Approved.
[21:31:58] <cscott>	 yurik: nothing, as far as I could tell.  i just logged into the new host and everything was there already.
[21:32:15] <cscott>	 yurik: https://wikitech.wikimedia.org/w/index.php?title=OCG&type=revision&diff=274824&oldid=270998
[21:32:46] <yurik>	 cscott, it complains on git deploy start about missing user.name & user.email.  Will see if i need anything else
[21:33:01] <grrrit-wm>	 (03CR) 10Odder: "I'd say let's wait for letsencrypt and make sure to dig this patch up when it's all ready and shiny." [dns] - 10https://gerrit.wikimedia.org/r/254294 (owner: 10Odder)
[21:35:52] <wikibugs>	 10Ops-Access-Requests, 6operations, 6Parsing-Team, 5Patch-For-Review: Getting parsing-team members sudo access to manage (start, stop, restart) services on ruthenium - https://phabricator.wikimedia.org/T124701#1971477 (10ssastry) a:5ssastry>3RobH
[21:44:44] <icinga-wm>	 PROBLEM - graphoid endpoints health on sca1001 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) is CRITICAL: Test retrieve PNG from mediawiki.org returned the unexpected status 400 (expecting: 200)
[21:45:13] <icinga-wm>	 PROBLEM - graphoid endpoints health on sca1002 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) is CRITICAL: Test retrieve PNG from mediawiki.org returned the unexpected status 400 (expecting: 200)
[21:45:39] <yurik>	 !log updated graphoid on scb*
[21:45:43] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[21:45:54] <yurik>	 checking why sca is showing issues ^
[21:46:15] <yurik>	 mobrovac, is this test still pointing to sca^, or is it really on scb?
[21:52:16] <mobrovac>	 yurik: no, those are the tests running on sca100x, we need to stop graphoid there
[21:52:49] <yurik>	 mobrovac, is git deploy still deploys there?
[21:53:00] <yurik>	 i thought sca100x was removed
[21:53:12] <yurik>	 i just did a full graphoid deployment + restart
[21:53:41] <mobrovac>	 yurik: you should have seen in the output of trebuchet that 2/4 minions succeeded
[21:55:06] <grrrit-wm>	 (03PS1) 10Papaul: admin: add dc-ops to install-server, allow to read log files [puppet] - 10https://gerrit.wikimedia.org/r/266919 
[22:00:25] <grrrit-wm>	 (03CR) 10Papaul: "I am able now to run puppet agent -t -v from carbon but i able not able to view syslog to troubleshoot MAC address issues when a new syst" [puppet] - 10https://gerrit.wikimedia.org/r/266919 (owner: 10Papaul)
[22:01:41] <wikibugs>	 6operations, 10ops-codfw, 10hardware-requests: mw2173 has probably a broken disk, needs substitution and reimaging - https://phabricator.wikimedia.org/T124408#1971588 (10Papaul) 5Open>3Resolved Closing this since the system is back up
[22:03:05] <wikibugs>	 6operations, 10ops-codfw, 5Patch-For-Review: mw2039 fails to reboot, mgmt interface unreachable - https://phabricator.wikimedia.org/T124282#1971594 (10Papaul) 5Open>3Resolved Closing, system is back up in service.
[22:04:01] <wikibugs>	 6operations, 10ops-codfw, 10hardware-requests: mw2173 has probably a broken disk, needs substitution and reimaging - https://phabricator.wikimedia.org/T124408#1971597 (10Papaul) 5Resolved>3Open
[22:05:01] <wikibugs>	 6operations, 10ops-codfw, 10hardware-requests: mw2173 has probably a broken disk, needs substitution and reimaging - https://phabricator.wikimedia.org/T124408#1955110 (10Papaul) Was en error closed this ticket by mistake.
[22:12:34] <icinga-wm>	 PROBLEM - Kafka Broker Replica Max Lag on kafka1012 is CRITICAL: CRITICAL: 58.33% of data above the critical threshold [5000000.0]
[22:16:04] <icinga-wm>	 PROBLEM - Kafka Broker Replica Max Lag on kafka1013 is CRITICAL: CRITICAL: 51.85% of data above the critical threshold [5000000.0]
[22:21:05] <subbu>	 robh: would it be possible for you dump the last 1000 lines of parsoid-rt and parsoid-rt-client logs (from ruthenium) to /tmp/<some-file> that I can take a look at? There is the mysterious error in testing and I want to take a look at the logs to see if it reveals something.
[22:22:15] <robh>	 i think i can pipe into your home directory, should be ok and you can just rm it when you finish
[22:22:23] <robh>	 that ok?
[22:22:32] <robh>	 if you can read tmp thats cool doo
[22:22:32] <robh>	 too
[22:22:41] <subbu>	 sounds good.
[22:22:41] <subbu>	 thanks.
[22:23:05] <icinga-wm>	 RECOVERY - Kafka Broker Replica Max Lag on kafka1012 is OK: OK: Less than 50.00% above the threshold [1000000.0]
[22:23:05] <icinga-wm>	 RECOVERY - Kafka Broker Replica Max Lag on kafka1013 is OK: OK: Less than 50.00% above the threshold [1000000.0]
[22:23:31] <subbu>	 i can read /tmp 
[22:23:33] <robh>	 subbu: Where do these log files you want live?  just checking and /var/log doesnt seem to have those
[22:24:06] <subbu>	 robh, journalctl -fu <service_name> .. but might need to be sudo / root.
[22:24:35] <subbu>	 journalctl -fu parsoid-rt and journalctl -fu parsoid-rt-client 
[22:25:14] <subbu>	 if they are not there either .. then, i need to fix the logging setup with systemd next. :)
[22:25:55] <robh>	 so that gives me the realtime output, not a look backward
[22:26:13] <robh>	 trying to review how to do a historical tail
[22:26:28] <subbu>	 mobrovac, do you know ^^ .. 
[22:27:13] <mobrovac>	 robh: just omit the "f" for a historical tail
[22:27:24] <mobrovac>	 journalctl -u parsoid-rt-client 
[22:27:55] <robh>	 yea but that still is a more|less type review, i just want it to grab the last 1k lines ahd shove into a file
[22:28:46] <robh>	 and it starts at the start of the log file, where i want the end of it in a snapshot of jsut the last x lines (in this case 1k)
[22:28:58] <mobrovac>	 robh: journalctl -n 1000 -u service_name > /tmp/blah.log
[22:29:29] <ottomata>	 !log starting mysqldump of MobileWebSectionUsage_14321266 from db1047 into m4-master
[22:29:34] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[22:29:51] <robh>	 mobrovac: you rock
[22:30:00] <mobrovac>	 haha
[22:30:14] <robh>	 i thought it wouldnt like that since when not piped it shows the more type fashion but nope its cool it likes it
[22:30:49] <mobrovac>	 yeah, it tests for tty before starting the output
[22:31:44] <robh>	 subbu: they are in tmp
[22:31:48] <wikibugs>	 6operations, 10Beta-Cluster-Infrastructure, 6Services: Move Node.JS services to Jessie and Node 4.2 - https://phabricator.wikimedia.org/T124989#1971690 (10mobrovac) 3NEW
[22:31:58] <robh>	 subbu: im going to sudo you ownership so you can rm when done
[22:32:04] <subbu>	 great. thanks.
[22:32:05] <robh>	 and remove other read rights
[22:32:13] <wikibugs>	 6operations, 10Beta-Cluster-Infrastructure, 6Services: Move Node.JS services to Jessie and Node 4.2 - https://phabricator.wikimedia.org/T124989#1971709 (10mobrovac)
[22:32:17] <wikibugs>	 6operations, 6Services: Migrate SCA cluster to SCB (Jessie and Node 4.2) - https://phabricator.wikimedia.org/T96017#1971710 (10mobrovac)
[22:33:00] <robh>	 they are all yours
[22:33:17] <robh>	 hope it helps, lemme know if you need more of them after you fix the issue =]
[22:33:25] <robh>	 (or more in general)
[22:36:14] <subbu>	 robh .. can you restart parsoid-rt-client service? i am curious if it was just some bad state one of the test clients was stuck in .. once the services came up after successful puppet run.
[22:36:46] <robh>	 !log restarting parsoid-rt-client service on ruthenium
[22:36:51] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[22:36:52] <robh>	 doing now
[22:36:55] <robh>	 done
[22:37:51] <subbu>	 thanks.
[22:43:52] <wikibugs>	 6operations, 10Beta-Cluster-Infrastructure, 6Services: Move Node.JS services to Jessie and Node 4.2 - https://phabricator.wikimedia.org/T124989#1971808 (10mobrovac)
[22:58:23] <icinga-wm>	 PROBLEM - High load average on labstore1001 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [24.0]
[23:05:35] <icinga-wm>	 RECOVERY - High load average on labstore1001 is OK: OK: Less than 50.00% above the threshold [16.0]
[23:05:48] <chasemp>	 ^ we know yuvi is going to disable a certain tool and we will try to talk w/ the user
[23:12:43] <wikibugs>	 6operations, 10RESTBase-Cassandra: replace default Cassandra superuser - https://phabricator.wikimedia.org/T113622#1971939 (10GWicke) p:5Triage>3Normal We are not using the default "admin" user for any ongoing operational tasks. Additionally, the credentials for the default admin user have been recently re...
[23:20:09] <grrrit-wm>	 (03PS1) 10Tim Landscheidt: Tools: Fix double file resource for jlocal [puppet] - 10https://gerrit.wikimedia.org/r/266934 
[23:22:25] <grrrit-wm>	 (03PS1) 10Tim Landscheidt: Tools: Fix argument quoting in jlocal [puppet] - 10https://gerrit.wikimedia.org/r/266935 
[23:25:47] <wikibugs>	 6operations, 10Beta-Cluster-Infrastructure, 6Services: Move Node.JS services to Jessie and Node 4.2 - https://phabricator.wikimedia.org/T124989#1971985 (10mobrovac)
[23:28:41] <wikibugs>	 6operations, 10Beta-Cluster-Infrastructure, 6Services: Move Node.JS services to Jessie and Node 4.2 - https://phabricator.wikimedia.org/T124989#1972000 (10mobrovac)
[23:30:20] <grrrit-wm>	 (03CR) 10Yuvipanda: "I don't think the comment is not pertinent (but hey, I'm biased, I wrote it)" [puppet] - 10https://gerrit.wikimedia.org/r/266935 (owner: 10Tim Landscheidt)
[23:37:23] <icinga-wm>	 PROBLEM - High load average on labstore1001 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [24.0]
[23:39:46] <tgr>	 greg-g: can I quickly deploy a SessionManager patch before the SWAT?
[23:40:02] <tgr>	 https://phabricator.wikimedia.org/T124971
[23:41:18] <grrrit-wm>	 (03PS1) 10Mobrovac: RESTBase: Start using deployment-restbase02 for clients [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266945 (https://phabricator.wikimedia.org/T125003) 
[23:41:20] <tgr>	 although given that CI takes 10 min per patch I probably wouldn't finish
[23:41:25] <tgr>	 after the SWAT, then
[23:42:26] <greg-g>	 tgr: sure, or during
[23:42:54] <greg-g>	 ah, it's full
[23:44:23] <icinga-wm>	 RECOVERY - High load average on labstore1001 is OK: OK: Less than 50.00% above the threshold [16.0]
[23:56:56] <grrrit-wm>	 (03PS1) 10Mattflaschen: Have Beta job queue settings shadow production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266949