[00:00:04] <jouncebot>	 RoanKattouw ostriches rmoen Krenair: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151105T0000). Please do the needful.
[00:00:34] <Krenair>	 I guess we're doing https://gerrit.wikimedia.org/r/250992
[00:00:52] <YuviPanda>	 andrewbogott: oh nevermind me, yes I found the ferm
[00:01:06] <Krenair>	 James_F
[00:02:21] <James_F>	 Krenair: When CI works, yes.
[00:02:26] <YuviPanda>	 andrewbogott: hmm, not sure how to fix that
[00:02:31] <YuviPanda>	 andrewbogott: I guess we should make it an array?
[00:03:14] <greg-g>	 it is working, the tests just take a while
[00:03:15] <grrrit-wm>	 (03CR) 10Awight: [C: 04-1] "Thanks for helping with this! One small change, we don't want the empty category= param cos the code that uses this variable actually add" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250384 (https://phabricator.wikimedia.org/T116609) (owner: 10Nemo bis)
[00:03:22] <andrewbogott>	 It’s already defined in hiera as labs_ldap_dns_host_secondary
[00:03:28] <YuviPanda>	 aaah
[00:03:30] <YuviPanda>	 ok
[00:03:30] <andrewbogott>	 so it’s just a second ferm line I think
[00:04:03] <YuviPanda>	 ok let me make a patch andrewbogott
[00:04:07] <andrewbogott>	 well, wait, I’m wrong — that’s probably a different IP from the one that’s making the nova query
[00:04:18] <andrewbogott>	 so probably need to add a labs_designate_secondary_hostname
[00:04:21] <andrewbogott>	 or something like that
[00:05:07] <grrrit-wm>	 (03CR) 10BryanDavis: "Posted for SWAT on 2015-11-06T00:00Z." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249313 (https://phabricator.wikimedia.org/T116550) (owner: 10BryanDavis)
[00:05:34] <grrrit-wm>	 (03PS1) 10Yuvipanda: labs: Open up nova API access to other DNS host too [puppet] - 10https://gerrit.wikimedia.org/r/251167 
[00:05:40] <YuviPanda>	 andrewbogott: ugh, yeah, just noticed that too
[00:05:49] <James_F>	 Krenair: https://gerrit.wikimedia.org/r/251168
[00:09:13] <grrrit-wm>	 (03PS2) 10Yuvipanda: labs: Open up nova API access to other DNS host too [puppet] - 10https://gerrit.wikimedia.org/r/251167 
[00:09:23] <YuviPanda>	 andrewbogott: ^?
[00:10:08] <YuviPanda>	 andrewbogott: should I define it for codfw too?
[00:10:23] <grrrit-wm>	 (03CR) 10Andrew Bogott: [C: 04-1] labs: Open up nova API access to other DNS host too (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/251167 (owner: 10Yuvipanda)
[00:10:32] <andrewbogott>	 I don’t think it’s useful to do for codfw right now
[00:11:30] <wikibugs>	 6operations, 10Analytics, 10Analytics-Cluster, 10Fundraising Tech Backlog, and 2 others: Verify kafkatee use for fundraising logs on erbium - https://phabricator.wikimedia.org/T97676#1784279 (10awight) I think we're prepared to make this change now.  The sample rate is parsed out of the filenames, so that...
[00:12:44] <grrrit-wm>	 (03PS3) 10Yuvipanda: labs: Open up nova API access to other DNS host too [puppet] - 10https://gerrit.wikimedia.org/r/251167 
[00:12:57] <YuviPanda>	 andrewbogott: %
[00:13:00] <YuviPanda>	 ^
[00:13:25] <grrrit-wm>	 (03CR) 10Andrew Bogott: [C: 031] labs: Open up nova API access to other DNS host too [puppet] - 10https://gerrit.wikimedia.org/r/251167 (owner: 10Yuvipanda)
[00:13:37] <grrrit-wm>	 (03PS4) 10Yuvipanda: labs: Open up nova API access to other DNS host too [puppet] - 10https://gerrit.wikimedia.org/r/251167 
[00:13:48] <grrrit-wm>	 (03CR) 10Yuvipanda: [C: 032 V: 032] labs: Open up nova API access to other DNS host too [puppet] - 10https://gerrit.wikimedia.org/r/251167 (owner: 10Yuvipanda)
[00:17:11] <grrrit-wm>	 (03PS1) 10Yuvipanda: dnsrecursor: Fix permissions for config YAML file [puppet] - 10https://gerrit.wikimedia.org/r/251171 
[00:17:35] <grrrit-wm>	 (03PS2) 10Yuvipanda: dnsrecursor: Fix permissions for config YAML file [puppet] - 10https://gerrit.wikimedia.org/r/251171 
[00:17:37] <icinga-wm>	 RECOVERY - puppet last run on labservices1001 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures
[00:19:06] <icinga-wm>	 RECOVERY - Recursive DNS on 208.80.155.118 is OK: DNS OK: 0.144 seconds response time. www.wikipedia.org returns 208.80.154.224
[00:19:11] <YuviPanda>	 andrewbogott: chasemp ^ dns fixed
[00:19:17] <grrrit-wm>	 (03CR) 10Yuvipanda: [C: 032] dnsrecursor: Fix permissions for config YAML file [puppet] - 10https://gerrit.wikimedia.org/r/251171 (owner: 10Yuvipanda)
[00:20:02] <YuviPanda>	 ebernhardson: dcausse any luck with nobelium? :)
[00:23:17] <icinga-wm>	 PROBLEM - puppet last run on labnet1002 is CRITICAL: CRITICAL: puppet fail
[00:24:37] <icinga-wm>	 PROBLEM - puppet last run on palladium is CRITICAL: CRITICAL: puppet fail
[00:25:57] <YuviPanda>	 um
[00:26:36] <Krenair>	 wtf is up on tin?
[00:26:38] <Krenair>	 twentyafterfour, hi
[00:27:06] <icinga-wm>	 RECOVERY - puppet last run on labnet1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[00:28:27] <icinga-wm>	 RECOVERY - puppet last run on palladium is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures
[00:32:05] <ebernhardson>	 YuviPanda: its  bqck to importing
[00:32:34] <grrrit-wm>	 (03PS3) 10Andrew Bogott: Update keystone policy.json to allow the 'observer' role to observe. [puppet] - 10https://gerrit.wikimedia.org/r/251151 (https://phabricator.wikimedia.org/T104588) 
[00:33:46] <grrrit-wm>	 (03CR) 10Andrew Bogott: [C: 032] Update keystone policy.json to allow the 'observer' role to observe. [puppet] - 10https://gerrit.wikimedia.org/r/251151 (https://phabricator.wikimedia.org/T104588) (owner: 10Andrew Bogott)
[00:34:26] <grrrit-wm>	 (03PS1) 10Yuvipanda: dynamicproxy: Move invisible-unicorn into puppet [puppet] - 10https://gerrit.wikimedia.org/r/251176 
[00:34:35] <YuviPanda>	 andrewbogott: ^^
[00:34:41] <YuviPanda>	 ebernhardson: cool! any vague ETAs?
[00:35:20] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] dynamicproxy: Move invisible-unicorn into puppet [puppet] - 10https://gerrit.wikimedia.org/r/251176 (owner: 10Yuvipanda)
[00:36:48] <andrewbogott>	 YuviPanda: will moving it to uswsgi change the port?  Or is that determined by the code and not the wsgi framework?
[00:37:29] <YuviPanda>	 andrewbogott: so when we initially built this uwsgi couldn't actually serve http directly
[00:37:31] <YuviPanda>	 now it can
[00:37:34] <YuviPanda>	 so I just got rid of nginx
[00:38:02] <YuviPanda>	 I also couldn't find the service definition file for the service earlier
[00:38:13] <andrewbogott>	 Sure, I’m just wondering about http-socket => '0.0.0.0:80',
[00:38:21] <YuviPanda>	 andrewbogott: ah yup, that's the one.
[00:38:27] <YuviPanda>	 andrewbogott: before it was using a unix socket
[00:38:30] <YuviPanda>	 that nginx listened on
[00:38:51] <andrewbogott>	 so the service is moving to 80?
[00:39:02] <YuviPanda>	  it was always on 80 no?
[00:39:04] <YuviPanda>	 oh wait
[00:39:06] <YuviPanda>	 maybe not
[00:39:13] <YuviPanda>	 was on 5668!
[00:39:14] <YuviPanda>	 good catch
[00:39:16] <YuviPanda>	 let me fix that
[00:39:32] <andrewbogott>	 It should stay on 5668 if that’s easy
[00:39:44] <YuviPanda>	 yeah
[00:39:44] <grrrit-wm>	 (03PS2) 10Yuvipanda: dynamicproxy: Move invisible-unicorn into puppet [puppet] - 10https://gerrit.wikimedia.org/r/251176 
[00:39:46] <YuviPanda>	 it is
[00:39:48] <YuviPanda>	 I just fixed it
[00:39:50] <YuviPanda>	 I just didn't notice it
[00:39:53] <YuviPanda>	 now to fix the crazy pep errors.....
[00:40:40] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] dynamicproxy: Move invisible-unicorn into puppet [puppet] - 10https://gerrit.wikimedia.org/r/251176 (owner: 10Yuvipanda)
[00:41:04] <ebernhardson>	 YuviPanda: hard to say, looks like 37M content documents to go, then 92M docs in the general indices. 
[00:41:42] <ebernhardson>	 but one doc does not equal another doc, wiktionary docs for example are typically very small
[00:42:59] <ebernhardson>	 getting an idea on insert speed is also odd because of that... but if we guess something like 1k doc/sec, which is probably at the higher end, 34 hours
[00:43:39] <grrrit-wm>	 (03PS3) 10Yuvipanda: dynamicproxy: Move invisible-unicorn into puppet [puppet] - 10https://gerrit.wikimedia.org/r/251176 
[00:43:48] <YuviPanda>	 andrewbogott: right
[00:43:51] <YuviPanda>	 err
[00:43:53] <YuviPanda>	 ebernhardson: right
[00:43:58] <YuviPanda>	 ebernhardson: another week?
[00:44:03] <ebernhardson>	 YuviPanda: hopefully less 
[00:44:09] <YuviPanda>	 nice!
[00:44:17] <YuviPanda>	 is this with the no-nested-documents fix?
[00:44:19] * ebernhardson should figure out why elasticsearch.*.elasticsearch.indices.indexing.* are all 0
[00:44:26] <ebernhardson>	 YuviPanda: it still has nested documents, its just lazy loading them
[00:44:33] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] dynamicproxy: Move invisible-unicorn into puppet [puppet] - 10https://gerrit.wikimedia.org/r/251176 (owner: 10Yuvipanda)
[00:44:41] * YuviPanda nods head
[00:44:45] <ebernhardson>	 it doesn't appear to need them at indexing time, but if we query all 1800 indices they will be loaded into memory and break things
[00:44:53] <YuviPanda>	 heh
[00:45:15] <ebernhardson>	 it might depend on actually querying them via geosearch, not sure yet. never used this lazy load option before :)
[00:45:20] <YuviPanda>	 did you guys figure out if es in codfw is going to be active-active with eqiad?
[00:45:45] <ebernhardson>	 YuviPanda: the intention was to serve search queries from codfw app/api servers from thet cluster, if thats what you mean?
[00:45:54] <YuviPanda>	 aaah ok
[00:45:56] <YuviPanda>	 yeah, that is.
[00:46:12] <ebernhardson>	 also the ability to shift traffic over for major upgrades (like es 2.0 which was just released)
[00:46:25] <ebernhardson>	 when they shift the major number like that the protocol between nodes changes, have to do the whole cluster
[00:47:04] <YuviPanda>	 ah
[00:47:12] <YuviPanda>	 so you can turn it all to one, do upgrade, turn it back, repeat
[00:47:16] <ebernhardson>	 yea
[00:48:23] <matt_flaschen>	 Krenair, what's the status on SWAT?  Can I add one super-late item?
[00:48:39] <Krenair>	 Waiting for twentyafterfour to appear
[00:48:44] <Krenair>	 There's something wrong
[00:49:32] <matt_flaschen>	 Okay
[00:49:46] <Krenair>	 See tin:/srv/mediawiki-staging/weird-rebase
[00:49:55] <Krenair>	 contains private info
[00:50:55] <matt_flaschen>	 Okay, I'll just put it in for tomorrow.
[01:00:05] <jouncebot>	 twentyafterfour: Respected human, time to deploy Phabricator update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151105T0100). Please do the needful.
[01:00:21] <Krenair>	 heh
[01:14:18] <icinga-wm>	 PROBLEM - puppet last run on mw1111 is CRITICAL: CRITICAL: Puppet has 1 failures
[01:34:07] <icinga-wm>	 PROBLEM - YARN NodeManager Node-State on analytics1032 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[01:35:57] <icinga-wm>	 RECOVERY - YARN NodeManager Node-State on analytics1032 is OK: OK: YARN NodeManager analytics1032.eqiad.wmnet:8041 Node-State: RUNNING
[01:41:06] <icinga-wm>	 RECOVERY - puppet last run on mw1111 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures
[01:52:00] <wikibugs>	 7Puppet, 6Labs, 6Phabricator: phabricator at labs is not up to date - https://phabricator.wikimedia.org/T117441#1784550 (10Negative24) @mmodell Not at a computer but isn't that just the tracking tag var?
[01:55:34] <Krenair>	 AaronSchulz, hi
[02:04:04] <Krenair>	 !log Someone has left tin:/srv/mediawiki-staging/php-1.27.0-wmf.5 in a mess, see `git log origin/wmf/1.27.0-wmf.5..HEAD --oneline`. Note there is one commit waiting to be merged on tin (https://gerrit.wikimedia.org/r/#/c/251168/) that hasn't been yet because of this.
[02:04:10] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[02:04:19] <Krenair>	 I'm going to sleep.
[02:05:29] <Krenair>	 Hope nobody runs scap
[02:05:41] <grrrit-wm>	 (03PS1) 10Ori.livneh: Revert "Remove /etc/wikimedia-image-scaler" [puppet] - 10https://gerrit.wikimedia.org/r/251188 
[02:05:44] <Krenair>	 or syncs anything else really
[02:06:26] <ori>	 Krenair: good night. who was the last person to sync stuff?
[02:06:31] <Krenair>	 aaron
[02:07:06] <ori>	 k
[02:07:21] <grrrit-wm>	 (03PS2) 10Ori.livneh: Revert "Remove /etc/wikimedia-image-scaler" [puppet] - 10https://gerrit.wikimedia.org/r/251188 
[02:07:28] <grrrit-wm>	 (03CR) 10Ori.livneh: [C: 032 V: 032] Revert "Remove /etc/wikimedia-image-scaler" [puppet] - 10https://gerrit.wikimedia.org/r/251188 (owner: 10Ori.livneh)
[02:10:22] <wikibugs>	 7Puppet, 6Labs, 6Phabricator: phabricator at labs is not up to date - https://phabricator.wikimedia.org/T117441#1784582 (10mmodell) @negative24: Production isn't deployed via puppet anymore. I just need to set up labs instances to clone the deployment repo instead of the individual tags.
[02:10:59] <twentyafterfour>	 Krenair: I'm here
[02:11:33] <twentyafterfour>	 Krenair: I'll take care of it
[02:13:07] <icinga-wm>	 PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 4 below the confidence bounds
[02:13:50] <wikibugs>	 7Puppet, 6Labs, 6Phabricator: phabricator at labs is not up to date - https://phabricator.wikimedia.org/T117441#1784583 (10Negative24) Ah, ok. (I'm a little bit curious of how the deployments are deployed; are they just pulled via git or something else?)
[02:15:21] <grrrit-wm>	 (03PS1) 10Ori.livneh: Only set $wgDisableOutputCompression to 'true' on the scalers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251189 
[02:15:25] <ori>	 twentyafterfour: thanks
[02:15:41] <ori>	 i'm going to sync a config change, but won't touch the branches
[02:16:00] <grrrit-wm>	 (03CR) 10Ori.livneh: [C: 032] Only set $wgDisableOutputCompression to 'true' on the scalers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251189 (owner: 10Ori.livneh)
[02:16:22] <grrrit-wm>	 (03Merged) 10jenkins-bot: Only set $wgDisableOutputCompression to 'true' on the scalers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251189 (owner: 10Ori.livneh)
[02:17:34] <logmsgbot>	 !log ori@tin Synchronized wmf-config/CommonSettings.php: I3c397e892e: Only set $wgDisableOutputCompression to 'true' on the scalers (duration: 00m 18s)
[02:17:40] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[02:23:53] <wikibugs>	 6operations, 10Analytics, 6Analytics-Kanban, 6Discovery, and 8 others: EventBus MVP - https://phabricator.wikimedia.org/T114443#1784606 (10GWicke) @faidon: Until very recently (last days), there wasn't actually any REST proxy with schema validation  in the EventLogging repository. @ottomata now has [a patc...
[02:28:17] <icinga-wm>	 PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 3 below the confidence bounds
[02:35:58] <icinga-wm>	 RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected
[02:40:46] <logmsgbot>	 !log l10nupdate@tin Synchronized php-1.27.0-wmf.4/cache/l10n: l10nupdate for 1.27.0-wmf.4 (duration: 10m 31s)
[02:40:55] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[02:47:21] <logmsgbot>	 !log l10nupdate@tin LocalisationUpdate completed (1.27.0-wmf.4) at 2015-11-05 02:47:20+00:00
[02:47:27] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[02:53:07] <icinga-wm>	 PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 7 below the confidence bounds
[02:58:47] <icinga-wm>	 PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 7 below the confidence bounds
[03:00:56] <icinga-wm>	 RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected
[03:10:06] <grrrit-wm>	 (03PS4) 10Yuvipanda: dynamicproxy: Move invisible-unicorn into puppet [puppet] - 10https://gerrit.wikimedia.org/r/251176 
[03:11:48] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] dynamicproxy: Move invisible-unicorn into puppet [puppet] - 10https://gerrit.wikimedia.org/r/251176 (owner: 10Yuvipanda)
[03:15:13] <logmsgbot>	 !log l10nupdate@tin Synchronized php-1.27.0-wmf.5/cache/l10n: l10nupdate for 1.27.0-wmf.5 (duration: 10m 17s)
[03:15:21] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[03:18:01] <icinga-wm>	 ACKNOWLEDGEMENT - Last backup of the maps filesystem on labstore1001 is CRITICAL: CRITICAL - Last run for unit replicate-maps was over 1 day, 1:00:00 ago Coren Looking into it.
[03:21:45] <logmsgbot>	 !log l10nupdate@tin LocalisationUpdate completed (1.27.0-wmf.5) at 2015-11-05 03:21:45+00:00
[03:21:51] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[03:23:01] <icinga-wm>	 ACKNOWLEDGEMENT - Last backup of the maps filesystem on labstore1001 is CRITICAL: CRITICAL - Last run for unit replicate-maps was over 1 day, 1:00:00 ago Coren Previous run pre-empted by manual backup next backup starts at 04:00:00 UTC and will clear the alarm. - The acknowledgement expires at: 2015-11-06 04:30:00 UTC.
[03:31:36] <icinga-wm>	 PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 6 below the confidence bounds
[03:31:45] <AaronSchulz>	 twentyafterfour: looks like it could use a hard reset back to the origin branch and cherry pick of the security change back?
[03:39:07] <icinga-wm>	 PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 8 below the confidence bounds
[03:48:28] <icinga-wm>	 PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 8 below the confidence bounds
[03:54:07] <icinga-wm>	 PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 9 below the confidence bounds
[03:59:56] <icinga-wm>	 PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 9 below the confidence bounds
[04:00:26] <icinga-wm>	 RECOVERY - Last backup of the maps filesystem on labstore1001 is OK: OK - Last run for unit replicate-maps was successful
[04:24:27] <icinga-wm>	 PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 3 below the confidence bounds
[04:27:28] <icinga-wm>	 PROBLEM - Incoming network saturation on labstore1003 is CRITICAL: CRITICAL: 12.50% of data above the critical threshold [100000000.0]
[04:28:47] <wikibugs>	 6operations, 6Commons, 10Wikimedia-Media-storage, 5MW-1.27-release-notes, and 2 others: Some files had disappeared from Commons after renaming - https://phabricator.wikimedia.org/T111838#1784708 (10aaron) {F2916164}  Large 5.5Mb list of ~40K orphaned files in the "public" zone for all of Commons. Files in...
[04:42:38] <icinga-wm>	 RECOVERY - Incoming network saturation on labstore1003 is OK: OK: Less than 10.00% above the threshold [75000000.0]
[05:09:07] <icinga-wm>	 PROBLEM - Incoming network saturation on labstore1003 is CRITICAL: CRITICAL: 19.05% of data above the critical threshold [100000000.0]
[05:26:57] <icinga-wm>	 RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected
[05:52:36] <icinga-wm>	 RECOVERY - Incoming network saturation on labstore1003 is OK: OK: Less than 10.00% above the threshold [75000000.0]
[06:21:06] <icinga-wm>	 PROBLEM - RAID on es2010 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded)
[06:30:07] <icinga-wm>	 PROBLEM - puppet last run on eventlog2001 is CRITICAL: CRITICAL: puppet fail
[06:30:47] <icinga-wm>	 PROBLEM - puppet last run on db1045 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:30:48] <icinga-wm>	 PROBLEM - puppet last run on cp2013 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:31:07] <icinga-wm>	 PROBLEM - puppet last run on mw1061 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:32:07] <icinga-wm>	 PROBLEM - puppet last run on mw2016 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:32:07] <icinga-wm>	 PROBLEM - puppet last run on lvs2002 is CRITICAL: CRITICAL: Puppet has 3 failures
[06:32:07] <icinga-wm>	 PROBLEM - puppet last run on sca1001 is CRITICAL: CRITICAL: Puppet has 2 failures
[06:32:28] <icinga-wm>	 PROBLEM - puppet last run on mw1135 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:33:17] <icinga-wm>	 PROBLEM - puppet last run on mw2045 is CRITICAL: CRITICAL: Puppet has 2 failures
[06:33:37] <icinga-wm>	 PROBLEM - puppet last run on mw2207 is CRITICAL: CRITICAL: Puppet has 2 failures
[06:46:19] <wikibugs>	 6operations, 6Analytics-Backlog, 10Wikipedia-iOS-App-Product-Backlog, 10vm-requests, 5iOS-5-app-production: Request one server to suport piwik analytics - https://phabricator.wikimedia.org/T116312#1784783 (10Joe) In my experience handling out 3 million events/day to a piwik installation means sounding th...
[06:53:47] <icinga-wm>	 PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 1 below the confidence bounds
[06:56:17] <icinga-wm>	 RECOVERY - puppet last run on lvs2002 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures
[06:56:47] <icinga-wm>	 RECOVERY - puppet last run on db1045 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures
[06:56:48] <icinga-wm>	 RECOVERY - puppet last run on cp2013 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures
[06:58:36] <icinga-wm>	 RECOVERY - puppet last run on mw1135 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[07:01:13] <wikibugs>	 6operations: Import Wikimania 2015 Videos - https://phabricator.wikimedia.org/T106565#1784793 (10Matanya)
[07:01:14] <wikibugs>	 6operations, 5Patch-For-Review: install / setup tungsten for temp use - wikimania 2015 video transcoding - https://phabricator.wikimedia.org/T106563#1784790 (10Matanya) 5Open>3declined a:3Matanya The video team hired by Wikimedia Mexico had encoded and uploaded the videos directly to commons. This task i...
[07:03:06] <icinga-wm>	 PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 1 below the confidence bounds
[07:08:46] <icinga-wm>	 PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 1 below the confidence bounds
[07:14:17] <icinga-wm>	 PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 2 below the confidence bounds
[07:19:48] <icinga-wm>	 PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 3 below the confidence bounds
[07:23:37] <icinga-wm>	 PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 12 data above and 3 below the confidence bounds
[07:25:57] <icinga-wm>	 RECOVERY - puppet last run on sca1001 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures
[07:26:56] <icinga-wm>	 RECOVERY - puppet last run on mw1061 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures
[07:27:17] <icinga-wm>	 RECOVERY - puppet last run on mw2045 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures
[07:27:37] <icinga-wm>	 RECOVERY - puppet last run on mw2207 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[07:27:57] <icinga-wm>	 RECOVERY - puppet last run on mw2016 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[07:27:58] <icinga-wm>	 RECOVERY - puppet last run on eventlog2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[07:33:07] <icinga-wm>	 PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 12 data above and 7 below the confidence bounds
[07:38:47] <icinga-wm>	 PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 12 data above and 8 below the confidence bounds
[07:46:08] <icinga-wm>	 RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected
[07:57:28] <grrrit-wm>	 (03PS3) 10KartikMistry: WIP: service-runner migration for cxserver [puppet] - 10https://gerrit.wikimedia.org/r/250910 (https://phabricator.wikimedia.org/T117657) 
[07:58:42] <grrrit-wm>	 (03PS4) 10KartikMistry: WIP: service-runner migration for cxserver [puppet] - 10https://gerrit.wikimedia.org/r/250910 (https://phabricator.wikimedia.org/T117657) 
[08:05:13] <logmsgbot>	 !log l10nupdate@tin ResourceLoader cache refresh completed at Thu Nov  5 08:05:13 UTC 2015 (duration 5m 12s)
[08:05:19] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[08:14:59] <grrrit-wm>	 (03PS1) 10Muehlenhoff: Assign Salt grain through the role, not by host [puppet] - 10https://gerrit.wikimedia.org/r/251202 
[08:20:20] <grrrit-wm>	 (03CR) 10Muehlenhoff: [C: 032 V: 032] Assign Salt grain through the role, not by host [puppet] - 10https://gerrit.wikimedia.org/r/251202 (owner: 10Muehlenhoff)
[08:29:12] <grrrit-wm>	 (03CR) 10DCausse: [C: 031] Make three of the newer ES nodes master eligable [puppet] - 10https://gerrit.wikimedia.org/r/251024 (https://phabricator.wikimedia.org/T112556) (owner: 10EBernhardson)
[08:33:48] <wikibugs>	 6operations, 10ops-codfw: es2010 failed disk (degraded RAID) - https://phabricator.wikimedia.org/T117848#1784891 (10jcrespo)
[08:34:49] <icinga-wm>	 ACKNOWLEDGEMENT - RAID on es2010 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) Jcrespo Reported for replacement: https://phabricator.wikimedia.org/T117848
[08:43:57] <wikibugs>	 7Puppet, 6Labs, 6Phabricator: phabricator at labs is not up to date - https://phabricator.wikimedia.org/T117441#1784904 (10mmodell) negative24: #scap3
[09:00:45] <grrrit-wm>	 (03PS1) 10Jcrespo: Add ip resolution for new codfw db servers; Poolings and depoolings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251205 
[09:03:35] <grrrit-wm>	 (03CR) 10Jcrespo: "Al ip resolution of codfw servers have to be added to eqiad too (eqiad would fail to find a master if we failover to codfw), but no server" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251205 (owner: 10Jcrespo)
[09:21:00] <grrrit-wm>	 (03PS2) 10Jcrespo: Add ip resolution for new codfw db servers; Poolings and depoolings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251205 
[09:21:24] <grrrit-wm>	 (03PS2) 10Giuseppe Lavagetto: terbium: move mediawiki monitoring [puppet] - 10https://gerrit.wikimedia.org/r/250931 (https://phabricator.wikimedia.org/T116728) 
[09:22:47] <grrrit-wm>	 (03CR) 10Jcrespo: [C: 032] Add ip resolution for new codfw db servers; Poolings and depoolings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251205 (owner: 10Jcrespo)
[09:25:18] <logmsgbot>	 !log jynus@tin Synchronized wmf-config/db-codfw.php: Pool db2048, db2049, db2050,db2055, db2057, db2063. Depool db2034, db2035, db2051 (duration: 00m 17s)
[09:25:24] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[09:25:54] <jynus>	 Thanks I have very good notes here, because if not, I could not be able to keep track
[09:33:16] <jynus>	 !log stopping mysql and cloning db2034 -> db2062, db2035 -> db2063, db2051 -> db2058
[09:33:22] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[09:35:02] <grrrit-wm>	 (03CR) 10Nemo bis: "Awight, do you have other ideas on how to prevent the redirection to the canonical URL?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250384 (https://phabricator.wikimedia.org/T116609) (owner: 10Nemo bis)
[09:37:38] <wikibugs>	 6operations, 6Analytics-Backlog, 10Wikipedia-iOS-App-Product-Backlog, 10vm-requests, 5iOS-5-app-production: Request one server to suport piwik analytics - https://phabricator.wikimedia.org/T116312#1784983 (10mark) This is clearly a system for analytics. Will it be implemented, maintained and supported by...
[09:38:30] <grrrit-wm>	 (03CR) 10Giuseppe Lavagetto: [C: 032] terbium: move mediawiki monitoring [puppet] - 10https://gerrit.wikimedia.org/r/250931 (https://phabricator.wikimedia.org/T116728) (owner: 10Giuseppe Lavagetto)
[09:56:22] <godog>	 !log run removenode on cerium.eqiad.wmnet -- decomission was missed before reimaging
[09:56:28] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[09:58:07] <wikibugs>	 6operations, 6Commons, 10Wikimedia-Media-storage, 5MW-1.27-release-notes, and 2 others: Some files had disappeared from Commons after renaming - https://phabricator.wikimedia.org/T111838#1785015 (10Ankry) >>! In T111838#1784708, @aaron wrote: > {F2916164} >  > Large 5.5Mb list of ~40K orphaned files in the...
[09:58:18] <grrrit-wm>	 (03PS7) 10Alexandros Kosiaris: openldap: Allow specifying an additional set of LDAP schemas [puppet] - 10https://gerrit.wikimedia.org/r/250010 (https://phabricator.wikimedia.org/T101299) (owner: 10Muehlenhoff)
[10:03:04] <grrrit-wm>	 (03CR) 10Nemo bis: "Or in other words, is this guaranteed to add at least one URL parameter (which will prevent URL redirect)?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250384 (https://phabricator.wikimedia.org/T116609) (owner: 10Nemo bis)
[10:04:41] <grrrit-wm>	 (03PS2) 10Nemo bis: Use full URL in $wgNoticeHideUrls [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250384 (https://phabricator.wikimedia.org/T116609) 
[10:05:40] <wikibugs>	 6operations, 6Commons, 10Wikimedia-Media-storage, 5MW-1.27-release-notes, and 2 others: Some files had disappeared from Commons after renaming - https://phabricator.wikimedia.org/T111838#1785028 (10jcrespo) Those 3 files and the ones on the description have a space character, could it be related to: T10767...
[10:15:52] <wikibugs>	 6operations, 6Commons, 10Wikimedia-Media-storage, 5MW-1.27-release-notes, and 2 others: Some files had disappeared from Commons after renaming - https://phabricator.wikimedia.org/T111838#1785047 (10Ankry) Just for records:  I have received information that on 8 October 2015 another file disappeared while b...
[10:22:00] <grrrit-wm>	 (03Abandoned) 10Giuseppe Lavagetto: terbium: remove role mediawiki::maintenance [puppet] - 10https://gerrit.wikimedia.org/r/250930 (https://phabricator.wikimedia.org/T116728) (owner: 10Giuseppe Lavagetto)
[10:30:55] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: [C: 031] openldap: Allow specifying an additional set of LDAP schemas [puppet] - 10https://gerrit.wikimedia.org/r/250010 (https://phabricator.wikimedia.org/T101299) (owner: 10Muehlenhoff)
[10:32:05] <grrrit-wm>	 (03PS8) 10Muehlenhoff: openldap: Allow specifying an additional set of LDAP schemas [puppet] - 10https://gerrit.wikimedia.org/r/250010 (https://phabricator.wikimedia.org/T101299) 
[10:32:13] <grrrit-wm>	 (03CR) 10Muehlenhoff: [C: 032 V: 032] openldap: Allow specifying an additional set of LDAP schemas [puppet] - 10https://gerrit.wikimedia.org/r/250010 (https://phabricator.wikimedia.org/T101299) (owner: 10Muehlenhoff)
[10:34:36] <icinga-wm>	 PROBLEM - Host db2034 is DOWN: PING CRITICAL - Packet loss = 100%
[10:35:05] <jynus>	 mmm
[10:36:04] <jynus>	 that is probably network saturation
[10:37:24] <grrrit-wm>	 (03PS2) 10Muehlenhoff: Add schema definitions for openldap/labs to be passed in $extra_schemas [puppet] - 10https://gerrit.wikimedia.org/r/250011 (https://phabricator.wikimedia.org/T101299) 
[10:38:23] <jynus>	 no, it is definitely down
[10:38:59] <grrrit-wm>	 (03PS3) 10Muehlenhoff: Add schema definitions for openldap/labs to be passed in $extra_schemas [puppet] - 10https://gerrit.wikimedia.org/r/250011 (https://phabricator.wikimedia.org/T101299) 
[10:39:31] <grrrit-wm>	 (03CR) 10Muehlenhoff: [C: 032 V: 032] Add schema definitions for openldap/labs to be passed in $extra_schemas [puppet] - 10https://gerrit.wikimedia.org/r/250011 (https://phabricator.wikimedia.org/T101299) (owner: 10Muehlenhoff)
[10:54:20] <wikibugs>	 6operations, 10ops-codfw: db2034 host crashed; mgmt interface unavailable (needs reset and hw check) - https://phabricator.wikimedia.org/T117858#1785142 (10jcrespo) 3NEW
[10:56:02] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: "I was under the impression that instead of relying on backports we import packages into the backports suite of our own repo. That used to " [puppet] - 10https://gerrit.wikimedia.org/r/251042 (owner: 10Ori.livneh)
[11:00:13] <wikibugs>	 6operations: 2FA for SSH access to the production cluster - https://phabricator.wikimedia.org/T116750#1785162 (10mark) >>! In T116750#1774292, @MoritzMuehlenhoff wrote: > Hardware budget needed: 24 * 50 dollars if all members of the "ops" group receive a Yubikey Neo -> 1200 dollars. (Plus possible shipping costs...
[11:03:26] <wikibugs>	 6operations, 5Patch-For-Review, 7Swift: install/setup/deploy ms-be2016-2021 - https://phabricator.wikimedia.org/T116842#1785166 (10fgiunchedi) thanks Daniel! I'll track the swift expansion here
[11:09:51] <grrrit-wm>	 (03CR) 10Filippo Giunchedi: restbase: move to systemd unit file [puppet] - 10https://gerrit.wikimedia.org/r/244647 (https://phabricator.wikimedia.org/T103134) (owner: 10Filippo Giunchedi)
[11:10:33] <wikibugs>	 6operations, 10RESTBase-Cassandra, 5Patch-For-Review: puppet should safely manage cassandra start/stop - https://phabricator.wikimedia.org/T103134#1785178 (10fgiunchedi) ok, looks like there's agreement on going with the `systemctl mask` idea, the same can be applied to restbase (easier once converted to sys...
[11:15:22] <jynus>	 1.5 hours to clone a full db server. Maybe I can improve that?
[11:20:02] <godog>	 moar parallelism
[11:22:33] <jynus>	 I do not think that sending a tar.gz in paralel will win much :-)
[11:22:51] <jynus>	 but I already get a x4 or x5 improvement in bytes sent
[11:26:39] <hashar>	 godog: hello :)  I got python-os-client-config prepared for jessie-wikimedia/backports at https://phabricator.wikimedia.org/T104967#1773653  in case you missed the mail notification
[11:27:26] <mark>	 jynus: oh well, 3 days to get my home server recovered from a single drive crash ;p
[11:27:50] <godog>	 hashar: yeah I saw that, I should be able to get to it today or tomorrow
[11:27:50] <jynus>	 not ssd, I suppose
[11:27:56] <mark>	 no
[11:28:18] <mark>	 disks are sloowwww
[11:28:19] <jynus>	 I do not know what it is an hd on my desktop/laptop anymore
[11:28:28] <mark>	 the whole process feels a lot like the big Labs NFS outage
[11:28:32] <jynus>	 I was a costly investment
[11:28:45] <_joe_>	 yes you were!
[11:28:48] <jynus>	 but my quality of life improved xinfinity
[11:28:49] <mark>	 i don't want to buy 8 TB of SSDs for home use :P
[11:28:58] <_joe_>	 mark: 8 TB??
[11:29:02] <jynus>	 why do you need 9 TG?!
[11:29:05] <mark>	 i do have SSD in my laptop, sure
[11:29:39] <_joe_>	 jynus: for his collection of movie backups I guess :P
[11:29:50] <mark>	 i mostly watch netflix these days actually
[11:30:04] <_joe_>	 apart from jokes, I have 2 TB and it holds all my backups
[11:30:18] <_joe_>	 oh I actually have 4 now, scratch that
[11:30:21] <mark>	 it's 8 TB before raid1 ;)
[11:30:24] <mark>	 so 4 TB usable
[11:30:33] <_joe_>	 Time machine needed a bigger disk
[11:30:34] <mark>	 there we go :P
[11:30:38] <jynus>	 yes, I have 3 TB for backups, but I do not need redundancy there
[11:31:00] <mark>	 yeah well, in my power outage, one drive died completely, another one was already a bit flaky
[11:31:07] <mark>	 so going from 3 drives to 1.5 in a raid5 setup wasn't great :P
[11:31:27] <_joe_>	 ewww
[11:31:51] <hashar>	 godog: and if you have any motivation, I could use a backport of python-shade which is blocked by that python-os-client-config  (all of that to bump Nodepool)
[11:33:02] <hashar>	 oh
[11:33:20] <hashar>	 jynus: random question from my coworking place: do we use SSD on our MariaDB servers?
[11:33:35] <hashar>	 we were wondering if one could get a transparent mix of SSD / HD
[11:33:56] <hashar>	 with more access data on the SSD and rest on the HD
[11:34:00] <wikibugs>	 6operations, 6Revscoring, 6Services, 7service-deployment-requests: New Service Request: ORES - https://phabricator.wikimedia.org/T117560#1785218 (10akosiaris) >>! In T117560#1778582, @yuvipanda wrote: > From talking to @akosiaris during the offsite, we can run both the web and the celery stuff (and redis t...
[11:34:10] <jynus>	 hashar depends on many factors
[11:34:26] <jynus>	 buffer pool hit rate
[11:34:31] <jynus>	 size of working set
[11:34:39] <jynus>	 size of total db
[11:34:52] <_joe_>	 hashar: you'll never get a straight answer to such a question from a dba, since it truly depends on many factors. there is no silver bullet.
[11:35:10] <hashar>	 I can imagine
[11:35:17] <jynus>	 well, the question was: do we use SSD on our MariaDB servers?
[11:35:22] <hashar>	 but still wondering whether we have SSD
[11:35:23] <hashar>	 yeah
[11:35:25] <jynus>	 the  answer is yes
[11:35:27] <jynus>	 :-)
[11:35:37] <hashar>	 good enough for the DB / disk io  newbie I am
[11:35:45] <moritzm>	 hashar: if you want a better answer, you need to optimise the query  :-)
[11:35:52] <jynus>	 ha ha
[11:35:54] <_joe_>	 hashar: and about a mix of ssds/hds of course you can
[11:36:12] <jynus>	 facebook worked on a diskcache implementation
[11:36:18] <jynus>	 I do not know the state of that
[11:36:37] <jynus>	 you can do a poor man's substitution
[11:36:46] <jynus>	 puting certain tables on a different disk
[11:37:03] <godog>	 a while ago I played with linux hybrid caching, bcache and lvm native, not impressed with the latter so far but bcache seemed to work ok
[11:37:09] <jynus>	 or you know, do something at disk level, but that can have mixed results
[11:37:37] <godog>	 (this https://phabricator.wikimedia.org/T88992)
[11:37:42] <jynus>	 in most cases, investing on memory is more productive for the buck, but more expensive
[11:39:00] <jynus>	 godog, problem is that with mysql things get more complex, there are hot and cold areas even within files
[11:39:26] <hashar>	 been asking that since some Apple Mac Mini have SSD/HD system  which are seen as a single disk.  The OS takes care of offloading least recently  / big files  to the HD
[11:40:07] <_joe_>	 hashar: that is the diskcache jynus was referring to, I was ofc referring to putting hot tables on SSDs directly :)
[11:40:16] <godog>	 jynus: heh I'm not sure how it handles that, it might be blockwise
[11:40:33] <jynus>	 https://www.facebook.com/notes/mysql-at-facebook/releasing-flashcache/388112370932
[11:40:47] <hashar>	 so do we purely use SSD or do we spread / shard tables between SSD and HD?
[11:41:54] <jynus>	 hashar, that is a money answer
[11:42:48] <jynus>	 if you have the money, going to SSDs is always going to be better
[11:43:28] <jynus>	 except some issues with the doublewrite area that some people experimented having wear issues
[11:43:43] <jynus>	 but it requires tuning
[11:43:56] <hashar>	 I should ask again the MariaDB folks here and come back with useful tech / doc instead of mumbling :/
[11:44:01] <jynus>	 by default, mysql is tuned for HDs, so it does many sequential scans
[11:44:39] <jynus>	 for example, the transaction log is pure sequential writes
[11:45:07] <jynus>	 also, for example, we use RAID cache, which changes a lot things
[11:46:04] <grrrit-wm>	 (03PS1) 10Giuseppe Lavagetto: noc: remove unused ssl settings [puppet] - 10https://gerrit.wikimedia.org/r/251221 
[11:46:06] <grrrit-wm>	 (03PS1) 10Giuseppe Lavagetto: noc: remove mod_cgi [puppet] - 10https://gerrit.wikimedia.org/r/251222 
[11:46:08] <grrrit-wm>	 (03PS1) 10Giuseppe Lavagetto: noc: remove mod_userdir inclusion [puppet] - 10https://gerrit.wikimedia.org/r/251223 
[11:46:10] <grrrit-wm>	 (03PS1) 10Giuseppe Lavagetto: mediawiki: extract php config from web::modules, use in noc [puppet] - 10https://gerrit.wikimedia.org/r/251224 
[11:46:12] <grrrit-wm>	 (03PS1) 10Giuseppe Lavagetto: noc: make noc virtualhost compatible with apache 2.4 [puppet] - 10https://gerrit.wikimedia.org/r/251225 
[11:46:14] <grrrit-wm>	 (03PS1) 10Giuseppe Lavagetto: noc: add role to mw1152 [puppet] - 10https://gerrit.wikimedia.org/r/251226 
[11:46:16] <grrrit-wm>	 (03PS1) 10Giuseppe Lavagetto: noc: point misc-varnish to mw1152 for both noc and dbtree [puppet] - 10https://gerrit.wikimedia.org/r/251227 
[11:48:04] <jynus>	 and to be fair, moritz's answer is the right one 90% of the time
[11:49:19] <jynus>	 on my career as a consultant, only once I returned a report and said: your queries are perfect, we can only do things in hardware/rearchitecture
[11:50:46] <_joe_>	 wow I am pretty amazed that actually happened
[11:51:11] <_joe_>	 in my experience, you have to constantly analyze and optimize your queries as your dataset evolves/grows
[11:51:30] <_joe_>	 so if someone can manage to keep a pristine record, it's pretty impressive
[11:51:38] <jynus>	 yes, basically they had been working before with an oracle employee
[11:51:59] <jynus>	 so there it is the mystery
[11:52:52] <jynus>	 it is the same as with programming- performance optimization never finishes, you only do the quick wins first
[11:54:20] <jynus>	 we are actually there on the external storage
[11:54:39] <grrrit-wm>	 (03PS2) 10Giuseppe Lavagetto: noc: remove unused ssl settings [puppet] - 10https://gerrit.wikimedia.org/r/251221 
[12:04:33] <icinga-wm>	 PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 2 below the confidence bounds
[12:05:41] <jynus>	 the 12h bump
[12:06:30] <hashar>	 is that due to some cache expiring?
[12:06:50] <grrrit-wm>	 (03PS1) 10Filippo Giunchedi: fully deprovision tungsten [puppet] - 10https://gerrit.wikimedia.org/r/251228 (https://phabricator.wikimedia.org/T97274) 
[12:06:57] <jynus>	 as far as I know it is request-based
[12:07:26] <grrrit-wm>	 (03CR) 10Filippo Giunchedi: [C: 032 V: 032] fully deprovision tungsten [puppet] - 10https://gerrit.wikimedia.org/r/251228 (https://phabricator.wikimedia.org/T97274) (owner: 10Filippo Giunchedi)
[12:07:34] <grrrit-wm>	 (03PS1) 10Muehlenhoff: Create a define to register extra LDAP schemas [puppet] - 10https://gerrit.wikimedia.org/r/251229 (https://phabricator.wikimedia.org/T101299) 
[12:08:22] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] Create a define to register extra LDAP schemas [puppet] - 10https://gerrit.wikimedia.org/r/251229 (https://phabricator.wikimedia.org/T101299) (owner: 10Muehlenhoff)
[12:10:15] <grrrit-wm>	 (03PS2) 10Muehlenhoff: Create a define to register extra LDAP schemas [puppet] - 10https://gerrit.wikimedia.org/r/251229 (https://phabricator.wikimedia.org/T101299) 
[12:11:06] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] Create a define to register extra LDAP schemas [puppet] - 10https://gerrit.wikimedia.org/r/251229 (https://phabricator.wikimedia.org/T101299) (owner: 10Muehlenhoff)
[12:18:20] <grrrit-wm>	 (03PS3) 10Giuseppe Lavagetto: noc: remove unused ssl settings [puppet] - 10https://gerrit.wikimedia.org/r/251221 
[12:18:51] <grrrit-wm>	 (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] noc: remove unused ssl settings [puppet] - 10https://gerrit.wikimedia.org/r/251221 (owner: 10Giuseppe Lavagetto)
[12:26:48] <grrrit-wm>	 (03PS2) 10Giuseppe Lavagetto: noc: remove mod_cgi [puppet] - 10https://gerrit.wikimedia.org/r/251222 
[12:27:37] <grrrit-wm>	 (03PS3) 10Muehlenhoff: Create a define to register extra LDAP schemas [puppet] - 10https://gerrit.wikimedia.org/r/251229 (https://phabricator.wikimedia.org/T101299) 
[12:28:36] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] Create a define to register extra LDAP schemas [puppet] - 10https://gerrit.wikimedia.org/r/251229 (https://phabricator.wikimedia.org/T101299) (owner: 10Muehlenhoff)
[12:30:47] <grrrit-wm>	 (03CR) 10Giuseppe Lavagetto: [C: 032] noc: remove mod_cgi [puppet] - 10https://gerrit.wikimedia.org/r/251222 (owner: 10Giuseppe Lavagetto)
[12:33:18] <_joe_>	 !log manually disabling mod_cgi on terbium
[12:33:24] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[12:38:12] <grrrit-wm>	 (03PS2) 10Giuseppe Lavagetto: noc: remove mod_userdir inclusion [puppet] - 10https://gerrit.wikimedia.org/r/251223 
[12:38:32] <grrrit-wm>	 (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] "noop according to the puppet compiler" [puppet] - 10https://gerrit.wikimedia.org/r/251223 (owner: 10Giuseppe Lavagetto)
[12:40:54] <icinga-wm>	 PROBLEM - puppet last run on cp3038 is CRITICAL: CRITICAL: puppet fail
[12:42:13] <grrrit-wm>	 (03PS2) 10Giuseppe Lavagetto: mediawiki: extract php config from web::modules, use in noc [puppet] - 10https://gerrit.wikimedia.org/r/251224 
[12:43:27] <wikibugs>	 6operations, 6Revscoring, 6Services, 7service-deployment-requests: New Service Request: ORES - https://phabricator.wikimedia.org/T117560#1785348 (10mobrovac) >>! In T117560#1783344, @GWicke wrote: > @halfak, it's a general concern, but something computationally intense and research-driven like ORES is espe...
[12:47:57] <grrrit-wm>	 (03CR) 10BBlack: [C: 04-1] "The more think about this, the more I'm concerned about the DHE>1024 compatibility issue. Probably any client old/crappy enough that DHE+" [puppet] - 10https://gerrit.wikimedia.org/r/251153 (owner: 10BBlack)
[12:49:39] <grrrit-wm>	 (03CR) 10Giuseppe Lavagetto: [C: 032] mediawiki: extract php config from web::modules, use in noc [puppet] - 10https://gerrit.wikimedia.org/r/251224 (owner: 10Giuseppe Lavagetto)
[12:50:39] <wikibugs>	 6operations, 5Continuous-Integration-Scaling, 5Patch-For-Review: install/deploy scandium as zuul merger (ci) server - https://phabricator.wikimedia.org/T95046#1785355 (10hashar) @robh scandium has been installed with Trusty.  Would need to reimage it to Jessie instead (sorry).  Some firewall rules have been...
[12:51:09] <wikibugs>	 6operations, 5Continuous-Integration-Scaling, 5Patch-For-Review: install/deploy scandium as zuul merger (ci) server - https://phabricator.wikimedia.org/T95046#1785357 (10hashar)
[12:54:51] <grrrit-wm>	 (03PS2) 10Giuseppe Lavagetto: noc: make noc virtualhost compatible with apache 2.4 [puppet] - 10https://gerrit.wikimedia.org/r/251225 
[12:57:28] <grrrit-wm>	 (03PS4) 10Muehlenhoff: Create a define to register extra LDAP schemas [puppet] - 10https://gerrit.wikimedia.org/r/251229 (https://phabricator.wikimedia.org/T101299) 
[12:57:52] <grrrit-wm>	 (03CR) 10Giuseppe Lavagetto: [C: 032] noc: make noc virtualhost compatible with apache 2.4 [puppet] - 10https://gerrit.wikimedia.org/r/251225 (owner: 10Giuseppe Lavagetto)
[12:57:59] <grrrit-wm>	 (03CR) 10Mobrovac: restbase: move to systemd unit file (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/244647 (https://phabricator.wikimedia.org/T103134) (owner: 10Filippo Giunchedi)
[13:02:12] <grrrit-wm>	 (03PS2) 10Giuseppe Lavagetto: noc: add role to mw1152 [puppet] - 10https://gerrit.wikimedia.org/r/251226 
[13:05:24] <grrrit-wm>	 (03CR) 10Giuseppe Lavagetto: [C: 032] noc: add role to mw1152 [puppet] - 10https://gerrit.wikimedia.org/r/251226 (owner: 10Giuseppe Lavagetto)
[13:06:34] <icinga-wm>	 RECOVERY - puppet last run on cp3038 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures
[13:12:15] <icinga-wm>	 PROBLEM - puppet last run on mw1152 is CRITICAL: CRITICAL: Puppet has 1 failures
[13:12:48] <grrrit-wm>	 (03PS1) 10Giuseppe Lavagetto: noc: puppetize dbtree directories [puppet] - 10https://gerrit.wikimedia.org/r/251233 
[13:13:16] <Krenair>	 _joe_, did you give mw1152 the same sort of network and mysql access as terbium?
[13:13:36] <grrrit-wm>	 (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] noc: puppetize dbtree directories [puppet] - 10https://gerrit.wikimedia.org/r/251233 (owner: 10Giuseppe Lavagetto)
[13:14:03] <_joe_>	 Krenair: what do you mean?
[13:15:26] <_joe_>	 is there some specific special access terbium has you are aware of?
[13:17:16] <Krenair>	 manifests/role/mariadb.pp:        srange => '@resolve((tin.eqiad.wmnet mira.codfw.wmnet terbium.eqiad.wmnet))',
[13:17:17] <Krenair>	 manifests/role/ganglia.pp:                        '10.64.32.13', # terbium
[13:17:40] <Krenair>	 maybe this: modules/ganglia/templates/deprecated/gmetad.conf.erb:trusted_hosts 208.80.152.165 208.80.154.149 208.80.154.14 10.64.32.13 #bastions, neon, terbium
[13:18:13] <icinga-wm>	 RECOVERY - puppet last run on mw1152 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[13:18:39] <_joe_>	 Krenair: the ganglia thing doesn't make sense anymore
[13:18:46] <Krenair>	 ok
[13:19:18] <_joe_>	 and for mariadb, I am aware of that and it's an upcoming patch
[13:20:24] <_joe_>	 ganglia I wasn't, tbh
[13:21:38] <wikibugs>	 6operations, 6Labs, 10wikitech.wikimedia.org, 7Wikimedia-log-errors: RunJobs.php fails to be executed on labswiki - https://phabricator.wikimedia.org/T117394#1785377 (10Krenair) >>! In T117394#1774385, @Krenair wrote: > IIRC, labswiki jobs are supposed to be running locally on silver only...  Actually, we...
[13:22:07] <_joe_>	 Krenair: it's an upcoming patch as I'm not sure that is needed anymore as well
[13:26:14] <grrrit-wm>	 (03CR) 10Jcrespo: [C: 032] Repool db2051; Depool db2042, 38, 39, 40 for cloning [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251234 (owner: 10Jcrespo)
[13:28:54] <icinga-wm>	 PROBLEM - puppet last run on terbium is CRITICAL: CRITICAL: puppet fail
[13:34:27] <wikibugs>	 6operations, 10Wikimedia-Planet, 10procurement: ssl certificate renewal: *.planet.wikimedia.org - https://phabricator.wikimedia.org/T117866#1785390 (10RobH) 3NEW a:3mark
[13:34:35] <icinga-wm>	 PROBLEM - check_puppetrun on payments2001 is CRITICAL: CRITICAL: puppet fail
[13:35:44] <Jeff_Green>	 ^^^ check_puppetrun alerts noted, just a puppetmaster reboot
[13:39:25] <icinga-wm>	 RECOVERY - check_puppetrun on payments2001 is OK: OK: Puppet is currently enabled, last run 94 seconds ago with 0 failures
[13:40:17] <icinga-wm>	 RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected
[13:40:25] <icinga-wm>	 PROBLEM - puppet last run on mw2154 is CRITICAL: CRITICAL: Puppet has 1 failures
[13:40:54] <wikibugs>	 6operations, 6Labs, 10wikitech.wikimedia.org, 7Wikimedia-log-errors: RunJobs.php fails to be executed on labswiki - https://phabricator.wikimedia.org/T117394#1785403 (10jcrespo) These were the first occurrences:  ```  {   "_index": "logstash-2015.10.31",   "_type": "mediawiki",   "_id": "AVC8f0N1lAIL90ZzMe...
[13:43:56] <logmsgbot>	 !log jynus@tin Synchronized wmf-config/db-codfw.php: Repool db2051; Depool db2042, 38, 39, 40 for cloning (duration: 00m 18s)
[13:44:02] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[13:57:40] <jynus>	 !log shuting down mysql and cloning db2042 -> db2062, db2038 -> db2059, db2039 -> db2060, db2040 -> db2061
[13:57:46] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[14:05:56] <icinga-wm>	 RECOVERY - puppet last run on mw2154 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[14:07:02] <grrrit-wm>	 (03CR) 10Faidon Liambotis: "This was never really the case. The "backports" section of our repository exists for backports that don't exist in backports in Debian and" [puppet] - 10https://gerrit.wikimedia.org/r/251042 (owner: 10Ori.livneh)
[14:08:16] <grrrit-wm>	 (03CR) 10Faidon Liambotis: [C: 04-1] "Also, -1 again because 3/4 of the previous comments went unanswered, guessing that ori missed them since they were on the commit message." [puppet] - 10https://gerrit.wikimedia.org/r/251042 (owner: 10Ori.livneh)
[14:12:26] <jynus>	 !log reinstalling db2056.codfw.wmnet
[14:12:31] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[14:22:26] <wikibugs>	 6operations, 10hardware-requests, 7Performance: Refresh Parser cache servers pc1001-pc1003 - https://phabricator.wikimedia.org/T111777#1785440 (10RobH)
[14:22:43] <wikibugs>	 6operations, 10hardware-requests, 7Performance: Refresh Parser cache servers pc1001-pc1003 - https://phabricator.wikimedia.org/T111777#1616223 (10RobH)
[14:25:06] <wikibugs>	 6operations, 10hardware-requests, 7Performance: Refresh Parser cache servers pc1001-pc1003 - https://phabricator.wikimedia.org/T111777#1785456 (10RobH)
[14:43:16] <icinga-wm>	 PROBLEM - Disk space on praseodymium is CRITICAL: DISK CRITICAL - free space: /var 13294 MB (3% inode=99%)
[14:44:12] <grrrit-wm>	 (03PS1) 10Faidon Liambotis: Add loopback for cr2-esams [dns] - 10https://gerrit.wikimedia.org/r/251242 
[14:44:14] <grrrit-wm>	 (03PS1) 10Faidon Liambotis: Allocate subnets/VLANs for cr2-esams neighbor links [dns] - 10https://gerrit.wikimedia.org/r/251243 
[14:48:51] <wikibugs>	 6operations, 10Analytics, 6Analytics-Kanban, 6Discovery, and 8 others: EventBus MVP - https://phabricator.wikimedia.org/T114443#1785475 (10Ottomata) > Until very recently (last days), there wasn't actually an EventBus-like REST proxy with schema validation in the EventLogging repository.  Not quite true, t...
[14:49:48] <grrrit-wm>	 (03CR) 10Hashar: "check experimental" [debs/nodepool] (debian) - 10https://gerrit.wikimedia.org/r/237380 (https://phabricator.wikimedia.org/T111377) (owner: 10Hashar)
[14:50:48] <grrrit-wm>	 (03CR) 10Faidon Liambotis: [C: 032] swift: don't wait for data disks on boot [puppet] - 10https://gerrit.wikimedia.org/r/250968 (https://phabricator.wikimedia.org/T107416) (owner: 10Filippo Giunchedi)
[14:53:16] <icinga-wm>	 PROBLEM - puppet last run on ms-fe3002 is CRITICAL: CRITICAL: puppet fail
[14:55:20] <grrrit-wm>	 (03CR) 10Faidon Liambotis: [C: 032] Add loopback for cr2-esams [dns] - 10https://gerrit.wikimedia.org/r/251242 (owner: 10Faidon Liambotis)
[14:56:17] <grrrit-wm>	 (03PS5) 10Giuseppe Lavagetto: gdash: deprecate reqerror dashboard, minor correction [puppet] - 10https://gerrit.wikimedia.org/r/250395 
[14:56:30] <icinga-wm>	 ACKNOWLEDGEMENT - Disk space on praseodymium is CRITICAL: DISK CRITICAL - free space: /var 7402 MB (2% inode=99%): Filippo Giunchedi looking, effect of nodetool removenode
[14:57:30] <grrrit-wm>	 (03PS2) 10Faidon Liambotis: Allocate subnets/VLANs for cr2-esams neighbor links [dns] - 10https://gerrit.wikimedia.org/r/251243 
[14:57:58] <grrrit-wm>	 (03PS2) 10BBlack: ssl_ciphersuite: add DHE+3DES option only for "mid" [puppet] - 10https://gerrit.wikimedia.org/r/251153 
[14:58:53] <grrrit-wm>	 (03CR) 10Faidon Liambotis: [C: 032] Allocate subnets/VLANs for cr2-esams neighbor links [dns] - 10https://gerrit.wikimedia.org/r/251243 (owner: 10Faidon Liambotis)
[14:59:37] <wikibugs>	 6operations, 10Analytics, 10Analytics-Cluster, 10Fundraising Tech Backlog, and 2 others: Verify kafkatee use for fundraising logs on erbium - https://phabricator.wikimedia.org/T97676#1785481 (10Jgreen) > Maybe we can doctor the last old files and the first new files by hand, so that they splice nearly perf...
[15:00:03] <grrrit-wm>	 (03PS1) 10Hashar: Jenkins job validation (DO NOT SUBMIT) [debs/nodepool] (debian) - 10https://gerrit.wikimedia.org/r/251245 
[15:00:17] <icinga-wm>	 PROBLEM - Check size of conntrack table on chromium is CRITICAL: CRITICAL: nf_conntrack is 92 % full
[15:00:46] <icinga-wm>	 RECOVERY - Disk space on praseodymium is OK: DISK OK
[15:00:50] <grrrit-wm>	 (03CR) 10Hashar: "check experimental" [debs/nodepool] (debian) - 10https://gerrit.wikimedia.org/r/251245 (owner: 10Hashar)
[15:01:08] <grrrit-wm>	 (03CR) 10BBlack: [C: 04-1] "This also needs appropriate backend definition stuff in the "directors" (which references the real "mw1152.eqiad.wmnet" uses a label like " [puppet] - 10https://gerrit.wikimedia.org/r/251227 (owner: 10Giuseppe Lavagetto)
[15:02:17] <icinga-wm>	 RECOVERY - Check size of conntrack table on chromium is OK: OK: nf_conntrack is 1 % full
[15:03:02] <jynus>	 I was going to say: I didn't see much going on on chromium
[15:05:03] <grrrit-wm>	 (03CR) 10Filippo Giunchedi: [C: 032 V: 032] swift: don't wait for data disks on boot [puppet] - 10https://gerrit.wikimedia.org/r/250968 (https://phabricator.wikimedia.org/T107416) (owner: 10Filippo Giunchedi)
[15:06:43] <grrrit-wm>	 (03PS2) 10Filippo Giunchedi: swift: don't wait for data disks on boot [puppet] - 10https://gerrit.wikimedia.org/r/250968 (https://phabricator.wikimedia.org/T107416) 
[15:06:53] <grrrit-wm>	 (03CR) 10Filippo Giunchedi: [V: 032] swift: don't wait for data disks on boot [puppet] - 10https://gerrit.wikimedia.org/r/250968 (https://phabricator.wikimedia.org/T107416) (owner: 10Filippo Giunchedi)
[15:07:08] <grrrit-wm>	 (03PS2) 10Giuseppe Lavagetto: noc: point misc-varnish to mw1152 for both noc and dbtree [puppet] - 10https://gerrit.wikimedia.org/r/251227 
[15:08:57] <grrrit-wm>	 (03CR) 10Hashar: "check experimental" [debs/nodepool] (debian) - 10https://gerrit.wikimedia.org/r/251245 (owner: 10Hashar)
[15:12:47] <wikibugs>	 7Puppet, 6operations, 5Patch-For-Review: merge swift_new and swift puppet modules/classes - https://phabricator.wikimedia.org/T107416#1785509 (10fgiunchedi) 5Open>3Resolved all done, `swift` and `swift_new` have been merged by @faidon and `nobootwait` added
[15:13:21] <grrrit-wm>	 (03CR) 10Faidon Liambotis: [C: 04-1] exim: Add and use $::other_site to provide LDAP fallback (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/249868 (https://phabricator.wikimedia.org/T82662) (owner: 10Alexandros Kosiaris)
[15:13:34] <grrrit-wm>	 (03CR) 10Hashar: "check experimental" [debs/nodepool] (debian) - 10https://gerrit.wikimedia.org/r/251245 (owner: 10Hashar)
[15:14:36] <grrrit-wm>	 (03CR) 10Faidon Liambotis: [C: 031] "Sounds fine to me." [puppet] - 10https://gerrit.wikimedia.org/r/250395 (owner: 10Giuseppe Lavagetto)
[15:15:11] <grrrit-wm>	 (03PS6) 10Giuseppe Lavagetto: gdash: deprecate reqerror dashboard, minor correction [puppet] - 10https://gerrit.wikimedia.org/r/250395 
[15:16:15] <grrrit-wm>	 (03CR) 10Giuseppe Lavagetto: [C: 032] gdash: deprecate reqerror dashboard, minor correction [puppet] - 10https://gerrit.wikimedia.org/r/250395 (owner: 10Giuseppe Lavagetto)
[15:16:28] <grrrit-wm>	 (03CR) 10Faidon Liambotis: [C: 04-2] "Honestly, I don't really like a) hardcoding "standard" to the role classes (it doesn't really belong there), b) hardcoding eth0 into the r" [puppet] - 10https://gerrit.wikimedia.org/r/250616 (owner: 10Dzahn)
[15:17:46] <grrrit-wm>	 (03CR) 10Faidon Liambotis: [C: 031] "LGTM" [software/otrs] - 10https://gerrit.wikimedia.org/r/248916 (owner: 10Alexandros Kosiaris)
[15:17:51] <grrrit-wm>	 (03CR) 10Faidon Liambotis: [C: 031] "LGTM" [software/otrs] - 10https://gerrit.wikimedia.org/r/248915 (owner: 10Alexandros Kosiaris)
[15:18:08] <paravoid>	 Krenair: https://gerrit.wikimedia.org/r/#/c/245139/ ?
[15:18:17] <grrrit-wm>	 (03PS1) 10BBlack: config-geo: more of the middle US -> codfw [dns] - 10https://gerrit.wikimedia.org/r/251247 
[15:18:43] <Krenair>	 ?
[15:18:56] <paravoid>	 I responded there a while ago
[15:19:15] <paravoid>	 not sure if you saw that
[15:19:37] <Krenair>	 I saw it, haven't worked on it yet
[15:19:43] <paravoid>	 ok
[15:19:44] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: exim: Add and use $::other_site to provide LDAP fallback (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/249868 (https://phabricator.wikimedia.org/T82662) (owner: 10Alexandros Kosiaris)
[15:19:48] <grrrit-wm>	 (03PS3) 10Giuseppe Lavagetto: noc: point misc-varnish to mw1152 for both noc and dbtree [puppet] - 10https://gerrit.wikimedia.org/r/251227 
[15:20:15] <grrrit-wm>	 (03CR) 10BBlack: [C: 031] noc: point misc-varnish to mw1152 for both noc and dbtree [puppet] - 10https://gerrit.wikimedia.org/r/251227 (owner: 10Giuseppe Lavagetto)
[15:20:46] <icinga-wm>	 RECOVERY - puppet last run on ms-fe3002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[15:22:05] <grrrit-wm>	 (03CR) 10Dzahn: "you say as a reason to not do this that "We generally have both standard and the IPv6 stuff in site.pp for all hosts." but the point is th" [puppet] - 10https://gerrit.wikimedia.org/r/250616 (owner: 10Dzahn)
[15:22:35] <grrrit-wm>	 (03PS2) 10BBlack: config-geo: more of the middle US -> codfw [dns] - 10https://gerrit.wikimedia.org/r/251247 (https://phabricator.wikimedia.org/T114659) 
[15:23:26] <paravoid>	 mutante: yeah well, I disagree with the point :)
[15:23:32] <grrrit-wm>	 (03CR) 10Giuseppe Lavagetto: [C: 032] noc: point misc-varnish to mw1152 for both noc and dbtree [puppet] - 10https://gerrit.wikimedia.org/r/251227 (owner: 10Giuseppe Lavagetto)
[15:23:45] <paravoid>	 and I don't think the solution is to move this repetition of the ipv6 stanza all over random roles
[15:23:56] <paravoid>	 that stanza is nothing specific to the roles itself
[15:24:09] <_joe_>	 for ipv6 I agree, I am unsure about standard
[15:24:16] <paravoid>	 neither is standard, which btw is defined on site.pp
[15:24:23] <_joe_>	 repeating it in every node seems... wrong
[15:24:30] <paravoid>	 so what you're actually doing makes the modules unusable from anywhere else
[15:24:37] <paravoid>	 the roles, sorry
[15:24:53] <_joe_>	 well, we could move standard to a proper location :)
[15:25:01] <paravoid>	 why?
[15:25:09] <paravoid>	 it's the least of our problems really
[15:25:50] <_joe_>	 no I was saying if that is the reason not to include it
[15:26:01] <paravoid>	 that's one of the reasons for sure
[15:26:09] <_joe_>	 I'm actually totally neutral on the topic
[15:26:23] <_joe_>	 I see pros and cons with both approaches
[15:26:50] <grrrit-wm>	 (03PS2) 10Hashar: Jenkins job validation (DO NOT SUBMIT) [debs/nodepool] (debian) - 10https://gerrit.wikimedia.org/r/251245 
[15:27:01] <paravoid>	 so how do these roles work on labs right now?
[15:27:09] <paravoid>	 that include standard?
[15:27:20] <_joe_>	 labs has site.pp in its puppet tree
[15:27:34] <paravoid>	 right
[15:27:36] <paravoid>	 ew..
[15:27:40] <_joe_>	 actually, site.pp is the entry point in labs as well ;)
[15:27:53] <_joe_>	 it's puppetlabs!
[15:28:17] <_joe_>	 paravoid: if we properly used environments, maybe that could be an issue
[15:28:20] <_joe_>	 but we don't
[15:28:30] <paravoid>	 I doubt there is a way to "properly use environments"
[15:28:37] <grrrit-wm>	 (03CR) 10Hashar: "check experimental" [debs/nodepool] (debian) - 10https://gerrit.wikimedia.org/r/251245 (owner: 10Hashar)
[15:28:56] <paravoid>	 but I'd be willing to be convinced otherwise ;)
[15:30:02] <_joe_>	 well for example it's possible to test patches to production classes/modules without actually needing to merge them. We could use it to get rid of 99% of self-hosted puppetmasters in labs
[15:34:41] <grrrit-wm>	 (03PS3) 10Hashar: Jenkins job validation (DO NOT SUBMIT) [debs/nodepool] (debian) - 10https://gerrit.wikimedia.org/r/251245 
[15:34:59] <grrrit-wm>	 (03CR) 10Hashar: "check experimental" [debs/nodepool] (debian) - 10https://gerrit.wikimedia.org/r/251245 (owner: 10Hashar)
[15:41:22] <wikibugs>	 6operations, 10CirrusSearch, 6Discovery, 5Patch-For-Review: Only use newer (elastic10{16..31}) servers as master capable elasticsearch nodes - https://phabricator.wikimedia.org/T112556#1785591 (10chasemp) Is there a disadvantage to having 4 eligible masters?  I know we have a minimum viability setting righ...
[15:41:58] <wikibugs>	 6operations, 10Analytics, 10Analytics-Cluster, 10Fundraising Tech Backlog, and 2 others: Verify kafkatee use for fundraising logs on erbium - https://phabricator.wikimedia.org/T97676#1785592 (10Pcoombe) @awight Sounds like it would be safest to just take campaigns down if it's only for a short window. Plea...
[15:43:22] <grrrit-wm>	 (03CR) 10Rush: [C: 031] "sounds good" [puppet] - 10https://gerrit.wikimedia.org/r/251019 (owner: 10Muehlenhoff)
[15:45:38] <grrrit-wm>	 (03PS1) 10Giuseppe Lavagetto: mw1152: allow connecting to labswiki database [puppet] - 10https://gerrit.wikimedia.org/r/251250 
[15:46:33] <_joe_>	 jynus: ^^
[15:46:53] <grrrit-wm>	 (03Abandoned) 10Hashar: Jenkins job validation (DO NOT SUBMIT) [debs/nodepool] (debian) - 10https://gerrit.wikimedia.org/r/251245 (owner: 10Hashar)
[15:47:43] <grrrit-wm>	 (03CR) 10Jcrespo: [C: 04-1] "we do not need to add terbium." [puppet] - 10https://gerrit.wikimedia.org/r/251250 (owner: 10Giuseppe Lavagetto)
[15:48:05] <grrrit-wm>	 (03PS2) 10Giuseppe Lavagetto: mw1152: allow connecting to labswiki database [puppet] - 10https://gerrit.wikimedia.org/r/251250 
[15:48:07] <_joe_>	 jynus: heh I already noticed
[15:48:08] <_joe_>	 :)
[15:48:18] <jynus>	 10.x
[15:48:24] <jynus>	 has access
[15:48:28] <_joe_>	 oh
[15:48:41] <_joe_>	 so it's just a matter of firewall I guess
[15:48:52] <_joe_>	 and that grants file is... useless?
[15:49:24] <jynus>	 no, that is needed
[15:49:55] <_joe_>	 so what's the issue? just the ip wrong in the grant? I already corrected it
[15:50:45] <icinga-wm>	 RECOVERY - Host db2034 is UP: PING WARNING - Packet loss = 64%, RTA = 34.79 ms
[15:51:27] <_joe_>	 jynus: so... is the patch now correct?
[15:53:46] <jynus>	 I do not know what you are referring to
[15:53:52] <jynus>	 but by parts
[15:54:58] <jynus>	 mw1152.eqiad.wmnet doesn't need to be added to the firewall
[15:55:08] <grrrit-wm>	 (03PS1) 10Dzahn: ldap: split client.pp into one file per class [puppet] - 10https://gerrit.wikimedia.org/r/251251 
[15:55:10] <_joe_>	 it doesn't? why?
[15:55:21] <jynus>	 because it is already included on the list of hosts that can access mysqls
[15:55:21] <wikibugs>	 6operations, 6Labs, 10Labs-Infrastructure: deployment tracking of codfw labs test cluster - https://phabricator.wikimedia.org/T117097#1785605 (10chasemp) a:5RobH>3Papaul for https://phabricator.wikimedia.org/T117097#1783632  thanks papaul
[15:55:54] <_joe_>	 jynus: including silver?
[15:56:02] <_joe_>	 I would've expected otherwise
[15:56:08] <_joe_>	 from puppet at least
[15:56:09] <jynus>	 let me recheck, but I wouls say yes
[15:56:54] <jynus>	 nom you are right
[15:57:06] <icinga-wm>	 PROBLEM - Host db2034 is DOWN: PING CRITICAL - Packet loss = 100%
[15:57:06] <jynus>	 no, you are right, it is separated
[15:59:58] <_joe_>	 ok, so I'll go on with merging that patch
[16:00:05] <jouncebot>	 anomie ostriches thcipriani marktraceur: Respected human, time to deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151105T1600). Please do the needful.
[16:00:05] <jouncebot>	 Luke081515: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process.
[16:00:08] <grrrit-wm>	 (03CR) 10Jcrespo: [C: 031] mw1152: allow connecting to labswiki database [puppet] - 10https://gerrit.wikimedia.org/r/251250 (owner: 10Giuseppe Lavagetto)
[16:00:25] <wikibugs>	 6operations, 10Traffic, 5Patch-For-Review: Re-balance North American traffic in DNS for codfw - https://phabricator.wikimedia.org/T114659#1785610 (10BBlack) This is what the US States look like, assuming patch 251247 above is applied:  {F2918924}
[16:02:05] <wikibugs>	 6operations, 10Traffic, 5Patch-For-Review: Re-balance North American traffic in DNS for codfw - https://phabricator.wikimedia.org/T114659#1785615 (10chasemp) awesome
[16:02:16] <icinga-wm>	 PROBLEM - swift codfw-prod object availability on graphite1001 is CRITICAL: CRITICAL: 37.50% of data under the critical threshold [90.0]
[16:02:41] <grrrit-wm>	 (03CR) 10Giuseppe Lavagetto: [C: 032] mw1152: allow connecting to labswiki database [puppet] - 10https://gerrit.wikimedia.org/r/251250 (owner: 10Giuseppe Lavagetto)
[16:03:32] <jzerebecki>	 jouncebot: why didn't you ping me!
[16:03:46] <icinga-wm>	 ACKNOWLEDGEMENT - swift codfw-prod object availability on graphite1001 is CRITICAL: CRITICAL: 37.50% of data under the critical threshold [90.0] Filippo Giunchedi codfw swift expansion in progress
[16:04:12] <wikibugs>	 6operations, 10ops-codfw: es2010 failed disk (degraded RAID) - https://phabricator.wikimedia.org/T117848#1785625 (10Cmjohnson) This server is out of warranty.  Is there a plan to replace these in the near term? I can send spare disks from eqiad to codfw if needed.
[16:04:20] <thcipriani>	 whoops, SWAT time. Luke081515|away jzerebecki matt_flaschen ready?
[16:04:41] <matt_flaschen>	 Yep
[16:04:49] <grrrit-wm>	 (03PS1) 10Dzahn: osm: minimal lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/251254 
[16:04:52] <jzerebecki>	 y
[16:05:06] <wikibugs>	 6operations, 10ops-codfw: db2034 host crashed; mgmt interface unavailable (needs reset and hw check) - https://phabricator.wikimedia.org/T117858#1785627 (10Cmjohnson) a:3Papaul Papaul,  Could you please troubleshoot this before you leave.  Thanks
[16:05:08] <_joe_>	 jynus: {{done}}, I just need the grant to be in effect now :)
[16:05:37] <jynus>	 1 sec
[16:05:55] <papaul>	 cmjohnson1: working al ready on it
[16:06:52] <jynus>	 curl silver.wikimedia.org:3306 works, grant should too
[16:07:59] <_joe_>	 ok thanks
[16:08:26] <hashar>	  http://cdn.debian.net/debian <-- looks nice and modern
[16:09:35] <icinga-wm>	 PROBLEM - Kafka Broker Replica Max Lag on kafka1022 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [5000000.0]
[16:09:54] <bd808>	 jzerebecki: I think jouncebot got confused because there are <p> wrappers in the DOM for that SWAT section. The parser it uses is pretty sensitive to the DOM output.
[16:10:39] * bd808 will look at it
[16:11:36] <jzerebecki>	 hashar: use http://httpredir.debian.org/ thought it sometimes has errors but recovers on retry
[16:12:41] <grrrit-wm>	 (03PS1) 10BBlack: config-geo: ulsfo fallback to codfw before eqiad [dns] - 10https://gerrit.wikimedia.org/r/251255 
[16:13:26] <icinga-wm>	 RECOVERY - Kafka Broker Replica Max Lag on kafka1022 is OK: OK: Less than 1.00% above the threshold [1000000.0]
[16:13:29] <grrrit-wm>	 (03CR) 10Hashar: "check experimental" [debs/nodepool] (debian) - 10https://gerrit.wikimedia.org/r/251245 (owner: 10Hashar)
[16:13:38] <wikibugs>	 6operations, 5Continuous-Integration-Scaling, 7Nodepool: Backport python-shade from debian/testing to jessie-wikimedia - https://phabricator.wikimedia.org/T107267#1785699 (10fgiunchedi)
[16:13:40] <wikibugs>	 6operations, 5Continuous-Integration-Scaling: Backport python-os-client-config 1.3.0-1 from Debian Sid to jessie-wikimedia - https://phabricator.wikimedia.org/T104967#1785696 (10fgiunchedi) 5Open>3Resolved a:3fgiunchedi ok I've uploaded `python-os-client-config`  ``` root@carbon:~# reprepro -C backports...
[16:13:49] <thcipriani>	 Krenair: what is the "weird-rebase" file here?
[16:14:04] <wikibugs>	 6operations, 5Continuous-Integration-Scaling, 7Nodepool: Backport python-shade from debian/testing to jessie-wikimedia - https://phabricator.wikimedia.org/T107267#1491090 (10fgiunchedi) all dependencies should be available now internally, please try to backport
[16:15:32] <thcipriani>	 James_F: it doesn't seem like your evening swat thing was deployed, is that right?
[16:15:39] <wikibugs>	 6operations, 10ops-codfw: es2010 failed disk (degraded RAID) - https://phabricator.wikimedia.org/T117848#1785703 (10Cmjohnson) To purchase new disks replacements from newegg, the disks are Approx $244.00 each.  I have 8 decommissioned ES hosts in eqiad that have those disks.  I can send a dozen or so disks to...
[16:16:32] <grrrit-wm>	 (03CR) 10Hashar: "check experimental" [debs/nodepool] (debian) - 10https://gerrit.wikimedia.org/r/251245 (owner: 10Hashar)
[16:20:33] <wikibugs>	 6operations, 10Analytics, 6Analytics-Kanban, 6Discovery, and 8 others: EventBus MVP - https://phabricator.wikimedia.org/T114443#1785720 (10faidon) More importantly, I don't understand why this is something Andrew has to do (and "soon") and not the services team "or else".  Why is it a given that the Servic...
[16:22:16] <thcipriani>	 hmm we seem to be 22 commits ahead of wmf/1.27.0-wmf.5 and only one is marked security...these don't seem to be deployed.
[16:23:17] <jzerebecki>	 that seems to be related to the weird-rebase file
[16:23:32] <bd808>	 thcipriani: Krenair and AaronSchulz were talking about that last night
[16:24:06] <bd808>	 AaronSchulz thought it needed to be reset to upstream and the one sec patch reapplied
[16:24:09] <thcipriani>	 yeah, I was just reading back scroll
[16:24:26] <icinga-wm>	 RECOVERY - Host db2034 is UP: PING OK - Packet loss = 0%, RTA = 34.41 ms
[16:24:28] <thcipriani>	 that _does_ seem like the right thing to do here.
[16:24:42] <thcipriani>	 kk, doing that.
[16:26:42] <grrrit-wm>	 (03PS2) 10BBlack: config-geo: ulsfo fallback to codfw before eqiad [dns] - 10https://gerrit.wikimedia.org/r/251255 (https://phabricator.wikimedia.org/T114659) 
[16:27:59] <wikibugs>	 6operations, 10Analytics, 6Analytics-Kanban, 6Discovery, and 8 others: EventBus MVP - https://phabricator.wikimedia.org/T114443#1785745 (10Nuria)   > As mentioned, we might want to use a single node process exposing parsoid, restbase & eventbus for small (third party) installs, but might as well use the ne...
[16:28:36] <icinga-wm>	 PROBLEM - Host db2034 is DOWN: PING CRITICAL - Packet loss = 100%
[16:31:22] <jynus>	 I am not going to ack^ db2034 for now- I think papaul is working on it
[16:32:10] <papaul>	 jynus: yes will let you know 
[16:34:38] <grrrit-wm>	 (03CR) 10Hashar: "check experimental" [debs/nodepool] (debian) - 10https://gerrit.wikimedia.org/r/251245 (owner: 10Hashar)
[16:34:39] <jynus>	 thank you very much!
[16:35:26] <wikibugs>	 6operations, 6Commons, 10MediaWiki-File-management, 10MediaWiki-Tarball-Backports, and 7 others: InstantCommons broken by switch to HTTPS - https://phabricator.wikimedia.org/T102566#1785750 (10BBlack) It's been over a week since the email, which ended up going out a bit later after the releases than expect...
[16:36:01] <hashar>	 akosiaris: thank you again for the puppet package_builder class and the jessie-wikimedia  WIKIMEDIA=yes stuff :}
[16:36:15] <hashar>	 akosiaris: it works! https://integration.wikimedia.org/ci/job/debian-glue/24/
[16:36:43] <grrrit-wm>	 (03PS3) 10BBlack: HTTPS redirects: remove InstantCommons exception [puppet] - 10https://gerrit.wikimedia.org/r/224557 (https://phabricator.wikimedia.org/T102566) 
[16:36:58] <papaul>	 jynus:  just a quick update when i got here today db2034 was completely power off 
[16:36:59] <jzerebecki>	 thcipriani: note that where i checked the rest of the fleet is at 96d099dab949f5d430c01a7d6bc2d9722f622ed2
[16:37:22] <jzerebecki>	 so this will deploy some new commits
[16:38:54] <jynus>	 papaul, yep, I expected a full crash
[16:39:18] <thcipriani>	 hmm, yeah, there are 7 new commits counting James_F merged last night and the matt_flaschen one merged this morning and not counting security commits.
[16:39:57] <jynus>	 I made a couple, but mediawiki-config only
[16:40:04] <jzerebecki>	 yup
[16:41:36] <thcipriani>	 sigh. OK. I'm going to apply the remainder of the security commits. I think I'm going to let twentyafterfour verify my thinking on the repo and run a full scap as part of the train since this window is almost over and I haven't untangled this ball of wax yet.
[16:42:13] <James_F>	 thcipriani: Yup, Krenair found production in an inconsistent state and didn't deploy, I think.
[16:42:21] <thcipriani>	 repo is mostly cleaned up, but I'm confused how it got in this state and I could use a little more time to sort everything out.
[16:42:36] <thcipriani>	 James_F: kk, thanks for confirming.
[16:43:23] <thcipriani>	 sorry jzerebecki matt_flaschen and Luke081515 I'm going to scrub this SWAT until I get it sorted.
[16:43:58] <Luke081515>	 ok
[16:44:19] <matt_flaschen>	 thcipriani, it's okay.  Let me know if I can help.  PM is fine if you want.
[16:47:39] <twentyafterfour>	 thcipriani: sigh indeed
[16:54:50] <grrrit-wm>	 (03PS1) 10Ori.livneh: Relax and extend the Content-Type regexp that controls gzipping [puppet] - 10https://gerrit.wikimedia.org/r/251268 
[16:54:57] <ori>	 bblack: ^
[16:56:18] <wikibugs>	 6operations, 5Continuous-Integration-Scaling, 5Patch-For-Review: install/deploy scandium as zuul merger (ci) server - https://phabricator.wikimedia.org/T95046#1785779 (10RobH) a:5hashar>3RobH
[16:58:27] <wikibugs>	 6operations, 6Labs, 10Labs-Infrastructure: deployment tracking of codfw labs test cluster - https://phabricator.wikimedia.org/T117097#1785789 (10Papaul) labtestmetal2001               ge-5/0/8    NIC1               ge-5/0/30      NIC2 labtestvirt2001                   ge-5/0/17 NIC1               ge-5/0/ 31...
[17:00:05] <jouncebot>	 akosiaris moritzm: Dear anthropoid, the time has come. Please deploy Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151105T1700).
[17:03:58] <grrrit-wm>	 (03CR) 10BryanDavis: [C: 04-1] "This is a WMF production cluster concentric change that will break beta cluster and other Labs projects that use these roles." [puppet] - 10https://gerrit.wikimedia.org/r/251019 (owner: 10Muehlenhoff)
[17:11:04] <godog>	 !log nodetool decommission on praseodymium
[17:11:10] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[17:15:35] <grrrit-wm>	 (03PS1) 10Muehlenhoff: Allow configurable ACLs [puppet] - 10https://gerrit.wikimedia.org/r/251272 (https://phabricator.wikimedia.org/T101299) 
[17:16:50] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] Allow configurable ACLs [puppet] - 10https://gerrit.wikimedia.org/r/251272 (https://phabricator.wikimedia.org/T101299) (owner: 10Muehlenhoff)
[17:17:06] <icinga-wm_>	 PROBLEM - cassandra CQL 10.64.16.149:9042 on praseodymium is CRITICAL: Connection refused
[17:17:16] <godog>	 expected ^
[17:17:24] <grrrit-wm>	 (03PS1) 10Filippo Giunchedi: cassandra: add praseodymium instances to seeds [puppet] - 10https://gerrit.wikimedia.org/r/251274 
[17:17:26] <grrrit-wm>	 (03PS1) 10Filippo Giunchedi: cassandra: add praseodymium-a instance [puppet] - 10https://gerrit.wikimedia.org/r/251275 
[17:19:20] <twentyafterfour>	 ok I'm going to run scap to sync thcipriani's morning swat
[17:19:45] <twentyafterfour>	 jzerebecki: matt_flaschen Luke081515 ^ fyi
[17:19:47] <jzerebecki>	 twentyafterfour: would you also include my SWAT patch from this morning?
[17:19:54] <jzerebecki>	 it was not yet merged
[17:19:55] <grrrit-wm>	 (03CR) 10BBlack: [C: 04-1] "Doesn't this miss the previously-matching image/vnd.microsoft.icon?" [puppet] - 10https://gerrit.wikimedia.org/r/251268 (owner: 10Ori.livneh)
[17:19:57] <twentyafterfour>	 yes
[17:20:04] <thcipriani>	 I only merged matt_flaschen 's patch so far for SWAT.
[17:20:08] <twentyafterfour>	 oh
[17:20:38] <twentyafterfour>	 jzerebecki: what's your patch?
[17:20:49] <thcipriani>	 also, James_F 's patch from evening SWAT is there too. Both required submodule bumps and neither of those submodules have been bumped on tin yet.
[17:21:03] <jzerebecki>	 twentyafterfour: https://gerrit.wikimedia.org/r/#q,251237,n,z
[17:21:07] <thcipriani>	 neither VisualEditor or Flow
[17:21:16] <grrrit-wm>	 (03PS2) 10Muehlenhoff: openldap: Allow configurable ACLs [puppet] - 10https://gerrit.wikimedia.org/r/251272 (https://phabricator.wikimedia.org/T101299) 
[17:23:19] <grrrit-wm>	 (03CR) 10BBlack: [C: 032] HTTPS redirects: remove InstantCommons exception [puppet] - 10https://gerrit.wikimedia.org/r/224557 (https://phabricator.wikimedia.org/T102566) (owner: 10BBlack)
[17:23:40] <jzerebecki>	 twentyafterfour: that will also newly deploy commits by AaronSchulz, bd808 
[17:24:26] <matt_flaschen>	 twentyafterfour, I have a second commit on the schedule too: https://gerrit.wikimedia.org/r/#/c/251246
[17:24:32] <matt_flaschen>	 Which is not merged yet.
[17:24:34] <wikibugs>	 6operations, 10Traffic, 7HTTPS: HTTPS Plans (tracking / high-level info) - https://phabricator.wikimedia.org/T104681#1785899 (10BBlack)
[17:24:39] <wikibugs>	 6operations, 6Commons, 10MediaWiki-File-management, 10MediaWiki-Tarball-Backports, and 7 others: InstantCommons broken by switch to HTTPS - https://phabricator.wikimedia.org/T102566#1785897 (10BBlack) 5Open>3Resolved a:3BBlack
[17:25:25] <icinga-wm_>	 PROBLEM - Host praseodymium is DOWN: PING CRITICAL - Packet loss = 100%
[17:25:56] <bd808>	 jzerebecki: my commit was synced yesterday -- https://tools.wmflabs.org/sal/log/AVDTvYdp1oXzWjit6ReL
[17:26:05] <grrrit-wm>	 (03PS1) 10Muehlenhoff: Update TODO [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/251281 
[17:26:07] <grrrit-wm>	 (03PS1) 10Muehlenhoff: Fix traceback for verbose view of deployment result [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/251282 
[17:26:24] <twentyafterfour>	 matt_flaschen: looking
[17:26:31] <grrrit-wm>	 (03CR) 10Muehlenhoff: [C: 032 V: 032] Update TODO [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/251281 (owner: 10Muehlenhoff)
[17:27:03] <grrrit-wm>	 (03CR) 10Muehlenhoff: [C: 032 V: 032] Fix traceback for verbose view of deployment result [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/251282 (owner: 10Muehlenhoff)
[17:27:46] <icinga-wm_>	 RECOVERY - Host praseodymium is UP: PING OK - Packet loss = 0%, RTA = 2.42 ms
[17:27:53] <jzerebecki>	 bd808: mw1001 wmf.5 is at 96d099dab949f5d430c01a7d6bc2d9722f622ed2 which does not contain your commmit
[17:29:07] <bd808>	 jzerebecki: how can you tell? we don't sync the .git data
[17:30:20] <jzerebecki>	 bd808: ugh. then disregard what I said.
[17:30:35] <bd808>	 The only way you can check for things that have been changed on the live cluster is by looking at the files
[17:30:44] <bd808>	 and mw1111 has my patch applied
[17:31:14] <bd808>	 (I'm not saying this is good, but it is how things work right now)
[17:31:14] <grrrit-wm>	 (03PS1) 10Chad: Remove superprotect [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251286 
[17:31:21] <jzerebecki>	 yea I somehow suppressed that we sync individual files
[17:31:57] <wikibugs>	 6operations, 10ops-codfw: db2034 host crashed; mgmt interface unavailable (needs reset and hw check) - https://phabricator.wikimedia.org/T117858#1785919 (10Papaul) Checked the server, the server was completely off. Power of the server, the iLo configuration were stay in place. I couldn't ssh@localIP but i can...
[17:33:14] <ostriches>	 James_F: Can I get you to revoke +superprotect from https://meta.wikimedia.org/wiki/Special:GlobalGroupPermissions/staff per https://gerrit.wikimedia.org/r/#/c/251286/ and https://www.mediawiki.org/wiki/WMF_Product_Development_Process/2015-11-05?
[17:33:37] <bd808>	 on a completely tangental note, there are 14 deploy branches on tin which seems quite excessive
[17:33:52] <ostriches>	 (or someone else with +sysadmin on meta)
[17:35:01] <greg-g>	 Jamesofur: ^
[17:35:21] <wikibugs>	 6operations, 5Patch-For-Review: Switch to Linux 3.19 by default on jessie hosts - https://phabricator.wikimedia.org/T100773#1785923 (10MoritzMuehlenhoff) This is down to these hosts now:  conf100[1-3] rhenium planet1001
[17:36:26] <JohnFLewis>	 ostriches: asking stewards might be easier and quicker really
[17:36:35] <ostriches>	 #wikimedia-stewards?
[17:36:41] <JohnFLewis>	 Yes
[17:37:03] <twentyafterfour>	 bd808: that is true
[17:38:26] <icinga-wm_>	 RECOVERY - Disk space on labvirt1002 is OK: DISK OK
[17:39:15] <ostriches>	 James_F, Jamesofur: nvm, asking stewards instead.
[17:39:34] <wikibugs>	 6operations, 10ops-codfw: db2034 host crashed; mgmt interface unavailable (needs reset and hw check) - https://phabricator.wikimedia.org/T117858#1785933 (10jcrespo) The issue looks like a network/board problem, right?
[17:39:50] <Jamesofur>	 I assume you want it revoked from staff too :)
[17:40:07] <twentyafterfour>	 ok merging https://gerrit.wikimedia.org/r/#/c/251246
[17:40:10] <Jamesofur>	 (They will likely ask me publicly or privately anyway)
[17:40:18] <JohnFLewis>	 Jamesofur: that's the only group it's on I think
[17:40:48] <JohnFLewis>	 It was only granted to staff unless someone sneaked it into Sysadmin later :)
[17:41:07] <ostriches>	 It's only on staff.
[17:41:12] <ostriches>	 I already checked the groups a few days ago
[17:43:52] <grrrit-wm>	 (03CR) 10Chad: [C: 032] Remove superprotect [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251286 (owner: 10Chad)
[17:44:12] <grrrit-wm>	 (03Merged) 10jenkins-bot: Remove superprotect [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251286 (owner: 10Chad)
[17:44:17] <greg-g>	 weee
[17:44:32] <JohnFLewis>	 Let's do the super protect controversy again
[17:44:48] * JohnFLewis reverts saying community consensus wasn't gathered
[17:44:49] <jynus>	 !log Deploying schema change on officewiki - flow (s3)
[17:44:54] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[17:45:18] <ostriches>	 JohnFLewis: Yeah, let's do this again real soon :)
[17:45:21] <ostriches>	 Next month?
[17:45:29] <greg-g>	 JohnFLewis: where?
[17:45:34] <JohnFLewis>	 Noted on my calendar :)
[17:45:43] <greg-g>	 oh, hah, sorry, misunderstood, stupid multitasking :)
[17:46:11] <JohnFLewis>	 greg-g: where is still valid! I see no RFC with consensus for reverting ;)
[17:46:29] <ostriches>	 No !log from sync?
[17:46:58] <ostriches>	 !log 17:45:13 Synchronized wmf-config/: Remove +superprotect, I579c11a2 (duration: 00m 18s)
[17:47:04] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[17:47:14] <matt_flaschen>	 Thanks, jynus.
[17:47:43] <grrrit-wm>	 (03PS2) 10Filippo Giunchedi: cassandra: add praseodymium instances to seeds [puppet] - 10https://gerrit.wikimedia.org/r/251274 
[17:47:50] <grrrit-wm>	 (03CR) 10Filippo Giunchedi: [C: 032 V: 032] cassandra: add praseodymium instances to seeds [puppet] - 10https://gerrit.wikimedia.org/r/251274 (owner: 10Filippo Giunchedi)
[17:50:36] <icinga-wm_>	 PROBLEM - puppet last run on xenon is CRITICAL: CRITICAL: Puppet has 1 failures
[17:52:32] <jynus>	 matt_flaschen, if you have 5 minutes, please help me create some traffic on officewiki related to flow
[17:52:49] <jynus>	 if not, we may not notice problems, etc
[17:52:59] <matt_flaschen>	 jynus, sure, like new posts, or just a lot of simultaneous GET requests?
[17:53:18] <jynus>	 nothing too formal, just create, edit a new page
[17:54:02] <jynus>	 this is such a trivial change, that either it is a too obvious error or it works
[17:54:19] <jynus>	 unlike the ES storage change, that will be more complex
[17:54:55] <matt_flaschen>	 jynus, seems fine: https://office.wikimedia.org/wiki/User_talk:Mattflaschen_(WMF)/Flow_Sandbox
[17:55:26] <grrrit-wm>	 (03PS2) 10Dzahn: ldap: split client.pp into one file per class [puppet] - 10https://gerrit.wikimedia.org/r/251251 
[17:55:40] <matt_flaschen>	 Forgot to add links before, that works too though.
[17:57:55] <icinga-wm_>	 RECOVERY - Host db2034 is UP: PING WARNING - Packet loss = 58%, RTA = 34.54 ms
[17:58:55] <jynus>	 Packet loss = 58%, nice
[17:59:08] <jynus>	 that is a 58% more than 0 :-)
[18:00:01] <jynus>	 matt_flaschen, I see no errors on the logs, so let's go with the real thing
[18:00:50] <matt_flaschen>	 jynus, +1
[18:01:08] <grrrit-wm>	 (03CR) 10Dzahn: "http://puppet-compiler.wmflabs.org/1200/" [puppet] - 10https://gerrit.wikimedia.org/r/251251 (owner: 10Dzahn)
[18:02:50] <grrrit-wm>	 (03PS2) 10Filippo Giunchedi: cassandra: add praseodymium-a instance [puppet] - 10https://gerrit.wikimedia.org/r/251275 
[18:02:57] <grrrit-wm>	 (03CR) 10Filippo Giunchedi: [C: 032 V: 032] cassandra: add praseodymium-a instance [puppet] - 10https://gerrit.wikimedia.org/r/251275 (owner: 10Filippo Giunchedi)
[18:03:37] <jynus>	 so it will be a 1 second write block of flow
[18:03:47] <jynus>	 hopefuly the last time we have to do so
[18:04:00] <jynus>	 (thanks to the PK)
[18:04:07] <matt_flaschen>	 Great. :)
[18:04:20] <grrrit-wm>	 (03PS2) 10Dzahn: osm: minimal lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/251254 
[18:04:21] <twentyafterfour>	 thcipriani: so whhich patches were merged for swat?
[18:04:36] <grrrit-wm>	 (03PS3) 10Dzahn: osm: minimal lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/251254 
[18:04:46] <matt_flaschen>	 twentyafterfour, he said he only did one of mine (I have two total).
[18:04:54] <matt_flaschen>	 I don't think he merged anyone else's.
[18:04:58] <grrrit-wm>	 (03CR) 10Dzahn: [C: 032] "fixes the last "WARNING: unquoted file mode" across the repo" [puppet] - 10https://gerrit.wikimedia.org/r/251254 (owner: 10Dzahn)
[18:05:16] <thcipriani>	 twentyafterfour: yup: just the one of matt_flaschen 's and the one from James_F from evening SWAT
[18:05:19] <jynus>	 !log schema change on x1 - flowdb
[18:05:25] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[18:05:38] <icinga-wm_>	 PROBLEM - Restbase endpoints health on praseodymium is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=127.0.0.1, port=7231): Max retries exceeded with url: /en.wikipedia.org/v1/?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused)))
[18:05:46] <thcipriani>	 twentyafterfour: and it looks like you merged the other of matt_flaschen 's patches for swat
[18:05:49] <jynus>	 (actually I am wrong, it is an 8 second process, but it is still online because it is a column addition)
[18:05:58] <icinga-wm_>	 PROBLEM - Restbase root url on praseodymium is CRITICAL: Connection refused
[18:06:33] <thcipriani>	 twentyafterfour: the VisualEditor submodule changed but not updated was the evening SWAT patch.
[18:06:37] <icinga-wm_>	 PROBLEM - cassandra CQL 10.64.16.149:9042 on praseodymium is CRITICAL: Connection refused
[18:06:56] <twentyafterfour>	 thcipriani: thanks
[18:07:04] <thcipriani>	 np
[18:07:10] <thcipriani>	 thank you!
[18:07:37] <icinga-wm_>	 PROBLEM - puppet last run on praseodymium is CRITICAL: CRITICAL: Puppet has 1 failures
[18:07:52] <jynus>	 traffic seems normal, lag has went back to 0 and no errors on the log
[18:08:17] <icinga-wm_>	 PROBLEM - service on praseodymium is CRITICAL: CRITICAL - Expecting active but unit cassandra is inactive
[18:09:21] <grrrit-wm>	 (03CR) 10Yuvipanda: [C: 031] ldap: split client.pp into one file per class [puppet] - 10https://gerrit.wikimedia.org/r/251251 (owner: 10Dzahn)
[18:09:37] <icinga-wm_>	 RECOVERY - puppet last run on praseodymium is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures
[18:11:18] <twentyafterfour>	 ok looks like everything from swat is merged and ready to go
[18:11:34] <twentyafterfour>	 anyone else have a patch to deploy before I sync this thing?
[18:13:58] <grrrit-wm>	 (03PS1) 10coren: Make host check_disk alerts optionally critical [puppet] - 10https://gerrit.wikimedia.org/r/251292 
[18:14:07] <Coren>	 chasemp: ^^
[18:15:12] <grrrit-wm>	 (03CR) 10Rush: Make host check_disk alerts optionally critical (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/251292 (owner: 10coren)
[18:16:00] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] Make host check_disk alerts optionally critical [puppet] - 10https://gerrit.wikimedia.org/r/251292 (owner: 10coren)
[18:16:14] <grrrit-wm>	 (03CR) 10coren: Make host check_disk alerts optionally critical (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/251292 (owner: 10coren)
[18:16:54] <grrrit-wm>	 (03PS2) 10coren: Make host check_disk alerts optionally critical [puppet] - 10https://gerrit.wikimedia.org/r/251292 
[18:19:31] <grrrit-wm>	 (03PS1) 10Dzahn: puppet-lint: re-enable 'unquoted file modes' check [puppet] - 10https://gerrit.wikimedia.org/r/251295 
[18:21:09] <grrrit-wm>	 (03CR) 10Rush: [C: 031] "as dicussed in -labs where a virt box w/ crit disk caused a partial outage" [puppet] - 10https://gerrit.wikimedia.org/r/251292 (owner: 10coren)
[18:21:11] <grrrit-wm>	 (03PS3) 10Rush: Make host check_disk alerts optionally critical [puppet] - 10https://gerrit.wikimedia.org/r/251292 (owner: 10coren)
[18:21:13] <grrrit-wm>	 (03PS2) 10Dzahn: puppet-lint: re-enable 'unquoted file modes' check [puppet] - 10https://gerrit.wikimedia.org/r/251295 
[18:21:16] <icinga-wm_>	 PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 216, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-5/2/3: down - Core: cr2-codfw:xe-5/0/1 Zayo (SO 580358) {#2909} [10Gbps DWDM]BR
[18:21:17] <grrrit-wm>	 (03PS1) 10coren: Labs: make disk space alerts for compute nodes paging [puppet] - 10https://gerrit.wikimedia.org/r/251297 
[18:24:39] <grrrit-wm>	 (03CR) 10Dzahn: [C: 031] "+1 for deploying it only to en.wp for now since that is the only thing that was requested and has the time constraint from external reques" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251130 (https://phabricator.wikimedia.org/T110661) (owner: 10Jdlrobson)
[18:25:22] <grrrit-wm>	 (03CR) 10coren: [C: 032] Make host check_disk alerts optionally critical [puppet] - 10https://gerrit.wikimedia.org/r/251292 (owner: 10coren)
[18:25:53] <grrrit-wm>	 (03PS2) 10Ori.livneh: Relax and extend the Content-Type regexp that controls gzipping [puppet] - 10https://gerrit.wikimedia.org/r/251268 
[18:26:08] <grrrit-wm>	 (03PS2) 10coren: Labs: make disk space alerts for compute nodes paging [puppet] - 10https://gerrit.wikimedia.org/r/251297 
[18:26:16] <icinga-wm_>	 PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 118, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-5/0/1: down - Core: cr2-eqiad:xe-5/2/3 Zayo (SO 580358) {#11519} [10Gbps DWDM]BR
[18:26:21] <Coren>	 chasemp: ^^ throws the switch
[18:26:38] <grrrit-wm>	 (03CR) 10Ori.livneh: "bblack, looks like it, yeah. Amended to match 'icon' (which covers x-icon and image/vnd.microsoft.icon). There is no non-compressible mime" [puppet] - 10https://gerrit.wikimedia.org/r/251268 (owner: 10Ori.livneh)
[18:26:50] <ori>	 bblack: ^
[18:26:54] <grrrit-wm>	 (03CR) 10Dzahn: "i liked this from a production point of view and also want to do similar changes to clean up site.pp, but if we are breaking beta cluster " [puppet] - 10https://gerrit.wikimedia.org/r/251019 (owner: 10Muehlenhoff)
[18:28:36] <grrrit-wm>	 (03PS2) 10Krinkle: Fix varnishmedia comment [puppet] - 10https://gerrit.wikimedia.org/r/243838 (owner: 10Gilles)
[18:28:56] <grrrit-wm>	 (03CR) 10Dzahn: "this is good. it's just about the timing. needs announcement on mailing lists. if it would just add the new backend but not switch it yet," [puppet] - 10https://gerrit.wikimedia.org/r/251115 (https://phabricator.wikimedia.org/T116992) (owner: 10John F. Lewis)
[18:29:17] <icinga-wm_>	 PROBLEM - Restbase endpoints health on aqs1001 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=127.0.0.1, port=7232): Max retries exceeded with url: /analytics.wikimedia.org/v1/?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused)))
[18:29:19] <grrrit-wm>	 (03CR) 10Ori.livneh: [C: 032 V: 032] Fix varnishmedia comment [puppet] - 10https://gerrit.wikimedia.org/r/243838 (owner: 10Gilles)
[18:29:36] <icinga-wm_>	 PROBLEM - Restbase root url on aqs1001 is CRITICAL: Connection refused
[18:30:40] <grrrit-wm>	 (03CR) 10Dzahn: "yea.. hmm. an opinion from _joe_ would be great here" [puppet] - 10https://gerrit.wikimedia.org/r/247324 (owner: 10Chad)
[18:30:54] <grrrit-wm>	 (03CR) 10Rush: [C: 031] "at some point we need to review teh thresholds here but I think based on todays events getting this rolling as is seems practical" [puppet] - 10https://gerrit.wikimedia.org/r/251297 (owner: 10coren)
[18:31:09] <ori>	 paravoid: have you seen https://phabricator.wikimedia.org/T107507#1534816 ?
[18:31:32] <paravoid>	 uhm, I guess not
[18:31:40] <ori>	 paravoid: i'm all for enabling backports unconditionally, but the "consensus" (?) was to disable it, which is why i took the middle road
[18:31:52] <ori>	 enabling it seems perfectly fine
[18:32:03] <paravoid>	 yeah my point was that your middle road isn't very different though
[18:32:06] <paravoid>	 than the default
[18:32:12] <paravoid>	 not compared to what we're doing
[18:32:16] <ori>	 the default appears to have changed
[18:32:24] <ori>	 backports used to be enabled
[18:32:26] <icinga-wm_>	 RECOVERY - service on praseodymium is OK: OK - cassandra-a is active
[18:32:31] <grrrit-wm>	 (03CR) 10Nemo bis: Remove superprotect (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251286 (owner: 10Chad)
[18:32:51] <paravoid>	 so that default changed on upstream d-i between the jessie release candidates
[18:33:11] <paravoid>	 this is all so deja vu, I remember saying this in a task somewhere
[18:33:28] <wikibugs>	 6operations, 5Patch-For-Review: install / setup tungsten for temp use - wikimania 2015 video transcoding - https://phabricator.wikimedia.org/T106563#1786221 (10Dzahn) @matanya thanks for the update. ok!. we are going to reclaim tungsten
[18:33:47] <wikibugs>	 6operations, 10ops-codfw: db2034 host crashed; mgmt interface unavailable (needs reset and hw check) - https://phabricator.wikimedia.org/T117858#1786224 (10jcrespo) This has been resolved to me, unless, @papaul, you want to add anything strange that you found and may be the cause of the issue. I will keep an e...
[18:35:16] <paravoid>	 this was https://bugs.debian.org/764982 btw
[18:35:22] <wikibugs>	 6operations, 6Performance-Team: Allocate tungsten for InfluxDB - https://phabricator.wikimedia.org/T117888#1786231 (10ori) 3NEW a:3Dzahn
[18:35:23] <grrrit-wm>	 (03PS3) 10Dzahn: puppet-lint: re-enable 'unquoted file modes' check [puppet] - 10https://gerrit.wikimedia.org/r/251295 (https://phabricator.wikimedia.org/T93645) 
[18:36:05] <ori>	 paravoid: i don't really care which alternative to the current status quo we pick, since they're all better, from my perspective
[18:36:14] <paravoid>	 ah yes, I said that on this bug above :P
[18:36:45] <ori>	 i just don't have the investment necessary to make sure this is adequately discussed, etc. so if you want to pull a "i'm faidon and i approve this message" thing and just pick some approach, i'd actually welcome that :)
[18:36:48] <akosiaris>	 hashar-away: yay!
[18:37:06] * YuviPanda +1's ori
[18:37:08] <wikibugs>	 6operations, 5Patch-For-Review: Move people.wikimedia.org off terbium, to somewhere open to all prod shell accounts? - https://phabricator.wikimedia.org/T116992#1786251 (10Dzahn)
[18:37:10] <wikibugs>	 6operations, 5Patch-For-Review: install / setup tungsten for temp use - wikimania 2015 video transcoding - https://phabricator.wikimedia.org/T106563#1786252 (10Dzahn)
[18:37:12] <wikibugs>	 6operations, 6Performance-Team: Allocate tungsten for InfluxDB - https://phabricator.wikimedia.org/T117888#1786250 (10Dzahn)
[18:37:21] <wikibugs>	 6operations, 5Patch-For-Review: Move people.wikimedia.org off terbium, to somewhere open to all prod shell accounts? - https://phabricator.wikimedia.org/T116992#1763907 (10Dzahn)
[18:37:22] <wikibugs>	 6operations, 6Performance-Team: Allocate tungsten for InfluxDB - https://phabricator.wikimedia.org/T117888#1786231 (10Dzahn)
[18:37:35] <ostriches>	 mutante: How about https://gerrit.wikimedia.org/r/#/c/224829/? :)
[18:37:45] <akosiaris>	 ori: independently I 've been looking into influxdb as well
[18:37:55] <akosiaris>	 great timing tbh
[18:38:05] <ori>	 akosiaris: what have your impressions been so far?
[18:38:31] <akosiaris>	 so, I 've been only testing the basics with a collectd
[18:38:41] <akosiaris>	 I must say I like it way better than graphite
[18:39:05] <icinga-wm_>	 PROBLEM - cassandra-a CQL 10.64.16.188:9042 on praseodymium is CRITICAL: Connection refused
[18:39:10] <akosiaris>	 for starters the fact that they tagging values instead of creating an hierarchy
[18:39:27] <akosiaris>	 but I 've seen that like a year ago
[18:39:35] <akosiaris>	 now at least it is stabler
[18:39:54] <akosiaris>	 I am hopeful for that thing 
[18:39:56] <chasemp>	 I like their activity level too the core group seems pretty cool
[18:40:15] <ori>	 the fact that it was designed from the outset to be horizontally scalable seems like the most attractive property -- with graphite you can scale it but it's a choose-your-own-adventure story, requiring that we cobble together different software components. every time we hit a resource ceiling it's a new crisis.
[18:40:19] <akosiaris>	 I am wondering a bit how their sharding works though. still reading/testing on that front
[18:40:22] <mutante>	 ostriches: sorry, not right now. i actually have the day off :p
[18:40:30] <mutante>	 it was the usual "just this one thing"
[18:40:34] <akosiaris>	 ori: yup
[18:40:38] <akosiaris>	 totally agree
[18:41:50] <ori>	 ostriches: it has a +1 from filippo and it looks sane to me, so i don't mind merging it. do you have a way to verify it in prod?
[18:42:16] <ostriches>	 Yeah, if we run puppet on tin & mira I can pull the new version of scap and test immediately.
[18:42:25] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: [C: 032] scap: Add co-master configuration [puppet] - 10https://gerrit.wikimedia.org/r/224829 (https://phabricator.wikimedia.org/T104826) (owner: 10BryanDavis)
[18:42:30] <akosiaris>	 ostriches: I 'll merge
[18:42:31] <ori>	 heh
[18:42:33] <ostriches>	 ty!
[18:42:34] <ori>	 thanks!
[18:42:41] <grrrit-wm>	 (03PS11) 10Alexandros Kosiaris: scap: Add co-master configuration [puppet] - 10https://gerrit.wikimedia.org/r/224829 (https://phabricator.wikimedia.org/T104826) (owner: 10BryanDavis)
[18:43:10] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] scap: Add co-master configuration [puppet] - 10https://gerrit.wikimedia.org/r/224829 (https://phabricator.wikimedia.org/T104826) (owner: 10BryanDavis)
[18:44:57] <ostriches>	 jouncebot: next
[18:44:58] <jouncebot>	 In 0 hour(s) and 15 minute(s): MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151105T1900)
[18:45:13] <mutante>	 cool :) i like seeing that merged
[18:45:26] <mutante>	 be back tomorrow. cya
[18:45:36] <ostriches>	 twentyafterfour: What all you gotta deploy today?
[18:45:56] <ori>	 has the repo state been sorted out?
[18:46:04] <grrrit-wm>	 (03PS3) 10coren: Labs: make disk space alerts for compute nodes paging [puppet] - 10https://gerrit.wikimedia.org/r/251297 
[18:46:23] <twentyafterfour>	 ostriches: morning swat stuff didn't get sync'd yet
[18:46:28] <twentyafterfour>	 ori: yes
[18:46:43] <ori>	 twentyafterfour: cool, thanks. i'll poke AaronSchulz to explain what happened.
[18:46:57] <twentyafterfour>	 ori: thcipriani and I got it straightened out.  And yes please doo
[18:47:06] <grrrit-wm>	 (03CR) 10coren: [C: 032] Labs: make disk space alerts for compute nodes paging [puppet] - 10https://gerrit.wikimedia.org/r/251297 (owner: 10coren)
[18:47:18] <twentyafterfour>	 inquiring minds would like to know wtf
[18:47:20] <akosiaris>	 ostriches: wanna test ? run puppet on both tin and mira 
[18:47:33] <akosiaris>	 I 've ran*
[18:47:41] <akosiaris>	 sigh... sorry 21:00 over here
[18:47:48] <akosiaris>	 not my best time of the day
[18:48:06] <ostriches>	 Well, I don't wanna screw with twentyafterfour's train deploy.
[18:48:09] <wikibugs>	 6operations, 6Labs: Investigate whether to use Debian's jessie-backports - https://phabricator.wikimedia.org/T107507#1786298 (10faidon) I don't remember this IRC discussion. Who was attending it? A little more context please? :)  In any case, I disagree with that consensus. I think enabling backports fleet-wid...
[18:48:12] <paravoid>	 ori: ^
[18:48:20] <ori>	 thanks
[18:48:22] <ostriches>	 akosiaris: So I might wait a min :)
[18:48:34] <akosiaris>	 ostriches: ok
[18:49:58] <wikibugs>	 6operations, 6Labs: Investigate whether to use Debian's jessie-backports - https://phabricator.wikimedia.org/T107507#1786306 (10coren) @faidon: That was mostly you and Moritz.  Lemme see if I find quotables in my local logs.  :-)
[18:50:25] <paravoid>	 I was?
[18:50:38] <wikibugs>	 6operations, 5Patch-For-Review: align puppet-lint config with coding style - https://phabricator.wikimedia.org/T93645#1786307 (10Dzahn) ^ see my change above. we have fixed all "unquoted file mode" warnings example:  mode => 0644)  across the repo.  so we can re-enable that specific check
[18:50:47] <akosiaris>	 perhaps it was about ubuntu ?
[18:50:48] <twentyafterfour>	 ostriches: if it's not gonna take long I can wait (train deploy window isn't for another 10 minutes anyway)
[18:50:48] <paravoid>	 it would surprise me but it's entirely plausible
[18:50:52] <Coren>	 paravoid: Yep, but that was some weeks ago.  I'm looking at my local logs now to figure out when and see quote it.  :-)
[18:51:08] <akosiaris>	 tbh I am not still feeling fully comfortable with -backports enabled 
[18:51:16] <akosiaris>	 got a long history of not doing it in production
[18:51:27] <ori>	 it was enabled by default
[18:51:43] <ori>	 akosiaris: it just did not need an explicit action to enable before
[18:51:43] <akosiaris>	 and I am wary of enabling it in only half the fleet (jessie vs ubuntu)
[18:52:17] <akosiaris>	 that was in jessie pre-release
[18:52:26] <ostriches>	 !log scap: deploying master@b44c268
[18:52:31] <akosiaris>	 but it always needed to be enabled explicitly
[18:52:32] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[18:52:41] <akosiaris>	 back in wheezy and squeeze as well
[18:53:33] <Coren>	 Ew.  We discuss "backport" a lot on IRC.  A naive grep gives me a few hundred log files.
[18:53:47] <Coren>	 Oh, wait, I updated the task the same day pretty much - that should narrow the window.
[18:53:49] <akosiaris>	 I am not against being convinced we should be enable it, for the record. 
[18:53:58] * jzerebecki will be offline soon
[18:54:48] <ori>	 Coren: could you review/merge https://gerrit.wikimedia.org/r/#/c/250378/ so we can close out https://phabricator.wikimedia.org/T115711 ? 
[18:55:09] <Coren>	 ori: Sure, give me a minute and I'll look at it.
[18:55:10] <paravoid>	 I found it.
[18:55:39] <Coren>	 paravoid: date/channel so I can follow along?
[18:56:20] <DanielK_WMDE>	 hi all
[18:56:36] <grrrit-wm>	 (03CR) 10coren: [C: 031] "With the caveat that this will prevent creation of the views, but will not remove extant ones from the replicas (that needs an interventio" [software] - 10https://gerrit.wikimedia.org/r/250378 (https://phabricator.wikimedia.org/T115711) (owner: 10Jcrespo)
[18:56:39] <DanielK_WMDE>	 if anything goes wrong with the wikidata deployment, please ping me
[18:56:51] <hoo|busy>	 DanielK_WMDE: I'm around
[18:56:57] <hoo|busy>	 or is there anything especially dangerous today?
[18:57:04] <grrrit-wm>	 (03CR) 10coren: [C: 032] Delete user_daily_contribs from the views in labs [software] - 10https://gerrit.wikimedia.org/r/250378 (https://phabricator.wikimedia.org/T115711) (owner: 10Jcrespo)
[18:57:07] <hoo|busy>	 dewiki is ro?
[18:57:14] <jzerebecki>	 hoo|busy: just the backport still in progress
[18:57:30] <Coren>	 ori: I'm doing a test run now to give it V+2, then I'll merge.
[18:57:35] <ori>	 thanks
[18:57:46] <hoo|busy>	 jzerebecki: Why is that?
[18:57:53] <grrrit-wm>	 (03PS1) 10Paladox: Fix replication in phabricator [puppet] - 10https://gerrit.wikimedia.org/r/251304 
[18:57:53] <DanielK_WMDE>	 hoo|busy: ah, good to know :) i thought you were offline. wanted to make sure *someone* is around
[18:58:01] <DanielK_WMDE>	 happy if i can go offline in an hour
[18:58:11] <hoo|busy>	 Katie told to be online, so here I am
[18:58:15] <hoo|busy>	 + me
[18:58:29] <logmsgbot>	 !log demon@tin Synchronized README: no-op, testing new scap code (duration: 00m 19s)
[18:58:35] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[18:58:46] <grrrit-wm>	 (03CR) 10Paladox: "I am not sure if this fixes the problem but may." [puppet] - 10https://gerrit.wikimedia.org/r/251304 (owner: 10Paladox)
[18:58:53] <ostriches>	 twentyafterfour: You may see mira complain for a bit about sync-master. Everything else should work and continue as normal.
[18:58:59] * ostriches goes after trebuchet with a knife
[18:58:59] <bd808>	 ostriches: master-master sync in there yet?
[18:59:20] <ostriches>	 The code's deployed to tin, mira didn't want to update.
[18:59:21] <wikibugs>	 6operations, 6Labs: Investigate whether to use Debian's jessie-backports - https://phabricator.wikimedia.org/T107507#1786335 (10MoritzMuehlenhoff) Using packages from backport selectively is fine with me, we already do it e.g. with openjdk-8 which we need for the cassandra cluster. It's a valid part of the Deb...
[18:59:23] <grrrit-wm>	 (03CR) 10Jcrespo: "tables and views already deleted." [software] - 10https://gerrit.wikimedia.org/r/250378 (https://phabricator.wikimedia.org/T115711) (owner: 10Jcrespo)
[18:59:59] * bd808 runs to look at the logs
[19:00:05] <jouncebot>	 twentyafterfour: Dear anthropoid, the time has come. Please deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151105T1900).
[19:01:58] <grrrit-wm>	 (03CR) 10Paladox: Fix replication in phabricator (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/251304 (owner: 10Paladox)
[19:02:25] <ostriches>	 bd808: https://phabricator.wikimedia.org/P2281
[19:02:37] <ostriches>	 Some are probably stale/depooled/etc.
[19:02:52] <bd808>	 ostriches: "['/srv/deployment/scap/scap/bin/sync-master', 'tin.
[19:02:53] <bd808>	 eqiad.wmnet'] on mira.codfw.wmnet returned [127]: bash: /srv/deployment/scap/sca
[19:02:54] <bd808>	 p/bin/sync-master: No such file or directory"
[19:03:02] <bd808>	 missing the new script
[19:03:05] <ostriches>	 Yeah, I know, I said trebuchet was stupid.
[19:03:10] <ostriches>	 mira didn't want to update.
[19:03:36] <AaronSchulz>	 twentyafterfour: so did you end up doing the hard reset + repick or use some other way? Looks like it was rebased against master instead of wmf5, so a few newer commits showed up but where not deployed.
[19:04:09] <twentyafterfour>	 AaronSchulz: thcipriani did the rebasing, I think everything got started over
[19:04:59] <thcipriani>	 AaronSchulz: I did a rebase to add in the commits that were made as part of SWAT, then I reset to the head of the .5 branch.
[19:05:24] <thcipriani>	 then I repicked the security patches on top.
[19:06:42] <twentyafterfour>	 and I'm gonna sync it all right now.
[19:06:48] <thcipriani>	 when I got there we were 22 commits ahead of origin/wmf/1.27.0-wmf.5
[19:07:07] <twentyafterfour>	 I'm going to sync to group1 and let it bake for a while before syncing wmf.5 to group2
[19:07:09] <thcipriani>	 (and 2 commits behind since those had been merged for SWAT)
[19:08:41] <grrrit-wm>	 (03CR) 10Paladox: Fix replication in phabricator (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/251304 (owner: 10Paladox)
[19:10:14] <wikibugs>	 6operations, 6Analytics-Backlog, 10Wikipedia-iOS-App-Product-Backlog, 10vm-requests, 5iOS-5-app-production: Request one server to suport piwik analytics - https://phabricator.wikimedia.org/T116312#1786348 (10Milimetric) @Joe, @mark, there was more context to this issue in other tickets, but I'm happy to...
[19:11:30] <grrrit-wm>	 (03CR) 10Chad: [C: 04-1] "This will not fix the problem and makes an unrelated and incorrect change to the proxy config. Please abandon." [puppet] - 10https://gerrit.wikimedia.org/r/251304 (owner: 10Paladox)
[19:13:26] <grrrit-wm>	 (03PS1) 1020after4: w/static/ symlinks for wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251308 
[19:14:10] <grrrit-wm>	 (03CR) 1020after4: [C: 032] w/static/ symlinks for wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251308 (owner: 1020after4)
[19:14:48] <grrrit-wm>	 (03Merged) 10jenkins-bot: w/static/ symlinks for wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251308 (owner: 1020after4)
[19:15:44] <logmsgbot>	 !log twentyafterfour@tin Started scap: Sync everything just to be sure
[19:15:50] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[19:16:44] <grrrit-wm>	 (03CR) 10coren: [V: 032] Delete user_daily_contribs from the views in labs [software] - 10https://gerrit.wikimedia.org/r/250378 (https://phabricator.wikimedia.org/T115711) (owner: 10Jcrespo)
[19:20:13] <ostriches>	 twentyafterfour: Again, mira will probably complain about missing sync-master, it should just continue and be ok tho
[19:20:13] <ostriches>	 Still trying to sort that
[19:20:25] <ostriches>	 Ah, it finally caught up
[19:20:26] <ostriches>	 Yay
[19:20:26] <ostriches>	 :)
[19:20:48] <ostriches>	 #eventualconsistency
[19:21:03] <ori>	 #eventconsi
[19:21:11] <ostriches>	 stency
[19:21:14] <grrrit-wm>	 (03PS2) 10Paladox: Fix replication in phabricator [puppet] - 10https://gerrit.wikimedia.org/r/251304 
[19:21:16] <ori>	 :)
[19:22:18] <ostriches>	 So now we get to play "fix broken permissions" in mw-staging in prod like we did in beta :)
[19:22:24] <grrrit-wm>	 (03PS3) 10Paladox: Fix replication in phabricator [puppet] - 10https://gerrit.wikimedia.org/r/251304 
[19:22:29] <ostriches>	 Although the root dir should be ok now
[19:22:33] <ostriches>	 With puppetz.
[19:25:11] <grrrit-wm>	 (03CR) 10Chad: [C: 04-1] Fix replication in phabricator (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/251304 (owner: 10Paladox)
[19:33:32] <grrrit-wm>	 (03CR) 10Faidon Liambotis: [C: 04-1] "Well, right now it would mean that misses (and therefore, all logged in traffic), would go ulsfo->codfw->eqiad->appservers if you look at " [dns] - 10https://gerrit.wikimedia.org/r/251255 (https://phabricator.wikimedia.org/T114659) (owner: 10BBlack)
[19:37:45] <logmsgbot>	 !log twentyafterfour@tin Finished scap: Sync everything just to be sure (duration: 22m 01s)
[19:37:52] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[19:37:59] <ostriches>	 twentyafterfour: And?
[19:38:00] <ostriches>	 :)
[19:44:16] <icinga-wm_>	 RECOVERY - Restbase root url on praseodymium is OK: HTTP OK: HTTP/1.1 200 - 15171 bytes in 0.015 second response time
[19:45:45] <icinga-wm_>	 RECOVERY - Restbase endpoints health on praseodymium is OK: All endpoints are healthy
[19:47:38] <grrrit-wm>	 (03PS1) 10BryanDavis: logstash: Exclude runJobs info events from logstash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251317 (https://phabricator.wikimedia.org/T113571) 
[19:47:49] <twentyafterfour>	 ostriches: scap executed flawlessly
[19:48:18] <twentyafterfour>	 but, wtf? 153 Notice: Undefined property: stdClass::$newContent in /srv/mediawiki/php-1.27.0-wmf.4/includes/page/WikiPage.php on line 2058
[19:48:41] <grrrit-wm>	 (03CR) 10Paladox: Fix replication in phabricator (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/251304 (owner: 10Paladox)
[19:49:12] <ostriches>	 twentyafterfour: Filed ages ago.
[19:49:29] <twentyafterfour>	 weird. it just showed up in logs suddenly
[19:50:05] <icinga-wm_>	 PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 4 below the confidence bounds
[19:50:21] <ostriches>	 twentyafterfour: sync-master was good? no permission complaints?
[19:51:04] <wikibugs>	 6operations, 6Commons, 10Wikimedia-Media-storage, 5MW-1.27-release-notes, and 2 others: Some files had disappeared from Commons after renaming - https://phabricator.wikimedia.org/T111838#1786485 (10matmarex) >>! In T111838#1785015, @Ankry wrote: > I have checked few random files from the list and they all...
[19:51:34] <twentyafterfour>	 ostriches: actually... I hadn't even noticed the scrollback
[19:51:39] <twentyafterfour>	 19:29:41 ['/srv/deployment/scap/scap/bin/sync-master', 'tin.eqiad.wmnet'] on mira.codfw.wmnet returned [70]: 19:21:32 Copying to mira.codfw.wmnet from tin.eqiad.wmnet
[19:51:41] <twentyafterfour>	 19:21:32 Started rsync master
[19:51:43] <twentyafterfour>	 rsync: failed to set times on "/srv/mediawiki-staging/live-1.5": Operation not permitted (1)
[19:51:51] <twentyafterfour>	 followed by a bunch more failed to set times on ....
[19:52:26] <ostriches>	 twentyafterfour: Pastebin :)
[19:54:07] <grrrit-wm>	 (03CR) 10Chad: Fix replication in phabricator (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/251304 (owner: 10Paladox)
[19:54:16] <twentyafterfour>	 ostriches: https://phabricator.wikimedia.org/P2282
[19:55:18] <twentyafterfour>	 the 'successful' output actually scrolled the errors up so far that I didn't even notice them :-/ obviously I'm not paying good enough attention
[19:55:36] <ostriches>	 Nbd, the rest of it worked out fine.
[19:55:55] <icinga-wm_>	 PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 5 below the confidence bounds
[19:56:25] <wikibugs>	 6operations, 7Database, 5Patch-For-Review: Drop `user_daily_contribs` table from all production wikis - https://phabricator.wikimedia.org/T115711#1786507 (10MaxSem) 5Open>3Resolved I don't see this table on betalabs.
[19:57:32] <ostriches>	 bd808: cc https://phabricator.wikimedia.org/P2282 :\
[19:58:04] <bd808>	 ostriches: looking. I imaging that means the initial clone there is not owned by mwdeploy
[19:58:30] <bd808>	 the mtime stuff requires ownership rather than just group access
[19:58:35] <grrrit-wm>	 (03Abandoned) 10Paladox: Fix replication in phabricator [puppet] - 10https://gerrit.wikimedia.org/r/251304 (owner: 10Paladox)
[19:58:35] <ostriches>	 Yeah
[19:59:35] <bd808>	 ostriches: most files there are owned by either root or Krenair 
[19:59:54] <icinga-wm_>	 PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 6 below the confidence bounds
[19:59:58] <bd808>	 so we need some chown help from a root to make things sane
[20:00:16] <bd808>	 or to nuke it all and start over
[20:02:49] * YuviPanda waves very lately at dbrant
[20:03:54] <bd808>	 ostriches: YuviPanda has time to wave so he has time to chown :)
[20:03:59] <_joe_>	 bd808: what do you need specifically?
[20:04:27] <bd808>	 _joe_: the files in /srv/mediawiki-staging on mira need to be owned by the mwdeploy user
[20:04:44] * YuviPanda can do if _joe_ isn't on it
[20:04:48] <ostriches>	 bd808: Also, we should make checkoutMediaWiki have you do that as mwdeploy as well
[20:04:50] <_joe_>	 why the mdeploy user? group write permission is not enough?
[20:04:58] <ostriches>	 Not to set mtimes.
[20:05:01] <dbrant>	 YuviPanda: hey! i got it figured out, thanks
[20:05:13] <_joe_>	 oh you manually set mtimes?
[20:05:19] <bd808>	 ostriches: yeah that will need to be fixed too
[20:05:24] <bd808>	 _joe_: rsync does
[20:05:31] <_joe_>	 you don't just touch the file, you run setattr, via rsync
[20:05:36] <YuviPanda>	 dbrant: haha ok
[20:05:37] <_joe_>	 ok
[20:05:45] <_joe_>	 yep you need that then
[20:05:48] <_joe_>	 so, on mira?
[20:06:02] <bd808>	 _joe_: yeah. the errors we saw are at https://phabricator.wikimedia.org/P2282
[20:07:12] <_joe_>	 !log chown mwdeploy:wikidev recursively on mira for /srv/mediawiki-staging
[20:07:19] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[20:07:30] <_joe_>	 {{done}}
[20:07:44] <icinga-wm_>	 PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 9 below the confidence bounds
[20:09:53] <bd808>	 ostriches: do you have time to scap again and check that out?
[20:10:03] <ostriches>	 Yeah
[20:10:21] <logmsgbot>	 !log demon@tin Started scap: no changes, testing permissions on mira co-master
[20:10:27] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[20:11:26] <icinga-wm_>	 PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 8 below the confidence bounds
[20:14:36] <grrrit-wm>	 (03PS1) 10Chad: checkoutMediaWiki: sudo as mwdeploy for most things [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251327 
[20:15:16] <icinga-wm_>	 PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 9 below the confidence bounds
[20:15:25] <icinga-wm_>	 RECOVERY - Unmerged changes on repository mediawiki_config on mira is OK: No changes to merge.
[20:15:53] <ostriches>	 Heh, that might be a problem.
[20:16:10] <ostriches>	 20:11:42 Started sync-masters
[20:16:11] <ostriches>	 sync-masters: 100% (ok: 1; fail: 0; left: 0)                                    
[20:16:11] <ostriches>	 20:14:58 Finished sync-masters (duration: 03m 16s)
[20:16:15] <ostriches>	 Yay!
[20:16:18] <bd808>	 w00t
[20:17:37] <grrrit-wm>	 (03CR) 10Jhobs: [C: 031] Deploy QuickSurveys to English Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251130 (https://phabricator.wikimedia.org/T110661) (owner: 10Jdlrobson)
[20:17:50] <wikibugs>	 6operations, 10Deployment-Systems, 5Patch-For-Review: install/deploy mira as codfw deployment server - https://phabricator.wikimedia.org/T95436#1786577 (10demon)
[20:17:50] <logmsgbot>	 !log demon@tin Finished scap: no changes, testing permissions on mira co-master (duration: 07m 29s)
[20:17:56] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[20:19:06] <icinga-wm_>	 PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 8 below the confidence bounds
[20:24:01] <grrrit-wm>	 (03CR) 10Eevans: restbase: move to systemd unit file (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/244647 (https://phabricator.wikimedia.org/T103134) (owner: 10Filippo Giunchedi)
[20:28:54] <icinga-wm_>	 RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected
[20:30:13] <bd808>	 twentyafterfour: is group2 still going to wmf.5 today?
[20:31:00] <grrrit-wm>	 (03CR) 10BryanDavis: [C: 031] checkoutMediaWiki: sudo as mwdeploy for most things [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251327 (owner: 10Chad)
[20:36:37] <wikibugs>	 6operations, 10Deployment-Systems, 5Patch-For-Review: install/deploy mira as codfw deployment server - https://phabricator.wikimedia.org/T95436#1786622 (10demon) Anything left on this?
[20:46:24] <icinga-wm_>	 PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 12 data above and 9 below the confidence bounds
[20:47:15] <wikibugs>	 6operations, 10hardware-requests: Add another redis jobqueue server master and slave - https://phabricator.wikimedia.org/T89400#1786669 (10RobH) a:3Joe So the current summary, as I understand it is we need 2 identical machines (master/slave) in EQIAD to add to the rdb cluster.  These two servers will be name...
[20:47:28] <wikibugs>	 6operations, 10hardware-requests: Add another redis jobqueue server master and slave - https://phabricator.wikimedia.org/T89400#1786676 (10RobH)
[20:47:42] <wikibugs>	 6operations, 10hardware-requests: Add another redis jobqueue server master and slave - https://phabricator.wikimedia.org/T89400#1035333 (10RobH)
[20:51:59] <wikibugs>	 6operations, 10hardware-requests: Add another redis jobqueue server master and slave - https://phabricator.wikimedia.org/T89400#1786681 (10RobH) @aaron Can the redis system config be updated to use /srv rather than /a?  My understanding is we've shifted nearly all other services to use /srv.
[20:52:01] <twentyafterfour>	 bd808: yes I just wanted to give it some time to be sure wmf.5 wasn't horribly broken
[20:52:14] <bd808>	 cool beans
[21:00:53] <wikibugs>	 6operations, 10hardware-requests: Add another redis jobqueue server master and slave - https://phabricator.wikimedia.org/T89400#1786698 (10mark) I understand that this is a bit urgent, so let's use one of our old spares, even if they're out of warranty. We can replace when we're out of the woods.
[21:02:05] <icinga-wm_>	 PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 8 below the confidence bounds
[21:12:38] <twentyafterfour>	 ok I guess it's baked long enough. I'm gonna deploy wmf.5 to group2
[21:13:50] <wikibugs>	 6operations, 10hardware-requests: Add another redis jobqueue server master and slave - https://phabricator.wikimedia.org/T89400#1786755 (10RobH) a:5Joe>3RobH Update from IRC:  @Mark stated he would like this to be hardware under warranty, and thus new, unless its an emergency.  @Joe stated he would like to...
[21:15:15] <grrrit-wm>	 (03PS1) 1020after4: all wikis to 1.27.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251410 
[21:15:33] <wikibugs>	 6operations, 10hardware-requests: Add another redis jobqueue server master and slave - https://phabricator.wikimedia.org/T89400#1786761 (10RobH) 5Open>3stalled We'll allocate the two old boxes for now, and order new boxes.  I'll put this task to stalled.  I'll create a blocking task for the installation of...
[21:26:09] <wikibugs>	 6operations: install/setup/deploy wmf3153 (rdb1007) & wmf3154 (rdb1008) - https://phabricator.wikimedia.org/T117916#1786794 (10RobH) 3NEW a:3RobH
[21:26:41] <grrrit-wm>	 (03CR) 1020after4: [C: 032] all wikis to 1.27.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251410 (owner: 1020after4)
[21:27:02] <grrrit-wm>	 (03Merged) 10jenkins-bot: all wikis to 1.27.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251410 (owner: 1020after4)
[21:27:05] <twentyafterfour>	 bd808: wmf.5 coming right up
[21:27:33] <twentyafterfour>	 or not
[21:27:51] <twentyafterfour>	 21:27:10 sync-wikiversions failed: <AttributeError> 'SyncWikiversions' object has no attribute '_get_target_list'
[21:32:20] <twentyafterfour>	 (patch coming up)
[21:35:50] <wikibugs>	 6operations, 6Release-Engineering-Team, 7Database, 5Patch-For-Review, 7WorkType-Maintenance: Recover missing values from user_properties tables - https://phabricator.wikimedia.org/T114899#1786833 (10Ciencia_Al_Poder) 5Resolved>3declined
[21:36:36] <wikibugs>	 6operations, 10hardware-requests: Add another redis jobqueue server master and slave - https://phabricator.wikimedia.org/T89400#1786836 (10aaron) >>! In T89400#1786681, @RobH wrote: > @aaron Can the redis system config be updated to use /srv rather than /a?  My understanding is we've shifted nearly all other s...
[21:42:21] <grrrit-wm>	 (03PS1) 10RobH: setting wmf3153 (rdb1007) & wmf3154 (rdb1008) mgmt dns entries [dns] - 10https://gerrit.wikimedia.org/r/251414 
[21:42:34] <icinga-wm_>	 PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 6 below the confidence bounds
[21:46:24] <icinga-wm_>	 PROBLEM - Unmerged changes on repository mediawiki_config on mira is CRITICAL: There is one unmerged change in mediawiki_config (dir /srv/mediawiki-staging/).
[21:51:00] <grrrit-wm>	 (03PS3) 10BBlack: Relax and extend the Content-Type regexp that controls gzipping [puppet] - 10https://gerrit.wikimedia.org/r/251268 (owner: 10Ori.livneh)
[21:51:15] <grrrit-wm>	 (03CR) 10BBlack: [C: 032 V: 032] Relax and extend the Content-Type regexp that controls gzipping [puppet] - 10https://gerrit.wikimedia.org/r/251268 (owner: 10Ori.livneh)
[21:51:32] <ori>	 thanks bblack
[21:52:05] <icinga-wm_>	 PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 4 below the confidence bounds
[21:53:21] <grrrit-wm>	 (03CR) 10BBlack: "This doesn't affect cache-tiering or routing, just frontend edge where traffic initially lands... ?" [dns] - 10https://gerrit.wikimedia.org/r/251255 (https://phabricator.wikimedia.org/T114659) (owner: 10BBlack)
[21:54:26] <bblack>	 np!
[21:54:30] <logmsgbot>	 !log twentyafterfour@tin rebuilt wikiversions.php and synchronized wikiversions files: sync 1.27.0-wmf.5 to group2
[21:54:36] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[21:57:54] <icinga-wm_>	 PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 3 below the confidence bounds
[21:58:18] <grrrit-wm>	 (03CR) 10RobH: [C: 032] setting wmf3153 (rdb1007) & wmf3154 (rdb1008) mgmt dns entries [dns] - 10https://gerrit.wikimedia.org/r/251414 (owner: 10RobH)
[21:59:44] <wikibugs>	 6operations: install/setup/deploy wmf3153 (rdb1007) & wmf3154 (rdb1008) - https://phabricator.wikimedia.org/T117916#1786917 (10RobH)
[22:05:30] <grrrit-wm>	 (03PS1) 10RobH: setting rdb1007/rdb1008 production dns entries [dns] - 10https://gerrit.wikimedia.org/r/251421 
[22:06:17] <grrrit-wm>	 (03CR) 10RobH: [C: 032] setting rdb1007/rdb1008 production dns entries [dns] - 10https://gerrit.wikimedia.org/r/251421 (owner: 10RobH)
[22:07:35] <icinga-wm_>	 PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 3 below the confidence bounds
[22:12:23] <ori>	 req errors are fine
[22:12:29] <ori>	 alert is too sensitive
[22:13:38] <twentyafterfour>	 yeah I see that alert for graphite1001 frequently ...
[22:13:56] <twentyafterfour>	 it's definitely too sensitive
[22:14:27] <ori>	 it didn't use to be too sensitive. it's probably a testament to good work from releng that it has become too sensitive. we used to have real spikes of errors more frequently.
[22:16:12] <grrrit-wm>	 (03PS1) 10RobH: setting install params for rdb1007-1008 [puppet] - 10https://gerrit.wikimedia.org/r/251426 
[22:16:37] <wikibugs>	 6operations: install/setup/deploy wmf3153 (rdb1007) & wmf3154 (rdb1008) - https://phabricator.wikimedia.org/T117916#1786994 (10RobH)
[22:18:27] <bblack>	 yeah I think that alert just looks for arbitrary pattern anomalies
[22:18:52] <bblack>	 as in, no absolute thresholds.  So yeah, if things are generally good, very minor disturbances are going to become alerts.
[22:19:22] <grrrit-wm>	 (03CR) 10RobH: [C: 032] setting install params for rdb1007-1008 [puppet] - 10https://gerrit.wikimedia.org/r/251426 (owner: 10RobH)
[22:20:55] <icinga-wm_>	 RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected
[22:27:36] <grrrit-wm>	 (03PS1) 10Dduvall: Install libjpeg-dev for diagrams in documentation [puppet] - 10https://gerrit.wikimedia.org/r/251428 
[22:28:56] <grrrit-wm>	 (03CR) 10Ori.livneh: [C: 032] Install libjpeg-dev for diagrams in documentation [puppet] - 10https://gerrit.wikimedia.org/r/251428 (owner: 10Dduvall)
[22:32:36] <icinga-wm_>	 RECOVERY - Unmerged changes on repository mediawiki_config on mira is OK: No changes to merge.
[22:35:04] <logmsgbot>	 !log ori@tin Synchronized php-1.27.0-wmf.5/extensions/WikimediaEvents: Ic99ac31f740956: Log backend response time on edit requests (duration: 00m 35s)
[22:35:10] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[22:39:14] <icinga-wm_>	 PROBLEM - Kafka Broker Replica Max Lag on kafka1012 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [5000000.0]
[22:46:26] <icinga-wm_>	 RECOVERY - puppet last run on mw1083 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures
[22:48:45] <icinga-wm_>	 RECOVERY - Kafka Broker Replica Max Lag on kafka1012 is OK: OK: Less than 1.00% above the threshold [1000000.0]
[23:00:41] <grrrit-wm>	 (03PS1) 10BryanDavis: monolog: Ensure that context data added by WebProcessor is utf-8 safe [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251431 
[23:00:48] <bd808>	 ori: ^
[23:08:18] <grrrit-wm>	 (03PS1) 10Gilles: Add libcurl-dev to Python Jenkins [puppet] - 10https://gerrit.wikimedia.org/r/251432 (https://phabricator.wikimedia.org/T111005) 
[23:13:17] <grrrit-wm>	 (03CR) 10Ori.livneh: [C: 032] Add libcurl-dev to Python Jenkins [puppet] - 10https://gerrit.wikimedia.org/r/251432 (https://phabricator.wikimedia.org/T111005) (owner: 10Gilles)
[23:13:56] <ori>	 bd808: looks ok to me, not sure how to test
[23:14:18] <ori>	 if you tested it then let's do it
[23:14:23] <bd808>	 I pasted it into my mw-vagrant config and it didn't blow up
[23:14:46] <bd808>	 and seemed to do the wanted thing for urls with nasty chars in them
[23:15:08] <grrrit-wm>	 (03CR) 10Ori.livneh: [C: 032] monolog: Ensure that context data added by WebProcessor is utf-8 safe [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251431 (owner: 10BryanDavis)
[23:15:29] <grrrit-wm>	 (03Merged) 10jenkins-bot: monolog: Ensure that context data added by WebProcessor is utf-8 safe [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251431 (owner: 10BryanDavis)
[23:16:34] <logmsgbot>	 !log ori@tin Synchronized wmf-config/logging.php: Ieb8c602a: monolog: Ensure that context data added by WebProcessor is utf-8 safe (duration: 00m 36s)
[23:16:40] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[23:25:39] <bd808>	 ori: https://fr.wikisource.org/wiki/R%C3%A9solution_179_du_conseil_de_s%C3%A9curit%C3%A9_des_nations_unies isn't adding to exception.log anymore :)
[23:30:31] <bd808>	 !log Logging volume into ELK cluster down dramatically; investigating 
[23:30:37] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[23:34:27] <bd808>	 !log Decreased replica count of logstash-2015.10.13 and logstash-2015.10.14 to free disk space on cluster
[23:34:33] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[23:44:15] <grrrit-wm>	 (03PS5) 10Yuvipanda: dynamicproxy: Move invisible-unicorn into puppet [puppet] - 10https://gerrit.wikimedia.org/r/251176 
[23:44:17] <grrrit-wm>	 (03PS1) 10Yuvipanda: Add .pep8 exception for line length [puppet] - 10https://gerrit.wikimedia.org/r/251435 
[23:45:37] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] dynamicproxy: Move invisible-unicorn into puppet [puppet] - 10https://gerrit.wikimedia.org/r/251176 (owner: 10Yuvipanda)
[23:45:53] <YuviPanda>	 fuck you too jenkins
[23:46:21] <YuviPanda>	 legoktm: any idea why jenkins doesn't respect either tox.ini nor .pep8 in base of the project?
[23:47:02] <wikibugs>	 6operations, 10CirrusSearch, 6Discovery, 5Patch-For-Review: Only use newer (elastic10{16..31}) servers as master capable elasticsearch nodes - https://phabricator.wikimedia.org/T112556#1787259 (10EBernhardson) With four nodes we will need to increase `discovery.zen.minimum_master_nodes` to 3 to ensure ther...
[23:47:12] <grrrit-wm>	 (03PS6) 10Yuvipanda: dynamicproxy: Move invisible-unicorn into puppet [puppet] - 10https://gerrit.wikimedia.org/r/251176 
[23:47:24] <legoktm>	 YuviPanda: it uses the root one
[23:47:40] <YuviPanda>	 do I have a wrong .pep8?
[23:47:44] <YuviPanda>	 also does it use tox.ini or pep8?
[23:47:46] <YuviPanda>	 .pep8
[23:48:02] <YuviPanda>	 there's a 'flake8' line in the root tox.ini
[23:48:05] <YuviPanda>	 should it be a 'pep8'
[23:48:07] <YuviPanda>	 ?
[23:48:55] <ori>	 flake8 is pyflakes + pep8
[23:48:58] <legoktm>	 it looks like its using pep8 and not flake8
[23:48:58] <legoktm>	 ugh
[23:49:01] <YuviPanda>	 yeah
[23:49:11] <YuviPanda>	 this looks like a misconfiguration somewhere... not sure where
[23:49:24] <YuviPanda>	 see adding the .pep8 into the folder with the py file gets it to shut up
[23:49:52] <YuviPanda>	 there are also 25 individual .pep8s scattered around the repo
[23:50:04] <grrrit-wm>	 (03PS7) 10Yuvipanda: dynamicproxy: Move invisible-unicorn into puppet [puppet] - 10https://gerrit.wikimedia.org/r/251176 
[23:51:03] <grrrit-wm>	 (03CR) 10Yuvipanda: [C: 032] dynamicproxy: Move invisible-unicorn into puppet [puppet] - 10https://gerrit.wikimedia.org/r/251176 (owner: 10Yuvipanda)
[23:54:14] <grrrit-wm>	 (03PS1) 10Yuvipanda: dynamicproxy: Install proper flask package [puppet] - 10https://gerrit.wikimedia.org/r/251436 
[23:54:45] <grrrit-wm>	 (03CR) 10Yuvipanda: [C: 032 V: 032] dynamicproxy: Install proper flask package [puppet] - 10https://gerrit.wikimedia.org/r/251436 (owner: 10Yuvipanda)
[23:54:47] <bd808>	 jouncebot: refresh
[23:54:51] <jouncebot>	 I refreshed my knowledge about deployments.