[01:45:32] <icinga-wm>	 PROBLEM - Outgoing network saturation on labstore1002 is CRITICAL: CRITICAL: 34.62% of data above the critical threshold [100000000.0]
[02:27:10] <logmsgbot>	 !log l10nupdate@tin Synchronized php-1.27.0-wmf.3/cache/l10n: l10nupdate for 1.27.0-wmf.3 (duration: 08m 14s)
[02:27:22] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[02:31:47] <logmsgbot>	 !log l10nupdate@tin LocalisationUpdate completed (1.27.0-wmf.3) at 2015-10-26 02:31:46+00:00
[02:31:53] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[02:55:32] <icinga-wm>	 RECOVERY - Outgoing network saturation on labstore1002 is OK: OK: Less than 10.00% above the threshold [75000000.0]
[02:56:12] <icinga-wm>	 PROBLEM - puppet last run on lvs2003 is CRITICAL: CRITICAL: puppet fail
[03:22:32] <icinga-wm>	 RECOVERY - puppet last run on lvs2003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[03:43:11] <icinga-wm>	 PROBLEM - puppet last run on mw2086 is CRITICAL: CRITICAL: Puppet has 1 failures
[04:09:31] <icinga-wm>	 RECOVERY - puppet last run on mw2086 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures
[05:48:20] <logmsgbot>	 !log l10nupdate@tin ResourceLoader cache refresh completed at Mon Oct 26 05:48:19 UTC 2015 (duration 48m 18s)
[05:48:25] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[06:29:52] <icinga-wm>	 PROBLEM - puppet last run on cp1053 is CRITICAL: CRITICAL: Puppet has 2 failures
[06:30:42] <icinga-wm>	 PROBLEM - puppet last run on db1045 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:30:51] <icinga-wm>	 PROBLEM - puppet last run on subra is CRITICAL: CRITICAL: Puppet has 1 failures
[06:30:52] <icinga-wm>	 PROBLEM - puppet last run on mc2007 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:31:11] <icinga-wm>	 PROBLEM - puppet last run on cp2002 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:31:12] <icinga-wm>	 PROBLEM - puppet last run on mc2005 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:31:33] <icinga-wm>	 PROBLEM - puppet last run on cp1068 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:31:52] <icinga-wm>	 PROBLEM - puppet last run on mw1170 is CRITICAL: CRITICAL: Puppet has 2 failures
[06:32:12] <icinga-wm>	 PROBLEM - puppet last run on mw1009 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:32:31] <icinga-wm>	 PROBLEM - puppet last run on mw1119 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:33:03] <icinga-wm>	 PROBLEM - puppet last run on mw2021 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:40:35] <_joe_>	 oh, DST
[06:40:40] <_joe_>	 I was worrying :P
[06:55:12] <icinga-wm>	 RECOVERY - puppet last run on db1045 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures
[06:55:22] <icinga-wm>	 RECOVERY - puppet last run on subra is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures
[06:55:41] <icinga-wm>	 RECOVERY - puppet last run on cp2002 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures
[06:55:42] <icinga-wm>	 RECOVERY - puppet last run on mc2005 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures
[06:56:02] <icinga-wm>	 RECOVERY - puppet last run on cp1053 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures
[06:56:02] <icinga-wm>	 RECOVERY - puppet last run on cp1068 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures
[06:56:32] <icinga-wm>	 RECOVERY - puppet last run on mw1009 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures
[06:56:52] <icinga-wm>	 RECOVERY - puppet last run on mw1119 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures
[06:57:11] <icinga-wm>	 RECOVERY - puppet last run on mc2007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:57:31] <icinga-wm>	 RECOVERY - puppet last run on mw2021 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:58:02] <icinga-wm>	 RECOVERY - puppet last run on mw1170 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[07:35:49] <grrrit-wm>	 (03CR) 10Giuseppe Lavagetto: [C: 032] Convert logging from print to twisted.python.log [debs/pybal] - 10https://gerrit.wikimedia.org/r/244138 (owner: 10Giuseppe Lavagetto)
[07:36:24] <grrrit-wm>	 (03Merged) 10jenkins-bot: Convert logging from print to twisted.python.log [debs/pybal] - 10https://gerrit.wikimedia.org/r/244138 (owner: 10Giuseppe Lavagetto)
[07:45:15] <grrrit-wm>	 (03CR) 10Giuseppe Lavagetto: [C: 032] Re-adding PyBalConfigurationObserverError [debs/pybal] - 10https://gerrit.wikimedia.org/r/244669 (owner: 10Giuseppe Lavagetto)
[07:45:42] <grrrit-wm>	 (03Merged) 10jenkins-bot: Re-adding PyBalConfigurationObserverError [debs/pybal] - 10https://gerrit.wikimedia.org/r/244669 (owner: 10Giuseppe Lavagetto)
[07:58:15] <grrrit-wm>	 (03PS2) 10Muehlenhoff: Move the base::firewall include into the impala role [puppet] - 10https://gerrit.wikimedia.org/r/246221 
[08:00:13] <grrrit-wm>	 (03PS3) 10Muehlenhoff: Assign salt grains for kafkatee [puppet] - 10https://gerrit.wikimedia.org/r/248329 
[08:01:30] <grrrit-wm>	 (03CR) 10Muehlenhoff: [C: 032 V: 032] Assign salt grains for kafkatee [puppet] - 10https://gerrit.wikimedia.org/r/248329 (owner: 10Muehlenhoff)
[08:05:12] <icinga-wm>	 PROBLEM - mailman_queue_size on fermium is CRITICAL: CRITICAL: 1 mailman queue(s) above limits (thresholds: bounces: 25 in: 25 out: 300 virgin: 25)
[08:06:50] <grrrit-wm>	 (03PS2) 10Muehlenhoff: Assign salt grains for gerrit [puppet] - 10https://gerrit.wikimedia.org/r/248330 
[08:07:11] <grrrit-wm>	 (03CR) 10Muehlenhoff: [C: 032 V: 032] Assign salt grains for gerrit [puppet] - 10https://gerrit.wikimedia.org/r/248330 (owner: 10Muehlenhoff)
[08:09:57] <grrrit-wm>	 (03PS7) 10Giuseppe Lavagetto: Add EtcdConfigurationObserver [debs/pybal] - 10https://gerrit.wikimedia.org/r/225649 (owner: 10Ori.livneh)
[08:10:26] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] Add EtcdConfigurationObserver [debs/pybal] - 10https://gerrit.wikimedia.org/r/225649 (owner: 10Ori.livneh)
[08:15:15] <grrrit-wm>	 (03CR) 10ArielGlenn: "Do you have an estimate of how long it will take this to run on the larger (wikidata, commons, en wiki) wikis?" [puppet] - 10https://gerrit.wikimedia.org/r/248596 (https://phabricator.wikimedia.org/T109690) (owner: 10EBernhardson)
[08:17:34] <grrrit-wm>	 (03PS2) 10Muehlenhoff: Assign salt grains for labvirt/nova compute [puppet] - 10https://gerrit.wikimedia.org/r/248331 
[08:17:54] <grrrit-wm>	 (03CR) 10Muehlenhoff: [C: 032 V: 032] Assign salt grains for labvirt/nova compute [puppet] - 10https://gerrit.wikimedia.org/r/248331 (owner: 10Muehlenhoff)
[08:19:37] <grrrit-wm>	 (03PS2) 10Muehlenhoff: Assign salt grains for wdqs [puppet] - 10https://gerrit.wikimedia.org/r/248332 
[08:20:15] <grrrit-wm>	 (03CR) 10Muehlenhoff: [C: 032 V: 032] Assign salt grains for wdqs [puppet] - 10https://gerrit.wikimedia.org/r/248332 (owner: 10Muehlenhoff)
[08:20:42] <grrrit-wm>	 (03PS2) 10Muehlenhoff: Assign salt grains for terbium [puppet] - 10https://gerrit.wikimedia.org/r/248333 
[08:20:51] <icinga-wm>	 RECOVERY - mailman_queue_size on fermium is OK: OK: mailman queues are below the limits.
[08:22:59] <grrrit-wm>	 (03CR) 10Muehlenhoff: [C: 032 V: 032] Assign salt grains for terbium [puppet] - 10https://gerrit.wikimedia.org/r/248333 (owner: 10Muehlenhoff)
[08:28:14] <wikibugs>	 6operations, 6Discovery, 10Maps, 3Discovery-Maps-Sprint: maps-test200{2-4} PostgreSQL replication needs rebuilding - https://phabricator.wikimedia.org/T116553#1752416 (10akosiaris) The reason is almost certainly this: https://phabricator.wikimedia.org/P2231  Dropping and recreating the PRIMARY KEY in a 100...
[08:28:37] <grrrit-wm>	 (03PS1) 10ArielGlenn: dumps: one more conf file not updated for new path of dblists [puppet] - 10https://gerrit.wikimedia.org/r/248822 
[08:29:39] <grrrit-wm>	 (03CR) 10ArielGlenn: [C: 032] dumps: one more conf file not updated for new path of dblists [puppet] - 10https://gerrit.wikimedia.org/r/248822 (owner: 10ArielGlenn)
[08:29:41] <grrrit-wm>	 (03PS2) 10Muehlenhoff: Assign salt grains for spark [puppet] - 10https://gerrit.wikimedia.org/r/248334 
[08:30:23] <grrrit-wm>	 (03PS3) 10Muehlenhoff: Assign salt grains for spark [puppet] - 10https://gerrit.wikimedia.org/r/248334 
[08:30:36] <grrrit-wm>	 (03CR) 10Muehlenhoff: [C: 032 V: 032] Assign salt grains for spark [puppet] - 10https://gerrit.wikimedia.org/r/248334 (owner: 10Muehlenhoff)
[08:44:03] <grrrit-wm>	 (03PS8) 10Giuseppe Lavagetto: Add EtcdConfigurationObserver [debs/pybal] - 10https://gerrit.wikimedia.org/r/225649 (owner: 10Ori.livneh)
[08:44:30] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] Add EtcdConfigurationObserver [debs/pybal] - 10https://gerrit.wikimedia.org/r/225649 (owner: 10Ori.livneh)
[08:53:48] <grrrit-wm>	 (03PS2) 10Muehlenhoff: Assign salt grains for hue [puppet] - 10https://gerrit.wikimedia.org/r/248335 
[08:55:02] <grrrit-wm>	 (03CR) 10Muehlenhoff: [C: 032 V: 032] Assign salt grains for hue [puppet] - 10https://gerrit.wikimedia.org/r/248335 (owner: 10Muehlenhoff)
[08:58:28] <grrrit-wm>	 (03PS2) 10Muehlenhoff: Assign salt grains for ci [puppet] - 10https://gerrit.wikimedia.org/r/248336 
[09:00:38] <grrrit-wm>	 (03CR) 10Muehlenhoff: [C: 032 V: 032] Assign salt grains for ci [puppet] - 10https://gerrit.wikimedia.org/r/248336 (owner: 10Muehlenhoff)
[09:04:34] <grrrit-wm>	 (03PS2) 10Muehlenhoff: Assign salt grains for db analytics/sanitarium [puppet] - 10https://gerrit.wikimedia.org/r/248337 
[09:05:17] <grrrit-wm>	 (03CR) 10Muehlenhoff: [C: 032 V: 032] Assign salt grains for db analytics/sanitarium [puppet] - 10https://gerrit.wikimedia.org/r/248337 (owner: 10Muehlenhoff)
[09:06:12] <grrrit-wm>	 (03PS1) 10ArielGlenn: dumps: update listing of files for rsync to lastest rsync args [puppet] - 10https://gerrit.wikimedia.org/r/248824 
[09:06:49] <grrrit-wm>	 (03PS2) 10Muehlenhoff: Assign salt grains for pool counters [puppet] - 10https://gerrit.wikimedia.org/r/248338 
[09:07:11] <grrrit-wm>	 (03PS2) 10ArielGlenn: dumps: update listing of files for rsync to lastest rsync args [puppet] - 10https://gerrit.wikimedia.org/r/248824 
[09:07:25] <grrrit-wm>	 (03CR) 10Muehlenhoff: [C: 032 V: 032] Assign salt grains for pool counters [puppet] - 10https://gerrit.wikimedia.org/r/248338 (owner: 10Muehlenhoff)
[09:08:12] <grrrit-wm>	 (03PS3) 10ArielGlenn: dumps: update listing of files for rsync to lastest rsync args [puppet] - 10https://gerrit.wikimedia.org/r/248824 
[09:08:20] <grrrit-wm>	 (03PS3) 10Muehlenhoff: Move the base::firewall include into the impala role [puppet] - 10https://gerrit.wikimedia.org/r/246221 
[09:09:01] <grrrit-wm>	 (03CR) 10ArielGlenn: [C: 032] dumps: update listing of files for rsync to lastest rsync args [puppet] - 10https://gerrit.wikimedia.org/r/248824 (owner: 10ArielGlenn)
[09:09:22] <grrrit-wm>	 (03CR) 10Muehlenhoff: [C: 032 V: 032] Move the base::firewall include into the impala role [puppet] - 10https://gerrit.wikimedia.org/r/246221 (owner: 10Muehlenhoff)
[09:09:36] <grrrit-wm>	 (03PS4) 10Muehlenhoff: Move the base::firewall include into the impala role [puppet] - 10https://gerrit.wikimedia.org/r/246221 
[09:09:44] <grrrit-wm>	 (03CR) 10Muehlenhoff: [V: 032] Move the base::firewall include into the impala role [puppet] - 10https://gerrit.wikimedia.org/r/246221 (owner: 10Muehlenhoff)
[09:14:06] <grrrit-wm>	 (03PS2) 10Muehlenhoff: Move base::firewall into the archiva role [puppet] - 10https://gerrit.wikimedia.org/r/245974 
[09:14:35] <grrrit-wm>	 (03CR) 10Muehlenhoff: [C: 032 V: 032] Move base::firewall into the archiva role [puppet] - 10https://gerrit.wikimedia.org/r/245974 (owner: 10Muehlenhoff)
[09:16:49] <jynus>	 !log rebooting and installing jessie on db2060-db2070
[09:16:53] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[09:20:30] <_joe_>	 !log restarting etcd on conf1001
[09:20:33] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[09:25:26] <_joe_>	 uhm the etcd cluster is in bad shape, meh
[09:30:10] <grrrit-wm>	 (03CR) 10ArielGlenn: [C: 032] dumps: don't escape commands not run in shell [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/244799 (owner: 10ArielGlenn)
[09:30:39] <_joe_>	 ok etcd cluster is ok again
[09:30:57] <grrrit-wm>	 (03CR) 10ArielGlenn: [C: 032] dumps: unfix a camelcase, imported module not fixed up yet [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/244800 (owner: 10ArielGlenn)
[09:31:06] <grrrit-wm>	 (03CR) 10ArielGlenn: [V: 032] dumps: unfix a camelcase, imported module not fixed up yet [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/244800 (owner: 10ArielGlenn)
[09:32:13] <grrrit-wm>	 (03CR) 10ArielGlenn: [C: 032 V: 032] dumps; fix another indentation screwup from the pylint [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/244801 (owner: 10ArielGlenn)
[09:36:13] <ori>	 apergos: ping re: https://phabricator.wikimedia.org/T87036
[09:36:32] <ori>	 apergos: sorry, that should have been https://phabricator.wikimedia.org/T94277
[09:37:24] <apergos>	 ori: need to have a backport of the fix referred to in https://phabricator.wikimedia.org/T113932 
[09:37:37] <apergos>	 once those packages are available I can do a test run and then covert over tohhvm
[09:37:48] <apergos>	 they are all running trusty already of course
[09:38:56] <ori>	 apergos: bd808 submitted a backport in sept 30: https://gerrit.wikimedia.org/r/#/c/242773/
[09:39:00] <apergos>	 yes.
[09:39:22] <apergos>	 I mean packages built, I can test them if someone can make them available to me
[09:39:53] <grrrit-wm>	 (03CR) 10Jcrespo: [C: 031] Include base::firewall in the mariadb::labsdb role [puppet] - 10https://gerrit.wikimedia.org/r/245958 (owner: 10Muehlenhoff)
[09:39:57] <apergos>	 ori
[09:40:02] <ori>	 i'll build it right now
[09:40:10] <apergos>	 you will? great!
[09:40:35] <apergos>	 if it works I won't be able to cut over the hosts til the next run (Nov 1) but that's soming up very soon.
[09:40:38] <apergos>	 *coming
[09:40:39] <ori>	 yes. i don't think it's unreasonable to expect ops to do it, though -- it has been traditionally done by ops.
[09:40:52] <apergos>	 joe has done those I believe
[09:41:56] <apergos>	 I don't expect we're going to gain anything from moving these long running maintenance scripts to hhvm, but I understand the desire to have everything running off of one implementation
[09:41:57] <ori>	 joe is doing a bajillion other things, though, and he is not directly responsible for the snapshot migration. i really have to ask you to show ownership here.
[09:42:15] <dcausse>	 !log deleting unused elasticsearch indices in eqiad (T112863)
[09:42:19] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[09:42:22] <apergos>	 he isn't, and you can ask but I looked at these packages when I first investigated, and I really have no idea about building them
[09:42:44] <ori>	 the way it looks like from my end is that any blocker is the occasion for all progress stalling completely until i nag
[09:42:46] <apergos>	 and I did spend a chunk of time poking around in the repos
[09:43:12] <_joe_>	 apergos: ok lemme update what is on wikitech on building HHVM packages
[09:43:13] <ori>	 there are packaging instructions at https://wikitech.wikimedia.org/wiki/HHVM
[09:43:19] <_joe_>	 it's super easy now with copper
[09:43:33] <_joe_>	 oh it's already there, see
[09:43:50] <ori>	 <apergos> I don't expect we're going to gain anything from moving these long running maintenance scripts to hhvm
[09:43:54] <ori>	 not sure how you reached that conclusion
[09:44:40] <apergos>	 speed wise what wouod we gain?
[09:44:43] <apergos>	 *would
[09:45:21] <ori>	 I don't know what the workload of the snapshot hosts is, and whether they are primarily IO or CPU bound, but a speedup of x2 is very plausible
[09:45:32] <ori>	 the issue with HHVM and CLI invocations is that CLI invocations tend to be short-running, and so HHVM JIT's doesn't have time to pay off
[09:45:51] <ori>	 basically by the time the script terminates HHVM has not even finished analyzing the code, let alone optimize it
[09:45:59] <ori>	 but this has no bearing on long-running jobs
[09:46:05] <ori>	 which benefit tremendously well from HHVM
[09:46:20] <_joe_>	 ori: well if the bytecode gets loaded once I don't think there should be a significant gain, or am I mistaken?
[09:46:27] <apergos>	 that is what I had thought
[09:46:43] <grrrit-wm>	 (03PS2) 10Alex Monk: beta: Add enwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/248639 
[09:47:04] <ori>	 no, it can recompile in the context of a single invocation
[09:47:45] <_joe_>	 anyways, off to take my meds
[09:47:49] <apergos>	 good luck
[09:48:26] <Nemo_bis>	 stacci bene
[09:49:07] <ori>	 at any rate, privately reaching the conclusion that this task is not worth doing (or not worth prioritizing) is not great, because it doesn't really give anyone a chance to challenge the conclusion and explain
[09:49:16] <apergos>	 no.  I didn't reach that conclusion.
[09:49:39] <apergos>	 as I said above, standardizing on one implementation is a perfectly valid reason
[09:49:57] <apergos>	 and even if I had not thought so, it's still a task in my queue, I'm still wanting to get it off of my queue
[09:50:44] <apergos>	 you don't seem to hear me that I cloned the repo(s) and tried looking at the package structure at the beginning and was overwhelmed
[09:51:14] <ori>	 i believe you, but you didn't communicate that on the task, or ask for help
[09:51:17] <apergos>	 now maybe I should have asked joe or bd808 to please make packages available, as joe had volunteered to do in the past. that is my bad
[09:54:46] <ori>	 i don't mean to chew you out. i just need help. the mental cost of rereading the tasks to reconstruct some notion of where things are and what they need to move forward makes me miserable and unproductive.
[09:56:06] * ori goes a-packagin'.
[09:58:15] <_joe_>	 ori: I plan on converting tin as soon as possible, FYI
[10:08:55] <grrrit-wm>	 (03PS2) 10Ori.livneh: Backport of D2486378: Implement compress.bzip2:// stream wrapper [debs/hhvm] - 10https://gerrit.wikimedia.org/r/242773 (https://phabricator.wikimedia.org/T113932) (owner: 10BryanDavis)
[10:09:04] <grrrit-wm>	 (03CR) 10Ori.livneh: [C: 032 V: 032] Backport of D2486378: Implement compress.bzip2:// stream wrapper [debs/hhvm] - 10https://gerrit.wikimedia.org/r/242773 (https://phabricator.wikimedia.org/T113932) (owner: 10BryanDavis)
[10:12:21] <icinga-wm>	 PROBLEM - puppet last run on cp3043 is CRITICAL: CRITICAL: puppet fail
[10:13:00] <_joe_>	 what is missing from wikitech is the policy for package release to production
[10:13:07] <_joe_>	 (re HHVM)
[10:22:53] <icinga-wm>	 PROBLEM - puppet last run on db2041 is CRITICAL: CRITICAL: puppet fail
[10:39:12] <icinga-wm>	 RECOVERY - puppet last run on cp3043 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[10:45:30] <grrrit-wm>	 (03PS1) 10Muehlenhoff: Assign salt grains for image scalers [puppet] - 10https://gerrit.wikimedia.org/r/248842 
[10:45:32] <grrrit-wm>	 (03PS1) 10Muehlenhoff: Assign salt grains for mw api servers [puppet] - 10https://gerrit.wikimedia.org/r/248843 
[10:45:34] <grrrit-wm>	 (03PS1) 10Muehlenhoff: Assign salt grains for jobrunners [puppet] - 10https://gerrit.wikimedia.org/r/248844 
[10:45:36] <grrrit-wm>	 (03PS1) 10Muehlenhoff: Assign salt grains for application servers [puppet] - 10https://gerrit.wikimedia.org/r/248845 
[10:45:38] <grrrit-wm>	 (03PS1) 10Muehlenhoff: Assign salt grains for librenms [puppet] - 10https://gerrit.wikimedia.org/r/248846 
[10:45:40] <grrrit-wm>	 (03PS1) 10Muehlenhoff: Assign salt grains for graphite [puppet] - 10https://gerrit.wikimedia.org/r/248847 
[10:45:42] <grrrit-wm>	 (03PS1) 10Muehlenhoff: Assign salt grains for openldap [puppet] - 10https://gerrit.wikimedia.org/r/248848 
[10:49:15] <grrrit-wm>	 (03CR) 10Filippo Giunchedi: [C: 031] Add DNS entries for ms-be20[1-2][0-6] Bug:T114712 [dns] - 10https://gerrit.wikimedia.org/r/248712 (https://phabricator.wikimedia.org/T114712) (owner: 10Papaul)
[10:51:32] <icinga-wm>	 RECOVERY - puppet last run on db2041 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[10:53:51] <icinga-wm>	 PROBLEM - Restbase endpoints health on aqs1002 is CRITICAL: /pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} is CRITICAL: Test Get aggregate page views returned the unexpected status 500 (expecting: 200)
[10:54:03] <icinga-wm>	 PROBLEM - Restbase endpoints health on aqs1003 is CRITICAL: /pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} is CRITICAL: Test Get aggregate page views returned the unexpected status 500 (expecting: 200)
[10:54:21] <grrrit-wm>	 (03PS1) 10Alexandros Kosiaris: puppetmaster: Move the role into the role module [puppet] - 10https://gerrit.wikimedia.org/r/248850 
[10:55:11] <icinga-wm>	 PROBLEM - Restbase endpoints health on aqs1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[10:56:06] <grrrit-wm>	 (03Abandoned) 10Muehlenhoff: Make db2055 to db2070 as role spare [puppet] - 10https://gerrit.wikimedia.org/r/246823 (owner: 10Muehlenhoff)
[11:04:41] <icinga-wm>	 PROBLEM - Analytics Cassanda CQL query interface on aqs1001 is CRITICAL: Connection timed out
[11:06:21] <icinga-wm>	 RECOVERY - Analytics Cassanda CQL query interface on aqs1001 is OK: TCP OK - 0.006 second response time on port 9042
[11:10:02] <icinga-wm>	 PROBLEM - Restbase root url on aqs1001 is CRITICAL: Connection refused
[11:26:01] <icinga-wm>	 PROBLEM - Incoming network saturation on labstore1003 is CRITICAL: CRITICAL: 10.34% of data above the critical threshold [100000000.0]
[11:26:21] <icinga-wm>	 RECOVERY - Restbase root url on aqs1001 is OK: HTTP OK: HTTP/1.1 200 - 727 bytes in 0.021 second response time
[11:26:32] <icinga-wm>	 RECOVERY - Restbase endpoints health on aqs1002 is OK: All endpoints are healthy
[11:26:52] <icinga-wm>	 RECOVERY - Restbase endpoints health on aqs1003 is OK: All endpoints are healthy
[11:27:51] <icinga-wm>	 RECOVERY - Restbase endpoints health on aqs1001 is OK: All endpoints are healthy
[11:48:52] <grrrit-wm>	 (03PS2) 10Muehlenhoff: Assign salt grains for mw api servers [puppet] - 10https://gerrit.wikimedia.org/r/248843 
[11:54:10] <grrrit-wm>	 (03CR) 10Muehlenhoff: [C: 032 V: 032] Assign salt grains for mw api servers [puppet] - 10https://gerrit.wikimedia.org/r/248843 (owner: 10Muehlenhoff)
[12:00:59] <wikibugs>	 6operations, 10Wikimedia-Mailing-lists, 5Patch-For-Review: Education coop mailing list rename - https://phabricator.wikimedia.org/T107445#1752738 (10Selsharbaty-WMF) Hi @JohnLewis!  Thank you for the quick response and getting this task done very quickly! I really appreciate your help.  I just need some clar...
[12:01:02] <grrrit-wm>	 (03PS2) 10Muehlenhoff: Assign salt grains for jobrunners [puppet] - 10https://gerrit.wikimedia.org/r/248844 
[12:03:50] <icinga-wm>	 RECOVERY - Incoming network saturation on labstore1003 is OK: OK: Less than 10.00% above the threshold [75000000.0]
[12:05:32] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: [C: 032] "http://puppet-compiler.wmflabs.org/1074/ says noop, apart from palladium which fails due to new_install not being in the labs/private repo" [puppet] - 10https://gerrit.wikimedia.org/r/248850 (owner: 10Alexandros Kosiaris)
[12:06:37] <wikibugs>	 6operations, 10Wikimedia-Mailing-lists, 5Patch-For-Review: Education coop mailing list rename - https://phabricator.wikimedia.org/T107445#1752752 (10JohnLewis) Ah! I see, thanks for pointing this out to me. I copied the full configuration over from the private list to the public list to ensure all members ar...
[12:06:41] <grrrit-wm>	 (03PS2) 10Alexandros Kosiaris: puppetmaster: Move the role into the role module [puppet] - 10https://gerrit.wikimedia.org/r/248850 
[12:07:05] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: [V: 032] puppetmaster: Move the role into the role module [puppet] - 10https://gerrit.wikimedia.org/r/248850 (owner: 10Alexandros Kosiaris)
[12:10:40] <grrrit-wm>	 (03CR) 10Muehlenhoff: [C: 032 V: 032] Assign salt grains for jobrunners [puppet] - 10https://gerrit.wikimedia.org/r/248844 (owner: 10Muehlenhoff)
[12:10:49] <grrrit-wm>	 (03PS3) 10Muehlenhoff: Assign salt grains for jobrunners [puppet] - 10https://gerrit.wikimedia.org/r/248844 
[12:10:56] <grrrit-wm>	 (03CR) 10Muehlenhoff: [V: 032] Assign salt grains for jobrunners [puppet] - 10https://gerrit.wikimedia.org/r/248844 (owner: 10Muehlenhoff)
[12:11:33] <grrrit-wm>	 (03PS2) 10Muehlenhoff: Assign salt grains for application servers [puppet] - 10https://gerrit.wikimedia.org/r/248845 
[12:15:19] <icinga-wm>	 RECOVERY - Restbase endpoints health on restbase-test2002 is OK: All endpoints are healthy
[12:15:29] <icinga-wm>	 RECOVERY - Restbase root url on restbase-test2002 is OK: HTTP OK: HTTP/1.1 200 - 15171 bytes in 0.109 second response time
[12:16:37] <grrrit-wm>	 (03CR) 10Muehlenhoff: [C: 032 V: 032] Assign salt grains for application servers [puppet] - 10https://gerrit.wikimedia.org/r/248845 (owner: 10Muehlenhoff)
[12:22:16] <grrrit-wm>	 (03PS2) 10Muehlenhoff: Assign salt grains for librenms [puppet] - 10https://gerrit.wikimedia.org/r/248846 
[12:24:47] <grrrit-wm>	 (03CR) 10Muehlenhoff: [C: 032 V: 032] Assign salt grains for librenms [puppet] - 10https://gerrit.wikimedia.org/r/248846 (owner: 10Muehlenhoff)
[12:27:35] <grrrit-wm>	 (03PS2) 10Muehlenhoff: Assign salt grains for graphite [puppet] - 10https://gerrit.wikimedia.org/r/248847 
[12:28:25] <grrrit-wm>	 (03PS9) 10Giuseppe Lavagetto: Add EtcdConfigurationObserver [debs/pybal] - 10https://gerrit.wikimedia.org/r/225649 (owner: 10Ori.livneh)
[12:28:54] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] Add EtcdConfigurationObserver [debs/pybal] - 10https://gerrit.wikimedia.org/r/225649 (owner: 10Ori.livneh)
[12:38:30] <grrrit-wm>	 (03CR) 10BBlack: "@ottomata: is the new python code tested at all?" [puppet] - 10https://gerrit.wikimedia.org/r/248067 (https://phabricator.wikimedia.org/T83580) (owner: 10Ottomata)
[12:38:52] <grrrit-wm>	 (03CR) 10BBlack: "(by that I mean really tested running it somewhere, as opposed to unit tests)" [puppet] - 10https://gerrit.wikimedia.org/r/248067 (https://phabricator.wikimedia.org/T83580) (owner: 10Ottomata)
[12:39:04] <wikibugs>	 6operations, 10Wikimedia-Mailing-lists, 5Patch-For-Review: Education coop mailing list rename - https://phabricator.wikimedia.org/T107445#1752805 (10JohnLewis) This is now done, enjoy the correct list situation :)
[12:41:20] <icinga-wm>	 PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet).
[12:42:18] <wikibugs>	 6operations, 6Discovery, 10Maps, 3Discovery-Maps-Sprint: maps-test200{2-4} PostgreSQL replication needs rebuilding - https://phabricator.wikimedia.org/T116553#1752806 (10akosiaris) p:5Triage>3Unbreak!
[12:42:23] <grrrit-wm>	 (03PS10) 10Giuseppe Lavagetto: Add EtcdConfigurationObserver [debs/pybal] - 10https://gerrit.wikimedia.org/r/225649 (owner: 10Ori.livneh)
[12:42:31] <icinga-wm>	 PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet).
[12:42:56] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] Add EtcdConfigurationObserver [debs/pybal] - 10https://gerrit.wikimedia.org/r/225649 (owner: 10Ori.livneh)
[12:43:56] <grrrit-wm>	 (03PS11) 10Giuseppe Lavagetto: Add EtcdConfigurationObserver [debs/pybal] - 10https://gerrit.wikimedia.org/r/225649 (owner: 10Ori.livneh)
[12:44:13] <paravoid>	 apergos: speaking of your queue, my snapshot patches are still pending unreviewed :)
[12:44:20] <apergos>	 I know
[12:44:26] <apergos>	 and they are on my today queue
[12:44:26] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] Add EtcdConfigurationObserver [debs/pybal] - 10https://gerrit.wikimedia.org/r/225649 (owner: 10Ori.livneh)
[12:44:33] <paravoid>	 well, two of them are reviewed by giuseppe
[12:44:34] * apergos flips over their old-fashioned paper notepad
[12:44:38] <_joe_>	 wtf is wrong with tox?
[12:44:56] <apergos>	 yep, there they are, #5 for today. I am on #4 righ tnow....
[12:46:15] <grrrit-wm>	 (03PS12) 10Giuseppe Lavagetto: Add EtcdConfigurationObserver [debs/pybal] - 10https://gerrit.wikimedia.org/r/225649 (owner: 10Ori.livneh)
[12:47:07] <wikibugs>	 6operations, 6Discovery, 10Maps, 3Discovery-Maps-Sprint: maps-test200{2-4} PostgreSQL replication needs rebuilding - https://phabricator.wikimedia.org/T116553#1752808 (10akosiaris) 5Open>3Resolved a:3akosiaris After a full reinitialization of the slaves, replication is working once more. I see a coup...
[12:48:33] <grrrit-wm>	 (03PS13) 10Giuseppe Lavagetto: Add EtcdConfigurationObserver [debs/pybal] - 10https://gerrit.wikimedia.org/r/225649 (owner: 10Ori.livneh)
[12:49:17] <wikibugs>	 6operations: monitor postgresql replication status - https://phabricator.wikimedia.org/T116580#1752811 (10akosiaris) 3NEW
[12:53:34] <grrrit-wm>	 (03PS1) 10Merlijn van Deen: toollabs: install libsort-fields-perl [puppet] - 10https://gerrit.wikimedia.org/r/248861 (https://phabricator.wikimedia.org/T116579) 
[12:59:28] <wikibugs>	 6operations, 6Discovery, 10Maps, 3Discovery-Maps-Sprint: maps-test200{2-4} PostgreSQL replication needs rebuilding - https://phabricator.wikimedia.org/T116553#1752834 (10Yurik) @akosiaris, sorry for the trouble. Just in case I break it again, could you write the sequence of steps that you did to recover it...
[12:59:47] <grrrit-wm>	 (03PS2) 10coren: toollabs: install libsort-fields-perl [puppet] - 10https://gerrit.wikimedia.org/r/248861 (https://phabricator.wikimedia.org/T116579) (owner: 10Merlijn van Deen)
[13:00:04] <jouncebot>	 aude: Dear anthropoid, the time has come. Please deploy Wikidata (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151026T1300).
[13:03:58] <grrrit-wm>	 (03CR) 10coren: [C: 032] toollabs: install libsort-fields-perl [puppet] - 10https://gerrit.wikimedia.org/r/248861 (https://phabricator.wikimedia.org/T116579) (owner: 10Merlijn van Deen)
[13:04:44] <Coren>	 Gah, merge collision.
[13:05:11] <icinga-wm>	 PROBLEM - Incoming network saturation on labstore1003 is CRITICAL: CRITICAL: 13.64% of data above the critical threshold [100000000.0]
[13:06:10] <icinga-wm>	 PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: There are 2 unmerged changes in puppet (dir /var/lib/git/operations/puppet).
[13:07:13] <Coren>	 moritzm: Okay for me to merge "Assign salt grains for librenms"?
[13:07:49] <icinga-wm>	 PROBLEM - Restbase endpoints health on aqs1002 is CRITICAL: /pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} is CRITICAL: Test Get aggregate page views returned the unexpected status 500 (expecting: 200)
[13:08:39] <icinga-wm>	 PROBLEM - Restbase endpoints health on aqs1003 is CRITICAL: /pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} is CRITICAL: Test Get aggregate page views returned the unexpected status 500 (expecting: 200)
[13:09:41] <icinga-wm>	 RECOVERY - Unmerged changes on repository puppet on palladium is OK: No changes to merge.
[13:10:19] <icinga-wm>	 RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge.
[13:16:59] <moritzm>	 Coren: yes, sorry about that
[13:17:19] <Coren>	 moritzm: No worries, it happens to all of us now and then.  :-)
[13:19:01] <icinga-wm>	 PROBLEM - Restbase endpoints health on aqs1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[13:21:00] <Coren>	 Anyone on that restbase thing?
[13:24:49] <icinga-wm>	 RECOVERY - Restbase endpoints health on aqs1003 is OK: All endpoints are healthy
[13:25:39] <icinga-wm>	 RECOVERY - Restbase endpoints health on aqs1002 is OK: All endpoints are healthy
[13:26:40] <icinga-wm>	 RECOVERY - Incoming network saturation on labstore1003 is OK: OK: Less than 10.00% above the threshold [75000000.0]
[13:26:59] <logmsgbot>	 !log aude@tin Synchronized php-1.27.0-wmf.3/extensions/CirrusSearch: Add justMapping option to updateOneSearchIndexConfig script (duration: 00m 18s)
[13:27:02] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[13:27:59] <icinga-wm>	 RECOVERY - Restbase endpoints health on aqs1001 is OK: All endpoints are healthy
[13:30:50] <aude>	 would help to updat ethe submodule...
[13:32:39] <logmsgbot>	 !log aude@tin Synchronized php-1.27.0-wmf.3/extensions/CirrusSearch: Add justMapping option to updateOneSearchIndexConfig script (updated submodule) (duration: 00m 18s)
[13:32:43] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[13:36:15] <grrrit-wm>	 (03PS3) 10Muehlenhoff: Assign salt grains for graphite [puppet] - 10https://gerrit.wikimedia.org/r/248847 
[13:37:44] <grrrit-wm>	 (03CR) 10Ottomata: "Yes, had run it on cp1057 for a while, turned if off over the weekend. I just started it back up in a screen there. The test.reqstats.mi" [puppet] - 10https://gerrit.wikimedia.org/r/248067 (https://phabricator.wikimedia.org/T83580) (owner: 10Ottomata)
[13:37:46] <grrrit-wm>	 (03PS4) 10Muehlenhoff: Assign salt grains for graphite [puppet] - 10https://gerrit.wikimedia.org/r/248847 
[13:38:07] <grrrit-wm>	 (03CR) 10Ottomata: "are being*" [puppet] - 10https://gerrit.wikimedia.org/r/248067 (https://phabricator.wikimedia.org/T83580) (owner: 10Ottomata)
[13:38:36] <grrrit-wm>	 (03CR) 10Muehlenhoff: [C: 032 V: 032] Assign salt grains for graphite [puppet] - 10https://gerrit.wikimedia.org/r/248847 (owner: 10Muehlenhoff)
[13:41:10] <grrrit-wm>	 (03CR) 10Alex Monk: "See I0d1ba430" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/236979 (https://phabricator.wikimedia.org/T110199) (owner: 10Robmoen)
[13:41:31] <grrrit-wm>	 (03PS2) 10Muehlenhoff: Assign salt grains for openldap [puppet] - 10https://gerrit.wikimedia.org/r/248848 
[13:42:07] <grrrit-wm>	 (03CR) 10Muehlenhoff: [C: 032 V: 032] Assign salt grains for openldap [puppet] - 10https://gerrit.wikimedia.org/r/248848 (owner: 10Muehlenhoff)
[13:46:00] <icinga-wm>	 PROBLEM - Restbase endpoints health on aqs1001 is CRITICAL: /pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} is CRITICAL: Test Get aggregate page views returned the unexpected status 500 (expecting: 200)
[13:47:11] <icinga-wm>	 PROBLEM - Restbase endpoints health on aqs1002 is CRITICAL: /pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} is CRITICAL: Test Get aggregate page views returned the unexpected status 500 (expecting: 200)
[13:47:38] <_joe_>	 akosiaris, morebots ^^
[13:47:43] <_joe_>	 err mobrovac 
[13:49:31] <_joe_>	 I see some cassandra troubles probably
[13:49:49] <_joe_>	 servers being detected as up, then down immediately
[13:49:59] <icinga-wm>	 PROBLEM - Restbase endpoints health on aqs1003 is CRITICAL: /pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} is CRITICAL: Test Get aggregate page views returned the unexpected status 500 (expecting: 200)
[13:50:26] <_joe_>	 WARN  [MessagingService-Outgoing-/10.64.0.123] 2015-10-26 13:49:49,992 OutboundTcpConnection.java:414 - Seed gossip version is -2147483648; will not connect with that version
[13:50:30] <_joe_>	 INFO  [HANDSHAKE-/10.64.0.123] 2015-10-26 13:49:49,993 OutboundTcpConnection.java:494 - Cannot handshake version with /10.64.0.123
[13:53:48] <grrrit-wm>	 (03PS1) 10Muehlenhoff: Add missing Hiera data [puppet] - 10https://gerrit.wikimedia.org/r/248865 
[13:54:40] <Coren>	 _joe_: I was trying to find those logs.  Can you tell me where they are for future ref?
[13:54:59] <Coren>	 _joe_: Also, 1001 is having heating issues; I was about to open a ticket for Chris.
[13:55:55] <mobrovac>	 hm
[13:57:12] <wikibugs>	 6operations, 10ops-eqiad: aqs1001 getting multiple and repeated heat MCEs - https://phabricator.wikimedia.org/T116584#1752956 (10coren) 3NEW a:3Cmjohnson
[13:58:15] <grrrit-wm>	 (03CR) 10Andrew Bogott: "What does the puppet compiler think about this one?" [puppet] - 10https://gerrit.wikimedia.org/r/247209 (owner: 10Muehlenhoff)
[13:59:14] <wikibugs>	 6operations, 6Discovery, 10Maps, 3Discovery-Maps-Sprint: maps-test200{2-4} PostgreSQL replication needs rebuilding - https://phabricator.wikimedia.org/T116553#1752966 (10akosiaris) Oh, it is basically reinitializing the slave.  ``` stop postgres mv /srv/postgres/9.4/main/recovery.conf ~/ rm -rf /srv/postgr...
[14:01:48] <mobrovac>	 _joe_: hm, seems to have stabilised, getting all endpoints healthy now when running locally service_checker on aqs1002
[14:01:58] <mobrovac>	 cass logs also say it's up
[14:03:01] <mobrovac>	 uf, aqs1001 cass is marked as down
[14:03:04] <grrrit-wm>	 (03CR) 10Muehlenhoff: [C: 04-1] "Same problem as with holmium, see http://puppet-compiler.wmflabs.org/1078/" [puppet] - 10https://gerrit.wikimedia.org/r/247209 (owner: 10Muehlenhoff)
[14:04:00] <wikibugs>	 6operations, 7Graphite, 7Monitoring, 5Patch-For-Review: deprecate gdash - https://phabricator.wikimedia.org/T104365#1752970 (10faidon) 5Resolved>3Open Let's keep the task open until we actually remove the static site as well.
[14:04:07] <wikibugs>	 6operations, 7Graphite, 7Monitoring, 5Patch-For-Review: deprecate gdash - https://phabricator.wikimedia.org/T104365#1752972 (10faidon) p:5Normal>3Low
[14:04:16] <grrrit-wm>	 (03PS2) 10Aude: Enable geosearch on test.wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/247837 
[14:04:34] <milimetric>	 _joe_: not that I could help in this situation, but joal and I are the ones who should be looking after aqs
[14:04:53] <milimetric>	 I'll try to learn about the maintenance from mobrovac and take that over
[14:04:55] <grrrit-wm>	 (03CR) 10Aude: [C: 032] Enable geosearch on test.wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/247837 (owner: 10Aude)
[14:05:01] <grrrit-wm>	 (03Merged) 10jenkins-bot: Enable geosearch on test.wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/247837 (owner: 10Aude)
[14:05:05] <wikibugs>	 6operations, 10Analytics, 6Discovery, 10EventBus, and 6 others: Define edit related events for change propagation - https://phabricator.wikimedia.org/T116247#1752974 (10Ottomata) I'm still a little confused about how this reqid/id will work?  You are suggesting that it comes from the x-request-id that we w...
[14:05:56] <wikibugs>	 6operations, 10Analytics, 6Discovery, 10EventBus, and 6 others: Define edit related events for change propagation - https://phabricator.wikimedia.org/T116247#1752975 (10Ottomata) To avoid possible conflicts, I'd suggest we call this not just `id`.  How about `uuid`?  That's what EventLogging capsule does:...
[14:07:09] <icinga-wm>	 RECOVERY - mysqld processes on db2065 is OK: PROCS OK: 1 process with command name mysqld
[14:07:19] <icinga-wm>	 RECOVERY - mysqld processes on db2067 is OK: PROCS OK: 1 process with command name mysqld
[14:07:22] <wikibugs>	 6operations, 10Analytics, 6Discovery, 10EventBus, and 6 others: Define edit related events for change propagation - https://phabricator.wikimedia.org/T116247#1752979 (10Ottomata) Also, this is just a personal preference, but I'd prefer if we had a convention differentiating integer/second based 'timestamps...
[14:07:25] <icinga-wm>	 RECOVERY - mysqld processes on db2070 is OK: PROCS OK: 1 process with command name mysqld
[14:07:38] <paravoid>	 jynus: all of these are paging
[14:07:44] <jynus>	 arg
[14:07:55] <jynus>	 they are all downtimed
[14:08:12] <akosiaris>	 pages that are only the recoveries ?
[14:08:14] <akosiaris>	 nice
[14:09:09] <logmsgbot>	 !log aude@tin Synchronized wmf-config/InitialiseSettings.php: Enable geosearch on test.wikidata (duration: 00m 17s)
[14:09:14] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[14:09:52] <icinga-wm>	 RECOVERY - Restbase endpoints health on aqs1001 is OK: All endpoints are healthy
[14:10:04] <_joe_>	 Coren: heat problems are not real
[14:10:12] <icinga-wm>	 RECOVERY - Restbase endpoints health on aqs1003 is OK: All endpoints are healthy
[14:10:22] <_joe_>	 Coren: the log lines were in the cassandra logs in /var/log/cassandra/...
[14:10:55] <Coren>	 _joe_: Ah, the other side of course.  I tried to find where restbase was dumping its own.
[14:11:11] <icinga-wm>	 RECOVERY - Restbase endpoints health on aqs1002 is OK: All endpoints are healthy
[14:11:59] <grrrit-wm>	 (03PS1) 10ArielGlenn: dumps: fix camelcases in WikiDumps.py (part 1) [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/248866 
[14:12:01] <grrrit-wm>	 (03PS1) 10ArielGlenn: dumps: camelcases in wikiDumps.py (part 2) [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/248867 
[14:14:20] <grrrit-wm>	 (03PS1) 10Aude: Enable GeoData extension on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/248868 (https://phabricator.wikimedia.org/T115482) 
[14:15:13] <aude>	 jynus: _joe_ is everything ok now?
[14:15:21] * aude wants to deploy
[14:15:40] <jynus>	 if you refer to mysql, there is nothing wrong
[14:15:45] <aude>	 ok
[14:16:17] <grrrit-wm>	 (03CR) 10Aude: [C: 032] Enable GeoData extension on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/248868 (https://phabricator.wikimedia.org/T115482) (owner: 10Aude)
[14:16:23] <grrrit-wm>	 (03Merged) 10jenkins-bot: Enable GeoData extension on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/248868 (https://phabricator.wikimedia.org/T115482) (owner: 10Aude)
[14:16:49] <jynus>	 so, in adition to downtime, that was aleady there, I have disabled notifications on those hosts
[14:18:56] <logmsgbot>	 !log aude@tin Synchronized wmf-config/InitialiseSettings.php: Enable GeoData on Wikidata (duration: 00m 17s)
[14:19:00] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[14:19:08] <wikibugs>	 6operations, 10Analytics, 6Discovery, 10EventBus, and 6 others: Define edit related events for change propagation - https://phabricator.wikimedia.org/T116247#1753027 (10Ottomata) Also, over at [[ https://phabricator.wikimedia.org/T88459#1694274  | T88459#1694274 ]], I commented:  If we adopt a convention o...
[14:35:52] <icinga-wm>	 PROBLEM - puppet last run on wtp1023 is CRITICAL: CRITICAL: Puppet has 1 failures
[14:38:21] <grrrit-wm>	 (03PS2) 10ArielGlenn: Remove class role::dataset::publicdirs, noop [puppet] - 10https://gerrit.wikimedia.org/r/246824 (owner: 10Faidon Liambotis)
[14:39:31] <milimetric>	 jynus: we have a replication problem from m4-master to analytics-store (for eventlogging data)
[14:39:32] <grrrit-wm>	 (03CR) 10ArielGlenn: [C: 032] Remove class role::dataset::publicdirs, noop [puppet] - 10https://gerrit.wikimedia.org/r/246824 (owner: 10Faidon Liambotis)
[14:39:44] <milimetric>	 jynus: nothing has been replicated since October 22nd for MobileWikiAppSearch_10641988
[14:39:54] <milimetric>	 (that's one example we know of, but there may be others)
[14:40:04] <milimetric>	 letting you know here, I can file a ticket if you'd like
[14:40:15] <jynus>	 let me check
[14:43:29] <jynus>	 I see m4 lagging behind, but not broken
[14:43:59] <jynus>	 but only in the last 24 hours, not since the 22nd
[14:45:10] <grrrit-wm>	 (03PS1) 10EBernhardson: Send CirrusSearch writes to codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/248871 
[14:45:25] <jynus>	 so, a 2 hour lag
[14:45:36] <jynus>	 less than 1 hour, actually
[14:45:38] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] Send CirrusSearch writes to codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/248871 (owner: 10EBernhardson)
[14:46:30] <wikibugs>	 6operations, 6Discovery, 10Maps, 3Discovery-Maps-Sprint: maps-test200{2-4} PostgreSQL replication needs rebuilding - https://phabricator.wikimedia.org/T116553#1753065 (10Yurik) @akosiaris, thx, but i suspect we won't be able to do most of these steps due to perm?
[14:47:02] <icinga-wm>	 PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 3 below the confidence bounds
[14:47:41] <jynus>	 "INSERT IGNORE INTO `EchoInteraction_5782287`", done 45 minutes ago
[14:48:38] <jynus>	 didn't nuria stopped the imports for a while?, milimetric 
[14:48:46] <wikibugs>	 6operations, 6Discovery, 10Maps, 3Discovery-Maps-Sprint: maps-test200{2-4} PostgreSQL replication needs rebuilding - https://phabricator.wikimedia.org/T116553#1753067 (10akosiaris) >>! In T116553#1753065, @Yurik wrote: > @akosiaris, thx, but i suspect we won't be able to do most of these steps due to perm?...
[14:48:52] <nuria>	 jynus: no, we stopped backfilling
[14:48:59] <nuria>	 jynus: but not the service
[14:49:20] <jynus>	 so, are these the backfillings ?
[14:49:21] <grrrit-wm>	 (03PS1) 10BBlack: Revert "remove cp1059 from ipsec hostlists - T114870" [puppet] - 10https://gerrit.wikimedia.org/r/248874 
[14:49:40] <grrrit-wm>	 (03CR) 10BBlack: [C: 032 V: 032] Revert "remove cp1059 from ipsec hostlists - T114870" [puppet] - 10https://gerrit.wikimedia.org/r/248874 (owner: 10BBlack)
[14:49:49] <milimetric>	 jynus: there's some confusion
[14:49:55] <milimetric>	 analytics store is lagging behind m4
[14:50:03] <jynus>	 yes
[14:50:04] <milimetric>	 m4 is not lagging as far as I can tell
[14:50:16] <jynus>	 what is the difference?
[14:50:53] <milimetric>	 EL consumer writes to m4 which then replicates to analytics-store
[14:50:55] <jynus>	 m4 replication is right now 2500 seconds behind
[14:51:02] <jynus>	 its master
[14:51:21] <jynus>	 its master is m4-master
[14:51:23] <milimetric>	 sorry, maybe we're saying the same thing and I'm bad at the terminology
[14:51:30] <jynus>	 it is ok
[14:51:35] <jynus>	 there is a slave delay
[14:51:44] <milimetric>	 my point is, regardless of what the replag monitoring is telling us, executing this gives very different results:
[14:51:53] <milimetric>	 select max(timestamp) from log.MobileWikiAppSearch_10641988;
[14:51:54] <jynus>	 my point is, maybe it is due to backlog that has been imported recently?
[14:52:09] <jynus>	 so a temporary thing
[14:52:12] <icinga-wm>	 PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 9.09% of data above the critical threshold [500.0]
[14:52:16] <jynus>	 I see two spikes
[14:52:18] <milimetric>	 jynus: we only started backfilling this morning, this lag is back to October 22nd (4 days)
[14:52:25] <jynus>	 no
[14:52:34] <jynus>	 that I can tell you is is not true
[14:52:41] <jynus>	 https://tendril.wikimedia.org/host/view/dbstore2002.codfw.wmnet/3306
[14:52:43] <jynus>	 look
[14:52:47] <milimetric>	 right, i did look at that
[14:52:51] <jynus>	 at the Replication graph
[14:53:00] <milimetric>	 so then how do we explain the absence of data past 20151022145533 on analytics-store?
[14:53:04] <jynus>	 there was lag on the 22
[14:53:07] <jynus>	 that I caused
[14:53:19] <jynus>	 then it recovered
[14:53:20] <milimetric>	 yep, when you were helping mforns
[14:53:28] <jynus>	 and now there is lag again
[14:53:31] <jynus>	 due to large inserts
[14:53:35] <milimetric>	 but it seems to me at that point it stopped replicating a bunch of tables
[14:53:45] <milimetric>	 the recent lag is unrelated to the problem I'm talking about
[14:54:04] <milimetric>	 20151022145533 == select max(timestamp) from log.MobileWikiAppSearch_10641988;
[14:54:26] <wikibugs>	 10Ops-Access-Requests, 6operations, 6Discovery, 10Maps, and 2 others: Grant sudo on map-tests200* for maps team - https://phabricator.wikimedia.org/T106637#1753077 (10Yurik) 5Resolved>3Open Reopening - we should be able to recover from the replication failures.  T116553#1752966 outlines steps that we s...
[14:54:46] <jynus>	 ok, let's talk on #wikimedia-databases to not flood this channel, as it is a very specific thing
[14:54:55] <milimetric>	 sorry -ops, good point, brt
[14:56:09] <wikibugs>	 6operations, 6Discovery, 10Maps, 3Discovery-Maps-Sprint: maps-test200{2-4} PostgreSQL replication needs rebuilding - https://phabricator.wikimedia.org/T116553#1752146 (10Yurik) Re-opened sudo task with extra info T106637.
[14:56:50] <grrrit-wm>	 (03PS2) 10ArielGlenn: dataset: move system user creation to module [puppet] - 10https://gerrit.wikimedia.org/r/246825 (owner: 10Faidon Liambotis)
[14:57:16] <Coren>	 _joe_: It's still an issue even if the MCEs are tripped by false positives since that makes the kernel throttle the "overheating" CPUs
[14:57:51] <icinga-wm>	 RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected
[14:58:10] <grrrit-wm>	 (03CR) 10ArielGlenn: [C: 032] dataset: move system user creation to module [puppet] - 10https://gerrit.wikimedia.org/r/246825 (owner: 10Faidon Liambotis)
[14:58:55] <bblack>	 !log repooling cp1059 varnish mobile frontend (wiped)
[14:58:58] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[15:00:05] <jouncebot>	 anomie ostriches thcipriani marktraceur: Respected human, time to deploy Morning SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151026T1500). Please do the needful.
[15:00:05] <jouncebot>	 ebernhardson Glaisher: A patch you scheduled for Morning SWAT(Max 8 patches) is about to be deployed. Please be available during the process.
[15:00:12] <Glaisher>	 here
[15:01:01] <icinga-wm>	 RECOVERY - puppet last run on wtp1023 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures
[15:01:18] <bblack>	 ottomata: ping re the new reqstats stuff and diamond? random cache hosts I look at now have 2x diamond processes consuming as much CPU as varnish itself....
[15:01:23] <icinga-wm>	 RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[15:01:34] <ebernhardson>	 here
[15:01:40] <bblack>	 ottomata: (with start times on Friday)
[15:01:42] <ebernhardson>	 i have pretty much all the patches, i can just ship these out
[15:02:20] <ebernhardson>	 Glaisher: you have a -1 from Krenair about a dependency. It's been resolved?
[15:02:33] <ebernhardson>	 looks like its proofreadpage, which looks to be out
[15:02:40] <Glaisher>	 ebernhardson: yeah
[15:02:54] <grrrit-wm>	 (03CR) 10EBernhardson: [C: 032] "proofread page has been deployed now." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/240640 (https://phabricator.wikimedia.org/T54709) (owner: 10Glaisher)
[15:03:20] <grrrit-wm>	 (03Merged) 10jenkins-bot: Remove Page and Index namespaces from $wgContentNamespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/240640 (https://phabricator.wikimedia.org/T54709) (owner: 10Glaisher)
[15:04:20] <logmsgbot>	 !log ebernhardson@tin Synchronized wmf-config/InitialiseSettings.php: Remove Predundant Page and Index namespaces from $wgContentNamespaces (duration: 00m 17s)
[15:04:22] <ebernhardson>	 Glaisher: ^
[15:04:24] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[15:06:42] <logmsgbot>	 !log ebernhardson@tin Synchronized php-1.27.0-wmf.3/extensions/WikimediaEvents/WikimediaEvents.php: Update satisfaction schema id due to bad varnish caching of old id (duration: 00m 17s)
[15:06:46] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[15:07:00] <Glaisher>	 ebernhardson: it would be a no-op and nothing seems to have broken on the wikis
[15:07:23] <ebernhardson>	 Glaisher: sounds right, thanks for checking
[15:08:21] <bblack>	 Coren: what host has the heat issue / MCE logs?
[15:08:22] <grrrit-wm>	 (03CR) 10ArielGlenn: "The reason these classes weren't inlined is that we have a history of needing to move the jobs back and forth between the primary and seco" [puppet] - 10https://gerrit.wikimedia.org/r/246826 (owner: 10Faidon Liambotis)
[15:08:34] <Coren>	 bblack: aqs1001
[15:08:40] <bblack>	 Coren: I used to think they weren't real either, but we ended up fixing several eqiad cache hosts and it was real
[15:09:19] <Coren>	 bblack: I have no opinion on the actual thermal issue; that needs feet on the ground.  :-)
[15:09:22] <grrrit-wm>	 (03PS1) 10Rush: openstack: cleanup up old repo setups [puppet] - 10https://gerrit.wikimedia.org/r/248882 
[15:10:13] <bblack>	 well from a software perspective, the temp sensor is showing 90C, which is high for most of our hardware, and the race of MCEs corresponds as well
[15:10:26] * Coren nods.
[15:10:31] <Coren>	 Hence the phab task.  :-)
[15:10:40] <bblack>	 where?
[15:11:23] <bblack>	 ah found it
[15:13:11] <wikibugs>	 6operations, 10ops-eqiad: aqs1001 getting multiple and repeated heat MCEs - https://phabricator.wikimedia.org/T116584#1753155 (10BBlack) Note this is showing ~90-91C on the software read of the temp sensors as well:  ``` root@aqs1001:~# cat /sys/class/thermal/thermal_zone*/temp 91000 90000 ```  This seems simi...
[15:13:11] <logmsgbot>	 !log ebernhardson@tin Synchronized php-1.27.0-wmf.3/extensions/WikimediaEvents/modules/: Move search schema from cirrussearch -> wikimediavents (duration: 00m 17s)
[15:13:15] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[15:13:46] <logmsgbot>	 !log ebernhardson@tin Synchronized php-1.27.0-wmf.3/extensions/WikimediaEvents/WikimediaEvents.php: Move search schema from cirrussearch -> wikimediavents (duration: 00m 19s)
[15:13:49] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[15:14:09] <Vito>	 something went wrong
[15:14:31] <Steinsplitter>	 MediaWiki internal error.
[15:14:31] <Steinsplitter>	 Exception caught inside exception handler.
[15:14:31] <Steinsplitter>	 Set $wgShowExceptionDetails = true; at the bottom of LocalSettings.php to show detailed debugging information.
[15:14:34] <Bsadowski1>	 "MediaWiki internal error.
[15:14:34] <Bsadowski1>	 Exception caught inside exception handler."
[15:14:36] <Bsadowski1>	 yep
[15:14:45] <Steinsplitter>	 bblack^^
[15:15:04] <Bsadowski1>	 Went to edit a page and got that
[15:15:14] <ShakespeareFan00>	 You doing anything right now?
[15:15:18] <bd808>	 ebernhardson: roll back? Looks really broken
[15:15:20] <ShakespeareFan00>	 "MediaWiki internal error.
[15:15:22] <ShakespeareFan00>	 Exception caught inside exception handler.
[15:15:23] <Glaisher>	 ebernhardson: ^
[15:15:24] <ShakespeareFan00>	 Set $wgShowExceptionDetails = true; at the bottom of LocalSettings.php to show detailed debugging information."
[15:15:25] <akosiaris>	 ebernhardson: ^^^
[15:15:25] <Bsadowski1>	 Refreshed still happening
[15:15:25] <edsanders>	 seeing on en.wiki and mediawiki.org
[15:15:26] <akosiaris>	 lol
[15:15:28] <akosiaris>	 everyone ...
[15:15:31] <icinga-wm>	 PROBLEM - HHVM rendering on mw1033 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 593 bytes in 0.080 second response time
[15:15:31] <icinga-wm>	 PROBLEM - HHVM rendering on mw1052 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 593 bytes in 0.072 second response time
[15:15:31] <icinga-wm>	 PROBLEM - HHVM rendering on mw1206 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 593 bytes in 0.116 second response time
[15:15:31] <icinga-wm>	 PROBLEM - HHVM rendering on mw1095 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 593 bytes in 0.080 second response time
[15:15:31] <icinga-wm>	 PROBLEM - HHVM rendering on mw2094 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 593 bytes in 0.151 second response time
[15:15:32] <icinga-wm>	 PROBLEM - HHVM rendering on mw2075 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 593 bytes in 0.152 second response time
[15:15:32] <icinga-wm>	 PROBLEM - HHVM rendering on mw1107 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 593 bytes in 0.082 second response time
[15:15:32] <Bsadowski1>	 +comma
[15:15:32] <icinga-wm>	 PROBLEM - HHVM rendering on mw1097 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 593 bytes in 0.074 second response time
[15:15:32] <icinga-wm>	 PROBLEM - HHVM rendering on mw2185 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 593 bytes in 0.146 second response time
[15:15:33] <icinga-wm>	 PROBLEM - HHVM rendering on mw2049 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 593 bytes in 0.142 second response time
[15:15:33] <icinga-wm>	 PROBLEM - HHVM rendering on mw1246 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 593 bytes in 0.045 second response time
[15:15:45] <akosiaris>	 uh oh
[15:15:45] <icinga-wm>	 PROBLEM - HHVM rendering on mw1057 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 593 bytes in 0.083 second response time
[15:15:45] <icinga-wm>	 PROBLEM - HHVM rendering on mw1106 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 593 bytes in 0.091 second response time
[15:15:47] <icinga-wm>	 PROBLEM - HHVM rendering on mw1141 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 593 bytes in 0.075 second response time
[15:15:47] <icinga-wm>	 PROBLEM - HHVM rendering on mw1235 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 593 bytes in 0.113 second response time
[15:15:47] <icinga-wm>	 PROBLEM - HHVM rendering on mw1039 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 593 bytes in 0.078 second response time
[15:15:47] <icinga-wm>	 PROBLEM - HHVM rendering on mw2137 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 593 bytes in 0.144 second response time
[15:15:48] <icinga-wm>	 PROBLEM - HHVM rendering on mw2097 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 593 bytes in 0.146 second response time
[15:15:48] <icinga-wm>	 PROBLEM - HHVM rendering on mw2101 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 593 bytes in 0.141 second response time
[15:15:49] <icinga-wm>	 PROBLEM - HHVM rendering on mw2066 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 593 bytes in 0.144 second response time
[15:15:49] <icinga-wm>	 PROBLEM - HHVM rendering on mw1193 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 593 bytes in 0.083 second response time
[15:15:52] <icinga-wm>	 PROBLEM - HHVM rendering on mw1211 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 593 bytes in 0.062 second response time
[15:15:52] <icinga-wm>	 PROBLEM - HHVM rendering on mw1172 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 593 bytes in 0.072 second response time
[15:15:52] <icinga-wm>	 PROBLEM - HHVM rendering on mw1188 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 593 bytes in 0.067 second response time
[15:15:52] <icinga-wm>	 PROBLEM - HHVM rendering on mw2112 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 593 bytes in 0.149 second response time
[15:15:52] <icinga-wm>	 PROBLEM - HHVM rendering on mw2043 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 593 bytes in 0.141 second response time
[15:15:52] <icinga-wm>	 PROBLEM - HHVM rendering on mw2145 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 593 bytes in 0.129 second response time
[15:15:53] <icinga-wm>	 PROBLEM - HHVM rendering on mw2167 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 593 bytes in 0.147 second response time
[15:16:13] <ShakespeareFan00>	 "Okay... Who Brought the dog?" ....
[15:16:16] <ShakespeareFan00>	 ;)
[15:16:30] <Vito>	 https://phabricator.wikimedia.org/T116593 <-- if someone wants to add some "the world is burning" :p
[15:16:32] <aude>	 omg
[15:16:43] <wikibugs>	 6operations, 10ops-eqiad, 10Traffic, 5Patch-For-Review: eqiad: investigate thermal issues with some cp10xx machines - https://phabricator.wikimedia.org/T103226#1753178 (10Cmjohnson) I do have thermal paste on-site.   Let me know when you want to schedule downtime on each of these.
[15:17:09] <Nemo_bis>	 tsk tsk I was first
[15:17:15] <Vito>	 lol
[15:17:23] <Steinsplitter>	 :-D
[15:17:55] <bblack>	 is rollback in progress?
[15:18:02] <anomie>	 MWException from line 331 of /srv/mediawiki/php-1.27.0-wmf.3/includes/resourceloader/ResourceLoader.php: ResourceLoader duplicate registration error. Another module has already been registered as schema.Search
[15:18:15] <_joe_>	 what the hell is happening?
[15:18:16] <icinga-wm>	 PROBLEM - LVS HTTPS IPv4 on text-lb.esams.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1080 bytes in 0.603 second response time
[15:18:17] <aude>	 ebernhardson: ?
[15:18:23] <ShakespeareFan00>	 _joe_: It broke
[15:18:24] <icinga-wm>	 PROBLEM - LVS HTTPS IPv6 on text-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1009 bytes in 0.077 second response time
[15:18:25] <ebernhardson>	 rolling back, but they were just js changes
[15:18:26] <bblack>	 we have pretty clear timing coincidence with the deploy
[15:18:28] <_joe_>	 who deployed what?
[15:18:37] <_joe_>	 ROLL BACK FFS
[15:18:41] <icinga-wm>	 PROBLEM - LVS HTTPS IPv4 on mobile-lb.ulsfo.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1041 bytes in 0.569 second response time
[15:18:43] <jynus>	 is it all LVS?
[15:18:47] <icinga-wm>	 PROBLEM - LVS HTTPS IPv6 on mobile-lb.ulsfo.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1012 bytes in 0.447 second response time
[15:18:53] <icinga-wm>	 PROBLEM - LVS HTTPS IPv6 on text-lb.ulsfo.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1051 bytes in 0.464 second response time
[15:18:59] <wikibugs>	 6operations: MediaWiki internal error. - https://phabricator.wikimedia.org/T116596#1753247 (10zhuyifei1999) 3NEW
[15:18:59] <icinga-wm>	 PROBLEM - LVS HTTPS IPv4 on text-lb.ulsfo.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1080 bytes in 0.488 second response time
[15:19:01] <ShakespeareFan00>	 is the thermal issue and the complete downing of Wikimedia related?)
[15:19:04] <bblack>	 LVS is secondary
[15:19:05] <icinga-wm>	 PROBLEM - LVS HTTPS IPv6 on text-lb.esams.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1050 bytes in 0.588 second response time
[15:19:11] <bblack>	 thermal issue is unrelated
[15:19:15] <bblack>	 the problem is the code deploy
[15:19:16] <wikibugs>	 6operations, 10Wikimedia-General-or-Unknown: Exception caught inside exception handler - https://phabricator.wikimedia.org/T116593#1753255 (10Glaisher) Caused by https://gerrit.wikimedia.org/r/248877 rollback in progress  ``` MWException from line 331 of /srv/mediawiki/php-1.27.0-wmf.3/includes/resourceloader/...
[15:19:17] <chasemp>	 _joe_: ebernhardson is rolling back now, it started right after a sync
[15:19:24] <bblack>	 15:13 < logmsgbot> !log ebernhardson@tin Synchronized php-1.27.0-wmf.3/extensions/WikimediaEvents/WikimediaEvents.php
[15:19:25] <wikibugs>	 6operations: MediaWiki internal error. - https://phabricator.wikimedia.org/T116596#1753257 (10JohnLewis)
[15:19:27] <bblack>	 ^ and related
[15:19:27] <wikibugs>	 6operations, 10Wikimedia-General-or-Unknown: Exception caught inside exception handler - https://phabricator.wikimedia.org/T116593#1753258 (10JohnLewis)
[15:19:30] <ottomata>	 bblack, huh.  well, the diamond stuff doesn't really work.  it was working for several days fine, then certain processes started segfauting
[15:19:33] <icinga-wm>	 PROBLEM - LVS HTTPS IPv4 on text-lb.codfw.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1080 bytes in 0.279 second response time
[15:19:34] <bblack>	 not now
[15:19:37] <ottomata>	 ok
[15:19:40] <icinga-wm>	 PROBLEM - LVS HTTPS IPv6 on mobile-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 970 bytes in 0.088 second response time
[15:19:40] <wikibugs>	 6operations: MediaWiki internal error. - https://phabricator.wikimedia.org/T116596#1753260 (10zhuyifei1999)
[15:19:42] <wikibugs>	 6operations, 10Wikimedia-General-or-Unknown: Exception caught inside exception handler - https://phabricator.wikimedia.org/T116593#1753261 (10zhuyifei1999)
[15:19:46] <icinga-wm>	 PROBLEM - LVS HTTPS IPv4 on mobile-lb.codfw.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1042 bytes in 0.272 second response time
[15:19:55] <ottomata>	 bblack, will just tell you:  i'm going to submit a separate patch to disable all of those diamond collectors.
[15:20:00] <bblack>	 ok thanks
[15:20:07] <paravoid>	 what is taking so long?
[15:20:10] <_joe_>	 chasemp: is it rolled back?
[15:20:14] <icinga-wm>	 PROBLEM - LVS HTTP IPv4 on api.svc.eqiad.wmnet is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50994 bytes in 0.577 second response time
[15:20:17] <akosiaris>	 not yet
[15:20:17] <ebernhardson>	 still working, its multiple patches
[15:20:21] <icinga-wm>	 PROBLEM - LVS HTTPS IPv6 on text-lb.codfw.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1051 bytes in 0.270 second response time
[15:20:26] <wikibugs>	 6operations, 6Release-Engineering-Team, 10Wikimedia-General-or-Unknown: Exception caught inside exception handler - https://phabricator.wikimedia.org/T116593#1753268 (10QuimGil)
[15:20:29] <icinga-wm>	 PROBLEM - LVS HTTPS IPv6 on mobile-lb.codfw.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 8844 bytes in 0.245 second response time
[15:20:35] <icinga-wm>	 PROBLEM - LVS HTTPS IPv4 on text-lb.eqiad.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[15:20:55] <icinga-wm>	 PROBLEM - LVS HTTPS IPv6 on mobile-lb.esams.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 8839 bytes in 0.378 second response time
[15:20:57] <chasemp>	 _joe_: paravoid: see it's multiple patches ^^ https://gerrit.wikimedia.org/r/#/q/owner:%22EBernhardson+%253Cebernhardson%2540wikimedia.org%253E%22,n,z
[15:21:02] <icinga-wm>	 PROBLEM - LVS HTTP IPv4 on rendering.svc.eqiad.wmnet is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 557 bytes in 0.070 second response time
[15:21:08] <logmsgbot>	 !log ebernhardson@tin Synchronized php-1.27.0-wmf.3/extensions/WikimediaEvents: rollback (duration: 00m 18s)
[15:21:18] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[15:21:39] <_joe_>	 chasemp: ok, the best way to do this usually could be to just roll back on tin (git reset --hard HEAD~N)
[15:21:49] <grrrit-wm>	 (03PS1) 10Ottomata: Disable all diamond varnishreqstats collectors [puppet] - 10https://gerrit.wikimedia.org/r/248888 (https://phabricator.wikimedia.org/T83580) 
[15:21:50] <icinga-wm>	 RECOVERY - HHVM rendering on mw1168 is OK: HTTP OK: HTTP/1.1 200 OK - 69551 bytes in 0.101 second response time
[15:21:50] <icinga-wm>	 RECOVERY - HHVM rendering on mw1053 is OK: HTTP OK: HTTP/1.1 200 OK - 69552 bytes in 0.125 second response time
[15:21:50] <icinga-wm>	 RECOVERY - HHVM rendering on mw1041 is OK: HTTP OK: HTTP/1.1 200 OK - 69552 bytes in 0.125 second response time
[15:21:50] <icinga-wm>	 RECOVERY - HHVM rendering on mw2175 is OK: HTTP OK: HTTP/1.1 200 OK - 69552 bytes in 0.260 second response time
[15:21:50] <icinga-wm>	 RECOVERY - HHVM rendering on mw2111 is OK: HTTP OK: HTTP/1.1 200 OK - 69552 bytes in 0.769 second response time
[15:21:51] <icinga-wm>	 RECOVERY - HHVM rendering on mw2076 is OK: HTTP OK: HTTP/1.1 200 OK - 69552 bytes in 0.743 second response time
[15:21:51] <icinga-wm>	 RECOVERY - HHVM rendering on mw1201 is OK: HTTP OK: HTTP/1.1 200 OK - 69552 bytes in 0.201 second response time
[15:21:51] <icinga-wm>	 RECOVERY - HHVM rendering on mw2130 is OK: HTTP OK: HTTP/1.1 200 OK - 69552 bytes in 0.869 second response time
[15:21:51] <icinga-wm>	 RECOVERY - HHVM rendering on mw1128 is OK: HTTP OK: HTTP/1.1 200 OK - 69553 bytes in 1.307 second response time
[15:21:52] <icinga-wm>	 RECOVERY - HHVM rendering on mw1143 is OK: HTTP OK: HTTP/1.1 200 OK - 69552 bytes in 0.302 second response time
[15:21:53] <icinga-wm>	 RECOVERY - HHVM rendering on mw2174 is OK: HTTP OK: HTTP/1.1 200 OK - 69552 bytes in 0.737 second response time
[15:21:53] <icinga-wm>	 RECOVERY - HHVM rendering on mw2205 is OK: HTTP OK: HTTP/1.1 200 OK - 69552 bytes in 0.320 second response time
[15:21:53] <_joe_>	 and then merge the changes
[15:21:54] <ShakespeareFan00>	 Peerhaps it's time code deployments had a sandbox/beta server first?
[15:22:00] <_joe_>	 heh
[15:22:03] <wikibugs>	 6operations, 6Release-Engineering-Team, 10Wikimedia-General-or-Unknown: Exception caught inside exception handler - https://phabricator.wikimedia.org/T116593#1753305 (10Glaisher)
[15:22:06] <icinga-wm>	 RECOVERY - HHVM rendering on mw1038 is OK: HTTP OK: HTTP/1.1 200 OK - 69552 bytes in 0.137 second response time
[15:22:06] <icinga-wm>	 RECOVERY - HHVM rendering on mw1112 is OK: HTTP OK: HTTP/1.1 200 OK - 69551 bytes in 0.118 second response time
[15:22:07] <icinga-wm>	 RECOVERY - HHVM rendering on mw1071 is OK: HTTP OK: HTTP/1.1 200 OK - 69552 bytes in 0.143 second response time
[15:22:07] <icinga-wm>	 RECOVERY - HHVM rendering on mw1070 is OK: HTTP OK: HTTP/1.1 200 OK - 69552 bytes in 0.153 second response time
[15:22:07] <icinga-wm>	 RECOVERY - HHVM rendering on mw1019 is OK: HTTP OK: HTTP/1.1 200 OK - 69552 bytes in 0.151 second response time
[15:22:07] <icinga-wm>	 RECOVERY - HHVM rendering on mw1101 is OK: HTTP OK: HTTP/1.1 200 OK - 69552 bytes in 0.145 second response time
[15:22:07] <icinga-wm>	 RECOVERY - HHVM rendering on mw1096 is OK: HTTP OK: HTTP/1.1 200 OK - 69552 bytes in 0.127 second response time
[15:22:09] <JohnFLewis>	 ShakespeareFan00: funnily enough they do
[15:22:19] <ebernhardson>	 and this code has been deployed there since last week
[15:22:19] <akosiaris>	 ok recoveries started
[15:22:20] <_joe_>	 JohnFLewis: no?
[15:22:26] <ShakespeareFan00>	 So technically deploys shouldn't cause meltdowns?
[15:22:30] <gilles>	 back to normal for me
[15:22:34] <_joe_>	 yup recoveries
[15:22:49] <chasemp>	 looks better now yes, and was def from the deploy
[15:22:50] <JohnFLewis>	 _joe_: beta cluster should be said thing
[15:22:53] <grrrit-wm>	 (03CR) 10Ottomata: [C: 032] Disable all diamond varnishreqstats collectors [puppet] - 10https://gerrit.wikimedia.org/r/248888 (https://phabricator.wikimedia.org/T83580) (owner: 10Ottomata)
[15:22:54] <_joe_>	 who's the lucky one who has to write the incident docks?
[15:22:57] <ori>	 ebernhardson: next time don't bother with git
[15:23:03] <paravoid>	 [Exception MWException] (/srv/mediawiki/php-1.27.0-wmf.3/includes/resourceloader/ResourceLoader.php:331) ResourceLoader duplicate registration error. Another module has already been registered as schema.Search
[15:23:03] <wikibugs>	 6operations, 10Analytics, 6Discovery, 10EventBus, and 6 others: Define edit related events for change propagation - https://phabricator.wikimedia.org/T116247#1753319 (10mobrovac) >>! In T116247#1752974, @Ottomata wrote: > I'm still a little confused about how this reqid/id will work?  You are suggesting th...
[15:23:05] <ori>	 don't bother with patches, i mean
[15:23:07] <paravoid>	 fwiw
[15:23:22] <ori>	 revert on tin, sync, then worry about git
[15:23:28] <wikibugs>	 6operations, 10Analytics, 6Discovery, 10EventBus, and 6 others: Define edit related events for change propagation - https://phabricator.wikimedia.org/T116247#1753324 (10mobrovac)
[15:23:37] <_joe_>	 ori: my point too :)
[15:23:39] <ebernhardson>	 paravoid: interesting, ok. That kind of error should probably just log and not fatal the whole site
[15:23:40] <jynus>	 started at 15:13
[15:23:46] <wikibugs>	 6operations, 10ops-eqiad, 10Traffic, 5Patch-For-Review: eqiad: investigate thermal issues with some cp10xx machines - https://phabricator.wikimedia.org/T103226#1753326 (10BBlack) @cmjohnson above I think meant for T116584
[15:23:57] <paravoid>	 ebernhardson: actionable for the postmortem that you'll write :)
[15:24:04] <greg-g>	 :)
[15:24:13] <greg-g>	 not a fun way to open my laptop monday morning :)
[15:24:19] <ebernhardson>	 paravoid: indeed
[15:24:19] <_joe_>	 paravoid: you beat me to that joke :P
[15:24:22] <Trizek>	 :D
[15:24:24] <bblack>	 ah there's the root cause, greg-g opened his laptop
[15:24:30] <greg-g>	 bblack: :P
[15:24:38] <jynus>	 beware of logging issues now
[15:24:40] <_joe_>	 yup, we don't trust neckbeards
[15:25:00] <jynus>	 that is 1 million errors/s
[15:25:01] <paravoid>	 as is the fact that this wasn't caught in QA
[15:25:26] <icinga-wm>	 RECOVERY - HHVM rendering on mw1057 is OK: HTTP OK: HTTP/1.1 200 OK - 69556 bytes in 0.133 second response time
[15:25:26] <icinga-wm>	 RECOVERY - HHVM rendering on mw2156 is OK: HTTP OK: HTTP/1.1 200 OK - 69555 bytes in 0.250 second response time
[15:25:27] <icinga-wm>	 RECOVERY - HHVM rendering on mw1106 is OK: HTTP OK: HTTP/1.1 200 OK - 69556 bytes in 0.130 second response time
[15:25:27] <icinga-wm>	 RECOVERY - HHVM rendering on mw1092 is OK: HTTP OK: HTTP/1.1 200 OK - 69556 bytes in 0.131 second response time
[15:25:28] <icinga-wm>	 RECOVERY - HHVM rendering on mw1039 is OK: HTTP OK: HTTP/1.1 200 OK - 69556 bytes in 0.121 second response time
[15:25:28] <icinga-wm>	 RECOVERY - HHVM rendering on mw1141 is OK: HTTP OK: HTTP/1.1 200 OK - 69556 bytes in 0.350 second response time
[15:25:29] <icinga-wm>	 RECOVERY - HHVM rendering on mw1235 is OK: HTTP OK: HTTP/1.1 200 OK - 69556 bytes in 0.368 second response time
[15:25:29] <icinga-wm>	 RECOVERY - LVS HTTPS IPv4 on mobile-lb.esams.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 10309 bytes in 0.522 second response time
[15:25:30] <icinga-wm>	 RECOVERY - HHVM rendering on mw2137 is OK: HTTP OK: HTTP/1.1 200 OK - 69556 bytes in 0.269 second response time
[15:25:30] <icinga-wm>	 RECOVERY - HHVM rendering on mw2101 is OK: HTTP OK: HTTP/1.1 200 OK - 69556 bytes in 0.262 second response time
[15:25:31] <icinga-wm>	 RECOVERY - HHVM rendering on mw2097 is OK: HTTP OK: HTTP/1.1 200 OK - 69556 bytes in 0.297 second response time
[15:25:31] <icinga-wm>	 RECOVERY - HHVM rendering on mw2066 is OK: HTTP OK: HTTP/1.1 200 OK - 69556 bytes in 0.273 second response time
[15:25:31] <_joe_>	 jynus: right now? or just before?
[15:25:41] <jynus>	 before
[15:25:50] <ebernhardson>	 paravoid: it looks like the issue with QA was order of patches applied
[15:25:53] <jynus>	 althoug there is some lag on monitoring, I suppose related to this
[15:25:53] <_joe_>	 icinga-wm getting kicked is kind of not so good
[15:26:00] <ebernhardson>	 paravoid: the code overall works, but its spread across multiple repositories
[15:26:18] <paravoid>	 _joe_: better than flooding the channel
[15:26:20] <_joe_>	 jynus: yup probably related
[15:26:40] <paravoid>	 ebernhardson: that's still something to learn from and protect against in the future though
[15:26:42] <_joe_>	 paravoid: heh, if it just stays out for 1 minute, fair enough
[15:26:43] <jynus>	 finished at 15:23, that is 10 minutes of outage
[15:27:36] <jynus>	 maybe 15:20 + tail
[15:28:02] <wikibugs>	 6operations, 6Labs, 10Labs-Infrastructure, 7Monitoring, 3labs-sprint-118: monitor expiration of labvirt-star SSL cert - https://phabricator.wikimedia.org/T116332#1753375 (10Andrew)
[15:28:06] <wikibugs>	 6operations, 10ops-eqiad, 5Patch-For-Review: cp1059 has network issues - https://phabricator.wikimedia.org/T114870#1753377 (10BBlack) 5Open>3Resolved cp1059 was stable for 6 days in icinga, seems fixed.  Repooled and undowntimed (with cleared caches just in case).
[15:29:54] <wikibugs>	 6operations, 10ops-eqiad: aqs1001 getting multiple and repeated heat MCEs - https://phabricator.wikimedia.org/T116584#1753383 (10Cmjohnson) I have thermal pate on-site. Let me know when you would like to schedule downtime to try the fix.  Chris
[15:30:11] <wikibugs>	 6operations, 10ops-eqiad, 6Labs, 10Labs-Infrastructure, 3labs-sprint-118: Rack/Setpup labvirt1010 and 1011 - https://phabricator.wikimedia.org/T116019#1753384 (10Andrew)
[15:30:28] <wikibugs>	 6operations, 10ops-eqiad: Decommission sodium - https://phabricator.wikimedia.org/T110142#1753385 (10Cmjohnson) removed switch information
[15:30:59] <logmsgbot>	 !log ebernhardson@tin Synchronized php-1.27.0-wmf.3/extensions/CirrusSearch: Undeploy eventlogging search schema from CirrusSearch (duration: 00m 18s)
[15:31:03] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[15:33:27] <wikibugs>	 6operations, 10Analytics, 6Discovery, 10EventBus, and 6 others: Define edit related events for change propagation - https://phabricator.wikimedia.org/T116247#1753398 (10Ottomata) > I don't see a conflicting problem with id (even though id is a JSONSchema keyword, but it relates to the schema, not its prope...
[15:33:58] <wikibugs>	 6operations, 10Analytics, 6Discovery, 10EventBus, and 6 others: Define edit related events for change propagation - https://phabricator.wikimedia.org/T116247#1753399 (10Eevans) >>! In T116247#1749452, @Ottomata wrote: > Right, but how would you do this in say, Hive?  Or in bash?  In bash:  ``` $ sudo apt-g...
[15:34:43] <wikibugs>	 6operations, 10Analytics, 6Discovery, 10EventBus, and 6 others: Define edit related events for change propagation - https://phabricator.wikimedia.org/T116247#1753400 (10Ottomata) > Manual schema versions. We could increase the schema version every time we change something in the schema. Easy to achieve but...
[15:36:17] <grrrit-wm>	 (03CR) 10Giuseppe Lavagetto: [C: 032] Add EtcdConfigurationObserver [debs/pybal] - 10https://gerrit.wikimedia.org/r/225649 (owner: 10Ori.livneh)
[15:36:29] <_joe_>	 ori: ^^ congratulations :)
[15:36:40] <ori>	 \o/
[15:37:37] <wikibugs>	 6operations, 6Analytics-Backlog, 10Wikimedia-Mailing-lists: Requests to lists.wikimedia.org should end up in hadoop wmf.webrequest via kafka! - https://phabricator.wikimedia.org/T116429#1753431 (10Nuria) In order to get this requests in hadoop this domain needs to be fronted by varnish, by looking through pu...
[15:37:54] <icinga-wm>	 PROBLEM - puppet last run on mw1067 is CRITICAL: CRITICAL: Puppet has 1 failures
[15:38:26] <_joe_>	 ori: once I fix a small glitch with idleconnectionmonitor I think we can cut out a new package and start using etcd for reals here
[15:38:54] <icinga-wm>	 RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[15:42:26] <grrrit-wm>	 (03Merged) 10jenkins-bot: Add EtcdConfigurationObserver [debs/pybal] - 10https://gerrit.wikimedia.org/r/225649 (owner: 10Ori.livneh)
[15:43:41] <logmsgbot>	 !log ebernhardson@tin Synchronized php-1.27.0-wmf.3/extensions/WikimediaEvents/modules: Re-deploy WME changes after deploying necessary CirrusSearch change first (duration: 00m 17s)
[15:43:44] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[15:44:15] <greg-g>	 site is still up, this time, ebernhardson :)
[15:44:39] <ebernhardson>	 greg-g: that was only the js, next one is what broke it before :P
[15:44:47] <greg-g>	 oh
[15:44:52] <greg-g>	 here we go...
[15:44:58] <logmsgbot>	 !log ebernhardson@tin Synchronized php-1.27.0-wmf.3/extensions/WikimediaEvents/WikimediaEvents.php: Re-deploy WME changes after deploying necessary CirrusSearch change first (duration: 00m 18s)
[15:45:02] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[15:45:12] * greg-g nods
[15:45:34] <ebernhardson>	 still looks sane
[15:45:53] <greg-g>	 yeah
[15:51:09] <wikibugs>	 6operations, 10Wikimedia-Mailing-lists, 5Patch-For-Review: Education coop mailing list rename - https://phabricator.wikimedia.org/T107445#1753497 (10Selsharbaty-WMF) Hi John,  Yeah, this is really helpful. I can never thank you enough!
[15:54:13] <wikibugs>	 6operations, 10Analytics, 6Discovery, 10EventBus, and 6 others: Define edit related events for change propagation - https://phabricator.wikimedia.org/T116247#1753513 (10mobrovac) >>! In T116247#1753398, @Ottomata wrote: > Ok cool, if that's the case, then `reqid` or even `request_id` (I like long names...w...
[15:55:25] <ebernhardson>	 one last swat patch almost forgot about, that turns writes on to CODFW for cirrussearch. just going to punt that back to evening swat
[15:55:57] <ebernhardson>	 starting an interview in a few minuts and wont be able to watch it
[15:56:27] <greg-g>	 yeah, good call :)
[15:57:22] <ebernhardson>	 it is an ops interview, could ask them how to fix it ;)
[15:58:44] <icinga-wm>	 PROBLEM - puppet last run on cp3016 is CRITICAL: CRITICAL: puppet fail
[16:01:46] <wikibugs>	 6operations, 10hardware-requests: Allocate hardware for salt master in eqiad - https://phabricator.wikimedia.org/T115288#1753574 (10ArielGlenn) Any movement on this front? Is that spare still around?
[16:02:44] <wikibugs>	 6operations, 10Analytics, 6Discovery, 10EventBus, and 6 others: Define edit related events for change propagation - https://phabricator.wikimedia.org/T116247#1753579 (10Ottomata) > Hm, I think duplicates should be detected based on the content of the message itself and the time stamp. EventLogging explicit...
[16:03:05] <icinga-wm>	 RECOVERY - puppet last run on mw1067 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures
[16:06:41] <wikibugs>	 7Puppet, 6operations, 10Continuous-Integration-Config: translatewiki-puppetlint-strict does not honor puppet-lint.rc file in /puppet - https://phabricator.wikimedia.org/T116552#1753608 (10hashar)
[16:10:09] <wikibugs>	 7Blocked-on-Operations, 10Ops-Access-Requests, 6operations, 3Discovery-Maps-Sprint, and 2 others: Kartotherian service logs inaccessible (systemd?) and not updated (/var/log) - https://phabricator.wikimedia.org/T115067#1753620 (10Dzahn) a:5akosiaris>3Dzahn
[16:11:22] <wikibugs>	 7Puppet, 6operations, 10Continuous-Integration-Config: translatewiki-puppetlint-strict does not honor puppet-lint.rc file in /puppet - https://phabricator.wikimedia.org/T116552#1753624 (10hashar) The Jenkins job is a template '{name}-puppetlint-strict' and indeed runs at the root of the repository.  Potentia...
[16:16:58] <wikibugs>	 7Puppet, 6Labs, 6Phabricator: phabricator puppet at labs broken - https://phabricator.wikimedia.org/T116442#1753651 (10Dzahn) Why can't we use the same class in prod and labs? That's the idea of testing changes,
[16:24:26] <wikibugs>	 6operations, 10ops-eqiad, 3labs-sprint-118: nobelium is overheating - https://phabricator.wikimedia.org/T116439#1753689 (10yuvipanda)
[16:24:38] <wikibugs>	 6operations, 6Labs, 10Labs-Infrastructure, 10hardware-requests, 3labs-sprint-117: How to handle mgmt lan for labs bare metal? - https://phabricator.wikimedia.org/T116607#1753690 (10Andrew) 3NEW a:3Andrew
[16:25:01] <grrrit-wm>	 (03CR) 10Faidon Liambotis: "I'd like to hear more about why we've needed this in the past, but in any case how is moving three lines different than moving one?" [puppet] - 10https://gerrit.wikimedia.org/r/246826 (owner: 10Faidon Liambotis)
[16:25:17] <wikibugs>	 6operations, 6Labs, 10Labs-Infrastructure, 10hardware-requests, and 2 others: How to handle mgmt lan for labs bare metal? - https://phabricator.wikimedia.org/T116607#1753690 (10Andrew)
[16:25:43] <icinga-wm>	 RECOVERY - puppet last run on cp3016 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures
[16:26:16] <wikibugs>	 6operations, 6Labs, 10Labs-Infrastructure, 10hardware-requests, and 2 others: How to handle mgmt lan for labs bare metal? - https://phabricator.wikimedia.org/T116607#1753700 (10RobH) I was chatting about this with Andrew.  So since all mgmt is on 'dumb' switches, we don't support multiple mgmt vlans unless...
[16:37:36] <wikibugs>	 6operations, 6Labs, 10Labs-Infrastructure, 10hardware-requests, and 2 others: How to handle mgmt lan for labs bare metal? - https://phabricator.wikimedia.org/T116607#1753783 (10chasemp) If the idea is these physical boxes are totally under the control of the relevant project admins we should consider mimic...
[16:42:31] <wikibugs>	 7Blocked-on-Operations, 6operations, 10Beta-Cluster-Infrastructure, 7HHVM: Convert work machines (tin, terbium) to Trusty and hhvm usage - https://phabricator.wikimedia.org/T87036#1753813 (10Joe) I will start working on this in the next couple of weeks.  My current plan **for tin ** is to ask people to use...
[16:46:58] <wikibugs>	 6operations, 10ops-eqiad, 10netops: asw-d-eqiad SNMP failures - https://phabricator.wikimedia.org/T112781#1753827 (10BBlack) What are we blocked on here currently, do we need to order more SFPs or something to try plugging these in again?
[16:53:52] <grrrit-wm>	 (03PS1) 10Jforrester: Enable VisualEditor in the 'Projet' namespace on the French Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/248910 (https://phabricator.wikimedia.org/T116603) 
[16:58:31] <wikibugs>	 7Blocked-on-Operations, 6operations, 10Parsoid, 10Salt, 6Scrum-of-Scrums: Disabling agent forwarding breaks dsh based restarts for Parsoid (required for deployments) - https://phabricator.wikimedia.org/T102039#1753883 (10ArielGlenn) I've been looking at this and seeing a couple of behaviours, one where I...
[17:00:11] <wikibugs>	 6operations, 10hardware-requests: Allocate hardware for salt master in eqiad - https://phabricator.wikimedia.org/T115288#1753896 (10Cmjohnson) Let's go with Rob's suggestion
[17:05:22] <grrrit-wm>	 (03PS2) 10Dzahn: releases: move base::firewall into the role [puppet] - 10https://gerrit.wikimedia.org/r/244691 (owner: 10Muehlenhoff)
[17:06:21] <grrrit-wm>	 (03CR) 10Dzahn: [C: 032] releases: move base::firewall into the role [puppet] - 10https://gerrit.wikimedia.org/r/244691 (owner: 10Muehlenhoff)
[17:07:45] <wikibugs>	 6operations, 10ops-eqiad, 10netops: asw-d-eqiad SNMP failures - https://phabricator.wikimedia.org/T112781#1753924 (10Cmjohnson) just a thought but some of those errors could be related to sfp's.....this occured when i had put standard sfp's in and not sfp+.  May be worth cabling up again and seeing if we get...
[17:08:55] <grrrit-wm>	 (03CR) 10Rush: [C: 04-1] "can you ref the issue this is for?" [puppet] - 10https://gerrit.wikimedia.org/r/209558 (owner: 10coren)
[17:10:42] <grrrit-wm>	 (03PS1) 10Alexandros Kosiaris: Update WikimediaEnableMultiLines to OTRS 5.0.1 [software/otrs] - 10https://gerrit.wikimedia.org/r/248915 
[17:10:44] <grrrit-wm>	 (03PS1) 10Alexandros Kosiaris: Update WikimediaTemplates to support 5.0.1 [software/otrs] - 10https://gerrit.wikimedia.org/r/248916 
[17:12:23] <wikibugs>	 6operations, 10hardware-requests: Allocate hardware for salt master in eqiad - https://phabricator.wikimedia.org/T115288#1753953 (10RobH) a:5Cmjohnson>3RobH
[17:12:30] <wikibugs>	 6operations, 10hardware-requests: Allocate hardware for salt master in eqiad - https://phabricator.wikimedia.org/T115288#1720246 (10RobH) p:5Normal>3High
[17:13:04] <wikibugs>	 6operations, 6Labs, 10Labs-Infrastructure, 10hardware-requests, and 2 others: Labs test cluster in codfw - https://phabricator.wikimedia.org/T114435#1753968 (10Andrew)
[17:14:13] <grrrit-wm>	 (03PS4) 10Dzahn: Move the ferm rules for elasticsearch internode traffic into role::logstash::elasticsearch [puppet] - 10https://gerrit.wikimedia.org/r/244412 (https://phabricator.wikimedia.org/T104964) (owner: 10Muehlenhoff)
[17:14:22] <grrrit-wm>	 (03CR) 10Dzahn: [C: 032] "http://puppet-compiler.wmflabs.org/1079/" [puppet] - 10https://gerrit.wikimedia.org/r/244412 (https://phabricator.wikimedia.org/T104964) (owner: 10Muehlenhoff)
[17:14:34] <wikibugs>	 6operations, 10hardware-requests: Allocate hardware for salt master in eqiad - https://phabricator.wikimedia.org/T115288#1753981 (10RobH) I'll handle getting this spun up, and any potential onsite tasks.  (since it responds to mgmt ssh, there likely won't be any other than the labeling task)  I'll get this nam...
[17:16:51] <wikibugs>	 6operations, 10MediaWiki-Cache, 6Performance-Team, 10hardware-requests, 7Availability: Setup a 2 server Kafka instance in both eqiad and codfw for reliable purge streams - https://phabricator.wikimedia.org/T114191#1753993 (10GWicke)
[17:17:31] <wikibugs>	 6operations, 10MediaWiki-Cache, 6Performance-Team, 10hardware-requests, 7Availability: Setup a 2 server Kafka instance in both eqiad and codfw for reliable purge streams - https://phabricator.wikimedia.org/T114191#1687618 (10GWicke) Updated the ask to two boxes per DC in the description.
[17:18:09] <wikibugs>	 6operations, 6Labs, 10Labs-Infrastructure, 10hardware-requests, and 2 others: How to handle mgmt lan for labs bare metal? - https://phabricator.wikimedia.org/T116607#1754006 (10Andrew) a:5Andrew>3None
[17:18:11] <wikibugs>	 6operations, 6Labs, 10Labs-Infrastructure, 10hardware-requests, and 2 others: Labs test cluster in codfw - https://phabricator.wikimedia.org/T114435#1754008 (10Andrew) a:5Andrew>3None
[17:19:58] <grrrit-wm>	 (03PS1) 10Dzahn: logstash::elasticsearch add base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/248918 (https://phabricator.wikimedia.org/T104964) 
[17:20:16] <mutante>	 moritzm: ^ 
[17:21:07] <wikibugs>	 6operations, 10MediaWiki-Cache, 6Performance-Team, 10hardware-requests, 7Availability: Setup a 2 server Kafka instance in both eqiad and codfw for reliable purge streams - https://phabricator.wikimedia.org/T114191#1754025 (10RobH) a:5Ottomata>3RobH
[17:22:45] <wikibugs>	 6operations, 10ops-eqiad, 10netops: asw-d-eqiad SNMP failures - https://phabricator.wikimedia.org/T112781#1754037 (10BBlack) yeah @paravoid was saying in the meeting, basically we should try one and see how it goes.  Let's cable/plug in the asw-d-eqiad connection for lvs1007 first and see what it does?  Shou...
[17:23:14] <icinga-wm>	 PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 12.50% of data above the critical threshold [500.0]
[17:23:46] <mutante>	 bd808: is this right? it seems to me it is.  "Restrict access to deployment redis to internal plus silver" https://gerrit.wikimedia.org/r/#/c/245876/
[17:26:19] <YuviPanda>	 ebernhardson: I talked to mark, it is 6weeks once we get the deployment done now :) (to nobelium)
[17:26:53] <wikibugs>	 6operations, 10ops-eqiad, 6Labs, 10Labs-Infrastructure, 3labs-sprint-118: Rack/Setpup labvirt1010 and 1011 - https://phabricator.wikimedia.org/T116019#1754054 (10chasemp) a:5Cmjohnson>3chasemp
[17:26:53] <grrrit-wm>	 (03CR) 10BryanDavis: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/245876 (owner: 10Muehlenhoff)
[17:27:41] <grrrit-wm>	 (03PS3) 10Dzahn: Restrict access to deployment redis to internal plus silver [puppet] - 10https://gerrit.wikimedia.org/r/245876 (owner: 10Muehlenhoff)
[17:28:26] <grrrit-wm>	 (03CR) 10Dzahn: [C: 032] "http://puppet-compiler.wmflabs.org/1080/" [puppet] - 10https://gerrit.wikimedia.org/r/245876 (owner: 10Muehlenhoff)
[17:28:35] <icinga-wm>	 RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[17:30:48] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: [C: 032] "Merging after ops meeting OKed this." [puppet] - 10https://gerrit.wikimedia.org/r/244627 (https://phabricator.wikimedia.org/T115067) (owner: 10Dzahn)
[17:30:56] <grrrit-wm>	 (03PS2) 10Alexandros Kosiaris: admin: let kartotherian and tilerator admins read logs [puppet] - 10https://gerrit.wikimedia.org/r/244627 (https://phabricator.wikimedia.org/T115067) (owner: 10Dzahn)
[17:31:58] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] admin: let kartotherian and tilerator admins read logs [puppet] - 10https://gerrit.wikimedia.org/r/244627 (https://phabricator.wikimedia.org/T115067) (owner: 10Dzahn)
[17:32:30] <mutante>	 :)
[17:34:46] <ebernhardson>	 YuviPanda: sweet!
[17:35:01] <ebernhardson>	 YuviPanda: i never expected to take so long to get everything ready...but good to know we have time available now :)
[17:35:31] <wikibugs>	 7Blocked-on-Operations, 10Ops-Access-Requests, 6operations, 3Discovery-Maps-Sprint, and 2 others: Kartotherian service logs inaccessible (systemd?) and not updated (/var/log) - https://phabricator.wikimedia.org/T115067#1754110 (10akosiaris) 5Open>3Resolved Change merged and tested. Resolving
[17:35:43] <grrrit-wm>	 (03PS1) 10RobH: adding in globalsign to procurement approved vendors [puppet] - 10https://gerrit.wikimedia.org/r/248921 
[17:37:33] <grrrit-wm>	 (03CR) 10RobH: [C: 032] adding in globalsign to procurement approved vendors [puppet] - 10https://gerrit.wikimedia.org/r/248921 (owner: 10RobH)
[17:37:53] <grrrit-wm>	 (03CR) 10Dzahn: [C: 032] Add DNS entries for ms-be20[1-2][0-6] Bug:T114712 [dns] - 10https://gerrit.wikimedia.org/r/248712 (https://phabricator.wikimedia.org/T114712) (owner: 10Papaul)
[17:38:17] <grrrit-wm>	 (03CR) 10Dzahn: "looks all good. free IPs in the original mgmt range, matches racktables" [dns] - 10https://gerrit.wikimedia.org/r/248712 (https://phabricator.wikimedia.org/T114712) (owner: 10Papaul)
[17:38:26] <grrrit-wm>	 (03CR) 10Yuvipanda: [C: 031] dynamicproxy: Empty data from initial-data.db [puppet] - 10https://gerrit.wikimedia.org/r/248622 (owner: 10Alex Monk)
[17:38:26] <wikibugs>	 6operations, 10Analytics, 6Discovery, 10EventBus, and 6 others: Define edit related events for change propagation - https://phabricator.wikimedia.org/T116247#1754123 (10GWicke) > If we adopt a convention of always storing schema name and/or revision in the schemas themselves, then we can do like EventLoggi...
[17:40:13] <mutante>	 wait, ehm
[17:40:18] <mutante>	 i just merged the DNS change above
[17:40:29] <mutante>	 but authdns-update diff shows me more of a change than that
[17:41:02] <bblack>	 was another merge pending from earlier?
[17:41:48] <mutante>	 i think from yesterday, yes
[17:42:17] <mutante>	 cmjohnson1: ^ ms-be1019 be1020 in eqiad? yesterday?
[17:42:50] <cmjohnson1>	 yeah but I merged that ...didn't I?  
[17:43:04] <mutante>	 in authdns-update i see it as a change
[17:43:12] <mutante>	 but unlike puppet-merge this won't tell me "2 changes, warning"
[17:43:14] <cmjohnson1>	 oh..that would explain why they're not installing
[17:43:17] <mutante>	 it will just show me the unified diff
[17:43:22] <mutante>	 and that was confusing
[17:43:40] <mutante>	 ok, so let me merge this together
[17:43:56] <cmjohnson1>	 okay...sorry for the confusions
[17:44:00] <mutante>	 done, try it again then
[17:44:01] <wikibugs>	 10Ops-Access-Requests, 6operations: Requesting access to (wmf-)deployment for jzerebecki - https://phabricator.wikimedia.org/T116487#1754144 (10greg) Jan: can you put in the description why you are requesting access? :)
[17:44:13] <mutante>	 np
[17:44:27] <mutante>	 hmm, papaul?
[17:44:58] <wikibugs>	 6operations, 10Analytics, 6Discovery, 10EventBus, and 6 others: Define edit related events for change propagation - https://phabricator.wikimedia.org/T116247#1754145 (10GWicke) > I'm not so sure actually that these will always be redundant. I think the request ID should be persisted to track the same event...
[17:45:23] <icinga-wm>	 PROBLEM - check_puppetrun on bismuth is CRITICAL: CRITICAL: Puppet has 3 failures
[17:45:58] <grrrit-wm>	 (03PS2) 10MaxSem: Beta: add cache headers to WP portal [puppet] - 10https://gerrit.wikimedia.org/r/248374 
[17:48:07] <wikibugs>	 6operations, 6Labs, 10Labs-Infrastructure, 3labs-sprint-117: Allocate labs subnet in dallas - https://phabricator.wikimedia.org/T115491#1754167 (10mark) For the Labs hosts/support vlans we can just follow the eqiad model for now, and copy that to codfw, with similar IP allocations as well.
[17:49:47] <wikibugs>	 6operations, 6Labs, 10Labs-Infrastructure, 3labs-sprint-117: Allocate subnet for labs test cluster instances - https://phabricator.wikimedia.org/T115492#1754169 (10mark) Let's allocate a similar amount of IPs (/20 iirc or thereabouts) as we did in eqiad, but experiment with how to use them with Neutron. I...
[17:50:24] <icinga-wm>	 PROBLEM - check_puppetrun on bismuth is CRITICAL: CRITICAL: Puppet has 3 failures
[17:54:04] <wikibugs>	 6operations: restore old mw config private svn repo from bacula - https://phabricator.wikimedia.org/T115937#1754188 (10Dzahn) >>! In T115937#1749505, @Reedy wrote: > I guess we ideally need /home/wikipedia/conf-svn/wmf-config for the actual svn repo...  I restored the entire /home/wikipedia from /home_pmtpa/wiki...
[17:55:24] <icinga-wm>	 PROBLEM - check_puppetrun on bismuth is CRITICAL: CRITICAL: Puppet has 3 failures
[17:56:50] <wikibugs>	 6operations: restore old mw config private svn repo from bacula - https://phabricator.wikimedia.org/T115937#1754219 (10Dzahn) @akosiaris @arielglenn @tstarling any ideas where else we could get "conf-svn"? I don't know after trying the old home_pmtpa that was mounted once on bast1001.
[17:56:56] <wikibugs>	 6operations: restore old mw config private svn repo from bacula - https://phabricator.wikimedia.org/T115937#1754221 (10Dzahn) a:5Dzahn>3None
[17:57:37] <wikibugs>	 10Ops-Access-Requests, 6operations: Requesting access to (wmf-)deployment for jzerebecki - https://phabricator.wikimedia.org/T116487#1754222 (10Dzahn) p:5Triage>3Normal
[17:59:15] <icinga-wm>	 PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 71 data above and 7 below the confidence bounds
[18:00:24] <icinga-wm>	 PROBLEM - check_puppetrun on bismuth is CRITICAL: CRITICAL: Puppet has 2 failures
[18:01:00] <wikibugs>	 10Ops-Access-Requests, 6operations, 6Discovery, 10Maps, and 2 others: Grant sudo on map-tests200* for maps team - https://phabricator.wikimedia.org/T106637#1754247 (10Dzahn) How common is this task? From T116553 i'm not sure this should be a part of normal admin work but rather an exception that we need mo...
[18:04:45] <wikibugs>	 6operations, 7Monitoring, 5Patch-For-Review: Monitor APC usage on application servers - https://phabricator.wikimedia.org/T116255#1754252 (10Krinkle) Initial dashboard up at <https://grafana.wikimedia.org/dashboard/db/hhvm-apc-usage>.
[18:05:24] <icinga-wm>	 RECOVERY - check_puppetrun on bismuth is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures
[18:08:55] <wikibugs>	 6operations: Include 5xx numbers in fluorine fatalmonitor - https://phabricator.wikimedia.org/T116627#1754262 (10EBernhardson) 3NEW
[18:09:13] <grrrit-wm>	 (03CR) 10Andrew Bogott: [C: 032] openstack: Remove havana/icehouse files [puppet] - 10https://gerrit.wikimedia.org/r/248619 (owner: 10Alex Monk)
[18:09:18] <grrrit-wm>	 (03PS2) 10Andrew Bogott: openstack: Remove havana/icehouse files [puppet] - 10https://gerrit.wikimedia.org/r/248619 (owner: 10Alex Monk)
[18:17:33] <grrrit-wm>	 (03PS1) 10Dzahn: test multi-role admin group behaviour [puppet] - 10https://gerrit.wikimedia.org/r/248928 
[18:21:08] <mutante>	 godog: aargg.. ^ the actual answer is not "merge" but "fail" :(
[18:21:11] <mutante>	 Error: Could not run: Conflicting value for admin::groups found in role test_bar
[18:21:37] <mutante>	 that succkkks
[18:22:14] <mutante>	 JohnFLewis: ^ fyi, too
[18:23:21] <hoo>	 I guess you need to set the admin groups in a variable (via hiera) and make admin look that up… not nice, though
[18:25:12] <grrrit-wm>	 (03CR) 10Dzahn: "fails :/ Error: Could not run: Conflicting value for admin::groups found in role test_bar" [puppet] - 10https://gerrit.wikimedia.org/r/248928 (owner: 10Dzahn)
[18:25:51] <grrrit-wm>	 (03CR) 10Dzahn: admin: add dc-ops group to role access_new_install (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/246850 (https://phabricator.wikimedia.org/T115718) (owner: 10Dzahn)
[18:26:19] <JohnFLewis>	 hoo: that is via hiera :)
[18:26:45] <JohnFLewis>	 mutante: do we just look up the hiera variable? no merging?
[18:26:57] * JohnFLewis looks
[18:27:10] <hoo>	 oh
[18:27:13] * hoo hides
[18:27:19] <wikibugs>	 6operations, 10Analytics, 6Discovery, 10EventBus, and 6 others: Define edit related events for change propagation - https://phabricator.wikimedia.org/T116247#1754366 (10mobrovac) a:3mobrovac [PR 5](https://github.com/wikimedia/restevent/pull/5) proposes the schema definitions for the basic MW events: art...
[18:28:50] <mutante>	 hoo: JohnFLewis: yes, this is via hiera. the issues comes up once you have 2 or more roles and each role assigns admin groups and then you put 2 roles on one node
[18:29:03] <JohnFLewis>	 mutante: yeah, {{looking}}
[18:29:12] <mutante>	 so that means we can define the admin groups in hiera, but only by hostnames
[18:29:16] <mutante>	 and not by roles. which sucks
[18:29:33] <mutante>	 well. or we have to use regex.yaml
[18:29:44] <godog>	 mutante: nooo :(
[18:30:04] <icinga-wm>	 PROBLEM - YARN NodeManager Node-State on analytics1039 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[18:33:43] <icinga-wm>	 RECOVERY - YARN NodeManager Node-State on analytics1039 is OK: OK: YARN NodeManager analytics1039.eqiad.wmnet:8041 Node-State: RUNNING
[18:36:24] <wikibugs>	 10Ops-Access-Requests, 6operations, 6Discovery, 10Maps, and 2 others: Grant sudo on map-tests200* for maps team - https://phabricator.wikimedia.org/T106637#1754445 (10Yurik) @dzahn, I think it has happened twice already.  This goes back to the origin of this task - we should be able to manage all aspects o...
[18:37:55] <grrrit-wm>	 (03PS3) 10MaxSem: Beta: add cache headers to WP portal [puppet] - 10https://gerrit.wikimedia.org/r/248374 
[18:40:04] <icinga-wm>	 PROBLEM - YARN NodeManager Node-State on analytics1030 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[18:41:54] <icinga-wm>	 RECOVERY - YARN NodeManager Node-State on analytics1030 is OK: OK: YARN NodeManager analytics1030.eqiad.wmnet:8041 Node-State: RUNNING
[18:44:41] <wikibugs>	 10Ops-Access-Requests, 6operations, 6Discovery, 10Maps, and 2 others: Grant sudo on map-tests200* for maps team - https://phabricator.wikimedia.org/T106637#1754469 (10Dzahn) I don't think rebuilding from replication failures should be considered normal and part of regular admin work. We should instead focu...
[18:55:50] <grrrit-wm>	 (03PS1) 10Dzahn: admin: add datacenter-ops to palladium [puppet] - 10https://gerrit.wikimedia.org/r/248936 
[19:04:47] <nuria>	 ori: do you know of any custom code to handle replication among master and slave in eventlogging db (we are talking with jynus in #wikimedia-databases)
[19:07:49] <wikibugs>	 6operations, 6Phabricator, 10Traffic, 7Blocked-on-Security: Pharicator needs to expose notification daemon (websocket) - https://phabricator.wikimedia.org/T112765#1754548 (10chasemp)
[19:08:06] <wikibugs>	 6operations, 6Phabricator, 10Traffic, 7Blocked-on-Security: Pharicator needs to expose notification daemon (websocket) - https://phabricator.wikimedia.org/T112765#1645318 (10chasemp) with all due respect, it has to be reviewed by security before ops can step in :)
[19:08:58] <wikibugs>	 10Ops-Access-Requests, 6operations: Requesting access to (wmf-)deployment for jzerebecki - https://phabricator.wikimedia.org/T116487#1754553 (10chasemp)
[19:10:18] <wikibugs>	 6operations, 10ops-eqiad: db1030 RAID degraded (disk failed) - https://phabricator.wikimedia.org/T116499#1754557 (10chasemp) p:5Triage>3High a:3Cmjohnson
[19:11:08] <wikibugs>	 7Puppet, 6operations, 10Continuous-Integration-Config: translatewiki-puppetlint-strict does not honor puppet-lint.rc file in /puppet - https://phabricator.wikimedia.org/T116552#1754562 (10chasemp) p:5Triage>3Normal
[19:11:34] <icinga-wm>	 PROBLEM - YARN NodeManager Node-State on analytics1038 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[19:12:46] <wikibugs>	 6operations: Include 5xx numbers in fluorine fatalmonitor - https://phabricator.wikimedia.org/T116627#1754571 (10chasemp) p:5Triage>3Normal
[19:12:50] <grrrit-wm>	 (03PS1) 10Chad: ContentTranslation: Use wfLoadExtension() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/248938 
[19:13:14] <icinga-wm>	 RECOVERY - YARN NodeManager Node-State on analytics1038 is OK: OK: YARN NodeManager analytics1038.eqiad.wmnet:8041 Node-State: RUNNING
[19:13:47] <wikibugs>	 10Ops-Access-Requests, 6operations: Add Matanya to "restricted" to perform server side uploads - https://phabricator.wikimedia.org/T106447#1754575 (10chasemp) I'll make a note to include this in the next ops meeting.
[19:15:05] <icinga-wm>	 PROBLEM - YARN NodeManager Node-State on analytics1039 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[19:16:40] <wikibugs>	 10Ops-Access-Requests, 6operations: Requesting access to (wmf-)deployment for jzerebecki - https://phabricator.wikimedia.org/T116487#1754582 (10chasemp) >>! In T116487#1754144, @greg wrote: > Jan: can you put in the description why you are requesting access? :)  This was the status as of the Ops meeting today.
[19:17:20] <wikibugs>	 7Blocked-on-Operations, 6operations, 10Beta-Cluster-Infrastructure, 7HHVM: Convert work machines (tin, terbium) to Trusty and hhvm usage - https://phabricator.wikimedia.org/T87036#1754586 (10bd808) >>! In T87036#1753813, @Joe wrote: > My current plan **for tin ** is to ask people to use mira instead of tin...
[19:17:23] <icinga-wm>	 PROBLEM - YARN NodeManager Node-State on analytics1032 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[19:18:34] <icinga-wm>	 RECOVERY - YARN NodeManager Node-State on analytics1039 is OK: OK: YARN NodeManager analytics1039.eqiad.wmnet:8041 Node-State: RUNNING
[19:19:03] <icinga-wm>	 RECOVERY - YARN NodeManager Node-State on analytics1032 is OK: OK: YARN NodeManager analytics1032.eqiad.wmnet:8041 Node-State: RUNNING
[19:19:28] <wikibugs>	 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Grant Joseph and Dan deploy permissions on aqs100[1-3] - https://phabricator.wikimedia.org/T116169#1754591 (10chasemp) a:3akosiaris Status as of now: not directly accepted, some extra information since the idea was the AQS requires scap3 and not ansibl...
[19:19:36] <wikibugs>	 6operations: Grant Joseph and Dan deploy permissions on aqs100[1-3] - https://phabricator.wikimedia.org/T116169#1754593 (10chasemp)
[19:19:55] <wikibugs>	 6operations: install/setup/deploy lawrencium as eqiad salt-master - https://phabricator.wikimedia.org/T116645#1754595 (10RobH) 3NEW a:3RobH
[19:20:06] <wikibugs>	 6operations, 10hardware-requests: Allocate hardware for salt master in eqiad - https://phabricator.wikimedia.org/T115288#1754604 (10RobH)
[19:20:07] <wikibugs>	 6operations: install/setup/deploy lawrencium as eqiad salt-master - https://phabricator.wikimedia.org/T116645#1754603 (10RobH)
[19:20:25] <wikibugs>	 6operations: install/setup/deploy lawrencium as eqiad salt-master - https://phabricator.wikimedia.org/T116645#1754595 (10RobH)
[19:20:26] <wikibugs>	 6operations, 10Salt: Move salt master to separate host from puppet master - https://phabricator.wikimedia.org/T115287#1754610 (10RobH)
[19:20:28] <wikibugs>	 6operations, 10hardware-requests: Allocate hardware for salt master in eqiad - https://phabricator.wikimedia.org/T115288#1754605 (10RobH) 5Open>3Resolved WMF3542 is allocated as hostname lawrencium for this use.  T116645 is for installation, resolving #hardware-request.
[19:20:56] <wikibugs>	 6operations: install/setup/deploy lawrencium as eqiad salt-master - https://phabricator.wikimedia.org/T116645#1754595 (10RobH)
[19:21:33] <grrrit-wm>	 (03PS1) 10RobH: setting lawrencium dns entries [dns] - 10https://gerrit.wikimedia.org/r/248939 
[19:21:39] <wikibugs>	 7Blocked-on-Operations, 6operations, 10Beta-Cluster-Infrastructure, 7HHVM: Convert work machines (tin, terbium) to Trusty and hhvm usage - https://phabricator.wikimedia.org/T87036#1754613 (10bd808) >>! In T87036#1753813, @Joe wrote: > For terbium, I still need to understand how much work - if any - will be...
[19:22:08] <grrrit-wm>	 (03CR) 10RobH: [C: 032] setting lawrencium dns entries [dns] - 10https://gerrit.wikimedia.org/r/248939 (owner: 10RobH)
[19:23:51] <YuviPanda>	 ebernhardson: do we have any blockers other than the nobelium hardware issue?
[19:23:58] <wikibugs>	 10Ops-Access-Requests, 6operations, 6Discovery, 10Maps, and 2 others: Grant sudo on map-tests200* for maps team - https://phabricator.wikimedia.org/T106637#1754618 (10chasemp) p:5High>3Normal
[19:24:23] <wikibugs>	 6operations: Grant Joseph and Dan deploy permissions on aqs100[1-3] - https://phabricator.wikimedia.org/T116169#1754619 (10akosiaris) >>! In T116169#1754591, @chasemp wrote: > Status as of now: not directly accepted, some extra information since the idea was the AQS requires scap3 and not ansible >  > @akosiaris...
[19:24:25] <wikibugs>	 6operations, 10ops-eqiad: label server lawrencium/WMF3542 - https://phabricator.wikimedia.org/T116646#1754621 (10RobH) 3NEW a:3Cmjohnson
[19:25:21] <wikibugs>	 6operations: install/setup/deploy lawrencium as eqiad salt-master - https://phabricator.wikimedia.org/T116645#1754628 (10RobH)
[19:25:54] <icinga-wm>	 PROBLEM - YARN NodeManager Node-State on analytics1038 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[19:27:44] <icinga-wm>	 PROBLEM - YARN NodeManager Node-State on analytics1039 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[19:29:24] <icinga-wm>	 RECOVERY - YARN NodeManager Node-State on analytics1039 is OK: OK: YARN NodeManager analytics1039.eqiad.wmnet:8041 Node-State: RUNNING
[19:29:33] <icinga-wm>	 RECOVERY - YARN NodeManager Node-State on analytics1038 is OK: OK: YARN NodeManager analytics1038.eqiad.wmnet:8041 Node-State: RUNNING
[19:31:08] <wikibugs>	 7Blocked-on-Operations, 6operations, 10Beta-Cluster-Infrastructure, 7HHVM: Convert work machines (tin, terbium) to Trusty and hhvm usage - https://phabricator.wikimedia.org/T87036#1754644 (10demon) >>! In T87036#1754586, @bd808 wrote: >>>! In T87036#1753813, @Joe wrote: >> My current plan **for tin ** is t...
[19:31:38] <wikibugs>	 6operations, 10Traffic, 7Performance: missing SPDY coalesce for upload.wm.o for images ref'd in projects' page outputs - https://phabricator.wikimedia.org/T116132#1754647 (10BBlack) There are a few concerns here which is why this is kind of "back burner" for now, but on the longer-term radar:  * Even in the...
[19:31:55] <wikibugs>	 7Blocked-on-Operations, 6operations, 5Continuous-Integration-Scaling: Backport python-os-client-config 1.3.0-1 from Debian Sid to jessie-wikimedia - https://phabricator.wikimedia.org/T104967#1754650 (10chasemp) @fgiunchedi shoutout as a packaging guru :D  Can you provide any guidance?
[19:32:54] <grrrit-wm>	 (03PS1) 10RobH: setting lawrencium install params [puppet] - 10https://gerrit.wikimedia.org/r/248941 
[19:33:28] <grrrit-wm>	 (03CR) 10RobH: [C: 032] setting lawrencium install params [puppet] - 10https://gerrit.wikimedia.org/r/248941 (owner: 10RobH)
[19:33:44] <wikibugs>	 6operations, 6Labs, 10Labs-Infrastructure, 10hardware-requests, and 2 others: How to handle mgmt lan for labs bare metal? - https://phabricator.wikimedia.org/T116607#1754657 (10chasemp) Notes from meeting:  Ironic will have its own model, for now the mgmt interface for any labs "hardware" node will be on i...
[19:34:45] <wikibugs>	 6operations, 10hardware-requests: eqiad: 1 hardware access request for labs on real hardware - https://phabricator.wikimedia.org/T106731#1754663 (10chasemp)
[19:36:05] <cmjohnson1>	 !log swapped bad disk on db1030
[19:36:09] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[19:37:38] <wikibugs>	 6operations, 10ops-eqiad: db1030 RAID degraded (disk failed) - https://phabricator.wikimedia.org/T116499#1754677 (10Cmjohnson) replaced disk  cmjohnson@db1030:~$ sudo megacli -PDList -aALL |grep "Firmware state:" Firmware state: Online, Spun Up Firmware state: Online, Spun Up Firmware state: Online, Spun Up Fi...
[19:40:53] <ori>	 apergos: still there?
[19:41:08] <apergos>	 physically yes :-D
[19:41:17] <apergos>	 ori: 
[19:41:29] <apergos>	 mentally pretty checked out... what's up?
[19:41:35] <ori>	 i'll have the new package on the snapshot hosts in a few minutes
[19:41:53] <apergos>	 ah that's excellent
[19:43:39] <wikibugs>	 6operations: install/setup/deploy lawrencium as eqiad salt-master - https://phabricator.wikimedia.org/T116645#1754696 (10RobH)
[19:44:06] <wikibugs>	 6operations: install/setup/deploy lawrencium as eqiad salt-master - https://phabricator.wikimedia.org/T116645#1754595 (10RobH)
[19:45:25] <wikibugs>	 6operations, 10Analytics, 6Discovery, 10EventBus, and 6 others: Define edit related events for change propagation - https://phabricator.wikimedia.org/T116247#1754698 (10Ottomata) > If we have a use case for emitting two secondary events *to the same topic* that were both triggered by the same primary event...
[19:46:13] <icinga-wm>	 PROBLEM - Host nobelium is DOWN: PING CRITICAL - Packet loss = 100%
[19:47:24] <wikibugs>	 6operations, 6Analytics-Backlog, 10Wikimedia-Mailing-lists: Requests to lists.wikimedia.org should end up in hadoop wmf.webrequest via kafka! - https://phabricator.wikimedia.org/T116429#1754706 (10Dzahn) Ok, let's not mix up dumps. and lists. in a single ticket please. They are different and unrelated.  I'm...
[19:49:33] <icinga-wm>	 PROBLEM - puppet last run on ms-fe1001 is CRITICAL: CRITICAL: Puppet has 1 failures
[19:50:09] <wikibugs>	 6operations, 10Analytics, 6Discovery, 10EventBus, and 6 others: Define edit related events for change propagation - https://phabricator.wikimedia.org/T116247#1754709 (10Ottomata) What do y'all think about keeping these 'framing' fields in a nested object?  I'm not sure if this is a good or bad idea.  If la...
[20:00:04] <jouncebot>	 gwicke cscott arlolra subbu bearND mdholloway: Respected human, time to deploy Services – Parsoid / OCG / Citoid / Mobileapps / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151026T2000). Please do the needful.
[20:00:13] <icinga-wm>	 RECOVERY - Host nobelium is UP: PING OK - Packet loss = 0%, RTA = 0.44 ms
[20:00:39] <subbu>	 will be another 15-30 mins before we are ready to deploy parsoid
[20:01:59] <wikibugs>	 6operations: install/setup/deploy lawrencium as eqiad salt-master - https://phabricator.wikimedia.org/T116645#1754777 (10RobH) Ok, it seems the H310 doesn't play nice with Debian/Jessie, so the server I allocated won't work.  Re-opening the allocation task.
[20:02:45] <YuviPanda>	 cmjohnson1: how did it go?
[20:03:11] <grrrit-wm>	 (03PS1) 10Eevans: cassandra: updated gc settings [puppet] - 10https://gerrit.wikimedia.org/r/248960 (https://phabricator.wikimedia.org/T106619) 
[20:03:22] <wikibugs>	 6operations: restore old mw config private svn repo from bacula - https://phabricator.wikimedia.org/T115937#1754782 (10akosiaris) >>! In T115937#1754219, @Dzahn wrote: > @akosiaris @arielglenn @tstarling any ideas where else we could get "conf-svn"? I don't know after trying the old home_pmtpa that was mounted o...
[20:03:31] <cmjohnson1>	 yuvipanda...done
[20:03:42] <YuviPanda>	 ok
[20:03:50] <YuviPanda>	 I'll stress it and see what happens!
[20:05:55] <YuviPanda>	 !log running stress on nobelium
[20:05:59] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[20:06:53] <wikibugs>	 6operations, 10Salt: Move salt master to separate host from puppet master - https://phabricator.wikimedia.org/T115287#1754793 (10RobH)
[20:06:54] <wikibugs>	 6operations: install/setup/deploy lawrencium as eqiad salt-master - https://phabricator.wikimedia.org/T116645#1754792 (10RobH)
[20:06:55] <wikibugs>	 6operations, 10hardware-requests: Allocate hardware for salt master in eqiad - https://phabricator.wikimedia.org/T115288#1754789 (10RobH) 5Resolved>3Open So WMF3542 has an H310 controller, which Jessie doesn't detect.  Since we don't like using these controllers, I can either replace it with a 710 (overkil...
[20:09:49] <wikibugs>	 6operations: restore old mw config private svn repo from bacula - https://phabricator.wikimedia.org/T115937#1754814 (10akosiaris) >>! In T115937#1754782, @akosiaris wrote: >>>! In T115937#1754219, @Dzahn wrote: >> @akosiaris @arielglenn @tstarling any ideas where else we could get "conf-svn"? I don't know after...
[20:10:35] <icinga-wm>	 RECOVERY - RAID on db1030 is OK: OK: optimal, 1 logical, 2 physical
[20:11:30] <wikibugs>	 6operations, 10ops-eqiad: Rename analytics1011, 1016, and 1019 to aqs1001, 1002, 1003  - https://phabricator.wikimedia.org/T116656#1754825 (10Cmjohnson) 3NEW a:3Cmjohnson
[20:12:04] <wikibugs>	 6operations, 10OTRS: Apply security patch to OTRS (Scheduler Process ID File Access vulnerability) - https://phabricator.wikimedia.org/T114132#1754834 (10faidon) 5Open>3declined a:3faidon We'll upgrade OTRS to a newer major release instead, as work for this was already underway when this security vulnera...
[20:12:22] <bearND>	 no mobileapps service deploy today
[20:13:43] <paravoid>	 !log deactivating ulsfo<->NTT BGP peering due to upcoming network migration
[20:13:47] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[20:14:09] <wikibugs>	 6operations, 10hardware-requests: Allocate hardware for salt master in eqiad - https://phabricator.wikimedia.org/T115288#1754857 (10RobH) Chatted with Ariel in IRC.  Going to go with one of the:  Dell PowerEdge R420, Dual Intel Xeon E5-2440, 32GB Memory, Dual 300GB SSD, Dual 500GB Nearline SAS      promethium...
[20:14:33] <wikibugs>	 6operations: install/setup/deploy X as eqiad salt-master - https://phabricator.wikimedia.org/T116645#1754860 (10RobH)
[20:14:48] <wikibugs>	 6operations, 10ops-eqiad: label server lawrencium/WMF3542 - https://phabricator.wikimedia.org/T116646#1754862 (10RobH) 5Open>3declined Declined, we aren't renaming this server after all.
[20:14:49] <wikibugs>	 6operations: install/setup/deploy X as eqiad salt-master - https://phabricator.wikimedia.org/T116645#1754595 (10RobH)
[20:16:24] <icinga-wm>	 RECOVERY - puppet last run on ms-fe1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[20:17:22] <ebernhardson>	 YuviPanda: i can kick the import off once nobelium is working again, i don't think there are other blockers
[20:17:22] <grrrit-wm>	 (03PS3) 10Faidon Liambotis: Use testsystem role for ruthenium [puppet] - 10https://gerrit.wikimedia.org/r/247239 (owner: 10Muehlenhoff)
[20:17:29] <grrrit-wm>	 (03CR) 10Faidon Liambotis: [C: 032 V: 032] Use testsystem role for ruthenium [puppet] - 10https://gerrit.wikimedia.org/r/247239 (owner: 10Muehlenhoff)
[20:17:42] <YuviPanda>	 ebernhardson: coool. I'm running a stress test that's pegging all CPU cores
[20:17:48] <YuviPanda>	 temperatures in 60-65 C
[20:17:50] <YuviPanda>	 which seem ok
[20:18:41] <grrrit-wm>	 (03PS2) 10Faidon Liambotis: Mark rubidium as spare [puppet] - 10https://gerrit.wikimedia.org/r/246832 (owner: 10Muehlenhoff)
[20:18:48] <grrrit-wm>	 (03CR) 10Faidon Liambotis: [C: 032 V: 032] Mark rubidium as spare [puppet] - 10https://gerrit.wikimedia.org/r/246832 (owner: 10Muehlenhoff)
[20:18:56] <ebernhardson>	 YuviPanda: yea anything below 85 should be fine
[20:18:59] <YuviPanda>	 ebernhardson: yeah
[20:19:07] <YuviPanda>	 ebernhardson: wanna kick it off now? i can stop the test.
[20:19:11] <ebernhardson>	 sure
[20:19:32] <YuviPanda>	 cmjohnson1: seems to be all good! :D stress test didn't bring temperature over 65C
[20:19:35] <YuviPanda>	 cmjohnson1: thanks!
[20:19:47] <YuviPanda>	 !log stress test on nobelium complete, CPU temperature didn't go above 65C
[20:19:49] <cmjohnson1>	 cool!  glad I could help
[20:19:50] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[20:21:16] <ebernhardson>	 !log started copy of eqiad elasticsearch indices to noeblium
[20:21:20] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[20:21:34] <grrrit-wm>	 (03PS3) 10Faidon Liambotis: puppet: do not 'ensure latest' [puppet] - 10https://gerrit.wikimedia.org/r/247007 (https://phabricator.wikimedia.org/T115348) (owner: 10Dzahn)
[20:21:34] <YuviPanda>	 n00belium
[20:21:39] <ebernhardson>	 heh
[20:22:10] <grrrit-wm>	 (03CR) 10Faidon Liambotis: [C: 032] puppet: do not 'ensure latest' [puppet] - 10https://gerrit.wikimedia.org/r/247007 (https://phabricator.wikimedia.org/T115348) (owner: 10Dzahn)
[20:22:17] <YuviPanda>	 ebernhardson: I see 'php' processes - are these not hhvm?
[20:22:46] <ebernhardson>	 YuviPanda: it is hhvm, thats the default php via debian's /etc/alternatives
[20:23:01] <YuviPanda>	 aah ok
[20:23:14] <icinga-wm>	 PROBLEM - DPKG on osmium is CRITICAL: DPKG CRITICAL dpkg reports broken packages
[20:23:37] <ebernhardson>	 YuviPanda: i'm pretty sure the limit here is disk io as well, not going to manage to peg things completely
[20:24:02] <YuviPanda>	 ebernhardson: yeah. any guesses on how long it'll take?
[20:24:10] <ebernhardson>	 YuviPanda: a week?
[20:24:19] <YuviPanda>	 ebernhardson: ok.
[20:24:27] <ebernhardson>	 YuviPanda: i know thats not a great number, but its 380M documents at a few hundred per second
[20:24:41] <YuviPanda>	 ebernhardson: yeah, 'tis still ok, esp. without the time pressure now
[20:24:48] <ebernhardson>	 (vs 5-6k/s importing to codfw)
[20:24:56] <YuviPanda>	 ebernhardson: what kindof tests will we do afterwards?
[20:25:06] <YuviPanda>	 ebernhardson: oh, and that's solely because of write performance on the hardware?
[20:25:27] <ebernhardson>	 actually it claims to have peaked at 16k docs/sec briefly, i think those were probably wiktionary's
[20:25:33] <ebernhardson>	 YuviPanda: i think so, but not completely sur
[20:25:52] <ebernhardson>	 YuviPanda: for labsearch i'm fairly certain, for codfw it could be a few things
[20:25:53] <YuviPanda>	 right
[20:26:36] <YuviPanda>	 because eventually we'll have to decom this host and figure out how much hardware we really need and then start that process
[20:26:55] <ebernhardson>	 yea, not sure how to figure that out yet but will think of something :)
[20:27:56] <ebernhardson>	 YuviPanda: i suppose the basic test to run is via our 'runSearch.php' script in elasticsearch. with that we can pump queries into the elasticsearch and see how it responds
[20:28:21] <YuviPanda>	 ebernhardson: right. I don't know of ES's querying abilities, but can we do more interesting queries that we can't do in prod?
[20:28:34] <YuviPanda>	 like, just target a particular title pattern for example
[20:28:38] <ebernhardson>	 and before that, seeing if it can even handle the current write load or if we have to strategically disable a few popular wikis writes
[20:28:50] <YuviPanda>	 ah, that
[20:28:51] <ebernhardson>	 YuviPanda: can do *much* more via the ES api we are exposing in labs
[20:28:52] <YuviPanda>	 yeah
[20:29:09] <ebernhardson>	 YuviPanda: but the easiest way to test is piping one query per line into runSearch.php :)
[20:29:28] <YuviPanda>	 right :)
[20:29:41] <YuviPanda>	 just want to have a mix of expensive queries too
[20:30:04] <ebernhardson>	 one difficulty...elasticsearch api can OOM the machine if your query is expensive enough
[20:30:11] <ebernhardson>	 or really, the java VM
[20:30:20] <ebernhardson>	 the machine will be fine, but the jvm will give up
[20:30:27] <YuviPanda>	 will it restart?
[20:30:34] <YuviPanda>	 will that cause the restart to be fairly long?
[20:31:10] <ebernhardson>	 it should be possible to restart, but there have been 2 prod issues in the last year where it didn't restart
[20:31:25] <ebernhardson>	 (we use a bunch of regex's to filter this out and deny them in php, but it sometimes misses something)
[20:31:41] <ebernhardson>	 that would be much harder/impossible for es api queries directly though
[20:31:52] <chasemp>	 which is partly becaues we allow globbing in prod right?
[20:32:07] <ebernhardson>	 chasemp: yes, both cases afaik were due to globbing 
[20:32:37] <YuviPanda>	 ebernhardson: we could make the proxy be more intelligent
[20:33:15] <YuviPanda>	 use the lua stuff in nginx, or write a simple one in pytohn / golang
[20:33:18] <ebernhardson>	 YuviPanda: generally i'm not super worried about that, but maybe i should be? the last time this happened the person running the queries was specifically trying to run "very expensive things" to see what happens
[20:33:23] <chasemp>	 are you thinking of building a quarry for es YuviPanda?
[20:33:30] <ebernhardson>	 people kinda asume 'oh wikipedia, its huge they can take it'
[20:33:39] <YuviPanda>	 chasemp: nah, although now that you mention it...
[20:33:39] <ebernhardson>	 where as people on labs wil understand the limited nature of the labs replica
[20:33:51] <grrrit-wm>	 (03PS3) 10Dzahn: interface: do not 'ensure latest',do require_package [puppet] - 10https://gerrit.wikimedia.org/r/247008 (https://phabricator.wikimedia.org/T115348) 
[20:34:04] <icinga-wm>	 PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet).
[20:34:32] <grrrit-wm>	 (03PS4) 10Faidon Liambotis: interface: do not 'ensure latest', use require_package [puppet] - 10https://gerrit.wikimedia.org/r/247008 (https://phabricator.wikimedia.org/T115348) (owner: 10Dzahn)
[20:34:35] <YuviPanda>	 ebernhardson: possibly, but they might not know what'll take it down
[20:34:46] <YuviPanda>	 ebernhardson: but yeah, i'm of the 'let us expose it and see what happens' camp
[20:34:47] <ebernhardson>	 YuviPanda: true
[20:35:12] <YuviPanda>	 chasemp: I've been meaning to rewrite quarry to decouple it from SQL so much
[20:35:23] <YuviPanda>	 chasemp: should be able to target WDQS, Posgres (OSM), and maybe even this
[20:35:23] <grrrit-wm>	 (03CR) 10Faidon Liambotis: [C: 032] interface: do not 'ensure latest', use require_package [puppet] - 10https://gerrit.wikimedia.org/r/247008 (https://phabricator.wikimedia.org/T115348) (owner: 10Dzahn)
[20:35:53] <icinga-wm>	 RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge.
[20:36:16] <subbu>	 deploying fresh parsoid code
[20:36:43] <chasemp>	 YuviPanda: would be super cool :)
[20:38:18] <YuviPanda>	 chasemp: yeah. Quarry's the first thing I'm going to move to k8s
[20:39:19] <YuviPanda>	 chasemp: won't be touching it for a few weeks, halfak and others (including me) are doing a workshop at an ACM conference about doing collaborative research via Quarry!
[20:39:45] <YuviPanda>	 maybe I should just the write the new thing from scratch :)
[20:43:58] <grrrit-wm>	 (03PS2) 10Dzahn: admin: add datacenter-ops to palladium [puppet] - 10https://gerrit.wikimedia.org/r/248936 (https://phabricator.wikimedia.org/T115718) 
[20:44:07] <grrrit-wm>	 (03CR) 10Legoktm: "Also extension-list?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/248938 (owner: 10Chad)
[20:45:20] <grrrit-wm>	 (03PS3) 10Dzahn: admin: add datacenter-ops to palladium [puppet] - 10https://gerrit.wikimedia.org/r/248936 (https://phabricator.wikimedia.org/T115718) 
[20:46:34] <icinga-wm>	 RECOVERY - DPKG on osmium is OK: All packages OK
[20:48:06] <grrrit-wm>	 (03CR) 10Dzahn: [C: 032] "access request has been ACKed in ops meeting" [puppet] - 10https://gerrit.wikimedia.org/r/248936 (https://phabricator.wikimedia.org/T115718) (owner: 10Dzahn)
[20:50:01] <subbu>	 !log deployed parsoid version 660c59a9
[20:50:05] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[20:50:14] <icinga-wm>	 PROBLEM - Router interfaces on cr2-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 75, down: 1, dormant: 0, excluded: 1, unused: 0BRxe-0/0/2: down - Transit: ! NTT (service ID 234631) {#1061} [10Gbps]BR
[20:52:08] <wikibugs>	 7Blocked-on-Operations, 6operations, 10Beta-Cluster-Infrastructure, 7HHVM: Convert work machines (tin, terbium) to Trusty and hhvm usage - https://phabricator.wikimedia.org/T87036#1755054 (10Krenair) (Note that WikitechPrivateLdapSettings actually comes from puppet so I don't think we need to worry about c...
[20:53:01] <grrrit-wm>	 (03Abandoned) 10Dzahn: test multi-role admin group behaviour [puppet] - 10https://gerrit.wikimedia.org/r/248928 (owner: 10Dzahn)
[20:53:14] <mutante>	 test succesful - we know it fails
[20:53:30] <grrrit-wm>	 (03Abandoned) 10Dzahn: admin: add dc-ops group to role access_new_install [puppet] - 10https://gerrit.wikimedia.org/r/246850 (https://phabricator.wikimedia.org/T115718) (owner: 10Dzahn)
[20:59:04] <icinga-wm>	 PROBLEM - puppet last run on elastic1020 is CRITICAL: CRITICAL: puppet fail
[21:00:55] <wikibugs>	 6operations: restore old mw config private svn repo from bacula - https://phabricator.wikimedia.org/T115937#1755109 (10Reedy) Hmm. It should be at /srv/home_pmtpa/conf-svn but there's no sign of it. I wonder if that means we deleted it at some point. Not quite sure why we would've deleted it.. Else that means it...
[21:02:35] <grrrit-wm>	 (03PS2) 10Dzahn: admin: add papaul to datacenter ops group [puppet] - 10https://gerrit.wikimedia.org/r/246849 
[21:02:44] <icinga-wm>	 PROBLEM - puppet last run on cp4018 is CRITICAL: CRITICAL: puppet fail
[21:03:38] <grrrit-wm>	 (03PS3) 10Dzahn: admin: add papaul to datacenter ops group [puppet] - 10https://gerrit.wikimedia.org/r/246849 
[21:04:43] <icinga-wm>	 PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet).
[21:05:58] <grrrit-wm>	 (03CR) 10Dzahn: [C: 032] admin: add papaul to datacenter ops group [puppet] - 10https://gerrit.wikimedia.org/r/246849 (owner: 10Dzahn)
[21:06:33] <icinga-wm>	 RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge.
[21:07:28] <grrrit-wm>	 (03CR) 10Dzahn: "for T115718 and has been acked in meeting" [puppet] - 10https://gerrit.wikimedia.org/r/246849 (owner: 10Dzahn)
[21:08:36] <wikibugs>	 6operations, 10Analytics, 6Discovery, 10EventBus, and 6 others: Define edit related events for change propagation - https://phabricator.wikimedia.org/T116247#1755124 (10GWicke) >>! In T116247#1754698, @Ottomata wrote: >> If we have a use case for emitting two secondary events *to the same topic* that were...
[21:16:13] <wikibugs>	 6operations, 10ops-eqiad: Rename analytics1011, 1016, and 1019 to aqs1001, 1002, 1003 - https://phabricator.wikimedia.org/T116656#1755150 (10Cmjohnson) racktables and lables updated...looks like dns was completed already
[21:16:19] <wikibugs>	 6operations, 10ops-eqiad: Rename analytics1011, 1016, and 1019 to aqs1001, 1002, 1003 - https://phabricator.wikimedia.org/T116656#1755151 (10Cmjohnson) 5Open>3Resolved
[21:18:50] <wikibugs>	 6operations, 10ops-eqiad, 3labs-sprint-118: nobelium is overheating - https://phabricator.wikimedia.org/T116439#1755165 (10Cmjohnson) 5Open>3Resolved a:3Cmjohnson Looks to be good  From IRC   YuviPanda cmjohnson1: seems to be all good! :D stress test didn't bring temperature over 65C
[21:26:04] <icinga-wm>	 RECOVERY - puppet last run on elastic1020 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[21:26:16] <cmjohnson1>	 jzerebecki or hoo: ping
[21:27:03] <wikibugs>	 6operations, 5Patch-For-Review: create new admin group for datacenter ops to add new systems to puppet - https://phabricator.wikimedia.org/T115718#1755188 (10Dzahn) because adding admin group via role (https://gerrit.wikimedia.org/r/#/c/246850/),  doesn't work (https://gerrit.wikimedia.org/r/#/c/248928/)  i ad...
[21:27:09] <wikibugs>	 6operations, 5Patch-For-Review: create new admin group for datacenter ops to add new systems to puppet - https://phabricator.wikimedia.org/T115718#1755189 (10Dzahn) 5Open>3Resolved
[21:27:21] <wikibugs>	 10Ops-Access-Requests, 6operations, 5Patch-For-Review: create new admin group for datacenter ops to add new systems to puppet - https://phabricator.wikimedia.org/T115718#1730969 (10Dzahn)
[21:29:45] <icinga-wm>	 RECOVERY - puppet last run on cp4018 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[21:32:11] <jynus>	 the dbstore1002 issue is expected, I am inserting rows there as crazy
[21:37:20] <wikibugs>	 6operations, 10ops-eqiad: Reclaim einsteinium.eqiad.wmnet for spares - https://phabricator.wikimedia.org/T116252#1755228 (10Cmjohnson)
[21:37:22] <wikibugs>	 6operations, 10ops-eqiad: wipe einsteinium disks - https://phabricator.wikimedia.org/T116253#1755226 (10Cmjohnson) 5Open>3Resolved wiped
[21:39:29] <hoo>	 cmjohnson1: Hi, what's up?
[21:40:34] <cmjohnson1>	 hi hoo: wdqs1001 and 1002 never had mgmt correctly setup when they were re-named. This will require a few minutes of downtime...i want to know when I could do this
[21:40:52] <hoo>	 Any time is as bad as any other
[21:40:52] <cmjohnson1>	 https://phabricator.wikimedia.org/T84686
[21:41:09] <hoo>	 The service is declared beta, so it should be ok-ish to just take them down for a short bit
[21:41:28] <cmjohnson1>	 okay..mind if I take 1001 down now?
[21:41:30] <hoo>	 Also... do you need to do both at once? If not, that should be fine
[21:41:37] <cmjohnson1>	 no..one at a time
[21:41:47] <hoo>	 That should be fine
[21:41:51] <hoo>	 SMalyshev: ^
[21:42:02] <grrrit-wm>	 (03CR) 10MaxSem: "Tested by cherrypicking on beta puppetmaster - works as expected." [puppet] - 10https://gerrit.wikimedia.org/r/248374 (owner: 10MaxSem)
[21:42:13] <hoo>	 I guess everything is puppetized by now, so that stuff should come back on its own
[21:46:48] <grrrit-wm>	 (03PS1) 10Cmjohnson: Adding mgmt entries for wdqs1001/2 [dns] - 10https://gerrit.wikimedia.org/r/248999 
[21:47:35] <grrrit-wm>	 (03CR) 10Cmjohnson: [C: 032] Adding mgmt entries for wdqs1001/2 [dns] - 10https://gerrit.wikimedia.org/r/248999 (owner: 10Cmjohnson)
[21:48:15] <cmjohnson1>	 !log powering off wdqs1001 to update idrac settings 
[21:48:18] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[21:49:04] <icinga-wm>	 PROBLEM - WDQS SPARQL on wdqs1001 is CRITICAL: Connection timed out
[21:50:37] <wikibugs>	 6operations, 10ops-codfw, 7Swift: rack & initial on-site setup of ms-be2016-2021 - https://phabricator.wikimedia.org/T114712#1755283 (10Papaul)   ms-be2016       10.193.1.12      port xe-0/2/7 ms-be2017       10.193.1.13      port xe-0/7/7 ms-be2018       10.193.1.14      port xe-0/2/7 ms-be2019       10.193...
[21:51:33] <icinga-wm>	 PROBLEM - Host wdqs1001 is DOWN: PING CRITICAL - Packet loss = 100%
[21:55:14] <icinga-wm>	 RECOVERY - Host wdqs1001 is UP: PING OK - Packet loss = 0%, RTA = 1.02 ms
[21:59:06] <icinga-wm>	 PROBLEM - puppet last run on wdqs1001 is CRITICAL: CRITICAL: puppet fail
[21:59:06] <icinga-wm>	 PROBLEM - WDQS HTTP on wdqs1001 is CRITICAL: Connection refused
[21:59:54] <icinga-wm>	 PROBLEM - WDQS HTTP Port on wdqs1001 is CRITICAL: connect to address 127.0.0.1 and port 80: Connection refused
[22:00:14] <icinga-wm>	 PROBLEM - Blazegraph Port on wdqs1001 is CRITICAL: connect to address 127.0.0.1 and port 9999: Connection refused
[22:00:43] <icinga-wm>	 PROBLEM - Blazegraph process on wdqs1001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 998 (blazegraph), regex args ^java .* blazegraph-service-.*-dist.war
[22:01:27] <greg-g>	 eh? ok
[22:01:29] <greg-g>	 oh
[22:02:28] <hoo>	 greg-g: It's fine
[22:02:30] <hoo>	 kinda
[22:03:56] <mutante>	 production monitoring with "declared beta"
[22:04:17] <mutante>	 doesn't understand semi-prod
[22:05:07] <hoo>	 It's stable enough for production monitoring, so I don't see an issue there
[22:05:16] <hoo>	 although I'm not on these alerts
[22:05:51] <mutante>	 where's the beta part then
[22:06:40] <hoo>	 It's mostly beta as in the data representation isn't fully stable yet
[22:07:00] <mutante>	 ah,ok
[22:07:09] <hoo>	 also, *I* don't think we know much about the stability, yet... we just never had a problem
[22:07:43] <hoo>	 I wonder how much user QPS it does... that's not visible anywhere AFAIK
[22:07:55] <hoo>	 maybe the proxy in front of it logs
[22:08:51] <hoo>	 meh, seems like the services didn't come up on their own
[22:09:07] <hoo>	 cmjohnson1: ^
[22:09:14] <hoo>	 So please leave the other one for now
[22:09:54] <cmjohnson1>	 hoo: the other was fine...so no need to bring it down. sorry for breaking things 
[22:12:07] <wikibugs>	 6operations: puppetize wdqs service startup on boot - https://phabricator.wikimedia.org/T116673#1755399 (10Dzahn) 3NEW
[22:12:16] <mutante>	 and that'
[22:12:51] <mutante>	 hoo: should i start it then?
[22:13:30] <hoo>	 mutante: I think we should do it via puppet
[22:13:55] <mutante>	 that's what the ticket is for ^
[22:14:07] <mutante>	 but don't you want it to run now
[22:15:23] <hoo>	 mutante: Fixing it via puppet is very easy
[22:15:53] <wikibugs>	 6operations: puppetize wdqs service startup on boot - https://phabricator.wikimedia.org/T116673#1755420 (10Dzahn)
[22:17:10] <mutante>	 ok, thought you'd want a "service start" to fix the current breakage
[22:17:27] <hoo>	 If we can do it via puppet now, I'd prefer that
[22:18:07] <grrrit-wm>	 (03PS1) 10Hoo man: Enable the wdqs service per default [puppet] - 10https://gerrit.wikimedia.org/r/249005 (https://phabricator.wikimedia.org/T116673) 
[22:18:21] <hoo>	 mutante|away: ^ that should do it
[22:18:26] <hoo>	 unless I screwed the syntax
[22:19:03] <hoo>	 ah, missed him :S
[22:19:53] <hoo>	 Anyone else around?
[22:20:26] <icinga-wm>	 RECOVERY - puppet last run on wdqs1001 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures
[22:21:04] <icinga-wm>	 RECOVERY - WDQS HTTP Port on wdqs1001 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 80
[22:21:10] <hoo>	 wtf
[22:21:24] <icinga-wm>	 RECOVERY - Blazegraph Port on wdqs1001 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9999
[22:21:27] <grrrit-wm>	 (03CR) 10BBlack: [C: 032] Enable the wdqs service per default [puppet] - 10https://gerrit.wikimedia.org/r/249005 (https://phabricator.wikimedia.org/T116673) (owner: 10Hoo man)
[22:21:44] <icinga-wm>	 RECOVERY - Blazegraph process on wdqs1001 is OK: PROCS OK: 1 process with UID = 998 (blazegraph), regex args ^java .* blazegraph-service-.*-dist.war
[22:22:36] <bblack>	 hoo: if it came up unexpectedly, it was likely puppet running every 30 minutes
[22:23:19] <hoo>	 mh, ok
[22:23:39] <hoo>	 still having it properly enabled is better than waiting for puppet
[22:23:43] <bblack>	 yeah
[22:23:56] <bblack>	 for wdqs1001 runs at :19 and :49 every hour for puppet
[22:24:09] <hoo>	 That matches with the recoveries
[22:24:27] <bblack>	 only way to stop puppet from turning on something manually disabled is to disable puppet (as root puppet agent --disable "reason why")
[22:25:14] <hoo>	 I don't even have shell on these boxes, so I'm mostly guessing around :P
[22:25:34] <grrrit-wm>	 (03PS1) 10MaxSem: [WIP] Switch www.wikimedia.org to source control [puppet] - 10https://gerrit.wikimedia.org/r/249009 (https://phabricator.wikimedia.org/T115964) 
[22:27:03] <wikibugs>	 6operations: puppetize wdqs service startup on boot - https://phabricator.wikimedia.org/T116673#1755486 (10hoo) 5Open>3Resolved a:3hoo
[22:27:37] <wikibugs>	 7Blocked-on-Operations, 10Ops-Access-Requests, 6operations, 3Discovery-Maps-Sprint, and 2 others: Kartotherian service logs inaccessible (systemd?) and not updated (/var/log) - https://phabricator.wikimedia.org/T115067#1755489 (10Yurik) 5Resolved>3Open Few issues: 1. Only works on maps2001, not {2-4} 2...
[22:28:08] <grrrit-wm>	 (03PS2) 10MaxSem: [WIP] Switch www portals to be deployed from Git [mediawiki-config] - 10https://gerrit.wikimedia.org/r/248526 (https://phabricator.wikimedia.org/T115964) 
[22:30:34] <icinga-wm>	 PROBLEM - NTP on wdqs1001 is CRITICAL: NTP CRITICAL: Offset unknown
[22:35:56] <wikibugs>	 6operations, 7Mail: Google Mail marking Phabricator and Gerrit notification emails as spam - https://phabricator.wikimedia.org/T115416#1755565 (10greg) I haven't been seeing any more of my phab+gerrit mails going into spam.  Anyone else?
[22:39:21] <wikibugs>	 6operations, 7Mail: Google Mail marking Phabricator and Gerrit notification emails as spam - https://phabricator.wikimedia.org/T115416#1755571 (10MoritzMuehlenhoff) >>! In T115416#1755565, @greg wrote: > I haven't been seeing any more of my phab+gerrit mails going into spam. >  > Anyone else?  Me neither, work...
[22:44:35] <wikibugs>	 6operations, 7Mail: Google Mail marking Phabricator and Gerrit notification emails as spam - https://phabricator.wikimedia.org/T115416#1755598 (10Legoktm) >>! In T115416#1735775, @greg wrote: > The middle one (list-unsubscribe header) doesn't make sense to my pedantic brain (they aren't mailing lists); anyone...
[22:45:20] <greg-g>	 legoktm: haha ^
[22:48:59] <grrrit-wm>	 (03PS2) 10EBernhardson: Send CirrusSearch writes to codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/248871 
[22:49:06] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] Send CirrusSearch writes to codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/248871 (owner: 10EBernhardson)
[22:49:31] <grrrit-wm>	 (03PS1) 10BBlack: ssl_ciphersuite: add ECDHE+3DES options [puppet] - 10https://gerrit.wikimedia.org/r/249017 
[22:53:38] <wikibugs>	 6operations, 7Mail, 15User-greg: Google Mail marking Phabricator and Gerrit notification emails as spam - https://phabricator.wikimedia.org/T115416#1755627 (10greg) 5Open>3Resolved a:3greg Calling this good. If anyone wants to take up the issues raised by @JKrauska they should be separate tasks.
[22:55:09] <grrrit-wm>	 (03PS3) 10EBernhardson: Send CirrusSearch writes to codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/248871 
[22:56:03] <icinga-wm>	 PROBLEM - puppet last run on mw2058 is CRITICAL: CRITICAL: Puppet has 1 failures
[23:00:04] <jouncebot>	 RoanKattouw ostriches Krenair: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151026T2300).
[23:00:04] <jouncebot>	 Krenair ebernhardson: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process.
[23:00:42] <Krenair>	 Hi
[23:00:51] <Krenair>	 ebernhardson, I'll do your patch first
[23:01:17] <grrrit-wm>	 (03CR) 10Alex Monk: [C: 032] Send CirrusSearch writes to codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/248871 (owner: 10EBernhardson)
[23:01:24] <grrrit-wm>	 (03Merged) 10jenkins-bot: Send CirrusSearch writes to codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/248871 (owner: 10EBernhardson)
[23:01:57] <ebernhardson>	 Krenair: k
[23:03:43] <Krenair>	 ebernhardson, I've synced it to mw1017 and testwiki should now be writing to codfw
[23:03:47] <Krenair>	 please check
[23:04:42] <ebernhardson>	 Krenair: checking
[23:05:13] <ebernhardson>	 Krenair: everything looks sane
[23:05:17] <Krenair>	 ok
[23:06:10] <logmsgbot>	 !log krenair@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/248871/3 (duration: 00m 18s)
[23:06:13] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[23:06:20] <Krenair>	 ebernhardson, now the other sites should be doing it, please confirm all is ok
[23:07:11] <grrrit-wm>	 (03PS1) 10RobH: Revert "setting lawrencium install params" [puppet] - 10https://gerrit.wikimedia.org/r/249020 
[23:07:31] <grrrit-wm>	 (03PS2) 10RobH: Revert "setting lawrencium install params" [puppet] - 10https://gerrit.wikimedia.org/r/249020 
[23:07:51] <grrrit-wm>	 (03CR) 10RobH: [C: 032] Revert "setting lawrencium install params" [puppet] - 10https://gerrit.wikimedia.org/r/249020 (owner: 10RobH)
[23:10:30] <ebernhardson>	 Krenair: hrm, getting errors where it complains pages dont exist in the new cluster :S 
[23:10:39] <ebernhardson>	 those don't really hurt anything, lemme double check job queue
[23:11:23] <ebernhardson>	 Krenair: yea its safe to leave out there. That class of error we get rid of immediatly. Leaving it up will let me debug whats going on here and i can undeploy it later once i figure out what is going on
[23:11:41] <Krenair>	 ok
[23:12:13] <grrrit-wm>	 (03PS2) 10Alex Monk: Remove old unused wmgUseAPIRequestLog code referencing locke, a pmtpa host [mediawiki-config] - 10https://gerrit.wikimedia.org/r/248475 
[23:12:36] <grrrit-wm>	 (03CR) 10Alex Monk: [C: 032] Remove old unused wmgUseAPIRequestLog code referencing locke, a pmtpa host [mediawiki-config] - 10https://gerrit.wikimedia.org/r/248475 (owner: 10Alex Monk)
[23:12:42] <grrrit-wm>	 (03Merged) 10jenkins-bot: Remove old unused wmgUseAPIRequestLog code referencing locke, a pmtpa host [mediawiki-config] - 10https://gerrit.wikimedia.org/r/248475 (owner: 10Alex Monk)
[23:15:15] <logmsgbot>	 !log krenair@tin Synchronized wmf-config/CommonSettings.php: https://gerrit.wikimedia.org/r/#/c/248475/ (duration: 00m 17s)
[23:15:19] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[23:16:00] <logmsgbot>	 !log krenair@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/248475/ (duration: 00m 17s)
[23:16:04] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[23:16:19] <grrrit-wm>	 (03PS2) 10Alex Monk: Remove old bugzilla and mingle.corp RSS whitelist entries from mediawiki.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/248478 
[23:16:43] <grrrit-wm>	 (03CR) 10Alex Monk: [C: 032] Remove old bugzilla and mingle.corp RSS whitelist entries from mediawiki.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/248478 (owner: 10Alex Monk)
[23:16:53] <grrrit-wm>	 (03Merged) 10jenkins-bot: Remove old bugzilla and mingle.corp RSS whitelist entries from mediawiki.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/248478 (owner: 10Alex Monk)
[23:17:31] <logmsgbot>	 !log krenair@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/248478/ (duration: 00m 17s)
[23:17:34] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[23:17:42] <Krenair>	 yay, no corp references in wmf-config :)
[23:18:24] <grrrit-wm>	 (03PS2) 10Alex Monk: Change Venetian Wikipedia logo per admin request [mediawiki-config] - 10https://gerrit.wikimedia.org/r/248633 (https://phabricator.wikimedia.org/T116476) 
[23:18:29] <grrrit-wm>	 (03CR) 10Alex Monk: [C: 032] Change Venetian Wikipedia logo per admin request [mediawiki-config] - 10https://gerrit.wikimedia.org/r/248633 (https://phabricator.wikimedia.org/T116476) (owner: 10Alex Monk)
[23:18:36] <grrrit-wm>	 (03Merged) 10jenkins-bot: Change Venetian Wikipedia logo per admin request [mediawiki-config] - 10https://gerrit.wikimedia.org/r/248633 (https://phabricator.wikimedia.org/T116476) (owner: 10Alex Monk)
[23:18:42] <wikibugs>	 7Blocked-on-Operations, 10Ops-Access-Requests, 6operations, 3Discovery-Maps-Sprint, and 2 others: Kartotherian service logs inaccessible (systemd?) and not updated (/var/log) - https://phabricator.wikimedia.org/T115067#1755730 (10Dzahn) >>! In T115067#1755489, @Yurik wrote: > 1. Only works on maps2001, not...
[23:19:15] <logmsgbot>	 !log krenair@tin Synchronized w/static/images/project-logos/vecwiki.png: https://gerrit.wikimedia.org/r/#/c/248633/ (duration: 00m 18s)
[23:19:19] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[23:20:18] <Krenair>	 (purged etc. too)
[23:21:24] <grrrit-wm>	 (03PS2) 10Alex Monk: Add QuickSurveys to extension-list-labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/248632 
[23:21:29] <grrrit-wm>	 (03CR) 10Alex Monk: [C: 032] Add QuickSurveys to extension-list-labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/248632 (owner: 10Alex Monk)
[23:21:34] <grrrit-wm>	 (03Merged) 10jenkins-bot: Add QuickSurveys to extension-list-labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/248632 (owner: 10Alex Monk)
[23:22:45] <logmsgbot>	 !log krenair@tin Synchronized wmf-config/extension-list-labs: https://gerrit.wikimedia.org/r/#/c/248632/ (duration: 00m 17s)
[23:22:49] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[23:24:03] <icinga-wm>	 RECOVERY - puppet last run on mw2058 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[23:24:45] <wikibugs>	 7Blocked-on-Operations, 10Ops-Access-Requests, 6operations, 3Discovery-Maps-Sprint, and 2 others: Kartotherian service logs inaccessible (systemd?) and not updated (/var/log) - https://phabricator.wikimedia.org/T115067#1755777 (10Dzahn) >>! In T115067#1755730, @Dzahn wrote:  > Checking this i found 2002-20...
[23:28:15] <grrrit-wm>	 (03PS3) 10Alex Monk: Changed wgNamespacesToBeSearchedDefault for itwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/248371 (https://phabricator.wikimedia.org/T114932) (owner: 10Pmlineditor)
[23:28:21] <grrrit-wm>	 (03CR) 10Alex Monk: [C: 032] Changed wgNamespacesToBeSearchedDefault for itwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/248371 (https://phabricator.wikimedia.org/T114932) (owner: 10Pmlineditor)
[23:28:26] <grrrit-wm>	 (03Merged) 10jenkins-bot: Changed wgNamespacesToBeSearchedDefault for itwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/248371 (https://phabricator.wikimedia.org/T114932) (owner: 10Pmlineditor)
[23:29:14] <logmsgbot>	 !log krenair@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/248371/ (duration: 00m 20s)
[23:29:17] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[23:32:47] <grrrit-wm>	 (03PS1) 10Dzahn: tilerator/k10n: add trailing * to journalctl sudo [puppet] - 10https://gerrit.wikimedia.org/r/249023 (https://phabricator.wikimedia.org/T115067) 
[23:35:29] <mutante>	 hoo: nominating hoo for wdqs-admin group
[23:36:19] <grrrit-wm>	 (03CR) 10Yurik: [C: 031] "Thanks for getting the patch quickly!" [puppet] - 10https://gerrit.wikimedia.org/r/249023 (https://phabricator.wikimedia.org/T115067) (owner: 10Dzahn)
[23:36:54] <grrrit-wm>	 (03CR) 10Dzahn: [C: 032] "just follow-up fix to ACKed request. this was the intention." [puppet] - 10https://gerrit.wikimedia.org/r/249023 (https://phabricator.wikimedia.org/T115067) (owner: 10Dzahn)
[23:37:02] <hoo>	 Stas doesn't want more admin, so probably not
[23:41:18] <wikibugs>	 7Blocked-on-Operations, 10Ops-Access-Requests, 6operations, 3Discovery-Maps-Sprint, and 2 others: Kartotherian service logs inaccessible (systemd?) and not updated (/var/log) - https://phabricator.wikimedia.org/T115067#1755845 (10Dzahn) >>! In T115067#1755489, @Yurik wrote: > 2. Turns out this approach doe...
[23:41:54] <icinga-wm>	 PROBLEM - puppet last run on db2017 is CRITICAL: CRITICAL: Puppet has 1 failures
[23:44:35] <wikibugs>	 7Blocked-on-Operations, 10Ops-Access-Requests, 6operations, 3Discovery-Maps-Sprint, and 2 others: Kartotherian service logs inaccessible (systemd?) and not updated (/var/log) - https://phabricator.wikimedia.org/T115067#1755853 (10Dzahn) 5Open>3Resolved
[23:47:36] <grrrit-wm>	 (03PS1) 10Yuvipanda: ssh: Allow direct login as servicegroups [puppet] - 10https://gerrit.wikimedia.org/r/249024 (https://phabricator.wikimedia.org/T113979) 
[23:49:15] <grrrit-wm>	 (03PS2) 10Yuvipanda: ssh: Allow direct login as servicegroups [puppet] - 10https://gerrit.wikimedia.org/r/249024 (https://phabricator.wikimedia.org/T113979) 
[23:49:34] <grrrit-wm>	 (03PS3) 10Yuvipanda: ssh: Allow direct login as servicegroups [puppet] - 10https://gerrit.wikimedia.org/r/249024 (https://phabricator.wikimedia.org/T113979) 
[23:49:44] <grrrit-wm>	 (03Abandoned) 10Dzahn: admin: create group aqs-restbase-deployers [puppet] - 10https://gerrit.wikimedia.org/r/248378 (https://phabricator.wikimedia.org/T116169) (owner: 10Dzahn)
[23:49:58] <grrrit-wm>	 (03CR) 10Yuvipanda: [C: 032 V: 032] ssh: Allow direct login as servicegroups [puppet] - 10https://gerrit.wikimedia.org/r/249024 (https://phabricator.wikimedia.org/T113979) (owner: 10Yuvipanda)
[23:54:24] <wikibugs>	 10Ops-Access-Requests, 6operations: wdqs-admin group membership for Marius Hoch (hoo) and Jan Zerebecki - https://phabricator.wikimedia.org/T116702#1755904 (10Dzahn) 3NEW
[23:55:27] <matt_flaschen>	 Krenair, could you let me know when you're done SWAT?  We have a patch (wasn't planned for SWAT, it's an unbreak now we just discovered).
[23:55:44] <Krenair>	 matt_flaschen, yep, sorry, I'm done
[23:55:51] <matt_flaschen>	 No problem, thanks.
[23:55:51] <Krenair>	 want me to sync something?
[23:55:57] <Krenair>	 or are you going to? or..?
[23:56:07] <matt_flaschen>	 Krenair, sure if you don't mind: https://gerrit.wikimedia.org/r/#/c/249026/
[23:56:56] <grrrit-wm>	 (03PS1) 10Dzahn: admin: hoo and jzerebecki for wdqs admins [puppet] - 10https://gerrit.wikimedia.org/r/249027 (https://phabricator.wikimedia.org/T116702) 
[23:56:57] <wikibugs>	 10Ops-Access-Requests, 6operations, 5Patch-For-Review: wdqs-admin group membership for Marius Hoch (hoo) and Jan Zerebecki - https://phabricator.wikimedia.org/T116702#1755922 (10Dzahn)
[23:56:58] <Krenair>	 doing
[23:57:02] <matt_flaschen>	 Thanks
[23:57:46] <wikibugs>	 10Ops-Access-Requests, 6operations, 5Patch-For-Review: wdqs-admin group membership for Marius Hoch (hoo) and Jan Zerebecki - https://phabricator.wikimedia.org/T116702#1755932 (10Dzahn) p:5Triage>3Normal
[23:58:00] <wikibugs>	 7Blocked-on-Operations, 10Ops-Access-Requests, 6operations, 3Discovery-Maps-Sprint, and 2 others: Kartotherian service logs inaccessible (systemd?) and not updated (/var/log) - https://phabricator.wikimedia.org/T115067#1755934 (10Yurik) Works, awesome, thanks!