[00:06:35] <Krenair>	 JohnLewis, I guess it's possible that shipping to eqiad was suggested, but solr3 went to codfw instead
[00:06:47] <JohnLewis>	 possible
[00:16:32] <Krenair>	 either way the thing about capella is still a bit strange
[00:17:43] <JohnLewis>	 at least we know the server is unused in a rack in codfw
[00:19:47] <Krenair>	 JohnLewis, clue
[00:20:10] <Krenair>	 https://wikitech.wikimedia.org/w/index.php?title=Server_Spares&diff=152418&oldid=152413
[00:20:37] <Krenair>	 PowerEdge R420
[00:20:51] <Krenair>	 https://wikitech.wikimedia.org/wiki/Mobile1
[00:20:57] <Krenair>	 Poweredge 1950
[00:21:15] <JohnLewis>	 not the same then :)
[00:21:32] <JohnLewis>	 plus the paper trail of capella -> solr3 -> Dell is helpful
[00:44:35] <wikibugs>	 6operations, 10Wikimedia-General-or-Unknown, 7database: Revision 186704908 on en.wikipedia.org, Fatal exception: unknown "cluster16" - https://phabricator.wikimedia.org/T26675#1201372 (10aaron) >>! In T26675#1152976, @Krenair wrote: > I suspect that if we had this blob laying around before, it may have been...
[01:15:16] <icinga-wm>	 RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK No anomaly detected
[01:54:31] <grrrit-wm>	 (03CR) 10Dereckson: "Changes in CommonSettings.php and InitialiseSettings.php looks good to me." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/203627 (https://phabricator.wikimedia.org/T15712) (owner: 10devunt)
[02:08:19] <grrrit-wm>	 (03CR) 10Dereckson: [C: 04-1] "Now the security review is done, the first step before a live deployment is to deploy it on the beta cluster to see all works really fine:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/203627 (https://phabricator.wikimedia.org/T15712) (owner: 10devunt)
[02:10:08] <icinga-wm>	 PROBLEM - Outgoing network saturation on labstore1001 is CRITICAL 21.43% of data above the critical threshold [100000000.0]
[02:21:18] <logmsgbot>	 !log l10nupdate Synchronized php-1.25wmf24/cache/l10n: (no message) (duration: 06m 23s)
[02:21:32] <morebots>	 Logged the message, Master
[02:26:10] <logmsgbot>	 !log LocalisationUpdate completed (1.25wmf24) at 2015-04-12 02:25:07+00:00
[02:26:17] <morebots>	 Logged the message, Master
[02:29:57] <icinga-wm>	 PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 6.67% of data above the critical threshold [500.0]
[02:37:37] <icinga-wm>	 RECOVERY - Outgoing network saturation on labstore1001 is OK Less than 10.00% above the threshold [75000000.0]
[02:41:52] <logmsgbot>	 !log l10nupdate Synchronized php-1.26wmf1/cache/l10n: (no message) (duration: 05m 44s)
[02:41:57] <morebots>	 Logged the message, Master
[02:46:18] <logmsgbot>	 !log LocalisationUpdate completed (1.26wmf1) at 2015-04-12 02:45:15+00:00
[02:46:22] <morebots>	 Logged the message, Master
[02:50:57] <icinga-wm>	 RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0]
[03:35:27] <icinga-wm>	 PROBLEM - puppet last run on mw1074 is CRITICAL Puppet has 1 failures
[03:35:47] <icinga-wm>	 PROBLEM - puppet last run on mw1163 is CRITICAL Puppet has 1 failures
[03:36:37] <icinga-wm>	 PROBLEM - puppet last run on mw1087 is CRITICAL Puppet has 1 failures
[03:41:56] <icinga-wm>	 PROBLEM - puppet last run on mw1093 is CRITICAL Puppet has 1 failures
[03:42:57] <icinga-wm>	 PROBLEM - puppet last run on mw2064 is CRITICAL Puppet has 1 failures
[03:48:07] <icinga-wm>	 PROBLEM - puppet last run on es2009 is CRITICAL puppet fail
[03:52:06] <icinga-wm>	 PROBLEM - puppet last run on mw1169 is CRITICAL Puppet has 1 failures
[03:57:58] <icinga-wm>	 RECOVERY - puppet last run on mw1074 is OK Puppet is currently enabled, last run 3 seconds ago with 0 failures
[03:58:27] <icinga-wm>	 RECOVERY - puppet last run on mw1163 is OK Puppet is currently enabled, last run 32 seconds ago with 0 failures
[03:59:06] <icinga-wm>	 RECOVERY - puppet last run on mw2064 is OK Puppet is currently enabled, last run 15 seconds ago with 0 failures
[03:59:16] <icinga-wm>	 RECOVERY - puppet last run on mw1087 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures
[03:59:36] <icinga-wm>	 RECOVERY - puppet last run on mw1093 is OK Puppet is currently enabled, last run 49 seconds ago with 0 failures
[04:00:07] <icinga-wm>	 RECOVERY - puppet last run on mw1169 is OK Puppet is currently enabled, last run 15 seconds ago with 0 failures
[04:05:57] <icinga-wm>	 RECOVERY - puppet last run on es2009 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures
[04:29:17] <icinga-wm>	 PROBLEM - Host mw2027 is DOWN: PING CRITICAL - Packet loss = 100%
[04:30:56] <icinga-wm>	 RECOVERY - Host mw2027 is UPING OK - Packet loss = 0%, RTA = 42.93 ms
[04:47:26] <wikibugs>	 6operations, 7database: Better backup coverage for X1 database cluster - https://phabricator.wikimedia.org/T95835#1201504 (10Mattflaschen) 3NEW
[04:49:35] <wikibugs>	 6operations, 7database: Better backup coverage for X1 database cluster - https://phabricator.wikimedia.org/T95835#1201513 (10Mattflaschen)
[05:28:48] <ebernhardson>	 anyone know how we manage clock drift in the apache cluster?  if i'm processing some data, how much fudge should i put in for that when guessing at an order of operations?
[05:29:36] <ebernhardson>	 i can probably declare something with 10s timestamp difference to be ordered, but what about 5s? 2s/ (the timestamps i have are ms)
[05:30:05] <logmsgbot>	 !log LocalisationUpdate ResourceLoader cache refresh completed at Sun Apr 12 05:29:02 UTC 2015 (duration 29m 1s)
[05:30:12] <morebots>	 Logged the message, Master
[05:32:26] <YuviPanda|zzz>	 ebernhardson: hey! I’m looking and it looks like we run ntp on almost all our servers
[05:32:28] * YuviPanda|zzz checks agan
[05:33:01] <YuviPanda|zzz>	 yeah, they’re on apaches
[05:33:04] <YuviPanda|zzz>	 now to see how it was
[05:34:31] <ebernhardson>	 ntp is a good sign, i think in general it keeps a cluster pretty darn close if run regularly
[05:35:09] <YuviPanda|zzz>	 ebernhardson: yes, and we have a deamon running
[05:35:32] <YuviPanda|zzz>	 ebernhardson: I’m going to run a salt ‘date’ on all our apaches via salt and see how we’re doing
[05:36:23] <ebernhardson>	 YuviPanda|zzz: excellent, thanks
[05:36:38] <cheesecat>	 ebernhardson: we use ntp and it's pretty good afaik...there is $wgClockSkewFudge
[05:37:00] <YuviPanda|zzz>	 ebernhardson: they seem in sync to me from a cursory look
[05:37:40] <ebernhardson>	 YuviPanda|zzz: thanks!
[05:38:19] <YuviPanda|zzz>	 yw
[05:41:00] <ebernhardson>	 cheesecat: wow, that is used in exactly one place.  you really know your esoteric pieces of mediawiki :)  thanks too
[05:41:21] * cheesecat is just a cat made of cheese
[05:41:58] <YuviPanda|zzz>	 hehe :)
[05:43:35] <cheesecat>	 http://en.wikipedia.org/wiki/Chechil !
[05:56:27] <icinga-wm>	 PROBLEM - git.wikimedia.org on antimony is CRITICAL - Socket timeout after 10 seconds
[05:58:17] <YuviPanda|zzz>	 cheesecat: is that what MaxSem had?!
[06:01:17] <icinga-wm>	 RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 60615 bytes in 3.966 second response time
[06:08:35] <MaxSem>	 YuviPanda|zzz, yes:P
[06:08:40] <YuviPanda|zzz>	 :D
[06:30:07] <icinga-wm>	 PROBLEM - puppet last run on mw2066 is CRITICAL Puppet has 1 failures
[06:30:07] <icinga-wm>	 PROBLEM - puppet last run on cp3042 is CRITICAL Puppet has 2 failures
[06:31:16] <icinga-wm>	 PROBLEM - puppet last run on db2036 is CRITICAL Puppet has 1 failures
[06:31:37] <icinga-wm>	 PROBLEM - puppet last run on mw2114 is CRITICAL Puppet has 1 failures
[06:34:36] <icinga-wm>	 PROBLEM - puppet last run on mw2127 is CRITICAL Puppet has 4 failures
[06:34:57] <icinga-wm>	 PROBLEM - puppet last run on mw2126 is CRITICAL Puppet has 1 failures
[06:35:27] <icinga-wm>	 PROBLEM - puppet last run on mw2184 is CRITICAL Puppet has 1 failures
[06:46:17] <icinga-wm>	 RECOVERY - puppet last run on mw2114 is OK Puppet is currently enabled, last run 31 seconds ago with 0 failures
[06:46:27] <icinga-wm>	 RECOVERY - puppet last run on mw2066 is OK Puppet is currently enabled, last run 4 seconds ago with 0 failures
[06:46:27] <icinga-wm>	 RECOVERY - puppet last run on cp3042 is OK Puppet is currently enabled, last run 41 seconds ago with 0 failures
[06:46:57] <icinga-wm>	 RECOVERY - puppet last run on mw2184 is OK Puppet is currently enabled, last run 26 seconds ago with 0 failures
[06:47:37] <icinga-wm>	 RECOVERY - puppet last run on db2036 is OK Puppet is currently enabled, last run 57 seconds ago with 0 failures
[06:47:37] <icinga-wm>	 RECOVERY - puppet last run on mw2127 is OK Puppet is currently enabled, last run 17 seconds ago with 0 failures
[06:48:07] <icinga-wm>	 RECOVERY - puppet last run on mw2126 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures
[07:52:57] <icinga-wm>	 PROBLEM - puppet last run on mw2048 is CRITICAL puppet fail
[08:12:27] <icinga-wm>	 RECOVERY - puppet last run on mw2048 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures
[08:28:55] <grrrit-wm>	 (03PS2) 10devunt: Add Josa extension and deploy to Korean language wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/203627 (https://phabricator.wikimedia.org/T15712) 
[08:29:46] <grrrit-wm>	 (03PS3) 10devunt: Add Josa extension and deploy to testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/203627 (https://phabricator.wikimedia.org/T15712) 
[08:35:19] <grrrit-wm>	 (03PS4) 10devunt: Add Josa extension and deploy to testwiki and labswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/203627 (https://phabricator.wikimedia.org/T15712) 
[08:44:12] <MaxSem>	 devunt, Chris Steipp mentioned coding conventions in his review, I don't see that it has been addressed
[08:45:36] <MaxSem>	 I see there's https://gerrit.wikimedia.org/r/#/c/202754/1/Josa.class.php but that's not enough
[08:51:21] <devunt>	 MaxSem, I checked all files with code-utils/stylize.php
[08:51:31] <devunt>	 Is there something else that I have to clean it?
[08:55:14] <MaxSem>	 function names must be camelCase
[08:55:33] <MaxSem>	 documentation
[08:55:44] <MaxSem>	 utf8_to_unicode doesn't always return a result
[08:56:33] <MaxSem>	 also, this funcytion name is unclear because UTF-8 is Unicode
[09:01:36] <icinga-wm>	 PROBLEM - puppet last run on analytics1017 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[09:01:36] <icinga-wm>	 PROBLEM - RAID on analytics1017 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[09:01:46] <icinga-wm>	 PROBLEM - Hadoop DataNode on analytics1017 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[09:01:57] <icinga-wm>	 PROBLEM - SSH on analytics1017 is CRITICAL - Socket timeout after 10 seconds
[09:01:57] <icinga-wm>	 PROBLEM - salt-minion processes on analytics1017 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[09:02:17] <icinga-wm>	 PROBLEM - Disk space on analytics1017 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[09:02:36] <icinga-wm>	 PROBLEM - DPKG on analytics1017 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[09:02:37] <icinga-wm>	 PROBLEM - dhclient process on analytics1017 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[09:02:37] <icinga-wm>	 PROBLEM - Hadoop NodeManager on analytics1017 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[09:02:47] <icinga-wm>	 PROBLEM - configured eth on analytics1017 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[09:31:51] <wikibugs>	 6operations: Ensure all disaster recover documentation is in one central location - https://phabricator.wikimedia.org/T95841#1201593 (10Peachey88) 3NEW
[09:35:16] <icinga-wm>	 PROBLEM - check_mysql on db1008 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 609
[09:40:16] <icinga-wm>	 RECOVERY - check_mysql on db1008 is OK: Uptime: 2659848 Threads: 2 Questions: 16470797 Slow queries: 17712 Opens: 50615 Flush tables: 2 Open tables: 64 Queries per second avg: 6.192 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0
[09:40:48] <icinga-wm>	 RECOVERY - SSH on analytics1017 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0)
[09:40:56] <icinga-wm>	 RECOVERY - salt-minion processes on analytics1017 is OK: PROCS OK: 3 processes with regex args ^/usr/bin/python /usr/bin/salt-minion
[09:41:07] <icinga-wm>	 RECOVERY - Disk space on analytics1017 is OK: DISK OK
[09:41:26] <icinga-wm>	 RECOVERY - DPKG on analytics1017 is OK: All packages OK
[09:41:27] <icinga-wm>	 RECOVERY - dhclient process on analytics1017 is OK: PROCS OK: 0 processes with command name dhclient
[09:41:27] <icinga-wm>	 RECOVERY - Hadoop NodeManager on analytics1017 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager
[09:41:37] <icinga-wm>	 RECOVERY - configured eth on analytics1017 is OK - interfaces up
[09:41:57] <icinga-wm>	 RECOVERY - puppet last run on analytics1017 is OK Puppet is currently enabled, last run 58 minutes ago with 0 failures
[09:41:58] <icinga-wm>	 RECOVERY - RAID on analytics1017 is OK no disks configured for RAID
[09:42:07] <icinga-wm>	 RECOVERY - Hadoop DataNode on analytics1017 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode
[10:34:13] <grrrit-wm>	 (03PS1) 10Tim Landscheidt: gridengine: Puppetize gridengine-mailer [puppet] - 10https://gerrit.wikimedia.org/r/203656 (https://phabricator.wikimedia.org/T63160) 
[11:01:02] <grrrit-wm>	 (03CR) 10Tim Landscheidt: "Without this change:" [puppet] - 10https://gerrit.wikimedia.org/r/148917 (owner: 10Tim Landscheidt)
[11:31:34] <wikibugs>	 6operations, 10ops-fundraising: Ensure all disaster recover documentation is in one central location - https://phabricator.wikimedia.org/T95841#1201703 (10Krenair)
[11:31:55] <wikibugs>	 6operations, 10ops-fundraising: Ensure all disaster recover documentation is in one central location - https://phabricator.wikimedia.org/T95841#1201593 (10Krenair) Some of those need updating too...
[11:50:07] <wikibugs>	 6operations, 10ops-fundraising: Ensure all disaster recover documentation is in one central location - https://phabricator.wikimedia.org/T95841#1201733 (10Peachey88) >>! In T95841#1201703, @Krenair wrote: > Some of those need updating too...  Subtasks, Yo! (or something)
[12:24:46] <icinga-wm>	 PROBLEM - puppet last run on mw2041 is CRITICAL Puppet has 1 failures
[12:36:08] <icinga-wm>	 PROBLEM - puppet last run on wtp2006 is CRITICAL puppet fail
[12:42:37] <icinga-wm>	 RECOVERY - puppet last run on mw2041 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures
[12:53:57] <icinga-wm>	 RECOVERY - puppet last run on wtp2006 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures
[13:00:02] <grrrit-wm>	 (03CR) 10Matanya: [C: 031] various role classes: moar small lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/202653 (https://phabricator.wikimedia.org/T93645) (owner: 10Dzahn)
[14:06:55] <grrrit-wm>	 (03PS2) 10Tim Landscheidt: Tools: Fix and simplify exim redirectors [puppet] - 10https://gerrit.wikimedia.org/r/148917 
[14:09:12] <grrrit-wm>	 (03CR) 10Tim Landscheidt: "No change (and still undeliverable = good) for T73692 addresses:" [puppet] - 10https://gerrit.wikimedia.org/r/148917 (owner: 10Tim Landscheidt)
[14:47:18] <grrrit-wm>	 (03PS1) 10Andrew Bogott: Modify the scheduler filter to allow host aggregates (maybe) [puppet] - 10https://gerrit.wikimedia.org/r/203665 
[14:47:53] <hoo_>	 !log Attached Helmut Welger@eowiki to the global account of the same name
[14:48:01] <morebots>	 Logged the message, Master
[14:48:13] <hoo_>	 !log Attached Bradypus@enwiki and Bradypus@commonswiki to the global account of the same name
[14:48:17] <morebots>	 Logged the message, Master
[14:49:53] <grrrit-wm>	 (03CR) 10Andrew Bogott: [C: 032] Modify the scheduler filter to allow host aggregates (maybe) [puppet] - 10https://gerrit.wikimedia.org/r/203665 (owner: 10Andrew Bogott)
[14:54:03] <hoo>	 !log Attached Peng@dewiktionary to the global account of the same name
[14:54:09] <morebots>	 Logged the message, Master
[15:02:27] <grrrit-wm>	 (03PS1) 10Andrew Bogott: Add labvirt1001 to the compute pool [puppet] - 10https://gerrit.wikimedia.org/r/203666 
[15:06:49] <grrrit-wm>	 (03PS1) 10Tim Landscheidt: Tools: Only forward mail for project users [puppet] - 10https://gerrit.wikimedia.org/r/203667 (https://phabricator.wikimedia.org/T93526) 
[15:07:17] <hoo>	 !log Attached Yagosaga@dewikibooks and Yagosaga@commonswiki to the global account of the same name
[15:07:21] <morebots>	 Logged the message, Master
[15:12:59] <grrrit-wm>	 (03CR) 10Tim Landscheidt: "Tested on Toolsbeta." [puppet] - 10https://gerrit.wikimedia.org/r/203667 (https://phabricator.wikimedia.org/T93526) (owner: 10Tim Landscheidt)
[15:14:50] <hoo>	 !log Attached Srbauer@nowiki and Srbauer@sourceswiki to the global account of the same name
[15:14:54] <morebots>	 Logged the message, Master
[15:22:16] <hoo>	 !log Attached Aloiswuest@commonswiki, Aloiswuest@dewikiquote and Aloiswuest@dewiktionary to the global account of the same name
[15:22:20] <morebots>	 Logged the message, Master
[15:32:36] <icinga-wm>	 PROBLEM - puppet last run on mw1133 is CRITICAL Puppet has 1 failures
[15:33:17] <icinga-wm>	 PROBLEM - puppet last run on mw1190 is CRITICAL Puppet has 1 failures
[15:39:56] <hoo>	 !log Attached Manfred Strumpf@commonswiki to the global account of the same name
[15:40:01] <morebots>	 Logged the message, Master
[15:49:28] <icinga-wm>	 RECOVERY - puppet last run on mw1190 is OK Puppet is currently enabled, last run 57 seconds ago with 0 failures
[15:50:18] <icinga-wm>	 RECOVERY - puppet last run on mw1133 is OK Puppet is currently enabled, last run 34 seconds ago with 0 failures
[16:32:57] <wikibugs>	 6operations, 10RESTBase, 7Monitoring, 5Patch-For-Review: Detailed cassandra monitoring: metrics and dashboards done, need to set up alerts - https://phabricator.wikimedia.org/T78514#1201862 (10GWicke) List of graphite-based alerts in puppet:  https://github.com/wikimedia/operations-puppet/search?utf8=%E2%9...
[16:43:37] <icinga-wm>	 PROBLEM - git.wikimedia.org on antimony is CRITICAL - Socket timeout after 10 seconds
[16:46:47] <icinga-wm>	 RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 60655 bytes in 0.598 second response time
[16:51:46] <icinga-wm>	 PROBLEM - git.wikimedia.org on antimony is CRITICAL - Socket timeout after 10 seconds
[16:54:28] <icinga-wm>	 PROBLEM - High load for whatever reason on labstore1001 is CRITICAL 50.00% of data above the critical threshold [24.0]
[16:56:07] <icinga-wm>	 RECOVERY - High load for whatever reason on labstore1001 is OK Less than 50.00% above the threshold [16.0]
[16:58:07] <icinga-wm>	 RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 60634 bytes in 0.360 second response time
[17:10:56] <icinga-wm>	 PROBLEM - Slow CirrusSearch query rate on fluorine is CRITICAL: CirrusSearch-slow.log_line_rate CRITICAL: 0.00333333333333
[17:12:58] <wikibugs>	 6operations, 6Labs: One instance hammering on NFS should not make it unavailable to everyone else - https://phabricator.wikimedia.org/T95766#1201894 (10JanZerebecki) At the time this happened I looked at https://grafana.wikimedia.org/#/dashboard/db/labs-monitoring and the inbound traffic on labstore1001 nearly...
[17:16:27] <icinga-wm>	 RECOVERY - Slow CirrusSearch query rate on fluorine is OK: CirrusSearch-slow.log_line_rate OKAY: 0.0
[18:09:07] <icinga-wm>	 PROBLEM - puppet last run on mc2011 is CRITICAL puppet fail
[18:26:57] <icinga-wm>	 RECOVERY - puppet last run on mc2011 is OK Puppet is currently enabled, last run 57 seconds ago with 0 failures
[18:46:09] <grrrit-wm>	 (03CR) 10Dereckson: [C: 04-1] Give patrol to reviewers for testwiki/enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/199321 (https://phabricator.wikimedia.org/T93798) (owner: 10Cenarium)
[19:06:10] <grrrit-wm>	 (03PS1) 10Yuvipanda: tools: time out webservice commands after 30s waiting for job [puppet] - 10https://gerrit.wikimedia.org/r/203682 
[19:07:07] <grrrit-wm>	 (03CR) 10Yuvipanda: "Note that my alternative to using this was to use signals.alarm, but invoking signal handlers for this seems like one of those things that" [puppet] - 10https://gerrit.wikimedia.org/r/203682 (owner: 10Yuvipanda)
[19:29:53] <grrrit-wm>	 (03CR) 10Merlijn van Deen: "qsub should return the job ID; can't we just match using that? (I understand it's more work, though, so this can be an OK temporary fix)" [puppet] - 10https://gerrit.wikimedia.org/r/203682 (owner: 10Yuvipanda)
[19:30:58] <grrrit-wm>	 (03CR) 10Yuvipanda: "We don't have the job id to begin with, so we can't use it everywhere, no?" [puppet] - 10https://gerrit.wikimedia.org/r/203682 (owner: 10Yuvipanda)
[19:36:11] <grrrit-wm>	 (03CR) 10Merlijn van Deen: "Why not? We start the job in start_web_job, and qsub returns" [puppet] - 10https://gerrit.wikimedia.org/r/203682 (owner: 10Yuvipanda)
[19:37:07] <grrrit-wm>	 (03CR) 10Tim Landscheidt: "Why don't we have the job number? As Merlijn wrote, it's returned by qsub when starting a web service. And if we're stopping, it's the j" [puppet] - 10https://gerrit.wikimedia.org/r/203682 (owner: 10Yuvipanda)
[19:39:15] <grrrit-wm>	 (03CR) 10Tim Landscheidt: ""qsub -terse" is probably the best approach." [puppet] - 10https://gerrit.wikimedia.org/r/203682 (owner: 10Yuvipanda)
[19:41:00] <grrrit-wm>	 (03CR) 10Yuvipanda: "Oops, you are all totally right. The job number will still be useless when stopping a webservice and for status, but those seem less likel" [puppet] - 10https://gerrit.wikimedia.org/r/203682 (owner: 10Yuvipanda)
[19:49:44] <YuviPanda|zzz>	 valhallasw`cloud: webservice2 is still hackier than I’d like, though
[19:49:52] <YuviPanda|zzz>	 we need a proper qsub / qstat abstraction
[19:50:33] <grrrit-wm>	 (03CR) 10Tim Landscheidt: "It would block for example on "webservice stop", if the service monitor (what's it called again?) is faster in re-starting the web service" [puppet] - 10https://gerrit.wikimedia.org/r/203682 (owner: 10Yuvipanda)
[20:04:04] <grrrit-wm>	 (03CR) 10Dereckson: [C: 04-1] "The goal of the change is to deploy only the extension to http://ko.wikipedia.beta.wmflabs.org, se we can check all is fine in a environme" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/203627 (https://phabricator.wikimedia.org/T15712) (owner: 10devunt)
[20:09:20] <grrrit-wm>	 (03PS2) 10Yuvipanda: tools: time out webservice commands after 30s waiting for job [puppet] - 10https://gerrit.wikimedia.org/r/203682 
[20:11:04] <grrrit-wm>	 (03CR) 10Merlijn van Deen: [C: 031] "This is an improvement over the status quo, even if it's not the perfect solution." [puppet] - 10https://gerrit.wikimedia.org/r/203682 (owner: 10Yuvipanda)
[20:19:17] <icinga-wm>	 PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 21.43% of data above the critical threshold [500.0]
[20:38:46] <icinga-wm>	 RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0]
[20:44:07] <icinga-wm>	 PROBLEM - puppet last run on mw2183 is CRITICAL puppet fail
[21:03:27] <icinga-wm>	 RECOVERY - puppet last run on mw2183 is OK Puppet is currently enabled, last run 50 seconds ago with 0 failures
[23:06:21] <grrrit-wm>	 (03CR) 10Cenarium: "I'm not sure what you want edited in the commit message." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/199321 (https://phabricator.wikimedia.org/T93798) (owner: 10Cenarium)