[00:00:04] <grrrit-wm>	 (03CR) 10Rush: [C: 032] bugzilla: switch svc_name to old-bugzilla [puppet] - 10https://gerrit.wikimedia.org/r/175144 (owner: 10Dzahn)
[00:26:41] <grrrit-wm>	 (03PS1) 10Dzahn: bugzilla: hardcode SSL cert name for migration [puppet] - 10https://gerrit.wikimedia.org/r/175313 
[00:27:17] <grrrit-wm>	 (03CR) 10Rush: [C: 031] bugzilla: hardcode SSL cert name for migration [puppet] - 10https://gerrit.wikimedia.org/r/175313 (owner: 10Dzahn)
[00:27:39] <grrrit-wm>	 (03CR) 10John F. Lewis: [C: 031] "Sane." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/171219 (https://bugzilla.wikimedia.org/55737) (owner: 10Glaisher)
[00:28:04] <grrrit-wm>	 (03CR) 10John F. Lewis: [C: 031] "+1 with the deletion of the wiki." [puppet] - 10https://gerrit.wikimedia.org/r/170925 (owner: 10Glaisher)
[00:28:08] <grrrit-wm>	 (03PS2) 10Dzahn: bugzilla: disable cron jobs [puppet] - 10https://gerrit.wikimedia.org/r/175308 
[00:28:27] <grrrit-wm>	 (03CR) 10Dzahn: [C: 032] bugzilla: disable cron jobs [puppet] - 10https://gerrit.wikimedia.org/r/175308 (owner: 10Dzahn)
[00:29:28] <grrrit-wm>	 (03PS2) 10Rush: bugzilla: hardcode SSL cert name for migration [puppet] - 10https://gerrit.wikimedia.org/r/175313 (owner: 10Dzahn)
[00:29:55] <grrrit-wm>	 (03CR) 10Dzahn: [C: 032] bugzilla: hardcode SSL cert name for migration [puppet] - 10https://gerrit.wikimedia.org/r/175313 (owner: 10Dzahn)
[00:51:21] <icinga-wm>	 PROBLEM - MySQL Slave Delay on db1016 is CRITICAL: CRIT replication delay 306 seconds  
[00:52:30] <icinga-wm>	 RECOVERY - MySQL Slave Delay on db1016 is OK: OK replication delay 0 seconds  
[00:55:41] <grrrit-wm>	 (03PS1) 10John F. Lewis: admin: grant qchris tin access (through deployers) [puppet] - 10https://gerrit.wikimedia.org/r/175315 
[00:56:02] <grrrit-wm>	 (03PS2) 10John F. Lewis: admin: grant qchris tin access (through deployers) [puppet] - 10https://gerrit.wikimedia.org/r/175315 
[00:57:12] <icinga-wm>	 PROBLEM - HHVM rendering on mw1228 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[00:57:20] <icinga-wm>	 PROBLEM - HHVM rendering on mw1227 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[00:58:11] <icinga-wm>	 RECOVERY - HHVM rendering on mw1228 is OK: HTTP OK: HTTP/1.1 200 OK - 68590 bytes in 0.513 second response time  
[00:58:12] <icinga-wm>	 RECOVERY - HHVM rendering on mw1227 is OK: HTTP OK: HTTP/1.1 200 OK - 68590 bytes in 0.735 second response time  
[01:02:10] <icinga-wm>	 PROBLEM - puppet last run on cp3013 is CRITICAL: CRITICAL: puppet fail  
[01:04:51] <icinga-wm>	 PROBLEM - Disk space on analytics1021 is CRITICAL: DISK CRITICAL - free space: /run/shm 965 MB (3% inode=99%):  
[01:21:21] <icinga-wm>	 RECOVERY - puppet last run on cp3013 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures  
[02:12:52] <logmsgbot>	 !log l10nupdate Synchronized php-1.25wmf8/cache/l10n: (no message) (duration: 00m 01s)
[02:12:55] <logmsgbot>	 !log LocalisationUpdate completed (1.25wmf8) at 2014-11-23 02:12:55+00:00
[02:13:00] <morebots>	 Logged the message, Master
[02:13:02] <morebots>	 Logged the message, Master
[02:19:20] <logmsgbot>	 !log l10nupdate Synchronized php-1.25wmf9/cache/l10n: (no message) (duration: 00m 01s)
[02:19:23] <logmsgbot>	 !log LocalisationUpdate completed (1.25wmf9) at 2014-11-23 02:19:23+00:00
[02:19:24] <morebots>	 Logged the message, Master
[02:19:27] <morebots>	 Logged the message, Master
[03:13:11] <icinga-wm>	 PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 21.43% of data above the critical threshold [500.0]  
[03:14:11] <icinga-wm>	 PROBLEM - puppet last run on amssq38 is CRITICAL: CRITICAL: puppet fail  
[03:14:11] <icinga-wm>	 PROBLEM - puppet last run on cp3012 is CRITICAL: CRITICAL: puppet fail  
[03:14:30] <icinga-wm>	 PROBLEM - puppet last run on cp3009 is CRITICAL: CRITICAL: puppet fail  
[03:14:42] <icinga-wm>	 PROBLEM - puppet last run on amssq42 is CRITICAL: CRITICAL: Puppet has 5 failures  
[03:15:11] <icinga-wm>	 PROBLEM - puppet last run on amssq62 is CRITICAL: CRITICAL: Puppet has 2 failures  
[03:23:11] <icinga-wm>	 PROBLEM - Packetloss_Average on erbium is CRITICAL: packet_loss_average CRITICAL: 12.1723110588  
[03:27:11] <icinga-wm>	 RECOVERY - Packetloss_Average on erbium is OK: packet_loss_average OKAY: -0.129670595238  
[03:27:40] <icinga-wm>	 RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0]  
[03:30:11] <icinga-wm>	 RECOVERY - puppet last run on amssq42 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures  
[03:31:51] <icinga-wm>	 RECOVERY - puppet last run on amssq62 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures  
[03:32:11] <icinga-wm>	 RECOVERY - puppet last run on cp3009 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures  
[03:33:00] <icinga-wm>	 PROBLEM - puppet last run on wtp1011 is CRITICAL: CRITICAL: Puppet has 1 failures  
[03:33:21] <icinga-wm>	 PROBLEM - puppet last run on mw1097 is CRITICAL: CRITICAL: Puppet has 1 failures  
[03:33:51] <icinga-wm>	 PROBLEM - puppet last run on ms-be1007 is CRITICAL: CRITICAL: Puppet has 2 failures  
[03:34:01] <icinga-wm>	 RECOVERY - puppet last run on cp3012 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures  
[03:35:02] <icinga-wm>	 RECOVERY - puppet last run on amssq38 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures  
[03:35:48] <logmsgbot>	 !log LocalisationUpdate ResourceLoader cache refresh completed at Sun Nov 23 03:35:48 UTC 2014 (duration 35m 47s)
[03:35:54] <morebots>	 Logged the message, Master
[03:41:31] <icinga-wm>	 PROBLEM - puppet last run on mw1019 is CRITICAL: CRITICAL: Puppet has 1 failures  
[03:47:31] <icinga-wm>	 PROBLEM - puppet last run on search1018 is CRITICAL: CRITICAL: Puppet has 1 failures  
[03:48:16] <hoo>	 Coren: around maybe?
[03:51:10] <icinga-wm>	 RECOVERY - puppet last run on ms-be1007 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures  
[03:51:41] <icinga-wm>	 RECOVERY - puppet last run on mw1097 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures  
[03:52:21] <icinga-wm>	 RECOVERY - puppet last run on wtp1011 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures  
[03:53:23] <jackmcbarn>	 it appears the job queue underwent cosmic inflation between 13:00 and 15:50 on the 14th
[03:55:34] <hoo>	 andrewbogott_afk: Coren: virt1009 wants your love... our labs instances are unusable because of IO is saturation...
[03:56:08] <hoo>	 aude: ^
[03:56:13] <hoo>	 hopes that's enough
[03:56:27] <aude>	 thanks
[03:58:50] <icinga-wm>	 RECOVERY - puppet last run on mw1019 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures  
[04:05:52] <icinga-wm>	 RECOVERY - puppet last run on search1018 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures  
[04:46:35] <ori>	 jackmcbarn: if you discover such things, don't be shy about !logging them; the bot has no ACLs for a reason
[04:50:51] <icinga-wm>	 PROBLEM - puppet last run on db2037 is CRITICAL: CRITICAL: puppet fail  
[05:09:31] <icinga-wm>	 RECOVERY - puppet last run on db2037 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures  
[05:43:41] <icinga-wm>	 PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 13.33% of data above the critical threshold [500.0]  
[05:44:00] <icinga-wm>	 PROBLEM - BGP status on cr1-ulsfo is CRITICAL: CRITICAL: No response from remote host 198.35.26.192  
[05:44:54] <icinga-wm>	 RECOVERY - BGP status on cr1-ulsfo is OK: OK: host 198.35.26.192, sessions up: 10, down: 0, shutdown: 0  
[05:59:11] <icinga-wm>	 RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0]  
[06:29:31] <icinga-wm>	 PROBLEM - puppet last run on search1018 is CRITICAL: CRITICAL: Puppet has 1 failures  
[06:29:42] <icinga-wm>	 PROBLEM - puppet last run on cp4008 is CRITICAL: CRITICAL: Puppet has 1 failures  
[06:31:21] <icinga-wm>	 PROBLEM - puppet last run on mw1144 is CRITICAL: CRITICAL: Puppet has 1 failures  
[06:32:01] <icinga-wm>	 PROBLEM - puppet last run on amssq55 is CRITICAL: CRITICAL: Puppet has 1 failures  
[06:32:11] <icinga-wm>	 PROBLEM - puppet last run on mw1025 is CRITICAL: CRITICAL: Puppet has 1 failures  
[06:41:20] <icinga-wm>	 PROBLEM - puppet last run on es1003 is CRITICAL: CRITICAL: Puppet has 2 failures  
[06:45:52] <icinga-wm>	 RECOVERY - puppet last run on search1018 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures  
[06:46:01] <icinga-wm>	 RECOVERY - puppet last run on cp4008 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures  
[06:46:31] <icinga-wm>	 RECOVERY - puppet last run on mw1144 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures  
[06:47:11] <icinga-wm>	 RECOVERY - puppet last run on amssq55 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures  
[06:47:21] <icinga-wm>	 RECOVERY - puppet last run on mw1025 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures  
[06:53:02] <icinga-wm>	 PROBLEM - puppet last run on db2003 is CRITICAL: CRITICAL: puppet fail  
[06:58:50] <icinga-wm>	 RECOVERY - puppet last run on es1003 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures  
[07:11:40] <icinga-wm>	 RECOVERY - puppet last run on db2003 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures  
[12:41:41] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1002 is OK: OK check_failover servers up 2 down 0  
[13:10:12] <icinga-wm>	 PROBLEM - puppet last run on amssq51 is CRITICAL: CRITICAL: puppet fail  
[13:30:01] <icinga-wm>	 RECOVERY - puppet last run on amssq51 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures  
[15:12:21] <icinga-wm>	 PROBLEM - puppet last run on ssl3002 is CRITICAL: CRITICAL: Puppet has 1 failures  
[15:12:51] <icinga-wm>	 PROBLEM - puppet last run on cp3009 is CRITICAL: CRITICAL: puppet fail  
[15:22:00] <icinga-wm>	 PROBLEM - HTTP error ratio anomaly detection on tungsten is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 1 below the confidence bounds  
[15:30:41] <icinga-wm>	 RECOVERY - puppet last run on ssl3002 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures  
[15:31:01] <icinga-wm>	 PROBLEM - puppet last run on cp4004 is CRITICAL: CRITICAL: Puppet has 1 failures  
[15:32:21] <icinga-wm>	 RECOVERY - puppet last run on cp3009 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures  
[15:41:11] <Coren>	 hoo: What's that about virt1009?
[15:41:26] <hoo>	 Coren: Seems to be good again
[15:42:41] <Coren>	 hoo: I got no page; whatever might have been wrong did not trigger an alert.
[15:43:26] <hoo>	 mh... do you have a sysload threshold for getting paged?
[15:43:33] <hoo>	 Cause I think that hit 250 (15 min)
[15:44:58] <hoo>	 oh, well... looks like it's bad again
[15:45:59] <Coren>	 Load isn't a useful metric - the box is 60% idle and below 5% iowait
[15:46:22] <hoo>	 true, that... but iowait also probably isn't here
[15:47:41] <icinga-wm>	 RECOVERY - puppet last run on cp4004 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures  
[15:48:55] <Coren>	 hoo: I wouldn't be surprised if virt1009 was a little slugging - one of the disks is currently rebuilding its mirror - but there's a cap on how much io bandwidth that can take and I don't see anything unusual in the io atm.
[15:49:06] <Coren>	 sluggish*
[15:51:34] <hoo>	 Any idea why we could be seeing such poor io perf then?
[15:54:20] <Coren>	 At first glance, there's nothing on the host that would cause it; but there may be an instance or project that's consuming a disproportionate /fraction/ of it.  Where do you see symptoms atm?
[15:54:41] <icinga-wm>	 PROBLEM - puppetmaster https on virt1000 is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 8140: HTTP/1.1 500 Internal Server Error  
[15:57:21] <hoo>	 Coren: Our jenkins instances in labs where dead dog slow 
[15:57:50] <hoo>	 let me see if I can get more useful metrics
[15:58:29] <hoo>	 yeah
[15:58:34] <hoo>	 extremely high iowait
[15:58:43] <hoo>	 so stuff just times out
[15:59:05] <Coren>	 On /, or on NFS?
[15:59:12] <hoo>	 also disk 100% busy according to atop (I do know that the busy time is flawed... but it can given an idea)
[15:59:40] <hoo>	 pretty sure they run on /
[16:00:07] <Coren>	 Can you point me at a suffering instance?
[16:00:38] <hoo>	 all jobs are finished now... but I can put load on them again, if needed
[16:00:53] <hoo>	 wikidata-jenkins[123]
[16:01:02] <Coren>	 Lemme go look at the instance while it's not under load first.
[16:03:27] <grrrit-wm>	 (03CR) 10Tnegrin: [C: 031] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/175315 (owner: 10John F. Lewis)
[16:03:36] <Coren>	 Well, while unloaded there is nothing untoward.  Do you think you can trigger the issue again?
[16:03:49] <hoo>	 I sure can
[16:06:37] <hoo>	 Coren: Load should be back
[16:08:54] <YuviPanda>	 Puppetmaster is dead. Got a large string of shinken warnings :)
[16:09:03] <Coren>	 I don't think you're suffering because of the host - afaict, virt1009 still has elbow room.  I see the actual instances hitting their ceiling pretty hard though.
[16:09:44] <hoo>	 that's unexpected... because stuff was totally fine until yesterday or so
[16:09:47] * Coren ponders.
[16:10:45] <Coren>	 There was an openstack upgrade late last week; it's entirely possible that the newer version is being more diligent in limiting how much IO an single tenant is allowed to consume.
[16:11:19] <Coren>	 So it'd have become noticable that you're hitting a per-instance limit when in the past you'd have gotten more of the raw host instead.
[16:11:33] <Coren>	 I'll need Andrew to make sure though - he's the one who did the upgrade.
[16:11:41] <hoo>	 Ok
[16:11:48] <hoo>	 anything we can do about this now-ish?
[16:12:38] <Coren>	 I don't know.  Lemme see if there are new tunables in nova.
[16:13:29] <Coren>	 What's the project name?
[16:13:46] <hoo>	 Wikidata-build
[16:21:07] <Coren>	 I'm not finding anything that seems relevant in any way.  :-(
[16:23:14] <hoo>	 yikes
[16:23:52] <hoo>	 mid-term we could probably have this run on a tmpfs (I have a similar setup for testing in a local VM and it works very well in ram)
[16:24:17] <hoo>	 but short term we want to have some kind of jenkins... :S
[16:49:22] <Coren>	 I'll tell Andrew about your issue; there may be a simple fix.
[16:49:44] <hoo>	 thanks :)
[16:54:10] <icinga-wm>	 RECOVERY - HTTP error ratio anomaly detection on tungsten is OK: OK: No anomaly detected  
[16:58:03] <addshore>	 oh hi hoo
[17:00:18] <addshore>	 what am I seeing hoo?
[17:00:37] <hoo>	 Depends on what you look at, I guess :D
[17:00:55] <addshore>	 *looks at his cup of tea*
[17:00:59] <addshore>	 mhhhhhm, tea!
[17:01:02] <hoo>	 But more seriously: Our jenkins instances are unusable because we use way to much IO
[17:01:16] <addshore>	 ahhh, disk IO or network?
[17:01:21] <hoo>	 so stuff just times out... all the time
[17:01:22] <hoo>	 disk IO
[17:01:36] <addshore>	 ouch *looks*
[17:02:08] <addshore>	 Build timed out (after 30 minutes). etf
[17:02:10] <addshore>	 *wtf
[17:02:21] <hoo>	 C.oren poked andrewbogott... hopefully he can help out
[17:02:23] <hoo>	 also
[17:02:24] <hoo>	 <hoo> mid-term we could probably have this run on a tmpfs (I have a similar setup for testing in a local VM and it works very well in ram)
[17:02:48] <addshore>	 but, how as this changed so so so much in the past months?
[17:03:13] <aude>	 in the past day
[17:03:13] <hoo>	 <Coren> There was an openstack upgrade late last week; it's entirely possible that the newer version is being more diligent in limiting how much IO an single tenant is allowed to consume.
[17:03:13] <hoo>	 <Coren> So it'd have become noticable that you're hitting a per-instance limit when in the past you'd have gotten more of the raw host instead.
[17:03:22] <aude>	 worked fine day before
[17:03:27] <addshore>	 wow
[17:03:30] * hoo not sure when it broke
[17:03:53] <Coren>	 Mind you, that's an hypothesis derived from what I saw and not because I /konw/
[17:04:42] <Coren>	 All I can tell for sure this second is that while the instance is hitting a limit, the host isn't out of resources.
[17:05:12] <addshore>	 hmmm *goes for a dig around*
[17:05:53] <hoo>	 doubt you'll be able to find anything... nothing much changed on our side
[17:05:56] <hoo>	 but go ahead
[17:06:42] <addshore>	 the other integration slaves aren't having this problem? just us?
[17:08:32] <hoo>	 addshore: Are those in labs as well?
[17:08:47] <hoo>	 Might be taht they live on another virt* server... and thus on another openstack version
[17:08:51] <addshore>	 I presume they are, in all the CI stuff they are refered to as labsSlaves
[17:11:12] <addshore>	 hoo aude Populating default interwiki table
[17:11:31] <icinga-wm>	 PROBLEM - Disk space on rhenium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.  
[17:11:40] <addshore>	 used to take 10 seconds, now takes 17 mins
[17:11:59] <hoo>	 I doubt that's the bottleneck
[17:12:12] <icinga-wm>	 RECOVERY - Disk space on rhenium is OK: DISK OK  
[17:12:12] <addshore>	 well, thats got to indicate something
[17:12:21] <hoo>	 IO slowness and sqlite?
[17:12:26] <addshore>	 7 days ago that took 10 seconds :p
[17:12:34] <hoo>	 yeah -.-
[17:13:39] <addshore>	 everything else looks liek it is taking the same ammount of time as before, including the rest of the mw install, and the running of tests
[17:14:10] <hoo>	 but why is the iowait so high?
[17:15:42] * hoo kicks github
[17:16:11] <hoo>	 80KiB/s really -.-
[17:34:06] <addshore>	 very odd :D
[17:35:11] <icinga-wm>	 PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 7.14% of data above the critical threshold [500.0]  
[17:37:21] <icinga-wm>	 PROBLEM - HTTP error ratio anomaly detection on tungsten is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 0 below the confidence bounds  
[17:39:21] <icinga-wm>	 PROBLEM - Router interfaces on cr1-ulsfo is CRITICAL: CRITICAL: No response from remote host 198.35.26.192 for 1.3.6.1.2.1.2.2.1.7  with snmp version 2  
[17:41:21] <icinga-wm>	 PROBLEM - BGP status on cr1-ulsfo is CRITICAL: CRITICAL: No response from remote host 198.35.26.192  
[17:42:20] <icinga-wm>	 PROBLEM - Router interfaces on cr1-ulsfo is CRITICAL: CRITICAL: No response from remote host 198.35.26.192 for 1.3.6.1.2.1.2.2.1.8  with snmp version 2  
[17:46:12] <icinga-wm>	 PROBLEM - BGP status on cr1-ulsfo is CRITICAL: CRITICAL: No response from remote host 198.35.26.192  
[17:48:21] <icinga-wm>	 PROBLEM - BGP status on cr1-ulsfo is CRITICAL: CRITICAL: No response from remote host 198.35.26.192  
[17:49:01] <icinga-wm>	 RECOVERY - BGP status on cr1-ulsfo is OK: OK: host 198.35.26.192, sessions up: 10, down: 0, shutdown: 0  
[17:49:31] <icinga-wm>	 RECOVERY - Router interfaces on cr1-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 65, down: 0, dormant: 0, excluded: 0, unused: 0  
[17:50:41] <icinga-wm>	 PROBLEM - puppet last run on cp4006 is CRITICAL: CRITICAL: Puppet has 1 failures  
[17:52:50] <icinga-wm>	 PROBLEM - puppet last run on cp4007 is CRITICAL: CRITICAL: puppet fail  
[17:53:41] <icinga-wm>	 PROBLEM - puppet last run on cp4002 is CRITICAL: CRITICAL: puppet fail  
[17:54:50] <icinga-wm>	 PROBLEM - puppet last run on cp4008 is CRITICAL: CRITICAL: puppet fail  
[17:56:41] <icinga-wm>	 RECOVERY - puppet last run on cp4002 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures  
[17:57:31] <icinga-wm>	 PROBLEM - BGP status on cr1-ulsfo is CRITICAL: CRITICAL: No response from remote host 198.35.26.192  
[17:59:40] <icinga-wm>	 PROBLEM - BGP status on cr1-ulsfo is CRITICAL: CRITICAL: No response from remote host 198.35.26.192  
[18:01:00] <icinga-wm>	 PROBLEM - puppet last run on cp4012 is CRITICAL: CRITICAL: Puppet has 1 failures  
[18:02:50] <icinga-wm>	 RECOVERY - BGP status on cr1-ulsfo is OK: OK: host 198.35.26.192, sessions up: 10, down: 0, shutdown: 0  
[18:04:10] <icinga-wm>	 RECOVERY - puppet last run on cp4006 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures  
[18:04:10] <icinga-wm>	 PROBLEM - puppet last run on cp4011 is CRITICAL: CRITICAL: Puppet has 1 failures  
[18:04:11] <icinga-wm>	 PROBLEM - puppet last run on cp4016 is CRITICAL: CRITICAL: Puppet has 1 failures  
[18:05:10] <icinga-wm>	 RECOVERY - puppet last run on cp4007 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures  
[18:06:01] <icinga-wm>	 PROBLEM - puppet last run on cp4017 is CRITICAL: CRITICAL: Puppet has 1 failures  
[18:06:50] <grrrit-wm>	 (03CR) 10Ori.livneh: [C: 031] admin: grant qchris tin access (through deployers) [puppet] - 10https://gerrit.wikimedia.org/r/175315 (owner: 10John F. Lewis)
[18:07:10] <icinga-wm>	 PROBLEM - puppet last run on cp4015 is CRITICAL: CRITICAL: puppet fail  
[18:07:11] <icinga-wm>	 RECOVERY - puppet last run on cp4008 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures  
[18:16:21] <icinga-wm>	 RECOVERY - puppet last run on cp4012 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures  
[18:17:31] <icinga-wm>	 RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0]  
[18:17:31] <icinga-wm>	 RECOVERY - puppet last run on cp4016 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures  
[18:17:31] <icinga-wm>	 RECOVERY - puppet last run on cp4011 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures  
[18:18:31] <icinga-wm>	 RECOVERY - puppet last run on cp4015 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures  
[18:19:31] <icinga-wm>	 RECOVERY - puppet last run on cp4017 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures  
[18:51:20] <icinga-wm>	 PROBLEM - BGP status on cr1-ulsfo is CRITICAL: CRITICAL: No response from remote host 198.35.26.192  
[18:52:31] <icinga-wm>	 PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 13.33% of data above the critical threshold [500.0]  
[18:55:20] <icinga-wm>	 RECOVERY - BGP status on cr1-ulsfo is OK: OK: host 198.35.26.192, sessions up: 10, down: 0, shutdown: 0  
[19:09:12] <icinga-wm>	 PROBLEM - Router interfaces on mr1-esams is CRITICAL: CRITICAL: host 91.198.174.247, interfaces up: 36, down: 1, dormant: 0, excluded: 1, unused: 0BRge-0/0/0: down - Core:  msw-oe12-esamsBR  
[19:11:01] <icinga-wm>	 RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0]  
[19:15:11] <icinga-wm>	 PROBLEM - HHVM rendering on mw1234 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[19:17:45] <icinga-wm>	 PROBLEM - Apache HTTP on mw1234 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[19:19:51] <icinga-wm>	 RECOVERY - Apache HTTP on mw1234 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 4.834 second response time  
[19:23:10] <icinga-wm>	 PROBLEM - Apache HTTP on mw1234 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[19:24:31] <icinga-wm>	 PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 6.67% of data above the critical threshold [500.0]  
[19:25:02] <icinga-wm>	 RECOVERY - Apache HTTP on mw1234 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 2.154 second response time  
[19:26:31] <icinga-wm>	 PROBLEM - puppet last run on cp4007 is CRITICAL: CRITICAL: Puppet has 1 failures  
[19:27:31] <icinga-wm>	 PROBLEM - puppet last run on cp4006 is CRITICAL: CRITICAL: Puppet has 1 failures  
[19:30:40] <icinga-wm>	 RECOVERY - HHVM rendering on mw1234 is OK: HTTP OK: HTTP/1.1 200 OK - 68769 bytes in 8.244 second response time  
[19:33:45] <icinga-wm>	 PROBLEM - HHVM rendering on mw1234 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[19:40:00] <icinga-wm>	 RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0]  
[19:40:03] <icinga-wm>	 RECOVERY - HHVM rendering on mw1234 is OK: HTTP OK: HTTP/1.1 200 OK - 68769 bytes in 1.389 second response time  
[19:42:01] <icinga-wm>	 RECOVERY - puppet last run on cp4007 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures  
[19:44:00] <icinga-wm>	 RECOVERY - puppet last run on cp4006 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures  
[19:44:11] <icinga-wm>	 PROBLEM - HHVM rendering on mw1234 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[19:46:41] <icinga-wm>	 PROBLEM - Apache HTTP on mw1234 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[19:49:50] <icinga-wm>	 RECOVERY - Apache HTTP on mw1234 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 8.258 second response time  
[19:53:50] <icinga-wm>	 PROBLEM - Apache HTTP on mw1234 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[19:54:51] <icinga-wm>	 RECOVERY - Apache HTTP on mw1234 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 9.936 second response time  
[19:57:30] <icinga-wm>	 RECOVERY - HHVM rendering on mw1234 is OK: HTTP OK: HTTP/1.1 200 OK - 68768 bytes in 9.224 second response time  
[20:00:41] <icinga-wm>	 PROBLEM - HHVM rendering on mw1234 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[20:07:31] <icinga-wm>	 PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [500.0]  
[20:08:01] <icinga-wm>	 RECOVERY - HHVM rendering on mw1234 is OK: HTTP OK: HTTP/1.1 200 OK - 68770 bytes in 9.736 second response time  
[20:10:00] <icinga-wm>	 PROBLEM - puppet last run on cp4019 is CRITICAL: CRITICAL: puppet fail  
[20:11:01] <icinga-wm>	 PROBLEM - HHVM rendering on mw1234 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[20:14:12] <icinga-wm>	 RECOVERY - HHVM rendering on mw1234 is OK: HTTP OK: HTTP/1.1 200 OK - 68769 bytes in 4.914 second response time  
[20:17:20] <icinga-wm>	 PROBLEM - HHVM rendering on mw1234 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[20:21:41] <icinga-wm>	 RECOVERY - HHVM rendering on mw1234 is OK: HTTP OK: HTTP/1.1 200 OK - 68770 bytes in 7.851 second response time  
[20:22:03] <icinga-wm>	 PROBLEM - Apache HTTP on mw1234 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[20:23:10] <icinga-wm>	 RECOVERY - Apache HTTP on mw1234 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 6.229 second response time  
[20:24:20] <icinga-wm>	 RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0]  
[20:25:50] <icinga-wm>	 PROBLEM - HHVM rendering on mw1234 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[20:26:37] <YuviPanda>	 hoo: addshore no all the virt hosts are upgraded 
[20:28:11] <icinga-wm>	 PROBLEM - Apache HTTP on mw1234 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[20:28:41] <icinga-wm>	 RECOVERY - puppet last run on cp4019 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures  
[20:32:20] <icinga-wm>	 RECOVERY - Apache HTTP on mw1234 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 8.580 second response time  
[20:34:41] <icinga-wm>	 RECOVERY - HTTP error ratio anomaly detection on tungsten is OK: OK: No anomaly detected  
[20:35:21] <icinga-wm>	 PROBLEM - Apache HTTP on mw1234 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[20:36:21] <icinga-wm>	 RECOVERY - Apache HTTP on mw1234 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 9.326 second response time  
[20:43:41] <icinga-wm>	 PROBLEM - Apache HTTP on mw1234 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[20:44:20] <icinga-wm>	 RECOVERY - HHVM rendering on mw1234 is OK: HTTP OK: HTTP/1.1 200 OK - 68770 bytes in 7.558 second response time  
[20:44:50] <icinga-wm>	 RECOVERY - Apache HTTP on mw1234 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 8.762 second response time  
[20:47:30] <icinga-wm>	 PROBLEM - HHVM rendering on mw1234 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[20:51:51] <icinga-wm>	 PROBLEM - Apache HTTP on mw1234 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[20:52:53] <icinga-wm>	 RECOVERY - Apache HTTP on mw1234 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 1.741 second response time  
[20:53:31] <icinga-wm>	 RECOVERY - HHVM rendering on mw1234 is OK: HTTP OK: HTTP/1.1 200 OK - 68769 bytes in 5.127 second response time  
[20:57:40] <icinga-wm>	 PROBLEM - HHVM rendering on mw1234 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[21:02:41] <icinga-wm>	 RECOVERY - HHVM rendering on mw1234 is OK: HTTP OK: HTTP/1.1 200 OK - 68769 bytes in 4.621 second response time  
[21:03:14] <Krenair>	 is mw1234 ill?
[21:05:51] <icinga-wm>	 PROBLEM - HHVM rendering on mw1234 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[21:22:01] <icinga-wm>	 PROBLEM - BGP status on cr1-ulsfo is CRITICAL: CRITICAL: No response from remote host 198.35.26.192  
[21:22:50] <icinga-wm>	 PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 6.67% of data above the critical threshold [500.0]  
[21:24:11] <icinga-wm>	 RECOVERY - BGP status on cr1-ulsfo is OK: OK: host 198.35.26.192, sessions up: 10, down: 0, shutdown: 0  
[21:28:18] <icinga-wm>	 PROBLEM - puppet last run on cp4007 is CRITICAL: CRITICAL: Puppet has 1 failures  
[21:35:11] <icinga-wm>	 PROBLEM - Apache HTTP on mw1234 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[21:38:15] <icinga-wm>	 RECOVERY - Apache HTTP on mw1234 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 9.976 second response time  
[21:41:21] <icinga-wm>	 PROBLEM - Apache HTTP on mw1234 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[21:41:41] <icinga-wm>	 RECOVERY - puppet last run on cp4007 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures  
[21:42:30] <icinga-wm>	 RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0]  
[21:46:30] <icinga-wm>	 RECOVERY - Apache HTTP on mw1234 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 8.683 second response time  
[21:52:41] <icinga-wm>	 PROBLEM - Apache HTTP on mw1234 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[21:52:58] <Krenair>	 looks like mw1234 is a new server added on thursday/friday. _joe_?
[21:55:41] <icinga-wm>	 RECOVERY - Apache HTTP on mw1234 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 7.926 second response time  
[21:58:50] <icinga-wm>	 PROBLEM - Apache HTTP on mw1234 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[22:03:46] <odder>	 Eloquence: I haven't been able to access my account on blog.wikimedia.org ever since the switch to Automattic, and I just got a request to change a picture in a blog post I used in a profile I published in May.  Do you know what might have happened to my account or else do you know who can help me?
[22:15:10] <icinga-wm>	 RECOVERY - Apache HTTP on mw1234 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 8.540 second response time  
[22:19:21] <icinga-wm>	 PROBLEM - Apache HTTP on mw1234 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[22:26:10] <ori>	 !log depooling mw1234; flapping.
[22:26:19] <morebots>	 Logged the message, Master
[22:26:40] <icinga-wm>	 RECOVERY - Apache HTTP on mw1234 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 9.707 second response time  
[22:27:11] <Krenair>	 ori, should I be contacting people when this stuff starts happening?
[22:27:21] <icinga-wm>	 PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 13.33% of data above the critical threshold [500.0]  
[22:27:44] <ori>	 Krenair: no, we have monitoring for that. But the monitoring ought to be better.
[22:28:00] <Krenair>	 Well it clearly failed here.
[22:28:27] <ori>	 yes, I agree
[22:28:39] <ori>	 if you're on the ops list, I'd send an e-mail and ask
[22:28:49] <Krenair>	 I am
[22:28:49] <Krenair>	 ok
[22:28:55] <ori>	 thanks
[22:30:57] <Krenair>	 um, ori
[22:31:07] <ori>	 woops
[22:31:19] <Krenair>	 :)
[22:31:21] <ori>	 thanks :)
[22:32:41] <icinga-wm>	 PROBLEM - Apache HTTP on mw1234 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[22:33:43] <ori>	 odd
[22:42:00] <icinga-wm>	 RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0]  
[22:45:36] <icinga-wm>	 PROBLEM - HHVM busy threads on mw1234 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [482.4]  
[22:58:20] <icinga-wm>	 RECOVERY - Apache HTTP on mw1234 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.043 second response time  
[22:58:51] <icinga-wm>	 RECOVERY - HHVM rendering on mw1234 is OK: HTTP OK: HTTP/1.1 200 OK - 68680 bytes in 1.188 second response time  
[23:08:10] <icinga-wm>	 RECOVERY - HHVM busy threads on mw1234 is OK: OK: Less than 1.00% above the threshold [321.6]  
[23:25:17] <Krenair>	 Why is labs/tools/wikibugs2 not being replicated to github?
[23:26:42] <Krenair>	 legoktm, ?
[23:27:02] <legoktm>	 uh
[23:27:12] <legoktm>	 did we even create a gerrit repo for it?
[23:27:24] <Krenair>	 https://git.wikimedia.org/log/labs%2Ftools%2Fwikibugs2
[23:27:34] <legoktm>	 o.O
[23:27:39] <legoktm>	 I don't know then
[23:27:52] <legoktm>	 umm
[23:27:53] <legoktm>	 https://github.com/wikimedia/labs-tools-wikibugs2
[23:27:56] <legoktm>	 looks fine to me?
[23:28:20] <Krenair>	 Ah.
[23:28:24] <Krenair>	 Yeah, I can't read/type then.
[23:28:40] <Krenair>	 I think I was looking for pywikibugs2. For some reason.
[23:28:58] <Krenair>	 Gerrit repo created 2014-11-10 by QChris: https://git.wikimedia.org/log/labs%2Ftools%2Fwikibugs2/refs%2Fmeta%2Fconfig
[23:30:08] <qchris>	 Is something wrong with those repos?
[23:30:20] <Krenair>	 I thought there was, I was just stupid.
[23:30:37] <Krenair>	 legoktm was surprised there was a gerrit repo for it.
[23:30:48] <legoktm>	 I forgot I had requested it
[23:30:50] <legoktm>	 :P
[23:30:52] <qchris>	 :-D
[23:30:55] <qchris>	 Ok.