[00:00:04] (03CR) 10Rush: [C: 032] bugzilla: switch svc_name to old-bugzilla [puppet] - 10https://gerrit.wikimedia.org/r/175144 (owner: 10Dzahn) [00:26:41] (03PS1) 10Dzahn: bugzilla: hardcode SSL cert name for migration [puppet] - 10https://gerrit.wikimedia.org/r/175313 [00:27:17] (03CR) 10Rush: [C: 031] bugzilla: hardcode SSL cert name for migration [puppet] - 10https://gerrit.wikimedia.org/r/175313 (owner: 10Dzahn) [00:27:39] (03CR) 10John F. Lewis: [C: 031] "Sane." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/171219 (https://bugzilla.wikimedia.org/55737) (owner: 10Glaisher) [00:28:04] (03CR) 10John F. Lewis: [C: 031] "+1 with the deletion of the wiki." [puppet] - 10https://gerrit.wikimedia.org/r/170925 (owner: 10Glaisher) [00:28:08] (03PS2) 10Dzahn: bugzilla: disable cron jobs [puppet] - 10https://gerrit.wikimedia.org/r/175308 [00:28:27] (03CR) 10Dzahn: [C: 032] bugzilla: disable cron jobs [puppet] - 10https://gerrit.wikimedia.org/r/175308 (owner: 10Dzahn) [00:29:28] (03PS2) 10Rush: bugzilla: hardcode SSL cert name for migration [puppet] - 10https://gerrit.wikimedia.org/r/175313 (owner: 10Dzahn) [00:29:55] (03CR) 10Dzahn: [C: 032] bugzilla: hardcode SSL cert name for migration [puppet] - 10https://gerrit.wikimedia.org/r/175313 (owner: 10Dzahn) [00:51:21] PROBLEM - MySQL Slave Delay on db1016 is CRITICAL: CRIT replication delay 306 seconds [00:52:30] RECOVERY - MySQL Slave Delay on db1016 is OK: OK replication delay 0 seconds [00:55:41] (03PS1) 10John F. Lewis: admin: grant qchris tin access (through deployers) [puppet] - 10https://gerrit.wikimedia.org/r/175315 [00:56:02] (03PS2) 10John F. Lewis: admin: grant qchris tin access (through deployers) [puppet] - 10https://gerrit.wikimedia.org/r/175315 [00:57:12] PROBLEM - HHVM rendering on mw1228 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:57:20] PROBLEM - HHVM rendering on mw1227 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:58:11] RECOVERY - HHVM rendering on mw1228 is OK: HTTP OK: HTTP/1.1 200 OK - 68590 bytes in 0.513 second response time [00:58:12] RECOVERY - HHVM rendering on mw1227 is OK: HTTP OK: HTTP/1.1 200 OK - 68590 bytes in 0.735 second response time [01:02:10] PROBLEM - puppet last run on cp3013 is CRITICAL: CRITICAL: puppet fail [01:04:51] PROBLEM - Disk space on analytics1021 is CRITICAL: DISK CRITICAL - free space: /run/shm 965 MB (3% inode=99%): [01:21:21] RECOVERY - puppet last run on cp3013 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [02:12:52] !log l10nupdate Synchronized php-1.25wmf8/cache/l10n: (no message) (duration: 00m 01s) [02:12:55] !log LocalisationUpdate completed (1.25wmf8) at 2014-11-23 02:12:55+00:00 [02:13:00] Logged the message, Master [02:13:02] Logged the message, Master [02:19:20] !log l10nupdate Synchronized php-1.25wmf9/cache/l10n: (no message) (duration: 00m 01s) [02:19:23] !log LocalisationUpdate completed (1.25wmf9) at 2014-11-23 02:19:23+00:00 [02:19:24] Logged the message, Master [02:19:27] Logged the message, Master [03:13:11] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 21.43% of data above the critical threshold [500.0] [03:14:11] PROBLEM - puppet last run on amssq38 is CRITICAL: CRITICAL: puppet fail [03:14:11] PROBLEM - puppet last run on cp3012 is CRITICAL: CRITICAL: puppet fail [03:14:30] PROBLEM - puppet last run on cp3009 is CRITICAL: CRITICAL: puppet fail [03:14:42] PROBLEM - puppet last run on amssq42 is CRITICAL: CRITICAL: Puppet has 5 failures [03:15:11] PROBLEM - puppet last run on amssq62 is CRITICAL: CRITICAL: Puppet has 2 failures [03:23:11] PROBLEM - Packetloss_Average on erbium is CRITICAL: packet_loss_average CRITICAL: 12.1723110588 [03:27:11] RECOVERY - Packetloss_Average on erbium is OK: packet_loss_average OKAY: -0.129670595238 [03:27:40] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [03:30:11] RECOVERY - puppet last run on amssq42 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [03:31:51] RECOVERY - puppet last run on amssq62 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [03:32:11] RECOVERY - puppet last run on cp3009 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [03:33:00] PROBLEM - puppet last run on wtp1011 is CRITICAL: CRITICAL: Puppet has 1 failures [03:33:21] PROBLEM - puppet last run on mw1097 is CRITICAL: CRITICAL: Puppet has 1 failures [03:33:51] PROBLEM - puppet last run on ms-be1007 is CRITICAL: CRITICAL: Puppet has 2 failures [03:34:01] RECOVERY - puppet last run on cp3012 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [03:35:02] RECOVERY - puppet last run on amssq38 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [03:35:48] !log LocalisationUpdate ResourceLoader cache refresh completed at Sun Nov 23 03:35:48 UTC 2014 (duration 35m 47s) [03:35:54] Logged the message, Master [03:41:31] PROBLEM - puppet last run on mw1019 is CRITICAL: CRITICAL: Puppet has 1 failures [03:47:31] PROBLEM - puppet last run on search1018 is CRITICAL: CRITICAL: Puppet has 1 failures [03:48:16] Coren: around maybe? [03:51:10] RECOVERY - puppet last run on ms-be1007 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [03:51:41] RECOVERY - puppet last run on mw1097 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [03:52:21] RECOVERY - puppet last run on wtp1011 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [03:53:23] it appears the job queue underwent cosmic inflation between 13:00 and 15:50 on the 14th [03:55:34] andrewbogott_afk: Coren: virt1009 wants your love... our labs instances are unusable because of IO is saturation... [03:56:08] aude: ^ [03:56:13] hopes that's enough [03:56:27] thanks [03:58:50] RECOVERY - puppet last run on mw1019 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [04:05:52] RECOVERY - puppet last run on search1018 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [04:46:35] jackmcbarn: if you discover such things, don't be shy about !logging them; the bot has no ACLs for a reason [04:50:51] PROBLEM - puppet last run on db2037 is CRITICAL: CRITICAL: puppet fail [05:09:31] RECOVERY - puppet last run on db2037 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [05:43:41] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 13.33% of data above the critical threshold [500.0] [05:44:00] PROBLEM - BGP status on cr1-ulsfo is CRITICAL: CRITICAL: No response from remote host 198.35.26.192 [05:44:54] RECOVERY - BGP status on cr1-ulsfo is OK: OK: host 198.35.26.192, sessions up: 10, down: 0, shutdown: 0 [05:59:11] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [06:29:31] PROBLEM - puppet last run on search1018 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:42] PROBLEM - puppet last run on cp4008 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:21] PROBLEM - puppet last run on mw1144 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:01] PROBLEM - puppet last run on amssq55 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:11] PROBLEM - puppet last run on mw1025 is CRITICAL: CRITICAL: Puppet has 1 failures [06:41:20] PROBLEM - puppet last run on es1003 is CRITICAL: CRITICAL: Puppet has 2 failures [06:45:52] RECOVERY - puppet last run on search1018 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [06:46:01] RECOVERY - puppet last run on cp4008 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [06:46:31] RECOVERY - puppet last run on mw1144 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [06:47:11] RECOVERY - puppet last run on amssq55 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [06:47:21] RECOVERY - puppet last run on mw1025 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [06:53:02] PROBLEM - puppet last run on db2003 is CRITICAL: CRITICAL: puppet fail [06:58:50] RECOVERY - puppet last run on es1003 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [07:11:40] RECOVERY - puppet last run on db2003 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [12:41:41] RECOVERY - haproxy failover on dbproxy1002 is OK: OK check_failover servers up 2 down 0 [13:10:12] PROBLEM - puppet last run on amssq51 is CRITICAL: CRITICAL: puppet fail [13:30:01] RECOVERY - puppet last run on amssq51 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [15:12:21] PROBLEM - puppet last run on ssl3002 is CRITICAL: CRITICAL: Puppet has 1 failures [15:12:51] PROBLEM - puppet last run on cp3009 is CRITICAL: CRITICAL: puppet fail [15:22:00] PROBLEM - HTTP error ratio anomaly detection on tungsten is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 1 below the confidence bounds [15:30:41] RECOVERY - puppet last run on ssl3002 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [15:31:01] PROBLEM - puppet last run on cp4004 is CRITICAL: CRITICAL: Puppet has 1 failures [15:32:21] RECOVERY - puppet last run on cp3009 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [15:41:11] hoo: What's that about virt1009? [15:41:26] Coren: Seems to be good again [15:42:41] hoo: I got no page; whatever might have been wrong did not trigger an alert. [15:43:26] mh... do you have a sysload threshold for getting paged? [15:43:33] Cause I think that hit 250 (15 min) [15:44:58] oh, well... looks like it's bad again [15:45:59] Load isn't a useful metric - the box is 60% idle and below 5% iowait [15:46:22] true, that... but iowait also probably isn't here [15:47:41] RECOVERY - puppet last run on cp4004 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [15:48:55] hoo: I wouldn't be surprised if virt1009 was a little slugging - one of the disks is currently rebuilding its mirror - but there's a cap on how much io bandwidth that can take and I don't see anything unusual in the io atm. [15:49:06] sluggish* [15:51:34] Any idea why we could be seeing such poor io perf then? [15:54:20] At first glance, there's nothing on the host that would cause it; but there may be an instance or project that's consuming a disproportionate /fraction/ of it. Where do you see symptoms atm? [15:54:41] PROBLEM - puppetmaster https on virt1000 is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 8140: HTTP/1.1 500 Internal Server Error [15:57:21] Coren: Our jenkins instances in labs where dead dog slow [15:57:50] let me see if I can get more useful metrics [15:58:29] yeah [15:58:34] extremely high iowait [15:58:43] so stuff just times out [15:59:05] On /, or on NFS? [15:59:12] also disk 100% busy according to atop (I do know that the busy time is flawed... but it can given an idea) [15:59:40] pretty sure they run on / [16:00:07] Can you point me at a suffering instance? [16:00:38] all jobs are finished now... but I can put load on them again, if needed [16:00:53] wikidata-jenkins[123] [16:01:02] Lemme go look at the instance while it's not under load first. [16:03:27] (03CR) 10Tnegrin: [C: 031] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/175315 (owner: 10John F. Lewis) [16:03:36] Well, while unloaded there is nothing untoward. Do you think you can trigger the issue again? [16:03:49] I sure can [16:06:37] Coren: Load should be back [16:08:54] Puppetmaster is dead. Got a large string of shinken warnings :) [16:09:03] I don't think you're suffering because of the host - afaict, virt1009 still has elbow room. I see the actual instances hitting their ceiling pretty hard though. [16:09:44] that's unexpected... because stuff was totally fine until yesterday or so [16:09:47] * Coren ponders. [16:10:45] There was an openstack upgrade late last week; it's entirely possible that the newer version is being more diligent in limiting how much IO an single tenant is allowed to consume. [16:11:19] So it'd have become noticable that you're hitting a per-instance limit when in the past you'd have gotten more of the raw host instead. [16:11:33] I'll need Andrew to make sure though - he's the one who did the upgrade. [16:11:41] Ok [16:11:48] anything we can do about this now-ish? [16:12:38] I don't know. Lemme see if there are new tunables in nova. [16:13:29] What's the project name? [16:13:46] Wikidata-build [16:21:07] I'm not finding anything that seems relevant in any way. :-( [16:23:14] yikes [16:23:52] mid-term we could probably have this run on a tmpfs (I have a similar setup for testing in a local VM and it works very well in ram) [16:24:17] but short term we want to have some kind of jenkins... :S [16:49:22] I'll tell Andrew about your issue; there may be a simple fix. [16:49:44] thanks :) [16:54:10] RECOVERY - HTTP error ratio anomaly detection on tungsten is OK: OK: No anomaly detected [16:58:03] oh hi hoo [17:00:18] what am I seeing hoo? [17:00:37] Depends on what you look at, I guess :D [17:00:55] *looks at his cup of tea* [17:00:59] mhhhhhm, tea! [17:01:02] But more seriously: Our jenkins instances are unusable because we use way to much IO [17:01:16] ahhh, disk IO or network? [17:01:21] so stuff just times out... all the time [17:01:22] disk IO [17:01:36] ouch *looks* [17:02:08] Build timed out (after 30 minutes). etf [17:02:10] *wtf [17:02:21] C.oren poked andrewbogott... hopefully he can help out [17:02:23] also [17:02:24] mid-term we could probably have this run on a tmpfs (I have a similar setup for testing in a local VM and it works very well in ram) [17:02:48] but, how as this changed so so so much in the past months? [17:03:13] in the past day [17:03:13] There was an openstack upgrade late last week; it's entirely possible that the newer version is being more diligent in limiting how much IO an single tenant is allowed to consume. [17:03:13] So it'd have become noticable that you're hitting a per-instance limit when in the past you'd have gotten more of the raw host instead. [17:03:22] worked fine day before [17:03:27] wow [17:03:30] * hoo not sure when it broke [17:03:53] Mind you, that's an hypothesis derived from what I saw and not because I /konw/ [17:04:42] All I can tell for sure this second is that while the instance is hitting a limit, the host isn't out of resources. [17:05:12] hmmm *goes for a dig around* [17:05:53] doubt you'll be able to find anything... nothing much changed on our side [17:05:56] but go ahead [17:06:42] the other integration slaves aren't having this problem? just us? [17:08:32] addshore: Are those in labs as well? [17:08:47] Might be taht they live on another virt* server... and thus on another openstack version [17:08:51] I presume they are, in all the CI stuff they are refered to as labsSlaves [17:11:12] hoo aude Populating default interwiki table [17:11:31] PROBLEM - Disk space on rhenium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:11:40] used to take 10 seconds, now takes 17 mins [17:11:59] I doubt that's the bottleneck [17:12:12] RECOVERY - Disk space on rhenium is OK: DISK OK [17:12:12] well, thats got to indicate something [17:12:21] IO slowness and sqlite? [17:12:26] 7 days ago that took 10 seconds :p [17:12:34] yeah -.- [17:13:39] everything else looks liek it is taking the same ammount of time as before, including the rest of the mw install, and the running of tests [17:14:10] but why is the iowait so high? [17:15:42] * hoo kicks github [17:16:11] 80KiB/s really -.- [17:34:06] very odd :D [17:35:11] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 7.14% of data above the critical threshold [500.0] [17:37:21] PROBLEM - HTTP error ratio anomaly detection on tungsten is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 0 below the confidence bounds [17:39:21] PROBLEM - Router interfaces on cr1-ulsfo is CRITICAL: CRITICAL: No response from remote host 198.35.26.192 for 1.3.6.1.2.1.2.2.1.7 with snmp version 2 [17:41:21] PROBLEM - BGP status on cr1-ulsfo is CRITICAL: CRITICAL: No response from remote host 198.35.26.192 [17:42:20] PROBLEM - Router interfaces on cr1-ulsfo is CRITICAL: CRITICAL: No response from remote host 198.35.26.192 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 [17:46:12] PROBLEM - BGP status on cr1-ulsfo is CRITICAL: CRITICAL: No response from remote host 198.35.26.192 [17:48:21] PROBLEM - BGP status on cr1-ulsfo is CRITICAL: CRITICAL: No response from remote host 198.35.26.192 [17:49:01] RECOVERY - BGP status on cr1-ulsfo is OK: OK: host 198.35.26.192, sessions up: 10, down: 0, shutdown: 0 [17:49:31] RECOVERY - Router interfaces on cr1-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 65, down: 0, dormant: 0, excluded: 0, unused: 0 [17:50:41] PROBLEM - puppet last run on cp4006 is CRITICAL: CRITICAL: Puppet has 1 failures [17:52:50] PROBLEM - puppet last run on cp4007 is CRITICAL: CRITICAL: puppet fail [17:53:41] PROBLEM - puppet last run on cp4002 is CRITICAL: CRITICAL: puppet fail [17:54:50] PROBLEM - puppet last run on cp4008 is CRITICAL: CRITICAL: puppet fail [17:56:41] RECOVERY - puppet last run on cp4002 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [17:57:31] PROBLEM - BGP status on cr1-ulsfo is CRITICAL: CRITICAL: No response from remote host 198.35.26.192 [17:59:40] PROBLEM - BGP status on cr1-ulsfo is CRITICAL: CRITICAL: No response from remote host 198.35.26.192 [18:01:00] PROBLEM - puppet last run on cp4012 is CRITICAL: CRITICAL: Puppet has 1 failures [18:02:50] RECOVERY - BGP status on cr1-ulsfo is OK: OK: host 198.35.26.192, sessions up: 10, down: 0, shutdown: 0 [18:04:10] RECOVERY - puppet last run on cp4006 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [18:04:10] PROBLEM - puppet last run on cp4011 is CRITICAL: CRITICAL: Puppet has 1 failures [18:04:11] PROBLEM - puppet last run on cp4016 is CRITICAL: CRITICAL: Puppet has 1 failures [18:05:10] RECOVERY - puppet last run on cp4007 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [18:06:01] PROBLEM - puppet last run on cp4017 is CRITICAL: CRITICAL: Puppet has 1 failures [18:06:50] (03CR) 10Ori.livneh: [C: 031] admin: grant qchris tin access (through deployers) [puppet] - 10https://gerrit.wikimedia.org/r/175315 (owner: 10John F. Lewis) [18:07:10] PROBLEM - puppet last run on cp4015 is CRITICAL: CRITICAL: puppet fail [18:07:11] RECOVERY - puppet last run on cp4008 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [18:16:21] RECOVERY - puppet last run on cp4012 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [18:17:31] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [18:17:31] RECOVERY - puppet last run on cp4016 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [18:17:31] RECOVERY - puppet last run on cp4011 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [18:18:31] RECOVERY - puppet last run on cp4015 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [18:19:31] RECOVERY - puppet last run on cp4017 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [18:51:20] PROBLEM - BGP status on cr1-ulsfo is CRITICAL: CRITICAL: No response from remote host 198.35.26.192 [18:52:31] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 13.33% of data above the critical threshold [500.0] [18:55:20] RECOVERY - BGP status on cr1-ulsfo is OK: OK: host 198.35.26.192, sessions up: 10, down: 0, shutdown: 0 [19:09:12] PROBLEM - Router interfaces on mr1-esams is CRITICAL: CRITICAL: host 91.198.174.247, interfaces up: 36, down: 1, dormant: 0, excluded: 1, unused: 0BRge-0/0/0: down - Core: msw-oe12-esamsBR [19:11:01] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [19:15:11] PROBLEM - HHVM rendering on mw1234 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:17:45] PROBLEM - Apache HTTP on mw1234 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:19:51] RECOVERY - Apache HTTP on mw1234 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 4.834 second response time [19:23:10] PROBLEM - Apache HTTP on mw1234 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:24:31] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 6.67% of data above the critical threshold [500.0] [19:25:02] RECOVERY - Apache HTTP on mw1234 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 2.154 second response time [19:26:31] PROBLEM - puppet last run on cp4007 is CRITICAL: CRITICAL: Puppet has 1 failures [19:27:31] PROBLEM - puppet last run on cp4006 is CRITICAL: CRITICAL: Puppet has 1 failures [19:30:40] RECOVERY - HHVM rendering on mw1234 is OK: HTTP OK: HTTP/1.1 200 OK - 68769 bytes in 8.244 second response time [19:33:45] PROBLEM - HHVM rendering on mw1234 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:40:00] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [19:40:03] RECOVERY - HHVM rendering on mw1234 is OK: HTTP OK: HTTP/1.1 200 OK - 68769 bytes in 1.389 second response time [19:42:01] RECOVERY - puppet last run on cp4007 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [19:44:00] RECOVERY - puppet last run on cp4006 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [19:44:11] PROBLEM - HHVM rendering on mw1234 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:46:41] PROBLEM - Apache HTTP on mw1234 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:49:50] RECOVERY - Apache HTTP on mw1234 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 8.258 second response time [19:53:50] PROBLEM - Apache HTTP on mw1234 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:54:51] RECOVERY - Apache HTTP on mw1234 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 9.936 second response time [19:57:30] RECOVERY - HHVM rendering on mw1234 is OK: HTTP OK: HTTP/1.1 200 OK - 68768 bytes in 9.224 second response time [20:00:41] PROBLEM - HHVM rendering on mw1234 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:07:31] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [500.0] [20:08:01] RECOVERY - HHVM rendering on mw1234 is OK: HTTP OK: HTTP/1.1 200 OK - 68770 bytes in 9.736 second response time [20:10:00] PROBLEM - puppet last run on cp4019 is CRITICAL: CRITICAL: puppet fail [20:11:01] PROBLEM - HHVM rendering on mw1234 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:14:12] RECOVERY - HHVM rendering on mw1234 is OK: HTTP OK: HTTP/1.1 200 OK - 68769 bytes in 4.914 second response time [20:17:20] PROBLEM - HHVM rendering on mw1234 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:21:41] RECOVERY - HHVM rendering on mw1234 is OK: HTTP OK: HTTP/1.1 200 OK - 68770 bytes in 7.851 second response time [20:22:03] PROBLEM - Apache HTTP on mw1234 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:23:10] RECOVERY - Apache HTTP on mw1234 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 6.229 second response time [20:24:20] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [20:25:50] PROBLEM - HHVM rendering on mw1234 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:26:37] hoo: addshore no all the virt hosts are upgraded [20:28:11] PROBLEM - Apache HTTP on mw1234 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:28:41] RECOVERY - puppet last run on cp4019 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [20:32:20] RECOVERY - Apache HTTP on mw1234 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 8.580 second response time [20:34:41] RECOVERY - HTTP error ratio anomaly detection on tungsten is OK: OK: No anomaly detected [20:35:21] PROBLEM - Apache HTTP on mw1234 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:36:21] RECOVERY - Apache HTTP on mw1234 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 9.326 second response time [20:43:41] PROBLEM - Apache HTTP on mw1234 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:44:20] RECOVERY - HHVM rendering on mw1234 is OK: HTTP OK: HTTP/1.1 200 OK - 68770 bytes in 7.558 second response time [20:44:50] RECOVERY - Apache HTTP on mw1234 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 8.762 second response time [20:47:30] PROBLEM - HHVM rendering on mw1234 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:51:51] PROBLEM - Apache HTTP on mw1234 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:52:53] RECOVERY - Apache HTTP on mw1234 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 1.741 second response time [20:53:31] RECOVERY - HHVM rendering on mw1234 is OK: HTTP OK: HTTP/1.1 200 OK - 68769 bytes in 5.127 second response time [20:57:40] PROBLEM - HHVM rendering on mw1234 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:02:41] RECOVERY - HHVM rendering on mw1234 is OK: HTTP OK: HTTP/1.1 200 OK - 68769 bytes in 4.621 second response time [21:03:14] is mw1234 ill? [21:05:51] PROBLEM - HHVM rendering on mw1234 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:22:01] PROBLEM - BGP status on cr1-ulsfo is CRITICAL: CRITICAL: No response from remote host 198.35.26.192 [21:22:50] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 6.67% of data above the critical threshold [500.0] [21:24:11] RECOVERY - BGP status on cr1-ulsfo is OK: OK: host 198.35.26.192, sessions up: 10, down: 0, shutdown: 0 [21:28:18] PROBLEM - puppet last run on cp4007 is CRITICAL: CRITICAL: Puppet has 1 failures [21:35:11] PROBLEM - Apache HTTP on mw1234 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:38:15] RECOVERY - Apache HTTP on mw1234 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 9.976 second response time [21:41:21] PROBLEM - Apache HTTP on mw1234 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:41:41] RECOVERY - puppet last run on cp4007 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [21:42:30] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [21:46:30] RECOVERY - Apache HTTP on mw1234 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 8.683 second response time [21:52:41] PROBLEM - Apache HTTP on mw1234 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:52:58] looks like mw1234 is a new server added on thursday/friday. _joe_? [21:55:41] RECOVERY - Apache HTTP on mw1234 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 7.926 second response time [21:58:50] PROBLEM - Apache HTTP on mw1234 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:03:46] Eloquence: I haven't been able to access my account on blog.wikimedia.org ever since the switch to Automattic, and I just got a request to change a picture in a blog post I used in a profile I published in May. Do you know what might have happened to my account or else do you know who can help me? [22:15:10] RECOVERY - Apache HTTP on mw1234 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 8.540 second response time [22:19:21] PROBLEM - Apache HTTP on mw1234 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:26:10] !log depooling mw1234; flapping. [22:26:19] Logged the message, Master [22:26:40] RECOVERY - Apache HTTP on mw1234 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 9.707 second response time [22:27:11] ori, should I be contacting people when this stuff starts happening? [22:27:21] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 13.33% of data above the critical threshold [500.0] [22:27:44] Krenair: no, we have monitoring for that. But the monitoring ought to be better. [22:28:00] Well it clearly failed here. [22:28:27] yes, I agree [22:28:39] if you're on the ops list, I'd send an e-mail and ask [22:28:49] I am [22:28:49] ok [22:28:55] thanks [22:30:57] um, ori [22:31:07] woops [22:31:19] :) [22:31:21] thanks :) [22:32:41] PROBLEM - Apache HTTP on mw1234 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:33:43] odd [22:42:00] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [22:45:36] PROBLEM - HHVM busy threads on mw1234 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [482.4] [22:58:20] RECOVERY - Apache HTTP on mw1234 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.043 second response time [22:58:51] RECOVERY - HHVM rendering on mw1234 is OK: HTTP OK: HTTP/1.1 200 OK - 68680 bytes in 1.188 second response time [23:08:10] RECOVERY - HHVM busy threads on mw1234 is OK: OK: Less than 1.00% above the threshold [321.6] [23:25:17] Why is labs/tools/wikibugs2 not being replicated to github? [23:26:42] legoktm, ? [23:27:02] uh [23:27:12] did we even create a gerrit repo for it? [23:27:24] https://git.wikimedia.org/log/labs%2Ftools%2Fwikibugs2 [23:27:34] o.O [23:27:39] I don't know then [23:27:52] umm [23:27:53] https://github.com/wikimedia/labs-tools-wikibugs2 [23:27:56] looks fine to me? [23:28:20] Ah. [23:28:24] Yeah, I can't read/type then. [23:28:40] I think I was looking for pywikibugs2. For some reason. [23:28:58] Gerrit repo created 2014-11-10 by QChris: https://git.wikimedia.org/log/labs%2Ftools%2Fwikibugs2/refs%2Fmeta%2Fconfig [23:30:08] Is something wrong with those repos? [23:30:20] I thought there was, I was just stupid. [23:30:37] legoktm was surprised there was a gerrit repo for it. [23:30:48] I forgot I had requested it [23:30:50] :P [23:30:52] :-D [23:30:55] Ok.