[00:04:50] PROBLEM - NTP on ssl3002 is CRITICAL: NTP CRITICAL: No response from NTP server [00:05:14] Change abandoned: MaxSem; "(no reason)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57774 [00:06:10] !log kaldari Finished syncing Wikimedia installation... : [00:06:19] Logged the message, Master [00:15:40] New patchset: Faidon; "New IPs for Varnish ACL for Dialog Sri Lanka WAP source IPs." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/66166 [00:16:28] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/66148 [00:17:03] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/66166 [00:21:09] RECOVERY - Puppet freshness on db45 is OK: puppet ran at Fri May 31 00:20:58 UTC 2013 [00:38:29] PROBLEM - SSH on mc15 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:39:19] RECOVERY - SSH on mc15 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [01:03:50] RECOVERY - NTP on ssl3002 is OK: NTP OK: Offset 0.0009056329727 secs [01:21:17] New patchset: Hazard-SJ; "(bug 49001) Restrict editing the Query namespace" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/66192 [01:23:41] New patchset: Hazard-SJ; "(bug 49001) Restrict editing the Query namespace on Wikidata" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/66192 [01:31:37] RECOVERY - NTP on ssl3003 is OK: NTP OK: Offset 0.001476287842 secs [02:02:27] New review: Tim Starling; "Why would query-update be needed to discuss changes to queries? Is the plan to change this configura..." [operations/mediawiki-config] (master) C: -1; - https://gerrit.wikimedia.org/r/66192 [02:06:06] !log LocalisationUpdate completed (1.22wmf5) at Fri May 31 02:06:06 UTC 2013 [02:06:16] Logged the message, Master [02:10:09] PROBLEM - Puppet freshness on amssq47 is CRITICAL: No successful Puppet run in the last 10 hours [02:10:09] PROBLEM - Puppet freshness on cp1038 is CRITICAL: No successful Puppet run in the last 10 hours [02:10:09] PROBLEM - Puppet freshness on cp1039 is CRITICAL: No successful Puppet run in the last 10 hours [02:11:03] !log LocalisationUpdate completed (1.22wmf4) at Fri May 31 02:11:03 UTC 2013 [02:11:09] PROBLEM - Puppet freshness on cp1037 is CRITICAL: No successful Puppet run in the last 10 hours [02:11:10] PROBLEM - Puppet freshness on cp1040 is CRITICAL: No successful Puppet run in the last 10 hours [02:11:11] Logged the message, Master [02:31:40] !log LocalisationUpdate ResourceLoader cache refresh completed at Fri May 31 02:31:40 UTC 2013 [02:31:49] Logged the message, Master [02:42:48] PROBLEM - HTTP radosgw on ms-fe1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:44:06] New review: Hazard-SJ; "This change is intended to replace https://www.wikidata.org/wiki/Special:AbuseFilter/9 for now (ther..." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/66192 [02:44:57] PROBLEM - HTTP radosgw on ms-fe1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:44:57] PROBLEM - HTTP radosgw on ms-fe1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:45:36] PROBLEM - Apache HTTP on mw1155 is CRITICAL: Connection timed out [02:45:36] PROBLEM - LVS HTTP IPv4 on rendering.svc.eqiad.wmnet is CRITICAL: Connection timed out [02:45:52] PROBLEM - Apache HTTP on mw1156 is CRITICAL: Connection timed out [02:46:11] PROBLEM - Apache HTTP on mw1154 is CRITICAL: Connection timed out [02:46:11] PROBLEM - Apache HTTP on mw1153 is CRITICAL: Connection timed out [02:46:11] PROBLEM - Apache HTTP on mw1159 is CRITICAL: Connection timed out [02:46:11] PROBLEM - Apache HTTP on mw1160 is CRITICAL: Connection timed out [02:46:51] PROBLEM - Apache HTTP on mw1157 is CRITICAL: Connection timed out [02:47:11] PROBLEM - Apache HTTP on mw1158 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:48:02] PROBLEM - HTTP Apache on ms-fe1001 is CRITICAL: Connection timed out [02:56:12] New patchset: Hazard-SJ; "(bug 49001) Restrict editing the Query namespace on Wikidata" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/66192 [02:57:01] TimStarling: ^ [02:57:10] took out restriction of the talk namespace [02:58:27] Jasper_Deng: omg so many reviewers? [02:58:37] and somehow you missed Reedy :P [03:04:01] legoktm: hey, that's not my review [03:04:08] not even my patch [03:04:12] lol [03:04:14] its your bug [03:14:39] PROBLEM - LVS HTTP IPv4 on ms-fe.eqiad.wmnet is CRITICAL: Connection timed out [03:23:27] RECOVERY - LVS HTTP IPv4 on ms-fe.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 311 bytes in 1.921 second response time [03:26:37] PROBLEM - LVS HTTP IPv4 on ms-fe.eqiad.wmnet is CRITICAL: Connection timed out [03:28:47] PROBLEM - HTTP radosgw on ms-fe1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:31:28] PROBLEM - HTTP Apache on ms-fe1002 is CRITICAL: Connection timed out [03:34:02] gerrit seems sloooooooow [03:39:25] It's not just me? [03:39:32] no [03:39:35] I get "Working..." a lot. [03:39:49] Hrm. [03:39:56] And the production sites seem to be having issues. [03:41:15] apergos: Are you around? The image scalers seem to be acting up. [03:41:55] https://commons.wikimedia.org/wiki/Special:NewFiles [03:44:01] TimStarling? [03:44:37] https://upload.wikimedia.org/wikipedia/commons/thumb/a/a2/Igor_Zaripov.jpg/220px-Igor_Zaripov.jpg [03:44:42] > [03:44:43] 503 Service Unavailable [03:44:43] The server is currently unavailable. Please try again at a later time. [03:44:46] There was a problem while contacting the image scaler: [Errno 110] ETIMEDOUT [03:44:50] > [03:45:13] unlikely that gerrit is related, it is regularly slow for its own reasons [03:45:27] Right. I think the image scalers may be breaking Score too. [03:45:39] search spike [03:45:53] Score (reported in Bugzilla) + thumbnails being broken (reported in -tech) seem to be related. Gerrit seems unrelated. [03:46:04] Speaking of search... Ram left? [03:46:14] I thought he was going to be the search guy for a while. [03:47:24] ms-fe1002 is not in ganglia? [03:47:37] should be [03:47:45] oh right [03:47:48] http://ganglia.wikimedia.org/latest/?c=Ceph%20eqiad&m=cpu_report&r=custom&s=by%20name&hc=4&mc=2&cs=fe1001 [03:47:58] just the whole ceph cluster is off the net [03:48:10] I guess that breaks the ganglia host search feature [03:48:28] http://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&tab=v&vn=Ceph+eqiad looks wonk [03:48:45] Right, ceph was the Score error. Okay. [03:50:40] Susan: gerrit is always slow, its just been wrose the last day or so [03:51:01] Yeah, I've noticed. [03:51:06] I thought it might just be me. [03:51:11] 7s times for lots of stuff [03:51:14] (gerrit) [03:51:26] Yeah, I got an internal server error earlier today as well. [03:51:29] The black screen of death. [03:51:38] Usually I just get "Working...". [03:51:45] It's like code review on dial-up. [03:52:22] https://bugzilla.wikimedia.org/show_bug.cgi?id=49004 is the ceph/image scaler bug, BTW. [03:52:31] The ceph docs seem to suggest it's just being experimented with. [03:52:43] it's configured as a multiwrite slave [03:54:34] so, should we just fail over now? [03:54:47] unless paravoid or apergos are floating around [03:55:23] fail over = just use ceph? [03:55:26] I mean swift? [03:55:28] yes [03:55:32] what are the consequences? [03:55:59] can it be resynced? [03:56:25] will it make it more difficult to get it back online later? [03:56:28] there are scripts that can handle the resyncing needs quickly enough [03:56:45] ok, do that then [03:57:11] also for the occasional ops that failed on swift it might cause some strangness [03:57:31] like if a file failed to save to swift and now we switched to it [03:58:07] better than the strangeness now ;) [03:58:41] it means we'll have some more time to analyse the issue on ceph [03:58:44] New patchset: Aaron Schulz; "Disabled ceph backends and switched to just swift." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/66196 [03:59:06] assuming it stays broken after the traffic goes away [03:59:28] New patchset: Aaron Schulz; "Disabled ceph backends and switched to just swift." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/66196 [04:00:33] Change merged: Aaron Schulz; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/66196 [04:01:50] !log aaron synchronized wmf-config/filebackend.php 'Disabled ceph backends and switched to just swift.' [04:01:57] RECOVERY - Apache HTTP on mw1159 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 4.248 second response time [04:01:59] Logged the message, Master [04:02:07] RECOVERY - Apache HTTP on mw1160 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.056 second response time [04:02:07] RECOVERY - Apache HTTP on mw1158 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.057 second response time [04:02:07] RECOVERY - Apache HTTP on mw1153 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.063 second response time [04:02:07] RECOVERY - Apache HTTP on mw1154 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.064 second response time [04:02:17] RECOVERY - Apache HTTP on mw1155 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.045 second response time [04:02:17] RECOVERY - LVS HTTP IPv4 on rendering.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 62927 bytes in 0.234 second response time [04:02:18] https://commons.wikimedia.org/wiki/Special:NewFiles looks OK now [04:02:41] RECOVERY - Apache HTTP on mw1157 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.056 second response time [04:02:47] RECOVERY - Apache HTTP on mw1156 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.065 second response time [04:03:01] TimStarling: I think I'm going to head out for a while [04:03:07] ok [04:03:21] is anything else likely to be broken by ceph being down? [04:05:55] I don't think so [04:06:37] ok, see you later [04:06:48] I think ms-fe1001 did fix itself, so that's a bit unfortunate [04:07:35] it's accepting connections now, whereas previously it just timed out [04:17:31] !log on nickel: restarting gmetad to see if that fixes ceph cluster reporting [04:17:40] Logged the message, Master [04:17:44] Warning: we failed to resolve data source name ms-fe1001.eqiad.wmnet, [04:18:41] well, that explains that part of it [04:24:49] μοrning [04:24:57] what's going on? [04:25:33] ceph had issues and was swapped out for swift. [04:25:59] New patchset: Tim Starling; "Fix gmetad source list typo" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/66206 [04:26:57] I'm not sure what happened to it [04:27:51] I have this: [04:27:53] [0353][tstarling@fenari:/home/wikipedia/common/wmf-config]$ telnet ms-fe1001 80 [04:27:53] Trying 10.64.0.167... [04:27:53] ^C [04:27:53] [0354][tstarling@fenari:/home/wikipedia/common/wmf-config]$ telnet ms-fe1002 80 [04:27:53] Trying 10.64.0.168... [04:27:54] ^C [04:28:00] i.e. I gave up waiting on both [04:28:10] but ping worked [04:28:59] so I suppose that could have been a listen queue overflow [04:29:34] actually I'm not sure what would cause it [04:29:35] " When syncookies are enabled there is no logical maximum length and this setting is ignored." [04:32:23] Change merged: Tim Starling; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/66206 [04:34:00] ganglia still has a flat line [04:35:29] despite ms-fe1002 responding on port 8649 [04:36:26] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:37:17] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.124 second response time [04:42:46] bah, ganglia doesn't like any of the ceph hosts for some reason [04:42:49] grrrr [04:49:33] yeah, like I said [04:49:35] I'm working on it [04:57:18] PROBLEM - NTP on ssl3003 is CRITICAL: NTP CRITICAL: No response from NTP server [05:01:26] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:02:16] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.142 second response time [05:02:36] PROBLEM - NTP on ssl3002 is CRITICAL: NTP CRITICAL: No response from NTP server [05:08:12] maybe the problem is the apache_status module [05:10:35] RECOVERY - HTTP Apache on ms-fe1002 is OK: HTTP OK: HTTP/1.1 200 OK - 229 bytes in 0.012 second response time [05:11:45] whatever you poked, it's giving us data at least [05:21:00] yeah, I poked first and puppetized second [05:21:15] New patchset: Tim Starling; "Disable the gmond apache_status module on ceph hosts" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/66225 [05:21:37] sorry for the reduced gerrit notification spam, I know ops people love botspam [05:22:55] I'm sure someone can produce some later if we are feeling deprived [05:23:09] Change merged: Tim Starling; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/66225 [05:25:46] !log on msfe1001-1004: disabled apache_status ganglia module [05:25:55] Logged the message, Master [05:26:25] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:29:16] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.132 second response time [05:32:25] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:33:15] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.131 second response time [05:36:25] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:41:18] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.125 second response time [05:41:58] PROBLEM - Apache HTTP on mw1017 is CRITICAL: Connection refused [05:42:58] RECOVERY - Apache HTTP on mw1017 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.062 second response time [05:52:28] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:53:18] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.127 second response time [05:57:29] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:58:18] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.133 second response time [06:12:26] RECOVERY - HTTP Apache on ms-fe1001 is OK: HTTP OK: HTTP/1.1 200 OK - 229 bytes in 0.008 second response time [06:12:54] PROBLEM - HTTP Apache on ms-fe1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:13:44] PROBLEM - HTTP Apache on ms-fe1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:27:34] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:28:14] PROBLEM - Disk space on mc15 is CRITICAL: Timeout while attempting connection [06:29:04] RECOVERY - Disk space on mc15 is OK: DISK OK [06:29:24] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.130 second response time [06:33:34] RECOVERY - NTP on ssl3002 is OK: NTP OK: Offset -0.00258243084 secs [06:38:50] PROBLEM - HTTP Apache on ms-fe1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:42:13] !log Running a script on terbium to list out files that were updated (overwritten) but are not synced on all wikis (it found none on commons/enwiki) [06:42:21] Logged the message, Master [06:56:31] !log ceph osds 50 and 132 on ms-fe1002 are logging 'slow requests', don't know how to restart specific osds in bobtail though [06:56:39] Logged the message, Master [07:01:25] New patchset: Pyoungmeister; "removing myself from icinga for vacation" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/66232 [07:06:02] New patchset: Pyoungmeister; "removing myself from icinga for vacation" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/66232 [07:26:41] PROBLEM - LVS HTTP IPv4 on ms-fe.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:40:53] PROBLEM - HTTP Apache on ms-fe1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:02:13] RECOVERY - NTP on ssl3003 is OK: NTP OK: Offset -0.002958297729 secs [08:15:38] New patchset: Mark Bergsma; "Awful hacks to make Puppet work are nothing new..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/66124 [08:19:44] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/66232 [08:24:41] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/66124 [08:35:05] New patchset: Mark Bergsma; "Replace '/' separator with a space" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/66236 [08:38:01] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/66236 [08:40:32] New patchset: Mark Bergsma; "Replace the entire string" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/66237 [08:43:04] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/66237 [08:43:05] jenkins is still sleeping [08:44:53] ok, so am I apparently [08:45:37] New patchset: Mark Bergsma; "Using a correct regexp with match groups does help" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/66238 [08:47:18] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/66238 [08:51:07] New patchset: Mark Bergsma; "Add all mobile UAs currently listed in the Squid configuration" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/65823 [08:52:16] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/65823 [08:57:23] hey mark, the analytics team is using dclass for device detection in kraken; it uses a decision tree written in C and uses the OpenDDR device library. There is a also a Varnish vmod module, see https://github.com/TheWeatherChannel/dClass/tree/master/servers/varnish [08:57:43] yeah I know [08:58:15] any plans to start using it? [08:58:22] not at this time, perhaps in the future [08:58:30] right now I'm just focusing on making varnish behave like squid :) [08:58:38] understood [09:00:53] RECOVERY - RAID on es1001 is OK: OK: State is Optimal, checked 2 logical device(s) [09:04:39] but if you see the plan on zero on wikitech-l, we might need it earlier for that ;) [09:32:42] PROBLEM - Host wtp1008 is DOWN: PING CRITICAL - Packet loss = 100% [09:34:22] RECOVERY - Host wtp1008 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [09:50:07] PROBLEM - Puppet freshness on erzurumi is CRITICAL: No successful Puppet run in the last 10 hours [09:50:07] PROBLEM - Puppet freshness on db1032 is CRITICAL: No successful Puppet run in the last 10 hours [09:50:07] PROBLEM - Puppet freshness on lvs1004 is CRITICAL: No successful Puppet run in the last 10 hours [09:50:07] PROBLEM - Puppet freshness on lvs1005 is CRITICAL: No successful Puppet run in the last 10 hours [09:50:07] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: No successful Puppet run in the last 10 hours [09:50:07] PROBLEM - Puppet freshness on mc15 is CRITICAL: No successful Puppet run in the last 10 hours [09:50:07] PROBLEM - Puppet freshness on ms-fe3001 is CRITICAL: No successful Puppet run in the last 10 hours [09:50:08] PROBLEM - Puppet freshness on mw1171 is CRITICAL: No successful Puppet run in the last 10 hours [09:50:08] PROBLEM - Puppet freshness on pdf1 is CRITICAL: No successful Puppet run in the last 10 hours [09:50:09] PROBLEM - Puppet freshness on pdf2 is CRITICAL: No successful Puppet run in the last 10 hours [09:50:09] PROBLEM - Puppet freshness on virt1 is CRITICAL: No successful Puppet run in the last 10 hours [09:50:10] PROBLEM - Puppet freshness on virt3 is CRITICAL: No successful Puppet run in the last 10 hours [09:50:10] PROBLEM - Puppet freshness on virt4 is CRITICAL: No successful Puppet run in the last 10 hours [10:21:27] goddamit [10:21:33] didn't hear the page [10:21:59] what the hell [10:26:20] RECOVERY - HTTP Apache on ms-fe1002 is OK: HTTP OK: HTTP/1.1 200 OK - 229 bytes in 0.005 second response time [10:26:21] RECOVERY - HTTP Apache on ms-fe1003 is OK: HTTP OK: HTTP/1.1 200 OK - 229 bytes in 4.732 second response time [10:26:41] RECOVERY - LVS HTTP IPv4 on ms-fe.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 311 bytes in 1.426 second response time [10:26:45] RECOVERY - HTTP radosgw on ms-fe1002 is OK: HTTP OK: HTTP/1.1 200 OK - 311 bytes in 0.768 second response time [10:27:10] RECOVERY - HTTP radosgw on ms-fe1003 is OK: HTTP OK: HTTP/1.1 200 OK - 311 bytes in 0.005 second response time [10:28:50] RECOVERY - HTTP radosgw on ms-fe1001 is OK: HTTP OK: HTTP/1.1 200 OK - 311 bytes in 4.721 second response time [10:29:32] RECOVERY - HTTP Apache on ms-fe1004 is OK: HTTP OK: HTTP/1.1 200 OK - 229 bytes in 0.001 second response time [10:29:42] RECOVERY - HTTP Apache on ms-fe1001 is OK: HTTP OK: HTTP/1.1 200 OK - 229 bytes in 0.001 second response time [10:29:42] RECOVERY - HTTP radosgw on ms-fe1004 is OK: HTTP OK: HTTP/1.1 200 OK - 311 bytes in 0.004 second response time [10:44:47] PROBLEM - SSH on mc15 is CRITICAL: Connection timed out [10:45:47] RECOVERY - SSH on mc15 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [10:53:45] Change merged: ArielGlenn; [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/65706 [11:07:50] New patchset: Ottomata; "Adding alerts for webrequest data loss in HDFS" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/66241 [11:08:48] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/66241 [11:40:27] PROBLEM - Host wtp1008 is DOWN: PING CRITICAL - Packet loss = 100% [11:41:26] RECOVERY - Host wtp1008 is UP: PING OK - Packet loss = 0%, RTA = 0.30 ms [12:09:56] PROBLEM - Disk space on mc15 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:10:46] RECOVERY - Disk space on mc15 is OK: DISK OK [12:11:06] PROBLEM - Puppet freshness on amssq47 is CRITICAL: No successful Puppet run in the last 10 hours [12:11:06] PROBLEM - Puppet freshness on cp1038 is CRITICAL: No successful Puppet run in the last 10 hours [12:11:06] PROBLEM - Puppet freshness on cp1039 is CRITICAL: No successful Puppet run in the last 10 hours [12:12:06] PROBLEM - Puppet freshness on cp1037 is CRITICAL: No successful Puppet run in the last 10 hours [12:12:06] PROBLEM - Puppet freshness on cp1040 is CRITICAL: No successful Puppet run in the last 10 hours [12:16:26] RECOVERY - Puppet freshness on amssq47 is OK: puppet ran at Fri May 31 12:16:23 UTC 2013 [12:16:56] RECOVERY - Puppet freshness on cp1039 is OK: puppet ran at Fri May 31 12:16:45 UTC 2013 [12:17:36] RECOVERY - Puppet freshness on cp1040 is OK: puppet ran at Fri May 31 12:17:29 UTC 2013 [12:18:36] RECOVERY - Puppet freshness on cp1037 is OK: puppet ran at Fri May 31 12:18:31 UTC 2013 [12:18:46] RECOVERY - Puppet freshness on cp1038 is OK: puppet ran at Fri May 31 12:18:43 UTC 2013 [13:26:27] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:27:17] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.132 second response time [13:32:52] New patchset: Faidon; "Ceph: move monitors to separate rows" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/66247 [13:34:48] no jenkins? [13:35:04] no [13:35:29] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/66247 [13:35:39] New patchset: Faidon; "Revert "Disable the gmond apache_status module on ceph hosts"" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/66248 [13:36:04] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/66248 [13:47:45] New patchset: Faidon; "Ceph: use role::ceph::mon on the new mons" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/66250 [13:48:48] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/66250 [14:02:13] where is jenkins? [14:21:12] Jenkins called in sick today [14:22:07] Lazy f*k doesn't even wfh. [14:27:40] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:28:27] !log reedy synchronized php-1.22wmf5/extensions/SecurePoll/ [14:28:30] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.127 second response time [14:28:36] Logged the message, Master [14:29:30] !log reedy synchronized php-1.22wmf4/extensions/SecurePoll/ [14:29:38] Logged the message, Master [14:33:27] !log Created bv2013_edit tables on all wikis [14:33:37] Logged the message, Master [14:40:33] !log reedy synchronized php-1.22wmf5/extensions/SecurePoll/ [14:40:42] Logged the message, Master [14:41:34] !log reedy synchronized php-1.22wmf4/extensions/SecurePoll/ [14:41:42] Logged the message, Master [14:46:31] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:47:21] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.192 second response time [14:55:14] New patchset: Petrb; "improved a help of sql command a bit" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/66262 [14:56:31] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:58:21] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.124 second response time [15:09:29] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:10:19] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.126 second response time [15:12:38] Change merged: Ottomata; [operations/puppet/cdh4] (master) - https://gerrit.wikimedia.org/r/65267 [15:12:48] New review: Ottomata; "Woot! Thanks Alex!" [operations/puppet/cdh4] (master) - https://gerrit.wikimedia.org/r/65267 [15:15:24] New patchset: Ottomata; "Updating modules/cdh4 to latest ecosystem commit." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/66263 [15:16:13] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/66263 [15:18:56] heya paravoid, you there? [15:19:19] the stuff that I need to puppetize and reinstall the kraken hadoop nodes is finally in ops/puppet, woo! [15:19:20] q for you: [15:19:26] kraken role class…or kraken module? [15:22:29] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:23:19] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.127 second response time [15:25:20] so, yay ceph? [15:26:00] New patchset: Petrb; "improved a help of sql command a bit" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/66262 [15:34:22] New patchset: Petrb; "inserted ksh and mysql client to exec nodes" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/66264 [15:40:01] New review: coren; "mysql-client (a) conflicts with mariadb-common (and libmariadbclient) and should be (b) mariadb-clie..." [operations/puppet] (production) C: -2; - https://gerrit.wikimedia.org/r/66264 [15:40:29] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:41:18] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.127 second response time [15:52:07] New patchset: Petrb; "inserted sql tool to execnodes" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/66266 [15:52:28] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:53:18] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.122 second response time [15:54:29] New patchset: Petrb; "subversion to all exec nodes" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/66273 [16:22:30] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:23:20] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.154 second response time [16:24:44] New review: Aude; "this is better than the abuse filter rule. We are not using (or have enabled, afaik) the query name..." [operations/mediawiki-config] (master) C: 1; - https://gerrit.wikimedia.org/r/66192 [16:28:44] paravoid, if you are around i'd love a puppet brain bounce [16:29:25] or even ori-l, you there? [16:31:30] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:33:20] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.125 second response time [16:52:27] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:53:17] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.126 second response time [17:02:27] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:03:17] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.125 second response time [17:11:08] ottomata: hey [17:12:20] hey, in short standup meeting [17:12:35] but, anyway, i'm prepping for puppetization of hadoop nodes using cdh4 [17:12:38] so. [17:12:42] the module is non-wmf specific [17:12:51] i'm trying to figure out where to put wmf-specifc usage of the module [17:13:03] first I will use it to puppetize labs [17:14:55] kraken module? [17:14:58] not sure. [17:15:07] mabye just roles/kraken.pp with a buncha classes in there? [17:15:14] i DO need to add a couple of config files [17:15:20] that are not part of the cdh4 module [17:15:26] roles/kraken.pp sounds right [17:15:36] and for files? templates/kraken/...? [17:15:51] doesn't seem like what a role class should be doing, you know? [17:17:14] why not? my pattern is typically this: [17:17:38] module classes take parameters and therefore need to be declared using 'myclass { param => value }' [17:17:59] and these declarations should be inside a role class, that does not take parameters [17:18:09] (i.e. it should be include-able) [17:18:15] aye [17:18:27] hmm, ok will continue with that then [17:18:55] so you end up with a hierarchy of: module -> configureable software platform, role -> configuration of that software platform [17:19:54] both ::labs:: and ::production:: classes in roles/kraken.pp? [17:21:21] if the setup is so different that including them in the same file will make it an unreadable profusion of if realm?s, then roles/kraken.pp and roles/kraken-labs.pp. if you can box the differences into a compact if/else, same file [17:22:24] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:23:14] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.125 second response time [17:23:19] New patchset: Sanja pavlovic; "Per bug #48012. Patch for worker.py. It checks for external programs existence in the initialization part." [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/64095 [17:26:24] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:28:14] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.131 second response time [17:31:24] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:33:15] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.128 second response time [17:51:19] Yay Ceva is picking up the servers -- RobH [17:52:24] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:53:14] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.125 second response time [18:02:24] paravoid: hi [18:02:40] paravoid: can I ask for some tips on why gbp is ignoring my debian/gbp.conf please ? [18:02:43] it seems to be the case [18:03:02] like if I feed it the params manually --debian-branch and stuff, it seems to take them into consideration [18:03:08] but the gbp.conf is left out [18:10:28] RECOVERY - Host virt1000 is UP: PING OK - Packet loss = 0%, RTA = 0.53 ms [18:10:56] RECOVERY - Host labs-ns1.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 0.40 ms [18:22:26] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:23:16] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.129 second response time [18:37:41] if someone is around who can do something, it would be lovely to have jenkins and gerrit irc bot back [18:40:58] aude: they disappeared? [18:41:08] I restarted ircecho on maganese [18:41:36] yay! [18:41:39] i see the bot now [18:41:52] I don't see ircecho for jenkins, but I'd imagine it's reporting through gerrit [18:42:01] jenkins seems to have died [18:42:08] not reviewing my code since yesterday [18:42:16] not reviewing anyone's code [18:42:34] It says Queue lengths: 334 events, 0 results. [18:42:37] (https://integration.wikimedia.org/zuul/) [18:42:53] need hashar [18:43:26] Also looks like someone's doing security stuff in gerrit again: https://integration.wikimedia.org/ci/job/mediawiki-core-whitespaces/4589/ [18:43:42] change is a draft, but jenkins was running tests for it [18:44:58] strange that it seems to be stuck there though [18:47:37] ok [18:50:40] Ah fun... interesting info leakage. [18:51:53] (in this case I know the person who's working on it, and it's not really secret. But we should close that) [18:52:20] It shows the full commit message as well as the user who uploaded it [18:52:29] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:52:36] Kinda pointless because once you know what repo a forbidden changeid is in, you can download it through git [18:53:19]