[00:04:50] PROBLEM - NTP on ssl3002 is CRITICAL: NTP CRITICAL: No response from NTP server [00:05:14] Change abandoned: MaxSem; "(no reason)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57774 [00:06:10] !log kaldari Finished syncing Wikimedia installation... : [00:06:19] Logged the message, Master [00:15:40] New patchset: Faidon; "New IPs for Varnish ACL for Dialog Sri Lanka WAP source IPs." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/66166 [00:16:28] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/66148 [00:17:03] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/66166 [00:21:09] RECOVERY - Puppet freshness on db45 is OK: puppet ran at Fri May 31 00:20:58 UTC 2013 [00:38:29] PROBLEM - SSH on mc15 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:39:19] RECOVERY - SSH on mc15 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [01:03:50] RECOVERY - NTP on ssl3002 is OK: NTP OK: Offset 0.0009056329727 secs [01:21:17] New patchset: Hazard-SJ; "(bug 49001) Restrict editing the Query namespace" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/66192 [01:23:41] New patchset: Hazard-SJ; "(bug 49001) Restrict editing the Query namespace on Wikidata" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/66192 [01:31:37] RECOVERY - NTP on ssl3003 is OK: NTP OK: Offset 0.001476287842 secs [02:02:27] New review: Tim Starling; "Why would query-update be needed to discuss changes to queries? Is the plan to change this configura..." [operations/mediawiki-config] (master) C: -1; - https://gerrit.wikimedia.org/r/66192 [02:06:06] !log LocalisationUpdate completed (1.22wmf5) at Fri May 31 02:06:06 UTC 2013 [02:06:16] Logged the message, Master [02:10:09] PROBLEM - Puppet freshness on amssq47 is CRITICAL: No successful Puppet run in the last 10 hours [02:10:09] PROBLEM - Puppet freshness on cp1038 is CRITICAL: No successful Puppet run in the last 10 hours [02:10:09] PROBLEM - Puppet freshness on cp1039 is CRITICAL: No successful Puppet run in the last 10 hours [02:11:03] !log LocalisationUpdate completed (1.22wmf4) at Fri May 31 02:11:03 UTC 2013 [02:11:09] PROBLEM - Puppet freshness on cp1037 is CRITICAL: No successful Puppet run in the last 10 hours [02:11:10] PROBLEM - Puppet freshness on cp1040 is CRITICAL: No successful Puppet run in the last 10 hours [02:11:11] Logged the message, Master [02:31:40] !log LocalisationUpdate ResourceLoader cache refresh completed at Fri May 31 02:31:40 UTC 2013 [02:31:49] Logged the message, Master [02:42:48] PROBLEM - HTTP radosgw on ms-fe1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:44:06] New review: Hazard-SJ; "This change is intended to replace https://www.wikidata.org/wiki/Special:AbuseFilter/9 for now (ther..." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/66192 [02:44:57] PROBLEM - HTTP radosgw on ms-fe1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:44:57] PROBLEM - HTTP radosgw on ms-fe1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:45:36] PROBLEM - Apache HTTP on mw1155 is CRITICAL: Connection timed out [02:45:36] PROBLEM - LVS HTTP IPv4 on rendering.svc.eqiad.wmnet is CRITICAL: Connection timed out [02:45:52] PROBLEM - Apache HTTP on mw1156 is CRITICAL: Connection timed out [02:46:11] PROBLEM - Apache HTTP on mw1154 is CRITICAL: Connection timed out [02:46:11] PROBLEM - Apache HTTP on mw1153 is CRITICAL: Connection timed out [02:46:11] PROBLEM - Apache HTTP on mw1159 is CRITICAL: Connection timed out [02:46:11] PROBLEM - Apache HTTP on mw1160 is CRITICAL: Connection timed out [02:46:51] PROBLEM - Apache HTTP on mw1157 is CRITICAL: Connection timed out [02:47:11] PROBLEM - Apache HTTP on mw1158 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:48:02] PROBLEM - HTTP Apache on ms-fe1001 is CRITICAL: Connection timed out [02:56:12] New patchset: Hazard-SJ; "(bug 49001) Restrict editing the Query namespace on Wikidata" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/66192 [02:57:01] TimStarling: ^ [02:57:10] took out restriction of the talk namespace [02:58:27] Jasper_Deng: omg so many reviewers? [02:58:37] and somehow you missed Reedy :P [03:04:01] legoktm: hey, that's not my review [03:04:08] not even my patch [03:04:12] lol [03:04:14] its your bug [03:14:39] PROBLEM - LVS HTTP IPv4 on ms-fe.eqiad.wmnet is CRITICAL: Connection timed out [03:23:27] RECOVERY - LVS HTTP IPv4 on ms-fe.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 311 bytes in 1.921 second response time [03:26:37] PROBLEM - LVS HTTP IPv4 on ms-fe.eqiad.wmnet is CRITICAL: Connection timed out [03:28:47] PROBLEM - HTTP radosgw on ms-fe1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:31:28] PROBLEM - HTTP Apache on ms-fe1002 is CRITICAL: Connection timed out [03:34:02] gerrit seems sloooooooow [03:39:25] It's not just me? [03:39:32] no [03:39:35] I get "Working..." a lot. [03:39:49] Hrm. [03:39:56] And the production sites seem to be having issues. [03:41:15] apergos: Are you around? The image scalers seem to be acting up. [03:41:55] https://commons.wikimedia.org/wiki/Special:NewFiles [03:44:01] TimStarling? [03:44:37] https://upload.wikimedia.org/wikipedia/commons/thumb/a/a2/Igor_Zaripov.jpg/220px-Igor_Zaripov.jpg [03:44:42] > [03:44:43] 503 Service Unavailable [03:44:43] The server is currently unavailable. Please try again at a later time. [03:44:46] There was a problem while contacting the image scaler: [Errno 110] ETIMEDOUT [03:44:50] > [03:45:13] unlikely that gerrit is related, it is regularly slow for its own reasons [03:45:27] Right. I think the image scalers may be breaking Score too. [03:45:39] search spike [03:45:53] Score (reported in Bugzilla) + thumbnails being broken (reported in -tech) seem to be related. Gerrit seems unrelated. [03:46:04] Speaking of search... Ram left? [03:46:14] I thought he was going to be the search guy for a while. [03:47:24] ms-fe1002 is not in ganglia? [03:47:37] should be [03:47:45] oh right [03:47:48] http://ganglia.wikimedia.org/latest/?c=Ceph%20eqiad&m=cpu_report&r=custom&s=by%20name&hc=4&mc=2&cs=fe1001 [03:47:58] just the whole ceph cluster is off the net [03:48:10] I guess that breaks the ganglia host search feature [03:48:28] http://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&tab=v&vn=Ceph+eqiad looks wonk [03:48:45] Right, ceph was the Score error. Okay. [03:50:40] Susan: gerrit is always slow, its just been wrose the last day or so [03:51:01] Yeah, I've noticed. [03:51:06] I thought it might just be me. [03:51:11] 7s times for lots of stuff [03:51:14] (gerrit) [03:51:26] Yeah, I got an internal server error earlier today as well. [03:51:29] The black screen of death. [03:51:38] Usually I just get "Working...". [03:51:45] It's like code review on dial-up. [03:52:22] https://bugzilla.wikimedia.org/show_bug.cgi?id=49004 is the ceph/image scaler bug, BTW. [03:52:31] The ceph docs seem to suggest it's just being experimented with. [03:52:43] it's configured as a multiwrite slave [03:54:34] so, should we just fail over now? [03:54:47] unless paravoid or apergos are floating around [03:55:23] fail over = just use ceph? [03:55:26] I mean swift? [03:55:28] yes [03:55:32] what are the consequences? [03:55:59] can it be resynced? [03:56:25] will it make it more difficult to get it back online later? [03:56:28] there are scripts that can handle the resyncing needs quickly enough [03:56:45] ok, do that then [03:57:11] also for the occasional ops that failed on swift it might cause some strangness [03:57:31] like if a file failed to save to swift and now we switched to it [03:58:07] better than the strangeness now ;) [03:58:41] it means we'll have some more time to analyse the issue on ceph [03:58:44] New patchset: Aaron Schulz; "Disabled ceph backends and switched to just swift." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/66196 [03:59:06] assuming it stays broken after the traffic goes away [03:59:28] New patchset: Aaron Schulz; "Disabled ceph backends and switched to just swift." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/66196 [04:00:33] Change merged: Aaron Schulz; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/66196 [04:01:50] !log aaron synchronized wmf-config/filebackend.php 'Disabled ceph backends and switched to just swift.' [04:01:57] RECOVERY - Apache HTTP on mw1159 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 4.248 second response time [04:01:59] Logged the message, Master [04:02:07] RECOVERY - Apache HTTP on mw1160 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.056 second response time [04:02:07] RECOVERY - Apache HTTP on mw1158 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.057 second response time [04:02:07] RECOVERY - Apache HTTP on mw1153 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.063 second response time [04:02:07] RECOVERY - Apache HTTP on mw1154 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.064 second response time [04:02:17] RECOVERY - Apache HTTP on mw1155 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.045 second response time [04:02:17] RECOVERY - LVS HTTP IPv4 on rendering.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 62927 bytes in 0.234 second response time [04:02:18] https://commons.wikimedia.org/wiki/Special:NewFiles looks OK now [04:02:41] RECOVERY - Apache HTTP on mw1157 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.056 second response time [04:02:47] RECOVERY - Apache HTTP on mw1156 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.065 second response time [04:03:01] TimStarling: I think I'm going to head out for a while [04:03:07] ok [04:03:21] is anything else likely to be broken by ceph being down? [04:05:55] I don't think so [04:06:37] ok, see you later [04:06:48] I think ms-fe1001 did fix itself, so that's a bit unfortunate [04:07:35] it's accepting connections now, whereas previously it just timed out [04:17:31] !log on nickel: restarting gmetad to see if that fixes ceph cluster reporting [04:17:40] Logged the message, Master [04:17:44] Warning: we failed to resolve data source name ms-fe1001.eqiad.wmnet, [04:18:41] well, that explains that part of it [04:24:49] μοrning [04:24:57] what's going on? [04:25:33] ceph had issues and was swapped out for swift. [04:25:59] New patchset: Tim Starling; "Fix gmetad source list typo" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/66206 [04:26:57] I'm not sure what happened to it [04:27:51] I have this: [04:27:53] [0353][tstarling@fenari:/home/wikipedia/common/wmf-config]$ telnet ms-fe1001 80 [04:27:53] Trying 10.64.0.167... [04:27:53] ^C [04:27:53] [0354][tstarling@fenari:/home/wikipedia/common/wmf-config]$ telnet ms-fe1002 80 [04:27:53] Trying 10.64.0.168... [04:27:54] ^C [04:28:00] i.e. I gave up waiting on both [04:28:10] but ping worked [04:28:59] so I suppose that could have been a listen queue overflow [04:29:34] actually I'm not sure what would cause it [04:29:35] " When syncookies are enabled there is no logical maximum length and this setting is ignored." [04:32:23] Change merged: Tim Starling; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/66206 [04:34:00] ganglia still has a flat line [04:35:29] despite ms-fe1002 responding on port 8649 [04:36:26] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:37:17] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.124 second response time [04:42:46] bah, ganglia doesn't like any of the ceph hosts for some reason [04:42:49] grrrr [04:49:33] yeah, like I said [04:49:35] I'm working on it [04:57:18] PROBLEM - NTP on ssl3003 is CRITICAL: NTP CRITICAL: No response from NTP server [05:01:26] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:02:16] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.142 second response time [05:02:36] PROBLEM - NTP on ssl3002 is CRITICAL: NTP CRITICAL: No response from NTP server [05:08:12] maybe the problem is the apache_status module [05:10:35] RECOVERY - HTTP Apache on ms-fe1002 is OK: HTTP OK: HTTP/1.1 200 OK - 229 bytes in 0.012 second response time [05:11:45] whatever you poked, it's giving us data at least [05:21:00] yeah, I poked first and puppetized second [05:21:15] New patchset: Tim Starling; "Disable the gmond apache_status module on ceph hosts" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/66225 [05:21:37] sorry for the reduced gerrit notification spam, I know ops people love botspam [05:22:55] I'm sure someone can produce some later if we are feeling deprived [05:23:09] Change merged: Tim Starling; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/66225 [05:25:46] !log on msfe1001-1004: disabled apache_status ganglia module [05:25:55] Logged the message, Master [05:26:25] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:29:16] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.132 second response time [05:32:25] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:33:15] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.131 second response time [05:36:25] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:41:18] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.125 second response time [05:41:58] PROBLEM - Apache HTTP on mw1017 is CRITICAL: Connection refused [05:42:58] RECOVERY - Apache HTTP on mw1017 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.062 second response time [05:52:28] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:53:18] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.127 second response time [05:57:29] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:58:18] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.133 second response time [06:12:26] RECOVERY - HTTP Apache on ms-fe1001 is OK: HTTP OK: HTTP/1.1 200 OK - 229 bytes in 0.008 second response time [06:12:54] PROBLEM - HTTP Apache on ms-fe1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:13:44] PROBLEM - HTTP Apache on ms-fe1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:27:34] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:28:14] PROBLEM - Disk space on mc15 is CRITICAL: Timeout while attempting connection [06:29:04] RECOVERY - Disk space on mc15 is OK: DISK OK [06:29:24] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.130 second response time [06:33:34] RECOVERY - NTP on ssl3002 is OK: NTP OK: Offset -0.00258243084 secs [06:38:50] PROBLEM - HTTP Apache on ms-fe1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:42:13] !log Running a script on terbium to list out files that were updated (overwritten) but are not synced on all wikis (it found none on commons/enwiki) [06:42:21] Logged the message, Master [06:56:31] !log ceph osds 50 and 132 on ms-fe1002 are logging 'slow requests', don't know how to restart specific osds in bobtail though [06:56:39] Logged the message, Master [07:01:25] New patchset: Pyoungmeister; "removing myself from icinga for vacation" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/66232 [07:06:02] New patchset: Pyoungmeister; "removing myself from icinga for vacation" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/66232 [07:26:41] PROBLEM - LVS HTTP IPv4 on ms-fe.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:40:53] PROBLEM - HTTP Apache on ms-fe1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:02:13] RECOVERY - NTP on ssl3003 is OK: NTP OK: Offset -0.002958297729 secs [08:15:38] New patchset: Mark Bergsma; "Awful hacks to make Puppet work are nothing new..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/66124 [08:19:44] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/66232 [08:24:41] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/66124 [08:35:05] New patchset: Mark Bergsma; "Replace '/' separator with a space" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/66236 [08:38:01] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/66236 [08:40:32] New patchset: Mark Bergsma; "Replace the entire string" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/66237 [08:43:04] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/66237 [08:43:05] jenkins is still sleeping [08:44:53] ok, so am I apparently [08:45:37] New patchset: Mark Bergsma; "Using a correct regexp with match groups does help" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/66238 [08:47:18] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/66238 [08:51:07] New patchset: Mark Bergsma; "Add all mobile UAs currently listed in the Squid configuration" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/65823 [08:52:16] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/65823 [08:57:23] hey mark, the analytics team is using dclass for device detection in kraken; it uses a decision tree written in C and uses the OpenDDR device library. There is a also a Varnish vmod module, see https://github.com/TheWeatherChannel/dClass/tree/master/servers/varnish [08:57:43] yeah I know [08:58:15] any plans to start using it? [08:58:22] not at this time, perhaps in the future [08:58:30] right now I'm just focusing on making varnish behave like squid :) [08:58:38] understood [09:00:53] RECOVERY - RAID on es1001 is OK: OK: State is Optimal, checked 2 logical device(s) [09:04:39] but if you see the plan on zero on wikitech-l, we might need it earlier for that ;) [09:32:42] PROBLEM - Host wtp1008 is DOWN: PING CRITICAL - Packet loss = 100% [09:34:22] RECOVERY - Host wtp1008 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [09:50:07] PROBLEM - Puppet freshness on erzurumi is CRITICAL: No successful Puppet run in the last 10 hours [09:50:07] PROBLEM - Puppet freshness on db1032 is CRITICAL: No successful Puppet run in the last 10 hours [09:50:07] PROBLEM - Puppet freshness on lvs1004 is CRITICAL: No successful Puppet run in the last 10 hours [09:50:07] PROBLEM - Puppet freshness on lvs1005 is CRITICAL: No successful Puppet run in the last 10 hours [09:50:07] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: No successful Puppet run in the last 10 hours [09:50:07] PROBLEM - Puppet freshness on mc15 is CRITICAL: No successful Puppet run in the last 10 hours [09:50:07] PROBLEM - Puppet freshness on ms-fe3001 is CRITICAL: No successful Puppet run in the last 10 hours [09:50:08] PROBLEM - Puppet freshness on mw1171 is CRITICAL: No successful Puppet run in the last 10 hours [09:50:08] PROBLEM - Puppet freshness on pdf1 is CRITICAL: No successful Puppet run in the last 10 hours [09:50:09] PROBLEM - Puppet freshness on pdf2 is CRITICAL: No successful Puppet run in the last 10 hours [09:50:09] PROBLEM - Puppet freshness on virt1 is CRITICAL: No successful Puppet run in the last 10 hours [09:50:10] PROBLEM - Puppet freshness on virt3 is CRITICAL: No successful Puppet run in the last 10 hours [09:50:10] PROBLEM - Puppet freshness on virt4 is CRITICAL: No successful Puppet run in the last 10 hours [10:21:27] goddamit [10:21:33] didn't hear the page [10:21:59] what the hell [10:26:20] RECOVERY - HTTP Apache on ms-fe1002 is OK: HTTP OK: HTTP/1.1 200 OK - 229 bytes in 0.005 second response time [10:26:21] RECOVERY - HTTP Apache on ms-fe1003 is OK: HTTP OK: HTTP/1.1 200 OK - 229 bytes in 4.732 second response time [10:26:41] RECOVERY - LVS HTTP IPv4 on ms-fe.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 311 bytes in 1.426 second response time [10:26:45] RECOVERY - HTTP radosgw on ms-fe1002 is OK: HTTP OK: HTTP/1.1 200 OK - 311 bytes in 0.768 second response time [10:27:10] RECOVERY - HTTP radosgw on ms-fe1003 is OK: HTTP OK: HTTP/1.1 200 OK - 311 bytes in 0.005 second response time [10:28:50] RECOVERY - HTTP radosgw on ms-fe1001 is OK: HTTP OK: HTTP/1.1 200 OK - 311 bytes in 4.721 second response time [10:29:32] RECOVERY - HTTP Apache on ms-fe1004 is OK: HTTP OK: HTTP/1.1 200 OK - 229 bytes in 0.001 second response time [10:29:42] RECOVERY - HTTP Apache on ms-fe1001 is OK: HTTP OK: HTTP/1.1 200 OK - 229 bytes in 0.001 second response time [10:29:42] RECOVERY - HTTP radosgw on ms-fe1004 is OK: HTTP OK: HTTP/1.1 200 OK - 311 bytes in 0.004 second response time [10:44:47] PROBLEM - SSH on mc15 is CRITICAL: Connection timed out [10:45:47] RECOVERY - SSH on mc15 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [10:53:45] Change merged: ArielGlenn; [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/65706 [11:07:50] New patchset: Ottomata; "Adding alerts for webrequest data loss in HDFS" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/66241 [11:08:48] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/66241 [11:40:27] PROBLEM - Host wtp1008 is DOWN: PING CRITICAL - Packet loss = 100% [11:41:26] RECOVERY - Host wtp1008 is UP: PING OK - Packet loss = 0%, RTA = 0.30 ms [12:09:56] PROBLEM - Disk space on mc15 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:10:46] RECOVERY - Disk space on mc15 is OK: DISK OK [12:11:06] PROBLEM - Puppet freshness on amssq47 is CRITICAL: No successful Puppet run in the last 10 hours [12:11:06] PROBLEM - Puppet freshness on cp1038 is CRITICAL: No successful Puppet run in the last 10 hours [12:11:06] PROBLEM - Puppet freshness on cp1039 is CRITICAL: No successful Puppet run in the last 10 hours [12:12:06] PROBLEM - Puppet freshness on cp1037 is CRITICAL: No successful Puppet run in the last 10 hours [12:12:06] PROBLEM - Puppet freshness on cp1040 is CRITICAL: No successful Puppet run in the last 10 hours [12:16:26] RECOVERY - Puppet freshness on amssq47 is OK: puppet ran at Fri May 31 12:16:23 UTC 2013 [12:16:56] RECOVERY - Puppet freshness on cp1039 is OK: puppet ran at Fri May 31 12:16:45 UTC 2013 [12:17:36] RECOVERY - Puppet freshness on cp1040 is OK: puppet ran at Fri May 31 12:17:29 UTC 2013 [12:18:36] RECOVERY - Puppet freshness on cp1037 is OK: puppet ran at Fri May 31 12:18:31 UTC 2013 [12:18:46] RECOVERY - Puppet freshness on cp1038 is OK: puppet ran at Fri May 31 12:18:43 UTC 2013 [13:26:27] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:27:17] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.132 second response time [13:32:52] New patchset: Faidon; "Ceph: move monitors to separate rows" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/66247 [13:34:48] no jenkins? [13:35:04] no [13:35:29] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/66247 [13:35:39] New patchset: Faidon; "Revert "Disable the gmond apache_status module on ceph hosts"" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/66248 [13:36:04] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/66248 [13:47:45] New patchset: Faidon; "Ceph: use role::ceph::mon on the new mons" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/66250 [13:48:48] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/66250 [14:02:13] where is jenkins? [14:21:12] Jenkins called in sick today [14:22:07] Lazy f*k doesn't even wfh. [14:27:40] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:28:27] !log reedy synchronized php-1.22wmf5/extensions/SecurePoll/ [14:28:30] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.127 second response time [14:28:36] Logged the message, Master [14:29:30] !log reedy synchronized php-1.22wmf4/extensions/SecurePoll/ [14:29:38] Logged the message, Master [14:33:27] !log Created bv2013_edit tables on all wikis [14:33:37] Logged the message, Master [14:40:33] !log reedy synchronized php-1.22wmf5/extensions/SecurePoll/ [14:40:42] Logged the message, Master [14:41:34] !log reedy synchronized php-1.22wmf4/extensions/SecurePoll/ [14:41:42] Logged the message, Master [14:46:31] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:47:21] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.192 second response time [14:55:14] New patchset: Petrb; "improved a help of sql command a bit" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/66262 [14:56:31] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:58:21] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.124 second response time [15:09:29] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:10:19] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.126 second response time [15:12:38] Change merged: Ottomata; [operations/puppet/cdh4] (master) - https://gerrit.wikimedia.org/r/65267 [15:12:48] New review: Ottomata; "Woot! Thanks Alex!" [operations/puppet/cdh4] (master) - https://gerrit.wikimedia.org/r/65267 [15:15:24] New patchset: Ottomata; "Updating modules/cdh4 to latest ecosystem commit." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/66263 [15:16:13] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/66263 [15:18:56] heya paravoid, you there? [15:19:19] the stuff that I need to puppetize and reinstall the kraken hadoop nodes is finally in ops/puppet, woo! [15:19:20] q for you: [15:19:26] kraken role class…or kraken module? [15:22:29] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:23:19] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.127 second response time [15:25:20] so, yay ceph? [15:26:00] New patchset: Petrb; "improved a help of sql command a bit" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/66262 [15:34:22] New patchset: Petrb; "inserted ksh and mysql client to exec nodes" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/66264 [15:40:01] New review: coren; "mysql-client (a) conflicts with mariadb-common (and libmariadbclient) and should be (b) mariadb-clie..." [operations/puppet] (production) C: -2; - https://gerrit.wikimedia.org/r/66264 [15:40:29] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:41:18] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.127 second response time [15:52:07] New patchset: Petrb; "inserted sql tool to execnodes" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/66266 [15:52:28] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:53:18] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.122 second response time [15:54:29] New patchset: Petrb; "subversion to all exec nodes" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/66273 [16:22:30] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:23:20] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.154 second response time [16:24:44] New review: Aude; "this is better than the abuse filter rule. We are not using (or have enabled, afaik) the query name..." [operations/mediawiki-config] (master) C: 1; - https://gerrit.wikimedia.org/r/66192 [16:28:44] paravoid, if you are around i'd love a puppet brain bounce [16:29:25] or even ori-l, you there? [16:31:30] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:33:20] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.125 second response time [16:52:27] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:53:17] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.126 second response time [17:02:27] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:03:17] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.125 second response time [17:11:08] ottomata: hey [17:12:20] hey, in short standup meeting [17:12:35] but, anyway, i'm prepping for puppetization of hadoop nodes using cdh4 [17:12:38] so. [17:12:42] the module is non-wmf specific [17:12:51] i'm trying to figure out where to put wmf-specifc usage of the module [17:13:03] first I will use it to puppetize labs [17:14:55] kraken module? [17:14:58] not sure. [17:15:07] mabye just roles/kraken.pp with a buncha classes in there? [17:15:14] i DO need to add a couple of config files [17:15:20] that are not part of the cdh4 module [17:15:26] roles/kraken.pp sounds right [17:15:36] and for files? templates/kraken/...? [17:15:51] doesn't seem like what a role class should be doing, you know? [17:17:14] why not? my pattern is typically this: [17:17:38] module classes take parameters and therefore need to be declared using 'myclass { param => value }' [17:17:59] and these declarations should be inside a role class, that does not take parameters [17:18:09] (i.e. it should be include-able) [17:18:15] aye [17:18:27] hmm, ok will continue with that then [17:18:55] so you end up with a hierarchy of: module -> configureable software platform, role -> configuration of that software platform [17:19:54] both ::labs:: and ::production:: classes in roles/kraken.pp? [17:21:21] if the setup is so different that including them in the same file will make it an unreadable profusion of if realm?s, then roles/kraken.pp and roles/kraken-labs.pp. if you can box the differences into a compact if/else, same file [17:22:24] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:23:14] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.125 second response time [17:23:19] New patchset: Sanja pavlovic; "Per bug #48012. Patch for worker.py. It checks for external programs existence in the initialization part." [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/64095 [17:26:24] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:28:14] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.131 second response time [17:31:24] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:33:15] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.128 second response time [17:51:19] Yay Ceva is picking up the servers -- RobH [17:52:24] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:53:14] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.125 second response time [18:02:24] paravoid: hi [18:02:40] paravoid: can I ask for some tips on why gbp is ignoring my debian/gbp.conf please ? [18:02:43] it seems to be the case [18:03:02] like if I feed it the params manually --debian-branch and stuff, it seems to take them into consideration [18:03:08] but the gbp.conf is left out [18:10:28] RECOVERY - Host virt1000 is UP: PING OK - Packet loss = 0%, RTA = 0.53 ms [18:10:56] RECOVERY - Host labs-ns1.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 0.40 ms [18:22:26] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:23:16] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.129 second response time [18:37:41] if someone is around who can do something, it would be lovely to have jenkins and gerrit irc bot back [18:40:58] aude: they disappeared? [18:41:08] I restarted ircecho on maganese [18:41:36] yay! [18:41:39] i see the bot now [18:41:52] I don't see ircecho for jenkins, but I'd imagine it's reporting through gerrit [18:42:01] jenkins seems to have died [18:42:08] not reviewing my code since yesterday [18:42:16] not reviewing anyone's code [18:42:34] It says Queue lengths: 334 events, 0 results. [18:42:37] (https://integration.wikimedia.org/zuul/) [18:42:53] need hashar [18:43:26] Also looks like someone's doing security stuff in gerrit again: https://integration.wikimedia.org/ci/job/mediawiki-core-whitespaces/4589/ [18:43:42] change is a draft, but jenkins was running tests for it [18:44:58] strange that it seems to be stuck there though [18:47:37] ok [18:50:40] Ah fun... interesting info leakage. [18:51:53] (in this case I know the person who's working on it, and it's not really secret. But we should close that) [18:52:20] It shows the full commit message as well as the user who uploaded it [18:52:29] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:52:36] Kinda pointless because once you know what repo a forbidden changeid is in, you can download it through git [18:53:19] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.125 second response time [19:06:28] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:07:16] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.124 second response time [19:13:26] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:14:16] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.125 second response time [19:22:26] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:22:50] !log aaron synchronized php-1.22wmf4/maintenance/copyFileBackend.php '3d0e3f8e4d09d4a2462f583c25dd284e6ee3e466' [19:22:59] Logged the message, Master [19:23:17] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.142 second response time [19:23:57] !log aaron synchronized php-1.22wmf5/maintenance/copyFileBackend.php '82352634594d0a67d3b9f6597c3bb2c0911f8028' [19:24:06] Logged the message, Master [19:46:32] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:47:22] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.125 second response time [19:51:02] PROBLEM - Puppet freshness on db1032 is CRITICAL: No successful Puppet run in the last 10 hours [19:51:02] PROBLEM - Puppet freshness on erzurumi is CRITICAL: No successful Puppet run in the last 10 hours [19:51:02] PROBLEM - Puppet freshness on lvs1005 is CRITICAL: No successful Puppet run in the last 10 hours [19:51:02] PROBLEM - Puppet freshness on lvs1004 is CRITICAL: No successful Puppet run in the last 10 hours [19:51:02] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: No successful Puppet run in the last 10 hours [19:51:02] PROBLEM - Puppet freshness on mc15 is CRITICAL: No successful Puppet run in the last 10 hours [19:51:02] PROBLEM - Puppet freshness on ms-fe3001 is CRITICAL: No successful Puppet run in the last 10 hours [19:51:03] PROBLEM - Puppet freshness on mw1171 is CRITICAL: No successful Puppet run in the last 10 hours [19:51:03] PROBLEM - Puppet freshness on pdf1 is CRITICAL: No successful Puppet run in the last 10 hours [19:51:04] PROBLEM - Puppet freshness on pdf2 is CRITICAL: No successful Puppet run in the last 10 hours [19:51:04] PROBLEM - Puppet freshness on virt1 is CRITICAL: No successful Puppet run in the last 10 hours [19:51:05] PROBLEM - Puppet freshness on virt1000 is CRITICAL: No successful Puppet run in the last 10 hours [19:51:05] PROBLEM - Puppet freshness on virt3 is CRITICAL: No successful Puppet run in the last 10 hours [19:51:06] PROBLEM - Puppet freshness on virt4 is CRITICAL: No successful Puppet run in the last 10 hours [19:53:04] !log Running a terbium script to copy missing files to swift as well as ceph [19:53:12] Logged the message, Master [19:55:06] AaronSchulz: do we know what caused the ceph meltdown last night yet? [19:56:35] you'd have to wait for paravoid [19:56:53] New patchset: Petrb; "improved a help of sql command a bit" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/66262 [19:57:05] New review: Yurik; "lots of minor style fixes are needed :)" [operations/dumps] (ariel) C: -1; - https://gerrit.wikimedia.org/r/64095 [19:57:06] Coren can you merge this it is rather important - it fixes a bug :/ [19:57:08] gotcha [19:57:32] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:58:31] New review: coren; "LGM" [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/66262 [19:59:04] petan: Once Jenkins gets to it, I'll merge. [19:59:09] ty [19:59:22] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.124 second response time [19:59:27] I think jenkins is broken because that patch is there for several hours [20:00:06] Coren it is syntactially correct I can run it myself [20:00:17] I dont know what is wrong with jenkins :/ [20:00:50] New review: coren; "+manual Verified -- Jenkins is ill." [operations/puppet] (production); V: 2 - https://gerrit.wikimedia.org/r/66262 [20:00:54] Change merged: coren; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/66262 [20:10:57] petan: yeh jenkins is broken for me too [20:11:13] * aude summons hashar [20:11:26] it's not running gate submits on +2s [20:11:37] not running at all [20:11:57] aude: Can you afford the needed sacrifices for such a summoning spell? :-) [20:12:10] heh [20:14:51] RECOVERY - Puppet freshness on virt1000 is OK: puppet ran at Fri May 31 20:14:42 UTC 2013 [20:28:31] PROBLEM - Host wtp1008 is DOWN: PING CRITICAL - Packet loss = 100% [20:29:22] RECOVERY - Host wtp1008 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [20:32:31] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:33:21] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.132 second response time [20:38:43] New patchset: coren; "Tool Labs: enable identd" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/66296 [20:41:14] anyone know what's up with gerrit? [20:44:23] binasher: It's broken. [20:44:24] :-) [20:44:45] let's not jump to conclusions here [20:44:55] it's just a flesh wound [20:48:40] * yurik begins a contest of throwing rotten tomatoes at gerrit... i heard it would make it move faster... [20:49:18] !log restarted gerrit [20:49:26] Logged the message, Master [20:50:41] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/65604 [20:50:50] binasher: we should have an auto-restart cron...didn't we have that with swift for a while? :D [20:51:08] yup :) [20:51:21] every root's favorite tool [20:58:00] AaronSchulz: do you have an opinion as to how the twemproxy config should be deployed? i'm thinking of either 1) leaving in wmf-config and adding "pkill -1 nutcracker" to scap / sync-file scripts or 2) doing something like the squid config, where they have their own deploy script that utilizes the puppet volatile repo [21:01:31] meeting :) [21:01:37] <^demon> manybubbles: Sorry I've been out sick today, but I did want to drop in and say hi. We've spoken on the phone, but not IRC yet :) [21:01:47] * ^demon is Chad [21:01:53] Ryan_Lane: No reverse for Openstack domain names? [21:02:06] well, there should be [21:02:30] writing the records into LDAP in such a way that this doesn't break pdns is difficult [21:02:53] Coren: is this going to be a problem for ident? [21:03:17] ^demon: cool! nice to talk to you again! Are you the guy on the East Coast who'll be working some on search with me or am I just lost? [21:03:19] Ryan_Lane: I don't think so, still testing things. [21:03:35] actually, I am certainly still lost, but I'm getting unlost, I think. [21:04:06] <^demon> manybubbles: Yep, just a bit north of you in Virginia (for another month, moving soon) [21:05:21] ^demon: cool. robla mentioned that we might want to meet up sometime before you move. I'm in Raleigh which is only a few hours drive. Richmond has some really nice museums I could take my kids to while we meet. [21:06:30] <^demon> Sounds like something we can work out, I'll be in town pretty much until moving day. [21:08:29] Ryan_Lane: From what I can tell, pidentd works "out of the box" for any host that needs/wants it. I tweaked a few options for labs, but I doubt there is genuine use for a role for that. [21:08:54] cool [21:09:07] PROBLEM - twemproxy process on mw1009 is CRITICAL: NRPE: Command check_twemproxy not defined [21:09:07] PROBLEM - twemproxy process on mw1172 is CRITICAL: NRPE: Command check_twemproxy not defined [21:09:07] PROBLEM - twemproxy process on mw1170 is CRITICAL: NRPE: Command check_twemproxy not defined [21:09:08] PROBLEM - twemproxy process on mw1073 is CRITICAL: NRPE: Command check_twemproxy not defined [21:09:08] PROBLEM - twemproxy process on mw1191 is CRITICAL: NRPE: Command check_twemproxy not defined [21:09:17] PROBLEM - twemproxy process on mw1199 is CRITICAL: NRPE: Command check_twemproxy not defined [21:09:17] PROBLEM - twemproxy process on mw1090 is CRITICAL: NRPE: Command check_twemproxy not defined [21:09:17] PROBLEM - twemproxy process on mw1070 is CRITICAL: NRPE: Command check_twemproxy not defined [21:09:17] PROBLEM - twemproxy process on mw15 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 65534 (nobody), command name nutcracker [21:09:17] PROBLEM - twemproxy process on mw3 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 65534 (nobody), command name nutcracker [21:09:17] PROBLEM - twemproxy process on mw7 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 65534 (nobody), command name nutcracker [21:09:29] PROBLEM - twemproxy process on mw1043 is CRITICAL: NRPE: Command check_twemproxy not defined [21:09:29] PROBLEM - twemproxy process on mw1121 is CRITICAL: NRPE: Command check_twemproxy not defined [21:09:30] PROBLEM - twemproxy process on mw1141 is CRITICAL: NRPE: Command check_twemproxy not defined [21:09:30] PROBLEM - twemproxy process on mw1145 is CRITICAL: NRPE: Command check_twemproxy not defined [21:09:30] PROBLEM - twemproxy process on srv235 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 65534 (nobody), command name nutcracker [21:09:37] PROBLEM - twemproxy process on mw1039 is CRITICAL: NRPE: Command check_twemproxy not defined [21:09:39] PROBLEM - twemproxy process on mw1061 is CRITICAL: NRPE: Command check_twemproxy not defined [21:09:39] PROBLEM - twemproxy process on mw1082 is CRITICAL: NRPE: Command check_twemproxy not defined [21:09:39] PROBLEM - twemproxy process on mw1059 is CRITICAL: NRPE: Command check_twemproxy not defined [21:09:39] PROBLEM - twemproxy process on mw1174 is CRITICAL: NRPE: Command check_twemproxy not defined [21:09:39] PROBLEM - twemproxy process on mw1148 is CRITICAL: NRPE: Command check_twemproxy not defined [21:09:39] PROBLEM - twemproxy process on mw113 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 65534 (nobody), command name nutcracker [21:09:40] New patchset: coren; "Tool Labs: enable identd" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/66296 [21:09:47] PROBLEM - twemproxy process on mw1008 is CRITICAL: NRPE: Command check_twemproxy not defined [21:09:47] PROBLEM - twemproxy process on mw1102 is CRITICAL: NRPE: Command check_twemproxy not defined [21:09:47] PROBLEM - twemproxy process on mw1119 is CRITICAL: NRPE: Command check_twemproxy not defined [21:09:47] PROBLEM - twemproxy process on mw1026 is CRITICAL: NRPE: Command check_twemproxy not defined [21:09:47] PROBLEM - twemproxy process on mw1177 is CRITICAL: NRPE: Command check_twemproxy not defined [21:09:47] PROBLEM - twemproxy process on mw1 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 65534 (nobody), command name nutcracker [21:09:57] PROBLEM - twemproxy process on mw1019 is CRITICAL: NRPE: Command check_twemproxy not defined [21:09:57] PROBLEM - twemproxy process on mw1058 is CRITICAL: NRPE: Command check_twemproxy not defined [21:09:57] PROBLEM - twemproxy process on mw1120 is CRITICAL: NRPE: Command check_twemproxy not defined [21:09:57] PROBLEM - twemproxy process on mw1110 is CRITICAL: NRPE: Command check_twemproxy not defined [21:09:57] PROBLEM - twemproxy process on mw1086 is CRITICAL: NRPE: Command check_twemproxy not defined [21:09:57] PROBLEM - twemproxy process on mw1184 is CRITICAL: NRPE: Command check_twemproxy not defined [21:09:57] PROBLEM - twemproxy process on mw1108 is CRITICAL: NRPE: Command check_twemproxy not defined [21:09:58] PROBLEM - twemproxy process on mw1135 is CRITICAL: NRPE: Command check_twemproxy not defined [21:09:58] PROBLEM - twemproxy process on mw12 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 65534 (nobody), command name nutcracker [21:09:59] PROBLEM - twemproxy process on tmh1002 is CRITICAL: NRPE: Command check_twemproxy not defined [21:11:02] well. [21:11:09] sorry about that! [21:11:32] :D [21:11:55] Anyone know when/if we expect Jenkins back? [21:13:07] twemproxy alerts should just be from puppet running on neon before running on all of the apaches [21:13:12] <^demon> Coren: Jenkins is gone? [21:13:28] oh..and the nrpe restart might be broken still? [21:13:56] oh jenkins is back? [21:14:00] ^demon: It hasn't verified anything in hours that I can see. [21:14:13] <^demon> Jenkins is doin' stuff. [21:14:27] <^demon> maybe it's zuul that's fubar'd. [21:14:39] https://gerrit.wikimedia.org/r/#/c/63855/ [21:14:48] it reviewed a patch i submitted much earlier today [21:14:54] reviewed it like just now [21:15:01] !log completed echo event_page_id schema migrations [21:15:11] Logged the message, Master [21:16:01] <^demon> aude: And this is why I won't fix bug 48690 yet ;-) [21:16:02] New review: awjrichards; "Looks good; should wait til deployment time to merge." [operations/mediawiki-config] (master); V: 1 C: 1; - https://gerrit.wikimedia.org/r/65843 [21:16:21] ^demon: agree [21:17:25] <^demon> manybubbles: I'm hoping to feel better by Monday, and then we can start diving into search together. [21:17:45] sigh.. "/etc/init.d/nagios-nrpe-server restart" doesn't work at all [21:18:01] is manybubbles our new search person? [21:18:15] ^demon: sounds good. I might still be reading wikipages for a while but I'd like to get a look. [21:18:18] yup! [21:18:24] sweet! welcome aboard manybubbles ! [21:18:26] I'm the new search person. [21:18:27] thanks! [21:18:34] I'm excited. [21:18:39] so are we :) [21:18:42] search search we need search [21:18:57] RECOVERY - twemproxy process on mw1110 is OK: PROCS OK: 1 process with UID = 65534 (nobody), command name nutcracker [21:18:59] * aude needs solr [21:19:07] "what do we want?" "search!" "when do we want it?" "wait, I know I wrote down that answer somewhere..." [21:19:07] for wikidata [21:19:17] :) [21:19:30] <^demon> greg-g: When do we want it? 6 years ago ;-) [21:20:59] <^demon> I think lsearchd is on the [[List of things we want a time machine for so we can go back and say "NO!"]] [21:21:22] <^demon> (Which should totally be a page on mw.org, if it isn't) [21:21:45] hah [21:22:27] sounds like a good page [21:22:47] RECOVERY - twemproxy process on mw1119 is OK: PROCS OK: 1 process with UID = 65534 (nobody), command name nutcracker [21:22:47] RECOVERY - twemproxy process on mw1008 is OK: PROCS OK: 1 process with UID = 65534 (nobody), command name nutcracker [21:22:47] RECOVERY - twemproxy process on mw1102 is OK: PROCS OK: 1 process with UID = 65534 (nobody), command name nutcracker [21:22:47] RECOVERY - twemproxy process on mw1026 is OK: PROCS OK: 1 process with UID = 65534 (nobody), command name nutcracker [21:22:57] RECOVERY - twemproxy process on mw1184 is OK: PROCS OK: 1 process with UID = 65534 (nobody), command name nutcracker [21:22:57] RECOVERY - twemproxy process on mw1120 is OK: PROCS OK: 1 process with UID = 65534 (nobody), command name nutcracker [21:22:57] RECOVERY - twemproxy process on mw1108 is OK: PROCS OK: 1 process with UID = 65534 (nobody), command name nutcracker [21:22:57] RECOVERY - twemproxy process on mw1135 is OK: PROCS OK: 1 process with UID = 65534 (nobody), command name nutcracker [21:22:57] RECOVERY - twemproxy process on mw1058 is OK: PROCS OK: 1 process with UID = 65534 (nobody), command name nutcracker [21:22:57] RECOVERY - twemproxy process on mw1086 is OK: PROCS OK: 1 process with UID = 65534 (nobody), command name nutcracker [21:22:57] RECOVERY - twemproxy process on mw1019 is OK: PROCS OK: 1 process with UID = 65534 (nobody), command name nutcracker [21:23:06] I'll start having a look at search stuff as soon as I have pulled myself from under this huge pile of reading material and as soon as I have a laptop [21:23:07] RECOVERY - twemproxy process on mw1009 is OK: PROCS OK: 1 process with UID = 65534 (nobody), command name nutcracker [21:23:07] RECOVERY - twemproxy process on mw1172 is OK: PROCS OK: 1 process with UID = 65534 (nobody), command name nutcracker [21:23:07] RECOVERY - twemproxy process on mw1191 is OK: PROCS OK: 1 process with UID = 65534 (nobody), command name nutcracker [21:23:07] RECOVERY - twemproxy process on mw1170 is OK: PROCS OK: 1 process with UID = 65534 (nobody), command name nutcracker [21:23:07] RECOVERY - twemproxy process on mw1073 is OK: PROCS OK: 1 process with UID = 65534 (nobody), command name nutcracker [21:23:17] RECOVERY - twemproxy process on mw1199 is OK: PROCS OK: 1 process with UID = 65534 (nobody), command name nutcracker [21:23:17] RECOVERY - twemproxy process on mw1070 is OK: PROCS OK: 1 process with UID = 65534 (nobody), command name nutcracker [21:23:17] RECOVERY - twemproxy process on mw1090 is OK: PROCS OK: 1 process with UID = 65534 (nobody), command name nutcracker [21:23:27] RECOVERY - twemproxy process on mw1145 is OK: PROCS OK: 1 process with UID = 65534 (nobody), command name nutcracker [21:23:27] RECOVERY - twemproxy process on mw1121 is OK: PROCS OK: 1 process with UID = 65534 (nobody), command name nutcracker [21:23:27] RECOVERY - twemproxy process on mw1043 is OK: PROCS OK: 1 process with UID = 65534 (nobody), command name nutcracker [21:23:27] RECOVERY - twemproxy process on mw1141 is OK: PROCS OK: 1 process with UID = 65534 (nobody), command name nutcracker [21:23:37] RECOVERY - twemproxy process on mw1174 is OK: PROCS OK: 1 process with UID = 65534 (nobody), command name nutcracker [21:23:37] RECOVERY - twemproxy process on mw1059 is OK: PROCS OK: 1 process with UID = 65534 (nobody), command name nutcracker [21:23:37] RECOVERY - twemproxy process on mw1148 is OK: PROCS OK: 1 process with UID = 65534 (nobody), command name nutcracker [21:23:37] RECOVERY - twemproxy process on mw1061 is OK: PROCS OK: 1 process with UID = 65534 (nobody), command name nutcracker [21:23:37] RECOVERY - twemproxy process on mw1039 is OK: PROCS OK: 1 process with UID = 65534 (nobody), command name nutcracker [21:23:37] RECOVERY - twemproxy process on mw1082 is OK: PROCS OK: 1 process with UID = 65534 (nobody), command name nutcracker [21:23:41] or some other temporary solution [21:23:47] RECOVERY - twemproxy process on mw1177 is OK: PROCS OK: 1 process with UID = 65534 (nobody), command name nutcracker [21:24:31] hashar is jenkins dead? [21:24:37] nop [21:24:45] * awjr slaps jenkins [21:24:49] awjr: it just received too many patches [21:24:59] oh, so it'll take a while to catch up? [21:25:09] I guess [21:25:14] hrmph [21:25:37] most probably we had l10nbot + bunch of patches against master + some cherry picking in wmf branches [21:25:42] that is a ton of patches to test out hehe [21:26:11] https://integration.wikimedia.org/zuul/ shows up 4 patches being tested for merge [21:26:23] yah [21:26:27] hashar, thanks for your email. I think I know how jenkins-job-builder works, but I want to add a new custom job type. Will I be the first person to do that? [21:26:30] im surprised it's all mediawiki/core though? [21:26:36] why does it bog down in these situations, scheduling is a known CS problem :/ [21:26:41] ah nm there's something different :) [21:27:02] andrewbogott: I think :-D [21:27:14] OK, I'll just create a new place for the scripts then. [21:27:34] I agree that it seems bad to have separate rules for different scripts, but it's better than simply ignoring pep8 entirely like we do now. [21:27:50] andrewbogott: about to go to bed. I have been doing ton of python today and connected to find out about Zuul potential issue [21:28:21] andrewbogott: we can catch up on monday if you want, i will be more or less connected from noon to 3pm PST [21:28:30] ok -- I think I can probably move forward now, anyway. Thanks again for emailing. [21:29:07] we probably want to have .pep8 in ops/puppet to ignore any whitespaces / tabs errors :-D [21:29:16] <^demon> andrewbogott: Well maybe we should make everything pep8 compliant :) [21:29:27] but yeah [21:29:34] ^demon++ [21:29:41] python scripts should all be pep8 compliants [21:30:05] Welp, I'd love that but have already tried and failed. [21:30:40] <^demon> I've had 2 changes sitting since like February just to fix some pep8 violations. [21:30:53] the main issue is tabs being used instead of space, that can be ignored and then pep8 job can be made voting :-D [21:31:23] ^demon: Idont bother fixing pep8 issues in ops/* anymore :-D [21:31:27] <^demon> Pfft, it takes no time to fix tabs => spaces. [21:31:35] <^demon> If someone would just merge said commits. [21:31:49] the other repo, I usually self merge them after a week or so [21:32:55] <^demon> andrewbogott: If you're feeling generous....https://gerrit.wikimedia.org/r/#/q/status:open+project:operations/puppet+branch:production+topic:pep8,n,z ;-) [21:33:35] !log Zuul / Jenkins had some delay most probably related to a lot of patchsets being sent at the same time. Zuul catching up right now. [21:33:37] * andrewbogott looks for the patch that prompted the most recent pep8 holy war [21:33:44] Logged the message, Master [21:36:23] basically some people think the standard is not matching their habit and thus want yet another standard :-D [21:37:29] <^demon> Habits die hard. [21:38:10] PROBLEM - twemproxy process on tmh1001 is CRITICAL: NRPE: Command check_twemproxy not defined [21:38:16] How can I make gerrit show me all my patches, ever? [21:38:21] PROBLEM - twemproxy process on mw10 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 65534 (nobody), command name nutcracker [21:38:21] PROBLEM - twemproxy process on mw11 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 65534 (nobody), command name nutcracker [21:38:21] PROBLEM - twemproxy process on mw115 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 65534 (nobody), command name nutcracker [21:38:21] PROBLEM - twemproxy process on mw13 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 65534 (nobody), command name nutcracker [21:38:21] PROBLEM - twemproxy process on mw61 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 65534 (nobody), command name nutcracker [21:38:21] PROBLEM - twemproxy process on mw8 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 65534 (nobody), command name nutcracker [21:38:31] PROBLEM - twemproxy process on mw1050 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 65534 (nobody), command name nutcracker [21:38:31] PROBLEM - twemproxy process on fenari is CRITICAL: PROCS CRITICAL: 0 processes with UID = 65534 (nobody), command name nutcracker [21:38:31] PROBLEM - twemproxy process on mw4 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 65534 (nobody), command name nutcracker [21:38:31] PROBLEM - twemproxy process on mw14 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 65534 (nobody), command name nutcracker [21:38:31] PROBLEM - twemproxy process on tmh2 is CRITICAL: NRPE: Command check_twemproxy not defined [21:38:39] <^demon> andrewbogott: https://gerrit.wikimedia.org/r/#/q/owner:self,n,z [21:38:40] PROBLEM - twemproxy process on mw2 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 65534 (nobody), command name nutcracker [21:38:41] PROBLEM - twemproxy process on hume is CRITICAL: PROCS CRITICAL: 0 processes with UID = 65534 (nobody), command name nutcracker [21:38:41] PROBLEM - twemproxy process on mw37 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 65534 (nobody), command name nutcracker [21:38:41] PROBLEM - twemproxy process on terbium is CRITICAL: NRPE: Command check_twemproxy not defined [21:38:41] PROBLEM - twemproxy process on mw9 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 65534 (nobody), command name nutcracker [21:38:41] PROBLEM - twemproxy process on mw99 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 65534 (nobody), command name nutcracker [21:38:41] PROBLEM - twemproxy process on srv193 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 65534 (nobody), command name nutcracker [21:38:57] ^demon: So… I just have to type that url freeform? No gui for that? [21:39:00] PROBLEM - twemproxy process on mw6 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 65534 (nobody), command name nutcracker [21:39:00] PROBLEM - twemproxy process on tmh1 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 65534 (nobody), command name nutcracker [21:39:00] PROBLEM - twemproxy process on mw5 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 65534 (nobody), command name nutcracker [21:39:00] PROBLEM - twemproxy process on mw55 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 65534 (nobody), command name nutcracker [21:39:34] <^demon> Type "owner:self" in the search box. [21:39:43] <^demon> It should autosuggest as you start typing own... [21:40:11] ok [21:40:22] New patchset: coren; "Identd daemon module" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/66296 [21:40:25] ^demon: are you aware of any issue with Gerrit today? [21:40:33] <^demon> I've been out sick all day. [21:40:41] <^demon> Just popped into IRC to say hey mostly. [21:40:47] same there :( [21:41:00] ^demon, hashar, greg-g: Here is the patch where I lost the fight to simply have files be pep8-compliant. I don't really understand why the #noqa solution wasn't acceptable, but I'm reluctant to repoen the issue. [21:41:00] <^demon> I've heard some "it's slow" comments, but really dunno for sure. [21:41:09] anyway seems zuul did not receive anything from Gerrit for a few hours but it is catching up again now. [21:41:27] If someone can sweep in, superhero-like, and insist that pep8 is The Way then I will stop worrying about trying to shoehorn in exceptions. [21:41:45] showhorning exceptions conversation: https://gerrit.wikimedia.org/r/#/c/61999/ [21:42:03] ^demon: yeah also noticed that adding a reviewer was slow :) Anyway Gerrit got restarted by asher so you should be fine for the weekend hehe. [21:42:12] Coren: +2/+2'd [21:42:22] <^demon> Pfft. Restarting isn't a fix... [21:42:34] <^demon> Got some replication failures...I can poke those later. [21:42:35] Change merged: coren; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/66296 [21:44:32] I am off to bed [21:44:39] Zuul catching up nicely. [21:44:45] so it should be fine! [21:44:55] <^demon> andrewbogott: The only exception I agree with is the line-too-long (there's always exceptions that are acceptable). The rest of it sounds like people not wanting to change habits. [21:45:18] <^demon> Perhaps the argument can be made that it's not a python project, but there's something to be said for consistency. [21:45:27] andrewbogott: yeah I have seen that patch. I think we should just use pep8 standard, only ignoring some specific annoyances :-] [21:45:33] <^demon> (It's not like we've got 2 python files, we've got *dozens*) [21:46:02] the same could go with the puppet manifests :-] [21:46:12] I think the consensus is to have 4 spaces for indentation [21:46:22] might be worth making sure everyone agree / accept the consensus [21:46:32] then we can enforce that rule in puppet-lint [21:48:26] binasher: what would the advantage of the volatile repo be? [21:48:41] * AaronSchulz leans toward the former...though perhaps out of ignorance [21:49:51] AaronSchulz: i don't think it would have an advantage in this case. in the squid case, it's good because if a host is down during deploy and then comes back online, it will still get the fresh config via puppet before being started [21:50:19] AaronSchulz: we have puppet run scap on newly stated app servers before apache starts though, so that's already covered [21:50:29] right [22:15:04] i'm leaning to the former as well