[00:00:21] Yay, error 500 [00:01:17] sweet :) [00:05:44] 1 isn't much use [00:09:14] Server: nginx/0.7.65 [00:09:15] o_0 [00:11:31] X-Cache: MISS from cp1004.eqiad.wmnet,MISS from cp1012.eqiad.wmnet [00:11:31] X-Cache-Lookup: MISS from cp1004.eqiad.wmnet:3128,MISS from cp1012.eqiad.wmnet:80 [00:13:41] Are we using nginx for api app servers? [00:14:20] /proxy [00:41:36] New patchset: Tim Starling; "Restrict NFS exports" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/7765 [00:41:56] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/7765 [00:46:49] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/7765 [00:46:52] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/7765 [00:53:03] "homeless apache servers" ? [00:53:11] Yeah :) [00:53:39] PROBLEM - Puppet freshness on search22 is CRITICAL: Puppet has not run in the last 10 hours [00:53:53] I think that how much we deal with homeless Apaches is appropriate for our respective neighborhoods [00:59:50] We should get people to donate to help them buy homes [01:02:29] heh [01:41:24] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 232 seconds [01:44:15] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 0 seconds [02:23:52] morebots is AWOL [02:30:34] damn it. wikitech doesn't pull from our repo [02:30:52] I have morebots packaged and updated [02:31:00] We don't miss her [02:31:53] I like using !log and it appearing in my twitter stream [02:34:49] arrrrrggghhh [02:35:14] Reedy: The bot is very anti-social to be on a social network :D [02:35:23] It's submissive [02:35:41] Oh so you just like dominating it? :D [02:36:12] It's a hussie [02:36:18] It suggests most people are master [02:36:26] ok, it's back [02:36:27] PROBLEM - Frontend Squid HTTP on cp1004 is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 Internal Server Error [02:37:11] thanks [02:39:03] RECOVERY - Frontend Squid HTTP on cp1004 is OK: HTTP OK HTTP/1.0 200 OK - 27545 bytes in 0.109 seconds [03:03:12] RECOVERY - Puppet freshness on search22 is OK: puppet ran at Wed May 16 03:03:03 UTC 2012 [03:03:12] PROBLEM - Frontend Squid HTTP on cp1004 is CRITICAL: HTTP CRITICAL: HTTP/1.0 504 Gateway Time-out [03:05:16] New review: Krinkle; "Maybe we need a Settings file for labs and for production and CommonSettings.php includes the right ..." [operations/mediawiki-config] (master) C: 0; - https://gerrit.wikimedia.org/r/7706 [03:05:48] New review: Krinkle; "Maybe use PrivateSettings for that (which isn't in the repo, but is included)" [operations/mediawiki-config] (master) C: 0; - https://gerrit.wikimedia.org/r/7706 [03:09:17] New review: Krinkle; "(no comment)" [operations/mediawiki-config] (master) C: 1; - https://gerrit.wikimedia.org/r/7577 [03:10:36] Nemo_bis: I recall you showing interest in dbbot-wm . FYI, the code is now up on GitHub. including documentation. [03:10:52] https://github.com/Krinkle/ts-krinkle-Kribo https://github.com/Krinkle/ts-krinkle-wmfDbBot [03:14:27] !log depooling cp1004 and stopping the squid backend service to let some connections close [03:14:32] Logged the message, Master [03:16:06] RECOVERY - Frontend Squid HTTP on cp1004 is OK: HTTP OK HTTP/1.0 200 OK - 27543 bytes in 0.113 seconds [03:17:54] PROBLEM - Backend Squid HTTP on cp1004 is CRITICAL: Connection refused [03:22:47] !log repooling squid frontend on cp1004 [03:22:51] Logged the message, Master [03:29:18] RECOVERY - Backend Squid HTTP on cp1004 is OK: HTTP OK HTTP/1.0 200 OK - 27407 bytes in 0.108 seconds [03:30:41] New patchset: Hashar; "redirect (302) /w/ to /w/index.php" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/7772 [03:34:46] !log stopped the squid process on cp1004 and stopped puppet to avoid it being restarted. it's having issues and I can't debug it right now. [03:34:50] Logged the message, Master [03:35:26] @externals [03:35:26] Krinkle: [all.dblist] last update: 2012-05-16 01:40:02 (UTC); [db.php] last update: 2012-05-16 01:40:02 (UTC) [03:36:39] PROBLEM - Backend Squid HTTP on cp1004 is CRITICAL: Connection refused [03:45:39] Hey guys I'm getting constant API errors now, mostly from cp1005.eqiad.wmnet [03:45:49] New patchset: Hashar; "disable "last message repeated n times"" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/7773 [03:46:09] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/7773 [03:48:15] kaldari: What's the error msg? [03:48:41] same as it's been all afternoon: Error: ERR_SOCKET_FAILURE, errno (98) Address already in use [03:49:08] Ugh [03:49:19] And Ryan is on his way home [03:49:22] He fixed cp1004 earlier [03:49:29] Yeah [03:49:49] Constant? [03:49:55] not anymore [03:50:01] it's just intermittant now [03:50:14] but I was getting it constantly for a few minutes [03:50:19] Yeaaah [03:51:08] I missed Ryan by 10 minutes [03:51:32] New review: Hashar; "$cluster is going to be used to include labs specific settings and, in some files, to tweak settings." [operations/mediawiki-config] (master); V: 0 C: 0; - https://gerrit.wikimedia.org/r/7706 [03:51:38] I think he had somewhere to be [03:51:48] i suspect there's only really TimStarling around atm.. [03:52:00] New patchset: Hashar; "override $cluster when on labs" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/7706 [03:52:14] New review: Hashar; "(no comment)" [operations/mediawiki-config] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/7706 [03:52:17] Change merged: Hashar; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/7706 [03:53:25] New patchset: Hashar; "implements beta labs specific domains" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/7705 [03:53:58] New review: Hashar; "(no comment)" [operations/mediawiki-config] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/7705 [03:54:00] Change merged: Hashar; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/7705 [03:54:26] It seems to be failing about 10-20% of the time on API POST requests for me - all cp1005 as far as I can tell [03:54:29] judging by the error message, Ryan just shut down cp1004, he didn't fix it [03:54:42] I mean the log message [03:55:01] Yeah, he said he didn't have time to debug it [03:57:10] yeah, it's spamming the syslog at a high rate [03:57:43] PROBLEM - swift-container-auditor on ms-be2 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [04:02:06] mutante mentioned that ops had to upgrade the kernel on the API machines recently, but I don't know the details [04:04:36] Most likely for the uptime bug [04:04:42] bind(4306, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("0.0.0.0")}, 16) = 0 [04:05:09] !log updating a few plugins on Jenkins (host: gallium ) [04:05:13] Logged the message, Master [04:05:31] doesn't seem right [04:07:01] ok, that is right [04:10:01] kaldari: Ryan despooled cp1005 [04:10:23] kaldari: ho sorry, you were already aware about it :-D [04:10:32] how? [04:10:42] you mean cp1004? [04:11:03] yeah [04:11:05] well, it's the backend process that's an issue [04:11:05] I'm going to turn it off and stop puppet [04:11:10] yeah cp1004 [04:11:14] sorry for the confusion tim [04:12:14] I'm not sure why it is calling bind for outbound connections, is that necessary? [04:15:47] I think so [04:15:53] Man it's been forever since I've done C socket programming [04:17:32] Hah I guess you're right [04:17:39] You don't need to bind() an outgoing socket [04:18:39] Or at least not explicitly [04:18:46] But I seem to recall that connect() calls bind() or something [04:21:43] PROBLEM - Frontend Squid HTTP on cp1005 is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 Internal Server Error [04:23:13] RECOVERY - swift-container-auditor on ms-be2 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [04:23:22] RECOVERY - Frontend Squid HTTP on cp1005 is OK: HTTP OK HTTP/1.0 200 OK - 27542 bytes in 0.116 seconds [04:30:15] it looks like you can use bind() to specify a local interface for an outbound connection [04:31:40] the relevant interface is configurable in squid [04:44:39] timstarling - check out -http://ganglia.wikimedia.org/latest/?c=Application%20servers%20pmtpa&m=load_one&r=hour&s=by%20name&hc=4&mc=2 [04:57:47] it looks like something happened about 8 hours ago judging by the graphs here: http://ganglia.wikimedia.org/latest/?r=day&cs=&ce=&m=load_one&s=by+name&c=Application+servers+pmtpa&h=&host_regex=&max_graphs=0&tab=m&vn=&sh=1&z=small&hc=4 [04:58:37] that's around the time we did the deploy for PageTriage to en.wiki [05:02:59] would turning off PageTriage in the en.wiki config be worth trying? [05:04:08] TimStarling, Reedy: ^ [05:08:46] ctwoo: ^ [05:09:32] i would defer to tim/roan or sam on this matter [05:15:06] * kaldari whistles to himself [05:15:44] PROBLEM - swift-container-auditor on ms-be2 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [05:17:05] PROBLEM - Lucene on search1015 is CRITICAL: Connection timed out [05:22:08] Krinkle, thanks; is it linked on Meta somewhere? [05:22:14] or at least wikitech [05:22:24] What is linked ? [05:22:40] Krinkle, dbbot source code [05:22:44] https://www.mediawiki.org/wiki/dbbot-wm [05:22:45] you highlighted me a while ago [05:22:47] PROBLEM - Puppet freshness on searchidx2 is CRITICAL: Puppet has not run in the last 10 hours [05:22:48] ok thanks [05:22:51] * Nemo_bis out now [05:24:17] RECOVERY - swift-container-auditor on ms-be2 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [05:27:50] those ganglia graphs are broken [05:28:33] i c [05:29:54] !log experimentally started squid on cp1004 [05:29:57] Logged the message, Master [05:30:10] let's see what happens... [05:30:26] RECOVERY - Backend Squid HTTP on cp1004 is OK: HTTP OK HTTP/1.0 200 OK - 27407 bytes in 0.107 seconds [05:30:53] it's the same [05:30:57] just got the socket failure error for cp1004 [05:31:05] I'm stopping it again [05:32:45] now back to cp1005 socket error [05:34:53] RECOVERY - Lucene on search1015 is OK: TCP OK - 9.020 second response time on port 8123 [05:35:47] PROBLEM - Backend Squid HTTP on cp1004 is CRITICAL: Connection refused [05:38:38] hey Ryan [05:38:45] hi [05:38:47] sup? [05:38:53] looks like we're having the same problem from cp1005 as we were having from cp1004 [05:39:02] * Ryan_Lane grumbles [05:39:17] Tim can probably fill you in on more [05:39:40] I haven't found out much [05:39:40] seems the new version of ganglia isn't working very well on the app servers [05:39:44] sounds like i should scroll up [05:39:51] yeah, it's broken [05:40:15] yet its fine on all the other hosts... [05:40:16] weird [05:40:17] but the problem is pretty hard to see on ganglia [05:40:21] yeah [05:40:33] you can see extra system CPU [05:40:38] all of the sockets are being used up [05:41:15] bind() fails, but I'm not sure how that is possible [05:41:16] I didn't get much time to investigate [05:41:22] the FD count according to cachemgr is quite low [05:42:26] 418 for the backend and 6700 for the frontend [05:42:38] that seems awfully low [05:42:42] not enough to cause an ephemeral port exhaustion or anything like that [05:44:02] PROBLEM - Lucene on search1015 is CRITICAL: Connection timed out [05:44:31] 93020 connections in time wait [05:44:50] if it helps at all, I first noticed the problem about 7+ hours ago, around 22:00 UTC [05:45:21] the rise in system CPU was around 22:30 on cp1005 [05:45:44] what limit do time_wait sockets count towards? [05:46:14] * TimStarling googles [05:46:14] let's see.... [05:46:20] heh. doing the same [05:46:44] RECOVERY - Lucene on search1015 is OK: TCP OK - 0.027 second response time on port 8123 [05:47:48] net.ipv4.ip_local_port_range [05:47:56] I believe [05:48:57] tcp_max_tw_buckets [05:49:24] one site recommends setting tcp_tw_recycle=1 [05:49:45] net.ipv4.tcp_max_tw_buckets = 360000 <— so, it isn't that [05:50:11] PROBLEM - Puppet freshness on db29 is CRITICAL: Puppet has not run in the last 10 hours [05:50:48] but tcp_tw_recycle is 0 on cp1003 [05:51:17] question is, why are we seeing so many more TIME_WAIT on cp1005 [05:51:28] I saw the same thing on cp1004 earlier [05:52:08] of course, this could simply be a red herring. [05:52:12] it could [05:52:27] but if I set this thing to 1 then the TIME_WAIT connections should go away and we can get on with other theories [05:52:32] just in soft state [05:52:50] !log on cp1005 setting tcp_tw_recycle=1 [05:52:53] Logged the message, Master [05:53:01] * Ryan_Lane nods [05:53:55] * jeremyb seems to be caught up. doesn't seem to be much I can do to help at this point (from the outside) [05:54:24] down to 34k now [05:54:49] 11k, seems to have plateaued [05:55:32] no more bind errors! [05:55:35] PROBLEM - MySQL Replication Heartbeat on db1035 is CRITICAL: CRIT replication delay 192 seconds [05:55:53] PROBLEM - MySQL Slave Delay on db1035 is CRITICAL: CRIT replication delay 196 seconds [05:56:02] PROBLEM - Lucene on search1015 is CRITICAL: Connection timed out [05:56:12] "Enabling this option is not recommended since this causes problems when working with NAT" <— oh no! we're going to break everything! [05:56:14] heh [05:56:31] my reference is http://www.stolk.org/debian/timewait.html [05:56:49] I'll try it on cp1004 [05:58:17] !log setting net.ipv4.tcp_tw_recycle=1 on cp1005 seems to have fixed it, doing it on cp1004 as well now [05:58:20] Logged the message, Master [05:58:54] RECOVERY - Backend Squid HTTP on cp1004 is OK: HTTP OK HTTP/1.0 200 OK - 27400 bytes in 0.162 seconds [06:01:25] seems to be fixed on my end [06:02:20] a lot of TIME_WAITs were from the appservers [06:02:40] in fact, it looks like the majority of them were [06:03:13] that makes sense, though [06:03:29] does this version of squid not support connection pooling? [06:06:24] dunno [06:07:42] actually, the properly behaving systems don't have a lot of TIME_WAITs to the appservers [06:07:45] what about pipelining? or maybe they're the same thing? [06:10:17] persistent connections is what I'm thinking of, really [06:10:49] maybe these backends were affected because they hold some special high-traffic URLs [06:10:53] PROBLEM - LVS Lucene on search-pool4.svc.eqiad.wmnet is CRITICAL: Connection timed out [06:11:11] they are api squids [06:11:39] if they were api squids, why would they have connections to appservers.svc.pmtpa.wmnet ? [06:11:45] isn't there another VIP for API? [06:11:54] yes [06:12:05] haha, used some generic terms in google and [[Manual:Squid caching]] was the second hit [06:12:05] RECOVERY - LVS Lucene on search-pool4.svc.eqiad.wmnet is OK: TCP OK - 3.029 second response time on port 8123 [06:12:46] they are definitely listed as API in text-settings, though [06:13:36] separate issue: search1015 seems to need a boot. (is intermittent in nagios about same as 1016 a few days ago) [06:14:01] (that's the pool4 recovery that just happened) [06:16:06] night guys, thanks for saving Wikipedia agains ;) [06:16:14] kaldari: night [06:18:58] cachemgr shows an equal number of requests delivered to appservers.svc and api.svc [06:20:02] RECOVERY - MySQL Replication Heartbeat on db1035 is OK: OK replication delay 0 seconds [06:20:02] RECOVERY - MySQL Slave Delay on db1035 is OK: OK replication delay 0 seconds [06:20:16] !log restarted lucene on search1015 [06:20:20] Logged the message, Master [06:20:44] seems lucene is possibly misconfigured, but that's unrelated to the process being locked up [06:20:56] RECOVERY - Lucene on search1015 is OK: TCP OK - 0.027 second response time on port 8123 [06:22:05] tcpdump shows it too [06:22:10] odd [06:22:25] well, it does have it configured in the file [06:22:33] I wonder why that's the case [06:23:24] the ones in pmtpa do as well, though [06:23:24] where is it configured? I have been looking at the configuration files and it looks fine [06:24:20] I was looking at the backend conf [06:24:34] cp1005:/etc/squid/squid.conf ? [06:24:40] yes [06:25:06] cache_peer_access 10.2.1.1 deny api_php [06:25:18] ah. right [06:25:19] that should prevent api.php requests from going to 10.2.1.1, but I see them in tcpdump [06:25:29] that makes no sense at all [06:25:54] wait [06:25:54] cache_peer_access 10.2.1.1 allow all [06:26:19] yeah, the first matching rule takes precedence [06:26:22] or at least it's meant to [06:27:03] it's configured the same way in pmtpa [06:28:05] pmtpa is also talking to the appservers [06:28:11] on the api squids [06:29:04] "And, finally, don't forget rules are read from top to bottom. The first rule matched will be used. Other rules won't be applied." [06:29:26] http://www.visolve.com/system_services/opensource/squid/squid30/accesscontrols-4.php#http_access [06:29:35] seems not to be working correctly, then :) [06:29:50] if it's changed, everything would break [06:30:41] I've got to go get some groceries [06:30:48] the site's up isn't it? [06:31:09] yeah [06:31:19] the tw recycle change is working for now [06:31:26] I also need to go away [06:31:28] I need to pack [06:32:23] well, with this configuration the api traffic isn't hitting the app servers, but other requests are. [06:32:31] so it makes sense that there is traffic hitting them [06:34:40] I wonder if one of the api servers is acting up [06:35:14] damn it [06:35:18] ding ding ding [06:35:39] ? [06:35:49] 2012-05-16 06:35:09 srv204 enwiki: [7d70df73] /w/api.php Exception from line 44 of /usr/local/apache/common-local/php-1.20wmf2/extensions/PageTriage/api/ApiPageTriageTemplate.php: ApiPageTriageTemplate::execute: template file not found: "/usr/local/apache/common-local/php-1.20wmf2/extensions/PageTriage/modules/ext.pageTriage.views.toolbar/ext.pageTriage.toolbarView.html" [06:36:00] about a billion of those errors [06:36:03] baaaahhhh [06:36:05] a constant stream [06:36:14] did someone fuck up a deploy? [06:36:30] maybe [06:36:48] heh. people don't check the exception log after they deploy something? :D [06:37:07] * jeremyb is hanging getting into bastion1.pmtpa.wmflabs. very likely unrelated to all of this [06:37:42] hm [06:37:44] is for me too [06:37:46] that isn't good [06:38:09] home dirs are hanging [06:38:32] labs-nfs1 is having issues [06:38:36] might need to rebootit [06:39:48] OATHAuth is going to make me fix these session issues with labs quicker I can already tell [06:40:05] rotfl [06:40:11] using it already? [06:40:18] yep [06:40:25] I'm likely going to force cloudadmins to use it [06:40:53] i heard [06:41:46] well sure enough that html file ain't in there [06:41:53] I'll see if I can find it somewhere in the tree [06:42:17] question is, is it supposed to be there, or not supposed to be referenced anymore? [06:42:40] I don't know, but if it's referenced then it can darn well be there for right now [06:42:40] hm. looks like labs-nfs1 reboot itself [06:42:46] yeah [06:44:04] well that's bad [06:44:46] well, it probably reboot itself because it patched and needed a reboot [06:45:00] it's not a great instance to have that happen on, though. heh [06:45:21] not in wnf2 or wmf3. so looks like broken code. meh [06:45:40] it could be in an older version of the extension [06:45:43] did you chek its log? [06:46:16] no, I'm about to do that now [06:47:31] that missing file was introduced in https://gerrit.wikimedia.org/r/7764/ (~6 hrs ago) [06:48:14] I see the refereence is in some javascript here [06:49:02] maybe it wasn't deployed? [06:49:37] yep [06:49:40] wasn't deployed [06:49:52] maybe we should page kaldari [06:50:28] I'm going to page him [06:50:30] good [06:50:31] PROBLEM - swift-container-auditor on ms-be1 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [06:50:56] what determines what version of an extension is deployed? [06:50:59] it's just bordeline getting too late i.e. I don't yet feel guilty about it [06:51:28] the extensions megarepo just has master and no other branches [06:51:37] quick hack so the unfinished toolbar won't load if this code gets randomly deployed, since it's in master (oops!) [06:51:40] that's awesome :-D [06:51:49] it could be causing the issue. [06:52:00] it's for sure spamming the hell out of the logs [06:54:35] well something is causing that piece to execute,. master or no. so yeah let's hope he shows up in here soon [07:00:08] who is raindrift? that's our other option [07:00:08] https://gerrit.wikimedia.org/r/gitweb?p=mediawiki/extensions/PageTriage.git;a=commitdiff;h=d80183f6a69d14a14d77b929ae8b6b9638e84a23 [07:03:27] ian [07:04:11] could ping him I guess [07:05:52] 21:09 logmsgbot: raindrift synchronizing Wikimedia installation... : PageTriage update [07:06:01] that's from yesterday [07:06:05] * Ryan_Lane nods [07:06:45] you wanna ping him? (I have a soft voip phone so I'm restricted to actual phone calls I guess) [07:07:22] but I am happy to sit in here and work it out with whoever shows up (as in you should go to bed) [07:07:36] pages [07:07:36] *paged [07:07:37] thanks [07:07:40] I still need to pack. heh [07:07:46] ok. well get packing :-D [07:07:50] and now labs is having some issue [07:07:56] oh. right :-/ [07:08:03] I bet you anything it's due to gluster again [07:08:24] gluster is turning out to be a real ball-buster isn't it (scuse the language) [07:08:27] I'm hoping the load goes back down to normal [07:08:31] yes. it's killing me [07:08:33] so nice in theory [07:08:41] so fubar in real life [07:09:16] RECOVERY - swift-container-auditor on ms-be1 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [07:09:17] if there is something you would want me to try/do if it doesn't behave, I can babysit it [07:09:35] I'm thinking it's actually due to labs-nfs1 [07:10:32] raindrift: heh. funny autocorrect too [07:10:41] hello raindrift [07:10:41] totally [07:10:43] hi! [07:10:45] sorry for the page, [07:10:49] so, I guess our stuff is broken? [07:10:53] what's going on is a pile of errors in the logs: [07:10:55] (yeah) [07:11:01] raindrift: check /home/w/log/exception.log [07:11:05] 2012-05-16 06:35:09 srv204 enwiki: [7d70df73] /w/api.php Exception from line 44 of /usr/local/apache/common-local/php-1.20wmf2/extensions/PageTriage/api/ApiPageTriageTemplate.php: ApiPageTriageTemplate::execute: template file not found: "/usr/local/apache/common-local/php-1.20wmf2/extensions/PageTriage/modules/ext.pageTriage.views.toolbar/ext.pageTriage.toolbarView.html [07:11:07] like these [07:11:15] and so there is no View.html yet of course [07:11:25] er ext.pageTriage.toolbarView.html [07:11:56] what I don't understand is why that stuff is being executed at all [07:12:57] i don't either. [07:13:07] grrrr... I love it when other european leaders tell us what our elections are for. (sorry, have morning news program on. prolyl shouldn't, it just pisses me off) [07:13:19] PROBLEM - swift-container-auditor on ms-be5 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [07:13:24] I am guessing that it has to do with yesterday's deployment, what else could it be [07:13:42] anyway, the fix for this is to make the api not throw those errors, since you can request any template file. i'll go fix it. just a minute. [07:13:48] ok [07:13:51] does it need to be deployed at all? or is it just for e.g. test/test2? [07:13:53] well, the file is actually missing [07:13:59] sure. [07:14:10] but it's for a feature that should be disabled. [07:14:13] ah [07:14:32] the toolbar's not done yet, and is therefore turned off. [07:14:39] at least in theory :-D [07:14:54] RECOVERY - swift-container-auditor on ms-be5 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [07:15:03] i don't understand what the procedure is to get the current deployed commit id/tag for an extension. PageTriage doesn't even have tags or branches (besides master) it seems [07:15:28] no branches. right [07:19:41] anyone feel like reviewing a change? https://gerrit.wikimedia.org/r/7775 [07:20:40] what happens when that returns false (I don't feel like digging through the rest of the code)? [07:21:17] apergos: look at the 3 lines above [07:21:56] that doesn't answer my question [07:22:12] yeah, maybe that should just return, actually. [07:22:23] what does the caller do when execute returns false, that was my q [07:22:25] ok [07:24:07] okay, fixed [07:25:50] so here I am at patchset two and it says return false... [07:26:28] wtf [07:26:47] there's no difference between the patchsets [07:26:59] ok, so it's not just me hatin' on gerrit's ui [07:27:01] besides commit msg [07:27:49] sorry, i'm still not used to specifying "-a" every time I want to commit. [07:28:00] should be better now. [07:28:33] the thing I don't get is why that js code is being executed at all. It shouldn't be. The only way it should run is if $wgPageTriageEnableCurationToolbar is set, which it isn't. [07:28:50] worth testing but given the hour where you are, not right now [07:29:07] (unless you are now so irritated that you have to fix it before you go to bed :-D) [07:29:15] no, i have things to do in the morning. [07:29:23] merged [07:29:31] awesome. thanks. [07:29:38] should i deploy this? [07:29:48] * apergos grits teeth [07:29:56] what else would need to go out with it? [07:30:06] i mean, it's one file. i could just sync the file. [07:30:24] yes, just the one file [07:30:35] sounds good. i'll do that. [07:32:20] prolly worth figuring out why the code is executed soon-ish (next couple of days) [07:35:11] yeah. i'm working on that section actively, so i'll do it tomorrow. [07:35:28] cool, I'm curious to learn the answer too :-D [07:37:00] * jeremyb just subscribed to notifications for all changes on the PageTriage repo ;-P [07:37:07] heh [07:37:39] oooooh, yay, there's an option now to not use that flash clipboard crap [07:37:54] *flash* clipboard? eeewww [07:39:11] there's always been that option [07:40:08] I have never noticed it, (which is a good thing) [07:41:15] PROBLEM - swift-container-auditor on ms-be4 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [07:42:47] okay. sync'd. [07:43:04] raindrift: thanks [07:43:06] no more errors [07:43:07] i'm going to bed now. i'll have my phone if y'all need me. [07:43:11] ok. thanks [07:43:17] (yah the log looks lots better) [07:43:27] thanks for letting me know. still weird that it was doing that at all. [07:43:47] like, that RL module shouldn't even have been loaded. But, whatever. I'll work it out later. Good night! [07:43:51] we live in the wierd zone, what can I say [07:43:55] night! [07:44:08] ok. I better pack [07:44:11] go go go [07:44:24] I think I fixed the labs issue [07:44:29] oh yay [07:44:32] so what was it? [07:44:32] stupid NFS server causing cascading failure [07:44:39] ah nfs [07:44:44] the bane of our existence [07:44:47] well, it's an nfs instance [07:44:47] Ryan_Lane: well i guess i didn't look so closely before then. certainly would have invoked it earlier if i'd seen it [07:47:27] oh sweet. my flight has wifi [07:48:20] lucky! [07:49:04] wow [07:49:10] and the exit row was open and doesn't cost extra [07:50:22] and it's the long flight :D [07:50:25] \o/ [07:51:45] sweet! [07:51:52] well booked [07:52:18] it is the middle of the week... [07:52:30] but yeah, luck too [07:53:32] the second flight charged for the exit row [07:53:45] it's exit row and aisle at that :) [07:57:00] RECOVERY - swift-container-auditor on ms-be4 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [07:57:32] nacht [08:07:13] New review: ArielGlenn; "(no comment)" [operations/dumps] (ariel); V: 1 C: 2; - https://gerrit.wikimedia.org/r/7726 [08:07:15] Change merged: ArielGlenn; [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/7726 [08:12:53] PROBLEM - swift-container-auditor on ms-be4 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [08:13:45] @docs [08:13:45] Krinkle: https://www.mediawiki.org/wiki/dbbot-wm [08:13:45] @externals [08:13:46] Krinkle: [operations/mediawiki-config.git] Checked out HEAD: 49ce19eeca5b8238e096b089222312f9258285be - https://gerrit.wikimedia.org/r/gitweb?p=operations/mediawiki-config.git;a=commit;h=49ce19eeca5b8238e096b089222312f9258285be [08:13:57] @info db36 [08:13:57] Krinkle: [db36: s1] 10.0.6.46 [08:14:03] @info 10.0.6.46 [08:14:03] Krinkle: [10.0.6.46: s1] db36 [08:14:44] great, its working. No longer fetching *.php.txt from noc.wikimedia.org [08:18:44] PROBLEM - Puppet freshness on storage3 is CRITICAL: Puppet has not run in the last 10 hours [08:26:59] RECOVERY - swift-container-auditor on ms-be4 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [08:53:23] PROBLEM - swift-container-auditor on ms-be1 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [08:55:11] PROBLEM - MySQL Slave Delay on db1035 is CRITICAL: CRIT replication delay 190 seconds [08:56:05] PROBLEM - MySQL Replication Heartbeat on db1035 is CRITICAL: CRIT replication delay 213 seconds [09:08:52] PROBLEM - check_job_queue on spence is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 9,999 jobs: , trwiktionary (13946) [09:11:34] RECOVERY - swift-container-auditor on ms-be1 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [09:17:52] PROBLEM - swift-container-auditor on ms-be4 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [09:18:37] PROBLEM - check_job_queue on neon is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 9,999 jobs: , trwiktionary (17067) [09:26:25] PROBLEM - swift-container-auditor on ms-be2 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [09:27:46] RECOVERY - swift-container-auditor on ms-be4 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [09:27:46] RECOVERY - swift-container-auditor on ms-be2 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [09:29:52] RECOVERY - MySQL Replication Heartbeat on db1035 is OK: OK replication delay 0 seconds [09:29:52] RECOVERY - MySQL Slave Delay on db1035 is OK: OK replication delay 0 seconds [09:49:31] RECOVERY - check_job_queue on neon is OK: JOBQUEUE OK - all job queues below 10,000 [09:55:13] RECOVERY - check_job_queue on spence is OK: JOBQUEUE OK - all job queues below 10,000 [10:18:35] PROBLEM - swift-container-auditor on ms-be1 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [10:41:59] PROBLEM - swift-container-auditor on ms-be2 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [10:42:44] RECOVERY - swift-container-auditor on ms-be1 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [10:43:17] New patchset: ArielGlenn; "verify toc of tarballs; clean up dup code" [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/7783 [10:44:49] New review: ArielGlenn; "(no comment)" [operations/dumps] (ariel); V: 1 C: 2; - https://gerrit.wikimedia.org/r/7783 [10:44:51] Change merged: ArielGlenn; [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/7783 [10:57:39] New patchset: Lcarr; "adding in analytics1-b-eqiad subnet" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/7784 [10:57:44] mutante: wanna review that? ^^ [10:57:56] on it [10:57:59] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/7784 [10:58:25] mixed tabs/spaces [10:58:46] amending:) [11:00:03] ah i just copied and pasted the labs one ;) [11:00:04] meh, fetching from gerrit.. and waiting.. [11:00:06] RECOVERY - swift-container-auditor on ms-be2 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [11:00:24] timeout from gerrit..wth [11:00:26] copy and pasting errors is bad style ;-) [11:01:02] hehe [11:03:42] PROBLEM - swift-container-auditor on ms-be5 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [11:03:59] New patchset: Dzahn; "adding in analytics1-b-eqiad subnet" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/7784 [11:04:01] replaced "labs-hosts1-b-eqiad " [11:04:18] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/7784 [11:05:06] meh :) one more [11:07:07] also waiting on push-.. i have connection issues..hmmm [11:09:21] New patchset: Dzahn; "adding in analytics1-b-eqiad subnet" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/7784 [11:09:24] there [11:09:41] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/7784 [11:13:31] New review: Dzahn; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/7784 [11:13:33] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/7784 [11:19:18] RECOVERY - swift-container-auditor on ms-be5 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [11:55:09] PROBLEM - swift-container-auditor on ms-be5 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [12:01:39] New patchset: ArielGlenn; "besides add missing arg, don't quite verification after first missing file" [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/7786 [12:02:39] New review: ArielGlenn; "(no comment)" [operations/dumps] (ariel); V: 1 C: 2; - https://gerrit.wikimedia.org/r/7786 [12:02:41] Change merged: ArielGlenn; [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/7786 [12:17:10] PROBLEM - swift-container-auditor on ms-be2 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [12:21:58] RECOVERY - swift-container-auditor on ms-be5 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [12:25:16] PROBLEM - swift-container-auditor on ms-be4 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [12:30:58] RECOVERY - swift-container-auditor on ms-be4 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [12:31:16] RECOVERY - swift-container-auditor on ms-be2 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [12:42:40] PROBLEM - swift-container-auditor on ms-be2 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [13:01:52] RECOVERY - swift-container-auditor on ms-be2 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [13:11:01] PROBLEM - Puppet freshness on cp1004 is CRITICAL: Puppet has not run in the last 10 hours [13:27:05] hiiiiiiyaaaa [13:27:59] hello ottomata [13:28:16] someone's... enthusiastic today :P [13:29:13] hi there! [13:29:20] heh [13:29:23] aye! [13:29:29] got stat1 waiting for precise! [13:29:32] thought I'd poke a bit [13:29:32] https://rt.wikimedia.org/Ticket/Display.html?id=2946 [13:31:34] New patchset: Ottomata; "{role,misc}/statistics.pp - installing generic mysqld on stat1." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/7285 [13:31:55] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/7285 [13:32:24] New review: Ottomata; "Oooook, using generic::mysql::* classes is better." [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/7285 [13:52:34] PROBLEM - swift-container-auditor on ms-be1 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [13:59:10] PROBLEM - swift-container-auditor on ms-be4 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [14:03:31] RECOVERY - swift-container-auditor on ms-be4 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [14:13:54] hi [14:14:19] PROBLEM - swift-container-auditor on ms-be5 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [14:14:55] PROBLEM - swift-container-auditor on ms-be4 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [14:16:34] RECOVERY - swift-container-auditor on ms-be1 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [14:19:31] hashar: you free for a minute? [14:19:37] sure [14:19:38] hashar: welcome :) [14:19:44] paravoid: just woke up :-(((( [14:19:46] at 4pm!!! [14:19:50] heh [14:20:04] my sleep habits are totally broken doh [14:20:04] I just woke up too :( [14:20:19] the good news are, I shifted my schedule a bit and run some errands in the morning, so I'll be available until late in case you need me [14:20:22] kaldari: are you in Europe? :-( [14:20:29] California [14:20:36] 7am here [14:20:54] so that is about right to wake up!!! Plus the sun is already shinning outside. Perfect time to wake up! [14:21:02] anyway, what were you willing to ask? [14:21:15] Ryan Lane mentioned that PageTriage extension was throwing exceptions. Can you tell how often? [14:21:35] paravoid: I am out this evening meeting some friends, so will probably stop working in 2 hours roughly [14:22:02] I'm pretty sure I know why it was throwing exceptions (due to a race condition), but I wanted to find out how bad the problem is [14:22:13] kaldari: too there was like a hell a lot of errors this morning (aka like 8 hours ago) about some view file missing ( a .html ) [14:22:17] do you have access to the cluster? [14:22:30] not shell access, no [14:22:30] ' [14:22:47] that is a good thing (ask brion) [14:22:49] although I will later this week probably :) [14:22:53] dont!!! [14:23:01] it will eat all your available time ahaha [14:23:02] I've tried to avoid it for years [14:23:10] RECOVERY - swift-container-auditor on ms-be5 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [14:23:14] but Terry requested it for me [14:24:01] oh my god [14:24:21] it's that bad? [14:24:26] no no [14:24:28] not your [14:24:29] another issue [14:24:32] will see that later [14:24:56] basically, we have all exceptions logged in ONE file [14:25:01] so that is sometime a bit impressive [14:25:07] hooray! [14:25:09] i bet [14:25:17] oh I remember someone fixed an issue with PageTriage this morning [14:25:23] logmsgbot: raindrift synchronized php-1.20wmf2/extensions/PageTriage/api/ApiPageTriageTemplate.php 'fixing exception bug that makes lots of logspam' [14:25:27] and in 1.20wmf3 too [14:25:32] raindrift was the user [14:25:37] ah cool [14:25:54] when the hell did he do that? [14:25:57] https://wikitech.wikimedia.org/view/Server_admin_log [14:26:06] at 7:41 UTC [14:26:19] aka just before 11pm for ya? [14:26:50] more like midnight I think [14:28:17] so he fixed some exception [14:28:34] and at one point some view file ending in .html was missing on some apaches [14:28:34] is it throwing any currently? [14:28:43] which was throwing error [14:29:16] ah, that would explain a lot [14:29:39] only one exception since 9:30am UTC [14:29:43] which is midnight for you [14:29:43] oh well, I guess I'm late to the party [14:30:10] I knew I shouldn't have gone to sleep :) [14:30:26] well the good thing is that some other people knew how to fix it ! [14:30:42] the only exception I get is [14:30:43] 2012-05-16 10:17:32 mw26 enwiki: [63730017] /w/api.php?action=pagetriagelist&limit=1000&namespace=0 Exception from line 1732 of /usr/local/apache/common-local/php-1.20wmf2/includes/GlobalFunctions.php: Internal error in ApiFormatXml::recXmlPrint: (pages, ...) has integer keys without _element value. Use ApiResult::setIndexedTagName(). [14:30:56] want me to send it to you by email for later fixing? [14:31:07] yes please! [14:31:07] thnkas [14:31:28] rkaldari@wikimedia.org [14:32:10] PROBLEM - swift-container-auditor on ms-be1 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [14:33:21] anyone know what would cause a Could not update the journal database for storage backend "local-NFS". error when Special:Upload ing? [14:33:22] RECOVERY - swift-container-auditor on ms-be4 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [14:33:25] this is on wikimania2013 wiki [14:35:44] maybe Reedy ^ [14:36:09] kaldari: sent. Enjoy your breakfast :-] [14:36:18] thnkas :) [14:46:34] RECOVERY - swift-container-auditor on ms-be1 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [14:54:40] PROBLEM - Host srv278 is DOWN: PING CRITICAL - Packet loss = 100% [14:56:10] RECOVERY - Host srv278 is UP: PING OK - Packet loss = 0%, RTA = 0.45 ms [14:57:11] PROBLEM - NTP on srv278 is CRITICAL: NTP CRITICAL: Offset unknown [14:57:38] PROBLEM - Apache HTTP on srv278 is CRITICAL: Connection refused [15:00:56] RECOVERY - NTP on srv278 is OK: NTP OK: Offset 0.08215665817 secs [15:08:53] RECOVERY - Apache HTTP on srv278 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.040 second response time [15:15:11] Jeff_Green: so the additional disks are in for aluminum [15:15:19] the raid controller is backordered, pinging dell again about it [15:15:19] yayyyyyy [15:15:23] booooo [15:15:31] lol [15:15:43] can we install disks today, assuming the FR folks are ok with the downtime? [15:16:36] Yep, we can, just wanted to check with you on a downtime window [15:16:41] ok [15:17:10] i'll let you know as soon as FR folks appear and I can schedule it [15:18:28] cool [15:23:35] PROBLEM - Puppet freshness on searchidx2 is CRITICAL: Puppet has not run in the last 10 hours [15:51:38] PROBLEM - Puppet freshness on db29 is CRITICAL: Puppet has not run in the last 10 hours [16:04:09] PROBLEM - swift-container-auditor on ms-be1 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [16:18:15] RECOVERY - swift-container-auditor on ms-be1 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [16:23:57] PROBLEM - swift-container-auditor on ms-be3 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [16:24:56] maplebed: https://gerrit.wikimedia.org/r/7798 [16:26:16] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/7798 [16:26:18] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/7798 [16:26:19] morning Patrick [16:28:59] !log deploying gerrit change 7798 to the mobile varnish servers [16:29:03] Logged the message, Master [16:29:21] PROBLEM - swift-container-auditor on ms-be4 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [16:31:15] !log clearing the mobile varnish cache [16:31:18] Logged the message, Master [16:35:42] * AaronSchulz keeps hearing the word "porno" [16:36:33] RECOVERY - swift-container-auditor on ms-be4 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [16:37:20] AaronSchulz: set fenari to be your socks proxy and load en.m.wikipedia.org [16:38:05] and you get pr0n? [16:38:22] \o/ [16:39:56] <^demon> That must be an awful problem to have. [16:40:13] Reedy: you there? [16:40:36] PROBLEM - swift-container-auditor on ms-be2 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [16:41:04] Reedy: can you tell me more about "enable centralauth logging to file" ? [16:41:11] it's making the apaches rather unhappy: http://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&m=load_one&s=by+name&c=Application+servers+pmtpa&h=&host_regex=&max_graphs=0&tab=m&vn=&sh=1&z=small&hc=4 [16:41:19] could it be sent to nfs instead? [16:42:16] notpeter: nfs imposes higher iowait times; won't that just make the problem worse? [16:42:35] maplebed: via udp2log [16:42:39] maplebed: a porn article or actual porn? [16:42:56] it's how we do most of our agregated logging, as far as I know [16:43:10] preilly suggests that the header is inadventently getting cached (and is figuring out how to fix it.) [16:43:16] they go to nfs1/2 via udp2log [16:43:35] notpeter: you sure? not syslog? [16:44:05] * Damianz gives you some hadoop and a little scribe :D [16:44:06] apache error logs get sent to nfs via udp syslog, php logs via udplog [16:44:19] hurray! we're both right! [16:44:24] Damianz: hurry [16:45:13] * AaronSchulz is afraid to set a socks proxy [16:50:04] !log running ipblocks schema migration on all s7 dbs via osc [16:50:08] Logged the message, Master [16:50:48] RECOVERY - swift-container-auditor on ms-be3 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [16:50:54] RobH: what's feasible from your end re. time to work on aluminium? [16:51:01] !log udpating dns for osm web servers [16:51:04] Logged the message, RobH [16:51:20] Jeff_Green: im just racking stuff and working away the backlog, so can do it whenever [16:51:24] ah ok [16:51:37] i figure i will go do lunch sometime soon, but my afternoon is open as well [16:51:41] binasher: is that the schema change needed for ipv6? [16:51:41] they say now is actually a good time, if that works for you [16:51:45] I will be here from now till about 7pm est [16:51:47] cook [16:51:50] cool even [16:51:51] cook! [16:51:55] Jeff_Green: that works for me [16:52:02] ok one sec, lemme just make sure they're clear [16:52:16] yay no more porn! [16:52:19] ok, you going to shut it down or shall I? (when you insure its clear) [16:52:22] err... booo no more porn. [16:52:25] paravoid: nope, that'll come next.. this is a change needed for new block functionality in mediawiki [16:52:46] RobH: i'll do it [16:52:51] I keep hearing about porn, I surely must be missing something :) [16:53:59] RobH: it's shutting down now [16:55:01] paravoid: It's a known fact that the internet leads to penises [16:55:36] PROBLEM - Host aluminium is DOWN: PING CRITICAL - Packet loss = 100% [16:56:09] !log running ipblocks schema migration on all s6 dbs via osc [16:56:12] Logged the message, Master [16:56:29] binasher: osc? [16:56:36] online schema change? [16:56:42] yeah [16:57:17] Jeff_Green: ok, going to go add the disks and power it back on [16:57:27] !log aluminum shut down for hard disk additions [16:57:30] Logged the message, RobH [16:57:42] RobH: k [16:58:35] !log running ipblocks schema migration on s5/dewiki via osc [16:58:38] Logged the message, Master [16:59:10] !log running ipblocks schema migration on all s4 dbs via osc [16:59:13] Logged the message, Master [17:00:51] !log running ipblocks schema migration on all s3 (819) dbs via osc [17:00:55] Logged the message, Master [17:06:32] Jeff_Green: ok, its booting back up [17:06:35] great [17:07:46] RECOVERY - Host aluminium is UP: PING OK - Packet loss = 0%, RTA = 26.45 ms [17:08:22] RECOVERY - swift-container-auditor on ms-be2 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [17:08:22] Jeff_Green: ok, its back up [17:08:23] RobH: it's up. thanks! [17:08:25] ha [17:08:31] !log aluminum back online [17:08:34] Logged the message, RobH [17:08:58] cool, returning to racking and such, afk [17:14:05] Reedy: ping? [17:16:08] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/7753 [17:16:10] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/7753 [17:16:58] !log deploying change to swift to make which containers write thumbs configurable [17:17:01] Logged the message, Master [17:38:25] !log running ipblocks schema migration on all s2 dbs via osc [17:38:28] Logged the message, Master [17:40:32] !!log running ipblocks migration on enwiki via osc [17:42:39] ahhhhhem :) [17:42:40] https://rt.wikimedia.org/Ticket/Display.html?id=2946 [17:43:03] !log ipblocks migration completed for all wikis [17:43:06] AaronSchulz: ^^^ [17:43:06] Logged the message, Master [17:43:15] http://wikitech.wikimedia.org/view/How_to_do_a_schema_change [17:43:24] do we really have a lot of tables with no PKs? [17:43:29] that seems like an anti-pattern [17:44:00] We do, yeah [17:44:01] we have some, and we have a lot of tables where the pk or only unique key is multi-column, although that's no longer a big problem [17:44:02] There's a bug about it [17:44:10] but the no-pk thing is really vexing [17:44:34] <^demon> What tables have no pk? [17:44:38] Reedy: remember how logging didn't have a PK? [17:44:41] https://bugzilla.wikimedia.org/show_bug.cgi?id=15441 [17:44:48] tim had some fun doing that schema change ;) [17:45:02] https://bugzilla.wikimedia.org/show_bug.cgi?id=15441#c2 [17:45:26] <^demon> lol hitcounter. [17:45:31] isn't the first unique index the PK? [17:45:37] (if not given explicitly) [17:46:02] AaronSchulz: yes [17:46:16] so really it's the "No PK or unique index:" that are an issue [17:46:27] funny that the list has all of the worst tables (for other reasons) [17:47:10] I guess you can assume that if there is no PK, that the table probably was poorly designed for other reasons too [17:50:40] hi RoanKattouw [17:51:10] I had a problem earlier on wikimania2013wiki -- [15:33:21] anyone know what would cause a Could not update the journal database for storage backend "local-NFS". error when Special:Upload ing? [17:51:24] problem's still there AFAIK [17:51:59] AaronSchulz: ---^^ [17:52:52] NFS sucks [17:55:34] the list of tables without a unique or primary key defined in bugzilla 15441 is missing a few, from looking at enwiki [17:56:06] lol [17:56:13] That doesn't suprise me [17:56:25] I think in another bug we noticed some were added to newer tables, but never patched back [17:56:32] Reedy: was that a new wiki [17:56:44] I think that was from the SQL file, so effectively [17:56:56] It's 4 months ago :p [17:56:57] AaronSchulz: Yes that's a new wiki [17:56:57] * AaronSchulz has "addWiki.php and filejournal table" on his todo list [17:57:16] https://www.mediawiki.org/wiki/Special:Code/MediaWiki/5210 is at fault [17:57:19] for user [17:57:22] older than I thought though [17:57:34] Yeah, look at https://bugzilla.wikimedia.org/show_bug.cgi?id=33228 [17:58:52] Reedy: addWiki.php needs to run the optional filejournal table patch [17:59:11] lol [17:59:16] FIXITFIXITFIXITFIXITFIXITFIXITFIXITFIXIT [17:59:28] I just never got around to this, sounds like an extra sourceFile() call [17:59:58] PROBLEM - swift-container-auditor on ms-be4 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [18:00:00] yeah, should be enough [18:00:23] * AaronSchulz volunteers Reedy >:) [18:00:35] Where's the file? [18:01:01] When did you create those? [18:01:20] archives/patch-filejournal.sql [18:02:16] Thehelpfulone: it should work now [18:02:29] AaronSchulz: when did those tables start being needed? [18:03:38] Wondering if any other new wikis need it.. [18:03:40] AaronSchulz: thanks [18:03:44] '19:54 logmsgbot: aaron synchronized wmf-config/CommonSettings.php 'Moved remaining wikis over to new backend config' ' [18:03:47] Reedy: may 8 [18:03:48] !log running recentchanges.rc_ip (ipv6) schema migration on all s7 dbs via osc [18:03:51] Logged the message, Master [18:03:57] Ah, so only wm2013 at fault [18:04:02] /with issues [18:04:19] binasher: which osc are you using? [18:05:40] AaronSchulz: fenari:/home/asher/db/pt-online-schema-change-2.1.1-no_child_table_patch via /home/asher/db/run-online-schema-change [18:05:56] ok, the pt one [18:10:04] i don't think i'm going to document things until the next version comes out / i don't have to modify it / trust it to run without watching it like a hawk [18:10:15] RECOVERY - swift-container-auditor on ms-be4 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [18:10:31] !log running recentchanges.rc_ip (ipv6) schema migration on all s6 dbs via osc [18:10:35] Logged the message, Master [18:11:58] Reedy, the interwiki map, http://meta.wikimedia.org/wiki/Interwiki_map -- apparently wm2013: links don't work, they were added the day before the sync date? please can you run the script again? [18:14:16] Later [18:16:07] yeah no problem, I realised you're in deployment atm [18:18:20] hm frwiki.recentchanges had about 1 million rows and took 2 minutes [18:19:05] but db46 got some replag [18:19:15] Feck [18:19:15] PROBLEM - Puppet freshness on storage3 is CRITICAL: Puppet has not run in the last 10 hours [18:19:16] meh, snapshot host [18:19:18] That's quite amazing [18:21:39] PROBLEM - swift-container-auditor on ms-be4 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [18:21:56] !log running recentchanges.rc_ip (ipv6) schema migration on s5 dbs via osc [18:21:59] Logged the message, Master [18:25:16] !log synced wikiversions.* files from NFS to spence local to prevent death of check_job_queue monitoring [18:25:19] Logged the message, Master [18:27:02] !log running recentchanges.rc_ip (ipv6) schema migration on s3 dbs via osc (s4 already completed during prior testing) [18:27:05] Logged the message, Master [18:29:09] binasher: you saw my note about db39 yesterday? (just making sure I didn't "fix" something in a way to break something else) [18:29:25] apergos: no, i didn't [18:29:29] ah [18:30:07] * binasher searches email for db39, doesn't see anything [18:30:13] what happened? [18:30:24] here and in log, jsut a sec [18:30:42] and I forgot to mention it yesterday [18:30:54] ways to scare binasher [18:31:07] :) [18:31:10] 10:37 apergos: on db39 dropped triggers pt_osc_elwiki_recentchanges ins, del, upd, they were preventing all elwiki edits except bot edits with the complaint Table 'elwiki._recentchanges_new' doesn't exist ... binasher, doublecheck me please? [18:31:36] I looked and did not see any other dbs with TRN/TRG files on db39 [18:31:39] only elwiki [18:31:41] weird [18:31:51] odd [18:32:11] I figure this might have had to do with the commonswiki stuff [18:32:22] hrm, that would have been an artifact of some failure case testing i was doing [18:32:22] it jives with the time that edits stopped happening over there [18:32:29] ok [18:32:55] all right, just wanted to make sure you knew and that I didn't break something [18:35:04] nope.. just me breaking things! sigh [18:40:15] RECOVERY - swift-container-auditor on ms-be4 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [18:44:48] ok cool [18:46:41] apergos: thanks for fixing! [18:47:19] sure [18:47:25] thanks for checking [19:13:38] New patchset: Reedy; "Bring sync-dblist somewhere upto date, using sudo, fan and timeout" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/7820 [19:13:47] !log restarting ganglia on nickel [19:13:50] Logged the message, notpeter [19:13:59] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/7820 [19:14:29] !log manually ran ddsh -cM -g mediawiki-installation -o -oSetupTimeout=30 -F30 "sudo -u mwdeploy rsync -a 10.0.5.8::common/*.dblist /usr/local/apache/common-local" because sync-dblist is woefully out of date.. [19:14:33] Logged the message, Master [19:16:03] Reedy: reminds, me sync-wikiversions should really do the dat and cdb in command, it would be a bit faster [19:17:05] !log running recentchanges.rc_ip (ipv6) schema migration on s2 dbs via osc [19:17:08] Logged the message, Master [19:25:19] !log running recentchanges.rc_ip (ipv6) schema migration on enwiki master (5.2mil rows) via osc - batten down the hatches! [19:25:22] Logged the message, Master [19:26:37] AaronSchulz: can you pause the sha1 backfill? [19:27:16] binasher: hrm, lets see [19:27:24] not easily tbh [19:28:43] is it running in a single process? [19:29:05] there are a handful, I'm pruning the screens that finished [19:29:26] just send them all a SIGSTOP [19:29:45] and you can send them a SIGCONT in about 10 minutes [19:30:15] New patchset: Dzahn; "add 1.20wmf3 as "good" version, declare 1.17 not good anymore now" [operations/debs/wikistats] (master) - https://gerrit.wikimedia.org/r/7822 [19:30:24] actually db12 is the only enwiki db thats struggling.. that box is sad, and needs a new kernel [19:30:50] AaronSchulz: don't worry about it [19:30:51] New review: Dzahn; "(no comment)" [operations/debs/wikistats] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/7822 [19:30:53] Change merged: Dzahn; [operations/debs/wikistats] (master) - https://gerrit.wikimedia.org/r/7822 [19:31:56] ok [19:32:30] the best ganglia graph i've ever seen (hattip notpeter) http://ganglia.wikimedia.org/latest/graph.php?r=20min&z=xlarge&c=Application+servers+pmtpa&m=load_one&s=by+name&mc=2&g=mem_report [19:32:46] LOL [19:33:25] heh [19:33:45] I still don't understand why ganglia is ok everywhere except the app servers [19:33:52] it's apparently not happy with the upgrade to ganglia [19:33:55] negative memory! they have data from the future [19:34:44] Reedy: where in git are the sync scripts? [19:34:53] * AaronSchulz can't find them [19:34:57] files/misc/scripts [19:35:04] in the puppet repo [19:35:20] ahh, I was looking in files/misc/, didn't notice scripts [19:36:16] heh [19:36:20] it's not obvious [19:38:26] !log shutting down mysql on db46, preparing to reboot for kernel upgrade [19:38:29] Logged the message, Master [19:39:45] New patchset: Aaron Schulz; "Sync dat and cdb files at once to go a bit faster." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/7823 [19:40:05] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/7823 [19:43:38] New review: Reedy; "(no comment)" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/7823 [19:44:52] !log rebooted db46 [19:44:54] Logged the message, Master [19:45:58] PROBLEM - Host db46 is DOWN: PING CRITICAL - Packet loss = 100% [19:49:07] RECOVERY - Host db46 is UP: PING OK - Packet loss = 0%, RTA = 0.55 ms [19:50:53] !log recentchanges.rc_ip migration completed [19:50:56] Logged the message, Master [19:51:20] !log stopping mysql on db12 [19:51:22] Logged the message, Master [19:55:43] PROBLEM - MySQL Replication Heartbeat on db46 is CRITICAL: CRIT replication delay 773 seconds [19:56:10] PROBLEM - MySQL Slave Delay on db46 is CRITICAL: CRIT replication delay 769 seconds [19:57:31] PROBLEM - mysqld processes on db12 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [19:57:39] !log rebooting db12 for kernel upgrade [19:57:42] Logged the message, Master [20:00:04] PROBLEM - Host db12 is DOWN: PING CRITICAL - Packet loss = 100% [20:00:49] RECOVERY - Host db12 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [20:01:43] RECOVERY - mysqld processes on db12 is OK: PROCS OK: 1 process with command name mysqld [20:05:01] PROBLEM - MySQL Slave Delay on db12 is CRITICAL: CRIT replication delay 659 seconds [20:05:05] !log converted centralauth.globalblocks from myisam to innodb [20:05:09] Logged the message, Master [20:05:37] PROBLEM - MySQL Replication Heartbeat on db12 is CRITICAL: CRIT replication delay 660 seconds [20:05:39] !log ran ipv6 migrations on globalblocks [20:05:43] Logged the message, Master [20:08:55] RECOVERY - MySQL Slave Delay on db46 is OK: OK replication delay 22 seconds [20:10:07] RECOVERY - MySQL Replication Heartbeat on db46 is OK: OK replication delay 0 seconds [20:19:18] PROBLEM - swift-container-auditor on ms-be1 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [20:22:09] RECOVERY - swift-container-auditor on ms-be1 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [20:25:20] New patchset: Pyoungmeister; "adding db61 and 62 as s1 slaves" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/7825 [20:25:49] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/7825 [20:26:28] New review: Pyoungmeister; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/7825 [20:26:30] RECOVERY - MySQL Slave Delay on db12 is OK: OK replication delay 0 seconds [20:26:31] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/7825 [20:26:39] RECOVERY - MySQL Replication Heartbeat on db12 is OK: OK replication delay 1 seconds [20:34:07] New patchset: Jgreen; "adding r-base to aluminium/grosley per RT #2972" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/7827 [20:34:27] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/7827 [20:36:07] New review: Jgreen; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/7827 [20:36:09] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/7827 [20:49:08] Reedy, please can you run the interwiki map script now? [21:18:11] mark: Is the IPv6 connectivity testing still a running experiment? I see your name on that javascript code and it found its way on a lot of wikis in the mean time (through the famous "cut/n/paste all of en.wiki scripts to fix a tiny bug" procedure) [21:18:42] http://ipv4.labs.wikimedia.org/ etc. [21:18:58] domain is still up but still used / cared about ? does it need the traffic still. [21:26:10] Krinkle: mark is not currently here [21:26:39] * Damianz eats some packets [21:26:39] ok [21:26:50] someone else is welcome to answer too, of course, would they know the answer [21:28:24] Krinkle: not sure if it is still known/needed [21:28:44] Krinkle: we're in the netherlands right now and it's 11:30pm ... [21:29:09] can you post up a message on wikitech-l ? [21:29:42] no rush :) [21:29:51] oh you are in the Netherlands [21:29:53] me too :P [21:30:10] Yeah but he's weird [21:30:22] I saw him up at 5am last night :) [21:30:38] there's that [21:30:38] 5am is a normal time [21:30:49] Geographical location does not necessarily correlate with working timezone [21:31:30] well not always, but i am right now [21:31:34] 5am, are you sure ? [21:31:48] because we were working way too late the last two days [21:32:03] but… new servers are racked! yay! [21:32:19] and we roped in another wikimedian who broke down tons of boxes and organized trash and recycling runs :) [21:34:14] LeslieCarr: I was talking about Krinkle being up at 5, not mark [21:34:32] hehe [21:34:39] oh! [21:34:40] :) [21:45:12] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/7829 [21:45:14] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/7829 [21:45:54] !log reverted this morning's mobile push - tests completed [21:45:57] Logged the message, Master [21:49:23] hu, /mnt/thumbs is empty for hume [21:49:29] *huh [21:50:04] stale? [21:54:04] maplebed: can you peek at https://bugzilla.wikimedia.org/show_bug.cgi?id=31680? I wonder if those files are on swift? [21:56:46] * AaronSchulz sends Reedy back to cr torture [21:56:52] AaronSchulz: yeah, I'll look. [21:58:05] AaronSchulz: the four resolutions in swift (at least in the container listing) are 142, 220, 320, 800. [21:58:15] one sec and I'll see if the other ones are there but not in the container listing. [21:59:07] AaronSchulz: neither file exists on swift (120px or 640px) [21:59:17] Do you want me to annotate the bug with that info or does it not matter? [21:59:35] I guess you can add it [22:01:24] PROBLEM - Host srv278 is DOWN: PING CRITICAL - Packet loss = 100% [22:01:34] added. [22:02:18] RECOVERY - Host srv278 is UP: PING OK - Packet loss = 0%, RTA = 0.87 ms [22:05:36] PROBLEM - Apache HTTP on srv278 is CRITICAL: Connection refused [22:06:01] New patchset: Reedy; "Adding fork limits and setup timeouts to normalise scripts" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/7831 [22:06:21] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/7831 [22:07:45] New patchset: Reedy; "Sync dat and cdb files at once to go a bit faster." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/7823 [22:08:04] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/7823 [22:09:38] New review: Aaron Schulz; "(no comment)" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/7823 [22:24:24] anyone know if we switched WM planet into git or if it's still in subversion? [22:24:30] RECOVERY - Apache HTTP on srv278 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.027 second response time [22:24:41] Still in SVN i think [22:24:51] certainly, svn isn't locked [22:27:28] Reedy: cool thanks much [22:29:43] New patchset: Aaron Schulz; "Set $wgSiteStatsAsyncFactor=1 on testwikis." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/7833 [22:30:50] New review: Aaron Schulz; "(no comment)" [operations/mediawiki-config] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/7833 [22:30:52] Change merged: Aaron Schulz; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/7833 [22:46:19] New patchset: Aaron Schulz; "Made mediawikiwiki use $wgSiteStatsAsyncFactor=1." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/7835 [22:46:50] PROBLEM - swift-container-auditor on ms-be1 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [22:46:53] New review: Aaron Schulz; "(no comment)" [operations/mediawiki-config] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/7835 [22:46:55] Change merged: Aaron Schulz; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/7835 [22:55:23] RECOVERY - swift-container-auditor on ms-be1 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [22:56:01] * Krinkle imagines RobH with a magnifying glass, watching the optical particle stream leaving from my laptop, across the ocean, entering the data center for my http request [22:56:53] seen the movie Hackers (1995) recently. incredible how they visualize all that internet stuff. If it where only half that cool to look at :P [22:56:57] were* [22:57:35] Have you not seen the pair of trolls we eploy to watchover the packets and impliment the firewall limits? [22:58:14] hehe [23:01:36] robla: :) [23:02:11] AaronSchulz: ? [23:12:02] PROBLEM - Puppet freshness on cp1004 is CRITICAL: Puppet has not run in the last 10 hours [23:31:23] PROBLEM - swift-container-auditor on ms-be5 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [23:34:14] RECOVERY - swift-container-auditor on ms-be5 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor