[00:00:13] mutante: http.debian.net is a service (listens to port 80). idk how what you linked works but it seems different that what i was talking about [00:00:40] New patchset: Reedy; "Lower email throttle" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/45273 [00:00:50] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/45273 [00:01:46] jeremyb_: mirrors.ubuntu.com also happens to listen on port 80, but i don't know what the real question was then [00:02:31] !log reedy synchronized wmf-config/InitialiseSettings.php 'Lowering throttle to 5/day for new users, and' [00:02:33] Logged the message, Master [00:04:29] New patchset: Reedy; "(bug 44587) Fix trwiki FlaggedRevs autopromotion config" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/51172 [00:04:34] mutante: http.debian.net doesn't actually serve files... (besides a description of how it works at the site's root) [00:04:38] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/51172 [00:04:49] mutante: otherwise only 302s, iirc [00:06:46] mirrors.u.c seems to only have text files [00:07:52] New patchset: Reedy; "Remove disabled ReaderFeedback" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/47532 [00:08:32] New patchset: Andrew Bogott; "Refactor mediawiki::singlenode and wikidata::singlenode into modules." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50451 [00:10:15] btw, I don't know what the cause is, but I've been seeing a significant increase in page load times on Wikimedia sites since about a week. Haven't noticed it on any other domains [00:10:27] takes up to a second sometimes for html to arrive [00:10:53] network inspector shows me 500-800ms latency for "waiting" [00:11:10] on initial hit of the document [00:14:55] PROBLEM - ircecho_service_running on neon is CRITICAL: Connection refused by host [00:15:26] PROBLEM - MySQL Slave Delay on db1049 is CRITICAL: CRIT replication delay 199 seconds [00:15:26] PROBLEM - MySQL disk space on neon is CRITICAL: Connection refused by host [00:15:51] notpeter: the runners look fine [00:16:25] RECOVERY - MySQL Slave Delay on db1049 is OK: OK replication delay 0 seconds [00:17:35] PROBLEM - Host search23 is DOWN: PING CRITICAL - Packet loss = 100% [00:17:36] PROBLEM - Host search23 is DOWN: PONG CRITICAL - Packet loss = 100% [00:17:43] I caught it this time [00:17:51] this request has been going for 8 minutes to bits [00:17:53] it won't time out [00:18:15] https://bits.wikimedia.org/nl.wikipedia.org/load.php?debug=false&lang=nl&modules=jquery.ui.button%2Ccore%2Cdialog%2Cdraggable%2Cmouse%2Cposition%2Cresizable%2Cwidget&skin=vector&version=20130218T165324Z&* [00:18:16] X-Cache:strontium hit (658) [00:18:22] X-Varnish:233645980 204707512 [00:18:32] Server:nginx/1.1.19 [00:18:50] ofcourse now its cached [00:18:59] 27 08:38:50 <+nagios-wm> PROBLEM - Puppet freshness on strontium is CRITICAL: Puppet has not run in the last 10 hours [00:19:13] probably not relevant.... [00:19:13] New patchset: Reedy; "Bug 44493 - Simple English Wiktionary local system messages ignored: set $wgLanguageCode to en" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/51303 [00:19:29] jeremyb_: nagios is/was lying [00:19:39] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/51303 [00:19:40] Reedy: oh, right that rings a bell [00:19:45] (didn't really read much about it) [00:19:56] Ariel checked a couple of hosts and it wasn't true [00:19:58] Krinkle: so, any ideas if your observations are really for the wiki or bits or both? [00:20:16] Reedy: it probably just wasn't sending the traps to both? [00:20:23] (wild guess) [00:20:24] jeremyb_: Could be separate problems [00:21:01] jeremyb_: but yeah, about once or twice an hour all http requests to wmf are taking about 2 seconds to respond. [00:21:05] then everything is fine [00:21:12] Krinkle: bits boxes have been extra bouncy recently (at least according to what i remember nagios saying here) [00:21:23] and bits is separately timing out,sometimes throwing 50x warning, sometimes just not timing out for minutes long [00:21:25] Krinkle: once paravoid restarted a couple boxen iirc [00:21:28] 2-3 days ago [00:21:54] New patchset: Reedy; "Bug 44894 - Set $wgAutoConfirmCount to 10 for ko.wikipedia" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/51304 [00:21:55] PROBLEM - MySQL Slave Running on db59 is CRITICAL: CRIT replication Slave_IO_Running: Yes Slave_SQL_Running: No Last_Error: Error BIGINT UNSIGNED value is out of range in (enwiki.article_f [00:22:07] binasher: speak of the devil [00:22:08] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/51304 [00:22:45] PROBLEM - MySQL Slave Running on db1047 is CRITICAL: CRIT replication Slave_IO_Running: Yes Slave_SQL_Running: No Last_Error: Error BIGINT UNSIGNED value is out of range in (enwiki.article_f [00:22:50] mariiia [00:24:02] ugh [00:24:07] New patchset: Reedy; "More for bug 44894 allow sysops to add/remove confirmed group" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/51305 [00:24:25] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/51305 [00:26:15] New patchset: Reedy; "Bug 45065 - enable webfonts on sa.wikiquote.org" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/51306 [00:26:31] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/51306 [00:27:18] !log reedy synchronized wmf-config/ [00:27:20] Logged the message, Master [00:28:59] Reedy: fyi, since we don't yet have a good scaptrap solution post-svn and pre-gerrit-with-tag-support, I started this page: http://wikitech.wikimedia.org/view/Deployments/Scaptrap, not sure of a better solution for you, is this OK or do you want something else? [00:29:13] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/51300 [00:32:19] !log authdns-update to support wikinews.de [00:32:20] Logged the message, RobH [00:33:05] New review: MaxSem; "LoadBalancer and GeoData use random." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/43029 [00:35:28] RECOVERY - ircecho_service_running on neon is OK: PROCS OK: 3 processes with args ircecho [00:35:29] RECOVERY - MySQL disk space on neon is OK: DISK OK [00:39:38] RECOVERY - MySQL Slave Running on db1047 is OK: OK replication Slave_IO_Running: Yes Slave_SQL_Running: Yes Last_Error: [00:44:58] RECOVERY - MySQL Slave Running on db59 is OK: OK replication Slave_IO_Running: Yes Slave_SQL_Running: Yes Last_Error: [00:46:03] New patchset: RobH; "adding support for wikinews.de" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/51309 [00:46:44] New patchset: Lcarr; "merging all icinga configurations to one file" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/51310 [00:47:30] PROBLEM - MySQL Slave Delay on db59 is CRITICAL: CRIT replication delay 838 seconds [00:47:52] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/51310 [00:50:36] ... git review is sloooooooow! [00:50:39] New patchset: Reedy; "Bug 44784 - Changing local timezone in ko.wiktionary" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/51311 [00:50:46] New patchset: coren; "Make pre-login sshd banner optional" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/51312 [00:51:15] use git push origin HEAD:refs/for/master [00:51:28] RECOVERY - MySQL Slave Delay on db59 is OK: OK replication delay 0 seconds [00:51:45] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/51311 [00:52:09] New patchset: Hashar; "(bug 45525) beta: $wgCategoryCollations we support" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/51313 [00:52:15] New review: RobH; "robh@fenari:~$ apache-fast-test robtest mw1044" [operations/apache-config] (master) C: 2; - https://gerrit.wikimedia.org/r/51309 [00:52:15] Change merged: RobH; [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/51309 [00:52:20] <^demon> Reedy: `git config --global alias.push-for-review "push origin HEAD:refs/for/master"` ;-) [00:52:40] <^demon> And because Ryan was silly when he setup puppet... [00:52:49] <^demon> `git config --global alias.push-puppet "push origin HEAD:refs/for/production"` [00:53:27] New patchset: Reedy; "Bug 44616 - Enable Labeled Section Transclusion extension on id.wikt" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/51314 [00:53:52] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/51314 [00:54:23] !log reedy synchronized wmf-config/ [00:54:25] Logged the message, Master [00:55:14] New patchset: Ryan Lane; "Use LDAP, requiring ops, for icinga admin" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/51315 [00:55:34] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 182 seconds [00:55:44] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 186 seconds [00:56:03] New patchset: Reedy; "Bug 26402 - Labeled section transclusion installation" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/47353 [00:56:28] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/47353 [00:56:34] RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 0 seconds [00:56:44] RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 0 seconds [00:57:21] robh is doing a graceful restart of all apaches [00:57:40] !log robh gracefulled all apaches [00:57:42] Logged the message, Master [00:57:55] !log reedy synchronized wmf-config/InitialiseSettings.php [00:57:57] Logged the message, Master [00:58:56] !log dsh restarted eqiad apaches in progress [00:58:58] Logged the message, RobH [00:59:00] New patchset: Ryan Lane; "Use LDAP, requiring ops, for icinga admin" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/51315 [01:02:21] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/51315 [01:09:36] New patchset: Reedy; "Remove wikimaniawiki entries as it's a redirect to the current wikimania event" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/51318 [01:10:07] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/51318 [01:10:38] !log reedy synchronized wmf-config/InitialiseSettings.php [01:12:34] Logged the message, Master [01:22:03] which search cluster is causing the paging issue ? is it #4 with machines search1015,16,19-22 ? [01:22:25] yea [01:22:26] *yes [01:22:55] ok, just want to look at those logs ... [01:24:41] New patchset: Ryan Lane; "Use ldaps for virt0/1000 connections" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/51323 [01:26:39] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/51323 [01:29:30] New patchset: Reedy; "lucene.php: simple loadbalancing of requests across datacenters" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/43029 [01:31:36] RECOVERY - MySQL Slave Running on db35 is OK: OK replication [01:31:49] New patchset: Reedy; "lucene.php: simple loadbalancing of requests across datacenters" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/43029 [01:36:36] PROBLEM - MySQL Slave Delay on db35 is CRITICAL: CRIT replication delay 241160 seconds [01:39:42] RECOVERY - MySQL Slave Delay on db35 is OK: OK replication delay seconds [01:41:42] PROBLEM - mysqld processes on db35 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [01:42:42] RECOVERY - mysqld processes on db35 is OK: PROCS OK: 1 process with command name mysqld [01:45:42] PROBLEM - MySQL Slave Running on db35 is CRITICAL: CRIT replication Slave_IO_Running: Yes Slave_SQL_Running: No Last_Error: Error Duplicate entry 142 for key PRIMARY on query. Default dat [01:48:40] New patchset: Ryan Lane; "Redirect 80 to 443 for icinga and icinga-admin" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/51330 [01:49:43] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/51330 [02:03:56] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:04:46] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.125 second response time [02:10:17] PROBLEM - Puppet freshness on virt0 is CRITICAL: Puppet has not run in the last 10 hours [02:30:14] !log LocalisationUpdate completed (1.21wmf10) at Thu Feb 28 02:30:14 UTC 2013 [02:30:17] Logged the message, Master [03:09:04] !log reedy synchronized docroot [03:09:06] Logged the message, Master [03:10:29] !log reedy synchronized docroot [03:10:31] Logged the message, Master [03:12:01] !log reedy synchronized docroot [03:12:03] Logged the message, Master [03:24:37] PROBLEM - ircecho_service_running on neon is CRITICAL: Connection refused by host [03:24:47] PROBLEM - MySQL disk space on neon is CRITICAL: Connection refused by host [03:29:14] New patchset: Reedy; "Bug 44335 - Botadmin user group in ml.wikisource" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/51337 [03:29:54] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/51337 [03:30:31] !log reedy synchronized wmf-config/InitialiseSettings.php [03:30:33] Logged the message, Master [03:39:47] New patchset: Reedy; "Remove slashes from tel protocol" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/51338 [03:40:14] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/51338 [03:40:59] !log reedy synchronized wmf-config/InitialiseSettings.php [03:41:00] Logged the message, Master [03:54:57] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:55:57] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 6.058 second response time [03:59:33] RECOVERY - ircecho_service_running on neon is OK: PROCS OK: 3 processes with args ircecho [03:59:43] RECOVERY - MySQL disk space on neon is OK: DISK OK [04:10:13] PROBLEM - Puppet freshness on tin is CRITICAL: Puppet has not run in the last 10 hours [04:10:23] PROBLEM - Varnish HTTP bits on niobium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:10:53] PROBLEM - SSH on niobium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:25:43] RECOVERY - SSH on niobium is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [04:27:13] RECOVERY - Varnish HTTP bits on niobium is OK: HTTP OK: HTTP/1.1 200 OK - 633 bytes in 0.001 second response time [04:28:53] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:34:43] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.466 second response time [04:39:35] New patchset: Legoktm; "(Bug 45538) Disable "ArticleFeedbackv4" on enwiki" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/51340 [04:54:13] PROBLEM - Puppet freshness on db62 is CRITICAL: Puppet has not run in the last 10 hours [04:58:19] New patchset: Legoktm; "(Bug 45538) Make ArticleFeedbackv5 opt-in for enwiki" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/51341 [05:00:13] PROBLEM - Puppet freshness on ms1004 is CRITICAL: Puppet has not run in the last 10 hours [05:04:26] Change abandoned: Reedy; "Duplicate of https://gerrit.wikimedia.org/r/#/c/47551/" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/51340 [05:05:05] oh oops [05:05:17] :p [05:15:24] PROBLEM - Varnish HTTP bits on niobium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:16:53] PROBLEM - SSH on niobium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:18:04] RECOVERY - Puppet freshness on tin is OK: puppet ran at Thu Feb 28 05:17:53 UTC 2013 [05:25:53] RECOVERY - Puppet freshness on knsq17 is OK: puppet ran at Thu Feb 28 05:25:48 UTC 2013 [05:27:53] RECOVERY - SSH on niobium is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [05:32:53] PROBLEM - SSH on niobium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:35:43] RECOVERY - SSH on niobium is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [05:36:13] RECOVERY - Varnish HTTP bits on niobium is OK: HTTP OK: HTTP/1.1 200 OK - 633 bytes in 4.276 second response time [05:42:13] PROBLEM - Puppet freshness on spence is CRITICAL: Puppet has not run in the last 10 hours [05:45:03] PROBLEM - LVS Lucene on search-pool4.svc.eqiad.wmnet is CRITICAL: Connection timed out [05:46:55] RECOVERY - LVS Lucene on search-pool4.svc.eqiad.wmnet is OK: TCP OK - 3.002 second response time on port 8123 [05:51:03] PROBLEM - LVS Lucene on search-pool4.svc.eqiad.wmnet is CRITICAL: Connection timed out [05:57:53] RECOVERY - LVS Lucene on search-pool4.svc.eqiad.wmnet is OK: TCP OK - 0.000 second response time on port 8123 [06:02:43] arg, i seem to have lost my ability to ack payments alerts when we switched to icinga [06:03:02] any OpsEns around who can ACK the check_gcisp on the payments boxes? [06:03:03] PROBLEM - LVS Lucene on search-pool4.svc.eqiad.wmnet is CRITICAL: Connection timed out [06:03:53] RECOVERY - LVS Lucene on search-pool4.svc.eqiad.wmnet is OK: TCP OK - 0.000 second response time on port 8123 [06:09:38] pgehres: Apparently I can... [06:09:51] my phone thanks you [06:10:03] any chance you can give me perms as well? [06:10:08] "Acknowledge service problems" [06:10:11] Not sure, I can look [06:10:13] PROBLEM - Puppet freshness on knsq28 is CRITICAL: Puppet has not run in the last 10 hours [06:10:24] It should be automagic as you should be in the WMF ldap group [06:10:54] Oh, no [06:10:59] Not Authorized [06:11:05] That's stupid, why not tell me originally!? [06:11:15] yeah, same here [06:11:32] although right as you got here the service has recovered [06:11:48] so, unless it flaps, it should be silent for a while [06:12:05] but it sounds like we need some rights around here [06:12:16] hmm [06:12:24] looks like those errors have gone away again.. [06:12:30] need me some tasty rights! [06:12:42] tasty tasty crunchy acls [06:13:43] I guess apergos or some of the SF natives would be the best people to try.. [06:14:26] yeah, /me bets they are all out partying since all Ops are here, except apergo_s [06:14:48] I know they all aren't ;) [06:15:20] heh, well i also know that at least Jeff's phone also expoded [06:15:34] so, i would hope he would some up online [06:16:45] He disappeared earlier, so AFAIK isn't out with some of them [06:16:53] PROBLEM - Lucene on search1015 is CRITICAL: Connection timed out [06:16:59] ah [06:18:43] RECOVERY - Lucene on search1015 is OK: TCP OK - 0.000 second response time on port 8123 [06:22:03] PROBLEM - LVS Lucene on search-pool4.svc.eqiad.wmnet is CRITICAL: Connection timed out [06:22:57] RECOVERY - LVS Lucene on search-pool4.svc.eqiad.wmnet is OK: TCP OK - 0.000 second response time on port 8123 [06:23:43] acls for what? [06:24:06] I can't check payments, I don't have access to the fundraising stuff [06:24:15] acknowlagement of payments alerts flapping [06:24:22] ack in icinga [06:24:46] and there goes searchpool4 again >_< [06:24:50] i just want to be able to make my phone stop exploding on 3rd party issues [06:25:06] but i lost that permission in the switchover [06:25:33] I think paravoid had the same issue earlier too [06:25:58] the accounts were moved over [06:26:03] PROBLEM - LVS Lucene on search-pool4.svc.eqiad.wmnet is CRITICAL: Connection timed out [06:27:53] RECOVERY - LVS Lucene on search-pool4.svc.eqiad.wmnet is OK: TCP OK - 0.003 second response time on port 8123 [06:28:11] !log stabbed searchpool4 and restarted lucene search etc gah hate hate hate [06:28:14] Logged the message, Master [06:28:31] yes I know, "tell us how you really feel about the search cluster" [06:28:48] I'm feeling pain... and anger... [06:29:00] arousal? [06:29:04] no? [06:29:05] no. [06:29:08] okay. [06:29:17] ori-l: I am highly amused [06:29:40] I think you made a typoo [06:31:04] * mwalker resists urge to make jokes about flinging poo [06:31:35] The community do that for us [06:32:04] pgehres: what was your account name? [06:32:14] i was able to login as pgehres [06:32:19] in icinga [06:32:27] should be labs/gerrit logins [06:32:29] hm [06:32:30] * pgehres looks for nagios creds [06:32:44] it was pgehres in nagios as well [06:32:57] hyeah I see you [06:33:12] New review: Matmarex; "(1 comment)" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/51313 [06:33:25] I see you in on icinga too [06:34:29] you're supposed to be able to get in on https://icinga-admin.wikimedia.org [06:35:02] i was able to login, just not able to ACK the payments_gcsip is CRITICAL spam [06:35:08] huh [06:35:36] it seems quiet now, so it can likely wait for leslie in the AM [06:35:47] unless you really want to learn icinga :-) [06:37:25] actually, i take that back, payments3 just alerted again [06:38:14] well I am having trouble even authenticating [06:38:18] so there ya go [06:39:43] well, that sure doesn't make things easy [06:39:48] no [06:39:55] I see my creds in there copied straight from nagios and yet [06:40:11] any known exploits in icinga that we can utilizes to elevate privs? [06:40:14] It's been ldap-ed [06:40:36] See if your labs/gerrit creds work [06:42:17] jesus h [06:42:30] I have to use a wikiname to log in? [06:42:37] I think so.. [06:42:56] ok well that is a bit of suckage [06:43:12] anyways I'm in, lemme just record this in keeppassx or I shall never rememebr it [06:45:08] which payments? I see payments1002 and payments4 [06:45:26] none of them at the moment [06:45:29] hahaha [06:45:30] it was all of them [06:46:12] if it flaps does the ack carry over to the next flappage? [06:46:26] no idea [06:54:41] ok here is what I jsut acked: [06:55:00] check_min_fraud on payments1002, 1 and 3 [06:55:08] I didn't see anything else go off for those boxes [06:55:37] kk, thanks. the big one that makes my phone go boom is check_gcsip, but that seems OK at the moment [06:55:41] ok [06:55:50] unless our payment processor explodes again, we should be quiet overnight [06:56:03] do you set an expire time? [06:56:05] and I will get Leslie to fix my permissions in the morning [06:56:14] I"m looking at these ack options [06:56:24] i can't do anything [06:56:40] no I mean on nagios [06:56:46] did you set an expire time for the acks? [06:56:59] that's what i mean, i can't ack anything [06:57:04] on nagios [06:57:06] when you had creds [06:57:09] oh [06:57:10] and you could ack them [06:57:24] I'm trying to set it up the way you did [06:57:33] only did it once or twice and it seemed to ack until it recovered [06:57:38] ok [06:57:41] and then it if failed again, it would alert [06:58:17] ah not authorized anyways [06:58:19] sorry [06:58:25] :-D :-D [06:58:25] no worries, thanks for trying [06:58:29] sure [06:58:48] I was guessing the 'sticky ack' would carry over through flaps but... [06:58:49] moot point [06:59:02] i already replied to the thread on the ops list, so hopefully someone will see that [06:59:08] cool [07:00:50] and on that bombshell, I am off to bed [07:00:56] thanks again Reedy and apergos [08:54:10] PROBLEM - LVS Lucene on search-pool4.svc.eqiad.wmnet is CRITICAL: Connection timed out [08:58:08] RECOVERY - LVS Lucene on search-pool4.svc.eqiad.wmnet is OK: TCP OK - 0.000 second response time on port 8123 [09:02:38] 2.5 hours after restarts on both hosts in the pool [09:02:51] we could restart them every half hour out off cron >_< [09:24:33] PROBLEM - ircecho_service_running on neon is CRITICAL: Connection refused by host [09:24:43] PROBLEM - MySQL disk space on neon is CRITICAL: Connection refused by host [09:57:33] RECOVERY - ircecho_service_running on neon is OK: PROCS OK: 3 processes with args ircecho [09:58:23] RECOVERY - MySQL disk space on neon is OK: DISK OK [10:09:33] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 188 seconds [10:10:03] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 202 seconds [10:11:53] PROBLEM - Puppet freshness on tmh1 is CRITICAL: Puppet has not run in the last 10 hours [10:11:53] PROBLEM - Puppet freshness on tmh1001 is CRITICAL: Puppet has not run in the last 10 hours [10:11:53] PROBLEM - Puppet freshness on tmh1002 is CRITICAL: Puppet has not run in the last 10 hours [10:12:53] PROBLEM - Puppet freshness on tmh2 is CRITICAL: Puppet has not run in the last 10 hours [10:17:34] RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 4 seconds [10:18:04] RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 0 seconds [10:35:38] PROBLEM - Solr on vanadium is CRITICAL: Average request time is 400.06927 (gt 400) [11:06:58] PROBLEM - LVS Lucene on search-pool4.svc.eqiad.wmnet is CRITICAL: Connection timed out [11:07:58] RECOVERY - LVS Lucene on search-pool4.svc.eqiad.wmnet is OK: TCP OK - 3.001 second response time on port 8123 [11:45:51] PROBLEM - MySQL disk space on neon is CRITICAL: Connection refused by host [11:46:31] PROBLEM - ircecho_service_running on neon is CRITICAL: Connection refused by host [11:49:42] PROBLEM - LVS Lucene on search-pool4.svc.eqiad.wmnet is CRITICAL: Connection timed out [11:54:41] RECOVERY - LVS Lucene on search-pool4.svc.eqiad.wmnet is OK: TCP OK - 3.008 second response time on port 8123 [12:03:12] New review: MaxSem; "(1 comment)" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/43029 [12:06:41] PROBLEM - LVS Lucene on search-pool4.svc.eqiad.wmnet is CRITICAL: Connection timed out [12:06:51] PROBLEM - swift-account-reaper on ms-be12 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [12:07:43] RECOVERY - LVS Lucene on search-pool4.svc.eqiad.wmnet is OK: TCP OK - 0.001 second response time on port 8123 [12:08:22] PROBLEM - swift-account-reaper on ms-be11 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [12:11:11] PROBLEM - Puppet freshness on virt0 is CRITICAL: Puppet has not run in the last 10 hours [12:49:22] RECOVERY - ircecho_service_running on neon is OK: PROCS OK: 3 processes with args ircecho [12:50:12] RECOVERY - MySQL disk space on neon is OK: DISK OK [12:58:24] !log restarted icinga (not sure why it stopped, log file simply said 'caught TERM signal', in any case it complained that Error: Could not create external command file '/var/lib/nagios/rw/nagios.cmd' as named pipe ) [12:58:49] *cough* [12:59:15] ah no morebots [13:02:50] can't apparently get on wikitech instance any more [13:02:55] log by hand I guess [13:07:13] bah, now it's sooo quiet here [13:07:16] I like it [13:07:59] * apergos drops a pin [13:35:01] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 183 seconds [13:35:21] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 188 seconds [13:45:57] RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 15 seconds [13:46:07] RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 8 seconds [14:02:57] PROBLEM - MySQL Replication Heartbeat on db33 is CRITICAL: CRIT replication delay 181 seconds [14:02:58] PROBLEM - LVS Lucene on search-pool4.svc.eqiad.wmnet is CRITICAL: Connection timed out [14:03:17] PROBLEM - MySQL Slave Delay on db33 is CRITICAL: CRIT replication delay 185 seconds [14:03:57] RECOVERY - LVS Lucene on search-pool4.svc.eqiad.wmnet is OK: TCP OK - 0.002 second response time on port 8123 [14:07:59] RECOVERY - MySQL Replication Heartbeat on db33 is OK: OK replication delay 4 seconds [14:08:17] RECOVERY - MySQL Slave Delay on db33 is OK: OK replication delay 0 seconds [14:10:08] PROBLEM - Varnish HTTP bits on strontium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:10:57] RECOVERY - Varnish HTTP bits on strontium is OK: HTTP OK: HTTP/1.1 200 OK - 635 bytes in 0.001 second response time [14:29:13] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:30:03] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.128 second response time [14:41:03] PROBLEM - LVS Lucene on search-pool4.svc.eqiad.wmnet is CRITICAL: Connection timed out [14:41:53] RECOVERY - LVS Lucene on search-pool4.svc.eqiad.wmnet is OK: TCP OK - 3.003 second response time on port 8123 [14:54:43] PROBLEM - Puppet freshness on db62 is CRITICAL: Puppet has not run in the last 10 hours [14:54:53] PROBLEM - ircecho_service_running on neon is CRITICAL: Connection refused by host [14:55:19] PROBLEM - MySQL disk space on neon is CRITICAL: Connection refused by host [14:56:03] PROBLEM - LVS Lucene on search-pool4.svc.eqiad.wmnet is CRITICAL: Connection timed out [14:56:53] RECOVERY - LVS Lucene on search-pool4.svc.eqiad.wmnet is OK: TCP OK - 0.000 second response time on port 8123 [15:00:43] PROBLEM - Puppet freshness on ms1004 is CRITICAL: Puppet has not run in the last 10 hours [15:02:13] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:04:03] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 1.241 second response time [15:08:13] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:09:03] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 7.131 second response time [15:27:25] RECOVERY - MySQL disk space on neon is OK: DISK OK [15:27:58] RECOVERY - ircecho_service_running on neon is OK: PROCS OK: 3 processes with args ircecho [15:28:15] PROBLEM - SSH on palladium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:29:06] PROBLEM - Varnish HTTP bits on palladium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:29:15] RECOVERY - SSH on palladium is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [15:29:55] RECOVERY - Varnish HTTP bits on palladium is OK: HTTP OK: HTTP/1.1 200 OK - 635 bytes in 0.003 second response time [15:42:15] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:42:25] PROBLEM - Puppet freshness on spence is CRITICAL: Puppet has not run in the last 10 hours [15:43:05] PROBLEM - LVS Lucene on search-pool4.svc.eqiad.wmnet is CRITICAL: Connection timed out [15:43:55] RECOVERY - LVS Lucene on search-pool4.svc.eqiad.wmnet is OK: TCP OK - 0.000 second response time on port 8123 [15:52:05] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 5.126 second response time [16:03:49] PROBLEM - LVS Lucene on search-pool4.svc.eqiad.wmnet is CRITICAL: Connection timed out [16:04:49] RECOVERY - LVS Lucene on search-pool4.svc.eqiad.wmnet is OK: TCP OK - 0.000 second response time on port 8123 [16:10:09] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:10:59] PROBLEM - Puppet freshness on knsq28 is CRITICAL: Puppet has not run in the last 10 hours [16:10:59] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.122 second response time [16:19:49] PROBLEM - LVS Lucene on search-pool4.svc.eqiad.wmnet is CRITICAL: Connection timed out [16:20:49] RECOVERY - LVS Lucene on search-pool4.svc.eqiad.wmnet is OK: TCP OK - 0.000 second response time on port 8123 [16:25:09] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:37:59] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.124 second response time [17:09:10] where can i find a copy of apache-fast-test? [17:09:28] i searched a little in git and i tried `apt-file search` in labs [17:10:23] mutante: ^ [17:16:28] <^demon> jeremyb_: I don't believe it's puppetized. I can pastebin it from fenari if you'd like. [17:16:50] do you have cluster access? [17:16:53] heh I was about to do the same thing [17:17:59] <^demon> https://noc.wikimedia.org/~demon/apache-fast-test [17:18:36] I wonder what else is in /home/w/bin/that's not in git someplace [17:18:59] <^demon> Too much, if I had to guess. [17:19:12] that's what I was afraid of [17:32:14] PROBLEM - Apache HTTP on mw1153 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:33:04] RECOVERY - Apache HTTP on mw1153 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.067 second response time [17:33:20] New patchset: Demon; "Updating gerrit to 2.5.2-1506-g278aa9a" [operations/debs/gerrit] (master) - https://gerrit.wikimedia.org/r/51280 [17:34:50] New review: Demon; "New(er) war can be found: https://integration.mediawiki.org/nightly/gerrit/wmf/gerrit-2.5.2-1506-g27..." [operations/debs/gerrit] (master) - https://gerrit.wikimedia.org/r/51280 [17:35:03] * jeremyb_ fetches [17:35:16] ^demon: danke [17:35:19] <^demon> yw [17:35:23] ah, written by jeff [17:35:27] i figured it was older than that [17:36:00] nope [17:38:31] i wonder if jeff likes perl :P [17:40:32] jeremyb_: perl is the best thing in the whole town! [17:41:04] <^demon> I believe Jeff may be exaggerating. [17:41:20] I like perl scripting better than shell scripting! [17:41:23]