[00:00:13] mutante: http.debian.net is a service (listens to port 80). idk how what you linked works but it seems different that what i was talking about [00:00:40] New patchset: Reedy; "Lower email throttle" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/45273 [00:00:50] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/45273 [00:01:46] jeremyb_: mirrors.ubuntu.com also happens to listen on port 80, but i don't know what the real question was then [00:02:31] !log reedy synchronized wmf-config/InitialiseSettings.php 'Lowering throttle to 5/day for new users, and' [00:02:33] Logged the message, Master [00:04:29] New patchset: Reedy; "(bug 44587) Fix trwiki FlaggedRevs autopromotion config" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/51172 [00:04:34] mutante: http.debian.net doesn't actually serve files... (besides a description of how it works at the site's root) [00:04:38] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/51172 [00:04:49] mutante: otherwise only 302s, iirc [00:06:46] mirrors.u.c seems to only have text files [00:07:52] New patchset: Reedy; "Remove disabled ReaderFeedback" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/47532 [00:08:32] New patchset: Andrew Bogott; "Refactor mediawiki::singlenode and wikidata::singlenode into modules." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50451 [00:10:15] btw, I don't know what the cause is, but I've been seeing a significant increase in page load times on Wikimedia sites since about a week. Haven't noticed it on any other domains [00:10:27] takes up to a second sometimes for html to arrive [00:10:53] network inspector shows me 500-800ms latency for "waiting" [00:11:10] on initial hit of the document [00:14:55] PROBLEM - ircecho_service_running on neon is CRITICAL: Connection refused by host [00:15:26] PROBLEM - MySQL Slave Delay on db1049 is CRITICAL: CRIT replication delay 199 seconds [00:15:26] PROBLEM - MySQL disk space on neon is CRITICAL: Connection refused by host [00:15:51] notpeter: the runners look fine [00:16:25] RECOVERY - MySQL Slave Delay on db1049 is OK: OK replication delay 0 seconds [00:17:35] PROBLEM - Host search23 is DOWN: PING CRITICAL - Packet loss = 100% [00:17:36] PROBLEM - Host search23 is DOWN: PONG CRITICAL - Packet loss = 100% [00:17:43] I caught it this time [00:17:51] this request has been going for 8 minutes to bits [00:17:53] it won't time out [00:18:15] https://bits.wikimedia.org/nl.wikipedia.org/load.php?debug=false&lang=nl&modules=jquery.ui.button%2Ccore%2Cdialog%2Cdraggable%2Cmouse%2Cposition%2Cresizable%2Cwidget&skin=vector&version=20130218T165324Z&* [00:18:16] X-Cache:strontium hit (658) [00:18:22] X-Varnish:233645980 204707512 [00:18:32] Server:nginx/1.1.19 [00:18:50] ofcourse now its cached [00:18:59] 27 08:38:50 <+nagios-wm> PROBLEM - Puppet freshness on strontium is CRITICAL: Puppet has not run in the last 10 hours [00:19:13] probably not relevant.... [00:19:13] New patchset: Reedy; "Bug 44493 - Simple English Wiktionary local system messages ignored: set $wgLanguageCode to en" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/51303 [00:19:29] jeremyb_: nagios is/was lying [00:19:39] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/51303 [00:19:40] Reedy: oh, right that rings a bell [00:19:45] (didn't really read much about it) [00:19:56] Ariel checked a couple of hosts and it wasn't true [00:19:58] Krinkle: so, any ideas if your observations are really for the wiki or bits or both? [00:20:16] Reedy: it probably just wasn't sending the traps to both? [00:20:23] (wild guess) [00:20:24] jeremyb_: Could be separate problems [00:21:01] jeremyb_: but yeah, about once or twice an hour all http requests to wmf are taking about 2 seconds to respond. [00:21:05] then everything is fine [00:21:12] Krinkle: bits boxes have been extra bouncy recently (at least according to what i remember nagios saying here) [00:21:23] and bits is separately timing out,sometimes throwing 50x warning, sometimes just not timing out for minutes long [00:21:25] Krinkle: once paravoid restarted a couple boxen iirc [00:21:28] 2-3 days ago [00:21:54] New patchset: Reedy; "Bug 44894 - Set $wgAutoConfirmCount to 10 for ko.wikipedia" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/51304 [00:21:55] PROBLEM - MySQL Slave Running on db59 is CRITICAL: CRIT replication Slave_IO_Running: Yes Slave_SQL_Running: No Last_Error: Error BIGINT UNSIGNED value is out of range in (enwiki.article_f [00:22:07] binasher: speak of the devil [00:22:08] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/51304 [00:22:45] PROBLEM - MySQL Slave Running on db1047 is CRITICAL: CRIT replication Slave_IO_Running: Yes Slave_SQL_Running: No Last_Error: Error BIGINT UNSIGNED value is out of range in (enwiki.article_f [00:22:50] mariiia [00:24:02] ugh [00:24:07] New patchset: Reedy; "More for bug 44894 allow sysops to add/remove confirmed group" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/51305 [00:24:25] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/51305 [00:26:15] New patchset: Reedy; "Bug 45065 - enable webfonts on sa.wikiquote.org" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/51306 [00:26:31] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/51306 [00:27:18] !log reedy synchronized wmf-config/ [00:27:20] Logged the message, Master [00:28:59] Reedy: fyi, since we don't yet have a good scaptrap solution post-svn and pre-gerrit-with-tag-support, I started this page: http://wikitech.wikimedia.org/view/Deployments/Scaptrap, not sure of a better solution for you, is this OK or do you want something else? [00:29:13] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/51300 [00:32:19] !log authdns-update to support wikinews.de [00:32:20] Logged the message, RobH [00:33:05] New review: MaxSem; "LoadBalancer and GeoData use random." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/43029 [00:35:28] RECOVERY - ircecho_service_running on neon is OK: PROCS OK: 3 processes with args ircecho [00:35:29] RECOVERY - MySQL disk space on neon is OK: DISK OK [00:39:38] RECOVERY - MySQL Slave Running on db1047 is OK: OK replication Slave_IO_Running: Yes Slave_SQL_Running: Yes Last_Error: [00:44:58] RECOVERY - MySQL Slave Running on db59 is OK: OK replication Slave_IO_Running: Yes Slave_SQL_Running: Yes Last_Error: [00:46:03] New patchset: RobH; "adding support for wikinews.de" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/51309 [00:46:44] New patchset: Lcarr; "merging all icinga configurations to one file" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/51310 [00:47:30] PROBLEM - MySQL Slave Delay on db59 is CRITICAL: CRIT replication delay 838 seconds [00:47:52] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/51310 [00:50:36] ... git review is sloooooooow! [00:50:39] New patchset: Reedy; "Bug 44784 - Changing local timezone in ko.wiktionary" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/51311 [00:50:46] New patchset: coren; "Make pre-login sshd banner optional" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/51312 [00:51:15] use git push origin HEAD:refs/for/master [00:51:28] RECOVERY - MySQL Slave Delay on db59 is OK: OK replication delay 0 seconds [00:51:45] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/51311 [00:52:09] New patchset: Hashar; "(bug 45525) beta: $wgCategoryCollations we support" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/51313 [00:52:15] New review: RobH; "robh@fenari:~$ apache-fast-test robtest mw1044" [operations/apache-config] (master) C: 2; - https://gerrit.wikimedia.org/r/51309 [00:52:15] Change merged: RobH; [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/51309 [00:52:20] <^demon> Reedy: `git config --global alias.push-for-review "push origin HEAD:refs/for/master"` ;-) [00:52:40] <^demon> And because Ryan was silly when he setup puppet... [00:52:49] <^demon> `git config --global alias.push-puppet "push origin HEAD:refs/for/production"` [00:53:27] New patchset: Reedy; "Bug 44616 - Enable Labeled Section Transclusion extension on id.wikt" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/51314 [00:53:52] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/51314 [00:54:23] !log reedy synchronized wmf-config/ [00:54:25] Logged the message, Master [00:55:14] New patchset: Ryan Lane; "Use LDAP, requiring ops, for icinga admin" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/51315 [00:55:34] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 182 seconds [00:55:44] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 186 seconds [00:56:03] New patchset: Reedy; "Bug 26402 - Labeled section transclusion installation" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/47353 [00:56:28] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/47353 [00:56:34] RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 0 seconds [00:56:44] RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 0 seconds [00:57:21] robh is doing a graceful restart of all apaches [00:57:40] !log robh gracefulled all apaches [00:57:42] Logged the message, Master [00:57:55] !log reedy synchronized wmf-config/InitialiseSettings.php [00:57:57] Logged the message, Master [00:58:56] !log dsh restarted eqiad apaches in progress [00:58:58] Logged the message, RobH [00:59:00] New patchset: Ryan Lane; "Use LDAP, requiring ops, for icinga admin" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/51315 [01:02:21] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/51315 [01:09:36] New patchset: Reedy; "Remove wikimaniawiki entries as it's a redirect to the current wikimania event" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/51318 [01:10:07] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/51318 [01:10:38] !log reedy synchronized wmf-config/InitialiseSettings.php [01:12:34] Logged the message, Master [01:22:03] which search cluster is causing the paging issue ? is it #4 with machines search1015,16,19-22 ? [01:22:25] yea [01:22:26] *yes [01:22:55] ok, just want to look at those logs ... [01:24:41] New patchset: Ryan Lane; "Use ldaps for virt0/1000 connections" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/51323 [01:26:39] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/51323 [01:29:30] New patchset: Reedy; "lucene.php: simple loadbalancing of requests across datacenters" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/43029 [01:31:36] RECOVERY - MySQL Slave Running on db35 is OK: OK replication [01:31:49] New patchset: Reedy; "lucene.php: simple loadbalancing of requests across datacenters" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/43029 [01:36:36] PROBLEM - MySQL Slave Delay on db35 is CRITICAL: CRIT replication delay 241160 seconds [01:39:42] RECOVERY - MySQL Slave Delay on db35 is OK: OK replication delay seconds [01:41:42] PROBLEM - mysqld processes on db35 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [01:42:42] RECOVERY - mysqld processes on db35 is OK: PROCS OK: 1 process with command name mysqld [01:45:42] PROBLEM - MySQL Slave Running on db35 is CRITICAL: CRIT replication Slave_IO_Running: Yes Slave_SQL_Running: No Last_Error: Error Duplicate entry 142 for key PRIMARY on query. Default dat [01:48:40] New patchset: Ryan Lane; "Redirect 80 to 443 for icinga and icinga-admin" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/51330 [01:49:43] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/51330 [02:03:56] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:04:46] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.125 second response time [02:10:17] PROBLEM - Puppet freshness on virt0 is CRITICAL: Puppet has not run in the last 10 hours [02:30:14] !log LocalisationUpdate completed (1.21wmf10) at Thu Feb 28 02:30:14 UTC 2013 [02:30:17] Logged the message, Master [03:09:04] !log reedy synchronized docroot [03:09:06] Logged the message, Master [03:10:29] !log reedy synchronized docroot [03:10:31] Logged the message, Master [03:12:01] !log reedy synchronized docroot [03:12:03] Logged the message, Master [03:24:37] PROBLEM - ircecho_service_running on neon is CRITICAL: Connection refused by host [03:24:47] PROBLEM - MySQL disk space on neon is CRITICAL: Connection refused by host [03:29:14] New patchset: Reedy; "Bug 44335 - Botadmin user group in ml.wikisource" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/51337 [03:29:54] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/51337 [03:30:31] !log reedy synchronized wmf-config/InitialiseSettings.php [03:30:33] Logged the message, Master [03:39:47] New patchset: Reedy; "Remove slashes from tel protocol" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/51338 [03:40:14] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/51338 [03:40:59] !log reedy synchronized wmf-config/InitialiseSettings.php [03:41:00] Logged the message, Master [03:54:57] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:55:57] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 6.058 second response time [03:59:33] RECOVERY - ircecho_service_running on neon is OK: PROCS OK: 3 processes with args ircecho [03:59:43] RECOVERY - MySQL disk space on neon is OK: DISK OK [04:10:13] PROBLEM - Puppet freshness on tin is CRITICAL: Puppet has not run in the last 10 hours [04:10:23] PROBLEM - Varnish HTTP bits on niobium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:10:53] PROBLEM - SSH on niobium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:25:43] RECOVERY - SSH on niobium is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [04:27:13] RECOVERY - Varnish HTTP bits on niobium is OK: HTTP OK: HTTP/1.1 200 OK - 633 bytes in 0.001 second response time [04:28:53] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:34:43] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.466 second response time [04:39:35] New patchset: Legoktm; "(Bug 45538) Disable "ArticleFeedbackv4" on enwiki" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/51340 [04:54:13] PROBLEM - Puppet freshness on db62 is CRITICAL: Puppet has not run in the last 10 hours [04:58:19] New patchset: Legoktm; "(Bug 45538) Make ArticleFeedbackv5 opt-in for enwiki" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/51341 [05:00:13] PROBLEM - Puppet freshness on ms1004 is CRITICAL: Puppet has not run in the last 10 hours [05:04:26] Change abandoned: Reedy; "Duplicate of https://gerrit.wikimedia.org/r/#/c/47551/" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/51340 [05:05:05] oh oops [05:05:17] :p [05:15:24] PROBLEM - Varnish HTTP bits on niobium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:16:53] PROBLEM - SSH on niobium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:18:04] RECOVERY - Puppet freshness on tin is OK: puppet ran at Thu Feb 28 05:17:53 UTC 2013 [05:25:53] RECOVERY - Puppet freshness on knsq17 is OK: puppet ran at Thu Feb 28 05:25:48 UTC 2013 [05:27:53] RECOVERY - SSH on niobium is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [05:32:53] PROBLEM - SSH on niobium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:35:43] RECOVERY - SSH on niobium is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [05:36:13] RECOVERY - Varnish HTTP bits on niobium is OK: HTTP OK: HTTP/1.1 200 OK - 633 bytes in 4.276 second response time [05:42:13] PROBLEM - Puppet freshness on spence is CRITICAL: Puppet has not run in the last 10 hours [05:45:03] PROBLEM - LVS Lucene on search-pool4.svc.eqiad.wmnet is CRITICAL: Connection timed out [05:46:55] RECOVERY - LVS Lucene on search-pool4.svc.eqiad.wmnet is OK: TCP OK - 3.002 second response time on port 8123 [05:51:03] PROBLEM - LVS Lucene on search-pool4.svc.eqiad.wmnet is CRITICAL: Connection timed out [05:57:53] RECOVERY - LVS Lucene on search-pool4.svc.eqiad.wmnet is OK: TCP OK - 0.000 second response time on port 8123 [06:02:43] arg, i seem to have lost my ability to ack payments alerts when we switched to icinga [06:03:02] any OpsEns around who can ACK the check_gcisp on the payments boxes? [06:03:03] PROBLEM - LVS Lucene on search-pool4.svc.eqiad.wmnet is CRITICAL: Connection timed out [06:03:53] RECOVERY - LVS Lucene on search-pool4.svc.eqiad.wmnet is OK: TCP OK - 0.000 second response time on port 8123 [06:09:38] pgehres: Apparently I can... [06:09:51] my phone thanks you [06:10:03] any chance you can give me perms as well? [06:10:08] "Acknowledge service problems" [06:10:11] Not sure, I can look [06:10:13] PROBLEM - Puppet freshness on knsq28 is CRITICAL: Puppet has not run in the last 10 hours [06:10:24] It should be automagic as you should be in the WMF ldap group [06:10:54] Oh, no [06:10:59] Not Authorized [06:11:05] That's stupid, why not tell me originally!? [06:11:15] yeah, same here [06:11:32] although right as you got here the service has recovered [06:11:48] so, unless it flaps, it should be silent for a while [06:12:05] but it sounds like we need some rights around here [06:12:16] hmm [06:12:24] looks like those errors have gone away again.. [06:12:30] need me some tasty rights! [06:12:42] tasty tasty crunchy acls [06:13:43] I guess apergos or some of the SF natives would be the best people to try.. [06:14:26] yeah, /me bets they are all out partying since all Ops are here, except apergo_s [06:14:48] I know they all aren't ;) [06:15:20] heh, well i also know that at least Jeff's phone also expoded [06:15:34] so, i would hope he would some up online [06:16:45] He disappeared earlier, so AFAIK isn't out with some of them [06:16:53] PROBLEM - Lucene on search1015 is CRITICAL: Connection timed out [06:16:59] ah [06:18:43] RECOVERY - Lucene on search1015 is OK: TCP OK - 0.000 second response time on port 8123 [06:22:03] PROBLEM - LVS Lucene on search-pool4.svc.eqiad.wmnet is CRITICAL: Connection timed out [06:22:57] RECOVERY - LVS Lucene on search-pool4.svc.eqiad.wmnet is OK: TCP OK - 0.000 second response time on port 8123 [06:23:43] acls for what? [06:24:06] I can't check payments, I don't have access to the fundraising stuff [06:24:15] acknowlagement of payments alerts flapping [06:24:22] ack in icinga [06:24:46] and there goes searchpool4 again >_< [06:24:50] i just want to be able to make my phone stop exploding on 3rd party issues [06:25:06] but i lost that permission in the switchover [06:25:33] I think paravoid had the same issue earlier too [06:25:58] the accounts were moved over [06:26:03] PROBLEM - LVS Lucene on search-pool4.svc.eqiad.wmnet is CRITICAL: Connection timed out [06:27:53] RECOVERY - LVS Lucene on search-pool4.svc.eqiad.wmnet is OK: TCP OK - 0.003 second response time on port 8123 [06:28:11] !log stabbed searchpool4 and restarted lucene search etc gah hate hate hate [06:28:14] Logged the message, Master [06:28:31] yes I know, "tell us how you really feel about the search cluster" [06:28:48] I'm feeling pain... and anger... [06:29:00] arousal? [06:29:04] no? [06:29:05] no. [06:29:08] okay. [06:29:17] ori-l: I am highly amused [06:29:40] I think you made a typoo [06:31:04] * mwalker resists urge to make jokes about flinging poo [06:31:35] The community do that for us [06:32:04] pgehres: what was your account name? [06:32:14] i was able to login as pgehres [06:32:19] in icinga [06:32:27] should be labs/gerrit logins [06:32:29] hm [06:32:30] * pgehres looks for nagios creds [06:32:44] it was pgehres in nagios as well [06:32:57] hyeah I see you [06:33:12] New review: Matmarex; "(1 comment)" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/51313 [06:33:25] I see you in on icinga too [06:34:29] you're supposed to be able to get in on https://icinga-admin.wikimedia.org [06:35:02] i was able to login, just not able to ACK the payments_gcsip is CRITICAL spam [06:35:08] huh [06:35:36] it seems quiet now, so it can likely wait for leslie in the AM [06:35:47] unless you really want to learn icinga :-) [06:37:25] actually, i take that back, payments3 just alerted again [06:38:14] well I am having trouble even authenticating [06:38:18] so there ya go [06:39:43] well, that sure doesn't make things easy [06:39:48] no [06:39:55] I see my creds in there copied straight from nagios and yet [06:40:11] any known exploits in icinga that we can utilizes to elevate privs? [06:40:14] It's been ldap-ed [06:40:36] See if your labs/gerrit creds work [06:42:17] jesus h [06:42:30] I have to use a wikiname to log in? [06:42:37] I think so.. [06:42:56] ok well that is a bit of suckage [06:43:12] anyways I'm in, lemme just record this in keeppassx or I shall never rememebr it [06:45:08] which payments? I see payments1002 and payments4 [06:45:26] none of them at the moment [06:45:29] hahaha [06:45:30] it was all of them [06:46:12] if it flaps does the ack carry over to the next flappage? [06:46:26] no idea [06:54:41] ok here is what I jsut acked: [06:55:00] check_min_fraud on payments1002, 1 and 3 [06:55:08] I didn't see anything else go off for those boxes [06:55:37] kk, thanks. the big one that makes my phone go boom is check_gcsip, but that seems OK at the moment [06:55:41] ok [06:55:50] unless our payment processor explodes again, we should be quiet overnight [06:56:03] do you set an expire time? [06:56:05] and I will get Leslie to fix my permissions in the morning [06:56:14] I"m looking at these ack options [06:56:24] i can't do anything [06:56:40] no I mean on nagios [06:56:46] did you set an expire time for the acks? [06:56:59] that's what i mean, i can't ack anything [06:57:04] on nagios [06:57:06] when you had creds [06:57:09] oh [06:57:10] and you could ack them [06:57:24] I'm trying to set it up the way you did [06:57:33] only did it once or twice and it seemed to ack until it recovered [06:57:38] ok [06:57:41] and then it if failed again, it would alert [06:58:17] ah not authorized anyways [06:58:19] sorry [06:58:25] :-D :-D [06:58:25] no worries, thanks for trying [06:58:29] sure [06:58:48] I was guessing the 'sticky ack' would carry over through flaps but... [06:58:49] moot point [06:59:02] i already replied to the thread on the ops list, so hopefully someone will see that [06:59:08] cool [07:00:50] and on that bombshell, I am off to bed [07:00:56] thanks again Reedy and apergos [08:54:10] PROBLEM - LVS Lucene on search-pool4.svc.eqiad.wmnet is CRITICAL: Connection timed out [08:58:08] RECOVERY - LVS Lucene on search-pool4.svc.eqiad.wmnet is OK: TCP OK - 0.000 second response time on port 8123 [09:02:38] 2.5 hours after restarts on both hosts in the pool [09:02:51] we could restart them every half hour out off cron >_< [09:24:33] PROBLEM - ircecho_service_running on neon is CRITICAL: Connection refused by host [09:24:43] PROBLEM - MySQL disk space on neon is CRITICAL: Connection refused by host [09:57:33] RECOVERY - ircecho_service_running on neon is OK: PROCS OK: 3 processes with args ircecho [09:58:23] RECOVERY - MySQL disk space on neon is OK: DISK OK [10:09:33] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 188 seconds [10:10:03] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 202 seconds [10:11:53] PROBLEM - Puppet freshness on tmh1 is CRITICAL: Puppet has not run in the last 10 hours [10:11:53] PROBLEM - Puppet freshness on tmh1001 is CRITICAL: Puppet has not run in the last 10 hours [10:11:53] PROBLEM - Puppet freshness on tmh1002 is CRITICAL: Puppet has not run in the last 10 hours [10:12:53] PROBLEM - Puppet freshness on tmh2 is CRITICAL: Puppet has not run in the last 10 hours [10:17:34] RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 4 seconds [10:18:04] RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 0 seconds [10:35:38] PROBLEM - Solr on vanadium is CRITICAL: Average request time is 400.06927 (gt 400) [11:06:58] PROBLEM - LVS Lucene on search-pool4.svc.eqiad.wmnet is CRITICAL: Connection timed out [11:07:58] RECOVERY - LVS Lucene on search-pool4.svc.eqiad.wmnet is OK: TCP OK - 3.001 second response time on port 8123 [11:45:51] PROBLEM - MySQL disk space on neon is CRITICAL: Connection refused by host [11:46:31] PROBLEM - ircecho_service_running on neon is CRITICAL: Connection refused by host [11:49:42] PROBLEM - LVS Lucene on search-pool4.svc.eqiad.wmnet is CRITICAL: Connection timed out [11:54:41] RECOVERY - LVS Lucene on search-pool4.svc.eqiad.wmnet is OK: TCP OK - 3.008 second response time on port 8123 [12:03:12] New review: MaxSem; "(1 comment)" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/43029 [12:06:41] PROBLEM - LVS Lucene on search-pool4.svc.eqiad.wmnet is CRITICAL: Connection timed out [12:06:51] PROBLEM - swift-account-reaper on ms-be12 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [12:07:43] RECOVERY - LVS Lucene on search-pool4.svc.eqiad.wmnet is OK: TCP OK - 0.001 second response time on port 8123 [12:08:22] PROBLEM - swift-account-reaper on ms-be11 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [12:11:11] PROBLEM - Puppet freshness on virt0 is CRITICAL: Puppet has not run in the last 10 hours [12:49:22] RECOVERY - ircecho_service_running on neon is OK: PROCS OK: 3 processes with args ircecho [12:50:12] RECOVERY - MySQL disk space on neon is OK: DISK OK [12:58:24] !log restarted icinga (not sure why it stopped, log file simply said 'caught TERM signal', in any case it complained that Error: Could not create external command file '/var/lib/nagios/rw/nagios.cmd' as named pipe ) [12:58:49] *cough* [12:59:15] ah no morebots [13:02:50] can't apparently get on wikitech instance any more [13:02:55] log by hand I guess [13:07:13] bah, now it's sooo quiet here [13:07:16] I like it [13:07:59] * apergos drops a pin [13:35:01] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 183 seconds [13:35:21] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 188 seconds [13:45:57] RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 15 seconds [13:46:07] RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 8 seconds [14:02:57] PROBLEM - MySQL Replication Heartbeat on db33 is CRITICAL: CRIT replication delay 181 seconds [14:02:58] PROBLEM - LVS Lucene on search-pool4.svc.eqiad.wmnet is CRITICAL: Connection timed out [14:03:17] PROBLEM - MySQL Slave Delay on db33 is CRITICAL: CRIT replication delay 185 seconds [14:03:57] RECOVERY - LVS Lucene on search-pool4.svc.eqiad.wmnet is OK: TCP OK - 0.002 second response time on port 8123 [14:07:59] RECOVERY - MySQL Replication Heartbeat on db33 is OK: OK replication delay 4 seconds [14:08:17] RECOVERY - MySQL Slave Delay on db33 is OK: OK replication delay 0 seconds [14:10:08] PROBLEM - Varnish HTTP bits on strontium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:10:57] RECOVERY - Varnish HTTP bits on strontium is OK: HTTP OK: HTTP/1.1 200 OK - 635 bytes in 0.001 second response time [14:29:13] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:30:03] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.128 second response time [14:41:03] PROBLEM - LVS Lucene on search-pool4.svc.eqiad.wmnet is CRITICAL: Connection timed out [14:41:53] RECOVERY - LVS Lucene on search-pool4.svc.eqiad.wmnet is OK: TCP OK - 3.003 second response time on port 8123 [14:54:43] PROBLEM - Puppet freshness on db62 is CRITICAL: Puppet has not run in the last 10 hours [14:54:53] PROBLEM - ircecho_service_running on neon is CRITICAL: Connection refused by host [14:55:19] PROBLEM - MySQL disk space on neon is CRITICAL: Connection refused by host [14:56:03] PROBLEM - LVS Lucene on search-pool4.svc.eqiad.wmnet is CRITICAL: Connection timed out [14:56:53] RECOVERY - LVS Lucene on search-pool4.svc.eqiad.wmnet is OK: TCP OK - 0.000 second response time on port 8123 [15:00:43] PROBLEM - Puppet freshness on ms1004 is CRITICAL: Puppet has not run in the last 10 hours [15:02:13] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:04:03] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 1.241 second response time [15:08:13] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:09:03] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 7.131 second response time [15:27:25] RECOVERY - MySQL disk space on neon is OK: DISK OK [15:27:58] RECOVERY - ircecho_service_running on neon is OK: PROCS OK: 3 processes with args ircecho [15:28:15] PROBLEM - SSH on palladium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:29:06] PROBLEM - Varnish HTTP bits on palladium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:29:15] RECOVERY - SSH on palladium is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [15:29:55] RECOVERY - Varnish HTTP bits on palladium is OK: HTTP OK: HTTP/1.1 200 OK - 635 bytes in 0.003 second response time [15:42:15] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:42:25] PROBLEM - Puppet freshness on spence is CRITICAL: Puppet has not run in the last 10 hours [15:43:05] PROBLEM - LVS Lucene on search-pool4.svc.eqiad.wmnet is CRITICAL: Connection timed out [15:43:55] RECOVERY - LVS Lucene on search-pool4.svc.eqiad.wmnet is OK: TCP OK - 0.000 second response time on port 8123 [15:52:05] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 5.126 second response time [16:03:49] PROBLEM - LVS Lucene on search-pool4.svc.eqiad.wmnet is CRITICAL: Connection timed out [16:04:49] RECOVERY - LVS Lucene on search-pool4.svc.eqiad.wmnet is OK: TCP OK - 0.000 second response time on port 8123 [16:10:09] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:10:59] PROBLEM - Puppet freshness on knsq28 is CRITICAL: Puppet has not run in the last 10 hours [16:10:59] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.122 second response time [16:19:49] PROBLEM - LVS Lucene on search-pool4.svc.eqiad.wmnet is CRITICAL: Connection timed out [16:20:49] RECOVERY - LVS Lucene on search-pool4.svc.eqiad.wmnet is OK: TCP OK - 0.000 second response time on port 8123 [16:25:09] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:37:59] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.124 second response time [17:09:10] where can i find a copy of apache-fast-test? [17:09:28] i searched a little in git and i tried `apt-file search` in labs [17:10:23] mutante: ^ [17:16:28] <^demon> jeremyb_: I don't believe it's puppetized. I can pastebin it from fenari if you'd like. [17:16:50] do you have cluster access? [17:16:53] heh I was about to do the same thing [17:17:59] <^demon> https://noc.wikimedia.org/~demon/apache-fast-test [17:18:36] I wonder what else is in /home/w/bin/that's not in git someplace [17:18:59] <^demon> Too much, if I had to guess. [17:19:12] that's what I was afraid of [17:32:14] PROBLEM - Apache HTTP on mw1153 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:33:04] RECOVERY - Apache HTTP on mw1153 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.067 second response time [17:33:20] New patchset: Demon; "Updating gerrit to 2.5.2-1506-g278aa9a" [operations/debs/gerrit] (master) - https://gerrit.wikimedia.org/r/51280 [17:34:50] New review: Demon; "New(er) war can be found: https://integration.mediawiki.org/nightly/gerrit/wmf/gerrit-2.5.2-1506-g27..." [operations/debs/gerrit] (master) - https://gerrit.wikimedia.org/r/51280 [17:35:03] * jeremyb_ fetches [17:35:16] ^demon: danke [17:35:19] <^demon> yw [17:35:23] ah, written by jeff [17:35:27] i figured it was older than that [17:36:00] nope [17:38:31] i wonder if jeff likes perl :P [17:40:32] jeremyb_: perl is the best thing in the whole town! [17:41:04] <^demon> I believe Jeff may be exaggerating. [17:41:20] I like perl scripting better than shell scripting! [17:41:23] That script is pretty neat though [17:41:48] it was fun, got to learn a little about threads [17:47:14] PROBLEM - MySQL Slave Delay on db33 is CRITICAL: CRIT replication delay 184 seconds [17:47:31] hey would you want to put that in say operations/tools or something? [17:47:34] PROBLEM - MySQL Replication Heartbeat on db33 is CRITICAL: CRIT replication delay 188 seconds [17:47:53] since people like it enough to want it [17:48:11] apergos: apergos sure. nobody has ever really reviewed it or anything afaik though [17:48:23] and yet there it is in /home/w/bin [17:48:26] jsut sayn :-P [17:48:35] Putting it in puppet and making sure it's there on all bastion hosts and stuff would be nice ;) [17:49:23] alright alright, let's put in an RT ticket [17:49:41] then people can notice and flame it or whatever [17:49:41] :-P [17:49:49] :-D [17:50:05] ha, you are on RT duty [17:50:13] so you can handle that ticket :-P [17:52:03] blahrgh [17:52:09] i'm going to do it with a cowsay... [17:52:31] heh [17:52:37] we would expect nothing less [17:52:54] http://trouser.org/cowsay?f=cow&msg=go%20apache-fast-test%20kthxbye [17:52:57] there you go [17:53:10] :-D [17:54:15] PROBLEM - MySQL disk space on neon is CRITICAL: Connection refused by host [17:54:17] New patchset: Aaron Schulz; "Removed unused global in jobs-loop (only local one matters)." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/51363 [17:54:55] PROBLEM - ircecho_service_running on neon is CRITICAL: Connection refused by host [17:55:32] joy [17:55:38] and also: Feb 28 17:55:05 neon rsyslogd-2177: imuxsock begins to drop messages from pid 16414 due to rate-limiting [17:56:31] apergos: https://gerrit.wikimedia.org/r/51363 easy review [17:56:38] not urgent of course [17:56:51] I shall look now [17:56:58] I guess we have a meeting in [17:56:59] uh [17:57:07] 4 minutes! brb [18:02:13] apergos ? [18:02:52] sec [18:03:00] had to take a pitt stop [18:04:48] there's only you in there and it's really noisy [18:04:51] dunno what that means [18:05:08] woosters: [18:05:32] can u hear me? [18:05:45] apergos [18:06:09] no I could not hear you [18:06:21] I could only hear some sort of really awful bass rhythm beat going on there [18:06:56] now I hear silence [18:07:06] let me skype u intead [18:07:08] also it's only you and not the room [18:07:10] instead [18:07:32] it's harder to hear people as a group on skype [18:09:56] New patchset: Asher; "remove server name labels as it changes hashing behavior" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/51364 [18:10:34] Change merged: Asher; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/51364 [18:12:31] guys whatever that sound is, it's nt the sound of people talking [18:12:35] or of anything useful [18:12:52] dudes! you are going to blow my ears out [18:12:54] woosters: [18:16:05] RECOVERY - MySQL Slave Running on db35 is OK: OK replication [18:18:24] woosters: I am taking the headphones off cause the sound is actually driving me crazy [18:18:36] please ping me when you want me to check the audio again [18:19:07] PROBLEM - MySQL Slave Running on db35 is CRITICAL: CRIT replication Slave_IO_Running: Yes Slave_SQL_Running: No Last_Error: Error Duplicate entry 142 for key PRIMARY on query. Default dat [18:19:25] PROBLEM - SSH on palladium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:19:45] PROBLEM - Varnish HTTP bits on palladium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:19:53] !log reedy synchronized php-1.21wmf10/extensions/WikimediaMaintenance [18:20:15] RECOVERY - SSH on palladium is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [18:20:35] RECOVERY - Varnish HTTP bits on palladium is OK: HTTP OK: HTTP/1.1 200 OK - 637 bytes in 0.001 second response time [18:20:41] notpeter: when upgrading from mysql 5.1 -> mysql or mariadb 5.5, you must run mysql_upgrade [18:21:58] !log reedy synchronized php-1.21wmf10/cache/interwiki.cdb 'Updating 1.21wmf10 interwiki cache' [18:22:21] Reedy: there is no morebots right now [18:22:31] it's a lie anyway :) [18:22:33] <^demon> Need more bots :\ [18:22:34] it departed earlier and I don't appear to have working creds on wikitech any more [18:22:41] (which I need to find out about too) [18:24:35] !log reedy synchronized php-1.21wmf10/cache/interwiki.cdb 'Updating 1.21wmf10 interwiki cache' [18:27:10] RECOVERY - ircecho_service_running on neon is OK: PROCS OK: 3 processes with args ircecho [18:27:10] RECOVERY - MySQL disk space on neon is OK: DISK OK [18:36:12] so Aaron I will look at that change now but if it needs more than half an eye it will get put off (remotely following meeting) [18:36:40] RECOVERY - MySQL Slave Running on db35 is OK: OK replication [18:38:20] PROBLEM - SSH on arsenic is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:38:30] PROBLEM - Varnish HTTP bits on arsenic is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:38:30] PROBLEM - mysqld processes on db35 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [18:40:34] Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/51363 [18:41:28] !log rebuilding db35 from a hotbackup of db55 (s5); testing a new build of mariadb 5.5.29 [18:41:39] binasher: is the hashing based on the labels? [18:41:49] AaronSchulz: if there are labels, yes [18:41:55] that's actually good [18:42:05] not for an initial switch, but in general [18:42:20] it makes it easy to swap out a server without changing the hashing [18:42:23] definitely [18:42:34] I'd really like to see that actually [18:42:36] that's what its intended for [18:42:43] but… for an initial switch.. :( [18:42:54] if we want to do it [18:43:08] we'll have to bite the invalidation at some point anyways though [18:43:17] yeah, I was going to say that [18:43:28] aaron, change live once puppet runs on those hosts. [18:43:30] I was taking to RobH about this a few days back [18:44:13] AaronSchulz: https://github.com/twitter/twemproxy/blob/master/notes/recommendation.md [18:44:27] see the section "Node Names for Consistent Hashing" [18:45:30] it's great if mc1015 dies and we want to swap a spare, mc1018 permanently into it [18:46:06] looking! [18:46:55] oh yeah [18:47:07] how obvious and yet not done before [18:50:47] binasher: so how soon could this be done? [18:51:01] it's a feature missing from the current pecl setup [18:55:18] AaronSchulz: if we agree that it's worth going to the named based hashing and accepting that an initial deploy will invalidating a bunch of keys, i think it can be done much sooner [18:55:48] as that eliminates an entire class of testing :) [18:56:10] RECOVERY - SSH on arsenic is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [18:56:20] RECOVERY - Varnish HTTP bits on arsenic is OK: HTTP OK: HTTP/1.1 200 OK - 633 bytes in 0.002 second response time [18:56:57] !log reedy synchronized php-1.21wmf10/cache/interwiki.cdb 'Updating 1.21wmf10 interwiki cache' [18:59:26] AaronSchulz: maybe we should just do it like that.. tear off the bandaid! [19:00:30] PROBLEM - Varnish HTTP bits on arsenic is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:00:36] jfdi is the best [19:03:22] RECOVERY - Varnish HTTP bits on arsenic is OK: HTTP OK: HTTP/1.1 200 OK - 633 bytes in 0.646 second response time [19:03:52] notpeter: re: search [19:04:21] do you remember how '*?' is evaluated for index inclusion? [19:04:38] it's like a case statement [19:04:45] it's everything that's not covered anywhere else [19:04:52] <^demon> *? -> [black box] -> included in index [19:04:59] <^demon> [black box] is how I describe all of lsearchd. [19:05:05] yeah, what ^demon sadi ;) [19:05:07] yep [19:05:38] hahah [19:06:05] ok, so its everything but does automagically exclude any *.nspart? index included elsewhere? [19:06:16] I habeeb so [19:06:49] that is what shows up on the pool4 frontends [19:11:32] PROBLEM - Varnish HTTP bits on arsenic is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:12:12] PROBLEM - SSH on arsenic is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:13:12] RECOVERY - SSH on arsenic is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [19:14:38] New patchset: Asher; "relocating the main commons search-index within pool4 to the relatively empty spelling index hosts" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/51380 [19:16:12] PROBLEM - SSH on arsenic is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:16:18] New patchset: Asher; "relocating the main commons search-index within pool4 to the relatively empty spelling index hosts - in pmtpa only for testing" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/51380 [19:20:02] RECOVERY - SSH on arsenic is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [19:21:22] RECOVERY - Varnish HTTP bits on arsenic is OK: HTTP OK: HTTP/1.1 200 OK - 635 bytes in 0.001 second response time [19:23:05] New patchset: Reedy; "Remove wgLegalTitleChars, same as default" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/51382 [19:24:28] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/51382 [19:24:39] ACKNOWLEDGEMENT - Puppet freshness on spence is CRITICAL: Puppet has not run in the last 10 hours LeslieCarr poop [19:25:14] lol [19:25:20] :-D [19:25:25] oh i didn't realize it would say the comment i put in [19:25:25] oops [19:25:26] haha [19:25:40] Ryan_Lane: oh great ldap knowing person :) [19:25:42] publically logged too [19:25:42] PROBLEM - Varnish HTTP bits on niobium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:25:52] PROBLEM - SSH on niobium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:26:29] Ryan_Lane: so i want to allow all wmf group admin rights on icinga [19:26:39] Ryan_Lane: but i am not sure where that is configured - can you point me in the right direction ? [19:27:45] Hm. Where should I point a general 'Tool Labs' problem email? (think: maintainer/webmaster/etc) [19:27:46] <^demon> where wmf group is configured? [19:27:49] <^demon> LeslieCarr: ^ [19:28:17] ^demon where the privilege restriction is configured [19:28:46] <^demon> Ah, dunno that bit. [19:29:00] how does nagios/icinga decide there is a problem with search ? Is it running a search query ? Is there a config file or page where I can see the query ? [19:29:01] me neither :) [19:29:14] xyzram: it's configured in puppet [19:29:22] Which file ? [19:29:41] so search.pp has monitor_service [19:30:26] and then to see what that command means, templates/icinga/checkcommands.cfg.erb [19:32:26] LeslieCarr: thats how it works right now [19:32:36] LeslieCarr: from the authentication perspective anyway [19:32:54] I don't know how the admin part of icinga works, though [19:33:05] I thought if you could auth, then you have admin [19:33:10] oh ? hrm -- so mwalker's account isn't being allowed to submit commands [19:33:11] yeah hrm [19:33:13] PROBLEM - MySQL Replication Heartbeat on db55 is CRITICAL: CRIT replication delay 193 seconds [19:33:13] PROBLEM - MySQL Slave Delay on db55 is CRITICAL: CRIT replication delay 197 seconds [19:33:14] hrm hrm hrm [19:33:26] I don't actually know, though :) [19:33:32] obviously he's not clicking it correctly [19:33:32] ;) [19:33:35] i'll keep searching [19:33:37] but from the auth POV, it's current;y set up for wmf group [19:33:46] command_line $USER1$/check_tcp -t 90 -w 10 -p 8123 -H $HOSTADDRESS$ ? [19:36:43] RECOVERY - SSH on niobium is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [19:37:54] LeslieCarr: check_lucene is defined to invoke check_tcp and the latter is defined to invoke itself; where is check_tcp ? [19:38:15] check_tcp is a file located in the nagios-plugins package i believe [19:38:20] they're all written in python [19:38:33] RECOVERY - Varnish HTTP bits on niobium is OK: HTTP OK: HTTP/1.1 200 OK - 635 bytes in 0.001 second response time [19:38:33] Does it have doc/manpage ? [19:38:44] you can install nagios-plugins and nagios-extras on some labs instance to find it [19:39:01] Do you know what it does ? [19:39:03] your guess is as good as mine :) there's no man page [19:39:14] OK [19:39:14] i'm guessing it just does a tcp check [19:39:17] and if it receives syn it's ok [19:39:19] i mean synack [19:39:29] is my best guess [19:39:44] Ok, thanks. [19:41:05] if you look at other monitor_services there are more complex monitoring commands we can invoke [19:41:24] <^demon> I need to do some for gerrit prolly. [19:41:50] <^demon> Monitoring the log to know if replication is failing, for example. [19:42:23] New patchset: Lcarr; "adding khorn and mwalker to having permissions for icinga" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/51387 [19:45:49] New patchset: Lcarr; "adding khorn and mwalker to having permissions for icinga" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/51387 [19:46:31] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/51387 [19:53:35] New patchset: Asher; "internally resharding the main indices of major projects within pool4 to fairly unused spell nodes [pmtpa only - for testing]" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/51380 [19:55:10] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/51380 [19:55:47] it's c at least for nagios. connect(2), and expect 0 return value [19:56:05] ACKNOWLEDGEMENT - Puppet freshness on spence is CRITICAL: Puppet has not run in the last 10 hours Matt Walker bad box! [19:56:13] :) [19:56:41] http://fossies.org/dox/nagios-plugins-1.4.16/check__tcp_8c_source.html here's a copy on the web (I had the source lying around on my hd from some version or other though) [19:56:51] PROBLEM - SSH on lvs1001 is CRITICAL: Server answer: [19:56:53] no idea whether that's most current etc [19:57:02] PROBLEM - Varnish HTTP bits on niobium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:58:51] RECOVERY - SSH on lvs1001 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [19:59:51] PROBLEM - SSH on niobium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:00:41] RECOVERY - SSH on niobium is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [20:03:11] Ops -- food and meeting are happening in Chambers right now. [20:03:51] PROBLEM - SSH on niobium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:04:15] is there gavel? [20:04:15] !log resharding search-pool4 in pmtpa - restarted lsearchd on all local nodes and indexers on searchidx2 [20:04:20] a gavel* [20:04:41] RECOVERY - SSH on niobium is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [20:05:51] RECOVERY - Varnish HTTP bits on niobium is OK: HTTP OK: HTTP/1.1 200 OK - 635 bytes in 0.001 second response time [20:11:55] PROBLEM - Puppet freshness on tmh1001 is CRITICAL: Puppet has not run in the last 10 hours [20:11:55] PROBLEM - Puppet freshness on tmh1002 is CRITICAL: Puppet has not run in the last 10 hours [20:11:55] PROBLEM - Puppet freshness on tmh1 is CRITICAL: Puppet has not run in the last 10 hours [20:12:55] PROBLEM - Puppet freshness on tmh2 is CRITICAL: Puppet has not run in the last 10 hours [20:38:26] New patchset: Asher; "moving search-pool4 queries to pmtpa to test resharding" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/51407 [20:38:46] Change merged: Asher; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/51407 [20:51:33] New patchset: Asher; "testing a search-pool5 in pmtpa" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/51483 [20:57:15] RECOVERY - MySQL Slave Delay on db33 is OK: OK replication delay 0 seconds [20:57:55] RECOVERY - MySQL Replication Heartbeat on db33 is OK: OK replication delay 0 seconds [20:59:42] !log disabled editing on wikitech.wikimedia.org [21:01:28] !log changing A record for wikitech.wm.o and changing labsconsole.wm.o into a cname to wikitech [21:01:49] Ryan_Lane no logbot? :o [21:01:56] ah [21:01:57] hahaha [21:01:59] right.... [21:02:06] I turned off editing [21:02:10] I'll log when I finish, I guess [21:02:14] it's here [21:02:19] but doesn't work :( [21:02:23] wikitech is read only [21:02:28] oh lol [21:02:30] true [21:04:47] so access to the linode instance where morebots runs [21:05:00] don't do anything with it right now [21:05:03] the wikitech creds on fenari didn't work for me today [21:05:06] I'm merging wikitech and labsconsole [21:05:20] at some point it woul dbe good to have creds to restart that bot [21:05:26] wherever it runs now [21:05:29] it'll be on wikitech-static now [21:05:34] the creds are under racktables [21:05:38] ah [21:06:05] good to know (I didn't realize it had already been moved) [21:06:06] thanks [21:07:03] RECOVERY - MySQL Replication Heartbeat on db55 is OK: OK replication delay 0 seconds [21:07:13] RECOVERY - MySQL Slave Delay on db55 is OK: OK replication delay 0 seconds [21:12:10] New patchset: Ryan Lane; "Rename labsconsole to wikitech" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/51542 [21:12:16] !log olivneh synchronized php-1.21wmf10/extensions/EventLogging 'Updates to JavaScript API' [21:12:27] that'll need to be relogged ;) [21:14:16] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/51542 [21:14:45] Ryan_Lane: no problem [21:14:58] New patchset: Pyoungmeister; "sharding out pool5 for pmtpa search" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/51546 [21:15:10] Acknowledged inability to write to wikitech, Master. [21:15:18] :D [21:15:47] binasher: ^^^^^ [21:16:27] <^demon> We should totally make morebots say that when it can't work :p [21:17:13] notpeter: this is responsible for some of the pmtpa replag - https://bugzilla.wikimedia.org/show_bug.cgi?id=45584 [21:18:29] <^demon> binasher: I just saw ^ [21:18:33] <^demon> That code is anciennntttttt [21:19:14] how did i manage to not notice that for so long [21:20:09] I don't even know, dog [21:22:01] Does it just need/want batching? [21:22:41] # Set query limit [21:22:41] if ( !empty( $this->history['limit'] ) ) { [21:22:41] $opts['LIMIT'] = intval( $this->history['limit'] ); [21:22:41] } [21:22:42] heh [21:24:37] New review: Asher; "(1 comment)" [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/51546 [21:24:53] Reedy: yup! [21:31:27] !log create missing dump directory on streber, enables "RT-shredder" plugin which comes with RT since 3.8 [21:34:21] http://etherpad.wikimedia.org/mobile-ops-syncup-28feb2013 [21:46:09] New patchset: Pgehres; "Fixing my last name for icinga" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/51571 [21:46:46] New patchset: Pyoungmeister; "setting up search pool5 in pmtpa" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/51572 [21:47:11] oops sorry pgehres [21:47:12] :-/ [21:47:20] no worries [21:47:33] you could also just change your name to match my typo [21:47:41] thats a lot of paperwork [21:47:47] !log olivneh synchronized php-1.21wmf10/extensions/GuidedTour 'Update to split test (1/3)' [21:48:02] !log olivneh synchronized php-1.21wmf10/extensions/E3Experiments 'Update to split test (2/3)' [21:48:18] !log olivneh synchronized php-1.21wmf10/extensions/GettingStarted 'Update to split test (3/3)' [21:49:15] :) [21:49:30] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/51571 [21:50:04] Change abandoned: Pyoungmeister; "in favor of https://gerrit.wikimedia.org/r/#/c/51572/" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/51546 [21:50:13] LeslieCarr: i am not usually this annoying, only when my phone explodes for 30 minutes and I can't stop it [21:50:17] :-) [21:50:55] hehehe [21:50:57] :) [21:51:31] before i do it, does anyone think we should not poweroff spence ? [21:52:16] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 194 seconds [21:52:26] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 199 seconds [21:55:20] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/51572 [21:55:46] pgehres, is that a "slow explosion"? :) [21:56:22] more like a mortar attack on a cron [21:56:28] * Platonides imagines the phone exploding at slow motion, matrix-like [21:56:31] 8 texts, every 5 minutes [21:56:35] hehehe [21:56:47] better than one email per minute [21:56:57] oh, there were emails too [21:57:10] gmail's mute is helpful [21:57:13] Platonides: Holy crap, how much email do you get. OTRS doesn't even get that much! [21:57:19] *get? [21:57:34] (at least, info-en doesn't) [21:59:29] binasher: convaluted code is convaluted :/ [21:59:41] :) [22:01:17] I could just put LIMIT 5 on it, but then apergos would likley want to beat me [22:02:53] LeslieCarr: I had to restart icinga today [22:02:56] and I don't know why [22:03:00] oh ? [22:03:01] hrm [22:03:03] what happened ? [22:03:05] luckily I had notification from watchmouse [22:03:27] dunno, the log said it received TERM and then it failed to restart (see sal) [22:03:47] it complained it couldn't open some file or pipe and maybe I shoudl check for an already existing instance [22:04:25] limit 5 of what, Reedy? [22:04:32] mutante: I see you've added me to the paging list, but ironically my cellphone provider changed the day before you added me :S [22:04:33] revisions for export [22:04:38] (see also RT ticket) [22:04:49] yeah, I woul dbeat you [22:04:56] apergos: https://bugzilla.wikimedia.org/show_bug.cgi?id=45584 [22:05:37] that code predates me by years of course [22:05:58] Yeaah :/ [22:06:22] Trying to work out what we should use for query continuation with the numerous conditions already in place [22:06:27] ugh [22:07:27] with all the different order by and other wheres, I could see it starting to do other stuff we don't really want [22:10:18] !log stopped gmetad on neon for testing [22:10:56] RECOVERY - mysqld processes on db35 is OK: PROCS OK: 1 process with command name mysqld [22:11:02] well I knew this issue would come up sooner or later [22:11:04] *sigh* [22:11:17] PROBLEM - Puppet freshness on virt0 is CRITICAL: Puppet has not run in the last 10 hours [22:11:22] !started gmetad on neon again [22:12:46] Jeff_Green: no logging right now [22:13:06] ha [22:13:15] New patchset: Asher; "change from explicit mariadb version to present while testing multiple versions" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/51574 [22:14:18] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/51574 [22:15:04] New patchset: Tim Starling; "Revert "For bug 44570 - Make the parser cache expire at 30 days"" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/51575 [22:16:10] Change merged: Ryan Lane; [operations/debs/gerrit] (master) - https://gerrit.wikimedia.org/r/51280 [22:18:32] New patchset: Tim Starling; "Revert "For bug 44570 - Make the parser cache expire at 30 days"" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/51575 [22:18:46] PROBLEM - MySQL Slave Delay on db35 is CRITICAL: CRIT replication delay 1026 seconds [22:18:55] Change merged: Tim Starling; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/51575 [22:19:17] TimStarling: ;) [22:19:44] !log tstarling synchronized wmf-config/InitialiseSettings.php [22:20:11] here I am in a mobile team meeting saying that they should talk to platform about things that might seriously damage site performance [22:20:54] seems a bit hypocritical when you let through changes like that [22:22:33] sure, an order of magnitude shorter parser cache expiry, that couldn't possibly do any harm could it? [22:23:45] depends on the distribution [22:24:46] RECOVERY - MySQL Slave Delay on db35 is OK: OK replication delay 0 seconds [22:25:26] TimStarling, like https://gerrit.wikimedia.org/r/51577 ?:P [22:28:26] RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 0 seconds [22:29:11] ^demon: done [22:29:16] RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 0 seconds [22:29:21] <^demon> \o/ [22:33:12] I'm still confused how that is any/much different from purging stuff over 30 days old via the purge parser cache script [22:49:03] !log upgrading mariadb boxes to 5.5.29 [22:49:39] New patchset: Asher; "adding search_pool5 to lvs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/51579 [22:50:51] whoa, https://wikitech.wikimedia.org/view/How_to_deploy_code redirects to blank page on labs. I am a headless chicken without that page! [22:51:19] it's now located at wikitech-old whilst ryan is in the midst of updating [22:52:15] New patchset: Asher; "adding search_pool5 to lvs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/51579 [22:53:18] PROBLEM - mysqld processes on db59 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [22:53:55] mwalker we're done, so wait 100 seconds and go ahead [22:54:14] spagewmf: cool beans -- thanks kindly! [22:54:56] Hmm [22:56:55] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/51579 [22:59:18] RECOVERY - mysqld processes on db59 is OK: PROCS OK: 1 process with command name mysqld [23:01:19] PROBLEM - MySQL Slave Delay on db59 is CRITICAL: CRIT replication delay 466 seconds [23:03:15] New patchset: Lcarr; "good night spence" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/51582 [23:03:35] Change merged: Asher; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/51483 [23:03:56] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/51582 [23:04:48] PROBLEM - mysqld processes on db52 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [23:05:20] PROBLEM - SSH on palladium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:05:48] PROBLEM - Varnish HTTP bits on palladium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:06:18] RECOVERY - SSH on palladium is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [23:06:33] !log asher synchronized wmf-config/lucene.php 'sending search-pool4 traffic to pmtpa, where the index/host distribution has been rebalanced' [23:06:39] RECOVERY - Varnish HTTP bits on palladium is OK: HTTP OK: HTTP/1.1 200 OK - 637 bytes in 0.001 second response time [23:06:49] RECOVERY - mysqld processes on db52 is OK: PROCS OK: 1 process with command name mysqld [23:07:42] !log powering off spence [23:08:19] RECOVERY - MySQL Slave Delay on db59 is OK: OK replication delay 0 seconds [23:09:01] !log serveradminlog is not working [23:09:05] !log wikitech is down [23:09:15] !log wikitech isn't logging [23:09:20] !log why won't you log wikitech? [23:09:23] !log herp derp derp [23:09:25] :D [23:09:40] :) [23:10:03] !log logging is dead [23:10:25] notpeter: i hate you. [23:10:28] !log Rob and notpeter are talking out loud [23:10:32] RobH: why are you talking out loud? [23:10:35] RobH: I know :) [23:10:50] PROBLEM - SSH on lvs6 is CRITICAL: Server answer: [23:10:50] RobH: why are you talking out loud? [23:10:52] RobH: do your hand motions in irc [23:10:53] !log RobH is making the threatening i'm looking at you hand motion [23:10:57] you have to log this! [23:11:50] RECOVERY - SSH on lvs6 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [23:12:40] !log i miss logging [23:13:02] whining accepted, o whiney one [23:16:00] PROBLEM - MySQL Slave Delay on db39 is CRITICAL: CRIT replication delay 206 seconds [23:17:40] !log pgehres synchronized php-1.21wmf10/extensions/DonationInterface/ 'Updating DonatonInterface-langonly' [23:21:19] New patchset: Asher; "pulling db1043 and shifting watchlist q's to db1050" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/51588 [23:22:38] Change merged: Asher; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/51588