[00:07:09] PROBLEM - RAID on searchidx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:08:59] RECOVERY - RAID on searchidx1001 is OK: OK: State is Optimal, checked 1 logical drive(s), 4 physical drive(s) [00:11:15] MaxSem: heh [00:20:49] PROBLEM - MySQL Recent Restart on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:21:39] RECOVERY - MySQL Recent Restart on db1047 is OK: OK 2192283 seconds since restart [00:23:29] PROBLEM - MySQL Slave Running on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:24:49] PROBLEM - MySQL Recent Restart on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:26:19] RECOVERY - MySQL Slave Running on db1047 is OK: OK replication Slave_IO_Running: Yes Slave_SQL_Running: Yes Last_Error: [00:28:39] RECOVERY - MySQL Recent Restart on db1047 is OK: OK 2192703 seconds since restart [00:32:49] PROBLEM - MySQL Recent Restart on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:34:29] PROBLEM - MySQL Slave Running on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:34:39] RECOVERY - MySQL Recent Restart on db1047 is OK: OK 2193063 seconds since restart [00:35:19] RECOVERY - MySQL Slave Running on db1047 is OK: OK replication Slave_IO_Running: Yes Slave_SQL_Running: Yes Last_Error: [00:53:38] purging issues with both pmtpa and esams [00:53:43] going to comment on bug now [00:54:06] or make a new one i think actually [01:09:11] https://bugzilla.wikimedia.org/56545 [01:10:35] back in a bit. maybe [01:27:47] PROBLEM - MySQL Recent Restart on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:29:17] PROBLEM - MySQL Slave Running on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:29:47] RECOVERY - MySQL Recent Restart on db1047 is OK: OK 2196363 seconds since restart [01:30:07] RECOVERY - MySQL Slave Running on db1047 is OK: OK replication Slave_IO_Running: Yes Slave_SQL_Running: Yes Last_Error: [01:32:47] PROBLEM - MySQL Recent Restart on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:33:17] PROBLEM - MySQL Slave Running on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:33:37] RECOVERY - MySQL Recent Restart on db1047 is OK: OK 2196603 seconds since restart [01:35:07] RECOVERY - MySQL Slave Running on db1047 is OK: OK replication Slave_IO_Running: Yes Slave_SQL_Running: Yes Last_Error: [01:45:26] springle: check out comment 1 here: https://bugzilla.wikimedia.org/show_bug.cgi?id=56545#c1 [01:45:40] that loosely corresponds with this: [01:45:47] (Nov 1) 10:51 logmsgbot: springle synchronized wmf-config/db-pmtpa.php 'depool first batch of pmtpa boxes to be decommissioned/shipped' [01:46:12] did a multicast proxy get decommissioned a little prematurely? [01:46:27] PROBLEM - Puppet freshness on lvs1002 is CRITICAL: No successful Puppet run in the last 10 hours [01:47:27] PROBLEM - Puppet freshness on lvs1005 is CRITICAL: No successful Puppet run in the last 10 hours [01:48:08] hrm....I suppose looking at the page history of that page, it could have happened any time between Nov 1 and Nov 3: [01:48:13] https://de.wikipedia.org/w/index.php?title=Massenmedien&action=history [01:48:42] also, i think i was not able to confirm the other one from volker [01:48:51] i wonder how it got fixed [01:48:57] i didn't try purging any of these myself [01:49:21] i had a loop through all DCs fetch each page 5 times. just bash+curl [01:50:39] robla: those pmtpa boxes were only db slaves, and most are still actually running. don't know about multicast [01:51:16] Something doesn't look right: https://ganglia.wikimedia.org/latest/?r=4hr&cs=&ce=&m=vhtcpd_inpkts_sane&s=descending&c=Text+caches+esams&h=&host_regex=&max_graphs=0&tab=m&vn=&sh=1&z=small&hc=4 [01:52:53] It looks like something is interrupting vhtcp traffic to esams [01:55:37] PROBLEM - Host mediawiki-lb.pmtpa.wikimedia.org_ipv6 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:860:ed1a::8 [01:55:57] RECOVERY - Host mediawiki-lb.pmtpa.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 35.38 ms [01:57:57] yup [02:01:24] !log LocalisationUpdate failed: git pull of extensions failed [02:01:46] Logged the message, Master [02:06:27] springle: I think you and TimStarling are the only ones with the access and know-how to fix this the cache purging issue that are in working hours right now. Sean, are you on it, or do we need to get you some help? [02:08:04] based on the graph that Bryan posted above, there's a legit problem there [02:08:36] ah...looks like it might actually be going again [02:08:44] I will look at it [02:09:19] thanks [02:09:57] thanks TimStarling [02:10:12] i'm not familiar with the text caches [02:11:29] relax everyone, i'm here [02:11:47] i'll proceed to assist tim by making uninformed guesses about what might be the issue [02:11:58] ori-l: that was my plan! [02:12:09] bd808: :) [02:12:11] seriously....me too [02:12:14] does anyone know what happened last time? [02:12:51] https://bugzilla.wikimedia.org/show_bug.cgi?id=54647#c5 is the closest I've gotten so far [02:13:08] well, clearly, it was a network issue [02:13:18] That's all I was told sadly. [02:13:58] The ganglia graphs seem to be showing some traffic now. [02:14:39] PROBLEM - search indices - check lucene status page on search14 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern found - 52969 bytes in 0.146 second response time [02:14:39] It looks like there was a ~2h gap form 00:00 to 02:00 [02:14:56] s/form/from/ [02:15:58] * bd808 just got called to dinner [02:16:10] i'm still reading up on the bug reports [02:18:01] maybe it was related to the saturation and rerouting on the 26th [02:25:26] lots of SmokeAlert packet loss emails about esams over the weekend [02:31:51] well, dobson is not getting HTCP packets, and it is definitely running udpmcast.py and it has joined the relevant multicast group [02:32:05] dobson had an icinga alert earlier [02:32:29] but dobson is in pmtpa [02:32:45] do we really send all HTCP traffic via pmtpa? [02:32:59] why is ulsfo working then? [02:33:02] I suppose it is not implausible [02:33:24] the alerts jeremyb is referring to are: [21:54:45] PROBLEM - NTP peers on dobson is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:33:25] the dobson alerts were all NTP and recovered in ~a min [02:33:51] also packet loss on emery [02:34:10] where IIRC 'packet loss' == gaps in udp log seq ids [02:34:26] obviously there shouldn't be more than one udpmcast in the world [02:34:42] if there was, there would be duplicate packets sent [02:42:28] jeremyb: why do you think ulsfo is working? [02:44:07] UDP RcvbufErrors on dobson rises by about 5k / sec [02:44:33] ok, so ulsfo is indeed working and is getting UDP packets directly from the apaches [02:44:41] not from a relay [02:44:58] ori-l: what does that mean? [02:45:16] that something is not draining the buffer fast enough? [02:45:57] you know dobson is a DNS server? [02:47:26] oh, no. ok, disregard. [02:48:00] it's not saturated [02:48:47] CPU usage is not especially high [02:48:56] it looked high at first, and then I realised it only has 2 cores [02:49:28] anyway, it is unlikely that I can fix this [02:51:23] what do you think might be going on? [02:54:33] I don't know, I just know enough about networks to know when it's time to cal Mark or Leslie [02:55:02] it wouldn't be the first failure of multicast routing, would it? [02:55:42] I can probably move udpmcast to eqiad [02:55:53] then they can have their sleep [02:57:31] i just came across https://wikitech.wikimedia.org/wiki/Multicast_HTCP_purging which has some useful tips, i assume you're aware but mentioning it just in case [02:59:21] what's a quiet misc host at eqiad that I should put this on? [03:00:12] tungsten [03:00:26] slotted to be the graphite host, replacing professor [03:00:41] right now it's just running statsd (udp metric aggregation) with very minimal traffic going on [03:00:58] ok [03:01:10] I don't know if it matters, but my irc logs say that the endpoint in esams is nescio now and not hooft. Hooft is still running a copy of the relay but the varnish boxes were seeing packets from nescio on 2013-10-16. [03:04:55] yes, nescio [03:05:01] I just copied the configuration from dobson [03:05:28] !log moved udpmcast.py from dobson to tungsten to work around failure of multicast routing eqiad -> pmtpa [03:05:45] Logged the message, Master [03:07:29] yeah, ok, so it needs to be a server with an external IP address [03:09:49] like... carbon [03:10:17] or chromium, if we stick with sharing DNS recursors [03:10:31] chromium has low CPU usage [03:12:41] !log moved udpmcast to chromium since it actually has an external IP address [03:12:55] so, it works now [03:12:55] Logged the message, Master [03:19:27] I see packets being logged in ganglia again. [03:22:54] RECOVERY - check_job_queue on hume is OK: JOBQUEUE OK - all job queues below 10,000 [03:25:54] PROBLEM - check_job_queue on hume is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:26:05] RECOVERY - check_job_queue on fenari is OK: JOBQUEUE OK - all job queues below 10,000 [03:29:04] PROBLEM - check_job_queue on fenari is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:35:11] when I search bing for "htcp port", it can't even admit any possibility that i might have actually meant "htcp" and not made a typo [03:35:33] I didn't know people binged. [03:35:39] it just shows all pages containing "http", with no note to say that it has done so, and no link to actually search for htcp [03:36:15] sure, I use bing, it is kinder to RequestPolicy since it doesn't track every outbound click [03:36:52] google actually shows you links with the correct URL, but changes the href attribute on left/right/middle mouse down [03:37:11] so when the click event occurs, it sends you via a tracking page [03:37:14] PROBLEM - Host wikiversity-lb.pmtpa.wikimedia.org_ipv6 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:860:ed1a::7 [03:37:28] whereas bing doesn't track outbound clicks at all [03:37:34] RECOVERY - Host wikiversity-lb.pmtpa.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 35.43 ms [03:38:02] it is kind of halfway between google and duck duck go :) [03:38:27] but yes, you have to start your queries with "+" or else it thinks you are stupid [03:38:34] PROBLEM - Host wikidata-lb.pmtpa.wikimedia.org_ipv6 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:860:ed1a::12 [03:38:54] RECOVERY - check_job_queue on hume is OK: JOBQUEUE OK - all job queues below 10,000 [03:39:24] RECOVERY - Host wikidata-lb.pmtpa.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 16%, RTA = 31.05 ms [03:39:34] PROBLEM - NTP peers on linne is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:41:24] RECOVERY - NTP peers on linne is OK: NTP OK: Offset -0.007041 secs [03:41:54] PROBLEM - check_job_queue on hume is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:42:27] Ah, Google dropped "+" syntax support, I think. [03:42:37] I use Chrome as browser, but default to Duck Duck Go. It's decent. [03:42:56] Its !bang syntax is nice. [03:45:24] PROBLEM - Host foundation-lb.pmtpa.wikimedia.org_ipv6 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:860:ed1a::9 [03:45:44] RECOVERY - Host foundation-lb.pmtpa.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 35.58 ms [03:48:54] RECOVERY - check_job_queue on hume is OK: JOBQUEUE OK - all job queues below 10,000 [03:51:55] PROBLEM - check_job_queue on hume is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:53:54] RECOVERY - check_job_queue on hume is OK: JOBQUEUE OK - all job queues below 10,000 [03:56:54] PROBLEM - check_job_queue on hume is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:15:03] !log LocalisationUpdate failed: git pull of extensions failed [04:15:13] this is me [04:15:19] Logged the message, Master [04:19:48] !log l10nupdate failures are due to "No submodule mapping found in .gitmodules for path 'WikibaseDatabase'"; repo appears to have been deleted; fixed with git rm --cached WikibaseDatabase as l10nupdate on tin. [04:20:02] Logged the message, Master [04:30:48] !log LocalisationUpdate completed (1.23wmf1) at Mon Nov 4 04:30:48 UTC 2013 [04:31:02] Logged the message, Master [04:33:23] yay [04:40:40] !log LocalisationUpdate completed (1.23wmf2) at Mon Nov 4 04:40:40 UTC 2013 [04:40:56] Logged the message, Master [04:49:54] Elsie: + for a google search? it's now just wrap in double quotes [04:50:02] Elsie: they use to warn about the deprecation [04:51:23] TimStarling: i knew ulsfo was working from last-modified headers as pasted on the bug. both unpurged some places articles were old in the same DCs as eachother and new in the same DCs as eachother [04:51:54] where does SAL twitter relay run now? i guess it's still morebots? [04:52:03] is apparently broken [04:52:56] still morebots [04:54:36] ahhh, ori-l is one of the morebots peoples :P [04:54:50] per tools.wmflabs.org index :) [04:54:59] no thank you [04:55:12] i rewrote logmsgbot and it works very reliably [04:55:25] morebots is a mess [04:55:25] I am a logbot running on tools-exec-02. [04:55:25] Messages are logged to wikitech.wikimedia.org/wiki/Server_Admin_Log. [04:55:25] To log a message, type !log . [04:56:06] haha [04:56:13] ori-l: add me to the morebots tool then? :) [04:56:50] OK [05:02:36] morning [05:02:44] what's going on? [05:03:24] see ops list [05:03:51] k, I literally just started working so I haven't seen anything yet [05:03:51] RECOVERY - check_job_queue on hume is OK: JOBQUEUE OK - all job queues below 10,000 [05:03:59] nothing urgent afaik [05:04:35] rpf most likely [05:05:00] TimStarling: think it's worth repurging the interim? (like was already done last month) do we have a well defined start time for the breakage? [05:05:18] the graph shows sometime wed or thurs and again a day later i think [05:05:33] but it also looks like the graph hasn't been around too long [05:05:36] https://ganglia.wikimedia.org/latest/stacked.php?m=vhtcpd_inpkts_sane&c=Text%20caches%20esams&r=week&st=1383541389&host_regex= [05:05:41] do we have a log of BGP sessions? maybe that will correlate to the ganglia data [05:06:03] state transitions you mean? [05:06:09] juniper logs have these, yes [05:06:11] i *think* BGP logging may have been turned on at some point? [05:06:21] and observium will probably have them as well [05:06:27] not too long ago. e.g. last 2 months [05:06:28] observium has some logs, yes [05:06:58] PROBLEM - check_job_queue on hume is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:07:06] we had a flapping 10g wave yesterday [05:07:34] between eqiad and sdtpa iirc [05:08:08] !log LocalisationUpdate ResourceLoader cache refresh completed at Mon Nov 4 05:08:07 UTC 2013 [05:08:24] Logged the message, Master [05:09:03] still flapping... [05:10:34] jeremyb: how do i add you to a particular tool? [05:10:44] maybe YuviPanda knows [05:10:53] ori-l: https://tools.wmflabs.org/ has links for add and remove per tool [05:10:57] sure [05:11:01] yeah, what jeremyb said [05:11:54] jeremyb: what's your username? [05:12:07] flappped 1298 times since yesterday [05:12:09] =ircnick [05:12:17] autocomplete didn't think so [05:12:22] huh [05:12:52] well i'm already listed on tools.wmflabs.org... [05:13:18] RECOVERY - check_job_queue on fenari is OK: JOBQUEUE OK - all job queues below 10,000 [05:13:33] ok, added [05:14:34] hrmmm, no enotif (no echo even) [05:14:37] danke :) [05:16:28] PROBLEM - check_job_queue on fenari is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:16:57] ori-l: so to clarify, morebots is the admin tool not the morebots tool? :) [05:18:20] I can still remove you if you like [05:18:29] anyways, added to morebots [05:18:37] TimStarling: re: Google changing links -- they've made a standard out of tracking clicks now [05:18:50] (I finished reading emails & backlog :) [05:19:13] [05:19:20] bing tracks link clicks as well [05:19:45] http://www.whatwg.org/specs/web-apps/current-work/multipage/links.html#dfnReturnLink-0 [05:20:33] For URLs that are HTTP URLs, the requests must be performed using the POST method, with an entity body with the MIME type text/ping consisting of the four-character string "PING". [05:20:37] * paravoid pukes [05:21:18] well, if it's standard, it's easy to block [05:21:34] oh the spec says that user agents should allow the user to block [05:21:49] "User agents should allow the user to adjust this behavior, for example in conjunction with a setting that disables the sending of HTTP Referer (sic) headers. Based on the user's preferences, UAs may either ignore the ping attribute altogether, or selectively ignore URLs in the list (e.g. ignoring any third-party URLs)." [05:22:18] IE will probably spoil it all by disabling it by default [05:22:28] like DNT [05:22:38] yeah I got the reference :) [05:23:05] (grrrit-wm was down again, I restarted it, seems to work now) [05:24:48] RECOVERY - check_job_queue on hume is OK: JOBQUEUE OK - all job queues below 10,000 [05:24:58] caniuse.com doesn't have the ping attribute [05:25:01] too new I guess [05:25:21] Google mobile search is getting faster - to be exact, 200-400 milliseconds faster! We are gradually rolling out this improvement to all browsers that support the attribute (currently, mobile Chrome and Safari). [05:25:41] heh, nice overstatement [05:27:58] PROBLEM - check_job_queue on hume is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:35:13] (03PS2) 10Ori.livneh: [WIP] Add Graphite module & role [operations/puppet] - 10https://gerrit.wikimedia.org/r/92271 [05:37:22] (03CR) 10Ori.livneh: "Still WIP. Now uses Debian packages + FHS paths. The Debian Carbon package uses the same crummy init script that is not multi-instance awa" [operations/puppet] - 10https://gerrit.wikimedia.org/r/92271 (owner: 10Ori.livneh) [05:37:28] if only we had a memcached module... [05:37:47] too much baggage [05:38:08] they won't conflict because the things that currently use the memcached module and graphite won't ever be applied on the same host anyway [05:38:19] sure [05:39:00] carbonctl, heh [05:39:08] reminds of me something :P [05:39:41] yeah, the reuse was conscious. it was a good idea that has been working well (/sbin/eventloggingctl) [05:40:01] note that there's no upstart service module any more, that is a bit alarming but sensible ultimately [05:40:21] the problem is that carbon/init is a task job that goes from starting to stopped [05:40:34] which is normal, that's how task jobs are supposed to function [05:40:49] it spins up carbon/cache & carbon/relay instances per what is configured [05:41:15] the upstart service provider in puppet is not aware of task jobs, and it's not aware of instance jobs either [05:41:34] so everything is managed via carbonctl [05:41:35] why do you need multiple instances? [05:41:50] it's explained in a comment [05:42:01] carbon cache is CPU bound and the python gil prevents it from utilizing multiple cores [05:42:03] ah, found it [05:44:28] may I suggest to file a Debian bug about this use case at some point? [05:45:03] I'd do it, but you're clearly better informed than me [05:45:31] sure, will do [05:46:01] on professor we have profiler-to-carbon sitting in front of carbon [05:46:15] and it is indeed maxing out one core [05:46:51] (changeset looks good so far, fwiw) [05:47:07] glad to see the debian packages worked out [05:47:09] yay, coo [05:47:09] l [05:47:23] oh, that reminds me [05:47:52] python-carbon can be removed from apt.wm.o, i think [05:48:05] (the debian package is 'graphite-carbon') [05:48:15] professor is on lucid so it's not using that anyway [05:48:27] right [05:48:28] done [05:49:28] thanks [05:49:37] morebots: ping [05:49:37] I am a logbot running on tools-exec-01. [05:49:37] Messages are logged to wikitech.wikimedia.org/wiki/Server_Admin_Log. [05:49:37] To log a message, type !log . [05:50:01] jeremyb: make !logs structured data [05:50:18] ori-l: elaborate? :) [05:50:29] !log test [05:50:43] Logged the message, Master [05:51:26] make references to individuals, commits, and hosts machine-readable [05:51:42] ori-l: were you around when I was proposing/discussing fedmsg? [05:51:44] so that you can easily grep the SAL [05:52:00] paravoid: that's the fedora message bus built on top of 0mq, right? [05:52:04] yes [05:52:20] i found out about it through you, but i don't recall a discussion, so i probably missed that part [05:52:29] it looked good, didn't look into it in detail [05:52:29] not much of a discussion [05:53:24] there's definitely a need to standardize inter-service communication somehow [05:53:38] ori-l: errr, the input sent to morebots is free text. how do you suggest to force the users to format properly? [05:53:43] stop using !log on IRC? [05:53:55] it even has a mediawiki module already, so funny [05:54:00] note: i have thought about the very same issue myself [05:54:07] jeremyb: IRC-generated !logs are only a subset of messages [05:54:17] many are generated by scripts [05:54:22] right [05:54:54] though it's the sort of thing that could and should be replaced by fedmsg i guess [05:55:05] much as having two bots interact is quaint [05:57:13] one of the things the core team has been talking about is replacing free-text wfDebugs with some sort of standard log message descriptor [05:57:21] indicating type / severity / component [05:57:32] i think bd808 had some ideas and was going to write an rfc [05:58:02] !log test2 [05:58:14] Logged the message, Master [06:02:15] wtf [06:02:21] 0mq went from 2.2 to 13.1? [06:02:50] pyzmq that is [06:02:56] trying to remember who else did that :P [06:03:18] that's why debian packages can reset the version number for the version number :) [06:03:23] i forget what that's called [06:03:40] [09:13:02 PM] hrmmm, no enotif (no echo even) <-- file a bug for the echo notifications and I can look at it sometime [06:03:46] doesn't matter in this case though, only if you go backwards [06:03:59] legoktm: uga, too many bugs to file! [06:04:09] PyZMQ releases ≤ 2.2.0 matched libzmq versioning, but this will no longer be the case. To avoid confusion with the contemporary libzmq-3.2 major version release, PyZMQ is jumping to 13.0 (it will be the thirteenth release, so why not?). [06:05:47] [09:55:04 PM] much as having two bots interact is quaint <-- there's a bug somewhere about issues when morebots and logmsgbot end up on different ends of netsplits [06:06:55] shouldn't that be mitigated by having them connect to the exact same server? [06:08:16] hm. they probably should be on dickson [06:08:26] [10:08:19 PM] morebots is connected on rothfuss.freenode.net (FR) [06:08:28] [10:08:23 PM] logmsgbot is connected on leguin.freenode.net (Ume?, SE, EU) [06:09:42] (03PS1) 10Jeremyb: tool labs exec_environ: add python-twitter package [operations/puppet] - 10https://gerrit.wikimedia.org/r/93426 [06:09:58] legoktm: no. see paravoid for why not [06:10:16] can i get a merge? ^^ [06:10:22] why not? [06:10:45] so then if we have an outage that affects dickson we can't !log ? [06:11:25] why would we connect to dickson? [06:11:44] jeremyb: you set fallback servers. so if it cant connect to the first server, it tries another one [06:11:50] paravoid: because its inside labs? [06:11:56] it's not inside labs [06:12:01] it's in the sandbox vlan [06:12:08] but still, what's the benefit? [06:12:28] legoktm: dickson is dedicated not virtual [06:12:35] i uh...don't know what that means. [06:12:48] maybe paravoid wants to merge? :) [06:12:58] we have a separate (new) network for "hosted" equipment [06:13:01] legoktm: bare metal [06:13:10] no, i meant "sandbox vlan" [06:13:20] which is separate from production and separate from labs [06:13:26] oh, ok [06:13:32] legoktm: that means it has about as much access to our networks as any other outside host [06:13:35] and it's sandboxed, there's some constraints into it for security purposes [06:13:37] well, then it doesn't really matter [06:14:08] legoktm: *but* if eqiad falls off the grid then dickson probably does too [06:14:19] if eqiad fails, then we lose morebots too. [06:14:43] not true for morebots' old home [06:14:55] anyway... [06:15:15] should i break tool labs rules and leave the bot working or break the bot or does someone want to deploy that for me? :) [06:16:04] jeremyb: why not just install the package locally for now with pip? [06:16:36] (03CR) 10Jeremyb: "note, this package is already installed on tools-login somehow. twitter works when run on tools-login but not when run from the grid" [operations/puppet] - 10https://gerrit.wikimedia.org/r/93426 (owner: 10Jeremyb) [06:16:53] legoktm: ewww? i hate pip? idk [06:17:19] I don't. [06:17:26] legoktm: but i wonder how it ever broke to begin with. maybe some hosts have it and some don't? [06:17:39] if that's the case thats a bug [06:17:43] all hosts should be equal [06:18:18] right, hence i sent a patch to equalize [06:18:26] i was guessing could be manual installs [06:18:55] the twitter breakage started when apergos booted morebots. so i guess before that point it was in a place with twitter and after not [06:24:52] !log left morebots running on tools-login because it doesn't work on all grid hosts. see https://gerrit.wikimedia.org/r/93426 [06:25:07] Logged the message, Master [06:30:42] PROBLEM - MySQL Replication Heartbeat on db57 is CRITICAL: CRIT replication delay 319 seconds [06:31:12] PROBLEM - MySQL Slave Delay on db57 is CRITICAL: CRIT replication delay 330 seconds [06:38:42] PROBLEM - MySQL Replication Heartbeat on db57 is CRITICAL: CRIT replication delay 319 seconds [06:39:13] * springle glares at db57 [06:39:56] springle: does it hear you? [06:40:02] PROBLEM - MySQL Slave Delay on db1007 is CRITICAL: CRIT replication delay 313 seconds [06:40:12] PROBLEM - MySQL Replication Heartbeat on db1007 is CRITICAL: CRIT replication delay 321 seconds [06:40:15] hah, db1007 joins the party [06:40:44] * jeremyb goes to sleep [06:42:02] RECOVERY - MySQL Slave Delay on db1007 is OK: OK replication delay 0 seconds [06:42:12] RECOVERY - MySQL Replication Heartbeat on db1007 is OK: OK replication delay -1 seconds [06:43:16] -1? :) [06:43:28] it's that fast :) [06:43:33] time slave [06:49:22] PROBLEM - MySQL Slave Running on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:51:12] RECOVERY - MySQL Slave Running on db1047 is OK: OK replication Slave_IO_Running: Yes Slave_SQL_Running: Yes Last_Error: [06:53:52] PROBLEM - MySQL Recent Restart on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:54:22] PROBLEM - MySQL Slave Running on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:55:12] RECOVERY - MySQL Slave Running on db1047 is OK: OK replication Slave_IO_Running: Yes Slave_SQL_Running: Yes Last_Error: [06:55:24] (03PS1) 10ArielGlenn: payments as empty hash for lvs service ips in eqiad [operations/puppet] - 10https://gerrit.wikimedia.org/r/93432 [06:55:42] RECOVERY - MySQL Recent Restart on db1047 is OK: OK 2215924 seconds since restart [06:56:46] (03CR) 10ArielGlenn: [C: 032] payments as empty hash for lvs service ips in eqiad [operations/puppet] - 10https://gerrit.wikimedia.org/r/93432 (owner: 10ArielGlenn) [06:59:02] RECOVERY - Puppet freshness on lvs1002 is OK: puppet ran at Mon Nov 4 06:58:52 UTC 2013 [07:00:42] RECOVERY - Puppet freshness on lvs1005 is OK: puppet ran at Mon Nov 4 07:00:33 UTC 2013 [07:38:44] PROBLEM - Host dobson is DOWN: PING CRITICAL - Packet loss = 100% [07:38:44] PROBLEM - Host emery is DOWN: PING CRITICAL - Packet loss = 100% [07:38:44] PROBLEM - Host db53 is DOWN: PING CRITICAL - Packet loss = 100% [07:38:44] PROBLEM - Host es4 is DOWN: PING CRITICAL - Packet loss = 100% [07:38:54] PROBLEM - Host es10 is DOWN: PING CRITICAL - Packet loss = 100% [07:38:54] PROBLEM - Host db9 is DOWN: PING CRITICAL - Packet loss = 100% [07:39:04] PROBLEM - Host mw90 is DOWN: PING CRITICAL - Packet loss = 100% [07:39:05] PROBLEM - Host mw5 is DOWN: PING CRITICAL - Packet loss = 100% [07:39:05] PROBLEM - Host mw100 is DOWN: PING CRITICAL - Packet loss = 100% [07:39:05] PROBLEM - Host mw41 is DOWN: PING CRITICAL - Packet loss = 100% [07:39:14] PROBLEM - Host 208.80.152.132 is DOWN: PING CRITICAL - Packet loss = 100% [07:39:24] PROBLEM - Host ms-be4 is DOWN: CRITICAL - Time to live exceeded (10.0.6.203) [07:39:24] PROBLEM - Host es3 is DOWN: CRITICAL - Time to live exceeded (10.0.0.227) [07:39:24] PROBLEM - Host mw105 is DOWN: CRITICAL - Time to live exceeded (10.0.11.105) [07:39:24] PROBLEM - Host mw70 is DOWN: CRITICAL - Time to live exceeded (10.0.11.70) [07:39:24] PROBLEM - Host srv290 is DOWN: CRITICAL - Time to live exceeded (10.0.8.40) [07:39:24] PROBLEM - Host mw116 is DOWN: CRITICAL - Time to live exceeded (10.0.11.116) [07:41:34] PROBLEM - Host pappas is DOWN: PING CRITICAL - Packet loss = 100% [07:41:34] PROBLEM - Host grosley is DOWN: PING CRITICAL - Packet loss = 100% [07:42:04] PROBLEM - Host loudon is DOWN: PING CRITICAL - Packet loss = 100% [07:42:14] PROBLEM - Host db78 is DOWN: PING CRITICAL - Packet loss = 100% [07:42:20] wtf [07:42:28] eeep [07:42:33] argh [07:42:36] network is split somehow [07:42:38] everything at pmtpa [07:42:49] the fiber is permanently cut now I guess [07:42:54] but something is funky with the network [07:42:54] RECOVERY - Host db34 is UP: PING WARNING - Packet loss = 86%, RTA = 35.49 ms [07:43:04] RECOVERY - Host mw124 is UP: PING WARNING - Packet loss = 54%, RTA = 30.11 ms [07:43:04]