[00:15:48] lesliecarr: the zayo link? no [00:21:46] PROBLEM - MySQL Replication Heartbeat on db1010 is CRITICAL: CRIT replication delay 311 seconds [00:21:56] PROBLEM - MySQL Slave Delay on db1010 is CRITICAL: CRIT replication delay 325 seconds [00:23:56] PROBLEM - Disk space on elastic1008 is CRITICAL: DISK CRITICAL - free space: /var/lib/elasticsearch 10095 MB (3% inode=99%): [00:24:46] RECOVERY - MySQL Replication Heartbeat on db1010 is OK: OK replication delay -0 seconds [00:24:56] RECOVERY - MySQL Slave Delay on db1010 is OK: OK replication delay 0 seconds [00:30:55] ^d: fucking elasticsearch it was trying to move stuff to a node that was almost full..... [00:31:03] <^d> :( [00:33:56] RECOVERY - Disk space on elastic1008 is OK: DISK OK [00:34:03] I figured out how to stop it [00:34:06] but I'm unhappy [00:34:20] I told it it could only have some number of GB of disk space rather than a percent [00:34:24] and it was like "cool!" [00:34:30] and it through some shards on the floor [00:34:36] and it is initializing those.... [00:45:05] (03PS7) 10Ori.livneh: Add logstash config for udp2log [operations/puppet] - 10https://gerrit.wikimedia.org/r/106154 (owner: 10BryanDavis) [00:45:21] (03CR) 10Ori.livneh: [C: 032 V: 032] "OK. But I am adding an item on our calendar to revert this in six weeks, on Thursday, February 27. I don't want to be stuck with this fore" [operations/puppet] - 10https://gerrit.wikimedia.org/r/106154 (owner: 10BryanDavis) [02:30:36] PROBLEM - Disk space on virt10 is CRITICAL: DISK CRITICAL - free space: /var/lib/nova/instances 39037 MB (3% inode=99%): [02:32:02] !log LocalisationUpdate completed (1.23wmf10) at 2014-01-17 02:32:01+00:00 [02:39:29] Oh, poop. [03:08:34] (03PS1) 10Springle: Disable slow information_schema query that scans all databases and tables. [operations/puppet] - 10https://gerrit.wikimedia.org/r/107994 [03:10:16] (03CR) 10Springle: [C: 032] Disable slow information_schema query that scans all databases and tables. [operations/puppet] - 10https://gerrit.wikimedia.org/r/107994 (owner: 10Springle) [03:11:19] (03Abandoned) 10Hydriz: Move WikimediaIncubator extension call to be after Scribunto [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/107340 (owner: 10Hydriz) [03:11:50] unmerged puppet changes for logstash... [03:12:11] ori: ^ ok to merge? [03:13:20] on palladium [03:19:23] * springle goes with 'yes' and watches a logstash puppet run [03:26:08] !log LocalisationUpdate completed (1.23wmf11) at 2014-01-17 03:26:07+00:00 [03:34:02] springle: yes, thanks. sorry for blocking. [03:42:41] ori: they named a DFS after you http://ori.scs.stanford.edu/ [03:56:51] ebernhardson: it's only fair [03:59:49] it truly is -- ori was already distributed in the physical sense so it would make sense to make him distributed in the virtual sense as well [04:03:24] !log LocalisationUpdate ResourceLoader cache refresh completed at 2014-01-17 04:03:24+00:00 [04:59:03] (03PS1) 10Springle: depool db1007 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/107997 [04:59:39] (03CR) 10Springle: [C: 032] depool db1007 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/107997 (owner: 10Springle) [04:59:46] (03Merged) 10jenkins-bot: depool db1007 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/107997 (owner: 10Springle) [05:00:25] !log springle synchronized wmf-config/db-eqiad.php 'depool db1007' [05:41:27] (03PS1) 10Springle: repool db1007, depool db1041 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/107998 [05:42:10] (03CR) 10Springle: [C: 032] repool db1007, depool db1041 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/107998 (owner: 10Springle) [05:42:16] (03Merged) 10jenkins-bot: repool db1007, depool db1041 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/107998 (owner: 10Springle) [05:43:02] !log springle synchronized wmf-config/db-eqiad.php 'repool db1007, depool db1041' [06:33:09] (03PS1) 10Springle: Another ganglia aggregator for db multicast group to try to reduce stats gaps. [operations/puppet] - 10https://gerrit.wikimedia.org/r/107999 [06:34:45] (03CR) 10Springle: [C: 032] Another ganglia aggregator for db multicast group to try to reduce stats gaps. [operations/puppet] - 10https://gerrit.wikimedia.org/r/107999 (owner: 10Springle) [07:50:42] (03PS1) 10Ori.livneh: Add txStatsD module [operations/puppet] - 10https://gerrit.wikimedia.org/r/108010 [08:08:33] (03PS2) 10Ori.livneh: Add txStatsD module [operations/puppet] - 10https://gerrit.wikimedia.org/r/108010 [08:10:07] (03CR) 10Ori.livneh: [C: 032] Add txStatsD module [operations/puppet] - 10https://gerrit.wikimedia.org/r/108010 (owner: 10Ori.livneh) [08:45:01] (03PS1) 10Ori.livneh: Port statsd role to use txStatsD; apply to tungsten [operations/puppet] - 10https://gerrit.wikimedia.org/r/108013 [08:46:17] (03CR) 10Ori.livneh: [C: 032] Port statsd role to use txStatsD; apply to tungsten [operations/puppet] - 10https://gerrit.wikimedia.org/r/108013 (owner: 10Ori.livneh) [08:46:21] !log Replacing StatsD on tungsten with txStatsD; see commit I19ecf608d for rationale. [08:46:56] morebots died. [08:50:12] !log restarted morebots. morebots missed Sean's 'repool db1007, depool db1041' @ 6:26 UTC. [08:50:17] !log Replacing StatsD on tungsten with txStatsD; see commit I19ecf608d for rationale. [08:50:20] Logged the message, Master [08:50:26] Logged the message, Master [08:50:38] !log finished elasticsearch upgrade. we're now 0.90.10 all the way. [08:50:45] Logged the message, Master [08:51:12] poor manybubbles [08:51:20] get some sleep! [08:51:24] txstatsd, hmmm [08:51:35] ori: there almost isn't a point to trying now [08:51:41] my kids will be up in three hours [08:51:49] i know the feeling [08:51:52] paravoid: it really is better [08:52:02] see commit msg for rationale [08:53:33] yeah I'm reading about it [08:55:36] PROBLEM - MySQL Slave Delay on db69 is CRITICAL: CRIT replication delay 306 seconds [08:55:46] PROBLEM - MySQL Replication Heartbeat on db69 is CRITICAL: CRIT replication delay 310 seconds [08:55:49] http://joemiller.me/2011/09/21/list-of-statsd-server-implementations/ [08:55:53] for crying out loud [08:56:08] i went through pretty much all of them [08:56:16] because i got fed up with statsd [08:56:20] I remember wanting to install statsite before [08:56:26] that, or something in python [08:56:35] we talked about this before, remember? [08:57:47] I'd personally keep the role named "statsd" btw [08:57:51] oh, too late [08:58:00] nevermind :) [08:58:28] manybubbles|away: mind if I move and reboot the labs instance search-test? [08:58:34] oops [08:58:46] RECOVERY - MySQL Replication Heartbeat on db69 is OK: OK replication delay 138 seconds [08:58:46] andrewbogott: huh? [08:58:49] search-test? [08:58:50] statsite is nice but doesn't support plugins (though it does have pluggable syncs) [08:58:58] it also doesn't support histograms [08:59:08] manybubbles|away: Yeah, I'm just trying to free up some space on the virt host it lives on. [08:59:19] Will be down for a few minutes, then rebooted. That OK? [08:59:20] and the fact that it is written in C makes it onerous to extend [08:59:25] why do you need histograms in statsd? [08:59:36] RECOVERY - MySQL Slave Delay on db69 is OK: OK replication delay 0 seconds [08:59:42] instead of generating them after the fact with graphite? [08:59:53] andrewbogott: I don't believe it is mine so I don't mind [08:59:58] hm, ok :) [09:00:00] paravoid: because statsd aggregates [09:00:01] (txstatsd looks just fine, I'm just curious) [09:00:12] but in general you can move and reboot any of my machines now that I'm going to sleep [09:00:19] fair enough :) [09:00:24] a histogram of 60-second averages isn't that useful [09:00:40] right [09:00:42] makes sense [09:01:08] paravoid: unrelated question -- [09:01:34] kibana uses PUT requests (REST hipsters, I know...) to save dashboards, and saving dashboards is apparently something you do pretty regularly when you're using kibana [09:02:06] so the default varnish vcls doesn't allow it [09:02:14] the wikimedia-wide vcl_recv blocks PUT requests, and it gets included before the misc varnish VCL so you can't really override it [09:02:15] right [09:03:08] bryan and i weren't sure about the best way of getting around that [09:04:03] maybe configure kibana not to act like a college student that just encountered roy fielding for the first time [09:05:05] i don't think we can do that without deviating from upstream [09:12:54] off to bed but will read pings tomorrow [09:18:46] RECOVERY - Disk space on virt11 is OK: DISK OK [09:23:10] #notgooglehangout [09:23:43] mutante: ? [09:24:08] andrewbogott: wrong window [09:24:12] :) [09:24:32] lo [09:24:42] hi hashar [09:25:26] and I missed andrew :D [09:28:11] ori: +3 on using a python based statsd hehe [09:28:38] the more I write python the more I like that language [09:34:57] hashar: shall we merge the contint/packages.pp change now while you're here? [09:38:50] yeah [09:39:01] mutante: it is simple enough :-] [09:39:06] will run puppet on the machines [09:39:37] (03CR) 10Dzahn: [C: 032] "should be NOOP" [operations/puppet] - 10https://gerrit.wikimedia.org/r/107347 (owner: 10Matanya) [09:40:24] hashar: thx, now [09:40:39] |log running puppet on gallium and lanthanum [09:40:55] I am still wondering how puppet cache the catalog [09:40:56] s [09:41:08] that was a pipe :) [09:41:10] ! [09:41:18] yeah that is intentional :D [09:41:26] so you get the message, but it is not logged [09:41:26] heh,ok [09:41:28] that is a nice feature [09:41:33] as for puppet catalog: [09:41:33] info: Caching catalog for lanthanum.eqiad.wmnet [09:41:34] info: Applying configuration version '1389951638' [09:41:42] I am afraid it cache the catalog based on the timestamp :/ [09:42:19] (03PS1) 10Springle: Use s[2-7] snapshot slaves for 'dump' (WikiExporter) query group [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/108017 [09:42:20] and a next run yields: [09:42:21] info: Caching catalog for lanthanum.eqiad.wmnet [09:42:21] info: Applying configuration version '1389951692' [09:43:08] (03CR) 10Faidon Liambotis: [C: 04-1] "Needs a manual rebase" [operations/puppet] - 10https://gerrit.wikimedia.org/r/107819 (owner: 10Matanya) [09:43:12] (03CR) 10Faidon Liambotis: [C: 032] hive: puppet 3 compatibility fix: fully qualify variables [operations/puppet] - 10https://gerrit.wikimedia.org/r/107821 (owner: 10Matanya) [09:43:27] (03CR) 10Springle: [C: 032] Use s[2-7] snapshot slaves for 'dump' (WikiExporter) query group [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/108017 (owner: 10Springle) [09:43:33] (03Merged) 10jenkins-bot: Use s[2-7] snapshot slaves for 'dump' (WikiExporter) query group [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/108017 (owner: 10Springle) [09:43:57] hashar: yea, timestamp. date -d@1389951692 [09:43:58] (03PS1) 10Hashar: /find_tabs.sh to find puppet manifest using tabs [operations/puppet] - 10https://gerrit.wikimedia.org/r/108018 [09:44:04] Fri Jan 17 01:41:32 PST 2014 [09:44:24] mutante: yeah I found some slides presenting a hack to use the GIT sha1 instead [09:44:36] though one would need the sha1 of both ops/puppet and the private repo [09:44:41] (03CR) 10Faidon Liambotis: [C: 04-1] ldap: puppet 3 compatibility fix: fully qualify variable (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/107823 (owner: 10Matanya) [09:44:41] that might help getting more catalogs cached [09:44:48] !log springle synchronized wmf-config/db-eqiad.php 'LB 'dump' s[2-7]' [09:44:54] mutante: and a tiny script to find manifests still having tabs: https://gerrit.wikimedia.org/r/108018 [09:44:54] aha, interesting [09:44:55] (03CR) 10Faidon Liambotis: [C: 032] smokeping: puppet 3 compatibility fix: module path [operations/puppet] - 10https://gerrit.wikimedia.org/r/107825 (owner: 10Matanya) [09:44:56] Logged the message, Master [09:45:10] (03CR) 10Faidon Liambotis: [C: 032] geoip: puppet 3 compatibility fix: module path [operations/puppet] - 10https://gerrit.wikimedia.org/r/107826 (owner: 10Matanya) [09:45:14] hashar: script to find tabs = puppet lint ? [09:45:26] in this case it only find tabs and is WAY faster [09:45:32] ok [09:46:57] (03PS2) 10Hashar: retab realm.pp [operations/puppet] - 10https://gerrit.wikimedia.org/r/104808 [09:47:23] (03PS2) 10Hashar: realm.pp puppet lint fixes [operations/puppet] - 10https://gerrit.wikimedia.org/r/104809 [09:47:54] the realm.pp change is simple enough if you have time for it [09:48:24] hashar: did that before locally with: puppet-lint manifests/* | grep "tab char" [09:48:46] yeah that would work, albeit it causes a full parse by puppet-lint [09:51:54] I published the script in Gerrit merely for information [09:52:00] might not be a good idea to merge it [09:54:38] i was thinking that you just use it locally, but maybe it can go into that "tools" repo instead [09:56:43] (03CR) 10Dzahn: [C: 031] retab realm.pp [operations/puppet] - 10https://gerrit.wikimedia.org/r/104808 (owner: 10Hashar) [09:59:36] mutante: in a different repo, it is unlikely to be noticed / used :D [10:00:21] hashar: fair , yea [10:05:13] hashar: heh, yea, folks will always comment on stuff they see in lint changes, that you didn't actually write and is unrelated, but that's what happens:) [10:05:25] the "while you're at it"-effect [10:27:17] (03CR) 10Dzahn: "that's because it was already done before then it seems:" [operations/puppet] - 10https://gerrit.wikimedia.org/r/107555 (owner: 10Matanya) [10:30:25] (03PS2) 10Matanya: nagios: puppet 3 compatibility fix: fully qualify variables [operations/puppet] - 10https://gerrit.wikimedia.org/r/107819 [10:31:04] (03CR) 10Dzahn: "don't you need <%= scope.lookupvar() in the .erb template to get the content of $host there?" [operations/puppet] - 10https://gerrit.wikimedia.org/r/107555 (owner: 10Matanya) [10:41:33] (03CR) 10Dzahn: "actual question because it appears to work like this on antimony but in a lot of other Apache site templates you see the other way" [operations/puppet] - 10https://gerrit.wikimedia.org/r/107555 (owner: 10Matanya) [10:43:21] mutante: might fix the issues in a different change so [10:43:31] I try to avoid changing code / layout too much when doing lint changes [10:43:44] so they are easier to review and less likely to cause an issue when being applied on servers [10:44:24] hashar: yea, so far that didn't even include opinion, i just noticed that this always happens to all of us [10:56:05] (03PS1) 10Alexandros Kosiaris: Add iron.wikimedia.org to bastion_hosts [operations/puppet] - 10https://gerrit.wikimedia.org/r/108023 [11:00:56] (03PS2) 10Alexandros Kosiaris: Add iron.wikimedia.org to bastion_hosts [operations/puppet] - 10https://gerrit.wikimedia.org/r/108023 [11:04:25] (03CR) 10Dzahn: [C: 031] "yep, those IPs are iron. and i noticed earlier that i couldn't ssh to antimony from iron, while i could from bast1001, so iron should be a" [operations/puppet] - 10https://gerrit.wikimedia.org/r/108023 (owner: 10Alexandros Kosiaris) [11:06:58] (03CR) 10Hashar: certs.pp puppet lint fixes (039 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/104743 (owner: 10Hashar) [11:07:29] (03PS3) 10Hashar: certs.pp puppet lint fixes [operations/puppet] - 10https://gerrit.wikimedia.org/r/104743 [11:08:03] (03CR) 10jenkins-bot: [V: 04-1] certs.pp puppet lint fixes [operations/puppet] - 10https://gerrit.wikimedia.org/r/104743 (owner: 10Hashar) [11:08:17] :( [11:13:06] (03PS4) 10Hashar: certs.pp puppet lint fixes [operations/puppet] - 10https://gerrit.wikimedia.org/r/104743 [11:13:22] (03CR) 10Hashar: "rebased on tip of production branch, fixed conflicts." [operations/puppet] - 10https://gerrit.wikimedia.org/r/104743 (owner: 10Hashar) [11:14:10] (03PS3) 10Alexandros Kosiaris: Add iron.wikimedia.org to bastion_hosts [operations/puppet] - 10https://gerrit.wikimedia.org/r/108023 [11:14:57] (03PS5) 10Hashar: certs.pp puppet lint fixes [operations/puppet] - 10https://gerrit.wikimedia.org/r/104743 [11:15:22] (03CR) 10Hashar: "Fix puppet-lint error in the define certificates::rapidssl_ca_2 which was introduced while rebasing and I forgot to lint :(" [operations/puppet] - 10https://gerrit.wikimedia.org/r/104743 (owner: 10Hashar) [11:16:27] (03PS1) 10Dzahn: let iron have a AAAA record [operations/dns] - 10https://gerrit.wikimedia.org/r/108027 [11:19:49] (03CR) 10Hashar: "minor issue: there is no AAAA entry for iron." (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/108023 (owner: 10Alexandros Kosiaris) [11:20:55] (03CR) 10Dzahn: "(PS1) Dzahn: let iron have a AAAA record [operations/dns] - https://gerrit.wikimedia.org/r/108027" [operations/puppet] - 10https://gerrit.wikimedia.org/r/108023 (owner: 10Alexandros Kosiaris) [11:25:49] (03CR) 10Alexandros Kosiaris: [C: 032] Add iron.wikimedia.org to bastion_hosts [operations/puppet] - 10https://gerrit.wikimedia.org/r/108023 (owner: 10Alexandros Kosiaris) [11:26:36] (03PS1) 10Alexandros Kosiaris: lint 1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa [operations/dns] - 10https://gerrit.wikimedia.org/r/108031 [11:26:38] (03PS1) 10Alexandros Kosiaris: Add reverse IPv6 records for bastion hosts [operations/dns] - 10https://gerrit.wikimedia.org/r/108032 [11:28:26] (03CR) 10Dzahn: certs.pp puppet lint fixes (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/104743 (owner: 10Hashar) [11:29:48] (03CR) 10Hashar: certs.pp puppet lint fixes (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/104743 (owner: 10Hashar) [11:30:28] (03PS6) 10Hashar: certs.pp puppet lint fixes [operations/puppet] - 10https://gerrit.wikimedia.org/r/104743 [11:33:23] (03CR) 10Dzahn: certs.pp puppet lint fixes (032 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/104743 (owner: 10Hashar) [11:34:04] * mutante hides from hashar [11:36:53] (03CR) 10Matanya: certs.pp puppet lint fixes (035 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/104743 (owner: 10Hashar) [11:47:32] commuting out to coworking place [11:51:47] (03CR) 10Dzahn: "reverse for this is in https://gerrit.wikimedia.org/r/#/c/108032/1" [operations/dns] - 10https://gerrit.wikimedia.org/r/108027 (owner: 10Dzahn) [11:56:52] (03PS2) 10Dzahn: let iron and bast1001 have AAAA records [operations/dns] - 10https://gerrit.wikimedia.org/r/108027 [11:57:04] (03CR) 10jenkins-bot: [V: 04-1] let iron and bast1001 have AAAA records [operations/dns] - 10https://gerrit.wikimedia.org/r/108027 (owner: 10Dzahn) [11:58:24] good catch, jenkins:) [11:58:27] (03PS3) 10Dzahn: let iron and bast1001 have AAAA records [operations/dns] - 10https://gerrit.wikimedia.org/r/108027 [13:50:18] (03PS1) 10Hashar: beta: monitor fatal errors on beta labs [operations/puppet] - 10https://gerrit.wikimedia.org/r/108039 [13:53:26] PROBLEM - Varnish HTTP text-backend on cp1054 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:53:46] PROBLEM - Varnish traffic logger on cp1054 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:54:06] PROBLEM - Varnish HTCP daemon on cp1054 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:55:15] wow [13:55:17] paravoid: ^XFS: possible memory allocation deadlock, another one, what [13:55:20] would you prefer [13:55:29] and here I was saying how it's fixed [13:55:46] it keeps spitting them out on mgmt currently [13:57:06] RECOVERY - Varnish HTCP daemon on cp1054 is OK: PROCS OK: 1 process with UID = 111 (vhtcpd), args vhtcpd [13:57:16] RECOVERY - Varnish HTTP text-backend on cp1054 is OK: HTTP OK: HTTP/1.1 200 OK - 189 bytes in 0.000 second response time [13:57:36] RECOVERY - Varnish traffic logger on cp1054 is OK: PROCS OK: 2 processes with command name varnishncsa [14:05:58] (03PS1) 10Hashar: puppetize beta fatal monitor [operations/puppet] - 10https://gerrit.wikimedia.org/r/108041 [14:07:18] mutante: feel brave enough for some puppet cron {} review ? :-D [14:07:36] that is for beta [14:10:43] hashar: sorry, in the middle of testing a module, have to say later [14:10:58] (labs issues) [14:12:05] mutante: tis ok [14:12:20] * hashar applies for ops position to be able to self merge  [14:12:37] +1 [14:12:44] +1 [14:13:06] +2:P [14:13:23] jenkins tells HR? :p [14:16:59] mark: so, echo 1 > /proc/sys/vm/compact_memory fixed the symptoms of "possible memory allocation deadlock in kmem_alloc" [14:17:44] without a reboot [14:18:00] but this is the box that had TSO disabled [14:18:03] http://ganglia.wikimedia.org/latest/graph.php?r=hour&z=xlarge&h=cp1054.eqiad.wmnet&m=mem_report&s=by+name&mc=2&g=mem_report&c=Text+caches+eqiad [14:18:25] so disabling TSO does /something/, but apparently not fixing the issues entirely [14:19:37] so, maybe compact_memory on a cronjob? [14:19:45] plus disable TSO anyway? [14:19:48] maybe 3.11 too? [14:22:32] hashar: labs currently cant configure instances. something broke in last deploy .being investigated https://bugzilla.wikimedia.org/show_bug.cgi?id=60167 [14:32:34] mutante: can it be LDAP related? [14:32:46] cause we apparently have some LDAP replication issue going on [14:33:32] mutante: forget me, Coren commented about the mediawiki code being updated [14:35:01] Yeah, I would have just restored to known good code if I was confident I wouldn't be breaking something-in-progress. I wish Mike was a little more verbose while he worked though so we'd know for sure. [14:37:20] (03CR) 10Alexandros Kosiaris: [C: 04-1] beta: monitor fatal errors on beta labs (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/108039 (owner: 10Hashar) [14:40:14] (03CR) 10Hashar: beta: monitor fatal errors on beta labs (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/108039 (owner: 10Hashar) [14:40:31] (03CR) 10Alexandros Kosiaris: [C: 032] "I think it is kind of an overkill to have that rather small misc/ class and then just include it on the role class. But since it seems to " [operations/puppet] - 10https://gerrit.wikimedia.org/r/108041 (owner: 10Hashar) [14:41:01] argh [14:41:11] why is there two conflicting person in my little brain [14:41:48] isn't that a mental illness ? [14:41:55] yea it is [14:42:04] the good thing is that I can stay alone for extensive period of time [14:42:11] since I got friends to talk to everytime [14:42:15] ahahahaha [14:42:31] albeit those friends keep disputing so it is a bit noisy at night [14:42:44] found out a way to came over it which is to think about images, they vanish [14:43:27] ok ok. just be careful. two is ok but if you get to something like 8 you will become the subject of a movie like that guy in Identity [14:44:15] (03PS2) 10Hashar: beta: monitor fatal errors on beta labs [operations/puppet] - 10https://gerrit.wikimedia.org/r/108039 [14:44:15] :-D [14:44:46] it is usually a conversation between me and my other self with a third unidentified party coming from time to time to interrupt us [14:44:58] (03CR) 10Hashar: "added a shebang and made the file executable" [operations/puppet] - 10https://gerrit.wikimedia.org/r/108039 (owner: 10Hashar) [14:45:04] see ? i told you. be careful :-) [14:45:17] that third guy is the way done a slippery slope [14:45:22] down* [14:45:24] (03PS2) 10Hashar: puppetize beta fatal monitor [operations/puppet] - 10https://gerrit.wikimedia.org/r/108041 [14:45:51] akosiaris: would you keep the .rb suffix when installing file in a /bin/ dir ? [14:46:06] no [14:46:19] I 'd rather not [14:46:27] (03CR) 10Alexandros Kosiaris: [C: 032] beta: monitor fatal errors on beta labs [operations/puppet] - 10https://gerrit.wikimedia.org/r/108039 (owner: 10Hashar) [14:46:34] (03CR) 10Hashar: "Yup should make the beta classes a puppet module maybe, just like we did for continuous integration." [operations/puppet] - 10https://gerrit.wikimedia.org/r/108041 (owner: 10Hashar) [15:00:58] (03CR) 10Faidon Liambotis: [C: 04-1] "Nak, these are autoconfigured IPs. Use interface::add_ip6_mapped to give them proper IPv6 addresses." [operations/dns] - 10https://gerrit.wikimedia.org/r/108032 (owner: 10Alexandros Kosiaris) [15:01:10] (03CR) 10Faidon Liambotis: [C: 04-2] "Let's do both forward & reverse in one commit please." [operations/dns] - 10https://gerrit.wikimedia.org/r/108027 (owner: 10Dzahn) [15:06:43] (03CR) 1001tonythomas: [C: 04-1] "It shows Can Merge: No" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/101820 (owner: 10Arav93) [15:22:36] PROBLEM - DPKG on rhodium is CRITICAL: DPKG CRITICAL dpkg reports broken packages [15:35:20] (03CR) 10Mark Bergsma: "Yurik: could you please rebase this change to current HEAD? When needing to change it, I'd like to make sure I'm not working out outdated " [operations/puppet] - 10https://gerrit.wikimedia.org/r/102316 (owner: 10Yurik) [16:09:11] (03PS3) 10coren: openstack: convert iptables to ferm [operations/puppet] - 10https://gerrit.wikimedia.org/r/98307 (owner: 10Faidon Liambotis) [16:10:49] (03CR) 10coren: [C: 031] "Carry over previous +1 with the change to reflect the merge conflict (allow all of 10/8 rather than distinguish between pmtpa and eqiad)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/98307 (owner: 10Faidon Liambotis) [16:11:27] (03PS1) 10Hashar: restore $oaiAgentRegex [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/108055 [16:12:36] (03CR) 10Brion VIBBER: [C: 032] "We're pretty sure putting this back will fix a legacy configuration that stopped working recently." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/108055 (owner: 10Hashar) [16:14:05] !log hashar synchronized wmf-config/CommonSettings.php 'restore $oaiAgentRegex' [16:14:12] Logged the message, Master [16:14:40] !log hashar synchronized wmf-config/InitialiseSettings.php 'touch, restore $oaiAgentRegex' [16:14:47] Logged the message, Master [16:17:48] (03CR) 10Hashar: "added back with https://gerrit.wikimedia.org/r/108055" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/105475 (owner: 10Reedy) [16:33:30] enwiki timeouts for me [16:33:35] anyone know what's up? [16:34:33] isup.me says it's up [16:34:51] <^d> Doesn't your internet suck? [16:35:19] ^d: everything else works fine [16:35:32] ^d: https://dpaste.de/ahCy [16:35:34] so it can connect [16:35:39] just returns an empty reply [16:35:41] the server [16:36:19] https://ganglia.wikimedia.org/latest/ works for me [16:36:34] <^d> :\ [16:36:41] hitting other wikipedias also has the same result [16:36:59] hitting tools works [16:37:06] any idea whom I should flag? [16:37:55] Jeff_Green: I guess this isn't RT duty, but I suppose I could ping you? ^ [16:38:15] hmm, seems to work now [16:38:23] <^d> Alphabetically always screws the same people :p [16:38:54] it's just really slow now [16:38:55] but works [16:42:20] is still really, *really* slow [16:42:23] (everything else is fast) [16:43:00] could be a network issue with the nl data center or something, maybe? [16:43:02] traceroute? [16:43:15] wfm over here [16:43:20] doing [16:43:25] yeah, works for brion too [16:44:06] still running [16:44:15] YuviPanda: how does wikitech feel? [16:44:27] because it would be cluster wiki [16:44:35] wouldn't [16:44:40] mutante: same as before [16:45:33] mutante: https://dpaste.de/GMxh [16:45:34] so far [16:46:03] 954.641 ms doesnt sound optimal.. ehm [16:46:14] but that's still before our infra [16:46:44] mutante: hmm, so network from cluster to me is broken? [16:47:31] ash sounds like Ashburn [16:47:51] ae0-xcr1.ash.cw.net ? [16:47:59] yeah, possibly [16:48:04] cr = core router [16:48:09] it does tell me it is hitting traceroute to text-lb.eqiad.wikimedia.org (208.80.154.224), 64 hops max, 52 byte packets [16:48:11] so eqiad [16:48:22] paravoid: ^? [16:48:44] oh, in here [16:48:44] do you know how close that is to us ? ash.cw.net [16:48:58] mark: ? [16:49:38] what? [16:49:51] people are reporting traceroutes to eqiad that end there [16:50:01] end at cw.net? [16:50:02] https://dpaste.de/GMxh [16:50:11] 15 ae0-xcr1.ash.cw.net (195.2.30.45) 954.641 ms * * [16:50:33] my guess is cr=core router, and ash=Ashburn [16:50:45] curl also gave me https://dpaste.de/ahCy a while ago. it returns now, is just really slow [16:50:50] http://www.cw.net/ says Vodafone [16:50:55] yes, and cw = cable&wireless, aka vodafone [16:50:57] and gives peering info [16:51:05] Vodafone AS1273 [16:51:09] I know [16:51:42] so people report slowness on all wikis, like YuviPanda [16:51:59] traceroute for bits https://dpaste.de/crtu [16:52:02] of course it's slow, there's 650ms & lost packets from the 5th hop onwards [16:52:24] looks like an issue on your side [16:52:24] < mutante> 954.641 ms doesnt sound optimal.. ehm [16:52:45] is anyone else reporting issues? [16:52:50] i think that is the case, yea [16:52:52] brion [16:53:03] brion's is fine now [16:53:06] i think [16:53:10] brion: can I get a traceroute from you as well? [16:53:25] paravoid: I'll just wait it out and see if it gets better. Thanks! [16:53:32] paravoid: just wanted to report, since everything else works as usual [16:54:44] thanks YuviPanda [17:00:47] interestingly my trace route craps out at xe-5-0-0.was10.ip4.tinet.net [17:00:54] though a direct ping reaches... [17:01:10] can I see it? [17:01:16] in private if you prefer [17:01:16] lemme cut-n-paste [17:01:28] mine craps out at te3-4.co2.as30217.net, from the office, though the site is responsive [17:01:47] https://gist.github.com/brion/508ad5579a7a319ebd74 [17:02:00] oh i'm sure my home IP is all over wikipedia, i edit logged out by accident all the time ;) [17:02:03] http://paste.debian.net/76888/ [17:02:36] i think that *.was10.ip4.tinet.net is in washington dc area though judging by the ping time [17:02:47] only 1-2ms different from my final ping [17:03:24] "traceroute" uses UDP by default [17:03:30] can it be related to the S.F->Ashburn fiber? [17:03:44] iirc we block these for to prevent attacks such as DNS amplification attacks [17:03:45] i can pull the fiber...it shouldn't have traffic on it though [17:03:54] cmjohnson1: don't, everything's okay [17:03:57] okay [17:04:02] ok [17:04:08] brion: use "traceroute -I" or "mtr" [17:04:23] the "crap out" part is the last hop, by design [17:04:48] ok, traceroute -I works as expected [17:04:54] whee [17:05:09] yeah the last hops were from tibet to eqiad and there into the text-lb [17:05:15] *tinet [17:05:24] tibet, hah [17:05:25] silly autocorrect! tibet has shitty network access [17:05:27] traceroute from a friend of mine, on a different ISP http://hastebin.com/cufokoguge.md [17:05:31] seems fine, so just my ISP I guess [17:05:44] YuviPanda: yup [17:05:46] sorry about the alarm [17:05:50] welcome to network hell yuvi [17:05:53] it's still interesting to see you going via Vodafone India [17:05:59] where YOUR AND ONLY YOUR internet fails [17:06:05] I don't know why only wiki is slow, however [17:06:09] everything else is fast enough [17:06:13] paravoid: oh? why so? [17:06:17] I am in India... [17:06:30] no, we have a new path to Vodafone [17:06:37] so it was interesting to me [17:06:46] ah [17:07:24] paravoid: can you tell me what you mean by 'new path to Vodafone'? Just curious (and know nothing of networks at that scale) [17:07:30] let's all just take a second to revel in the wonder that is the internet: that it works at all is an amazing tribute to human engineering [17:08:55] brion: /me does 10 hail marys [17:09:19] greg-g: the internet gods are old-school, they demand blood sacrifice [17:09:33] oh, alright then, lemme catch monte real quick [17:09:47] greg-g: no sacrificing ouyr iOS developer [17:10:11] *our [17:10:43] hehe [17:10:54] no comment [17:54:19] paravoid, got it. it's good to know that we're not quite as TPA-dependent as we used to be, in any case. :P [17:54:24] Still lots of stuff to do though. [17:54:26] not much at all [17:54:29] Labs + Bugzilla + some misc stuff? [17:55:22] yeah, mail [17:56:03] hi, saw the alerts while en route to the office :( [17:56:03] marktraceur: Logging is still useful for people in this channel, and to save later: https://wikitech.wikimedia.org/w/index.php?title=Server_Admin_Log&diff=96578&oldid=96577 [17:56:03] PROBLEM - check_mysql on payments1001 is CRITICAL: Slave IO: Connecting Slave SQL: Yes Seconds Behind Master: (null) [17:56:03] jgage, don't worry, you'll get a nice outage yet :P [17:56:03] BZ is very close and 4.4 sitting on zirconium, we are like 95% there, would have needed db sync and switchover though [17:56:03] Heh, yes [17:56:03] hehehe [17:56:04] and very few merges [17:56:04] at least i didn't cause this one, phew [18:00:16] PROBLEM - check_mysql on payments1001 is CRITICAL: Slave IO: Connecting Slave SQL: Yes Seconds Behind Master: (null) [18:01:31] <^d> I wonder if we can move url-downloader to some $randomUnderusedMisc in eqiad. Should just be two lines of puppet to move. [18:02:26] <^d> Oh, maybe some other stuff too. Ignore me. Bad time. [18:05:16] PROBLEM - check_mysql on payments1001 is CRITICAL: Slave IO: Connecting Slave SQL: Yes Seconds Behind Master: (null) [18:05:22] ^d: for later https://gerrit.wikimedia.org/r/#/c/107590/ [18:06:51] <^d> mutante: Yeah I saw :) [18:10:16] PROBLEM - check_mysql on payments1001 is CRITICAL: Slave IO: Connecting Slave SQL: Yes Seconds Behind Master: (null) [18:15:16] PROBLEM - check_mysql on payments1001 is CRITICAL: Slave IO: Connecting Slave SQL: Yes Seconds Behind Master: (null) [18:16:07] oh paravoid [18:16:13] i just realized that this didn't merge [18:16:13] https://gerrit.wikimedia.org/r/#/c/107723/ [18:16:17] because of your -2 [18:18:56] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: No output from Graphite for target(s): reqstats.5xx [18:20:16] PROBLEM - check_mysql on payments1001 is CRITICAL: Slave IO: Connecting Slave SQL: Yes Seconds Behind Master: (null) [18:25:16] PROBLEM - check_mysql on payments1001 is CRITICAL: Slave IO: Connecting Slave SQL: Yes Seconds Behind Master: (null) [18:30:16] PROBLEM - check_mysql on payments1001 is CRITICAL: Slave IO: Connecting Slave SQL: Yes Seconds Behind Master: (null) [18:32:05] <^d> crap, we do have gerrit problems from tampa downtime. [18:32:18] <^d> we have two ldap hosts, one in eqiad and one in tampa. so you've got a chance of hitting tampa. [18:32:35] <^d> (would explain why i've been getting intermittent errors) [18:40:50] The IRC RC feed is down [18:42:45] part of tampa fiber issue, see ops channel [18:43:24] ori, this is the operations channel :) [18:43:49] i need coffee. anyways, known. [18:44:12] everything that is currently down should be linked to Tampa tracking ticket ideally [18:44:42] at least the good part about it is identifying some that may have been forgotten [18:44:51] How have three fibres been cut at the same time? :/ [18:45:22] Krenair: that's what everybody asks:) maybe by not noticing you alreay cut one before [18:45:38] oh ok, so it was just on the last one until earlier when that broke too? [18:45:41] and then cutting the last one later [18:46:01] it's speculation, we're waiting for response with more details [18:46:16] I'm guessing the ticket you refer to is in RT [18:46:22] so anyone on the phone with fpl ? [18:46:24] they like phone [18:46:26] LeslieCarr: I was [18:46:27] i'm guessing paravoid is [18:46:28] ok [18:46:28] hehe [18:46:31] Krenair: yea, but there might also be BZ for the same [18:46:40] Krenair: either one, also happy to transfer it later [18:46:41] LeslieCarr: three cuts, orlando, sarasota, tampa [18:46:43] couldn't find one [18:46:51] fyi they can be slackers, so sometimes you have to keep calling... but they also mainly just call up other people [18:47:02] this is leslie suspicious face [18:47:05] <- [18:47:16] they seemed fairly knowledgeable about the situation [18:47:38] as soon as I called and before I gave any details they were saying "so... we have three cuts right now. heh, amazing, isn't it?" [18:49:45] any chance we can recover the irc feed before we're back up in tpa? [18:50:04] logmsgbot? [18:50:10] irc.wikimedia.org [18:50:13] oh [18:50:27] any OOB in tamba? [18:50:33] ekrem [18:51:03] csteipp: can you ssh into tin/terbium? [18:51:25] AaronSchulz: oh hey, I've disabled swift @ pmtpa and deployed earlier [18:51:28] AaronSchulz: tin yeah [18:51:44] paravoid: change of plans? [18:51:46] terbium too [18:51:55] Eloquence: looking; I'm not very familiar with it [18:52:03] csteipp: i get "Network is unreachable" [18:52:07] IRC migration ticket is https://rt.wikimedia.org/Ticket/Display.html?id=4784 [18:52:08] AaronSchulz: no, Tampa is disconnected from eqiad [18:52:08] I can get into fenari though [18:52:20] AaronSchulz: what's your proxy server? [18:52:36] paravoid: puppetized but lacks packaging [18:52:43] AaronSchulz: fiber cuts; same cause for the problem you're encountering, use bast1001 to fix this [18:52:44] bast1001.wikimedia.org is what I use... [18:52:56] csteipp: still fenari [18:53:09] * AaronSchulz updates his ProxyCommand [18:53:14] Ah, I bet fenari -> eqiad isn't happy... [18:53:23] AaronSchulz: tampa <-> eqiad is broken and will be for the next hour or two (or three) [18:53:33] fenari <-> anything in eqiad 10.x won't work [18:53:53] nor will tin/terbium <-> pmtpa 10.x, like mwN boxes in tampa [18:54:21] anyone familiar with the IRC feed? [18:54:24] paravoid: wmgRC2UDPAddress and wmgRC2UDPPort in InitialiseSettings.php specifies host/port for rc feed; modules/ircd/manifests/mediawiki-irc-relay.pp forwards to irc [18:54:27] paravoid: https://wikitech.wikimedia.org/wiki/IRCD [18:54:41] ori: port 9390 it seems [18:55:34] paravoid: modules/ircd but we don't have the actual package [18:55:36] hrm [18:55:43] yeah, give me a sec [18:56:50] https://gerrit.wikimedia.org/r/#/c/94407/7/modules/ircd/manifests/mediawiki-irc-relay.pp [18:56:51] it'd be good to move it; the urgency is that most countervandalism tools depend on it, and wikimedia wikis depend on countervandalism tools to control vandalism [18:57:06] I know [18:57:09] (ekrem being hardcoded in there is the bad part) [18:58:39] guillom: newsworthy? ^^ [18:59:04] twkozlowski: yes, probably. [19:00:07] backlogs says IRC, labs, BZ, and FR [19:00:51] BZ and labs already up, FR isn't anything we're interested it from a user's perspective... [19:00:52] partially FR, I'd get a quote from Jeff_Green and/or mwalker and/or K4-713 about it [19:00:55] is this summary right? :) [19:01:01] yeah [19:01:15] greg-g: a quote re. the effects of the pmtpa outage? [19:01:23] yea, BZ and partial labs have been brought back thanks to the work-around [19:01:29] but the fibers are still not back [19:01:33] twkozlowski: and some mail [19:01:34] Jeff_Green: before they go off saying FR was down ;) [19:01:41] it's a little more nuanced than that ;) [19:01:46] true [19:02:00] greg-g: Fundraising has been fine since 10:17 PST, as far as I know. [19:02:14] So how is it the fibre cuts are being worked around at the moment? [19:02:48] BZ labs outage 17:19-17:49, IRC 17:19 - ? [19:03:22] did something automatically disable puppet? [19:03:40] in /var/log/puppet.log i see several messages like this: [19:03:45] "notice: Skipping run of Puppet configuration client; administratively disabled; use 'puppet Puppet configuration client --enable' to re-enable." [19:04:05] this in on sylvester, a vm in the catgraph project. [19:04:14] can this be related to the network problems? [19:04:34] should i just re-enable it? [19:04:44] that wouldn't automatically disable puppet, however there is a puppet bug which occasionally causes it to disable itself [19:04:47] when running the daemon [19:05:35] JohannesK_WMDE: i've seen puppet getting in that state sometimes, unrelated of any networking stuff [19:05:38] to [19:06:11] JohannesK_WMDE: you can just enable it, and the bonus bug is what it tells you to type has a typo in it:p [19:06:22] LeslieCarr, mutante: thanks, i'll just reenable it and see what happens [19:06:30] mutante: lol okay [19:06:47] JohannesK_WMDE: s/Puppet configuration client/agent/ [19:07:02] mutante: makes sense :D thanks [19:07:15] JohannesK_WMDE: yw [19:18:49] gonna deploy https://gerrit.wikimedia.org/r/#/c/108072/ [19:19:09] god speed. [19:19:50] !log faidon updated /a/common to {{Gerrit|I4342b062b}}: Switch wmgRC2UDPAddress to a temp eqiad relay [19:20:15] \o/ [19:20:22] !:) [19:20:22] /me laughs [19:20:49] ? [19:20:50] wow, i didn't know it would reply that [19:21:04] oh [19:21:23] i just wanted to chime in and say how cool it is you are finding a fix, wm-bot surprised me too :P [19:21:36] !log faidon synchronized wmf-config/InitialiseSettings.php 'Switch wmgRC2UDPAddress to a temp eqiad relay' [19:21:57] I'm not very familiar with the irc feed, can someone verify it works now? [19:22:08] Doesn't work for me [19:22:20] oh ok, different ip [19:22:33] 11:28 -!- Irssi: Join to #en.wikipedia was synced in 0 secs [19:22:46] * Connecting to 208.80.154.157 (208.80.154.157) port 6667... [19:22:46] * Connection failed. Error: Connection refused [19:22:59] the irc server is where it was [19:23:01] i don't see the feed yet though [19:23:17] paravoid: nothing changed there [19:23:25] ok, will debug further, thanks [19:25:32] paravoid: works! [19:25:36] I know [19:25:36] :) [19:25:43] =) [19:25:59] paravoid: works [19:26:10] arr, too late, thanks! [19:26:54] Eloquence: ^ [19:27:06] awesome :) [19:27:37] So how is it the fibre cuts are being worked around at the moment? [19:28:04] Krenair: a team is on the way is what we know [19:28:06] Working. [19:28:31] Krenair: what do you mean? [19:28:34] Krenair: oh, "how", paravoid knows:) [19:30:06] paravoid, well it seems I can somehow connect to some pmtpa stuff but other things internally are/were broken (such as relaying RC from wikis to the IRC server) [19:30:28] tampa is an island network-wise right now [19:30:45] it works and has "internet", but is not connected to eqiad [19:31:07] you can see the whole internet but eqiad, and of course private doesn't work [19:31:20] so it has to go via the public internet to get to eqiad? [19:31:50] yah [19:31:53] it won't get to eqiad [19:31:58] not via public internet either [19:32:01] Hello ops. You are probaby super busy but we are going to do maintenance work on EventLogging DB [19:32:09] we shall be adding a column [19:32:20] yeah, we're going to exploit the fact that there may be data issues anyway to do a migration on db1047 [19:32:37] ori: sounds nasty [19:32:45] no funkiness expected, worst case scenario is EventLogging meltdown but that won't spillover to any other services [19:32:55] * greg-g nods [19:33:12] either it's important enough to not do it on a friday and esp. while having a partial outage, or it's not important for us to care :) [19:33:15] but I appreciate the notice :) [19:33:20] from -tech: [19:33:21] 14:32 < MusikAnim> Looks like IRC feed is back to normal function. Big thanks to the WM team! [19:33:31] oh, ya'll know [19:33:36] thanks for prompt response [19:33:40] * greg-g should stop multitasking [19:33:48] yes, the IRC feed goes via esams now [19:34:00] eqiad -> esams -> pmtpa, as crazy as that sounds [19:34:39] ew. [19:34:40] it's awesome that that worked:) [19:35:33] i had totally different scenarios in mind, like you wanted to actually move it and stuff [19:36:56] RECOVERY - Host sq69 is UP: PING OK - Packet loss = 0%, RTA = 33.38 ms [19:36:56] RECOVERY - Host ssl4 is UP: PING OK - Packet loss = 0%, RTA = 33.59 ms [19:36:56] RECOVERY - Host sq67 is UP: PING OK - Packet loss = 0%, RTA = 33.07 ms [19:36:56] RECOVERY - Host mchenry is UP: PING OK - Packet loss = 0%, RTA = 33.06 ms [19:36:56] RECOVERY - Host brewster is UP: PING OK - Packet loss = 0%, RTA = 33.03 ms [19:36:56] RECOVERY - Host tridge is UP: PING OK - Packet loss = 0%, RTA = 32.96 ms [19:36:56] RECOVERY - Host capella is UP: PING OK - Packet loss = 0%, RTA = 32.96 ms [19:36:57] RECOVERY - Host virt0 is UP: PING OK - Packet loss = 0%, RTA = 32.94 ms [19:36:57] RECOVERY - Host dataset2 is UP: PING OK - Packet loss = 0%, RTA = 33.12 ms [19:36:58] RECOVERY - Host payments2 is UP: PING OK - Packet loss = 0%, RTA = 33.09 ms [19:36:58] RECOVERY - Host payments4 is UP: PING OK - Packet loss = 0%, RTA = 33.23 ms [19:36:59] woot [19:36:59] RECOVERY - Host payments1 is UP: PING OK - Packet loss = 0%, RTA = 33.13 ms [19:37:05] oh, hey [19:37:06] RECOVERY - Host lvs2 is UP: PING OK - Packet loss = 0%, RTA = 33.00 ms [19:37:06] RECOVERY - Host fenari is UP: PING OK - Packet loss = 0%, RTA = 33.11 ms [19:37:06] RECOVERY - Host sanger is UP: PING OK - Packet loss = 0%, RTA = 34.16 ms [19:37:06] RECOVERY - Host pdf2 is UP: PING OK - Packet loss = 0%, RTA = 33.22 ms [19:37:06] RECOVERY - Host ns1.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 33.17 ms [19:37:06] RECOVERY - Host ekrem is UP: PING OK - Packet loss = 0%, RTA = 33.17 ms [19:37:06] RECOVERY - Host sq70 is UP: PING OK - Packet loss = 0%, RTA = 33.17 ms [19:37:07] RECOVERY - Host kaulen is UP: PING OK - Packet loss = 0%, RTA = 33.11 ms [19:37:07] RECOVERY - Host 208.80.152.132 is UP: PING OK - Packet loss = 0%, RTA = 32.95 ms [19:37:08] that's me with network stuff [19:37:08] RECOVERY - Host lvs5 is UP: PING OK - Packet loss = 0%, RTA = 33.11 ms [19:37:08] RECOVERY - Host emery is UP: PING OK - Packet loss = 0%, RTA = 33.02 ms [19:37:09] RECOVERY - Host formey is UP: PING OK - Packet loss = 0%, RTA = 33.26 ms [19:37:09] RECOVERY - Host yvon is UP: PING OK - Packet loss = 0%, RTA = 33.14 ms [19:37:10] RECOVERY - Host ssl3 is UP: PING OK - Packet loss = 0%, RTA = 34.12 ms [19:37:10] not actual recovery [19:37:16] RECOVERY - Host lvs3 is UP: PING OK - Packet loss = 0%, RTA = 33.09 ms [19:37:16] RECOVERY - Host lvs4 is UP: PING OK - Packet loss = 0%, RTA = 33.40 ms [19:37:16] RECOVERY - Host pdf3 is UP: PING OK - Packet loss = 0%, RTA = 33.19 ms [19:37:16] RECOVERY - Host ssl2 is UP: PING OK - Packet loss = 0%, RTA = 33.10 ms [19:37:16] RECOVERY - Host ms10 is UP: PING OK - Packet loss = 0%, RTA = 33.28 ms [19:37:16] RECOVERY - Host hume is UP: PING OK - Packet loss = 0%, RTA = 33.16 ms [19:37:26] RECOVERY - Host lvs1 is UP: PING OK - Packet loss = 0%, RTA = 33.61 ms [19:37:27] RECOVERY - Host lvs6 is UP: PING OK - Packet loss = 0%, RTA = 34.07 ms [19:37:27] RECOVERY - Host linne is UP: PING OK - Packet loss = 0%, RTA = 33.18 ms [19:37:27] RECOVERY - Host mexia is UP: PING OK - Packet loss = 0%, RTA = 33.17 ms [19:37:27] RECOVERY - Host labs-ns0.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 34.08 ms [19:37:27] RECOVERY - Host zhen is UP: PING OK - Packet loss = 0%, RTA = 33.22 ms [19:37:30] LeslieCarr: static? [19:37:36] RECOVERY - Host dobson is UP: PING OK - Packet loss = 0%, RTA = 33.24 ms [19:37:36] RECOVERY - Host stat1 is UP: PING OK - Packet loss = 0%, RTA = 33.07 ms [19:37:46] RECOVERY - Host sq68 is UP: PING OK - Packet loss = 0%, RTA = 33.16 ms [19:37:46] RECOVERY - Host manutius is UP: PING OK - Packet loss = 0%, RTA = 33.27 ms [19:37:46] RECOVERY - Host ssl1 is UP: PING OK - Packet loss = 0%, RTA = 33.16 ms [19:37:46] RECOVERY - Host locke is UP: PING OK - Packet loss = 0%, RTA = 33.09 ms [19:37:46] RECOVERY - Host 208.80.152.131 is UP: PING OK - Packet loss = 0%, RTA = 33.05 ms [19:38:00] yep and deactivated the border-in4 term wikimedia-prefixes - paravoid [19:38:08] ok [19:38:16] via? [19:38:25] via just the deactivate statement [19:38:31] you should be able to rollback and be fine [19:38:39] no, I mean, via which transit [19:38:44] xo [19:38:53] you know we lost XO for some hours yesterday, right? :) [19:39:02] XO @ tampa specifically [19:39:17] so the chances of it happening twice are lower ;) [19:39:23] heh [19:39:27] RECOVERY - Host grosley is UP: PING OK - Packet loss = 0%, RTA = 33.03 ms [19:39:46] i just did it because it's at both locations [19:39:46] RECOVERY - Host loudon is UP: PING OK - Packet loss = 0%, RTA = 33.02 ms [19:40:04] yeah [19:40:16] RECOVERY - Host pappas is UP: PING OK - Packet loss = 0%, RTA = 33.08 ms [19:40:26] RECOVERY - Host payments3 is UP: PING OK - Packet loss = 0%, RTA = 33.10 ms [19:47:56] PROBLEM - Puppet freshness on ssl1 is CRITICAL: Last successful Puppet run was Fri 17 Jan 2014 04:47:15 PM UTC [19:48:08] up up up up [19:48:56] PROBLEM - Puppet freshness on capella is CRITICAL: Last successful Puppet run was Fri 17 Jan 2014 04:48:06 PM UTC [19:48:56] PROBLEM - Puppet freshness on dataset2 is CRITICAL: Last successful Puppet run was Fri 17 Jan 2014 04:48:41 PM UTC [19:48:56] PROBLEM - Puppet freshness on sq68 is CRITICAL: Last successful Puppet run was Fri 17 Jan 2014 04:48:11 PM UTC [19:48:56] PROBLEM - Puppet freshness on hume is CRITICAL: Last successful Puppet run was Fri 17 Jan 2014 04:48:36 PM UTC [19:48:56] PROBLEM - Puppet freshness on ssl4 is CRITICAL: Last successful Puppet run was Fri 17 Jan 2014 04:47:46 PM UTC [19:48:56] PROBLEM - Puppet freshness on cp4019 is CRITICAL: Last successful Puppet run was Fri 17 Jan 2014 04:48:26 PM UTC [19:49:56] PROBLEM - Puppet freshness on lvs6 is CRITICAL: Last successful Puppet run was Fri 17 Jan 2014 04:49:21 PM UTC [19:50:54] greg-g: that's what you get for that, a bunch of freshness spam :p:) cya later, signing off for tonight [19:50:56] PROBLEM - Puppet freshness on stat1002 is CRITICAL: Last successful Puppet run was Fri 17 Jan 2014 04:49:52 PM UTC [19:51:56] PROBLEM - Puppet freshness on brewster is CRITICAL: Last successful Puppet run was Fri 17 Jan 2014 04:50:57 PM UTC [19:51:56] PROBLEM - Puppet freshness on manutius is CRITICAL: Last successful Puppet run was Fri 17 Jan 2014 04:51:07 PM UTC [19:51:56] PROBLEM - Puppet freshness on mchenry is CRITICAL: Last successful Puppet run was Fri 17 Jan 2014 04:51:32 PM UTC [19:54:36] RECOVERY - DPKG on rhodium is OK: All packages OK [19:54:56] PROBLEM - Puppet freshness on emery is CRITICAL: Last successful Puppet run was Fri 17 Jan 2014 04:54:45 PM UTC [19:54:56] PROBLEM - Puppet freshness on linne is CRITICAL: Last successful Puppet run was Fri 17 Jan 2014 04:54:40 PM UTC [19:57:56] PROBLEM - Puppet freshness on virt0 is CRITICAL: Last successful Puppet run was Fri 17 Jan 2014 04:57:23 PM UTC [19:58:56] PROBLEM - Puppet freshness on formey is CRITICAL: Last successful Puppet run was Fri 17 Jan 2014 04:58:33 PM UTC [19:59:56] PROBLEM - Puppet freshness on pdf2 is CRITICAL: Last successful Puppet run was Fri 17 Jan 2014 04:59:39 PM UTC [19:59:56] PROBLEM - Puppet freshness on sq69 is CRITICAL: Last successful Puppet run was Fri 17 Jan 2014 04:59:34 PM UTC [20:00:36] PROBLEM - DPKG on rhodium is CRITICAL: DPKG CRITICAL dpkg reports broken packages [20:00:56] PROBLEM - Puppet freshness on lvs1 is CRITICAL: Last successful Puppet run was Fri 17 Jan 2014 04:59:54 PM UTC [20:00:56] PROBLEM - Puppet freshness on pdf3 is CRITICAL: Last successful Puppet run was Fri 17 Jan 2014 05:00:36 PM UTC [20:00:56] PROBLEM - Puppet freshness on sq70 is CRITICAL: Last successful Puppet run was Fri 17 Jan 2014 05:00:36 PM UTC [20:00:56] PROBLEM - Puppet freshness on ssl3 is CRITICAL: Last successful Puppet run was Fri 17 Jan 2014 05:00:00 PM UTC [20:01:56] PROBLEM - Puppet freshness on locke is CRITICAL: Last successful Puppet run was Fri 17 Jan 2014 05:00:51 PM UTC [20:01:56] PROBLEM - Puppet freshness on ssl2 is CRITICAL: Last successful Puppet run was Fri 17 Jan 2014 05:01:36 PM UTC [20:02:56] PROBLEM - Puppet freshness on zhen is CRITICAL: Last successful Puppet run was Fri 17 Jan 2014 05:02:12 PM UTC [20:02:56] PROBLEM - Puppet freshness on sanger is CRITICAL: Last successful Puppet run was Fri 17 Jan 2014 05:01:51 PM UTC [20:03:56] PROBLEM - Puppet freshness on yvon is CRITICAL: Last successful Puppet run was Fri 17 Jan 2014 05:02:53 PM UTC [20:03:56] PROBLEM - Puppet freshness on tridge is CRITICAL: Last successful Puppet run was Fri 17 Jan 2014 05:02:48 PM UTC [20:03:56] PROBLEM - Puppet freshness on lvs5 is CRITICAL: Last successful Puppet run was Fri 17 Jan 2014 05:03:33 PM UTC [20:04:56] PROBLEM - Puppet freshness on lvs2 is CRITICAL: Last successful Puppet run was Fri 17 Jan 2014 05:04:19 PM UTC [20:04:56] PROBLEM - Puppet freshness on sq67 is CRITICAL: Last successful Puppet run was Fri 17 Jan 2014 05:03:58 PM UTC [20:05:56] PROBLEM - Puppet freshness on mexia is CRITICAL: Last successful Puppet run was Fri 17 Jan 2014 05:04:59 PM UTC [20:06:56] PROBLEM - Puppet freshness on fenari is CRITICAL: Last successful Puppet run was Fri 17 Jan 2014 05:06:05 PM UTC [20:07:56] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Last successful Puppet run was Fri 17 Jan 2014 05:07:31 PM UTC [20:09:56] PROBLEM - Puppet freshness on dobson is CRITICAL: Last successful Puppet run was Fri 17 Jan 2014 05:09:12 PM UTC [20:11:56] PROBLEM - Puppet freshness on lvs3 is CRITICAL: Last successful Puppet run was Fri 17 Jan 2014 05:11:42 PM UTC [20:12:56] PROBLEM - Puppet freshness on stat1 is CRITICAL: Last successful Puppet run was Fri 17 Jan 2014 05:12:38 PM UTC [20:12:56] PROBLEM - Puppet freshness on ekrem is CRITICAL: Last successful Puppet run was Fri 17 Jan 2014 05:12:38 PM UTC [20:13:56] PROBLEM - Puppet freshness on ms10 is CRITICAL: Last successful Puppet run was Fri 17 Jan 2014 05:13:28 PM UTC [20:13:56] PROBLEM - Puppet freshness on lvs4 is CRITICAL: Last successful Puppet run was Fri 17 Jan 2014 05:13:43 PM UTC [20:48:14] Ahahaha [20:48:19] Amusing side effect: [20:48:24] http://en.wikipedia.beta.wmflabs.org/w/api.php?action=query&format=json&titles=File%3AWikimedia+Foundation+-+Team+1+-+California+Academy+of+Sciences%2Ejpg&prop=imageinfo&iiextmetadatalanguage=en&iiprop=timestamp%7Cuser%7Cuserid%7Ccomment%7Curl%7Csize%7Csha1%7Cmime%7Cmediatype%7Cmetadata%7Cextmetadata&iiurlwidth=640&meta=filerepoinfo [20:48:40] Image metadata can't be fetched because labs can't talk to the cluster (where Commons is) [20:48:54] So I can see the already-generated thumbnails but can't get data about them [20:48:57] This is fun. [21:07:12] marktraceur: You have an amusing definition of "amusing". [21:09:04] marktraceur: yeah, annoying :/ [21:10:04] paravoid, are you working? [21:10:07] what is ahapppening? [21:10:08] http://ganglia.wikimedia.org/latest/graph_all_periods.php?title=&vl=&x=&n=&hreg%5B%5D=cp301%5B1-4%5D&mreg%5B%5D=frontend.client_conn>ype=stack&glegend=show&aggregate=1 [21:11:56] just ganglia being silly I suppose? [21:13:30] i guess? [21:13:31] if so, totally silly [21:13:38] why does it show requests for cp301[34]? [21:13:39] at all? [21:13:42] that is totally weird [21:14:07] you're getting some ganglia expertise lately, maybe you want to debug? :) [21:16:28] (I'm packing) [21:20:15] haha, ok [21:20:18] i do not wnat to debug :p [21:20:20] maybe later [21:44:16] !log removed myself from all aliases [21:46:43] :( [21:47:32] The logbot's not working because it can't reach the cluster as well? [21:48:32] or maybe needs restarting... [21:49:00] hrm, where does that live.... [21:49:02] paravoid: can someone from ops fix that thing? argh. [21:49:09] LeslieCarr: i'll restart it; it's on tool labs [21:49:18] thanks [21:49:53] I think all tools in Labs can't reach the cluster as long as pmtpa-eqiad is down. [21:50:10] oh yeah [21:50:16] well [21:50:25] they may be able to if they have a public ip [21:50:30] if no public ip, then they can not [21:51:12] thanks ori [21:51:16] hrm, the method names are missing from https://gdash.wikimedia.org/dashboards/filebackend/ :/ [21:51:19] !log removed myself from all mail aliases [21:51:27] Logged the message, Mistress of the network gear. [21:51:28] !log restarted morebots [21:51:35] Logged the message, Master [21:51:52] ori: We could try running the logstash irc input to log stuff like that [21:51:55] What's the word on getting the link back up? [21:52:08] I stand corrected: en.wikipedia.org times out from Labs, but wikitech.wikimedia.org is accessible. [21:52:12] bd808: yes plz [21:52:20] but it also needs to be able to update the SAL [21:52:46] !log to restart morebots: ssh to tool labs, become morebots, then: qdel $(qstat | grep production | cut -d' ' -f 1) ; sleep 5 ; jstart -N production /usr/lib/adminbot/adminlogbot.py --config ./confs/production-logbot.py [21:52:52] Logged the message, Master [21:52:57] So somebody needs to write a wiki output plugin [21:53:27] or a mediawiki extension for showing date from logstash [21:53:39] which could ostensibly use all the work ^d and manybubbles did on cirrus [21:53:46] *showing data [21:54:19] i imagine showing a list of log entries in elasticsearch matching some filter is not very different from producing search results [21:54:30] <^d> What you probably want actually is to use the Elastica extension we have. [21:54:32] <^d> For this. [21:54:35] It's exactly the same mostly [21:54:51] <^d> Most of Cirrus is MW-search-specific and wouldn't be terribly useful to you in logstash. [21:54:58] right, Elastica [21:55:19] <^d> But yeah, you could use Elastica and then output the data to some sort of special page [21:55:20] <^d> Or w/e. [21:55:28] that would solve another problem -- the fact that the SAL is huge and morebots updates it by fetching the whole thing, adding a line to the top, and then submitting the entire thing back for parsing/saving [21:55:41] ori: On a slightly related note: https://gerrit.wikimedia.org/r/#/c/108153/ [21:56:45] bd808: should it be going to a specific host like that or should it use multicast? [21:56:48] I haven't tried the irc plugin anywhere yet. I could play with it in labs a bit to see how it works. What channel can I attach to? -core? [21:57:05] Or better I'll just do a ## channel to start [21:57:35] just /join here and consume !log in parallel with morebots [21:59:07] bd808: ^ see me question above re: udp log [21:59:42] ori: MIssed that. Umm I think we have to go direct or we will get duplicate log messages unfortunately [22:00:17] Another reason that we need to get the native redis output written [22:00:56] PROBLEM - Puppet freshness on hooft is CRITICAL: Last successful Puppet run was Fri 17 Jan 2014 06:59:52 PM UTC [22:00:56] bd808: can you document that in a todo in the source? [22:01:06] Sure. [22:01:39] AaronSchulz: I added a 'MediaWiki' prefix to all Graphite metric names, so the significant bits shifted dot-separated location to the right [22:01:50] AaronSchulz: probably that's the reason for the method names going AWOL [22:02:02] to all MediaWiki-generated Graphite metrics, I mean. [22:03:24] is there a bug for that? [22:04:19] ori: TODO added [22:05:21] grrt-wm also needs to be restarted [22:07:49] grrrit-wm prints nothing to stderr and '2', repeatedly, to stdout. nice. [22:09:30] ori: I think marktraceur disabled grrrit-wm on purpose since it can't read from/to Gerrit ATM. [22:09:38] Yeah [22:09:50] what good does disabling it do? [22:10:06] Nothing worse than having it around [22:13:15] except when the stream is back someone will need to be around to manually start it [22:19:08] bd808: merged, tcpdump shows data going to logstash1001 [22:19:24] * bd808 runs off to look at logs [22:19:48] hey guys; would there be a reason a labs instance cannot talk to commons? I thought our pipes were back up and running? [22:19:58] they are not [22:20:50] ori: https://logstash.wikimedia.org/#dashboard/temp/7-q7h8QoTJKn3H08z3bf2g [22:20:56] RECOVERY - Host ms-be1 is UP: PING OK - Packet loss = 0%, RTA = 35.50 ms [22:20:56] RECOVERY - Host mw20 is UP: PING OK - Packet loss = 0%, RTA = 36.46 ms [22:20:56] RECOVERY - Host ps1-d3-sdtpa is UP: PING OK - Packet loss = 0%, RTA = 37.90 ms [22:20:56] RECOVERY - Host mw49 is UP: PING OK - Packet loss = 0%, RTA = 35.61 ms [22:20:56] RECOVERY - Host db71 is UP: PING OK - Packet loss = 0%, RTA = 35.92 ms [22:20:56] RECOVERY - Host mw60 is UP: PING OK - Packet loss = 0%, RTA = 35.47 ms [22:20:56] RECOVERY - Host professor is UP: PING OK - Packet loss = 0%, RTA = 35.45 ms [22:21:56] bd808: \o/ very cool! [22:22:03] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: reqstats.5xx [warn=250.000 [22:22:03] RECOVERY - Puppet freshness on manutius is OK: puppet ran at Fri Jan 17 22:21:59 UTC 2014 [22:22:10] RECOVERY - Puppet freshness on stat1002 is OK: puppet ran at Fri Jan 17 22:22:04 UTC 2014 [22:22:10] RECOVERY - Puppet freshness on mchenry is OK: puppet ran at Fri Jan 17 22:22:09 UTC 2014 [22:22:12] heh, is tampa back? [22:22:15] yes [22:22:25] one of the links at least [22:22:31] RECOVERY - Host erzurumi is UP: PING OK - Packet loss = 0%, RTA = 35.50 ms [22:22:40] RECOVERY - Puppet freshness on dataset2 is OK: puppet ran at Fri Jan 17 22:22:34 UTC 2014 [22:23:10] PROBLEM - MySQL Replication Heartbeat on db67 is CRITICAL: CRIT replication delay 16525 seconds [22:23:20] RECOVERY - Host db78 is UP: PING OK - Packet loss = 0%, RTA = 35.38 ms [22:23:30] PROBLEM - MySQL Replication Heartbeat on es7 is CRITICAL: CRIT replication delay 8987 seconds [22:23:30] PROBLEM - MySQL Replication Heartbeat on db63 is CRITICAL: CRIT replication delay 15942 seconds [22:23:40] PROBLEM - MySQL Slave Delay on db69 is CRITICAL: CRIT replication delay 17029 seconds [22:23:50] PROBLEM - Disk space on virt10 is CRITICAL: Timeout while attempting connection [22:23:50] PROBLEM - MySQL Slave Delay on db63 is CRITICAL: Timeout while attempting connection [22:23:50] PROBLEM - MySQL Slave Delay on db67 is CRITICAL: CRIT replication delay 15793 seconds [22:23:51] PROBLEM - MySQL Replication Heartbeat on db69 is CRITICAL: CRIT replication delay 16807 seconds [22:23:51] RECOVERY - Host es3 is UP: PING WARNING - Packet loss = 61%, RTA = 35.36 ms [22:24:00] PROBLEM - MySQL Slave Delay on es7 is CRITICAL: CRIT replication delay 6356 seconds [22:24:00] PROBLEM - MySQL Replication Heartbeat on db72 is CRITICAL: CRIT replication delay 12420 seconds [22:24:00] PROBLEM - MySQL Slave Delay on db72 is CRITICAL: CRIT replication delay 12366 seconds [22:24:14] yeah, gonna be a little behind ;) [22:24:21] bd808: because the API log is really a request log rather than an operational log per se, I figure it should be sent to kafka [22:25:00] RECOVERY - Puppet freshness on linne is OK: puppet ran at Fri Jan 17 22:24:57 UTC 2014 [22:25:13] might be worth looking into https://github.com/quipo/kafka-php / https://github.com/michal-harish/kafka-php [22:25:31] RECOVERY - MySQL Replication Heartbeat on es7 is OK: OK replication delay 0 seconds [22:25:40] RECOVERY - Puppet freshness on emery is OK: puppet ran at Fri Jan 17 22:25:33 UTC 2014 [22:26:00] RECOVERY - MySQL Slave Delay on es7 is OK: OK replication delay 0 seconds [22:27:32] ori: There is work on a kafka output plugin at https://github.com/joekiller/logstash-kafka [22:27:40] RECOVERY - Puppet freshness on virt0 is OK: puppet ran at Fri Jan 17 22:27:37 UTC 2014 [22:27:54] but logstash is not a useful middleman [22:28:40] RECOVERY - Puppet freshness on pdf2 is OK: puppet ran at Fri Jan 17 22:28:33 UTC 2014 [22:28:44] Ah. You just want to go straight to kafka from our next-gen logging system then. [22:29:08] yeah, it's a good use-case for backend configurability [22:29:20] RECOVERY - Puppet freshness on formey is OK: puppet ran at Fri Jan 17 22:29:19 UTC 2014 [22:29:24] * bd808 nods [22:29:50] RECOVERY - Puppet freshness on lvs1 is OK: puppet ran at Fri Jan 17 22:29:45 UTC 2014 [22:30:10] RECOVERY - Puppet freshness on ssl3 is OK: puppet ran at Fri Jan 17 22:30:05 UTC 2014 [22:30:20] RECOVERY - Puppet freshness on sq69 is OK: puppet ran at Fri Jan 17 22:30:15 UTC 2014 [22:30:50] PROBLEM - Disk space on virt10 is CRITICAL: Timeout while attempting connection [22:32:20] PROBLEM - Host mw68 is DOWN: PING CRITICAL - Packet loss = 100% [22:32:20] PROBLEM - Host ersch is DOWN: PING CRITICAL - Packet loss = 100% [22:32:20] PROBLEM - Host mw114 is DOWN: PING CRITICAL - Packet loss = 100% [22:32:20] PROBLEM - Host labstore4 is DOWN: PING CRITICAL - Packet loss = 100% [22:32:20] PROBLEM - Host mw105 is DOWN: PING CRITICAL - Packet loss = 100% [22:32:39] greg-g: we won't do ulsfo next week after all [22:33:12] greg-g: transport got delivered just today; transit hasn't been lit up yet, vendor investigating; plus, both Mark and I will be travelling or busy in meetings [22:33:31] PROBLEM - puppet disabled on searchidx2 is CRITICAL: Timeout while attempting connection [22:33:40] PROBLEM - puppet disabled on tmh2 is CRITICAL: Timeout while attempting connection [22:33:48] it doesn't make much sense, as much as I'd enjoy the reduced latency from SF :) [22:34:00] RECOVERY - Host ps1-b2-sdtpa is UP: PING OK - Packet loss = 0%, RTA = 37.18 ms [22:34:00] RECOVERY - Host ps1-c3-sdtpa is UP: PING OK - Packet loss = 0%, RTA = 37.86 ms [22:34:00] RECOVERY - Host ps1-b5-sdtpa is UP: PING OK - Packet loss = 0%, RTA = 37.35 ms [22:34:00] RECOVERY - Host ps1-a5-sdtpa is UP: PING OK - Packet loss = 0%, RTA = 38.05 ms [22:34:00] RECOVERY - Host ps1-d2-pmtpa is UP: PING OK - Packet loss = 0%, RTA = 37.97 ms [22:35:10] RECOVERY - Puppet freshness on pdf3 is OK: puppet ran at Fri Jan 17 22:35:07 UTC 2014 [22:35:20] RECOVERY - Puppet freshness on lvs2 is OK: puppet ran at Fri Jan 17 22:35:12 UTC 2014 [22:35:20] RECOVERY - Puppet freshness on lvs5 is OK: puppet ran at Fri Jan 17 22:35:17 UTC 2014 [22:35:34] so many recovery pages ;) [22:35:43] paravoid: shall i revert the hacks ? [22:36:00] RECOVERY - Puppet freshness on mexia is OK: puppet ran at Fri Jan 17 22:35:58 UTC 2014 [22:36:08] let's wait to see it stable first I'd say [22:36:19] you never know, the crew is probably still on-site [22:36:27] and only one of the two is up now [22:37:00] RECOVERY - Puppet freshness on virt1004 is OK: puppet ran at Fri Jan 17 22:36:53 UTC 2014 [22:37:44] From Labs I can connect now to labsdb* in eqiad, but en.wikipedia.org times out. [22:37:50] PROBLEM - Packetloss_Average on emery is CRITICAL: packet_loss_average CRITICAL: 54.4386242623 [22:38:30] scfc_de: are you sure? there's no reason this would happen [22:38:40] RECOVERY - Puppet freshness on fenari is OK: puppet ran at Fri Jan 17 22:38:39 UTC 2014 [22:39:10] RECOVERY - Puppet freshness on dobson is OK: puppet ran at Fri Jan 17 22:39:00 UTC 2014 [22:39:33] paravoid: Yes. Log into tools-login.wmflabs.org, "curl http://en.wikipedia.org/" => times out. [22:39:35] hrm, let me try [22:40:05] hrm, yeah, it stops at cr1-sdtpa [22:40:09] according to mtr [22:40:17] i'll check it out [22:40:40] what an awesome last day ;) [22:40:40] RECOVERY - MySQL Slave Delay on db63 is OK: OK replication delay 92 seconds [22:40:58] heh [22:41:31] RECOVERY - MySQL Replication Heartbeat on db63 is OK: OK replication delay -1 seconds [22:41:40] RECOVERY - Puppet freshness on lvs3 is OK: puppet ran at Fri Jan 17 22:41:32 UTC 2014 [22:42:26] hrm, so it works from fenari [22:42:50] RECOVERY - Puppet freshness on ms10 is OK: puppet ran at Fri Jan 17 22:42:43 UTC 2014 [22:42:50] RECOVERY - MySQL Slave Delay on db67 is OK: OK replication delay 0 seconds [22:42:50] RECOVERY - Puppet freshness on stat1 is OK: puppet ran at Fri Jan 17 22:42:48 UTC 2014 [22:43:11] scfc_de: can you ping en.wikipedia.org ? (i don't have access on tool labs and i'm not starting now!) [22:43:20] RECOVERY - MySQL Replication Heartbeat on db67 is OK: OK replication delay -1 seconds [22:43:30] RECOVERY - Puppet freshness on ekrem is OK: puppet ran at Fri Jan 17 22:43:23 UTC 2014 [22:43:34] paravoid: :) ok [22:44:00] RECOVERY - Puppet freshness on lvs4 is OK: puppet ran at Fri Jan 17 22:43:54 UTC 2014 [22:44:26] greg-g: sorry for the mixup, I'm sure I said "probably" somewhere along my update, but I should had been more clear [22:44:52] no worries [22:45:04] LeslieCarr: Apparently now yes and on the first try :-9. [22:45:17] does it curl now as well ? [22:45:53] Yes, :80 works fine. [22:46:05] so the cr's in tampa are not the most powerful processor wise and just got like a million new routes dumped on them -- since it's in bgp, it's possible they hadn't been able to process and install the route yet [22:46:40] RECOVERY - MySQL Slave Delay on db69 is OK: OK replication delay 0 seconds [22:46:50] RECOVERY - MySQL Replication Heartbeat on db69 is OK: OK replication delay -0 seconds [22:47:20] RECOVERY - Puppet freshness on ssl1 is OK: puppet ran at Fri Jan 17 22:47:17 UTC 2014 [22:47:38] (Memo to self: "curl http://en.wikipedia.org/" is bad for testing, as the redirect to /wiki/Main_Page has no content :-).) [22:48:00] RECOVERY - Puppet freshness on capella is OK: puppet ran at Fri Jan 17 22:47:53 UTC 2014 [22:48:10] RECOVERY - Puppet freshness on hume is OK: puppet ran at Fri Jan 17 22:48:08 UTC 2014 [22:48:20] RECOVERY - Puppet freshness on sq68 is OK: puppet ran at Fri Jan 17 22:48:18 UTC 2014 [22:48:20] RECOVERY - Puppet freshness on cp4019 is OK: puppet ran at Fri Jan 17 22:48:18 UTC 2014 [22:48:25] hehe, i still totally use wget [22:48:30] RECOVERY - Puppet freshness on ssl4 is OK: puppet ran at Fri Jan 17 22:48:23 UTC 2014 [22:49:14] So now one link is up? Are the link statuses public somewhere? (Icinga/Ganglia/etc.) [22:50:40] PROBLEM - Host ms-be1012 is DOWN: PING CRITICAL - Packet loss = 100% [22:51:50] RECOVERY - Packetloss_Average on emery is OK: packet_loss_average OKAY: -0.102021311475 [23:01:20] RECOVERY - Puppet freshness on sq70 is OK: puppet ran at Fri Jan 17 23:01:17 UTC 2014 [23:01:31] RECOVERY - Puppet freshness on locke is OK: puppet ran at Fri Jan 17 23:01:27 UTC 2014 [23:02:19] (03PS1) 10Se4598: Resetting legacy channel names on labs and enabling IRC-RC echo again [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/108163 [23:02:20] RECOVERY - Puppet freshness on tridge is OK: puppet ran at Fri Jan 17 23:02:18 UTC 2014 [23:02:30] RECOVERY - Puppet freshness on ssl2 is OK: puppet ran at Fri Jan 17 23:02:28 UTC 2014 [23:03:00] RECOVERY - Puppet freshness on yvon is OK: puppet ran at Fri Jan 17 23:02:59 UTC 2014 [23:03:29] (03CR) 10Se4598: "unconfirmed to fix the bug/can't tested" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/108163 (owner: 10Se4598) [23:03:50] PROBLEM - Puppet freshness on sanger is CRITICAL: Last successful Puppet run was Fri 17 Jan 2014 05:01:51 PM UTC [23:07:10] RECOVERY - Puppet freshness on sanger is OK: puppet ran at Fri Jan 17 23:07:01 UTC 2014 [23:07:32] (03CR) 10Tim Landscheidt: [C: 04-1] "Just cosmetics." (032 comments) [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/108163 (owner: 10Se4598) [23:10:32] (03PS2) 10Se4598: Resetting legacy channel names on labs and enabling IRC-RC echo again [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/108163 [23:16:28] (03PS1) 10Chad: WIP: Move purge-checkuser script off hume and to terbium [operations/puppet] - 10https://gerrit.wikimedia.org/r/108165 [23:19:17] (03CR) 10Chad: "I *think* this is the last thing running on terbium MW-wise. I think we also have jobqueue monitoring there, but we should already have th" [operations/puppet] - 10https://gerrit.wikimedia.org/r/108165 (owner: 10Chad) [23:22:31] (03PS1) 10Gergő Tisza: Make UploadWizard respect the Flickr blacklist on Commons [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/108167 [23:24:58] (03CR) 10Gergő Tisza: Make UploadWizard respect the Flickr blacklist on Commons (031 comment) [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/108167 (owner: 10Gergő Tisza) [23:44:15] …lawnmower?