[00:15:48] lesliecarr: the zayo link? no [00:21:46] PROBLEM - MySQL Replication Heartbeat on db1010 is CRITICAL: CRIT replication delay 311 seconds [00:21:56] PROBLEM - MySQL Slave Delay on db1010 is CRITICAL: CRIT replication delay 325 seconds [00:23:56] PROBLEM - Disk space on elastic1008 is CRITICAL: DISK CRITICAL - free space: /var/lib/elasticsearch 10095 MB (3% inode=99%): [00:24:46] RECOVERY - MySQL Replication Heartbeat on db1010 is OK: OK replication delay -0 seconds [00:24:56] RECOVERY - MySQL Slave Delay on db1010 is OK: OK replication delay 0 seconds [00:30:55] ^d: fucking elasticsearch it was trying to move stuff to a node that was almost full..... [00:31:03] <^d> :( [00:33:56] RECOVERY - Disk space on elastic1008 is OK: DISK OK [00:34:03] I figured out how to stop it [00:34:06] but I'm unhappy [00:34:20] I told it it could only have some number of GB of disk space rather than a percent [00:34:24] and it was like "cool!" [00:34:30] and it through some shards on the floor [00:34:36] and it is initializing those.... [00:45:05] (03PS7) 10Ori.livneh: Add logstash config for udp2log [operations/puppet] - 10https://gerrit.wikimedia.org/r/106154 (owner: 10BryanDavis) [00:45:21] (03CR) 10Ori.livneh: [C: 032 V: 032] "OK. But I am adding an item on our calendar to revert this in six weeks, on Thursday, February 27. I don't want to be stuck with this fore" [operations/puppet] - 10https://gerrit.wikimedia.org/r/106154 (owner: 10BryanDavis) [02:30:36] PROBLEM - Disk space on virt10 is CRITICAL: DISK CRITICAL - free space: /var/lib/nova/instances 39037 MB (3% inode=99%): [02:32:02] !log LocalisationUpdate completed (1.23wmf10) at 2014-01-17 02:32:01+00:00 [02:39:29] Oh, poop. [03:08:34] (03PS1) 10Springle: Disable slow information_schema query that scans all databases and tables. [operations/puppet] - 10https://gerrit.wikimedia.org/r/107994 [03:10:16] (03CR) 10Springle: [C: 032] Disable slow information_schema query that scans all databases and tables. [operations/puppet] - 10https://gerrit.wikimedia.org/r/107994 (owner: 10Springle) [03:11:19] (03Abandoned) 10Hydriz: Move WikimediaIncubator extension call to be after Scribunto [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/107340 (owner: 10Hydriz) [03:11:50] unmerged puppet changes for logstash... [03:12:11] ori: ^ ok to merge? [03:13:20] on palladium [03:19:23] * springle goes with 'yes' and watches a logstash puppet run [03:26:08] !log LocalisationUpdate completed (1.23wmf11) at 2014-01-17 03:26:07+00:00 [03:34:02] springle: yes, thanks. sorry for blocking. [03:42:41] ori: they named a DFS after you http://ori.scs.stanford.edu/ [03:56:51] ebernhardson: it's only fair [03:59:49] it truly is -- ori was already distributed in the physical sense so it would make sense to make him distributed in the virtual sense as well [04:03:24] !log LocalisationUpdate ResourceLoader cache refresh completed at 2014-01-17 04:03:24+00:00 [04:59:03] (03PS1) 10Springle: depool db1007 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/107997 [04:59:39] (03CR) 10Springle: [C: 032] depool db1007 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/107997 (owner: 10Springle) [04:59:46] (03Merged) 10jenkins-bot: depool db1007 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/107997 (owner: 10Springle) [05:00:25] !log springle synchronized wmf-config/db-eqiad.php 'depool db1007' [05:41:27] (03PS1) 10Springle: repool db1007, depool db1041 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/107998 [05:42:10] (03CR) 10Springle: [C: 032] repool db1007, depool db1041 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/107998 (owner: 10Springle) [05:42:16] (03Merged) 10jenkins-bot: repool db1007, depool db1041 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/107998 (owner: 10Springle) [05:43:02] !log springle synchronized wmf-config/db-eqiad.php 'repool db1007, depool db1041' [06:33:09] (03PS1) 10Springle: Another ganglia aggregator for db multicast group to try to reduce stats gaps. [operations/puppet] - 10https://gerrit.wikimedia.org/r/107999 [06:34:45] (03CR) 10Springle: [C: 032] Another ganglia aggregator for db multicast group to try to reduce stats gaps. [operations/puppet] - 10https://gerrit.wikimedia.org/r/107999 (owner: 10Springle) [07:50:42] (03PS1) 10Ori.livneh: Add txStatsD module [operations/puppet] - 10https://gerrit.wikimedia.org/r/108010 [08:08:33] (03PS2) 10Ori.livneh: Add txStatsD module [operations/puppet] - 10https://gerrit.wikimedia.org/r/108010 [08:10:07] (03CR) 10Ori.livneh: [C: 032] Add txStatsD module [operations/puppet] - 10https://gerrit.wikimedia.org/r/108010 (owner: 10Ori.livneh) [08:45:01] (03PS1) 10Ori.livneh: Port statsd role to use txStatsD; apply to tungsten [operations/puppet] - 10https://gerrit.wikimedia.org/r/108013 [08:46:17] (03CR) 10Ori.livneh: [C: 032] Port statsd role to use txStatsD; apply to tungsten [operations/puppet] - 10https://gerrit.wikimedia.org/r/108013 (owner: 10Ori.livneh) [08:46:21] !log Replacing StatsD on tungsten with txStatsD; see commit I19ecf608d for rationale. [08:46:56] morebots died. [08:50:12] !log restarted morebots. morebots missed Sean's 'repool db1007, depool db1041' @ 6:26 UTC. [08:50:17] !log Replacing StatsD on tungsten with txStatsD; see commit I19ecf608d for rationale. [08:50:20] Logged the message, Master [08:50:26] Logged the message, Master [08:50:38] !log finished elasticsearch upgrade. we're now 0.90.10 all the way. [08:50:45] Logged the message, Master [08:51:12] poor manybubbles [08:51:20] get some sleep! [08:51:24] txstatsd, hmmm [08:51:35] ori: there almost isn't a point to trying now [08:51:41] my kids will be up in three hours [08:51:49] i know the feeling [08:51:52] paravoid: it really is better [08:52:02] see commit msg for rationale [08:53:33] yeah I'm reading about it [08:55:36] PROBLEM - MySQL Slave Delay on db69 is CRITICAL: CRIT replication delay 306 seconds [08:55:46] PROBLEM - MySQL Replication Heartbeat on db69 is CRITICAL: CRIT replication delay 310 seconds [08:55:49] http://joemiller.me/2011/09/21/list-of-statsd-server-implementations/ [08:55:53] for crying out loud [08:56:08] i went through pretty much all of them [08:56:16] because i got fed up with statsd [08:56:20] I remember wanting to install statsite before [08:56:26] that, or something in python [08:56:35] we talked about this before, remember? [08:57:47] I'd personally keep the role named "statsd" btw [08:57:51] oh, too late [08:58:00] nevermind :) [08:58:28] manybubbles|away: mind if I move and reboot the labs instance search-test? [08:58:34] oops [08:58:46] RECOVERY - MySQL Replication Heartbeat on db69 is OK: OK replication delay 138 seconds [08:58:46] andrewbogott: huh? [08:58:49] search-test? [08:58:50] statsite is nice but doesn't support plugins (though it does have pluggable syncs) [08:58:58] it also doesn't support histograms [08:59:08] manybubbles|away: Yeah, I'm just trying to free up some space on the virt host it lives on. [08:59:19] Will be down for a few minutes, then rebooted. That OK? [08:59:20] and the fact that it is written in C makes it onerous to extend [08:59:25] why do you need histograms in statsd? [08:59:36] RECOVERY - MySQL Slave Delay on db69 is OK: OK replication delay 0 seconds [08:59:42] instead of generating them after the fact with graphite? [08:59:53] andrewbogott: I don't believe it is mine so I don't mind [08:59:58] hm, ok :) [09:00:00] paravoid: because statsd aggregates [09:00:01] (txstatsd looks just fine, I'm just curious) [09:00:12] but in general you can move and reboot any of my machines now that I'm going to sleep [09:00:19] fair enough :) [09:00:24] a histogram of 60-second averages isn't that useful [09:00:40] right [09:00:42] makes sense [09:01:08] paravoid: unrelated question -- [09:01:34] kibana uses PUT requests (REST hipsters, I know...) to save dashboards, and saving dashboards is apparently something you do pretty regularly when you're using kibana [09:02:06] so the default varnish vcls doesn't allow it [09:02:14] the wikimedia-wide vcl_recv blocks PUT requests, and it gets included before the misc varnish VCL so you can't really override it [09:02:15] right [09:03:08] bryan and i weren't sure about the best way of getting around that [09:04:03] maybe configure kibana not to act like a college student that just encountered roy fielding for the first time [09:05:05] i don't think we can do that without deviating from upstream [09:12:54] off to bed but will read pings tomorrow [09:18:46] RECOVERY - Disk space on virt11 is OK: DISK OK [09:23:10] #notgooglehangout [09:23:43] mutante: ? [09:24:08] andrewbogott: wrong window [09:24:12] :) [09:24:32] lo [09:24:42] hi hashar [09:25:26] and I missed andrew :D [09:28:11] ori: +3 on using a python based statsd hehe [09:28:38] the more I write python the more I like that language [09:34:57] hashar: shall we merge the contint/packages.pp change now while you're here? [09:38:50] yeah [09:39:01] mutante: it is simple enough :-] [09:39:06] will run puppet on the machines [09:39:37] (03CR) 10Dzahn: [C: 032] "should be NOOP" [operations/puppet] - 10https://gerrit.wikimedia.org/r/107347 (owner: 10Matanya) [09:40:24] hashar: thx, now [09:40:39] |log running puppet on gallium and lanthanum [09:40:55] I am still wondering how puppet cache the catalog [09:40:56] s [09:41:08] that was a pipe :) [09:41:10] ! [09:41:18] yeah that is intentional :D [09:41:26] so you get the message, but it is not logged [09:41:26] heh,ok [09:41:28] that is a nice feature [09:41:33] as for puppet catalog: [09:41:33] info: Caching catalog for lanthanum.eqiad.wmnet [09:41:34] info: Applying configuration version '1389951638' [09:41:42] I am afraid it cache the catalog based on the timestamp :/ [09:42:19] (03PS1) 10Springle: Use s[2-7] snapshot slaves for 'dump' (WikiExporter) query group [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/108017 [09:42:20] and a next run yields: [09:42:21] info: Caching catalog for lanthanum.eqiad.wmnet [09:42:21] info: Applying configuration version '1389951692' [09:43:08] (03CR) 10Faidon Liambotis: [C: 04-1] "Needs a manual rebase" [operations/puppet] - 10https://gerrit.wikimedia.org/r/107819 (owner: 10Matanya) [09:43:12] (03CR) 10Faidon Liambotis: [C: 032] hive: puppet 3 compatibility fix: fully qualify variables [operations/puppet] - 10https://gerrit.wikimedia.org/r/107821 (owner: 10Matanya) [09:43:27] (03CR) 10Springle: [C: 032] Use s[2-7] snapshot slaves for 'dump' (WikiExporter) query group [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/108017 (owner: 10Springle) [09:43:33] (03Merged) 10jenkins-bot: Use s[2-7] snapshot slaves for 'dump' (WikiExporter) query group [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/108017 (owner: 10Springle) [09:43:57] hashar: yea, timestamp. date -d@1389951692 [09:43:58] (03PS1) 10Hashar: /find_tabs.sh to find puppet manifest using tabs [operations/puppet] - 10https://gerrit.wikimedia.org/r/108018 [09:44:04] Fri Jan 17 01:41:32 PST 2014 [09:44:24] mutante: yeah I found some slides presenting a hack to use the GIT sha1 instead [09:44:36] though one would need the sha1 of both ops/puppet and the private repo [09:44:41] (03CR) 10Faidon Liambotis: [C: 04-1] ldap: puppet 3 compatibility fix: fully qualify variable (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/107823 (owner: 10Matanya) [09:44:41] that might help getting more catalogs cached [09:44:48] !log springle synchronized wmf-config/db-eqiad.php 'LB 'dump' s[2-7]' [09:44:54] mutante: and a tiny script to find manifests still having tabs: https://gerrit.wikimedia.org/r/108018 [09:44:54] aha, interesting [09:44:55] (03CR) 10Faidon Liambotis: [C: 032] smokeping: puppet 3 compatibility fix: module path [operations/puppet] - 10https://gerrit.wikimedia.org/r/107825 (owner: 10Matanya) [09:44:56] Logged the message, Master [09:45:10] (03CR) 10Faidon Liambotis: [C: 032] geoip: puppet 3 compatibility fix: module path [operations/puppet] - 10https://gerrit.wikimedia.org/r/107826 (owner: 10Matanya) [09:45:14] hashar: script to find tabs = puppet lint ? [09:45:26] in this case it only find tabs and is WAY faster [09:45:32] ok [09:46:57] (03PS2) 10Hashar: retab realm.pp [operations/puppet] - 10https://gerrit.wikimedia.org/r/104808 [09:47:23] (03PS2) 10Hashar: realm.pp puppet lint fixes [operations/puppet] - 10https://gerrit.wikimedia.org/r/104809 [09:47:54] the realm.pp change is simple enough if you have time for it [09:48:24] hashar: did that before locally with: puppet-lint manifests/* | grep "tab char" [09:48:46] yeah that would work, albeit it causes a full parse by puppet-lint [09:51:54] I published the script in Gerrit merely for information [09:52:00] might not be a good idea to merge it [09:54:38] i was thinking that you just use it locally, but maybe it can go into that "tools" repo instead [09:56:43] (03CR) 10Dzahn: [C: 031] retab realm.pp [operations/puppet] - 10https://gerrit.wikimedia.org/r/104808 (owner: 10Hashar) [09:59:36] mutante: in a different repo, it is unlikely to be noticed / used :D [10:00:21] hashar: fair , yea [10:05:13] hashar: heh, yea, folks will always comment on stuff they see in lint changes, that you didn't actually write and is unrelated, but that's what happens:) [10:05:25] the "while you're at it"-effect [10:27:17] (03CR) 10Dzahn: "that's because it was already done before then it seems:" [operations/puppet] - 10https://gerrit.wikimedia.org/r/107555 (owner: 10Matanya) [10:30:25] (03PS2) 10Matanya: nagios: puppet 3 compatibility fix: fully qualify variables [operations/puppet] - 10https://gerrit.wikimedia.org/r/107819 [10:31:04] (03CR) 10Dzahn: "don't you need <%= scope.lookupvar() in the .erb template to get the content of $host there?" [operations/puppet] - 10https://gerrit.wikimedia.org/r/107555 (owner: 10Matanya) [10:41:33] (03CR) 10Dzahn: "actual question because it appears to work like this on antimony but in a lot of other Apache site templates you see the other way" [operations/puppet] - 10https://gerrit.wikimedia.org/r/107555 (owner: 10Matanya) [10:43:21] mutante: might fix the issues in a different change so [10:43:31] I try to avoid changing code / layout too much when doing lint changes [10:43:44] so they are easier to review and less likely to cause an issue when being applied on servers [10:44:24] hashar: yea, so far that didn't even include opinion, i just noticed that this always happens to all of us [10:56:05] (03PS1) 10Alexandros Kosiaris: Add iron.wikimedia.org to bastion_hosts [operations/puppet] - 10https://gerrit.wikimedia.org/r/108023 [11:00:56] (03PS2) 10Alexandros Kosiaris: Add iron.wikimedia.org to bastion_hosts [operations/puppet] - 10https://gerrit.wikimedia.org/r/108023 [11:04:25] (03CR) 10Dzahn: [C: 031] "yep, those IPs are iron. and i noticed earlier that i couldn't ssh to antimony from iron, while i could from bast1001, so iron should be a" [operations/puppet] - 10https://gerrit.wikimedia.org/r/108023 (owner: 10Alexandros Kosiaris) [11:06:58] (03CR) 10Hashar: certs.pp puppet lint fixes (039 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/104743 (owner: 10Hashar) [11:07:29] (03PS3) 10Hashar: certs.pp puppet lint fixes [operations/puppet] - 10https://gerrit.wikimedia.org/r/104743 [11:08:03] (03CR) 10jenkins-bot: [V: 04-1] certs.pp puppet lint fixes [operations/puppet] - 10https://gerrit.wikimedia.org/r/104743 (owner: 10Hashar) [11:08:17] :( [11:13:06] (03PS4) 10Hashar: certs.pp puppet lint fixes [operations/puppet] - 10https://gerrit.wikimedia.org/r/104743 [11:13:22] (03CR) 10Hashar: "rebased on tip of production branch, fixed conflicts." [operations/puppet] - 10https://gerrit.wikimedia.org/r/104743 (owner: 10Hashar) [11:14:10] (03PS3) 10Alexandros Kosiaris: Add iron.wikimedia.org to bastion_hosts [operations/puppet] - 10https://gerrit.wikimedia.org/r/108023 [11:14:57] (03PS5) 10Hashar: certs.pp puppet lint fixes [operations/puppet] - 10https://gerrit.wikimedia.org/r/104743 [11:15:22] (03CR) 10Hashar: "Fix puppet-lint error in the define certificates::rapidssl_ca_2 which was introduced while rebasing and I forgot to lint :(" [operations/puppet] - 10https://gerrit.wikimedia.org/r/104743 (owner: 10Hashar) [11:16:27] (03PS1) 10Dzahn: let iron have a AAAA record [operations/dns] - 10https://gerrit.wikimedia.org/r/108027 [11:19:49] (03CR) 10Hashar: "minor issue: there is no AAAA entry for iron." (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/108023 (owner: 10Alexandros Kosiaris) [11:20:55] (03CR) 10Dzahn: "(PS1) Dzahn: let iron have a AAAA record [operations/dns] - https://gerrit.wikimedia.org/r/108027" [operations/puppet] - 10https://gerrit.wikimedia.org/r/108023 (owner: 10Alexandros Kosiaris) [11:25:49] (03CR) 10Alexandros Kosiaris: [C: 032] Add iron.wikimedia.org to bastion_hosts [operations/puppet] - 10https://gerrit.wikimedia.org/r/108023 (owner: 10Alexandros Kosiaris) [11:26:36] (03PS1) 10Alexandros Kosiaris: lint 1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa [operations/dns] - 10https://gerrit.wikimedia.org/r/108031 [11:26:38] (03PS1) 10Alexandros Kosiaris: Add reverse IPv6 records for bastion hosts [operations/dns] - 10https://gerrit.wikimedia.org/r/108032 [11:28:26] (03CR) 10Dzahn: certs.pp puppet lint fixes (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/104743 (owner: 10Hashar) [11:29:48] (03CR) 10Hashar: certs.pp puppet lint fixes (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/104743 (owner: 10Hashar) [11:30:28] (03PS6) 10Hashar: certs.pp puppet lint fixes [operations/puppet] - 10https://gerrit.wikimedia.org/r/104743 [11:33:23] (03CR) 10Dzahn: certs.pp puppet lint fixes (032 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/104743 (owner: 10Hashar) [11:34:04] * mutante hides from hashar [11:36:53] (03CR) 10Matanya: certs.pp puppet lint fixes (035 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/104743 (owner: 10Hashar) [11:47:32] commuting out to coworking place [11:51:47] (03CR) 10Dzahn: "reverse for this is in https://gerrit.wikimedia.org/r/#/c/108032/1" [operations/dns] - 10https://gerrit.wikimedia.org/r/108027 (owner: 10Dzahn) [11:56:52] (03PS2) 10Dzahn: let iron and bast1001 have AAAA records [operations/dns] - 10https://gerrit.wikimedia.org/r/108027 [11:57:04] (03CR) 10jenkins-bot: [V: 04-1] let iron and bast1001 have AAAA records [operations/dns] - 10https://gerrit.wikimedia.org/r/108027 (owner: 10Dzahn) [11:58:24] good catch, jenkins:) [11:58:27] (03PS3) 10Dzahn: let iron and bast1001 have AAAA records [operations/dns] - 10https://gerrit.wikimedia.org/r/108027 [13:50:18] (03PS1) 10Hashar: beta: monitor fatal errors on beta labs [operations/puppet] - 10https://gerrit.wikimedia.org/r/108039 [13:53:26] PROBLEM - Varnish HTTP text-backend on cp1054 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:53:46] PROBLEM - Varnish traffic logger on cp1054 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:54:06] PROBLEM - Varnish HTCP daemon on cp1054 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:55:15] wow [13:55:17] paravoid: ^XFS: possible memory allocation deadlock, another one, what [13:55:20] would you prefer [13:55:29] and here I was saying how it's fixed [13:55:46] it keeps spitting them out on mgmt currently [13:57:06] RECOVERY - Varnish HTCP daemon on cp1054 is OK: PROCS OK: 1 process with UID = 111 (vhtcpd), args vhtcpd [13:57:16] RECOVERY - Varnish HTTP text-backend on cp1054 is OK: HTTP OK: HTTP/1.1 200 OK - 189 bytes in 0.000 second response time [13:57:36] RECOVERY - Varnish traffic logger on cp1054 is OK: PROCS OK: 2 processes with command name varnishncsa [14:05:58] (03PS1) 10Hashar: puppetize beta fatal monitor [operations/puppet] - 10https://gerrit.wikimedia.org/r/108041 [14:07:18] mutante: feel brave enough for some puppet cron {} review ? :-D [14:07:36] that is for beta [14:10:43] hashar: sorry, in the middle of testing a module, have to say later [14:10:58] (labs issues) [14:12:05] mutante: tis ok [14:12:20] * hashar applies for ops position to be able to self merge [14:12:37] +1 [14:12:44] +1 [14:13:06] +2:P [14:13:23] jenkins tells HR? :p [14:16:59] mark: so, echo 1 > /proc/sys/vm/compact_memory fixed the symptoms of "possible memory allocation deadlock in kmem_alloc" [14:17:44] without a reboot [14:18:00] but this is the box that had TSO disabled [14:18:03] http://ganglia.wikimedia.org/latest/graph.php?r=hour&z=xlarge&h=cp1054.eqiad.wmnet&m=mem_report&s=by+name&mc=2&g=mem_report&c=Text+caches+eqiad [14:18:25] so disabling TSO does /something/, but apparently not fixing the issues entirely [14:19:37] so, maybe compact_memory on a cronjob? [14:19:45] plus disable TSO anyway? [14:19:48] maybe 3.11 too? [14:22:32] hashar: labs currently cant configure instances. something broke in last deploy .being investigated https://bugzilla.wikimedia.org/show_bug.cgi?id=60167 [14:32:34] mutante: can it be LDAP related? [14:32:46] cause we apparently have some LDAP replication issue going on [14:33:32] mutante: forget me, Coren commented about the mediawiki code being updated [14:35:01] Yeah, I would have just restored to known good code if I was confident I wouldn't be breaking something-in-progress. I wish Mike was a little more verbose while he worked though so we'd know for sure. [14:37:20] (03CR) 10Alexandros Kosiaris: [C: 04-1] beta: monitor fatal errors on beta labs (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/108039 (owner: 10Hashar) [14:40:14] (03CR) 10Hashar: beta: monitor fatal errors on beta labs (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/108039 (owner: 10Hashar) [14:40:31] (03CR) 10Alexandros Kosiaris: [C: 032] "I think it is kind of an overkill to have that rather small misc/ class and then just include it on the role class. But since it seems to " [operations/puppet] - 10https://gerrit.wikimedia.org/r/108041 (owner: 10Hashar) [14:41:01] argh [14:41:11] why is there two conflicting person in my little brain [14:41:48] isn't that a mental illness ? [14:41:55] yea it is [14:42:04] the good thing is that I can stay alone for extensive period of time [14:42:11] since I got friends to talk to everytime [14:42:15] ahahahaha [14:42:31] albeit those friends keep disputing so it is a bit noisy at night [14:42:44] found out a way to came over it which is to think about images, they vanish [14:43:27] ok ok. just be careful. two is ok but if you get to something like 8 you will become the subject of a movie like that guy in Identity [14:44:15] (03PS2) 10Hashar: beta: monitor fatal errors on beta labs [operations/puppet] - 10https://gerrit.wikimedia.org/r/108039 [14:44:15] :-D [14:44:46] it is usually a conversation between me and my other self with a third unidentified party coming from time to time to interrupt us [14:44:58] (03CR) 10Hashar: "added a shebang and made the file executable" [operations/puppet] - 10https://gerrit.wikimedia.org/r/108039 (owner: 10Hashar) [14:45:04] see ? i told you. be careful :-) [14:45:17] that third guy is the way done a slippery slope [14:45:22] down* [14:45:24] (03PS2) 10Hashar: puppetize beta fatal monitor [operations/puppet] - 10https://gerrit.wikimedia.org/r/108041 [14:45:51] akosiaris: would you keep the .rb suffix when installing file in a /bin/ dir ? [14:46:06] no [14:46:19] I 'd rather not [14:46:27] (03CR) 10Alexandros Kosiaris: [C: 032] beta: monitor fatal errors on beta labs [operations/puppet] - 10https://gerrit.wikimedia.org/r/108039 (owner: 10Hashar) [14:46:34] (03CR) 10Hashar: "Yup should make the beta classes a puppet module maybe, just like we did for continuous integration." [operations/puppet] - 10https://gerrit.wikimedia.org/r/108041 (owner: 10Hashar) [15:00:58] (03CR) 10Faidon Liambotis: [C: 04-1] "Nak, these are autoconfigured IPs. Use interface::add_ip6_mapped to give them proper IPv6 addresses." [operations/dns] - 10https://gerrit.wikimedia.org/r/108032 (owner: 10Alexandros Kosiaris) [15:01:10] (03CR) 10Faidon Liambotis: [C: 04-2] "Let's do both forward & reverse in one commit please." [operations/dns] - 10https://gerrit.wikimedia.org/r/108027 (owner: 10Dzahn) [15:06:43] (03CR) 1001tonythomas: [C: 04-1] "It shows Can Merge: No" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/101820 (owner: 10Arav93) [15:22:36] PROBLEM - DPKG on rhodium is CRITICAL: DPKG CRITICAL dpkg reports broken packages [15:35:20] (03CR) 10Mark Bergsma: "Yurik: could you please rebase this change to current HEAD? When needing to change it, I'd like to make sure I'm not working out outdated " [operations/puppet] - 10https://gerrit.wikimedia.org/r/102316 (owner: 10Yurik) [16:09:11] (03PS3) 10coren: openstack: convert iptables to ferm [operations/puppet] - 10https://gerrit.wikimedia.org/r/98307 (owner: 10Faidon Liambotis) [16:10:49] (03CR) 10coren: [C: 031] "Carry over previous +1 with the change to reflect the merge conflict (allow all of 10/8 rather than distinguish between pmtpa and eqiad)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/98307 (owner: 10Faidon Liambotis) [16:11:27] (03PS1) 10Hashar: restore $oaiAgentRegex [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/108055 [16:12:36] (03CR) 10Brion VIBBER: [C: 032] "We're pretty sure putting this back will fix a legacy configuration that stopped working recently." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/108055 (owner: 10Hashar) [16:14:05] !log hashar synchronized wmf-config/CommonSettings.php 'restore $oaiAgentRegex' [16:14:12] Logged the message, Master [16:14:40] !log hashar synchronized wmf-config/InitialiseSettings.php 'touch, restore $oaiAgentRegex' [16:14:47] Logged the message, Master [16:17:48] (03CR) 10Hashar: "added back with https://gerrit.wikimedia.org/r/108055" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/105475 (owner: 10Reedy) [16:33:30] enwiki timeouts for me [16:33:35] anyone know what's up? [16:34:33] isup.me says it's up [16:34:51] <^d> Doesn't your internet suck? [16:35:19] ^d: everything else works fine [16:35:32] ^d: https://dpaste.de/ahCy [16:35:34] so it can connect [16:35:39] just returns an empty reply [16:35:41] the server [16:36:19] https://ganglia.wikimedia.org/latest/ works for me [16:36:34] <^d> :\ [16:36:41] hitting other wikipedias also has the same result [16:36:59] hitting tools works [16:37:06] any idea whom I should flag? [16:37:55] Jeff_Green: I guess this isn't RT duty, but I suppose I could ping you? ^ [16:38:15] hmm, seems to work now [16:38:23] <^d> Alphabetically always screws the same people :p [16:38:54] it's just really slow now [16:38:55] but works [16:42:20] is still really, *really* slow [16:42:23] (everything else is fast) [16:43:00] could be a network issue with the nl data center or something, maybe? [16:43:02] traceroute? [16:43:15] wfm over here [16:43:20] doing [16:43:25] yeah, works for brion too [16:44:06] still running [16:44:15] YuviPanda: how does wikitech feel? [16:44:27] because it would be cluster wiki [16:44:35] wouldn't [16:44:40] mutante: same as before [16:45:33] mutante: https://dpaste.de/GMxh [16:45:34] so far [16:46:03] 954.641 ms doesnt sound optimal.. ehm [16:46:14] but that's still before our infra [16:46:44] mutante: hmm, so network from cluster to me is broken? [16:47:31] ash sounds like Ashburn [16:47:51] ae0-xcr1.ash.cw.net ? [16:47:59] yeah, possibly [16:48:04] cr = core router [16:48:09] it does tell me it is hitting traceroute to text-lb.eqiad.wikimedia.org (208.80.154.224), 64 hops max, 52 byte packets [16:48:11] so eqiad [16:48:22] paravoid: ^? [16:48:44] oh, in here [16:48:44] do you know how close that is to us ? ash.cw.net [16:48:58] mark: ? [16:49:38] what? [16:49:51] people are reporting traceroutes to eqiad that end there [16:50:01] end at cw.net? [16:50:02] https://dpaste.de/GMxh [16:50:11] 15 ae0-xcr1.ash.cw.net (195.2.30.45) 954.641 ms * * [16:50:33] my guess is cr=core router, and ash=Ashburn [16:50:45] curl also gave me https://dpaste.de/ahCy a while ago. it returns now, is just really slow [16:50:50] http://www.cw.net/ says Vodafone [16:50:55] yes, and cw = cable&wireless, aka vodafone [16:50:57] and gives peering info [16:51:05] Vodafone AS1273 [16:51:09] I know [16:51:42] so people report slowness on all wikis, like YuviPanda [16:51:59] traceroute for bits https://dpaste.de/crtu [16:52:02] of course it's slow, there's 650ms & lost packets from the 5th hop onwards [16:52:24] looks like an issue on your side [16:52:24] < mutante> 954.641 ms doesnt sound optimal.. ehm [16:52:45] is anyone else reporting issues? [16:52:50] i think that is the case, yea [16:52:52] brion [16:53:03] brion's is fine now [16:53:06] i think [16:53:10] brion: can I get a traceroute from you as well? [16:53:25] paravoid: I'll just wait it out and see if it gets better. Thanks! [16:53:32] paravoid: just wanted to report, since everything else works as usual [16:54:44] thanks YuviPanda [17:00:47] interestingly my trace route craps out at xe-5-0-0.was10.ip4.tinet.net [17:00:54] though a direct ping reaches... [17:01:10] can I see it? [17:01:16] in private if you prefer [17:01:16] lemme cut-n-paste [17:01:28] mine craps out at te3-4.co2.as30217.net, from the office, though the site is responsive [17:01:47] https://gist.github.com/brion/508ad5579a7a319ebd74 [17:02:00] oh i'm sure my home IP is all over wikipedia, i edit logged out by accident all the time ;) [17:02:03] http://paste.debian.net/76888/ [17:02:36] i think that *.was10.ip4.tinet.net is in washington dc area though judging by the ping time [17:02:47] only 1-2ms different from my final ping [17:03:24] "traceroute" uses UDP by default [17:03:30] can it be related to the S.F->Ashburn fiber? [17:03:44] iirc we block these for to prevent attacks such as DNS amplification attacks [17:03:45] i can pull the fiber...it shouldn't have traffic on it though [17:03:54] cmjohnson1: don't, everything's okay [17:03:57] okay [17:04:02] ok [17:04:08] brion: use "traceroute -I" or "mtr" [17:04:23] the "crap out" part is the last hop, by design [17:04:48] ok, traceroute -I works as expected [17:04:54] whee [17:05:09] yeah the last hops were from tibet to eqiad and there into the text-lb [17:05:15] *tinet [17:05:24] tibet, hah [17:05:25] silly autocorrect! tibet has shitty network access [17:05:27] traceroute from a friend of mine, on a different ISP http://hastebin.com/cufokoguge.md [17:05:31] seems fine, so just my ISP I guess [17:05:44] YuviPanda: yup [17:05:46] sorry about the alarm [17:05:50] welcome to network hell yuvi [17:05:53] it's still interesting to see you going via Vodafone India [17:05:59] where YOUR AND ONLY YOUR internet fails [17:06:05] I don't know why only wiki is slow, however [17:06:09] everything else is fast enough [17:06:13] paravoid: oh? why so? [17:06:17] I am in India... [17:06:30] no, we have a new path to Vodafone [17:06:37] so it was interesting to me [17:06:46] ah [17:07:24] paravoid: can you tell me what you mean by 'new path to Vodafone'? Just curious (and know nothing of networks at that scale) [17:07:30] let's all just take a second to revel in the wonder that is the internet: that it works at all is an amazing tribute to human engineering [17:08:55] brion: /me does 10 hail marys [17:09:19] greg-g: the internet gods are old-school, they demand blood sacrifice [17:09:33] oh, alright then, lemme catch monte real quick [17:09:47] greg-g: no sacrificing ouyr iOS developer [17:10:11] *our [17:10:43]