[00:22:21] (03PS1) 10Ori.livneh: mediawiki: install php5-dbg on Zend app servers [puppet] - 10https://gerrit.wikimedia.org/r/165145 [00:23:26] (03CR) 10Ori.livneh: [C: 032] mediawiki: install php5-dbg on Zend app servers [puppet] - 10https://gerrit.wikimedia.org/r/165145 (owner: 10Ori.livneh) [00:24:00] (03CR) 10Chad: mediawiki: install php5-dbg on Zend app servers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/165145 (owner: 10Ori.livneh) [00:24:17] <^demon|away> ori: I was about to leave a comment :p [00:25:08] (03PS1) 10Ori.livneh: mediawiki::packages::php5: formatting tweak [puppet] - 10https://gerrit.wikimedia.org/r/165146 [00:25:16] PROBLEM - puppet last run on amslvs2 is CRITICAL: CRITICAL: puppet fail [00:25:56] (03CR) 10Ori.livneh: [C: 032 V: 032] "per ^d" [puppet] - 10https://gerrit.wikimedia.org/r/165146 (owner: 10Ori.livneh) [00:26:05] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [00:27:30] <^demon|away> ori: :) [00:33:32] (03PS3) 10Krinkle: Gzip .svg and .ico files on bits.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/113687 (https://bugzilla.wikimedia.org/61442) (owner: 10Brion VIBBER) [00:33:34] (03PS10) 10Krinkle: Gzip SVGs on back upload varnishes [puppet] - 10https://gerrit.wikimedia.org/r/108484 (https://bugzilla.wikimedia.org/54291) (owner: 10Ori.livneh) [00:34:08] (03CR) 10Krinkle: "The mid-string match bug was fixed by using string comparison instead of regex matching (alternatively, we could add ^ and $ to the regex)" [puppet] - 10https://gerrit.wikimedia.org/r/108484 (https://bugzilla.wikimedia.org/54291) (owner: 10Ori.livneh) [00:34:35] !log core dumps were enabled on mw1088, unexpectedly started gathering natural segfault traffic [00:34:42] Logged the message, Master [00:34:58] (03CR) 10Krinkle: [C: 031] "Fixed the open-ended regex by completing the half-match of 'svg' to 'svg+'xml' and adding an $-end marker as well. Escaped the '+' accordi" [puppet] - 10https://gerrit.wikimedia.org/r/113687 (https://bugzilla.wikimedia.org/61442) (owner: 10Brion VIBBER) [00:35:06] (03CR) 10Krinkle: [C: 031] Gzip SVGs on back upload varnishes [puppet] - 10https://gerrit.wikimedia.org/r/108484 (https://bugzilla.wikimedia.org/54291) (owner: 10Ori.livneh) [00:36:06] Question: ewulczyn____ had a Wikitech account created for him, but it did not have an associated gerrit login. [00:36:10] What should he do? [00:38:51] awight: that is odd, it should be the same thing [00:39:01] same LDAP user [00:39:26] error on gerrit login? [00:42:16] mutante: he doesn't show up in the reviewer autocomplete... asking him to try logging in now, though. [00:42:18] (03CR) 10Dzahn: [C: 032] "checked with cmjohnson, already disconnected" [dns] - 10https://gerrit.wikimedia.org/r/164128 (owner: 10Dzahn) [00:42:51] awight: what's the username? [00:43:04] RECOVERY - puppet last run on amslvs2 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [00:43:12] mutante: ewulczyn. Thanks for taking a look! [00:43:29] mutante: *actually* he just logged in successfully! Sorry for the false alarm. [00:44:07] mutante: hah, apparently u don't show up in the autocomplete list until logging in at least once. [00:44:24] awight: ah! alright, cool [00:44:49] (03PS2) 10Ori.livneh: apache: keep two weeks' worth of logs, rather than 1yr [puppet] - 10https://gerrit.wikimedia.org/r/130296 (owner: 10ArielGlenn) [00:45:18] ^ mutante, re-added you, now that the patch is against the current HEAD [00:47:10] (03CR) 10Ori.livneh: [C: 031] "I think it makes sense to do it in apache/manifests/init.pp -- it's the sort of thing that should be standardized everywhere." [puppet] - 10https://gerrit.wikimedia.org/r/130296 (owner: 10ArielGlenn) [00:48:29] (03CR) 10Dzahn: [C: 031] "thanks! this looks very reasonable to me. minor nitpick is only mixed spaces/tabs in the logrotate config" [puppet] - 10https://gerrit.wikimedia.org/r/130296 (owner: 10ArielGlenn) [00:51:04] (03PS3) 10Ori.livneh: apache: keep two weeks' worth of logs, rather than 1yr [puppet] - 10https://gerrit.wikimedia.org/r/130296 (owner: 10ArielGlenn) [00:53:35] (03CR) 10Dzahn: [C: 031] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/130296 (owner: 10ArielGlenn) [00:55:08] mutante: thanks! i'll leave it for apergo.s to review/merge [00:55:17] ori: thank you too [00:55:49] yea, for the report too [01:15:44] !log ori Synchronized php-1.25wmf2/extensions/Wikidata: Ie92da71 / I44f1dce: Update Wikidata, fixes for serialization issues (duration: 00m 10s) [01:15:52] Logged the message, Master [01:16:18] !log ori Synchronized php-1.25wmf1/extensions/Wikidata: Ie92da71 / I44f1dce: Update Wikidata, fixes for serialization issues (duration: 00m 09s) [01:16:20] ^ aude [01:16:23] thanks [01:16:23] Logged the message, Master [01:16:36] aude: isn't it 3am for you? [01:16:48] it is [01:17:09] aw, crap. well, thanks very much for staying on this bug. i hope you manage to get some rest. [01:17:15] i hope it wasn't too stressful [01:19:06] looks like the item still doesn't load though :( [01:19:14] * aude looks at the logs [01:19:51] I think it's OK to let it be and get some rest, really. I'll poke at it later when I get home, too. [01:21:04] gwicke: ping [01:21:13] bblack: pong [01:21:31] hey, what do you know about PURGE traffic on the parsoidcache? (as in who generates it and why and how much, etc?) [01:21:49] * ori heads out [01:21:56] I think we generate it from the parsoid cache update job [01:22:06] (because it seems like purge isn't handled right at all in the VCL) [01:22:11] (03PS2) 10Dzahn: remove pmtpa from all $domain_search [puppet] - 10https://gerrit.wikimedia.org/r/159441 [01:22:14] bblack: I was looking at the number of network connections on the varnishes [01:22:23] they seem to hover around 12k [01:22:53] (03CR) 10jenkins-bot: [V: 04-1] remove pmtpa from all $domain_search [puppet] - 10https://gerrit.wikimedia.org/r/159441 (owner: 10Dzahn) [01:23:05] but the front-end connection limit should be at 10k [01:23:09] 100k even [01:23:19] so probably not an issue [01:23:19] RECOVERY - HTTP error ratio anomaly detection on tungsten is OK: OK: No anomaly detected [01:23:35] bblack: we don't rely on the purge [01:23:42] ori: alright [01:23:47] thanks for helping out with this [01:24:00] bblack: originally the idea was that the front-ends just forward the purge to the backend [01:24:13] generally the front-ends are only used for hashing [01:24:59] they do cache though, right? [01:25:08] I don't think so [01:25:38] sub vcl_fetch { [01:25:40] set beresp.ttl = 0s; [01:25:42] } [01:25:54] ah the cache is config'd for 1GB [01:26:01] (03PS3) 10Dzahn: remove pmtpa from all $domain_search [puppet] - 10https://gerrit.wikimedia.org/r/159441 [01:26:02] and yeah I guess ttl =0s [01:27:31] (03PS4) 10Dzahn: remove pmtpa from all $domain_search [puppet] - 10https://gerrit.wikimedia.org/r/159441 [01:27:47] bblack: I just grepped the logs for 'Failed API', and the number seems to fluctuate quite a bit [01:27:56] anyways, I think purging there should set req.hash_ignore_busy. it was a problem on other caches. trying that locally just to see if there's some tertiary fallout that affects the timeout issue for the non-PURGE requests. [01:28:41] (03PS5) 10Dzahn: remove pmtpa from all $domain_search [puppet] - 10https://gerrit.wikimedia.org/r/159441 [01:28:54] I didn't see any of the timed-out requests in the varnishncsa logs [01:29:18] was looking at the frontend [01:29:36] (03CR) 10Dzahn: [C: 032] remove pmtpa from all $domain_search [puppet] - 10https://gerrit.wikimedia.org/r/159441 (owner: 10Dzahn) [01:29:39] are there any LVS connection limits? [01:31:44] gwicke: not really, no, and parsoid has much lower conn counts than other working services on the same lvs [01:32:04] what I do know now is the timeouts are "real" [01:32:23] as in you got them outside of parsoid as well? [01:32:31] as in: a sniffer on cp1045 does see a request come in and no response go back out for 10s [01:32:39] aha! [01:32:43] I haven't seen it myself with curl, but I've observed it with the natural wtp traffic [01:32:56] it's a very strange pattern though, like this is due to some subtle bug [01:34:29] at the TCP level from cp1045's perspective, you see the GET request come in, you see the ACK for that. Then the 10s timeout goes by. Then the wtp client sends a FIN because it's giving up on the connection, and the varnish box acks the FIN and almost immediately spews back the request data that was being waited on (+some small ms wait), and then the client aborts the connection. [01:35:00] (03PS2) 10Dzahn: remove subnet 10.4.6.0/24 - pmtpa virtual hosts [puppet] - 10https://gerrit.wikimedia.org/r/164242 [01:35:04] it's almost like varnish doesn't think the request was fully sent to it, until it the client decides to go ahead and abort and starts closing its side [01:35:47] (at which point it's too late as far as the client cares) [01:36:00] so the issue is likely between front- and backend varnish? [01:36:04] yet when I look at the bytes of the GET request in detail, it does look complete [01:36:11] (03CR) 10Dzahn: [C: 032] "Nmap done: 256 IP addresses (0 hosts up) :)" [puppet] - 10https://gerrit.wikimedia.org/r/164242 (owner: 10Dzahn) [01:36:15] no, this is the traffic between a wtp machine and the varnish frontend [01:36:20] ah [01:37:06] we had issues with backend connections on the frontend being too low before [01:37:27] (03Abandoned) 10Dzahn: make boron an 'official bastion host' [puppet] - 10https://gerrit.wikimedia.org/r/165098 (owner: 10Dzahn) [01:38:05] I've see that timing of varnish very quickly sending the necessary response data right after the client gives up at the 10s mark way too many times in my random traces for it to be coincidence though. [01:38:17] I think there's a relation there somewhere. some ugly stupid bug. [01:38:57] I'm glad though that you narrowed it down a lot [01:39:07] so far we were looking all over [01:39:28] (unfortunately it's really hard to accurately trace out just one ephemeral connection's flow from a varnish strace, and varnishlog just spews the whole status of the thing after-the-fact) [01:40:58] the purge fix doesn't seem to have affected anything critical, but it's probably a good thing in general. I'll puppet that out while I think and look some more [01:41:25] (03CR) 10Dzahn: [C: 031] "thanks for this patch, it just needs to wait until actually all hosts have been removed from puppet stored configs and are really out of i" [puppet] - 10https://gerrit.wikimedia.org/r/165091 (owner: 10Matanya) [01:42:00] (03CR) 10Dzahn: [C: 04-1] "well, technically -1 until all hosts are really decom'ed" [puppet] - 10https://gerrit.wikimedia.org/r/165091 (owner: 10Matanya) [01:43:08] (03CR) 10Dzahn: [C: 031] "it's true, there is no db1001.pmtpa.wmnet, '1001' already means it should be eqiad. let springle decide though" [puppet] - 10https://gerrit.wikimedia.org/r/165090 (owner: 10Matanya) [01:43:46] (03PS1) 10BBlack: Set req.hash_ignore_busy for parsoid purges [puppet] - 10https://gerrit.wikimedia.org/r/165159 [01:43:48] (03CR) 10Dzahn: "db1001.eqiad.wmnet has address 10.64.0.5" [puppet] - 10https://gerrit.wikimedia.org/r/165090 (owner: 10Matanya) [01:44:32] (03CR) 10Dzahn: [C: 032] "just changes usage info text" [puppet] - 10https://gerrit.wikimedia.org/r/165088 (owner: 10Matanya) [01:45:03] why can't gerrit consistently hotlink real git commit hashes from the same repo in commit messages? I even give it the full hash and that still doesn't work all the time :p [01:45:22] (03PS2) 10BBlack: Set req.hash_ignore_busy for parsoid purges [puppet] - 10https://gerrit.wikimedia.org/r/165159 [01:45:28] (03CR) 10BBlack: [C: 032 V: 032] Set req.hash_ignore_busy for parsoid purges [puppet] - 10https://gerrit.wikimedia.org/r/165159 (owner: 10BBlack) [01:45:50] bblack: thanks for investigating this, it's much appreciated! [01:46:19] np [01:46:40] I'll head home now [01:46:55] (03CR) 10Dzahn: [C: 04-1] "looks like that it was not intentional to also touch the torrus things here. could you remove that and the dependency I9164f99c33203633 ha" [puppet] - 10https://gerrit.wikimedia.org/r/164273 (owner: 10Hoo man) [01:47:12] (03PS2) 10Dzahn: Remove all references to pmtpa from role::cache [puppet] - 10https://gerrit.wikimedia.org/r/164273 (owner: 10Hoo man) [01:47:35] ok [01:49:10] just sent a quick heads-up to the team [01:49:26] so that subbu doesn't spend the night looking for issues on the parsoid side [01:49:51] bye! [01:50:05] it may yet be a parsoid issue, but if it is it's probably very low level. Something in some node.js network or async i/o library :) [02:14:23] !log LocalisationUpdate completed (1.25wmf1) at 2014-10-07 02:14:21+00:00 [02:14:29] Logged the message, Master [02:23:27] (03CR) 10Springle: [C: 04-1] "+1 to the fix, but use the CNAME." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/165090 (owner: 10Matanya) [02:26:04] !log LocalisationUpdate completed (1.25wmf2) at 2014-10-07 02:26:04+00:00 [02:26:11] Logged the message, Master [02:28:18] PROBLEM - puppet last run on amssq60 is CRITICAL: CRITICAL: puppet fail [02:48:15] RECOVERY - puppet last run on amssq60 is OK: OK: Puppet is currently enabled, last run 60 seconds ago with 0 failures [03:28:09] !log LocalisationUpdate ResourceLoader cache refresh completed at Tue Oct 7 03:28:09 UTC 2014 (duration 28m 8s) [03:28:18] Logged the message, Master [05:16:39] anyone object to me using hhvm on osmium for debugging? usually fine, unless someone is specifically using it [05:23:17] What is this supposed to show? (Linked on wikitech main page) http://torrus.wikimedia.org/torrus/CDN?path=/Totals/ [05:23:36] * Nemo_bis never managed to see anything in torrus, probably because lack of privileges [06:02:27] !log ori Synchronized php-1.25wmf2/includes/objectcache/HashBagOStuff.php: I0b0b5f01: HashBagOStuff: use the value itself as the CAS token (duration: 00m 07s) [06:02:32] Logged the message, Master [06:02:37] !log ori Synchronized php-1.25wmf1/includes/objectcache/HashBagOStuff.php: I0b0b5f01: HashBagOStuff: use the value itself as the CAS token (duration: 00m 06s) [06:02:41] Logged the message, Master [06:06:10] ebernhardson: fine [06:28:27] PROBLEM - puppet last run on search1001 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:05] PROBLEM - puppet last run on lvs2004 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:05] PROBLEM - puppet last run on db1046 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:25] PROBLEM - puppet last run on db2018 is CRITICAL: CRITICAL: Puppet has 2 failures [06:30:35] PROBLEM - puppet last run on mw1170 is CRITICAL: CRITICAL: Puppet has 2 failures [06:41:20] (03PS2) 10Matanya: mysql_wmf: db1001 is in eqiad not in pmtpa [puppet] - 10https://gerrit.wikimedia.org/r/165090 [06:45:24] RECOVERY - puppet last run on db1046 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [06:45:34] RECOVERY - puppet last run on lvs2004 is OK: OK: Puppet is currently enabled, last run 1 seconds ago with 0 failures [06:45:43] RECOVERY - puppet last run on db2018 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [06:45:54] RECOVERY - puppet last run on search1001 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [06:46:04] RECOVERY - puppet last run on mw1170 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [06:50:44] PROBLEM - puppet last run on db1048 is CRITICAL: CRITICAL: Puppet has 1 failures [06:57:19] (03CR) 10Krinkle: [C: 031] use scap's embedded linking, remove lint script [puppet] - 10https://gerrit.wikimedia.org/r/160691 (https://bugzilla.wikimedia.org/68255) (owner: 10Filippo Giunchedi) [06:57:30] (03Abandoned) 10Krinkle: contint: Package 'php5-parsekit' is absent on Trusty, don't require it [puppet] - 10https://gerrit.wikimedia.org/r/161748 (https://bugzilla.wikimedia.org/68255) (owner: 10Krinkle) [07:06:27] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 222, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/2/0: down - Core: cr1-codfw:xe-5/2/1 (Telia, IC-307235) (#2648) [10Gbps wave]BR [07:08:25] RECOVERY - puppet last run on db1048 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [07:24:27] PROBLEM - Swift HTTP backend on ms-fe2002 is CRITICAL: Connection timed out [07:24:36] PROBLEM - Swift HTTP backend on ms-fe2001 is CRITICAL: Connection timed out [07:26:26] PROBLEM - puppet last run on ms-be2004 is CRITICAL: CRITICAL: Puppet has 9 failures [07:40:04] good morning [07:40:24] if any ops feels brave, we could use removal of some obsolete code ( php_parsekit and the lame php linter we have been using for ages) : https://gerrit.wikimedia.org/r/#/c/160691/ [07:40:42] php linting is done by scap directly [07:45:40] RECOVERY - puppet last run on ms-be2004 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [07:59:00] PROBLEM - Swift HTTP backend on ms-fe2001 is CRITICAL: Connection timed out [07:59:01] PROBLEM - Swift HTTP backend on ms-fe2002 is CRITICAL: Connection timed out [07:59:49] PROBLEM - Host ns1-v6 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:860:ed1a::e [08:00:19] PROBLEM - Host 2620:0:860:2:d6ae:52ff:fead:5610 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:860:2:d6ae:52ff:fead:5610 [08:00:20] PROBLEM - Host labcontrol2001 is DOWN: CRITICAL - Time to live exceeded (208.80.153.14) [08:00:20] PROBLEM - Host 208.80.153.42 is DOWN: CRITICAL - Time to live exceeded (208.80.153.42) [08:00:20] PROBLEM - Host 208.80.153.12 is DOWN: PING CRITICAL - Packet loss = 100% [08:00:20] PROBLEM - Host bast2001 is DOWN: PING CRITICAL - Packet loss = 100% [08:00:20] PROBLEM - Host baham is DOWN: PING CRITICAL - Packet loss = 100% [08:00:20] PROBLEM - Host install2001 is DOWN: PING CRITICAL - Packet loss = 100% [08:00:21] PROBLEM - Host achernar is DOWN: PING CRITICAL - Packet loss = 100% [08:00:37] PROBLEM - Host 2620:0:860:1:d6ae:52ff:feac:4dc8 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:860:1:d6ae:52ff:feac:4dc8 [08:00:39] PROBLEM - Host cr2-codfw is DOWN: CRITICAL - Time to live exceeded (208.80.153.193) [08:00:39] PROBLEM - Host ns1-v4 is DOWN: CRITICAL - Time to live exceeded (208.80.153.231) [08:00:39] PROBLEM - Host ms-be2006 is DOWN: PING CRITICAL - Packet loss = 100% [08:00:39] PROBLEM - Host db2016 is DOWN: PING CRITICAL - Packet loss = 100% [08:00:39] PROBLEM - Host lvs2003 is DOWN: PING CRITICAL - Packet loss = 100% [08:00:40] PROBLEM - Host lvs2006 is DOWN: PING CRITICAL - Packet loss = 100% [08:00:55] <_joe_> wtf codfw down again [08:01:58] PROBLEM - RAID on db2002 is CRITICAL: Timeout while attempting connection [08:01:58] PROBLEM - check if salt-minion is running on db2039 is CRITICAL: Timeout while attempting connection [08:01:59] PROBLEM - RAID on ms-be2006 is CRITICAL: Timeout while attempting connection [08:01:59] PROBLEM - very high load average likely xfs on ms-be2001 is CRITICAL: Timeout while attempting connection [08:02:00] PROBLEM - swift-container-replicator on ms-be2011 is CRITICAL: Timeout while attempting connection [08:02:00] PROBLEM - swift-object-replicator on ms-be2008 is CRITICAL: Timeout while attempting connection [08:02:01] PROBLEM - check if dhclient is running on install2001 is CRITICAL: Connection refused or timed out [08:02:20] PROBLEM - check if dhclient is running on db2034 is CRITICAL: Timeout while attempting connection [08:02:20] PROBLEM - Disk space on ms-be2002 is CRITICAL: Timeout while attempting connection [08:02:20] PROBLEM - SSH on ms-fe2003 is CRITICAL: Connection timed out [08:02:21] PROBLEM - NTP on ms-be2009 is CRITICAL: NTP CRITICAL: No response from NTP server [08:02:21] PROBLEM - NTP on ms-be2002 is CRITICAL: NTP CRITICAL: No response from NTP server [08:02:21] PROBLEM - puppet last run on db2034 is CRITICAL: Timeout while attempting connection [08:02:21] PROBLEM - check if salt-minion is running on db2012 is CRITICAL: Timeout while attempting connection [08:02:22] PROBLEM - puppet last run on db2003 is CRITICAL: Timeout while attempting connection [08:02:22] PROBLEM - swift-account-replicator on ms-be2006 is CRITICAL: Timeout while attempting connection [08:02:23] PROBLEM - Swift HTTP backend on ms-fe2003 is CRITICAL: Connection timed out [08:02:42] PROBLEM - puppet last run on db2009 is CRITICAL: Timeout while attempting connection [08:02:42] PROBLEM - check configured eth on ms-be2002 is CRITICAL: Timeout while attempting connection [08:03:00] PROBLEM - Memcached on ms-fe2004 is CRITICAL: Connection timed out [08:03:14] PROBLEM - check configured eth on ms-fe2004 is CRITICAL: Timeout while attempting connection [08:03:47] RECOVERY - NTP on ms-be2009 is OK: NTP OK: Offset -0.06556928158 secs [08:03:47] RECOVERY - puppet last run on db2034 is OK: OK: Puppet is currently enabled, last run 1214 seconds ago with 0 failures [08:03:47] RECOVERY - puppet last run on db2003 is OK: OK: Puppet is currently enabled, last run 730 seconds ago with 0 failures [08:03:47] RECOVERY - NTP on ms-be2002 is OK: NTP OK: Offset -0.08006155491 secs [08:03:47] RECOVERY - check if salt-minion is running on db2012 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [08:03:48] RECOVERY - swift-account-replicator on ms-be2006 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [08:03:48] RECOVERY - Host ms-be2004 is UP: PING OK - Packet loss = 0%, RTA = 52.09 ms [08:03:49] RECOVERY - Host lvs2004 is UP: PING OK - Packet loss = 0%, RTA = 52.28 ms [08:03:49] RECOVERY - Host lvs2002 is UP: PING OK - Packet loss = 0%, RTA = 56.04 ms [08:03:50] RECOVERY - Host ms-be2005 is UP: PING OK - Packet loss = 0%, RTA = 53.94 ms [08:03:50] RECOVERY - Host lvs2001 is UP: PING OK - Packet loss = 0%, RTA = 53.63 ms [08:03:51] RECOVERY - Host ms-be2012 is UP: PING OK - Packet loss = 0%, RTA = 52.06 ms [08:04:00] RECOVERY - puppet last run on db2009 is OK: OK: Puppet is currently enabled, last run 1250 seconds ago with 0 failures [08:04:00] RECOVERY - check configured eth on ms-be2002 is OK: NRPE: Unable to read output [08:04:00] RECOVERY - Host lvs2005 is UP: PING OK - Packet loss = 0%, RTA = 52.13 ms [08:04:00] RECOVERY - Host 2620:0:860:1:d6ae:52ff:feac:4dc8 is UP: PING OK - Packet loss = 0%, RTA = 52.22 ms [08:04:00] RECOVERY - Host baham is UP: PING OK - Packet loss = 0%, RTA = 53.58 ms [08:04:01] RECOVERY - Host db2028 is UP: PING OK - Packet loss = 0%, RTA = 52.45 ms [08:04:01] RECOVERY - Host db2004 is UP: PING OK - Packet loss = 0%, RTA = 52.15 ms [08:04:02] RECOVERY - Host lvs2003 is UP: PING OK - Packet loss = 0%, RTA = 53.07 ms [08:04:02] RECOVERY - Memcached on ms-fe2004 is OK: TCP OK - 0.054 second response time on port 11211 [08:04:10] RECOVERY - SSH on db2001 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [08:04:10] RECOVERY - NTP on db2007 is OK: NTP OK: Offset -0.001999258995 secs [08:04:10] RECOVERY - check configured eth on db2017 is OK: NRPE: Unable to read output [08:04:10] RECOVERY - DPKG on db2034 is OK: All packages OK [08:04:10] RECOVERY - Disk space on ms-fe2003 is OK: DISK OK [08:04:11] RECOVERY - check configured eth on ms-be2011 is OK: NRPE: Unable to read output [08:04:11] RECOVERY - swift-object-replicator on ms-be2001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [08:04:12] RECOVERY - swift-object-updater on ms-be2003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-updater [08:04:25] RECOVERY - check configured eth on ms-fe2004 is OK: NRPE: Unable to read output [08:04:32] RECOVERY - check if salt-minion is running on db2039 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [08:04:32] RECOVERY - RAID on db2002 is OK: OK: optimal, 1 logical, 2 physical [08:04:32] RECOVERY - very high load average likely xfs on ms-be2001 is OK: OK - load average: 10.10, 12.91, 14.02 [08:04:32] RECOVERY - swift-container-replicator on ms-be2011 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [08:04:32] RECOVERY - swift-object-replicator on ms-be2008 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [08:04:33] RECOVERY - RAID on ms-be2006 is OK: OK: optimal, 14 logical, 14 physical [08:04:33] RECOVERY - Host ns1-v6 is UP: PING OK - Packet loss = 0%, RTA = 52.26 ms [08:04:47] RECOVERY - check if dhclient is running on install2001 is OK: PROCS OK: 0 processes with command name dhclient [08:04:47] RECOVERY - SSH on ms-fe2003 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [08:04:47] RECOVERY - check if dhclient is running on db2034 is OK: PROCS OK: 0 processes with command name dhclient [08:04:47] RECOVERY - Disk space on ms-be2002 is OK: DISK OK [08:04:55] RECOVERY - Host 2620:0:860:2:d6ae:52ff:fead:5610 is UP: PING OK - Packet loss = 0%, RTA = 52.08 ms [08:04:55] RECOVERY - Host 208.80.153.42 is UP: PING OK - Packet loss = 0%, RTA = 52.04 ms [08:06:06] RECOVERY - Host cr1-codfw is UP: PING OK - Packet loss = 0%, RTA = 52.77 ms [08:06:26] PROBLEM - puppet last run on ms-be2002 is CRITICAL: CRITICAL: Puppet has 26 failures [08:06:37] PROBLEM - puppet last run on ms-be2003 is CRITICAL: CRITICAL: puppet fail [08:06:37] PROBLEM - puppet last run on ms-be2006 is CRITICAL: CRITICAL: puppet fail [08:06:49] PROBLEM - puppet last run on ms-fe2004 is CRITICAL: CRITICAL: puppet fail [08:06:55] PROBLEM - puppet last run on db2033 is CRITICAL: CRITICAL: Puppet has 9 failures [08:07:06] PROBLEM - puppet last run on db2035 is CRITICAL: CRITICAL: puppet fail [08:07:25] PROBLEM - puppet last run on db2005 is CRITICAL: CRITICAL: puppet fail [08:11:25] PROBLEM - Swift HTTP backend on ms-fe2001 is CRITICAL: Connection timed out [08:11:35] PROBLEM - Swift HTTP backend on ms-fe2002 is CRITICAL: Connection timed out [08:11:35] PROBLEM - Swift HTTP backend on ms-fe2003 is CRITICAL: Connection timed out [08:12:16] PROBLEM - Host ns1-v4-old is DOWN: PING CRITICAL - Packet loss = 100% [08:12:45] PROBLEM - Host 2620:0:860:2:d6ae:52ff:fead:5610 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:860:2:d6ae:52ff:fead:5610 [08:12:47] PROBLEM - Host db2004 is DOWN: PING CRITICAL - Packet loss = 100% [08:12:47] PROBLEM - Host db2028 is DOWN: PING CRITICAL - Packet loss = 100% [08:13:05] PROBLEM - Host 2620:0:860:1:d6ae:52ff:feac:4dc8 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:860:1:d6ae:52ff:feac:4dc8 [08:13:09] PROBLEM - Host cr2-codfw is DOWN: CRITICAL - Plugin timed out after 15 seconds [08:13:09] PROBLEM - Host cr1-codfw is DOWN: CRITICAL - Plugin timed out after 15 seconds [08:13:10] PROBLEM - Host db2005 is DOWN: PING CRITICAL - Packet loss = 100% [08:13:10] PROBLEM - Host bast2001 is DOWN: PING CRITICAL - Packet loss = 100% [08:13:10] PROBLEM - Host labcontrol2001 is DOWN: PING CRITICAL - Packet loss = 100% [08:13:10] PROBLEM - Host ms-be2005 is DOWN: PING CRITICAL - Packet loss = 100% [08:13:10] PROBLEM - Host ms-fe2003 is DOWN: PING CRITICAL - Packet loss = 100% [08:17:46] PROBLEM - Host install2001 is DOWN: CRITICAL - Time to live exceeded (208.80.153.4) [08:17:46] PROBLEM - Host baham is DOWN: CRITICAL - Time to live exceeded (208.80.153.13) [08:17:55] PROBLEM - Host ns1-v4-old is DOWN: PING CRITICAL - Packet loss = 100% [08:18:04] PROBLEM - Host 2620:0:860:2:d6ae:52ff:fead:5610 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:860:2:d6ae:52ff:fead:5610 [08:18:04] PROBLEM - Host labcontrol2001 is DOWN: CRITICAL - Time to live exceeded (208.80.153.14) [08:18:04] PROBLEM - Host ns1-v6 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:860:ed1a::e [08:18:05] PROBLEM - Host db2030 is DOWN: PING CRITICAL - Packet loss = 100% [08:18:05] PROBLEM - Host ms-fe2003 is DOWN: PING CRITICAL - Packet loss = 100% [08:18:05] PROBLEM - Host db2005 is DOWN: PING CRITICAL - Packet loss = 100% [08:18:05] PROBLEM - Host lvs2004 is DOWN: PING CRITICAL - Packet loss = 100% [08:18:06] PROBLEM - Host ms-be2011 is DOWN: PING CRITICAL - Packet loss = 100% [08:18:06] PROBLEM - Host db2010 is DOWN: PING CRITICAL - Packet loss = 100% [08:18:07] PROBLEM - Host ms-be2007 is DOWN: PING CRITICAL - Packet loss = 100% [08:18:07] PROBLEM - Host db2017 is DOWN: PING CRITICAL - Packet loss = 100% [08:18:08] PROBLEM - Host db2012 is DOWN: PING CRITICAL - Packet loss = 100% [08:18:08] PROBLEM - Host db2035 is DOWN: PING CRITICAL - Packet loss = 100% [08:18:09] PROBLEM - Host ms-be2006 is DOWN: PING CRITICAL - Packet loss = 100% [08:18:09] PROBLEM - Host db2038 is DOWN: PING CRITICAL - Packet loss = 100% [08:18:16] PROBLEM - Host lvs2005 is DOWN: CRITICAL - Plugin timed out after 15 seconds [08:18:16] PROBLEM - Host db2028 is DOWN: CRITICAL - Plugin timed out after 15 seconds [08:18:16] PROBLEM - Host db2023 is DOWN: CRITICAL - Plugin timed out after 15 seconds [08:18:16] PROBLEM - Host db2019 is DOWN: CRITICAL - Plugin timed out after 15 seconds [08:18:16] PROBLEM - Host ms-fe2004 is DOWN: CRITICAL - Plugin timed out after 15 seconds [08:18:16] PROBLEM - Host lvs2003 is DOWN: CRITICAL - Plugin timed out after 15 seconds [08:18:25] <_joe_> mmm [08:21:36] !log Reload Zuul to deploy 5e905e7c9dde9f47482d [08:21:44] Logged the message, Master [08:37:43] RECOVERY - Host labcontrol2001 is UP: PING OK - Packet loss = 0%, RTA = 52.67 ms [08:37:43] RECOVERY - Host lvs2006 is UP: PING OK - Packet loss = 0%, RTA = 54.07 ms [08:37:43] RECOVERY - Host db2038 is UP: PING OK - Packet loss = 0%, RTA = 52.38 ms [08:37:43] RECOVERY - Host ms-be2002 is UP: PING OK - Packet loss = 0%, RTA = 53.09 ms [08:37:43] RECOVERY - Host ms-fe2003 is UP: PING OK - Packet loss = 0%, RTA = 53.09 ms [08:37:43] RECOVERY - Host achernar is UP: PING OK - Packet loss = 0%, RTA = 53.49 ms [08:37:43] RECOVERY - Host db2023 is UP: PING OK - Packet loss = 0%, RTA = 53.78 ms [08:39:57] PROBLEM - puppet last run on acamar is CRITICAL: CRITICAL: puppet fail [08:40:08] PROBLEM - puppet last run on install2001 is CRITICAL: CRITICAL: puppet fail [08:40:08] RECOVERY - Host cr1-codfw is UP: PING OK - Packet loss = 0%, RTA = 53.69 ms [08:40:17] PROBLEM - puppet last run on ms-be2011 is CRITICAL: CRITICAL: puppet fail [08:40:17] PROBLEM - puppet last run on ms-fe2001 is CRITICAL: CRITICAL: puppet fail [08:40:18] PROBLEM - puppet last run on db2019 is CRITICAL: CRITICAL: puppet fail [08:40:27] PROBLEM - puppet last run on lvs2006 is CRITICAL: CRITICAL: puppet fail [08:40:28] PROBLEM - puppet last run on db2001 is CRITICAL: CRITICAL: puppet fail [08:40:32] PROBLEM - puppet last run on achernar is CRITICAL: CRITICAL: puppet fail [08:40:47] PROBLEM - puppet last run on lvs2004 is CRITICAL: CRITICAL: puppet fail [08:40:47] PROBLEM - puppet last run on labcontrol2001 is CRITICAL: CRITICAL: puppet fail [08:40:47] PROBLEM - puppet last run on db2018 is CRITICAL: CRITICAL: puppet fail [08:40:47] PROBLEM - puppet last run on db2030 is CRITICAL: CRITICAL: puppet fail [08:40:47] PROBLEM - puppet last run on ms-be2001 is CRITICAL: CRITICAL: puppet fail [08:40:47] PROBLEM - puppet last run on db2011 is CRITICAL: CRITICAL: puppet fail [08:40:47] PROBLEM - puppet last run on db2023 is CRITICAL: CRITICAL: puppet fail [08:40:48] PROBLEM - puppet last run on db2004 is CRITICAL: CRITICAL: puppet fail [08:40:48] PROBLEM - puppet last run on ms-be2008 is CRITICAL: CRITICAL: puppet fail [08:40:49] PROBLEM - puppet last run on db2002 is CRITICAL: CRITICAL: puppet fail [08:40:49] PROBLEM - puppet last run on ms-be2004 is CRITICAL: CRITICAL: puppet fail [08:40:50] PROBLEM - puppet last run on ms-be2009 is CRITICAL: CRITICAL: puppet fail [08:40:50] PROBLEM - puppet last run on ms-fe2003 is CRITICAL: CRITICAL: puppet fail [08:40:51] PROBLEM - puppet last run on db2007 is CRITICAL: CRITICAL: puppet fail [08:40:51] PROBLEM - puppet last run on db2009 is CRITICAL: CRITICAL: puppet fail [08:40:52] PROBLEM - puppet last run on db2016 is CRITICAL: CRITICAL: puppet fail [08:41:00] <_joe_> ns1 still unreachable [08:41:06] PROBLEM - puppet last run on baham is CRITICAL: CRITICAL: puppet fail [08:41:16] (03CR) 10Giuseppe Lavagetto: HAT: mark failed requests with an additional header (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/165028 (owner: 10Giuseppe Lavagetto) [08:41:31] PROBLEM - puppet last run on lvs2001 is CRITICAL: CRITICAL: puppet fail [08:41:37] PROBLEM - puppet last run on db2038 is CRITICAL: CRITICAL: puppet fail [08:41:37] RECOVERY - puppet last run on db2035 is OK: OK: Puppet is currently enabled, last run 87 seconds ago with 0 failures [08:41:37] PROBLEM - puppet last run on ms-be2007 is CRITICAL: CRITICAL: puppet fail [08:42:07] PROBLEM - puppet last run on lvs2003 is CRITICAL: CRITICAL: puppet fail [08:42:07] PROBLEM - puppet last run on ms-fe2002 is CRITICAL: CRITICAL: puppet fail [08:42:16] PROBLEM - puppet last run on bast2001 is CRITICAL: CRITICAL: Puppet has 15 failures [08:42:19] PROBLEM - puppet last run on db2039 is CRITICAL: CRITICAL: puppet fail [08:42:26] PROBLEM - puppet last run on db2036 is CRITICAL: CRITICAL: puppet fail [08:42:27] PROBLEM - puppet last run on pollux is CRITICAL: CRITICAL: puppet fail [08:42:37] PROBLEM - puppet last run on db2003 is CRITICAL: CRITICAL: puppet fail [08:42:48] PROBLEM - puppet last run on lvs2002 is CRITICAL: CRITICAL: Puppet has 9 failures [08:42:57] PROBLEM - puppet last run on db2034 is CRITICAL: CRITICAL: puppet fail [08:42:57] PROBLEM - puppet last run on db2029 is CRITICAL: CRITICAL: puppet fail [08:42:57] PROBLEM - puppet last run on db2037 is CRITICAL: CRITICAL: puppet fail [08:42:57] PROBLEM - puppet last run on ms-be2005 is CRITICAL: CRITICAL: puppet fail [08:42:57] PROBLEM - puppet last run on ms-be2012 is CRITICAL: CRITICAL: puppet fail [08:42:58] RECOVERY - puppet last run on ms-be2002 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [08:43:17] RECOVERY - puppet last run on ms-be2003 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [08:43:17] RECOVERY - puppet last run on db2039 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [08:43:27] RECOVERY - puppet last run on db2019 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [08:43:46] RECOVERY - puppet last run on db2005 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [08:43:47] RECOVERY - puppet last run on db2034 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [08:43:47] RECOVERY - puppet last run on db2009 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [08:44:08] RECOVERY - puppet last run on ms-fe2004 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [08:44:26] RECOVERY - puppet last run on ms-be2006 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [08:45:00] RECOVERY - puppet last run on db2002 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [08:45:01] RECOVERY - puppet last run on ms-be2004 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [08:45:26] RECOVERY - puppet last run on db2018 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [08:46:07] RECOVERY - puppet last run on lvs2004 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [08:46:39] RECOVERY - puppet last run on ms-fe2001 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [08:46:46] RECOVERY - puppet last run on lvs2001 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [08:46:47] RECOVERY - puppet last run on db2036 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [08:46:57] RECOVERY - puppet last run on labcontrol2001 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [08:47:06] RECOVERY - puppet last run on db2038 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [08:47:07] RECOVERY - puppet last run on ms-fe2003 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [08:47:27] RECOVERY - puppet last run on install2001 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [08:48:07] RECOVERY - puppet last run on db2007 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [08:48:07] RECOVERY - puppet last run on db2016 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [08:48:07] RECOVERY - puppet last run on db2029 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [08:48:46] RECOVERY - puppet last run on ms-be2011 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [08:49:16] RECOVERY - puppet last run on db2037 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [08:50:17] PROBLEM - Swift HTTP backend on ms-fe2002 is CRITICAL: Connection timed out [08:50:17] PROBLEM - Swift HTTP backend on ms-fe2003 is CRITICAL: Connection timed out [08:50:58] PROBLEM - Host achernar is DOWN: PING CRITICAL - Packet loss = 100% [08:51:00] PROBLEM - Host cr2-codfw is DOWN: PING CRITICAL - Packet loss = 100% [08:51:00] PROBLEM - Host labs-ns1.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [08:51:00] PROBLEM - Host 208.80.153.12 is DOWN: PING CRITICAL - Packet loss = 100% [08:51:00] PROBLEM - Host lvs2001 is DOWN: PING CRITICAL - Packet loss = 100% [08:51:30] PROBLEM - Host pollux is DOWN: CRITICAL - Plugin timed out after 15 seconds [08:51:38] PROBLEM - Host acamar is DOWN: PING CRITICAL - Packet loss = 100% [08:51:41] PROBLEM - Host db2007 is DOWN: PING CRITICAL - Packet loss = 100% [08:51:41] PROBLEM - Host db2003 is DOWN: PING CRITICAL - Packet loss = 100% [08:51:41] PROBLEM - Host bast2001 is DOWN: PING CRITICAL - Packet loss = 100% [08:51:41] PROBLEM - Host db2017 is DOWN: PING CRITICAL - Packet loss = 100% [08:51:41] PROBLEM - Host db2002 is DOWN: PING CRITICAL - Packet loss = 100% [08:51:41] PROBLEM - Host db2019 is DOWN: PING CRITICAL - Packet loss = 100% [08:51:41] PROBLEM - Host db2028 is DOWN: PING CRITICAL - Packet loss = 100% [08:54:04] (03PS4) 10Filippo Giunchedi: use scap's embedded linking, remove lint script [puppet] - 10https://gerrit.wikimedia.org/r/160691 (https://bugzilla.wikimedia.org/68255) [08:55:24] !log Building additional contint slaves in labs (integration-slave1004 with precise and integration-slave1009 with trusty) [08:55:32] Logged the message, Master [08:56:07] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] use scap's embedded linking, remove lint script [puppet] - 10https://gerrit.wikimedia.org/r/160691 (https://bugzilla.wikimedia.org/68255) (owner: 10Filippo Giunchedi) [08:56:49] hashar: ^ merged if you notice sth funny [08:57:17] godog: awesome [08:58:42] hashar: hm.. re-using slave1004 means ssh known_hosts complains, sorry, just noticed it afterwards forgot about that. edit your known_hosts to remove 1004 (and 1005 as well I suppose) [08:59:17] are you adding more Trusty slaves? [08:59:30] see log [08:59:57] :-] [09:00:17] Krinkle: you can !log in #wikimedia-qa as well [09:00:25] ? [09:00:29] hashar: what does that do [09:00:29] spurt out stuff on https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [09:00:37] that is the equivalent of !log in -operations [09:00:45] but for folks managing integrateion and beta [09:00:50] i.e. the release engineering team [09:00:53] (and others) [09:01:15] hashar: this is ops-y though. What about labs integration SAL? [09:01:20] so we have three places [09:01:22] that is to avoid cluttering the production SAL with beta related stuff [09:01:30] this isn't beta stuff. I don't care about beta. [09:01:34] I no more use the integration / deployment-prep SAL [09:01:57] CI is being integrated under the ReleaseEngineering team :] [09:02:13] OK. I agree let's not use nova resources SAL for individual labs project (integration, and beta in this case) [09:02:28] and only use prod SAL and RelEng SAL? [09:02:50] I guess that goes for hashar and bd808|BUFFER as well :P https://wikitech.wikimedia.org/wiki/Nova_Resource:Integration/SAL [09:03:48] hashar: This ties quite closely to production though, with jenkins, gerrit and zuul, which is full prod. [09:04:10] I've never even heard of releng sal until now, I doubt ops will look there if stuff is up with jenkins and we're not around. [09:04:30] If this is going to happen, maybe announce to engineering/ops just so they know [09:04:41] hashar: what about Zuul and Jenkins restarts, where do you log those? [09:04:46] I am not going to log in two places. [09:04:49] Tell me which and I'll do it [09:06:19] (03CR) 10Filippo Giunchedi: [C: 031] delete nfs[12].pmtpa SSL certs [puppet] - 10https://gerrit.wikimedia.org/r/164694 (owner: 10Dzahn) [09:07:18] Reedy: https://gerrit.wikimedia.org/r/#/c/164314 good to be merged I'd say? [09:07:27] Krinkle: I would log everything via #wikimedia-qa [09:07:40] Krinkle: though for outages I log in #wikimedia-operations :] [09:07:51] since that is where most folks look at [09:08:18] for Zuul configuration changes, I don't bother loading. I am assuming as soon as a change is +2ed / merged, it is always being deployed. [09:08:28] at worth git reflog provides some history [09:12:29] (03PS2) 10Filippo Giunchedi: lists.wm.org - raise HSTS max-age to 1 year [puppet] - 10https://gerrit.wikimedia.org/r/161177 (https://bugzilla.wikimedia.org/38516) (owner: 10Chmarkine) [09:12:50] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] lists.wm.org - raise HSTS max-age to 1 year [puppet] - 10https://gerrit.wikimedia.org/r/161177 (https://bugzilla.wikimedia.org/38516) (owner: 10Chmarkine) [09:16:43] (03PS5) 10Filippo Giunchedi: contint: python3.4 on Trusty labs slaves [puppet] - 10https://gerrit.wikimedia.org/r/164071 (owner: 10Hashar) [09:16:49] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] contint: python3.4 on Trusty labs slaves [puppet] - 10https://gerrit.wikimedia.org/r/164071 (owner: 10Hashar) [09:19:24] RECOVERY - Host ms-fe2001 is UP: PING OK - Packet loss = 0%, RTA = 54.08 ms [09:19:24] RECOVERY - Host db2019 is UP: PING OK - Packet loss = 0%, RTA = 53.35 ms [09:19:24] RECOVERY - Host lvs2006 is UP: PING OK - Packet loss = 0%, RTA = 53.38 ms [09:19:24] RECOVERY - Host lvs2004 is UP: PING OK - Packet loss = 0%, RTA = 54.66 ms [09:19:25] RECOVERY - Host db2009 is UP: PING OK - Packet loss = 0%, RTA = 54.13 ms [09:21:36] PROBLEM - puppet last run on ms-be2010 is CRITICAL: CRITICAL: puppet fail [09:21:38] PROBLEM - puppet last run on ms-be2003 is CRITICAL: CRITICAL: puppet fail [09:21:48] PROBLEM - puppet last run on ms-be2006 is CRITICAL: CRITICAL: puppet fail [09:21:57] PROBLEM - puppet last run on db2005 is CRITICAL: CRITICAL: puppet fail [09:21:57] PROBLEM - puppet last run on db2010 is CRITICAL: CRITICAL: puppet fail [09:21:57] PROBLEM - puppet last run on db2019 is CRITICAL: CRITICAL: puppet fail [09:21:57] PROBLEM - puppet last run on db2028 is CRITICAL: CRITICAL: puppet fail [09:22:06] PROBLEM - puppet last run on db2036 is CRITICAL: CRITICAL: puppet fail [09:22:12] PROBLEM - puppet last run on db2039 is CRITICAL: CRITICAL: puppet fail [09:22:16] PROBLEM - puppet last run on db2002 is CRITICAL: CRITICAL: puppet fail [09:22:16] PROBLEM - puppet last run on db2007 is CRITICAL: CRITICAL: puppet fail [09:22:16] PROBLEM - puppet last run on ms-be2004 is CRITICAL: CRITICAL: puppet fail [09:22:16] PROBLEM - puppet last run on db2038 is CRITICAL: CRITICAL: puppet fail [09:22:16] PROBLEM - puppet last run on ms-fe2001 is CRITICAL: CRITICAL: puppet fail [09:22:17] PROBLEM - puppet last run on ms-fe2003 is CRITICAL: CRITICAL: puppet fail [09:22:17] PROBLEM - puppet last run on db2009 is CRITICAL: CRITICAL: puppet fail [09:22:18] PROBLEM - puppet last run on db2037 is CRITICAL: CRITICAL: puppet fail [09:22:18] PROBLEM - puppet last run on db2034 is CRITICAL: CRITICAL: puppet fail [09:22:19] PROBLEM - puppet last run on db2018 is CRITICAL: CRITICAL: puppet fail [09:22:19] PROBLEM - puppet last run on db2012 is CRITICAL: CRITICAL: puppet fail [09:22:26] PROBLEM - puppet last run on labcontrol2001 is CRITICAL: CRITICAL: puppet fail [09:22:27] PROBLEM - puppet last run on db2029 is CRITICAL: CRITICAL: puppet fail [09:22:27] PROBLEM - puppet last run on install2001 is CRITICAL: CRITICAL: puppet fail [09:22:36] PROBLEM - puppet last run on lvs2004 is CRITICAL: CRITICAL: puppet fail [09:22:37] PROBLEM - puppet last run on ms-be2011 is CRITICAL: CRITICAL: puppet fail [09:22:37] PROBLEM - puppet last run on ms-fe2004 is CRITICAL: CRITICAL: puppet fail [09:23:26] PROBLEM - puppet last run on db2016 is CRITICAL: CRITICAL: puppet fail [09:23:27] RECOVERY - puppet last run on db2009 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [09:23:52] RECOVERY - puppet last run on ms-be2003 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [09:24:06] PROBLEM - puppet last run on lvs2001 is CRITICAL: CRITICAL: puppet fail [09:24:07] RECOVERY - puppet last run on db2005 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [09:24:07] RECOVERY - puppet last run on db2019 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [09:24:16] RECOVERY - puppet last run on db2039 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [09:24:27] RECOVERY - puppet last run on db2034 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [09:24:38] RECOVERY - puppet last run on ms-fe2004 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [09:24:57] RECOVERY - puppet last run on ms-be2006 is OK: OK: Puppet is currently enabled, last run 60 seconds ago with 0 failures [09:25:17] RECOVERY - puppet last run on db2002 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [09:25:17] RECOVERY - puppet last run on ms-be2004 is OK: OK: Puppet is currently enabled, last run 63 seconds ago with 0 failures [09:26:25] RECOVERY - puppet last run on db2036 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [09:26:39] RECOVERY - puppet last run on ms-fe2001 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [09:26:41] RECOVERY - puppet last run on db2018 is OK: OK: Puppet is currently enabled, last run 64 seconds ago with 0 failures [09:26:50] RECOVERY - puppet last run on ms-fe2003 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [09:26:50] RECOVERY - puppet last run on labcontrol2001 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [09:27:00] RECOVERY - puppet last run on lvs2004 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [09:27:20] RECOVERY - puppet last run on lvs2001 is OK: OK: Puppet is currently enabled, last run 61 seconds ago with 0 failures [09:27:39] RECOVERY - puppet last run on db2038 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [09:27:54] RECOVERY - puppet last run on db2029 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [09:27:54] RECOVERY - puppet last run on install2001 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [09:28:00] RECOVERY - puppet last run on ms-be2011 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [09:28:40] RECOVERY - puppet last run on db2007 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [09:28:49] RECOVERY - puppet last run on db2016 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [09:29:30] RECOVERY - puppet last run on lvs2006 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [09:29:30] RECOVERY - puppet last run on db2001 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [09:29:40] RECOVERY - puppet last run on db2037 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [09:30:09] RECOVERY - puppet last run on ms-be2001 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [09:30:10] RECOVERY - puppet last run on db2004 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [09:30:19] RECOVERY - puppet last run on db2023 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [09:30:20] RECOVERY - puppet last run on acamar is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [09:30:20] RECOVERY - puppet last run on ms-be2008 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [09:30:30] RECOVERY - puppet last run on ms-be2012 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [09:30:39] RECOVERY - puppet last run on ms-be2007 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [09:30:50] RECOVERY - puppet last run on ms-be2005 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [09:31:40] RECOVERY - puppet last run on db2003 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [09:32:20] RECOVERY - puppet last run on lvs2003 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [09:33:20] RECOVERY - puppet last run on ms-fe2002 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [09:33:20] RECOVERY - puppet last run on db2011 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [09:33:20] RECOVERY - puppet last run on db2030 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [09:33:49] RECOVERY - puppet last run on db2012 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [09:34:26] (03CR) 10Filippo Giunchedi: "following up from IRC, urllib3 is in trusty so in theory the remote api calls don't need to happen on localhost. However there will be som" [puppet] - 10https://gerrit.wikimedia.org/r/163945 (owner: 10Chad) [09:34:40] RECOVERY - puppet last run on db2028 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [09:35:09] RECOVERY - puppet last run on achernar is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [09:35:19] RECOVERY - puppet last run on baham is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [09:36:40] RECOVERY - puppet last run on db2010 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [09:37:10] RECOVERY - puppet last run on bast2001 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [09:37:51] RECOVERY - puppet last run on pollux is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [09:37:59] RECOVERY - puppet last run on ms-be2009 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [09:39:21] RECOVERY - puppet last run on ms-be2010 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [09:45:42] (03CR) 10Filippo Giunchedi: "it seems the package is already maintained under debian-science, if we're not making significant changes perhaps a rebuild for trusty is e" [debs/contenttranslation/apertium] - 10https://gerrit.wikimedia.org/r/163577 (owner: 10KartikMistry) [09:55:59] !log start swiftrepl of non-commons originals eqiad -> codfw [09:56:07] Logged the message, Master [10:06:29] godog: did you see the ugly tiny shell script that I had for the various groups? [10:08:18] paravoid: yep I'm using that :)) thanks! [10:09:31] !log start swiftrepl of commons originals eqiad -> codfw [10:09:38] Logged the message, Master [10:10:07] ok [10:17:28] (03CR) 10Krinkle: "Is there a way to have it add up cpu values so that it measures the total and not individual user/system/idle etc. Similar to what Ganglia" [puppet] - 10https://gerrit.wikimedia.org/r/161015 (owner: 10Yuvipanda) [10:23:48] (03PS2) 10Giuseppe Lavagetto: HAT: mark failed requests with an additional header [puppet] - 10https://gerrit.wikimedia.org/r/165028 [10:23:52] varnish 500 for phabricator, fun :) [10:24:57] (03PS3) 10Giuseppe Lavagetto: HAT: mark failed requests with an additional header [puppet] - 10https://gerrit.wikimedia.org/r/165028 [10:29:57] so how do I get added to the ops group @ phabricator? [10:30:22] <_joe_> paravoid: when you find out, lemme know [10:30:36] <_joe_> I think that involves pinging chase, and bribing him [10:46:22] (03CR) 10Zfilipin: [C: 04-1] "Found one line that is no longer needed. More comments inline." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/163791 (owner: 10Krinkle) [10:53:09] !log restart commons swiftrepl from ms-fe1003 and non-commons from ms-fe1004 to avoid maxing out copper's nic [10:53:15] Logged the message, Master [10:53:17] 10gbit FTW [10:56:07] :) [10:56:59] why not run it on a frontend instead of on copper? [10:57:11] i guess using > 1 Gbps might load swift in eqiad too much though ;) [10:58:18] brb [10:59:28] haha we'll see! but yeah there are two swiftrepl now one on ms-fe1003 and one on ms-fe1004 [10:59:57] is aaron aware of these plans? [11:02:37] paravoid: replication-wise? there's been a thread on ops@ "swift replication in codfw: options" [11:05:17] how are we going to keep it in sync after swiftrepl? [11:07:15] good question, I played with swift's container sync and seemed to be working, I'm planning to publish a patch today to add that to the swift pipeline and then turn it on per-container and see how it does [11:13:35] <_joe_> godog: anything we decide to use better be network-partition tolerant :) [11:16:50] I sincerely doubt container sync will work at our scale... [11:17:08] unless they rewrote it since [11:18:21] we used to have mediawiki write to both swift clusters, but aaron wasn't fond of that solution at all [11:18:43] it was a hack on mediawiki's part for one, and of course it meant that writes were synchronous with all the performance/stability problems that entails [11:20:20] mediawiki does keep a log of operations in the database btw, but that wouldn't cover thumbnails obviously [11:20:48] but something like that, that doesn't involve enumerating all files in every container all the time, is probably the way to go [11:20:55] <_joe_> paravoid: my proposal was to use something like a message queue to feed new/modified originals to the replicated swift cluster, and maybe run switftrepl every N weeks to ensure everything is still in sync [11:21:14] <_joe_> paravoid: I guess thumbnails are supposed to be local to the DC [11:21:22] that doesn't cover it [11:21:35] they'll get outdated, and then we'll have an imagescaler outage on the DC switchover [11:22:09] <_joe_> paravoid: mmmh so you think we need to replicate all thumbnails as well? [11:22:26] <_joe_> can't we just do active-active in read [11:22:35] <_joe_> which would solve this problem [11:22:45] what do you mean? [11:23:28] <_joe_> if we do keep both swift clusters live for _reading_, it will make it so that both have reasonably good thumbnail caches [11:24:20] paravoid: swift container sync keeps track of where it was though [11:24:36] _joe_: that's harder than it sounds [11:24:46] (03CR) 10KartikMistry: "Yes. I took over maintenance (with help from upstream) as we want it available from official Ubuntu repository later. But, current version" [debs/contenttranslation/apertium] - 10https://gerrit.wikimedia.org/r/163577 (owner: 10KartikMistry) [11:25:20] <_joe_> paravoid: I wasn't convinced that was easy at all [11:25:21] _joe_: scaling requires mediawiki, which can't run active-active atm even for this simple case [11:25:37] (we tried, and failed because of memcache) [11:25:46] this means you'd scale on the other DC, which means latency [11:25:48] <_joe_> because of memcache? [11:26:56] I don't remember the details well, but I vaguely remember the imagescaling/file storage path of mediawiki to require memcache which was desynced between the two DCs at the time [11:27:43] <_joe_> I sincerely thought imagescalers could work active/active [11:33:07] (03PS1) 10Filippo Giunchedi: swift: update ganglia_new and gdash dashboards [puppet] - 10https://gerrit.wikimedia.org/r/165190 [11:34:14] not afaik [11:51:32] (03PS5) 10Giuseppe Lavagetto: mediawiki: consolidate apache configs [puppet] - 10https://gerrit.wikimedia.org/r/164358 [11:56:36] (03CR) 10Krinkle: "The 'libsikuli-script-java' package was already in the class (I moved it to reduce complexity, not introducing it). Submit a follow-up pat" [puppet] - 10https://gerrit.wikimedia.org/r/163791 (owner: 10Krinkle) [12:04:41] (03CR) 10Krinkle: [C: 04-1] "This adds legacy aliases to domains that currently don't have them." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/164358 (owner: 10Giuseppe Lavagetto) [12:06:55] (03PS4) 10KartikMistry: Update config for Language pairs [puppet] - 10https://gerrit.wikimedia.org/r/163841 [12:08:08] (03CR) 10Giuseppe Lavagetto: "About the legacy aliases, we discussed this with ori and we don't think adding those has any consequence apart from accepting more urls; o" [puppet] - 10https://gerrit.wikimedia.org/r/164358 (owner: 10Giuseppe Lavagetto) [12:09:08] (03CR) 10Giuseppe Lavagetto: mediawiki: consolidate apache configs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/164358 (owner: 10Giuseppe Lavagetto) [12:23:30] (03CR) 10Jsahleen: "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/163841 (owner: 10KartikMistry) [12:32:12] anyone around who can talk about the status of sticking wikidata items? [12:32:20] I'm just seeing if I can help any [12:50:44] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] swift: update ganglia_new and gdash dashboards [puppet] - 10https://gerrit.wikimedia.org/r/165190 (owner: 10Filippo Giunchedi) [12:52:23] <_joe_> manybubbles: have you seen my mail and hoo's response on ops@? [12:53:18] _joe_: still in the "superprotect used on wikidata thread? [12:56:34] <_joe_> manybubbles: yes [12:56:53] _joe_: so its not crashing but its still slow and sometimes runs out of memory? [12:58:10] <_joe_> manybubbles: < TimStarling> my test case is http://www.wikidata.org/w/index.php?title=Q183&oldid=143201634 [12:58:41] <_joe_> it still fails under zend, it just doesn't segfault anymore [12:59:59] <_joe_> or maybe it still segfaults, from what I see in the logs on logstash when requesting it [13:01:23] <_joe_> manybubbles: anyways, you would find more informed people on #wikidata, probably [13:07:07] (03CR) 10Filippo Giunchedi: [C: 031] "indeed, I was confused by the latest experimental upload in debian and thought it came from svn.debian.org." [debs/contenttranslation/apertium] - 10https://gerrit.wikimedia.org/r/163577 (owner: 10KartikMistry) [13:31:40] (03PS1) 10Filippo Giunchedi: swift: point ganglia codfw cluster to install2001 [puppet] - 10https://gerrit.wikimedia.org/r/165196 [13:38:50] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] swift: point ganglia codfw cluster to install2001 [puppet] - 10https://gerrit.wikimedia.org/r/165196 (owner: 10Filippo Giunchedi) [13:48:56] hey andrewbogott [13:49:01] (03CR) 10BBlack: [C: 031] HAT: mark failed requests with an additional header [puppet] - 10https://gerrit.wikimedia.org/r/165028 (owner: 10Giuseppe Lavagetto) [13:49:18] 'morning [13:50:06] <_joe_> bblack: thanks [13:50:06] andrewbogott: mind if I bug you with https://gerrit.wikimedia.org/r/#/c/165123/? [13:50:20] I suspect that'll be my last neon affecting change in a while... [13:50:40] reading... [13:51:05] (03PS4) 10Giuseppe Lavagetto: HAT: mark failed requests with an additional header [puppet] - 10https://gerrit.wikimedia.org/r/165028 [13:51:22] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] HAT: mark failed requests with an additional header [puppet] - 10https://gerrit.wikimedia.org/r/165028 (owner: 10Giuseppe Lavagetto) [13:51:51] YuviPanda: so previously on each run it changed the ownership and then changed it back? dang [13:52:07] andrewbogott: it changed them back only for *some* files [13:52:23] andrewbogott: more sadly, it just made ownership as root, and gave a+rw permissions so icinga user can do things to it [13:52:26] which is stupider... [13:52:27] perhaps [13:52:30] (03CR) 10Andrew Bogott: [C: 032] icinga: Make all config files belong to icinga:icinga [puppet] - 10https://gerrit.wikimedia.org/r/165123 (owner: 10Yuvipanda) [13:52:40] andrewbogott: if you look at the chmods, they are all for a, rather than o or g [13:54:12] akosiaris: anything i can do to help with: https://gerrit.wikimedia.org/r/#/c/160628/ ? [13:54:36] andrewbogott: yay! :) do a icinga -v /etc/icinga/icinga.cfg as well? [13:54:43] (03CR) 10Matanya: "Anybody home?" [puppet] - 10https://gerrit.wikimedia.org/r/161184 (owner: 10Matanya) [13:55:27] (03CR) 10Matanya: "poke." [puppet] - 10https://gerrit.wikimedia.org/r/159462 (owner: 10Matanya) [13:55:33] (03PS2) 10Yuvipanda: varnish:qualify vars [puppet] - 10https://gerrit.wikimedia.org/r/161184 (owner: 10Matanya) [13:55:55] YuviPanda: must've been a typo that I missed "Error: /Stage[main]/Icinga/File[/etc/icinga/icinga.cfg]/group: change from root to icing failed: Could not find group icing" [13:56:05] andrewbogott: bah. [13:56:10] andrewbogott: let me submit [13:56:15] thanks [13:57:27] andrewbogott: ^ [13:57:27] (03PS1) 10Yuvipanda: icinga: Fix typo [puppet] - 10https://gerrit.wikimedia.org/r/165197 [13:57:28] err [13:57:37] <_joe_> stupid apache [13:58:41] andrewbogott: ^^ [13:59:13] (03CR) 10Andrew Bogott: [C: 032] icinga: Fix typo [puppet] - 10https://gerrit.wikimedia.org/r/165197 (owner: 10Yuvipanda) [13:59:27] PROBLEM - puppet last run on neon is CRITICAL: CRITICAL: Puppet has 1 failures [14:05:43] RECOVERY - puppet last run on neon is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [14:06:48] andrewbogott: yay! [14:07:09] matanya: I think I know as much as you do on that front [14:07:27] <^demon|away> I've been thinking... [14:07:32] thanks akosiaris. i'll poke chasemp later [14:07:34] andrewbogott: did icinga -v report no errors? [14:07:36] <^demon|away> The nice thing about the LAMP acronym is the L doesn't change when you change Linux distros or versions. [14:07:41] <^demon|away> HAT, while creative, will be obsolete when we move on from Trusty. Do we plan to rename HAT -> HAX then? [14:07:50] HAX doesn'ts ound too bad [14:07:51] YuviPanda: I don't know yet, puppet hasn't actually returned [14:07:55] andrewbogott: oh [14:07:57] andrewbogott: I see. [14:07:57] andrewbogott: ok [14:08:06] <^demon|away> YuviPanda: We got lucky on the LTS letters :p [14:08:08] andrewbogott: hmm, I just saw icinga-wm say no errors on neon... [14:08:11] ^demon|away: :D [14:08:19] <^demon|away> Could've been HAW or HAY :) [14:08:21] as long as it is not HOAX :) [14:08:26] ^demon|away: HAL? [14:08:41] then we need to hire a DC Ops person named Dave... [14:08:45] HA* [14:08:57] <^demon|away> I like HAL [14:08:58] HA* -> High Availability! [14:09:02] Webscale!!1 [14:09:08] HAU? [14:09:20] and now remove the A cause we are going nginx :P [14:09:27] haha :D [14:09:32] nginx -> varnish -> nginx -> hhvm :D [14:09:41] <^demon|away> HNT just doesn't roll off the tongue like HAT [14:09:47] <^demon|away> Or HNX [14:10:08] clearly we should keep apache around just for that case ;) [14:10:14] <^demon|away> Was about to say. [14:10:16] <^demon|away> Must keep apache. [14:10:22] YuviPanda: https://dpaste.de/5ADs [14:10:23] <^demon|away> ngnix is too hard to fit in a cute acronym. [14:10:47] andrewbogott: yay :) [14:10:49] seems ok... [14:12:02] Yeah, I don't know what's up with the duplicates but I can't imagine they're from your patch [14:12:13] YuviPanda: btw, did you see that I reverted one of your patches yesterday? [14:12:18] andrewbogott: oh? no, which one? [14:12:29] * YuviPanda checks [14:12:33] https://gerrit.wikimedia.org/r/#/c/165007/ [14:13:00] andrewbogott: ah, hmm. [14:13:08] andrewbogott: let me re-do that patch, and be more careful this time... [14:15:18] (03CR) 10Chad: "I'd be ok with tweaking this to use a request if that makes things easier on precise. Will amend today." [puppet] - 10https://gerrit.wikimedia.org/r/163945 (owner: 10Chad) [14:16:55] (03CR) 10Chad: "Also, while we're here. Where's the best place for me to stash elastictool.py? /usr/lib/python2.7?" [puppet] - 10https://gerrit.wikimedia.org/r/163945 (owner: 10Chad) [14:19:58] (03PS1) 10Alexandros Kosiaris: Remove /srv/deployment/mathoid/mathoid resource [puppet] - 10https://gerrit.wikimedia.org/r/165199 [14:23:05] K4: Dear anthropoid, the time has come. Please deploy Fundraising (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20141007T1423). [14:25:47] (03PS1) 10Reedy: Remove Apache config stuff from noc [mediawiki-config] - 10https://gerrit.wikimedia.org/r/165200 [14:26:59] (03PS1) 10Reedy: No one cares about lucene either [mediawiki-config] - 10https://gerrit.wikimedia.org/r/165201 [14:27:34] (03CR) 10Alexandros Kosiaris: [C: 032] Remove /srv/deployment/mathoid/mathoid resource [puppet] - 10https://gerrit.wikimedia.org/r/165199 (owner: 10Alexandros Kosiaris) [14:27:41] Reedy: just squash both changes together :-] [14:27:50] pfft @{ [14:27:54] wth is that smilie [14:29:15] (03PS2) 10Reedy: Remove dead/out of date links from noc [mediawiki-config] - 10https://gerrit.wikimedia.org/r/165200 [14:29:16] angry moustache [14:29:25] (03Abandoned) 10Reedy: No one cares about lucene either [mediawiki-config] - 10https://gerrit.wikimedia.org/r/165201 (owner: 10Reedy) [14:29:49] chasemp: you were working on SNI with cp1008, right? [14:29:53] chasemp: has a bunch of duplicate defs... https://dpaste.de/5ADs [14:30:41] (03CR) 10Reedy: [C: 032] Remove dead/out of date links from noc [mediawiki-config] - 10https://gerrit.wikimedia.org/r/165200 (owner: 10Reedy) [14:30:48] (03Merged) 10jenkins-bot: Remove dead/out of date links from noc [mediawiki-config] - 10https://gerrit.wikimedia.org/r/165200 (owner: 10Reedy) [14:31:12] !log reedy Synchronized docroot and w: Fixup noc (duration: 00m 16s) [14:31:22] Logged the message, Master [14:32:11] (03PS1) 10Alexandros Kosiaris: Followup commit to bfaaff1. Remove the unneeded require [puppet] - 10https://gerrit.wikimedia.org/r/165202 [14:32:44] Hmm [14:32:51] NOC is behind misc varnish now, isn't it? [14:32:58] (03PS1) 10Zfilipin: contint: Sikuli is no longer used anywhere [puppet] - 10https://gerrit.wikimedia.org/r/165204 (https://bugzilla.wikimedia.org/54393) [14:33:22] Or is it because sync fails on terbium... [14:34:41] (03CR) 10Zfilipin: "Done: https://gerrit.wikimedia.org/r/#/c/165204/" [puppet] - 10https://gerrit.wikimedia.org/r/163791 (owner: 10Krinkle) [14:35:10] Can someone rm -rf terbium:/srv/mediawiki/docroot/noc/dbtree please? [14:35:23] Permissions are wrong, so just need it removing so I can sync it back from tin [14:36:20] (03PS1) 10Giuseppe Lavagetto: mediawiki: workaround an apache bug [puppet] - 10https://gerrit.wikimedia.org/r/165206 [14:36:22] <_joe_> Reedy: add that to the operations queue on phabricator [14:36:30] ... [14:36:32] <_joe_> which is read-only, unfortunately [14:36:33] <_joe_> :P [14:36:42] <_joe_> just joking, gimme 2 mins [14:36:55] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] mediawiki: workaround an apache bug [puppet] - 10https://gerrit.wikimedia.org/r/165206 (owner: 10Giuseppe Lavagetto) [14:36:58] thanks [14:37:33] <_joe_> did I ever said I hate apache? [14:37:54] more than once [14:37:57] :-) [14:38:06] (03CR) 10Alexandros Kosiaris: [C: 032] Followup commit to bfaaff1. Remove the unneeded require [puppet] - 10https://gerrit.wikimedia.org/r/165202 (owner: 10Alexandros Kosiaris) [14:38:47] <_joe_> Reedy: which permissions should it have [14:38:53] <_joe_> I can just change those [14:39:24] <_joe_> Reedy: check it now [14:39:27] I guess mwdeploy:mwdeploy should own it at least [14:39:33] RECOVERY - mathoid on sca1002 is OK: HTTP OK: HTTP/1.1 200 OK - 301 bytes in 0.038 second response time [14:39:37] yes! [14:40:03] RECOVERY - mathoid on sca1001 is OK: HTTP OK: HTTP/1.1 200 OK - 301 bytes in 0.022 second response time [14:40:05] where physikerwelt ? he would be thrilled with this... [14:40:35] <_joe_> a nodejs app responding correctly to health checks? I'm impressed as well [14:41:57] mutante: jzerebecki I'm mostly done with the icinga work, and will start poking at shinken now. [14:42:13] (03CR) 10Hashar: [C: 031] contint: Sikuli is no longer used anywhere [puppet] - 10https://gerrit.wikimedia.org/r/165204 (https://bugzilla.wikimedia.org/54393) (owner: 10Zfilipin) [14:42:24] mutante: jzerebecki it shouldn't be too hard to set up icinga on labs, but I'll probably not get to that until after shinken stuff is done. feel free to add me as a reviewer to patches though! [14:44:29] (03CR) 10Mark Bergsma: "/usr/lib/python2.7 is never the right place unless it's installed by a Debian package. You probably want /usr/local/lib/python2.7/site-pac" [puppet] - 10https://gerrit.wikimedia.org/r/163945 (owner: 10Chad) [14:46:06] (03PS1) 10Reedy: Remove report.py from noc index as cgi-bin doesn't work [mediawiki-config] - 10https://gerrit.wikimedia.org/r/165207 [14:47:09] (03PS1) 10Reedy: Remove interwiki.cdb symlink (unused) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/165208 [14:49:13] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "The correct way to set up python modules and their binaries is by creating a proper python module, with setup.py, and the entry_points['co" [puppet] - 10https://gerrit.wikimedia.org/r/163945 (owner: 10Chad) [14:49:32] PROBLEM - Host ps1-c2-pmtpa is DOWN: CRITICAL - Plugin timed out after 15 seconds [14:49:33] PROBLEM - Host ps1-c1-pmtpa is DOWN: CRITICAL - Plugin timed out after 15 seconds [14:49:42] (03PS1) 10Reedy: Remove cirrus.dblist symlink (doesn't exist) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/165210 [14:49:43] \o/ [14:49:53] PROBLEM - Host ps1-c3-pmtpa is DOWN: PING CRITICAL - Packet loss = 100% [14:49:59] <_joe_> is tampa gone for good? [14:50:09] almost [14:50:15] die die die [14:50:31] very few signs of life left ;) [14:50:43] <_joe_> the only thing that sucks being in a remote team is we can't properly celebrate [14:50:46] man I remember when I was hired [14:50:53] PROBLEM - Host ps1-d2-pmtpa is DOWN: PING CRITICAL - Packet loss = 100% [14:51:10] when people were telling me that "we also have this tampa DC but it's probably not worth explaining to you much about it, it's about to go away" [14:51:19] that was true [14:51:34] PROBLEM - Host ps1-d1-pmtpa is DOWN: CRITICAL - Plugin timed out after 15 seconds [14:51:37] _joe_: there's time to fedex whisky out for the next ops meeting [14:52:06] tampa has served us for nearly 11 years [14:52:19] no hurricanes, plenty of power failures ;p [14:52:44] (03CR) 10Reedy: [C: 032] Remove cirrus.dblist symlink (doesn't exist) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/165210 (owner: 10Reedy) [14:52:54] and a lawnmower [14:52:59] (03Merged) 10jenkins-bot: Remove cirrus.dblist symlink (doesn't exist) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/165210 (owner: 10Reedy) [14:53:10] and a few fiber cuts, that probably aggregate to less downtime compared to what codfw has so far :) [14:53:21] (03PS2) 10Reedy: Remove report.py from noc index as cgi-bin doesn't work [mediawiki-config] - 10https://gerrit.wikimedia.org/r/165207 [14:53:23] <_joe_> paravoid: :/ [14:53:36] (03CR) 10Reedy: [C: 032] Remove report.py from noc index as cgi-bin doesn't work [mediawiki-config] - 10https://gerrit.wikimedia.org/r/165207 (owner: 10Reedy) [14:53:42] (03Merged) 10jenkins-bot: Remove report.py from noc index as cgi-bin doesn't work [mediawiki-config] - 10https://gerrit.wikimedia.org/r/165207 (owner: 10Reedy) [14:56:11] (03PS2) 10Reedy: Remove interwiki.cdb symlink (unused) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/165208 [14:56:35] (03CR) 10Reedy: [C: 032] Remove interwiki.cdb symlink (unused) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/165208 (owner: 10Reedy) [14:56:41] (03Merged) 10jenkins-bot: Remove interwiki.cdb symlink (unused) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/165208 (owner: 10Reedy) [14:58:54] (03PS1) 10Yuvipanda: dynamicproxy: Block tweetmemebot for tools proxy [puppet] - 10https://gerrit.wikimedia.org/r/165212 (https://bugzilla.wikimedia.org/71120) [14:58:57] Betacommand: ^ [14:59:41] YuviPanda: I must have that bot on ignore [14:59:56] Betacommand: ah, heh. https://gerrit.wikimedia.org/r/#/c/165212/ [15:00:05] manybubbles, anomie, ^d, marktraceur: Respected human, time to deploy SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20141007T1500). Please do the needful. [15:00:35] * anomie sees nothing for SWAT this morning [15:00:38] !log reedy Synchronized docroot and w: (no message) (duration: 00m 14s) [15:00:44] Logged the message, Master [15:01:28] (03CR) 10Amire80: [C: 031] contint: Sikuli is no longer used anywhere [puppet] - 10https://gerrit.wikimedia.org/r/165204 (https://bugzilla.wikimedia.org/54393) (owner: 10Zfilipin) [15:02:27] (03PS1) 10Alexandros Kosiaris: Give sca cluster the mathoid LVS IP [puppet] - 10https://gerrit.wikimedia.org/r/165214 [15:03:48] (03CR) 10Alexandros Kosiaris: [C: 032] Give sca cluster the mathoid LVS IP [puppet] - 10https://gerrit.wikimedia.org/r/165214 (owner: 10Alexandros Kosiaris) [15:03:48] YuviPanda: I saw a message about SNI, but can't find it :) my bouncer is not happy, but fyi I didn't do much of the SNI stuff in the end, mark and (I think?) brandon knocked it out while I was away [15:04:00] chasemp: ah, hmm. [15:04:08] chasemp: it just seems to be including https monitoring twice [15:04:09] I think [15:04:39] bblack: SNI related duplicate https defs on cp1008 in icinga: https://dpaste.de/5ADs [15:07:03] PROBLEM - Disk space on analytics1035 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/k 148727 MB (3% inode=99%): [15:07:12] I thought what they did affected only the web-misc cluster? [15:07:17] but I'm really not sure even [15:11:19] Betacommand: need to get Coren to merge https://gerrit.wikimedia.org/r/#/c/165212/ [15:12:28] (03CR) 10coren: [C: 032] "Catch-by-UA is rarely sufficient, but this'll help." [puppet] - 10https://gerrit.wikimedia.org/r/165212 (https://bugzilla.wikimedia.org/71120) (owner: 10Yuvipanda) [15:12:47] Coren: heh, we already have iptables blocks for their IP, I think [15:12:57] Some of them, at least. [15:13:01] yeah [15:13:13] All the more annoying because they don't actually /exist/ anymore. [15:13:25] yeah [15:13:37] * YuviPanda forces a run on tools-webproxy [15:13:44] hmm, didn't pick it up? [15:13:48] * YuviPanda waits for puppet-merge [15:17:01] It's been puppet-merged already; you probably were faster than the speed of pull. :-) [15:17:53] Coren: heh :) [15:18:00] yeah, I did it again after about 15s and it was fine [15:19:48] ok what about SNI and certs and blah blah? :) [15:19:56] I've been missing some stuff here [15:20:22] bblack: hah :) [15:20:41] bblack: https://dpaste.de/5ADs is output of icinga -v, says https service is duplicated on cp1008 (where SNI tests were being done, IIRC) [15:22:11] I just reloaded icinga like yesterday and it was fine. so, someone borked it since then :p [15:23:10] when people make monitoring changes, they should really really run puppet manually on masters then neon, or at least tail the puppet log and wait for the run on neon that makes changes [15:23:15] it gets borked so often [15:23:26] bblack: nah, this has been the case for quite a while now, I think? I saw this 2 weeks back as well... [15:23:33] bblack: it's not an error, so isn't borked. just a warning [15:23:38] oh? [15:23:53] bblack: yeah [15:24:02] bblack: so icinga starts up fine, etc. [15:24:03] bblack: run a 'how to icinga for ops' session :p [15:28:52] (03PS1) 10BBlack: fix dupe monitor defs on cp1008 [puppet] - 10https://gerrit.wikimedia.org/r/165222 [15:29:16] (03CR) 10BBlack: [C: 032 V: 032] fix dupe monitor defs on cp1008 [puppet] - 10https://gerrit.wikimedia.org/r/165222 (owner: 10BBlack) [15:29:32] bblack: thanks! [15:29:43] now to figure out the etherpad one [15:31:26] Coren: I see the CA patch, and I like it, but I don't have the bandwidth to validate it just yet. [15:32:27] again it's one of those things where I fear a small oversight could break prod SSL. We'll probably want to check them in puppet-compiler, and then shut off puppet on them, do just one, validate that the results look sane, do the rest, etc [15:36:27] YuviPanda: icinga -v gives 13x "Warning:" lines, then says "Total Warnings: 0" at the bottom of the run :) [15:38:35] out by 13 error [15:41:15] well now that the HTTPS checks are fixed, it's just an everday off-by-one error :p [15:46:27] (03PS1) 10Andrew Bogott: Change ldap 'master' settings in firstboot.sh [puppet] - 10https://gerrit.wikimedia.org/r/165226 [15:53:21] (03PS6) 10Giuseppe Lavagetto: mediawiki: consolidate apache configs [puppet] - 10https://gerrit.wikimedia.org/r/164358 [15:57:07] bblack: Fair enough; that _is_ why I wanted you in that particular loop. :-) [15:58:13] (03PS2) 10Andrew Bogott: Change ldap 'master' settings in firstboot.sh [puppet] - 10https://gerrit.wikimedia.org/r/165226 [15:58:15] (03PS1) 10Andrew Bogott: Removed a bunch of 'refreshonly' directives. [puppet] - 10https://gerrit.wikimedia.org/r/165227 [15:59:36] (03CR) 10Andrew Bogott: [C: 032] Change ldap 'master' settings in firstboot.sh [puppet] - 10https://gerrit.wikimedia.org/r/165226 (owner: 10Andrew Bogott) [15:59:57] <_joe_> Krinkle|detached: https://gerrit.wikimedia.org/r/164358 amended with your suggestions (partially); I think the patch is better now, thanks! [16:00:01] (03CR) 10Andrew Bogott: [C: 032] Removed a bunch of 'refreshonly' directives. [puppet] - 10https://gerrit.wikimedia.org/r/165227 (owner: 10Andrew Bogott) [16:02:06] (03PS1) 10Andrew Bogott: Remove autofs setup in vmbuilder. [puppet] - 10https://gerrit.wikimedia.org/r/165228 [16:04:12] (03CR) 10Andrew Bogott: [C: 032] Remove autofs setup in vmbuilder. [puppet] - 10https://gerrit.wikimedia.org/r/165228 (owner: 10Andrew Bogott) [16:04:15] <_joe_> bbl [16:05:51] (03CR) 10Reedy: [C: 031] "Yay" [puppet] - 10https://gerrit.wikimedia.org/r/164358 (owner: 10Giuseppe Lavagetto) [16:10:36] bblack: heh, yeah :) icinga can't count [16:15:49] (03PS1) 10Reedy: Cleanup CentralNotice enabling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/165230 [16:16:10] (03PS1) 10BryanDavis: iegreview: Create module and role for deployment [puppet] - 10https://gerrit.wikimedia.org/r/165231 [16:16:12] (03PS1) 10BryanDavis: iegreview: Apply role to zirconium and configure varnish [puppet] - 10https://gerrit.wikimedia.org/r/165232 [16:16:31] (03PS2) 10Reedy: Cleanup CentralNotice enabling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/165230 [16:18:30] (03CR) 10BryanDavis: "Inline comment about need for Sean to pick the right db server to host this." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/165231 (owner: 10BryanDavis) [16:19:41] (03PS2) 10BryanDavis: iegreview: Create module and role for deployment [puppet] - 10https://gerrit.wikimedia.org/r/165231 [16:20:00] (03PS2) 10BryanDavis: iegreview: Apply role to zirconium and configure varnish [puppet] - 10https://gerrit.wikimedia.org/r/165232 [16:21:03] (03PS2) 10Alexandros Kosiaris: remove sanger [dns] - 10https://gerrit.wikimedia.org/r/164261 (owner: 10Dzahn) [16:22:35] !log Created echo tables on fawikivoyage on extension1 cluster [16:22:42] Logged the message, Master [16:22:44] (03PS1) 10BryanDavis: iegreview: Put iegreview.wikimedia.org behind misc-web-lb.eqiad [dns] - 10https://gerrit.wikimedia.org/r/165236 [16:22:46] (03CR) 10Alexandros Kosiaris: [C: 032] remove sanger [dns] - 10https://gerrit.wikimedia.org/r/164261 (owner: 10Dzahn) [16:23:43] (03PS1) 10Reedy: Enable Echo on fawikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/165237 [16:23:57] (03CR) 10Reedy: [C: 032] Enable Echo on fawikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/165237 (owner: 10Reedy) [16:24:04] (03Merged) 10jenkins-bot: Enable Echo on fawikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/165237 (owner: 10Reedy) [16:24:30] !log reedy Synchronized database lists: echo for fawikivoyage (duration: 00m 20s) [16:24:35] Logged the message, Master [16:26:07] (03PS3) 10BryanDavis: iegreview: Create module and role for deployment [puppet] - 10https://gerrit.wikimedia.org/r/165231 (https://bugzilla.wikimedia.org/71597) [16:26:26] (03PS3) 10BryanDavis: iegreview: Apply role to zirconium and configure varnish [puppet] - 10https://gerrit.wikimedia.org/r/165232 (https://bugzilla.wikimedia.org/71597) [16:26:57] (03PS2) 10BryanDavis: iegreview: Put iegreview.wikimedia.org behind misc-web-lb.eqiad [dns] - 10https://gerrit.wikimedia.org/r/165236 (https://bugzilla.wikimedia.org/71597) [16:28:43] (03PS2) 10Filippo Giunchedi: swift: add container sync [puppet] - 10https://gerrit.wikimedia.org/r/160430 [16:29:47] (03PS1) 10Cscott: Disable PediaPress POD function. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/165239 (https://bugzilla.wikimedia.org/71675) [16:34:20] (03PS2) 10Reedy: WIP: Extract wmf-beta-scap to runas-withagent wrapper script [puppet] - 10https://gerrit.wikimedia.org/r/160960 [16:38:04] (03PS3) 10Reedy: WIP: Extract wmf-beta-scap to runas-withagent wrapper script [puppet] - 10https://gerrit.wikimedia.org/r/160960 [16:41:24] (03PS4) 10Reedy: WIP: Extract wmf-beta-scap to runas-withagent wrapper script [puppet] - 10https://gerrit.wikimedia.org/r/160960 [16:41:51] (03PS5) 10Reedy: Extract wmf-beta-scap to runas-withagent wrapper script [puppet] - 10https://gerrit.wikimedia.org/r/160960 [16:43:12] (03PS2) 10Giuseppe Lavagetto: mediawiki: add HHVM proxy rules in main.conf [puppet] - 10https://gerrit.wikimedia.org/r/159490 [16:43:15] andrewbogott, hey [16:43:27] Krenair: have you tried turning it off and on again? [16:43:37] ummm. [16:43:47] <_joe_> hello IT? [16:43:47] https://bugzilla.wikimedia.org/show_bug.cgi?id=71731 [16:44:16] Oh, nope, that actually fixed it. [16:44:18] Thanks andrewbogott. [16:44:22] :) [16:44:23] np [16:44:24] (I logged out and back in again.) [16:44:32] It's a stupid bug, I don't know why it's been biting extra hard this week [16:44:56] (03PS5) 10Reedy: Use sync-dir to copy out l10n json files, build cdbs on hosts [puppet] - 10https://gerrit.wikimedia.org/r/158623 (https://bugzilla.wikimedia.org/70443) [16:45:25] (I was running into it on NovaProxy) [16:45:31] (03PS6) 10Reedy: Extract wmf-beta-scap to runas-withagent wrapper script [puppet] - 10https://gerrit.wikimedia.org/r/160960 [16:45:56] (03CR) 10John F. Lewis: [C: 031] fix cert mismatch on mail.wikipedia.org [puppet] - 10https://gerrit.wikimedia.org/r/154223 (https://bugzilla.wikimedia.org/44731) (owner: 10Jeremyb) [16:46:57] (03PS6) 10Reedy: Use sync-dir to copy out l10n json files, build cdbs on hosts [puppet] - 10https://gerrit.wikimedia.org/r/158623 (https://bugzilla.wikimedia.org/70443) [16:48:40] bd808: I think the l10nupdate fanout stuff is done ^^ [16:48:46] (03PS3) 10Reedy: Remove sync-l10nupdate(-1)? [puppet] - 10https://gerrit.wikimedia.org/r/158624 [16:48:55] oh cool [16:49:22] bd808: greg-g just said springle was just having issues with tin because of it [16:49:27] thought I might aswell get it sorted at least [16:49:28] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 6.67% of data above the critical threshold [500.0] [16:49:56] Yeah s.pringle emailed ops-l about how tin was crushed during the sync [16:51:56] Reedy: Is sudo-withagent straight from my prior code? I think I see a bug with it. The sudo stanza won't pass RUN_AS as $1 because of the shift. [16:52:19] Ah [16:52:29] No, I moved and (broke) altered it [16:52:35] So the shift should come after the sudo or the sudo should add in the user [16:53:14] Oh yeah I would have been hard coding RUN_AS before [16:54:07] PROBLEM - Swift HTTP backend on ms-fe2002 is CRITICAL: Connection timed out [16:54:16] PROBLEM - Swift HTTP backend on ms-fe2003 is CRITICAL: Connection timed out [16:54:27] PROBLEM - Host baham is DOWN: PING CRITICAL - Packet loss = 100% [16:54:36] PROBLEM - Host install2001 is DOWN: CRITICAL - Plugin timed out after 15 seconds [16:54:39] PROBLEM - Host labcontrol2001 is DOWN: CRITICAL - Plugin timed out after 15 seconds [16:54:54] bd808: I'm not sure I quite follow [16:55:27] PROBLEM - Host 2620:0:860:1:d6ae:52ff:feac:4dc8 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:860:1:d6ae:52ff:feac:4dc8 [16:55:27] RECOVERY - Host baham is UP: PING OK - Packet loss = 0%, RTA = 54.22 ms [16:55:29] RECOVERY - Host labcontrol2001 is UP: PING OK - Packet loss = 0%, RTA = 53.35 ms [16:55:37] RECOVERY - Host install2001 is UP: PING OK - Packet loss = 0%, RTA = 53.38 ms [16:56:27] PROBLEM - Host cr2-codfw is DOWN: CRITICAL - Plugin timed out after 15 seconds [16:57:27] PROBLEM - Host ns1-v6 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:860:ed1a::e [16:57:47] PROBLEM - Host 208.80.153.42 is DOWN: PING CRITICAL - Packet loss = 100% [16:58:06] Reedy: $@ = "reedy /some/app arg1 arg2"; then shift and sudo and $@ = "/some/app arg1 arg2"; The second invocation will pop /someapp as the RUN_AS var and loop. [16:58:06] PROBLEM - Host labs-ns1.wikimedia.org is DOWN: CRITICAL - Plugin timed out after 15 seconds [16:58:07] PROBLEM - Host labcontrol2001 is DOWN: PING CRITICAL - Packet loss = 100% [16:58:07] PROBLEM - Host achernar is DOWN: PING CRITICAL - Packet loss = 100% [16:58:07] PROBLEM - Host acamar is DOWN: PING CRITICAL - Packet loss = 100% [16:58:07] PROBLEM - Host db2007 is DOWN: PING CRITICAL - Packet loss = 100% [16:58:07] PROBLEM - Host ms-be2006 is DOWN: PING CRITICAL - Packet loss = 100% [16:58:25] bd808: Ah, gotcha [16:58:37] Reedy: Easy fix is to move the shift below the sudo [16:58:43] is the icinga-wm bot code somewhere I can snag it for office it? [16:59:13] cajoel: hey! it's just irccecho + icinga writing to a monitored file... [16:59:22] <^d> Yeah, I was going to say it's just a lame ircecho script. [16:59:29] <^d> (really, ircecho is lame :p) [16:59:34] cajoel: manifests/role/echoirc.pp [16:59:38] merci [16:59:43] cajoel: code should be in puppet. [16:59:45] ah, bah. [17:00:07] ^d: I think chasemp was talking about a shared redis pubsub/queue that would output to irc, sal, graphite event, etc based on rules... [17:00:21] (03PS7) 10Reedy: Extract wmf-beta-scap to runas-withagent wrapper script [puppet] - 10https://gerrit.wikimedia.org/r/160960 [17:00:44] that's a dream, but nowhere here is it reality [17:00:57] reality is tail -f [17:02:36] PROBLEM - Host 2620:0:860:2:d6ae:52ff:fead:5610 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:860:2:d6ae:52ff:fead:5610 [17:02:36] PROBLEM - Host cr1-codfw is DOWN: CRITICAL - Plugin timed out after 15 seconds [17:04:40] bd808: fixed :) [17:04:41] ?? [17:04:53] again!? [17:04:54] wtf [17:04:57] mark: ^ [17:05:07] yeah I know [17:05:22] seriously? [17:06:32] i wonder what % of codfw customers were affected by the last fiber cut. or whatever this problem is. [17:06:44] (03CR) 1020after4: [C: 031] Use sync-dir to copy out l10n json files, build cdbs on hosts [puppet] - 10https://gerrit.wikimedia.org/r/158623 (https://bugzilla.wikimedia.org/70443) (owner: 10Reedy) [17:06:54] (03CR) 1020after4: [C: 031] Extract wmf-beta-scap to runas-withagent wrapper script [puppet] - 10https://gerrit.wikimedia.org/r/160960 (owner: 10Reedy) [17:07:16] hopefully a lot, so that they build more redundancy [17:09:27] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [17:20:27] (03PS8) 10Reedy: Extract wmf-beta-scap to sudo-withagent wrapper script [puppet] - 10https://gerrit.wikimedia.org/r/160960 [17:23:07] (03CR) 10Reedy: [C: 031] "We just need to get ops to merge it now :)" [puppet] - 10https://gerrit.wikimedia.org/r/160960 (owner: 10Reedy) [17:34:07] RECOVERY - Host cr1-codfw is UP: PING WARNING - Packet loss = 44%, RTA = 73.35 ms [17:34:16] RECOVERY - Host ms-be2008 is UP: PING OK - Packet loss = 0%, RTA = 53.60 ms [17:34:17] RECOVERY - Host lvs2006 is UP: PING OK - Packet loss = 0%, RTA = 52.82 ms [17:34:17] RECOVERY - Host labcontrol2001 is UP: PING OK - Packet loss = 0%, RTA = 53.86 ms [17:34:17] RECOVERY - Host ms-be2006 is UP: PING OK - Packet loss = 0%, RTA = 55.05 ms [17:34:17] RECOVERY - Host db2023 is UP: PING OK - Packet loss = 0%, RTA = 55.04 ms [17:36:27] PROBLEM - puppet last run on ms-fe2004 is CRITICAL: CRITICAL: puppet fail [17:36:27] PROBLEM - puppet last run on lvs2001 is CRITICAL: CRITICAL: puppet fail [17:36:37] PROBLEM - puppet last run on pollux is CRITICAL: CRITICAL: puppet fail [17:36:37] PROBLEM - puppet last run on db2035 is CRITICAL: CRITICAL: puppet fail [17:36:37] PROBLEM - puppet last run on ms-be2001 is CRITICAL: CRITICAL: puppet fail [17:36:38] PROBLEM - puppet last run on ms-fe2001 is CRITICAL: CRITICAL: puppet fail [17:36:38] PROBLEM - puppet last run on ms-be2010 is CRITICAL: CRITICAL: puppet fail [17:36:47] PROBLEM - puppet last run on db2002 is CRITICAL: CRITICAL: puppet fail [17:36:56] PROBLEM - puppet last run on lvs2006 is CRITICAL: CRITICAL: puppet fail [17:36:57] PROBLEM - puppet last run on db2034 is CRITICAL: CRITICAL: puppet fail [17:36:57] PROBLEM - puppet last run on db2033 is CRITICAL: CRITICAL: puppet fail [17:36:57] PROBLEM - puppet last run on db2036 is CRITICAL: CRITICAL: puppet fail [17:36:59] PROBLEM - puppet last run on ms-be2011 is CRITICAL: CRITICAL: puppet fail [17:36:59] PROBLEM - puppet last run on db2029 is CRITICAL: CRITICAL: puppet fail [17:36:59] PROBLEM - puppet last run on lvs2004 is CRITICAL: CRITICAL: puppet fail [17:36:59] PROBLEM - puppet last run on ms-fe2002 is CRITICAL: CRITICAL: puppet fail [17:36:59] PROBLEM - puppet last run on db2038 is CRITICAL: CRITICAL: puppet fail [17:36:59] PROBLEM - puppet last run on ms-be2009 is CRITICAL: CRITICAL: puppet fail [17:37:00] PROBLEM - puppet last run on db2016 is CRITICAL: CRITICAL: puppet fail [17:37:00] PROBLEM - puppet last run on db2009 is CRITICAL: CRITICAL: puppet fail [17:37:00] PROBLEM - puppet last run on ms-be2003 is CRITICAL: CRITICAL: puppet fail [17:37:01] PROBLEM - puppet last run on db2018 is CRITICAL: CRITICAL: puppet fail [17:37:01] PROBLEM - puppet last run on labcontrol2001 is CRITICAL: CRITICAL: puppet fail [17:37:07] PROBLEM - puppet last run on db2039 is CRITICAL: CRITICAL: puppet fail [17:37:07] PROBLEM - puppet last run on lvs2005 is CRITICAL: CRITICAL: puppet fail [17:37:07] PROBLEM - puppet last run on db2007 is CRITICAL: CRITICAL: puppet fail [17:37:07] PROBLEM - puppet last run on ms-be2002 is CRITICAL: CRITICAL: puppet fail [17:37:07] PROBLEM - puppet last run on db2005 is CRITICAL: CRITICAL: puppet fail [17:37:07] PROBLEM - puppet last run on acamar is CRITICAL: CRITICAL: puppet fail [17:37:07] PROBLEM - puppet last run on ms-be2006 is CRITICAL: CRITICAL: puppet fail [17:37:17] PROBLEM - puppet last run on ms-be2005 is CRITICAL: CRITICAL: puppet fail [17:37:18] PROBLEM - puppet last run on db2023 is CRITICAL: CRITICAL: puppet fail [17:37:18] PROBLEM - puppet last run on ms-be2012 is CRITICAL: CRITICAL: puppet fail [17:37:36] PROBLEM - puppet last run on db2019 is CRITICAL: CRITICAL: puppet fail [17:37:36] PROBLEM - puppet last run on ms-be2007 is CRITICAL: CRITICAL: puppet fail [17:37:36] PROBLEM - puppet last run on ms-be2004 is CRITICAL: CRITICAL: puppet fail [17:37:36] PROBLEM - puppet last run on ms-be2008 is CRITICAL: CRITICAL: puppet fail [17:37:36] PROBLEM - puppet last run on achernar is CRITICAL: CRITICAL: puppet fail [17:37:36] PROBLEM - puppet last run on ms-fe2003 is CRITICAL: CRITICAL: puppet fail [17:37:37] PROBLEM - puppet last run on install2001 is CRITICAL: CRITICAL: puppet fail [17:37:37] PROBLEM - puppet last run on db2037 is CRITICAL: CRITICAL: puppet fail [17:37:46] RECOVERY - puppet last run on pollux is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [17:38:16] RECOVERY - puppet last run on lvs2005 is OK: OK: Puppet is currently enabled, last run 69 seconds ago with 0 failures [17:38:56] RECOVERY - puppet last run on ms-be2009 is OK: OK: Puppet is currently enabled, last run 63 seconds ago with 0 failures [17:39:57] RECOVERY - puppet last run on ms-be2010 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [17:40:07] RECOVERY - puppet last run on db2033 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [17:40:26] (03PS1) 10coren: Labs: Allow using a full MX configurably [puppet] - 10https://gerrit.wikimedia.org/r/165249 [17:40:38] RECOVERY - puppet last run on db2019 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [17:40:57] RECOVERY - puppet last run on db2035 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [17:41:24] (03CR) 10Faidon Liambotis: [C: 04-2] "This is horrible, please no." [puppet] - 10https://gerrit.wikimedia.org/r/165249 (owner: 10coren) [17:41:32] chasemp, ori, ensure_packges: yes or no? [17:42:01] I'm not I know what the story is? [17:42:04] ha, should I use it? [17:42:09] i want to install python3 on analytics worker nodes [17:42:16] ah, ori knows that one then :) [17:42:30] paravoid: ... horrible? Care to be a bit more specific? It's not unusual to have a class switch behaviour according to a puppet var [17:42:37] RECOVERY - puppet last run on ms-be2002 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [17:42:48] not sure what the preferred option for something like that is, dummy wrapper class for python3? ensure_packages? if !defined(...)? [17:43:17] RECOVERY - puppet last run on db2034 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [17:43:36] to change what a role class does altogether depending on what a global variable is set to? [17:43:37] RECOVERY - puppet last run on ms-be2003 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [17:43:39] no we don't do that [17:43:43] that's a terrible pattern [17:43:49] Coren: this is your chance to be a heira hero! [17:43:57] (That's what I told Jeff too but he wasn't in to it) [17:44:07] hiera hiera hiera! [17:44:12] <_joe_> ottomata: eh. [17:44:17] RECOVERY - puppet last run on ms-fe2004 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [17:44:23] <_joe_> but hiera in labs is still a problem [17:44:36] <_joe_> given we need an interface for it in wikitech [17:44:37] RECOVERY - puppet last run on db2009 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [17:44:47] RECOVERY - puppet last run on db2039 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [17:44:47] RECOVERY - puppet last run on db2005 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [17:44:47] RECOVERY - puppet last run on ms-be2006 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [17:44:54] paravoid: I honestly don't see it; how is selecting between "be a dumb forwarder" and "be a full mx" completely different? One is a strict subset of the other. [17:44:55] By 'be a heira hero' I meant 'totally implement heira in labs in order to implement this one little change' :) [17:44:59] _joe_: opinions on ensure_packages? [17:44:59] <_joe_> adn I _don't_ have time to code it [17:45:09] RECOVERY - puppet last run on ms-be2004 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [17:45:14] <_joe_> ottomata: look at require_packages [17:45:22] Coren: if in doubt, don't use global variables altogether [17:45:27] RECOVERY - puppet last run on db2002 is OK: OK: Puppet is currently enabled, last run 69 seconds ago with 0 failures [17:45:47] RECOVERY - puppet last run on db2018 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [17:46:03] paravoid: Oh, I agree that's a *very* good principle; the problem is that puppet is really really dumb about one class overriding another. (Yes, hiera could help here) [17:46:10] <_joe_> Coren: use a class parameter [17:46:17] * andrewbogott -> lunch [17:46:26] RECOVERY - puppet last run on ms-fe2001 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [17:46:35] not that either [17:46:36] RECOVERY - puppet last run on lvs2004 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [17:46:40] separate role classes [17:46:44] don't include the one you don't want to [17:46:46] <_joe_> or ^^ [17:47:07] and don't try to solve unrelated abstraction problems you're having in role/mail.pp [17:47:12] <_joe_> paravoid: yeah I forgot the ldap based ENC allows that [17:47:15] paravoid: We have no mechanism to state "include this everywhere except where that is included" [17:47:16] RECOVERY - puppet last run on ms-fe2003 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [17:47:17] RECOVERY - puppet last run on lvs2001 is OK: OK: Puppet is currently enabled, last run 62 seconds ago with 0 failures [17:47:27] <_joe_> Coren: we have in fact [17:47:34] _joe_: We do? [17:47:37] RECOVERY - puppet last run on ms-be2011 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [17:47:37] RECOVERY - puppet last run on db2036 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [17:47:37] RECOVERY - puppet last run on db2029 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [17:47:42] XY problem; why is that included in the first place? [17:47:50] you know we have mailservers in production too, right? :) [17:47:51] RECOVERY - puppet last run on labcontrol2001 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [17:47:56] RECOVERY - puppet last run on db2007 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [17:48:05] <_joe_> Coren: hiera_include('mx_class', 'the_base_one') [17:48:08] <_joe_> :) [17:48:09] i.e. we have role::mail::mx in production [17:48:17] RECOVERY - puppet last run on install2001 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [17:48:34] <_joe_> paravoid: all labs instances probably already include the class Coren was trying to modify [17:48:37] RECOVERY - puppet last run on db2038 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [17:48:49] I am well aware :) [17:48:54] * Coren ponders. [17:49:01] either fix it one-off, with puppetmaster::self, or fix _that_ [17:49:02] _joe_: That might actually work. [17:49:08] <_joe_> Coren: there are several ways to solve this [17:49:17] RECOVERY - puppet last run on db2037 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [17:49:24] i.e. remove role::mail::sender from the default set of classes that labs include [17:49:31] (03Abandoned) 10coren: Labs: Allow using a full MX configurably [puppet] - 10https://gerrit.wikimedia.org/r/165249 (owner: 10coren) [17:49:33] possibly add it to a "standard" labs class that can be removed [17:49:37] RECOVERY - puppet last run on lvs2006 is OK: OK: Puppet is currently enabled, last run 61 seconds ago with 0 failures [17:49:47] RECOVERY - puppet last run on db2016 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [17:49:57] <_joe_> paravoid: mmmh that pattern is not optimal as well IMO [17:50:07] RECOVERY - puppet last run on db2023 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [17:50:07] RECOVERY - puppet last run on ms-be2012 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [17:50:07] <_joe_> but sorry, time for dinner! [17:50:12] paravoid: Any solution that leaves an instance in labs without exactly one of role::mail::sender and role::mail::mx is a problem. [17:50:16] RECOVERY - puppet last run on ms-be2005 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [17:50:18] why? [17:50:27] RECOVERY - puppet last run on ms-be2001 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [17:50:37] What the hell is with the constant codfw power emails [17:50:44] device up device down device up [17:50:56] its spammign the shit out of noc [17:51:04] robh: codfw comes and goes [17:51:05] paravoid: because otherwise they mail out cronspam from root@evil.name.wmflabs [17:51:07] RECOVERY - puppet last run on acamar is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [17:51:08] fibre issues [17:51:19] RECOVERY - puppet last run on ms-be2007 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [17:51:22] RECOVERY - puppet last run on ms-be2008 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [17:51:22] Coren: ? [17:51:36] if they have neither role::mail::sender nor role::mail:mx, they won't mail cronspam at all [17:52:15] robh: both mark and I are on it with telia, we talked about this extensively during the ops meeting you missed [17:53:07] paravoid: without exim4.minimal they use whatever default mail output various programs try to use, which end up sending mail directly by SMTP. Badly. (IIRC you're correct cron tries to say local - but php tries to be too smart for instance) [17:53:38] And crontabs with MAILTO also cause issues. [17:53:56] RECOVERY - puppet last run on ms-fe2002 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [17:54:36] RECOVERY - puppet last run on achernar is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [17:58:00] ottomata ^d let me know if you'd need the es deb uploaded, I'll be around for another ~1h [17:59:19] no hurry for me, not going to work on the python thing for a while, and it is not blocking [17:59:26] i think they might need it for their upgrade(?) [17:59:31] manybubbles: ? [17:59:50] godog, btw, that's mostly just changing the version number in reprepro updates in puppet [17:59:56] and i think running reprepro update :) [18:00:05] Reedy, greg-g: Dear anthropoid, the time has come. Please deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20141007T1800). [18:00:18] es being elastic search? [18:00:28] ottomata: oh ok so it is just that, sweet [18:00:39] Reedy: ye oops not external storage [18:00:44] heh [18:00:56] Reedy: I don't have a clue about external storage tho, what is it? :P [18:00:57] o yeah, require_package, right [18:00:58] i remember this [18:01:01] godog: mysql :) [18:01:06] <^d> godog: Yes, please so I can start the process on beta. [18:01:06] i think...? [18:01:37] we can upgrade logstash (again) sooon [18:01:49] <^d> godog: Yep, I see the 1.3.4 package on packages.elasticsearch.org [18:01:52] <^d> .deb, that is [18:03:41] just a point release? [18:04:22] (03PS1) 10Ottomata: require_package python3 on Hadoop nodes [puppet] - 10https://gerrit.wikimedia.org/r/165255 [18:05:04] <^d> Reedy: Yeah, 1.3.2 -> 1.3.4. It's got some minor improvements we could make use of and 1.4 is still a tad far off :) [18:05:17] Fair enough [18:05:30] <^d> 1.4 we need though, lots of reasons. [18:06:49] I can get logstash done too [18:07:16] PROBLEM - CI tmpfs disk space on gallium is CRITICAL: DISK CRITICAL - free space: /var/lib/jenkins-slave/tmpfs 21 MB (4% inode=99%): [18:07:37] PROBLEM - Disk space on gallium is CRITICAL: DISK CRITICAL - free space: /var/lib/jenkins-slave/tmpfs 19 MB (3% inode=99%): [18:08:41] ottomata: do you remember how to reprepro update a single package/distribution as opposed to every distribution ? [18:09:56] nope :/ [18:10:08] i just run reprepro update, but, i think the updates file is pretty strict as to what versions get updated [18:10:22] so, if you run update, i don't think anything will change except for what you change [18:10:24] akosiaris: might know more? [18:10:51] (03CR) 10Ottomata: [C: 032 V: 032] require_package python3 on Hadoop nodes [puppet] - 10https://gerrit.wikimedia.org/r/165255 (owner: 10Ottomata) [18:11:10] ah nevermind I think I worked around it :) [18:11:53] ^d: should be good to go [18:12:17] (namely by temporarily changing Update: elasticsearch only) [18:12:34] <^d> Let's have a look [18:14:16] <^d> godog: Working fine on deployment-elastic01 in beta, thx [18:14:38] ^d: np [18:14:41] * YuviPanda gently pokes ^d about graphite - elasticsearch [18:16:19] <^d> Oh yeah, I meant to review that plugin. [18:16:41] for mutante http://devopsreactions.tumblr.com/post/99391149783/sysadmins-being-introduced-to-kanban [18:18:28] <^d> YuviPanda: "Configuration is possible via three parameters:" [18:18:34] <^d> He then goes on to list 6 [18:19:03] oozing quality, I see [18:24:30] (03PS1) 10Reedy: Non wikipedias to 1.25wmf2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/165258 [18:25:01] <^d> !log jenkins tmpfs run out of space again, tests failing [18:25:06] Logged the message, Master [18:28:50] (03CR) 10Dzahn: [C: 031] iegreview: Put iegreview.wikimedia.org behind misc-web-lb.eqiad [dns] - 10https://gerrit.wikimedia.org/r/165236 (https://bugzilla.wikimedia.org/71597) (owner: 10BryanDavis) [18:30:18] (03CR) 10Dzahn: [C: 031] iegreview: Apply role to zirconium and configure varnish [puppet] - 10https://gerrit.wikimedia.org/r/165232 (https://bugzilla.wikimedia.org/71597) (owner: 10BryanDavis) [18:32:26] (03CR) 10Reedy: [C: 032] Non wikipedias to 1.25wmf2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/165258 (owner: 10Reedy) [18:32:33] (03Merged) 10jenkins-bot: Non wikipedias to 1.25wmf2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/165258 (owner: 10Reedy) [18:34:30] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: Non wikipedias to 1.25wmf2 [18:34:37] Logged the message, Master [18:37:47] (03PS3) 10Reedy: Cleanup CentralNotice enabling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/165230 [18:37:55] (03CR) 10Reedy: [C: 032] Cleanup CentralNotice enabling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/165230 (owner: 10Reedy) [18:38:48] (03Merged) 10jenkins-bot: Cleanup CentralNotice enabling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/165230 (owner: 10Reedy) [18:39:51] (03CR) 10Dzahn: "some inline comments and nice to have would be a short README.md in the module root, just cause they are parsed for doc.wm.org" (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/165231 (https://bugzilla.wikimedia.org/71597) (owner: 10BryanDavis) [18:39:51] gah, our checkcommands.cfg is a pain :| [18:39:55] * YuviPanda goes to refactor some more [18:42:36] !log csteipp Synchronized php-1.25wmf2/extensions/CentralAuth/CentralAuthPlugin.php: (no message) (duration: 00m 06s) [18:42:43] Logged the message, Master [18:43:28] !log sanger - deleted salt key, revoked puppet cert, rm icinga stored config, already out of DNS - Killing sanger.wikimedia.org...done. [18:43:32] (03CR) 10BryanDavis: iegreview: Create module and role for deployment (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/165231 (https://bugzilla.wikimedia.org/71597) (owner: 10BryanDavis) [18:45:12] !log deployed fix for bug 71749 [18:45:17] Logged the message, Master [18:47:11] (03CR) 10Dzahn: [C: 031] "overall i think it's pretty nice and clean, using new Apache setup, parameterized and all" [puppet] - 10https://gerrit.wikimedia.org/r/165231 (https://bugzilla.wikimedia.org/71597) (owner: 10BryanDavis) [18:48:44] (03CR) 10Manybubbles: [C: 031] "I uploaded the plugins to archiva and successfully git deployed them to beta to validate that they were in archiva properly." [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/164633 (owner: 10Manybubbles) [18:48:51] YuviPanda: maybe we can remove some of them that were only used by Tampa hosts which i'm removing :) [18:49:11] PROBLEM - Unmerged changes on repository mediawiki_config on tin is CRITICAL: There is one unmerged change in mediawiki_config (dir /srv/mediawiki-staging/). [18:49:41] ACKNOWLEDGEMENT - Host ps1-c1-pmtpa is DOWN: CRITICAL - Plugin timed out after 15 seconds daniel_zahn Tampa :) [18:49:41] ACKNOWLEDGEMENT - Host ps1-c2-pmtpa is DOWN: CRITICAL - Plugin timed out after 15 seconds daniel_zahn Tampa :) [18:49:41] ACKNOWLEDGEMENT - Host ps1-c3-pmtpa is DOWN: CRITICAL - Plugin timed out after 15 seconds daniel_zahn Tampa :) [18:49:41] ACKNOWLEDGEMENT - Host ps1-d1-pmtpa is DOWN: CRITICAL - Plugin timed out after 15 seconds daniel_zahn Tampa :) [18:49:41] ACKNOWLEDGEMENT - Host ps1-d2-pmtpa is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn Tampa :) [18:50:02] mutante: checkcommands? yeah, that would be nice :) [18:50:11] mutante: although if they could be used in the future, I don't mind [18:50:13] RECOVERY - Unmerged changes on repository mediawiki_config on tin is OK: No changes to merge. [18:50:24] mutante: the biggest problem now is the template variables. I'll need to move them into individual files [18:50:44] mutante: and even though we have http.cfg, dns.cfg, etc, they have the actual used check commands commented out and moved to checkcommands.cfg [18:50:46] which is, ugh [18:51:47] i'm not sure when that split into separate files started. used checkcommands.cfg but not the separate ones [18:52:10] originally everything was just in one file [18:52:18] mutante: I think the separate files were just from packages, and somehow got co-opted into puppet [18:52:31] ah, yeah, that would make sense [18:52:34] I've a vague feeling that the comands defined in the separate files aren't used *at all* [18:52:55] i think that is true because i dont remember ever touching them [18:52:57] a good thing to do would be to install the packages and see what files pop up there... [18:53:04] while i have debugged icinga issues quite a bit [18:53:41] YuviPanda: dpkg -L [18:54:24] hmm, I need a precise machine... [18:54:48] bah, a machine that has icinga installed... [18:54:51] PROBLEM - CI tmpfs disk space on gallium is CRITICAL: DISK CRITICAL - free space: /var/lib/jenkins-slave/tmpfs 26 MB (5% inode=99%): [18:54:56] hold on [18:55:13] mutante: can you paste? :) [18:55:32] yea, what was the phabricator link for the pastebin :) [18:55:35] just a sec [18:55:36] aaah [18:56:05] hit plus sign, create "pastebin" instead of "task" :) [18:57:19] That disk space issue on gallium appears to be causing issues: https://integration.wikimedia.org/ci/job/mwext-Collection-testextension/21/console [18:57:40] (not sure if anyone had noticed) [18:57:48] (03CR) 10Reedy: "Still needed?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/164723 (owner: 10Hoo man) [18:58:25] (03Abandoned) 10Reedy: Enable Echo for Persian Wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/164491 (https://bugzilla.wikimedia.org/71669) (owner: 10Reza) [18:58:39] YuviPanda: how's that? https://phabricator.wikimedia.org/P3 [18:58:50] Krenair: Yup. I've already filed an RT ticket asking for more ram so we can have a larger ramdisk [18:58:54] Reedy: Ah, forgot about that [18:59:07] ah nice, mutante [18:59:39] mutante: yup, most of them seem to be from nagios-plugin-basic [18:59:41] But I still want to do that, yes [18:59:52] mutante: we're using the executable files from there, but not the cfg [18:59:54] YuviPanda: ack [19:00:00] (03PS4) 10Reedy: Enable "import" on fa.wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/164933 (https://bugzilla.wikimedia.org/71681) (owner: 10Reza) [19:00:56] mutante: I'm trying to figure out how to do this properly... [19:01:37] (03PS5) 10Ori.livneh: base::standard-packages: install `perf` [puppet] - 10https://gerrit.wikimedia.org/r/164883 [19:01:47] (03CR) 10Ori.livneh: "ping?" [puppet] - 10https://gerrit.wikimedia.org/r/164883 (owner: 10Ori.livneh) [19:02:27] (03CR) 10Reedy: [C: 032] Enable "import" on fa.wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/164933 (https://bugzilla.wikimedia.org/71681) (owner: 10Reza) [19:02:42] (03Merged) 10jenkins-bot: Enable "import" on fa.wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/164933 (https://bugzilla.wikimedia.org/71681) (owner: 10Reza) [19:03:05] (03CR) 10Reedy: [C: 04-1] "Needs rebasing" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/139421 (https://bugzilla.wikimedia.org/66587) (owner: 10Reedy) [19:03:50] PROBLEM - Disk space on gallium is CRITICAL: DISK CRITICAL - free space: /var/lib/jenkins-slave/tmpfs 19 MB (3% inode=99%): [19:05:04] YuviPanda: i'm not sure yet. let puppet ensure them abent if unused? just to be simpler to debu [19:05:08] g [19:05:08] <^d> Yes yes, we know icinga. [19:05:47] mutante: they aren't unused, is the problem. they're modified :| [19:05:56] (03PS2) 10Reedy: Require 10 edits for autoconfirmed on fa.wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/165014 (https://bugzilla.wikimedia.org/71709) (owner: 10Calak) [19:06:02] (03CR) 10Reedy: [C: 032] Require 10 edits for autoconfirmed on fa.wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/165014 (https://bugzilla.wikimedia.org/71709) (owner: 10Calak) [19:06:07] what ^d just said should be a bot trigger for _actually_ telling icinga that, and it would actually stop [19:06:23] <^d> Hehe [19:06:25] mutante: so what we've to do is to undo the modifications, move custom things into custom cfg files when necessary, and then let icinga just use the files from the package... [19:06:33] (03Merged) 10jenkins-bot: Require 10 edits for autoconfirmed on fa.wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/165014 (https://bugzilla.wikimedia.org/71709) (owner: 10Calak) [19:06:36] the shell part is already there mostly.. just need to integrate into a bot [19:08:03] RECOVERY - Disk space on gallium is OK: DISK OK [19:08:03] (03PS3) 10Reedy: Add namespace alias on fa.wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/164773 (https://bugzilla.wikimedia.org/71668) (owner: 10Calak) [19:08:18] (03CR) 10Reedy: [C: 032] Add namespace alias on fa.wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/164773 (https://bugzilla.wikimedia.org/71668) (owner: 10Calak) [19:08:25] (03Merged) 10jenkins-bot: Add namespace alias on fa.wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/164773 (https://bugzilla.wikimedia.org/71668) (owner: 10Calak) [19:09:05] (03PS2) 10Reedy: Disable PediaPress POD function. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/165239 (https://bugzilla.wikimedia.org/71675) (owner: 10Cscott) [19:09:10] (03CR) 10Reedy: [C: 032] Disable PediaPress POD function. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/165239 (https://bugzilla.wikimedia.org/71675) (owner: 10Cscott) [19:09:17] (03Merged) 10jenkins-bot: Disable PediaPress POD function. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/165239 (https://bugzilla.wikimedia.org/71675) (owner: 10Cscott) [19:09:30] <^d> !log cleared old files from runs on gallium tmpfs, testing should recover now. [19:09:30] RECOVERY - CI tmpfs disk space on gallium is OK: DISK OK [19:09:34] Logged the message, Master [19:11:00] (03PS6) 10Reedy: Re-enable all Math modes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/139421 (https://bugzilla.wikimedia.org/66587) [19:11:20] (03CR) 10Reedy: [C: 032] Re-enable all Math modes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/139421 (https://bugzilla.wikimedia.org/66587) (owner: 10Reedy) [19:11:27] (03Merged) 10jenkins-bot: Re-enable all Math modes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/139421 (https://bugzilla.wikimedia.org/66587) (owner: 10Reedy) [19:12:30] PROBLEM - Unmerged changes on repository mediawiki_config on tin is CRITICAL: There are 5 unmerged changes in mediawiki_config (dir /srv/mediawiki-staging/). [19:12:38] (03PS2) 10Reedy: Enable otherProjectsLinks by default on itwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/164719 (https://bugzilla.wikimedia.org/71464) (owner: 10Glaisher) [19:15:52] (03PS1) 10GWicke: Reduce number of parsoid runners even further [puppet] - 10https://gerrit.wikimedia.org/r/165317 [19:16:32] (03CR) 10Reedy: [C: 032] Enable otherProjectsLinks by default on itwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/164719 (https://bugzilla.wikimedia.org/71464) (owner: 10Glaisher) [19:16:46] (03Merged) 10jenkins-bot: Enable otherProjectsLinks by default on itwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/164719 (https://bugzilla.wikimedia.org/71464) (owner: 10Glaisher) [19:18:41] RECOVERY - Unmerged changes on repository mediawiki_config on tin is OK: No changes to merge. [19:19:09] !log reedy Synchronized wmf-config/: (no message) (duration: 00m 26s) [19:19:14] Logged the message, Master [19:20:27] !log Created EducationProgram tables on cawiki [19:20:32] Logged the message, Master [19:21:44] (03CR) 10Subramanya Sastry: [C: 031] "Once we resolve the varnish issues, all the misses that lead to tpl/ext/image reqs. hitting the api cluster should start getting fulfilled" [puppet] - 10https://gerrit.wikimedia.org/r/165317 (owner: 10GWicke) [19:21:59] (03PS2) 10Reedy: Enable EducationProgram extension on cawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/164532 (https://bugzilla.wikimedia.org/71381) (owner: 10Glaisher) [19:24:47] (03CR) 10Reedy: [C: 032] Enable EducationProgram extension on cawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/164532 (https://bugzilla.wikimedia.org/71381) (owner: 10Glaisher) [19:24:53] (03Merged) 10jenkins-bot: Enable EducationProgram extension on cawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/164532 (https://bugzilla.wikimedia.org/71381) (owner: 10Glaisher) [19:26:22] (03PS2) 10Reedy: Remove one wiki included by mistake in aeacad551 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/163906 (https://bugzilla.wikimedia.org/71403) (owner: 10Nemo bis) [19:26:29] (03CR) 10Reedy: [C: 032] Remove one wiki included by mistake in aeacad551 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/163906 (https://bugzilla.wikimedia.org/71403) (owner: 10Nemo bis) [19:26:36] (03Merged) 10jenkins-bot: Remove one wiki included by mistake in aeacad551 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/163906 (https://bugzilla.wikimedia.org/71403) (owner: 10Nemo bis) [19:26:51] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 224, down: 0, dormant: 0, excluded: 0, unused: 0 [19:29:05] !log reedy Synchronized database lists: (no message) (duration: 00m 20s) [19:29:11] Logged the message, Master [19:29:55] !log reedy Synchronized wmf-config/: (no message) (duration: 00m 23s) [19:30:01] Logged the message, Master [19:38:41] PROBLEM - puppetmaster https on palladium is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 8140: HTTP/1.1 500 Internal Server Error [19:38:41] PROBLEM - puppet last run on bast2001 is CRITICAL: CRITICAL: Puppet has 2 failures [19:38:41] PROBLEM - puppetmaster backend https on palladium is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 8141: HTTP/1.1 500 Internal Server Error [19:38:46] PROBLEM - puppet last run on mw1103 is CRITICAL: CRITICAL: Puppet has 45 failures [19:38:50] PROBLEM - puppet last run on mw1015 is CRITICAL: CRITICAL: puppet fail [19:39:00] PROBLEM - puppet last run on db1068 is CRITICAL: CRITICAL: puppet fail [19:39:00] PROBLEM - puppet last run on mw1101 is CRITICAL: CRITICAL: puppet fail [19:39:01] PROBLEM - puppet last run on mw1047 is CRITICAL: CRITICAL: Puppet has 20 failures [19:39:01] PROBLEM - puppet last run on db1009 is CRITICAL: CRITICAL: Puppet has 18 failures [19:39:11] PROBLEM - puppet last run on ms-be1014 is CRITICAL: CRITICAL: puppet fail [19:39:11] PROBLEM - puppet last run on cp3007 is CRITICAL: CRITICAL: puppet fail [19:39:12] PROBLEM - puppet last run on db1005 is CRITICAL: CRITICAL: puppet fail [19:39:12] PROBLEM - puppet last run on lanthanum is CRITICAL: CRITICAL: puppet fail [19:39:12] PROBLEM - puppet last run on cp3017 is CRITICAL: CRITICAL: Puppet has 15 failures [19:39:21] PROBLEM - puppet last run on analytics1021 is CRITICAL: CRITICAL: Puppet has 22 failures [19:39:22] PROBLEM - puppet last run on analytics1031 is CRITICAL: CRITICAL: Puppet has 23 failures [19:39:23] PROBLEM - puppet last run on mw1070 is CRITICAL: CRITICAL: puppet fail [19:39:23] PROBLEM - puppet last run on ms-be3004 is CRITICAL: CRITICAL: puppet fail [19:39:23] PROBLEM - puppet last run on cp3019 is CRITICAL: CRITICAL: Puppet has 1 failures [19:39:23] PROBLEM - puppet last run on cp3021 is CRITICAL: CRITICAL: Puppet has 12 failures [19:39:33] PROBLEM - puppet last run on mw1095 is CRITICAL: CRITICAL: Puppet has 64 failures [19:39:34] PROBLEM - puppet last run on tmh1002 is CRITICAL: CRITICAL: puppet fail [19:39:37] PROBLEM - puppet last run on lvs2002 is CRITICAL: CRITICAL: Puppet has 16 failures [19:39:37] PROBLEM - puppet last run on mw1102 is CRITICAL: CRITICAL: puppet fail [19:39:37] PROBLEM - puppet last run on mw1128 is CRITICAL: CRITICAL: Puppet has 30 failures [19:39:37] PROBLEM - puppet last run on mw1137 is CRITICAL: CRITICAL: Puppet has 42 failures [19:39:37] PROBLEM - puppet last run on palladium is CRITICAL: CRITICAL: Puppet has 29 failures [19:39:40] PROBLEM - puppet last run on tmh1001 is CRITICAL: CRITICAL: Puppet has 14 failures [19:39:40] PROBLEM - puppet last run on ms-be1005 is CRITICAL: CRITICAL: puppet fail [19:39:40] PROBLEM - puppet last run on elastic1009 is CRITICAL: CRITICAL: Puppet has 25 failures [19:39:41] PROBLEM - puppet last run on mc1015 is CRITICAL: CRITICAL: Puppet has 11 failures [19:39:41] PROBLEM - puppet last run on analytics1012 is CRITICAL: CRITICAL: Puppet has 10 failures [19:39:41] PROBLEM - puppet last run on mw1017 is CRITICAL: CRITICAL: Puppet has 63 failures [19:39:41] PROBLEM - puppet last run on es1005 is CRITICAL: CRITICAL: puppet fail [19:39:42] PROBLEM - puppet last run on mw1058 is CRITICAL: CRITICAL: Puppet has 67 failures [19:39:50] PROBLEM - puppet last run on cp1008 is CRITICAL: CRITICAL: puppet fail [19:39:50] PROBLEM - puppet last run on zinc is CRITICAL: CRITICAL: Puppet has 21 failures [19:39:50] PROBLEM - puppet last run on hydrogen is CRITICAL: CRITICAL: puppet fail [19:39:50] PROBLEM - puppet last run on virt1005 is CRITICAL: CRITICAL: Puppet has 25 failures [19:39:51] RECOVERY - puppetmaster https on palladium is OK: HTTP OK: Status line output matched 400 - 335 bytes in 0.046 second response time [19:39:51] PROBLEM - puppet last run on mw1078 is CRITICAL: CRITICAL: Puppet has 58 failures [19:39:51] PROBLEM - puppet last run on search1003 is CRITICAL: CRITICAL: Puppet has 48 failures [19:39:51] PROBLEM - puppet last run on mw1179 is CRITICAL: CRITICAL: puppet fail [19:39:52] PROBLEM - puppet last run on cp4015 is CRITICAL: CRITICAL: puppet fail [19:39:52] PROBLEM - puppet last run on wtp1014 is CRITICAL: CRITICAL: puppet fail [19:39:53] PROBLEM - puppet last run on amssq52 is CRITICAL: CRITICAL: Puppet has 23 failures [19:39:53] PROBLEM - puppet last run on db1049 is CRITICAL: CRITICAL: puppet fail [19:39:54] PROBLEM - puppet last run on mw1085 is CRITICAL: CRITICAL: Puppet has 67 failures [19:39:54] PROBLEM - puppet last run on mw1199 is CRITICAL: CRITICAL: Puppet has 52 failures [19:40:02] PROBLEM - puppet last run on db2010 is CRITICAL: CRITICAL: Puppet has 15 failures [19:40:02] PROBLEM - puppet last run on ms-be2009 is CRITICAL: CRITICAL: Puppet has 27 failures [19:40:02] PROBLEM - puppet last run on mw1019 is CRITICAL: CRITICAL: Puppet has 58 failures [19:40:03] PROBLEM - puppet last run on search1009 is CRITICAL: CRITICAL: puppet fail [19:40:03] PROBLEM - puppet last run on mw1214 is CRITICAL: CRITICAL: puppet fail [19:40:12] PROBLEM - puppet last run on mw1018 is CRITICAL: CRITICAL: Puppet has 56 failures [19:40:13] PROBLEM - puppet last run on labsdb1002 is CRITICAL: CRITICAL: Puppet has 20 failures [19:40:13] PROBLEM - puppet last run on cp4011 is CRITICAL: CRITICAL: Puppet has 21 failures [19:40:13] PROBLEM - puppet last run on cp1064 is CRITICAL: CRITICAL: Puppet has 23 failures [19:40:13] PROBLEM - puppet last run on search1021 is CRITICAL: CRITICAL: puppet fail [19:40:13] PROBLEM - puppet last run on elastic1013 is CRITICAL: CRITICAL: Puppet has 22 failures [19:40:20] PROBLEM - puppet last run on labsdb1007 is CRITICAL: CRITICAL: Puppet has 27 failures [19:40:21] PROBLEM - puppet last run on elastic1010 is CRITICAL: CRITICAL: Puppet has 25 failures [19:40:30] PROBLEM - puppet last run on search1008 is CRITICAL: CRITICAL: Puppet has 38 failures [19:40:30] PROBLEM - puppet last run on erbium is CRITICAL: CRITICAL: puppet fail [19:40:30] PROBLEM - puppet last run on cp4016 is CRITICAL: CRITICAL: Puppet has 27 failures [19:40:31] PROBLEM - puppet last run on db1058 is CRITICAL: CRITICAL: puppet fail [19:40:31] PROBLEM - puppet last run on cp4010 is CRITICAL: CRITICAL: Puppet has 26 failures [19:40:31] PROBLEM - puppet last run on mw1157 is CRITICAL: CRITICAL: Puppet has 67 failures [19:40:31] PROBLEM - puppet last run on mw1075 is CRITICAL: CRITICAL: Puppet has 53 failures [19:40:32] PROBLEM - puppet last run on db1019 is CRITICAL: CRITICAL: Puppet has 19 failures [19:40:32] PROBLEM - puppet last run on pollux is CRITICAL: CRITICAL: Puppet has 22 failures [19:40:33] PROBLEM - puppet last run on lvs3002 is CRITICAL: CRITICAL: Puppet has 25 failures [19:40:36] charming! [19:40:37] :S [19:40:40] PROBLEM - puppet last run on cp1057 is CRITICAL: CRITICAL: puppet fail [19:40:40] PROBLEM - puppet last run on search1019 is CRITICAL: CRITICAL: Puppet has 45 failures [19:40:41] PROBLEM - puppet last run on zirconium is CRITICAL: CRITICAL: puppet fail [19:40:50] PROBLEM - puppet last run on mw1182 is CRITICAL: CRITICAL: puppet fail [19:40:51] PROBLEM - puppet last run on db2033 is CRITICAL: CRITICAL: puppet fail [19:40:51] PROBLEM - puppet last run on mw1020 is CRITICAL: CRITICAL: Puppet has 67 failures [19:40:52] puppetmaster fail [19:41:05] PROBLEM - puppet last run on lvs2005 is CRITICAL: CRITICAL: Puppet has 24 failures [19:41:06] PROBLEM - puppet last run on mw1127 is CRITICAL: CRITICAL: puppet fail [19:41:06] PROBLEM - puppet last run on amssq58 is CRITICAL: CRITICAL: puppet fail [19:41:06] PROBLEM - puppet last run on ssl1004 is CRITICAL: CRITICAL: Puppet has 20 failures [19:41:06] PROBLEM - puppet last run on mw1136 is CRITICAL: CRITICAL: puppet fail [19:41:06] PROBLEM - puppet last run on lvs1006 is CRITICAL: CRITICAL: Puppet has 21 failures [19:41:07] PROBLEM - puppet last run on cp4017 is CRITICAL: CRITICAL: puppet fail [19:41:10] PROBLEM - puppet last run on mw1169 is CRITICAL: CRITICAL: puppet fail [19:41:11] PROBLEM - puppet last run on elastic1016 is CRITICAL: CRITICAL: Puppet has 25 failures [19:41:11] PROBLEM - puppet last run on es1003 is CRITICAL: CRITICAL: Puppet has 19 failures [19:41:20] PROBLEM - puppet last run on ocg1002 is CRITICAL: CRITICAL: puppet fail [19:41:20] PROBLEM - puppet last run on db1041 is CRITICAL: CRITICAL: puppet fail [19:41:20] PROBLEM - puppet last run on mw1191 is CRITICAL: CRITICAL: Puppet has 67 failures [19:41:21] PROBLEM - puppet last run on virt1009 is CRITICAL: CRITICAL: Puppet has 23 failures [19:41:21] PROBLEM - puppet last run on analytics1029 is CRITICAL: CRITICAL: puppet fail [19:41:21] PROBLEM - puppet last run on analytics1024 is CRITICAL: CRITICAL: Puppet has 22 failures [19:41:22] !log restarting apache on palladium - mod_passenger fail [19:41:28] Logged the message, Master [19:41:29] thank goodness for the precise numbers [19:41:32] is someone tallying them up? [19:41:33] in 'Passenger::ApplicationPoolPtr Passenger::ApplicationPoolServer::connect()' (ApplicationPoolServer.h:746) [19:41:33] PROBLEM - puppet last run on amssq45 is CRITICAL: CRITICAL: Puppet has 24 failures [19:41:34] PROBLEM - puppet last run on analytics1034 is CRITICAL: CRITICAL: puppet fail [19:41:36] in 'int Hooks::handleRequest(request_rec*)' (Hooks.cpp:523) [19:41:41] PROBLEM - puppet last run on cp3013 is CRITICAL: CRITICAL: puppet fail [19:41:41] PROBLEM - puppet last run on mw1083 is CRITICAL: CRITICAL: Puppet has 58 failures [19:41:41] PROBLEM - puppet last run on lvs4001 is CRITICAL: CRITICAL: Puppet has 23 failures [19:41:41] PROBLEM - puppet last run on ms-be2010 is CRITICAL: CRITICAL: Puppet has 26 failures [19:41:50] PROBLEM - puppet last run on mw1013 is CRITICAL: CRITICAL: Puppet has 50 failures [19:41:53] PROBLEM - puppet last run on mw1184 is CRITICAL: CRITICAL: Puppet has 70 failures [19:41:55] error log looks better already [19:42:00] PROBLEM - puppet last run on cp1065 is CRITICAL: CRITICAL: puppet fail [19:42:00] PROBLEM - puppet last run on mw1138 is CRITICAL: CRITICAL: puppet fail [19:42:01] PROBLEM - puppet last run on labsdb1001 is CRITICAL: CRITICAL: puppet fail [19:42:01] PROBLEM - puppet last run on mw1062 is CRITICAL: CRITICAL: puppet fail [19:42:01] PROBLEM - puppet last run on db1010 is CRITICAL: CRITICAL: puppet fail [19:42:01] PROBLEM - puppet last run on mc1008 is CRITICAL: CRITICAL: Puppet has 23 failures [19:42:10] PROBLEM - puppet last run on es1009 is CRITICAL: CRITICAL: Puppet has 23 failures [19:42:10] RECOVERY - puppetmaster backend https on palladium is OK: HTTP OK: Status line output matched 400 - 335 bytes in 0.034 second response time [19:42:10] PROBLEM - puppet last run on analytics1036 is CRITICAL: CRITICAL: puppet fail [19:42:11] PROBLEM - puppet last run on mw1036 is CRITICAL: CRITICAL: puppet fail [19:42:11] PROBLEM - puppet last run on mw1196 is CRITICAL: CRITICAL: Puppet has 67 failures [19:42:12] PROBLEM - puppet last run on mw1094 is CRITICAL: CRITICAL: Puppet has 73 failures [19:42:12] PROBLEM - puppet last run on mc1010 is CRITICAL: CRITICAL: puppet fail [19:42:12] PROBLEM - puppet last run on cp1051 is CRITICAL: CRITICAL: Puppet has 23 failures [19:42:18] PROBLEM - puppet last run on mw1109 is CRITICAL: CRITICAL: puppet fail [19:42:30] PROBLEM - puppet last run on search1014 is CRITICAL: CRITICAL: Puppet has 54 failures [19:42:31] PROBLEM - puppet last run on analytics1019 is CRITICAL: CRITICAL: Puppet has 18 failures [19:42:31] PROBLEM - puppet last run on cp4013 is CRITICAL: CRITICAL: Puppet has 24 failures [19:42:31] PROBLEM - puppet last run on ssl3003 is CRITICAL: CRITICAL: puppet fail [19:42:32] PROBLEM - puppet last run on mw1096 is CRITICAL: CRITICAL: Puppet has 69 failures [19:42:32] PROBLEM - puppet last run on db1024 is CRITICAL: CRITICAL: Puppet has 22 failures [19:42:32] PROBLEM - puppet last run on praseodymium is CRITICAL: CRITICAL: Puppet has 21 failures [19:42:32] PROBLEM - puppet last run on mw1035 is CRITICAL: CRITICAL: Puppet has 66 failures [19:42:32] PROBLEM - puppet last run on mw1147 is CRITICAL: CRITICAL: puppet fail [19:42:45] !log temp. stopped icinga-wm [19:42:51] Logged the message, Master [19:42:51] @palladium:~# tail -f /var/log/apache2/error.log was full of those passenger fails , we had them before [19:43:14] now it's quiet again, should be recovering in a bit [19:43:24] eww [19:43:26] ugly errors [19:43:37] thanks for hopping on it so fast [19:44:04] so first step is troubleshoot apache, then puppetmaster ;] [19:44:08] i was about to go in the opposite direction [19:44:25] then puppet master process that is. [19:44:25] the apache error log on the puppetmaster [19:44:37] Unexpected error in mod_passenger: Could not connect to the ApplicationPool server: Broken pipe (32) [19:44:39] yea [19:44:43] and if it's full of mod_passenger fails (the Ruby on Apache module) , then just restart Apacche [19:44:48] gtk [19:44:58] yes, that [19:45:11] well, at least it was similar last time [19:46:08] Phusion_Passenger/2.2.11 ..configured -- resuming normal operations [20:00:50] RECOVERY - puppet last run on ytterbium is OK: OK: Puppet is currently enabled, last run 60 seconds ago with 0 failures [20:00:51] RECOVERY - puppet last run on analytics1039 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [20:00:51] RECOVERY - puppet last run on uranium is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [20:01:00] RECOVERY - puppet last run on wtp1021 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [20:01:00] RECOVERY - puppet last run on cp3022 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [20:01:00] RECOVERY - puppet last run on mc1009 is OK: OK: Puppet is currently enabled, last run 0 seconds ago with 0 failures [20:01:01] RECOVERY - puppet last run on es1006 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [20:01:01] RECOVERY - puppet last run on mw1040 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [20:01:01] RECOVERY - puppet last run on cp1043 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [20:01:01] RECOVERY - puppet last run on mw1062 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [20:01:02] RECOVERY - puppet last run on mw1216 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [20:01:02] RECOVERY - puppet last run on wtp1024 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [20:01:11] RECOVERY - puppet last run on es1009 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [20:01:21] RECOVERY - puppet last run on ms-be1001 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [20:01:22] RECOVERY - puppet last run on mc1010 is OK: OK: Puppet is currently enabled, last run 60 seconds ago with 0 failures [20:01:22] RECOVERY - puppet last run on rdb1004 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [20:01:32] RECOVERY - puppet last run on mw1109 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [20:01:33] RECOVERY - puppet last run on ms-be1013 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [20:01:33] RECOVERY - puppet last run on mc1011 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [20:01:33] RECOVERY - puppet last run on ms-fe1003 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [20:01:33] RECOVERY - puppet last run on ocg1001 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [20:01:33] RECOVERY - puppet last run on ssl3003 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [20:01:33] RECOVERY - puppet last run on mw1067 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [20:01:34] RECOVERY - puppet last run on mw1147 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [20:01:34] RECOVERY - puppet last run on mw1124 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [20:01:36] \o/ [20:01:42] RECOVERY - puppet last run on vanadium is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [20:01:42] RECOVERY - puppet last run on virt1002 is OK: OK: Puppet is currently enabled, last run 67 seconds ago with 0 failures [20:01:43] RECOVERY - puppet last run on mw1132 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [20:01:43] RECOVERY - puppet last run on neon is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [20:01:43] RECOVERY - puppet last run on pc1001 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [20:01:48] ARISE SERVERS [20:01:52] RECOVERY - puppet last run on amssq59 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [20:01:53] RECOVERY - puppet last run on cp3013 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [20:01:53] RECOVERY - puppet last run on wtp1019 is OK: OK: Puppet is currently enabled, last run 63 seconds ago with 0 failures [20:01:53] RECOVERY - puppet last run on cp4007 is OK: OK: Puppet is currently enabled, last run 61 seconds ago with 0 failures [20:02:01] (they werent really down but it seemed appropriate to say) [20:02:02] RECOVERY - puppet last run on mw1115 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [20:02:06] RECOVERY - puppet last run on mw1038 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [20:02:10] heh, it randomly comes back on next puppet run [20:02:11] RECOVERY - puppet last run on cp1059 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [20:02:11] RECOVERY - puppet last run on dbstore1001 is OK: OK: Puppet is currently enabled, last run 60 seconds ago with 0 failures [20:02:11] RECOVERY - puppet last run on mw1130 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [20:02:12] RECOVERY - puppet last run on mw1048 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [20:02:12] RECOVERY - puppet last run on mw1080 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [20:02:23] RECOVERY - puppet last run on cp1049 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [20:02:23] RECOVERY - puppet last run on mc1004 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [20:02:26] could have kept it killed until they are really all done [20:02:36] RECOVERY - puppet last run on amssq44 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [20:02:39] RECOVERY - puppet last run on analytics1017 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [20:02:40] RECOVERY - puppet last run on amssq43 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [20:02:47] RECOVERY - puppet last run on wtp1017 is OK: OK: Puppet is currently enabled, last run 60 seconds ago with 0 failures [20:02:47] RECOVERY - puppet last run on mw1028 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [20:02:47] RECOVERY - puppet last run on mw1005 is OK: OK: Puppet is currently enabled, last run 62 seconds ago with 0 failures [20:02:47] PROBLEM - Host ps1-d3-pmtpa is DOWN: CRITICAL - Plugin timed out after 15 seconds [20:02:55] but it's down from >200 to 35.. [20:02:57] RECOVERY - puppet last run on mw1178 is OK: OK: Puppet is currently enabled, last run 65 seconds ago with 0 failures [20:02:58] RECOVERY - puppet last run on cp3015 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [20:03:07] RECOVERY - puppet last run on cp4006 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [20:03:08] RECOVERY - puppet last run on mw1072 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [20:03:08] RECOVERY - puppet last run on cp3006 is OK: OK: Puppet is currently enabled, last run 65 seconds ago with 0 failures [20:03:17] RECOVERY - puppet last run on mw1063 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [20:03:18] its kinda nice to see all the clears roll past [20:03:28] undoing the 'oh shit' of the errors [20:03:40] RECOVERY - puppet last run on searchidx1001 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [20:03:40] RECOVERY - puppet last run on ms-be2003 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [20:03:48] RECOVERY - puppet last run on netmon1001 is OK: OK: Puppet is currently enabled, last run 67 seconds ago with 0 failures [20:03:57] RECOVERY - Router interfaces on cr2-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 74, down: 0, dormant: 0, excluded: 0, unused: 0 [20:04:05] removes more tampa stuff [20:04:16] k:) [20:06:14] (03PS1) 10Cscott: Add djvu tools for OCG. [puppet] - 10https://gerrit.wikimedia.org/r/165329 [20:06:35] (03PS5) 10Dzahn: Move fatalmonitor to fluorine [puppet] - 10https://gerrit.wikimedia.org/r/164314 (owner: 10Reedy) [20:06:37] or actually.. ^ [20:07:10] (03CR) 10Dzahn: [C: 032] Move fatalmonitor to fluorine [puppet] - 10https://gerrit.wikimedia.org/r/164314 (owner: 10Reedy) [20:09:43] (03CR) 10Dzahn: "thanks, yea. needed. /home/wikipedia/syslog/apache.log would not work anymore indeed" [puppet] - 10https://gerrit.wikimedia.org/r/164314 (owner: 10Reedy) [20:12:07] Reedy: ^ that worked, it's on fluorine, and tested it, and it does show me warnings [20:12:12] like f.e. [20:12:16] 1 PHP Warning: Recursion detected in RequestContext::getLanguage [20:13:22] https://github.com/search?q=%22home/wikipedia%22+%40wikimedia&type=Code&utf8=%E2%9C%93 [20:13:47] mutante: does it exist at all on other servers, or are those all broken? [20:14:47] Krinkle: i think the all broken [20:14:49] https://github.com/wikimedia/operations-puppet/blob/23f71c00eb9e5cd84e25dc78006189275eb51d7d/manifests/role/syslog.pp#L12 [20:14:58] k :) [20:15:17] but syslogging, it has been moved [20:15:26] a bunch of this is hopefully unused [20:16:29] on fluorine it's /a/mw-log [20:16:40] PROBLEM - puppet last run on neon is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:16:55] (03CR) 10Dzahn: "file is now on fluorine. ran it manually, works for me. it did show me some warnings" [puppet] - 10https://gerrit.wikimedia.org/r/164314 (owner: 10Reedy) [20:22:14] (03PS2) 10Dzahn: remove pdf servers [dns] - 10https://gerrit.wikimedia.org/r/163308 [20:23:27] (03CR) 10Dzahn: [C: 032] remove pdf servers [dns] - 10https://gerrit.wikimedia.org/r/163308 (owner: 10Dzahn) [20:25:49] RECOVERY - puppet last run on neon is OK: OK: Puppet is currently enabled, last run 1493 seconds ago with 0 failures [20:27:26] !log pdf2/pdf3 - revoked puppet certs, removed from DNS & icinga [20:27:36] Logged the message, Master [20:28:39] wheeee [20:29:32] the output of "puppetstoredconfigclean.rb" is so nice [20:29:37] Killing pdf3.wikimedia.org...done. [20:39:34] (03PS3) 10Dzahn: decom mexia [puppet] - 10https://gerrit.wikimedia.org/r/159437 [20:44:27] (03PS4) 10Dzahn: decom mexia [puppet] - 10https://gerrit.wikimedia.org/r/159437 [20:46:04] (03CR) 10Dzahn: [C: 032] "ticket is resolved. mexia is down" [puppet] - 10https://gerrit.wikimedia.org/r/159437 (owner: 10Dzahn) [20:48:54] !log mexia - revoke salt,puppet,monitoring,storedconfigs [20:49:30] Logged the message, Master [20:50:57] <^d> manybubbles: About? I've got elastic stuck in yellow on beta. [20:51:05] about [20:51:08] beta? [20:51:10] checking [20:51:26] <^d> 47 shards are unassigned after rolling 02 and 03. [20:51:35] <^d> But none are initializing or moving. [20:51:50] <^d> (the majority reinitialized, obviously) [20:55:20] RECOVERY - BGP status on cr1-ulsfo is OK: OK: host 198.35.26.192, sessions up: 9, down: 0, shutdown: 0 [20:55:47] RECOVERY - BGP status on cr2-codfw is OK: OK: host 208.80.153.193, sessions up: 6, down: 0, shutdown: 0 [20:56:13] weird [20:56:17] still looking [20:56:38] RECOVERY - BGP status on cr1-codfw is OK: OK: host 208.80.153.192, sessions up: 6, down: 0, shutdown: 0 [20:57:07] RECOVERY - BGP status on cr1-eqiad is OK: OK: host 208.80.154.196, sessions up: 15, down: 0, shutdown: 0 [20:59:30] (03PS2) 10Dzahn: decom dobson [puppet] - 10https://gerrit.wikimedia.org/r/164120 [21:00:04] (03PS3) 10Dzahn: decom dobson [puppet] - 10https://gerrit.wikimedia.org/r/164120 [21:00:05] spagewmf, ebernhardson: Respected human, time to deploy Flow (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20141007T2100). Please do the needful. [21:00:21] ^d: wtf. I restarted deployment-elastic01 and it fixed itself. [21:00:25] not cool, dude. not cool. [21:00:30] <^d> weird... [21:00:41] <^d> I was checking logs, didn't see anything. [21:00:44] if that happens in prod I'm going to be pissed [21:00:53] <^d> I'll be more careful in prod. [21:01:03] <^d> And keep an eye out for that explicitly. [21:01:30] I've noticed that sometimes you need to slap the allocation algorithm around if it gets complacent. I have no idea why though. I've read the code and I don't remember seeing a lazy clause [21:01:51] (03CR) 10Dzahn: [C: 032] "dobson is down" [puppet] - 10https://gerrit.wikimedia.org/r/164120 (owner: 10Dzahn) [21:04:42] !log dobson - revoke puppet cert, delete from storedconfigs/icinga, deleted from dsh [21:04:50] Logged the message, Master [21:05:58] <^d> manybubbles: It's just well hidden ;-) [21:06:32] <^d> Kind of like rolling a manual car down the hill to get the engine going :p [21:07:06] more like shaking a sleeping person until they wake up [21:08:19] <^d> *snicker* [21:08:36] (03PS1) 10BBlack: Remove old ns1/mexia refs from authdns [puppet] - 10https://gerrit.wikimedia.org/r/165358 [21:10:22] (03PS1) 10Yuvipanda: Include ganglia in standard only for production [puppet] - 10https://gerrit.wikimedia.org/r/165360 [21:10:24] Krinkle: ^ [21:10:57] YuviPanda: still, it seems wrong of it to take up, like, 8GB of ram? [21:11:09] Krinkle: oh, that seems... something else wrong somewhere :| [21:11:12] (over weeks time) [21:11:20] maybe Ganglia is just written terribly? [21:11:21] is it buffering up indefinitely since it can't connect? [21:11:22] * YuviPanda isn't sure [21:11:25] that's possible [21:11:30] greg-g: FYI the CiteThisPage stuff didn't feel ready; pushed it back three weeks. https://wikitech.wikimedia.org/w/index.php?title=Deployments&diff=130294&oldid=130234 [21:11:30] (03CR) 10Krinkle: [C: 031] Include ganglia in standard only for production [puppet] - 10https://gerrit.wikimedia.org/r/165360 (owner: 10Yuvipanda) [21:11:44] Krinkle: try killing the process? [21:12:10] <^d> manybubbles: Gah, same thing happened with 04. [21:12:16] * ^d puts angry hat on [21:12:31] YuviPanda: cherry-picking now, will kill the process to see if that helps [21:12:39] Krinkle: ok [21:12:41] (03PS3) 10Dzahn: remove fenari [dns] - 10https://gerrit.wikimedia.org/r/163313 [21:14:14] (03PS19) 10Krinkle: contint: Add Xvfb module, role::ci::slave::localbrowser and Chromium [puppet] - 10https://gerrit.wikimedia.org/r/163791 [21:14:34] (03PS4) 10Dzahn: remove fenari [dns] - 10https://gerrit.wikimedia.org/r/163313 [21:14:49] (03CR) 10Krinkle: "Remove "require trusty". Per Antoine, these are included in the main role, not a dedicated role, thus can't require trusty as that'll made" [puppet] - 10https://gerrit.wikimedia.org/r/163791 (owner: 10Krinkle) [21:14:54] (03PS20) 10Krinkle: contint: Add Xvfb module, role::ci::slave::localbrowser and Chromium [puppet] - 10https://gerrit.wikimedia.org/r/163791 [21:16:56] (03PS2) 10Dzahn: remove mexia [dns] - 10https://gerrit.wikimedia.org/r/164260 [21:17:05] ^d: fucking hell [21:17:18] <^d> I'm seeing a pattern. [21:17:36] <^d> restart node, it comes up and starts initializing a ton of shards (ok) [21:17:48] <^d> Then it switches to /moving/ 2 shards (our max) [21:17:50] <^d> Those finish [21:17:55] <^d> It never goes back to initializing. [21:18:53] <^d> kicking 01 got everything happy again [21:18:57] <^d> I really dunno wtf is going on [21:20:38] James_F: cool [21:21:04] greg-g: Better safe than sorry. [21:21:25] yeppers [21:21:43] <^d> manybubbles: I'm not starting prod tonight, it's already after 5pm your time and I'd rather not get us in a bad spot if beta gave us troubles. [21:21:51] +1 [21:21:52] (03CR) 10Ori.livneh: [C: 031] contint: Add Xvfb module, role::ci::slave::localbrowser and Chromium [puppet] - 10https://gerrit.wikimedia.org/r/163791 (owner: 10Krinkle) [21:22:29] ^d: you can crank up the logging on the allocation decider on the fly and see what it is thinking [21:22:30] its OK [21:22:53] <^d> Yeah I was going to kick them a few more times with debugging ramped up. [21:23:30] (03PS2) 10Dzahn: decom mchenry [puppet] - 10https://gerrit.wikimedia.org/r/164123 [21:24:52] (03PS1) 10BBlack: fix icinga BGP checks, mostly [puppet] - 10https://gerrit.wikimedia.org/r/165367 [21:24:54] (03CR) 10Dzahn: [C: 032] "mchenry has been shut down on Oct 1st" [puppet] - 10https://gerrit.wikimedia.org/r/164123 (owner: 10Dzahn) [21:25:47] (03CR) 10BBlack: [C: 032] fix icinga BGP checks, mostly [puppet] - 10https://gerrit.wikimedia.org/r/165367 (owner: 10BBlack) [21:25:56] YuviPanda: ganglia is still being restarted [21:25:58] gmond [21:26:07] bah [21:26:11] Krinkle: can you file a bug? i'll look tomorrow [21:26:17] deep in puppet on something else... [21:26:22] I cherry-picked the change to puppet master, ran puppet on the local instances and killed the process by pid from top [21:26:27] YuviPanda: OK [21:26:33] sorry [21:26:36] YuviPanda: component? [21:26:48] Krinkle: just labs, infrastructure, I guess [21:26:53] ops stuff has terrible presence in bugzilla [21:27:25] (03PS2) 10BBlack: Remove old ns1/mexia refs from authdns [puppet] - 10https://gerrit.wikimedia.org/r/165358 [21:27:27] (03PS2) 10BBlack: fix icinga BGP checks, mostly [puppet] - 10https://gerrit.wikimedia.org/r/165367 [21:27:30] gerrit is being really annoying :p [21:27:44] !log mchenry - revoke puppet cert, clean storedconfigs/rm from icinga [21:27:48] <^d> bblack: s/being// [21:27:50] <^d> It is all the time. [21:27:50] (03CR) 10BBlack: [C: 032] Remove old ns1/mexia refs from authdns [puppet] - 10https://gerrit.wikimedia.org/r/165358 (owner: 10BBlack) [21:27:51] Logged the message, Master [21:27:59] (03CR) 10BBlack: [V: 032] Remove old ns1/mexia refs from authdns [puppet] - 10https://gerrit.wikimedia.org/r/165358 (owner: 10BBlack) [21:28:11] (03CR) 10BBlack: [V: 032] fix icinga BGP checks, mostly [puppet] - 10https://gerrit.wikimedia.org/r/165367 (owner: 10BBlack) [21:28:23] (03CR) 10Dzahn: [C: 031] "bblack, here, updated" [dns] - 10https://gerrit.wikimedia.org/r/164260 (owner: 10Dzahn) [21:30:05] YuviPanda: https://bugzilla.wikimedia.org/show_bug.cgi?id=71761 [21:30:26] Krinkle: thanks. taken [21:30:51] (03CR) 10BBlack: [C: 031] remove mexia [dns] - 10https://gerrit.wikimedia.org/r/164260 (owner: 10Dzahn) [21:31:32] (03CR) 10Dzahn: "should the $default_sites be codfw and eqiad ?" [puppet] - 10https://gerrit.wikimedia.org/r/164498 (owner: 10Dzahn) [21:32:13] sees mchenry being removed from icinga configs [21:32:38] (03CR) 10Dzahn: [C: 032] remove mexia [dns] - 10https://gerrit.wikimedia.org/r/164260 (owner: 10Dzahn) [21:40:47] (03PS1) 10Calak: Create new user groups on fa.wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/165371 (https://bugzilla.wikimedia.org/71760) [21:43:07] (03CR) 10Dzahn: "robh: still -1?" [puppet] - 10https://gerrit.wikimedia.org/r/159439 (owner: 10Dzahn) [21:44:18] (03CR) 10Dzahn: [C: 031] "meh, this just DHCP" [puppet] - 10https://gerrit.wikimedia.org/r/159440 (owner: 10Dzahn) [21:46:34] (03CR) 10Dzahn: "the entire 10.0.0.0/16 ?" [puppet] - 10https://gerrit.wikimedia.org/r/164241 (owner: 10Dzahn) [21:46:58] (03CR) 10Dzahn: "how does all this still work with dataset2 being down, or does it ?" [puppet] - 10https://gerrit.wikimedia.org/r/164233 (owner: 10Dzahn) [21:47:02] goddamit, our nginx module requires wmflib [21:47:03] sigh [21:47:18] (03CR) 10Hoo man: "Dzahn: It was intentional as they mentioned pmtpa caches. If you want this to be changed I can do that of course." [puppet] - 10https://gerrit.wikimedia.org/r/164273 (owner: 10Hoo man) [21:47:35] (03PS2) 10Calak: Create new user groups on fa.wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/165371 (https://bugzilla.wikimedia.org/71760) [21:47:36] mutante: If you just merge that and hte other one it doesn't matter(tm) [21:47:47] other one = follow-up [21:50:07] ottomata: ori the nginx module requires wmflib, is that intentional? [21:50:13] * YuviPanda is trying to use that in a unrelated puppet repo [21:51:25] YuviPanda: I do not know! [21:51:32] hmm, sigh [21:51:34] is wmflib a submodule? [21:51:35] it should be! [21:51:37] ottomata: no [21:51:43] ottomata: btw, I'm using librarian puppet, and it's quite nice! [21:52:19] ^demon|brb: good news! 1.3.4 passed regression tests! [21:52:22] ottomata: +1 to making that into a m odule [21:52:41] (03CR) 10Dzahn: "Hooman: I see. alright. yea, no big deal about the torrus change, i think it conflicts with another one i made at Change-Id: I0f1b9e5c3193" [puppet] - 10https://gerrit.wikimedia.org/r/164273 (owner: 10Hoo man) [21:58:37] ottomata, YuviPanda: I've poked ori about moving wmflib to a submodule before. I'd like to have it in mediawiki-vagrant too. [21:58:47] yeah, good idea [21:58:53] bd808: librarian-puppet is nice, btw [22:00:12] YuviPanda: cool. I've only used it in boxen but it's really nice there. [22:00:55] * bd808 should update his boxen manifests [22:04:28] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 6.67% of data above the critical threshold [500.0] [22:05:29] PROBLEM - puppet last run on amssq49 is CRITICAL: CRITICAL: Puppet has 1 failures [22:11:27] PROBLEM - check if salt-minion is running on virt1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:11:37] PROBLEM - DPKG on virt1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:11:48] PROBLEM - check if dhclient is running on virt1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:11:57] PROBLEM - SSH on virt1005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:12:07] PROBLEM - check configured eth on virt1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:12:18] PROBLEM - RAID on virt1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:12:38] PROBLEM - puppet last run on virt1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:12:48] PROBLEM - Disk space on virt1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:15:42] spagewmf, ebernhardson: are you still deploying flow? [22:16:12] greg-g: i'd like to deploy ocg assuming the flowers are done [22:16:36] there's an ocg bug day tomorrow, i've got a few last-minute bugs to add. ;) [22:17:48] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [22:19:00] (03PS1) 10BBlack: clean up esams router monitoring a bit [puppet] - 10https://gerrit.wikimedia.org/r/165379 [22:20:02] YuviPanda: yes, intentional [22:20:14] ori: hmm, ok [22:20:50] (03CR) 10BBlack: [C: 032] clean up esams router monitoring a bit [puppet] - 10https://gerrit.wikimedia.org/r/165379 (owner: 10BBlack) [22:22:02] ^ will probably cause a few esams router alerts in the near future until I go clean them up, don't panic [22:22:10] What's up with toollabs? [22:23:09] ah, hmm [22:23:12] that box seems dead as wlel [22:23:32] cscott: flowers? oh, flow-ers [22:23:52] sjoerddebruin: rebooting [22:23:59] cscott: you're good [22:23:59] okay YuviPanda [22:24:08] RECOVERY - puppet last run on amssq49 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [22:25:46] greg-g: thanks [22:26:06] (03CR) 10Krinkle: [C: 04-1] "This is not how packages are uninstalled in puppet. Change them to package ensure => absent." [puppet] - 10https://gerrit.wikimedia.org/r/165204 (https://bugzilla.wikimedia.org/54393) (owner: 10Zfilipin) [22:29:58] (03PS3) 10Dzahn: remove dobson [dns] - 10https://gerrit.wikimedia.org/r/164119 [22:31:15] (03CR) 10Dzahn: [C: 032] remove dobson [dns] - 10https://gerrit.wikimedia.org/r/164119 (owner: 10Dzahn) [22:35:24] (03PS2) 10Dzahn: remove mchenry [dns] - 10https://gerrit.wikimedia.org/r/164259 [22:35:50] (03CR) 10jenkins-bot: [V: 04-1] remove mchenry [dns] - 10https://gerrit.wikimedia.org/r/164259 (owner: 10Dzahn) [22:37:39] !log cycling power on virt1005 -- unresponsive [22:37:46] Logged the message, Master [22:39:12] (03PS3) 10Dzahn: remove mchenry [dns] - 10https://gerrit.wikimedia.org/r/164259 [22:40:29] !log fenari - revoked puppet cert, rm salt key, rm from icinga ... [22:40:34] Logged the message, Master [22:40:51] PROBLEM - Host virt1005 is DOWN: CRITICAL - Plugin timed out after 15 seconds [22:42:12] (03CR) 10Dzahn: [C: 032] remove mchenry [dns] - 10https://gerrit.wikimedia.org/r/164259 (owner: 10Dzahn) [22:42:21] !log ori Synchronized php-1.25wmf2/extensions/WikimediaEvents: Update WikimediaEvents for Ied71b5032: Groundwork for HHVM productivity analysis (duration: 00m 04s) [22:42:31] Logged the message, Master [22:42:41] PROBLEM - Router interfaces on cr1-esams is CRITICAL: CRITICAL: host 91.198.174.245, interfaces up: 98, down: 1, dormant: 0, excluded: 0, unused: 0BRae0: down - BR [22:43:32] PROBLEM - Router interfaces on cr2-knams is CRITICAL: CRITICAL: host 91.198.174.246, interfaces up: 81, down: 2, dormant: 0, excluded: 0, unused: 0BRge-1/2/5: down - BRxe-0/0/3: down - Core: csw2-esams:xe-3/1/1 [10Gbps DF CWDM C47]BR [22:43:36] (03CR) 10Dzahn: "recheck" [dns] - 10https://gerrit.wikimedia.org/r/164259 (owner: 10Dzahn) [22:44:02] RECOVERY - SSH on virt1005 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.4 (protocol 2.0) [22:44:04] RECOVERY - Disk space on virt1005 is OK: DISK OK [22:44:04] RECOVERY - check if dhclient is running on virt1005 is OK: PROCS OK: 0 processes with command name dhclient [22:44:12] RECOVERY - Host virt1005 is UP: PING OK - Packet loss = 0%, RTA = 1.97 ms [22:44:32] RECOVERY - RAID on virt1005 is OK: OK: Active: 16, Working: 16, Failed: 0, Spare: 0 [22:44:32] RECOVERY - check configured eth on virt1005 is OK: NRPE: Unable to read output [22:44:33] RECOVERY - DPKG on virt1005 is OK: All packages OK [22:44:41] RECOVERY - check if salt-minion is running on virt1005 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [22:44:41] PROBLEM - Router interfaces on mr1-esams is CRITICAL: CRITICAL: No response from remote host 91.198.174.244 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 [22:44:42] RECOVERY - puppet last run on virt1005 is OK: OK: Puppet is currently enabled, last run 2828 seconds ago with 0 failures [22:47:25] !log db60,db69-74,es4,es7,es10 - remove from icinga monitoring, puppet certs, salt keys [22:47:35] Logged the message, Master [22:48:29] !log ori Synchronized php-1.25wmf1/extensions/WikimediaEvents: Update WikimediaEvents for Ied71b5032: Groundwork for HHVM productivity analysis (duration: 00m 04s) [22:48:35] Logged the message, Master [22:50:43] !log db68,tarin - revoke the last remaining pmtpa certs [22:50:49] Logged the message, Master [22:54:17] !log updated OCG to version c778ea8b898f8ad8c2b7ad9de78a75469e7ed061 [22:54:23] Logged the message, Master [22:56:09] (03PS1) 10BBlack: fix mr1-esams IP for monitoring [puppet] - 10https://gerrit.wikimedia.org/r/165389 [22:56:28] (03CR) 10BBlack: [C: 032 V: 032] fix mr1-esams IP for monitoring [puppet] - 10https://gerrit.wikimedia.org/r/165389 (owner: 10BBlack) [22:58:54] YuviPanda: arr? [22:59:00] mutante: ? [22:59:02] Error: Could not open command file '/var/lib/nagios/rw/nagios.cmd' for update! [22:59:10] "The permissions on the external command file and/or directory may be incorrect. Read the FAQs on how to setup proper permissions." [22:59:27] hmm, [22:59:31] mutante: what's the perms on that file? [22:59:32] any changes to the permission fix 'execs' ? [22:59:42] maybe it was just really unluck timing [22:59:45] mutante: we killed them. [22:59:58] hrmm.. when [22:59:59] after changing a lot of ownership [23:00:03] mutante: yesterday [23:00:05] RoanKattouw, ^d, marktraceur, MaxSem: Dear anthropoid, the time has come. Please deploy SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20141007T2300). [23:00:41] mutante: where's that error from? [23:00:44] icinga logs? [23:01:23] YuviPanda: web interface [23:01:52] mutante: oh, when you try to execute a command? [23:01:55] yes [23:02:12] but i used it earlier today [23:02:20] that's... werid [23:02:22] *weird [23:02:29] now it's broken. i can repeat it [23:02:37] mutante: can you check perms of that file, and also that folder? [23:02:39] and tell me what it is? [23:03:47] YuviPanda: prw-rw---- 1 icinga icinga [23:03:55] nothing to swat, cooll [23:03:57] YuviPanda: drwxrwxrwx 2 icinga nagios [23:04:14] Oh, well, that's easy. [23:04:21] mutante: hmm, since it's owned by icinga, and I guess icinga itself runs as icinga, it should be able to write... [23:04:27] apparently icinga web doesn't run as icinga. [23:04:32] :) [23:05:18] so we are getting closer again to the reason those exec's existed :p [23:05:28] heh [23:05:42] let me submit a patch for just this one, but not as an exec [23:07:51] (03PS1) 10Yuvipanda: icinga: Setup perms for command files [puppet] - 10https://gerrit.wikimedia.org/r/165391 [23:08:04] mutante: ^ this lets it be grouped as www-data, which *should* make this work [23:08:28] previously, anyone with any access on that machine would be able to write whatever into that... [23:08:31] 777 is always a hack, I think [23:09:21] YuviPanda: what i dont understand about that is why it changed NOW [23:09:27] i still used it earlier today [23:09:32] mutante: me neither... [23:09:36] mutante: I *think* what happened was... [23:09:42] mutante: you used the file, and icinga *wrote* to it fine [23:09:53] mutante: since it still had the old perms [23:09:59] mutante: and then it wrote them, and no 'fix perms' was applied... [23:10:03] mutante: and then it couldn't write to them again [23:10:16] that's my current theory, at least [23:10:20] YuviPanda: ah, , yea, reasonable! [23:12:23] YuviPanda: do you have the link handy that removed the exec's? [23:12:33] mutante: not handy, no. let me find [23:12:54] mutante: https://gerrit.wikimedia.org/r/#/c/165123/ [23:15:02] YuviPanda: it seems the only one it doesnt touch is actually the one for /var/lib/nagios/rw [23:15:36] i see.. [23:15:39] mutante: yup, I presumed that 1. since things writing to it would run as user icinga, 2. the parent folder was owned by icinga, 3. ones reading from it would run as icinga, it would be ok... [23:15:41] clearly not [23:15:53] (03PS1) 10MaxSem: Disable GeoData on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/165393 [23:16:40] (03CR) 10Dzahn: [C: 032] "yep, should fix sending ACKs/host commands via web ui, and sure better than 0777 hack :)" [puppet] - 10https://gerrit.wikimedia.org/r/165391 (owner: 10Yuvipanda) [23:19:05] (03CR) 10MaxSem: [C: 032] Disable GeoData on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/165393 (owner: 10MaxSem) [23:19:12] (03Merged) 10jenkins-bot: Disable GeoData on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/165393 (owner: 10MaxSem) [23:20:28] !log maxsem Synchronized wmf-config/InitialiseSettings.php: SWAT: https://gerrit.wikimedia.org/r/165393 (duration: 00m 04s) [23:20:35] Logged the message, Master [23:20:46] ACKNOWLEDGEMENT - Host ps1-d3-pmtpa is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn Tampa [23:20:53] YuviPanda: ^ fixed, thank you [23:20:58] mutante: can you try another? [23:21:04] (03CR) 10Dzahn: "follow-up fix to I6d95a9b45e868d13b" [puppet] - 10https://gerrit.wikimedia.org/r/165391 (owner: 10Yuvipanda) [23:23:53] ACKNOWLEDGEMENT - DPKG on rhenium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. daniel_zahn test [23:23:59] YuviPanda: still ok [23:24:04] yay, cool [23:24:07] and no 777 hack [23:24:10] nice [23:25:08] YuviPanda: yes :) just left to wonder what /var/lib/nagios/rm/ was [23:25:20] rm, not rw [23:25:25] yeah, is empty... [23:25:26] almost looks typo [23:25:32] rw is also a *stupid* folder name [23:29:28] !log restarting every shutoff VM on virt1005 [23:29:36] Logged the message, Master [23:29:36] (03CR) 10BryanDavis: [C: 031] Extract wmf-beta-scap to sudo-withagent wrapper script [puppet] - 10https://gerrit.wikimedia.org/r/160960 (owner: 10Reedy) [23:31:14] (03CR) 10Ebrahim: [C: 031] Create new user groups on fa.wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/165371 (https://bugzilla.wikimedia.org/71760) (owner: 10Calak) [23:34:56] (03PS1) 10BBlack: Disable cr1-esams BGP check for now [puppet] - 10https://gerrit.wikimedia.org/r/165396 [23:35:18] <^demon|brb> YuviPanda: So, the graphite plugin. Looks sane enough to try out testing in beta. I'd rather not go through the whole archiva song and dance to build it properly in case we decide it sucks so I'm going to cheat and install it manually. [23:35:30] yeah, fine by me [23:35:33] (03CR) 10BBlack: [C: 032 V: 032] Disable cr1-esams BGP check for now [puppet] - 10https://gerrit.wikimedia.org/r/165396 (owner: 10BBlack) [23:35:37] point it to labmon1001.eqiad.wmnet [23:35:42] and log on -labs :) [23:35:53] <^demon|brb> Does it require any sort of setup on graphite's side? [23:35:57] (03CR) 10BryanDavis: [C: 04-1] Use sync-dir to copy out l10n json files, build cdbs on hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/158623 (https://bugzilla.wikimedia.org/70443) (owner: 10Reedy) [23:36:00] ^demon|brb: nope [23:36:06] <^demon|brb> Okie dokie, will give it a shot then [23:36:25] ^demon|brb: cool. just remember the hostname to send to :) [23:39:35] Krinkle: graphite is back up, btw [23:44:38] (03PS5) 10Dzahn: remove fenari [dns] - 10https://gerrit.wikimedia.org/r/163313 [23:46:03] (03PS6) 10Dzahn: remove fenari [dns] - 10https://gerrit.wikimedia.org/r/163313 [23:46:51] (03PS7) 10Dzahn: remove fenari [dns] - 10https://gerrit.wikimedia.org/r/163313 [23:48:54] (03CR) 10Dzahn: [C: 032] "bye" [dns] - 10https://gerrit.wikimedia.org/r/163313 (owner: 10Dzahn) [23:49:07] good riddance! [23:49:30] we should've just named that host "42", because it was the answer to every question :p [23:50:23] bblack: :) did you know https://www.wikidata.org/wiki/Q42 [23:51:08] i guess this was kind of rejected [23:51:10] "#8094: deploy protactinium as fenari replacement" [23:51:15] not needed anymore [23:51:59] heh Q42 is awesome [23:52:33] bblack: it was fun when wikidata first started and everybody was trying to get "cool" numbers :) [23:52:43] also, 23 is George Washington , heh [23:53:07] or see Q666 [23:54:00] elastic1019 - down since 67 days [23:54:21] ah, needs reinstall it says