[00:00:39] (03CR) 10Ori.livneh: [C: 032] wfdebug-ganglia: set reporting interval to 60 seconds [operations/puppet] - 10https://gerrit.wikimedia.org/r/91807 (owner: 10Ori.livneh) [00:03:04] !log kaldari synchronized php-1.23wmf1/extensions/MobileFrontend/ 'Updating MobileFrontend for cherrypick' [00:03:15] Logged the message, Master [00:04:46] (03PS1) 10Ryan Lane: Reduce salt call timeout to 1s for fetch/checkout [operations/puppet] - 10https://gerrit.wikimedia.org/r/91809 [00:04:53] ^^ \o/ [00:05:16] 58 second reduction in deployment overhead from salt [00:06:40] !log kaldari synchronized php-1.22wmf22/extensions/MobileFrontend/ 'Updating MobileFrontend for cherrypick' [00:06:53] Logged the message, Master [00:07:08] (03CR) 10Ryan Lane: [C: 032] Reduce salt call timeout to 1s for fetch/checkout [operations/puppet] - 10https://gerrit.wikimedia.org/r/91809 (owner: 10Ryan Lane) [00:13:27] # INFO : Step 'sync' finished. Started at 2013-10-25 00:12:55; took 19 seconds to complete [00:13:28] \o/ [00:13:58] !log kaldari synchronized php-1.22wmf22/extensions/MobileFrontend/ 'Updating MobileFrontend for cherrypick' [00:49:48] !log kaldari synchronized php-1.22wmf22/extensions/MobileFrontend/ 'Updating MobileFrontend for cherrypick' [01:01:16] !log kaldari synchronized php-1.22wmf22/extensions/MobileFrontend/ 'Updating MobileFrontend for cherrypick' [01:04:11] !log kaldari synchronized php-1.22wmf22/extensions/MobileFrontend/ 'Updating MobileFrontend for cherrypick, one more time with feeling' [01:04:25] Logged the message, Master [01:06:15] (03PS1) 10Yurik: Fixed incorrect check against empty string in Varnish [operations/puppet] - 10https://gerrit.wikimedia.org/r/91813 [01:56:37] i'm out of battery and heading home. would be gone if someone could check noc@ again in the next hour or two [01:56:53] will get back on from home [02:09:39] jeremyb: no news to noc@ , no reply from smart, Leslie contact Philippines chapter for help, on wikimedia-ph list [02:10:24] I don't see why we're spending all this time tbh [02:10:33] there's a crappy ISP somewhere in the world [02:11:09] that has a broken network and doesn't respond to notices from others or listen to their customers [02:11:14] oh well? [02:11:20] heh, yea, it's a good idea to let wikimedia-ph handle more of it [02:11:25] they are local to the ISP [02:11:26] or noone? [02:11:51] jeremyb: archives not public, oh well , but in there https://lists.wikimedia.org/mailman/private/wikimedia-ph/ [02:12:22] paravoid: yea, the users just think it's us and dont know they should complain to their provider [02:12:33] so we get all the OTRS tickets jeremyb handles [02:12:55] besides that, yea, agree [02:13:39] decentralize to chapter ++ [02:14:56] I didn't say exactly that :() [02:14:57] :) [02:15:25] I'm saying that we've done more than enough, we should just stop caring now [02:15:35] I'm sure there are multiple broken ISPs in the world, we can't fix everything [02:16:04] just tell the users that it's their ISP's problem, we did everything we could to contact them and failed [02:16:25] time for them to switch ISPs :) [02:16:32] yea, basically what we're doing, we tell them to contact ISP [02:16:50] !log LocalisationUpdate completed (1.22wmf22) at Fri Oct 25 02:16:50 UTC 2013 [02:17:05] Logged the message, Master [02:19:53] paravoid: fwiw, i get the impression that this ISP didn't break until ~2 weeks ago [02:20:01] so? [02:20:23] so, maybe they have a log of what changed 2 weeks ago? [02:20:32] I really don't care? :) [02:20:37] ok [02:20:54] we're pretty sure they are the ones that are broken [02:21:02] we've notified them [02:21:26] I mean, feel free to pursue this further, I won't stop you :) [02:22:01] well kul has apparently had recent contact with them [02:22:08] so that's one route [02:22:32] and i was thinking about trying one other way [02:23:08] !log starting recentchanges OSC bug 55844 [02:23:23] Logged the message, Master [02:23:55] i'm kinda baffled why they're active on twitter but ignoring us [02:24:11] but yeah, it's their problem [02:24:39] oh yea, the WP Zero route [02:24:57] right [02:25:06] jeremyb: did you tweet to them then ? [02:25:11] yes [02:25:19] well, then that's really all you could do [02:26:23] unless you wanna subscribe to wikimedia-ph to follow up with them , heh [02:26:41] well smart just started following me personally on twitter [02:26:47] and replied [02:26:54] there you go ,, and? [02:27:10] they want my number [02:27:12] :P [02:27:24] duh :p but something [02:27:29] with their service which i don't subscribe to [02:27:31] yes, something [02:27:45] they think you are customer in .ph i bet [02:27:57] yeah [02:28:18] no, it's Wikipedia [02:28:21] :) [02:29:01] goes back to setting up bed [02:29:23] tell me what happened later, now i wanna know the ending [02:29:44] uhhh, ikea? is it puppetized? [02:30:26] haha, kind of, it uses all kinds of thumbnails for instructions that are not under cc, and needs base::tools::hammer [02:33:54] oh, they have a standard toolkit you can buy :P [02:33:59] how's the i18n? [02:34:29] solved by just using pictograms [02:35:02] role::shelf::billy isn't included yet [02:36:54] jeremyb: in the future Ikea will be like thingiverse.com and you 3D-print it and just pay the license, but we'll have free furniture on commons :) ttyl [02:38:29] haha [02:38:48] you could rent a drive up printer at uhaul? [02:38:49] :P [02:48:11] !log LocalisationUpdate completed (1.23wmf1) at Fri Oct 25 02:48:10 UTC 2013 [02:48:29] Logged the message, Master [03:02:06] springle: Hey. are the archive and externallinks pk additions all done? [03:02:18] Reedy: no [03:06:16] Reedy: couple more days. have some recentchanges OSC going on simultaneously. archive/externallinks need master rotations which will happen afterwards [03:06:52] I [03:07:11] plus S6 is held up with the UpdateCollation job [03:07:55] cool. wasn't sure where things were, and then saw the recentchanges stuff starting [03:08:50] lots of improvements happening :) [03:09:03] slowly :) [03:10:40] well, if you have a time machine... [03:21:39] !log LocalisationUpdate ResourceLoader cache refresh completed at Fri Oct 25 03:21:39 UTC 2013 [03:21:58] Logged the message, Master [05:45:18] hi [05:45:43] I have a question related to categories [05:45:59] I have a limited understanding of the mediawiki schema [05:46:35] so, I was reading mediawiki documentation in an attempt to understand how I could derive category history (and category events) based on what is available in the mediawiki database [05:49:28] another question which might be a bit strange is, I'm trying to count the number of total revisions inside the enwiki database [05:50:03] So far I've tried these two SELECT COUNT(*) FROM revision; SELECT COUNT(rev_id) FROM revision; [05:50:07] they both take a lot of time [05:50:30] so then I thought "ok, maybe I can just get the max rev_id" [05:51:11] SELECT MAX(rev_id) FROM revision; [05:51:14] I got back 578653633 [05:51:33] but it doesn't help since there are missing revisions [05:51:46] because they were deleted and moved into the archive table? [05:53:09] what's the name of the archive table please ? [05:56:04] legoktm: you know what I was thinking about deriving category history ? I was thinking, since the dump is public data, I can throw it up on AWS, spin up a bunch of instances, let them make easy work of the ~600M revisions to parse them with reverse regexes(since the categories will be at the end of each revision text) and get the categories for articles and their evolution across time [05:56:20] heh [05:56:21] and after that's done, having a cronjob run once in a while could keep that up-to-date [05:56:23] that would be interesting. [06:06:43] but the problem is that I'm not sure if the dumps contain rev_ids [06:06:55] so I couldn't link them back to the enwiki database [06:07:15] and the dump probably doesn't contain some revisions as they're archived, not sure about that one [06:07:21] they have rev_ids [06:07:32] oh that's interesting [06:07:36] but they wouldnt have deleted revisions if the revision was deleted prior to dump creation [06:08:12] true [06:09:07] legoktm: do they contain category ids as well ? [06:09:17] i dont think so [06:09:35] iirc, you can download dumps of the categorylinks sql table [06:15:34] legoktm: so this weird idea, what kind of things would it catch that aren't already in the category , categorylinks table ? [06:15:42] legoktm: the Template: case that you told me about [06:15:47] legoktm: are there others like it ? [06:16:17] the template case would be in the categorylinks table [06:16:31] just not easily parseable from revision text [08:02:14] (03CR) 10Ori.livneh: "It eats up all the CPU on my VM doing absolutely nothing:" [operations/puppet] - 10https://gerrit.wikimedia.org/r/90733 (owner: 10Physikerwelt) [08:28:52] (03PS1) 10ArielGlenn: put the glusterfs mount back (dumps rsyncs) [operations/puppet] - 10https://gerrit.wikimedia.org/r/91826 [08:29:57] (03CR) 10ArielGlenn: [C: 032] put the glusterfs mount back (dumps rsyncs) [operations/puppet] - 10https://gerrit.wikimedia.org/r/91826 (owner: 10ArielGlenn) [09:39:59] (03PS1) 10Hashar: ctags configuration [operations/puppet] - 10https://gerrit.wikimedia.org/r/91836 [10:00:03] (03CR) 10Akosiaris: [C: 032] ctags configuration [operations/puppet] - 10https://gerrit.wikimedia.org/r/91836 (owner: 10Hashar) [10:19:23] out for lunch [10:40:18] (03CR) 10Mark Bergsma: [C: 04-1] "Yay!" [operations/puppet] - 10https://gerrit.wikimedia.org/r/91492 (owner: 10Ori.livneh) [11:09:47] (03PS2) 10TTO: redirect vikipedi[a].com.tr to tr.wikipedia.org [operations/apache-config] - 10https://gerrit.wikimedia.org/r/88705 (owner: 10Dzahn) [11:30:36] mark, around? [11:30:48] yes [11:30:56] mark, i'm trying to make sense of the varnish log, you might be able to help :) [11:31:02] i simulated the crash in beta [11:32:21] so i discovered that one of the (minor) bugs was in checking -- https://gerrit.wikimedia.org/r/#/c/91813/ [11:32:28] mark ^ [11:33:59] so how does that cause a crash? [11:34:42] PROBLEM - Disk space on copper is CRITICAL: DISK CRITICAL - free space: / 355 MB (3% inode=90%): [11:35:44] mark - that bug - doesn't, just causes an error in the log - because netmapper gets an empty string instead of an IP [11:36:00] mark, could you connect to beta labs - ssh 10.4.1.82 [11:36:14] and look at /etc/varnish/tmp.txt [11:36:40] what's the hostname for that ip? [11:37:08] deployment-cache-mobile01, but DNS was broken a few days ago [11:37:50] the log shows tons of "0 WorkThread - 0x7ff6267f2aa0 start" entries [11:38:42] mark, once you open the file, search for "FORCE" [11:39:03] the first line -- 11 SessionOpen c 109.172.15.11 53958 :80 [11:39:12] right above the FORCE [11:39:48] ok [11:49:54] ...and then? [11:49:55] :) [11:52:33] hmm, mark, i just tried something else and getting weird responses from the backend, postpone your inquiry for a bit please [11:52:47] * mark goes to lunch then :) [11:52:51] i think it might have been a few bugs [11:52:53] :) [11:53:03] sure, but please +2 that little varnish change [11:53:24] mark^! [11:53:29] :) [11:57:06] and btw, one other reason for the failure - it seems varnish doesn't handle onerror="continue" esi param [12:01:52] PROBLEM - Disk space on cp1046 is CRITICAL: DISK CRITICAL - free space: /srv/sda3 7475 MB (2% inode=99%): /srv/sdb3 7353 MB (2% inode=99%): [12:01:52] PROBLEM - Varnish HTCP daemon on cp1046 is CRITICAL: NRPE: Unable to read output [12:04:02] PROBLEM - DPKG on cp1046 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [12:04:53] PROBLEM - Varnish HTCP daemon on cp1046 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [12:05:59] RECOVERY - Varnish HTCP daemon on cp1046 is OK: PROCS OK: 1 process with UID = 111 (vhtcpd), args vhtcpd [12:05:59] RECOVERY - DPKG on cp1046 is OK: All packages OK [12:06:49] PROBLEM - SSH on cp1046 is CRITICAL: Server answer: [12:08:49] RECOVERY - SSH on cp1046 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [12:09:39] PROBLEM - Varnish traffic logger on cp1046 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [12:14:39] RECOVERY - Varnish traffic logger on cp1046 is OK: PROCS OK: 2 processes with command name varnishncsa [12:15:49] PROBLEM - Disk space on cp1046 is CRITICAL: DISK CRITICAL - free space: /srv/sda3 7474 MB (2% inode=99%): /srv/sdb3 7352 MB (2% inode=99%): [12:15:59] PROBLEM - Varnish HTCP daemon on cp1046 is CRITICAL: NRPE: Call to popen() failed [12:16:59] RECOVERY - Varnish HTCP daemon on cp1046 is OK: PROCS OK: 1 process with UID = 111 (vhtcpd), args vhtcpd [12:16:59] PROBLEM - DPKG on cp1046 is CRITICAL: NRPE: Unable to read output [12:17:09] PROBLEM - RAID on cp1046 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [12:18:09] RECOVERY - RAID on cp1046 is OK: OK: no RAID installed [12:19:59] RECOVERY - DPKG on cp1046 is OK: All packages OK [12:20:49] PROBLEM - Disk space on cp1046 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [12:21:59] PROBLEM - Varnish HTCP daemon on cp1046 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [12:22:39] PROBLEM - Varnish traffic logger on cp1046 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [12:22:49] PROBLEM - SSH on cp1046 is CRITICAL: Server answer: [12:22:59] RECOVERY - Varnish HTCP daemon on cp1046 is OK: PROCS OK: 1 process with UID = 111 (vhtcpd), args vhtcpd [12:23:39] RECOVERY - Varnish traffic logger on cp1046 is OK: PROCS OK: 2 processes with command name varnishncsa [12:23:49] RECOVERY - SSH on cp1046 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [12:26:39] PROBLEM - Varnish traffic logger on cp1046 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [12:27:39] RECOVERY - Varnish traffic logger on cp1046 is OK: PROCS OK: 2 processes with command name varnishncsa [12:27:49] PROBLEM - SSH on cp1046 is CRITICAL: Server answer: [12:27:49] PROBLEM - Disk space on cp1046 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [12:27:59] PROBLEM - DPKG on cp1046 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [12:28:09] PROBLEM - RAID on cp1046 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [12:28:59] PROBLEM - Varnish HTCP daemon on cp1046 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [12:30:39] PROBLEM - Varnish traffic logger on cp1046 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [12:30:49] RECOVERY - SSH on cp1046 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [12:30:59] RECOVERY - Varnish HTCP daemon on cp1046 is OK: PROCS OK: 1 process with UID = 111 (vhtcpd), args vhtcpd [12:30:59] RECOVERY - DPKG on cp1046 is OK: All packages OK [12:33:09] RECOVERY - RAID on cp1046 is OK: OK: no RAID installed [12:33:39] RECOVERY - Varnish traffic logger on cp1046 is OK: PROCS OK: 2 processes with command name varnishncsa [12:34:59] PROBLEM - Varnish HTCP daemon on cp1046 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [12:34:59] PROBLEM - DPKG on cp1046 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [12:35:59] RECOVERY - DPKG on cp1046 is OK: All packages OK [12:36:39] PROBLEM - Varnish traffic logger on cp1046 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [12:36:49] PROBLEM - SSH on cp1046 is CRITICAL: Server answer: [12:37:39] RECOVERY - Varnish traffic logger on cp1046 is OK: PROCS OK: 2 processes with command name varnishncsa [12:38:49] RECOVERY - SSH on cp1046 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [12:38:59] PROBLEM - DPKG on cp1046 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [12:39:09] PROBLEM - RAID on cp1046 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [12:40:59] PROBLEM - Varnish HTCP daemon on cp1046 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [12:41:59] RECOVERY - Varnish HTCP daemon on cp1046 is OK: PROCS OK: 1 process with UID = 111 (vhtcpd), args vhtcpd [12:42:09] PROBLEM - RAID on cp1046 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [12:43:49] PROBLEM - Disk space on cp1046 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [12:43:59] RECOVERY - DPKG on cp1046 is OK: All packages OK [12:44:09] RECOVERY - RAID on cp1046 is OK: OK: no RAID installed [12:44:59] PROBLEM - Varnish HTCP daemon on cp1046 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [12:45:59] RECOVERY - Varnish HTCP daemon on cp1046 is OK: PROCS OK: 1 process with UID = 111 (vhtcpd), args vhtcpd [12:46:49] PROBLEM - Disk space on cp1046 is CRITICAL: DISK CRITICAL - free space: /srv/sda3 7473 MB (2% inode=99%): /srv/sdb3 7351 MB (2% inode=99%): [12:48:59] PROBLEM - DPKG on cp1046 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [12:49:59] RECOVERY - DPKG on cp1046 is OK: All packages OK [12:52:19] PROBLEM - RAID on cp1046 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:52:49] PROBLEM - Varnish traffic logger on cp1046 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:52:59] PROBLEM - SSH on cp1046 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:53:39] PROBLEM - Varnish HTTP mobile-frontend on cp1046 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:54:09] PROBLEM - Varnish HTCP daemon on cp1046 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:54:09] PROBLEM - Varnish HTTP mobile-backend on cp1046 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:54:09] PROBLEM - DPKG on cp1046 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:58:09] RECOVERY - Varnish HTTP mobile-backend on cp1046 is OK: HTTP OK: HTTP/1.1 200 OK - 189 bytes in 6.871 second response time [12:58:09] RECOVERY - RAID on cp1046 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [12:58:39] RECOVERY - Varnish traffic logger on cp1046 is OK: PROCS OK: 2 processes with command name varnishncsa [12:58:49] RECOVERY - SSH on cp1046 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [12:58:59] RECOVERY - Varnish HTCP daemon on cp1046 is OK: PROCS OK: 1 process with UID = 111 (vhtcpd), args vhtcpd [12:58:59] RECOVERY - DPKG on cp1046 is OK: All packages OK [12:59:29] RECOVERY - Varnish HTTP mobile-frontend on cp1046 is OK: HTTP OK: HTTP/1.1 200 OK - 262 bytes in 0.001 second response time [13:02:49] !log Rebooting cp1046, xfs mem alloc issues [13:03:05] Logged the message, Master [13:04:29] PROBLEM - Host cp1046 is DOWN: PING CRITICAL - Packet loss = 100% [13:04:49] Hm. Anyone knows where in git the wikitech config is? I see it on the local box, but it doesn't seem to be a checkout. [13:05:34] is it in git? :) [13:06:15] RECOVERY - Host cp1046 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms [13:06:25] ... I sure as hell /hope/ so. [13:06:31] i suspect it may not be [13:06:36] * Coren shudders. [13:06:40] How... uncouth. [13:07:12] it was an unmaintained completely isolated setup on some random vhost in some network until ryan migrated it like 6 months ago or so? [13:08:24] Ah. Fun. [13:26:57] (03PS1) 10Mark Bergsma: Use Varnish by default for role::cache::text [operations/puppet] - 10https://gerrit.wikimedia.org/r/91863 [13:26:58] (03PS1) 10Mark Bergsma: Add ulsfo text caches [operations/puppet] - 10https://gerrit.wikimedia.org/r/91864 [13:28:45] (03CR) 10Mark Bergsma: [C: 032] Use Varnish by default for role::cache::text [operations/puppet] - 10https://gerrit.wikimedia.org/r/91863 (owner: 10Mark Bergsma) [13:30:39] (03CR) 10Physikerwelt: "I can not run the current version of the puppet script" [operations/puppet] - 10https://gerrit.wikimedia.org/r/90733 (owner: 10Physikerwelt) [13:32:31] (03CR) 10Physikerwelt: "I was doing too many things in parallel... I wanted to write" [operations/puppet] - 10https://gerrit.wikimedia.org/r/90733 (owner: 10Physikerwelt) [13:35:05] PROBLEM - SSH on lvs6 is CRITICAL: Server answer: [13:37:05] RECOVERY - SSH on lvs6 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [13:40:05] PROBLEM - SSH on lvs6 is CRITICAL: Server answer: [13:41:05] RECOVERY - SSH on lvs6 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [13:47:32] So, someone reached out regarding receiving 504 errors when trying to access the sites - sadly with little more information than that. [13:47:44] I realized, though, that I don't know where the right place is to forward said issues. [13:47:55] PROBLEM - SSH on lvs1001 is CRITICAL: Server answer: [13:48:42] RT [13:48:46] leslie has been on it for days already [13:48:55] RECOVERY - SSH on lvs1001 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [13:48:57] there seems to be an ISP in the philippines with a broken proxy [13:49:20] and she can't get ahold of them in any way, she's even tried reaching out to Wikimedia chapter in the Philippines now [13:49:22] Yeah, I saw that - I'm not sure this is the same issue (they mentioned french), but I'll forward it that way [13:49:37] ok [13:49:41] Thanks :) [13:50:02] if it's oceania that could be ulsfo [13:50:06] which is a slightly different SSL setup [13:50:14] but the existing issues so far that I've seen have not been ulsfo [13:51:27] I've reached out for more info, with a reply-to of ops-requests@ [13:58:35] RECOVERY - Disk space on copper is OK: DISK OK [13:59:19] !log shot swiftrepl on copper, almost out of space, tossed the log... restart as you like [13:59:34] what is almost out of space? [13:59:34] Logged the message, Master [13:59:46] 100mb or so left [13:59:52] would be gone in 30 mins [13:59:53] of what? [13:59:54] or lss [13:59:56] on / [14:00:04] of the log of copper? [14:00:08] could you just ask first before killing? [14:00:19] I have already talked to faidon about this for several days [14:00:24] he said it was fine to shoot it [14:00:53] he didn't want to truncate the log? [14:00:56] no [14:01:10] he doesn't use the log for anything except 'is the job done yet?' [14:01:40] yeah I know, so shouldn't it be better to truncate the log and keeping the process running? :) [14:01:52] we did this same drill a few days ago: shoot job, toss log, restart it [14:02:14] ah, because the fd doesn't get released while the process runs [14:02:27] exactly [14:48:29] (03CR) 10Mark Bergsma: [C: 032] Add ulsfo text caches [operations/puppet] - 10https://gerrit.wikimedia.org/r/91864 (owner: 10Mark Bergsma) [15:07:37] PROBLEM - Varnish HTTP text-frontend on cp4009 is CRITICAL: Connection refused [15:07:37] PROBLEM - Varnish traffic logger on cp4016 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishncsa [15:07:37] PROBLEM - Varnish traffic logger on cp4010 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishncsa [15:07:47] PROBLEM - Varnish traffic logger on cp4009 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishncsa [15:07:57] PROBLEM - HTTPS on cp4010 is CRITICAL: Connection refused [15:08:07] PROBLEM - HTTPS on cp4016 is CRITICAL: Connection refused [15:08:17] PROBLEM - HTTPS on cp4009 is CRITICAL: Connection refused [15:08:37] PROBLEM - Varnish HTTP text-frontend on cp4016 is CRITICAL: Connection refused [15:08:37] PROBLEM - Varnish HTTP text-frontend on cp4010 is CRITICAL: Connection refused [15:12:09] with bad settings. Some stuff continued to work with the bad settings. Some [15:12:12] stuff didn't. Expect to see some more bugs filed around how we make sure this [15:12:15] eek [15:12:18] bad paste [15:15:42] (03PS3) 10Ottomata: Updating with recent upstream changes to varnishkafka.conf [operations/puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/91664 [15:18:28] greg-g: if all goes well, next week I think I'll put text/wiki traffic on varnish in ulsfo. That would also include wikipedia, but as it's "only" for OC, it wouldn't be a lot of traffic [15:18:48] I don't know when exactly yet, but I'll do it during European work hours, well outside of any SF deployment windows [15:19:55] mark: sweet [15:20:41] mark: that might be worth a note in the deploy highlights email I write, since its wikipedias. Have a favorite day or two? [15:21:15] monday or tuesday I think, but I can't guarantee it [15:21:31] :) [15:22:00] I'll just say "the week of..." ;) [15:22:05] yeah [15:29:39] (03PS1) 10Mark Bergsma: Add Text caches ulsfo cluster [operations/puppet] - 10https://gerrit.wikimedia.org/r/91877 [15:31:02] Hm… right now all the logrotate scripts are in files/logrotate. Seems to me there's a major design choice here -- do things like that get grouped with the tool they configure, or with the tool that uses them? [15:31:10] (03PS1) 10Odder: (bug 40941) Increase font size in Gerrit diff messages [operations/puppet] - 10https://gerrit.wikimedia.org/r/91879 [15:31:13] For instance, I'd think that the gluster logrotate config would go in the gluster module [15:33:37] (03CR) 10Mark Bergsma: [C: 032] Add Text caches ulsfo cluster [operations/puppet] - 10https://gerrit.wikimedia.org/r/91877 (owner: 10Mark Bergsma) [15:34:09] andrewbogott: yeah that makes sense [15:34:18] I consider things like logrotate generic infrastructure used by other more specific modules [15:34:33] * andrewbogott nods [15:34:55] yup, we had that discussion before with nagios plugins [15:35:07] and ganglia plugins [15:35:12] and a bunch more yeah [15:35:12] right [15:44:31] (03PS1) 10Cmjohnson: Removing site.pp entries for wm126-134 [operations/puppet] - 10https://gerrit.wikimedia.org/r/91882 [15:44:38] andrewbogott: btw I have already modularized install-server. So dont put it in your plans for modularization. I am holding it back for some changes in network.pp so I can use ferm and drop the iptables rules in there (yuck!) [15:45:17] akosiaris: thanks, I will add this to https://wikitech.wikimedia.org/wiki/Puppet_Todo [15:46:26] :) [15:48:08] apergos: https://gerrit.wikimedia.org/r/91882 [15:51:13] cmjohnson1: dhcp entries? [15:52:57] apergos: we should probably take out mgmt as well. it's not like they're going to be used in the same place again [15:53:13] (03PS1) 10Andrew Bogott: Move generic::gluster* into a module [operations/puppet] - 10https://gerrit.wikimedia.org/r/91884 [15:53:24] after they're powered down [15:54:13] sure [15:55:37] RECOVERY - Varnish HTTP text-frontend on cp4009 is OK: HTTP OK: HTTP/1.1 200 OK - 198 bytes in 0.146 second response time [15:55:47] RECOVERY - Varnish traffic logger on cp4009 is OK: PROCS OK: 2 processes with command name varnishncsa [15:57:37] RECOVERY - Varnish HTTP text-frontend on cp4010 is OK: HTTP OK: HTTP/1.1 200 OK - 198 bytes in 0.150 second response time [15:57:37] RECOVERY - Varnish HTTP text-frontend on cp4016 is OK: HTTP OK: HTTP/1.1 200 OK - 198 bytes in 0.150 second response time [15:57:37] RECOVERY - Varnish traffic logger on cp4016 is OK: PROCS OK: 2 processes with command name varnishncsa [15:57:37] RECOVERY - Varnish traffic logger on cp4010 is OK: PROCS OK: 2 processes with command name varnishncsa [16:00:19] (03PS2) 10Andrew Bogott: Move generic::gluster* into a module [operations/puppet] - 10https://gerrit.wikimedia.org/r/91884 [16:07:35] PROBLEM - Varnish HTTP text-frontend on cp4018 is CRITICAL: Connection refused [16:07:45] PROBLEM - Varnish traffic logger on cp4018 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishncsa [16:07:55] PROBLEM - Varnish HTTP text-frontend on cp4017 is CRITICAL: Connection refused [16:08:05] PROBLEM - Varnish traffic logger on cp4017 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishncsa [16:08:15] PROBLEM - HTTPS on cp4018 is CRITICAL: Connection refused [16:08:25] PROBLEM - HTTPS on cp4008 is CRITICAL: Connection refused [16:08:35] PROBLEM - HTTPS on cp4017 is CRITICAL: Connection refused [16:14:08] (03PS1) 10Cmjohnson: Removing mgmt ip's for mw126-135 [operations/dns] - 10https://gerrit.wikimedia.org/r/91885 [16:16:50] (03PS2) 10Cmjohnson: Removing site.pp entries for wm126-134 [operations/puppet] - 10https://gerrit.wikimedia.org/r/91882 [16:19:05] RECOVERY - Varnish traffic logger on cp4017 is OK: PROCS OK: 2 processes with command name varnishncsa [16:19:15] (03PS3) 10Cmjohnson: Removing site.pp entries for wm126-134 [operations/puppet] - 10https://gerrit.wikimedia.org/r/91882 [16:19:29] (03CR) 10Cmjohnson: [C: 032] Removing site.pp entries for wm126-134 [operations/puppet] - 10https://gerrit.wikimedia.org/r/91882 (owner: 10Cmjohnson) [16:19:55] RECOVERY - Varnish HTTP text-frontend on cp4017 is OK: HTTP OK: HTTP/1.1 200 OK - 198 bytes in 0.148 second response time [16:21:56] (03CR) 10Cmjohnson: [C: 032] Removing mgmt ip's for mw126-135 [operations/dns] - 10https://gerrit.wikimedia.org/r/91885 (owner: 10Cmjohnson) [16:22:29] !log dns update [16:22:42] Logged the message, Master [16:24:01] (03PS1) 10Vogone: Added filemover user group to bnwiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/91886 [16:29:35] RECOVERY - Varnish HTTP text-frontend on cp4018 is OK: HTTP OK: HTTP/1.1 200 OK - 198 bytes in 0.150 second response time [16:29:45] RECOVERY - Varnish traffic logger on cp4018 is OK: PROCS OK: 2 processes with command name varnishncsa [16:50:45] PROBLEM - Disk space on cp1051 is CRITICAL: DISK CRITICAL - free space: /srv/sda3 13277 MB (4% inode=99%): /srv/sdb3 12348 MB (3% inode=99%): [16:54:58] (03PS1) 10Dzahn: fix wrong reverse DNS for mw110 and mw110.mgmt [operations/dns] - 10https://gerrit.wikimedia.org/r/91891 [16:58:52] (03CR) 10Dzahn: [C: 032] "pmtpa/apaches:{ 'host': 'mw110.pmtpa.wmnet', 'weight': 200, 'enabled': False } #won't image properly, needs work" [operations/dns] - 10https://gerrit.wikimedia.org/r/91891 (owner: 10Dzahn) [16:59:29] !log DNS update, fix mw110 reverse entries [16:59:44] Logged the message, Master [17:03:34] mutante: that host was never installed (completely) as it turns out... now there is not much point either [17:04:08] apergos: yea, it wasn't installed because it didn't work because of the above. just resolving RT #6086 by Chris [17:04:40] mutante: cool..i am going to create RT to decom 110 [17:04:40] Is it relatively new hardware? [17:04:47] reedy yes [17:05:02] just didnt want the broken entry either way [17:05:15] yeah..same here, apergos and I were debating on fixing or decomming [17:05:32] cmjohnson1: eh, ok.. in that case .. yea well [17:05:33] neither of us were passionate either way so decom won out [17:05:35] it didn't take long [17:14:34] (03Draft5) 10Akosiaris: Modularizing puppetmaster [operations/puppet] - 10https://gerrit.wikimedia.org/r/91353 [17:29:21] (03CR) 10Eloquence: [C: 04-1] "I'm sorry, but I don't agree re: the username issue. Let's not hardcode past idiosyncrasies in a way that makes things more confusing in t" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/91344 (owner: 10Legoktm) [17:29:23] (03CR) 10Chad: [C: 031] "Upstream won't take it. They want everything as small and compact as possible." [operations/puppet] - 10https://gerrit.wikimedia.org/r/91879 (owner: 10Odder) [17:30:27] (03CR) 10Eloquence: "I like "MediaWiki message delivery", the alternative suggested by Nemo. Descriptive and clear." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/91344 (owner: 10Legoktm) [17:35:54] (03PS3) 10Legoktm: Enable MassMessage on all wikis [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/91344 [17:37:24] (03CR) 10Legoktm: "Switched the username to "MediaWiki message delivery" which I've created a global account for." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/91344 (owner: 10Legoktm) [17:47:54] (03PS1) 10Cmjohnson: Decommission mw110 [operations/puppet] - 10https://gerrit.wikimedia.org/r/91897 [17:56:43] (03CR) 10Reedy: [C: 04-1] "[18:52:17] Reedy: https://gerrit.wikimedia.org/r/#/c/91344/3/wmf-config/InitialiseSettings.php shouldn't that be using $wmg?" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/91344 (owner: 10Legoktm) [17:57:01] (03CR) 10Cmjohnson: [C: 032] Decommission mw110 [operations/puppet] - 10https://gerrit.wikimedia.org/r/91897 (owner: 10Cmjohnson) [17:57:10] AaronSchulz: err, why should it use wmg? [17:57:28] some variables use wg, some are wmg so I'm not really sure what the distinction is [17:57:44] Because loading the extension will override those variables [17:57:49] They're loaded at the start [17:57:54] THEN your extension is loaded [17:57:57] overriding those with the defaults [17:58:03] oh. [17:58:05] $wgMyGlobal = $wmgMyGlobal; [17:58:08] gotcha [17:58:15] how do I do the array merge thingy for +metawiki then? [17:59:24] Defining the default in InitialiseSettings.php is usually the simplest way [17:59:42] ok [18:00:45] legoktm: btw, just chatted with Erik, let's get the updated version of MM (pending some changes he suggested, I think) out to testwikis on Nov 7th (not next week, but the week after) and then, pending OK, we'll roll it out. [18:01:28] ok. what changes need to be made still? [18:02:21] rename? :-P [18:02:56] I already switch the config for that [18:03:28] legoktm: that plus something else Erik mentioned, he was going to comment on the change, I think. [18:03:37] ok [18:04:39] (03PS4) 10Legoktm: Enable MassMessage on all wikis [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/91344 [18:04:48] Reedy, AaronSchulz ^ [18:04:52] i have to run to class now, bbl [18:05:01] * AaronSchulz remembers saying that [18:05:04] * AaronSchulz feels old now [18:05:26] (03CR) 10jenkins-bot: [V: 04-1] Enable MassMessage on all wikis [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/91344 (owner: 10Legoktm) [18:06:26] (03PS1) 10Manybubbles: Change Elasticsearch defaults that cause pain/fear [operations/puppet] - 10https://gerrit.wikimedia.org/r/91903 [18:07:13] AaronSchulz: those were the days..... [18:07:36] * AaronSchulz thinks of that terrible song [18:08:24] the one every graduation played for like 5 years? [18:08:47] something about roads and time of your life and all that [18:08:57] * greg-g shudders [18:09:01] <^d> Hah. [18:09:03] yeah, just remembered the melody [18:09:05] <^d> I love the commit summary. [18:09:10] <^d> "Change Elasticsearch defaults that cause pain/fear" [18:09:36] ^d: it fits [18:10:09] (03CR) 10Chad: [C: 031] "Yes, please." [operations/puppet] - 10https://gerrit.wikimedia.org/r/91903 (owner: 10Manybubbles) [18:10:28] this is one of the few times that I've complained on the mailing list, received a single well thought out response with code, and been able to just drop the code right in. [18:10:40] I tested it, and it is great [18:10:53] wonderful comments, everything [18:11:40] I suppose that is some kind of plagiarism, but the comments were pretty much what we wanted too. [18:13:28] manybubbles: just credit it in a comment [18:13:41] voila! no plagiarism [18:13:47] now, copyvio, that's another issue... [18:14:24] (03CR) 10Manybubbles: [C: 04-1] "One moment while I add credit for this great suggestion." [operations/puppet] - 10https://gerrit.wikimedia.org/r/91903 (owner: 10Manybubbles) [18:25:35] :) [18:25:48] heya ori-l [18:25:56] you around? [19:11:36] PROBLEM - Varnish traffic logger on cp1060 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [19:13:36] RECOVERY - Varnish traffic logger on cp1060 is OK: PROCS OK: 2 processes with command name varnishncsa [19:21:36] PROBLEM - Disk space on cp1060 is CRITICAL: NRPE: Call to popen() failed [19:29:17] (03CR) 10Andrew Bogott: "I've verified on labs that switching from openstack::gluster-service to gluster::service is a noop. That should trickle down to the subcl" [operations/puppet] - 10https://gerrit.wikimedia.org/r/91884 (owner: 10Andrew Bogott) [19:29:36] https://bugzilla.wikimedia.org/enter_bug.cgi?product=Tools&format=guided [19:32:57] (03PS2) 10Manybubbles: Change Elasticsearch defaults that cause pain/fear [operations/puppet] - 10https://gerrit.wikimedia.org/r/91903 [19:33:16] PROBLEM - Varnish HTCP daemon on cp1060 is CRITICAL: NRPE: Unable to read output [19:33:36] PROBLEM - Disk space on cp1060 is CRITICAL: DISK CRITICAL - free space: /srv/sda3 7069 MB (2% inode=99%): /srv/sdb3 7529 MB (2% inode=99%): [19:35:06] PROBLEM - RAID on cp1060 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [19:35:16] PROBLEM - Varnish HTCP daemon on cp1060 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [19:35:36] PROBLEM - Disk space on cp1060 is CRITICAL: DISK CRITICAL - free space: /srv/sda3 7069 MB (2% inode=99%): /srv/sdb3 7529 MB (2% inode=99%): [19:36:06] RECOVERY - RAID on cp1060 is OK: OK: no RAID installed [19:37:16] RECOVERY - Varnish HTCP daemon on cp1060 is OK: PROCS OK: 1 process with UID = 111 (vhtcpd), args vhtcpd [19:38:35] ottomata: hey [19:38:36] PROBLEM - Disk space on cp1060 is CRITICAL: NRPE: Unable to read output [19:38:36] PROBLEM - Varnish traffic logger on cp1060 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [19:39:36] RECOVERY - Varnish traffic logger on cp1060 is OK: PROCS OK: 2 processes with command name varnishncsa [19:40:13] heya [19:40:16] manybubbles: want me to merge that? [19:40:20] (03CR) 10Ottomata: [C: 032] Change Elasticsearch defaults that cause pain/fear [operations/puppet] - 10https://gerrit.wikimedia.org/r/91903 (owner: 10Manybubbles) [19:40:23] ori-l: [19:40:23] so [19:40:31] ottomata: sure [19:40:34] thanks [19:40:34] varnishkafka now writes periodic stats to a log file [19:40:36] ! [19:40:37] in json format [19:40:51] kinda like this [19:40:51] https://gist.github.com/ottomata/7155267 [19:40:58] what is the best way to get them into ganglia? [19:41:12] shoudl I look into statsd so we could also send to graphite if we wanted? [19:41:28] or should I just write a gmetric/ganglia python module thing to send them on to ganglia [19:41:34] manybubbles: merged [19:42:01] ottomata: whatever you prefer, really [19:42:16] PROBLEM - Varnish HTCP daemon on cp1060 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [19:43:00] well, if I did it to statsd, it'd be easier to put in graphite or something else later, right? [19:43:28] (03CR) 10Ottomata: [C: 032 V: 032] Updating with recent upstream changes to varnishkafka.conf [operations/puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/91664 (owner: 10Ottomata) [19:43:36] PROBLEM - Disk space on cp1060 is CRITICAL: DISK CRITICAL - free space: /srv/sda3 7068 MB (2% inode=99%): /srv/sdb3 7529 MB (2% inode=99%): [19:44:16] RECOVERY - Varnish HTCP daemon on cp1060 is OK: PROCS OK: 1 process with UID = 111 (vhtcpd), args vhtcpd [19:45:13] Anyone who can provide a little puppet help for a newbie? [19:45:43] https://gerrit.wikimedia.org/r/#/c/91953/4/puppet/modules/exim-conf/manifests/init.pp is currently giving me in "make noop": [19:45:49] ottomata: yes [19:45:55] err: /Stage[main]/Exim-conf/File[/etc/mailname]: Could not evaluate: Could not retrieve information from environment production source(s) puppet:///modules/exim-config/mailname at /root/translatewiki/puppet/modules/exim-conf/manifests/init.pp:4 [19:46:11] I'm probably doing something fairly elementary wrong. [19:46:33] yeah siebrand [19:46:41] i don't know what class exim is [19:46:49] but I highly doubt it has a source parameter [19:47:36] PROBLEM - Disk space on cp1060 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [19:49:36] PROBLEM - Disk space on cp1060 is CRITICAL: DISK CRITICAL - free space: /srv/sda3 7068 MB (2% inode=99%): /srv/sdb3 7529 MB (2% inode=99%): [19:50:58] ori-l: do we have an example of sending things to statsd and then to ganglia right now? [19:51:10] our statsd can write to ganglia too [19:51:13] so you just have to send to statsd [19:51:26] tied up with something atm but can help in a bit [19:51:28] k [19:51:40] ottomata: See https://github.com/example42/puppet-exim [19:52:38] ok it does! [19:53:17] hmmm, ok, siebrand, is this on your VM or something? [19:53:21] is puppetmaster running there? [19:53:28] this looks like maybe your fileserver.conf isn't properly configured [19:53:30] ottomata: translatewiki.net [19:53:30] not sure though [20:00:26] PROBLEM - DPKG on cp1060 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [20:01:26] RECOVERY - DPKG on cp1060 is OK: All packages OK [20:02:36] PROBLEM - Disk space on cp1060 is CRITICAL: NRPE: Unable to read output [20:04:36] PROBLEM - Disk space on cp1060 is CRITICAL: DISK CRITICAL - free space: /srv/sda3 7068 MB (2% inode=99%): /srv/sdb3 7529 MB (2% inode=99%): [20:04:58] ottomata: Anyway... I asked primarily what the file action failed, which was earlier in the file I referenced... [20:06:10] right [20:06:18] so i think possibly fileserver.conf is not confiured properly [20:06:28] it sounds like it doesn't know how to resolve your source to that file [20:06:36] (i'm just guessing here) [20:06:58] (03PS1) 10Aaron Schulz: Fix annoyance with ctrl-C in mwscriptwikiset scripts [operations/puppet] - 10https://gerrit.wikimedia.org/r/91990 [20:07:11] either that, or hm [20:07:17] the puppet url to that file isn't right [20:07:18] uhh [20:07:36] maybe try it without the modules/ bit [20:07:45] source => 'puppet:///exim-config/mailname' [20:08:07] hmm, no [20:08:10] i think you ahve it right [20:08:11] http://docs.puppetlabs.com/puppet/2.7/reference/modules_fundamentals.html#files [20:09:56] (03CR) 10Legoktm: "I don't know why jenkins didn't like the latest patchset." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/91344 (owner: 10Legoktm) [20:15:29] when was lists.wikimedia.org switched to https only? [20:22:01] huh [20:22:54] hm, February 2012 maybe https://bugzilla.wikimedia.org/show_bug.cgi?id=33897#c3 [20:31:29] ottomata: I found the problem... [20:31:39] it's not exim-config bot exim-conf.... :| [20:33:42] (03CR) 10Dr0ptp4kt: "(2 comments)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/88261 (owner: 10Dr0ptp4kt) [20:33:55] (03PS9) 10Dr0ptp4kt: Add an extra header for cache variance of W0 banners for proxies. [operations/puppet] - 10https://gerrit.wikimedia.org/r/88261 [20:41:28] 3~/q gwicke_away [20:41:30] er :) [20:42:52] (03CR) 10Dr0ptp4kt: "@BBlack and @mark, regarding a more broad solution, we're fine with whatever you guys determine as appropriate. For the near term, though," [operations/puppet] - 10https://gerrit.wikimedia.org/r/88261 (owner: 10Dr0ptp4kt) [20:53:54] PROBLEM - Host mw31 is DOWN: PING CRITICAL - Packet loss = 100% [20:54:54] RECOVERY - Host mw31 is UP: PING OK - Packet loss = 0%, RTA = 26.60 ms [21:09:59] (03PS3) 10Dzahn: add dsh group "misc-servers" [operations/puppet] - 10https://gerrit.wikimedia.org/r/88126 [21:11:45] (03PS1) 10Cmjohnson: Removing dns entry for mw110 [operations/dns] - 10https://gerrit.wikimedia.org/r/92003 [21:11:57] (03CR) 10Dzahn: [C: 032] "i'm gonna maintain it to keep track of the non-cluster servers, it's not used in anything automatic" [operations/puppet] - 10https://gerrit.wikimedia.org/r/88126 (owner: 10Dzahn) [21:13:29] !log removing mw110 from pybal [21:13:47] Logged the message, Master [21:13:58] (03CR) 10Dzahn: [C: 032] Add pmtpa apaches for completeness [operations/puppet] - 10https://gerrit.wikimedia.org/r/91383 (owner: 10Reedy) [21:15:17] (03CR) 10Cmjohnson: [C: 032] Removing dns entry for mw110 [operations/dns] - 10https://gerrit.wikimedia.org/r/92003 (owner: 10Cmjohnson) [21:15:48] !log dns update [21:25:11] gwicke: what's the nodejs version used in production? [21:25:17] and is that version available somewhere as a deb? [21:26:02] YuviPanda: we currently use the old Ubuntu 0.8.x [21:26:12] Version: 0.8.2-1chl1~precise1 [21:26:14] i suppose [21:26:19] *nod* [21:26:39] I was scheming with Faidon to rebuild the debian package for ubuntu [21:26:46] I see [21:26:51] so that we can get up-to-date 0.10 [21:26:54] right [21:27:11] gwicke: I was going to spend a weekend working on exposing our RC feed over websockets [21:27:18] and wondering if I should write it in nodejs or twisted [21:27:32] since node is already on the cluster, seemed to make sense to just use it... [21:27:39] (03PS1) 10Cmjohnson: adding dhcpd file for mw110 [operations/puppet] - 10https://gerrit.wikimedia.org/r/92008 [21:28:01] gwicke: the only question being one of deploying the additional libraries needed. Hopefully I can get started by just adding node_modules in the repository... [21:28:13] (03CR) 10Cmjohnson: [C: 032] adding dhcpd file for mw110 [operations/puppet] - 10https://gerrit.wikimedia.org/r/92008 (owner: 10Cmjohnson) [21:34:12] mutante: regarding https://gerrit.wikimedia.org/r/#/c/91638/ what do you mean by your comment [21:34:19] suppose we need 2 of those blocks for that or [21:34:19] can it be unified? [21:35:22] cmjohnson1: see how we are removing a host range out of the middle of another range [21:35:48] and now it needs "for i in range" twice [21:36:08] oh..okay...we need 2 blocks for that [21:36:26] yea, the question was if we do, but we do [21:36:32] ariel already answered it [21:37:16] do we even have any search above 37? [21:37:57] ori-l: can I bother you in person for 2 minutes? [21:38:04] yeah [21:40:45] mutante: we do not have any search boxes above search36 so the 2nd block is not necessary. I am not sure why it goes to search51. [21:40:54] cmjohnson1: not activated, in DNS we do [21:41:07] cmjohnson1: ok, then even better [21:41:15] unless we are planning on expanding which were not ...let's remove it [21:42:29] cmjohnson1: amending [21:44:29] !log powering off to decom search21-36 [21:44:46] Logged the message, Master [21:45:02] (03PS4) 10Dzahn: remove search21-36, keep search13-20 [operations/dns] - 10https://gerrit.wikimedia.org/r/91638 [21:46:05] (03PS1) 10Andrew Bogott: Stamp out the last reference to generic::nginx, remove. [operations/puppet] - 10https://gerrit.wikimedia.org/r/92011 [21:47:25] (03PS5) 10Dzahn: remove search21-50, keep search13-20 [operations/dns] - 10https://gerrit.wikimedia.org/r/91638 [21:48:03] (03PS2) 10Andrew Bogott: Stamp out the last reference to nginx_site, remove. [operations/puppet] - 10https://gerrit.wikimedia.org/r/92011 [21:48:45] PROBLEM - Host search22 is DOWN: PING CRITICAL - Packet loss = 100% [21:49:25] PROBLEM - Host search24 is DOWN: PING CRITICAL - Packet loss = 100% [21:49:35] PROBLEM - Host search23 is DOWN: PING CRITICAL - Packet loss = 100% [21:49:55] PROBLEM - Host search25 is DOWN: PING CRITICAL - Packet loss = 100% [21:50:54] <^d> mutante: that you? ^ [21:51:21] ^d: nope 14:47 < cmjohnson1> !log powering off to decom search21-36 [21:51:25] (03CR) 10Andrew Bogott: [C: 032] Stamp out the last reference to nginx_site, remove. [operations/puppet] - 10https://gerrit.wikimedia.org/r/92011 (owner: 10Andrew Bogott) [21:51:35] PROBLEM - Host search26 is DOWN: PING CRITICAL - Packet loss = 100% [21:51:44] ^d: he's moving them so they become cirrus hosts [21:51:47] cmjohnson1: right [21:51:50] <^d> Yeah, I knew that :) [21:51:57] correct [21:51:58] <^d> I saw your dns change so got confused. [21:52:35] PROBLEM - Host search27 is DOWN: PING CRITICAL - Packet loss = 100% [21:54:27] ACKNOWLEDGEMENT - Host search22 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn moved by Chris. see #5883: Servers for CirrusSearchs Elasticsearch Instances [21:54:27] ACKNOWLEDGEMENT - Host search23 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn moved by Chris. see #5883: Servers for CirrusSearchs Elasticsearch Instances [21:54:27] ACKNOWLEDGEMENT - Host search24 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn moved by Chris. see #5883: Servers for CirrusSearchs Elasticsearch Instances [21:54:27] ACKNOWLEDGEMENT - Host search25 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn moved by Chris. see #5883: Servers for CirrusSearchs Elasticsearch Instances [21:54:27] ACKNOWLEDGEMENT - Host search26 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn moved by Chris. see #5883: Servers for CirrusSearchs Elasticsearch Instances [21:54:27] ACKNOWLEDGEMENT - Host search27 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn moved by Chris. see #5883: Servers for CirrusSearchs Elasticsearch Instances [21:54:27] ACKNOWLEDGEMENT - Host search28 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn moved by Chris. see #5883: Servers for CirrusSearchs Elasticsearch Instances [21:54:45] PROBLEM - LVS Lucene on search-pool2.svc.pmtpa.wmnet is CRITICAL: No route to host [21:55:30] mutante: assuming that's you as well ? [21:55:36] no, i'm not [21:56:23] that change isnt merged [21:56:55] PROBLEM - Host search29 is DOWN: PING CRITICAL - Packet loss = 100% [21:57:00] leslicarr: taking down tampa search to move...got the ok from mark but it appears that we need to wait [21:57:14] lesliecarr [21:57:15] <^d> They're not being used. [21:57:22] <^d> They just need to have their monitoring shut off too ;-) [21:57:41] <^d> And removed from pybal [21:58:02] ok [21:59:30] <^d> Yeah, just verified, none of the wikis point to search in tampa. [21:59:40] <^d> :) [22:00:20] !log removing search21-36 from pybal [22:00:33] Logged the message, Master [22:02:04] (03PS1) 10Chad: Remove old tampa search config [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/92016 [22:02:22] (03CR) 10jenkins-bot: [V: 04-1] Remove old tampa search config [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/92016 (owner: 10Chad) [22:02:45] PROBLEM - LVS Lucene on search-pool1.svc.pmtpa.wmnet is CRITICAL: Connection refused [22:03:37] https://wikitech.wikimedia.org/wiki/Server_Lifecycle [22:03:50] https://wikitech.wikimedia.org/wiki/Server_Lifecycle#Reclaim_or_Decommission [22:06:09] PROBLEM - LVS Lucene on search-pool3.svc.pmtpa.wmnet is CRITICAL: Connection refused [22:07:34] ACKNOWLEDGEMENT - LVS Lucene on search-pool1.svc.pmtpa.wmnet is CRITICAL: Connection refused daniel_zahn shut down by Chris - RT #5883 [22:07:38] ACKNOWLEDGEMENT - LVS Lucene on search-pool2.svc.pmtpa.wmnet is CRITICAL: Connection refused daniel_zahn shut down by Chris - RT #5883 [22:07:41] ACKNOWLEDGEMENT - LVS Lucene on search-pool3.svc.pmtpa.wmnet is CRITICAL: Connection refused daniel_zahn shut down by Chris - RT #5883 [22:07:44] (03PS1) 10Chad: Shut down search_pool[1-3] in pmtpa [operations/puppet] - 10https://gerrit.wikimedia.org/r/92017 [22:08:19] cmjohnson1: i disabled notifications for that [22:08:21] (03CR) 10Chad: "(1 comment)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/92017 (owner: 10Chad) [22:08:23] the paging , you know [22:08:26] well, that just hella paged. [22:08:46] so yea, we shouldnt decom things like that in that order [22:08:56] ie: we need to disable notifications before we go turning off servers. [22:11:59] RECOVERY - Host search22 is UP: PING OK - Packet loss = 0%, RTA = 26.96 ms [22:12:09] RECOVERY - Host search23 is UP: PING OK - Packet loss = 0%, RTA = 27.17 ms [22:12:24] fyi: i powered them up [22:12:33] so here comes the recovery msgs [22:12:59] RECOVERY - Host search25 is UP: PING OK - Packet loss = 0%, RTA = 26.64 ms [22:13:09] RECOVERY - Host search24 is UP: PING OK - Packet loss = 0%, RTA = 28.58 ms [22:13:29] RECOVERY - Host search26 is UP: PING OK - Packet loss = 0%, RTA = 26.50 ms [22:14:59] RECOVERY - Host search27 is UP: PING OK - Packet loss = 0%, RTA = 27.77 ms [22:15:09] PROBLEM - search indices - check lucene status page on search23 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:15:39] PROBLEM - search indices - check lucene status page on search26 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:15:59] RECOVERY - search indices - check lucene status page on search23 is OK: HTTP OK: HTTP/1.1 200 OK - 269 bytes in 0.056 second response time [22:16:29] RECOVERY - search indices - check lucene status page on search26 is OK: HTTP OK: HTTP/1.1 200 OK - 207 bytes in 0.055 second response time [22:18:00] <^d> Yeah so we can totally decom them but RobH is right (and what I was trying to say :)) [22:20:31] (03PS1) 10Cmjohnson: adding search21-36 to decom.pp [operations/puppet] - 10https://gerrit.wikimedia.org/r/92020 [22:21:17] ^d I didn't think they were still being monitored...and didn't check ...bad cmjohnson1 [22:23:58] (03CR) 10Cmjohnson: [C: 032] adding search21-36 to decom.pp [operations/puppet] - 10https://gerrit.wikimedia.org/r/92020 (owner: 10Cmjohnson) [22:25:34] (03PS1) 10Cmjohnson: Revert "adding search21-36 to decom.pp" [operations/puppet] - 10https://gerrit.wikimedia.org/r/92021 [22:27:09] PROBLEM - NTP on search25 is CRITICAL: NTP CRITICAL: Offset unknown [22:29:07] (03CR) 10Cmjohnson: [C: 032] Revert "adding search21-36 to decom.pp" [operations/puppet] - 10https://gerrit.wikimedia.org/r/92021 (owner: 10Cmjohnson) [22:32:09] RECOVERY - NTP on search25 is OK: NTP OK: Offset -0.0003695487976 secs [22:32:32] (03PS3) 10Chad: Switch to single Json object for gerrit's reviewer count query [operations/puppet] - 10https://gerrit.wikimedia.org/r/84743 (owner: 10QChris) [22:39:39] PROBLEM - RAID on cp1059 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [22:40:39] RECOVERY - RAID on cp1059 is OK: OK: no RAID installed [22:41:49] PROBLEM - Disk space on cp1059 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [22:42:53] (03PS1) 10Cmjohnson: adding search21-36 to decom.pp [operations/puppet] - 10https://gerrit.wikimedia.org/r/92025 [22:43:11] (03CR) 10Cmjohnson: [C: 032] adding search21-36 to decom.pp [operations/puppet] - 10https://gerrit.wikimedia.org/r/92025 (owner: 10Cmjohnson) [22:43:59] PROBLEM - DPKG on cp1059 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [22:44:59] PROBLEM - SSH on cp1059 is CRITICAL: Server answer: [22:45:59] RECOVERY - DPKG on cp1059 is OK: All packages OK [22:45:59] RECOVERY - SSH on cp1059 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [22:50:49] PROBLEM - Disk space on cp1059 is CRITICAL: DISK CRITICAL - free space: /srv/sda3 7440 MB (2% inode=99%): /srv/sdb3 7941 MB (2% inode=99%): [22:52:49] PROBLEM - Disk space on cp1059 is CRITICAL: DISK CRITICAL - free space: /srv/sda3 7440 MB (2% inode=99%): /srv/sdb3 7941 MB (2% inode=99%): [22:54:59] PROBLEM - SSH on cp1059 is CRITICAL: Server answer: [22:55:59] RECOVERY - SSH on cp1059 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [23:05:34] PROBLEM - Puppet freshness on search30 is CRITICAL: No successful Puppet run in the last 10 hours [23:05:44] PROBLEM - Puppet freshness on search32 is CRITICAL: No successful Puppet run in the last 10 hours [23:05:54] PROBLEM - Puppet freshness on search23 is CRITICAL: No successful Puppet run in the last 10 hours [23:06:04] PROBLEM - Puppet freshness on search24 is CRITICAL: No successful Puppet run in the last 10 hours [23:06:14] PROBLEM - Puppet freshness on search26 is CRITICAL: No successful Puppet run in the last 10 hours [23:06:24] PROBLEM - Puppet freshness on search27 is CRITICAL: No successful Puppet run in the last 10 hours [23:34:17] (03PS1) 10Aaron Schulz: Switched to JobQueueFederated [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/92032 [23:34:43] (03CR) 10Aaron Schulz: [C: 04-1] Switched to JobQueueFederated [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/92032 (owner: 10Aaron Schulz) [23:42:11] (03PS1) 10Ori.livneh: Apply misc::monitoring::view::mobile on nickel [operations/puppet] - 10https://gerrit.wikimedia.org/r/92035 [23:44:01] !log aaron synchronized php-1.22wmf22/extensions/SwiftCloudFiles '2661b2d67a0bffb6f3a6bd680114a5e7acec5994' [23:44:14] Logged the message, Master [23:45:55] RECOVERY - Puppet freshness on search30 is OK: puppet ran at Fri Oct 25 23:45:53 UTC 2013 [23:46:34] PROBLEM - Puppet freshness on search30 is CRITICAL: No successful Puppet run in the last 10 hours [23:53:04] RECOVERY - Puppet freshness on search32 is OK: puppet ran at Fri Oct 25 23:53:00 UTC 2013 [23:53:14] RECOVERY - Puppet freshness on search27 is OK: puppet ran at Fri Oct 25 23:53:05 UTC 2013 [23:53:24] PROBLEM - Puppet freshness on search27 is CRITICAL: No successful Puppet run in the last 10 hours [23:53:44] PROBLEM - Puppet freshness on search32 is CRITICAL: No successful Puppet run in the last 10 hours [23:55:54] RECOVERY - Puppet freshness on search23 is OK: puppet ran at Fri Oct 25 23:55:46 UTC 2013 [23:55:54] PROBLEM - Puppet freshness on search23 is CRITICAL: No successful Puppet run in the last 10 hours [23:58:54] RECOVERY - Puppet freshness on search26 is OK: puppet ran at Fri Oct 25 23:58:47 UTC 2013 [23:59:12] (03CR) 10TTO: "recheck" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/91344 (owner: 10Legoktm) [23:59:14] PROBLEM - Puppet freshness on search26 is CRITICAL: No successful Puppet run in the last 10 hours [23:59:54] RECOVERY - Puppet freshness on search24 is OK: puppet ran at Fri Oct 25 23:59:48 UTC 2013