[00:02:29] greg-g: Hey, need a follow-up to the SWAT deploy – either enabling Wikinews as a wiki family in Parsoid or a revert of the config change. Would prefer the former. [00:02:41] … and he's not around. Boo. [00:02:52] ori: You around? [00:03:02] Just do it and don't break stuff :P [00:04:04] James_F, let me see what I can do [00:04:15] it'll be a temporary patch [00:04:21] RoanKattouw: ^^^ [00:04:30] (Thanks!) [00:09:11] !log enabled wikinews family in Parsoid with temporary live patch to un-break VE deploy [00:09:17] Logged the message, Master [00:10:21] RECOVERY - Varnishkafka Delivery Errors on cp3020 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [00:10:42] RECOVERY - Varnishkafka Delivery Errors on cp3019 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [00:13:37] gwicke: Thanks. [00:33:51] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: reqstats.5xx [warn=250.000 [00:38:21] PROBLEM - Varnishkafka Delivery Errors on cp3020 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 108.433334 [00:40:28] (03PS1) 10Ori.livneh: beta cluster: un-split-brain memcached config [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/125926 [00:40:51] PROBLEM - Varnishkafka Delivery Errors on cp3019 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 127.76667 [00:49:11] PROBLEM - MySQL InnoDB on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:50:01] RECOVERY - MySQL InnoDB on db1047 is OK: OK longest blocking idle transaction sleeps for 0 seconds [00:51:55] James_F|Away: hey, sorry, I've been out today due to migraine/etc, hope it all worked out (haven't read scroll back, just the ping) [01:00:13] greg-g: it worked out; gwicke deployed a temp fix [01:02:21] RECOVERY - Varnishkafka Delivery Errors on cp3020 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [01:05:56] ori: thankya [01:06:42] RECOVERY - Varnishkafka Delivery Errors on cp3019 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [01:08:24] * greg-g goes back to not really looking at a screen [01:08:25] greg-g, was a trivial config change to make VE work on wikinews too [01:15:11] (03PS1) 10Ori.livneh: Embed $wgCopyrightIcon [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/125927 [01:15:25] AaronSchulz: ^ what do you think? [01:16:22] I'm not sure about it. It would mean one request fewer, but at the cost of a slight increase of page size [01:16:26] ori: Better hard code? This sounds like one of htese things which at some point break... [01:17:23] hoo: hasn't CommonSettings.php suffered enough? you want to add base64-encoded binary data now, too? :) [01:17:38] hoo: joking aside, that's worth thinking about, but I'd like to figure out first whether it's worth doing at all [01:18:19] <^d> That feels icky. [01:18:54] I guess it changes rarely so that's not an issue but I'd be curious what the payload increase is [01:18:56] ori: How much data is it? [01:19:43] the png file is 2426 bytes, base64 encoding swells that to 3258, and gzip brings it back down to 2515 [01:19:49] <^d> Long as we're caching the icon well enough I'd agree that it doesn't change often enough to be worth removing the request at the cost of page size. [01:20:25] <^d> Otherwise why not just embed all images as base64 and then we have WAY LESS requests? ;-) [01:20:36] ^d: you joke, but.. [01:21:07] IIRC yahoo found that something like 40% of page views are made by users with a cold cache [01:21:08] ori, keep in mind that requests will soon become cheaper [01:21:19] gwicke: w/SPDY, you mean [01:21:26] yeah, and HTTP 2 [01:21:40] "soon" [01:21:45] 67% of requests currently [01:21:58] gwicke: how did you get that figure? [01:22:16] http://caniuse.com/spdy [01:22:27] I stand corrected, it's 65.35% [01:23:15] we have spdy in labs already [01:23:16] ori: Could we add this to some wikimedia specific extension and then embed it into CSS? [01:23:33] hoo: why? [01:23:44] ori: To not have it within the page body? [01:23:44] gwicke: i'm with you, but spdy is not popular with ops [01:23:51] ^d: migrated my testwiki's files to saio and it seems to work fine [01:23:59] * AaronSchulz hugs copyFileBackend.php [01:24:16] ori, I think we can work that out [01:24:23] currently it's blocked on nginx upgrade [01:26:03] there is a nice video illustrating the speedup at https://code.google.com/p/mod-spdy/ [01:26:08] gwicke: if it's true that SPDY is within reach, then it's true that it's better not to embed it [01:26:11] better than those cold numbers ;) [01:28:29] <^d> gwicke: caniuse is a huge listing of technologies I don't use :p [01:28:44] <^d> When I have to do something facing a user I still use and so forth ;-) [01:29:06] hehe, uppercase even ;) [01:29:11] those were the days [01:29:13] <^d> Javascript? Pshaw. [01:29:36] <^d> CSS? Just UI extras I don't need. [01:29:41] more seriously, whenever you are using a google or fb service you are already using spdy [01:30:22] PROBLEM - Varnishkafka Delivery Errors on cp3020 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 44.599998 [01:30:40] <^d> gwicke: You dunno that. I could be one of those Firefox 3.x holdouts ;-) [01:30:57] hehe, or Safari [01:31:16] <^d> Safari is good for one thing...downloading a new browser on a fresh Mac. [01:31:20] <^d> Fuck Safari. [01:31:25] <^d> :) [01:32:04] another case where safari excels: http://caniuse.com/nav-timing [01:33:14] gotta run, see you later! [01:33:20] * ori waves [01:35:01] ori: What I meant above is loading it via RL so that it can be cached... [01:35:04] anyway, good night [01:36:27] hoo: the .png is cached [01:36:42] and good night [01:36:59] ori: Sure, but not if it's in the page body [01:37:51] PROBLEM - Varnishkafka Delivery Errors on cp3019 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 115.366669 [02:00:21] RECOVERY - Varnishkafka Delivery Errors on cp3020 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [02:01:42] RECOVERY - Varnishkafka Delivery Errors on cp3019 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [02:03:29] (03PS4) 10Springle: Remove prettify, seems to be unused [operations/software] - 10https://gerrit.wikimedia.org/r/118952 (owner: 10Reedy) [02:04:04] (03CR) 10Springle: [C: 032] Remove prettify, seems to be unused [operations/software] - 10https://gerrit.wikimedia.org/r/118952 (owner: 10Reedy) [02:04:21] PROBLEM - Varnishkafka Delivery Errors on cp3020 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 351.799988 [02:04:51] PROBLEM - Varnishkafka Delivery Errors on cp3019 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 798.666687 [02:05:21] RECOVERY - Varnishkafka Delivery Errors on cp3020 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [02:05:51] RECOVERY - Varnishkafka Delivery Errors on cp3019 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [02:09:11] (03CR) 10Springle: [C: 031] "maintain-replicas.pl is Coren's domain. Upstream, rc_source is replicating from the sanitarium and ready to go." [operations/software] - 10https://gerrit.wikimedia.org/r/125369 (owner: 10Aude) [02:12:42] PROBLEM - Disk space on virt0 is CRITICAL: DISK CRITICAL - free space: /a 2887 MB (3% inode=99%): [02:15:24] (03CR) 10Springle: [C: 031] "Do we need (or have we had) any discussion with Legal about exposing these tables in labs?" [operations/software] - 10https://gerrit.wikimedia.org/r/118582 (owner: 10Aude) [02:16:25] (03PS3) 10Springle: Fixup whitespace of jOrgChart.js [operations/software] - 10https://gerrit.wikimedia.org/r/118953 (owner: 10Reedy) [02:17:04] (03CR) 10Springle: [C: 032] Fixup whitespace of jOrgChart.js [operations/software] - 10https://gerrit.wikimedia.org/r/118953 (owner: 10Reedy) [02:18:42] PROBLEM - Disk space on virt0 is CRITICAL: DISK CRITICAL - free space: /a 3299 MB (3% inode=99%): [02:22:56] !log LocalisationUpdate completed (1.23wmf21) at 2014-04-15 02:22:54+00:00 [02:23:03] Logged the message, Master [02:42:50] !log LocalisationUpdate completed (1.23wmf22) at 2014-04-15 02:42:48+00:00 [02:42:54] Logged the message, Master [03:00:42] RECOVERY - Disk space on virt0 is OK: DISK OK [03:25:57] !log LocalisationUpdate ResourceLoader cache refresh completed at Tue Apr 15 03:25:52 UTC 2014 (duration 25m 51s) [03:26:03] Logged the message, Master [04:34:53] (03CR) 10Aaron Schulz: "I wonder if the cache hit rate on this is more than half the time. One the one hand this can reduce bandwidth usage by avoiding the packet" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/125927 (owner: 10Ori.livneh) [05:26:25] (03CR) 10Chad: Enhanced recent changes: explicitly disable by default (031 comment) [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/124292 (owner: 10Nemo bis) [05:40:09] :D this is awesome: https://github.com/jordansissel/fpm [05:40:11] (03PS1) 10ArielGlenn: subversion role lint, pass svn hostname to cert check [operations/puppet] - 10https://gerrit.wikimedia.org/r/125944 [05:42:37] wow werd na, totally going to check that out! [05:43:16] (03CR) 10ArielGlenn: [C: 032] subversion role lint, pass svn hostname to cert check [operations/puppet] - 10https://gerrit.wikimedia.org/r/125944 (owner: 10ArielGlenn) [05:48:50] apergos: trying to puppetise configuration on CentOS 5 [05:48:59] which is a PITA because there is no CentOS 5 package for node because it's too old [05:49:09] god knows why I ended up with a CentOS 5 server to maintain [05:49:16] that's your first mistake... [05:49:24] but puppetising seems like a nice way to upgrade smoothly [05:50:24] (03CR) 10Nemo bis: Enhanced recent changes: explicitly disable by default (031 comment) [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/124292 (owner: 10Nemo bis) [05:50:40] already at 6.5.. hm [06:12:32] (03CR) 10ArielGlenn: [C: 032] Revert "Revert "decom: remove brewster incl. mgmt"" [operations/dns] - 10https://gerrit.wikimedia.org/r/125753 (owner: 10Dzahn) [06:18:49] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 200,000 [06:20:23] (03Abandoned) 10Dzahn: Revert "decom : brewster" [operations/puppet] - 10https://gerrit.wikimedia.org/r/125748 (owner: 10Reedy) [06:22:49] (03CR) 10Dzahn: "nice! this fixed the last unknown in icinga :)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/125944 (owner: 10ArielGlenn) [06:36:49] PROBLEM - mysqld processes on db73 is CRITICAL: PROCS CRITICAL: 2 processes with command name mysqld [06:37:31] heh [06:37:39] (03CR) 10ArielGlenn: "noticed that 'apaches' in pmtpa dns manifests resolves to the same as appservers, and there is also an 'apaches2' which I have no idea wha" [operations/dns] - 10https://gerrit.wikimedia.org/r/120063 (owner: 10Dzahn) [06:40:17] (03CR) 10Dzahn: "good point. apaches2 -> 10.0.5.5 . 100% packet loss" [operations/dns] - 10https://gerrit.wikimedia.org/r/120063 (owner: 10Dzahn) [06:41:50] springle: thanks for letting us know you saw.. waves [06:46:13] (03PS7) 10Dzahn: decom Tampa: remove service IPs [operations/dns] - 10https://gerrit.wikimedia.org/r/120063 [06:47:11] !log moving pmtpa m1 and x1 slaves to db73 and db69 on 12th floor [06:47:17] Logged the message, Master [06:47:48] (03CR) 10Dzahn: "how about this. remove "apaches2" and switch "apaches" from the IP of appservers.svc.pmtpa.wmnet to appservers.svc.eqiad.wmnet (10.2.1.1 -" [operations/dns] - 10https://gerrit.wikimedia.org/r/120063 (owner: 10Dzahn) [06:49:41] mutante: re db67, i've backed up analytics data and am about to give one final warning to the list. then we can decom [06:51:15] springle: great! :) [06:51:40] then we can as well use that change later that removes multiple at once [06:51:51] yay [06:51:51] yep [06:51:53] thanks for the heads up [06:53:30] (03PS8) 10Dzahn: decom Tampa: remove service IPs [operations/dns] - 10https://gerrit.wikimedia.org/r/120063 [06:56:32] (03CR) 10Dzahn: "or kill "apaches" altogether and create "hhvm" :p" [operations/dns] - 10https://gerrit.wikimedia.org/r/120063 (owner: 10Dzahn) [06:59:34] (03PS1) 10Dzahn: remove dbdump.pmtpa.wmnet [operations/dns] - 10https://gerrit.wikimedia.org/r/125948 [07:00:53] <_joe_> springle: hi man [07:01:08] hey _joe_ [07:01:36] (03CR) 10ArielGlenn: "if apaches points to tampa (= nothing running now), why keep it at all?" [operations/dns] - 10https://gerrit.wikimedia.org/r/120063 (owner: 10Dzahn) [07:02:22] !log shutdown db67 for decom. analytics data is backed up on dbstore1002 [07:02:27] Logged the message, Master [07:02:45] (03PS1) 10Dzahn: remove imagedump.pmtpa.wmnet [operations/dns] - 10https://gerrit.wikimedia.org/r/125949 [07:04:25] (03CR) 10Chad: Enhanced recent changes: explicitly disable by default (031 comment) [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/124292 (owner: 10Nemo bis) [07:07:11] (03PS1) 10Dzahn: remove arptest [operations/dns] - 10https://gerrit.wikimedia.org/r/125950 [07:10:34] (03CR) 10Springle: [C: 031] "Never heard of it :-)" [operations/dns] - 10https://gerrit.wikimedia.org/r/125948 (owner: 10Dzahn) [07:12:38] (03PS9) 10Dzahn: decom Tampa: remove service IPs [operations/dns] - 10https://gerrit.wikimedia.org/r/120063 [07:16:01] (03PS1) 10Dzahn: remove syslog service IP (Tampa) [operations/dns] - 10https://gerrit.wikimedia.org/r/125952 [07:16:38] (03CR) 10Giuseppe Lavagetto: "Repackaging python-urllib3 is probably trivial. However since all the urllib3-specific functionalities are useless to us at the moment, I'" [operations/puppet] - 10https://gerrit.wikimedia.org/r/125726 (owner: 10Giuseppe Lavagetto) [07:20:17] (03CR) 10Dzahn: "this removed a blocker on the dobson/linne decom tickets. thanks" [operations/puppet] - 10https://gerrit.wikimedia.org/r/125185 (owner: 10BBlack) [07:26:40] (03CR) 10ArielGlenn: [C: 031] decom Tampa: remove service IPs [operations/dns] - 10https://gerrit.wikimedia.org/r/120063 (owner: 10Dzahn) [07:39:20] (03CR) 10Springle: "db67 is backed-up, shutdown and ready to go." [operations/puppet] - 10https://gerrit.wikimedia.org/r/122406 (owner: 10Dzahn) [07:39:32] (03CR) 10Springle: [C: 031] remove db67 from coredb,decom db64,db65,db66,db70 [operations/puppet] - 10https://gerrit.wikimedia.org/r/122406 (owner: 10Dzahn) [07:39:56] (03CR) 10Springle: [C: 031] "db67 is backed-up, shutdown and ready to go." [operations/dns] - 10https://gerrit.wikimedia.org/r/122412 (owner: 10Matanya) [07:41:08] (03PS1) 10ArielGlenn: add ntp servers on eeden.esams, rubidium (rt #7101) [operations/puppet] - 10https://gerrit.wikimedia.org/r/125954 [07:41:23] (03CR) 10Dzahn: "looks like we should also get this for production some time" [operations/puppet] - 10https://gerrit.wikimedia.org/r/125888 (owner: 10BryanDavis) [07:41:56] (03CR) 10Dzahn: "also see Change-Id: Ia351ef7e997" [operations/puppet] - 10https://gerrit.wikimedia.org/r/123852 (owner: 10Dzahn) [07:42:15] (03CR) 10jenkins-bot: [V: 04-1] add ntp servers on eeden.esams, rubidium (rt #7101) [operations/puppet] - 10https://gerrit.wikimedia.org/r/125954 (owner: 10ArielGlenn) [07:44:56] (03CR) 10Dzahn: "some are "WARNING: kernel parameter(s) net.ipv4.conf.lo_LVS.rp_filter have unexpected value(s). "" [operations/puppet] - 10https://gerrit.wikimedia.org/r/125117 (owner: 10Alexandros Kosiaris) [07:45:01] (03PS2) 10ArielGlenn: add ntp servers on eeden.esams, rubidium (rt #7101) [operations/puppet] - 10https://gerrit.wikimedia.org/r/125954 [07:46:41] (03CR) 10Dzahn: [C: 031] Adding ensure parameter to varnish::logging [operations/puppet] - 10https://gerrit.wikimedia.org/r/125742 (owner: 10Ottomata) [07:47:54] (03CR) 10Dzahn: [C: 031] "all for it since this and 2 others should unblock RT #6143" [operations/puppet] - 10https://gerrit.wikimedia.org/r/125743 (owner: 10Ottomata) [07:48:58] (03CR) 10Dzahn: [C: 031] "8418-8441 tcp,udp Unassigned" [operations/puppet] - 10https://gerrit.wikimedia.org/r/125744 (owner: 10Ottomata) [07:50:10] (03CR) 10Dzahn: "does it need a ferm change?" [operations/puppet] - 10https://gerrit.wikimedia.org/r/125744 (owner: 10Ottomata) [07:57:55] (03CR) 10Dzahn: [C: 031] "yay for resolving #7101. rubidium and the code sounds reosonable" [operations/puppet] - 10https://gerrit.wikimedia.org/r/125954 (owner: 10ArielGlenn) [08:18:01] (03PS1) 10Dzahn: add monitoring for Ubuntu mirror being in sync [operations/puppet] - 10https://gerrit.wikimedia.org/r/125956 [08:20:00] (03CR) 10Dzahn: "see Change-Id: I6024b3b57906 for actually using the script now to resolve RT #3793" [operations/puppet] - 10https://gerrit.wikimedia.org/r/112738 (owner: 10Matanya) [08:22:08] (03CR) 10Dzahn: "on carbon: /tmp/check_apt_mirror hostname: carbon upstreamdate: 2014-04-14 23:30:40 UTC t1: 1397518240 t2: 1397550087" [operations/puppet] - 10https://gerrit.wikimedia.org/r/125956 (owner: 10Dzahn) [08:29:06] (03CR) 10Dzahn: [C: 031] "awesome! and it's also correct that db63 stays in s1, right" [operations/puppet] - 10https://gerrit.wikimedia.org/r/122406 (owner: 10Dzahn) [08:29:14] (03PS3) 10Dzahn: remove db67 from coredb,decom db64,db65,db66,db70 [operations/puppet] - 10https://gerrit.wikimedia.org/r/122406 [08:29:45] springle: just rebasing..we can do anytime incl. the config change? [08:30:14] not touching role/coredb and i would have merged already :) [08:35:22] (03CR) 10Dzahn: [C: 031] "heh, apergos, this was my Google result https://wikitech.wikimedia.org/wiki/User:ArielGlenn/Server_cleanup" [operations/dns] - 10https://gerrit.wikimedia.org/r/125948 (owner: 10Dzahn) [08:36:34] (03CR) 10Alexandros Kosiaris: [C: 032] add ntp servers on eeden.esams, rubidium (rt #7101) [operations/puppet] - 10https://gerrit.wikimedia.org/r/125954 (owner: 10ArielGlenn) [08:36:52] (03CR) 10Dzahn: [C: 031] "https://wikitech.wikimedia.org/wiki/Dumps/Image_dumps_plans_2012 ? --> https://wikitech.wikimedia.org/wiki/User:ArielGlenn/Server_cleanu" [operations/dns] - 10https://gerrit.wikimedia.org/r/125949 (owner: 10Dzahn) [08:38:04] I need to update that page, beenletting it rot for awhile now [08:39:08] (03CR) 10Dzahn: [C: 031] "already had +1 from dba but the constant need for rebasing removes these..it's a bummer" [operations/puppet] - 10https://gerrit.wikimedia.org/r/122406 (owner: 10Dzahn) [08:40:42] (03CR) 10Dzahn: [C: 031] "this also appears in https://wikitech.wikimedia.org/wiki/User:ArielGlenn/Server_cleanup#Hosts_in_dns_and_not_in_dhcp" [operations/dns] - 10https://gerrit.wikimedia.org/r/125950 (owner: 10Dzahn) [08:51:34] mutante: what constant need for rebasing ? you only need to rebase once, right before committing [08:55:13] akosiaris: fair, but it becomes larger and possibly more annoying to fix over time [08:56:31] yeah, but don't you get bored of constantly fixing it ? [09:00:49] shrug..something in between never fixing it and ...ok.. "contant need" is wrong [09:24:50] (03PS2) 10Giuseppe Lavagetto: Substituting the check_graphite script. [operations/puppet] - 10https://gerrit.wikimedia.org/r/125726 [09:33:29] mutante: re db67, yes, anytime. go ahead [09:34:43] (03CR) 10Dzahn: [C: 032] remove db67 from coredb,decom db64,db65,db66,db70 [operations/puppet] - 10https://gerrit.wikimedia.org/r/122406 (owner: 10Dzahn) [09:37:18] springle: done, db64,65,66,70, do you want to shut them down? [09:37:32] they are still responding [09:37:37] ok [09:38:28] checks monitoring [09:38:50] actually, wait [09:39:00] ? [09:39:02] puppet stored configs, cleaning that [09:39:41] let me clean it up from icinga first [09:40:06] ok :) [09:40:25] i'll take it over from here, nvm [09:47:08] !log db64-67 - puppetstoredconfigclean.rb db${db}.pmtpa.wmnet ; puppetca --clean db${db}.pmtpa.wmnet ; salt-key -d db${db}.pmtpa.wmnet [09:47:14] Logged the message, Master [09:49:33] mutante: re db63 staying in s1. that's actually on 10th floor i think, and I have db60 down on 12th in s1. so let me check it out then we can decom either 60 or 63, but not both [09:50:04] * springle got a bit mixed up [09:51:06] ah, ok, yea, i also got a bit mixed up about db64/db67 being in icinga [09:51:21] but removing now [09:51:34] pending neon puppet runs [09:52:43] springle: thanks, maybe we should check with chris as well (list of stuff to go to 12th) [09:54:06] cmjohnson1: hello [10:05:53] springle: db70 was zombie. still asks for a password but not keys and console output �#!])a���a~�c!c#�a��c [10:07:23] !log db70 - powerdown via mgmt [10:07:28] Logged the message, Master [10:07:39] (03CR) 10Mark Bergsma: [C: 031] "swift can be removed also, it's not active and the content is stale." [operations/dns] - 10https://gerrit.wikimedia.org/r/120063 (owner: 10Dzahn) [10:10:26] yay [10:47:12] (03PS10) 10Dzahn: decom Tampa: remove service IPs [operations/dns] - 10https://gerrit.wikimedia.org/r/120063 [10:51:54] !log db65,db66 - shutdown [10:51:59] Logged the message, Master [10:55:02] !log db64,db67 - powerdown via mgmt [10:55:08] Logged the message, Master [11:00:39] RECOVERY - DPKG on ms-be1 is OK: All packages OK [11:00:40] (03PS2) 10Dzahn: db: remove 64-67 and 70 [operations/dns] - 10https://gerrit.wikimedia.org/r/122412 (owner: 10Matanya) [11:01:09] PROBLEM - Slow CirrusSearch query rate on fluorine is CRITICAL: CirrusSearch-slow.log_line_rate CRITICAL: 0.000277777777778 [11:05:28] PROBLEM - Host db65 is DOWN: PING CRITICAL - Packet loss = 100% [11:05:28] PROBLEM - Host db66 is DOWN: PING CRITICAL - Packet loss = 100% [11:08:58] ACKNOWLEDGEMENT - Host db65 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn grr. scheduled downtime gone [11:09:29] ACKNOWLEDGEMENT - Host db66 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn RT #7129 [11:14:00] apergos: have you seen dataset1001's diskspace alert? [11:15:24] (03PS2) 10TTO: Remove useless "confirmed" permission assignments [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/116059 [11:18:06] (03PS1) 10ArielGlenn: show what's backed up in motd (rt #469) [operations/puppet] - 10https://gerrit.wikimedia.org/r/125971 [11:18:15] paravoid: no [11:18:26] I'll have a look in just a moment [11:18:28] :) [11:21:25] the disk space is fine, that threshhold should be lowered to 3% the raid alert concerns me though [11:21:30] * apergos files a ticket  [11:22:00] no I don't, chris got there well before me [11:25:09] (03CR) 10Mark Bergsma: "What would that output look like, especially with multiple directories? Can you give an example?" [operations/puppet] - 10https://gerrit.wikimedia.org/r/125971 (owner: 10ArielGlenn) [11:25:11] (03CR) 10Dzahn: [C: 032] "re-removed from monitoring,salt/puppet,shutdown" [operations/dns] - 10https://gerrit.wikimedia.org/r/122412 (owner: 10Matanya) [11:26:30] !log DNS update - remove db64,db65,db66,db67,db70 [11:26:35] Logged the message, Master [11:28:58] (03CR) 10Dzahn: [C: 032] remove dbdump.pmtpa.wmnet [operations/dns] - 10https://gerrit.wikimedia.org/r/125948 (owner: 10Dzahn) [11:30:15] !log DNS update - removed dbdump.pmtpa.wmnet [11:30:20] Logged the message, Master [11:30:50] (03CR) 10ArielGlenn: "Ah ha, I thought backup::set needed to be called once per directory, I missed the statistics example." [operations/puppet] - 10https://gerrit.wikimedia.org/r/125971 (owner: 10ArielGlenn) [11:32:45] (03CR) 10Dzahn: "10.0.5.1 - Nmap done: 1 IP address (0 hosts up) scanned" [operations/dns] - 10https://gerrit.wikimedia.org/r/125948 (owner: 10Dzahn) [11:35:29] (03CR) 10Dzahn: "as suggested by Alex we should replace the plugin with an existing one, see https://wiki.ubuntu.com/Mirrors/Monitoring%20Scripts this cha" [operations/puppet] - 10https://gerrit.wikimedia.org/r/125956 (owner: 10Dzahn) [11:37:01] (03CR) 10Dzahn: "Matanya, thanks for the contribution but we'd like to replace it with Non-HTTP-based script from here: https://wiki.ubuntu.com/Mirrors/Mon" [operations/puppet] - 10https://gerrit.wikimedia.org/r/112738 (owner: 10Matanya) [11:37:49] (03CR) 10ArielGlenn: "It behaves ok with multiple directories, one entry per line; note that the names in these entries have - instead of / if we care about tha" [operations/puppet] - 10https://gerrit.wikimedia.org/r/125971 (owner: 10ArielGlenn) [11:41:18] MaxSem: hey [11:41:27] yo [11:48:45] matanya: I (finally) have some time for code review… if there's anything you'd like me to prioritize pleae let me know. [11:56:38] <_joe_> also, I can code-review as well as long as it's puppet and in particular if it's on 2.x => 3 migration [12:01:06] RECOVERY - Slow CirrusSearch query rate on fluorine is OK: CirrusSearch-slow.log_line_rate OKAY: 0.0 [12:07:24] (03PS1) 10Dzahn: decom db78 (was once fundraising db) [operations/puppet] - 10https://gerrit.wikimedia.org/r/125974 [12:10:43] (03CR) 10Dzahn: "not in DNS since I4f22cde90915" [operations/puppet] - 10https://gerrit.wikimedia.org/r/125974 (owner: 10Dzahn) [12:14:43] (03CR) 10Dzahn: [C: 032] "racadm serveraction powerdown" [operations/puppet] - 10https://gerrit.wikimedia.org/r/125974 (owner: 10Dzahn) [12:15:12] (03PS5) 10Hashar: Puppet masters in labs needs puppet::self::geoip [operations/puppet] - 10https://gerrit.wikimedia.org/r/121677 [12:15:43] (03PS2) 10Hashar: contint::slave-scripts recurse submodules [operations/puppet] - 10https://gerrit.wikimedia.org/r/122342 [12:16:17] (03PS3) 10Hashar: contint: get rid of misc::pbuilder on slaves [operations/puppet] - 10https://gerrit.wikimedia.org/r/122707 [12:16:35] (03PS4) 10Hashar: contint: directory to hold debian-glue packages [operations/puppet] - 10https://gerrit.wikimedia.org/r/122712 [12:16:37] grrr [12:16:42] gerrit does not like me today [12:18:38] (03PS2) 10Alexandros Kosiaris: add monitoring for Ubuntu mirror being in sync [operations/puppet] - 10https://gerrit.wikimedia.org/r/125956 (owner: 10Dzahn) [12:19:46] !log adding ms-be101[345] to Swift eqiad's rings, at 33% weight; old rings kept at ms-fe1001:~/swift-2014-04-14 [12:19:52] Logged the message, Master [12:19:53] (03PS2) 10Hashar: contint: get composer on Jenkins slaves [operations/puppet] - 10https://gerrit.wikimedia.org/r/124305 [12:20:02] apergos: ^^^ [12:20:12] apergos: fyi, if something bad happens :) [12:20:18] ok, thank for the heads up! [12:20:40] (03CR) 10Hashar: [C: 031 V: 032] "Rebased on top of https://gerrit.wikimedia.org/r/#/c/122342/ which enable submodules for all contint repositories processed by git::clone(" [operations/puppet] - 10https://gerrit.wikimedia.org/r/124305 (owner: 10Hashar) [12:20:48] mutante: I think #125956 is good to go, wanna have a look ? [12:21:42] apergos: btw, since the new boxes are 3T disks, I've switched the old weight "100" to "2000" quite a few weeks back [12:21:54] so the new ones will eventually have weight 3000, but now are at 1000 [12:22:17] gotcha [12:22:54] (03CR) 10Dzahn: "just update the comment and commit message that it is the new script and where it comes from, and just one little tiny tab" (032 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/125956 (owner: 10Dzahn) [12:26:15] (03CR) 10Hashar: [C: 031 V: 032] "rebased" [operations/puppet] - 10https://gerrit.wikimedia.org/r/122342 (owner: 10Hashar) [12:26:22] (03CR) 10Hashar: [C: 031 V: 032] "rebased" [operations/puppet] - 10https://gerrit.wikimedia.org/r/122707 (owner: 10Hashar) [12:26:30] (03CR) 10Hashar: [C: 031 V: 032] "rebased" [operations/puppet] - 10https://gerrit.wikimedia.org/r/122712 (owner: 10Hashar) [12:26:49] (03CR) 10Hashar: [C: 031 V: 032] "rebased" [operations/puppet] - 10https://gerrit.wikimedia.org/r/121677 (owner: 10Hashar) [12:27:42] (03PS3) 10Dzahn: add monitoring for Ubuntu mirror being in sync [operations/puppet] - 10https://gerrit.wikimedia.org/r/125956 [12:28:05] mutante: thanks for the ubuntu mirror monitoring :] [12:28:46] (03PS4) 10Dzahn: add monitoring for Ubuntu mirror being in sync [operations/puppet] - 10https://gerrit.wikimedia.org/r/125956 [12:29:14] hashar: matanya trigged it by adding a check, pretty old ticket, yea [12:29:22] mutante: can you walk me through decomissioning a host? [12:29:33] Or, I guess I could read the docs :) [12:29:40] mutante: there is a status page at https://launchpad.net/ubuntu/+mirror/ubuntu.wikimedia.org-archive might be worth linking to it [12:30:04] mutante: it provides a nice dashboard, showing for example that our Trusty mirrors are lagged by a week. [12:30:36] (03PS5) 10Alexandros Kosiaris: add monitoring for Ubuntu mirror being in sync [operations/puppet] - 10https://gerrit.wikimedia.org/r/125956 (owner: 10Dzahn) [12:30:52] andrewbogott: start by making patches that remove them from puppet repo and dns repo? [12:31:52] hashar: yes, that is what alex pointed out as well and why he replaced the http based script [12:32:19] andrewbogott: https://wikitech.wikimedia.org/wiki/Server_Lifecycle#Reclaim_or_Decommission [12:32:22] mutante: it provides a nice dashboard, showing for example that our Trusty mirrors are lagged by a week. [12:32:24] rrr [12:32:24] yep, reading [12:33:00] andrewbogott: the short version is "grep -r hostname *" in puppet repo and see what you find ...first [12:33:37] (03PS1) 10Andrew Bogott: Remove pmtpa compute servers from puppet. [operations/puppet] - 10https://gerrit.wikimedia.org/r/125977 [12:33:43] then stop puppet agent and salt minion on hosts, remove from puppetstoredconfigs, run puppet on neon until it's gone from icinga [12:35:45] root@palladium:~# puppetstoredconfigclean.rb db66.pmtpa.wmnet [12:36:22] root@palladium:~# puppetca --clean db66.pmtpa.wmnet [12:36:30] root@palladium:~# salt-key -d db66.pmtpa.wmnet [12:36:36] andrewbogott: ^ examples for you [12:36:39] (03PS1) 10Andrew Bogott: Remove virt5-15 from dns [operations/dns] - 10https://gerrit.wikimedia.org/r/125978 [12:36:43] mutante: thanks [12:38:46] (03CR) 10Alexandros Kosiaris: [C: 032] add monitoring for Ubuntu mirror being in sync [operations/puppet] - 10https://gerrit.wikimedia.org/r/125956 (owner: 10Dzahn) [12:39:25] akosiaris: you beat me to it, same second :p [12:39:35] i hit review and it changes to merged, hah [12:39:42] (03PS1) 10Andrew Bogott: Remove virt12-15 from puppet [operations/puppet] - 10https://gerrit.wikimedia.org/r/125979 [12:40:45] (03CR) 10Dzahn: "just wondering how the +/- 0 for zookeeper is in here" [operations/puppet] - 10https://gerrit.wikimedia.org/r/125977 (owner: 10Andrew Bogott) [12:41:07] :-) [12:42:53] (03CR) 10Hashar: beta: New script to restart apaches (033 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/125888 (owner: 10BryanDavis) [12:43:52] (03CR) 10Dzahn: [C: 031] "just do it after puppet removal and shutdown" [operations/dns] - 10https://gerrit.wikimedia.org/r/125978 (owner: 10Andrew Bogott) [12:45:05] (03CR) 10Dzahn: [C: 04-1] "also remove reverse records from 10.in-addr.arpa please" [operations/dns] - 10https://gerrit.wikimedia.org/r/125978 (owner: 10Andrew Bogott) [12:45:10] (03PS2) 10Andrew Bogott: Remove virt12-15 from puppet [operations/puppet] - 10https://gerrit.wikimedia.org/r/125979 [12:45:59] andrewbogott: for some reason you always touch zookeeper module? [12:46:11] mutante: probably just working on a stale branch… I'll fix. [12:46:19] Anyway, I'm going to set aside that bigger patch and just do one host for now [12:46:33] alright [12:50:41] stop puppet agent and salt minion on the box before revoking cert/key, or they will try to re-add themselves, and they will show as unaccepted cruft [12:52:26] (03PS3) 10Andrew Bogott: Remove virt12-15 from puppet [operations/puppet] - 10https://gerrit.wikimedia.org/r/125979 [12:52:51] ok, now… mutante, https://gerrit.wikimedia.org/r/#/c/125979/ look ok? [12:54:28] (03CR) 10Dzahn: [C: 032] Remove virt12-15 from puppet [operations/puppet] - 10https://gerrit.wikimedia.org/r/125979 (owner: 10Andrew Bogott) [12:55:34] (03CR) 10Dzahn: "now stop the puppet agent and run puppetstoredconfigclean.rb and puppet on neon and check they are out of Icinga" [operations/puppet] - 10https://gerrit.wikimedia.org/r/125979 (owner: 10Andrew Bogott) [12:56:46] andrewbogott: you can either remove it completely before shutting it down or schedule a downtime/disable notifications and all that [12:57:11] Don't think I need to schedule… no one should notice. [12:57:18] famous last words :/ [12:57:26] (03CR) 10ArielGlenn: "maybe a change needs to go in backup::host as well?" [operations/puppet] - 10https://gerrit.wikimedia.org/r/125971 (owner: 10ArielGlenn) [12:58:20] (03PS1) 10Reedy: Non Wikipedia to 1.23wmf22 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/125983 [13:00:37] andrewbogott: dependending on the order, sometimes it is already out of Icinga and then mysteriously comes back, that is if puppet agent still ran when you removed from storedconfigs and then neon.. [13:01:49] re-adds it on next run. s/mysteriously/can be confusing [13:05:16] mutante: virt12 isn't showing up in icinga anymore. So, next step is shutdown? [13:05:47] (03PS2) 10Hashar: gerrit: remove one replication to gallium [operations/puppet] - 10https://gerrit.wikimedia.org/r/122419 [13:05:48] andrewbogott: yes, shutdown [13:05:53] via OS or mgmt [13:06:00] (03CR) 10Hashar: "Added reference to bug 63937" [operations/puppet] - 10https://gerrit.wikimedia.org/r/122419 (owner: 10Hashar) [13:06:09] (03CR) 10jenkins-bot: [V: 04-1] gerrit: remove one replication to gallium [operations/puppet] - 10https://gerrit.wikimedia.org/r/122419 (owner: 10Hashar) [13:06:41] mutante: and then… $ puppetca clean virt12.pmtpa.wmnet [13:06:42] ? [13:07:23] andrewbogott: yes, puppetca --clean and salt-key -d [13:07:35] on palladium [13:07:39] (03PS3) 10Hashar: gerrit: remove one replication to gallium [operations/puppet] - 10https://gerrit.wikimedia.org/r/122419 [13:07:49] (03CR) 10Hashar: "rebased" [operations/puppet] - 10https://gerrit.wikimedia.org/r/122419 (owner: 10Hashar) [13:09:08] andrewbogott: i'd !log the shutdowns / removal from puppet [13:10:29] (03PS1) 10Andrew Bogott: Decom virt12 [operations/puppet] - 10https://gerrit.wikimedia.org/r/125984 [13:10:34] PROBLEM - Host virt12 is DOWN: PING CRITICAL - Packet loss = 100% [13:10:40] Dammit icinga! [13:10:47] andrewbogott: ^ and there you have the example of the re-appearing, heh [13:10:56] anyway… that decom patch correct? [13:11:03] (That'll shut up icinga, right?) [13:11:16] no [13:11:28] what will make it shut up is re-deleting from the storedcofings thing [13:11:32] and running puppet again on neon [13:11:41] you don't need to do that patch [13:11:46] ok [13:12:04] puppetstoredconfigclean.rb is on palladium though, not neon, right? [13:12:12] yes, here i'll do it [13:12:19] root@palladium:~# puppetstoredconfigclean.rb virt12.pmtpa.wmnet [13:12:19] Killing virt12.pmtpa.wmnet...done. [13:12:25] thanks [13:12:33] if i repeat that now it says Can't find host virt12.pmtpa.wmnet. [13:12:35] * andrewbogott refreshes palladium [13:12:47] ! log shutdown and decommissioned virt12 [13:12:54] you need to run puppet on neon now [13:13:06] !log shutdown and decommissioned virt12 [13:13:11] Logged the message, Master [13:13:22] mutante, the decom patch is right? [13:13:46] adding to manifests/decommissioning.pp is not neeed anymore [13:13:54] oh? ok. [13:13:58] or if it is i did things wrong all the time [13:14:11] never needed it to remove stuff from icinga [13:14:21] i think it's deprecated [13:15:24] oh, wait [13:15:43] * andrewbogott waits [13:15:46] "Inclusion in decommissioning.pp removes the host from puppet's storedconfig, but with a delay of several hours to a day, and not always reliably, so we don't rely on that mechanism any more. " [13:15:51] this is why [13:16:06] we just did that with the script [13:16:12] so we don't have to wait [13:16:28] it's the same as running puppetstoredconfigclean.rb [13:16:39] but that delay [13:17:20] ok, I thought it was sort of the same thing. Makes sense. [13:17:28] And I guess no need to keep things there for records? [13:18:04] (03CR) 10Dzahn: "Server Lifecycle page says we don't rely on this anymore. "Inclusion in decommissioning.pp removes the host from puppet's storedconfig, bu" [operations/puppet] - 10https://gerrit.wikimedia.org/r/125984 (owner: 10Andrew Bogott) [13:18:44] since the page says we don't rely on it anymore, you can consider it optional way to do it or we can remove it entirely [13:19:25] +2 for remove entirely [13:19:26] did not use that to track decoms either, instead looked at dsh groups to see what is left [13:19:37] (that's a +2 without submit :-P) [13:22:09] andrewbogott: no, it's out of sync anyways [13:23:59] ok. I'm going to let this sit a bit and see how openstack handles the vanished node... [13:24:06] Will worry about wiping &c later today. [13:24:07] nods [13:24:21] Or is wiping left to the DC folks? [13:24:33] andrewbogott: after it's gone from puppet and dns and shutdown, then create followup-ticket in pmtpa [13:24:41] link that to the core ticket, resolve the core ticket [13:24:55] ok [13:24:55] on the pmtpa ticket tell people to wipe it etc [13:26:50] (03PS1) 10Dzahn: decom the decom script [operations/puppet] - 10https://gerrit.wikimedia.org/r/125986 [13:27:04] (03CR) 10Ottomata: "No, there's no ferm on this afaik." [operations/puppet] - 10https://gerrit.wikimedia.org/r/125744 (owner: 10Ottomata) [13:29:08] (03CR) 10ArielGlenn: [C: 031] "yay!" [operations/dns] - 10https://gerrit.wikimedia.org/r/120063 (owner: 10Dzahn) [13:32:15] (03PS2) 10Dzahn: decom the decom script [operations/puppet] - 10https://gerrit.wikimedia.org/r/125986 [13:32:33] wow [13:33:08] (03PS1) 10Alexandros Kosiaris: Revert "Introduce machine deleteme" [operations/dns] - 10https://gerrit.wikimedia.org/r/125987 [13:33:35] (03PS1) 10Alexandros Kosiaris: Revert "Adding host deleteme.wikimedia.org" [operations/puppet] - 10https://gerrit.wikimedia.org/r/125988 [13:33:51] (03CR) 10Dzahn: [C: 04-1] "trying to remove this entirely to make decom'ing less confusing in I76c59c5f23647" [operations/puppet] - 10https://gerrit.wikimedia.org/r/125984 (owner: 10Andrew Bogott) [13:34:06] (03CR) 10Alexandros Kosiaris: [C: 032] Revert "Introduce machine deleteme" [operations/dns] - 10https://gerrit.wikimedia.org/r/125987 (owner: 10Alexandros Kosiaris) [13:37:59] (03CR) 10ArielGlenn: [C: 031] "I think it's high time for this file to go." [operations/puppet] - 10https://gerrit.wikimedia.org/r/125986 (owner: 10Dzahn) [13:38:43] (03CR) 10Alexandros Kosiaris: [C: 032] Revert "Adding host deleteme.wikimedia.org" [operations/puppet] - 10https://gerrit.wikimedia.org/r/125988 (owner: 10Alexandros Kosiaris) [13:41:12] RobH: WMF3409 released. [13:41:16] !log Repacking Gerrit replicated repositories on lanthanum and gallium (both under /srv/ssd/gerrit/ ) [13:41:21] Logged the message, Master [13:42:02] !log Command executed (as gerritslave user): find /srv/ssd/gerrit -type d -name '*.git' -exec bash -c 'echo; date; cd {}; echo; pwd; echo; git repack -ad; date;' \; [13:42:08] Logged the message, Master [13:44:30] (03PS3) 10Giuseppe Lavagetto: Substituting the check_graphite script. [operations/puppet] - 10https://gerrit.wikimedia.org/r/125726 [13:45:39] (03CR) 10jenkins-bot: [V: 04-1] Substituting the check_graphite script. [operations/puppet] - 10https://gerrit.wikimedia.org/r/125726 (owner: 10Giuseppe Lavagetto) [13:50:30] (03PS1) 10Alexandros Kosiaris: Introduce check to test if dhclient is running [operations/puppet] - 10https://gerrit.wikimedia.org/r/125989 [13:50:32] (03PS4) 10Giuseppe Lavagetto: Substituting the check_graphite script. [operations/puppet] - 10https://gerrit.wikimedia.org/r/125726 [13:51:56] !log Jenkins compressing console logs of builds. On gallium as user jenkins : find /var/lib/jenkins/jobs -wholename '*/builds/*/log' -type f -exec gzip --best {} \; [13:52:00] (03CR) 10jenkins-bot: [V: 04-1] Introduce check to test if dhclient is running [operations/puppet] - 10https://gerrit.wikimedia.org/r/125989 (owner: 10Alexandros Kosiaris) [13:52:02] Logged the message, Master [13:56:36] (03PS2) 10Alexandros Kosiaris: Introduce check to test if dhclient is running [operations/puppet] - 10https://gerrit.wikimedia.org/r/125989 [13:57:43] (03CR) 10jenkins-bot: [V: 04-1] Introduce check to test if dhclient is running [operations/puppet] - 10https://gerrit.wikimedia.org/r/125989 (owner: 10Alexandros Kosiaris) [14:00:22] (03PS1) 10Faidon Liambotis: Move ms-fe30xx/ms-be30xx to private1-esams [operations/dns] - 10https://gerrit.wikimedia.org/r/125990 [14:03:32] (03CR) 10Dzahn: "my question is though: why do we include "sudoers::appserver" in class bastionhost? It's not an appserver, and if it is meant to be for de" [operations/puppet] - 10https://gerrit.wikimedia.org/r/122399 (owner: 10Dzahn) [14:04:52] (03PS1) 10Hashar: contint: compress Jenkins console logs once per day [operations/puppet] - 10https://gerrit.wikimedia.org/r/125991 [14:04:54] (03CR) 10Dzahn: "i'm just asking that here because i would effectively add that on iron as well with this." [operations/puppet] - 10https://gerrit.wikimedia.org/r/122399 (owner: 10Dzahn) [14:05:35] (03CR) 10Hashar: "I have been running such a command from time to time to free up disk space on gallium (which host the Jenkins master)." [operations/puppet] - 10https://gerrit.wikimedia.org/r/125991 (owner: 10Hashar) [14:14:33] (03PS5) 10Giuseppe Lavagetto: Substituting the check_graphite script. [operations/puppet] - 10https://gerrit.wikimedia.org/r/125726 [14:14:42] (03PS3) 10Anomie: Add contact pages for legal to testwiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/119873 (owner: 10Reedy) [14:15:09] (03CR) 10Anomie: [C: 04-1] "-1 because I presume Reedy still doesn't want these emails" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/119873 (owner: 10Reedy) [14:17:00] (03CR) 10Ottomata: "I'd merge this, but I want to be absolutely sure that it isn't going to break things for new and existing self hosted puppet masters. Has" [operations/puppet] - 10https://gerrit.wikimedia.org/r/121677 (owner: 10Hashar) [14:19:31] (03CR) 10Andrew Bogott: [C: 032] "I'll risk it :)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/121677 (owner: 10Hashar) [14:19:36] (03CR) 10Hashar: "That was definitely breaking puppet master self when I wrote the patch a few weeks ago. I guess one can try booting an instance, convert " [operations/puppet] - 10https://gerrit.wikimedia.org/r/121677 (owner: 10Hashar) [14:19:41] (03PS1) 10Alexandros Kosiaris: rp_filter check tunings [operations/puppet] - 10https://gerrit.wikimedia.org/r/125995 [14:19:50] (03CR) 10Hashar: "Or merge :-]" [operations/puppet] - 10https://gerrit.wikimedia.org/r/121677 (owner: 10Hashar) [14:20:19] andrewbogott: ottomata: I am creating an instance and will apply the puppetmaster class to it to confirm [14:20:37] hashar: I have one existing, I'm testing right now [14:20:51] \O/ [14:20:53] (03CR) 10jenkins-bot: [V: 04-1] rp_filter check tunings [operations/puppet] - 10https://gerrit.wikimedia.org/r/125995 (owner: 10Alexandros Kosiaris) [14:23:34] hashar: Seems happy. Maybe you can double-check by updating the beta puppetmaster [14:24:09] hashar: , andrewbogott, thanks! [14:24:17] I am not sure why the beta puppetmaster does not need it [14:24:47] can someone pm me an email address for andrew garret? [14:24:54] unless he's here now… werdna? That's you, right? [14:25:02] yeah that is him [14:25:09] though he is in australia so probably sleeping rightnow [14:25:36] ah, nm, found an email on a mailing list [14:26:20] <_joe_> I know I created something of a monster-patch, but does anyone care to take a look? [14:28:04] (03CR) 10Hashar: "rebased deployment-salt and it still works :-] Not sure what happened originally." [operations/puppet] - 10https://gerrit.wikimedia.org/r/121677 (owner: 10Hashar) [14:31:26] (03PS3) 10Alexandros Kosiaris: Introduce check to test if dhclient is running [operations/puppet] - 10https://gerrit.wikimedia.org/r/125989 [14:34:02] (03CR) 10Alexandros Kosiaris: [C: 032] Introduce check to test if dhclient is running [operations/puppet] - 10https://gerrit.wikimedia.org/r/125989 (owner: 10Alexandros Kosiaris) [14:34:04] (03PS2) 10Alexandros Kosiaris: rp_filter check tunings [operations/puppet] - 10https://gerrit.wikimedia.org/r/125995 [14:35:49] (03CR) 10Alexandros Kosiaris: [C: 032] rp_filter check tunings [operations/puppet] - 10https://gerrit.wikimedia.org/r/125995 (owner: 10Alexandros Kosiaris) [14:36:15] this should squash some of the rp_filter warnings [14:36:34] hopefully what will be left will be stuff the actually needs fixing [14:37:58] (03PS1) 10Hashar: contint: extract android SDK dependencies to a module [operations/puppet] - 10https://gerrit.wikimedia.org/r/126000 [14:39:18] (03CR) 10Hashar: "I factored out the code out of the contint module in favor of a new androidsdk::dependencies class : https://gerrit.wikimedia.org/r/#/c/12" [operations/puppet] - 10https://gerrit.wikimedia.org/r/125241 (owner: 10Yuvipanda) [14:40:34] (03CR) 10Hashar: "Made for Tim Landscheidt who wrote: https://gerrit.wikimedia.org/r/#/c/125241/" [operations/puppet] - 10https://gerrit.wikimedia.org/r/126000 (owner: 10Hashar) [14:49:31] is there a read-only archive of our old svn repos someplace? Trying to track down authorship of a file [14:50:18] andrewbogott: svn.wikimedia.org is still up AFAIK (ro, though) [14:50:19] https://svn.wikimedia.org/viewvc [14:50:27] oh, perfect. That was easy [14:50:29] thanks hoo [14:50:47] yes, svn is still up, I just fixed a check onthat name yesterday [14:51:01] oh, hm, adminbot isn't in there. I wonder where it lived pre-git [14:51:03] maybe nowhere :( [14:51:06] manybubbles, MaxSem: Unless either of you want to do today's SWAT, I'll take it. [14:51:15] anomie: have fun! [14:51:53] ebernhardson: SWAT deploy in about 10 minutes. [14:52:13] * MaxSem is in a different TZ now [14:53:41] yes, we switched svn to antimony for that:) [14:53:47] not tampa [14:56:03] (03PS1) 10Hashar: applicationserver: rm code for labs instance in pmtpa [operations/puppet] - 10https://gerrit.wikimedia.org/r/126002 [14:57:18] (03CR) 10Dzahn: [C: 031] applicationserver: rm code for labs instance in pmtpa [operations/puppet] - 10https://gerrit.wikimedia.org/r/126002 (owner: 10Hashar) [14:58:52] :-) [15:00:05] ebernhardson: Ping, you ready to test your backport once I deploy it? [15:00:42] PROBLEM - Slow CirrusSearch query rate on fluorine is CRITICAL: CirrusSearch-slow.log_line_rate CRITICAL: 0.000277777777778 [15:01:04] hmmmm [15:01:05] anomie: yes [15:01:08] (03CR) 10Dzahn: [C: 032] decom Tampa: remove service IPs [operations/dns] - 10https://gerrit.wikimedia.org/r/120063 (owner: 10Dzahn) [15:01:09] still too sensitive! [15:01:16] * anomie starts the SWAT deploy process [15:01:18] manybubbles: i'm going to move to a cafe [15:01:22] want to start another elastic node first [15:01:29] mind if I start moving shards off the next one? [15:01:42] ottomata: no problem [15:01:47] elastic1009 [15:01:58] sure! [15:02:19] !log DNS update - removing Tampa service IPs [15:02:25] Logged the message, Master [15:02:44] k, that's going [15:02:48] shards moving off on 1009 [15:02:54] heading to a cafe, back soon! [15:03:14] wow they're really gone... [15:03:26] please check.. i'ts a little exciting [15:03:31] (03CR) 10Andrew Bogott: "I've emailed the authors to get informal consent to relicensing as gpl3+. Once (if) I get confirmation we can rewrite this with a sensibl" [operations/debs/adminbot] - 10https://gerrit.wikimedia.org/r/68935 (owner: 10AzaToth) [15:03:36] everything still ok, right? [15:04:24] http://torrus.wikimedia.org/torrus/Facilities?path=/Power_usage/Total_power_usage/Power_per_site&view=lastyear [15:04:55] Reedy: :) [15:06:07] PROBLEM - check if dhclient is running on dobson is CRITICAL: Connection refused by host [15:06:27] PROBLEM - check if dhclient is running on mchenry is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [15:07:49] ACKNOWLEDGEMENT - check if dhclient is running on dobson is CRITICAL: Connection refused by host alexandros kosiaris hardy, to be decom [15:07:49] ACKNOWLEDGEMENT - check if dhclient is running on mchenry is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. alexandros kosiaris hardy, to be decom [15:09:02] (03CR) 10Dzahn: "Reedy, what was imagedump" [operations/dns] - 10https://gerrit.wikimedia.org/r/125949 (owner: 10Dzahn) [15:09:40] I'm guessing it's to images what download/dumps is to xml [15:09:59] I'm guessing that's not true [15:10:13] (my media rsync/tarball services never used that name) [15:10:20] unless it's a name eft over from before 2008 [15:10:23] *left [15:11:03] chasemp: I have several pending changes in gerrit to admins.pp… will it upset you if I just remove myself as reviewer and add you to all of them? I assume they should all be stonewalled pending a proper new design. [15:11:55] either way is good for me, if it's something blocking you no problems doing your thing, I am going to be writing a one-off to dump the whole admins.pp to a hash [15:12:08] or at least that is my intention [15:12:16] I'm just trying to clean up my epic review backlog… want to clean out things that I know I will never look at [15:12:40] passing them my way sounds reasonable [15:13:07] ok thanks [15:13:14] !log anomie synchronized php-1.23wmf21/extensions/Flow 'SWAT: Flow: Prevent logspam on enwiki 125930' [15:13:19] Logged the message, Master [15:13:51] That was a disturbing number of error messages... [15:14:46] (03CR) 10Andrew Bogott: "Adding chase as a reviewer since he's going to spearhead this giant redesign." [operations/puppet] - 10https://gerrit.wikimedia.org/r/107848 (owner: 10Faidon Liambotis) [15:15:03] anomie: what kind [15:15:19] (03CR) 10Alexandros Kosiaris: "Good catch. I am inclined to so no to that sudo rule." [operations/puppet] - 10https://gerrit.wikimedia.org/r/122399 (owner: 10Dzahn) [15:15:38] mutante: Lots of messages along the lines of "mw1204: Received disconnect from 10.64.0.207: 2: Too many authentication failures for anomie" [15:16:09] based on log output, i would guess the patch didn't make it all the way out either (since the patch reduced some logging levels) [15:16:23] ebernhardson: Yeah, it seems to not have. Working on it. [15:16:36] anomie: did the local ssh agent die? [15:16:58] anomie: just mw1204? [15:17:07] mutante: No, lots. Probably all. [15:17:58] Failed publickey for anomie ...indeed [15:18:39] on mw1204 you have one with comment bjorsch@wikimedia.org-2 [15:19:01] do you load that with ssh-add? [15:21:12] the "-2" part makes me think maybe you use -1 [15:21:27] PROBLEM - Puppet freshness on ms-be1004 is CRITICAL: Last successful Puppet run was Tue 15 Apr 2014 12:20:52 PM UTC [15:21:39] !log anomie synchronized php-1.23wmf21/extensions/Flow 'SWAT: Flow: Prevent logspam on enwiki 125930' [15:21:52] mutante: Must be agent trouble on my end, it all worked that time. [15:22:10] anomie: happened to me as well with some sync scripts in the past [15:22:14] * anomie is done with the SWAT deploy [15:22:24] anomie: sometimes the agent died on my end when things were too fast [15:22:28] sounds good [15:22:49] mutante: Now I just need to figure out what blew up on my end. Thanks. [15:22:59] ebernhardson: Should be deployed now [15:23:05] for me it worked better when my network was slow :p [15:23:09] seriously [15:23:33] * anomie is going to reboot and see if the agent issue happens again or if it was just a fluke [15:23:55] anomie: looks to have worked, thanks [15:24:06] (03CR) 10Hashar: [C: 031] "My only contribution was adding the .pep8 file with https://gerrit.wikimedia.org/r/#/c/47115/" [operations/debs/adminbot] - 10https://gerrit.wikimedia.org/r/68935 (owner: 10AzaToth) [15:25:07] (03CR) 10Andrew Bogott: [C: 031] "This looks right to me, but I'd like coren to merge." [operations/puppet] - 10https://gerrit.wikimedia.org/r/123149 (owner: 10Tim Landscheidt) [15:25:53] (03CR) 10Andrew Bogott: [C: 04-1] "needs rebase :(" [operations/puppet] - 10https://gerrit.wikimedia.org/r/114734 (owner: 10Tim Landscheidt) [15:25:56] (03CR) 10Dzahn: [C: 04-1] "first i'd like bastionhost to not include sudoers::appserver anymore. that will be another patch" [operations/puppet] - 10https://gerrit.wikimedia.org/r/122399 (owner: 10Dzahn) [15:27:07] RobH: Can you look at https://gerrit.wikimedia.org/r/#/c/111386/3 and (ideally) explain it to me? [15:27:36] Is it that that line is left over from when we self-signed? [15:29:25] puppet executes a literal "cat" of the things that you mention there, and the result becomes the *-chained.pem [15:30:44] so if you install certificate star.wmflabs.org and use the -chained in the apache config [15:30:50] and it is now from RapidSSL, then yes [15:31:07] ok… I don't know if it's now from RapidSSL though. How can I tell? [15:31:11] Other than asking the person who bought it? [15:31:22] otherwise you have a broken chain [15:31:31] but need to double check what it is from, yes [15:31:42] if it had a broken chain would it still work? [15:31:45] Because… it works [15:31:57] depends on client and how you check [15:32:52] ok. So, how to find out who issued the cert? [15:32:55] https://www.ssllabs.com/ssltest/analyze.html?d=wmflabs.org [15:32:58] or openssl [15:33:38] openssl x509 -in certificate.crt -text -noout [15:34:46] andrewbogott: it works in many browsers, but if you try to curl from a command line you get the error right away [15:34:57] i ran into that issue [15:35:05] hey, I'm confused about something. [15:35:08] Subject: serialNumber=BhQHbaOWi1kF5o57ZgySvt3TVywIQOGI, OU=GT90855227, OU=See www.rapidssl.com/resources/cps (c)13, OU=Domain Control Validated - RapidSSL(R), CN=bugzilla.wikimedia.org [15:35:12] andrewbogott: [15:35:15] hmm, one sec [15:35:24] (plus everything mutante already said ;) [15:35:36] ok, I just dug that out too. [15:35:40] So, patch is correcdt. [15:35:40] eh, wrong cert, but look at the Subject [15:35:44] *correct [15:35:46] is what i should have said [15:35:56] Subject: serialNumber=VlQJ1SFLQmiQkfbrLvyvKHGvxlkXpTeo, OU=GT17134841, OU=See www.rapidssl.com/resources/cps (c)13, OU=Domain Control Validated - RapidSSL(R), CN=*.wmflabs.org [15:36:04] (03CR) 10RobH: [C: 031] star.wmflabs.org: fix intermediate CA [operations/puppet] - 10https://gerrit.wikimedia.org/r/111386 (owner: 10Jeremyb) [15:36:08] Certificate name mismatch [15:36:21] PROBLEM - check if dhclient is running on tridge is CRITICAL: NRPE: Command check_check_dhclient not defined [15:36:23] (03CR) 10Dzahn: [C: 031] star.wmflabs.org: fix intermediate CA [operations/puppet] - 10https://gerrit.wikimedia.org/r/111386 (owner: 10Jeremyb) [15:36:39] ah, I remember... [15:36:51] PROBLEM - check if dhclient is running on pdf2 is CRITICAL: Connection refused by host [15:36:51] PROBLEM - check if dhclient is running on pdf3 is CRITICAL: Connection refused by host [15:36:57] <^d> ottomata: I added myself to the elastic repartitioning rt. Could you be sure to update it with the status as nodes are done? [15:37:09] ^d, sure! [15:37:13] <^d> Thanks! [15:37:48] ACKNOWLEDGEMENT - check if dhclient is running on pdf2 is CRITICAL: Connection refused by host daniel_zahn pdf... ssshhh [15:38:33] ACKNOWLEDGEMENT - check if dhclient is running on pdf3 is CRITICAL: Connection refused by host daniel_zahn no NRPE here... [15:40:38] (03PS4) 10Andrew Bogott: star.wmflabs.org: fix intermediate CA [operations/puppet] - 10https://gerrit.wikimedia.org/r/111386 (owner: 10Jeremyb) [15:43:10] (03PS1) 10Ottomata: Adding tnegrin to icinga analytics contact group [operations/puppet] - 10https://gerrit.wikimedia.org/r/126007 [15:43:30] (03CR) 10Ottomata: [C: 032 V: 032] Adding tnegrin to icinga analytics contact group [operations/puppet] - 10https://gerrit.wikimedia.org/r/126007 (owner: 10Ottomata) [15:43:32] (03CR) 10Andrew Bogott: [C: 032] star.wmflabs.org: fix intermediate CA [operations/puppet] - 10https://gerrit.wikimedia.org/r/111386 (owner: 10Jeremyb) [15:48:08] (03CR) 10Dzahn: "hmm, i think it might also be "RapidSSL_CA.pem GeoTrust_Global_CA.pem" like the others" [operations/puppet] - 10https://gerrit.wikimedia.org/r/111386 (owner: 10Jeremyb) [15:49:22] (03CR) 10Dzahn: "i:/C=US/O=GeoTrust, Inc./CN=RapidSSL CA" [operations/puppet] - 10https://gerrit.wikimedia.org/r/111386 (owner: 10Jeremyb) [15:50:33] (03PS1) 10Andrew Bogott: Added GeoTrust_Global_CA.pem to the star.wmflabs.org cert chain [operations/puppet] - 10https://gerrit.wikimedia.org/r/126008 [15:50:45] mutante: ^ ? [15:50:53] oh [15:50:56] it has to be both! [15:51:05] wildcards have to have both rapid and geo listed, mutante is correct [15:51:07] or it'll break [15:51:25] …which is what ^ does, right? [15:51:38] (03CR) 10Dzahn: [C: 031] "per comments on I4fba98a3856f59" [operations/puppet] - 10https://gerrit.wikimedia.org/r/126008 (owner: 10Andrew Bogott) [15:51:54] andrewbogott: i think so yep [15:51:59] (03CR) 10Dzahn: "openssl s_client -connect wikistats.wmflabs.org:443 -CAfile /etc/ssl/certs/" [operations/puppet] - 10https://gerrit.wikimedia.org/r/126008 (owner: 10Andrew Bogott) [15:52:02] https://gerrit.wikimedia.org/r/#/c/126008/ fixes yes [15:52:28] so the non wildcards dont need both, but the wildcards do, heh [15:53:02] (03CR) 10RobH: [C: 031] Added GeoTrust_Global_CA.pem to the star.wmflabs.org cert chain [operations/puppet] - 10https://gerrit.wikimedia.org/r/126008 (owner: 10Andrew Bogott) [15:53:17] (03CR) 10Dzahn: [C: 032] Added GeoTrust_Global_CA.pem to the star.wmflabs.org cert chain [operations/puppet] - 10https://gerrit.wikimedia.org/r/126008 (owner: 10Andrew Bogott) [15:53:20] whats better than just one coworkers +1? [15:53:23] two of em =] [15:53:48] thanks y'all [15:53:52] andrewbogott: So there is also one other little puppet bug that crops up on occassion [15:54:00] i'll tell you about it now so you arent surprised later [15:54:01] yeah? [15:54:12] sometimes when puppet runs and creates the chained cert file it has a cat error [15:54:18] andrewbogott: better to delete the chained-file now [15:54:20] and combines the first and second begin and end cert lines [15:54:28] andrewbogott: then let puppet recreate it [15:54:29] and yes, you have to remove the old chain file for the new one to create correctly [15:54:33] as mutante says [15:54:36] and then restart service [15:54:48] and after it creates the new one, just cat it to ensure its sane and doesnt combine lines by mistake [15:54:51] ok. I actually don't even know where that file is/what service, etc... [15:54:59] its in /etc/ssl/certs [15:55:06] as star.wikimedia.org.chained.cert iirc [15:55:12] andrewbogott: on the "proxy" instance that is somehow specially locked down [15:55:15] well, not wikimedia [15:55:17] sorry, wmflabs [15:55:18] because it has certs in labs [15:55:22] ok, but the actual cert isn't generated from puppet anyway, right? Because it's on labs, has to have been copied over by hand [15:55:36] oh, maybe, sorry, labs differences i dunno =[ [15:55:49] bah. I've just copied the cert around, don't know where it came from :( [15:55:51] andrewbogott: i expected it to be puppet generated [15:55:56] same [15:56:25] We're talking about star.wmflabs.org.chained.pem? And not star.wmflabs.org.pem? [15:56:35] andrewbogott: chained is created by puppet [15:56:47] ok, I'll move it away and see what I get [15:57:01] yes [15:57:58] andrewbogott: it does this [15:58:00] /bin/cat ${certname}.pem ${ca} > ${location}/${certname}.chained.pem [15:58:13] just check if the newlines look ok [15:58:17] after it did that [15:58:37] yeah, it looks fine to me [15:58:50] and chained is used in webserver config, right [15:58:53] and my browser still loads wikitech [15:59:05] bah, wikitech, wrong site [15:59:27] (03PS1) 10Ottomata: Removing some references to stat1, replacing some of them with stat1003 [operations/puppet] - 10https://gerrit.wikimedia.org/r/126009 [15:59:41] manybubbles: elastic1009 shards are moved [15:59:45] reinstalling [15:59:45] andrewbogott: yuviproxy, right [15:59:53] at want point do you want to play with it? [16:00:03] once it is good to go but before we start moving shards back to it? [16:00:04] so i suspect the newline error is when my stupid text editor inputs soft line break characters, i need to test this later when i feel like touching ssl again [16:00:27] but maybe not, i just tend to assume when something breaks and i touched it at some point that its somehow my fault ;] [16:00:41] RECOVERY - Slow CirrusSearch query rate on fluorine is OK: CirrusSearch-slow.log_line_rate OKAY: 0.0 [16:00:41] ottomata: once you've rebuilt it but before you move the shards back to it [16:01:22] ok… http://eqiadproxytest.wmflabs.org loads fine [16:01:28] good sign? [16:02:10] k [16:03:00] andrewbogott: yes, i can get a Verified 0 [16:03:06] root@zirconium:~# openssl s_client -connect eqiadproxytest.wmflabs.org:443 -CApath /etc/ssl/certs/ [16:03:10] cool, thanks again [16:03:16] Verify return code: 0 (ok) [16:04:59] PROBLEM - Host elastic1009 is DOWN: PING CRITICAL - Packet loss = 100% [16:05:01] hey akosiaris [16:05:03] is this ok? [16:05:03] https://gerrit.wikimedia.org/r/#/c/126009/1/files/backup/disklist-daily [16:05:18] removing the disklist entries for stat1? [16:05:33] andrewbogott: of course, now as the last person to commit major ssl changes you are the 'expert' until someone else does it ;] [16:05:41] mutante: WE ARE FREE [16:05:42] https://www.ssllabs.com/ssltest/analyze.html?d=eqiadproxytest.wmflabs.org [16:05:58] !log reinstalling elastic1009 [16:06:04] Logged the message, Master [16:06:33] andrewbogott: A-, just like Bugzilla , if you want 100/100/100 let's talk about https://rt.wikimedia.org/Ticket/Display.html?id=7281 :) [16:06:42] ottomata: dont put stat1003 there please. the removal is ok [16:07:33] mutante: For today I don't care, just trying to clear code review requests :) [16:07:57] wait [16:07:59] it still says this [16:08:02] Subject Labs CA Not in trust store [16:08:50] (03CR) 10Alexandros Kosiaris: [C: 04-1] Removing some references to stat1, replacing some of them with stat1003 (033 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/126009 (owner: 10Ottomata) [16:08:52] Chain issues Incomplete, Extra certs, Contains anchor [16:08:54] hrmm [16:09:16] that sounds like before the change, maybe cached at ssllabs because i clicked before [16:09:20] Coren: would you be the person to chat with about labsstore1001? [16:09:30] andrewbogott: or you? [16:09:36] don't put stat1003 there? [16:09:43] akosiaris: ? [16:09:44] its saturating its port regularly, i think we need to add another ethernet connection [16:09:48] RobH: Coren's basement flooded while he and Earnest were in Athens so I don't think he'll be on IRC much today. [16:09:54] But, in theory it is Coren, yes. [16:09:55] andrewbogott: ouch [16:09:58] yeah [16:09:58] that suuucks [16:09:59] PROBLEM - Host ms-fe3001 is DOWN: PING CRITICAL - Packet loss = 100% [16:10:09] RECOVERY - Host elastic1009 is UP: PING OK - Packet loss = 0%, RTA = 0.30 ms [16:10:16] I'll handle via email then, thanks =] [16:10:36] oh reading comments [16:11:38] (03CR) 10Chad: [C: 031] gerrit: remove one replication to gallium [operations/puppet] - 10https://gerrit.wikimedia.org/r/122419 (owner: 10Hashar) [16:12:09] PROBLEM - ElasticSearch health check on elastic1009 is CRITICAL: CRITICAL - Could not connect to server 10.64.32.141 [16:12:09] PROBLEM - check if dhclient is running on elastic1009 is CRITICAL: Connection refused by host [16:12:09] PROBLEM - Disk space on elastic1009 is CRITICAL: Connection refused by host [16:12:19] PROBLEM - SSH on elastic1009 is CRITICAL: Connection refused [16:12:19] PROBLEM - check configured eth on elastic1009 is CRITICAL: Connection refused by host [16:12:19] PROBLEM - DPKG on elastic1009 is CRITICAL: Connection refused by host [16:12:19] PROBLEM - puppet disabled on elastic1009 is CRITICAL: Connection refused by host [16:12:29] PROBLEM - RAID on elastic1009 is CRITICAL: Connection refused by host [16:13:10] !log moving ms-fe3xxx/ms-be3xxx to private1-esams [16:13:16] Logged the message, Master [16:16:55] (03CR) 10Dzahn: [C: 032] gerrit: remove one replication to gallium [operations/puppet] - 10https://gerrit.wikimedia.org/r/122419 (owner: 10Hashar) [16:17:03] (03CR) 10Faidon Liambotis: [C: 032] Move ms-fe30xx/ms-be30xx to private1-esams [operations/dns] - 10https://gerrit.wikimedia.org/r/125990 (owner: 10Faidon Liambotis) [16:17:20] akosiaris: did you mean to say that disklist stuff is for 'amanda'? [16:17:34] 'Dont please. This is for backups' [16:18:13] (03PS2) 10Ottomata: Removing some references to stat1, replacing some of them with stat1003 [operations/puppet] - 10https://gerrit.wikimedia.org/r/126009 [16:18:17] !log swapping bad disk slot 4 on dataset1001 [16:18:23] Logged the message, Master [16:20:46] (03PS1) 10Faidon Liambotis: Move ms-fe30xx/ms-be30xx to private1-esams [operations/puppet] - 10https://gerrit.wikimedia.org/r/126012 [16:21:28] ^d: hi, did you get the gerrit replication plugin restarted? ( re: removing a replication to gallium https://gerrit.wikimedia.org/r/#/c/122419/ ) [16:21:33] (03CR) 10Faidon Liambotis: [C: 032] Move ms-fe30xx/ms-be30xx to private1-esams [operations/puppet] - 10https://gerrit.wikimedia.org/r/126012 (owner: 10Faidon Liambotis) [16:21:44] <^d> hashar: Ah, no. [16:21:47] <^d> Didn't realize the patch merged. [16:21:50] <^d> Lemme do it. [16:21:54] thanks ! :-] [16:22:19] !log shutting down mw1163 to replace DIMM [16:22:25] Logged the message, Master [16:24:19] PROBLEM - NTP on elastic1009 is CRITICAL: NTP CRITICAL: No response from NTP server [16:24:59] PROBLEM - Host mw1163 is DOWN: PING CRITICAL - Packet loss = 100% [16:26:12] (03PS2) 10BryanDavis: beta: New script to restart apaches [operations/puppet] - 10https://gerrit.wikimedia.org/r/125888 [16:27:09] PROBLEM - Host elastic1009 is DOWN: PING CRITICAL - Packet loss = 100% [16:27:13] (03PS1) 10Dzahn: remove include sudo::appserver from bastions [operations/puppet] - 10https://gerrit.wikimedia.org/r/126014 [16:27:18] ^d: should we just: gerrit plugin reload replication ? [16:27:29] <^d> I had to run puppet so the config change went live. [16:27:33] <^d> That restarted gerrit on its own. [16:27:44] <^d> Oh duh, nvm. Will reload. [16:28:11] (03CR) 10BryanDavis: beta: New script to restart apaches (033 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/125888 (owner: 10BryanDavis) [16:28:19] RECOVERY - SSH on elastic1009 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.3 (protocol 2.0) [16:28:29] RECOVERY - Host elastic1009 is UP: PING OK - Packet loss = 0%, RTA = 0.52 ms [16:28:42] (03CR) 10Dzahn: [C: 031] "sudo::appserver is stuff like apache2ctl, renice,find-nearest-rsync, mwdeploy NOPASSWD: ALL,.." [operations/puppet] - 10https://gerrit.wikimedia.org/r/126014 (owner: 10Dzahn) [16:29:02] <^d> hashar: config live, replication reloaded. [16:29:29] ^d: awesome. I am deleting /var/lib/git/gerrit on gallium [16:30:45] !log gallium had two Gerrit replications streams, one of them got removed {{gerrit|122419}} thus deleting the target directories under /var/lib/git [16:30:51] Logged the message, Master [16:31:09] !log ... all Jenkins jobs are using /srv/ssd/gerrit instead [16:31:09] PROBLEM - DPKG on ytterbium is CRITICAL: DPKG CRITICAL dpkg reports broken packages [16:31:22] RECOVERY - Host mw1163 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms [16:31:28] Logged the message, Master [16:31:47] <^d> hashar: Yay, less replication :) [16:31:56] :-] [16:32:06] heh [16:32:09] RECOVERY - DPKG on ytterbium is OK: All packages OK [16:32:34] ^d: we might well remove the ones to Jenkins slaves entirely. Gotta talk about it with Timo (we could just clone from Gerrit now since we no more reclone on every build) [16:33:06] <^d> I'd rather not to that. [16:33:12] (03CR) 10Dzahn: "after I82cabbe9f6" [operations/puppet] - 10https://gerrit.wikimedia.org/r/122399 (owner: 10Dzahn) [16:34:09] RECOVERY - ElasticSearch health check on elastic1009 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 1896: active_shards: 5611: relocating_shards: 0: initializing_shards: 0: unassigned_shards: 0 [16:34:33] ^d: ok :-] [16:38:46] (03PS3) 10Manybubbles: Make Elasticsearch more reliable in beta [operations/puppet] - 10https://gerrit.wikimedia.org/r/125331 [16:39:09] RECOVERY - check if dhclient is running on elastic1009 is OK: PROCS OK: 0 processes with command name dhclient [16:39:10] RECOVERY - Disk space on elastic1009 is OK: DISK OK [16:39:19] RECOVERY - check configured eth on elastic1009 is OK: NRPE: Unable to read output [16:39:19] RECOVERY - DPKG on elastic1009 is OK: All packages OK [16:39:19] RECOVERY - puppet disabled on elastic1009 is OK: OK [16:39:29] RECOVERY - RAID on elastic1009 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [16:40:42] !Log replacing eth cable on mw1193 [16:40:48] Logged the message, Master [16:43:15] (03CR) 10Dzahn: move LDAP admin permissions,tools out of site.pp (032 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/125721 (owner: 10Dzahn) [16:43:58] manybubbles: unicast hosts change looks good, shall I merge? [16:44:06] ottomata: sure! [16:44:31] (03PS1) 10Faidon Liambotis: autoinstall: add partman for ms-fe/ms-be esams [operations/puppet] - 10https://gerrit.wikimedia.org/r/126016 [16:44:39] also elastic1009 is ready to go [16:44:43] manybubbles: ^ [16:44:44] do your thang [16:45:09] (03CR) 10Faidon Liambotis: [C: 032 V: 032] autoinstall: add partman for ms-fe/ms-be esams [operations/puppet] - 10https://gerrit.wikimedia.org/r/126016 (owner: 10Faidon Liambotis) [16:45:15] (03CR) 10Ottomata: [C: 032 V: 032] Make Elasticsearch more reliable in beta [operations/puppet] - 10https://gerrit.wikimedia.org/r/125331 (owner: 10Manybubbles) [16:45:44] oop, paravoid, you puppet-merging or me? [16:45:44] :) [16:45:52] ah you did ti! :) [16:46:07] cool [16:47:08] ottomata: cool, thanks for the merge. I'll go look at Elastic1009 [16:49:13] (03CR) 10Dzahn: [C: 04-1] "please don't touch unrelated zookeeper submodule" [operations/puppet] - 10https://gerrit.wikimedia.org/r/125977 (owner: 10Andrew Bogott) [16:49:21] ottomata: I'm done with my tweak. [16:49:35] if you have the command up to more the shards back please run it [16:49:40] otherwise I'll go look it up: [16:49:41] :) [16:50:05] !log mw1093 replacing ethernet cable [16:50:09] Logged the message, Master [16:50:24] i got it [16:50:33] sweet [16:50:52] !log raised "new generation" size on elastic1009 to test a performance theory [16:50:57] Logged the message, Master [16:51:03] manybubbles: moving shards back to 1009 [16:51:09] ottomata: sweet [16:51:12] (03PS2) 10Dzahn: remove sudo::appserver from bastions [operations/puppet] - 10https://gerrit.wikimedia.org/r/126014 [16:52:19] RECOVERY - NTP on elastic1009 is OK: NTP OK: Offset -0.01924777031 secs [16:54:21] (03PS2) 10Dzahn: remove syslog service IP (Tampa) [operations/dns] - 10https://gerrit.wikimedia.org/r/125952 [16:56:18] !Log mw1057 replacing ethernet cable [16:56:23] Logged the message, Master [16:59:49] PROBLEM - Host mw1057 is DOWN: PING CRITICAL - Packet loss = 100% [17:00:04] (03CR) 10Dzahn: [C: 031] Initial commit of pmacct module and role [operations/puppet] - 10https://gerrit.wikimedia.org/r/115345 (owner: 10Jkrauska) [17:00:40] PROBLEM - Slow CirrusSearch query rate on fluorine is CRITICAL: CirrusSearch-slow.log_line_rate CRITICAL: 0.0622222222222 [17:01:29] RECOVERY - Host mw1057 is UP: PING OK - Packet loss = 0%, RTA = 2.17 ms [17:03:29] paravoid: ping [17:03:35] pong [17:03:50] lvs1005...i agree most likely cable...okay to try and fix now? [17:04:17] (03CR) 10Andrew Bogott: [C: 04-1] "Looks OK, but please fix whitespace" [operations/puppet] - 10https://gerrit.wikimedia.org/r/126002 (owner: 10Hashar) [17:05:17] cmjohnson1: ack, go ahead [17:05:55] great..thx [17:06:07] !Log fixing lvs1005 eth1 cable [17:06:12] Logged the message, Master [17:06:55] do we have a ticket for migrating url-downloader to eqiad? [17:06:57] I think no [17:07:44] paravoid: only if you consider it part of linne [17:08:00] because templates/wikimedia.org:url-downloader 1H IN A 208.80.152.143 ; linne [17:08:05] and linne has a ticket [17:08:18] PROBLEM - Host lvs1005 is DOWN: PING CRITICAL - Packet loss = 100% [17:08:36] well, it is kind of mentioned on 6157 [17:08:49] linne is a Wikimedia Upload-by-URL proxy (misc::url-downloader). [17:08:58] PROBLEM - Host misc-web-lb.eqiad.wikimedia.org_ipv6 is DOWN: PING CRITICAL - Packet loss = 100% [17:08:59] but ticket isnt very specific [17:09:38] RECOVERY - Host misc-web-lb.eqiad.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 0.62 ms [17:09:42] RECOVERY - Host lvs1005 is UP: PING OK - Packet loss = 0%, RTA = 1.18 ms [17:09:45] !log stopped pybal on lvs1005 [17:09:51] Logged the message, Master [17:10:57] mutante|away: #7284 [17:11:41] paravoid: yep, cool, and linked i see [17:13:13] and saw reply to lvs1005 as well..yep, most of the times its just cable [17:13:52] 7279, chris [17:14:12] yep..forgot to stop pybal [17:14:17] sorry for pages [17:16:24] cmjohnson1: https://www.youtube.com/watch?v=07So_lJQyqw plays when i get icinga pages [17:17:29] so replaced the one end at lvs1005 didn't fix problem...need to fix end at ge-6/0/46 [17:20:08] PROBLEM - Host lvs1005 is DOWN: PING CRITICAL - Packet loss = 100% [17:20:48] PROBLEM - Host misc-web-lb.eqiad.wikimedia.org_ipv6 is DOWN: PING CRITICAL - Packet loss = 100% [17:21:23] so all these misc web lb are false and i dont need to worry right? [17:21:42] <_joe|away> RobH: not really sure [17:21:51] cmjohnson1: are these from your lvs stuff? [17:22:11] he mentions forgetting to stop pybal and sorry for pages, and pages are for misc web lb [17:22:15] so im guessing its ok to ignore [17:25:10] ottomata: moving shards back off 1009 so I can bounce it again [17:25:23] it has so few shards that is probably the faster then the fast restart mechanism [17:26:13] (03PS2) 10Dzahn: applicationserver: rm code for labs instance in pmtpa [operations/puppet] - 10https://gerrit.wikimedia.org/r/126002 (owner: 10Hashar) [17:28:08] RECOVERY - Host lvs1005 is UP: PING OK - Packet loss = 0%, RTA = 0.31 ms [17:28:23] (03CR) 10BryanDavis: [C: 031] applicationserver: rm code for labs instance in pmtpa [operations/puppet] - 10https://gerrit.wikimedia.org/r/126002 (owner: 10Hashar) [17:28:28] RECOVERY - Host misc-web-lb.eqiad.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 0.37 ms [17:30:26] (03CR) 10Andrew Bogott: [C: 032] applicationserver: rm code for labs instance in pmtpa [operations/puppet] - 10https://gerrit.wikimedia.org/r/126002 (owner: 10Hashar) [17:31:14] manybubbles: HIII [17:31:18] I AM IN UR HANGOUT [17:32:21] (03PS1) 10Dzahn: retab role/applicationserver [operations/puppet] - 10https://gerrit.wikimedia.org/r/126025 [17:33:33] (03CR) 10Dzahn: "could not resist. thanks for merging the pmpta cleanup. ttyl :)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/126025 (owner: 10Dzahn) [17:34:49] (03CR) 10Hoo man: [C: 031] "Makes sense, would approve if I could" [operations/puppet] - 10https://gerrit.wikimedia.org/r/126014 (owner: 10Dzahn) [17:35:12] (03PS1) 10Faidon Liambotis: Add webproxy.{esams,ulsfo}.wmnet service aliases [operations/dns] - 10https://gerrit.wikimedia.org/r/126026 [17:36:09] (03CR) 10Faidon Liambotis: [C: 032] Add webproxy.{esams,ulsfo}.wmnet service aliases [operations/dns] - 10https://gerrit.wikimedia.org/r/126026 (owner: 10Faidon Liambotis) [17:36:44] wait... what "require mysql_wmf::client" on the bastion hosts... *sigh* [17:38:06] (03CR) 10Andrew Bogott: [C: 032] retab role/applicationserver [operations/puppet] - 10https://gerrit.wikimedia.org/r/126025 (owner: 10Dzahn) [17:40:51] (03PS1) 10Hoo man: Remove mysql client from bastionhost [operations/puppet] - 10https://gerrit.wikimedia.org/r/126027 [17:41:34] (03CR) 10Ottomata: "Why not? I told people that they could do this. Is that wrong? Do we really need to give people special accounts on machines so that th" [operations/puppet] - 10https://gerrit.wikimedia.org/r/126027 (owner: 10Hoo man) [17:42:49] (03CR) 10Hoo man: "Do we actually have people who just use the bastions to connect to mysql, but don't have access to any other hosts? Also they should proba" [operations/puppet] - 10https://gerrit.wikimedia.org/r/126027 (owner: 10Hoo man) [17:43:55] (03CR) 10Ottomata: "Yes, they do exist! Some folks use MySQL GUIs through bast1001 to connect to research databases." [operations/puppet] - 10https://gerrit.wikimedia.org/r/126027 (owner: 10Hoo man) [17:44:18] (03CR) 10Ottomata: "That doesn't requires a mysql client on bast1001 though, so maybe that is ok. We should check." [operations/puppet] - 10https://gerrit.wikimedia.org/r/126027 (owner: 10Hoo man) [17:45:19] (03CR) 10Hoo man: "And these need the mysql-client-5.5 packages (no idea how these guys connect, never used one)? I just think it's architecturally wrong to " [operations/puppet] - 10https://gerrit.wikimedia.org/r/126027 (owner: 10Hoo man) [17:46:31] (03CR) 10Hoo man: "I think it's ok to use the bastions to connect to services, also for like wgetting stuff into the cluster or stuff like that... but a mysq" [operations/puppet] - 10https://gerrit.wikimedia.org/r/126027 (owner: 10Hoo man) [17:47:58] PROBLEM - Host mw1163 is DOWN: PING CRITICAL - Packet loss = 100% [17:49:18] RECOVERY - Host mw1163 is UP: PING OK - Packet loss = 0%, RTA = 0.30 ms [17:49:40] * AaronSchulz looks at http://www.mediawiki.org/wiki/Requests_for_comment/Simplify_thumbnail_cache [17:52:21] paravoid: so as long as any thumb URL within the same varnish object (via hash) is getting hits, won't the whole hash-chain stay in cache? So someone could keep requesting different parameter combinations and they wouldn't go away even if unused? I'm assuming LRU only applies to objects. [17:53:44] * AaronSchulz needs to investigate all these random ' could not get local copy ' thumbnail.log entries [17:54:57] (03PS12) 10BryanDavis: [WIP] Configure scap master and clients in beta [operations/puppet] - 10https://gerrit.wikimedia.org/r/123674 [17:55:51] (03CR) 10BryanDavis: "Patch set 12 was a manual rebase needed for whitespace changes in manifests/role/applicationserver.pp" [operations/puppet] - 10https://gerrit.wikimedia.org/r/123674 (owner: 10BryanDavis) [18:00:48] PROBLEM - Slow CirrusSearch query rate on fluorine is CRITICAL: CirrusSearch-slow.log_line_rate CRITICAL: 0.000277777777778 [18:00:55] (03PS1) 10Faidon Liambotis: Use webproxy.$site.wmnet alias instead of carbon [operations/puppet] - 10https://gerrit.wikimedia.org/r/126029 [18:00:57] (03PS1) 10Faidon Liambotis: autoinstall: add http/proxy for private1-esams [operations/puppet] - 10https://gerrit.wikimedia.org/r/126030 [18:00:59] (03PS1) 10Faidon Liambotis: install-server: set up esams, ulsfo webproxies [operations/puppet] - 10https://gerrit.wikimedia.org/r/126031 [18:02:15] (03CR) 10Faidon Liambotis: [C: 032] Use webproxy.$site.wmnet alias instead of carbon [operations/puppet] - 10https://gerrit.wikimedia.org/r/126029 (owner: 10Faidon Liambotis) [18:02:34] (03CR) 10Faidon Liambotis: [C: 032 V: 032] autoinstall: add http/proxy for private1-esams [operations/puppet] - 10https://gerrit.wikimedia.org/r/126030 (owner: 10Faidon Liambotis) [18:03:29] ottomata: done with 1009 - well, done restarting it [18:03:33] and moving shards back to it now [18:03:44] so you are safe to start moving shards off the next one [18:04:10] ok thanks [18:06:08] * AaronSchulz may have to look at the source [18:07:54] moving shards off of elastic1010 [18:09:28] PROBLEM - LVS HTTP IPv6 on misc-web-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: Connection refused [18:10:29] RECOVERY - LVS HTTP IPv6 on misc-web-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 188 bytes in 0.050 second response time [18:11:00] mark: so, remember how you were saying that we don't need a local resolver on LVS boxes anymore? [18:11:18] yes? [18:11:30] we didn't think of dns-rec-lb :) [18:11:43] and resolv.conf's timeout:3 apparently isn't enough [18:11:59] hehe [18:12:00] right [18:12:05] oh and pybal is probably buggy as well [18:12:12] it spewed a bunch of backtraces [18:15:36] (03CR) 10Alexandros Kosiaris: [C: 032] install-server: set up esams, ulsfo webproxies [operations/puppet] - 10https://gerrit.wikimedia.org/r/126031 (owner: 10Faidon Liambotis) [18:20:08] (03PS1) 10Ottomata: Using custom partman/elasticsearch.cfg for elasticsearch nodes [operations/puppet] - 10https://gerrit.wikimedia.org/r/126036 [18:21:23] (03PS2) 10Ottomata: Using custom partman/elasticsearch.cfg for elasticsearch nodes [operations/puppet] - 10https://gerrit.wikimedia.org/r/126036 [18:21:44] (03CR) 10Ottomata: [C: 032 V: 032] "Partman is still fairly cryptic to me, but I think this should work." [operations/puppet] - 10https://gerrit.wikimedia.org/r/126036 (owner: 10Ottomata) [18:22:18] PROBLEM - Puppet freshness on ms-be1004 is CRITICAL: Last successful Puppet run was Tue 15 Apr 2014 12:20:52 PM UTC [18:26:49] (03PS1) 10Mwalker: Adding Node 0.10 dependency back to OCG [operations/puppet] - 10https://gerrit.wikimedia.org/r/126039 [18:28:39] Reedy: feeling shell-bug-y? twkozlowski has some pent up patches :) https://gerrit.wikimedia.org/r/#/q/owner:%22Odder+%253Ctomasz%2540twkozlowski.net%253E%22+status:open,n,z [18:33:49] (03CR) 10Reedy: [C: 032] Non Wikipedia to 1.23wmf22 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/125983 (owner: 10Reedy) [18:34:09] (03Merged) 10jenkins-bot: Non Wikipedia to 1.23wmf22 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/125983 (owner: 10Reedy) [18:35:20] paravoid: nevermind, hash-chain works the way I'd hope [18:35:30] (for lru) [18:35:59] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: Non Wikipedias to 1.23wmf22 [18:36:04] (03PS2) 10Reedy: Enable Import on Spanish Wikiquote [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/120152 (owner: 10Odder) [18:36:05] Logged the message, Master [18:36:09] (03CR) 10Reedy: [C: 032] Enable Import on Spanish Wikiquote [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/120152 (owner: 10Odder) [18:36:51] (03Merged) 10jenkins-bot: Enable Import on Spanish Wikiquote [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/120152 (owner: 10Odder) [18:38:06] (03PS3) 10Reedy: Add new user groups to Spanish Wikiquote [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/120153 (owner: 10Odder) [18:38:08] (03PS2) 10Mwalker: Adding Node 0.10 dependency back to OCG [operations/puppet] - 10https://gerrit.wikimedia.org/r/126039 [18:38:11] (03CR) 10Reedy: [C: 032] Add new user groups to Spanish Wikiquote [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/120153 (owner: 10Odder) [18:38:35] Reedy: thanks :) [18:38:37] Wee! I can close some bugs now! [18:38:40] (03Merged) 10jenkins-bot: Add new user groups to Spanish Wikiquote [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/120153 (owner: 10Odder) [18:39:44] Gah [18:39:45] 47 Fatal error: Declaration of ProofreadPageContent::preloadTransform() must be compatible with that of Content::preloadTransform() in /usr/local/apache/common-lo [18:39:45] cal/php-1.23wmf22/extensions/ProofreadPage/includes/page/ProofreadPageContent.php on line 262 [18:41:06] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: Wikisources back to 1.23wmf21 due to ProofreadPage fatal [18:41:11] Logged the message, Master [18:41:25] :/ [18:43:02] (03PS3) 10Mwalker: Adding Node 0.10 dependency back to OCG [operations/puppet] - 10https://gerrit.wikimedia.org/r/126039 [18:43:10] * Reedy files a bug but is going to fix now [18:43:45] weird, only changes were i10n/json: https://git.wikimedia.org/summary/?r=mediawiki/extensions/ProofreadPage.git [18:43:58] Presumably a change in core [18:44:07] And extension(s) weren't updated [18:44:08] yeah [18:45:43] heh [18:45:49] PhpStorm knew already it was buggy [18:46:01] public function preloadTransform( Title $title, ParserOptions $popts ) { [18:46:28] public function preloadTransform( Title $title, ParserOptions $parserOptions, $params = array() ); [18:47:49] * Reedy looks for someone to shout at [18:48:06] :) [18:48:27] https://gerrit.wikimedia.org/r/#/c/116482/ [18:49:15] greg-g: To make it worse, there was only one extension to fix [18:50:08] haha [18:51:36] notice: Finished catalog run in -7078.24 seconds [18:54:14] really [18:54:40] * apergos checks out for the night.. have a good rest of the day folks [18:55:26] (03PS1) 10Ottomata: Prefer Ubuntu's openjdk-7* packages rather than what we have in Wikimedia apt [operations/puppet] - 10https://gerrit.wikimedia.org/r/126048 [18:55:34] (03PS2) 10Ottomata: Prefer Ubuntu's openjdk-7* packages rather than what we have in Wikimedia apt [operations/puppet] - 10https://gerrit.wikimedia.org/r/126048 [18:55:35] paravoid: too much performance [18:56:36] <^d> Can we get the time back from previous puppet runs that took too long? [18:57:01] (03PS3) 10Ottomata: Prefer Ubuntu's openjdk-7* packages rather than what we have in Wikimedia apt [operations/puppet] - 10https://gerrit.wikimedia.org/r/126048 [18:57:17] ottomata: noo [18:57:20] paravoid: I'm not super confident about that change, I tested it in one spot [18:57:22] yeah that's fine [18:57:26] not sure if that is the right thing to do [18:57:29] want your input [18:58:47] (03CR) 10Faidon Liambotis: [C: 04-2] "That's very dirty in general, and even more so having this in the apt module." [operations/puppet] - 10https://gerrit.wikimedia.org/r/126048 (owner: 10Ottomata) [18:59:46] (03CR) 10Ottomata: "Oh, ok. I assumed you wanted the potential downgrades stopped." [operations/puppet] - 10https://gerrit.wikimedia.org/r/126048 (owner: 10Ottomata) [19:00:40] (03PS1) 10coren: Tool Labs: install python-mwparserfromhell [operations/puppet] - 10https://gerrit.wikimedia.org/r/126050 [19:00:48] PROBLEM - Slow CirrusSearch query rate on fluorine is CRITICAL: CirrusSearch-slow.log_line_rate CRITICAL: 0.000277700638711 [19:07:13] !log reedy synchronized php-1.23wmf22/extensions/ProofreadPage [19:07:18] Logged the message, Master [19:07:18] !log reinstalling elastic1010 [19:07:18] PROBLEM - Host mw1163 is DOWN: PING CRITICAL - Packet loss = 100% [19:07:22] Logged the message, Master [19:07:53] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: Wikisources back to 1.23wmf22 [19:07:59] Logged the message, Master [19:08:08] RECOVERY - Host mw1163 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms [19:08:49] (03CR) 10coren: [C: 032] "Simple package addition." [operations/puppet] - 10https://gerrit.wikimedia.org/r/126050 (owner: 10coren) [19:09:10] (03PS2) 10Reedy: Enable NewUserMessage extension on Urdu Wikipedia [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/123366 (owner: 10Odder) [19:09:14] (03CR) 10Reedy: [C: 032] Enable NewUserMessage extension on Urdu Wikipedia [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/123366 (owner: 10Odder) [19:09:48] PROBLEM - Host elastic1010 is DOWN: PING CRITICAL - Packet loss = 100% [19:09:57] (03CR) 10jenkins-bot: [V: 04-1] Enable NewUserMessage extension on Urdu Wikipedia [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/123366 (owner: 10Odder) [19:10:39] PHP Warning: unlink(/tmp/MWRealmTests/general.ext): No such file or directory in /srv/ssd/jenkins-slave/workspace/operations-mw-config-tests/tests/multiversion/MWRealmTest.php on line 42 [19:10:43] (03CR) 10Reedy: Enable NewUserMessage extension on Urdu Wikipedia [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/123366 (owner: 10Odder) [19:10:43] Not funny at all. [19:10:47] (03CR) 10Reedy: [C: 032] Enable NewUserMessage extension on Urdu Wikipedia [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/123366 (owner: 10Odder) [19:10:58] (03Merged) 10jenkins-bot: Enable NewUserMessage extension on Urdu Wikipedia [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/123366 (owner: 10Odder) [19:11:57] 19:09:55 There were 5 failures: hmmm [19:12:17] (03PS2) 10Reedy: Add New York Public Library to wgCopyUploadsDomains [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/124273 (owner: 10Odder) [19:12:21] (03CR) 10Reedy: [C: 032] Add New York Public Library to wgCopyUploadsDomains [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/124273 (owner: 10Odder) [19:12:28] <_joe|away> /win 19 [19:13:43] _joe|away: Was it highly effective? [19:13:45] _joe|away: https://pthree.org/2007/07/18/irssi-windows-1-throuh-80/ [19:14:58] RECOVERY - Host elastic1010 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms [19:15:43] (03Merged) 10jenkins-bot: Add New York Public Library to wgCopyUploadsDomains [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/124273 (owner: 10Odder) [19:15:52] <_joe|away> greg-g: eheh [19:16:34] (03PS2) 10Reedy: Modify wgAddGroups, wgRemoveGroups on brwikimedia [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/124393 (owner: 10Odder) [19:16:39] (03CR) 10Reedy: [C: 032] Modify wgAddGroups, wgRemoveGroups on brwikimedia [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/124393 (owner: 10Odder) [19:16:50] (03Merged) 10jenkins-bot: Modify wgAddGroups, wgRemoveGroups on brwikimedia [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/124393 (owner: 10Odder) [19:17:18] PROBLEM - puppet disabled on elastic1010 is CRITICAL: Connection refused by host [19:17:18] PROBLEM - Disk space on elastic1010 is CRITICAL: Connection refused by host [19:17:18] PROBLEM - ElasticSearch health check on elastic1010 is CRITICAL: CRITICAL - Could not connect to server 10.64.32.142 [19:17:28] PROBLEM - check if dhclient is running on elastic1010 is CRITICAL: Connection refused by host [19:17:28] PROBLEM - RAID on elastic1010 is CRITICAL: Connection refused by host [19:17:38] (03PS2) 10Reedy: Local assignment of accountcreator flag on ptwiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/125210 (owner: 10Odder) [19:17:38] PROBLEM - DPKG on elastic1010 is CRITICAL: Connection refused by host [19:17:38] PROBLEM - check configured eth on elastic1010 is CRITICAL: Connection refused by host [19:17:42] (03CR) 10Reedy: [C: 032] Local assignment of accountcreator flag on ptwiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/125210 (owner: 10Odder) [19:17:49] PROBLEM - SSH on elastic1010 is CRITICAL: Connection refused [19:17:56] (03CR) 10Reedy: "Is this actually good to go?" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/115153 (owner: 10Odder) [19:18:20] (03Merged) 10jenkins-bot: Local assignment of accountcreator flag on ptwiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/125210 (owner: 10Odder) [19:18:53] (03CR) 10Amire80: "Depends on what the performance people say. I wrote my reasoning at Bug 60939." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/115153 (owner: 10Odder) [19:20:09] !log reedy synchronized php-1.23wmf22/includes/PrefixSearch.php 'I82b5ca65864099c180d915055c43e6839bd4f4a2' [19:20:15] Logged the message, Master [19:20:26] (03CR) 10Reedy: Create a FeaturedFeed for the Tech News bulletin (031 comment) [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/124272 (owner: 10Odder) [19:20:52] (03PS2) 10Reedy: Added Markus Glaser's GPG keys [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/121397 (owner: 10Mglaser) [19:21:30] (03CR) 10Reedy: "Is the fact he uploaded them himself enough?" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/121397 (owner: 10Mglaser) [19:21:36] (03PS3) 10Reedy: Bug 34897 - Enable Special:Import on cawikiquote I4bdaa1b4c679356e6355987b31d1dce04ae85bd3 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/121834 (owner: 10Nknudsen) [19:21:41] (03CR) 10Reedy: [C: 032] Bug 34897 - Enable Special:Import on cawikiquote I4bdaa1b4c679356e6355987b31d1dce04ae85bd3 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/121834 (owner: 10Nknudsen) [19:21:51] (03Merged) 10jenkins-bot: Bug 34897 - Enable Special:Import on cawikiquote I4bdaa1b4c679356e6355987b31d1dce04ae85bd3 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/121834 (owner: 10Nknudsen) [19:22:20] (03PS3) 10Reedy: Remove C: namespace alias (for categories) from hiwiki config [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/113656 (owner: 10Gerrit Patch Uploader) [19:22:24] (03CR) 10Reedy: [C: 032] Remove C: namespace alias (for categories) from hiwiki config [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/113656 (owner: 10Gerrit Patch Uploader) [19:22:34] (03Merged) 10jenkins-bot: Remove C: namespace alias (for categories) from hiwiki config [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/113656 (owner: 10Gerrit Patch Uploader) [19:22:39] (03CR) 10Legoktm: [C: 031] "GPL3+ is fine with me." [operations/debs/adminbot] - 10https://gerrit.wikimedia.org/r/68935 (owner: 10AzaToth) [19:24:14] !log reedy synchronized wmf-config/ [19:24:20] Logged the message, Master [19:29:38] PROBLEM - NTP on elastic1010 is CRITICAL: NTP CRITICAL: No response from NTP server [19:31:18] !log setting refresh interval on elasticsearch indexes to 30s to test effect on load [19:31:23] Logged the message, Master [19:31:58] PROBLEM - Host elastic1010 is DOWN: PING CRITICAL - Packet loss = 100% [19:33:48] RECOVERY - SSH on elastic1010 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.3 (protocol 2.0) [19:33:58] RECOVERY - Host elastic1010 is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms [19:39:23] (03CR) 10Rush: [C: 04-1] "These are my notes, I really think this is great and I'm not sure about the culture of review here so I hope this doesn't come off as peda" [operations/puppet] - 10https://gerrit.wikimedia.org/r/125726 (owner: 10Giuseppe Lavagetto) [19:40:44] <_joe_> chasemp: thanks! [19:41:55] crap now I have to change my password [19:42:00] I included wrong paste [19:42:47] (03PS13) 10BryanDavis: [WIP] Configure scap master and clients in beta [operations/puppet] - 10https://gerrit.wikimedia.org/r/123674 [19:43:33] <_joe_> chasemp: I was about to tell you in private... [19:43:49] <_joe_> you can edit the comment, I think [19:44:15] bonehead move, no big deal already changing it [19:44:36] (03PS1) 10Mwalker: Enable CentralNotice CrossWiki Hiding [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/126065 [19:45:05] <_joe_> chasemp: a couple of problems are a little weird tbh [19:45:22] <_joe_> I gotta look into that (and write tests) [19:49:07] (03Abandoned) 10Ottomata: Prefer Ubuntu's openjdk-7* packages rather than what we have in Wikimedia apt [operations/puppet] - 10https://gerrit.wikimedia.org/r/126048 (owner: 10Ottomata) [19:49:30] (03CR) 10Rush: "for posterity, this was a mistaken paste. the issue was resolved immediately." [operations/puppet] - 10https://gerrit.wikimedia.org/r/125726 (owner: 10Giuseppe Lavagetto) [19:50:06] !log Restarting stuck Jenkins [19:50:11] Logged the message, Mr. Obvious [19:51:26] (03PS1) 10Ottomata: Fixing partman/elasticsearch.cfg [operations/puppet] - 10https://gerrit.wikimedia.org/r/126066 [19:52:25] (03PS2) 10Ottomata: Fixing partman/elasticsearch.cfg [operations/puppet] - 10https://gerrit.wikimedia.org/r/126066 [19:53:15] (03CR) 10Ottomata: [C: 032 V: 032] Fixing partman/elasticsearch.cfg [operations/puppet] - 10https://gerrit.wikimedia.org/r/126066 (owner: 10Ottomata) [19:57:38] PROBLEM - Host elastic1010 is DOWN: PING CRITICAL - Packet loss = 100% [20:00:48] PROBLEM - Slow CirrusSearch query rate on fluorine is CRITICAL: CirrusSearch-slow.log_line_rate CRITICAL: 0.0589052514587 [20:02:48] RECOVERY - Host elastic1010 is UP: PING OK - Packet loss = 0%, RTA = 2.59 ms [20:04:19] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: reqstats.5xx [crit=500.000000 [20:04:49] PROBLEM - SSH on elastic1010 is CRITICAL: Connection refused [20:04:59] http://gdash.wikimedia.org/dashboards/reqerror/ is... interesting [20:05:00] ori: ^ [20:05:29] hrm [20:05:36] 15:21 logmsgbot: anomie synchronized php-1.23wmf21/extensions/Flow 'SWAT: Flow: Prevent logspam on enwiki 125930' [20:05:49] well, there's a spike of errors: http://ur1.ca/edq1f [20:05:54] the Expand view button on commons was added by wmf? [20:07:32] that big ole spike at 18:XX was proofreadpage that Reedy fixed, probably [20:07:34] http://ru.wikipedia.org/wiki/Города_Сибирского_федерального_округа 500s [20:07:41] [2014-04-15 20:01:01] Fatal error: LuaSandboxFunction::call() [luasandboxfunction.call]: PANIC: unprotected error in call to Lua API (not enough memory) at /usr/local/apache/common-local/php-1.23wmf21/extensions/Scribunto/engines/LuaSandbox/Engine.php on line 264 [20:07:54] anomie: ^ [20:08:11] the spike coincedes well with the swat push [20:08:14] then there's http://pt.wikipedia.org/wiki/Usuário:Gustavotcabral/Auto/Patologia , which times out [20:08:26] yeah, pretty clear correlation [20:08:47] ?? [20:08:55] Steinsplitter: see #wikimedia-tech [20:09:02] Steinsplitter: yes, they're going to add an update [20:09:02] thanks [20:09:09] good [20:09:13] the exceptions are wikibase [20:09:15] by update I hope you mean rollback [20:13:36] bblack: around? [20:16:14] (03CR) 10Matanya: [C: 031] remove pmtpa payments LVS monitoring [operations/puppet] - 10https://gerrit.wikimedia.org/r/125715 (owner: 10Dzahn) [20:17:40] paravoid, greg-g: quarterly review in 10 mins, can't help with the bugs in prod sadly [20:17:50] RECOVERY - SSH on elastic1010 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.3 (protocol 2.0) [20:19:17] RobH, you've got partman skills, ja? [20:20:03] i wouldn't go that far, no. [20:20:10] well i mean [20:20:16] ha, more than anybody else here maybe? [20:20:25] still nope ;] [20:20:40] (you can still ask me, just sayin ;) [20:21:24] ori: Sounds like https://bugzilla.wikimedia.org/show_bug.cgi?id=59130 [20:21:30] RobH: trying to make this work: [20:21:30] https://github.com/wikimedia/operations-puppet/blob/production/modules/install-server/files/autoinstall/partman/elasticsearch.cfg [20:21:35] i based it on raid1-30G [20:21:38] so far no good [20:21:43] hm, I just noticed a tabbing difference [20:21:45] going to change it and see [20:21:53] so annoying that I have to wait so long to find out if the thang works [20:22:03] I do but I'm about to join a meeting [20:22:06] later maybe [20:22:26] i can look in a few minutes and try to help, but need to not stop mid commit or i will lose track of what im doin [20:22:32] so in about 5 [20:22:36] ish [20:22:43] ok sure [20:23:13] (03PS1) 10Ottomata: partman/elasticsearch.cfg - fixing some tab differences [operations/puppet] - 10https://gerrit.wikimedia.org/r/126118 [20:23:28] (03CR) 10Ottomata: [C: 032 V: 032] partman/elasticsearch.cfg - fixing some tab differences [operations/puppet] - 10https://gerrit.wikimedia.org/r/126118 (owner: 10Ottomata) [20:26:57] (03PS1) 10RobH: adding osmium to dns [operations/dns] - 10https://gerrit.wikimedia.org/r/126119 [20:27:09] PROBLEM - Host mw1163 is DOWN: PING CRITICAL - Packet loss = 100% [20:27:49] RECOVERY - Host mw1163 is UP: PING OK - Packet loss = 0%, RTA = 0.83 ms [20:28:56] (03CR) 10RobH: [C: 032] adding osmium to dns [operations/dns] - 10https://gerrit.wikimedia.org/r/126119 (owner: 10RobH) [20:30:09] PROBLEM - Host elastic1010 is DOWN: PING CRITICAL - Packet loss = 100% [20:33:27] paravoid: ori something deployed? those req errors are back to normal-ish level? [20:33:53] (03PS6) 10Odder: Create a FeaturedFeed for the Tech News bulletin [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/124272 [20:33:54] I did not [20:34:19] (03PS1) 10Manybubbles: Fix bad server names for labs [operations/puppet] - 10https://gerrit.wikimedia.org/r/126125 [20:35:07] well neat [20:35:19] RECOVERY - Host elastic1010 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [20:35:26] annnnnd, going back up [20:35:53] ottomata: https://graphite.wikimedia.org/render/?title=HTTP%205xx%20Responses%20-8hours&from=-8hours&width=1024&height=500&until=now&areaMode=none&hideLegend=false&lineWidth=2&lineMode=connected&target=color%28cactiStyle%28alias%28reqstats.5xx,%225xx%20resp/min%22%29%29,%22blue%22%29 shows a spike, but oxygen's 5xx.log does not show it; this used to work, any ideas what's recently changed? [20:37:49] PROBLEM - SSH on elastic1010 is CRITICAL: Connection refused [20:39:09] ottomata: im lookin at it now [20:39:21] (the partman thing) [20:39:35] (just letting you know i didnt forget ;) [20:40:16] hm paravoid [20:40:25] no, not in that time frame [20:40:30] sqstat is currently hosted on analytics1003 [20:40:51] as of last wed [20:41:13] that is the only change I know of [20:41:15] the biggest difference is that sqstat is running on the multicast relay now [20:41:20] rather than the unicast that was going to emery [20:42:39] ottomata: do you have stuff to do on 1010? [20:42:50] yes i am trying to figure out this new partman recipe on 1010 [20:42:58] trying to make what we talked about today work [20:43:03] haven't succeeded yet [20:43:20] So what is the error you are seeing with this partman? [20:43:31] uh, not even sure where to look for errors, [20:43:34] its just not doing what I want [20:43:51] i want it to make a raid 0 with all remaining space on sda3 and sdb3, and mount it at /var/lib/elasticsaerch [20:43:56] well, you can see what happens in it by reading the output in the installer logs [20:44:03] after install? [20:44:06] and seeing what it says its doing when it hits those steps [20:44:06] or on carbon? [20:44:14] oh, so the system is installing? [20:44:19] yes [20:44:32] i adapted this file from raid1-30G [20:44:38] so / and swap work great [20:44:39] and its just not the outcome in partitions you want then? [20:44:44] those are sda,b1 and sda,b2 [20:44:55] raid-30G leaves sda3 and sdb3 unpartitioned [20:44:58] i'm trying to make them raid0 [20:45:06] so far haven't succeeded [20:45:22] it does what raid1-30G does, except maybe changes the fs type [20:45:28] for the physical partitions [20:45:39] what sysetm did you reinstall with this? 1010? [20:45:39] haven't succeeded in creating the array yet [20:45:44] elastic1010 [20:45:49] i'm trying again now [20:45:52] its installing now [20:46:46] i think you may want to look at raid1-varnish.cfg [20:46:50] (03CR) 10Ottomata: [C: 032 V: 032] Fix bad server names for labs [operations/puppet] - 10https://gerrit.wikimedia.org/r/126125 (owner: 10Manybubbles) [20:46:54] as it does something nearly identical for vanish systems [20:47:02] I think it may be a bit more useful =] [20:47:03] ACK [20:47:05] GRRRR [20:47:11] i didn't puppet merge my last changes to this file [20:47:12] sigh [20:47:18] this install try won't work either [20:49:17] so the -1 in the size declaration for the final partition is 'use rest of disk'? [20:50:28] yes, at least, it work sfor raid1-30G [20:50:34] that came frrom there [20:50:49] my changes are attempting to take the phsyical partitions, which do get created [20:50:54] and put raid0 on them [20:52:19] PROBLEM - Host elastic1010 is DOWN: PING CRITICAL - Packet loss = 100% [20:53:57] so yea, its difficult to pull info out of the installer logs [20:54:10] but commit your changes and run them and then lets read over the installer log and see what it tries to do with the recipie [20:54:38] (if it couldnt do the isntall at all it would be a bit more painful to read, so atleast there is that) [20:55:33] i've been able to muddle my way through a number of partman fixes over time but its never been easy ;_; [20:55:41] ok yeah, when you say installer logs, you mean on carbon? [20:55:49] fingers crossed this time it will work! [20:55:51] iirc they are local to the sytem [20:55:58] in the logs directory, lemme see [20:56:06] ok [20:56:15] yep, /var/log/installer [20:56:27] ok, will check it when it boots [20:56:45] then we get to read over it (warning, may cause your eyes to bleed) [20:57:10] haha ok i'll put on my welding mask [20:57:29] RECOVERY - Host elastic1010 is UP: PING OK - Packet loss = 0%, RTA = 0.85 ms [20:57:34] these logs have lead me to ask mark questions where i knew damn well the answer was going to be 'read the installer log' but hoped he would say something else. [20:57:43] heh [20:58:11] (they get slightly easier to parse over time, but not much) [21:00:49] RECOVERY - Slow CirrusSearch query rate on fluorine is OK: CirrusSearch-slow.log_line_rate OKAY: 0.0 [21:02:19] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: reqstats.5xx [warn=250.000 [21:09:19] PROBLEM - Host elastic1010 is DOWN: PING CRITICAL - Packet loss = 100% [21:11:49] RECOVERY - SSH on elastic1010 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.3 (protocol 2.0) [21:11:59] RECOVERY - Host elastic1010 is UP: PING OK - Packet loss = 0%, RTA = 0.45 ms [21:12:21] ok RobH, it did not work! [21:12:25] cehcking log [21:12:29] !log Zuul locked again :/ Unpooling and repooling Jenkins slaves. [21:12:34] Logged the message, Master [21:12:44] (03PS1) 10RobH: adding rhenium to netboot.cfg [operations/puppet] - 10https://gerrit.wikimedia.org/r/126130 [21:12:52] cool, lemme merge this and i'll check it out as well [21:13:13] i'll spin up this install so it can do its thing while we parse [21:13:36] cajoel: ^ this is your netflow server install im working on presently, i just added you to some relevant tickets [21:14:36] !log cleared /tmp/ on integration-slave1002 (filled up by hhvm job, known issue, bug filled already) [21:14:41] Logged the message, Master [21:15:24] !log Jenkins is processing jobs again [21:15:29] Logged the message, Master [21:16:46] !log restarting elastic1009 to test performance changes. cluster will go yellow for a few minutes. might go red (wikitech is busted) [21:16:52] Logged the message, Master [21:19:19] PROBLEM - ElasticSearch health check on elastic1004 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 14: number_of_data_nodes: 14: active_primary_shards: 1895: active_shards: 5237: relocating_shards: 0: initializing_shards: 0: unassigned_shards: 374 [21:19:19] PROBLEM - ElasticSearch health check on elastic1014 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 14: number_of_data_nodes: 14: active_primary_shards: 1895: active_shards: 5237: relocating_shards: 0: initializing_shards: 0: unassigned_shards: 374 [21:19:29] PROBLEM - ElasticSearch health check on elastic1001 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 14: number_of_data_nodes: 14: active_primary_shards: 1895: active_shards: 5237: relocating_shards: 0: initializing_shards: 0: unassigned_shards: 374 [21:19:29] PROBLEM - ElasticSearch health check on elastic1015 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 14: number_of_data_nodes: 14: active_primary_shards: 1895: active_shards: 5237: relocating_shards: 0: initializing_shards: 0: unassigned_shards: 374 [21:19:30] PROBLEM - ElasticSearch health check on elastic1013 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 14: number_of_data_nodes: 14: active_primary_shards: 1895: active_shards: 5237: relocating_shards: 0: initializing_shards: 0: unassigned_shards: 374 [21:19:30] PROBLEM - ElasticSearch health check on elastic1002 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 14: number_of_data_nodes: 14: active_primary_shards: 1895: active_shards: 5237: relocating_shards: 0: initializing_shards: 0: unassigned_shards: 374 [21:19:30] PROBLEM - ElasticSearch health check on elastic1005 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 14: number_of_data_nodes: 14: active_primary_shards: 1895: active_shards: 5237: relocating_shards: 0: initializing_shards: 0: unassigned_shards: 374 [21:19:30] PROBLEM - ElasticSearch health check on elastic1003 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 14: number_of_data_nodes: 14: active_primary_shards: 1895: active_shards: 5237: relocating_shards: 0: initializing_shards: 0: unassigned_shards: 374 [21:19:30] PROBLEM - ElasticSearch health check on elastic1016 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 14: number_of_data_nodes: 14: active_primary_shards: 1895: active_shards: 5237: relocating_shards: 0: initializing_shards: 0: unassigned_shards: 374 [21:19:49] PROBLEM - ElasticSearch health check on elastic1009 is CRITICAL: CRITICAL - Could not connect to server 10.64.32.141 [21:19:49] PROBLEM - ElasticSearch health check on elastic1006 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 14: number_of_data_nodes: 14: active_primary_shards: 1895: active_shards: 5237: relocating_shards: 0: initializing_shards: 0: unassigned_shards: 374 [21:19:49] PROBLEM - ElasticSearch health check on elastic1007 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 14: number_of_data_nodes: 14: active_primary_shards: 1895: active_shards: 5237: relocating_shards: 0: initializing_shards: 0: unassigned_shards: 374 [21:19:49] PROBLEM - ElasticSearch health check on elastic1008 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 14: number_of_data_nodes: 14: active_primary_shards: 1895: active_shards: 5237: relocating_shards: 0: initializing_shards: 0: unassigned_shards: 374 [21:19:50] PROBLEM - ElasticSearch health check on elastic1011 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 14: number_of_data_nodes: 14: active_primary_shards: 1895: active_shards: 5237: relocating_shards: 0: initializing_shards: 0: unassigned_shards: 374 [21:19:59] PROBLEM - ElasticSearch health check on elastic1012 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 14: number_of_data_nodes: 14: active_primary_shards: 1895: active_shards: 5237: relocating_shards: 0: initializing_shards: 0: unassigned_shards: 374 [21:20:19] (03CR) 10RobH: [C: 032] adding rhenium to netboot.cfg [operations/puppet] - 10https://gerrit.wikimedia.org/r/126130 (owner: 10RobH) [21:20:46] yeah yeah yeah [21:20:51] labswiki [21:21:32] cluster is back to yellow - labswiki should be back [21:22:01] (03PS1) 10Ottomata: Setting nagios retries and check intervals on CirrusSearch-slow-queries alert [operations/puppet] - 10https://gerrit.wikimedia.org/r/126132 [21:22:06] uh oh [21:22:09] oh ok [21:22:12] that ok manybubbles? [21:22:15] just saw your log statement [21:22:18] sok [21:22:22] its wikitech [21:22:26] k [21:22:27] it went red because wikitech was down [21:22:49] PROBLEM - Puppet freshness on ms-be1004 is CRITICAL: Last successful Puppet run was Tue 15 Apr 2014 12:20:52 PM UTC [21:22:54] RobH, I assume I should be looking at the partman log file? [21:23:01] i have no idea how to read that, you are right! [21:23:05] very repetitive! [21:23:16] yay! [21:23:17] wc -l partman [21:23:20] 18479 partman [21:23:49] PROBLEM - ElasticSearch health check on elastic1009 is CRITICAL: CRITICAL - Could not connect to server 10.64.32.141 [21:23:49] PROBLEM - ElasticSearch health check on elastic1006 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 14: number_of_data_nodes: 14: active_primary_shards: 1895: active_shards: 5289: relocating_shards: 0: initializing_shards: 42: unassigned_shards: 280 [21:23:49] PROBLEM - ElasticSearch health check on elastic1007 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 14: number_of_data_nodes: 14: active_primary_shards: 1895: active_shards: 5297: relocating_shards: 0: initializing_shards: 42: unassigned_shards: 272 [21:23:49] PROBLEM - ElasticSearch health check on elastic1008 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 14: number_of_data_nodes: 14: active_primary_shards: 1895: active_shards: 5298: relocating_shards: 0: initializing_shards: 42: unassigned_shards: 271 [21:23:49] PROBLEM - ElasticSearch health check on elastic1011 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 14: number_of_data_nodes: 14: active_primary_shards: 1895: active_shards: 5300: relocating_shards: 0: initializing_shards: 42: unassigned_shards: 269 [21:23:59] PROBLEM - ElasticSearch health check on elastic1012 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 15: number_of_data_nodes: 15: active_primary_shards: 1895: active_shards: 5311: relocating_shards: 0: initializing_shards: 42: unassigned_shards: 258 [21:24:58] (03PS2) 10Ottomata: Setting nagios retries and check intervals on CirrusSearch-slow-queries alert [operations/puppet] - 10https://gerrit.wikimedia.org/r/126132 [21:25:03] (03CR) 10Ottomata: [C: 032 V: 032] Setting nagios retries and check intervals on CirrusSearch-slow-queries alert [operations/puppet] - 10https://gerrit.wikimedia.org/r/126132 (owner: 10Ottomata) [21:25:17] (03PS1) 10RobH: rhenium needs gpt partitioning [operations/puppet] - 10https://gerrit.wikimedia.org/r/126134 [21:26:12] (03CR) 10RobH: [C: 032 V: 032] "rob is too impatient to wait for zuul for this change" [operations/puppet] - 10https://gerrit.wikimedia.org/r/126134 (owner: 10RobH) [21:29:21] ottomata: ok, im reading through them [21:29:34] so partman yes, and also just the syslog for it as well [21:30:10] i have no idea how to read the partman file either ;] [21:30:25] but usually the syslog file has a slightly more understandable (but not really ;) [21:30:28] output [21:32:38] ok [21:32:46] ottomata: look around line 2574 [21:32:52] is where it starts partitioning, im just gettingthere [21:33:13] ok reading [21:33:14] syslog [21:33:19] PROBLEM - ElasticSearch health check on elastic1004 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 14: number_of_data_nodes: 14: active_primary_shards: 1895: active_shards: 5352: relocating_shards: 0: initializing_shards: 42: unassigned_shards: 217 [21:33:19] PROBLEM - ElasticSearch health check on elastic1014 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 14: number_of_data_nodes: 14: active_primary_shards: 1895: active_shards: 5352: relocating_shards: 0: initializing_shards: 42: unassigned_shards: 217 [21:33:29] PROBLEM - ElasticSearch health check on elastic1001 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 14: number_of_data_nodes: 14: active_primary_shards: 1895: active_shards: 5360: relocating_shards: 0: initializing_shards: 42: unassigned_shards: 209 [21:33:29] PROBLEM - ElasticSearch health check on elastic1015 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 14: number_of_data_nodes: 14: active_primary_shards: 1895: active_shards: 5361: relocating_shards: 0: initializing_shards: 42: unassigned_shards: 208 [21:33:29] PROBLEM - ElasticSearch health check on elastic1013 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 14: number_of_data_nodes: 14: active_primary_shards: 1895: active_shards: 5361: relocating_shards: 0: initializing_shards: 42: unassigned_shards: 208 [21:33:30] PROBLEM - ElasticSearch health check on elastic1003 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 14: number_of_data_nodes: 14: active_primary_shards: 1895: active_shards: 5363: relocating_shards: 0: initializing_shards: 42: unassigned_shards: 206 [21:33:30] PROBLEM - ElasticSearch health check on elastic1005 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 14: number_of_data_nodes: 14: active_primary_shards: 1895: active_shards: 5363: relocating_shards: 0: initializing_shards: 42: unassigned_shards: 206 [21:33:30] PROBLEM - ElasticSearch health check on elastic1002 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 14: number_of_data_nodes: 14: active_primary_shards: 1895: active_shards: 5363: relocating_shards: 0: initializing_shards: 42: unassigned_shards: 206 [21:33:30] PROBLEM - ElasticSearch health check on elastic1016 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 14: number_of_data_nodes: 14: active_primary_shards: 1895: active_shards: 5363: relocating_shards: 0: initializing_shards: 42: unassigned_shards: 206 [21:33:49] PROBLEM - ElasticSearch health check on elastic1009 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 15: number_of_data_nodes: 15: active_primary_shards: 1895: active_shards: 5364: relocating_shards: 0: initializing_shards: 42: unassigned_shards: 205 [21:33:49] PROBLEM - ElasticSearch health check on elastic1006 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 15: number_of_data_nodes: 15: active_primary_shards: 1895: active_shards: 5364: relocating_shards: 0: initializing_shards: 42: unassigned_shards: 205 [21:33:53] ignore [21:33:57] I'm on it [21:34:00] pssh [21:34:00] ottomata: oh, just for the record, im totally NOT doing that thing where i lead you to the answer, as that assumes i know the aswer [21:34:01] Apr 15 20:53:20 debconf: --> SET 2 ext4 /var/lib/elasticsearch /dev/sda3#/dev/sdb3 . [21:34:02] its just another restart.... [21:34:02] that's not right [21:34:07] ;] [21:34:16] ottomata: we should talk about replacing that monitor [21:34:20] that shoudl be /dev/md2 [21:34:21] I wrote something to do it [21:34:26] but we haven't talked much about it. [21:34:41] uhh ok, manybubbles, let's talk later, i'm hoping to get 1010 back up and quit working for the day [21:34:52] ottomata: yeah, later [21:34:54] hmm, well [21:35:51] ottomata: yea, seems fubar [21:36:38] yeah don't get it [21:37:09] AH! [21:37:11] i'm missing a \ [21:37:12] hm [21:37:20] i'd think it would have a long output for the 2625 Apr 15 20:53:20 debconf: --> SET partman-auto-raid/recipe 1 2 0 ext3 / /dev/sda1#/dev/sdb1 . 1 2 0 swap - /dev/sda2#/dev/sdb2 . [21:37:24] i think the line ended to early [21:37:27] the other paritions have a long output for all the settings [21:37:29] and it doesnt [21:37:30] so it got all confused [21:37:32] yeah [21:37:33] ahh, yea that would do it [21:37:39] we should see sda3 stuff on that same line [21:37:53] indeed, the installer logs always tell =] [21:38:02] even if they make eyes bleed [21:38:34] (03PS1) 10Ottomata: Another fix for elasticsearch.cfg [operations/puppet] - 10https://gerrit.wikimedia.org/r/126138 [21:38:53] ottomata: now im invested in this working =] [21:38:53] (03CR) 10Ottomata: [C: 032 V: 032] Another fix for elasticsearch.cfg [operations/puppet] - 10https://gerrit.wikimedia.org/r/126138 (owner: 10Ottomata) [21:38:58] ha, yeah! [21:39:06] !log jenkins /var/lib/git cleaned up on gallium [21:39:06] wish it didn't take so long to find out! [21:39:12] Logged the message, Master [21:41:27] ok rebooting elastic1010 for like the 6th time today [21:41:52] just remember to make a sacrifice to the server gods every 13th reboot [21:42:29] just shredding a random file out of your document store usually suffices. [21:43:35] PROBLEM - Host elastic1010 is DOWN: PING CRITICAL - Packet loss = 100% [21:44:26] RECOVERY - ElasticSearch health check on elastic1001 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 15: number_of_data_nodes: 15: active_primary_shards: 1896: active_shards: 5611: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0 [21:44:26] RECOVERY - ElasticSearch health check on elastic1013 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 15: number_of_data_nodes: 15: active_primary_shards: 1896: active_shards: 5611: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0 [21:44:26] RECOVERY - ElasticSearch health check on elastic1015 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 15: number_of_data_nodes: 15: active_primary_shards: 1896: active_shards: 5611: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0 [21:44:35] RECOVERY - ElasticSearch health check on elastic1014 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 15: number_of_data_nodes: 15: active_primary_shards: 1896: active_shards: 5611: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0 [21:44:35] RECOVERY - ElasticSearch health check on elastic1004 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 15: number_of_data_nodes: 15: active_primary_shards: 1896: active_shards: 5611: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0 [21:44:35] RECOVERY - ElasticSearch health check on elastic1005 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 15: number_of_data_nodes: 15: active_primary_shards: 1896: active_shards: 5611: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0 [21:44:35] RECOVERY - ElasticSearch health check on elastic1003 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 15: number_of_data_nodes: 15: active_primary_shards: 1896: active_shards: 5611: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0 [21:44:35] RECOVERY - ElasticSearch health check on elastic1002 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 15: number_of_data_nodes: 15: active_primary_shards: 1896: active_shards: 5611: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0 [21:44:35] RECOVERY - ElasticSearch health check on elastic1016 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 15: number_of_data_nodes: 15: active_primary_shards: 1896: active_shards: 5611: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0 [21:44:45] RECOVERY - ElasticSearch health check on elastic1006 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 15: number_of_data_nodes: 15: active_primary_shards: 1896: active_shards: 5611: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0 [21:44:45] RECOVERY - ElasticSearch health check on elastic1009 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 15: number_of_data_nodes: 15: active_primary_shards: 1896: active_shards: 5611: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0 [21:44:45] RECOVERY - ElasticSearch health check on elastic1007 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 15: number_of_data_nodes: 15: active_primary_shards: 1896: active_shards: 5611: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0 [21:44:55] RECOVERY - ElasticSearch health check on elastic1008 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 15: number_of_data_nodes: 15: active_primary_shards: 1896: active_shards: 5611: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0 [21:44:55] RECOVERY - ElasticSearch health check on elastic1011 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 15: number_of_data_nodes: 15: active_primary_shards: 1896: active_shards: 5611: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0 [21:45:05] RECOVERY - ElasticSearch health check on elastic1012 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 15: number_of_data_nodes: 15: active_primary_shards: 1896: active_shards: 5611: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0 [21:45:46] PROBLEM - Slow CirrusSearch query rate on fluorine is CRITICAL: CirrusSearch-slow.log_line_rate CRITICAL: 0.00037037037037 [21:46:00] (03CR) 10Hashar: "Thank you!" [operations/puppet] - 10https://gerrit.wikimedia.org/r/126025 (owner: 10Dzahn) [21:46:35] PROBLEM - Host mw1163 is DOWN: PING CRITICAL - Packet loss = 100% [21:47:25] RECOVERY - Host mw1163 is UP: PING OK - Packet loss = 0%, RTA = 0.40 ms [21:47:33] ottomata: so now im battling partman for my server, and its halting on that step with no output and cannot access logs ;_; [21:47:49] its gonna be a partman kinds afternoon [21:48:03] <^demon|away> ottomata, manybubbles: You guys doing another box? [21:48:09] * ^demon|away saw the recoveries [21:48:20] still on 1010, the recoveries are manybubbles thang [21:48:34] <^demon|away> Ah k [21:48:45] RECOVERY - Host elastic1010 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [21:48:55] ^demon|away: RobH you can blame ^demon|away for making our eyes bleed on partman logs! [21:48:55] RECOVERY - Puppet freshness on ms-be1004 is OK: puppet ran at Tue Apr 15 21:48:47 UTC 2014 [21:48:58] :p [21:49:13] * ^demon|away hides [21:49:33] RobH, I am sorry for your partman loss [21:49:39] i am glad I am not in your situation [21:49:47] my partman at least is almost working :) [21:50:46] RECOVERY - Slow CirrusSearch query rate on fluorine is OK: CirrusSearch-slow.log_line_rate OKAY: 0.0 [21:50:55] PROBLEM - SSH on elastic1010 is CRITICAL: Connection refused [21:51:35] PROBLEM - check_job_queue on terbium is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 199,999 jobs: , Total (201678) [21:51:42] !log restarting elastic1009 again [21:51:46] Logged the message, Master [21:54:44] (03PS1) 10RobH: fixing raid1-gpt to use variable rest of disk space [operations/puppet] - 10https://gerrit.wikimedia.org/r/126142 [21:54:59] <^d> check_job_queue lies. [21:55:10] <^d> no wikis have > 160k jobs, much less 199999 [21:55:18] <^d> Oh, total, [21:55:20] <^d> silly check [21:55:58] ^d: nice [21:56:36] (03CR) 10RobH: [C: 032] fixing raid1-gpt to use variable rest of disk space [operations/puppet] - 10https://gerrit.wikimedia.org/r/126142 (owner: 10RobH) [21:57:25] PROBLEM - Puppet freshness on elastic1010 is CRITICAL: Last successful Puppet run was Tue 15 Apr 2014 06:57:10 PM UTC [21:57:51] <^d> Easy fix, actually. [21:59:13] ^d: Re Gerrit, do you know the syntax for finding open changesets that cannot be merged automatically ("owner:self status:open ???")? [22:00:17] <^d> Offhand I can't think of what it'd be. [22:00:19] <^d> https://gerrit.wikimedia.org/r/Documentation/user-search.html would know [22:00:42] <^d> Doesn't seem so :\ [22:02:35] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 200,000 [22:02:39] ^d: k, thanks. [22:03:28] ^d: Do we have other labels than Verified and CodeReview defined in this instance? [22:03:34] <^d> Nope [22:03:45] O M G RobH [22:03:48] /dev/md2 494G 198M 469G 1% /var/lib/elasticsearch [22:03:55] RECOVERY - SSH on elastic1010 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.3 (protocol 2.0) [22:04:28] going to reboot this machine just to double check that the array is recreated on boot [22:05:38] PROBLEM - check_job_queue on terbium is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 199,999 jobs: , Total (203059) [22:06:35] why job queue get big? [22:06:38] PROBLEM - Host elastic1010 is DOWN: PING CRITICAL - Packet loss = 100% [22:08:20] its just the parsoid jobs - so long as they go down over time we're ok with that.... [22:08:58] RECOVERY - Host elastic1010 is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms [22:09:20] awesooome it works! [22:10:38] RECOVERY - Puppet freshness on elastic1010 is OK: puppet ran at Tue Apr 15 22:10:30 UTC 2014 [22:11:46] ottomata: sweet [22:11:52] and i fixed mine as well and got it installed [22:11:57] success all around \o/ [22:12:18] RECOVERY - ElasticSearch health check on elastic1010 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 1896: active_shards: 5611: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0 [22:12:39] ooooh [22:15:54] ottomata: though i have to admit its kinda nice to fix baffling partman errors ;] [22:16:43] it is satisfying [22:16:54] it also means no one has to ever think about formatting elastic nodes again [22:17:01] (uh, unless different disks :/ ) [22:17:11] then it should be another recipie [22:17:15] aye [22:17:18] nothing should be manual on production =] [22:17:28] RECOVERY - DPKG on elastic1010 is OK: All packages OK [22:17:28] RECOVERY - check if dhclient is running on elastic1010 is OK: PROCS OK: 0 processes with command name dhclient [22:17:38] RECOVERY - RAID on elastic1010 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [22:17:38] RECOVERY - check configured eth on elastic1010 is OK: NRPE: Unable to read output [22:17:57] hm, what about this RobH? [22:17:59] i've been running [22:18:06] tune2fs -m 0 on the partitions [22:18:13] after I formatted them ext4 [22:18:18] RECOVERY - Disk space on elastic1010 is OK: DISK OK [22:18:18] RECOVERY - puppet disabled on elastic1010 is OK: OK [22:18:58] can we automate that somehow? [22:19:18] hrmm [22:19:56] im sure we can somehow, but im not sure the best way [22:20:07] i imagine we also never want it run after data is in place right? [22:20:38] (03PS1) 10Chad: Clean up job queue length check [operations/puppet] - 10https://gerrit.wikimedia.org/r/126148 [22:22:30] (03CR) 10Chad: "Needs Iaedbdb90 to actually work, but won't fatal or anything without it." [operations/puppet] - 10https://gerrit.wikimedia.org/r/126148 (owner: 10Chad) [22:25:10] yeahhh, ok moving shards back to elastic1010 now [22:25:21] it isn't really harmful [22:25:33] it just tells the fs to not reserve blocks [22:25:45] but it has to be run while the partition is not mounted [22:25:56] RobH, i'm going to sign off, and think about that tomorrow [22:26:07] manybubbles: elastic1010 is moving shards back now [22:26:10] should be good to go [22:26:19] i'm here for a bit longer, but not really working anymore [22:26:23] if anything is weird ping me [22:26:25] ottomata: cyas, glad ya got it fixed, easier to relax now [22:26:45] thanks! ottomata and RobH [22:26:54] I see the shards recovering to the host [22:29:35] <^d> ottomata, manybubbles: Important doc update. https://wikitech.wikimedia.org/w/index.php?title=Search&diff=109767&oldid=109765 [22:31:38] RECOVERY - NTP on elastic1010 is OK: NTP OK: Offset -0.01151764393 secs [22:36:34] have you switched mediawiki.org to elastic search yet? [22:44:11] <^d> hashar: It's been on Cirrus for ages :p [22:44:21] <^d> Like the first after test(2)wiki [22:48:31] (03PS1) 10Hashar: admins::jenkins sort account list [operations/puppet] - 10https://gerrit.wikimedia.org/r/126154 [22:48:33] (03PS1) 10Hashar: contint: gives access to Bryan Davis [operations/puppet] - 10https://gerrit.wikimedia.org/r/126155 [22:50:17] ^d: I noticed a page that is not updated in the search result :D [22:50:34] <^d> What page? [22:50:48] ah no it is gone after I edited it apparently [22:51:12] sorry :] [22:51:28] <^d> Gone? [22:51:31] <^d> Hmm :) [22:51:32] <^d> Ok [22:57:02] (03PS2) 10Hashar: contint: gives access to Bryan Davis [operations/puppet] - 10https://gerrit.wikimedia.org/r/126155 [22:57:24] (03CR) 10Hashar: "Since I have no clue whether that require a RT ticket, I filled one https://rt.wikimedia.org/Ticket/Display.html?id=7292 and amended the c" [operations/puppet] - 10https://gerrit.wikimedia.org/r/126155 (owner: 10Hashar) [22:57:37] off to bed! [22:57:54] * bd808 waves [22:58:11] ah no [22:58:44] ^d: I run git repack on Gerrit replication destinations ? Should we do that automatically as a monthly maintenance task ? [22:59:12] <^d> You can [22:59:26] greg-g, is there a ban on me deploying my own swat deploy? [22:59:47] ^d: will add some cron somewhere and add you as a reviewer :] [23:00:00] <^d> Okie dokie [23:00:07] mwalker: nope [23:00:18] fantastic [23:00:33] mwalker: are you doing swat then? [23:00:33] (03CR) 10Mwalker: [C: 032] Enable CentralNotice CrossWiki Hiding [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/126065 (owner: 10Mwalker) [23:00:39] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 200,000 [23:00:40] i mean, the other swat patches, if any [23:00:54] there aren't any [23:00:59] so it's just my config change [23:01:02] Too late to get one in? :D [23:01:12] marktraceur, no; just add it to the calendar [23:01:15] 'kay [23:01:17] and then tell me :) [23:01:50] !log restarting Zuul to clear leaked file descriptor (know issue, fixed upstream) [23:01:55] Added, mwalker [23:01:57] Logged the message, Master [23:02:05] https://wikitech.wikimedia.org/w/index.php?title=Deployments&diff=109769&oldid=109748 [23:02:15] Uhhh [23:02:24] Only to wmf22, which is I believe the branch on Commons [23:02:43] I can make a cherry-pick if you want [23:02:45] mwalker: "official" "nope" [23:03:00] Reedy, I'm going to push https://gerrit.wikimedia.org/r/#/c/121834/ and https://gerrit.wikimedia.org/r/#/c/113656/ [23:03:05] which are undeployed configuration changes [23:03:27] Ugh [23:03:29] marktraceur, I'll take the cherry pick work if you do it [23:03:30] Thanks/sorry [23:03:33] np [23:03:44] I thought I'd pulled after jenkins had finished merging [23:04:39] PROBLEM - check_job_queue on terbium is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 199,999 jobs: , Total (201681) [23:04:44] mwalker: Updated. [23:05:03] (03CR) 10Hashar: [C: 04-1] "The job I started earlier today is still running. I am not sure how much time it takes to traverse the whole file hierarchy." [operations/puppet] - 10https://gerrit.wikimedia.org/r/125991 (owner: 10Hashar) [23:05:10] !log mwalker Started scap: Configuration changes, {{gerrit|113656}}, {{gerrit|121834}}, {{gerrit|126065}} [23:05:15] Logged the message, Master [23:05:29] (03CR) 10Ori.livneh: "Is there a way to constrain the range of fonts that will be loaded automatically?" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/115153 (owner: 10Odder) [23:06:30] PROBLEM - Host mw1163 is DOWN: PING CRITICAL - Packet loss = 100% [23:07:09] RECOVERY - Host mw1163 is UP: PING OK - Packet loss = 0%, RTA = 0.32 ms [23:08:04] (03PS1) 10Faidon Liambotis: Add ms-fe.esams.wmnet round-robin [operations/dns] - 10https://gerrit.wikimedia.org/r/126159 [23:08:17] (03CR) 10Faidon Liambotis: [C: 032] Add ms-fe.esams.wmnet round-robin [operations/dns] - 10https://gerrit.wikimedia.org/r/126159 (owner: 10Faidon Liambotis) [23:08:21] !log mwalker Finished scap: Configuration changes, {{gerrit|113656}}, {{gerrit|121834}}, {{gerrit|126065}} (duration: 03m 11s) [23:08:27] Logged the message, Master [23:08:37] Reedy, ^ if you want to test [23:10:04] (03PS1) 10Faidon Liambotis: swift: remove pmtpa-labs & pmtpa-labsupgrade roles [operations/puppet] - 10https://gerrit.wikimedia.org/r/126161 [23:10:06] (03PS1) 10Faidon Liambotis: Swift: add esams-prod role classes and use them [operations/puppet] - 10https://gerrit.wikimedia.org/r/126162 [23:10:26] (03CR) 10Faidon Liambotis: [C: 032 V: 032] swift: remove pmtpa-labs & pmtpa-labsupgrade roles [operations/puppet] - 10https://gerrit.wikimedia.org/r/126161 (owner: 10Faidon Liambotis) [23:12:27] (03PS1) 10Hoo man: Add abusefilter-modify-restricted to sysop on commons [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/126163 [23:12:43] mwalker: You doing swat or something? [23:12:55] hoo, yepyep [23:13:11] If you got some more moments, you might want to add that in [23:13:27] sure; let me get marktraceur's stuff out [23:14:05] hmm; actually hoo; do you have a bug request or wiki discussion or something for this? [23:14:20] (03CR) 10Faidon Liambotis: [C: 032] Swift: add esams-prod role classes and use them [operations/puppet] - 10https://gerrit.wikimedia.org/r/126162 (owner: 10Faidon Liambotis) [23:14:40] mwalker: Stuff is broken over there in the current configuration... so I doubt there's consensus needed [23:14:48] not a big deal, also [23:14:55] hokay; taking your word for it [23:15:26] (03CR) 10Mwalker: [C: 032] "[16:14] hmm; actually hoo; do you have a bug request or wiki discussion or something for this?" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/126163 (owner: 10Hoo man) [23:15:49] mwalker: A Commons admin came into -stewards to ask a steward to do some changes to a filter needing that userright - plus when enabling restricted actions, that right should really be there regardless anyway :) [23:16:21] (03CR) 10Mwalker: "[16:15] mwalker: A Commons admin came into -stewards to ask a steward to do some changes to a filter needing that userright - " [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/126163 (owner: 10Hoo man) [23:16:57] mwalker: are we just copy and pasting what people ping you with now into bugs? :p [23:17:02] *patches [23:17:38] JohnLewis: That's a decent way of documenting the discussion that happened [23:17:39] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 200,000 [23:17:59] hoo: I was 'joking' :p [23:18:25] JohnLewis, ya; I'm just documenting the rationale behind the change in case anyone comes over later and goes; wtf [23:18:28] JohnLewis: You troll me quite often... I should have known better :P [23:18:54] mwalker: So 'blame them, they convicned me to do it!' [23:19:06] hoo: You should! [23:19:23] "It was a bad idea. The users made me do it." [23:19:37] !log mwalker Started scap: Configuration change {{gerrit|126163}} and MultimediaViewer {{gerrit|126158}} [23:19:42] hehe [23:19:43] Logged the message, Master [23:20:15] nah; it's more like, it looked sane, they had a good rationale; if we want we can revert it and then ask the requesters about it later [23:20:49] isn't that a rough analogue of the Change, Revert, Discuss workflow? [23:21:01] marktraceur, I'm scapping your change now [23:21:17] *nod* watching with bated breath [23:21:20] ah damn... did a mistake [23:21:20] 'I'm sane, they're not. Got it? Here's the IRC logs to prove it!' :p [23:21:30] They don't ahve an abusefilter group :P [23:21:41] Need to add it to sysop like I said in the summary [23:21:53] !log mwalker Finished scap: Configuration change {{gerrit|126163}} and MultimediaViewer {{gerrit|126158}} (duration: 02m 15s) [23:21:54] * hoo restarts his brain [23:21:59] Logged the message, Master [23:22:00] mwalker: Well, these IRC logs don't make up for that mistake :p [23:22:24] ya; but I know of hoo by reputation [23:22:35] I wouldn't just do it for people I didn't know [23:22:39] hoo: Currently fixing it? [23:22:57] Huzzah, thanks mwalker [23:22:59] (03PS1) 10Hoo man: There's no abusefilter group on commons, just sysop [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/126168 [23:23:13] mwalker: ^ :P I can also do if, I yet managed to annoy you [23:23:25] (03PS2) 10Mwalker: There's no abusefilter group on commons, just sysop [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/126168 (owner: 10Hoo man) [23:23:31] (it's working) [23:23:39] PROBLEM - check_job_queue on terbium is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 199,999 jobs: , Total (200490) [23:23:39] (03CR) 10Mwalker: [C: 032] There's no abusefilter group on commons, just sysop [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/126168 (owner: 10Hoo man) [23:23:50] :D [23:24:19] hoo, you have deploy rights? [23:24:34] mwalker: Is that a surprise to you? :P Yep [23:24:55] mwalker: Well, until ops find out he made you merge bad code :D [23:25:52] !log mwalker synchronized wmf-config/abusefilter.php '{{gerrit|126168}} more abuse filter configuration fun' [23:25:59] Logged the message, Master [23:26:12] hoo, try it now [23:26:25] mwalker: Now it's right... bah, thanks [23:26:33] shiney [23:26:39] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 200,000 [23:26:46] I wtfed a bit looking at https://commons.wikimedia.org/wiki/Special:ListGroupRights seeing a new group :P [23:27:44] thanks, mwalker :) [23:28:01] np [23:28:06] greg-g, I think I'm done [23:28:26] Wait, how long has 'Namespace restrictions' been on Special:ListGroupRights? [23:28:40] JohnLewis: ask Krenair :P [23:28:51] He did it quite some time back AFAIR [23:29:14] I only just noticed on testwikidatawiki hoo :p [23:29:23] JohnLewis, for not a very long time at all [23:29:24] mwalker: awesome [23:29:31] The patch was made ages ago, only merged recently [23:29:34] mwalker: yay two scaps under 3.5 minutes! [23:29:51] Krenair: Ah - nice feature anyway :) [23:29:55] greg-g, I know! scap shall no longer be my whipping boy [23:29:59] :) [23:30:00] :) [23:34:19] * marktraceur dumps gatorade on mwalker [23:34:21] Thanks coach [23:34:34] PROBLEM - check_job_queue on terbium is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 199,999 jobs: , Total (200725) [23:34:39] hah [23:34:46] wouldn't the coach be greg? [23:35:08] JohnLewis, hoo: Less than a week ago actually. [23:35:19] Maybe [23:35:25] mwalker: What are you, the QB? [23:35:32] Krenair: Really? Explains why I haven't seen it yet. [23:35:51] JohnLewis, see https://gerrit.wikimedia.org/r/#/c/40096/ [23:35:59] marktraceur, hmm... I'm not sure this analogy is working so well [23:36:03] PROBLEM - swift-container-updater on ms-be3001 is CRITICAL: NRPE: Command check_swift-container-updater not defined [23:36:03] PROBLEM - swift-container-replicator on ms-be3002 is CRITICAL: NRPE: Command check_swift-container-replicator not defined [23:36:03] PROBLEM - swift-account-server on ms-be3003 is CRITICAL: NRPE: Command check_swift-account-server not defined [23:36:10] marktraceur, mwalker: greg-g would be the calendar. Telling you when you can do things, mwalker is the mastermind of the case :P [23:36:13] PROBLEM - swift-container-server on ms-be3002 is CRITICAL: NRPE: Command check_swift-container-server not defined [23:36:13] PROBLEM - swift-container-auditor on ms-be3003 is CRITICAL: NRPE: Command check_swift-container-auditor not defined [23:36:13] PROBLEM - swift-object-auditor on ms-be3001 is CRITICAL: NRPE: Command check_swift-object-auditor not defined [23:36:13] PROBLEM - swift-container-replicator on ms-be3003 is CRITICAL: NRPE: Command check_swift-container-replicator not defined [23:36:13] PROBLEM - swift-container-updater on ms-be3002 is CRITICAL: NRPE: Command check_swift-container-updater not defined [23:36:13] PROBLEM - swift-object-replicator on ms-be3001 is CRITICAL: NRPE: Command check_swift-object-replicator not defined [23:36:13] JohnLewis, Uploaded 2012-12-23, merged 2014-04-10 [23:36:16] hoo, I just had a thought... what change did we make that caused the permissions problem? [23:36:23] PROBLEM - Memcached on ms-fe3001 is CRITICAL: Connection refused [23:36:23] PROBLEM - swift-object-auditor on ms-be3002 is CRITICAL: NRPE: Command check_swift-object-auditor not defined [23:36:23] PROBLEM - swift-container-server on ms-be3003 is CRITICAL: NRPE: Command check_swift-container-server not defined [23:36:23] PROBLEM - swift-object-server on ms-be3001 is CRITICAL: NRPE: Command check_swift-object-server not defined [23:36:23] PROBLEM - Swift HTTP backend on ms-fe3001 is CRITICAL: Connection refused [23:36:23] PROBLEM - swift-account-auditor on ms-be3001 is CRITICAL: NRPE: Command check_swift-account-auditor not defined [23:36:24] PROBLEM - swift-object-updater on ms-be3001 is CRITICAL: NRPE: Command check_swift-object-updater not defined [23:36:24] PROBLEM - swift-object-replicator on ms-be3002 is CRITICAL: NRPE: Command check_swift-object-replicator not defined [23:36:24] PROBLEM - swift-container-updater on ms-be3003 is CRITICAL: NRPE: Command check_swift-container-updater not defined [23:36:33] PROBLEM - Swift HTTP frontend on ms-fe3001 is CRITICAL: Connection refused [23:36:34] PROBLEM - swift-account-reaper on ms-be3001 is CRITICAL: NRPE: Command check_swift-account-reaper not defined [23:36:34] PROBLEM - swift-object-auditor on ms-be3003 is CRITICAL: NRPE: Command check_swift-object-auditor not defined [23:36:34] PROBLEM - swift-object-server on ms-be3002 is CRITICAL: NRPE: Command check_swift-object-server not defined [23:36:34] PROBLEM - swift-object-replicator on ms-be3003 is CRITICAL: NRPE: Command check_swift-object-replicator not defined [23:36:34] PROBLEM - swift-object-updater on ms-be3002 is CRITICAL: NRPE: Command check_swift-object-updater not defined [23:36:34] PROBLEM - swift-account-replicator on ms-be3001 is CRITICAL: NRPE: Command check_swift-account-replicator not defined [23:36:35] PROBLEM - swift-account-auditor on ms-be3002 is CRITICAL: NRPE: Command check_swift-account-auditor not defined [23:36:43] uhhhh [23:36:43] mwalker: Pretty good question [23:36:43] PROBLEM - Memcached on ms-fe3002 is CRITICAL: Connection refused [23:36:43] PROBLEM - swift-account-server on ms-be3001 is CRITICAL: NRPE: Command check_swift-account-server not defined [23:36:43] PROBLEM - swift-object-server on ms-be3003 is CRITICAL: NRPE: Command check_swift-object-server not defined [23:36:43] PROBLEM - swift-account-reaper on ms-be3002 is CRITICAL: NRPE: Command check_swift-account-reaper not defined [23:36:44] PROBLEM - Swift HTTP backend on ms-fe3002 is CRITICAL: Connection refused [23:36:53] PROBLEM - swift-object-updater on ms-be3003 is CRITICAL: NRPE: Command check_swift-object-updater not defined [23:36:53] PROBLEM - swift-account-replicator on ms-be3002 is CRITICAL: NRPE: Command check_swift-account-replicator not defined [23:36:53] PROBLEM - swift-account-auditor on ms-be3003 is CRITICAL: NRPE: Command check_swift-account-auditor not defined [23:36:53] PROBLEM - swift-container-auditor on ms-be3001 is CRITICAL: NRPE: Command check_swift-container-auditor not defined [23:36:53] PROBLEM - Swift HTTP frontend on ms-fe3002 is CRITICAL: Connection refused [23:36:53] PROBLEM - swift-account-reaper on ms-be3003 is CRITICAL: NRPE: Command check_swift-account-reaper not defined [23:36:54] PROBLEM - swift-account-server on ms-be3002 is CRITICAL: NRPE: Command check_swift-account-server not defined [23:36:54] PROBLEM - swift-container-replicator on ms-be3001 is CRITICAL: NRPE: Command check_swift-container-replicator not defined [23:37:00] Whoaaaa [23:37:03] PROBLEM - swift-container-auditor on ms-be3002 is CRITICAL: NRPE: Command check_swift-container-auditor not defined [23:37:03] PROBLEM - swift-account-replicator on ms-be3003 is CRITICAL: NRPE: Command check_swift-account-replicator not defined [23:37:03] PROBLEM - swift-container-server on ms-be3001 is CRITICAL: NRPE: Command check_swift-container-server not defined [23:37:03] greg-g, I don't think I caused this.... [23:37:12] I'm currently trying to identify the reason, somehow that must have been possibly at some point [23:37:16] Right - Who broke icinga? :p [23:37:34] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 200,000 [23:37:58] and I couldn't find a rights change leading to that (neither on wiki nor in git) [23:38:03] wtf [23:38:35] paravoid: swift? ^^ [23:38:54] greg-g: backend swift storage nodes according to wikitech. [23:39:05] yeah [23:39:09] hoo, well; if it broke earlier today it would be some regression from the 1.23wmf22 deploy [23:39:26] mwalker: Don't think so... if it was in AbuseFilter, I would have seen it [23:39:44] probably the problem has been there for a bit longer, but nobody tried to edit a filter since [23:39:50] heh [23:39:52] maybe because of my configuration clean ups :P [23:40:25] do you think it's worth exploring further? [23:40:34] PROBLEM - check_job_queue on terbium is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 199,999 jobs: , Total (201005) [23:40:38] I'm loathe to apply a bandaid if it's just covering up a festering wound [23:41:01] also; nice job queue... [23:41:11] eep [23:41:12] sorry about that [23:41:15] the ghost wiki is apparently busy today [23:41:24] ignore ms-fe3xxx/ms-be3xxx alerts [23:41:32] paravoid: :) ok [23:41:34] paravoid, kk; so long as its expected [23:43:34] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 200,000 [23:47:34] PROBLEM - check_job_queue on terbium is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 199,999 jobs: , Total (202656) [23:47:34] RECOVERY - swift-account-replicator on ms-be3001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [23:47:43] RECOVERY - swift-account-server on ms-be3001 is OK: PROCS OK: 13 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [23:47:53] RECOVERY - swift-container-auditor on ms-be3001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [23:47:53] RECOVERY - swift-container-replicator on ms-be3001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [23:48:03] RECOVERY - swift-container-server on ms-be3001 is OK: PROCS OK: 13 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [23:48:13] RECOVERY - swift-object-auditor on ms-be3001 is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [23:48:13] RECOVERY - swift-object-replicator on ms-be3001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [23:48:23] RECOVERY - swift-object-server on ms-be3001 is OK: PROCS OK: 101 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [23:48:24] RECOVERY - swift-account-auditor on ms-be3001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [23:48:33] RECOVERY - swift-account-reaper on ms-be3001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [23:56:33] RECOVERY - Swift HTTP frontend on ms-fe3001 is OK: HTTP OK: HTTP/1.1 200 OK - 137 bytes in 0.198 second response time [23:56:53] RECOVERY - Swift HTTP frontend on ms-fe3002 is OK: HTTP OK: HTTP/1.1 200 OK - 137 bytes in 0.196 second response time [23:57:03] RECOVERY - swift-container-updater on ms-be3001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-updater [23:57:03] RECOVERY - swift-container-replicator on ms-be3002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [23:57:13] RECOVERY - swift-container-auditor on ms-be3003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [23:57:13] RECOVERY - swift-container-server on ms-be3002 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [23:57:13] RECOVERY - swift-container-replicator on ms-be3003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [23:57:23] RECOVERY - swift-container-server on ms-be3003 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [23:57:23] RECOVERY - swift-object-auditor on ms-be3002 is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [23:57:24] RECOVERY - swift-container-updater on ms-be3003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-updater [23:57:24] RECOVERY - swift-object-updater on ms-be3001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-updater [23:57:33] RECOVERY - swift-object-replicator on ms-be3002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [23:57:33] RECOVERY - swift-object-server on ms-be3002 is OK: PROCS OK: 101 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [23:57:34] RECOVERY - swift-object-auditor on ms-be3003 is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [23:57:34] RECOVERY - swift-account-auditor on ms-be3002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [23:57:34] RECOVERY - swift-object-replicator on ms-be3003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [23:57:34] RECOVERY - swift-object-updater on ms-be3002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-updater [23:57:43] RECOVERY - swift-object-server on ms-be3003 is OK: PROCS OK: 101 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [23:57:43] RECOVERY - swift-account-reaper on ms-be3002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [23:57:53] RECOVERY - swift-account-auditor on ms-be3003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [23:57:53] RECOVERY - swift-account-replicator on ms-be3002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [23:57:53] RECOVERY - swift-object-updater on ms-be3003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-updater [23:57:53] RECOVERY - swift-account-reaper on ms-be3003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [23:57:53] RECOVERY - swift-account-server on ms-be3002 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [23:58:03] RECOVERY - swift-account-replicator on ms-be3003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [23:58:03] RECOVERY - swift-container-auditor on ms-be3002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [23:58:03] RECOVERY - swift-account-server on ms-be3003 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [23:58:13] RECOVERY - swift-container-updater on ms-be3002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-updater [23:58:34] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 200,000 [23:58:40] sorry about that :) [23:59:47] wee