[16:58:58] !log installing many package upgrades on wikitech-static [16:59:01] Logged the message, Master [17:00:33] (03CR) 10BryanDavis: [WIP] Add role::mediawiki_vagrant_lxc (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/193665 (owner: 10BryanDavis) [17:01:13] godog: fwiw, here is how to schedule a downtime from shell: https://wikitech.wikimedia.org/wiki/Icinga#Scheduling_downtimes_with_a_shell_command if you can think of a way where to execute that [17:02:59] (yea, we actually had Labs Nagios back then, regression) [17:04:23] mutante: mhh perhaps via http would be less cumbersome, e.g. https://github.com/zorkian/nagios-api [17:04:46] PROBLEM - salt-minion processes on wtp2005 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [17:06:16] PROBLEM - Disk space on uranium is CRITICAL: DISK CRITICAL - free space: / 348 MB (3% inode=84%): [17:06:16] RECOVERY - dhclient process on wtp2004 is OK: PROCS OK: 0 processes with command name dhclient [17:06:37] RECOVERY - DPKG on wtp2004 is OK: All packages OK [17:07:05] RECOVERY - RAID on wtp2004 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [17:07:15] RECOVERY - configured eth on wtp2004 is OK: NRPE: Unable to read output [17:07:31] godog: how about running it via ssh? [17:07:35] RECOVERY - Disk space on wtp2004 is OK: DISK OK [17:08:25] RECOVERY - puppet last run on wtp2004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:08:35] RECOVERY - NTP on wtp2004 is OK: NTP OK: Offset 0.0703625679 secs [17:09:03] mutante: yep why not [17:15:45] (03PS1) 10Papaul: add asset tag mgmt info fro mw(2135-2214) [dns] - 10https://gerrit.wikimedia.org/r/200889 [17:21:37] PROBLEM - NTP on wtp2019 is CRITICAL: NTP CRITICAL: No response from NTP server [17:21:48] (03PS1) 10John F. Lewis: create shell for lpintscher [puppet] - 10https://gerrit.wikimedia.org/r/200891 (https://phabricator.wikimedia.org/T94390) [17:27:41] (03CR) 10Dzahn: [C: 031] "the username and UID match LDAP, the groups seem to be what is requested, the key is different from labs. lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/200891 (https://phabricator.wikimedia.org/T94390) (owner: 10John F. Lewis) [17:29:32] (03CR) 10Dzahn: "just needs the waiting period" [puppet] - 10https://gerrit.wikimedia.org/r/200891 (https://phabricator.wikimedia.org/T94390) (owner: 10John F. Lewis) [17:31:38] PROBLEM - puppet last run on cp4018 is CRITICAL: CRITICAL: puppet fail [17:33:28] RECOVERY - salt-minion processes on wtp2019 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [17:33:49] RECOVERY - Disk space on wtp2019 is OK: DISK OK [17:34:08] RECOVERY - DPKG on wtp2019 is OK: All packages OK [17:34:18] RECOVERY - RAID on wtp2019 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [17:34:28] RECOVERY - configured eth on wtp2019 is OK: NRPE: Unable to read output [17:34:37] RECOVERY - dhclient process on wtp2019 is OK: PROCS OK: 0 processes with command name dhclient [17:34:48] RECOVERY - puppet last run on wtp2019 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [17:36:18] RECOVERY - NTP on wtp2019 is OK: NTP OK: Offset -0.01317048073 secs [17:40:13] (03PS1) 1020after4: Group1 wikis to 1.25wmf23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/200897 [17:42:50] (03CR) 1020after4: [C: 032] Group1 wikis to 1.25wmf23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/200897 (owner: 1020after4) [17:44:21] (03PS1) 10Manybubbles: Remove "using new search" message [mediawiki-config] - 10https://gerrit.wikimedia.org/r/200898 [17:44:52] (03Merged) 10jenkins-bot: Group1 wikis to 1.25wmf23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/200897 (owner: 1020after4) [17:48:15] (03PS1) 10Glaisher: Revert "Make Spam Blacklist global file protocol-relative" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/200901 (https://phabricator.wikimedia.org/T94591) [17:48:21] twentyafterfour, can you let me know when you're done with the deploy please? [17:48:26] greg-g: twentyafterfour ^ [17:48:36] can that be deployed please? [17:48:53] I was just waiting for the window in 10 minutes [17:49:23] if there is no reason to wait I'll go ahead and push it [17:49:25] RECOVERY - puppet last run on cp4018 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:49:39] you merged a config patch 10 minutes before your window? :/ [17:49:56] Krenair: yeah I thought it was time [17:50:41] twentyafterfour, let's pull them both onto tin, and you can sync your one when it's time? [17:50:54] ok [17:51:18] (03CR) 10Alex Monk: [C: 032] Revert "Make Spam Blacklist global file protocol-relative" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/200901 (https://phabricator.wikimedia.org/T94591) (owner: 10Glaisher) [17:52:26] (03CR) 10Alex Monk: [V: 032] Revert "Make Spam Blacklist global file protocol-relative" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/200901 (https://phabricator.wikimedia.org/T94591) (owner: 10Glaisher) [17:53:50] ok so pull and deploy them both? or go ahead and deploy CommonSettings now? [17:54:26] I just did [17:54:33] !log krenair Synchronized wmf-config/CommonSettings.php: https://gerrit.wikimedia.org/r/#/c/200901/ (duration: 01m 06s) [17:54:39] Logged the message, Master [17:54:40] ok cool [17:54:51] (pulled both, synced only one, since your one does not touch commonsettings) [17:55:00] Glaisher, ^ [17:55:02] please test [17:55:08] thanks [17:55:14] I guess it'll take some time [17:55:30] after the original patch was deployed it was working [17:55:50] the error rate is certainly going down [17:55:54] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 0 below the confidence bounds [17:56:28] ok.. the edit didn't go thru [17:57:46] PROBLEM - HHVM rendering on mw1194 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:57:46] PROBLEM - Apache HTTP on mw1194 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:59:19] so...obvious question here....what was the point of making the file proto-rel? [17:59:44] I was wondering the same thing [18:00:04] everything else we now've is protocol relative, no? [18:00:04] twentyafterfour, greg-g: Respected human, time to deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150331T1800). Please do the needful. [18:00:44] Glaisher: but that's an internal file location [18:01:35] the file? [18:02:06] PROBLEM - HHVM busy threads on mw1194 is CRITICAL: CRITICAL: 88.89% of data above the critical threshold [115.2] [18:02:35] PROBLEM - HHVM queue size on mw1194 is CRITICAL: CRITICAL: 88.89% of data above the critical threshold [80.0] [18:03:12] yeah [18:03:26] !log twentyafterfour rebuilt wikiversions.cdb and synchronized wikiversions files: group1 to VERSION [18:03:36] Logged the message, Master [18:04:05] !log group1 to VERSION=1.25wmf23 [18:04:11] Logged the message, Master [18:05:29] isn't it retrieved as an external file from url rather than an internal file? [18:06:03] !log sync_wikiversions failed for host mw2213.codfw.wmnet port 22: Connection timed out [18:06:09] Logged the message, Master [18:07:50] !log twentyafterfour rebuilt wikiversions.cdb and synchronized wikiversions files: group1 to $VERSION, mw2213.codfe.wmnet failed, trying one more time [18:07:55] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 0 below the confidence bounds [18:07:55] Logged the message, Master [18:08:53] twentyafterfour, yeah I had that issue syncing to mw2213 earlier [18:09:02] I couldn't actually ping the host [18:09:07] !log mw2213.codfw.wmnet still timing out [18:09:09] And it's in codfw [18:09:12] Logged the message, Master [18:09:42] So it's not actually serving traffic, I wouldn't worry too much [18:09:48] ok [18:10:58] still, something for ops to deal with in some way... [18:11:59] re: 2213 https://phabricator.wikimedia.org/T93857 [18:22:40] (03PS2) 10Legoktm: shinken: spam #wikimedia-releng with integration alerts [puppet] - 10https://gerrit.wikimedia.org/r/200770 [18:23:36] (03CR) 10Yuvipanda: "Should get a +1 from someone else in releng, I guess. it's already spammed a fuckton." [puppet] - 10https://gerrit.wikimedia.org/r/200770 (owner: 10Legoktm) [18:24:55] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 0 below the confidence bounds [18:41:33] (03CR) 10Jforrester: [C: 031] shinken: spam #wikimedia-releng with integration alerts [puppet] - 10https://gerrit.wikimedia.org/r/200770 (owner: 10Legoktm) [18:52:48] Jeff_Green: hi, is there a reason dkim doesn't work for mail from meta, but does for mail from wikis ? [18:53:26] matanya: I have an educated guess, but tell me what you mean re. doesn't work? [18:54:06] copying the error in a sec [18:54:12] k [18:56:47] (03CR) 10Krinkle: [C: 031] "Yes please" [puppet] - 10https://gerrit.wikimedia.org/r/200770 (owner: 10Legoktm) [18:56:55] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 0 below the confidence bounds [18:57:35] Jeff_Green: is have a log for commons off hand : dkim=natural, reason = "invalid (public key: granularity mismatch)" header.d=wikimedia.org [18:57:42] (03CR) 10Yuvipanda: [C: 032] shinken: spam #wikimedia-releng with integration alerts [puppet] - 10https://gerrit.wikimedia.org/r/200770 (owner: 10Legoktm) [18:58:03] matanya: PM? [18:58:08] sure YuviPanda [18:58:09] legoktm: done [18:58:39] matanya: do you have a full message we can look at? [18:58:55] it's hard for me to make sense of things without looking at email headers [18:58:57] YuviPanda: ty :D [18:59:15] Jeff_Green: you want me to forward you the email ? [18:59:38] yespls, including the full headers [19:00:48] jgreen@wikimedia.org it is ? [19:00:56] yup [19:01:22] shipped [19:02:40] paravoid: Having connectivity problems to eqiad again... want an mtr? [19:06:08] Jeff_Green: if you need anything else to look into this, let me know [19:12:39] matanya: somehow the headers were eaten in the forward [19:12:48] arg [19:12:59] how can i make sure you get them ? [19:13:07] what email client are you using? [19:13:45] rainloop [19:14:00] but i can access the shell and grab to orig [19:14:22] ok, maybe tar it or something [19:14:37] i have to run out for a bit, but I will take a look when I get back [19:14:41] ok [19:18:34] (03PS1) 10Krinkle: shinken: Set up IRC notifications for cvn [puppet] - 10https://gerrit.wikimedia.org/r/200929 [19:19:41] (03PS2) 10Krinkle: shinken: Set up IRC notifications for cvn [puppet] - 10https://gerrit.wikimedia.org/r/200929 [19:22:11] Jeff_Green: sent [19:22:56] andrewbogott: Coren https://etherpad.wikimedia.org/p/Labs-Incident-Report-20015-03-31 [19:23:03] _joe_: ^ if you are around, since you helped [19:25:32] YuviPanda: Minor wording tweak to make clear that the rsync finished. [19:25:45] Coren: ah, I didn’t know that. cool [19:25:55] Coren: is there task for the labstore1002 swich? [19:26:30] YuviPanda: No, we have yet to sit down and make a testing schedule; though pointing at that task may make sense. [19:26:41] matanya: I have to run to a meeting, but I'll try to make sense of this when I get back [19:27:07] IIRC we want to test switchover of all switchoverable subsystems. [19:27:10] Coren: alright, can you make a task and put that on the etherpad? [19:27:16] Yep. [19:28:29] thanks [19:29:39] (03PS1) 10Chad: Remove a me-only hack from git repos in prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/200938 [19:32:45] YuviPanda: Created and added. As a side-effect, if we don't switch *back* that makes 1001 idle and ripe for reinstalling with Jessie. [19:33:03] Ima create a ticket for that [19:37:26] YuviPanda: Both added to the etherpad [19:37:34] (03PS4) 10Andrew Bogott: Make labs resolv.conf play nice with resolvconf [puppet] - 10https://gerrit.wikimedia.org/r/200595 (https://phabricator.wikimedia.org/T93691) [19:42:49] twentyafterfour: the $VESION in the log message implies more scripting going on? :) [19:42:56] * greg-g just got back to laptop after appt [19:43:25] greg-g: yeah and I had single quotes around my command so it didn't replace [19:43:56] (03CR) 10Andrew Bogott: [C: 04-1] "This doesn't work. Resolvconf is just too clever and is constantly messing with the settings I want. In this version of the patch it jus" [puppet] - 10https://gerrit.wikimedia.org/r/200595 (https://phabricator.wikimedia.org/T93691) (owner: 10Andrew Bogott) [19:44:58] twentyafterfour: /me nods figured [19:45:07] but cool :) [19:46:18] greg-g: I'm working on the part to interface with gerrit/zuul [19:46:36] auto-+2'ing/merging? [19:47:58] greg-g: I might keep the +2 part manual but I figured out how to detect when the merge finishes and then pull on tin and run scap [19:48:39] * greg-g nods [19:48:51] that's still nice [19:49:08] so basically, once you +2 it, it auto runs scap, which is kinda cool [19:49:15] (right?) [19:49:46] greg-g: right. I want to eliminate the manual typing or copy/pasting of commands [19:50:01] :) [19:50:11] greg-g: to avoid making more 1.20wmf25 branches ;) [19:51:08] yeah, oops ;) [19:55:45] Coren: andrewbogott am going for breakfast / lunch now, do take a look at the etherpads (I added more actionables)? [19:57:37] hmm no graphs are loading for me in ganglia. known issue? [19:59:54] jgage: I see the same thing now. [20:00:44] * YuviPanda shall brb in 1h [20:00:48] same result regardless of date range selected [20:01:22] jgage: if I open an image in a new tab, I occasionaly get a graph when I refresh but it's intermittent [20:01:35] hmmm [20:02:27] uranium's root partition is 100% full :( [20:03:22] Ah, no separate /var. Logs musta filled it [20:03:25] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected [20:03:38] syslog is crazy big. [20:03:57] Mar 31 19:40:07 uranium /usr/sbin/gmetad[8414]: RRD_update (/srv/ganglia/rrds/Text caches ulsfo/cp4010.ulsfo.wmnet/frontend.n_lru_nuked.rrd): rrdcached: illegal attempt to update using time 1427830801.000000 when last update time is 1427830801.000000 (minimum one second step) [20:04:17] Box off ntp? [20:05:08] ntpd is running and the current time is correct [20:05:30] Not necessarily uranium; some box might be sending data with a bad timestamp. [20:05:58] ah yeah that makes ganglia sad iirc [20:06:05] I flushed the syslog; it grows at severlal Mb/min so that's only a bandaid [20:06:35] RECOVERY - Disk space on uranium is OK: DISK OK [20:06:37] It's all cp4018 [20:06:58] git add modules/cassandra/manifests/init.pp [20:06:59] fatal: Path 'modules/cassandra/manifests/init.pp' is in submodule 'modules/cassandra' [20:07:08] ^and this is why i can't just contribute [20:07:36] cp4018 does seem to be synced [20:07:45] But it's seriously flooding uranium. [20:08:48] Ah - it seems to have quieted down after 11Mb of complaints. [20:08:55] PROBLEM - puppet last run on mc2010 is CRITICAL: CRITICAL: puppet fail [20:09:17] (03PS1) 10Chad: Checkout proper branch when using master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/200996 [20:12:42] (03PS1) 10Andrew Bogott: Disable automatic updating of resolv.conf [puppet] - 10https://gerrit.wikimedia.org/r/200999 (https://phabricator.wikimedia.org/T93691) [20:12:57] (03PS1) 10Dzahn: cassandra: disable Thrift RPC interface [puppet] - 10https://gerrit.wikimedia.org/r/201000 (https://bugzilla.wikimedia.org/94330) [20:13:23] (03CR) 10Andrew Bogott: "This seems drastic, but it does get puppet back in control which is what I think we want." [puppet] - 10https://gerrit.wikimedia.org/r/200999 (https://phabricator.wikimedia.org/T93691) (owner: 10Andrew Bogott) [20:15:58] (03CR) 10Tim Landscheidt: "Looks like https://github.com/rodjek/puppet-lint/issues/307 would cover this case, so no new RFE necessary." [puppet] - 10https://gerrit.wikimedia.org/r/198116 (https://phabricator.wikimedia.org/T87132) (owner: 10Tim Landscheidt) [20:18:36] hm something is weird with text-ulsfo graphs. about 12 hours ago the number of cpus in the cluster spiked and started fluctuating. net traffic for the cluster shows a similar increase, but graphs for the individual nodes don't correspond. https://ganglia.wikimedia.org/latest/?r=day&cs=&ce=&c=Text+caches+ulsfo&h=&tab=m&vn=&hide-hf=false&m=network_report&sh=1&z=small&hc=4&host_regex=&max_graphs=0&s=by+name [20:18:52] papaul_: I'd like to troubleshoot mc2017 boot issues iwth you if you are available? [20:20:54] (03PS1) 10Smalyshev: T93658: create script for TTL dumps [puppet] - 10https://gerrit.wikimedia.org/r/201003 [20:23:03] (03CR) 10Tim Landscheidt: Disable automatic updating of resolv.conf (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/200999 (https://phabricator.wikimedia.org/T93691) (owner: 10Andrew Bogott) [20:23:55] RECOVERY - puppet last run on mc2010 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [20:27:12] (03PS1) 10John F. Lewis: shinken: add wmt project to monitoring [puppet] - 10https://gerrit.wikimedia.org/r/201005 [20:27:25] (03PS2) 10John F. Lewis: shinken: add wmt project to monitoring [puppet] - 10https://gerrit.wikimedia.org/r/201005 [20:28:21] RobH: just about to leave [20:28:33] lets plan to work on it tomorrow my am [20:28:34] =] [20:28:34] RobH: will you do tomorrow morning [20:28:36] k [20:28:42] have a nice evening [20:28:46] thanks [20:28:58] i will ping you tomorrow morning [20:29:41] (03CR) 10Andrew Bogott: Disable automatic updating of resolv.conf (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/200999 (https://phabricator.wikimedia.org/T93691) (owner: 10Andrew Bogott) [20:29:47] robh: where are we with the new logstash hosts? Waiting for m.ark to approve still? [20:30:06] i thought the last update still left you and mark in disucsion [20:30:07] (03PS2) 10Smalyshev: T93658: create script for TTL dumps [puppet] - 10https://gerrit.wikimedia.org/r/201003 [20:30:08] lemme pull it up [20:30:11] matanya: still around? [20:30:43] bd808: So yea, basically my understanding is he is asking why no raid1, and you state what you state [20:30:50] so i dunno [20:31:33] I'll ask him about it tomorrow when we are bouth about [20:31:40] you may want to chat with him if you find a moment [20:31:43] just to ensure you two are on the same page [20:31:48] *nod* [20:33:11] I crazily thought in last January that getting $18K of new hardware would be quick and painless :/ [20:33:19] *late January [20:35:52] (03CR) 10Eevans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/201000 (https://bugzilla.wikimedia.org/94330) (owner: 10Dzahn) [20:39:11] (03PS1) 10Gilles: Initial venv [software/sentry] - 10https://gerrit.wikimedia.org/r/201006 (https://phabricator.wikimedia.org/T84956) [20:47:58] (03PS2) 10Gilles: Basic role for Sentry [puppet] - 10https://gerrit.wikimedia.org/r/199598 (https://phabricator.wikimedia.org/T84956) [20:51:34] urandom: what was the time-series database that you had developed, again? [20:52:26] anything i can do about the puppet compiler still being down? [20:52:38] still says puppet-compiler02.eqiad.wmflabs is offline [20:53:06] because it changed the IP yesterday? [20:53:12] ori: http://newts.io [20:54:09] urandom: oh, cool [20:54:11] thanks [20:54:33] neat. is there a doc that compares it to other tsdb? [20:54:53] of course not, that would be flame bait :) [20:54:56] haha [20:55:38] 20:53 < The_Photo> something wrong with Wikipedia servers? [20:55:38] 20:53 < The_Photo> wp spanish is slowest here in Brazil [20:56:29] (from -tech, btw) [20:56:40] jgage: this biggest difference, is in the data model, it assumes that the common case is the one where you query for a group of metrics [20:57:01] mutante: not sure, come on over to -releng to ask (that's a slave of jenkins) [20:57:10] like bytes in/out, 1, 5, and 15 load averge [20:57:38] and it optimizes storage access along those lines, as well [20:58:17] greg-g: thanks, will do! [20:58:26] urandom: cool :) [20:59:14] (03CR) 10Hoo man: [C: 04-1] "-1 to make sure this is not getting merged before the folder name change has been decided on and implemented." [puppet] - 10https://gerrit.wikimedia.org/r/201003 (owner: 10Smalyshev) [21:00:37] hoo: is the patch itself ok? it's the first one I do for puppet so I'm not sure if it's fine [21:00:56] (03CR) 10Smalyshev: "Is there any progress in that area? We'd like to get some dumps soon-ish." [puppet] - 10https://gerrit.wikimedia.org/r/201003 (owner: 10Smalyshev) [21:01:25] SMalyshev: Haven't really checked... I will poke at this more tomorrow, hopefully [21:01:36] ok, thanks [21:02:34] (03CR) 10Manybubbles: [C: 031] "Fire when ready!" [puppet] - 10https://gerrit.wikimedia.org/r/201003 (owner: 10Smalyshev) [21:03:54] (03CR) 10Yuvipanda: [C: 031] "lgtm, I'll merge once IRC channel questions are answered" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/201005 (owner: 10John F. Lewis) [21:04:19] (03CR) 10Yuvipanda: [C: 032] shinken: Set up IRC notifications for cvn [puppet] - 10https://gerrit.wikimedia.org/r/200929 (owner: 10Krinkle) [21:10:27] (03CR) 10John F. Lewis: shinken: add wmt project to monitoring (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/201005 (owner: 10John F. Lewis) [21:11:57] (03CR) 10Dzahn: [C: 031] "@Yuvipanda: confirmed just now i could join it without being invited. go for it" [puppet] - 10https://gerrit.wikimedia.org/r/201005 (owner: 10John F. Lewis) [21:12:18] mutante: turns out that the grrrit-wm issue was that that config said #wmt instead of ##wmt [21:12:55] YuviPanda: ah:) [21:13:23] JohnFLewis: lgtm, but needs manual rebase :( [21:13:43] YuviPanda: my enemy :( [21:27:03] JohnFLewis: so to rebase, I go to the gerrit patch page, copy the ‘check out’ command, and check it out [21:27:13] JohnFLewis: and then I do ‘git fetch’ and ‘git rebase origin/production' [21:27:15] and resolve them [21:27:17] and then I push again [21:27:25] and then I can do ‘git checkout production’ to go back to my local copy [21:27:55] YuviPanda: right, I swear this part of git just hates me, constantly :/ [21:29:37] (03PS3) 10Yuvipanda: shinken: add wmt project to monitoring [puppet] - 10https://gerrit.wikimedia.org/r/201005 (owner: 10John F. Lewis) [21:33:05] (03CR) 10Yuvipanda: [C: 032] shinken: add wmt project to monitoring [puppet] - 10https://gerrit.wikimedia.org/r/201005 (owner: 10John F. Lewis) [21:38:11] (03PS1) 10Yuvipanda: shinken: Specify service notification command for wmt IRC [puppet] - 10https://gerrit.wikimedia.org/r/201041 [21:38:42] JohnFLewis: ^ this was needed as well :) [21:39:10] copy and paste fail :) [21:39:57] (03CR) 10Yuvipanda: [C: 032 V: 032] shinken: Specify service notification command for wmt IRC [puppet] - 10https://gerrit.wikimedia.org/r/201041 (owner: 10Yuvipanda) [21:40:17] JohnFLewis: :) can you document it on wikitech please? [21:41:24] I'll create a basic list in my namespace and shove it into a wikitech page tomorrow because I need sleep [21:43:28] JohnFLewis: <3 [22:00:51] (03CR) 10Gergő Tisza: "Should I add config files etc. here or make that a separate patch?" [puppet] - 10https://gerrit.wikimedia.org/r/199598 (https://phabricator.wikimedia.org/T84956) (owner: 10Gilles) [22:03:53] PROBLEM - puppet last run on mw2133 is CRITICAL: CRITICAL: puppet fail [22:06:26] (03CR) 10Yuvipanda: ganglia: DRY, use hiera (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/198566 (https://phabricator.wikimedia.org/T93776) (owner: 10Giuseppe Lavagetto) [22:15:44] (03PS1) 10Ori.livneh: Set $wgLogoHD for enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/201050 [22:20:33] RECOVERY - puppet last run on mw2133 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [22:23:58] (03CR) 10Isarra: [C: 031] Set $wgLogoHD for enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/201050 (owner: 10Ori.livneh) [22:58:52] gwicke: https://gerrit.wikimedia.org/r/#/c/201000 [22:59:21] (03CR) 10GWicke: [C: 031] cassandra: disable Thrift RPC interface [puppet] - 10https://gerrit.wikimedia.org/r/201000 (https://bugzilla.wikimedia.org/94330) (owner: 10Dzahn) [22:59:39] mutante: one less thing we need to firewall ;) [23:00:04] RoanKattouw, ^d, Krenair, tgr: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150331T2300). Please do the needful. [23:00:35] <^demon|away> RoanKattouw: you taking it since you've got one on the list? [23:03:59] gwicke: ok:) [23:04:06] (03CR) 10Dzahn: [C: 032] cassandra: disable Thrift RPC interface [puppet] - 10https://gerrit.wikimedia.org/r/201000 (https://bugzilla.wikimedia.org/94330) (owner: 10Dzahn) [23:07:57] gwicke: the config changed but it did not make puppet restart the service, should it? [23:08:25] mutante: no [23:08:35] puppet has no concept of a coordinated restart [23:08:50] so we have to make sure that we do that manually [23:08:55] gwicke: how do you coordinate one? [23:09:26] with coordinated being 'wait until a node is back up & is processing requests' before proceeding [23:09:40] I'd like to automate that [23:09:44] gwicke: you can do that in upstart [23:09:54] upstart? [23:09:57] ^demon|away: Sure, I'll take it [23:10:01] are the machines using jessie or trusty? [23:10:07] jessie [23:10:17] oh. well, i'm sure you can do it in systemd too, but i don't know how :P [23:10:38] I was leaning towards trying the recipe in http://docs.ansible.com/guide_rolling_upgrade.html [23:11:12] main issue is that we don't have a deploy host that has jessie or a reasonable recent ansible [23:13:25] urandom: do you have the permissions for restart? [23:13:37] mutante: yup! [23:14:26] urandom: so i merged the config change to disable RPC and watched it on restbase1001 but it's still listening [23:14:52] gwicke: it seems like Wants / After config directives in systemd do what you want, no? [23:15:06] and we went meta again [23:15:13] mutante: but it hasn't been restarted? [23:15:19] urandom: yea [23:15:45] !log restarting cassandra on restbase1001 [23:15:53] Logged the message, Master [23:16:04] hmm [23:16:17] ah, it worked now:) [23:16:44] mutante: was that a 'yea' it hasn't been restarted, or a 'yea' it had? [23:16:59] it was a "yea" the listening port is gone from netstat :) [23:17:26] yes, it had not been restarted because: [23:17:41] 16:08 < mutante> gwicke: the config changed but it did not make puppet restart the service, should it? [23:17:44] 16:08 < gwicke> mutante: no [23:18:04] yeah, I don't think that would be a good idea [23:18:33] to restart as part of a cassandra.yaml change, I mean [23:18:40] automatically [23:20:51] (03PS1) 10Kaldari: Enable Gather on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/201065 [23:21:45] yea, so we just need one on all nodes though [23:21:52] (03CR) 10Kaldari: [C: 04-2] "Not until change Ic82604e is deployed and tested." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/201065 (owner: 10Kaldari) [23:26:40] mutante: has that been applied to all 6 now? [23:42:15] RoanKattouw: I added a config change to the SWAT window, but it should only be done after Jon’s changes are deployed and tested. I can handle doing that config change myself, but wanted to list it in the window anyway. [23:42:30] OK [23:42:36] I'm not done yet, still waiting on Jenkins [23:43:24] RoanKattouw: Just ping me or Jon once the sync for wmf22 is done. [23:43:37] OK [23:43:53] PROBLEM - puppet last run on amssq61 is CRITICAL: CRITICAL: puppet fail [23:47:42] chasemp: around? [23:51:08] Barely, what's up? [23:51:41] chasemp: ah, wanted to catch up on how the ‘metrics for prod’ is going, since I’m starting doing that for tools... [23:51:48] but if you’re only barely around I’ll catch up tomorrow [23:52:08] Tomorrow then :) [23:53:01] chasemp: :)