[00:00:22] I won't merge anything right this second but I would look at it tomorrow morning nd test it absolutely [00:00:30] (03CR) 10Dzahn: [C: 031] "https://wikitech.wikimedia.org/wiki/Linne" [puppet] - 10https://gerrit.wikimedia.org/r/162171 (owner: 10Dzahn) [00:00:48] apergos: k [00:00:48] (it's 3 am for me, really a ridiculous time to be reviewing anything but I was in the salt internals anyways...) [00:01:12] I have a candidate host to tet it on even... do not anyne sign salt keys on db1050, kthxbye [00:01:23] *test it on [00:01:33] bd808: what didn't work with: git -c "core.sharedRepository=group" clone [...] [00:01:51] bd808: was it simply that it wouldn't work for existing clones? [00:02:26] I... can't remember. I wonder if we documented on the original patch set. [00:03:04] I do remember that you and I spent quite a bit of time coming up with something that fixed the scap clone [00:04:39] bd808: I think the issue was that we didn't realize the dir would need to be shared right away [00:04:48] so we were stuck needing to fix clones on all the app servers [00:05:00] yeah. [00:05:25] (03PS1) 10Dzahn: NTP client config - use rubidium/mexia as servers [puppet] - 10https://gerrit.wikimedia.org/r/162175 [00:05:26] That whole block I'm touching is related to converting an existing clone that was made wihtout sahred [00:05:37] *shared [00:06:00] (03CR) 10Dzahn: "also see: Change-Id: Ie5c4a8028bd1" [puppet] - 10https://gerrit.wikimedia.org/r/162171 (owner: 10Dzahn) [00:06:15] bd808: it doesn't seem worth keeping, IMO. i'd get rid of it. you need to sudo to edit files on /etc; why not vagrant? [00:06:32] at minimum, i'd not attempt to fix repos that were not cloned in shared mode [00:07:13] (03CR) 10Dzahn: [C: 031] "ntpd is running on both hosts, this was the idea, right?" [puppet] - 10https://gerrit.wikimedia.org/r/162175 (owner: 10Dzahn) [00:07:19] ori: I think that would work in this use case too. I can test a patch that just rips that whole if block out [00:07:32] bd808: +1 to that [00:07:49] The biggest problem I have right now is the "o=" permissions part [00:08:05] which is blatantly wrong in the general case [00:08:13] but was right for scap [00:08:20] or at least ok [00:08:36] (03CR) 10Dzahn: "but are we sure? because also see Change-Id: I398ba453a8f7" [puppet] - 10https://gerrit.wikimedia.org/r/162175 (owner: 10Dzahn) [00:09:49] (03CR) 10BryanDavis: [C: 04-1] "Ori would like me to test a version of this that just removes the whole "fix an existing clone" block." [puppet] - 10https://gerrit.wikimedia.org/r/162160 (https://bugzilla.wikimedia.org/70959) (owner: 10BryanDavis) [00:12:12] bd808: git -c "core.sharedRepository=group" clone [..] did the right thing for me [00:14:29] ori: Does group set the sticky bit? That seems to be what the comments in https://gerrit.wikimedia.org/r/#/c/118745/2/modules/git/manifests/clone.pp,unified were worried about. [00:15:51] bd808: users can sudo to git pull [00:16:16] bd808: i think it makes sense that puppet-managed resources require sudo to manipulate. mediawiki-vagrant is a bit of an exception because it's a single-user environment. [00:16:48] ori: True, but I think spagewmf has filed bugs against labs_vagrant before for needing sudo. [00:16:53] bd808: put it this way, if this code wasn't there -- i.e., if there was no shared => parameter, and the only use-case for one was the convenience factor for labs_vagrant users, would you be in favor of adding it? [00:16:59] I habitually use sudo [00:17:08] WONTFIX is an option [00:17:24] let's just kill it [00:18:02] ori: gawd no. I just picked the shared clone as an alternative to the horrible reality of recursive file managment [00:18:34] which is all pointing back to wondering why we care that sudo needed to be used or not [00:19:37] spagewmf: You are my labs_vagrant power user. Would needing to sudo to update things in /vagrant make you more horribly sad than than puppet code needed to make it possible makes Ori? [00:20:23] ori if the code checks out tomorrow I would push it live (ou would be gone), o you mind? [00:20:32] (after this answer I"m gooing to bed) [00:20:46] (03PS2) 10Dzahn: Remove tridge [puppet] - 10https://gerrit.wikimedia.org/r/161948 (owner: 10Alexandros Kosiaris) [00:23:10] (03PS2) 10Dzahn: NTP client config - use rubidium/eeden as servers [puppet] - 10https://gerrit.wikimedia.org/r/162175 [00:23:36] (03CR) 10Dzahn: [C: 032] Remove tridge [puppet] - 10https://gerrit.wikimedia.org/r/161948 (owner: 10Alexandros Kosiaris) [00:26:24] bd808: " tab completion is part of the api for something like this" [00:26:25] Amem. [00:26:27] Amen. [00:26:36] !log tridge - revoking puppet cert, deleting salt key, decom ... [00:26:42] Logged the message, Master [00:28:24] guess I'll read the answer later... gone [00:29:22] PROBLEM - puppet last run on mw1172 is CRITICAL: CRITICAL: Puppet has 1 failures [00:30:44] (03PS1) 10Dzahn: delete amanda config files [puppet] - 10https://gerrit.wikimedia.org/r/162182 [00:32:03] (03CR) 10Dzahn: "revoked puppet cert, deleted salt key, stopped puppet agent and salt minion, scheduled icinga downtime..." [puppet] - 10https://gerrit.wikimedia.org/r/161948 (owner: 10Alexandros Kosiaris) [00:32:52] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [00:33:37] apergos: i think i like the extra config file too, so i'll just suggest you discuss it with godog if you have the chance. i'm cool with whatever you guys decide on. [00:33:49] apergos: and yeah, feel free to merge, regardless of whether or not i'm around [00:34:04] (03PS1) 10Dzahn: remove ferm rule for amanda/tridge [puppet] - 10https://gerrit.wikimedia.org/r/162184 [00:40:40] (03PS1) 10Dzahn: remove tridge from dsh and dhcp [puppet] - 10https://gerrit.wikimedia.org/r/162186 [00:41:34] (03CR) 10Dzahn: [C: 032] remove tridge from dsh and dhcp [puppet] - 10https://gerrit.wikimedia.org/r/162186 (owner: 10Dzahn) [00:42:08] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [00:43:17] (03PS1) 10Dzahn: remove cron that rsynced nfs home [puppet] - 10https://gerrit.wikimedia.org/r/162189 [00:45:42] (03CR) 10Dzahn: "this has been merged but still hasn't been applied it seems, and legal is ping'ing. need Ariel?" [puppet] - 10https://gerrit.wikimedia.org/r/134121 (owner: 10Filippo Giunchedi) [00:46:39] RECOVERY - puppet last run on mw1172 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [00:48:12] (03CR) 10Dzahn: "ottomata: looks right, right?" [puppet] - 10https://gerrit.wikimedia.org/r/162166 (owner: 10Dzahn) [00:51:26] (03Abandoned) 10Dzahn: [WIP] MySQL scripts to generate Kanban chart for RT [puppet] - 10https://gerrit.wikimedia.org/r/111152 (owner: 10Diederik) [00:52:44] (03CR) 10Dzahn: "yea, my bad, i mixed up snmptraps and snmpwalks. thanks Alex." [puppet] - 10https://gerrit.wikimedia.org/r/127246 (owner: 10ArielGlenn) [00:53:06] (03CR) 10Dzahn: "Ariel, abandoning it because we got rid of snmptraps" [puppet] - 10https://gerrit.wikimedia.org/r/127246 (owner: 10ArielGlenn) [00:53:10] (03Abandoned) 10Dzahn: move the puppet snmptrap into a class so it can be run in last stage [puppet] - 10https://gerrit.wikimedia.org/r/127246 (owner: 10ArielGlenn) [00:54:37] (03CR) 10Dzahn: "yea, how do we handle hardware repairs etc? will deployers be ok if a few servers timeout during this?" [puppet] - 10https://gerrit.wikimedia.org/r/160953 (owner: 10Alexandros Kosiaris) [00:57:58] (03CR) 10Dzahn: [C: 031] ""RSpec as well as unit tests are included"<- this is probably rare among modules and _really_ nice!" [puppet] - 10https://gerrit.wikimedia.org/r/159738 (owner: 10Alexandros Kosiaris) [01:02:49] (03CR) 10Dzahn: [C: 031] "+1 because _joe_ said he doesn't want to block it" [puppet] - 10https://gerrit.wikimedia.org/r/159729 (https://bugzilla.wikimedia.org/38516) (owner: 10Chmarkine) [01:06:15] (03PS7) 10Dzahn: icinga plugin script for HDFS webrequests [puppet] - 10https://gerrit.wikimedia.org/r/151095 (owner: 10Ottomata) [01:06:37] (03CR) 10Dzahn: [C: 032] icinga plugin script for HDFS webrequests [puppet] - 10https://gerrit.wikimedia.org/r/151095 (owner: 10Ottomata) [01:11:18] (03PS1) 10Dzahn: icinga - delete unused check_hdfs_path [puppet] - 10https://gerrit.wikimedia.org/r/162191 [01:12:26] (03CR) 10Dzahn: [C: 032] "you might never read this, because gerrit, but there you go, added and deleted" [puppet] - 10https://gerrit.wikimedia.org/r/162191 (owner: 10Dzahn) [01:14:14] (03CR) 10Dzahn: "deleted in Change-Id: I99e5584cb562bdcb36" [puppet] - 10https://gerrit.wikimedia.org/r/151095 (owner: 10Ottomata) [01:16:16] (03CR) 10Ori.livneh: "Ariel: as mentioned on IRC, this is what I did in PS1. I actually agree with Filippo that having an additional config file is cleaner, but" [puppet] - 10https://gerrit.wikimedia.org/r/161332 (owner: 10Ori.livneh) [01:16:29] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There are 3 unmerged changes in puppet (dir /var/lib/git/operations/puppet). [01:16:41] (03CR) 10Dzahn: "everybody: come on, this is from 2012 ?!" [apache-config] - 10https://gerrit.wikimedia.org/r/24407 (owner: 10Jeremyb) [01:16:49] PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: There are 3 unmerged changes in puppet (dir /var/lib/git/operations/puppet). [01:17:18] hah, what's that? :) [01:17:57] jeremyb: hey, yea, what about it :) [01:18:09] mutante: not really online now [01:19:58] (03CR) 10Dzahn: "- wrong repo" [apache-config] - 10https://gerrit.wikimedia.org/r/24407 (owner: 10Jeremyb) [01:21:07] (03CR) 10Dzahn: "i think this probably never happened: "Please make a collective decision somewhere (e.g. on the mailing list thread) and let me know what " [apache-config] - 10https://gerrit.wikimedia.org/r/24407 (owner: 10Jeremyb) [01:21:43] jeremyb: it's alright, the collective needs to make a decision [01:21:51] (but still wrong repo) [01:22:19] we have a collective? :) [01:22:31] you asked for it:) [01:23:01] maybe I did! [01:23:09] 2 years between upload and update date.. oh man [01:23:29] maybe we should stall it until we can review in phab [01:43:03] (03PS1) 10Dzahn: create shell account for nettrom [puppet] - 10https://gerrit.wikimedia.org/r/162192 [01:47:59] (03PS1) 10Dzahn: add nettrom to various statistic/analytics groups [puppet] - 10https://gerrit.wikimedia.org/r/162193 [01:52:44] (03CR) 10Dzahn: "how come "analytics-users" is an empty group? does it really make sense he would be alone in there?" [puppet] - 10https://gerrit.wikimedia.org/r/162193 (owner: 10Dzahn) [01:54:09] (03CR) 10Dzahn: [C: 031] "analytics-ops: please confirm" [puppet] - 10https://gerrit.wikimedia.org/r/162193 (owner: 10Dzahn) [01:55:14] (03PS3) 10BryanDavis: Remove git::clone support for converting clones to shared [puppet] - 10https://gerrit.wikimedia.org/r/162160 [02:02:06] (03CR) 10BryanDavis: "Tested via cherry-pick on ieg-dev.eqiad.wmflabs. Initial git clone of /srv/vagrant comes out with desired 2775 permissions and no file per" [puppet] - 10https://gerrit.wikimedia.org/r/162160 (owner: 10BryanDavis) [02:06:39] PROBLEM - Disk space on virt0 is CRITICAL: DISK CRITICAL - free space: /a 3613 MB (3% inode=99%): [02:17:31] !log LocalisationUpdate completed (1.24wmf20) at 2014-09-23 02:17:31+00:00 [02:17:44] Logged the message, Master [02:19:13] (03PS4) 10Yurik: Remove git::clone support for converting clones to shared [puppet] - 10https://gerrit.wikimedia.org/r/162160 (owner: 10BryanDavis) [02:30:38] !log LocalisationUpdate completed (1.24wmf21) at 2014-09-23 02:30:38+00:00 [02:30:44] Logged the message, Master [02:43:48] !log LocalisationUpdate completed (1.24wmf22) at 2014-09-23 02:43:48+00:00 [02:43:55] Logged the message, Master [02:55:58] can someone get me the stack trace of this from mediawikiwiki? [acd305b3] 2014-09-23 02:55:34: Fatal exception of type MWException [02:56:10] Looking [02:57:23] 2014-09-23 02:55:34 mw1020 mediawikiwiki: [acd305b3] /w/index.php?title=Project:Sandbox&action=edit&oldid=1176667 Exception from line 594 of /srv/mediawiki/php-1.24wmf22/includes/content/ContentHandler.php: Format text/x-wiki is not supported for content model css [02:57:26] jackmcbarn: ---^^ [02:57:46] RoanKattouw: can i have the full stack trace? [02:58:24] Yeah I'll PM it to you [02:58:50] Not because it's secret but because it's 10+ lines and I'm too lazy to pastebin at this hour [02:58:51] I should eat or something [03:00:19] RECOVERY - Disk space on virt0 is OK: DISK OK [03:01:13] RoanKattouw: thanks [03:17:47] (03PS7) 10Krinkle: contint: Ensure nodejs-legacy is installed [puppet] - 10https://gerrit.wikimedia.org/r/159226 [03:17:58] (03PS2) 10Krinkle: contint: Package 'php5-parsekit' is absent on Trusty, don't require it [puppet] - 10https://gerrit.wikimedia.org/r/161748 [03:20:42] (03CR) 10Tim Starling: [C: 032] Fix profiling error CommonSettings.php-skin-include1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/161911 (owner: 10Tim Starling) [03:21:23] (03Merged) 10jenkins-bot: Fix profiling error CommonSettings.php-skin-include1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/161911 (owner: 10Tim Starling) [03:29:07] !log tstarling Synchronized wmf-config/CommonSettings.php: fix profiling (duration: 00m 07s) [03:29:13] Logged the message, Master [03:33:39] mutante (cc bblack): ping since it's your RT week: can't triage maint-announce very well if I don't know which links are important. can we get some kind of document that indicates which providers should trigger e.g. a geodns failover before window starts and which are no big deal? [03:34:15] also, i don't even know which are peering vs. transit. tampa vs. eqiad [03:34:17] etc. [03:35:51] i guess some of this is overlap with either racktables/servermon or officewiki [03:35:53] ? [03:37:07] e.g. i *think* fpl is a tampa thing [03:37:58] so weird that they sometimes put dates and times in subjects and sometimes don't [03:42:29] !log LocalisationUpdate ResourceLoader cache refresh completed at Tue Sep 23 03:42:29 UTC 2014 (duration 42m 28s) [03:42:36] Logged the message, Master [03:47:48] (03CR) 10Krinkle: "Follows-up 22afa99e5d786eb0eb195f. Nice :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/161911 (owner: 10Tim Starling) [03:48:09] (03CR) 10Krinkle: Include CologneBlue and Modern if they exist as proper skins (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/135927 (owner: 10Chad) [03:51:10] andrewbogott_afk: Coren: adding shell rights to a user takes 30+ secs POST??? [03:51:17] seems excessive [03:52:34] jeremyb: It is, but it's a consequence of the rest it has to do (though I rarely saw 30s, more around 20s or so). It has to create a keystone auth thingy, add to the bastion project (which does a bazillion ldap things) and a couple other things. It can probably stand a rewrite by now because there's too much stuff piled on in suboptimal ways. [03:53:02] 30secs was reproducible AFAICT [03:53:25] btw, who can add to tools? [03:53:51] (the project not an individual tool) [03:54:58] i think you meant "bajillion"? :) [03:56:07] Coren: oh, btw, please add to https://wikitech.wikimedia.org/wiki/MediaWiki:Sidebar [[Release Engineering/SAL]] [03:56:45] jeremyb: Plz to file a bz for this? I was heading to bed. :-) [03:56:55] really?? :) [03:57:06] It's midnight here. :-P [03:57:14] you know i live in your TZ :) [03:57:40] Heh. I have a hard enough time remembering names, don't ask me to also manage to keep time zones straight. :-) [03:57:41] i may come visit you in a month or 3 btw. just had someone here from your city and i need to reciprocate [03:58:09] nacht [04:10:30] !log tstarling Synchronized php-1.24wmf21/languages/Language.php: profiling (duration: 00m 05s) [04:10:37] Logged the message, Master [04:27:57] whoa, i totally thought i was talking to Coren in #-labs before :( [04:49:38] !log tstarling Synchronized php-1.24wmf21/languages/Language.php: profiling (duration: 00m 05s) [05:12:08] PROBLEM - puppet last run on cp4019 is CRITICAL: CRITICAL: Puppet has 1 failures [05:28:18] RECOVERY - puppet last run on cp4019 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [06:27:28] PROBLEM - puppet last run on search1007 is CRITICAL: CRITICAL: Epic puppet fail [06:28:09] PROBLEM - puppet last run on db1046 is CRITICAL: CRITICAL: Puppet has 1 failures [06:28:09] PROBLEM - puppet last run on mw1061 is CRITICAL: CRITICAL: Puppet has 1 failures [06:28:18] PROBLEM - puppet last run on analytics1010 is CRITICAL: CRITICAL: Epic puppet fail [06:28:18] PROBLEM - puppet last run on mw1114 is CRITICAL: CRITICAL: Epic puppet fail [06:29:28] PROBLEM - puppet last run on cp3003 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:29] PROBLEM - puppet last run on cp1056 is CRITICAL: CRITICAL: Puppet has 2 failures [06:30:08] PROBLEM - puppet last run on logstash1002 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:08] PROBLEM - puppet last run on mw1100 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:28] PROBLEM - puppet last run on iron is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:29] PROBLEM - puppet last run on db1067 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:29] PROBLEM - puppet last run on mw1052 is CRITICAL: CRITICAL: Puppet has 2 failures [06:30:58] PROBLEM - puppet last run on mw1123 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:58] PROBLEM - puppet last run on mw1092 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:58] PROBLEM - puppet last run on mw1065 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:09] PROBLEM - puppet last run on mw1144 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:18] PROBLEM - puppet last run on lvs2004 is CRITICAL: CRITICAL: Puppet has 2 failures [06:31:18] PROBLEM - puppet last run on db1059 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:19] PROBLEM - puppet last run on cp4008 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:28] PROBLEM - puppet last run on mw1170 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:35] PROBLEM - puppet last run on mw1172 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:38] PROBLEM - puppet last run on mw1025 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:39] PROBLEM - puppet last run on mw1213 is CRITICAL: CRITICAL: Puppet has 1 failures [06:34:00] PROBLEM - puppet last run on amslvs1 is CRITICAL: CRITICAL: Puppet has 1 failures [06:34:08] PROBLEM - puppet last run on pc1002 is CRITICAL: CRITICAL: Puppet has 1 failures [06:34:09] PROBLEM - puppet last run on db1026 is CRITICAL: CRITICAL: Puppet has 1 failures [06:34:09] PROBLEM - puppet last run on virt1001 is CRITICAL: CRITICAL: Puppet has 1 failures [06:34:18] PROBLEM - puppet last run on wtp1005 is CRITICAL: CRITICAL: Puppet has 1 failures [06:34:28] PROBLEM - puppet last run on snapshot1001 is CRITICAL: CRITICAL: Puppet has 1 failures [06:34:36] PROBLEM - puppet last run on virt1004 is CRITICAL: CRITICAL: Puppet has 1 failures [06:34:36] PROBLEM - puppet last run on db1039 is CRITICAL: CRITICAL: Puppet has 1 failures [06:34:48] PROBLEM - puppet last run on gallium is CRITICAL: CRITICAL: Puppet has 1 failures [06:34:48] PROBLEM - puppet last run on mc1014 is CRITICAL: CRITICAL: Puppet has 1 failures [06:34:49] PROBLEM - puppet last run on db1043 is CRITICAL: CRITICAL: Puppet has 1 failures [06:34:49] PROBLEM - puppet last run on db2037 is CRITICAL: CRITICAL: Puppet has 1 failures [06:34:49] PROBLEM - puppet last run on nescio is CRITICAL: CRITICAL: Puppet has 1 failures [06:34:49] PROBLEM - puppet last run on lvs3004 is CRITICAL: CRITICAL: Puppet has 1 failures [06:35:09] PROBLEM - puppet last run on mw1014 is CRITICAL: CRITICAL: Puppet has 1 failures [06:35:18] PROBLEM - puppet last run on tungsten is CRITICAL: CRITICAL: Puppet has 1 failures [06:35:20] <_joe_> mmmh more than usual [06:35:20] PROBLEM - puppet last run on mw1002 is CRITICAL: CRITICAL: Puppet has 1 failures [06:35:31] PROBLEM - puppet last run on mw1195 is CRITICAL: CRITICAL: Puppet has 1 failures [06:35:31] PROBLEM - puppet last run on cp3010 is CRITICAL: CRITICAL: Puppet has 1 failures [06:35:31] PROBLEM - puppet last run on amssq36 is CRITICAL: CRITICAL: Puppet has 1 failures [06:35:50] PROBLEM - puppet last run on ms-fe3002 is CRITICAL: CRITICAL: Puppet has 1 failures [06:37:57] Sockpuppet? [06:38:12] <_joe_> mod_passenger and logrotation [06:40:24] (03Abandoned) 10Jeremyb: raise account creation throttle: Puget Sound [mediawiki-config] - 10https://gerrit.wikimedia.org/r/161191 (https://bugzilla.wikimedia.org/70953) (owner: 10Jeremyb) [06:45:28] RECOVERY - puppet last run on db1046 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [06:45:28] RECOVERY - puppet last run on mw1100 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [06:45:38] RECOVERY - puppet last run on iron is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [06:45:38] RECOVERY - puppet last run on cp3003 is OK: OK: Puppet is currently enabled, last run 61 seconds ago with 0 failures [06:45:48] RECOVERY - puppet last run on mw1025 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [06:45:49] RECOVERY - puppet last run on cp1056 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [06:45:49] RECOVERY - puppet last run on mw1052 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [06:46:18] RECOVERY - puppet last run on mw1123 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [06:46:18] RECOVERY - puppet last run on mw1092 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [06:46:18] RECOVERY - puppet last run on mw1065 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [06:46:19] RECOVERY - puppet last run on mw1144 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [06:46:28] RECOVERY - puppet last run on logstash1002 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [06:46:29] RECOVERY - puppet last run on mw1061 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [06:46:29] RECOVERY - puppet last run on lvs2004 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [06:46:30] RECOVERY - puppet last run on db1059 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [06:46:48] RECOVERY - puppet last run on mw1170 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [06:46:48] RECOVERY - puppet last run on cp4008 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [06:46:49] RECOVERY - puppet last run on db1067 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [06:46:49] RECOVERY - puppet last run on mw1213 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [06:47:19] RECOVERY - puppet last run on pc1002 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [06:47:23] RECOVERY - puppet last run on amslvs1 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [06:47:30] RECOVERY - puppet last run on mw1002 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [06:47:38] RECOVERY - puppet last run on analytics1010 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [06:47:38] RECOVERY - puppet last run on snapshot1001 is OK: OK: Puppet is currently enabled, last run 1 seconds ago with 0 failures [06:47:38] RECOVERY - puppet last run on mw1114 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [06:47:38] RECOVERY - puppet last run on mw1195 is OK: OK: Puppet is currently enabled, last run 1 seconds ago with 0 failures [06:47:39] RECOVERY - puppet last run on search1007 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [06:47:48] RECOVERY - puppet last run on mw1172 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [06:47:48] RECOVERY - puppet last run on gallium is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [06:47:49] RECOVERY - puppet last run on db1043 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [06:47:59] RECOVERY - puppet last run on nescio is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [06:48:20] RECOVERY - puppet last run on virt1001 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [06:48:29] RECOVERY - puppet last run on wtp1005 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [06:48:38] RECOVERY - puppet last run on virt1004 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [06:48:38] RECOVERY - puppet last run on db1039 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [06:48:39] RECOVERY - puppet last run on cp3010 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [06:48:48] RECOVERY - puppet last run on mc1014 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [06:49:00] RECOVERY - puppet last run on lvs3004 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [06:49:04] RECOVERY - puppet last run on ms-fe3002 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [06:49:19] RECOVERY - puppet last run on mw1014 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [06:49:19] RECOVERY - puppet last run on db1026 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [06:49:28] RECOVERY - puppet last run on tungsten is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [06:49:49] RECOVERY - puppet last run on amssq36 is OK: OK: Puppet is currently enabled, last run 63 seconds ago with 0 failures [06:49:59] RECOVERY - puppet last run on db2037 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [07:38:57] (03PS1) 10Giuseppe Lavagetto: compare-puppet-catalogs: use hiera [software] - 10https://gerrit.wikimedia.org/r/162208 [07:39:17] (03CR) 10Giuseppe Lavagetto: [C: 032] compare-puppet-catalogs: use hiera [software] - 10https://gerrit.wikimedia.org/r/162208 (owner: 10Giuseppe Lavagetto) [08:11:01] (03CR) 10Filippo Giunchedi: [C: 031] NTP client config - use rubidium/eeden as servers [puppet] - 10https://gerrit.wikimedia.org/r/162175 (owner: 10Dzahn) [08:12:04] (03CR) 10Filippo Giunchedi: [C: 031] "good to be merged this week I think unless no -1s" [puppet] - 10https://gerrit.wikimedia.org/r/159692 (owner: 10Ori.livneh) [08:21:19] (03CR) 10Filippo Giunchedi: "no strong opinions on either approach, why a separate config file wouldn't be preferred in this case though?" [puppet] - 10https://gerrit.wikimedia.org/r/161332 (owner: 10Ori.livneh) [08:23:41] (03CR) 10Filippo Giunchedi: [C: 031] decom linne [puppet] - 10https://gerrit.wikimedia.org/r/162171 (owner: 10Dzahn) [08:24:37] (03CR) 10Filippo Giunchedi: "DNS has some service CNAMEs for ntp, are we going to use that or explicit names in config files?" [puppet] - 10https://gerrit.wikimedia.org/r/162175 (owner: 10Dzahn) [08:25:23] (03CR) 10Alexandros Kosiaris: "Yeah, linne has to be migrated to the module/role stuff before we applied it to chromium as well or else it will break. The linne change i" [puppet] - 10https://gerrit.wikimedia.org/r/159738 (owner: 10Alexandros Kosiaris) [08:27:46] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Wanna absent the file resource /root/.ssh/home-rsync as well ? it is equally useless I think now" [puppet] - 10https://gerrit.wikimedia.org/r/162189 (owner: 10Dzahn) [08:30:31] (03CR) 10Alexandros Kosiaris: [C: 04-1] "The change is fine, -1 blocking until" [puppet] - 10https://gerrit.wikimedia.org/r/162171 (owner: 10Dzahn) [08:33:21] (03PS10) 10Filippo Giunchedi: Add Grafana module & role [puppet] - 10https://gerrit.wikimedia.org/r/133274 (owner: 10Ori.livneh) [08:33:30] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] Add Grafana module & role [puppet] - 10https://gerrit.wikimedia.org/r/133274 (owner: 10Ori.livneh) [08:34:41] mutante: FYI I merged the hdfs/icinga stuff too [08:34:42] (03CR) 10Alexandros Kosiaris: [C: 032] delete amanda config files [puppet] - 10https://gerrit.wikimedia.org/r/162182 (owner: 10Dzahn) [08:34:48] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [08:35:07] (03CR) 10Alexandros Kosiaris: [C: 032] remove ferm rule for amanda/tridge [puppet] - 10https://gerrit.wikimedia.org/r/162184 (owner: 10Dzahn) [08:35:18] RECOVERY - Unmerged changes on repository puppet on palladium is OK: No changes to merge. [08:35:56] <_joe_> godog: so... should we install it somewhere? [08:36:30] <_joe_> oh I see, zirconium [08:36:36] _joe_: it == grafana? [08:37:39] <_joe_> yep [08:38:04] <_joe_> godog: there is a problem with that change [08:38:11] <_joe_> check misc-varnishes [08:38:23] <_joe_> I fear we didn't add the needed backends [08:38:29] PROBLEM - puppet last run on zirconium is CRITICAL: CRITICAL: Epic puppet fail [08:38:55] <_joe_> and zirconium as well :) [08:39:16] (03CR) 10ArielGlenn: "(Sorry Ori, I didn't see the ps1 version.)" [puppet] - 10https://gerrit.wikimedia.org/r/161332 (owner: 10Ori.livneh) [08:39:52] (03CR) 10Gilles: [C: 031] Up image scaller wgMaxShellFileSize to 512MB [mediawiki-config] - 10https://gerrit.wikimedia.org/r/162038 (owner: 10Reedy) [08:40:09] (03PS1) 10Filippo Giunchedi: grafana: pass server_name to grafana::web::apache [puppet] - 10https://gerrit.wikimedia.org/r/162210 [08:40:17] _joe_: ^ [08:40:30] <_joe_> godog: eheh [08:40:49] (03PS3) 10Giuseppe Lavagetto: puppet: introduce hiera for production [puppet] - 10https://gerrit.wikimedia.org/r/160924 [08:41:18] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] grafana: pass server_name to grafana::web::apache [puppet] - 10https://gerrit.wikimedia.org/r/162210 (owner: 10Filippo Giunchedi) [08:41:20] (03CR) 10Giuseppe Lavagetto: [C: 032] puppet: introduce hiera for production [puppet] - 10https://gerrit.wikimedia.org/r/160924 (owner: 10Giuseppe Lavagetto) [08:41:29] (03PS4) 10Giuseppe Lavagetto: puppet: introduce hiera for production [puppet] - 10https://gerrit.wikimedia.org/r/160924 [08:42:50] <_joe_> and hiera is live [08:43:30] RECOVERY - puppet last run on zirconium is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [08:44:30] \o/ [08:44:32] happy days [08:44:37] <_joe_> grrr [08:44:38] <_joe_> typo [08:46:12] (03CR) 10Alexandros Kosiaris: "CNAMEs please, let's not deviate from our defacto standard of Service IPs/aliases I see that both eqiad and pmtpa point to linne, we need " [puppet] - 10https://gerrit.wikimedia.org/r/162175 (owner: 10Dzahn) [08:46:53] (03PS1) 10Giuseppe Lavagetto: hiera: fix filename [puppet] - 10https://gerrit.wikimedia.org/r/162211 [08:46:59] PROBLEM - puppet last run on palladium is CRITICAL: CRITICAL: Puppet has 1 failures [08:47:07] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] hiera: fix filename [puppet] - 10https://gerrit.wikimedia.org/r/162211 (owner: 10Giuseppe Lavagetto) [08:47:43] (03CR) 1020after4: [C: 031] T458: Rename ext_ref description and hide it from users [puppet] - 10https://gerrit.wikimedia.org/r/162161 (owner: 10Chad) [08:47:57] (03PS3) 10Filippo Giunchedi: Add grafana.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/133275 (owner: 10Ori.livneh) [08:48:08] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] Add grafana.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/133275 (owner: 10Ori.livneh) [08:49:09] (03PS1) 10Giuseppe Lavagetto: fix hiera path [software] - 10https://gerrit.wikimedia.org/r/162212 [08:49:10] RECOVERY - puppet last run on palladium is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [08:49:40] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] fix hiera path [software] - 10https://gerrit.wikimedia.org/r/162212 (owner: 10Giuseppe Lavagetto) [08:50:10] <_joe_> bbiab [08:50:35] ori: grafana changes deployed \o/ (dns+puppet) pending software deployment [08:58:45] <_joe_> godog: I think you may use the trebuchet package provider [08:59:25] <_joe_> and it should take care of deploying a first version [08:59:35] _joe_: indeed, it is already configured but I'll let ori have fun [09:02:37] akosiaris: it looks like the /data/db20/home dir is not in the trige copy from iron... it's gone now? :-) [09:02:41] *tridge [09:02:57] (if so I don't need to look at that copy for anything!) [09:03:38] apergos: yes I did not copy it. I though we had agreed it made no sense to move it [09:03:44] yay! [09:04:05] no I'm totally good with that. which just means, sorry I asked for the mount, I don't need it :-D [09:04:18] cool, unmounting then :-) [09:04:22] thank you! [09:04:35] you 're welcome [09:06:56] (03PS1) 10Giuseppe Lavagetto: hiera: test functionality of dynlookup feature [puppet] - 10https://gerrit.wikimedia.org/r/162213 [09:09:17] apergos: I was checking https://gerrit.wikimedia.org/r/#/c/161332/ again, it looks like passing the minion config as a dictionary is supported only in recent versions? [09:13:38] PROBLEM - RAID on mw1053 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:14:20] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] codfw-prod: initial ring [software/swift-ring] - 10https://gerrit.wikimedia.org/r/161992 (owner: 10Filippo Giunchedi) [09:15:38] RECOVERY - RAID on mw1053 is OK: OK: no RAID installed [09:16:15] !log deployed codfw-prod swift ring to palladium [09:16:18] PROBLEM - puppet last run on mw1053 is CRITICAL: CRITICAL: Puppet has 1 failures [09:16:21] Logged the message, Master [09:18:43] is it possible to get a patch merged which I had scheduled for SWAT but I was away and did not reply during the deployment window? [09:18:57] two actually [09:20:44] godog: I can't remember what versions I tested my little test script on [09:20:47] I'll check it [09:20:51] (03PS8) 10Alexandros Kosiaris: module/role for url-downloader [puppet] - 10https://gerrit.wikimedia.org/r/159738 [09:23:08] apergos: ok! [09:24:00] (03CR) 10Alexandros Kosiaris: [C: 032] module/role for url-downloader [puppet] - 10https://gerrit.wikimedia.org/r/159738 (owner: 10Alexandros Kosiaris) [09:32:08] PROBLEM - check configured eth on mw1053 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:32:48] PROBLEM - nutcracker port on mw1053 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:32:48] PROBLEM - check if dhclient is running on mw1053 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:32:49] PROBLEM - SSH on mw1053 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:32:49] PROBLEM - DPKG on mw1053 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:33:05] PROBLEM - RAID on mw1053 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:33:10] (03PS2) 10Giuseppe Lavagetto: hiera: test functionality of dynlookup feature [puppet] - 10https://gerrit.wikimedia.org/r/162213 [09:33:40] PROBLEM - Disk space on mw1053 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:33:53] PROBLEM - nutcracker process on mw1053 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:34:08] (03PS2) 10Alexandros Kosiaris: url_downloader: assign to chromium [puppet] - 10https://gerrit.wikimedia.org/r/161951 [09:35:10] (03CR) 10Alexandros Kosiaris: [C: 032] url_downloader: assign to chromium [puppet] - 10https://gerrit.wikimedia.org/r/161951 (owner: 10Alexandros Kosiaris) [09:37:41] RECOVERY - Disk space on mw1053 is OK: DISK OK [09:37:49] RECOVERY - nutcracker process on mw1053 is OK: PROCS OK: 1 process with UID = 110 (nutcracker), command name nutcracker [09:37:59] RECOVERY - nutcracker port on mw1053 is OK: TCP OK - 0.000 second response time on port 11212 [09:37:59] RECOVERY - check if dhclient is running on mw1053 is OK: PROCS OK: 0 processes with command name dhclient [09:38:00] RECOVERY - SSH on mw1053 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [09:38:00] RECOVERY - DPKG on mw1053 is OK: All packages OK [09:38:28] RECOVERY - RAID on mw1053 is OK: OK: no RAID installed [09:38:37] RECOVERY - check configured eth on mw1053 is OK: NRPE: Unable to read output [09:39:58] RECOVERY - puppet last run on mw1053 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [09:49:59] (03PS3) 10Giuseppe Lavagetto: hiera: test functionality of dynlookup feature [puppet] - 10https://gerrit.wikimedia.org/r/162213 [09:57:07] PROBLEM - puppet last run on chromium is CRITICAL: CRITICAL: Puppet has 1 failures [10:07:21] for the love of ... [10:07:35] squid developers must be on drugs or something [10:08:42] <_joe_> mmmh I just broke the puppet compiler in some funny way [10:09:07] <_joe_> well, whatever [10:09:21] <_joe_> I'll work on it tomorrow [10:13:15] godog: I just tested my little script on 0.17.5+ds-1 and it seems to be fine, this is on a test image with an empty config (so it would try to contct 'salt' as the master, if it were going to, and I would hear about that :-D) [10:14:24] (03PS4) 10Giuseppe Lavagetto: hiera: test functionality of dynlookup feature [puppet] - 10https://gerrit.wikimedia.org/r/162213 [10:15:56] (03CR) 10Giuseppe Lavagetto: [C: 032] hiera: test functionality of dynlookup feature [puppet] - 10https://gerrit.wikimedia.org/r/162213 (owner: 10Giuseppe Lavagetto) [10:17:12] apergos: nice! both approaches work then, I can review another PS if needed [10:17:35] well ori was saying it would be ok to just go back to ps1 [10:17:48] as pposed to having to tweak some third ps [10:17:54] I should test it [10:19:58] i.e. I know the approach works, I don't know if the exact changes made to grain-ensure don't have some little error in them until I test it [10:21:14] PROBLEM - puppet last run on mw1017 is CRITICAL: CRITICAL: Epic puppet fail [10:24:45] <_joe_> ^^ that is me [10:26:08] apergos: ah ok, I was under the impression that you had a third PS with a different approach like in the comments [10:26:49] I think wht I do is preetty close to what ps1 does [10:27:17] i.e. it's bsed o setting the option and shoveling that in, I don't care if it's via a class or what [10:36:24] RECOVERY - puppet last run on mw1017 is OK: OK: Puppet is currently enabled, last run 60 seconds ago with 0 failures [10:36:36] <_joe_> oh damn I am demented [10:37:09] <_joe_> I should've ran puppet on all puppetmasters [10:40:47] godog: ps1 wfm [10:41:46] tested on db1050 which has no running minion (no good keys to be accepted, I left it broken last night intentionally) [10:44:00] PROBLEM - url_dowloader on linne is CRITICAL: Connection refused [10:44:41] PROBLEM - url_dowloader on chromium is CRITICAL: Connection refused [10:46:51] apergos: cool! [10:49:30] PROBLEM - puppet last run on db1031 is CRITICAL: CRITICAL: Puppet has 1 failures [10:49:31] PROBLEM - puppet last run on ms-be1003 is CRITICAL: CRITICAL: Puppet has 1 failures [10:49:41] PROBLEM - puppet last run on db1073 is CRITICAL: CRITICAL: Puppet has 1 failures [10:49:51] PROBLEM - puppet last run on mc1006 is CRITICAL: CRITICAL: Puppet has 1 failures [10:52:50] PROBLEM - puppet last run on amslvs1 is CRITICAL: CRITICAL: Puppet has 1 failures [10:54:32] PROBLEM - puppet last run on mw1180 is CRITICAL: CRITICAL: Puppet has 1 failures [10:54:33] PROBLEM - puppet last run on mc1005 is CRITICAL: CRITICAL: Puppet has 1 failures [10:55:16] <_joe_> mmmh [10:57:00] <_joe_> apt timeout [11:02:40] RECOVERY - puppet last run on db1073 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [11:03:40] RECOVERY - puppet last run on db1031 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [11:03:42] RECOVERY - puppet last run on ms-be1003 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [11:04:00] RECOVERY - puppet last run on mc1006 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [11:06:41] RECOVERY - puppet last run on amslvs1 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [11:08:30] RECOVERY - puppet last run on mw1180 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [11:08:35] RECOVERY - puppet last run on mc1005 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [11:10:28] (03PS1) 10Alexandros Kosiaris: url_downloader: squid3 cache_dir fix [puppet] - 10https://gerrit.wikimedia.org/r/162218 [11:10:38] _joe_: ^ [11:10:42] * akosiaris sigh [11:11:20] PROBLEM - MySQL Slave Delay on db1060 is CRITICAL: CRIT replication delay 325 seconds [11:11:54] PROBLEM - MySQL Replication Heartbeat on db1009 is CRITICAL: CRIT replication delay 331 seconds [11:12:03] PROBLEM - MySQL Slave Delay on db1063 is CRITICAL: CRIT replication delay 366 seconds [11:12:05] PROBLEM - MySQL Replication Heartbeat on db1060 is CRITICAL: CRIT replication delay 339 seconds [11:12:06] PROBLEM - MySQL Slave Delay on db1002 is CRITICAL: CRIT replication delay 343 seconds [11:12:33] PROBLEM - MySQL Replication Heartbeat on db1054 is CRITICAL: CRIT replication delay 344 seconds [11:12:43] PROBLEM - MySQL Replication Heartbeat on db1002 is CRITICAL: CRIT replication delay 380 seconds [11:12:44] PROBLEM - MySQL Slave Delay on db1054 is CRITICAL: CRIT replication delay 410 seconds [11:12:53] PROBLEM - MySQL Slave Delay on db1009 is CRITICAL: CRIT replication delay 394 seconds [11:12:57] what ho [11:13:42] more LinksUpdate [11:14:00] RECOVERY - MySQL Replication Heartbeat on db1009 is OK: OK replication delay 0 seconds [11:14:10] RECOVERY - MySQL Slave Delay on db1063 is OK: OK replication delay 0 seconds [11:14:13] RECOVERY - MySQL Replication Heartbeat on db1060 is OK: OK replication delay -1 seconds [11:14:13] RECOVERY - MySQL Slave Delay on db1002 is OK: OK replication delay 0 seconds [11:14:15] RECOVERY - MySQL Replication Heartbeat on db1054 is OK: OK replication delay -0 seconds [11:14:27] RECOVERY - MySQL Replication Heartbeat on db1002 is OK: OK replication delay -1 seconds [11:14:40] RECOVERY - MySQL Slave Delay on db1054 is OK: OK replication delay 0 seconds [11:14:49] RECOVERY - MySQL Slave Delay on db1009 is OK: OK replication delay 0 seconds [11:14:52] RECOVERY - MySQL Slave Delay on db1060 is OK: OK replication delay 0 seconds [11:21:38] (03CR) 10Alexandros Kosiaris: [C: 032] url_downloader: squid3 cache_dir fix [puppet] - 10https://gerrit.wikimedia.org/r/162218 (owner: 10Alexandros Kosiaris) [11:22:15] error: Ref refs/remotes/origin/production is at 6fb28eebe7abb863e4f42fc020bcd4c8339d756e but expected c45f8cea30e586e6104839cd794e758c28153028 [11:22:29] I bet strontium is going to be out of sync now [11:23:51] apergos: likely! the lame fix I do usually is ssh gitpuppet@strontium from palladium as root [11:24:17] * apergos peaks in... ah, notme [11:24:27] no, it is me [11:24:34] the question if why that error [11:24:54] Unpacking objects: 100% (6/6), done. [11:24:54] error: Ref refs/remotes/origin/production is at 6fb28eebe7abb863e4f42fc020bcd4c8339d756e but expected c45f8cea30e586e6104839cd794e758c28153028 [11:24:54] From https://gerrit.wikimedia.org/r/p/operations/puppet [11:24:54] ! c45f8ce..6fb28ee production -> origin/production (unable to update local ref) [11:24:55] suposedly there's an auth issue that crops up once in awhile... I dunno though [11:25:16] where the previous sync didn't go right [11:25:17] that must not be an auth issue [11:25:25] oh.. [11:26:07] I've had it happen to me a couple times with various values of sudobut never got anything useful out of inspection [11:26:12] (03CR) 10Springle: [C: 031] Added refreshLinks to $wgJobBackoffThrottling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/161603 (owner: 10Aaron Schulz) [11:31:29] (03PS1) 10Aklapper: Update Phabricator footer (License, Terms of Use) [puppet] - 10https://gerrit.wikimedia.org/r/162219 [11:31:31] (03PS1) 10Giuseppe Lavagetto: mediawiki: first step in simplification [puppet] - 10https://gerrit.wikimedia.org/r/162220 [11:31:33] (03PS1) 10Giuseppe Lavagetto: mediawiki: use hiera in appserver role [puppet] - 10https://gerrit.wikimedia.org/r/162221 [11:32:22] RECOVERY - puppet last run on chromium is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [11:32:29] (03PS2) 10Giuseppe Lavagetto: mediawiki: first step in simplification [puppet] - 10https://gerrit.wikimedia.org/r/162220 [11:33:16] (03CR) 10Giuseppe Lavagetto: [C: 032] mediawiki: first step in simplification [puppet] - 10https://gerrit.wikimedia.org/r/162220 (owner: 10Giuseppe Lavagetto) [11:35:38] (03CR) 10Qgil: [C: 031] "This looks exactly like the changes I have applied manually via the web interface. Thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/162219 (owner: 10Aklapper) [12:07:54] mark: Hi. I'm pretty lost wondering what Ops work is left to get phab.wmfusercontent.org cert / SNI / nginx to work (apart from updating the storage path in Phabricator's settings), before allowing people to upload files in Phabricator and store them on phab.wmfusercontent.org? [12:07:58] Am I right that https://gerrit.wikimedia.org/r/#/c/161180/ is the next step? Any basic timeframe? [12:08:10] i have no idea [12:08:13] you'll have to wait for chase [12:08:18] heh, alright :) [12:08:45] i mean, i can imagine some stuff but it depends on phabricator and what the plans are [12:10:39] ...while it feels to be like waiting for the SNI/nginx low-level stuff to be sorted out. But guess that means I have to talk to Chase, alright. [12:11:26] no, the nginx stuff is all done [12:11:33] sni/nginx that is [12:11:40] oh, good [12:11:51] i finished that on thursday last week I believe [12:11:51] (03CR) 10Ottomata: [C: 032 V: 032] Add java_opts option [puppet/kafka] - 10https://gerrit.wikimedia.org/r/162152 (owner: 10Plucas) [12:11:59] (03CR) 10Ottomata: "Thanks!" [puppet/kafka] - 10https://gerrit.wikimedia.org/r/162152 (owner: 10Plucas) [12:12:15] but there's no backend setup for it yet that I'm aware of [12:12:48] (03CR) 10Ottomata: "stat1003 has public IP but is firewalled. Bastion access is needed." [puppet] - 10https://gerrit.wikimedia.org/r/162166 (owner: 10Dzahn) [12:14:16] I guess I wonder what "backend setup" means in this case and how complex that task is (but if that takes too long to explain and if you're busy, ignore this implicit question please) [12:15:36] it means, a web server vhost on the phabricator needs to be setup for it that varnish will talk to [12:15:43] and phabricator needs to be configured for it [12:15:53] and it all depends on how phabricator handles that [12:16:20] so yeah, we could do that, but i have no idea if that'd be consistent with the plans for it, and given that there are security implications of it all [12:16:26] i'd rather not guess and wait a few days until Chase can work it out [12:17:03] (03CR) 10Ottomata: "analytics-privatedata-users gives access to everything that analytics-users does, but with privileges to access private data. The Hadoop " [puppet] - 10https://gerrit.wikimedia.org/r/162193 (owner: 10Dzahn) [12:17:44] (03CR) 10Ottomata: "bastion access will be needed in order to access stat1002." [puppet] - 10https://gerrit.wikimedia.org/r/162193 (owner: 10Dzahn) [12:19:34] mark: alright, that is helpful. thank you for explaining! [12:21:32] brandon's work btw is related but not for phabricator [12:21:41] so because we had to fix SNI for phabricator, we decided to push it for production use as well [12:21:42] anyone around mind if I deploy some config changes now? [12:21:52] it's one of our plans we've had for a while [12:21:56] guessed so [12:21:57] I'm trying to figure out a brownout case in cirrus [12:22:00] ah, nice coincidence [12:22:14] well, we fixed it for phab, so now we have the ground work for the rest ;) [12:22:18] so thanks ;) [12:22:31] manybubbles: go ahead [12:22:44] (03CR) 10Manybubbles: [C: 032] Increase weight of author namespace in wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/161022 (https://bugzilla.wikimedia.org/69771) (owner: 10Manybubbles) [12:22:54] (03Merged) 10jenkins-bot: Increase weight of author namespace in wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/161022 (https://bugzilla.wikimedia.org/69771) (owner: 10Manybubbles) [12:23:18] (03CR) 10Manybubbles: [C: 032] Lower throttle on two cirrus jobs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/161960 (owner: 10Manybubbles) [12:23:24] (03Merged) 10jenkins-bot: Lower throttle on two cirrus jobs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/161960 (owner: 10Manybubbles) [12:24:30] !log manybubbles Synchronized wmf-config/: Some new cirrus config (duration: 00m 07s) [12:24:36] Logged the message, Master [12:29:17] (03CR) 10Mdann52: [C: 04-1] "Sysop should not automatically receive the right - sysop =/= OTRS user" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/161456 (https://bugzilla.wikimedia.org/70386) (owner: 10Jackmcbarn) [12:34:07] (03CR) 10Revi: "@Mdann52: That means 'sysop can remove OTRS-member right'." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/161456 (https://bugzilla.wikimedia.org/70386) (owner: 10Jackmcbarn) [13:00:25] (03PS1) 10Manybubbles: Throttle cirrus links update jobs some more [mediawiki-config] - 10https://gerrit.wikimedia.org/r/162226 [13:00:43] (03CR) 10Manybubbles: [C: 032] Throttle cirrus links update jobs some more [mediawiki-config] - 10https://gerrit.wikimedia.org/r/162226 (owner: 10Manybubbles) [13:00:48] (03Merged) 10jenkins-bot: Throttle cirrus links update jobs some more [mediawiki-config] - 10https://gerrit.wikimedia.org/r/162226 (owner: 10Manybubbles) [13:01:31] !log manybubbles Synchronized wmf-config/: Throttle cirrus jobs some more. (duration: 00m 04s) [13:01:37] Logged the message, Master [13:05:31] PROBLEM - puppet last run on netmon1001 is CRITICAL: CRITICAL: Puppet has 1 failures [13:19:32] hi is phabricator.wikimedia supposed to work? [13:19:46] I get 500 when I try to login: Request: POST http://phabricator.wikimedia.org/auth/login/ldap:self/, from 90.183.23.27 via cp1044 cp1044 ([208.80.154.241]:80), Varnish XID 1573317246 [13:19:47] Forwarded for: 90.183.23.27 Error: 503, Service Unavailable at Tue, 23 Sep 2014 13:18:49 GMT [13:20:00] * 503 [13:20:47] (03PS1) 10Alexandros Kosiaris: url-downloader: further fixes [puppet] - 10https://gerrit.wikimedia.org/r/162228 [13:23:50] RECOVERY - puppet last run on netmon1001 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [13:28:37] (03PS2) 10Giuseppe Lavagetto: mediawiki: use hiera in appserver role [puppet] - 10https://gerrit.wikimedia.org/r/162221 [13:31:25] (03CR) 10Giuseppe Lavagetto: [C: 032] mediawiki: use hiera in appserver role [puppet] - 10https://gerrit.wikimedia.org/r/162221 (owner: 10Giuseppe Lavagetto) [13:39:21] (03CR) 10Alexandros Kosiaris: [C: 032] url-downloader: further fixes [puppet] - 10https://gerrit.wikimedia.org/r/162228 (owner: 10Alexandros Kosiaris) [13:47:21] (03CR) 10Alexandros Kosiaris: [C: 032] Change url-downloader to point to new IP [dns] - 10https://gerrit.wikimedia.org/r/161956 (owner: 10Alexandros Kosiaris) [13:47:51] akosiaris: yay [13:48:01] !log change url-downloader ip to point to the new one [13:48:02] (03PS4) 1001tonythomas: Corrected the exim regex expression and POST url [puppet] - 10https://gerrit.wikimedia.org/r/161679 [13:48:07] Logged the message, Master [13:48:11] Reedy: you are fast :-) [13:49:00] let's monitor it now for a couple of hours... not like it gets much traffic but anyway [13:52:01] (03PS1) 10Reedy: Don't hardcode url-downloader.wikimedia.org:8080, use $wgCopyUploadProxy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/162230 [13:52:17] I guess it'd be mostly likely flickr uploads and such [13:52:21] I guess they should be quicker too [13:53:19] probably [13:57:16] Test it in a couple of hours (via MediaWiki) [13:57:20] Disable squid on linne, test again [13:57:31] disable/stop [13:59:10] * YuviPanda|afk wonders if we should put forth an OPW/GSoC project for ops [14:02:26] <_joe_> YuviPanda: oh I have lots of ideas... "reimplement puppet in python" for starters [14:02:34] hahaha :D [14:02:53] pyppet [14:03:19] I wonder if 'move things into modules' can be an OPW thing [14:03:50] * YuviPanda would also personally like ganglia to die and be replaced by graphite [14:07:48] (03PS1) 10Giuseppe Lavagetto: mediawiki: simplify api and rendering [puppet] - 10https://gerrit.wikimedia.org/r/162234 [14:07:58] <_joe_> YuviPanda: first we need to make graphite scale [14:08:02] that is true [14:08:27] we could've theoretically done it when graphite.wmflabs.org was on a labs machine, but we sidestepped that with getting a real machine for it [14:08:38] I wonder how tungsten is doing [14:08:46] <_joe_> bad [14:08:50] heh [14:09:38] I suppose the way to do it is to move statsd+carbon-relay into one host, and spread the caches among other hosts [14:09:56] <_joe_> I think godog researched a bit about this [14:12:25] ye, the main bottleneck is the disk unsurprisingly enough [14:12:49] yeah [14:12:54] we could also just move 'em to SSDs [14:13:36] InfluxDB seems a bit too immature at this point [14:13:46] yep SSDs would certainly help [14:14:01] (03PS2) 10Yuvipanda: nagios_common: Move check_ssl_cert into module [puppet] - 10https://gerrit.wikimedia.org/r/161952 [14:14:06] _joe_: wanna merge ^? [14:14:27] <_joe_> YuviPanda: I'm busy trying to bring down prod right now [14:14:34] _joe_: ah, heh :) [14:14:41] <_joe_> with noop changes on appservers [14:14:52] * YuviPanda looks forward to 1.5months from now, when he can get busy trying to bring down prod [14:15:30] (03CR) 10Jackmcbarn: "Indeed. In fact, it's not possible for one group to automatically contain another group." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/161456 (https://bugzilla.wikimedia.org/70386) (owner: 10Jackmcbarn) [14:15:32] <_joe_> YuviPanda: oh you're a developer; bringing down prod is your #1 goal already. The difference is that as an ops you're not supposed to [14:16:07] hehe [14:16:22] (03CR) 10Giuseppe Lavagetto: [C: 032] mediawiki: simplify api and rendering [puppet] - 10https://gerrit.wikimedia.org/r/162234 (owner: 10Giuseppe Lavagetto) [14:18:24] (03PS1) 10Yuvipanda: nagios_common: Move check_all_memcached into module [puppet] - 10https://gerrit.wikimedia.org/r/162238 [14:26:34] (03PS1) 10Yuvipanda: nagios_common: Move users check into module [puppet] - 10https://gerrit.wikimedia.org/r/162239 [14:27:22] (03CR) 10jenkins-bot: [V: 04-1] nagios_common: Move users check into module [puppet] - 10https://gerrit.wikimedia.org/r/162239 (owner: 10Yuvipanda) [14:27:58] (03PS2) 10Yuvipanda: nagios_common: Move users check into module [puppet] - 10https://gerrit.wikimedia.org/r/162239 [14:28:00] (03PS2) 10Yuvipanda: nagios_common: Move check_all_memcached into module [puppet] - 10https://gerrit.wikimedia.org/r/162238 [14:28:02] (03PS3) 10Yuvipanda: nagios_common: Move check_ssl_cert into module [puppet] - 10https://gerrit.wikimedia.org/r/161952 [14:29:02] (03PS3) 10ArielGlenn: salt: make grain-ensure.py operate locally [puppet] - 10https://gerrit.wikimedia.org/r/161332 (owner: 10Ori.livneh) [14:29:58] (03CR) 10ArielGlenn: "just resubmitted patchset 1." [puppet] - 10https://gerrit.wikimedia.org/r/161332 (owner: 10Ori.livneh) [14:30:07] _joe_: I'm just going to move a bunch more things, and I'll bug you tomorrow [14:30:26] <_joe_> YuviPanda: ok [14:30:34] <_joe_> I'm done with prod for today [14:33:32] (03PS4) 10ArielGlenn: salt: make grain-ensure.py operate locally [puppet] - 10https://gerrit.wikimedia.org/r/161332 (owner: 10Ori.livneh) [14:34:29] (03CR) 10ArielGlenn: [C: 032] salt: make grain-ensure.py operate locally [puppet] - 10https://gerrit.wikimedia.org/r/161332 (owner: 10Ori.livneh) [14:34:42] The external account ("LDAP") you just authenticated with is not configured to allow registration on this Phabricator install. An administrator may have recently disabled it. [14:37:10] _joe_: regarding bringing down production, the FR team is complaining about server errors on https://meta.wikimedia.org/wiki/Special:GlobalAllocation [14:37:20] could that be related? [14:37:33] <_joe_> Jeff_Green: I don't think so [14:37:45] <_joe_> Jeff_Green: I just risked to remove servers from pools [14:38:24] that page is spitting 503s [14:38:54] <_joe_> meaning it takes too much time to be rendered [14:38:59] there are a lot of IPs and ports on the list, figuring out what it means [14:39:36] <_joe_> the last one is the varnish host [14:40:05] <_joe_> the ones before are usually the varnishes and public ips from the the various datacenters [14:40:21] <_joe_> take a look at logs on fluorine [14:40:36] <_joe_> or, try to fetch that page on the backend directly with apache-fast-test [14:41:36] ha, I wrote it but I don't even know where we run that now :-P [14:42:04] (03CR) 10Mdann52: [C: 031] ""Facepalm" that's why you should always read the documentation and stuff first...." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/161456 (https://bugzilla.wikimedia.org/70386) (owner: 10Jackmcbarn) [14:42:13] <_joe_> tin [14:42:13] <_joe_> I guessed you knew :P [14:42:19] thanks [14:42:54] PROBLEM - puppet last run on db1050 is CRITICAL: CRITICAL: Puppet has 1 failures [14:43:57] why do I have the feeling I just murdered the entire cluster? [14:43:58] Jeff_Green: hey, question: is frack still using the Netapp log-syncing thingy ? [14:44:35] re. first problem -- apache-fast-test gets all 500's, mix of ISE and "read timeout" [14:45:33] akosiaris: replication of the fundraising share? yeah. writes/reads are all at eqiad, replicates to pmtpa just for redundancy [14:46:10] Jeff_Green: cool then, I have to break the replication for the netapp to be movable [14:46:22] and zero disks and all that jazz [14:46:25] right [14:46:37] as long as the eqiad one stays up fundraising should not be disrupted [14:46:44] I 'll probably do it tomorrow [14:46:46] oh my [14:46:55] oh my re. the GlobalAllocation thing I mean [14:46:56] <_joe_> netapp jazz [14:47:03] PHP Fatal error: Allowed memory size of 268435456 bytes exhausted (tried to allocate 71 bytes) in /srv/mediawiki/php-1.24wmf21/extensions/CentralNotice/special/SpecialGlobalAllocation.php on line 436 [14:47:03] RECOVERY - puppet last run on db1050 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [14:47:10] <_joe_> Jeff_Green: ;) [14:47:20] fail wonderful fail [14:47:27] now what? who works on that thing? [14:47:38] <_joe_> bugzilla it and forget [14:47:49] :-( [14:48:51] <_joe_> that's how I was taught [14:49:05] <_joe_> Jeff_Green: you can ask in #mediawiki-core I guess [14:49:45] _joe_: there's already a ticket for this error [14:49:52] https://bugzilla.wikimedia.org/show_bug.cgi?id=53443 [14:50:00] from 2013-08 [14:50:09] <_joe_> eheh [14:50:13] well slightly different actually [14:50:13] <_joe_> sounds promising [14:50:28] * anomie assumes manybubbles will SWAT this morning [14:52:21] incoming spam! [14:52:24] INCOMING SPAM! [14:52:28] ha [14:52:30] (03PS1) 10Yuvipanda: nagios_common: Move telnet into module [puppet] - 10https://gerrit.wikimedia.org/r/162244 [14:52:32] (03PS1) 10Yuvipanda: nagios_common: Move ssh into module [puppet] - 10https://gerrit.wikimedia.org/r/162245 [14:52:34] (03PS1) 10Yuvipanda: nagios_common: Move snmp into module [puppet] - 10https://gerrit.wikimedia.org/r/162246 [14:52:36] (03PS1) 10Yuvipanda: nagios_common: Move real into module [puppet] - 10https://gerrit.wikimedia.org/r/162247 [14:52:38] (03PS1) 10Yuvipanda: nagios_common: Move radius into module [puppet] - 10https://gerrit.wikimedia.org/r/162248 [14:52:40] (03PS1) 10Yuvipanda: nagios_common: Move rpc-nfs into module [puppet] - 10https://gerrit.wikimedia.org/r/162249 [14:52:42] (03PS1) 10Yuvipanda: nagios_common: Move tcp_udp into module [puppet] - 10https://gerrit.wikimedia.org/r/162250 [14:52:44] (03PS1) 10Yuvipanda: nagios_common: Move apt into module [puppet] - 10https://gerrit.wikimedia.org/r/162251 [14:52:46] (03PS1) 10Yuvipanda: nagios_common: Move breeze into module [puppet] - 10https://gerrit.wikimedia.org/r/162252 [14:52:48] (03PS1) 10Yuvipanda: nagios_common: Move dhcp into module [puppet] - 10https://gerrit.wikimedia.org/r/162253 [14:52:50] (03PS1) 10Yuvipanda: nagios_common: Move disk-smb into module [puppet] - 10https://gerrit.wikimedia.org/r/162254 [14:52:52] (03PS1) 10Yuvipanda: nagios_common: Move disk into module [puppet] - 10https://gerrit.wikimedia.org/r/162255 [14:52:54] (03PS1) 10Yuvipanda: nagios_common: Move dns into module [puppet] - 10https://gerrit.wikimedia.org/r/162256 [14:52:56] (03PS1) 10Yuvipanda: nagios_common: Move dummy into module [puppet] - 10https://gerrit.wikimedia.org/r/162257 [14:52:58] (03PS1) 10Yuvipanda: nagios_common: Move flexlm into module [puppet] - 10https://gerrit.wikimedia.org/r/162258 [14:53:00] (03PS1) 10Yuvipanda: nagios_common: Move ftp into module [puppet] - 10https://gerrit.wikimedia.org/r/162259 [14:53:02] (03PS1) 10Yuvipanda: icinga: Remove hppjd check [puppet] - 10https://gerrit.wikimedia.org/r/162260 [14:53:04] (03PS1) 10Yuvipanda: nagios_common: Move http into module [puppet] - 10https://gerrit.wikimedia.org/r/162261 [14:53:06] (03PS1) 10Yuvipanda: nagios_common: Move ifstatus into module [puppet] - 10https://gerrit.wikimedia.org/r/162262 [14:53:08] (03PS1) 10Yuvipanda: nagios_common: Move ldap into module [puppet] - 10https://gerrit.wikimedia.org/r/162263 [14:53:10] (03PS1) 10Yuvipanda: nagios_common: Move load into module [puppet] - 10https://gerrit.wikimedia.org/r/162264 [14:53:12] (03PS1) 10Yuvipanda: nagios_common: Move mail into module [puppet] - 10https://gerrit.wikimedia.org/r/162265 [14:53:14] (03PS1) 10Yuvipanda: nagios_common: move mrtg into module [puppet] - 10https://gerrit.wikimedia.org/r/162266 [14:53:16] (03PS1) 10Yuvipanda: nagios_common: move mysql into module [puppet] - 10https://gerrit.wikimedia.org/r/162267 [14:53:18] (03PS1) 10Yuvipanda: nagios_common: move netware into module [puppet] - 10https://gerrit.wikimedia.org/r/162268 [14:53:20] (03PS1) 10Yuvipanda: nagios_common: move news into module [puppet] - 10https://gerrit.wikimedia.org/r/162269 [14:53:22] (03PS1) 10Yuvipanda: nagios_common: move nt into module [puppet] - 10https://gerrit.wikimedia.org/r/162270 [14:53:24] (03PS1) 10Yuvipanda: nagios_common: move ntp into module [puppet] - 10https://gerrit.wikimedia.org/r/162271 [14:53:26] (03PS1) 10Yuvipanda: nagios_common: move pgsql into module [puppet] - 10https://gerrit.wikimedia.org/r/162272 [14:53:28] (03PS1) 10Yuvipanda: nagios_common: move ping into module [puppet] - 10https://gerrit.wikimedia.org/r/162273 [14:53:30] (03PS1) 10Yuvipanda: nagios_common: move procs into module [puppet] - 10https://gerrit.wikimedia.org/r/162274 [14:53:32] (03PS1) 10Yuvipanda: nagios_common: move vsz into module [puppet] - 10https://gerrit.wikimedia.org/r/162275 [14:53:41] END OF SPAM [14:54:06] (03CR) 10jenkins-bot: [V: 04-1] nagios_common: Move tcp_udp into module [puppet] - 10https://gerrit.wikimedia.org/r/162250 (owner: 10Yuvipanda) [14:54:12] oh dear [14:54:13] (03CR) 10jenkins-bot: [V: 04-1] nagios_common: Move apt into module [puppet] - 10https://gerrit.wikimedia.org/r/162251 (owner: 10Yuvipanda) [14:54:23] (03CR) 10jenkins-bot: [V: 04-1] nagios_common: Move breeze into module [puppet] - 10https://gerrit.wikimedia.org/r/162252 (owner: 10Yuvipanda) [14:54:27] jenkins is not amused [14:54:43] missed a comma [14:54:51] (03CR) 10jenkins-bot: [V: 04-1] nagios_common: Move radius into module [puppet] - 10https://gerrit.wikimedia.org/r/162248 (owner: 10Yuvipanda) [14:54:57] (03CR) 10jenkins-bot: [V: 04-1] nagios_common: Move dhcp into module [puppet] - 10https://gerrit.wikimedia.org/r/162253 (owner: 10Yuvipanda) [14:54:59] (03CR) 10jenkins-bot: [V: 04-1] nagios_common: Move disk into module [puppet] - 10https://gerrit.wikimedia.org/r/162255 (owner: 10Yuvipanda) [14:55:03] (03CR) 10jenkins-bot: [V: 04-1] nagios_common: Move disk-smb into module [puppet] - 10https://gerrit.wikimedia.org/r/162254 (owner: 10Yuvipanda) [14:55:15] jenkins is the real spammer here :P [14:55:16] (03CR) 10jenkins-bot: [V: 04-1] nagios_common: Move dummy into module [puppet] - 10https://gerrit.wikimedia.org/r/162257 (owner: 10Yuvipanda) [14:55:40] (03CR) 10jenkins-bot: [V: 04-1] nagios_common: Move ftp into module [puppet] - 10https://gerrit.wikimedia.org/r/162259 (owner: 10Yuvipanda) [14:55:50] (03CR) 10jenkins-bot: [V: 04-1] nagios_common: Move rpc-nfs into module [puppet] - 10https://gerrit.wikimedia.org/r/162249 (owner: 10Yuvipanda) [14:55:52] (03CR) 10jenkins-bot: [V: 04-1] nagios_common: Move http into module [puppet] - 10https://gerrit.wikimedia.org/r/162261 (owner: 10Yuvipanda) [14:55:59] (03CR) 10jenkins-bot: [V: 04-1] nagios_common: Move ifstatus into module [puppet] - 10https://gerrit.wikimedia.org/r/162262 (owner: 10Yuvipanda) [14:56:19] (03CR) 10jenkins-bot: [V: 04-1] nagios_common: Move dns into module [puppet] - 10https://gerrit.wikimedia.org/r/162256 (owner: 10Yuvipanda) [14:56:27] (03CR) 10jenkins-bot: [V: 04-1] nagios_common: Move ldap into module [puppet] - 10https://gerrit.wikimedia.org/r/162263 (owner: 10Yuvipanda) [14:56:32] (03CR) 10jenkins-bot: [V: 04-1] nagios_common: Move load into module [puppet] - 10https://gerrit.wikimedia.org/r/162264 (owner: 10Yuvipanda) [14:56:34] (03CR) 10jenkins-bot: [V: 04-1] nagios_common: Move mail into module [puppet] - 10https://gerrit.wikimedia.org/r/162265 (owner: 10Yuvipanda) [14:56:48] (03CR) 10jenkins-bot: [V: 04-1] nagios_common: move mrtg into module [puppet] - 10https://gerrit.wikimedia.org/r/162266 (owner: 10Yuvipanda) [14:56:56] (03CR) 10jenkins-bot: [V: 04-1] nagios_common: move netware into module [puppet] - 10https://gerrit.wikimedia.org/r/162268 (owner: 10Yuvipanda) [14:57:01] (03CR) 10jenkins-bot: [V: 04-1] icinga: Remove hppjd check [puppet] - 10https://gerrit.wikimedia.org/r/162260 (owner: 10Yuvipanda) [14:57:06] (03CR) 10jenkins-bot: [V: 04-1] nagios_common: Move flexlm into module [puppet] - 10https://gerrit.wikimedia.org/r/162258 (owner: 10Yuvipanda) [14:57:20] * YuviPanda waits for the spam to complete, so he can double check before spamming again [14:57:31] (03CR) 10jenkins-bot: [V: 04-1] nagios_common: move pgsql into module [puppet] - 10https://gerrit.wikimedia.org/r/162272 (owner: 10Yuvipanda) [14:57:34] (03CR) 10jenkins-bot: [V: 04-1] nagios_common: move news into module [puppet] - 10https://gerrit.wikimedia.org/r/162269 (owner: 10Yuvipanda) [14:57:34] PROBLEM - puppet last run on mw1113 is CRITICAL: CRITICAL: Puppet has 1 failures [14:57:36] (03CR) 10jenkins-bot: [V: 04-1] nagios_common: move nt into module [puppet] - 10https://gerrit.wikimedia.org/r/162270 (owner: 10Yuvipanda) [14:57:38] (03CR) 10jenkins-bot: [V: 04-1] nagios_common: move ntp into module [puppet] - 10https://gerrit.wikimedia.org/r/162271 (owner: 10Yuvipanda) [14:57:40] (03CR) 10jenkins-bot: [V: 04-1] nagios_common: move ping into module [puppet] - 10https://gerrit.wikimedia.org/r/162273 (owner: 10Yuvipanda) [14:57:47] (03CR) 10jenkins-bot: [V: 04-1] nagios_common: move procs into module [puppet] - 10https://gerrit.wikimedia.org/r/162274 (owner: 10Yuvipanda) [14:57:52] (03CR) 10jenkins-bot: [V: 04-1] nagios_common: move vsz into module [puppet] - 10https://gerrit.wikimedia.org/r/162275 (owner: 10Yuvipanda) [14:58:41] is jenkins broken? [14:58:42] INCOMING SPAM [14:58:45] omg [14:58:46] aude: no, I missed a comma [14:58:46] (03PS2) 10Yuvipanda: nagios_common: Move tcp_udp into module [puppet] - 10https://gerrit.wikimedia.org/r/162250 [14:58:48] (03PS2) 10Yuvipanda: nagios_common: Move apt into module [puppet] - 10https://gerrit.wikimedia.org/r/162251 [14:58:50] (03PS2) 10Yuvipanda: nagios_common: Move radius into module [puppet] - 10https://gerrit.wikimedia.org/r/162248 [14:58:52] (03PS2) 10Yuvipanda: nagios_common: Move rpc-nfs into module [puppet] - 10https://gerrit.wikimedia.org/r/162249 [14:58:54] :( [14:58:54] (03PS2) 10Yuvipanda: nagios_common: Move disk-smb into module [puppet] - 10https://gerrit.wikimedia.org/r/162254 [14:58:56] (03PS2) 10Yuvipanda: nagios_common: Move disk into module [puppet] - 10https://gerrit.wikimedia.org/r/162255 [14:58:58] (03PS2) 10Yuvipanda: nagios_common: Move breeze into module [puppet] - 10https://gerrit.wikimedia.org/r/162252 [14:59:00] (03PS2) 10Yuvipanda: nagios_common: Move dhcp into module [puppet] - 10https://gerrit.wikimedia.org/r/162253 [14:59:02] (03PS2) 10Yuvipanda: nagios_common: Move snmp into module [puppet] - 10https://gerrit.wikimedia.org/r/162246 [14:59:04] (03PS2) 10Yuvipanda: nagios_common: Move real into module [puppet] - 10https://gerrit.wikimedia.org/r/162247 [14:59:06] (03PS2) 10Yuvipanda: nagios_common: Move telnet into module [puppet] - 10https://gerrit.wikimedia.org/r/162244 [14:59:08] (03PS2) 10Yuvipanda: nagios_common: Move ssh into module [puppet] - 10https://gerrit.wikimedia.org/r/162245 [14:59:10] (03PS2) 10Yuvipanda: nagios_common: move mysql into module [puppet] - 10https://gerrit.wikimedia.org/r/162267 [14:59:12] (03PS2) 10Yuvipanda: nagios_common: move mrtg into module [puppet] - 10https://gerrit.wikimedia.org/r/162266 [14:59:14] (03PS2) 10Yuvipanda: nagios_common: Move mail into module [puppet] - 10https://gerrit.wikimedia.org/r/162265 [14:59:16] (03PS2) 10Yuvipanda: nagios_common: Move load into module [puppet] - 10https://gerrit.wikimedia.org/r/162264 [14:59:19] (03PS2) 10Yuvipanda: nagios_common: move ntp into module [puppet] - 10https://gerrit.wikimedia.org/r/162271 [14:59:20] (03PS2) 10Yuvipanda: nagios_common: move nt into module [puppet] - 10https://gerrit.wikimedia.org/r/162270 [14:59:23] (03PS2) 10Yuvipanda: nagios_common: move news into module [puppet] - 10https://gerrit.wikimedia.org/r/162269 [14:59:24] (03PS2) 10Yuvipanda: nagios_common: move netware into module [puppet] - 10https://gerrit.wikimedia.org/r/162268 [14:59:26] (03PS2) 10Yuvipanda: nagios_common: Move ftp into module [puppet] - 10https://gerrit.wikimedia.org/r/162259 [14:59:28] (03PS2) 10Yuvipanda: nagios_common: Move flexlm into module [puppet] - 10https://gerrit.wikimedia.org/r/162258 [14:59:30] (03PS2) 10Yuvipanda: nagios_common: Move dummy into module [puppet] - 10https://gerrit.wikimedia.org/r/162257 [14:59:32] (03PS2) 10Yuvipanda: nagios_common: Move dns into module [puppet] - 10https://gerrit.wikimedia.org/r/162256 [14:59:34] (03PS2) 10Yuvipanda: nagios_common: Move ldap into module [puppet] - 10https://gerrit.wikimedia.org/r/162263 [14:59:37] (03PS2) 10Yuvipanda: nagios_common: Move ifstatus into module [puppet] - 10https://gerrit.wikimedia.org/r/162262 [14:59:39] (03PS2) 10Yuvipanda: nagios_common: Move http into module [puppet] - 10https://gerrit.wikimedia.org/r/162261 [14:59:41] (03PS2) 10Yuvipanda: icinga: Remove hppjd check [puppet] - 10https://gerrit.wikimedia.org/r/162260 [14:59:43] (03PS2) 10Yuvipanda: nagios_common: move pgsql into module [puppet] - 10https://gerrit.wikimedia.org/r/162272 [14:59:45] (03PS2) 10Yuvipanda: nagios_common: move ping into module [puppet] - 10https://gerrit.wikimedia.org/r/162273 [14:59:47] (03PS2) 10Yuvipanda: nagios_common: move procs into module [puppet] - 10https://gerrit.wikimedia.org/r/162274 [14:59:49] (03PS2) 10Yuvipanda: nagios_common: move vsz into module [puppet] - 10https://gerrit.wikimedia.org/r/162275 [14:59:51] (03PS3) 10Yuvipanda: nagios_common: Move users check into module [puppet] - 10https://gerrit.wikimedia.org/r/162239 [14:59:53] (03PS3) 10Yuvipanda: nagios_common: Move check_all_memcached into module [puppet] - 10https://gerrit.wikimedia.org/r/162238 [14:59:55] (03PS4) 10Yuvipanda: nagios_common: Move check_ssl_cert into module [puppet] - 10https://gerrit.wikimedia.org/r/161952 [14:59:57] (03CR) 10jenkins-bot: [V: 04-1] nagios_common: move mysql into module [puppet] - 10https://gerrit.wikimedia.org/r/162267 (owner: 10Yuvipanda) [15:00:04] oh for fuck's sake [15:00:11] <_joe_> Yuviiiiii [15:00:14] that isn't nice [15:00:17] sorry [15:00:31] anomie: yeah - I'll swat [15:00:31] oh [15:00:47] YuviPanda: should I wait to SWAT until nagios is happy? [15:00:48] that error was from earlier, jenkins shouldn't complain anymore [15:00:58] manybubbles: no, it should be all good now [15:00:59] oh, its just patches. well, here I go [15:01:23] * YuviPanda refrains from pushing things until SWAT is over [15:01:39] <_joe_> brb [15:05:40] God YuviPanda [15:07:28] manybubbles: hey at least I didn't write 'uploadwizard' in the title [15:07:36] Heh [15:08:13] Actually that probably would have been fine, I expected a ping about SWAT anyway [15:08:19] !log manybubbles Synchronized php-1.24wmf22/extensions/CirrusSearch/: SWAT deploy cirrus backports (duration: 00m 05s) [15:08:21] But I guess jouncebot is hungover [15:08:23] Logged the message, Master [15:08:26] manybubbles: it was 35*3 pings [15:08:52] err [15:08:53] marktraceur: [15:09:22] YuviPanda: Which only means I'd come to this channel, scroll up to see your silliness, and still make fun of you the same way [15:09:25] <3 [15:09:45] marktraceur: heh [15:15:24] I'm here today. Can someone merge the patches I scheduled for yesterday's SWAT? [15:15:31] https://wikitech.wikimedia.org/w/index.php?title=Deployments&diff=128186 [15:16:27] !log manybubbles Synchronized php-1.24wmf21/extensions/CirrusSearch/: SWAT update Cirrus for better error handling (duration: 00m 04s) [15:16:31] Logged the message, Master [15:16:57] RECOVERY - puppet last run on mw1113 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [15:17:35] * manybubbles done with SWAT [15:18:48] https://gerrit.wikimedia.org/r/#/q/owner:%22Yuvipanda+%253Cyuvipanda%2540gmail.com%253E%22+project:operations/puppet+status:open,n,z if someone wants to merge [15:20:29] Glaisher: oh more things to SWAT? I didn't see those until just now [15:20:56] manybubbles: I just got here and added them just now. Sorry [15:21:25] anomie: is removing these namespace aliases just as simple as deploying the patch? [15:22:15] <^d> Anyone have a clue why puppet's disabled since the 12th on searchidx1001? [15:22:39] (03CR) 10Manybubbles: [C: 032] Add '*.beeldbank.cultureelerfgoed.nl' to wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/161779 (https://bugzilla.wikimedia.org/70840) (owner: 10Gerrit Patch Uploader) [15:22:43] <^d> 'reason not specified' :p [15:22:48] (03Merged) 10jenkins-bot: Add '*.beeldbank.cultureelerfgoed.nl' to wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/161779 (https://bugzilla.wikimedia.org/70840) (owner: 10Gerrit Patch Uploader) [15:22:56] * manybubbles is swatting again [15:23:51] manybubbles: Well, then someone would have to go through and null edit (or purge via API with forcelinkupdate) any pages linking with those namespaces to get the links tables updated. I'm not sure whether there's a safe maintenance script for that sort of thing; I tried asking Reedy yesterday but if he replied I missed it. [15:24:11] !log manybubbles Synchronized wmf-config/InitialiseSettings.php: SWAT add *.beeldbank.cultureelerfgoed.nl to upload list (duration: 00m 04s) [15:24:16] Logged the message, Master [15:24:18] Glaisher: did the uploads bug [15:24:28] thanks [15:25:38] anomie: cool. If we don't hear back from reedy in the next 20 minutes I'll boot the bug to the next swat (again :sadface:) [15:27:24] not many c: and mw: links at ckbwiki [15:28:43] http://ckb.wikipedia.org/w/api.php?action=query&list=iwbacklinks&format=xml&iwblprefix=mw&iwbllimit=40&iwblprop=iwprefix [15:28:48] I'm here [15:28:50] http://ckb.wikipedia.org/w/api.php?action=query&list=iwbacklinks&format=xml&iwblprefix=c&iwbllimit=40&iwblprop=iwprefix [15:29:54] AFAIK there isn't a maintenance script for it [15:30:01] (03CR) 10Andrew Bogott: [C: 032] nagios_common: Move check_ssl_cert into module (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/161952 (owner: 10Yuvipanda) [15:31:11] Reedy: so I can just null edit all those pages manually? its only a dozen [15:31:54] WFM [15:32:05] If it was more, I'd just script api purges or something [15:32:13] k [15:32:25] (03CR) 10Manybubbles: [C: 032] Remove "MW" and "C" namespace alias on ckbwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/161684 (https://bugzilla.wikimedia.org/70564) (owner: 10Gerrit Patch Uploader) [15:32:50] (03Merged) 10jenkins-bot: Remove "MW" and "C" namespace alias on ckbwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/161684 (https://bugzilla.wikimedia.org/70564) (owner: 10Gerrit Patch Uploader) [15:33:25] PROBLEM - puppet last run on palladium is CRITICAL: CRITICAL: Epic puppet fail [15:33:52] !log manybubbles Synchronized wmf-config/InitialiseSettings.php: SWAT remove C and MW namspace aliases from ckbwiki (duration: 00m 07s) [15:33:57] Logged the message, Master [15:34:06] (03CR) 10Andrew Bogott: [C: 032] nagios_common: move vsz into module [puppet] - 10https://gerrit.wikimedia.org/r/162275 (owner: 10Yuvipanda) [15:35:14] andrewbogott: lots of dependent patches, I'm afraid. easiest course of action is to start at https://gerrit.wikimedia.org/r/#/c/162238/ and keep following the 'needed by' link [15:36:50] (03CR) 10Andrew Bogott: [C: 032] nagios_common: move procs into module [puppet] - 10https://gerrit.wikimedia.org/r/162274 (owner: 10Yuvipanda) [15:37:25] RECOVERY - puppet last run on palladium is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [15:38:55] PROBLEM - puppet last run on searchidx1001 is CRITICAL: CRITICAL: Puppet has 1 failures [15:39:32] Reedy: can't edit this page: https://ckb.wikipedia.org/w/index.php?title=%D9%88%DB%8C%DA%A9%DB%8C%D9%BE%DB%8C%D8%AF%DB%8C%D8%A7:%DA%95%D8%A7%D9%BE%D8%B1%D8%B3%DB%8C%DB%8C%DB%95%DA%A9%D8%A7%D9%86/%DA%86%D8%A7%DA%A9%D8%B3%D8%A7%D8%B2%DB%8C%DB%8C_%D9%85%D8%A7%D9%81%DB%95%DA%A9%D8%A7%D9%86%DB%8C_%D9%BE%D8%A7%DA%B5%D9%88%DB%8E%D9%86%DB%95%DB%8C_%DA%A9%DB%95%DA%B5%DA%A9%D8%A7%D9%88%DB%95%DA%98%D9%88%D9%88&action=edit [15:39:36] (03CR) 10Andrew Bogott: [C: 032] nagios_common: Move users check into module [puppet] - 10https://gerrit.wikimedia.org/r/162239 (owner: 10Yuvipanda) [15:40:29] (03PS1) 10Ottomata: Escape awk variable in kafkatee output [puppet] - 10https://gerrit.wikimedia.org/r/162282 [15:40:35] manybubbles: protected or something? [15:40:44] Reedy: I can't read it so I'm not sure [15:40:45] * Reedy sets an en interface [15:41:05] I've null edited that page [15:41:15] Yup, it's protected [15:41:17] "This page has been protected to prevent editing or other actions." [15:41:41] (03CR) 10Andrew Bogott: [C: 032] nagios_common: Move check_all_memcached into module [puppet] - 10https://gerrit.wikimedia.org/r/162238 (owner: 10Yuvipanda) [15:42:02] YuviPanda: let's let those two settle in before merging any more... [15:42:08] andrewbogott: ok [15:42:48] andrewbogott: icinga.pp is shorter by about 250 lines after the end of this (still 600 lines or so), but hopefully it'll be empty when this is done [15:44:09] andrewbogott: can you force puppet on neon and tell me if it succeeds? [15:44:21] * YuviPanda has a feeling it might find duplicate definitions, since the old files haven't been explicitly removed [15:47:09] ok - I null edited all the pages that weren't protected [15:48:30] thanks [15:48:58] PROBLEM - Disk space on searchidx1001 is CRITICAL: DISK CRITICAL - free space: / 33 MB (0% inode=21%): [15:50:13] YuviPanda: puppet ran and updated things on neon, no problems. [15:50:21] andrewbogott: wheee! cool [15:52:36] andrewbogott: I've to go now, so don't know if it's a good idea to merge the rest now [15:53:03] YuviPanda: that's fine, I have a meeting in a minute, and we should wait an hour for everything to cycle. [15:53:09] andrewbogott: cool [15:53:10] Just ping me when you're around and want me to merge some more of them. [15:53:16] andrewbogott: yeah. thanks :D [15:53:55] what's the internal range for codfw lvs service IPs? asking for ms-fe.codfw.wmnet cc: bblack [15:56:28] (03CR) 10Hashar: "operations/debs/wikimedia-task-appserver is deployed on all application server and its content is being slowly migrated to puppet so we ca" [debs/wikimedia-task-appserver] - 10https://gerrit.wikimedia.org/r/115135 (https://bugzilla.wikimedia.org/61090) (owner: 10Hashar) [16:02:53] RECOVERY - puppet last run on searchidx1001 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [16:07:38] bd808: Align in vim works like a charm, thanks :) [16:07:52] godog: \o/ tools rule [16:09:15] indeed! I got puppet-syntax-vim too which will align itself sometimes [16:24:22] RECOVERY - Disk space on searchidx1001 is OK: DISK OK [16:29:41] <_joe_> !log manually created /srv/mediawiki bind mount on searchidx1001; moved old contents to /a/mediawiki-stale, to avoid filling the disk [16:29:45] Logged the message, Master [16:30:10] (03CR) 10Ori.livneh: [C: 031] "Yep." [puppet] - 10https://gerrit.wikimedia.org/r/162160 (owner: 10BryanDavis) [16:31:31] (03CR) 10Andrew Bogott: [C: 032] Remove git::clone support for converting clones to shared [puppet] - 10https://gerrit.wikimedia.org/r/162160 (owner: 10BryanDavis) [16:32:24] Thanks ori and andrewbogott [16:41:03] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There are 2 unmerged changes in puppet (dir /var/lib/git/operations/puppet). [16:49:53] (03PS1) 10Giuseppe Lavagetto: searchidx: mount /srv/mediawiki as a bind from /a [puppet] - 10https://gerrit.wikimedia.org/r/162288 [16:50:22] <_joe_> ^d: ^^ [16:50:39] <^d> Looking. [16:50:52] <^d> Also, like I just said in -core, I think we need to restart the indexing process too. [16:51:02] <^d> It's been running since June, which would explain why it's looking in the wrong place for config. [16:51:07] <^d> (old values still in memory) [16:51:34] hah [16:51:45] (03CR) 10Chad: [C: 031] searchidx: mount /srv/mediawiki as a bind from /a [puppet] - 10https://gerrit.wikimedia.org/r/162288 (owner: 10Giuseppe Lavagetto) [16:52:52] since june.. [16:53:03] (03PS2) 10Giuseppe Lavagetto: searchidx: mount /srv/mediawiki as a bind from /a [puppet] - 10https://gerrit.wikimedia.org/r/162288 [16:53:14] ^d is too kind, massaging lucene still [16:53:17] (03CR) 10Giuseppe Lavagetto: [C: 032] searchidx: mount /srv/mediawiki as a bind from /a [puppet] - 10https://gerrit.wikimedia.org/r/162288 (owner: 10Giuseppe Lavagetto) [16:53:25] (03CR) 10Giuseppe Lavagetto: [V: 032] searchidx: mount /srv/mediawiki as a bind from /a [puppet] - 10https://gerrit.wikimedia.org/r/162288 (owner: 10Giuseppe Lavagetto) [16:53:38] (03CR) 10Aaron Schulz: [C: 032] Removed redundant config due to new job runner [mediawiki-config] - 10https://gerrit.wikimedia.org/r/161590 (owner: 10Aaron Schulz) [16:53:58] (03Merged) 10jenkins-bot: Removed redundant config due to new job runner [mediawiki-config] - 10https://gerrit.wikimedia.org/r/161590 (owner: 10Aaron Schulz) [16:54:04] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [16:56:53] PROBLEM - puppet last run on searchidx1001 is CRITICAL: CRITICAL: Epic puppet fail [16:57:24] <_joe_> stupid puppet [16:57:27] (03PS1) 10Giuseppe Lavagetto: fix mount parameter [puppet] - 10https://gerrit.wikimedia.org/r/162290 [16:58:01] (03CR) 10Chad: [C: 031] fix mount parameter [puppet] - 10https://gerrit.wikimedia.org/r/162290 (owner: 10Giuseppe Lavagetto) [16:58:15] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] fix mount parameter [puppet] - 10https://gerrit.wikimedia.org/r/162290 (owner: 10Giuseppe Lavagetto) [16:59:22] !log aaron Synchronized wmf-config/jobqueue-eqiad.php: Removed redundant config due to new job runner (duration: 00m 05s) [16:59:27] Logged the message, Master [16:59:54] RECOVERY - puppet last run on searchidx1001 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [17:00:08] <_joe_> ^d: now you may safely restart the daemon [17:00:14] <_joe_> and... I'm off [17:06:24] <^d> Gah what happened?!? [17:07:38] (03PS1) 10Filippo Giunchedi: swift: refactor into module, add codfw [puppet] - 10https://gerrit.wikimedia.org/r/162291 [17:15:37] (03PS1) 10Milimetric: Make archive table name configurable [puppet/wikimetrics] - 10https://gerrit.wikimedia.org/r/162292 [17:15:39] (03CR) 10jenkins-bot: [V: 04-1] Make archive table name configurable [puppet/wikimetrics] - 10https://gerrit.wikimedia.org/r/162292 (owner: 10Milimetric) [17:20:13] (03PS1) 10Milimetric: Configure archive table name [puppet] - 10https://gerrit.wikimedia.org/r/162293 [17:21:01] (03CR) 10jenkins-bot: [V: 04-1] Configure archive table name [puppet] - 10https://gerrit.wikimedia.org/r/162293 (owner: 10Milimetric) [17:21:13] (03CR) 10Aaron Schulz: [C: 032] Set $wgBloomFilterStores in labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/161382 (owner: 10Aaron Schulz) [17:21:25] (03Merged) 10jenkins-bot: Set $wgBloomFilterStores in labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/161382 (owner: 10Aaron Schulz) [17:31:15] PROBLEM - Unmerged changes on repository mediawiki_config on tin is CRITICAL: There are 2 unmerged changes in mediawiki_config (dir /srv/mediawiki-staging/). [17:36:50] (03PS1) 10MaxSem: Experiment: disable gadgets caching [mediawiki-config] - 10https://gerrit.wikimedia.org/r/162297 [17:41:26] <_joe_> ^d: problems? [17:41:58] <^d> I can't get the indexer started. Kicking off the lucene.jobs.sh by hand ends up with a ton of RMI errors. [17:42:21] <^d> And puppet doesn't seem to start it. [17:43:31] <_joe_> ^d: ok, let me get on that server [17:43:56] <_joe_> is there any cache you may need to clear? [17:44:15] <^d> Not afaik. [17:44:40] <_joe_> also, how do you start it? [17:45:32] <_joe_> sorry but It's the first time I see that service [17:45:49] <_joe_> I think this may have something to do with the root being full [17:46:41] <^d> sudo -u lsearch /a/search/lucene.jobs.sh inc-updater-start should do it. [17:47:31] <_joe_> how's usually started? [17:47:38] <_joe_> just from CLI? [17:47:49] <^d> Yep. [17:47:50] <^d> Ah, two steps [17:48:15] <^d> /etc/init.d/lucene-search-2 start [17:48:20] <_joe_> ok [17:48:20] <^d> sudo -u lsearch /a/search/lucene.jobs.sh inc-updater-start [17:48:32] <_joe_> oh so both should start? [17:49:05] <^d> Less errors now. I think it's doing something. [17:49:42] hey, i'll take over for _joe_ so he can eat dinner [17:49:55] ^d: gimme directions! ;] [17:50:07] <^d> I think it might be better. [17:50:08] greg-g: Is there a chance I could get a provisional deploy window around 14:00 to push out like 6 patches to Commons? [17:50:09] <^d> Not sure yet. [17:50:13] <_joe_> robh: I think he may need help troubleshooting [17:50:14] <^d> It stopped giving as many errors. [17:50:18] We're trying to revamp our interfacez. [17:50:23] _joe_: we troubleshoot lsearchd? [17:50:38] i thought we merely hammered in more supports to hold up the tottering infrastructure until elastic takes over? [17:50:39] ;D [17:50:39] <_joe_> my general advice with java it's restart it. If it doesn't restart, try again [17:50:54] <^d> Well something's running now :p [17:50:58] <_joe_> if that doesn't do, check the logs :) [17:51:03] <_joe_> ^d: eheh ok, laters [17:51:06] _joe_: so you restarted it right? [17:51:07] thats all? [17:51:13] <^d> I just did. [17:51:15] <_joe_> robh: ^d did [17:51:19] cool [17:51:23] <_joe_> but, there were issues [17:51:23] ok, have a nice dinner [17:51:28] <_joe_> anyways, laters [17:52:31] marktraceur: to commons? are they tested/etc on beta? [17:52:52] marktraceur: to be clear: I'm asking because they sound like new things, instead of bugfixes [17:53:11] They're relocations of various features, so nothing 100% new per se, but a lot of movement [17:53:15] We can wait. [17:54:17] gotcha [17:54:23] <^d> Ok, arbcom_fiwiki is corrupted. [17:54:23] mediaviwer? [17:54:34] PROBLEM - Apache HTTP on mw1117 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:54:35] PROBLEM - Apache HTTP on mw1203 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:54:36] ^d: fuck [17:54:37] <^d> boardgovcomwiki too [17:54:43] they'll take ages to reindex [17:54:55] PROBLEM - Apache HTTP on mw1138 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:55:04] PROBLEM - Apache HTTP on mw1130 is CRITICAL: Connection timed out [17:55:06] PROBLEM - Apache HTTP on mw1119 is CRITICAL: Connection timed out [17:55:14] PROBLEM - Apache HTTP on mw1121 is CRITICAL: Connection timed out [17:55:14] PROBLEM - Apache HTTP on mw1128 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:55:14] PROBLEM - Apache HTTP on mw1144 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:55:14] PROBLEM - Apache HTTP on mw1190 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:55:25] ... [17:55:27] PROBLEM - Apache HTTP on mw1135 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:55:27] PROBLEM - Apache HTTP on mw1208 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:55:28] that doesnt look good [17:55:29] ???? [17:55:33] PROBLEM - Apache HTTP on mw1205 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:55:33] PROBLEM - Apache HTTP on mw1137 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:55:33] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [500.0] [17:55:34] PROBLEM - Apache HTTP on mw1115 is CRITICAL: Connection timed out [17:55:34] PROBLEM - Apache HTTP on mw1146 is CRITICAL: Connection timed out [17:55:34] PROBLEM - Apache HTTP on mw1204 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:55:34] PROBLEM - Apache HTTP on mw1198 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:55:35] PROBLEM - Apache HTTP on mw1199 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:55:35] PROBLEM - Apache HTTP on mw1206 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:55:48] (03CR) 10Plucas: "Thanks for merging. Note that this change isn't really useful until this one is also merged into operations/debs/kafka: https://gerrit.wik" [puppet/kafka] - 10https://gerrit.wikimedia.org/r/162152 (owner: 10Plucas) [17:55:54] PROBLEM - Apache HTTP on mw1148 is CRITICAL: Connection timed out [17:55:54] PROBLEM - puppet last run on mw1147 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:55:54] PROBLEM - Apache HTTP on mw1202 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:55:54] PROBLEM - Apache HTTP on mw1134 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:55:54] PROBLEM - Apache HTTP on mw1196 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:55:55] PROBLEM - Apache HTTP on mw1189 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:55:55] PROBLEM - Apache HTTP on mw1193 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:55:56] PROBLEM - RAID on mw1147 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:55:56] PROBLEM - Apache HTTP on mw1139 is CRITICAL: Connection timed out [17:55:57] PROBLEM - Apache HTTP on mw1147 is CRITICAL: Connection timed out [17:55:57] PROBLEM - Apache HTTP on mw1125 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:55:58] PROBLEM - Apache HTTP on mw1123 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:56:04] PROBLEM - Apache HTTP on mw1195 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:56:04] PROBLEM - Apache HTTP on mw1192 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:56:04] PROBLEM - Apache HTTP on mw1207 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:56:04] PROBLEM - Apache HTTP on mw1201 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:56:04] PROBLEM - Apache HTTP on mw1200 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:56:09] api servers are spiked [17:56:24] PROBLEM - puppet last run on mw1117 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:56:34] PROBLEM - Apache HTTP on mw1120 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:56:34] PROBLEM - Apache HTTP on mw1191 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:56:45] RECOVERY - puppet last run on mw1147 is OK: OK: Puppet is currently enabled, last run 914 seconds ago with 0 failures [17:56:46] RECOVERY - RAID on mw1147 is OK: OK: no RAID installed [17:56:54] PROBLEM - Apache HTTP on mw1118 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:57:00] PROBLEM - RAID on mw1120 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:57:05] PROBLEM - Apache HTTP on mw1131 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:57:15] RECOVERY - puppet last run on mw1117 is OK: OK: Puppet is currently enabled, last run 732 seconds ago with 0 failures [17:57:24] PROBLEM - Apache HTTP on mw1197 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:57:25] PROBLEM - check configured eth on mw1126 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:57:34] 1 recovery.. [17:57:34] PROBLEM - RAID on mw1126 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:57:38] PROBLEM - Apache HTTP on mw1126 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:57:55] PROBLEM - puppet last run on mw1144 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:58:15] RECOVERY - check configured eth on mw1126 is OK: NRPE: Unable to read output [17:58:24] RECOVERY - RAID on mw1126 is OK: OK: no RAID installed [17:58:26] this might be NRPE being messed up? [17:58:36] ganglia showed a huge spike on them [17:58:39] ah, cool [17:58:40] its all mw api [17:58:44] PROBLEM - RAID on mw1143 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:58:51] * YuviPanda has been messing around with icinga, was slightly worried [17:58:53] so something just smacked the shit of of them and i dunno what yet [17:58:55] RECOVERY - puppet last run on mw1144 is OK: OK: Puppet is currently enabled, last run 772 seconds ago with 0 failures [17:59:02] (haven't touched NRPE tho) [17:59:30] PROBLEM - check configured eth on mw1125 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:59:35] PROBLEM - RAID on mw1123 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:59:54] RECOVERY - RAID on mw1120 is OK: OK: no RAID installed [17:59:55] load on them is 99+ [18:00:02] yeah I'm on 1117 [18:00:03] load average: 97.69, 82.31, 49.89 [18:00:08] wtf? [18:00:15] RECOVERY - check configured eth on mw1125 is OK: NRPE: Unable to read output [18:00:21] is this related to the search stuff? [18:00:24] greg-g: Yeah, media viewer stuff [18:00:24] RECOVERY - Apache HTTP on mw1128 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 8.039 second response time [18:00:24] RECOVERY - RAID on mw1123 is OK: OK: no RAID installed [18:00:25] RECOVERY - Apache HTTP on mw1208 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 1.314 second response time [18:00:28] So low-impart [18:00:31] impact [18:00:42] bblack: i dont know how it could be [18:00:44] RECOVERY - RAID on mw1143 is OK: OK: no RAID installed [18:00:53] 10.64.48.16: :real_connect(): (HY000/1040): Too many connections [18:00:58] <^d> bblack: No. Only thing impacted is old-search indexer. [18:01:07] PROBLEM - Apache HTTP on mw1124 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:01:10] lsearch doesnt interface with api cluster afaik [18:01:10] <^d> Shouldn't have any connection with api. [18:01:19] two of us said so, it must be true! [18:01:24] RECOVERY - Apache HTTP on mw1121 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 6.079 second response time [18:01:27] PROBLEM - LVS HTTP IPv4 on api.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:01:30] (chad's input on this is of a higher value than my own ;) [18:02:00] RECOVERY - Apache HTTP on mw1204 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.674 second response time [18:02:00] RECOVERY - Apache HTTP on mw1120 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.091 second response time [18:02:00] RECOVERY - Apache HTTP on mw1146 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.104 second response time [18:02:00] RECOVERY - Apache HTTP on mw1117 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.196 second response time [18:02:09] RECOVERY - Apache HTTP on mw1198 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 4.826 second response time [18:02:09] RECOVERY - Apache HTTP on mw1199 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 1.421 second response time [18:02:10] RECOVERY - Apache HTTP on mw1203 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 1.491 second response time [18:02:10] ok, and now they are starting to go to lower load [18:02:13] RECOVERY - Apache HTTP on mw1206 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 4.989 second response time [18:02:20] RECOVERY - Apache HTTP on mw1124 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.344 second response time [18:02:20] RECOVERY - Apache HTTP on mw1202 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.393 second response time [18:02:21] RECOVERY - Apache HTTP on mw1189 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 6.450 second response time [18:02:29] RECOVERY - Apache HTTP on mw1148 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 8.156 second response time [18:02:30] RECOVERY - Apache HTTP on mw1125 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.085 second response time [18:02:30] RECOVERY - Apache HTTP on mw1123 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.091 second response time [18:02:30] RECOVERY - Apache HTTP on mw1147 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.106 second response time [18:02:30] RECOVERY - Apache HTTP on mw1139 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.197 second response time [18:02:30] RECOVERY - Apache HTTP on mw1207 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.684 second response time [18:02:30] RECOVERY - Apache HTTP on mw1201 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 1.298 second response time [18:02:31] RECOVERY - Apache HTTP on mw1195 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 8.146 second response time [18:02:31] RECOVERY - Apache HTTP on mw1119 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 1.936 second response time [18:02:32] RECOVERY - Apache HTTP on mw1131 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 4.928 second response time [18:02:32] RECOVERY - Apache HTTP on mw1144 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.151 second response time [18:02:38] bblack: im guessing you didnt do anything either and its just recovering? [18:02:39] RECOVERY - Apache HTTP on mw1130 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 7.796 second response time [18:02:39] RECOVERY - Apache HTTP on mw1200 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 8.446 second response time [18:02:39] RECOVERY - LVS HTTP IPv4 on api.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 4129 bytes in 0.359 second response time [18:02:40] I think MaxSem might be right with the mysql errors [18:02:50] yeah I've just been looking, not touching [18:02:50] RECOVERY - Apache HTTP on mw1190 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 1.801 second response time [18:02:50] RECOVERY - Apache HTTP on mw1135 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.282 second response time [18:02:51] RECOVERY - Apache HTTP on mw1205 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.536 second response time [18:03:01] Reedy / MaxSem it would make sense [18:03:02] RECOVERY - Apache HTTP on mw1115 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.195 second response time [18:03:11] RECOVERY - Apache HTTP on mw1126 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.109 second response time [18:03:12] RECOVERY - Apache HTTP on mw1191 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.266 second response time [18:03:14] Reedy, actually I might be not - they seem to be slightly in the past [18:03:20] and if it wasnt resolving, i may be inclined to possibly poke a dba [18:03:21] RECOVERY - Apache HTTP on mw1134 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.171 second response time [18:03:21] RECOVERY - Apache HTTP on mw1118 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.251 second response time [18:03:21] RECOVERY - Apache HTTP on mw1193 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.667 second response time [18:03:22] RECOVERY - Apache HTTP on mw1138 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.105 second response time [18:03:22] RECOVERY - Apache HTTP on mw1196 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 3.562 second response time [18:03:31] RECOVERY - Apache HTTP on mw1192 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 3.488 second response time [18:03:33] RECOVERY - Apache HTTP on mw1197 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.532 second response time [18:03:40] Ok... well, if it actually fixes itself, I'll just email ops list with the start and end time info [18:03:42] RECOVERY - Apache HTTP on mw1137 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.084 second response time [18:03:45] db1063 [18:04:05] altthough... no - it all started at the same time [18:04:23] db1063 is "just" an s2 slave [18:04:35] just errors were gone far sooner that the appservers recovered [18:04:47] they sometimes take a while to catch up [18:04:54] Reedy, other hosts were also mentioned;) [18:05:03] Yeah [18:05:05] <^d> fuckity fuck fuck fuck. [18:05:11] <^d> lsearchd relies on php message files. [18:05:21] mwahahahaha [18:05:22] PROBLEM - puppet last run on mw1135 is CRITICAL: CRITICAL: Puppet has 1 failures [18:05:22] <^d> so fixing their path won't do me a damn bit of good. [18:05:22] it loads them directly? [18:05:26] <^d> Yes. [18:05:29] <^d> It's retarded. [18:05:30] sweeeeeet [18:05:34] <^d> Capital R Retarded. [18:05:40] Fuck the wrapper [18:05:49] <^d> It has a PHP parser baked in. [18:05:51] <^d> It's sooooo lame. [18:05:55] wtf is does it need them for? [18:05:59] Profit [18:06:01] <^d> Namespaces. [18:07:08] what about custom namespaces? [18:07:53] <^d> Funny, the old message files are still there (months out of date, but w/e) [18:08:04] <^d> MaxSem: You're asking too much of lsearchd :p [18:08:24] just feed it the old files [18:08:37] <^d> Well that's what's weird. Old files are there. [18:08:40] <^d> It should see them [18:09:02] <^d> Oh, some of them are. [18:09:04] <^d> Not all of them [18:09:50] * ^d fires up eclipse and lsearchd. [18:09:54] <^d> What a fun day. [18:10:21] RECOVERY - Unmerged changes on repository mediawiki_config on tin is OK: No changes to merge. [18:12:22] (03PS1) 10Reedy: Non wikipedias to 1.24wmf22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/162303 [18:12:51] (03CR) 10Reedy: [C: 032] Non wikipedias to 1.24wmf22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/162303 (owner: 10Reedy) [18:12:55] (03Merged) 10jenkins-bot: Non wikipedias to 1.24wmf22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/162303 (owner: 10Reedy) [18:14:45] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: Non wikipedias to 1.24wmf22 [18:14:51] Logged the message, Master [18:17:21] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [18:17:21] RECOVERY - puppet last run on mw1135 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [18:17:56] (03PS2) 10Reedy: Up image scaller wgMaxShellFileSize to 512MB [mediawiki-config] - 10https://gerrit.wikimedia.org/r/162038 [18:18:00] (03CR) 10Reedy: [C: 032] Up image scaller wgMaxShellFileSize to 512MB [mediawiki-config] - 10https://gerrit.wikimedia.org/r/162038 (owner: 10Reedy) [18:18:05] (03Merged) 10jenkins-bot: Up image scaller wgMaxShellFileSize to 512MB [mediawiki-config] - 10https://gerrit.wikimedia.org/r/162038 (owner: 10Reedy) [18:18:25] (03PS2) 10Reedy: Don't hardcode url-downloader.wikimedia.org:8080, use $wgCopyUploadProxy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/162230 [18:18:30] (03CR) 10Reedy: [C: 032] Don't hardcode url-downloader.wikimedia.org:8080, use $wgCopyUploadProxy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/162230 (owner: 10Reedy) [18:18:34] (03Merged) 10jenkins-bot: Don't hardcode url-downloader.wikimedia.org:8080, use $wgCopyUploadProxy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/162230 (owner: 10Reedy) [18:19:10] (03PS2) 10Reedy: Minor fix [mediawiki-config] - 10https://gerrit.wikimedia.org/r/160890 (owner: 10MZMcBride) [18:19:14] (03CR) 10Reedy: [C: 032] Minor fix [mediawiki-config] - 10https://gerrit.wikimedia.org/r/160890 (owner: 10MZMcBride) [18:19:19] (03Merged) 10jenkins-bot: Minor fix [mediawiki-config] - 10https://gerrit.wikimedia.org/r/160890 (owner: 10MZMcBride) [18:20:41] is there gonna be an outage report? [18:21:42] (03PS2) 10Reedy: Enable UserMerge extension on all public wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/160828 (https://bugzilla.wikimedia.org/68844) (owner: 10Legoktm) [18:21:42] 14:03 < robh> Ok... well, if it actually fixes itself, I'll just email ops list with the start and end time info [18:21:48] (03CR) 10Reedy: [C: 032] Enable UserMerge extension on all public wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/160828 (https://bugzilla.wikimedia.org/68844) (owner: 10Legoktm) [18:21:53] (03Merged) 10jenkins-bot: Enable UserMerge extension on all public wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/160828 (https://bugzilla.wikimedia.org/68844) (owner: 10Legoktm) [18:21:59] woo [18:24:02] (03PS3) 10Reedy: Enable Flow on [[mw:Talk:MediaWiki 1.25]] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158237 (owner: 10Jforrester) [18:24:06] (03CR) 10Reedy: [C: 032] Enable Flow on [[mw:Talk:MediaWiki 1.25]] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158237 (owner: 10Jforrester) [18:24:11] (03Merged) 10jenkins-bot: Enable Flow on [[mw:Talk:MediaWiki 1.25]] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158237 (owner: 10Jforrester) [18:24:13] (03PS3) 10Reedy: Enable Flow on [[mw:User talk:Jdforrester (WMF)]] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/161172 (owner: 10Jforrester) [18:24:16] (03CR) 10Reedy: [C: 032] Enable Flow on [[mw:User talk:Jdforrester (WMF)]] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/161172 (owner: 10Jforrester) [18:24:21] (03Merged) 10jenkins-bot: Enable Flow on [[mw:User talk:Jdforrester (WMF)]] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/161172 (owner: 10Jforrester) [18:24:24] (03PS2) 10Reedy: Enable Flow on [[mw:Talk:HHVM/About]] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/162002 (https://bugzilla.wikimedia.org/71048) (owner: 10Jforrester) [18:24:28] (03CR) 10Reedy: [C: 032] Enable Flow on [[mw:Talk:HHVM/About]] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/162002 (https://bugzilla.wikimedia.org/71048) (owner: 10Jforrester) [18:24:33] (03Merged) 10jenkins-bot: Enable Flow on [[mw:Talk:HHVM/About]] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/162002 (https://bugzilla.wikimedia.org/71048) (owner: 10Jforrester) [18:25:02] (03PS2) 10Reedy: Added refreshLinks to $wgJobBackoffThrottling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/161603 (owner: 10Aaron Schulz) [18:25:05] Thanks Reedy. [18:25:08] (03CR) 10Reedy: [C: 032] Added refreshLinks to $wgJobBackoffThrottling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/161603 (owner: 10Aaron Schulz) [18:25:14] (03Merged) 10jenkins-bot: Added refreshLinks to $wgJobBackoffThrottling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/161603 (owner: 10Aaron Schulz) [18:25:37] (03PS2) 10Reedy: Add OTRS-member group to enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/161456 (https://bugzilla.wikimedia.org/70386) (owner: 10Jackmcbarn) [18:25:42] (03CR) 10Reedy: [C: 032] Add OTRS-member group to enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/161456 (https://bugzilla.wikimedia.org/70386) (owner: 10Jackmcbarn) [18:25:47] (03Merged) 10jenkins-bot: Add OTRS-member group to enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/161456 (https://bugzilla.wikimedia.org/70386) (owner: 10Jackmcbarn) [18:26:28] ori: Can https://gerrit.wikimedia.org/r/#/c/161468/ go out? [18:27:21] (03PS2) 10Reedy: Set bloom cache config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/161021 (owner: 10Aaron Schulz) [18:27:27] (03CR) 10Reedy: [C: 032] Set bloom cache config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/161021 (owner: 10Aaron Schulz) [18:27:31] PROBLEM - Unmerged changes on repository mediawiki_config on tin is CRITICAL: There are 9 unmerged changes in mediawiki_config (dir /srv/mediawiki-staging/). [18:27:33] (03Merged) 10jenkins-bot: Set bloom cache config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/161021 (owner: 10Aaron Schulz) [18:27:45] * ^d throws some rocks at lsearchd [18:28:04] (03Abandoned) 10Reedy: Enable Flow on mw.org Talk:HHVM/About [mediawiki-config] - 10https://gerrit.wikimedia.org/r/162124 (owner: 10Spage) [18:28:31] RECOVERY - Unmerged changes on repository mediawiki_config on tin is OK: No changes to merge. [18:28:47] !log reedy Synchronized wmf-config/: (no message) (duration: 00m 16s) [18:28:52] Logged the message, Master [18:29:12] https://en.wikipedia.org/wiki/Special:UserMerge :D [18:29:26] 13 Warning: Redis::connect() [redis.connect]: php_network_getaddresses: getaddrinfo failed: Name or service not known in /srv/mediawik [18:29:26] i/php-1.24wmf21/includes/clientpool/RedisConnectionPool.php on line 206 [18:29:26] 3 Warning: Redis::connect() [redis.connect]: php_network_getaddresses: getaddrinfo failed: Name or service not known in /srv/mediawik [18:29:26] i/php-1.24wmf22/includes/clientpool/RedisConnectionPool.php on line 206 [18:29:36] I'm guessing 161021 wasn't ready to go [18:30:01] <^d> file:///a/search/conf/messages/MessagesTest2.php [18:30:06] (03CR) 10Reedy: "13 Warning: Redis::connect() [redis.connect]: php_network_getaddresses: getaddrinfo failed: Name or service n" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/161021 (owner: 10Aaron Schulz) [18:30:06] <^d> lol, stupid lsearchd. [18:30:13] (03PS1) 10Reedy: Revert "Set bloom cache config" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/162313 [18:30:22] (03CR) 10Reedy: [C: 032] Revert "Set bloom cache config" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/162313 (owner: 10Reedy) [18:30:29] (03Merged) 10jenkins-bot: Revert "Set bloom cache config" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/162313 (owner: 10Reedy) [18:30:52] !log reedy Synchronized wmf-config/: (no message) (duration: 00m 14s) [18:30:59] Logged the message, Master [18:34:22] <_joe_> ^d: I'm back if you need any help [18:34:44] Reedy: not yet (re: lua profiler) [18:35:08] (03CR) 10Reedy: [C: 04-1] "-1 per ori on irc: "not yet"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/161468 (owner: 10Jackmcbarn) [18:37:17] <^d> _joe_: I think it's mostly working, it's just spammy logs. [18:38:29] <_joe_> you just described any java app [18:38:39] <_joe_> operating normally, I mean [18:44:39] (03PS1) 10Plucas: server::jmxtrans: add namespacing option [puppet/kafka] - 10https://gerrit.wikimedia.org/r/162320 [18:45:21] (03CR) 10Plucas: "I have tested this; if $group_prefix is left unspecified, there is no change in behavior." [puppet/kafka] - 10https://gerrit.wikimedia.org/r/162320 (owner: 10Plucas) [18:46:22] Reedy: I can connect to those servers just fine [18:54:49] was that apache2.log? [18:58:08] (03PS1) 10Aaron Schulz: Restore "Set bloom cache config" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/162333 [18:58:14] (03CR) 10Aaron Schulz: [C: 04-2] Restore "Set bloom cache config" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/162333 (owner: 10Aaron Schulz) [18:59:49] ottomata ^d manybubbles http://sbdevel.wordpress.com [19:00:43] AaronSchulz: /home/wikipedia/syslog/apache.log [19:01:49] (03CR) 10Ottomata: [C: 032 V: 032] Support extra java opts from $JAVA_OPTS [debs/kafka] - 10https://gerrit.wikimedia.org/r/162150 (owner: 10Plucas) [19:02:06] Reedy: is that the same as apache2 log on fluorine? [19:02:06] (03CR) 10Ottomata: "Ah, hadn't realized that wasn't merged. Done." [debs/kafka] - 10https://gerrit.wikimedia.org/r/162150 (owner: 10Plucas) [19:02:11] hello [19:02:39] (03PS1) 10Reza: Add OTRS-member group to fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/162345 [19:02:54] AaronSchulz: nope [19:03:58] (03PS2) 10Reza: Add OTRS-member group to fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/162345 (https://bugzilla.wikimedia.org/54368) [19:04:00] (03PS1) 10Gergő Tisza: Deploy ImageMetrics extension on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/162358 (https://bugzilla.wikimedia.org/71188) [19:04:37] Reedy: which host? [19:04:48] fenari [19:04:49] :D [19:04:56] (03CR) 10Ottomata: [C: 032 V: 032] server::jmxtrans: add namespacing option [puppet/kafka] - 10https://gerrit.wikimedia.org/r/162320 (owner: 10Plucas) [19:05:42] AaronSchulz: they were increasing in frequency, hence reverting [19:05:57] * AaronSchulz hasn't used fenari in ages [19:09:21] PROBLEM - RAID on mw1139 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:10:01] what's going on with that host? [19:10:02] RECOVERY - RAID on mw1139 is OK: OK: no RAID installed [19:15:21] mutante: Hey, HaeB reported experiencing https://github.com/ether/etherpad-lite/issues/2107 [19:15:37] There's a script to run that will fix it for the broken pad - https://etherpad.wikimedia.org/p/WRN201409 [19:20:39] marktraceur: I hope not every first comment on a bug from unknown people isn't https://github.com/ether/etherpad-lite/issues/2107#issuecomment-37647040 :) [19:21:07] I wouldn't be surprised [19:21:19] JohnMcLear can be a mite silly :) he's also way overextended [19:23:34] mutante: yes, that would be great; people are trying to use that pad right now. (i'm checking my other pads and fortunately this seems to be the only one affected so far) [19:23:50] HaeB: Did copying it out work? [19:24:02] Because if so, it wasn't the content. [19:24:44] marktraceur: all i see on that etherpad is a Loading screen [19:25:14] where would the script be you mentioned? [19:27:29] HaeB: doing what though? running checkPad ? [19:27:33] _joe_: are the memory leaks gone? [19:27:52] <_joe_> AaronSchulz: they present about once a week [19:28:01] 1.2G atm, which is OK [19:28:08] mutante: https://github.com/ether/etherpad-lite/blob/develop/bin/repairPad.js [19:28:27] ... i believe [19:28:35] https://github.com/ether/etherpad-lite/pull/2210 ? [19:30:32] greg-g: how's the train? [19:30:33] HaeB: ok..looking [19:30:54] greg-g: i'd like to do an OCG deploy if now's a good time. [19:31:28] mutante: On the github issue report [19:31:34] Maybe in stable already, not sure [19:31:46] Yeah, the link you have there [19:31:58] Sounds like it was just a freak corruption. [19:33:21] WARNING: This script must not be used while etherpad is running! [19:33:23] hrmm [19:33:47] Hrmm indeed. [19:34:35] and now i actually see an error [19:35:02] Script error. in https://etherpad.wikimedia.org/javascripts/lib/ep_etherpad-lite/static/js/pad.js?callback=require.define at line 0' [19:36:32] marktraceur: HaeB : tried.. but i dont see a difference yet on that pad :/ [19:37:26] cscott: good to go [19:37:35] greg-g: ok, thanks. [19:38:47] (03PS1) 10RobH: adding ms-be2012 mac address to install module [puppet] - 10https://gerrit.wikimedia.org/r/162368 [19:38:53] me neither :( [19:38:56] !log stopped etherpad, added repairPad.js, attempted repair of pad 'WRN201409', started etherpad [19:39:01] Logged the message, Master [19:39:15] but the script has run? [19:40:01] yea, it spits out this warning about not running it while etherpad is running [19:40:04] and then that's it [19:40:48] and it doesnt seem to matter if etherpad is actually running or not [19:40:54] it warns anyways [19:40:55] oh, even though etherpad was actually stopped? [19:40:59] yes [19:41:10] stopped it and checked the process was gone [19:41:39] right, the script doesn't seem to check if the process is running, right? it prints the warning anyway [19:41:59] yep [19:42:20] and $PADID is really just the name of the pad, correct? [19:42:48] at least i wouldn't know of other IDs assigned to them [19:43:07] i'm not sure. marktraceur ? [19:43:25] he's the etherpad expert ;) [19:44:39] not getting to this: console.info("finished"); [19:44:42] but also no error [19:48:16] checkPad.js and extractPadData.js dont have comments either [19:48:51] those were already installed it checkPad is supposed to check for data corruption [19:49:52] oh.. ? [19:49:56] padID a string, format is GROUPID$PADNAME, for example the pad test of group g.s8oes9dhwrvt0zif has padID g.s8oes9dhwrvt0zif$test [19:50:21] i guess i'm not using the right padid [19:55:11] mutante: from http://etherpad.org/doc/v1.2.1/#index_custom_static_files it does look like they actually do use 'padid' just for that part of the pad's URL [19:55:43] .. so "WRN201409" should have been correct here [19:57:01] HaeB: ok.. then they didn't work [19:57:13] asked in #etherpad.. but .. yea..sigh [19:58:40] HaeB: is it a lot of content that is lost? [19:58:52] i wonder if it would break again on a different pad name, with identical content [19:59:07] (03CR) 10RobH: [C: 032] adding ms-be2012 mac address to install module [puppet] - 10https://gerrit.wikimedia.org/r/162368 (owner: 10RobH) [20:00:03] mutante: actually the export URL still works - i've now been able to extract the plaintext: https://etherpad.wikimedia.org/p/WRN201409/export/txt [20:00:20] it's just that all the formatting (including line breaks) appears lost [20:00:30] and the URL remains broken [20:00:44] HaeB: great, i was about to ask that next.. can we still export it.. actually that is what i was hoping to get with extractPadData [20:01:12] it would still be useful to copy it over directly in the database, to preserve the formatting [20:01:12] HaeB: we could try deleting the pad and recreating it to fix the URL itself? [20:01:44] yes that should work [20:01:48] the database is one giant key/value store [20:01:54] afaict [20:01:57] the db also has all the attribution and history of course [20:02:15] not terribly important to preserve in this case though [20:04:49] well, deletePad doesnt do much either.. wth [20:05:11] looks for that admin login on web now [20:07:05] it appears the password i have on file isn't correct [20:07:18] thanks! and perhaps we should document this somewhere as an emergency procedure [20:07:29] well, but nothing works :p [20:07:54] (especially when it happens during a meeting where a pad is used for note-taking ;) [20:07:55] oh [20:11:22] oh, look, it wrote a .db file [20:11:29] WRN201409.db [20:13:35] that has the authors etc, so extractPadData works [20:14:54] A tool for deleting pads from the CLI, because sometimes a brick is required to fix a window. [20:15:03] haha, right, but the brick fails [20:15:42] no bricks in a house built of glass please [20:16:32] any known issue ? I'm getting internal Error: bad token when uploading files [20:16:44] ah, and now when editing too [20:17:21] matanya: uploading on commons? [20:17:27] yes [20:18:03] uuh.. that doesn't sound too good. i suppose swift [20:18:34] is mesos being used in the cluster right now? [20:19:10] I see https://en.wikipedia.org/wiki/JobServer#Mesos_clustering [20:19:13] (03CR) 10Rush: [C: 04-1] "So this has been talked about a bit already. I have thought it is dangerous and silly to let users change these 'fixed' historical identif" [puppet] - 10https://gerrit.wikimedia.org/r/162161 (owner: 10Chad) [20:19:15] but can’t find it in puppet [20:19:21] paravoid: ^^ [20:21:10] (03CR) 10Rush: [C: 04-1] "the labstore hosts need to be figured out, there is a conflict between ldap and puppet account management. There is a lot of chatter abou" [puppet] - 10https://gerrit.wikimedia.org/r/160628 (owner: 10Matanya) [20:21:34] Oh that’s just a project page oh I confused [20:23:14] yeah, it's not [20:23:28] (03CR) 10Rush: [C: 04-2] "This is not the right way to go about this. So here is the rub, this is not a stacking sort of process. We can't disable a handful here," [puppet] - 10https://gerrit.wikimedia.org/r/161624 (owner: 10Aklapper) [20:24:23] (03Abandoned) 10Rush: *.wmfusercontent.org ssl termination on web-misc cluster [puppet] - 10https://gerrit.wikimedia.org/r/159820 (owner: 10Rush) [20:24:32] (03Abandoned) 10Rush: allow only for localssl protoproxy [puppet] - 10https://gerrit.wikimedia.org/r/159961 (owner: 10Rush) [20:24:58] !log updated OCG to version 1cf9281ec3e01d6cbb27053de9f2423582fcc156 [20:25:03] Logged the message, Master [20:25:40] <_joe_> preilly: no [20:26:06] greg-g: ok, done with ocg deploy. thanks! [20:26:40] _joe_: thanks! [20:27:31] <_joe_> preilly: it is a very interesting software though, in my opinion [20:28:14] _joe_: yeah totally I sold my company OrlyAtomics to Mesosphere and Mesosphere is the company behind Apache Mesos http://techcrunch.com/2014/09/17/mesosphere-snags-orlyatomics-in-acquihire-deal/ [20:32:16] cscott: thank you! :) [20:44:17] (03PS1) 10BBlack: be explicit about server names for misc default [puppet] - 10https://gerrit.wikimedia.org/r/162457 [20:44:35] (03CR) 10BBlack: [C: 032] be explicit about server names for misc default [puppet] - 10https://gerrit.wikimedia.org/r/162457 (owner: 10BBlack) [20:45:53] (03PS1) 10Plucas: 0.8.1.1-3 release [debs/kafka] - 10https://gerrit.wikimedia.org/r/162458 [20:46:55] (03PS1) 10Dzahn: minor typo fix in url_downloader role [puppet] - 10https://gerrit.wikimedia.org/r/162459 [20:51:22] (03CR) 10Rush: "just a note, please don't "test" configurations through web in anticipation of fixed_settings. These settings override each other and can" [puppet] - 10https://gerrit.wikimedia.org/r/162219 (owner: 10Aklapper) [20:53:14] ACKNOWLEDGEMENT - url_dowloader on chromium is CRITICAL: Connection refused daniel_zahn its running - but on the second IP on the same host [20:55:19] (03CR) 10Qgil: "Calling it "Reference" and making it non-editable is good. I agree that we don't need to remove right now." [puppet] - 10https://gerrit.wikimedia.org/r/162161 (owner: 10Chad) [20:55:44] (03PS2) 10Qgil: T458: Rename ext_ref description and hide it from users [puppet] - 10https://gerrit.wikimedia.org/r/162161 (owner: 10Chad) [21:00:51] (03PS1) 10Rush: set phab file domain [puppet] - 10https://gerrit.wikimedia.org/r/162461 [21:01:04] (03CR) 10Qgil: "I am the one to blame ref touching the configuration via web. I tried in phab-01 before touching phabricator.wikimedia.org, though. Ref to" [puppet] - 10https://gerrit.wikimedia.org/r/162219 (owner: 10Aklapper) [21:01:55] (03PS1) 10Dzahn: fix monitoring of url-downloader [puppet] - 10https://gerrit.wikimedia.org/r/162463 [21:02:38] (03CR) 10Rush: [C: 032] set phab file domain [puppet] - 10https://gerrit.wikimedia.org/r/162461 (owner: 10Rush) [21:02:58] (03CR) 10Reedy: [C: 031] minor typo fix in url_downloader role [puppet] - 10https://gerrit.wikimedia.org/r/162459 (owner: 10Dzahn) [21:03:47] (03CR) 10Dzahn: [C: 032] minor typo fix in url_downloader role [puppet] - 10https://gerrit.wikimedia.org/r/162459 (owner: 10Dzahn) [21:07:14] (03CR) 10Dzahn: "also applies to linne currently, but we are shutting that down" [puppet] - 10https://gerrit.wikimedia.org/r/162463 (owner: 10Dzahn) [21:15:18] (03PS1) 10Andrew Bogott: Add a couple of custom schema that were present on virt1000 but not puppetized [puppet] - 10https://gerrit.wikimedia.org/r/162468 [21:19:03] (03CR) 10Andrew Bogott: [C: 032] Add a couple of custom schema that were present on virt1000 but not puppetized [puppet] - 10https://gerrit.wikimedia.org/r/162468 (owner: 10Andrew Bogott) [21:25:53] !log ebernhardson Started scap: Bump flow submodule (and change an i18n message) in 1.24wmf21 and 1.24wmf22 [21:25:57] Logged the message, Master [21:54:07] !log ebernhardson Finished scap: Bump flow submodule (and change an i18n message) in 1.24wmf21 and 1.24wmf22 (duration: 28m 14s) [21:54:12] Logged the message, Master [21:56:24] (03PS2) 10Dzahn: fix monitoring of url-downloader [puppet] - 10https://gerrit.wikimedia.org/r/162463 [21:57:18] uh, 28 mins is longggg [21:57:31] (03PS3) 10Dzahn: fix monitoring of url-downloader [puppet] - 10https://gerrit.wikimedia.org/r/162463 [21:58:26] (03CR) 10Dzahn: "PS2: check the hostname instead of IP (some will say to not rely on DNS in icinga config, some will say to not hardcode the IP address)" [puppet] - 10https://gerrit.wikimedia.org/r/162463 (owner: 10Dzahn) [21:58:28] (03CR) 10Aklapper: "Would prefer to get the full footer merged, even if not correctly displayed on some pages and making us look ridiculous. Upstream already " [puppet] - 10https://gerrit.wikimedia.org/r/162219 (owner: 10Aklapper) [21:59:24] (03CR) 10Dzahn: [C: 032] "@neon:~# /usr/lib/nagios/plugins/check_tcp -H url-downloader.wikimedia.org -p 8080" [puppet] - 10https://gerrit.wikimedia.org/r/162463 (owner: 10Dzahn) [22:01:16] MaxSem: but how to make it faster? the breakout seems reasonable: 1m rsync common, 7m building LocalisationCache for 2x branches, 2.5m syncing proxies, 10m syncing apaches, 8m rebuilding cdb's [22:01:51] also flow is done with deploy window [22:02:06] (03CR) 10Rush: "no -1 from me, better lawful and ugly than reverse" [puppet] - 10https://gerrit.wikimedia.org/r/162219 (owner: 10Aklapper) [22:02:17] well, it used to take just 20 not so long ago:( [22:04:44] !log aaron Synchronized php-1.24wmf22/includes/jobqueue/JobRunner.php: f23f1ad35f02f6a17c9b5842aa6d8c152a273639 (duration: 00m 04s) [22:04:49] Logged the message, Master [22:15:53] (03PS1) 10RobH: setting neptunium dns entries [dns] - 10https://gerrit.wikimedia.org/r/162482 [22:16:09] (03CR) 10jenkins-bot: [V: 04-1] setting neptunium dns entries [dns] - 10https://gerrit.wikimedia.org/r/162482 (owner: 10RobH) [22:16:19] bah, thats never good [22:18:00] (03PS2) 10RobH: setting neptunium dns entries [dns] - 10https://gerrit.wikimedia.org/r/162482 [22:19:20] (03CR) 10RobH: [C: 032] setting neptunium dns entries [dns] - 10https://gerrit.wikimedia.org/r/162482 (owner: 10RobH) [22:20:16] something is duplicate on neon, about new HTTPS monitoring [22:20:23] Duplicate definition found for service 'HTTPS' on host 'cp1008' .. a bunch of those [22:20:44] also, why did the service i just renamed just disappear , heh :p [22:20:57] MaxSem: There are 2 things I know of slowing scap down: fenari and the increased load on the job runners. One of the rsync slaves is a jobrunner (I don't remember which one) and it is taking longer to handle requests. [22:21:06] i think that may be bblack's testing mutante [22:21:16] cuz thats a low # cp, i think its the temp host he spun up last week [22:21:53] but not sure [22:22:22] robh: makes sense, it's just that one host, thx [22:22:39] bd808, what's slowing fenari down? [22:22:44] yea, i imagine it'll fail for ssl tests since its a new config for ssl and sni use [22:23:03] MaxSem: 2 cpus cores that are maxed most of the time the last time I looked. [22:23:11] yea, it makes sense, if there is one check for each server name now [22:23:17] doing what? :P [22:23:20] but they are all called "HTTPS" ..then they are duplicates for icinga [22:23:34] just warnings though.. there is more important stuff [22:24:12] MaxSem: Maybe just network lag? -- https://ganglia.wikimedia.org/latest/graph_all_periods.php?h=fenari.wikimedia.org&m=cpu_report&r=hour&s=descending&hc=4&mc=2&st=1411511004&g=cpu_report&z=large&c=Miscellaneous%20pmtpa [22:24:46] That's a lot of iowait [22:24:54] (03CR) 10Dzahn: "recovered" [puppet] - 10https://gerrit.wikimedia.org/r/162463 (owner: 10Dzahn) [22:25:30] But it will be gone soon \o/ [22:26:03] Then we just need to be smart about adding plenty of rsync slaves as we ramp up codfw [22:26:11] and mayb eadd a few more in pmtpa [22:26:16] *maybe add [22:26:43] OR SWITCH TO BITTORRENT! [22:27:01] also, what pmtpa? :P [22:27:06] eqiad [22:27:09] whatever [22:27:14] :p [22:27:59] I talked to an ops from spotify who said they are actually using bittorrent internally for large file transfers [22:28:08] "fix monitoring of poolcounter service" [22:28:12] bittorrent might help with cdbs and allow us to drop the build step at the end [22:28:31] ^ any hints who would possibly work on that in the future? [22:28:33] It might also let us add hhbc to the mix [22:28:40] just for triaging [22:28:58] let's start like this: is it RT or Bugzilla or phab?:) [22:29:04] chasemp: twitter and FB do all of their deploys with BT [22:29:10] (03PS1) 10EBernhardson: Flow enable mw:Talk:Mediawiki UI [mediawiki-config] - 10https://gerrit.wikimedia.org/r/162486 [22:29:28] British Telecom? [22:29:30] interesting twitter I had heard, but not fb [22:29:49] I have notes about the FB process from a one of their release engineers. It's pretty slick [22:30:15] YuviPanda: fucking British Telecom [22:30:27] Reedy: :D Sky seems worse [22:30:50] BT get the blame for a lot of the basic infrastructure too [22:31:03] They build a squashfs image of hhvm, the hhbc cache and all the other things they need on the servers, sync it with bittorrent and then mount the image [22:31:05] (03PS2) 10Prtksxna: Flow enable mw:Talk:Mediawiki UI [mediawiki-config] - 10https://gerrit.wikimedia.org/r/162486 (https://bugzilla.wikimedia.org/71204) (owner: 10EBernhardson) [22:31:09] YuviPanda: Leave sky alone; it has the limit :( [22:31:22] YuviPanda: hey, we need a new check_graphite i think [22:31:42] bd808: how often do they sync I wonder 30xdaily or 1x daily etc [22:31:45] YuviPanda: " Determin good enough value for check_graphite MediaWiki.PoolCounter.Client.acquireForAnyone.tp99" ?:) [22:32:30] chasemp: 2x daily [22:32:52] or 3x when the 2x goes wrong :) [22:33:25] (03PS1) 10RobH: setting neptunium install params [puppet] - 10https://gerrit.wikimedia.org/r/162489 [22:33:44] They run a mixed cluster too. At any given time they have "several" versions live on their cluster, but only one per server. [22:34:26] They have a fancy request routing layer that lets them pick how much traffic to send to which group of servers [22:34:36] mutante: too sleepy to figure that one out :) [22:34:48] cookie routing? I've done this in the past [22:34:53] I don't have many details on that but would be interested in knowing more about it [22:35:02] both for having a "live" staging for known users, and for balancing across farms [22:35:26] mutante: I should consider upstreaming our changes as well, at some point [22:36:25] YuviPanda: would you be interested in that ticket in general? it's an RT about adding poolcounter monitoring and it says to use check_graphite [22:36:29] (03CR) 10RobH: [C: 032] setting neptunium install params [puppet] - 10https://gerrit.wikimedia.org/r/162489 (owner: 10RobH) [22:36:33] mutante: oh, sure, add me! [22:36:49] YuviPanda: cool, that's all i wanted for now :) [22:36:54] mutante: :D [22:36:58] I can take a look tomorrow [22:37:08] all i do is triage the queue.. cool, thx [22:37:30] :) [22:39:17] * YuviPanda hopes to have killed files/icinga in a few days, and misc/icinga.pp in a week or two [22:39:37] bd808: I know a guy at FB that works on their "Buck" build system, if that would be helpful. http://facebook.github.io/buck/ [22:40:39] YuviPanda: wow, nice [22:40:58] mutante: files/icinga/plugin-config has been emptied of 35 files :) [22:41:15] and icinga.pp is shorter by 200 [22:41:39] heh, but they really were not in use, right [22:41:45] mutante: they were! [22:41:52] some were, at least [22:41:56] check_http lived there, I think [22:42:03] dns checks too [22:42:21] I should probably do a grep to see if any of them are used [22:42:22] wait, what? check_http is installed from distro package [22:42:24] and kill unused ones [22:42:38] you are not removing check_http, are you [22:43:06] no moved them into a module [22:43:18] chrismcmahon: greg-g has a contact (and I guess I do too) on their release team. I haven't bugged him much since he was nice enough to drive up to SF and talk to us in May. [22:43:26] YuviPanda: ah, i see [22:44:03] bd808: caltrain, but yeah. Nice guy. [22:44:11] :D [22:44:12] am off now [22:44:14] * YuviPanda waves [22:44:15] chrismcmahon: buck is only for android, right? [22:44:42] greg-g: and Java too pretty sure [22:45:21] * greg-g nods [22:45:41] not part of our team's baliwick (yet?) ;) [22:51:54] (03PS1) 10Reedy: Bump wgMaxImageArea to 75MP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/162492 [22:57:01] (03PS1) 10Dzahn: NTP service aliases, switch eqiad, add esams [dns] - 10https://gerrit.wikimedia.org/r/162496 [22:57:12] PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: There are 2 unmerged changes in puppet (dir /var/lib/git/operations/puppet). [22:57:12] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There are 2 unmerged changes in puppet (dir /var/lib/git/operations/puppet). [22:57:49] (03CR) 10Dzahn: "how about ntp.codfw ?" [dns] - 10https://gerrit.wikimedia.org/r/162496 (owner: 10Dzahn) [22:59:13] (03PS2) 10Dzahn: NTP service aliases, switch eqiad, add esams [dns] - 10https://gerrit.wikimedia.org/r/162496 [23:00:37] (03CR) 10Dzahn: "see -> https://gerrit.wikimedia.org/r/#/c/162496/2" [puppet] - 10https://gerrit.wikimedia.org/r/162175 (owner: 10Dzahn) [23:00:39] * MaxSem volunteers to swat [23:01:35] * legoktm is here [23:02:47] MaxSem: We're pulling Roan's SWAT item for today. [23:02:54] MaxSem: Doing tomorrow morning instead. [23:03:01] MaxSem: Sorry! [23:03:06] okay [23:04:56] (03PS3) 10Dzahn: NTP client config - use rubidium/eeden as servers [puppet] - 10https://gerrit.wikimedia.org/r/162175 [23:05:42] (03CR) 10Dzahn: "now using $servers =['ntp.eqiad.wmnet', 'ntp.esams.wmnet']," [puppet] - 10https://gerrit.wikimedia.org/r/162175 (owner: 10Dzahn) [23:06:34] !log maxsem Synchronized php-1.24wmf21/extensions/MassMessage/: https://gerrit.wikimedia.org/r/#/c/161002/ (duration: 00m 03s) [23:06:39] Logged the message, Master [23:06:45] did we intentionally kill http://bits.wikimedia.org/skins/common/images/poweredby_mediawiki_88x31.png today? [23:06:51] (replaced by some better image?) [23:06:56] legoktm, ^ [23:07:13] bblack, it got moved to a different location [23:07:40] I assume the links to it were updated long before right? [23:07:47] nice, that is the monitoring image :) [23:08:25] legoktm, spam appears gone [23:08:30] yay :) [23:08:39] how about something that isn't even in /skins/ [23:09:12] resources/assets I think [23:09:58] is there a link to the relevant change somewhere we can dig up? [23:10:01] (03CR) 10MaxSem: [C: 032] Experiment: disable gadgets caching [mediawiki-config] - 10https://gerrit.wikimedia.org/r/162297 (owner: 10MaxSem) [23:10:11] (03Merged) 10jenkins-bot: Experiment: disable gadgets caching [mediawiki-config] - 10https://gerrit.wikimedia.org/r/162297 (owner: 10MaxSem) [23:10:13] hrmm [23:10:19] we need to reference a static image in the check [23:10:36] MaxSem: know of an image that will likely not change location anytime soon? [23:10:49] (just asking you since you knew the other answer ;) [23:10:57] http://bits.wikimedia.org/favicon/commons.ico ? [23:11:00] hmm, everything changes so fast:) [23:11:04] favicons ok? [23:11:45] favicons are served by favicon.php :P [23:11:45] http://bits.wikimedia.org/favicon/wikitech.ico [23:11:56] i'd say ideally another .png image [23:12:01] so its testing exactly same thing [23:12:01] well, i can also link to the ico like above [23:12:08] without .php [23:12:16] but i guess it would work [23:12:17] they are just files in docroot/bits/favicon/ [23:12:44] bblack: https://gerrit.wikimedia.org/r/#/c/161030/ [23:12:44] at first it was a new toplevel /assets but then it got moved under resources/ [23:12:44] https://gerrit.wikimedia.org/r/#/c/158648/ created top level /assets [23:12:58] !log maxsem Synchronized wmf-config/CommonSettings.php: https://gerrit.wikimedia.org/r/#/c/162297/ (duration: 00m 03s) [23:13:03] Logged the message, Master [23:13:17] icinga-wm, now tell us I broke something:P [23:13:38] docroot/bits/DolphinBrowser/blackberry/ ... :) [23:13:51] legoktm: so the new URL didn't exist until a few days ago right? this isn't a caching issue? [23:13:53] PROBLEM - puppet last run on db2011 is CRITICAL: CRITICAL: Epic puppet fail [23:14:58] maybe? [23:15:04] (03PS1) 10MaxSem: Revert "Experiment: disable gadgets caching" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/162501 [23:15:13] (03CR) 10MaxSem: [C: 032] Revert "Experiment: disable gadgets caching" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/162501 (owner: 10MaxSem) [23:15:19] (03Merged) 10jenkins-bot: Revert "Experiment: disable gadgets caching" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/162501 (owner: 10MaxSem) [23:15:27] MaxSem: it didn't work? [23:15:40] !log maxsem Synchronized wmf-config/CommonSettings.php: fail! (duration: 00m 04s) [23:15:45] Logged the message, Master [23:15:52] gadgets were just gone:P [23:15:59] eh [23:16:12] uhoh. [23:16:34] fixed [23:20:18] (03PS1) 10Dzahn: delete entire DolphinBrowser directory from bits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/162505 [23:20:41] (03PS1) 10Rush: phabricator add header module [puppet] - 10https://gerrit.wikimedia.org/r/162506 [23:21:19] +0, -53424 .. good for ohloh [23:22:18] (03CR) 10Rush: [C: 032] phabricator add header module [puppet] - 10https://gerrit.wikimedia.org/r/162506 (owner: 10Rush) [23:23:34] RECOVERY - Unmerged changes on repository puppet on palladium is OK: No changes to merge. [23:23:34] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [23:27:30] (03CR) 10Reedy: [C: 031] "lol. Should get Mobile people to confirm we can actually kill it :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/162505 (owner: 10Dzahn) [23:33:14] RECOVERY - puppet last run on db2011 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [23:42:46] (03CR) 10Aklapper: "So we can agree on "edit: false" and "view: true" for the time being?" [puppet] - 10https://gerrit.wikimedia.org/r/162161 (owner: 10Chad) [23:47:03] !log maxsem Synchronized php-1.24wmf22/extensions/MobileFrontend/: (no message) (duration: 00m 04s) [23:47:08] Logged the message, Master [23:47:42] (03PS2) 10Kaldari: Enable WikiGrok for prototype testing on enwiki mobile [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158512 [23:47:59] (03CR) 10MaxSem: [C: 031] "Provisional +1 pending approval from Zero team." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/162505 (owner: 10Dzahn)