[00:02:08] Krenair, kaldari: can I proceed with the GlobalUsage patch? [00:02:20] or are you still working on WikiGrok? [00:02:31] still dealing with wikigrok [00:02:34] tgr: go for it. it’ll take me a minute to clean things up [00:02:40] oops, nevermind [00:02:48] 6operations: Procure and setup rbf2001-2002 - https://phabricator.wikimedia.org/T86897#1111678 (10Dzahn) 5Open>3declined a:3Dzahn The issue is still unchanged. I attempted another reinstall of rbf2001 and: ``` │ Malformed IP address │ │ The IP addres... [00:02:49] 7Puppet, 6operations, 5Patch-For-Review, 3wikis-in-codfw: Check that the redis roles can be applied in codfw, set up puppet. - https://phabricator.wikimedia.org/T86898#1111682 (10Dzahn) [00:02:50] 6operations, 5Patch-For-Review, 3wikis-in-codfw: Setup redis clusters in codfw - https://phabricator.wikimedia.org/T86887#1111681 (10Dzahn) [00:03:04] 7Puppet, 6operations, 5Patch-For-Review, 3wikis-in-codfw: Check that the redis roles can be applied in codfw, set up puppet. - https://phabricator.wikimedia.org/T86898#979090 (10Dzahn) [00:03:05] 6operations, 5Patch-For-Review, 3wikis-in-codfw: Setup redis clusters in codfw - https://phabricator.wikimedia.org/T86887#978964 (10Dzahn) [00:03:06] 6operations: Procure and setup rbf2001-2002 - https://phabricator.wikimedia.org/T86897#1111683 (10Dzahn) 5declined>3Open [00:04:19] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 14.29% of data above the critical threshold [500.0] [00:06:22] (03PS1) 10Tim Landscheidt: Tools: Fix and clean up generation of /etc/ssh/ssh_known_keys [puppet] - 10https://gerrit.wikimedia.org/r/196125 (https://phabricator.wikimedia.org/T92379) [00:06:45] Krenair: I’ll need you to create the submodule update for wmf20 again, unless you want to wait 10 minutes on me. [00:06:56] I already did [00:07:00] thanks [00:07:05] almost done with wmf21 [00:07:08] waiting for jenkins to finish testing it [00:07:32] kaldari, I assume you're just going to update https://gerrit.wikimedia.org/r/#/c/196106/ ? [00:10:51] Krenair: yes, should be ready now: https://gerrit.wikimedia.org/r/#/c/196106/ [00:11:24] !log krenair Synchronized php-1.25wmf20/extensions/WikiGrok/includes/Hooks.php: https://gerrit.wikimedia.org/r/#/c/196122/ (duration: 00m 07s) [00:11:32] Logged the message, Master [00:11:37] This seems okay. [00:11:47] yay [00:12:38] Krenair: doing a quick test... [00:12:58] Krenair: change works great [00:13:04] great [00:13:05] Krenair: ready for wmf21 [00:13:14] 6operations: Procure and setup rbf2001-2002 - https://phabricator.wikimedia.org/T86897#1111701 (10Dzahn) ``` BusyBox v1.22.1 (Debian 1:1.22.0-15) built-in shell (ash) Enter 'help' for a list of built-in commands. ~ # ip a s 1: lo: mtu 65536 qdisc noqueue link/loopback 00:00:00:00:00:... [00:13:15] done, waiting for jenkins [00:18:58] RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [00:20:38] Krenair: So when you have to revert to before a submodule update, are the steps: git the previous core commit from the log, git checkout that commit, do a submodule update for the extension, re-sync the extension dir? [00:21:53] !log krenair Synchronized php-1.25wmf21/extensions/WikiGrok/includes/Hooks.php: https://gerrit.wikimedia.org/r/#/c/196106/ (duration: 00m 05s) [00:21:59] Logged the message, Master [00:22:25] seems okay... kaldari? [00:22:39] Krenair: yep, wmf21 works good [00:23:00] okay, tgr? [00:23:09] kaldari: I'm not sure I understand your question [00:23:36] Krenair: still need the config change when you have a chance [00:23:50] "revert to before a submodule update" - you mean, reverting just the submodule update? [00:23:56] kaldari, what, https://gerrit.wikimedia.org/r/#/c/196083/ ? [00:23:57] Krenair: or is that already done? [00:24:09] that was done [00:24:11] yes, thanks! [00:24:38] Krenair: like just now, when you had to fix en.wiki, what steps did you take exactly? [00:24:58] went to the WikiGrok directory, checked out the last known good commit [00:25:00] synced it [00:25:49] it needs to be cleaned up before the next person comes along, but it unbreaks the site which is the priority [00:26:01] that makes sense [00:27:26] 6operations, 10ops-codfw, 10hardware-requests, 3wikis-in-codfw: Procure rdb2001-2004 - onsite pending racking - https://phabricator.wikimedia.org/T86896#1111735 (10Dzahn) renamed in racktables from rbd to rdb 2001-2004 [00:31:39] 6operations, 10ops-codfw, 10hardware-requests, 3wikis-in-codfw: Procure rdb2001-2004 - onsite pending racking - https://phabricator.wikimedia.org/T86896#1111738 (10Dzahn) https://gerrit.wikimedia.org/r/#/c/196104/ https://gerrit.wikimedia.org/r/#/c/195943/ https://gerrit.wikimedia.org/r/#/c/195868/1 and... [00:31:53] kaldari, in this case, the way we cleaned up was reverting the bad commit of the two [00:32:23] so the complete diff became just the good commit, effectively, which could be synched [00:32:50] true [00:32:57] (03PS1) 10GWicke: WIP: Merge Iaa3bbf07b6053e139dc [puppet] - 10https://gerrit.wikimedia.org/r/196128 [00:33:47] (03CR) 10jenkins-bot: [V: 04-1] WIP: Merge Iaa3bbf07b6053e139dc [puppet] - 10https://gerrit.wikimedia.org/r/196128 (owner: 10GWicke) [00:37:05] (03PS1) 10Dzahn: netboot: fix conflicting rdb entries, missing echo [puppet] - 10https://gerrit.wikimedia.org/r/196129 (https://phabricator.wikimedia.org/T92011) [00:38:11] (03CR) 10Dzahn: [C: 032] netboot: fix conflicting rdb entries, missing echo [puppet] - 10https://gerrit.wikimedia.org/r/196129 (https://phabricator.wikimedia.org/T92011) (owner: 10Dzahn) [00:38:53] Krenair: the security patches should survive a git pull --rebase, right? [00:39:07] um [00:39:17] tgr, are you not following the instructions? [00:39:52] tgr, https://wikitech.wikimedia.org/wiki/How_to_deploy_code#Step_2:_get_the_code_on_tin [00:40:05] !log on tin: fixing ownership and permissions of /tmp/mw-cache-* [00:40:11] Logged the message, Master [00:41:06] I am otherwise [00:41:27] IIRC --rebase is equivalent to doing a manual rebase, with some small advantages [00:41:38] but I'll stick to the manual then :) [00:42:56] !log rdb2001 attempting another reinstall after fixed netboot [00:43:01] Logged the message, Master [00:46:27] 6operations, 5Patch-For-Review, 3wikis-in-codfw: setup & deploy rdb2001-2004 - https://phabricator.wikimedia.org/T92011#1111762 (10Dzahn) there was this issue: https://gerrit.wikimedia.org/r/#/c/196129/1 and what is listed on https://phabricator.wikimedia.org/T86896#1111738 [00:49:34] 6operations, 5Patch-For-Review, 3wikis-in-codfw: setup & deploy rdb2001-2004 - https://phabricator.wikimedia.org/T92011#1111770 (10Dzahn) still doesn't seem to install. does partman/raid1-lvm-ext4-srv.cfg not work? [00:54:04] (03PS1) 10BBlack: tag cp1059 + cp4010 -> jessie [puppet] - 10https://gerrit.wikimedia.org/r/196131 [00:54:23] (03CR) 10BBlack: [C: 032 V: 032] tag cp1059 + cp4010 -> jessie [puppet] - 10https://gerrit.wikimedia.org/r/196131 (owner: 10BBlack) [00:54:29] 6operations: Delete stat1002:/a/squid/archive/teahouse - https://phabricator.wikimedia.org/T92335#1111782 (10Capt_Swing) please delete this @kevinator. We haven't used these log data for a long, long time. Thanks! [00:55:36] !log tgr Synchronized php-1.25wmf20/extensions/GlobalUsage/refreshGlobalimagelinks.php: fix script before running for T65594 (duration: 00m 06s) [00:55:42] Logged the message, Master [00:57:02] !log doing refreshGlobalimagelinks.php test runs [00:57:07] Logged the message, Master [00:58:54] Krenair: seems ok [00:59:10] ok :) [01:00:56] !log running extensions/GlobalUsage/refreshGlobalimagelinks.php --pages=nonexisting for all wikis (T65594) [01:01:01] Logged the message, Master [01:03:34] this will spam the fatallog a bit [01:03:56] what would have been the correct way of only running it for public wikis? [01:06:37] tgr: there's a 'foreachwikiindblist' script on tin [01:06:45] that takes a dblist as the first parameter [01:06:59] yeah but we want all dbs that aren't closed, silver or private [01:07:06] yeah, but what I would need here is for each wikis not in dblist [01:08:24] use comm or grep to create a dblist file that contains all the lines in all.dblist that are not in closed.dblist, silver.dblist or private.dblist [01:08:30] (03PS1) 10Tim Starling: Fix l10nupdate, have mwscript always use www-data [puppet] - 10https://gerrit.wikimedia.org/r/196132 [01:09:02] I recall someone saying that all.dblist is not up-to-date [01:09:57] that patchset will need deploying some time in the next 50 minutes [01:10:03] i can review [01:10:10] just give it +1, I'll merge and test [01:10:31] tgr: no, it's always up to date [01:11:29] (03CR) 10Ori.livneh: [C: 031] Fix l10nupdate, have mwscript always use www-data [puppet] - 10https://gerrit.wikimedia.org/r/196132 (owner: 10Tim Starling) [01:11:32] if the current wiki is missing from all.dblist, then a "no such wiki" error message is shown [01:11:58] it's checked at the top of CommonSettings.php [01:12:16] inclusion in all.dblist is the definition of a wiki that exists [01:13:20] (03CR) 10Tim Starling: [C: 032] Fix l10nupdate, have mwscript always use www-data [puppet] - 10https://gerrit.wikimedia.org/r/196132 (owner: 10Tim Starling) [01:13:32] that's good to know, thanks [01:13:59] tgr: python -c "print '\n'.join({db.strip() for db in open('all.dblist')} - {db.strip() for db in open('closed.dblist')} - {db.strip() for db in open('silver.dblist')})" > my.dblist [01:14:17] I wonder if it would be worth the effort to teach foreachwikiindblist to handle something like 'all !closed !private' [01:15:58] python -c "print ''.join(set(open('all.dblist')) - set(open('closed.dblist')) - set(open('silver.dblist')))" [01:16:01] that's neater [01:16:20] tgr: i doubt it [01:16:57] (03PS1) 10Eevans: enable authenticated access to Cassandra JMX [puppet/cassandra] - 10https://gerrit.wikimedia.org/r/196133 [01:16:57] but it may be worthwhile to have the dblist update script generate a public.dblist [01:18:29] two problems with that patchset [01:18:33] (03CR) 10Ori.livneh: "1) Do you really need jmxremote.password, or can you just interpolate the password parameter directly into the script invocation?" [puppet/cassandra] - 10https://gerrit.wikimedia.org/r/196133 (owner: 10Eevans) [01:20:25] TimStarling: ? [01:24:08] (03PS1) 10Tim Starling: Fix obvious bugs in previously l10nupdate fix [puppet] - 10https://gerrit.wikimedia.org/r/196134 [01:25:40] d'oh [01:26:09] (03CR) 10Ori.livneh: [C: 031] Fix obvious bugs in previously l10nupdate fix [puppet] - 10https://gerrit.wikimedia.org/r/196134 (owner: 10Tim Starling) [01:27:10] (03CR) 10Tim Starling: [C: 032] Fix obvious bugs in previously l10nupdate fix [puppet] - 10https://gerrit.wikimedia.org/r/196134 (owner: 10Tim Starling) [01:27:49] !log on tin: testing l10nupdate [01:27:55] Logged the message, Master [01:32:19] 6operations, 10RESTBase, 10RESTBase-Cassandra: enable authenticated access to Cassandra JMX - https://phabricator.wikimedia.org/T92471#1111859 (10Eevans) 3NEW a:3Eevans [01:33:15] (03PS2) 10Eevans: enable authenticated access to Cassandra JMX [puppet/cassandra] - 10https://gerrit.wikimedia.org/r/196133 (https://phabricator.wikimedia.org/T92471) [01:35:49] ori: what's the normal puppet equivalent of vagrant-puppet's upstart class? [01:36:40] tgr: upstart class? there's an upstart service provider, but that is built in to puppet [01:37:11] I must be confusing things then [01:37:31] I remember a conversation about how upstart should be replaced with something [01:37:35] systemd maybe [01:37:39] (03PS1) 10Tim Starling: Move l10nupdate output directory [puppet] - 10https://gerrit.wikimedia.org/r/196137 [01:37:43] jessie doesn't have upstart; it only has systemd [01:37:51] if you're targetting jessie for deployment, use systemd [01:37:55] if trusty, upstart [01:38:05] if both, you'll have to branch based on os [01:40:48] 6operations: Procure and setup rbf2001-2002 - https://phabricator.wikimedia.org/T86897#1111870 (10Dzahn) 2 is eth3 with scope global eth3 3 is eth2 without IP ? [01:41:52] (03PS2) 10Tim Starling: Move l10nupdate output directory [puppet] - 10https://gerrit.wikimedia.org/r/196137 [01:42:37] TimStarling: why not tempCacheDir="$(mktemp -d)" ? [01:43:19] that way it would get purged eventually [01:44:05] sounds like a job for a separate patchset [01:44:18] I am just fixing what was there already [01:44:44] also I have to do it within 16 minutes unless I can somehow figure out how to disable the cron job [01:44:50] (03CR) 10Ori.livneh: [C: 031] Move l10nupdate output directory [puppet] - 10https://gerrit.wikimedia.org/r/196137 (owner: 10Tim Starling) [01:45:09] (03PS1) 10Dzahn: rbf2001: use eth2 MAC for DHCP [puppet] - 10https://gerrit.wikimedia.org/r/196138 (https://phabricator.wikimedia.org/T86897) [01:45:18] (03CR) 10Tim Starling: [C: 032] Move l10nupdate output directory [puppet] - 10https://gerrit.wikimedia.org/r/196137 (owner: 10Tim Starling) [01:46:19] mutante: do you know why terbium is freaking out? [01:46:21] TimStarling: rm /var/spool/cron/crontabs/l10nupdate ; puppet agent --disable [01:46:23] (03CR) 10Dzahn: [C: 032] rbf2001: use eth2 MAC for DHCP [puppet] - 10https://gerrit.wikimedia.org/r/196138 (https://phabricator.wikimedia.org/T86897) (owner: 10Dzahn) [01:46:35] andrewbogott: no, i don't [01:46:48] mutante: ok, here’s a more general question… what is terbium? [01:47:38] andrewbogott: it runs mw maintenance scripts and some other stuff like people.wm [01:47:43] replaces parts of fenari [01:47:48] andrewbogott: (a) people.wikimedia.org, (b) periodic, cron-managed mediawiki jobs, (c) noc.wikimedia.org [01:48:12] ok, so probably something has changed with permissions and a cron is running [01:48:19] yeah, the challenge is to work out how to do things like that without disabling puppet ;) [01:48:46] you know there are lots of interesting ways to sabotage a system that puppet doesn't know about [01:48:48] andrewbogott: define "freaking out"? it looks ok in icinga? [01:49:15] mutante: > 100 emails sent to root in the last 5 minutes [01:49:25] “ www-data : user NOT in sudoers ; TTY=unknown “ [01:49:33] I’m new to getting these root emails, but — surely that’s not normal [01:50:00] (03CR) 10Eevans: "Which use of jmxremote.password are you referring to?" [puppet/cassandra] - 10https://gerrit.wikimedia.org/r/196133 (https://phabricator.wikimedia.org/T92471) (owner: 10Eevans) [01:50:58] hm, and now silver is doing the same thing [01:51:54] !log tstarling Synchronized php-1.25wmf20/cache/l10n: (no message) (duration: 00m 02s) [01:51:59] Logged the message, Master [01:52:40] andrewbogott: https://gerrit.wikimedia.org/r/#/c/196132/ [01:52:56] TimStarling: andrewbogott i think those are related [01:52:58] yeah, I just saw that… [01:53:15] !log LocalisationUpdate completed (1.25wmf20) at 2015-03-12 01:52:11+00:00 [01:53:17] TimStarling: here’s a sample https://phabricator.wikimedia.org/P388 [01:53:20] Logged the message, Master [01:53:22] !log tstarling Synchronized php-1.25wmf21/cache/l10n: (no message) (duration: 00m 01s) [01:53:24] because those mails are from crons using mwscript [01:53:27] Logged the message, Master [01:53:42] you know what I think about sudo mails, right? [01:53:59] TimStarling: no, but I’m guessing you think that they’re never ever useful :) [01:54:13] yep [01:54:47] and they're easy enough to disable in sudo's configuration, I've recommended it in the past [01:55:00] …and you don’t think that they indicate that our crons (and job runners) are failing? Or did that email storm just happen during a transition? [01:55:01] !log LocalisationUpdate completed (1.25wmf21) at 2015-03-12 01:53:58+00:00 [01:55:03] but ops are a special breed who love spam [01:55:06] Logged the message, Master [01:55:35] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There are 4 unmerged changes in puppet (dir /var/lib/git/operations/puppet). [01:55:56] it's a problem, but it's not your problem, it's my problem [01:56:15] PROBLEM - check if wikidata.org dispatch lag is higher than 2 minutes on wikidata is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1459 bytes in 0.431 second response time [01:56:28] ok :) As long as you know what’s happening I will start deleting these [01:56:36] ops gets mail every time someone runs sudo? [01:57:25] PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: There are 2 unmerged changes in puppet (dir /var/lib/git/operations/puppet). [01:57:50] no, only if the permissions are wrong, e.g. you mistyped the user [01:58:03] Krenair: https://xkcd.com/838/ [01:58:11] ah so that's the "this incident will be reported"? :) [01:58:23] yep :) [01:58:28] :D [01:58:29] TimStarling: my stab at evil: sed -i -e 's|var/spool|var/ffool|g' `which cron` [01:58:40] * andrewbogott withdraws [01:59:03] andrewbogott: this is speculative fiction, not something i'd ever actually do :P [02:00:32] !log rbf2001 reboot from busybox :p [02:00:38] Logged the message, Master [02:00:48] !log on tin: disabled puppet for l10nupdate testing [02:00:53] Logged the message, Master [02:02:26] Anyone messing with terbium? mutante ? [02:04:10] what's wrong with it? [02:04:15] Cron jobs were killed [02:04:32] And seems like the cron daemon also was stopped [02:04:37] so I don't dare to restart [02:04:46] PROBLEM - puppet last run on tin is CRITICAL: CRITICAL: puppet fail [02:08:05] RECOVERY - puppet last run on tin is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [02:08:32] 6operations, 5Patch-For-Review: Procure and setup rbf2001-2002 - https://phabricator.wikimedia.org/T86897#1111880 (10Dzahn) i tried to use eth2 and " Network autoconfiguration failed Your network is probably not using the DHCP protocol. " [02:09:02] OOM? [02:09:10] probably needs a reboot [02:09:37] TimStarling: Terbium? [02:10:36] yes, if the cron daemon was killed then maybe oom-killer killed it [02:10:49] in which case rebooting would be a safe way to make sure all daemons are running [02:10:52] No, doesn't look like it [02:12:07] !log tstarling Synchronized README: (no message) (duration: 00m 01s) [02:12:13] Logged the message, Master [02:15:15] !log tstarling Synchronized README: (no message) (duration: 00m 01s) [02:16:29] scap is failing with an SSH error [02:16:39] I am trying to figure out why [02:17:13] what is the error? [02:17:40] 02:15:14 ['/srv/deployment/scap/scap/bin/sync-common', '--no-update-l10n', '--include', 'README', 'mw1033.eqiad.wmnet', 'mw1070.eqiad.wmnet', 'mw1097.eqiad.wmnet', 'mw1216.eqiad.wmnet', 'mw1161.eqiad.wmnet', 'mw1201.eqiad.wmnet'] on mw1121 returned [255]: Permission denied (publickey). [02:17:58] and the same for every server [02:18:05] 02:15:14 267 apaches had sync errors [02:18:15] but I can ssh to them from the terminal [02:18:24] using l10nupdate's private key [02:18:26] problem with the shared ssh-agent? [02:18:30] are you not in wikidev? [02:18:52] I think l10nupdate's keep has been borked since December [02:18:55] I am in l10nupdate only [02:18:58] *key [02:19:06] [0215][l10nupdate@tin:~]$ groups [02:19:06] l10nupdate [02:19:13] that won't work [02:19:21] why not? [02:19:53] scap is configured to use /var/run/keyholder/proxy.sock [02:20:12] since December? [02:20:25] time flies [02:20:41] so should it not do that for l10nupdate, or should it be made to work? [02:20:50] https://phabricator.wikimedia.org/T76061 [02:21:45] probably neither. we scap regularly enough, why do messages have to be synced automatically? [02:22:14] so people don't open "why is the message not updated" bugs [02:22:26] "because we haven't scapped since then" [02:22:34] we've been fine since december [02:23:30] ori: Could you take a look at terbium? The cron daemon is running (it's cron on Ubuntu, not crond, doh)... but for some reason not starting any Wikidata crons [02:24:22] "Urgency just became High+1 for us, it turns out there is no workaround, LocalisationUpdate is overriding newer, manually deployed extension messages." [02:24:46] so we have two bugs? [02:25:05] TimStarling: followed by "Okay, scap does in fact update the messages. Thanks! This means we have a workaround, I'm lowering the urgency again. " [02:25:06] That comment was operator error [02:25:45] They did not scap because they did not know all the strangeness of l10n cache [02:26:36] So right now l10nupdarte does update the cdb files on tin and then they ship whenever the next scap is run by a member of the wikidev group [02:26:57] seems fine to me [02:27:03] I've tried to fix this in scap/scyn-* but with no luck [02:27:05] having l10nupdate scap was always creepy [02:27:12] i could have ssh-agent-proxy get the remote socket's credentials via getsockopt() SO_PEERCRED [02:27:13] so currently you don't know why it is happening? [02:27:31] I have scanned through the bug comments and that is the current status? [02:27:48] Correct. My last investigation was https://phabricator.wikimedia.org/T76061#1060333 [02:27:48] it happens because scap hard-codes an SSH_AUTH_SOCK that is not readable for l10nupdate [02:28:38] I actually made a fix for that -- https://gerrit.wikimedia.org/r/#/c/191248/1/scap/cli.py,unified [02:28:46] but it seems not to make a difference [02:28:56] and I'm really not exactly sure why... [02:29:21] Oh! because the ssh isn't done as user l10nupdate on the far end! [02:29:48] let's run it under strace and see what ssh command it runs [02:30:19] The far end user is going to be mwdeploy I bet [02:30:38] boo, no -b [02:30:45] and mwdeploy isn't going to accept the l10nupdate key [02:31:40] !log tstarling Synchronized README: (no message) (duration: 00m 06s) [02:31:45] Logged the message, Master [02:31:58] yeah it uses -lmwdeploy [02:32:19] which also changed when we added the ssh-agent proxy [02:32:50] I could patch scap to allow the user to be set with a command line arg [02:33:10] right now we take it from the config file [02:33:41] then l10nupdate could add `--ssh-user l10nupdate` or something [02:37:48] ok [02:37:49] It doesn't even need a patch. `-D ssh_user=l10nupdate` should do it I think [02:38:13] I will just comment out the scap for now so that I can test my own changes without seeing so many error messages [02:39:05] `-D ssh_user:l10nupdate` [02:41:15] (03PS2) 10Nuria: Adding a Last-Access cookie to text and mobile requests [puppet] - 10https://gerrit.wikimedia.org/r/196009 (https://phabricator.wikimedia.org/T92435) [02:41:21] !log Manually started 7 Wikibase dispatchChanges instances on terbium after cron failed to start them. [02:41:29] Logged the message, Master [02:42:24] (03PS1) 10BryanDavis: l10nupdate: connect to remote hosts as l10nupdate user [puppet] - 10https://gerrit.wikimedia.org/r/196143 (https://phabricator.wikimedia.org/T76061) [02:44:29] (03CR) 10Nuria: "> re: appending to the cookie, we have a vmod" [puppet] - 10https://gerrit.wikimedia.org/r/196009 (https://phabricator.wikimedia.org/T92435) (owner: 10Nuria) [02:45:37] TimStarling, ori: I think https://gerrit.wikimedia.org/r/#/c/196143/1 will fix this so l10nupdate works again [02:46:46] ok, may as well test that while I'm up to my elbows in it [02:47:27] (03CR) 10Tim Starling: [C: 032] l10nupdate: connect to remote hosts as l10nupdate user [puppet] - 10https://gerrit.wikimedia.org/r/196143 (https://phabricator.wikimedia.org/T76061) (owner: 10BryanDavis) [02:49:00] mutante: ok to deploy your change? [02:49:32] TimStarling: yes please [02:49:37] sorry for leaving it there [02:50:14] RECOVERY - Unmerged changes on repository puppet on palladium is OK: No changes to merge. [02:50:42] there is still a permission issue on terbium [02:50:44] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [02:51:45] hoo: terbium CRON[10133]: (www-data) CMD (/usr/local/bin/mwscript extensions/Wikidata/extensions/Wikibase... this looks like it works again [02:56:30] PHP Warning: fopen(/tmp/mw-cache-1.25wmf20/conf-aawiki): failed to open stream: Permission denied in /srv/mediawiki/wmf-config/CommonSettings.php on line 192 [02:56:37] !log tstarling Synchronized php-1.25wmf20/cache/l10n: (no message) (duration: 00m 03s) [02:56:45] Logged the message, Master [02:57:29] bd808: seems to work [02:57:56] Running rsync command: `sudo -u mwdeploy -n -- /usr/bin/rsync --archive --delete-delay --delay-updates --compress --delete --exclude=**/.svn/lock --exclude=**/.git/objects --exclude=**/.git/**/objects --exclude=**/cache/l10n/*.cdb --no-perms --include=/php-1.25wmf20 --include=/php-1.25wmf20/cache --include=/php-1.25wmf20/cache/l10n --include=/php-1.25wmf20/cache/l10n/*** --exclude=* mw1216.eqiad.wmnet::common /srv/mediawiki` [02:58:04] !log LocalisationUpdate completed (1.25wmf20) at 2015-03-12 02:57:01+00:00 [02:58:04] w00t [02:58:09] Logged the message, Master [02:59:18] I was following the scap logs with -- tail -f /a/mw-log/scap.log | python ~bd808/scaplog.py -- from fluorine [03:04:12] Wikidata dispatch lag is growing right now (45 minutes already) because the dispatchers aren't running :( ... Any news? [03:08:45] Manually started 8 instances again... [03:11:19] !log tstarling Synchronized php-1.25wmf21/cache/l10n: (no message) (duration: 00m 04s) [03:11:27] Logged the message, Master [03:13:01] !log LocalisationUpdate completed (1.25wmf21) at 2015-03-12 03:11:58+00:00 [03:13:06] Logged the message, Master [03:25:11] (03PS1) 10Hoo man: Don't rely on USER to be correct in mwscript [puppet] - 10https://gerrit.wikimedia.org/r/196145 [03:25:17] TimStarling: mutante: ori ^ fix [03:31:52] (03CR) 10BryanDavis: [C: 031] Don't rely on USER to be correct in mwscript [puppet] - 10https://gerrit.wikimedia.org/r/196145 (owner: 10Hoo man) [03:34:59] TimStarling: Apparently cron jobs don't set $USER so mwscript needs a tweak [03:35:08] Yes [03:35:18] hoo was nice enough to make a patch [03:35:30] just needs a merge and sync [03:35:50] https://gerrit.wikimedia.org/r/#/c/196145/ [03:39:09] (03CR) 10Ori.livneh: [C: 032] Don't rely on USER to be correct in mwscript [puppet] - 10https://gerrit.wikimedia.org/r/196145 (owner: 10Hoo man) [03:39:15] sorry, I was away [03:39:22] Thank you [03:39:27] thx ori [03:39:29] i'll force a puppet run on terbium [03:39:47] bd808: nice fix, though I really do still think it's not a great idea to have l10nupdate scap [03:40:26] it doesn't run a full scap, just a sync-dir of the cdb json dumps [03:40:53] But it is creepy [03:41:49] hoo: ran puppet on terbium. do you need me to do anything else? [03:42:00] ori: No, I guess that's it :) Thank you [03:42:15] Works, a new instance just popped up :) [03:45:42] !log LocalisationUpdate ResourceLoader cache refresh completed at Thu Mar 12 03:44:38 UTC 2015 (duration 50m 35s) [03:45:48] Logged the message, Master [04:08:03] (03PS1) 10Gage: IPsec: refactor hiera data, clean up template [puppet] - 10https://gerrit.wikimedia.org/r/196155 [04:09:19] 6operations: Make puppet the sole manager of user keys - https://phabricator.wikimedia.org/T92475#1112128 (10yuvipanda) 3NEW [04:09:52] (03CR) 10Gage: [C: 032] IPsec: refactor hiera data, clean up template [puppet] - 10https://gerrit.wikimedia.org/r/196155 (owner: 10Gage) [04:11:57] yay, that worked [04:39:32] (03CR) 10Yuvipanda: ssh: introduce ssh::userkey resource (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/183814 (owner: 10Faidon Liambotis) [04:39:41] (03PS2) 10Yuvipanda: ssh: remove .ssh/authorized_keys support from prod [puppet] - 10https://gerrit.wikimedia.org/r/183824 (https://phabricator.wikimedia.org/T92475) (owner: 10Faidon Liambotis) [04:39:43] (03PS3) 10Yuvipanda: ssh: introduce ssh::userkey resource [puppet] - 10https://gerrit.wikimedia.org/r/183814 (owner: 10Faidon Liambotis) [04:39:45] (03PS2) 10Yuvipanda: ssh: recurse/purge => true for /etc/ssh/userkeys [puppet] - 10https://gerrit.wikimedia.org/r/183815 (https://phabricator.wikimedia.org/T92475) (owner: 10Faidon Liambotis) [04:39:47] (03PS2) 10Yuvipanda: reprepro: transition to ssh::userkey [puppet] - 10https://gerrit.wikimedia.org/r/183818 (https://phabricator.wikimedia.org/T92475) (owner: 10Faidon Liambotis) [04:39:49] (03PS2) 10Yuvipanda: openstack: transition nova to ssh::userkey [puppet] - 10https://gerrit.wikimedia.org/r/183819 (https://phabricator.wikimedia.org/T92475) (owner: 10Faidon Liambotis) [04:39:51] (03PS2) 10Yuvipanda: ssh: change userkeys' path hierarchy [puppet] - 10https://gerrit.wikimedia.org/r/183816 (https://phabricator.wikimedia.org/T92475) (owner: 10Faidon Liambotis) [04:39:53] (03PS2) 10Yuvipanda: ssh: support /etc/ssh/userkeys in production too [puppet] - 10https://gerrit.wikimedia.org/r/183817 (https://phabricator.wikimedia.org/T92475) (owner: 10Faidon Liambotis) [04:39:55] (03PS2) 10Yuvipanda: puppet: transition to ssh::userkey [puppet] - 10https://gerrit.wikimedia.org/r/183822 (https://phabricator.wikimedia.org/T92475) (owner: 10Faidon Liambotis) [04:39:57] (03PS2) 10Yuvipanda: admin: transition to ssh::userkey [puppet] - 10https://gerrit.wikimedia.org/r/183823 (https://phabricator.wikimedia.org/T92475) (owner: 10Faidon Liambotis) [04:39:59] (03PS2) 10Yuvipanda: mediawiki: transition to ssh::userkey [puppet] - 10https://gerrit.wikimedia.org/r/183820 (https://phabricator.wikimedia.org/T92475) (owner: 10Faidon Liambotis) [04:40:01] (03PS2) 10Yuvipanda: authdns: transition to ssh::userkey [puppet] - 10https://gerrit.wikimedia.org/r/183821 (https://phabricator.wikimedia.org/T92475) (owner: 10Faidon Liambotis) [04:40:36] (03CR) 10jenkins-bot: [V: 04-1] ssh: remove .ssh/authorized_keys support from prod [puppet] - 10https://gerrit.wikimedia.org/r/183824 (https://phabricator.wikimedia.org/T92475) (owner: 10Faidon Liambotis) [04:42:22] (03CR) 10jenkins-bot: [V: 04-1] authdns: transition to ssh::userkey [puppet] - 10https://gerrit.wikimedia.org/r/183821 (https://phabricator.wikimedia.org/T92475) (owner: 10Faidon Liambotis) [04:42:39] (03CR) 10jenkins-bot: [V: 04-1] admin: transition to ssh::userkey [puppet] - 10https://gerrit.wikimedia.org/r/183823 (https://phabricator.wikimedia.org/T92475) (owner: 10Faidon Liambotis) [04:42:42] (03CR) 10jenkins-bot: [V: 04-1] mediawiki: transition to ssh::userkey [puppet] - 10https://gerrit.wikimedia.org/r/183820 (https://phabricator.wikimedia.org/T92475) (owner: 10Faidon Liambotis) [04:42:44] (03CR) 10jenkins-bot: [V: 04-1] puppet: transition to ssh::userkey [puppet] - 10https://gerrit.wikimedia.org/r/183822 (https://phabricator.wikimedia.org/T92475) (owner: 10Faidon Liambotis) [04:43:09] (03PS3) 10Yuvipanda: ssh: remove .ssh/authorized_keys support from prod [puppet] - 10https://gerrit.wikimedia.org/r/183824 (https://phabricator.wikimedia.org/T92475) (owner: 10Faidon Liambotis) [04:43:11] (03PS3) 10Yuvipanda: reprepro: transition to ssh::userkey [puppet] - 10https://gerrit.wikimedia.org/r/183818 (https://phabricator.wikimedia.org/T92475) (owner: 10Faidon Liambotis) [04:43:13] (03PS3) 10Yuvipanda: openstack: transition nova to ssh::userkey [puppet] - 10https://gerrit.wikimedia.org/r/183819 (https://phabricator.wikimedia.org/T92475) (owner: 10Faidon Liambotis) [04:43:15] (03PS3) 10Yuvipanda: puppet: transition to ssh::userkey [puppet] - 10https://gerrit.wikimedia.org/r/183822 (https://phabricator.wikimedia.org/T92475) (owner: 10Faidon Liambotis) [04:43:17] (03PS3) 10Yuvipanda: admin: transition to ssh::userkey [puppet] - 10https://gerrit.wikimedia.org/r/183823 (https://phabricator.wikimedia.org/T92475) (owner: 10Faidon Liambotis) [04:43:19] (03PS3) 10Yuvipanda: mediawiki: transition to ssh::userkey [puppet] - 10https://gerrit.wikimedia.org/r/183820 (https://phabricator.wikimedia.org/T92475) (owner: 10Faidon Liambotis) [04:43:21] (03PS3) 10Yuvipanda: authdns: transition to ssh::userkey [puppet] - 10https://gerrit.wikimedia.org/r/183821 (https://phabricator.wikimedia.org/T92475) (owner: 10Faidon Liambotis) [04:50:26] (03PS4) 10Yuvipanda: ssh: introduce ssh::userkey resource [puppet] - 10https://gerrit.wikimedia.org/r/183814 (owner: 10Faidon Liambotis) [04:50:51] (03CR) 10Yuvipanda: [C: 032] "Beta only change, so is good." [puppet] - 10https://gerrit.wikimedia.org/r/183814 (owner: 10Faidon Liambotis) [04:55:34] (03PS3) 10Yuvipanda: ssh: recurse/purge => true for /etc/ssh/userkeys [puppet] - 10https://gerrit.wikimedia.org/r/183815 (https://phabricator.wikimedia.org/T92475) (owner: 10Faidon Liambotis) [04:56:45] RECOVERY - check if wikidata.org dispatch lag is higher than 2 minutes on wikidata is OK: HTTP OK: HTTP/1.1 200 OK - 1442 bytes in 0.275 second response time [05:01:42] (03PS3) 10Yuvipanda: ssh: change userkeys' path hierarchy [puppet] - 10https://gerrit.wikimedia.org/r/183816 (https://phabricator.wikimedia.org/T92475) (owner: 10Faidon Liambotis) [05:01:57] (03CR) 10Yuvipanda: [C: 032] ssh: recurse/purge => true for /etc/ssh/userkeys [puppet] - 10https://gerrit.wikimedia.org/r/183815 (https://phabricator.wikimedia.org/T92475) (owner: 10Faidon Liambotis) [05:02:08] (03PS1) 10BBlack: update esams cache domainnames in hieradata [puppet] - 10https://gerrit.wikimedia.org/r/196156 [05:02:21] (03PS2) 10BBlack: update esams cache domainnames in hieradata [puppet] - 10https://gerrit.wikimedia.org/r/196156 [05:02:51] (03CR) 10BBlack: [C: 032 V: 032] update esams cache domainnames in hieradata [puppet] - 10https://gerrit.wikimedia.org/r/196156 (owner: 10BBlack) [05:03:04] (03CR) 10Yuvipanda: [C: 032] ssh: change userkeys' path hierarchy [puppet] - 10https://gerrit.wikimedia.org/r/183816 (https://phabricator.wikimedia.org/T92475) (owner: 10Faidon Liambotis) [05:03:18] go for it :) [05:03:24] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There are 5 unmerged changes in puppet (dir /var/lib/git/operations/puppet). [05:04:25] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [05:07:08] (03PS3) 10Yuvipanda: ssh: support /etc/ssh/userkeys in production too [puppet] - 10https://gerrit.wikimedia.org/r/183817 (https://phabricator.wikimedia.org/T92475) (owner: 10Faidon Liambotis) [05:07:36] (03PS4) 10Yuvipanda: mediawiki: transition to ssh::userkey [puppet] - 10https://gerrit.wikimedia.org/r/183820 (https://phabricator.wikimedia.org/T92475) (owner: 10Faidon Liambotis) [05:12:34] (03CR) 10Yuvipanda: [C: 032] ssh: support /etc/ssh/userkeys in production too [puppet] - 10https://gerrit.wikimedia.org/r/183817 (https://phabricator.wikimedia.org/T92475) (owner: 10Faidon Liambotis) [05:20:05] (03CR) 10Yuvipanda: [C: 032] mediawiki: transition to ssh::userkey [puppet] - 10https://gerrit.wikimedia.org/r/183820 (https://phabricator.wikimedia.org/T92475) (owner: 10Faidon Liambotis) [05:23:17] !log legoktm Synchronized README: testing that Yuvi didnt break anything (duration: 00m 05s) [05:23:25] Logged the message, Master [05:23:29] Guest56166: all good [05:23:33] legoktm: sweet. [05:23:42] legoktm: can you also try sshing to mw1219 yourself, see if that works? [05:23:48] you have access there, no? [05:23:54] * Guest56166 doesn't remember [05:24:08] legoktm@mw1219:~$ echo "hi" [05:24:08] hi [05:24:17] PROBLEM - puppet last run on palladium is CRITICAL: CRITICAL: puppet fail [05:24:28] Guest56166: ^ [05:24:47] strange [05:24:49] logs show nothing [05:25:23] uh [05:25:26] RECOVERY - puppet last run on palladium is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [05:25:42] false alarm [05:26:15] 6operations, 10ops-ulsfo: cp4009 hardware fault - https://phabricator.wikimedia.org/T92476#1112175 (10BBlack) 3NEW [05:29:36] PROBLEM - SSH on nescio is CRITICAL: Connection refused [05:30:05] (03PS1) 10BBlack: cp1046/cp4009 -> jessie; depool cp4009 T92476 [puppet] - 10https://gerrit.wikimedia.org/r/196157 [05:30:27] (03PS2) 10BBlack: cp1046/cp4009 -> jessie; depool cp4009 T92476 [puppet] - 10https://gerrit.wikimedia.org/r/196157 [05:30:37] PROBLEM - SSH on sodium is CRITICAL: Connection refused [05:30:39] (03CR) 10BBlack: [C: 032 V: 032] cp1046/cp4009 -> jessie; depool cp4009 T92476 [puppet] - 10https://gerrit.wikimedia.org/r/196157 (owner: 10BBlack) [05:30:41] uh, [05:30:44] that’s our lucid boxen. [05:30:48] yup [05:30:56] did the ssh changes break lucid? [05:31:02] I'm gonna guess ssh::userkey doesn't like lucid, yeah [05:31:08] sigh [05:31:28] wait, does that mean sshd didn’t start and the boxes are unconnectable to now? [05:31:42] I think puppet will be able to fix it, puppet doesn't rely on ssh [05:31:44] yeah [05:31:53] but, something has to be fixed in puppet first, without ssh access to look at the issue :) [05:31:58] I guess I’ll put an os guard [05:32:04] I wonder if salt can reach them [05:32:09] then I kind of have pseudo ssh [05:32:09]