[00:02:08] Krenair, kaldari: can I proceed with the GlobalUsage patch? [00:02:20] or are you still working on WikiGrok? [00:02:31] still dealing with wikigrok [00:02:34] tgr: go for it. it’ll take me a minute to clean things up [00:02:40] oops, nevermind [00:02:48] 6operations: Procure and setup rbf2001-2002 - https://phabricator.wikimedia.org/T86897#1111678 (10Dzahn) 5Open>3declined a:3Dzahn The issue is still unchanged. I attempted another reinstall of rbf2001 and: ``` │ Malformed IP address │ │ The IP addres... [00:02:49] 7Puppet, 6operations, 5Patch-For-Review, 3wikis-in-codfw: Check that the redis roles can be applied in codfw, set up puppet. - https://phabricator.wikimedia.org/T86898#1111682 (10Dzahn) [00:02:50] 6operations, 5Patch-For-Review, 3wikis-in-codfw: Setup redis clusters in codfw - https://phabricator.wikimedia.org/T86887#1111681 (10Dzahn) [00:03:04] 7Puppet, 6operations, 5Patch-For-Review, 3wikis-in-codfw: Check that the redis roles can be applied in codfw, set up puppet. - https://phabricator.wikimedia.org/T86898#979090 (10Dzahn) [00:03:05] 6operations, 5Patch-For-Review, 3wikis-in-codfw: Setup redis clusters in codfw - https://phabricator.wikimedia.org/T86887#978964 (10Dzahn) [00:03:06] 6operations: Procure and setup rbf2001-2002 - https://phabricator.wikimedia.org/T86897#1111683 (10Dzahn) 5declined>3Open [00:04:19] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 14.29% of data above the critical threshold [500.0] [00:06:22] (03PS1) 10Tim Landscheidt: Tools: Fix and clean up generation of /etc/ssh/ssh_known_keys [puppet] - 10https://gerrit.wikimedia.org/r/196125 (https://phabricator.wikimedia.org/T92379) [00:06:45] Krenair: I’ll need you to create the submodule update for wmf20 again, unless you want to wait 10 minutes on me. [00:06:56] I already did [00:07:00] thanks [00:07:05] almost done with wmf21 [00:07:08] waiting for jenkins to finish testing it [00:07:32] kaldari, I assume you're just going to update https://gerrit.wikimedia.org/r/#/c/196106/ ? [00:10:51] Krenair: yes, should be ready now: https://gerrit.wikimedia.org/r/#/c/196106/ [00:11:24] !log krenair Synchronized php-1.25wmf20/extensions/WikiGrok/includes/Hooks.php: https://gerrit.wikimedia.org/r/#/c/196122/ (duration: 00m 07s) [00:11:32] Logged the message, Master [00:11:37] This seems okay. [00:11:47] yay [00:12:38] Krenair: doing a quick test... [00:12:58] Krenair: change works great [00:13:04] great [00:13:05] Krenair: ready for wmf21 [00:13:14] 6operations: Procure and setup rbf2001-2002 - https://phabricator.wikimedia.org/T86897#1111701 (10Dzahn) ``` BusyBox v1.22.1 (Debian 1:1.22.0-15) built-in shell (ash) Enter 'help' for a list of built-in commands. ~ # ip a s 1: lo: mtu 65536 qdisc noqueue link/loopback 00:00:00:00:00:... [00:13:15] done, waiting for jenkins [00:18:58] RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [00:20:38] Krenair: So when you have to revert to before a submodule update, are the steps: git the previous core commit from the log, git checkout that commit, do a submodule update for the extension, re-sync the extension dir? [00:21:53] !log krenair Synchronized php-1.25wmf21/extensions/WikiGrok/includes/Hooks.php: https://gerrit.wikimedia.org/r/#/c/196106/ (duration: 00m 05s) [00:21:59] Logged the message, Master [00:22:25] seems okay... kaldari? [00:22:39] Krenair: yep, wmf21 works good [00:23:00] okay, tgr? [00:23:09] kaldari: I'm not sure I understand your question [00:23:36] Krenair: still need the config change when you have a chance [00:23:50] "revert to before a submodule update" - you mean, reverting just the submodule update? [00:23:56] kaldari, what, https://gerrit.wikimedia.org/r/#/c/196083/ ? [00:23:57] Krenair: or is that already done? [00:24:09] that was done [00:24:11] yes, thanks! [00:24:38] Krenair: like just now, when you had to fix en.wiki, what steps did you take exactly? [00:24:58] went to the WikiGrok directory, checked out the last known good commit [00:25:00] synced it [00:25:49] it needs to be cleaned up before the next person comes along, but it unbreaks the site which is the priority [00:26:01] that makes sense [00:27:26] 6operations, 10ops-codfw, 10hardware-requests, 3wikis-in-codfw: Procure rdb2001-2004 - onsite pending racking - https://phabricator.wikimedia.org/T86896#1111735 (10Dzahn) renamed in racktables from rbd to rdb 2001-2004 [00:31:39] 6operations, 10ops-codfw, 10hardware-requests, 3wikis-in-codfw: Procure rdb2001-2004 - onsite pending racking - https://phabricator.wikimedia.org/T86896#1111738 (10Dzahn) https://gerrit.wikimedia.org/r/#/c/196104/ https://gerrit.wikimedia.org/r/#/c/195943/ https://gerrit.wikimedia.org/r/#/c/195868/1 and... [00:31:53] kaldari, in this case, the way we cleaned up was reverting the bad commit of the two [00:32:23] so the complete diff became just the good commit, effectively, which could be synched [00:32:50] true [00:32:57] (03PS1) 10GWicke: WIP: Merge Iaa3bbf07b6053e139dc [puppet] - 10https://gerrit.wikimedia.org/r/196128 [00:33:47] (03CR) 10jenkins-bot: [V: 04-1] WIP: Merge Iaa3bbf07b6053e139dc [puppet] - 10https://gerrit.wikimedia.org/r/196128 (owner: 10GWicke) [00:37:05] (03PS1) 10Dzahn: netboot: fix conflicting rdb entries, missing echo [puppet] - 10https://gerrit.wikimedia.org/r/196129 (https://phabricator.wikimedia.org/T92011) [00:38:11] (03CR) 10Dzahn: [C: 032] netboot: fix conflicting rdb entries, missing echo [puppet] - 10https://gerrit.wikimedia.org/r/196129 (https://phabricator.wikimedia.org/T92011) (owner: 10Dzahn) [00:38:53] Krenair: the security patches should survive a git pull --rebase, right? [00:39:07] um [00:39:17] tgr, are you not following the instructions? [00:39:52] tgr, https://wikitech.wikimedia.org/wiki/How_to_deploy_code#Step_2:_get_the_code_on_tin [00:40:05] !log on tin: fixing ownership and permissions of /tmp/mw-cache-* [00:40:11] Logged the message, Master [00:41:06] I am otherwise [00:41:27] IIRC --rebase is equivalent to doing a manual rebase, with some small advantages [00:41:38] but I'll stick to the manual then :) [00:42:56] !log rdb2001 attempting another reinstall after fixed netboot [00:43:01] Logged the message, Master [00:46:27] 6operations, 5Patch-For-Review, 3wikis-in-codfw: setup & deploy rdb2001-2004 - https://phabricator.wikimedia.org/T92011#1111762 (10Dzahn) there was this issue: https://gerrit.wikimedia.org/r/#/c/196129/1 and what is listed on https://phabricator.wikimedia.org/T86896#1111738 [00:49:34] 6operations, 5Patch-For-Review, 3wikis-in-codfw: setup & deploy rdb2001-2004 - https://phabricator.wikimedia.org/T92011#1111770 (10Dzahn) still doesn't seem to install. does partman/raid1-lvm-ext4-srv.cfg not work? [00:54:04] (03PS1) 10BBlack: tag cp1059 + cp4010 -> jessie [puppet] - 10https://gerrit.wikimedia.org/r/196131 [00:54:23] (03CR) 10BBlack: [C: 032 V: 032] tag cp1059 + cp4010 -> jessie [puppet] - 10https://gerrit.wikimedia.org/r/196131 (owner: 10BBlack) [00:54:29] 6operations: Delete stat1002:/a/squid/archive/teahouse - https://phabricator.wikimedia.org/T92335#1111782 (10Capt_Swing) please delete this @kevinator. We haven't used these log data for a long, long time. Thanks! [00:55:36] !log tgr Synchronized php-1.25wmf20/extensions/GlobalUsage/refreshGlobalimagelinks.php: fix script before running for T65594 (duration: 00m 06s) [00:55:42] Logged the message, Master [00:57:02] !log doing refreshGlobalimagelinks.php test runs [00:57:07] Logged the message, Master [00:58:54] Krenair: seems ok [00:59:10] ok :) [01:00:56] !log running extensions/GlobalUsage/refreshGlobalimagelinks.php --pages=nonexisting for all wikis (T65594) [01:01:01] Logged the message, Master [01:03:34] this will spam the fatallog a bit [01:03:56] what would have been the correct way of only running it for public wikis? [01:06:37] tgr: there's a 'foreachwikiindblist' script on tin [01:06:45] that takes a dblist as the first parameter [01:06:59] yeah but we want all dbs that aren't closed, silver or private [01:07:06] yeah, but what I would need here is for each wikis not in dblist [01:08:24] use comm or grep to create a dblist file that contains all the lines in all.dblist that are not in closed.dblist, silver.dblist or private.dblist [01:08:30] (03PS1) 10Tim Starling: Fix l10nupdate, have mwscript always use www-data [puppet] - 10https://gerrit.wikimedia.org/r/196132 [01:09:02] I recall someone saying that all.dblist is not up-to-date [01:09:57] that patchset will need deploying some time in the next 50 minutes [01:10:03] i can review [01:10:10] just give it +1, I'll merge and test [01:10:31] tgr: no, it's always up to date [01:11:29] (03CR) 10Ori.livneh: [C: 031] Fix l10nupdate, have mwscript always use www-data [puppet] - 10https://gerrit.wikimedia.org/r/196132 (owner: 10Tim Starling) [01:11:32] if the current wiki is missing from all.dblist, then a "no such wiki" error message is shown [01:11:58] it's checked at the top of CommonSettings.php [01:12:16] inclusion in all.dblist is the definition of a wiki that exists [01:13:20] (03CR) 10Tim Starling: [C: 032] Fix l10nupdate, have mwscript always use www-data [puppet] - 10https://gerrit.wikimedia.org/r/196132 (owner: 10Tim Starling) [01:13:32] that's good to know, thanks [01:13:59] tgr: python -c "print '\n'.join({db.strip() for db in open('all.dblist')} - {db.strip() for db in open('closed.dblist')} - {db.strip() for db in open('silver.dblist')})" > my.dblist [01:14:17] I wonder if it would be worth the effort to teach foreachwikiindblist to handle something like 'all !closed !private' [01:15:58] python -c "print ''.join(set(open('all.dblist')) - set(open('closed.dblist')) - set(open('silver.dblist')))" [01:16:01] that's neater [01:16:20] tgr: i doubt it [01:16:57] (03PS1) 10Eevans: enable authenticated access to Cassandra JMX [puppet/cassandra] - 10https://gerrit.wikimedia.org/r/196133 [01:16:57] but it may be worthwhile to have the dblist update script generate a public.dblist [01:18:29] two problems with that patchset [01:18:33] (03CR) 10Ori.livneh: "1) Do you really need jmxremote.password, or can you just interpolate the password parameter directly into the script invocation?" [puppet/cassandra] - 10https://gerrit.wikimedia.org/r/196133 (owner: 10Eevans) [01:20:25] TimStarling: ? [01:24:08] (03PS1) 10Tim Starling: Fix obvious bugs in previously l10nupdate fix [puppet] - 10https://gerrit.wikimedia.org/r/196134 [01:25:40] d'oh [01:26:09] (03CR) 10Ori.livneh: [C: 031] Fix obvious bugs in previously l10nupdate fix [puppet] - 10https://gerrit.wikimedia.org/r/196134 (owner: 10Tim Starling) [01:27:10] (03CR) 10Tim Starling: [C: 032] Fix obvious bugs in previously l10nupdate fix [puppet] - 10https://gerrit.wikimedia.org/r/196134 (owner: 10Tim Starling) [01:27:49] !log on tin: testing l10nupdate [01:27:55] Logged the message, Master [01:32:19] 6operations, 10RESTBase, 10RESTBase-Cassandra: enable authenticated access to Cassandra JMX - https://phabricator.wikimedia.org/T92471#1111859 (10Eevans) 3NEW a:3Eevans [01:33:15] (03PS2) 10Eevans: enable authenticated access to Cassandra JMX [puppet/cassandra] - 10https://gerrit.wikimedia.org/r/196133 (https://phabricator.wikimedia.org/T92471) [01:35:49] ori: what's the normal puppet equivalent of vagrant-puppet's upstart class? [01:36:40] tgr: upstart class? there's an upstart service provider, but that is built in to puppet [01:37:11] I must be confusing things then [01:37:31] I remember a conversation about how upstart should be replaced with something [01:37:35] systemd maybe [01:37:39] (03PS1) 10Tim Starling: Move l10nupdate output directory [puppet] - 10https://gerrit.wikimedia.org/r/196137 [01:37:43] jessie doesn't have upstart; it only has systemd [01:37:51] if you're targetting jessie for deployment, use systemd [01:37:55] if trusty, upstart [01:38:05] if both, you'll have to branch based on os [01:40:48] 6operations: Procure and setup rbf2001-2002 - https://phabricator.wikimedia.org/T86897#1111870 (10Dzahn) 2 is eth3 with scope global eth3 3 is eth2 without IP ? [01:41:52] (03PS2) 10Tim Starling: Move l10nupdate output directory [puppet] - 10https://gerrit.wikimedia.org/r/196137 [01:42:37] TimStarling: why not tempCacheDir="$(mktemp -d)" ? [01:43:19] that way it would get purged eventually [01:44:05] sounds like a job for a separate patchset [01:44:18] I am just fixing what was there already [01:44:44] also I have to do it within 16 minutes unless I can somehow figure out how to disable the cron job [01:44:50] (03CR) 10Ori.livneh: [C: 031] Move l10nupdate output directory [puppet] - 10https://gerrit.wikimedia.org/r/196137 (owner: 10Tim Starling) [01:45:09] (03PS1) 10Dzahn: rbf2001: use eth2 MAC for DHCP [puppet] - 10https://gerrit.wikimedia.org/r/196138 (https://phabricator.wikimedia.org/T86897) [01:45:18] (03CR) 10Tim Starling: [C: 032] Move l10nupdate output directory [puppet] - 10https://gerrit.wikimedia.org/r/196137 (owner: 10Tim Starling) [01:46:19] mutante: do you know why terbium is freaking out? [01:46:21] TimStarling: rm /var/spool/cron/crontabs/l10nupdate ; puppet agent --disable [01:46:23] (03CR) 10Dzahn: [C: 032] rbf2001: use eth2 MAC for DHCP [puppet] - 10https://gerrit.wikimedia.org/r/196138 (https://phabricator.wikimedia.org/T86897) (owner: 10Dzahn) [01:46:35] andrewbogott: no, i don't [01:46:48] mutante: ok, here’s a more general question… what is terbium? [01:47:38] andrewbogott: it runs mw maintenance scripts and some other stuff like people.wm [01:47:43] replaces parts of fenari [01:47:48] andrewbogott: (a) people.wikimedia.org, (b) periodic, cron-managed mediawiki jobs, (c) noc.wikimedia.org [01:48:12] ok, so probably something has changed with permissions and a cron is running [01:48:19] yeah, the challenge is to work out how to do things like that without disabling puppet ;) [01:48:46] you know there are lots of interesting ways to sabotage a system that puppet doesn't know about [01:48:48] andrewbogott: define "freaking out"? it looks ok in icinga? [01:49:15] mutante: > 100 emails sent to root in the last 5 minutes [01:49:25] “ www-data : user NOT in sudoers ; TTY=unknown “ [01:49:33] I’m new to getting these root emails, but — surely that’s not normal [01:50:00] (03CR) 10Eevans: "Which use of jmxremote.password are you referring to?" [puppet/cassandra] - 10https://gerrit.wikimedia.org/r/196133 (https://phabricator.wikimedia.org/T92471) (owner: 10Eevans) [01:50:58] hm, and now silver is doing the same thing [01:51:54] !log tstarling Synchronized php-1.25wmf20/cache/l10n: (no message) (duration: 00m 02s) [01:51:59] Logged the message, Master [01:52:40] andrewbogott: https://gerrit.wikimedia.org/r/#/c/196132/ [01:52:56] TimStarling: andrewbogott i think those are related [01:52:58] yeah, I just saw that… [01:53:15] !log LocalisationUpdate completed (1.25wmf20) at 2015-03-12 01:52:11+00:00 [01:53:17] TimStarling: here’s a sample https://phabricator.wikimedia.org/P388 [01:53:20] Logged the message, Master [01:53:22] !log tstarling Synchronized php-1.25wmf21/cache/l10n: (no message) (duration: 00m 01s) [01:53:24] because those mails are from crons using mwscript [01:53:27] Logged the message, Master [01:53:42] you know what I think about sudo mails, right? [01:53:59] TimStarling: no, but I’m guessing you think that they’re never ever useful :) [01:54:13] yep [01:54:47] and they're easy enough to disable in sudo's configuration, I've recommended it in the past [01:55:00] …and you don’t think that they indicate that our crons (and job runners) are failing? Or did that email storm just happen during a transition? [01:55:01] !log LocalisationUpdate completed (1.25wmf21) at 2015-03-12 01:53:58+00:00 [01:55:03] but ops are a special breed who love spam [01:55:06] Logged the message, Master [01:55:35] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There are 4 unmerged changes in puppet (dir /var/lib/git/operations/puppet). [01:55:56] it's a problem, but it's not your problem, it's my problem [01:56:15] PROBLEM - check if wikidata.org dispatch lag is higher than 2 minutes on wikidata is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1459 bytes in 0.431 second response time [01:56:28] ok :) As long as you know what’s happening I will start deleting these [01:56:36] ops gets mail every time someone runs sudo? [01:57:25] PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: There are 2 unmerged changes in puppet (dir /var/lib/git/operations/puppet). [01:57:50] no, only if the permissions are wrong, e.g. you mistyped the user [01:58:03] Krenair: https://xkcd.com/838/ [01:58:11] ah so that's the "this incident will be reported"? :) [01:58:23] yep :) [01:58:28] :D [01:58:29] TimStarling: my stab at evil: sed -i -e 's|var/spool|var/ffool|g' `which cron` [01:58:40] * andrewbogott withdraws [01:59:03] andrewbogott: this is speculative fiction, not something i'd ever actually do :P [02:00:32] !log rbf2001 reboot from busybox :p [02:00:38] Logged the message, Master [02:00:48] !log on tin: disabled puppet for l10nupdate testing [02:00:53] Logged the message, Master [02:02:26] Anyone messing with terbium? mutante ? [02:04:10] what's wrong with it? [02:04:15] Cron jobs were killed [02:04:32] And seems like the cron daemon also was stopped [02:04:37] so I don't dare to restart [02:04:46] PROBLEM - puppet last run on tin is CRITICAL: CRITICAL: puppet fail [02:08:05] RECOVERY - puppet last run on tin is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [02:08:32] 6operations, 5Patch-For-Review: Procure and setup rbf2001-2002 - https://phabricator.wikimedia.org/T86897#1111880 (10Dzahn) i tried to use eth2 and " Network autoconfiguration failed Your network is probably not using the DHCP protocol. " [02:09:02] OOM? [02:09:10] probably needs a reboot [02:09:37] TimStarling: Terbium? [02:10:36] yes, if the cron daemon was killed then maybe oom-killer killed it [02:10:49] in which case rebooting would be a safe way to make sure all daemons are running [02:10:52] No, doesn't look like it [02:12:07] !log tstarling Synchronized README: (no message) (duration: 00m 01s) [02:12:13] Logged the message, Master [02:15:15] !log tstarling Synchronized README: (no message) (duration: 00m 01s) [02:16:29] scap is failing with an SSH error [02:16:39] I am trying to figure out why [02:17:13] what is the error? [02:17:40] 02:15:14 ['/srv/deployment/scap/scap/bin/sync-common', '--no-update-l10n', '--include', 'README', 'mw1033.eqiad.wmnet', 'mw1070.eqiad.wmnet', 'mw1097.eqiad.wmnet', 'mw1216.eqiad.wmnet', 'mw1161.eqiad.wmnet', 'mw1201.eqiad.wmnet'] on mw1121 returned [255]: Permission denied (publickey). [02:17:58] and the same for every server [02:18:05] 02:15:14 267 apaches had sync errors [02:18:15] but I can ssh to them from the terminal [02:18:24] using l10nupdate's private key [02:18:26] problem with the shared ssh-agent? [02:18:30] are you not in wikidev? [02:18:52] I think l10nupdate's keep has been borked since December [02:18:55] I am in l10nupdate only [02:18:58] *key [02:19:06] [0215][l10nupdate@tin:~]$ groups [02:19:06] l10nupdate [02:19:13] that won't work [02:19:21] why not? [02:19:53] scap is configured to use /var/run/keyholder/proxy.sock [02:20:12] since December? [02:20:25] time flies [02:20:41] so should it not do that for l10nupdate, or should it be made to work? [02:20:50] https://phabricator.wikimedia.org/T76061 [02:21:45] probably neither. we scap regularly enough, why do messages have to be synced automatically? [02:22:14] so people don't open "why is the message not updated" bugs [02:22:26] "because we haven't scapped since then" [02:22:34] we've been fine since december [02:23:30] ori: Could you take a look at terbium? The cron daemon is running (it's cron on Ubuntu, not crond, doh)... but for some reason not starting any Wikidata crons [02:24:22] "Urgency just became High+1 for us, it turns out there is no workaround, LocalisationUpdate is overriding newer, manually deployed extension messages." [02:24:46] so we have two bugs? [02:25:05] TimStarling: followed by "Okay, scap does in fact update the messages. Thanks! This means we have a workaround, I'm lowering the urgency again. " [02:25:06] That comment was operator error [02:25:45] They did not scap because they did not know all the strangeness of l10n cache [02:26:36] So right now l10nupdarte does update the cdb files on tin and then they ship whenever the next scap is run by a member of the wikidev group [02:26:57] seems fine to me [02:27:03] I've tried to fix this in scap/scyn-* but with no luck [02:27:05] having l10nupdate scap was always creepy [02:27:12] i could have ssh-agent-proxy get the remote socket's credentials via getsockopt() SO_PEERCRED [02:27:13] so currently you don't know why it is happening? [02:27:31] I have scanned through the bug comments and that is the current status? [02:27:48] Correct. My last investigation was https://phabricator.wikimedia.org/T76061#1060333 [02:27:48] it happens because scap hard-codes an SSH_AUTH_SOCK that is not readable for l10nupdate [02:28:38] I actually made a fix for that -- https://gerrit.wikimedia.org/r/#/c/191248/1/scap/cli.py,unified [02:28:46] but it seems not to make a difference [02:28:56] and I'm really not exactly sure why... [02:29:21] Oh! because the ssh isn't done as user l10nupdate on the far end! [02:29:48] let's run it under strace and see what ssh command it runs [02:30:19] The far end user is going to be mwdeploy I bet [02:30:38] boo, no -b [02:30:45] and mwdeploy isn't going to accept the l10nupdate key [02:31:40] !log tstarling Synchronized README: (no message) (duration: 00m 06s) [02:31:45] Logged the message, Master [02:31:58] yeah it uses -lmwdeploy [02:32:19] which also changed when we added the ssh-agent proxy [02:32:50] I could patch scap to allow the user to be set with a command line arg [02:33:10] right now we take it from the config file [02:33:41] then l10nupdate could add `--ssh-user l10nupdate` or something [02:37:48] ok [02:37:49] It doesn't even need a patch. `-D ssh_user=l10nupdate` should do it I think [02:38:13] I will just comment out the scap for now so that I can test my own changes without seeing so many error messages [02:39:05] `-D ssh_user:l10nupdate` [02:41:15] (03PS2) 10Nuria: Adding a Last-Access cookie to text and mobile requests [puppet] - 10https://gerrit.wikimedia.org/r/196009 (https://phabricator.wikimedia.org/T92435) [02:41:21] !log Manually started 7 Wikibase dispatchChanges instances on terbium after cron failed to start them. [02:41:29] Logged the message, Master [02:42:24] (03PS1) 10BryanDavis: l10nupdate: connect to remote hosts as l10nupdate user [puppet] - 10https://gerrit.wikimedia.org/r/196143 (https://phabricator.wikimedia.org/T76061) [02:44:29] (03CR) 10Nuria: "> re: appending to the cookie, we have a vmod" [puppet] - 10https://gerrit.wikimedia.org/r/196009 (https://phabricator.wikimedia.org/T92435) (owner: 10Nuria) [02:45:37] TimStarling, ori: I think https://gerrit.wikimedia.org/r/#/c/196143/1 will fix this so l10nupdate works again [02:46:46] ok, may as well test that while I'm up to my elbows in it [02:47:27] (03CR) 10Tim Starling: [C: 032] l10nupdate: connect to remote hosts as l10nupdate user [puppet] - 10https://gerrit.wikimedia.org/r/196143 (https://phabricator.wikimedia.org/T76061) (owner: 10BryanDavis) [02:49:00] mutante: ok to deploy your change? [02:49:32] TimStarling: yes please [02:49:37] sorry for leaving it there [02:50:14] RECOVERY - Unmerged changes on repository puppet on palladium is OK: No changes to merge. [02:50:42] there is still a permission issue on terbium [02:50:44] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [02:51:45] hoo: terbium CRON[10133]: (www-data) CMD (/usr/local/bin/mwscript extensions/Wikidata/extensions/Wikibase... this looks like it works again [02:56:30] PHP Warning: fopen(/tmp/mw-cache-1.25wmf20/conf-aawiki): failed to open stream: Permission denied in /srv/mediawiki/wmf-config/CommonSettings.php on line 192 [02:56:37] !log tstarling Synchronized php-1.25wmf20/cache/l10n: (no message) (duration: 00m 03s) [02:56:45] Logged the message, Master [02:57:29] bd808: seems to work [02:57:56] Running rsync command: `sudo -u mwdeploy -n -- /usr/bin/rsync --archive --delete-delay --delay-updates --compress --delete --exclude=**/.svn/lock --exclude=**/.git/objects --exclude=**/.git/**/objects --exclude=**/cache/l10n/*.cdb --no-perms --include=/php-1.25wmf20 --include=/php-1.25wmf20/cache --include=/php-1.25wmf20/cache/l10n --include=/php-1.25wmf20/cache/l10n/*** --exclude=* mw1216.eqiad.wmnet::common /srv/mediawiki` [02:58:04] !log LocalisationUpdate completed (1.25wmf20) at 2015-03-12 02:57:01+00:00 [02:58:04] w00t [02:58:09] Logged the message, Master [02:59:18] I was following the scap logs with -- tail -f /a/mw-log/scap.log | python ~bd808/scaplog.py -- from fluorine [03:04:12] Wikidata dispatch lag is growing right now (45 minutes already) because the dispatchers aren't running :( ... Any news? [03:08:45] Manually started 8 instances again... [03:11:19] !log tstarling Synchronized php-1.25wmf21/cache/l10n: (no message) (duration: 00m 04s) [03:11:27] Logged the message, Master [03:13:01] !log LocalisationUpdate completed (1.25wmf21) at 2015-03-12 03:11:58+00:00 [03:13:06] Logged the message, Master [03:25:11] (03PS1) 10Hoo man: Don't rely on USER to be correct in mwscript [puppet] - 10https://gerrit.wikimedia.org/r/196145 [03:25:17] TimStarling: mutante: ori ^ fix [03:31:52] (03CR) 10BryanDavis: [C: 031] Don't rely on USER to be correct in mwscript [puppet] - 10https://gerrit.wikimedia.org/r/196145 (owner: 10Hoo man) [03:34:59] TimStarling: Apparently cron jobs don't set $USER so mwscript needs a tweak [03:35:08] Yes [03:35:18] hoo was nice enough to make a patch [03:35:30] just needs a merge and sync [03:35:50] https://gerrit.wikimedia.org/r/#/c/196145/ [03:39:09] (03CR) 10Ori.livneh: [C: 032] Don't rely on USER to be correct in mwscript [puppet] - 10https://gerrit.wikimedia.org/r/196145 (owner: 10Hoo man) [03:39:15] sorry, I was away [03:39:22] Thank you [03:39:27] thx ori [03:39:29] i'll force a puppet run on terbium [03:39:47] bd808: nice fix, though I really do still think it's not a great idea to have l10nupdate scap [03:40:26] it doesn't run a full scap, just a sync-dir of the cdb json dumps [03:40:53] But it is creepy [03:41:49] hoo: ran puppet on terbium. do you need me to do anything else? [03:42:00] ori: No, I guess that's it :) Thank you [03:42:15] Works, a new instance just popped up :) [03:45:42] !log LocalisationUpdate ResourceLoader cache refresh completed at Thu Mar 12 03:44:38 UTC 2015 (duration 50m 35s) [03:45:48] Logged the message, Master [04:08:03] (03PS1) 10Gage: IPsec: refactor hiera data, clean up template [puppet] - 10https://gerrit.wikimedia.org/r/196155 [04:09:19] 6operations: Make puppet the sole manager of user keys - https://phabricator.wikimedia.org/T92475#1112128 (10yuvipanda) 3NEW [04:09:52] (03CR) 10Gage: [C: 032] IPsec: refactor hiera data, clean up template [puppet] - 10https://gerrit.wikimedia.org/r/196155 (owner: 10Gage) [04:11:57] yay, that worked [04:39:32] (03CR) 10Yuvipanda: ssh: introduce ssh::userkey resource (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/183814 (owner: 10Faidon Liambotis) [04:39:41] (03PS2) 10Yuvipanda: ssh: remove .ssh/authorized_keys support from prod [puppet] - 10https://gerrit.wikimedia.org/r/183824 (https://phabricator.wikimedia.org/T92475) (owner: 10Faidon Liambotis) [04:39:43] (03PS3) 10Yuvipanda: ssh: introduce ssh::userkey resource [puppet] - 10https://gerrit.wikimedia.org/r/183814 (owner: 10Faidon Liambotis) [04:39:45] (03PS2) 10Yuvipanda: ssh: recurse/purge => true for /etc/ssh/userkeys [puppet] - 10https://gerrit.wikimedia.org/r/183815 (https://phabricator.wikimedia.org/T92475) (owner: 10Faidon Liambotis) [04:39:47] (03PS2) 10Yuvipanda: reprepro: transition to ssh::userkey [puppet] - 10https://gerrit.wikimedia.org/r/183818 (https://phabricator.wikimedia.org/T92475) (owner: 10Faidon Liambotis) [04:39:49] (03PS2) 10Yuvipanda: openstack: transition nova to ssh::userkey [puppet] - 10https://gerrit.wikimedia.org/r/183819 (https://phabricator.wikimedia.org/T92475) (owner: 10Faidon Liambotis) [04:39:51] (03PS2) 10Yuvipanda: ssh: change userkeys' path hierarchy [puppet] - 10https://gerrit.wikimedia.org/r/183816 (https://phabricator.wikimedia.org/T92475) (owner: 10Faidon Liambotis) [04:39:53] (03PS2) 10Yuvipanda: ssh: support /etc/ssh/userkeys in production too [puppet] - 10https://gerrit.wikimedia.org/r/183817 (https://phabricator.wikimedia.org/T92475) (owner: 10Faidon Liambotis) [04:39:55] (03PS2) 10Yuvipanda: puppet: transition to ssh::userkey [puppet] - 10https://gerrit.wikimedia.org/r/183822 (https://phabricator.wikimedia.org/T92475) (owner: 10Faidon Liambotis) [04:39:57] (03PS2) 10Yuvipanda: admin: transition to ssh::userkey [puppet] - 10https://gerrit.wikimedia.org/r/183823 (https://phabricator.wikimedia.org/T92475) (owner: 10Faidon Liambotis) [04:39:59] (03PS2) 10Yuvipanda: mediawiki: transition to ssh::userkey [puppet] - 10https://gerrit.wikimedia.org/r/183820 (https://phabricator.wikimedia.org/T92475) (owner: 10Faidon Liambotis) [04:40:01] (03PS2) 10Yuvipanda: authdns: transition to ssh::userkey [puppet] - 10https://gerrit.wikimedia.org/r/183821 (https://phabricator.wikimedia.org/T92475) (owner: 10Faidon Liambotis) [04:40:36] (03CR) 10jenkins-bot: [V: 04-1] ssh: remove .ssh/authorized_keys support from prod [puppet] - 10https://gerrit.wikimedia.org/r/183824 (https://phabricator.wikimedia.org/T92475) (owner: 10Faidon Liambotis) [04:42:22] (03CR) 10jenkins-bot: [V: 04-1] authdns: transition to ssh::userkey [puppet] - 10https://gerrit.wikimedia.org/r/183821 (https://phabricator.wikimedia.org/T92475) (owner: 10Faidon Liambotis) [04:42:39] (03CR) 10jenkins-bot: [V: 04-1] admin: transition to ssh::userkey [puppet] - 10https://gerrit.wikimedia.org/r/183823 (https://phabricator.wikimedia.org/T92475) (owner: 10Faidon Liambotis) [04:42:42] (03CR) 10jenkins-bot: [V: 04-1] mediawiki: transition to ssh::userkey [puppet] - 10https://gerrit.wikimedia.org/r/183820 (https://phabricator.wikimedia.org/T92475) (owner: 10Faidon Liambotis) [04:42:44] (03CR) 10jenkins-bot: [V: 04-1] puppet: transition to ssh::userkey [puppet] - 10https://gerrit.wikimedia.org/r/183822 (https://phabricator.wikimedia.org/T92475) (owner: 10Faidon Liambotis) [04:43:09] (03PS3) 10Yuvipanda: ssh: remove .ssh/authorized_keys support from prod [puppet] - 10https://gerrit.wikimedia.org/r/183824 (https://phabricator.wikimedia.org/T92475) (owner: 10Faidon Liambotis) [04:43:11] (03PS3) 10Yuvipanda: reprepro: transition to ssh::userkey [puppet] - 10https://gerrit.wikimedia.org/r/183818 (https://phabricator.wikimedia.org/T92475) (owner: 10Faidon Liambotis) [04:43:13] (03PS3) 10Yuvipanda: openstack: transition nova to ssh::userkey [puppet] - 10https://gerrit.wikimedia.org/r/183819 (https://phabricator.wikimedia.org/T92475) (owner: 10Faidon Liambotis) [04:43:15] (03PS3) 10Yuvipanda: puppet: transition to ssh::userkey [puppet] - 10https://gerrit.wikimedia.org/r/183822 (https://phabricator.wikimedia.org/T92475) (owner: 10Faidon Liambotis) [04:43:17] (03PS3) 10Yuvipanda: admin: transition to ssh::userkey [puppet] - 10https://gerrit.wikimedia.org/r/183823 (https://phabricator.wikimedia.org/T92475) (owner: 10Faidon Liambotis) [04:43:19] (03PS3) 10Yuvipanda: mediawiki: transition to ssh::userkey [puppet] - 10https://gerrit.wikimedia.org/r/183820 (https://phabricator.wikimedia.org/T92475) (owner: 10Faidon Liambotis) [04:43:21] (03PS3) 10Yuvipanda: authdns: transition to ssh::userkey [puppet] - 10https://gerrit.wikimedia.org/r/183821 (https://phabricator.wikimedia.org/T92475) (owner: 10Faidon Liambotis) [04:50:26] (03PS4) 10Yuvipanda: ssh: introduce ssh::userkey resource [puppet] - 10https://gerrit.wikimedia.org/r/183814 (owner: 10Faidon Liambotis) [04:50:51] (03CR) 10Yuvipanda: [C: 032] "Beta only change, so is good." [puppet] - 10https://gerrit.wikimedia.org/r/183814 (owner: 10Faidon Liambotis) [04:55:34] (03PS3) 10Yuvipanda: ssh: recurse/purge => true for /etc/ssh/userkeys [puppet] - 10https://gerrit.wikimedia.org/r/183815 (https://phabricator.wikimedia.org/T92475) (owner: 10Faidon Liambotis) [04:56:45] RECOVERY - check if wikidata.org dispatch lag is higher than 2 minutes on wikidata is OK: HTTP OK: HTTP/1.1 200 OK - 1442 bytes in 0.275 second response time [05:01:42] (03PS3) 10Yuvipanda: ssh: change userkeys' path hierarchy [puppet] - 10https://gerrit.wikimedia.org/r/183816 (https://phabricator.wikimedia.org/T92475) (owner: 10Faidon Liambotis) [05:01:57] (03CR) 10Yuvipanda: [C: 032] ssh: recurse/purge => true for /etc/ssh/userkeys [puppet] - 10https://gerrit.wikimedia.org/r/183815 (https://phabricator.wikimedia.org/T92475) (owner: 10Faidon Liambotis) [05:02:08] (03PS1) 10BBlack: update esams cache domainnames in hieradata [puppet] - 10https://gerrit.wikimedia.org/r/196156 [05:02:21] (03PS2) 10BBlack: update esams cache domainnames in hieradata [puppet] - 10https://gerrit.wikimedia.org/r/196156 [05:02:51] (03CR) 10BBlack: [C: 032 V: 032] update esams cache domainnames in hieradata [puppet] - 10https://gerrit.wikimedia.org/r/196156 (owner: 10BBlack) [05:03:04] (03CR) 10Yuvipanda: [C: 032] ssh: change userkeys' path hierarchy [puppet] - 10https://gerrit.wikimedia.org/r/183816 (https://phabricator.wikimedia.org/T92475) (owner: 10Faidon Liambotis) [05:03:18] go for it :) [05:03:24] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There are 5 unmerged changes in puppet (dir /var/lib/git/operations/puppet). [05:04:25] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [05:07:08] (03PS3) 10Yuvipanda: ssh: support /etc/ssh/userkeys in production too [puppet] - 10https://gerrit.wikimedia.org/r/183817 (https://phabricator.wikimedia.org/T92475) (owner: 10Faidon Liambotis) [05:07:36] (03PS4) 10Yuvipanda: mediawiki: transition to ssh::userkey [puppet] - 10https://gerrit.wikimedia.org/r/183820 (https://phabricator.wikimedia.org/T92475) (owner: 10Faidon Liambotis) [05:12:34] (03CR) 10Yuvipanda: [C: 032] ssh: support /etc/ssh/userkeys in production too [puppet] - 10https://gerrit.wikimedia.org/r/183817 (https://phabricator.wikimedia.org/T92475) (owner: 10Faidon Liambotis) [05:20:05] (03CR) 10Yuvipanda: [C: 032] mediawiki: transition to ssh::userkey [puppet] - 10https://gerrit.wikimedia.org/r/183820 (https://phabricator.wikimedia.org/T92475) (owner: 10Faidon Liambotis) [05:23:17] !log legoktm Synchronized README: testing that Yuvi didnt break anything (duration: 00m 05s) [05:23:25] Logged the message, Master [05:23:29] Guest56166: all good [05:23:33] legoktm: sweet. [05:23:42] legoktm: can you also try sshing to mw1219 yourself, see if that works? [05:23:48] you have access there, no? [05:23:54] * Guest56166 doesn't remember [05:24:08] legoktm@mw1219:~$ echo "hi" [05:24:08] hi [05:24:17] PROBLEM - puppet last run on palladium is CRITICAL: CRITICAL: puppet fail [05:24:28] Guest56166: ^ [05:24:47] strange [05:24:49] logs show nothing [05:25:23] uh [05:25:26] RECOVERY - puppet last run on palladium is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [05:25:42] false alarm [05:26:15] 6operations, 10ops-ulsfo: cp4009 hardware fault - https://phabricator.wikimedia.org/T92476#1112175 (10BBlack) 3NEW [05:29:36] PROBLEM - SSH on nescio is CRITICAL: Connection refused [05:30:05] (03PS1) 10BBlack: cp1046/cp4009 -> jessie; depool cp4009 T92476 [puppet] - 10https://gerrit.wikimedia.org/r/196157 [05:30:27] (03PS2) 10BBlack: cp1046/cp4009 -> jessie; depool cp4009 T92476 [puppet] - 10https://gerrit.wikimedia.org/r/196157 [05:30:37] PROBLEM - SSH on sodium is CRITICAL: Connection refused [05:30:39] (03CR) 10BBlack: [C: 032 V: 032] cp1046/cp4009 -> jessie; depool cp4009 T92476 [puppet] - 10https://gerrit.wikimedia.org/r/196157 (owner: 10BBlack) [05:30:41] uh, [05:30:44] that’s our lucid boxen. [05:30:48] yup [05:30:56] did the ssh changes break lucid? [05:31:02] I'm gonna guess ssh::userkey doesn't like lucid, yeah [05:31:08] sigh [05:31:28] wait, does that mean sshd didn’t start and the boxes are unconnectable to now? [05:31:42] I think puppet will be able to fix it, puppet doesn't rely on ssh [05:31:44] yeah [05:31:53] but, something has to be fixed in puppet first, without ssh access to look at the issue :) [05:31:58] I guess I’ll put an os guard [05:32:04] I wonder if salt can reach them [05:32:09] then I kind of have pseudo ssh [05:32:09] probably [05:33:38] yup, yup [05:34:29] we should really get those boxes reinstalled :/ [05:34:36] yeah [05:35:13] bblack: there's a lone lucid box in labs as well, to support this [05:37:02] (03PS1) 10Yuvipanda: ssh: Don't give lucid nice things, because lucid is not nice [puppet] - 10https://gerrit.wikimedia.org/r/196159 (https://phabricator.wikimedia.org/T92475) [05:37:04] bblack: ^ [05:37:40] if what I suspect is true then this should work [05:38:06] (03CR) 10Yuvipanda: [C: 032] ssh: Don't give lucid nice things, because lucid is not nice [puppet] - 10https://gerrit.wikimedia.org/r/196159 (https://phabricator.wikimedia.org/T92475) (owner: 10Yuvipanda) [05:39:27] RECOVERY - SSH on sodium is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7.1 (protocol 2.0) [05:39:39] bblack: ^ yay. [05:39:42] also boo lucid [05:39:51] awesome [05:40:02] now to see if I can fix it for lucid as well [05:41:30] (03PS1) 10BBlack: depool cp3009 for reinstall [puppet] - 10https://gerrit.wikimedia.org/r/196160 [05:41:41] (03CR) 10BBlack: [C: 032 V: 032] depool cp3009 for reinstall [puppet] - 10https://gerrit.wikimedia.org/r/196160 (owner: 10BBlack) [05:43:42] heh, git-gc run during puppet-merge [05:43:45] this may take a while :P [05:44:34] ditto on strontium, I guess it makes sense they'd reach the same point at the same time [05:44:37] 6operations, 5Patch-For-Review: Make puppet the sole manager of user keys - https://phabricator.wikimedia.org/T92475#1112197 (10yuvipanda) So this fails on lucid because ssh refuses to start when there's more than one path on the AuthorizedKeys directive. So it should be fine once everything is merged. [05:44:48] PROBLEM - puppet last run on cp1046 is CRITICAL: CRITICAL: Puppet has 1 failures [05:45:55] second time now that initial puppet has hit: [05:45:55] Mar 12 05:27:03 cp1046 puppet-agent[1110]: (/Stage[main]/Base::Monitoring::Host/Package[mpt-status]/ensure) change from purged to latest failed: Could not update: Execution of '/usr/bin/apt-get -q -y -o DPkg::Opt [05:46:00] ions::=--force-confold install mpt-status' returned 100: Reading package lists... [05:46:03] Mar 12 05:27:03 cp1046 puppet-agent[1110]: (/Stage[main]/Base::Monitoring::Host/Package[mpt-status]/ensure) E: Problem renaming the file /var/cache/apt/pkgcache.bin.QaB5Ef to /var/cache/apt/pkgcache.bin - rename [05:46:07] (2: No such file or directory) [05:46:23] I'm guessing some apt command is running in parallel from outside of puppet and screwing that up, but I don't think the puppet-run cron would be going by then... [05:46:58] RECOVERY - puppet last run on cp1046 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [05:47:07] RECOVERY - SSH on nescio is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7.1 (protocol 2.0) [05:47:33] oh no, it did happen exactly at the puppet-run mark heh [05:47:59] so, on initial puppet run, puppet installs the cron, then collides with cron running apt-get update through that and fails a package install [05:48:02] how lovely [05:49:19] ah, hehe [05:50:03] well this is all part of the same problem with manual puppet runs colliding with it too [05:50:41] given that puppet-run doesn't exist before the initial puppet run, I think we have to solve this by having puppet-run check for the lockfile used by the actual puppet agent (before even doing the preactions like apt-get update) [05:51:08] PROBLEM - RAID on analytics1010 is CRITICAL: CRITICAL: Active: 6, Working: 6, Failed: 2, Spare: 0 [05:52:10] (of course, we can check it but can't lock it I guess, so it will still be racy, just less-racy) [06:02:02] brb [06:04:58] (03PS1) 10BBlack: puppet-run: pre-check agent lock before apt [puppet] - 10https://gerrit.wikimedia.org/r/196162 [06:05:10] (03CR) 10BBlack: [C: 032 V: 032] puppet-run: pre-check agent lock before apt [puppet] - 10https://gerrit.wikimedia.org/r/196162 (owner: 10BBlack) [06:15:23] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [06:16:05] puppet-merge, you suck [06:17:33] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [06:22:59] (03PS1) 10BBlack: repool cp3009 [puppet] - 10https://gerrit.wikimedia.org/r/196165 [06:23:21] (03CR) 10BBlack: [C: 032 V: 032] repool cp3009 [puppet] - 10https://gerrit.wikimedia.org/r/196165 (owner: 10BBlack) [06:27:30] bblack: how are changes synchronized to strontium anyway? I don't see anything about that in puppet-merge itself. [06:28:23] PROBLEM - puppet last run on db1002 is CRITICAL: CRITICAL: Puppet has 2 failures [06:28:23] PROBLEM - puppet last run on mw1118 is CRITICAL: CRITICAL: Puppet has 3 failures [06:28:45] ori: it's through some git hook on palladium [06:28:52] PROBLEM - puppet last run on cp3014 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:21] we've looked into that problem a few times in the past, but nobody found a definitive answer for how to fix it. it's some kind of race condition. [06:29:52] challenge accepted [06:29:58] :) [06:30:11] I wish I saved the output to paste at you [06:30:31] 10Ops-Access-Requests, 6operations, 10MediaWiki-extensions-ContentTranslation, 3LE-Sprint-84, 5Patch-For-Review: Access to stat1003 for Niklas and Kartik - https://phabricator.wikimedia.org/T91625#1112230 (10Nikerabbit) I have signed now as well. [06:30:46] but basically over on strontium, it ends up saying something like expected "5a454cb5 but got 4aecfa4f" when it fails [06:31:18] and the ordering of the hashes in that failure message make it sound like (to me) that strontium was expecting to see the older commit from before this merge, but instead saw the new one, in whatever it was looking at for whatever [06:31:32] PROBLEM - puppet last run on mw1119 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:52] PROBLEM - puppet last run on mw1042 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:52] PROBLEM - puppet last run on cp4008 is CRITICAL: CRITICAL: Puppet has 2 failures [06:32:09] I guess it's time for nightly puppetfail spam :P [06:32:24] <_joe_> ori: http://www.oriblindforest.com/ [06:33:07] _joe_: i am middle-eastern, but i'm not that hairy [06:33:30] ori: I tend to have better "luck" at triggering it if I operate very quickly, I think. but it could be my imagination (e.g. hit the submit button on gerrit in the browser and immediately fire off puppet-merge on palladium without waiting - sometimes you have to re-run a couple times before it even sees it) [06:34:23] bblack: you wouldn't happen to have a log of that error occurring anywhere? [06:35:10] nope :/ [06:35:18] unless the machines themselves are logging it somewhere [06:36:08] seems not [06:36:13] the hook is at palladium:/var/lib/git/operations/puppet/.git/hook [06:36:31] the hook is at palladium:/var/lib/git/operations/puppet/.git/hooks/post-merge , I mean [06:37:00] thanks for the overview! [06:37:48] it's also possible that gerrit is the culprit. that for whatever reason in that short window of time, it presents different commits to the two hosts, or presents some inconsistent state to strontium. [06:40:23] it's coming from https://github.com/git/git/blob/master/refs.c#L2146-2152 [06:40:25] <_joe_> bblack: that happens sometimes [06:40:35] <_joe_> but it's not the common problem [06:40:55] <_joe_> ori: that is one ot the errors [06:41:05] well sure there are other problems. there used to be one when using sudo incorrectly, and sometimes the git connection just fails, etc [06:41:09] what else have you seen? [06:41:20] <_joe_> but the most common one is the "I don't know who you are" git error [06:41:21] but this one I saw earlier this evening was the pure one, just a hash mismatch on strontium [06:41:37] <_joe_> bblack: hash mismatch is because gerrit [06:41:49] <_joe_> you must season the patch for a while after you merged it [06:42:11] yeah sometimes I rush the process a bit when I know I'm right :P [06:42:19] it's not surprising my fingers can outrun java [06:43:23] <_joe_> lol [06:43:23] while I blindly typing "yes" to a diff I didn't read, java was still instantiating a CommitReviewPusherContainerObjectInstantiatorForMergeAbstractHolderFunctor [06:44:19] ??? <_joe_> but the most common one is the "I don't know who you are" git error [06:44:27] that's not a real error message, is it? [06:45:19] <_joe_> ori: not sure [06:45:35] <_joe_> ori: IIRC, yes [06:45:52] RECOVERY - puppet last run on db1002 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [06:46:07] (¬_¬) [06:46:21] RECOVERY - puppet last run on cp3014 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [06:46:30] might be: http://stackoverflow.com/questions/11656761/git-please-tell-me-who-you-are-error [06:46:51] although that would seem like not it based on SO content [06:46:52] RECOVERY - puppet last run on mw1118 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [06:47:02] RECOVERY - puppet last run on cp4008 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [06:47:12] the suggestion to run git fetch / git reset --hard origin/master [06:47:17] (because I swear when that issue was first common on palladium, a common cause was doing "sudo puppet-merge" instead of sudo -i [06:47:18] instead of git-pull is a good one [06:48:20] one of the problems with solving these is they're not easy to reproduce, and easy to work around by retrying things [06:50:22] RECOVERY - puppet last run on mw1042 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [06:50:45] the post-merge git hook on palladium isn't puppetized [06:51:11] oh no, that's not true [06:51:12] it's puppetmaster/templates/post-merge.erb [06:51:31] we were just bitten by $USER not being set in non-login shells earlier [06:55:24] ugh [06:55:30] deployment-prep puppetmaster seems dead [06:59:42] 6operations, 5Patch-For-Review, 3wikis-in-codfw: setup & deploy rdb2001-2004 - https://phabricator.wikimedia.org/T92011#1112251 (10Joe) @Dzahn thanks for looking into it - it may well be that my partman recipe was wrong, I'll look into it! [07:05:28] 6operations, 5Patch-For-Review, 3wikis-in-codfw: setup & deploy rdb2001-2004 - https://phabricator.wikimedia.org/T92011#1112257 (10Joe) I just tried to reinstall rdb2002, and after loading the kernel the console went completely dark, so this happens well before the partman stage I guess. Anyway, I'll try to... [07:06:41] PROBLEM - Host cp4009 is DOWN: PING CRITICAL - Packet loss = 100% [07:08:17] (03PS1) 10Giuseppe Lavagetto: redis: revert partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/196166 [07:08:46] (03CR) 10Giuseppe Lavagetto: [C: 032] redis: revert partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/196166 (owner: 10Giuseppe Lavagetto) [07:08:53] (03CR) 10Giuseppe Lavagetto: [V: 032] redis: revert partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/196166 (owner: 10Giuseppe Lavagetto) [07:09:42] RECOVERY - puppet last run on mw1119 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [07:18:33] (03PS1) 10Ori.livneh: puppetmaster post-merge hook fixes [puppet] - 10https://gerrit.wikimedia.org/r/196167 [07:18:41] bblack, _joe_ ^ [07:20:02] <_joe_> ori: I'm not sure git reset --hard origin/production is a good idea [07:20:23] <_joe_> but lemme think about that [07:20:36] <_joe_> right now I'm working on debian-installer bugs [07:20:48] kk [07:20:51] i'm off to bed anyhow [07:20:59] good $TIME_OF_DAY [07:22:46] <_joe_> ori: night! [07:23:12] (03PS1) 10Giuseppe Lavagetto: redis: install rdb2001 as trusty [puppet] - 10https://gerrit.wikimedia.org/r/196168 [07:23:38] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] redis: install rdb2001 as trusty [puppet] - 10https://gerrit.wikimedia.org/r/196168 (owner: 10Giuseppe Lavagetto) [07:36:08] 7Puppet, 6operations, 5Patch-For-Review: Resource attributes are quoted inconsistently - https://phabricator.wikimedia.org/T91908#1112282 (10Joe) So can we agree on using ensure => present and ensure => directory ? As for the "more correct" nature of using quoted values... whatever is legitimate in a... [07:49:16] (03PS1) 10Giuseppe Lavagetto: redis: use ubuntu, correct partman scheme [puppet] - 10https://gerrit.wikimedia.org/r/196170 [07:50:34] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] redis: use ubuntu, correct partman scheme [puppet] - 10https://gerrit.wikimedia.org/r/196170 (owner: 10Giuseppe Lavagetto) [07:54:55] debian-installer bugs? [07:55:16] and ubuntu? [07:55:19] need help? :) [08:06:00] <_joe_> paravoid: if you wish :) [08:07:09] <_joe_> paravoid: atm, if I try to install the new redis servers with jessie, I get https://dpaste.de/f2mu/raw and it freezes [08:07:34] <_joe_> even trying to boot the installer in debug mode didn't give me any additional info [08:17:28] 6operations, 6Multimedia, 7HHVM: Convert Imagescalers to HHVM, Trusty - https://phabricator.wikimedia.org/T84842#1112289 (10Kopiersperre) May you add some additional fonts, e.g. [[https://wiki.debian.org/SubstitutingCalibriAndCambriaFonts|Caladea and Carlito]] for easier SVG rendering? See also https://comm... [08:19:13] (03PS1) 10Yuvipanda: ssh: Make userkeys be world readable [puppet] - 10https://gerrit.wikimedia.org/r/196171 (https://phabricator.wikimedia.org/T92475) [08:19:14] paravoid: ^ [08:19:39] (03CR) 10Faidon Liambotis: [C: 032] ssh: Make userkeys be world readable [puppet] - 10https://gerrit.wikimedia.org/r/196171 (https://phabricator.wikimedia.org/T92475) (owner: 10Yuvipanda) [08:20:23] (03PS2) 10Yuvipanda: ssh: Make userkeys be world readable [puppet] - 10https://gerrit.wikimedia.org/r/196171 (https://phabricator.wikimedia.org/T92475) [08:20:31] (03CR) 10Yuvipanda: [V: 032] ssh: Make userkeys be world readable [puppet] - 10https://gerrit.wikimedia.org/r/196171 (https://phabricator.wikimedia.org/T92475) (owner: 10Yuvipanda) [08:32:12] PROBLEM - DPKG on mw1152 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [08:34:11] PROBLEM - HHVM processes on mw1152 is CRITICAL: PROCS CRITICAL: 0 processes with command name hhvm [08:34:11] PROBLEM - Apache HTTP on mw1152 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.020 second response time [08:34:31] PROBLEM - HHVM rendering on mw1152 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.010 second response time [08:37:45] <_joe_> that's me dist-upgrading it [08:37:51] RECOVERY - HHVM rendering on mw1152 is OK: HTTP OK: HTTP/1.1 200 OK - 66654 bytes in 6.128 second response time [08:38:32] RECOVERY - HHVM processes on mw1152 is OK: PROCS OK: 1 process with command name hhvm [08:38:32] RECOVERY - Apache HTTP on mw1152 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.104 second response time [08:38:52] RECOVERY - DPKG on mw1152 is OK: All packages OK [08:44:17] (03PS1) 10Giuseppe Lavagetto: mediawiki: install fonts metric-compatible with Calibri and Cambria [puppet] - 10https://gerrit.wikimedia.org/r/196173 (https://phabricator.wikimedia.org/T84842) [08:50:08] (03PS20) 10Yuvipanda: deployment: Combine labs/prod deployment server roles [puppet] - 10https://gerrit.wikimedia.org/r/195340 [08:53:17] <_joe_> !log pooling mw1152, the HHVM imagescaler, into production [08:53:24] Logged the message, Master [08:54:12] <_joe_> well, 8 minutes early, but still :) [08:56:12] (03PS21) 10Yuvipanda: deployment: Combine labs/prod deployment server roles [puppet] - 10https://gerrit.wikimedia.org/r/195340 [08:57:15] 6operations, 6Multimedia, 7HHVM, 5Patch-For-Review: Convert Imagescalers to HHVM, Trusty - https://phabricator.wikimedia.org/T84842#1112318 (10Joe) The imagescaler is now into rotation. [08:57:42] 6operations, 6Multimedia, 7HHVM, 5Patch-For-Review: Convert Imagescalers to HHVM, Trusty - https://phabricator.wikimedia.org/T84842#1112319 (10Joe) @brion problem confirmed, but this would be relevant for the videoscalers, right? [09:00:32] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 0 below the confidence bounds [09:01:03] (03PS22) 10Yuvipanda: deployment: Combine labs/prod deployment server roles [puppet] - 10https://gerrit.wikimedia.org/r/195340 [09:02:03] (03PS1) 10Hashar: contint: Remove integration/kss.git [puppet] - 10https://gerrit.wikimedia.org/r/196174 (https://phabricator.wikimedia.org/T92482) [09:02:47] PS22?? [09:03:06] paravoid: I’ve split off about 7 patches from it so far. [09:03:52] paravoid: and I test by making a patch, pushing to gerrit, and then pulling... [09:04:16] (03PS23) 10Yuvipanda: deployment: Combine labs/prod deployment server roles [puppet] - 10https://gerrit.wikimedia.org/r/195340 [09:04:18] ^ is a typo fix, for example. [09:04:29] paravoid: when merged this will also hit tin, so I want to be vewy vewy careful [09:04:30] (03PS4) 10Faidon Liambotis: ssh: remove .ssh/authorized_keys support from prod [puppet] - 10https://gerrit.wikimedia.org/r/183824 (https://phabricator.wikimedia.org/T92475) [09:04:32] (03PS4) 10Faidon Liambotis: reprepro: transition to ssh::userkey [puppet] - 10https://gerrit.wikimedia.org/r/183818 (https://phabricator.wikimedia.org/T92475) [09:04:34] (03PS4) 10Faidon Liambotis: openstack: transition nova to ssh::userkey [puppet] - 10https://gerrit.wikimedia.org/r/183819 (https://phabricator.wikimedia.org/T92475) [09:04:36] (03PS4) 10Faidon Liambotis: puppet: transition to ssh::userkey [puppet] - 10https://gerrit.wikimedia.org/r/183822 (https://phabricator.wikimedia.org/T92475) [09:04:38] (03PS4) 10Faidon Liambotis: admin: transition to ssh::userkey [puppet] - 10https://gerrit.wikimedia.org/r/183823 (https://phabricator.wikimedia.org/T92475) [09:04:40] (03PS4) 10Faidon Liambotis: authdns: transition to ssh::userkey [puppet] - 10https://gerrit.wikimedia.org/r/183821 (https://phabricator.wikimedia.org/T92475) [09:04:42] (03PS1) 10Faidon Liambotis: jenkins: transition to ssh::userkey [puppet] - 10https://gerrit.wikimedia.org/r/196175 (https://phabricator.wikimedia.org/T92475) [09:04:44] (03PS1) 10Faidon Liambotis: gerrit: transition to ssh::userkey [puppet] - 10https://gerrit.wikimedia.org/r/196176 (https://phabricator.wikimedia.org/T92475) [09:04:46] (03PS1) 10Faidon Liambotis: mha: transition to ssh::userkey [puppet] - 10https://gerrit.wikimedia.org/r/196177 (https://phabricator.wikimedia.org/T92475) [09:04:48] (03PS1) 10Faidon Liambotis: datasets: transition to ssh::userkey [puppet] - 10https://gerrit.wikimedia.org/r/196178 (https://phabricator.wikimedia.org/T92475) [09:04:50] (03PS1) 10Faidon Liambotis: logging: transition to ssh::userkey [puppet] - 10https://gerrit.wikimedia.org/r/196179 (https://phabricator.wikimedia.org/T92475) [09:05:01] there [09:05:25] did I just crash grrrit-wm? [09:05:33] haha [09:05:38] paravoid: it probably got kicked off for flooding [09:05:43] no [09:05:50] that has a different quit message [09:05:58] oh [09:05:59] true [09:06:37] you did indeed [09:06:45] gerrit’s ssh stuff has been flaky, apparently [09:06:57] (03CR) 10jenkins-bot: [V: 04-1] gerrit: transition to ssh::userkey [puppet] - 10https://gerrit.wikimedia.org/r/196176 (https://phabricator.wikimedia.org/T92475) (owner: 10Faidon Liambotis) [09:07:08] aww crap [09:07:08] oh it came back [09:07:10] I also rebooted it [09:07:12] sigh [09:07:21] ah, maybe we’ll be spared a lot of -1s? [09:08:24] anyway, time to pick up my passport! [09:09:35] (03PS5) 10Faidon Liambotis: ssh: remove .ssh/authorized_keys support from prod [puppet] - 10https://gerrit.wikimedia.org/r/183824 (https://phabricator.wikimedia.org/T92475) [09:09:37] (03PS2) 10Faidon Liambotis: gerrit: transition to ssh::userkey [puppet] - 10https://gerrit.wikimedia.org/r/196176 (https://phabricator.wikimedia.org/T92475) [09:09:39] (03PS2) 10Faidon Liambotis: mha: transition to ssh::userkey [puppet] - 10https://gerrit.wikimedia.org/r/196177 (https://phabricator.wikimedia.org/T92475) [09:09:41] (03PS2) 10Faidon Liambotis: datasets: transition to ssh::userkey [puppet] - 10https://gerrit.wikimedia.org/r/196178 (https://phabricator.wikimedia.org/T92475) [09:09:43] (03PS2) 10Faidon Liambotis: logging: transition to ssh::userkey [puppet] - 10https://gerrit.wikimedia.org/r/196179 (https://phabricator.wikimedia.org/T92475) [09:09:53] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 6.67% of data above the critical threshold [500.0] [09:10:19] <_joe_> mh that doesn't look good [09:15:12] <_joe_> !log depooling the HHVM imagescaler [09:15:19] Logged the message, Master [09:19:00] 6operations, 6Multimedia, 7HHVM, 5Patch-For-Review: Convert Imagescalers to HHVM, Trusty - https://phabricator.wikimedia.org/T84842#1112402 (10Joe) I had to depool the server, as quite a lot of unexpected errors were spawning. E.G. when rescaling http://commons.wikimedia.org/w/thumb_handler.php/c/ca/Alari... [09:26:24] 6operations, 6Multimedia, 7HHVM, 5Patch-For-Review: Convert Imagescalers to HHVM, Trusty - https://phabricator.wikimedia.org/T84842#1112432 (10Gilles) --no-external-files is an option that was coming from our custom patches to librsvg [09:31:02] RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [09:36:41] 6operations, 6Multimedia, 7HHVM, 5Patch-For-Review: Convert Imagescalers to HHVM, Trusty - https://phabricator.wikimedia.org/T84842#1112438 (10Joe) @Gilles I imagined that - given our svg code now supports the security model in the more modern rsvg versions I guessed it would work automagically. It is not... [09:41:33] 6operations: Higher Targeted Traffic: Wmflabs.Org - https://phabricator.wikimedia.org/T92485#1112447 (10emailbot) [09:41:59] (03PS1) 10Gilles: Re-enable thumbnail prerendering in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/196183 [09:42:55] (03PS2) 10Gilles: Re-enable thumbnail prerendering in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/196183 (https://phabricator.wikimedia.org/T92484) [09:46:01] <_joe_> gi11es: I was looking at includes/media/SVG.php for when external resources are handled, but I can't find that no-external-files option [09:46:57] <_joe_> oh nevermind, it is in the frigging mediawiki-config :) [09:49:33] 6operations, 6Multimedia, 7HHVM, 5Patch-For-Review: Convert Imagescalers to HHVM, Trusty - https://phabricator.wikimedia.org/T84842#1112459 (10Gilles) If the new rsvg version doesn't need a specific option turned on to avoid external resources, then it's just a matter of having a slightly different mediawi... [09:50:03] _joe_: yeah, just left a comment to that effect [09:51:21] <_joe_> gi11es: working on it, I'll ask you to review [09:56:06] _joe_: has someone reviewed the security aspect of the new rsvg version? i.e. actually tried it with an svg that had external resources, etc.? [09:56:46] <_joe_> gi11es: I didn't try that, but me, chris and aaron looked into it at the time [09:57:01] ok cool [09:57:13] <_joe_> gi11es: fundamentally, rsvg refuses to load any external resource that is not in the working directory [09:57:25] <_joe_> and we generate a tempdir when converting IIRC [09:57:38] <_joe_> if you have an example, I can do a test anyways [09:57:38] 6operations, 5Patch-For-Review, 3wikis-in-codfw: setup & deploy rdb2001-2004 - https://phabricator.wikimedia.org/T92011#1112465 (10faidon) rdb2002 was misconfigured in the BIOS. Per [[ https://wikitech.wikimedia.org/wiki/Platform-specific_documentation/Dell_PowerEdge_RN10#Initial_System_Setup | our platform... [10:06:20] I'll try to make one that phones home to me, this way we can make sure that it's not happening [10:07:53] PROBLEM - puppet last run on nescio is CRITICAL: CRITICAL: Puppet last ran 4 hours ago [10:11:56] (03PS1) 10Giuseppe Lavagetto: imagescaling: remove legacy option for HHVM imagescalers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/196186 (https://phabricator.wikimedia.org/T84842) [10:12:27] <_joe_> gi11es: ^^ [10:17:11] _joe_: can you access my wiki here: http://montpellier.dubuc.fr:8080 ? [10:17:17] just to check that my firewall is open [10:17:34] <_joe_> gi11es: nope [10:17:57] meh, gotta dig into my router for a sec [10:22:13] (03CR) 10Liuxinyu970226: [C: 031] "OK thx" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/193827 (https://phabricator.wikimedia.org/T91223) (owner: 10Gerrit Patch Uploader) [10:26:02] (03CR) 10Gilles: [C: 031] imagescaling: remove legacy option for HHVM imagescalers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/196186 (https://phabricator.wikimedia.org/T84842) (owner: 10Giuseppe Lavagetto) [10:28:05] (03CR) 10Giuseppe Lavagetto: [C: 032] "Let's see if this solves the HHVM imagescaler problems." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/196186 (https://phabricator.wikimedia.org/T84842) (owner: 10Giuseppe Lavagetto) [10:28:42] (03Merged) 10jenkins-bot: imagescaling: remove legacy option for HHVM imagescalers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/196186 (https://phabricator.wikimedia.org/T84842) (owner: 10Giuseppe Lavagetto) [10:29:42] PROBLEM - puppet last run on sodium is CRITICAL: CRITICAL: Puppet last ran 4 hours ago [10:32:42] !log oblivian Synchronized wmf-config/CommonSettings.php: Fix for svg conversion on HHVM (duration: 00m 05s) [10:32:49] Logged the message, Master [10:33:07] <_joe_> !log repooling mw1152 [10:33:12] Logged the message, Master [10:34:11] (03PS5) 10Mobrovac: Add restbase job runners [puppet] - 10https://gerrit.wikimedia.org/r/195364 (owner: 10GWicke) [10:36:01] <_joe_> mobrovac: oh btw, I have like a ton of questions on that patch [10:36:11] <_joe_> are you around in ~ 1 hour? [10:36:27] yep, ought to be [10:36:42] <_joe_> I'd really like to merge it during my working hours [10:36:57] lunch time? [10:36:58] :) [10:37:12] <_joe_> no, HHVM imagescalers time [10:37:25] ah, lunch time is much more fun [10:37:29] ok, just ping me [10:37:32] np [10:40:06] Setting up zuul (2.0.0-304-g685ca22-wmf1) ... [10:40:06] Adding system user for Zuul [10:40:06] * Zuul Merger: /etc/default/zuul-merger is not set to START_DAEMON=1: exiting [10:40:08] !!!! [10:42:17] eh looks like you've got a default or env var to deal with [10:42:31] yeah I have added that function [10:42:41] I am busy building a debian package for Zuul :ÿ [10:44:06] yey [10:45:25] btw hashar, the services team is really really interested in your automatic pkg building stuff [10:45:52] mobrovac: should be straightforward with git-buildpackage and jenkins debian glue shell scripts. [10:46:09] mobrovac: we can pair after lunch if you want [10:47:49] hm, that's way too early [10:47:58] need to set up stuff on my end for that first [10:47:59] an example is https://integration.wikimedia.org/ci/job/operations-debs-pybal-debian-glue/ [10:48:09] hashar: maybe sometime next week? [10:48:13] sure! [10:48:17] cool thnx i'll check that out [10:48:21] gr8! [10:48:37] which is only 3 lines of code to add in integration/config.git :-) [10:49:19] where can i find the src of the linked prj? [10:50:56] mobrovac: https://git.wikimedia.org/blob/integration%2Fconfig/19f889a67def3f120559c5d1430e73aa4eee81e6/jjb%2Foperations-debs.yaml has all the cruft [10:51:13] mobrovac: browse to the bottom for the 5 lines that adds a job for a project [10:51:14] good [10:51:15] merci [10:51:49] you can even generate the jobs yourself by following the tutorial at http://www.mediawiki.org/wiki/CI/JJB [10:52:33] <_joe_> !log depooled mw1152 again [10:52:38] Logged the message, Master [10:53:56] hashar: wow that's really neat [10:53:58] :) [10:59:13] (03PS1) 10Hashar: zuul: init scripts now have START_DAEMON [puppet] - 10https://gerrit.wikimedia.org/r/196198 [11:00:00] mobrovac: and there are a few more tutorials under http://www.mediawiki.org/wiki/Continuous_integration/Tutorials [11:00:08] though none related to deb packages [11:01:06] reading, learning, yey [11:01:09] * mobrovac happy [11:01:15] cheers hashar [11:02:13] (03CR) 10Hashar: "That is merely a copy paste of the work I have done on the Debian package." [puppet] - 10https://gerrit.wikimedia.org/r/196198 (owner: 10Hashar) [11:24:49] 6operations, 6Multimedia, 7HHVM, 5Patch-For-Review: Convert Imagescalers to HHVM, Trusty - https://phabricator.wikimedia.org/T84842#1112607 (10Joe) I re-pooled the HHVM imagescaler and got confirmation that SVG files rendered correctly. However, I was still seeing a significant amount of 503s, but nothing... [11:29:14] <_joe_> mobrovac: around? [11:29:35] and about [11:30:28] <_joe_> so, do we really need 4 jobrunners for restbase and 4 for parsoid? [11:31:17] <_joe_> and, how many mediawiki API calls can I expect to be caused by the restbase jobrunners? [11:31:18] you'd squeeze them all? [11:31:52] mw api - one per job run [11:31:57] <_joe_> mobrovac: I have no idea what the restbase jobrunners will do [11:32:07] <_joe_> what rate of jobs do you expect [11:32:07] let me explain [11:32:16] <_joe_> yes please :) [11:32:23] they are exactly the same as the parsoid ones [11:32:42] we have hooks to notify restbase on rev changes [11:32:47] <_joe_> so we have 1 restbase job per parsoid job? [11:33:15] <_joe_> because that is potentially an issue, I've seen parsoid jobs tear down the mediawiki api in the past [11:33:28] ah [11:33:31] hm hm [11:33:35] <_joe_> heh [11:33:36] <_joe_> :) [11:33:53] <_joe_> what api calls is restbase going to do? [11:33:53] springle: around? [11:34:07] _joe_: https://www.mediawiki.org/wiki/Extension:RestBaseUpdateJobs [11:34:13] revprop [11:34:20] <_joe_> mobrovac: let's do one thing: I'll ask for further clarification in the patchset [11:34:33] springle: https://gerrit.wikimedia.org/r/#/c/195853 done in Production, right? [11:34:39] (schema change) [11:34:48] <_joe_> mobrovac: do you happen to know which api calls the parsoid jobs do? [11:35:02] _joe_: are you suggesting we merge the parsoid and restbase to the current 4 parsoid job runners? [11:35:15] <_joe_> no [11:35:26] _joe_: euh, not really, i can check the src and get back to you on that one [11:35:38] <_joe_> I just want to understand the relative weight of the jobs [11:35:54] ok, so [11:36:07] <_joe_> that's because I really want to merge your patchset but ATM I'm not confident with its consequences [11:36:28] altogether you'll have one rev api call for rb for each revchange [11:36:46] hm no, let me rephrase that [11:37:27] one revprop api call per job, knowing that one revchange may spring multiple jobs (if it's a revchange for a template, e.g.) [11:38:02] <_joe_> yeah I think that situation (template change) is what caused issues with parsoid as well in the past [11:38:52] also, i should note that this way of getting updates from the core should be considered as a fix, not a solution [11:39:10] <_joe_> what do you mean? [11:39:19] until we get around to working on https://phabricator.wikimedia.org/T84923 [11:40:15] <_joe_> why is rcstream not a good candidate for that? [11:40:20] <_joe_> but I digress :) [11:40:22] PROBLEM - Slow CirrusSearch query rate on fluorine is CRITICAL: CirrusSearch-slow.log_line_rate CRITICAL: 0.0133333333333 [11:40:25] so jobrunners in this context would become useless [11:40:49] i yet have to play with it to assess its potential [11:41:09] oh you mean, using rcstream now instead of the jobrunner? [11:41:14] <_joe_> but I'll read that ticket thorougly [11:41:15] <_joe_> no no [11:41:38] <_joe_> as a changes stream provider. but I don't think honestly you'd eliminate jobrunners [11:41:49] <_joe_> you want to be able to schedule async jobs [11:42:04] <_joe_> that are easily rate-limited globally [11:42:12] <_joe_> I think we could create a better jobrunner [11:42:21] right, ofc, i meant jobrunners would become obsolete for getting events out of the core [11:42:27] <_joe_> that e.g. can rate-limit requests to a single service [11:43:02] <_joe_> or throttle those based on the rate-limiting imposed by said service [11:43:10] +1 [11:43:26] <_joe_> anyways, I have some doubts about the consequences of merging this [11:44:28] if it's of any consolation, currently the extension is enabled only on test.wp [11:44:48] but yeah, eventually it's going to be enabled everywhere [11:44:51] <_joe_> yeah I know [11:45:04] so, with the parsoid jobs, the problem was too many calls to the api? [11:45:09] or the jobrunners themselves? [11:45:17] <_joe_> too many api calls [11:45:22] RECOVERY - Slow CirrusSearch query rate on fluorine is OK: CirrusSearch-slow.log_line_rate OKAY: 0.0 [11:45:25] <_joe_> from the jobrunners of course [11:45:32] ah hm [11:45:43] <_joe_> I'm sure gabriel has more details [11:49:40] (03CR) 10Giuseppe Lavagetto: "While the patch is formally correct, I have some doubts regarding possible consequences, if not now, in the near future when restbase will" [puppet] - 10https://gerrit.wikimedia.org/r/195364 (owner: 10GWicke) [11:53:42] _joe_: ah, ok, now I see why the parsoid jobs might have had problems with overloading the api [11:54:05] this is especially problematic for templates transcluded in many pages [11:54:32] kart_: correct [11:54:36] <_joe_> yes, tipically one of the enwiki or dewiki main templates changes => hours of fear and loathing in apiland [11:54:59] because for each of them, parsoid fetches the src of both the template and the page as it needs both [11:55:06] these are pretty heavy requests [11:55:35] <_joe_> what will happen with restbase in that scenario? [11:55:56] <_joe_> I admit my ignorance of restbase internals [11:56:28] so, in that case, RB will be much more light-weight [11:56:54] as it will emit exactly one revprop per transcluded page + one for the template itself [11:57:06] the payload of which is not big, as seen in https://phabricator.wikimedia.org/P389 [11:57:26] <_joe_> ok my question is [11:57:34] <_joe_> think a template is for 10K pages [11:57:53] <_joe_> now, before you had the hammering parsoid requests [11:58:12] <_joe_> and now you still have all those parsoid requests, plus the restbase ones [11:58:24] yes [11:58:26] oh wait [11:58:27] damn [11:58:35] no no this is noooot good [11:58:38] so [11:59:07] apart from the revprop stuff, in order to be up-to-date, rb needs the content as well, which it gets from parsoid [11:59:17] <_joe_> aha. [11:59:27] <_joe_> that was my next question [11:59:28] <_joe_> :) [11:59:52] 6operations, 6Multimedia, 7HHVM, 5Patch-For-Review: Convert Imagescalers to HHVM, Trusty - https://phabricator.wikimedia.org/T84842#1112668 (10Gilles) Special:UploadStash has a known 503 issue happening at the moment: T90599 [12:00:17] and the no-bueno stanza here is that both requests from the jobrunners use no-cache, which means the api would get hammers twice as hard, once from parsoid directly, and once more when rb requests the new version (as it fwd the no-cache to parsoid) [12:00:53] so for the same thing, parsoid would call the api twice [12:00:57] ay [12:01:06] <_joe_> mh so [12:01:41] <_joe_> maybe a solution would be to just use the restbase job and make it build the parsoid cache? [12:02:01] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected [12:02:03] maybe we could simply remove the parsoid jobs and use only the RB ones [12:02:06] <_joe_> maybe this doesn't even make sense, I'm just working with what I got now [12:02:10] <_joe_> mobrovac: exactly [12:02:15] 6operations, 6Multimedia, 7HHVM, 5Patch-For-Review: Convert Imagescalers to HHVM, Trusty - https://phabricator.wikimedia.org/T84842#1112671 (10Gilles) What happens with those requests is that the web server queries the image scalers internally to have them scale an image which isn't public yet. What I don... [12:02:45] that would be a solution once updates to RB are coming from all domains, not just test.wp [12:02:53] s/would/could/ [12:03:30] <_joe_> mobrovac: so my proposal is - we move forward the patch, but enable just 1 restbase jobrunner, but you think if the solution you just proposed is feasible [12:03:38] <_joe_> and if it's feasible on a per-wiki basis [12:05:48] from the correctness standpoint, it should be feasible given that both rb and parsoid get the exact same updated and that rb calls parsoid for of them [12:06:07] would need to ensure that with the parsoid guys though [12:06:21] <_joe_> mobrovac: yes of course [12:06:39] <_joe_> we should open a task on this [12:06:39] as for the per-wiki basis, probably with a lot of c/p that could be achieved [12:07:40] 6operations, 6Multimedia, 7HHVM, 5Patch-For-Review: Convert Imagescalers to HHVM, Trusty - https://phabricator.wikimedia.org/T84842#1112689 (10Joe) ok, then it was probably a false positive. I'm a bit in the dark about what is timing out here then, will need to investigate further [12:08:10] <_joe_> mobrovac: I'd open a task, but I guess my formulation of the problem would be less clear than yours [12:08:38] <_joe_> (yes, this is an elegant way to try not to do work :)) [12:09:51] _joe_: sure, no pb, i'll type it up [12:09:58] haha [12:10:38] <_joe_> once that is up, I'll amend gabriel's patch and merge it citing the task [12:11:07] gr8, was about to say that we still should test the jobrunner conf [12:11:09] thnx [12:12:16] <_joe_> yes I see the value for you guys and I don't want to block you until it's harmful :) [12:32:17] _joe_: hm, actually, upon better studying, RB is not actually sending the no-cache header to parsoid [12:32:36] however, that does not fully alleviate the problem of double-calling the MW API [12:33:14] since RB might call Parsoid before the original parsoid jobrunner req [12:33:46] that is, if RB manages to execute its stuff prior to Parsoid for the same rev, MW API is going to be called twice for the exact same thing [12:33:48] <_joe_> it's very probable btw [12:34:03] <_joe_> it's a race condition at least [12:34:06] right [12:34:25] the point is we cannot rely on parsoid winning most of the time [12:34:35] <_joe_> so yeah either we make the two jobs depend on each other (and I don't think we can do that right now, or if it's advisable) [12:34:40] <_joe_> I'd guess it wont [12:34:54] <_joe_> as restbase jobs are going to be faster in general, right? [12:36:31] euh not really [12:36:44] we still rely on parsoid to generate us the new content [12:36:54] <_joe_> ok [12:37:19] <_joe_> also, parsoid will have no extrnal cache when called from restbase, right? [12:37:34] <_joe_> restbase doesn't call varnish in front of parsoid, I guess [12:40:47] 6operations, 10Parsoid, 10RESTBase, 6Services: Revision updates with Jobrunner for Parsoid and RESTBase - https://phabricator.wikimedia.org/T92490#1112711 (10mobrovac) 3NEW [12:41:11] _joe_: ^^ [12:41:18] let me check that [12:42:03] _joe_: it does, RB places calls to http://parsoid-lb.eqiad.wikimedia.org [12:43:48] 6operations, 10Parsoid, 10RESTBase, 6Services: Revision updates with Jobrunner for Parsoid and RESTBase - https://phabricator.wikimedia.org/T92490#1112721 (10mobrovac) p:5Triage>3High [12:44:26] <_joe_> mh [12:44:53] <_joe_> I thought restbase would call parsoid directly, it makes more sense but probably this is just temporary [12:45:32] not really, makes more sense to see if the varnish layer has the info RB needs [12:45:33] <_joe_> mobrovac: thanks a lot btw! [12:45:51] if it were to call parsoid directly, a MW API call would be made each time [12:45:57] _joe_: np, my pleasure [12:46:20] <_joe_> mobrovac: well, in the long run I'd expect restbase to be the main way to access parsoid, maybe the only one [12:46:36] <_joe_> at that point, I'd concentrate the cache at the restbase level [12:46:43] the rationale for this is that when something is requested and RB doesn't have it in storage, it has no way of knowing whether it's new or old content [12:46:55] _joe_: correct [12:47:12] <_joe_> mobrovac: good! [12:47:20] <_joe_> I'll amend gabriel's patch and merge it [12:48:27] when that happens, we simply change https://github.com/wikimedia/operations-puppet/blob/production/modules/restbase/templates/config.yaml.erb#L118 and wait for the next puppet run :) [12:48:36] _joe_: cool, thnx [12:55:16] (03PS6) 10Giuseppe Lavagetto: Add restbase job runners [puppet] - 10https://gerrit.wikimedia.org/r/195364 (owner: 10GWicke) [12:58:05] (03CR) 10Giuseppe Lavagetto: [C: 032] "As stated in the discussion on IRC, I am merging this to allow the services team to test all the moving parts, but I don't think this shou" [puppet] - 10https://gerrit.wikimedia.org/r/195364 (owner: 10GWicke) [12:59:06] <_joe_> mobrovac: merged and puppet running [12:59:45] beautiful [12:59:46] grzie [12:59:58] 6operations, 6Commons, 10Wikimedia-General-or-Unknown: Upload varnish cp1063 not responding to purges - https://phabricator.wikimedia.org/T54864#1112755 (10mark) 5declined>3Resolved a:3mark [13:04:28] 6operations, 10ops-esams: decom amslvs1-4 (dc work) - https://phabricator.wikimedia.org/T87790#1112762 (10mark) They were part of the same batch as all misc servers in esams, and I think we won't recycle them yet (later, in one go). We may or may not reuse them for something else until that time. [13:04:38] <_joe_> mobrovac: jobs should be running everywhere [13:04:47] cool [13:04:48] :) [13:05:00] <_joe_> whoa I'm not used to see people from SF before I have lunch [13:05:13] question: how can i obtain / get the revdelete right on testwiki? [13:05:18] <^d> _joe_: Most SF people wake up late :p [13:05:24] * ^d is an early riser [13:05:27] hello ^d [13:05:34] <_joe_> ^d: yeah but now you're one hour early because DST [13:05:36] <^d> YuviPanda: morning bro [13:05:41] <_joe_> mobrovac: no idea [13:05:46] ^d: bro bro bro [13:05:46] :D [13:06:01] ^d: deployment-prep is almost close to being done, and I have removed one more dependency on NFS... [13:06:04] err [13:06:09] staging-tin, not deployment-prep [13:06:18] <^d> \o/ [13:06:20] ^d: I just have to fix failing scap and we’ll all be good :D [13:06:27] <^d> death to nfs! [13:06:43] ^d: :D [13:07:06] <_joe_> staging-tin? [13:07:07] ^d: with this I think deployment-prep is NFS free except for upload [13:07:19] _joe_: this is the ‘recreate beta from scratch, but properly done’ project [13:07:21] staging [13:07:28] <_joe_> staging-deployment would've been better [13:07:46] the work I’m doing would probably help a lot with putting up deployment hosts in the new DC if we want to do that [13:07:53] deployment-prep, betalabs, staging-deployment... [13:08:02] <_joe_> YuviPanda: yeah I've seen, thanks for that [13:08:06] <^d> _joe_: One of our goals is "do it just like prod, even if prod is silly" [13:08:09] <^d> Hence tin :) [13:09:03] _joe_: still battling some scap issues, however. [13:09:09] need to figure out how to make it use the proxy properly [13:09:10] well [13:09:13] the ssh-agent properly [13:22:21] RECOVERY - puppet last run on sodium is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [13:31:22] (03PS24) 10Yuvipanda: deployment: Combine labs/prod deployment server roles [puppet] - 10https://gerrit.wikimedia.org/r/195340 [13:32:39] kart_: is there a cxserver deployment today? [13:38:49] _joe_: btw, the lucid hosts look ok to me [13:38:54] (puppet runs successfully, etc [13:38:55] ) [13:39:39] ^d: [13:39:56] (since your name is on the cxserver deploy thing as well) [13:40:58] <^d> heck if I know! :D [13:41:07] <^d> I've just been around in case kart_ needs help [13:41:16] <^d> no clue if a deploy happens or not [13:45:18] ^d: if it doesn’t happen there’s like a 6-7h window in which that patch can be merged [13:50:11] PROBLEM - check_mysql on db1008 is CRITICAL: Cant connect to local MySQL server through socket /tmp/mysql.sock (2) [13:55:11] PROBLEM - check_mysql on db1008 is CRITICAL: Cant connect to local MySQL server through socket /tmp/mysql.sock (2) [14:00:11] PROBLEM - check_mysql on db1008 is CRITICAL: Cant connect to local MySQL server through socket /tmp/mysql.sock (2) [14:01:02] anyone up to review a Zuul change please? I need a new variable START_DAEMON in the zuul init scripts https://gerrit.wikimedia.org/r/#/c/196198/ :( [14:01:04] :) [14:02:12] 6operations, 10Parsoid, 10RESTBase, 6Services: Revision updates with Jobrunner for Parsoid and RESTBase - https://phabricator.wikimedia.org/T92490#1112974 (10mobrovac) [14:05:12] PROBLEM - check_mysql on db1008 is CRITICAL: Cant connect to local MySQL server through socket /tmp/mysql.sock (2) [14:10:11] PROBLEM - check_mysql on db1008 is CRITICAL: Cant connect to local MySQL server through socket /tmp/mysql.sock (2) [14:12:51] (03CR) 10Ottomata: Adding a Last-Access cookie to text and mobile requests (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/196009 (https://phabricator.wikimedia.org/T92435) (owner: 10Nuria) [14:15:11] PROBLEM - check_mysql on db1008 is CRITICAL: Cant connect to local MySQL server through socket /tmp/mysql.sock (2) [14:16:01] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 6.67% of data above the critical threshold [500.0] [14:19:17] 6operations, 10Analytics-EventLogging, 6Analytics-Kanban: EventLogging query strings are truncated to 1014 bytes by ?(varnishncsa? or udp packet size?) - https://phabricator.wikimedia.org/T91347#1113013 (10mforns) a:5mforns>3Ottomata [14:20:11] PROBLEM - check_mysql on db1008 is CRITICAL: Cant connect to local MySQL server through socket /tmp/mysql.sock (2) [14:20:20] 6operations, 10Analytics-EventLogging, 6Analytics-Kanban: Eventlogging JS client should warn users when serialized event is more than "N" chars long and not sent the event - https://phabricator.wikimedia.org/T91918#1113018 (10mforns) a:5Nuria>3mforns [14:23:15] (03PS1) 10Giuseppe Lavagetto: Revert "redis: use ubuntu, correct partman scheme" [puppet] - 10https://gerrit.wikimedia.org/r/196220 [14:24:43] 7Blocked-on-Operations, 6Scrum-of-Scrums, 6Zero, 7Varnish: Tag all Zero traffic with X-Analytics xcs value - https://phabricator.wikimedia.org/T89177#1113028 (10Ottomata) Hm, Yuri, do you need this data outside of Hadoop/Hive? Why don't we just add this to the refined webrequest Hive table as part of the... [14:25:12] PROBLEM - check_mysql on db1008 is CRITICAL: Cant connect to local MySQL server through socket /tmp/mysql.sock (2) [14:25:41] (03PS2) 10Giuseppe Lavagetto: redis: use debian in codfw [puppet] - 10https://gerrit.wikimedia.org/r/196220 [14:27:01] 7Blocked-on-Operations, 6Scrum-of-Scrums, 6Zero, 7Varnish: Tag all Zero traffic with X-Analytics xcs value - https://phabricator.wikimedia.org/T89177#1113029 (10Nuria) @Yurik: Both andrew and myself think this is very possible to do in the cluster (just like geo coding is being done now) so we can try t... [14:27:21] _joe_, mobrovac, re Parsoid & caching: the v2 API that's called by RB is intentionally uncached, so the base load on Parsoid *will* double [14:28:13] however, RB also eliminates most of the extra load on Parsoid from dumps [14:28:21] RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [14:28:38] (03PS3) 10Nuria: Adding a Last-Access cookie to text and mobile requests [puppet] - 10https://gerrit.wikimedia.org/r/196009 (https://phabricator.wikimedia.org/T92435) [14:29:07] (03PS3) 10Giuseppe Lavagetto: redis: use debian in codfw [puppet] - 10https://gerrit.wikimedia.org/r/196220 [14:29:42] _joe_, mobrovac: if the base load turns out to be too high as we gradually enable updates for more wikis we can still dial down the template update rate (which is >90% of the load) [14:29:56] gwicke: the concern is not parsoid per se, but the mw api calls made by it [14:30:11] PROBLEM - check_mysql on db1008 is CRITICAL: Cant connect to local MySQL server through socket /tmp/mysql.sock (2) [14:30:57] mobrovac: we know that the dbs and app servers are fine even with the Parsoid cluster close to 100% (which didn't use to be the case) [14:31:02] <_joe_> gwicke: yes, what we figured out earlier is that you basically call parsoid twice, and that going on we can maybe just drop the parsoid jobs and use restbase jobs to update parsoid as well [14:31:37] that won't work [14:31:43] <_joe_> why? [14:31:51] v1 / v2 mismatch [14:31:52] we are also switching API versions on the Parsoid side [14:32:09] RB requests a bundle of data-parsoid and html [14:32:28] while v1 returns that all mixed within the (bloated) HTML document [14:32:44] <_joe_> so say that the restbase job spawns before the parsoid one ran [14:33:03] they are two different jobs [14:33:08] <_joe_> you can either use the cached content (and get an outdated one), or run the job twice, right? [14:33:09] doesn't matter, different URI, different content, varnish won't cache that [14:33:35] _joe_: correct [14:33:38] <_joe_> ok so we basically doubled the api calls we do for each change of revision? [14:33:44] we intentionally don't cache v2 as that would only dilute the cache for v1 responses [14:33:58] <_joe_> gwicke: I think it's a sensible choice [14:34:04] yes [14:34:27] <_joe_> but still, what I just stated kind of worries me on the long run. [14:34:31] _joe_: yes, for the transition period we'll have a higher base load on parsoid and the API [14:34:51] <_joe_> gwicke: how long do you expect the transition to be? [14:34:53] but we know that even 100% is okay, so I'm not too worried about the API part [14:35:11] PROBLEM - check_mysql on db1008 is CRITICAL: Cant connect to local MySQL server through socket /tmp/mysql.sock (2) [14:35:20] and we have several parameters that we can tune if that turned out to be a problem [14:35:26] <_joe_> you may be not, I am :) [14:35:36] <_joe_> and this is good [14:36:18] we have been hitting Parsoid pretty hard, and there were not issues [14:36:21] app cluster load is low [14:36:36] <_joe_> api [14:36:42] *nod* [14:36:56] transition period should be a month perhaps [14:37:04] <_joe_> gwicke: oh ok! [14:37:12] <_joe_> that's good. [14:37:13] we are testing the VE integration on test.wikipedia.org currently [14:37:31] 6operations, 10Analytics-EventLogging, 6Analytics-Kanban: Eventlogging JS client should warn users when serialized event is more than "N" chars long and not sent the event - https://phabricator.wikimedia.org/T91918#1113048 (10Nuria) [14:37:33] <_joe_> so in ~ 1/2 months we can turn the parsoid jobs off, basically? [14:37:33] it's working like a charm, without ever having been tested in combination before [14:37:47] good API testing in CI ftw ;) [14:38:12] <_joe_> if that's the case, I'm pretty confident we can work with it [14:38:12] _joe_: once VE is migrated over [14:38:26] <_joe_> I thought it would take at least a quarter :) [14:38:37] and yeah, target for that is end of this month [14:39:43] so are we saying here that in 15 days varnish won't serve parsoid's content any more? [14:40:12] PROBLEM - check_mysql on db1008 is CRITICAL: Cant connect to local MySQL server through socket /tmp/mysql.sock (2) [14:40:13] once VE is no longer using it we can at least disable the update job [14:40:29] Flow isn't using the cached content anyway [14:40:29] right [14:40:35] perfect [14:41:00] now, does that mean we can get rid of varnish / re-purpose it? [14:41:03] :P [14:41:11] and dump consumers are more than happy to use a faster API [14:41:32] those varnishes are currently used for several tasks [14:41:50] for RB a double-layer Varnish setup doesn't make that much sense [14:42:00] on average, it adds about 10ms latency [14:42:01] <_joe_> yes [14:42:10] that's why i'm asking actually [14:42:22] because i'd like us not to proxy service requests through it [14:42:32] currently done for rb, citoid, zotero [14:42:40] <_joe_> gwicke: wouldn't it be better to point restbase directly to parsoid then? [14:42:41] and i'd like us to stop that "trend" [14:42:47] a single cache layer would be nice to have though [14:43:27] _joe_: you mean to the lb? [14:43:39] if we don't do that yes, then yes we should [14:43:43] *yet [14:44:13] parsoidHost: http://parsoid-lb.eqiad.wikimedia.org [14:44:23] so yeah, lets change that [14:44:32] <_joe_> that is varnish I think [14:44:36] yup [14:44:37] <_joe_> nod [14:44:41] yes, let's procure a new varnish layer for those services [14:44:44] single layer [14:45:11] PROBLEM - check_mysql on db1008 is CRITICAL: Cant connect to local MySQL server through socket /tmp/mysql.sock (2) [14:45:37] I was wondering what the state of nginx caching is like these days [14:45:44] <_joe_> gwicke: don't [14:45:49] <_joe_> :) [14:45:58] wonder all you want, that doesn't seem useful at this point [14:46:03] i know there have been some improvements there [14:46:09] <_joe_> gwicke: I have a caching layer with nginx + memcache + some lua dark magic [14:46:39] <_joe_> in another project, and well comparing that to varnish is unfair [14:46:50] I agree that single-layer Varnish is more straightforward and the logical next step [14:47:15] just wondered about the potential to save one hop if all we want to do is in-memory caching [14:48:06] (03PS1) 10Andrew Bogott: Refresh nfs exports much more often. [puppet] - 10https://gerrit.wikimedia.org/r/196225 [14:49:23] imagine the potential to save hops if we had a monolithic application ;) [14:49:48] he ;) [14:49:59] (03CR) 10coren: [C: 031] "Still reasonable." [puppet] - 10https://gerrit.wikimedia.org/r/196225 (owner: 10Andrew Bogott) [14:50:12] RECOVERY - check_mysql on db1008 is OK: Uptime: 48 Threads: 1 Questions: 26670 Slow queries: 0 Opens: 59 Flush tables: 2 Open tables: 64 Queries per second avg: 555.625 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [14:50:34] (03CR) 10Andrew Bogott: [C: 032] Refresh nfs exports much more often. [puppet] - 10https://gerrit.wikimedia.org/r/196225 (owner: 10Andrew Bogott) [14:51:43] mlitn: Ping for SWAT in about 8.5 minutes [14:51:45] (03CR) 10Nuria: Adding a Last-Access cookie to text and mobile requests (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/196009 (https://phabricator.wikimedia.org/T92435) (owner: 10Nuria) [14:51:51] anomie: I’m around! [14:51:57] marktraceur, ^d, thcipriani: Who wants to SWAT today? [14:53:17] I hesitate to say I "want" to... [14:53:23] But I would do it if needed. [14:53:58] I could do it [14:54:13] Given that I've been pawning off evening SWATs on other people recently [14:57:02] So it looks like special guest RoanKattouw will SWAT today (: [14:57:12] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] redis: use debian in codfw [puppet] - 10https://gerrit.wikimedia.org/r/196220 (owner: 10Giuseppe Lavagetto) [14:57:21] :P [14:57:56] I still have puppet disabled on carbon [14:58:27] Krenair: I keep forgetting to ping you too on Thursdays, because you're not listed in the column on the Deployments page. [14:58:35] <_joe_> paravoid: I moved back to debian [14:58:45] <_joe_> so it shouldn't interfere with you right? [14:58:48] I don't really know how that list is made [14:59:02] <_joe_> I just switched back to here seeing "ask faidon" [14:59:04] <_joe_> :) [14:59:07] heh :) [14:59:10] I think greg-g makes it [14:59:11] I think greg-g copy-pastes it from a template or something. [14:59:17] <^d> anomie: So you can be the voiceover announcer for RoanKattouw's guest appearance :p [14:59:18] I'm deep into d-i trying to find the race [14:59:24] I have a modified initrd [14:59:27] <_joe_> good luck with that [14:59:28] <_joe_> :) [14:59:34] I've made good progress actually [14:59:42] <_joe_> if you manage to pull that off, we'd all be VERY VERY grateful [15:00:01] <_joe_> and some millions of sysadmins around the globe too [15:00:05] manybubbles, anomie, ^d, thcipriani, marktraceur, mlitn: Dear anthropoid, the time has come. Please deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150312T1500). [15:00:12] I'm pretty confident I know where the race is and I have a few candidates to see how it's triggered [15:00:18] I just eliminated one [15:00:32] <_joe_> paravoid: so, ok to re-enable? [15:00:38] yes for now [15:00:43] I'll redisable if need be [15:00:44] <_joe_> I want to install at least 2 servers today [15:00:48] <_joe_> ok [15:00:49] 6operations, 6Commons, 10MediaWiki-Uploading, 6Multimedia, 5Patch-For-Review: http-bad-status - 403 Forbidden when trying to upload file from tools.wikimedia.pl by URL to Commons - https://phabricator.wikimedia.org/T75200#1113120 (10Steinsplitter) [15:00:57] can you leave rdb2002 to me a little while longer? [15:01:02] <_joe_> yes [15:01:05] thank you [15:01:12] <_joe_> I'll install rdb2001 and 2003 [15:01:18] <_joe_> who are master-slave [15:03:07] RoanKattouw, are you doing it? [15:03:44] kart_: ping? [15:03:47] (03PS1) 10Mobrovac: Hit Parsoid directly [puppet] - 10https://gerrit.wikimedia.org/r/196229 [15:07:00] Krenair: Yes [15:07:04] ok [15:07:12] Still waiting for Jeknins [15:07:40] Oh sorry Jenkins merged my thing 8 minutes ago :D [15:07:59] I was going to say :p [15:08:25] 7Blocked-on-Operations, 6Scrum-of-Scrums, 6Zero, 7Varnish: Tag all Zero traffic with X-Analytics xcs value - https://phabricator.wikimedia.org/T89177#1113146 (10Yurik) @Nuria, @Ottomata, @faidon First, for the reasoning: banners, compression, analytics. # We need to show Zero banners on non-mobile Wiki... [15:08:40] But grrrit-wm somehow didn't ping me about it [15:09:52] * YuviPanda has a feeling grrrit-wm doesn’t like jenkins as much anymore [15:10:19] YuviPanda: Maybe it's a repo specific setting though [15:10:31] don’t think we ever added one [15:10:35] Because I do see a merge notification for UploadWizard [15:10:39] Hmm OK [15:10:49] I mean I know there's some filtering of jenkins-bot stuff going on [15:10:56] Like, V:2 isn't reported but V:-1 is [15:11:51] I suspect breakage, actually [15:12:07] grrrit-wm went from relying on redis to report gerrit events to handling the ssh stream itself. [15:12:15] maybe it doesn’t handle everything peachy [15:12:47] isn't gerrit event streaming a privileged thing? [15:12:53] yup [15:12:58] the bot has the privilage [15:13:17] but runs in tools with random people having access? [15:13:30] (03PS1) 10Andrew Bogott: Sleep in firstboot to avoid that NFS race condition. [puppet] - 10https://gerrit.wikimedia.org/r/196233 [15:14:39] (03CR) 10Andrew Bogott: "This isn't urgent enough to merit a rebuild of images but we might as well have it for the next time we need to rebuild." [puppet] - 10https://gerrit.wikimedia.org/r/196233 (owner: 10Andrew Bogott) [15:15:12] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 110, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-5/0/3: down - Core: pfw2-codfw:xe-6/0/0 {#10901} [10Gbps DF]BR [15:15:16] (03CR) 10GWicke: Hit Parsoid directly (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/196229 (owner: 10Mobrovac) [15:16:34] 7Blocked-on-Operations, 6Scrum-of-Scrums, 6Zero, 7Varnish: Tag all Zero traffic with X-Analytics xcs value - https://phabricator.wikimedia.org/T89177#1113155 (10Ottomata) I can't comment on the first 2 points, as those make this seem like it actually has to be done in varnish, if you need to respond to a r... [15:20:00] Krenair: yes [15:20:10] Krenair: it’s… not much of a privilage. no private info comes through. [15:20:14] Why is it restricted then? :/ [15:20:27] Krenair: I asked the saaaaaame question. [15:20:28] no idea [15:20:31] upstreaddidit [15:21:01] !log catrope Synchronized php-1.25wmf21/extensions/Flow/: SWAT (duration: 00m 07s) [15:21:11] Logged the message, Master [15:21:12] mlitn: Please verify -----^^ [15:21:15] (03PS1) 10Ottomata: Update gbp to work with tags. Add debian/README with build instructions [debs/python-kafka] (debian) - 10https://gerrit.wikimedia.org/r/196235 [15:22:01] 7Blocked-on-Operations, 6Scrum-of-Scrums, 6Zero, 7Varnish: Tag all Zero traffic with X-Analytics xcs value - https://phabricator.wikimedia.org/T89177#1113164 (10faidon) Regarding (1) and "proper on-wiki notification" — is this task about *Varying* the cache per X-CS, not just tagging it for Analytics purpo... [15:23:38] (03PS1) 10Ottomata: Update gbp to work with tags. Add debian/README with build instructions [debs/python-kafka] - 10https://gerrit.wikimedia.org/r/196237 [15:24:08] (03Abandoned) 10Ottomata: Update gbp to work with tags. Add debian/README with build instructions [debs/python-kafka] - 10https://gerrit.wikimedia.org/r/196237 (owner: 10Ottomata) [15:24:11] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 114, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-5/0/3: down - Core: pfw1-codfw:xe-6/0/0 {#10900} [10Gbps DF]BR [15:24:17] RoanKattouw: checking [15:24:59] (03PS2) 10Mobrovac: Hit Parsoid directly [puppet] - 10https://gerrit.wikimedia.org/r/196229 [15:25:03] (03PS2) 10Ottomata: Update gbp to work with tags. Add debian/README with build instructions [debs/python-kafka] (debian) - 10https://gerrit.wikimedia.org/r/196235 [15:26:29] (03PS3) 10Ottomata: Update gbp to work with tags. Add debian/README with build instructions [debs/python-kafka] (debian) - 10https://gerrit.wikimedia.org/r/196235 [15:26:35] papaul: are you working on the pfws? [15:26:46] (03CR) 10Ottomata: [C: 032 V: 032] Update gbp to work with tags. Add debian/README with build instructions [debs/python-kafka] (debian) - 10https://gerrit.wikimedia.org/r/196235 (owner: 10Ottomata) [15:26:48] yes [15:26:49] RoanKattouw: works fine, thanks! [15:26:58] paravoid: yes [15:27:04] (03CR) 10GWicke: [C: 031] Hit Parsoid directly [puppet] - 10https://gerrit.wikimedia.org/r/196229 (owner: 10Mobrovac) [15:27:07] okay [15:27:17] 7Puppet, 6operations, 5Patch-For-Review: Resource attributes are quoted inconsistently - https://phabricator.wikimedia.org/T91908#1113186 (10akosiaris) >>! In T91908#1112282, @Joe wrote: > So can we agree on using > > ensure => present > > and > > ensure => directory > > ? +1 > > As for the "more co... [15:27:34] papaul: ok, thanks -- let me know when you're done so I can make sure everything's okay :) [15:27:47] ok will [15:28:54] paravoid: done i need test please. can you test and see if you can get RX and TX now [15:29:03] i moved it to scs-c1 [15:29:19] we've lost the network on both of them [15:29:34] 15mins + 5mins ago [15:32:01] paravoid: there are back up now [15:32:25] I don't see them [15:32:28] (03PS1) 10Gage: IPsec: deploy on amssq* [puppet] - 10https://gerrit.wikimedia.org/r/196240 [15:32:52] paravoid there are not on scs-c8 but on scs-c1 [15:33:00] no, we've lost the *network* [15:33:05] (03CR) 10coren: [C: 032] Tools: Properly puppetize crontab replacement [puppet] - 10https://gerrit.wikimedia.org/r/186627 (https://phabricator.wikimedia.org/T86445) (owner: 10Tim Landscheidt) [15:33:23] I haven't checked the serial console yet, will do so in a minute [15:33:29] ok [15:33:42] did you accidentally remove their network cables perhaps? [15:33:47] no [15:33:54] just the console [15:34:59] are you sure that's the serial console port and not the network management port? [15:35:21] (03CR) 10Mobrovac: "I may be ignorant, but why put a password on it at all?" [puppet/cassandra] - 10https://gerrit.wikimedia.org/r/196133 (https://phabricator.wikimedia.org/T92471) (owner: 10Eevans) [15:35:26] did you reboot them? [15:35:27] serial and ethernet are very different, perhaps it's caused the boxes to crash somehow... [15:35:27] sure [15:35:59] the ehternet port are in the front and the serial are the only cable at the back [15:36:11] (03CR) 10Gage: [C: 032] IPsec: deploy on amssq* [puppet] - 10https://gerrit.wikimedia.org/r/196240 (owner: 10Gage) [15:36:40] jgage_: does that mean I can put amssq31 back in prod traffic flow too? [15:36:55] (03CR) 10Thcipriani: [C: 031] "All looks sane—all present and accounted for :)" [puppet] - 10https://gerrit.wikimedia.org/r/195340 (owner: 10Yuvipanda) [15:37:11] (03PS1) 10Ottomata: Bump debian/changelog to 0.9.3 [debs/python-kafka] (debian) - 10https://gerrit.wikimedia.org/r/196242 [15:37:29] (03PS25) 10Yuvipanda: deployment: Combine labs/prod deployment server roles [puppet] - 10https://gerrit.wikimedia.org/r/195340 [15:37:33] (03CR) 10Ottomata: [C: 032 V: 032] Bump debian/changelog to 0.9.3 [debs/python-kafka] (debian) - 10https://gerrit.wikimedia.org/r/196242 (owner: 10Ottomata) [15:37:36] bblack gimme just a minute to confirm, then yes :) [15:38:02] papaul: what did you replace exactly? [15:38:03] none of them talk to cp1008 for prod traffic anyways, so I don't think any real traffic will flow over this yet till you include some other cp1xxx [15:38:08] yeah [15:38:28] paravoid: the serial console card [15:38:47] these two have booted up with an empty config, it's unclear to me why [15:39:05] perhaps it's not just serial console? [15:39:09] perhaps they also hold the config [15:39:12] (03PS26) 10Yuvipanda: deployment: Combine labs/prod deployment server roles [puppet] - 10https://gerrit.wikimedia.org/r/195340 [15:39:14] perhaps replacing with the old cards brings it back? [15:39:23] do you want me to do that [15:39:31] put back the old card [15:39:33] and see [15:40:30] 6operations, 10Analytics, 6Scrum-of-Scrums, 10Wikipedia-App-Android-App, and 2 others: Avoid cache fragmenting URLs for Share a Fact shares - https://phabricator.wikimedia.org/T90606#1113222 (10Fjalapeno) @dr0ptp4kt thanks! [15:40:42] bblack: ok, amssq31 is ready to go back into service. thanks. [15:40:49] 6operations, 10Analytics, 6Scrum-of-Scrums, 10Wikipedia-App-Android-App, and 2 others: Avoid cache fragmenting URLs for Share a Fact shares - https://phabricator.wikimedia.org/T90606#1113225 (10Fjalapeno) a:3Fjalapeno [15:41:24] papaul: yes please [15:41:31] ok doing that on pwf2 [15:41:38] (03PS1) 10BBlack: repool amssq31 [puppet] - 10https://gerrit.wikimedia.org/r/196244 [15:41:48] (03CR) 10BBlack: [C: 032 V: 032] repool amssq31 [puppet] - 10https://gerrit.wikimedia.org/r/196244 (owner: 10BBlack) [15:42:22] (03CR) 10Tim Landscheidt: Sleep in firstboot to avoid that NFS race condition. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/196233 (owner: 10Andrew Bogott) [15:42:47] (03CR) 10Alexandros Kosiaris: [C: 04-1] Don't include a node in its own seeds (032 comments) [puppet/cassandra] - 10https://gerrit.wikimedia.org/r/195483 (https://phabricator.wikimedia.org/T91617) (owner: 10GWicke) [15:43:37] (03CR) 10coren: [C: 032] "This actually more paranoid than strictly required, and should do nicely." [puppet] - 10https://gerrit.wikimedia.org/r/190978 (https://phabricator.wikimedia.org/T87527) (owner: 10Tim Landscheidt) [15:43:46] 7Blocked-on-Operations, 6Scrum-of-Scrums, 6Zero, 7Varnish: Tag all Zero traffic with X-Analytics xcs value - https://phabricator.wikimedia.org/T89177#1113241 (10Yurik) @Faidon, we have solved the cache variance a long time ago - see the [[ http://git.wikimedia.org/blob/operations%2Fpuppet.git/433adf01c101b... [15:44:15] 6operations, 10Analytics-EventLogging, 6Analytics-Kanban: Eventlogging JS client should warn users when serialized event is more than "N" chars long and not sent the event [8 pts] - https://phabricator.wikimedia.org/T91918#1113246 (10kevinator) [15:45:15] paravoid: the old card is back in pfw2 [15:45:43] I don't see it in the console server [15:45:48] bblack: can cp1008 go back into service as well? i'm ready for it if you are. [15:46:00] is it still scs-c1 port 34? [15:46:05] jgage: it was never in service, it's a test-only host that doesn't go into service [15:46:08] ok [15:46:22] jgage: is the ipsec optional at this point due to some config? [15:46:46] jgage: (I mean, I see ipsec.conf entries for the cp10xx, but clearly it's still able to talk them even though they're not yet configured) [15:47:23] hosts will use ipsec if it's configured on both ends, but will continue to use regular transport otherwise [15:47:36] there's no policy to enforce use of ipsec [15:48:06] does that make it defeatable? [15:48:21] papaul: ? [15:48:35] (can a MITM force ipsec on each side to think the other isn't configured for it?) [15:48:51] no, IKE protects us against that [15:49:08] how? [15:49:26] (03PS27) 10Yuvipanda: deployment: Combine labs/prod deployment server roles [puppet] - 10https://gerrit.wikimedia.org/r/195340 [15:49:39] well if the MITM was able to filter IKE packets then yes, but the MITM isn't able to talk to the IKE agents because it would lack the proper keys [15:50:02] yeah but cp10xx right now aren't talking to IKE, they didn't need proper keys to effectively be disabled [15:50:11] PROBLEM - Host cp1047 is DOWN: PING CRITICAL - Packet loss = 100% [15:50:13] PROBLEM - puppet last run on cp4018 is CRITICAL: CRITICAL: puppet fail [15:50:17] (03CR) 10Yuvipanda: [C: 032] deployment: Combine labs/prod deployment server roles [puppet] - 10https://gerrit.wikimedia.org/r/195340 (owner: 10Yuvipanda) [15:50:52] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 112, down: 0, dormant: 0, excluded: 0, unused: 0 [15:51:02] RECOVERY - Host cp1047 is UP: PING OK - Packet loss = 0%, RTA = 0.32 ms [15:52:12] 6operations, 10ops-eqiad: cp1047 down - https://phabricator.wikimedia.org/T88045#1113287 (10Cmjohnson) a:5Cmjohnson>3BBlack Checked for errors again today and did not see any. Rebooted and nothing shows in post. Assigning to bblack to add back. [15:52:50] bblack: right, IPsec transport is not used unless both sides are configured. so because we're not firewalling non-ipsec traffic, currently it will just fall back to regular transport. we'll probably use firewall rules to enforce IPsec transport later. [15:53:18] ok, that works [15:54:25] (03CR) 10Alexandros Kosiaris: Hit Parsoid directly (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/196229 (owner: 10Mobrovac) [15:54:35] (03PS1) 10coren: Update maintain-replicas with recent schema changes [software] - 10https://gerrit.wikimedia.org/r/196249 [15:54:40] 6operations, 5Patch-For-Review: reclaim lsearchd hosts - https://phabricator.wikimedia.org/T86149#1113295 (10Cmjohnson) [15:54:41] 6operations, 10ops-eqiad: wipe search* and searchidx* hosts - https://phabricator.wikimedia.org/T92434#1113293 (10Cmjohnson) 5Open>3Resolved This was already part of an older phab ticket and is done. All servers have been wiped, they've been changed to asset tags and added to the spares list on wikitech.... [15:57:40] 6operations, 10ops-eqiad: dysprosium net / disk issues for reuse as cache box - https://phabricator.wikimedia.org/T83070#1113299 (10Cmjohnson) 5Open>3Resolved I must've disconnected the DAC cable from the switch side. Fixed xe-8/0/32 up up dysprosium [15:58:15] 7Blocked-on-Operations, 6Scrum-of-Scrums, 6Zero, 7Varnish: Some traffic is not identified as Zero in Varnish - https://phabricator.wikimedia.org/T88366#1113301 (10Yurik) [15:58:25] 6operations, 10ops-eqiad: mc1014 server has been flaking out and dropping connectivity - https://phabricator.wikimedia.org/T91773#1113302 (10Cmjohnson) a:5Cmjohnson>3chasemp Haven't seen any more icinga alerts an all appears to be normal. Assigning to Chase to review and resolve as necessary. [15:59:46] (03CR) 10Mobrovac: "@akosiaris, well, I wouldn't like to hard-code the hostname either, but can't see a way to get it nor the port out of hiera ATM. Have you " [puppet] - 10https://gerrit.wikimedia.org/r/196229 (owner: 10Mobrovac) [16:00:05] kart_, ^d: Dear anthropoid, the time has come. Please deploy Content Translation/cxserver (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150312T1600). [16:01:08] 6operations, 6Labs, 10hardware-requests: Hardware for Designate - https://phabricator.wikimedia.org/T91277#1113310 (10RobH) server holmium is now allocated for this task. I'll create the linked tickets for its setup. [16:01:41] (03CR) 10Tim Landscheidt: "Somewhat in here or the accompanying changes triggered that now at least in Tools, each Puppet run leaves:" [puppet] - 10https://gerrit.wikimedia.org/r/183814 (owner: 10Faidon Liambotis) [16:02:54] YuviPanda: ^^, I'm a little busy atm [16:02:57] 6operations, 10ops-eqiad: cp1047 down - https://phabricator.wikimedia.org/T88045#1113318 (10BBlack) a:5BBlack>3Cmjohnson Just rebooted into bios setup to enable HT, then rebooted for PXE, and saw: ``` Error: Memory initialization warning detected. MEMBIST Memory Test failure DIMM B5 ``` Then it halts to... [16:03:07] YuviPanda: probably needs a salt rm -rf /etc/ssh/userkeys/ubuntu [16:03:20] YuviPanda: I don't think there's anything creating that anymore (I hope) [16:04:17] paravoid: yeah, I’ll take care of labs [16:04:25] thx [16:06:24] paravoid: can you see it now? [16:06:34] papaul: hey [16:06:45] are you able to see it [16:06:49] on scs-c8 [16:06:57] so yeah, you didn't replace a "console port", the part is the whole routing engine :) [16:07:03] ok [16:07:14] so put the old one back in pfw1 as well [16:07:23] paravoid: [16:07:33] ok [16:07:48] paravoid : so what are we going to do next [16:07:58] we'll have to plan this a bit better [16:08:05] paravoid: on [16:08:07] ok [16:08:28] 6operations, 6Labs, 10hardware-requests: setup / deploy holmium as designate server - https://phabricator.wikimedia.org/T92507#1113340 (10RobH) 3NEW a:3RobH [16:08:35] paravoid: i have 10 days to returned the old parts [16:08:44] so i let you decide [16:08:49] 6operations, 6Labs, 10hardware-requests: Hardware for Designate - https://phabricator.wikimedia.org/T91277#1113347 (10RobH) [16:08:50] 6operations, 6Labs, 10hardware-requests: setup / deploy holmium as designate server - https://phabricator.wikimedia.org/T92507#1113340 (10RobH) [16:08:56] papaul: since when? [16:09:00] 6operations, 6Labs, 10hardware-requests: Hardware for Designate - https://phabricator.wikimedia.org/T91277#1078794 (10RobH) [16:09:01] 6operations, 6Labs, 10hardware-requests: setup / deploy holmium as designate server - https://phabricator.wikimedia.org/T92507#1113340 (10RobH) [16:09:03] RECOVERY - puppet last run on cp4018 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [16:09:05] hi, user reporting that commons is verry slow from europe (i can confirm) [16:09:09] papaul: I mean, when did those 10 days start counting? [16:09:11] paravoid:that is what the email mentioned [16:09:14] 6operations, 6Labs, 10hardware-requests: setup / deploy holmium as designate server - https://phabricator.wikimedia.org/T92507#1113340 (10RobH) [16:09:15] 6operations, 6Labs, 10hardware-requests: Hardware for Designate - https://phabricator.wikimedia.org/T91277#1113354 (10RobH) 5Open>3Resolved a:3RobH [16:09:24] paravoid: since monday [16:09:34] paravoid: this monday [16:09:36] 6operations, 6Labs, 10hardware-requests: setup / deploy holmium as designate server - https://phabricator.wikimedia.org/T92507#1113340 (10RobH) [16:09:38] ok [16:09:54] paravoid: on i am putting back the old one in pfw [16:09:57] 1 [16:10:00] okay, thanks! [16:10:02] bblack: are you working on esams upload caches perhaps? [16:10:21] (03CR) 10Alexandros Kosiaris: [C: 04-1] "LGTM, minor typos/comments" (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/196198 (owner: 10Hashar) [16:10:27] Steinsplitter: you mean the loading of images or commons wiki itself? [16:10:34] upload [16:11:58] mark: many The upload succeeded, but the server could not get a preview thumbnail. [16:13:32] does a reload work usually? [16:14:27] no [16:14:32] just tested [16:14:38] matanya: URL? [16:14:42] are you saying -uploading- of imges is failing? [16:14:46] or do you mean upload.wikimedia.org? [16:15:24] paravoid: i just look on the serial console cards, i have aa 2GB flash memory card that i can take out. do you think that the configuration is on that flash memory card ? if so we can just remove the memory card from the old card and put it in the new card and maybe it will load up the configuration [16:15:42] mark: uploading of images [16:15:43] papaul: no let's not do that [16:15:58] paravoid: ok [16:16:04] paravoid: no url, it is in the process of upload [16:16:26] Steinsplitter: is this what you're seeing as well? [16:16:43] chasemp: btw, image uploads would be a good monitoring test in addition to editing... but let's handle that later [16:16:46] i.e. problems with *uploading* images, rather than loading? [16:18:00] paravoid: yes [16:18:31] so to confirm, you can browse the site fine, but you're experiencing slowness while trying to upload, right? [16:18:43] _joe_: still here? [16:19:22] confirm. [16:20:21] gi11es: here? [16:20:32] this sounds potentially related to T84842 [16:20:55] Steinsplitter: so what exactly is slow? [16:21:19] uploading [16:21:59] matanya said that the upload worked but the preview thumbnail was the issue [16:22:10] it uploads at last [16:22:12] paravoid: pfw1 is back up on scs-c8 on port 1 [16:22:13] but very slow [16:22:22] and without a thumbnail [16:22:38] parvoid: i scolled back and overlocked that. sorry for beeing not specific engough. [16:23:08] (03CR) 10Alexandros Kosiaris: [C: 04-1] "LGTM, minor comment." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/195896 (https://phabricator.wikimedia.org/T89875) (owner: 10Mobrovac) [16:23:18] ^d: around? [16:23:35] It looks I messed up with timezone. [16:24:03] anyone else deploying anything? [16:24:34] <_joe_> paravoid: yes, I got a coffee [16:24:40] <_joe_> whatsup? [16:25:02] <_joe_> the HHVM imagescaler is out of rotation ATM [16:25:22] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 116, down: 0, dormant: 0, excluded: 0, unused: 0 [16:26:33] <_joe_> I was about to repool it, but I'll abstain from doing so [16:27:25] <_joe_> Steinsplitter, matanya can you give me the url of one of said images? [16:27:48] mark understood, will make a note [16:27:56] _joe_: https://commons.wikimedia.org/w/index.php?title=Special:ListFiles/Matanya&ilshowall=1 [16:28:02] all those for today [16:28:26] <_joe_> matanya: you don't see the thumbnails? [16:28:38] after the upload it do [16:28:43] but not during the upload [16:28:52] 6operations, 10ops-eqiad: mc1014 server has been flaking out and dropping connectivity - https://phabricator.wikimedia.org/T91773#1113434 (10chasemp) Thanks! [16:29:02] https://phabricator.wikimedia.org/T90599 ? [16:29:04] during the upload I get: The upload succeeded, but the server could not get a preview thumbnail. [16:29:25] and it takes quite some time to upload [16:29:58] <_joe_> paravoid: that's the known bug with uploadwizard that gilles named in the imagescaler bug [16:30:01] Krinkle: Thanks for the hint (jscs), I thought I looked for something like that [16:30:27] when i'm not using UW, it is faster, is that possible ? [16:31:12] PROBLEM - Graphite Carbon on graphite2001 is CRITICAL: CRITICAL: Not all configured Carbon instances are running. [16:31:21] PROBLEM - uWSGI web apps on graphite2001 is CRITICAL: CRITICAL: Not all configured uWSGI apps are running. [16:34:01] 6operations, 10ops-codfw: codw pfw* serial connections problem - https://phabricator.wikimedia.org/T84737#1113443 (10Papaul) Replacing the serial console card on the system caused the whole system to lose the initial set-up the reason being, the routing engine is part of the card. As discuss on IRC with Faidon... [16:35:01] 6operations, 10Datasets-General-or-Unknown, 6Services, 10hardware-requests: Hardware for HTML / zim dumps - https://phabricator.wikimedia.org/T91853#1113448 (10GWicke) [16:35:34] robh: you around? [16:35:40] Yep [16:35:47] hey [16:36:05] I have a question re the process for the HTML dump host [16:36:09] https://phabricator.wikimedia.org/T91853 [16:36:53] do you think we could use one of those spares, and if so, what's the process / timeline? [16:37:43] This is the first I've heard about this particular server request (seems new so thats normal) [16:38:54] usually its fast (a day to three) [16:39:23] cool ;) [16:40:00] those spares are a bit overkill really, but they seem to be the only ones available with the storage [16:40:20] !log kartik Started scap: Update ContentTranslation [16:40:27] Logged the message, Master [16:40:43] Nikerabbit: ^d ^^ [16:41:12] gwicke: so the specs of the spare you chose are primarily due to the large disks right? [16:41:44] since we really have nothing else even close to that capacity (not sure if the high memory or cpu count is super relevant for this role) [16:41:46] (03CR) 10Yuvipanda: "salted the ubuntu keys out, things should be all good now." [puppet] - 10https://gerrit.wikimedia.org/r/183814 (owner: 10Faidon Liambotis) [16:41:50] mark: no, I'm not, what's the issue with esams caches? [16:41:53] robh: could consider using one of them temporarily only & then switching to another box later [16:41:54] paravoid: ^ salt done, things seem ok, etc [16:41:55] im not likely to hold up allocation for that reason, but i'd note it on the task [16:42:11] so something with a single cpu and half the memory would likely also work? [16:42:21] yes [16:42:27] it's primarily about disks [16:42:33] bblack: I don't think it's the esams caches [16:42:39] well, not the upload ones at least [16:42:46] !log kartik Finished scap: Update ContentTranslation (duration: 02m 25s) [16:42:51] possibly the text ones? [16:42:51] Logged the message, Master [16:42:56] possibly [16:43:03] cmjohnson1: Do we have any 3TB or larger spare sata disks on site? (I ask due to old lsearch hosts being good for use but small disks) [16:43:04] more users are experiencing https://phabricator.wikimedia.org/T90599 [16:43:12] scap was fast. [16:43:41] gwicke: so yea, i'm checking if we have spare disks. i dont really think its worth ordering spare disks to slap in an old out of warranty server for this, when we have the three slightly overprovisioned boxes for this [16:43:44] but checking [16:43:55] could be unrelated to varnish/nginx entirely [16:44:03] robh: we have some SATA disks...lemme check on the amount [16:44:29] paravoid: only known work on prod esams cache clusters is: gage deployed the ipsec stack on the texts (but it's not on the corresponding esams boxes, so no traffic should be using it yet) [16:44:38] cmjohnson1: cool, the followup is if we do have 3tb spares, how many and can they be installed into one of the old lsearch wmf3152 to wmf3175 [16:44:42] this morning, I mean. there were reinstalls yesterday [16:45:11] "corresponding eqiad boxes" you mean? [16:45:21] i was about to reinstall another one, should i wait? [16:45:35] robh, cmjohnson1: that sounds great [16:45:53] paravoid: yes [16:46:40] (03PS1) 10Dzahn: depool cp1066 for reinstall [puppet] - 10https://gerrit.wikimedia.org/r/196254 [16:46:55] intra-DC frontend->backend should be affected, though, no? [16:47:14] ah there are no config stanzas for it [16:47:31] right it's only config'd for x-dc [16:47:50] (03CR) 10Tim Landscheidt: "Thanks, seems to work." [puppet] - 10https://gerrit.wikimedia.org/r/183814 (owner: 10Faidon Liambotis) [16:49:28] robh: so no 3TB's left. I have 9 1TB and many 500GB [16:49:49] yea... i dont think its worth ordering spares to use, better to just allocate a spare system entirely [16:50:26] (03CR) 10Alexandros Kosiaris: "@mobrovac, no I am fine with it as long as it is what you intended. The default has no port hence my question." [puppet] - 10https://gerrit.wikimedia.org/r/196229 (owner: 10Mobrovac) [16:51:44] (03PS1) 10Cmjohnson: fixing typo in wmnet file [dns] - 10https://gerrit.wikimedia.org/r/196256 [16:51:59] jgage: here? [16:52:09] hi [16:52:13] hi [16:52:22] (03CR) 10Cmjohnson: [C: 032] fixing typo in wmnet file [dns] - 10https://gerrit.wikimedia.org/r/196256 (owner: 10Cmjohnson) [16:52:25] did you deploy ipsec in all of amssq*? [16:52:35] (see above) [16:53:25] checking scrollback now. i did configure the esams side but the transports are not established because the eqiad side is not configured. [16:53:39] 7Blocked-on-Operations, 6Scrum-of-Scrums, 6Zero, 7Varnish: Tag all Zero traffic with X-Analytics xcs value - https://phabricator.wikimedia.org/T89177#1113513 (10Ottomata) >>>* Tag all traffic with other information like https and proxy >>https has to be done on varnishes, for sure. We can't infer this very... [16:55:15] that was quite a risky move [16:55:34] 6operations, 5Patch-For-Review, 3wikis-in-codfw: setup & deploy rdb2001-2004 - https://phabricator.wikimedia.org/T92011#1113521 (10Dzahn) ah! thanks @faidon. I will grab rdb2004 joe took rdb2003. [16:55:37] 6operations, 10Datasets-General-or-Unknown, 6Services, 10hardware-requests: Hardware for HTML / zim dumps - https://phabricator.wikimedia.org/T91853#1113522 (10RobH) I've chatted with Gabriel and Ariel about this particular request in IRC (plus reviewed the linked tasks.) This is an actual need, and discu... [16:56:00] let's start with two hosts, first depooled and then slowly ramp up prod traffic to them [16:56:17] also, is there a reason why we're doing it only for cross-DC traffic? [16:56:19] ok. just discussed this with mark. [16:56:32] where else would we do it? intra-dc? [16:56:35] yes [16:56:47] that has never been the plan, but we could easily do that if desired [16:57:08] why intra-dc? [16:57:37] that has never been the plan indeed [16:58:04] hrm, ok [16:58:18] more importantly, how are we going to secure kafka? [16:58:39] (and udp2log, although I guess there's little point in doing that now) [16:59:06] 6operations, 10Datasets-General-or-Unknown, 6Services, 10hardware-requests: Hardware for HTML / zim dumps - https://phabricator.wikimedia.org/T91853#1113534 (10RobH) Also perhaps @arielglenn could offer insight as one of the opsen with knowledge of dumps? We chatted some in IRC, but on task is ideal. My... [16:59:07] we haven't discussed that before, but it would be easy to apply ipsec to kafka traffic [16:59:31] well either that or TLS [16:59:38] if we don't do that, there's little point to the whole project :) [16:59:39] 7Blocked-on-Operations, 6Scrum-of-Scrums, 6Zero, 7Varnish: Tag all Zero traffic with X-Analytics xcs value - https://phabricator.wikimedia.org/T89177#1113538 (10Yurik) >>! In T89177#1113513, @Ottomata wrote: >>Some traffic goes through proxies like Opera-mini (compressing proxy specifically for mobile vers... [16:59:42] agreed [16:59:52] gwicke: I've updated the task with the info on which system to use and such. I need to get feedback from apergos (and then mark needs to be ok with the allocation), since its hardware and all [17:00:04] but once thats done its fast if we go with the onsite system [17:00:29] https://cwiki.apache.org/confluence/display/KAFKA/Security [17:00:30] if we have to order something, it is 3-4 days (for hard disks) or 2-3 weeks for dell server (alternatives to using the spare server onsite) [17:00:50] but, it has movement now and its on my radar [17:01:00] robh: cool, thank you! [17:01:08] 6operations, 10Datasets-General-or-Unknown, 6Services, 10hardware-requests: Hardware for HTML / zim dumps - https://phabricator.wikimedia.org/T91853#1113541 (10RobH) a:3RobH [17:01:36] (03PS1) 10Cmjohnson: Adding dhcp entries for cp1071-74 with jesse installer [puppet] - 10https://gerrit.wikimedia.org/r/196258 [17:01:47] 6operations, 10ops-esams: Rack, cable, prepare cp3023-3046 - https://phabricator.wikimedia.org/T92514#1113544 (10mark) 3NEW [17:01:57] 6operations, 10Datasets-General-or-Unknown, 6Services, 10hardware-requests: Hardware for HTML / zim dumps - https://phabricator.wikimedia.org/T91853#1113552 (10ArielGlenn) I chatted with Gabriel about this and we agreed that locally generating the dumps on a host plus keeping one run aroun to use as input... [17:02:14] jgage: do you have any plans for monitoring the state of ipsec? [17:02:15] robh: we could also consider using one of the spares temporarily until one of the older boxes is fitted with disks [17:02:30] jgage: I find it a bit surprising that while traffic is not encrypted now, we are not getting any alerts [17:02:51] this is intentional now of course, but what if it wasn't and was a misconfiguration or something [17:03:09] robh: the first iteration of those dumps will basically just be running a script manually, so easy to move that to another node [17:03:25] 6operations, 10Datasets-General-or-Unknown, 6Services, 10hardware-requests: Hardware for HTML / zim dumps - https://phabricator.wikimedia.org/T91853#1113556 (10RobH) But the cpu and memory requirements for the intended host are not as high as wmf4543 (Xeon E5-2450 v2, 64gm ram) right? [17:04:19] gwicke: please note that on task. (i could but then its me saying im quoting you from irc ;) [17:04:26] that does make allocation easier (imo) [17:04:44] paravoid: yeah monitoring is definitely needed and is part of my plan. will open a ticket for that today to discuss details. [17:04:49] as most dump stuff is fire and let it sit, i'd assume transitioning to a different server would be painful [17:04:56] seems this isnt that =] [17:04:58] jgage: great, thanks [17:05:00] monitoring is a very good point [17:05:05] that should be done before deployment [17:05:05] robh: *nod* [17:05:20] agreed [17:06:02] jgage: finally, please !log when deploying :) I was quite scared for a moment there when I saw that commit message in my backlog [17:06:31] ah yeah. will do. [17:06:54] thank you! [17:07:09] ok, now back to square 1 about that slowness issue [17:07:35] !log rdb2004 - changed serial settings in bios, boots into installer now (T92011) [17:07:42] Logged the message, Master [17:08:04] 6operations, 10Datasets-General-or-Unknown, 6Services, 10hardware-requests: Hardware for HTML / zim dumps - https://phabricator.wikimedia.org/T91853#1113574 (10GWicke) @RobH, for plain HTML dumps only compression will use significant CPU, and basically nothing significant memory. For other formats there co... [17:08:05] !log rdb2004 .. but then gets the 'malformed IP address' warning like on rbf2001 [17:08:09] Logged the message, Master [17:08:54] <_joe_> mutante: did you try the "server debug" hack?' [17:10:04] <_joe_> if that doesn't work, let's go with trusty for the redis hosts for now [17:10:14] _joe_: not yet, i will [17:10:16] ok [17:10:38] <_joe_> just sync with paravoid, he may want to play with rdb2002 a bit longer [17:10:43] 6operations, 5Patch-For-Review, 3wikis-in-codfw: setup & deploy rdb2001-2004 - https://phabricator.wikimedia.org/T92011#1113589 (10Dzahn) rbf2004 - serial communication was set to off, changed to on via com2, serial port addres was device1=com2, device2=com1, switched around red... [17:10:49] yes please [17:12:02] yes, just touching rdb2004 for now [17:12:15] hrmm. the same issue now as on rbf2001 though [17:16:40] (03CR) 10Alexandros Kosiaris: [C: 032] reprepro: transition to ssh::userkey [puppet] - 10https://gerrit.wikimedia.org/r/183818 (https://phabricator.wikimedia.org/T92475) (owner: 10Faidon Liambotis) [17:17:11] akosiaris: \o/ [17:17:21] akosiaris: assuming you’re merging as well :D [17:17:26] * YuviPanda finds C+2 without merge confusing [17:18:13] RECOVERY - Graphite Carbon on graphite2001 is OK: OK: All defined Carbon jobs are runnning. [17:21:42] PROBLEM - Graphite Carbon on graphite2001 is CRITICAL: CRITICAL: Not all configured Carbon instances are running. [17:22:19] 7Blocked-on-Operations, 6operations, 6Scrum-of-Scrums, 6Zero, 7Varnish: Tag all Zero traffic with X-Analytics xcs value - https://phabricator.wikimedia.org/T89177#1113679 (10mark) [17:22:43] YuviPanda: yeah, I noticed that discussion on that ticket [17:22:59] <_joe_> YuviPanda: I think what alex does is right [17:23:19] nobody expects it though [17:23:27] yeah, like the spanish inquisition [17:23:33] +2 means for some reason "I'll shepherd into production" [17:23:44] <_joe_> YuviPanda: you ruined my youtube link [17:23:44] generally, across all repos, if you +2, then either jenkins merges for you, or you merge it. [17:23:51] _joe_: haha :D [17:24:29] yeah, I 'll merge that one as well [17:24:33] submitting is not really tied to the vote though [17:24:50] (03CR) 10Alexandros Kosiaris: "Cleaned up /var/lib/reprepo/.ssh manually on caesium." [puppet] - 10https://gerrit.wikimedia.org/r/183818 (https://phabricator.wikimedia.org/T92475) (owner: 10Faidon Liambotis) [17:25:31] i'm used to them from alex now and i appreciated the difference between "+1 as in "that sounds like a good idea" vs. "+2 i actually checked this, but you go ahead and merge your change" [17:26:45] 6operations, 10Datasets-General-or-Unknown, 6Services, 10hardware-requests: Hardware for HTML / zim dumps - https://phabricator.wikimedia.org/T91853#1113701 (10Kelson) @RobH, for the ZIM files, I need a more CPU resources than for HTML only Parsoid dumps ; see my previous emails to get more details about t... [17:27:26] (03CR) 10coren: [C: 032] "Tested to match status-quo" [software] - 10https://gerrit.wikimedia.org/r/196249 (owner: 10coren) [17:27:29] so much easier to do these disucssions about hardware in our post phabricator world. [17:27:37] * robh is still quite enamored by phabricator. [17:27:59] * gwicke concurs with robh [17:28:00] (03PS1) 10coren: Generate db mapping from maintain-replias.pl [software] - 10https://gerrit.wikimedia.org/r/196271 [17:30:44] 6operations, 10Analytics, 6Scrum-of-Scrums, 10Wikipedia-App-Android-App, and 3 others: Avoid cache fragmenting URLs for Share a Fact shares - https://phabricator.wikimedia.org/T90606#1113713 (10dr0ptp4kt) The iOS client side code is good. Merged that in. [17:31:13] (03PS1) 10coren: Actually populate meta_p.wiki.size [software] - 10https://gerrit.wikimedia.org/r/196272 (https://phabricator.wikimedia.org/T90084) [17:31:32] 6operations, 10Analytics, 6Scrum-of-Scrums, 10Wikipedia-App-Android-App, and 3 others: Avoid cache fragmenting URLs for Share a Fact shares - https://phabricator.wikimedia.org/T90606#1113721 (10dr0ptp4kt) a:5Fjalapeno>3dr0ptp4kt [17:31:40] 6operations, 10Analytics, 6Scrum-of-Scrums, 10Wikipedia-App-Android-App, and 3 others: Avoid cache fragmenting URLs for Share a Fact shares - https://phabricator.wikimedia.org/T90606#1063059 (10dr0ptp4kt) a:5dr0ptp4kt>3None [17:34:58] 7Blocked-on-Operations, 6operations, 6Scrum-of-Scrums, 6Zero, 7Varnish: Tag all Zero traffic with X-Analytics xcs value - https://phabricator.wikimedia.org/T89177#1113730 (10mark) >>! In T89177#1113146, @Yurik wrote: > First, for the reasoning: banners, compression, analytics. > > # We need to show Ze... [17:37:32] (03PS1) 10RobH: setting holmium (designate server) dns entry [dns] - 10https://gerrit.wikimedia.org/r/196275 [17:38:12] PROBLEM - puppet last run on sodium is CRITICAL: CRITICAL: Puppet last ran 4 hours ago [17:38:36] 6operations, 6Labs, 10hardware-requests: setup / deploy holmium as designate server - https://phabricator.wikimedia.org/T92507#1113740 (10RobH) [17:40:07] 7Blocked-on-Operations, 6operations, 6Scrum-of-Scrums, 6Zero, 7Varnish: Tag all Zero traffic with X-Analytics xcs value - https://phabricator.wikimedia.org/T89177#1113743 (10Yurik) >>! In T89177#1113730, @mark wrote: > > @Yurik: could you elaborate on this? Why has this become a much bigger issue recent... [17:41:12] (03PS1) 10Yuvipanda: beta: Remove unused wmf-beta-scap file [puppet] - 10https://gerrit.wikimedia.org/r/196276 [17:41:14] (03PS1) 10Yuvipanda: beta: Kill deployment-rsync01 [puppet] - 10https://gerrit.wikimedia.org/r/196277 [17:41:31] (03PS2) 10Yuvipanda: beta: Remove unused wmf-beta-scap file [puppet] - 10https://gerrit.wikimedia.org/r/196276 [17:41:40] (03PS2) 10Yuvipanda: beta: Kill deployment-rsync01 [puppet] - 10https://gerrit.wikimedia.org/r/196277 [17:42:05] (03PS1) 10RobH: setting holmium install params [puppet] - 10https://gerrit.wikimedia.org/r/196278 [17:42:28] (03CR) 10Yuvipanda: [C: 032 V: 032] beta: Remove unused wmf-beta-scap file [puppet] - 10https://gerrit.wikimedia.org/r/196276 (owner: 10Yuvipanda) [17:42:46] (03PS3) 10Yuvipanda: beta: Kill deployment-rsync01 [puppet] - 10https://gerrit.wikimedia.org/r/196277 [17:42:56] 6operations, 6Labs, 10hardware-requests: setup / deploy holmium as designate server - https://phabricator.wikimedia.org/T92507#1113761 (10RobH) [17:43:00] (03CR) 10Yuvipanda: [C: 032 V: 032] beta: Kill deployment-rsync01 [puppet] - 10https://gerrit.wikimedia.org/r/196277 (owner: 10Yuvipanda) [17:43:21] (03CR) 10RobH: [C: 032] setting holmium install params [puppet] - 10https://gerrit.wikimedia.org/r/196278 (owner: 10RobH) [17:43:43] (03CR) 10RobH: [C: 032] setting holmium (designate server) dns entry [dns] - 10https://gerrit.wikimedia.org/r/196275 (owner: 10RobH) [17:45:00] (03CR) 10Alexandros Kosiaris: "I 've been indeed doing what Daniel describes in the past. Then I realized I am the only one making that distinction (thanks Daniel for po" [puppet] - 10https://gerrit.wikimedia.org/r/195913 (owner: 10Faidon Liambotis) [17:45:41] (03CR) 10Yuvipanda: "YESSSSSS" [puppet] - 10https://gerrit.wikimedia.org/r/195913 (owner: 10Faidon Liambotis) [17:45:42] 6operations, 6Phabricator, 7Mail: Phabricator mails Message-ID has localhost.localdomain - https://phabricator.wikimedia.org/T75713#1113777 (10chasemp) I spoke with upstream about this and the general idea is that Amazon http://aws.amazon.com/ses/ mangles the message-id in some circumstances and so they pref... [17:46:02] PROBLEM - puppet last run on xenon is CRITICAL: CRITICAL: puppet fail [17:49:48] (03CR) 10Dzahn: "i think akosaris was right but having the same expectations is also important" [puppet] - 10https://gerrit.wikimedia.org/r/195913 (owner: 10Faidon Liambotis) [17:50:21] gwicke, heheh https://wikitech.wikimedia.org/w/index.php?title=Squids&oldid=17057 <3 [17:51:51] !log restarted populateListOfUsersToBeRenamed.php on terbium (CentralAuth) [17:51:58] Logged the message, Master [17:52:54] (03PS4) 10Nuria: Adding a Last-Access cookie to text and mobile requests [puppet] - 10https://gerrit.wikimedia.org/r/196009 (https://phabricator.wikimedia.org/T92435) [17:55:21] (03CR) 10Nuria: Adding a Last-Access cookie to text and mobile requests (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/196009 (https://phabricator.wikimedia.org/T92435) (owner: 10Nuria) [17:56:14] (03PS1) 10Rush: enable mc1014 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/196279 [17:57:02] PROBLEM - Host cp1047 is DOWN: PING CRITICAL - Packet loss = 100% [17:57:19] <_joe_> akosiaris: you're not the only one btw [17:57:54] (03CR) 10Mobrovac: "@akosiaris Yep, it is. It's actually set in files/misc/parsoid.upstart#23 ." [puppet] - 10https://gerrit.wikimedia.org/r/196229 (owner: 10Mobrovac) [17:58:05] MaxSem: ye olden days ;) [17:58:23] I bothered mark about this quite a bit when I started and the outcome was "a +1 from an ops means ready to merge / I approve" [17:58:33] RECOVERY - uWSGI web apps on graphite2001 is OK: OK: All defined uWSGI apps are runnning. [17:59:11] 6operations, 6Multimedia, 7HHVM, 5Patch-For-Review: Convert Imagescalers to HHVM, Trusty - https://phabricator.wikimedia.org/T84842#1113836 (10brion) >>! In T84842#1112319, @Joe wrote: > @brion problem confirmed, but this would be relevant for the videoscalers, right? ah good if video scalers are a separa... [17:59:35] (03PS1) 10Rush: enable mc1014 [puppet] - 10https://gerrit.wikimedia.org/r/196281 [17:59:55] ^ cp1047 is ok, I'll re-up the downtime on it, sorry [18:00:44] _joe_: paravoid you might like the current size of the beta/ module :) [18:00:56] I can probably still trim a file or two [18:01:52] PROBLEM - uWSGI web apps on graphite2001 is CRITICAL: CRITICAL: Not all configured uWSGI apps are running. [18:04:30] !log started uWSGI on graphite2001 [18:04:36] Logged the message, Master [18:04:41] RECOVERY - graphite.wikimedia.org on graphite2001 is OK: HTTP OK: HTTP/1.1 200 OK - 1607 bytes in 0.211 second response time [18:04:50] RECOVERY - puppet last run on xenon is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [18:05:32] (03CR) 10Mobrovac: Puppetise Citoid's configuration (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/195896 (https://phabricator.wikimedia.org/T89875) (owner: 10Mobrovac) [18:05:53] (03PS4) 10Mobrovac: Puppetise Citoid's configuration [puppet] - 10https://gerrit.wikimedia.org/r/195896 (https://phabricator.wikimedia.org/T89875) [18:06:01] (03CR) 10Mattflaschen: [C: 04-1] "Code looks good and works locally. However, blocked on https://phabricator.wikimedia.org/T91086 and https://phabricator.wikimedia.org/T89" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/196068 (https://phabricator.wikimedia.org/T90670) (owner: 10EBernhardson) [18:06:17] 6operations, 6Labs, 10hardware-requests: setup / deploy holmium as designate server - https://phabricator.wikimedia.org/T92507#1113867 (10RobH) [18:06:32] 6operations, 6Labs, 10hardware-requests: setup / deploy holmium as designate server - https://phabricator.wikimedia.org/T92507#1113340 (10RobH) os install in progress, all previous steps complete [18:06:50] (03CR) 10Rush: [C: 032] enable mc1014 [puppet] - 10https://gerrit.wikimedia.org/r/196281 (owner: 10Rush) [18:07:12] 6operations, 10ops-eqiad: mc1014 server has been flaking out and dropping connectivity - https://phabricator.wikimedia.org/T91773#1113870 (10chasemp) [18:07:13] (03CR) 10Rush: [C: 032] enable mc1014 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/196279 (owner: 10Rush) [18:07:18] 6operations, 10ops-eqiad: mc1014 server has been flaking out and dropping connectivity - https://phabricator.wikimedia.org/T91773#1095950 (10chasemp) [18:09:02] (03CR) 10Alexandros Kosiaris: "I was referring to" [puppet] - 10https://gerrit.wikimedia.org/r/196229 (owner: 10Mobrovac) [18:09:52] !log rush Synchronized wmf-config/session.php: mc1014 enable (duration: 00m 06s) [18:09:59] Logged the message, Master [18:10:14] <_joe_> I should count the time I lose every day waiting for a) gerrit b) jenkins [18:10:22] (03PS1) 10Giuseppe Lavagetto: memcached: add host entry for mc2001 [puppet] - 10https://gerrit.wikimedia.org/r/196284 [18:10:56] api down for anyone else? [18:11:01] akosiaris: hehe, it seems we've entered an endless loop with these comments :) [18:11:01] cp1065 is giving me an error message [18:11:06] everything down [18:11:09] I thing [18:11:12] sites 503 for me [18:11:14] mobrovac: my point exactly :-) [18:11:15] All down for me [18:11:20] Wikimedia Error [18:11:28] (03PS2) 10coren: Actually populate meta_p.wiki.size [software] - 10https://gerrit.wikimedia.org/r/196272 (https://phabricator.wikimedia.org/T90084) [18:11:39] Basically 503 [18:11:41] chasemp: ^ [18:11:44] Erm... did you guys have plannned downtime today? [18:11:48] no [18:11:51] can you be more specific? [18:11:52] (03CR) 10Tim Landscheidt: setting holmium (designate server) dns entry (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/196275 (owner: 10RobH) [18:11:59] well great [18:12:04] Request: GET http://en.wikipedia.org/w/index.php?title=Michael_Pawlyn&action=delete, from 10.20.0.155 via cp1065 cp1065 ([10.64.0.102]:3128), Varnish XID 3092852003 [18:12:09] "Request: GET http://en.wikisource.org/wiki/User_talk:Hrishikes, from 10.20.0.135 via cp1055 cp1055 ([10.64.32.107]:3128), Varnish XID 864516636 [18:12:10] bblack, en.wikipedia.org should load, right? :) [18:12:11] Forwarded for: 80.176.129.180, 10.20.0.151, 10.20.0.151, 10.20.0.135 [18:12:12] Error: 503, Service Unavailable at Thu, 12 Mar 2015 18:11:10 GMT " [18:12:15] it loads for me [18:12:18] so that's why I'm asking [18:12:22] I get 503s too [18:12:28] I think the people complaining are all in Europe [18:12:29] Europe problem ? [18:12:30] 503 for me as well [18:12:32] (Including me right now) [18:12:36] probably [18:12:37] * Steinsplitter as well [18:12:39] 503s, in Canada. [18:12:39] I'm esams, yes [18:12:42] same error as Qcoder00 here. I'm in UK [18:12:45] akosiaris: i'm good with the change too, i was basically complaining at the fact that parsoid module is rather non-parametrised [18:12:46] Erroring here, too [18:12:50] from south america [18:12:50] (wfm India as well) [18:12:50] "Error: 503, Service Unavailable at Thu, 12 Mar 2015 18:12:35 GMT" [18:12:51] PROBLEM - HHVM rendering on mw1120 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:12:58] ^ it begins [18:12:58] 503 for me and I'm not europe [18:13:04] I'm in the UK [18:13:05] 503 in france [18:13:05] what is 503'ing [18:13:06] ? [18:13:10] PROBLEM - Apache HTTP on mw1120 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:13:10] I'm in SFbay and seeing 503s [18:13:11] en.wikipedia.org [18:13:12] Wikimedia sites [18:13:13] chasemp: requests to wikis [18:13:15] Yeah esams is 503ing [18:13:16] So is ulsfo [18:13:19] chasemp, everything [18:13:19] But eqiad seems fine [18:13:19] chasemp: everything? [18:13:25] <_joe_> everything is 503ing for me [18:13:35] same here [18:13:40] it's ok for me so that's interesting [18:13:43] wget -H "Host: en.wikipedia.org" http://text-lb.esams.wikimedia.org [18:13:43] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 26.67% of data above the critical threshold [500.0] [18:13:52] 7Blocked-on-Operations, 6operations, 6Scrum-of-Scrums, 6Zero, and 2 others: Tag all Zero traffic with X-Analytics xcs value - https://phabricator.wikimedia.org/T89177#1113898 (10Yurik) [18:13:55] That 503s for me, ulsfo also 503s, and eqiad 200s [18:13:58] http://ganglia.wikimedia.org/latest/graph.php?r=hour&z=xlarge&c=Application+servers+eqiad&m=cpu_report&s=by+name&mc=2&g=network_report [18:14:12] jgage: ping? [18:14:14] greg-g: ^ [18:14:21] bah [18:14:23] i suspect mc1014 enable ( [18:14:30] <_joe_> chasemp: maybe after your merge? [18:14:41] PROBLEM - HTTP 5xx req/min on graphite2001 is CRITICAL: CRITICAL: 28.57% of data above the critical threshold [500.0] [18:14:48] so deployer people I ran sync-file wmf-config/session.php "mc1014 enable" from /srv/mediawiki [18:14:48] (03PS1) 10Ori.livneh: Revert "enable mc1014" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/196286 [18:14:54] i'm on the same page as RoanKattouw [18:14:56] seems to have completed but possibly bad news [18:14:57] <_joe_> ori: yea let's revert that [18:15:02] (03CR) 10Ori.livneh: [C: 032 V: 032] Revert "enable mc1014" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/196286 (owner: 10Ori.livneh) [18:15:11] ori: thanks [18:15:17] whitespace before the oh man [18:15:34] !log ori Synchronized wmf-config/session.php: I29542c0965 (duration: 00m 08s) [18:15:37] I would suspect the ipsec stuff, but it's ulsfo too? [18:15:39] Logged the message, Master [18:15:45] wfm now [18:15:46] guillom: not sure what to tweet other than a general "we're experiencing technical issues and we're currently investigating" [18:15:50] WP is 503... guess we shouldn't have had that NSA lawsuit... [18:15:52] working [18:15:55] HaeB: ^ [18:15:58] yeah, things are back [18:16:00] Ouch. [18:16:00] guillom: nvm :) [18:16:06] working for me. thanks! :) [18:16:07] HaeB: nvm on the tweet, we're back it seems [18:16:12] yup [18:16:17] ok thanks guys clearly that was a bad sync for me [18:16:28] so, what went wrong ? [18:16:30] (03CR) 10Se4598: "is this whitespace causing the outage?" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/196279 (owner: 10Rush) [18:16:32] http://en.wikipedia.org/wiki/Downtime is my test page, works [18:16:42] mutante++ [18:16:48] that whitespace looks wrong [18:16:54] how the hell did mc1014 affect ulsfo/esams but not eqiad? [18:16:55] Visiting http://commons.wikimedia.org/wiki/File:Trie_example.svg causes 503 Service Unavailable. Interesting. I can consistently reproduce it. Doesn't happen to ?action=history or other files. [18:17:13] bblack, the whitespace thing at the beginning of the file, perhaps? [18:17:17] What was the content of the file? [18:17:19] bblack: eqiad probably still has a cached version of the main page [18:17:21] PROBLEM - HHVM queue size on mw1120 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [80.0] [18:17:23] <_joe_> bblack: now that I don't understand [18:17:24] "We were briefly experiencing some technical issues, but should be back up now" [18:17:24] Krinkle: seems fine to me [18:17:26] Yes that change is broken [18:17:26] does that work? [18:17:31] ouch =[ [18:17:33] It should be reverted immediately [18:17:37] already reverted [18:17:39] <_joe_> HaeB: I do see the sites [18:17:39] RoanKattouw: yeah but esams/ulsfo backend caches go to eqiad backend caches [18:17:40] and synced [18:17:40] Whitespace before the opening i went to get a drink and a site outage happened and was repaired [18:17:52] <_joe_> ori: I do see the sites from europe [18:17:57] Oh we're back already [18:17:57] robh: That's the best kind, right? [18:17:58] HaeB: yeah [18:17:59] Nice [18:18:08] I go and respond to a different chat and the outage is fixed [18:18:10] HaeB: that's fine [18:18:11] _joe_: hence the "back up now" [18:18:11] guillom: yes, the ouch was for the folks who fixed it, not me ;D [18:18:13] deleting stuff works as well, fwiw [18:18:22] Back for me now [18:18:29] <_joe_> HaeB: ok thanks :) [18:18:48] one more shirt from bd808 :D [18:19:09] (03PS4) 10coren: Labs: puppetize replica-addusers [puppet] - 10https://gerrit.wikimedia.org/r/135445 [18:19:14] HaeB: looks like a 6 minute outage, btw [18:19:16] * bd808 will need to start a fundraising campaign [18:19:17] so I don't see what was broken in that change [18:19:18] now for the post-mortem -- sync-file ought to catch syntax errors [18:19:28] I don't think it was a syntax error [18:19:31] twentyafterfour: Whitespace before the opening it should? It' runs php -l [18:19:36] It causes "Headers already sent" errors [18:19:45] RoanKattouw: doh! :) [18:19:47] It's not syntactically invalid [18:19:50] oh, right [18:19:54] yeah, runtime error rather than syntax [18:19:54] It's just something you want to do basically never [18:19:54] * legoktm notes that phpcs catches anything before oh, that sucks, 50 files i uploaded didn't go through, 1 hour work of description and stuff :/ [18:20:14] legoktm: But apparently phpcs doesn't run on wmf-config [18:20:21] It would probably have a seizure anyway [18:20:30] So many lines over 100 chars :D [18:20:30] we should just check for this very specific thing in scap [18:20:52] if filename.endswith('.php') and not file.startswith(' Let's make phpcs voting [18:21:03] at deviantart we had a very strict commit hook that wouldn't accept any file with whitespace before the :D [18:22:17] 6operations, 10MediaWiki-File-management, 6Multimedia: http://commons.wikimedia.org/wiki/File:Trie_example.svg causes 503 Service Unavailable - https://phabricator.wikimedia.org/T92529#1113927 (10Krinkle) 3NEW [18:22:37] akosiaris: What about http://upload.wikimedia.org/wikipedia/commons/thumb/b/be/Trie_example.svg/213px-Trie_example.svg.png [18:23:27] (03PS5) 10coren: Labs: puppetize replica-addusers [puppet] - 10https://gerrit.wikimedia.org/r/135445 [18:23:30] Krinkle: Headers already sent, terminating is what it says [18:23:35] YuviPanda: You around for ^^ ? [18:23:42] akosiaris: Fun [18:24:38] cache problem by the images? [18:25:21] mobrovac: I 'll merge https://gerrit.wikimedia.org/r/#/c/195896/4 [18:25:49] 6operations, 10Continuous-Integration, 6MediaWiki-Core-Team: add a check for whitespace before leading akosiaris: yep, please [18:25:58] (03CR) 10Yuvipanda: [C: 04-1] "Another 'new'-perl-in-ops-puppet dilemma :( But I think it's ok to put this in as long as we commit to moving it off perl at some point." (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/135445 (owner: 10coren) [18:26:13] (03CR) 10Alexandros Kosiaris: [C: 032] Puppetise Citoid's configuration [puppet] - 10https://gerrit.wikimedia.org/r/195896 (https://phabricator.wikimedia.org/T89875) (owner: 10Mobrovac) [18:26:20] YuviPanda: "new". :-) [18:26:26] akosiaris: Krinkle with a clean profile, i can view it, but when refreshing, i get this error [18:26:27] Coren: hence dilemma [18:26:44] terrible + puppetized vs terrible + unpuppetized, so former wins :) [18:27:10] 6operations, 10Continuous-Integration, 6MediaWiki-Core-Team: add a check for whitespace before leading 6operations, 10MediaWiki-File-management, 6Multimedia: http://commons.wikimedia.org/wiki/File:Trie_example.svg causes 503 Service Unavailable - https://phabricator.wikimedia.org/T92529#1113973 (10Tgr) Caused by the outage mentioned in T92531. Varnish apparently cached the error for file URLs (the file page w... [18:27:37] I'm looking at Trie_example.svg... [18:27:45] YuviPanda: /have/ we stopped using generic::upstart_job? [18:27:55] I remember someone ripped it all out [18:28:00] YuviPanda: Not like that's an issue to switch it to a simple file[] [18:28:24] Coren: yup. [18:28:26] 10Ops-Access-Requests, 6operations: Give CCogdill an account on barium - https://phabricator.wikimedia.org/T92533#1113976 (10CCogdill_WMF) 3NEW a:3Jgreen [18:28:30] Coren: in fact, I can’t even find the generic module…? [18:28:56] can we just do a varnish ban for all 503s? [18:29:22] Coren: https://gerrit.wikimedia.org/r/#/c/184296/ [18:29:24] (03CR) 10Mobrovac: "@akosiaris, the default has no port attached to it because it points to the Varnish layer, while the fact in hiera diverts that to hit Par" [puppet] - 10https://gerrit.wikimedia.org/r/196229 (owner: 10Mobrovac) [18:29:45] (03CR) 10coren: Labs: puppetize replica-addusers (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/135445 (owner: 10coren) [18:29:51] PROBLEM - HHVM busy threads on mw1120 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [86.4] [18:30:01] tgr: it's not a 503 in varnish, it's a 200 [18:30:08] (503 isn't cached anyways) [18:30:45] 6operations, 10Continuous-Integration, 6MediaWiki-Core-Team: add a check for whitespace before leading >! In T92531#1113969, @Krinkle wrote: > See {T46875}. > > This and various other errors are already caught by phpcs. Please ensure proj... [18:30:55] YuviPanda: Also the patch is old but got forgotten, I just freshened it up. :-) [18:31:01] RECOVERY - HHVM busy threads on mw1120 is OK: OK: Less than 30.00% above the threshold [57.6] [18:31:07] (03CR) 10Yuvipanda: Labs: puppetize replica-addusers (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/135445 (owner: 10coren) [18:31:11] Coren: that makes sense :) [18:31:21] RECOVERY - HTTP 5xx req/min on graphite2001 is OK: OK: Less than 1.00% above the threshold [250.0] [18:31:28] is mediawiki.org broken [18:31:31] RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [18:32:39] aude: shouldn't be now, though there was a 5 min outage [18:32:49] 6operations, 10Continuous-Integration, 6MediaWiki-Core-Team: add a check for whitespace before leading >! In T92531#1113969, @Krinkle wrote: > See {T46875}. > > This and various other errors are already caught by phpcs. Please ensure project(... [18:32:50] 2Fatal error: clone called on non-object in /srv/mediawiki/php-1.25wmf21/extensions/Flow/includes/Formatter/RecentChanges.php on line 181 [18:32:53] it is [18:32:56] (03CR) 10Jgreen: [C: 04-1] "Shopify has a Primary Domain setting with an option to redirect all other hostnames there. Maybe I'm missing something, but I can't think " (032 comments) [dns] - 10https://gerrit.wikimedia.org/r/196007 (https://phabricator.wikimedia.org/T92438) (owner: 10John F. Lewis) [18:33:38] 6operations, 10Continuous-Integration, 10Incident-20150312-whitespace, 6MediaWiki-Core-Team: add a check for whitespace before leading aude: url? [18:33:52] 7Blocked-on-Operations, 6operations, 6Scrum-of-Scrums, 6Zero, and 2 others: Tag all Zero traffic with X-Analytics xcs value - https://phabricator.wikimedia.org/T89177#1114013 (10Nuria) @Yurik My 2 cents and last post in this ticket, as I think decisions here belong to ops team. I think we should separat... [18:34:01] YuviPanda: openstack contains no less than four classes with dashes already. Unifying them is worthwhile for a subsequent cleanup, but doing it partway or mashing it in with this one patch is bad mojo imo [18:34:08] https://www.mediawiki.org/wiki/Special:RecentChanges [18:34:08] paravoid: have found out something about my slow connection to wmf infrastructure? It's about 8 kB/s, it's terrible :( [18:34:22] Coren: well, this is a new class, and we should’t perpetrate the -shes [18:34:24] consistent with the error logs [18:34:30] filed a bug [18:35:14] bblack: true. A ban on the exact content-length, then? [18:35:27] 6operations, 10Continuous-Integration, 10Incident-20150312-whitespace, 6MediaWiki-Core-Team: add a check for whitespace before leading 3High [18:35:33] YuviPanda: Hm. When was dashness decided against? [18:35:41] * aude does not have flow installed [18:35:56] aude: I can't reproduce [18:36:05] really [18:36:13] it's up for me too, but then again could be I'm hitting eqiad and you are not? [18:36:18] I tried esams too [18:36:30] PROBLEM - HHVM busy threads on mw1120 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [86.4] [18:36:32] it may be specific to your login or a cookie? [18:36:42] stuff like https://www.mediawiki.org/wiki/MediaWiki_Developer_Summit_2015 is ok [18:36:43] or to a specific cache endpoint... [18:36:49] i am logged in [18:37:20] hm, now it works in firefox but not logged in [18:37:36] Coren: http://projects.puppetlabs.com/projects/puppet/wiki/Allowed_Characters_in_Identifier_Names and others [18:37:38] 6operations, 10Continuous-Integration, 10Incident-20150312-whitespace, 6MediaWiki-Core-Team: add a check for whitespace before leading >! In T92531#1114010, @ori wrote: >>>! In T92531#1113969, @Krinkle wrote: >> See {T46875}. >> >> This... [18:37:39] and broken [18:37:41] RECOVERY - HHVM busy threads on mw1120 is OK: OK: Less than 30.00% above the threshold [57.6] [18:37:41] when logged in [18:37:46] 6operations: Changing the URL for the Wikimedia Shop - https://phabricator.wikimedia.org/T92438#1114033 (10Jgreen) Victoria asked for www.store.wikipedia.org in 1:1 conversation before opening the ticket, we need to add that CNAME too. Also does anyone know the history behind the decision to route some of the h... [18:37:48] (it’s old, but still) [18:37:53] plus rest of our code uses dashes [18:38:30] mlitn: [19:32] aude 2Fatal error: clone called on non-object in /srv/mediawiki/php-1.25wmf21/extensions/Flow/includes/Formatter/RecentChanges.php on line 181 [18:38:42] Coren: and see https://docs.puppetlabs.com/puppet/latest/reference/lang_reserved.html#acceptable-characters-in-names [18:39:44] * aude heads home and for food [18:41:55] (03CR) 10Alexandros Kosiaris: [C: 032] "And you are absolutely right. I should have seen it. Merging" [puppet] - 10https://gerrit.wikimedia.org/r/196229 (owner: 10Mobrovac) [18:42:11] PROBLEM - HHVM busy threads on mw1120 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [86.4] [18:42:40] 6operations, 10MediaWiki-File-management, 6Multimedia: http://commons.wikimedia.org/wiki/File:Trie_example.svg causes 503 Service Unavailable - https://phabricator.wikimedia.org/T92529#1114067 (10Tgr) The deeper problem is that thumb.php returns an error message with a HTTP 200 status if there is output befo... [18:43:20] RECOVERY - HHVM busy threads on mw1120 is OK: OK: Less than 30.00% above the threshold [57.6] [18:43:36] guys, does the class passwords::puppet::database sound familiar to you? [18:43:48] is it a puppet class from the private repo perhaps? [18:44:28] mobrovac: lemme check [18:45:15] mobrovac: yes, it's in the private repo [18:45:50] YuviPanda: Where "the rest" is defined as "some fraction". I count no less than 47 classes with dashes in 'em -- but okay, I'll fix this one. [18:45:50] hm, so we're trying to test a patch for the cassandra submodule on cerium [18:46:06] YuviPanda: Fixing the other 46 is left as an exercise to the reader. :-P [18:46:10] it holds the production db pass [18:46:15] any known way of circumventing that class? [18:46:19] urandom: ^^ [18:46:25] if this were a test in labs the way to go would be to add it to labs/private with a fake value [18:47:12] tgr: I've banned the upload caches on content-length==52 [18:47:20] seems to work for the test example with the Trie image [18:47:29] Coren: the language reference I linked to says ‘lowercase letters, numbers, underscores’ :) [18:47:38] Coren: also says 'Note: In some cases, names containing unsupported characters will still work. These cases should be considered bugs, and may cease to work at any time. Removal of these bug cases will not be limited to major releases.' [18:47:43] so no, this is not ‘tyranny of the majority’ :) [18:47:47] cam anyone confirm/provide any other test case for borked images? [18:47:49] you can call it stupidity of the puppet, if you want. [18:48:58] YuviPanda: It contradicts the first document you pointed to as well. So yeah. Just silly puppet. [18:49:01] mobrovac: how are you testing? local puppetmaster? puppet apply? [18:49:11] Coren: yeah, the first one is outdated. [18:49:13] puppet apply [18:49:25] Coren: but ‘dashes work now but that is a bug’ is what _joe_ told me as well, so that matches. [18:49:29] (03PS6) 10coren: Labs: puppetize replica-addusers [puppet] - 10https://gerrit.wikimedia.org/r/135445 [18:49:35] but --noop is not enough, we want to see the stuff applied [18:49:51] mobrovac: could you just set $puppet_production_db_pass = 'snakeoil' ? [18:49:57] that's all it does [18:50:16] 6operations, 10MediaWiki-File-management, 6Multimedia: http://commons.wikimedia.org/wiki/File:Trie_example.svg causes 503 Service Unavailable - https://phabricator.wikimedia.org/T92529#1114106 (10Tgr) p:5Triage>3High [18:50:17] ah ok, will try adding that [18:50:19] thnx mutante [18:50:22] yw [18:50:33] urandom: ^^ [18:52:09] 6operations, 6Commons, 10MediaWiki-File-management, 6Multimedia: http://commons.wikimedia.org/wiki/File:Trie_example.svg causes 503 Service Unavailable - https://phabricator.wikimedia.org/T92529#1114112 (10Steinsplitter) [18:52:39] (03CR) 10Yuvipanda: [C: 031] "I also just realized that there's no docs on what exactly this does, so a bunch of lines about that (docs on the puppet class, maybe?) see" [puppet] - 10https://gerrit.wikimedia.org/r/135445 (owner: 10coren) [18:53:09] (03CR) 10Yuvipanda: "(+1 on perl + puppetized > perl + unpuppetized)" [puppet] - 10https://gerrit.wikimedia.org/r/135445 (owner: 10coren) [18:53:13] YuviPanda: Not entirely surprising, this adds users to the replicas. :-) But yeah. [18:53:40] (03CR) 10Alexandros Kosiaris: [C: 032] puppet: transition to ssh::userkey [puppet] - 10https://gerrit.wikimedia.org/r/183822 (https://phabricator.wikimedia.org/T92475) (owner: 10Faidon Liambotis) [18:53:50] Coren: true, but what users? which replicas? why is it on the NFS servers? etc, etc [18:53:51] Krinkle: ping re images, were there known cases aside from http://upload.wikimedia.org/wikipedia/commons/thumb/b/be/Trie_example.svg/213px-Trie_example.svg.png ? [18:54:55] Labs users, the db replicats, and it's there because that's the only place from which you can write the replica.my.cnf to all projects. But yes, that could bear being written down. [18:55:03] yup :) [18:55:43] (03CR) 10coren: [C: 032] "Puppetize status-quo." [puppet] - 10https://gerrit.wikimedia.org/r/135445 (owner: 10coren) [18:56:12] bblack: I dont know. Maybe there's logs? [18:56:14] Coren: can you document the answerse you gave me as well? [18:56:25] bblack: I imagine any image that was a cache miss during the outage. [18:56:40] * YuviPanda goes to sleep. [18:56:41] Probably at least a couple 100 [18:57:02] bblack: How was http://upload.wikimedia.org/wikipedia/commons/thumb/b/be/Trie_example.svg/213px-Trie_example.svg.png fixed? [18:57:10] YuviPanda|zzz: Will do so in a further patch. [18:57:18] Coren: cool :) [18:57:51] PROBLEM - HHVM busy threads on mw1120 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [86.4] [18:58:32] Krinkle: I cleared from the upload caches all files with size==52 (which was the length of that "Header sent" or whatever content in the Trie example) [18:58:50] tgr: since you're here [18:59:00] bblack: Interesting. [18:59:02] PROBLEM - puppet last run on labstore1001 is CRITICAL: CRITICAL: Puppet has 1 failures [18:59:04] bblack: How does that work? [18:59:15] do you mean how does it work internally in varnish? [18:59:34] bblack: Yeah, how did you find those and purge them. [18:59:42] it basically filters the existing cache objects, if their size==52 and they existing before the ban was implemented they're ignored [19:00:10] using varnishadm like this: https://www.varnish-cache.org/docs/3.0/tutorial/purging.html#bans [19:00:17] but on obj.http.content-length==52 [19:00:31] Krinkle: Yeah it's not about going out to find them as much as it is about checking at use time [19:00:42] (03PS1) 10coren: replica-addusers: moar documentation [puppet] - 10https://gerrit.wikimedia.org/r/196300 [19:00:43] 10Ops-Access-Requests, 6operations: Give CCogdill an account on barium - https://phabricator.wikimedia.org/T92533#1114168 (10Jgreen) I got some additional background info from Adam, and I can't support putting regular users on a production system to intervene on broken tools. I think we should be able to come... [19:00:48] I don't know for sure that all affected images had that exact same error/length though [19:01:10] "I'm about to use this cached object, but it satisfies this ban rule and it existed before the rule was put in place, so I'll ignore/discard this cached object and proceed as if this was a cache miss" [19:01:20] right [19:01:38] tgr: we got two reports for https://phabricator.wikimedia.org/T90599 today [19:01:48] (03CR) 10Alexandros Kosiaris: [C: 032] openstack: transition nova to ssh::userkey [puppet] - 10https://gerrit.wikimedia.org/r/183819 (https://phabricator.wikimedia.org/T92475) (owner: 10Faidon Liambotis) [19:02:44] 10Ops-Access-Requests, 6operations: Give CCogdill an account on barium - https://phabricator.wikimedia.org/T92533#1114181 (10CCogdill_WMF) @jgreen I'm totally happy with a workaround. I'll work with @awight to see if there are any other options. [19:02:50] (03CR) 10coren: [C: 032] "Tested and working." [software] - 10https://gerrit.wikimedia.org/r/196272 (https://phabricator.wikimedia.org/T90084) (owner: 10coren) [19:03:21] paravoid: yeah, that's an ongoing bug [19:03:59] probably just the standard issue of scaling of large files timing out, except something somewhere in the stashed upload call chain makes timeout limits more aggressive [19:04:36] (03CR) 10BBlack: [C: 031] Adding dhcp entries for cp1071-74 with jesse installer [puppet] - 10https://gerrit.wikimedia.org/r/196258 (owner: 10Cmjohnson) [19:05:06] (03PS2) 10Cmjohnson: Adding dhcp entries for cp1071-74 with jesse installer [puppet] - 10https://gerrit.wikimedia.org/r/196258 [19:05:39] (03PS1) 10Rush: Revert "sessions: temporarily disable mc1014" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/196302 [19:06:37] (03CR) 10Cmjohnson: [C: 032] Adding dhcp entries for cp1071-74 with jesse installer [puppet] - 10https://gerrit.wikimedia.org/r/196258 (owner: 10Cmjohnson) [19:07:52] (03CR) 10Alexandros Kosiaris: "Cleaned up on virt10*" [puppet] - 10https://gerrit.wikimedia.org/r/183819 (https://phabricator.wikimedia.org/T92475) (owner: 10Faidon Liambotis) [19:08:11] RECOVERY - HHVM busy threads on mw1120 is OK: OK: Less than 30.00% above the threshold [57.6] [19:09:17] 6operations, 6Commons, 10MediaWiki-File-management, 6Multimedia: http://commons.wikimedia.org/wiki/File:Trie_example.svg causes 503 Service Unavailable - https://phabricator.wikimedia.org/T92529#1114205 (10Tgr) 5Open>3Resolved a:3Tgr @BBlack fixed this by banning all files with content-lengh==52. Fol... [19:10:50] PROBLEM - puppet last run on labstore1002 is CRITICAL: CRITICAL: Puppet has 1 failures [19:11:33] (03CR) 10Alexandros Kosiaris: [C: 032] authdns: transition to ssh::userkey [puppet] - 10https://gerrit.wikimedia.org/r/183821 (https://phabricator.wikimedia.org/T92475) (owner: 10Faidon Liambotis) [19:11:50] 6operations, 6Commons, 10Incident-20150312-whitespace, 10MediaWiki-File-management, 6Multimedia: http://commons.wikimedia.org/wiki/File:Trie_example.svg causes 503 Service Unavailable - https://phabricator.wikimedia.org/T92529#1114218 (10Se4598) [19:12:07] (03PS1) 10Legoktm: check_php_syntax: Check for any content before opening 6operations, 6Commons, 10Incident-20150312-whitespace, 10MediaWiki-File-management, 6Multimedia: http://commons.wikimedia.org/wiki/File:Trie_example.svg causes 503 Service Unavailable - https://phabricator.wikimedia.org/T92529#1114224 (10Tgr) (I wonder how that content length is calculated? `\nHeaders al... [19:13:14] greg-g, add Incident-20150312-whitespace also to https://phabricator.wikimedia.org/T92545 ? [19:13:25] (03CR) 10Legoktm: "Untested besides running "tox"" [tools/scap] - 10https://gerrit.wikimedia.org/r/196306 (https://phabricator.wikimedia.org/T92534) (owner: 10Legoktm) [19:14:39] (03CR) 10Alexandros Kosiaris: "cleaned up on baham,iridium,eeden" [puppet] - 10https://gerrit.wikimedia.org/r/183821 (https://phabricator.wikimedia.org/T92475) (owner: 10Faidon Liambotis) [19:14:51] (03CR) 10Catrope: [C: 031] Revert "sessions: temporarily disable mc1014" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/196302 (owner: 10Rush) [19:15:06] se4598: looks like tgr did :) [19:15:57] 6operations, 10Citoid, 10VisualEditor, 3VisualEditor 2014/15 Q3 blockers: Improve citoid production service - https://phabricator.wikimedia.org/T90281#1114251 (10mobrovac) [19:15:58] 6operations, 10Citoid, 5Patch-For-Review: Configure citoid to use the new zotero service - https://phabricator.wikimedia.org/T89873#1114248 (10mobrovac) 5Open>3Resolved a:3mobrovac Fixed with https://gerrit.wikimedia.org/r/#/c/195896/ , resolving. [19:16:02] (03CR) 10Dzahn: [C: 032] depool cp1066 for reinstall [puppet] - 10https://gerrit.wikimedia.org/r/196254 (owner: 10Dzahn) [19:16:48] 6operations, 7domains: add support for wikimedia.xyz - https://phabricator.wikimedia.org/T92547#1114259 (10RobH) 3NEW [19:16:51] 6operations, 10Citoid, 10VisualEditor, 3VisualEditor 2014/15 Q3 blockers: Improve citoid production service - https://phabricator.wikimedia.org/T90281#1055318 (10mobrovac) [19:16:55] 6operations, 10Citoid, 5Patch-For-Review, 3VisualEditor 2014/15 Q3 blockers: Configure citoid to use outbound proxy - https://phabricator.wikimedia.org/T89875#1114266 (10mobrovac) 5Open>3Resolved [19:17:45] (03CR) 10Alexandros Kosiaris: "cleaned up on strontium, palladium, virt1000" [puppet] - 10https://gerrit.wikimedia.org/r/183822 (https://phabricator.wikimedia.org/T92475) (owner: 10Faidon Liambotis) [19:18:25] 6operations, 7network: Very slow connection to wmf engineering infrastructure - https://phabricator.wikimedia.org/T92548#1114279 (10Florian) 3NEW [19:19:39] (03CR) 10Rush: [C: 032] "seriously hope this works out :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/196302 (owner: 10Rush) [19:20:54] !log rush Synchronized wmf-config/session.php: re-reenable mc1014 (duration: 00m 06s) [19:21:00] Logged the message, Master [19:21:18] 6operations, 7domains: add support for wikimedia.xyz - https://phabricator.wikimedia.org/T92547#1114290 (10Slaporte) @robh: Please point to wikimedia.org. [19:22:56] (03PS5) 10Andrew Bogott: Roughed in designate class [puppet] - 10https://gerrit.wikimedia.org/r/191471 [19:23:01] 6operations, 6Labs, 10hardware-requests: setup / deploy holmium as designate server - https://phabricator.wikimedia.org/T92507#1114295 (10RobH) [19:23:03] (03CR) 10Legoktm: check_php_syntax: Check for any content before opening 6operations, 6Labs, 10hardware-requests: setup / deploy holmium as designate server - https://phabricator.wikimedia.org/T92507#1114296 (10RobH) p:5High>3Normal [19:23:28] (03CR) 10GWicke: Don't include a node in its own seeds (032 comments) [puppet/cassandra] - 10https://gerrit.wikimedia.org/r/195483 (https://phabricator.wikimedia.org/T91617) (owner: 10GWicke) [19:24:18] 6operations, 6Labs, 10hardware-requests: setup / deploy holmium as designate server - https://phabricator.wikimedia.org/T92507#1114303 (10RobH) a:5RobH>3Andrew @Andrew, System OS installed and awaiting service implementation. puppet/salt keys have NOT been accepted at this time. [19:24:25] 6operations, 6Labs: setup / deploy holmium as designate server - https://phabricator.wikimedia.org/T92507#1114306 (10RobH) [19:25:10] andrewbogott: ^ holmium is all yours for designate [19:25:21] robh: thank you! [19:25:31] quite welcome [19:25:38] (03PS2) 10Legoktm: check_php_syntax: Check for any content before opening (03PS1) 10BBlack: configure (but do not pool) cp107[1-4] for upload [puppet] - 10https://gerrit.wikimedia.org/r/196310 [19:30:14] (03PS6) 10GWicke: Don't include a node in its own seeds [puppet/cassandra] - 10https://gerrit.wikimedia.org/r/195483 (https://phabricator.wikimedia.org/T91617) [19:30:23] (03CR) 10BBlack: [C: 032 V: 032] configure (but do not pool) cp107[1-4] for upload [puppet] - 10https://gerrit.wikimedia.org/r/196310 (owner: 10BBlack) [19:30:41] PROBLEM - puppet last run on mw1051 is CRITICAL: CRITICAL: puppet fail [19:32:48] (03PS1) 10RobH: adding wikimedia.xyz domain support [dns] - 10https://gerrit.wikimedia.org/r/196312 [19:33:29] 6operations, 7domains: add support for wikimedia.xyz - https://phabricator.wikimedia.org/T92547#1114363 (10RobH) dns https://gerrit.wikimedia.org/r/#/ [19:35:42] bblack: it's interesting that backends have such low hit (and request) rates [19:36:03] yes, it is [19:36:31] PROBLEM - HHVM busy threads on mw1120 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [86.4] [19:37:50] ori: ^ is that the second time in a short time mw1120 has done this? [19:38:59] 6operations, 6Multimedia, 7HHVM, 5Patch-For-Review, 7user-notice: Convert Imagescalers to HHVM, Trusty - https://phabricator.wikimedia.org/T84842#1114397 (10gpaumier) [19:41:15] (03PS1) 10RobH: adding support to redirect wikimedia.xyz to wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/196321 [19:41:23] !log restarting hhvm on mw1120 [19:41:23] RECOVERY - Apache HTTP on mw1120 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 3.471 second response time [19:41:23] RECOVERY - HHVM rendering on mw1120 is OK: HTTP OK: HTTP/1.1 200 OK - 66892 bytes in 4.661 second response time [19:41:28] Logged the message, Master [19:41:58] 6operations, 5Patch-For-Review, 7domains: add support for wikimedia.xyz - https://phabricator.wikimedia.org/T92547#1114415 (10RobH) [19:42:00] 6operations, 6Multimedia, 7HHVM, 5Patch-For-Review, and 2 others: Convert Imagescalers to HHVM, Trusty - https://phabricator.wikimedia.org/T84842#1114416 (10greg) [19:42:32] 6operations, 10ops-eqiad: mc1014 server has been flaking out and dropping connectivity - https://phabricator.wikimedia.org/T91773#1114420 (10chasemp) 5Open>3Resolved resolved ever so painfully [19:43:21] RECOVERY - HHVM busy threads on mw1120 is OK: OK: Less than 30.00% above the threshold [57.6] [19:45:37] 6operations, 10OTRS, 6Security: Make OTRS sessions IP-address-agnostic - https://phabricator.wikimedia.org/T87217#1114431 (10Aschmidt) Today, I have got the advice to set network.dns.disableIPv6 in Firefox about:config to "true". I have tested this, and I have not been logged out for about half an hour. It... [19:48:47] 6operations, 5Patch-For-Review: Make puppet the sole manager of user keys - https://phabricator.wikimedia.org/T92475#1114444 (10RobH) p:5Triage>3High [19:49:51] RECOVERY - puppet last run on mw1051 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [19:49:58] 6operations, 10Continuous-Integration, 10Incident-20150312-whitespace, 6MediaWiki-Core-Team: add a check for whitespace before leading RECOVERY - HHVM queue size on mw1120 is OK: OK: Less than 30.00% above the threshold [10.0] [19:50:39] 10Ops-Access-Requests, 6operations: Give CCogdill an account on barium - https://phabricator.wikimedia.org/T92533#1114466 (10RobH) 5Open>3declined I'm putting the status of this to declined, since the access request has been denied. [19:51:32] 6operations, 10ops-esams: cp3011 hardware fault - https://phabricator.wikimedia.org/T92306#1114471 (10RobH) [19:52:35] 6operations, 10Wikimedia-General-or-Unknown, 7Regression: svn.wikimedia.org security certificate expired - https://phabricator.wikimedia.org/T88731#1114473 (10RobH) [19:53:11] 6operations, 10ops-ulsfo: cp4009 hardware fault - https://phabricator.wikimedia.org/T92476#1114486 (10RobH) [19:57:55] (03CR) 10Alexandros Kosiaris: Don't include a node in its own seeds (032 comments) [puppet/cassandra] - 10https://gerrit.wikimedia.org/r/195483 (https://phabricator.wikimedia.org/T91617) (owner: 10GWicke) [19:58:54] (03PS1) 10BBlack: pool cp107[12] backends [puppet] - 10https://gerrit.wikimedia.org/r/196326 [19:58:56] (03PS1) 10BBlack: pool cp107[34] backends [puppet] - 10https://gerrit.wikimedia.org/r/196327 [19:59:09] (03CR) 10BBlack: [C: 032 V: 032] pool cp107[12] backends [puppet] - 10https://gerrit.wikimedia.org/r/196326 (owner: 10BBlack) [20:03:51] (03PS1) 10Dzahn: Revert "rbf2001: use eth2 MAC for DHCP" [puppet] - 10https://gerrit.wikimedia.org/r/196328 [20:04:48] (03CR) 10BryanDavis: check_php_syntax: Check for any content before opening (03PS2) 10Dzahn: Revert "rbf2001: use eth2 MAC for DHCP" [puppet] - 10https://gerrit.wikimedia.org/r/196328 [20:05:38] (03CR) 10Dzahn: [C: 032] "eth2 and eth3 appear to be switched around here, while on rdf2004 2 is actually eth2 and 3 is actually eth3" [puppet] - 10https://gerrit.wikimedia.org/r/196328 (owner: 10Dzahn) [20:05:51] mutante: no don't do that [20:05:57] there's a workaround, to pass "debug" [20:06:15] that may work and give you eth0/eth1, although _joe_ said that it didn't work for him today [20:06:24] I'm trying to work on a more permanent fix [20:07:00] paravoid: yes, i was just going to revert that attempt though and go back to the lower MAC address [20:07:55] to undo what i changed [20:08:00] nod [20:08:08] (03CR) 10BBlack: [C: 032] pool cp107[34] backends [puppet] - 10https://gerrit.wikimedia.org/r/196327 (owner: 10BBlack) [20:09:22] that's amazing [20:09:29] from racking them up to pooling them in what, 2-3 hours? [20:09:38] yup :) [20:09:55] good times [20:09:55] it helps when all the netboot/disk/etc stuff has been refactored and sorted out to work generically [20:10:07] (and puppet works on the first try always and finishes everything) [20:10:13] yup! [20:10:14] :) [20:11:10] new hardware config and nothing manual happened. I just logged in to look afterwards and verify/stare [20:11:47] how many precises left? :) [20:11:54] like 9 [20:11:56] (03PS1) 10Cmjohnson: Adding dns entries for cp3030-3053 include mgmt and ipv6 [dns] - 10https://gerrit.wikimedia.org/r/196332 [20:12:00] hehe [20:12:05] incl. random clusters? [20:12:07] misc/parsoid/etc.? [20:12:12] I could wipe them all out today, except for the esams/ulsfo uploads needing spacing (that's most of what's left) [20:12:27] no, I've only been doing/counting bits/text/upload/mobile so far [20:12:28] bblack: looking forward to SPDY for restbase [20:12:32] when that's good, I'll get the rest [20:13:10] that's what, another 4 :) [20:13:15] yeah something like that [20:13:47] was really happy last night when I saw my browser using spdy for enwiki and bits [20:14:05] great work! [20:14:15] although it brings you very little atm [20:14:23] mutante has cp1066 in-progress (last eqiad text), there's 1x ulsfo text I'll do today, and then there's still 3x esams-upload and 4x ulsfo-upload to go [20:14:33] you still need 3 connections/RTTs [20:14:47] err I meant to say 3x esams-upload and 4x eqiad-upload [20:14:47] yeah, but it's a good step forward [20:15:21] once we past the new hardware setup for this, I'm gonna start on making a raft of tickets about various related fixups [20:15:31] (e.g. cache hitrate related things, killing bits., etc) [20:15:38] and lets us design new APIs / IP layouts with this in mind [20:15:51] 6operations: Delete gadolinium:/a/log/fundraising/ - https://phabricator.wikimedia.org/T92336#1114572 (10atgo) Hey @jgreen could you confirm that we have this covered elsewhere? [20:15:53] 6operations, 10RESTBase, 10RESTBase-Cassandra: move cassandra submodule into puppet repo - https://phabricator.wikimedia.org/T92560#1114574 (10Eevans) 3NEW [20:16:21] (03PS2) 10Cmjohnson: Adding dns entries for cp3030-3053 include mgmt and ipv6 [dns] - 10https://gerrit.wikimedia.org/r/196332 [20:16:37] 6operations: Delete stat1002:/a/squid/archive/bannerImpressions - https://phabricator.wikimedia.org/T92330#1114581 (10atgo) Hey @jgreen @ellery could you look at this and confirm that we don't need it? [20:16:50] does it? [20:16:57] besides browsers, how's the client support for spdy? [20:17:03] I don't think libcurl supports it [20:17:18] cmjohnson1: oh, I forgot that amssq clash with them numerically [20:17:20] so for APIs consumed by MW or bots etc. it doesn't provide you much [20:18:13] bblack: do you wanna go back to starting at cp3023? [20:18:14] so much for my careful 1:1 mapping of cp30xx IP last-octets [20:18:28] no, it doesn't matter in practice, I just liked it [20:18:48] e.g. cp3030 could've been 10.20.0.130 and such, but amssq are already on those IPs [20:18:53] 6operations, 10ops-codfw: setup and deploy mw2135 through mw2215 - https://phabricator.wikimedia.org/T86806#1114584 (10Papaul) @Joe Redirection after boot for mw2135-mw2214 changed from disable to enable. Please let me know if there is anything else. i will start working on mw2001-mw2134 tomorrow. Thanks [20:19:15] 6operations, 10ops-esams: Rack, cable, prepare cp3030-3053 - https://phabricator.wikimedia.org/T92514#1114587 (10Cmjohnson) [20:19:17] !log cp1066 - comment in pybal, reinstall [20:19:20] (03PS1) 10Eevans: move cassandra submodule into puppet repo [puppet] - 10https://gerrit.wikimedia.org/r/196335 (https://phabricator.wikimedia.org/T92560) [20:19:22] Logged the message, Master [20:19:24] gwicke: so not sure how it would affect your api design [20:20:13] (03CR) 10Cmjohnson: [C: 032] Adding dns entries for cp3030-3053 include mgmt and ipv6 [dns] - 10https://gerrit.wikimedia.org/r/196332 (owner: 10Cmjohnson) [20:20:53] paravoid: with http/1 you have to heavily optimize for least number of requests [20:20:59] see above [20:21:15] what's your target audience for APIs? [20:21:37] for MW core it doesn't matter, as the latencies are low anyway [20:21:40] MW as a consumer won't support it anytime soon, neither will independent tool authors [20:21:49] 6operations, 10ops-esams: Rack, cable, prepare cp3030-3053 - https://phabricator.wikimedia.org/T92514#1114594 (10Cmjohnson) The following mgmt ip's have been assigned cp3030 1H IN A 10.21.0.151 cp3031 1H IN A 10.21.0.152 cp3032 1H IN A 10.21.0.153 cp3033 1H IN A... [20:21:57] it primarily matters for browsers and apps [20:22:31] exactly, and APIs are useful for a third category of tools/bots [20:22:51] those are gaining spdy support as well [20:23:09] libcurl's isn't even coded yet, so I'd expect at least 3-4 years before it becomes mainstream [20:23:31] the node request module for example already pulls in spdy support [20:23:45] not sure if it uses it by default yet [20:23:47] but I guess speed there matters less since there's usually no UX involved [20:24:03] yeah, on clients you can use higher concurrency [20:24:28] browsers normally use six parallel connections per host [20:24:38] depends on the browser [20:24:43] once you exceed that parallelism, you get head-of-line blocking [20:25:01] IIRC six is used by FF and Chrome [20:25:02] you're confused [20:25:11] HOL blocking has nothing to do with domain sharding [20:25:56] I didn't say that it had [20:26:04] you might be confused ;) [20:26:10] 6operations, 10RESTBase, 10RESTBase-Cassandra, 5Patch-For-Review: enable authenticated access to Cassandra JMX - https://phabricator.wikimedia.org/T92471#1114605 (10Eevans) [20:26:19] 22:24 < gwicke> once you exceed that parallelism, you get head-of-line blocking [20:26:25] that's a false statement [20:26:32] 6operations, 10RESTBase, 10RESTBase-Cassandra, 5Patch-For-Review: enable authenticated access to Cassandra JMX - https://phabricator.wikimedia.org/T92471#1111859 (10Eevans) [20:26:53] paravoid: without pipelining clients send one request at a time [20:27:08] 6operations, 10Citoid, 10VisualEditor, 3VisualEditor 2014/15 Q3 blockers: Improve citoid production service - https://phabricator.wikimedia.org/T90281#1114609 (10Jdforrester-WMF) [20:27:09] 6operations, 10Citoid, 5Patch-For-Review: Update the citoid/deploy branch to not contain zotero deploy - https://phabricator.wikimedia.org/T89872#1114608 (10Jdforrester-WMF) 5Open>3Resolved [20:27:09] which means that following requests are blocked by the head-of-line [20:27:17] without pipelining and without parallelism, yes [20:27:22] 6operations, 10RESTBase, 10RESTBase-Cassandra, 5Patch-For-Review: move cassandra submodule into puppet repo - https://phabricator.wikimedia.org/T92560#1114611 (10Eevans) [20:27:22] 6operations, 10RESTBase, 10RESTBase-Cassandra, 5Patch-For-Review: enable authenticated access to Cassandra JMX - https://phabricator.wikimedia.org/T92471#1111859 (10Eevans) [20:27:53] paravoid: once you exceed six parallel requests but only have six connections, that's the situation you are in [20:28:15] some of those requests end up being blocked & delayed [20:28:53] what are you talking about [20:29:30] 6 is a magic value that's configurable in most browsers but generally set to 6 [20:29:34] max_connections_per_domain or something [20:29:35] with one connection and no pipelining you would get hol blocked because of interlinked resources (html referring to a stylesheet etc.) [20:30:08] browsers generally don't pipeline because of proxy issues [20:30:21] parallelism exists in browsers to fix this, triying to keep those e.g. 6 connections busy [20:30:28] with spdy they do [20:30:30] Dependent resources isn't really HOL blocking [20:30:42] *nod* [20:30:54] we are talking about the situation where you have more than six resources to retrieve [20:30:58] HOL blocking means you have something that's ready to go but is blocked because the thing at the head of the line is not ready to go [20:30:59] in parallel [20:31:17] I'm well aware of what HOL blocking is [20:31:40] it's a very common concept in networking, and by far a larger problem with TCP than HTTP [20:31:50] s/with/in/ [20:35:20] I think what gwicke's saying makes sense. if a page loads 24 items and there's 6 connections, things are HOL-blocked [20:35:37] what would be the HOL here? [20:35:50] only one item can be fetched for each of those 6 connections at a time [20:36:07] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "LGTM, and I do not second ori's comment (interpolating the password exposes it while the process runs). Just a couple of nitpicks." (032 comments) [puppet/cassandra] - 10https://gerrit.wikimedia.org/r/196133 (https://phabricator.wikimedia.org/T92471) (owner: 10Eevans) [20:36:10] so if 24 urls are requested from http://foo in a given page, 3/4 of them are queued behind the first 6 [20:36:11] Assuming all 24 items could theoretically be fetched in parallel if you had 24 connections [20:36:25] sure, but queueing != HOL blocked [20:37:00] OK that's true, that's also not HOL blocking [20:37:02] bblack: equivalent to the # Jessie comment are also the icons on the host group overview :) : https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?hostgroup=cache_text_eqiad&style=overview&nostatusheader [20:37:06] what is the thing at the *head* that is not there yet? [20:37:07] What you said (dependent resources) isn't either though [20:37:22] paravoid: the response from the previous request [20:37:26] the first request on each connection [20:37:49] https://en.wikipedia.org/wiki/Head-of-line_blocking first summary sentence even mentions HTTP pipelining as an example [20:38:17] OK I see what they mean re HTTP pipelining [20:38:35] In HTTP/2, requests and responses can be interleaved [20:38:40] (03CR) 10Ottomata: "NoooOoooo then we can't use this in vagrant!" [puppet] - 10https://gerrit.wikimedia.org/r/196335 (https://phabricator.wikimedia.org/T92560) (owner: 10Eevans) [20:38:47] So you once you have sent request 1, you can then send request 2 on the same connection [20:38:48] the client can fire off 4x requests to each of the 6x connections immediately, but responses 1-4 on each connection come back serially [20:39:16] Whereas in HTTP/1.1 , even when pipelining, request 2 (or 7, if you have 6 conns) is blocked on the response for request 1 [20:39:55] *nod*, without pipelining even sending the second request is blocked on the response [20:39:55] well in http/2 it's not pipelining, it's true multiplexing [20:40:17] you can do several independent transactions on one spdy/https2 conn and none depend on the others [20:40:40] you can do server push, but it'd be years before we're able to :) [20:40:42] and by using https only they side-step the proxy problem [20:40:52] which proxy problem? [20:41:01] proxies messing up pipelining [20:41:19] that's the reason why all desktop browsers default to no pipelining [20:41:21] proxies wouldn't be able to talk http/2 anyway, it's a different protocol [20:41:25] (03PS1) 10Dzahn: repool cp1066 after jessie reinstall [puppet] - 10https://gerrit.wikimedia.org/r/196411 [20:41:38] paravoid: HTTP 1.1 pipelining [20:41:57] I'm actually curious if that's also true for HTTPS [20:41:58] wait, I'm confused [20:42:00] who's "they"? [20:42:53] the designers of spdy / http/2 [20:43:25] existing proxies would not be able to understand http/2 anyway (pipelining is the least of the problems) [20:43:46] and multiplexing is so essential to how spdy/http2 is designed that it'd be highly improbable for an http2 proxy to break it [20:44:04] (03CR) 10Dzahn: [C: 032] repool cp1066 after jessie reinstall [puppet] - 10https://gerrit.wikimedia.org/r/196411 (owner: 10Dzahn) [20:44:32] so, no, this is not why spdy defaults to a TLS transport [20:44:43] I didn't say it was [20:44:46] ;) [20:44:59] but I did say that it helps to avoid the issues of HTTP proxies interfering [20:45:01] no they do not side-step the proxy problem by using https only [20:45:15] (for pipelining, at least) [20:46:58] RoanKattouw: so, those additional requests would be for dependent resources, no? [20:47:23] Well, dependent on the HTML response yes [20:47:26] But not dependent on each other [20:47:44] You could have an HTML response that then triggers dozens of dependent resource requests all at the same time [20:48:11] (That is in fact what happens, especially on pages with a lot of images) [20:48:34] (03CR) 10Ori.livneh: "You can just copy changes between the two repositories. It's not elegant, but it beats the pain of working with submodules in our current " [puppet] - 10https://gerrit.wikimedia.org/r/196335 (https://phabricator.wikimedia.org/T92560) (owner: 10Eevans) [20:48:37] (03PS1) 10BBlack: depool cp3006 for reinstall [puppet] - 10https://gerrit.wikimedia.org/r/196414 [20:48:48] dozens, hardly :) [20:48:53] (03CR) 10BBlack: [C: 032 V: 032] depool cp3006 for reinstall [puppet] - 10https://gerrit.wikimedia.org/r/196414 (owner: 10BBlack) [20:49:00] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 77 data above and 9 below the confidence bounds [20:49:01] PROBLEM - HTTP error ratio anomaly detection on graphite2001 is CRITICAL: CRITICAL: Anomaly detected: 77 data above and 9 below the confidence bounds [20:49:03] it's 4-6 connections per domain [20:49:04] !log repooled cp1066 in pybal - text varnishes in eqiad now 100% Debian [20:49:11] Logged the message, Master [20:49:15] but yeah, this indeed happens [20:49:22] (03PS1) 10GWicke: Roll out RESTBase updates & VE use to all phase0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/196415 [20:49:40] these dependent resources are blocking the page being viewed [20:49:47] and all of these are blocked by the HTML being loaded [20:49:49] I think the anomaly alert just tends to trend in much later, and is responding to the 5m issue earlier, FWIW [20:49:59] Yeah [20:50:04] the HOL here is the HTML [20:50:11] Images don't block the page from being viewed, but CSS and JS do [20:50:19] SPDY solves this by utilizing server push [20:50:29] you request the HTML and you get back HTML+assets [20:50:35] if your stack supports that [20:50:36] and image downloads still often HOL block on css / scripts [20:50:51] as those are loaded first [20:50:54] that's what the whole SPDY-solves-HOL-blocking claim is [20:51:04] deviantart serves their images from a list of random subdomains that get translated in nginx rules etc to bypass the 4-6 limit fwiw [20:51:14] yeah, that's common [20:51:15] paravoid: even without server push HOL blocking is eliminated [20:51:28] but this can achieve the opposite result actually [20:51:40] TCP connections take a while to reach optimum speed (slow start etc.) [20:51:55] so "infinite" parallelization can be hurtful [20:52:08] are we talking about domain sharding for upload.wm.o? [20:52:12] I think it's 3x i.e. up to 18 [20:52:13] Yeah presumably that's part of why max_client_connections exists [20:52:15] sort of :) [20:52:15] that they do [20:52:19] 6operations, 3HTTPS-by-default, 5Patch-For-Review: Upgrade all HTTP frontends to Debian jessie - https://phabricator.wikimedia.org/T86648#1114695 (10Dzahn) text-eqiad are 100% complete [20:52:21] paravoid: and the answer to that is... QUIC? :) [20:52:32] bblack: google's answer is :) [20:52:36] to TCP HOL blocking too [20:52:43] single TCP connection [20:52:43] they want to do some evil stuff [20:52:54] where the browser renders page fragments as they arrive or something [20:53:05] or starts parsing at least [20:53:09] (03CR) 10Jforrester: [C: 031] Roll out RESTBase updates & VE use to all phase0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/196415 (owner: 10GWicke) [20:53:14] not everyone can benefit from pipelining, so domain sharding on upload.wm.o would be very useful [20:53:18] mutante: \o/ [20:53:18] paravoid: Ahm don't browsers already do that? [20:53:35] mutante: if you have time, can you continue serially on eqiad uploads? [20:53:40] ori: it would be worse for the majority, better for a minority [20:53:46] I mean in practice,