[00:02:52] hm. deployment is working with grains, but it's slower for some reason… I wonder what the deal with that is [00:03:17] I wonder if the timeout is too long [00:03:17] notpeter_: thank you! i really appreciate it! [00:03:40] ori-l: I tested with a force sync of fluorine, btw [00:03:47] I'd imagine that's ok? [00:04:10] notpeter_: for context, this was the default output of 'top' when i was trying to debug memory bloating in EventLogging: http://i.imgur.com/G0e5azP.png [00:04:19] Ryan_Lane: yes, nothing is depending on it [00:04:22] * Ryan_Lane nods [00:04:40] notpeter_: so this will make my life a lot easier :) [00:04:55] ori-l: oh god, that's terrible [00:05:00] glad to help :) [00:06:28] PROBLEM - Puppet freshness on zinc is CRITICAL: No successful Puppet run in the last 10 hours [00:06:36] <^demon> !log dist-upgrading formey lol [00:06:46] Logged the message, Master [00:06:59] Ryan_Lane: I'm wrapping up for the day… I think things are working properly. [00:07:44] andrewbogott: \o/ [00:07:46] andrewbogott: great work [00:07:53] time will tell :/ [00:07:56] g'night [00:07:57] one less place we need forwarded keys [00:08:00] night [00:11:34] <^demon> !log rebooting formey [00:11:37] Logged the message, Master [00:18:40] (PS4) Physikerwelt: Creating initial debianization [operations/debs/latexml] - https://gerrit.wikimedia.org/r/75513 (owner: AzaToth) [00:24:58] RECOVERY - Puppet freshness on zinc is OK: puppet ran at Wed Jul 24 00:24:48 UTC 2013 [00:25:28] PROBLEM - Puppet freshness on zinc is CRITICAL: No successful Puppet run in the last 10 hours [00:34:28] PROBLEM - MySQL Slave Delay on db1008 is CRITICAL: CRIT replication delay 207 seconds [00:35:28] RECOVERY - MySQL Slave Delay on db1008 is OK: OK replication delay 16 seconds [00:46:25] (CR) AzaToth: [C: -1] "(2 comments)" [operations/debs/latexml] - https://gerrit.wikimedia.org/r/75513 (owner: AzaToth) [00:55:01] RECOVERY - Puppet freshness on zinc is OK: puppet ran at Wed Jul 24 00:54:51 UTC 2013 [00:55:21] PROBLEM - Puppet freshness on zinc is CRITICAL: No successful Puppet run in the last 10 hours [01:00:41] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:01:32] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.126 second response time [01:11:20] (PS1) TTO: (bug 51803) set up flood flag for ckbwiki [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/75538 [01:23:31] PROBLEM - MySQL Slave Delay on db1008 is CRITICAL: CRIT replication delay 202 seconds [01:25:01] RECOVERY - Puppet freshness on zinc is OK: puppet ran at Wed Jul 24 01:24:58 UTC 2013 [01:25:21] PROBLEM - Puppet freshness on zinc is CRITICAL: No successful Puppet run in the last 10 hours [01:25:31] RECOVERY - MySQL Slave Delay on db1008 is OK: OK replication delay 22 seconds [01:28:31] PROBLEM - MySQL Slave Delay on db1008 is CRITICAL: CRIT replication delay 202 seconds [01:29:15] (PS1) TTO: (bug 49600) add Portal namespace for sowiki [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/75540 [01:30:31] RECOVERY - MySQL Slave Delay on db1008 is OK: OK replication delay 22 seconds [01:48:25] PROBLEM - MySQL Slave Delay on db1008 is CRITICAL: CRIT replication delay 194 seconds [01:50:25] RECOVERY - MySQL Slave Delay on db1008 is OK: OK replication delay 13 seconds [01:54:45] RECOVERY - Puppet freshness on zinc is OK: puppet ran at Wed Jul 24 01:54:36 UTC 2013 [01:55:25] PROBLEM - Puppet freshness on zinc is CRITICAL: No successful Puppet run in the last 10 hours [01:56:47] (PS1) Jforrester: Hide the new 'visualeditor-betatempdisable' preference if in alpha [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/75542 [01:56:49] (PS1) Jforrester: VisualEditor into beta on de/es/fr/he/it/pl/ru/svwiki [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/75543 [02:08:47] !log LocalisationUpdate completed (1.22wmf11) at Wed Jul 24 02:08:46 UTC 2013 [02:09:00] Logged the message, Master [02:15:05] !log LocalisationUpdate completed (1.22wmf10) at Wed Jul 24 02:15:04 UTC 2013 [02:15:15] Logged the message, Master [02:18:21] PROBLEM - MySQL Slave Delay on db1008 is CRITICAL: CRIT replication delay 194 seconds [02:20:21] RECOVERY - MySQL Slave Delay on db1008 is OK: OK replication delay 13 seconds [02:23:22] PROBLEM - MySQL Slave Delay on db1008 is CRITICAL: CRIT replication delay 193 seconds [02:24:21] RECOVERY - MySQL Slave Delay on db1008 is OK: OK replication delay 21 seconds [02:24:51] RECOVERY - Puppet freshness on zinc is OK: puppet ran at Wed Jul 24 02:24:44 UTC 2013 [02:25:21] PROBLEM - Puppet freshness on zinc is CRITICAL: No successful Puppet run in the last 10 hours [02:26:45] !log LocalisationUpdate ResourceLoader cache refresh completed at Wed Jul 24 02:26:44 UTC 2013 [02:26:56] Logged the message, Master [02:39:23] PROBLEM - MySQL Slave Delay on db1008 is CRITICAL: CRIT replication delay 195 seconds [02:40:23] RECOVERY - MySQL Slave Delay on db1008 is OK: OK replication delay 14 seconds [02:40:53] PROBLEM - Puppet freshness on manutius is CRITICAL: No successful Puppet run in the last 10 hours [02:54:43] RECOVERY - Puppet freshness on zinc is OK: puppet ran at Wed Jul 24 02:54:38 UTC 2013 [02:55:23] PROBLEM - Puppet freshness on zinc is CRITICAL: No successful Puppet run in the last 10 hours [03:08:42] (PS2) Jforrester: VisualEditor into beta on de/es/fr/he/it/pl/ru/svwiki for logged-in only [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/75543 [03:18:26] PROBLEM - MySQL Slave Delay on db1008 is CRITICAL: CRIT replication delay 194 seconds [03:20:19] RECOVERY - MySQL Slave Delay on db1008 is OK: OK replication delay 14 seconds [03:23:19] PROBLEM - MySQL Slave Delay on db1008 is CRITICAL: CRIT replication delay 194 seconds [03:24:49] RECOVERY - Puppet freshness on zinc is OK: puppet ran at Wed Jul 24 03:24:46 UTC 2013 [03:25:19] RECOVERY - MySQL Slave Delay on db1008 is OK: OK replication delay 14 seconds [03:25:29] PROBLEM - Puppet freshness on zinc is CRITICAL: No successful Puppet run in the last 10 hours [03:31:39] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:32:29] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.123 second response time [03:33:19] PROBLEM - MySQL Slave Delay on db1008 is CRITICAL: CRIT replication delay 194 seconds [03:35:19] RECOVERY - MySQL Slave Delay on db1008 is OK: OK replication delay 14 seconds [03:35:40] interesting topic [03:46:33] Who's on RT duty? [03:47:58] I'm not sure; the topic was blank so I scanned my recent logs for the basic template [03:48:20] PROBLEM - MySQL Slave Delay on db1008 is CRITICAL: CRIT replication delay 194 seconds [03:50:20] RECOVERY - MySQL Slave Delay on db1008 is OK: OK replication delay 14 seconds [03:50:46] According to https://wikitech.wikimedia.org/wiki/Interrupts_Rotation it's Ryan_Lane [03:53:20] PROBLEM - MySQL Slave Delay on db1008 is CRITICAL: CRIT replication delay 194 seconds [03:54:50] RECOVERY - Puppet freshness on zinc is OK: puppet ran at Wed Jul 24 03:54:42 UTC 2013 [03:55:20] RECOVERY - MySQL Slave Delay on db1008 is OK: OK replication delay 14 seconds [03:55:20] PROBLEM - Puppet freshness on zinc is CRITICAL: No successful Puppet run in the last 10 hours [03:56:02] Damn, beaten. [04:13:19] PROBLEM - MySQL Slave Delay on db1008 is CRITICAL: CRIT replication delay 194 seconds [04:13:52] yes, it's me [04:14:06] Elsie: why? what's up? [04:14:29] PROBLEM - SSH on lvs1001 is CRITICAL: Server answer: [04:15:19] RECOVERY - MySQL Slave Delay on db1008 is OK: OK replication delay 14 seconds [04:15:30] RECOVERY - SSH on lvs1001 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [04:16:59] PROBLEM - Puppet freshness on sq41 is CRITICAL: No successful Puppet run in the last 10 hours [04:18:29] PROBLEM - SSH on lvs1001 is CRITICAL: Server answer: [04:19:19] PROBLEM - MySQL Slave Delay on db1008 is CRITICAL: CRIT replication delay 225 seconds [04:20:19] RECOVERY - MySQL Slave Delay on db1008 is OK: OK replication delay 14 seconds [04:21:31] Ryan_Lane: nothing; the topic was blanked so we reset it. Sorry for the ping. [04:22:14] ah. no worries [04:24:49] RECOVERY - Puppet freshness on zinc is OK: puppet ran at Wed Jul 24 04:24:47 UTC 2013 [04:25:30] PROBLEM - Puppet freshness on zinc is CRITICAL: No successful Puppet run in the last 10 hours [04:26:29] RECOVERY - SSH on lvs1001 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [04:42:19] PROBLEM - Puppet freshness on gadolinium is CRITICAL: No successful Puppet run in the last 10 hours [04:42:32] :-) [04:42:45] I read the pages on wikitech.wikimedia.org. Yay documentation! [04:54:49] RECOVERY - Puppet freshness on zinc is OK: puppet ran at Wed Jul 24 04:54:46 UTC 2013 [04:55:29] PROBLEM - Puppet freshness on zinc is CRITICAL: No successful Puppet run in the last 10 hours [05:13:46] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:14:36] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.133 second response time [05:19:56] PROBLEM - MySQL Slave Delay on db1008 is CRITICAL: CRIT replication delay 182 seconds [05:23:56] PROBLEM - MySQL Slave Delay on db1008 is CRITICAL: CRIT replication delay 225 seconds [05:24:56] RECOVERY - Puppet freshness on zinc is OK: puppet ran at Wed Jul 24 05:24:50 UTC 2013 [05:25:26] PROBLEM - Puppet freshness on zinc is CRITICAL: No successful Puppet run in the last 10 hours [05:25:56] RECOVERY - MySQL Slave Delay on db1008 is OK: OK replication delay 4 seconds [05:32:46] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:33:36] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.123 second response time [05:33:56] PROBLEM - MySQL Slave Delay on db1008 is CRITICAL: CRIT replication delay 225 seconds [05:36:56] RECOVERY - MySQL Slave Delay on db1008 is OK: OK replication delay 24 seconds [05:39:48] PROBLEM - MySQL Slave Delay on db1008 is CRITICAL: CRIT replication delay 204 seconds [05:43:48] PROBLEM - MySQL Slave Delay on db1008 is CRITICAL: CRIT replication delay 225 seconds [05:48:48] RECOVERY - MySQL Slave Delay on db1008 is OK: OK replication delay 26 seconds [05:53:48] PROBLEM - MySQL Slave Delay on db1008 is CRITICAL: CRIT replication delay 225 seconds [05:54:48] RECOVERY - Puppet freshness on zinc is OK: puppet ran at Wed Jul 24 05:54:43 UTC 2013 [05:55:28] PROBLEM - Puppet freshness on zinc is CRITICAL: No successful Puppet run in the last 10 hours [05:58:48] PROBLEM - MySQL Slave Delay on db1008 is CRITICAL: CRIT replication delay 224 seconds [06:00:48] RECOVERY - MySQL Slave Delay on db1008 is OK: OK replication delay 20 seconds [06:18:50] PROBLEM - MySQL Slave Delay on db1008 is CRITICAL: CRIT replication delay 202 seconds [06:20:50] RECOVERY - MySQL Slave Delay on db1008 is OK: OK replication delay 24 seconds [06:23:40] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:24:30] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.124 second response time [06:24:50] RECOVERY - Puppet freshness on zinc is OK: puppet ran at Wed Jul 24 06:24:48 UTC 2013 [06:25:30] PROBLEM - Puppet freshness on zinc is CRITICAL: No successful Puppet run in the last 10 hours [06:28:50] PROBLEM - MySQL Slave Delay on db1008 is CRITICAL: CRIT replication delay 224 seconds [06:34:50] PROBLEM - MySQL Slave Delay on db1008 is CRITICAL: CRIT replication delay 217 seconds [06:36:50] RECOVERY - MySQL Slave Delay on db1008 is OK: OK replication delay 20 seconds [06:55:43] RECOVERY - Puppet freshness on zinc is OK: puppet ran at Wed Jul 24 06:55:34 UTC 2013 [06:56:23] PROBLEM - Puppet freshness on zinc is CRITICAL: No successful Puppet run in the last 10 hours [07:02:43] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:03:33] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.126 second response time [07:24:54] RECOVERY - Puppet freshness on zinc is OK: puppet ran at Wed Jul 24 07:24:52 UTC 2013 [07:25:24] PROBLEM - Puppet freshness on zinc is CRITICAL: No successful Puppet run in the last 10 hours [07:48:32] goood morniiing [07:48:49] PROBLEM - Puppet freshness on erzurumi is CRITICAL: No successful Puppet run in the last 10 hours [07:48:49] PROBLEM - Puppet freshness on lvs1004 is CRITICAL: No successful Puppet run in the last 10 hours [07:48:49] PROBLEM - Puppet freshness on lvs1005 is CRITICAL: No successful Puppet run in the last 10 hours [07:48:49] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: No successful Puppet run in the last 10 hours [07:48:49] PROBLEM - Puppet freshness on virt1 is CRITICAL: No successful Puppet run in the last 10 hours [07:48:50] PROBLEM - Puppet freshness on virt3 is CRITICAL: No successful Puppet run in the last 10 hours [07:48:50] PROBLEM - Puppet freshness on virt4 is CRITICAL: No successful Puppet run in the last 10 hours [07:53:49] PROBLEM - Puppet freshness on ms-fe1002 is CRITICAL: No successful Puppet run in the last 10 hours [07:54:39] RECOVERY - Puppet freshness on zinc is OK: puppet ran at Wed Jul 24 07:54:35 UTC 2013 [07:55:29] PROBLEM - Puppet freshness on zinc is CRITICAL: No successful Puppet run in the last 10 hours [07:57:06] mark: morning. Following gitblit (serving git.wikimedia.org) issue yesterday: there is now an init script, jstack is fixed in puppet, puppet will ensure service is running AND the nasty web spiders have been blocked on apache frontend =) [07:59:49] PROBLEM - Puppet freshness on ms-fe1003 is CRITICAL: No successful Puppet run in the last 10 hours [08:04:49] PROBLEM - Puppet freshness on ms-fe1004 is CRITICAL: No successful Puppet run in the last 10 hours [08:20:55] PROBLEM - Puppet freshness on ms-fe1001 is CRITICAL: No successful Puppet run in the last 10 hours [08:24:55] RECOVERY - Puppet freshness on zinc is OK: puppet ran at Wed Jul 24 08:24:49 UTC 2013 [08:25:25] PROBLEM - Puppet freshness on zinc is CRITICAL: No successful Puppet run in the last 10 hours [08:54:51] RECOVERY - Puppet freshness on zinc is OK: puppet ran at Wed Jul 24 08:54:46 UTC 2013 [08:55:31] PROBLEM - Puppet freshness on zinc is CRITICAL: No successful Puppet run in the last 10 hours [09:09:16] PROBLEM - MySQL Slave Delay on db1008 is CRITICAL: CRIT replication delay 182 seconds [09:10:16] RECOVERY - MySQL Slave Delay on db1008 is OK: OK replication delay 6 seconds [09:13:46] RECOVERY - search indices - check lucene status page on search1013 is OK: HTTP OK: HTTP/1.1 200 OK - 747 bytes in 0.006 second response time [09:25:06] RECOVERY - Puppet freshness on zinc is OK: puppet ran at Wed Jul 24 09:24:58 UTC 2013 [09:25:26] PROBLEM - Puppet freshness on zinc is CRITICAL: No successful Puppet run in the last 10 hours [09:28:23] (PS4) Hashar: replicate Gerrit repos to Jenkins slave lanthanum [operations/puppet] - https://gerrit.wikimedia.org/r/75499 [09:28:24] (PS4) Hashar: replicate Gerrit repos to Jenkins slave gallium [operations/puppet] - https://gerrit.wikimedia.org/r/75500 [09:29:02] (CR) Hashar: "renamed Gerrit replication group 'jenkins-lanthanum' to 'jenkins-slaves'" [operations/puppet] - https://gerrit.wikimedia.org/r/75499 (owner: Hashar) [09:29:44] (CR) Hashar: "(1 comment)" [operations/puppet] - https://gerrit.wikimedia.org/r/75500 (owner: Hashar) [09:30:21] (CR) Ori.livneh: "> I haven't looked, but I'd assume since so much effort went in to this it is more complete and robust than git::clone." [operations/puppet] - https://gerrit.wikimedia.org/r/74099 (owner: Andrew Bogott) [09:30:43] (CR) Hashar: "(1 comment)" [operations/puppet] - https://gerrit.wikimedia.org/r/75500 (owner: Hashar) [09:43:17] PROBLEM - udp2log log age for gadolinium on gadolinium is CRITICAL: CRITICAL: log files /a/log/webrequest/packet-loss.log, have not been written in a critical amount of time. For most logs, this is 4 hours. For slow logs, this is 4 days. [09:46:39] (CR) TTO: "In http://lists.wikimedia.org/pipermail/wikitech-l/2013-July/070716.html , James said he would be instating an "opt-out" user preference. " [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/73565 (owner: Odder) [09:53:19] (CR) Matmarex: "No, that's in fact exactly what this patch does." [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/73565 (owner: Odder) [09:54:47] RECOVERY - Puppet freshness on zinc is OK: puppet ran at Wed Jul 24 09:54:37 UTC 2013 [09:55:27] PROBLEM - Puppet freshness on zinc is CRITICAL: No successful Puppet run in the last 10 hours [10:06:59] (PS1) MaxSem: Switch translation memory from vanadium to zinc [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/75566 [10:16:37] (CR) MaxSem: [C: 2] Switch translation memory from vanadium to zinc [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/75566 (owner: MaxSem) [10:17:01] (Merged) jenkins-bot: Switch translation memory from vanadium to zinc [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/75566 (owner: MaxSem) [10:24:18] !log Attempting to migrate TTM to zinc [10:24:29] Logged the message, Master [10:24:52] RECOVERY - Puppet freshness on zinc is OK: puppet ran at Wed Jul 24 10:24:48 UTC 2013 [10:25:32] PROBLEM - Puppet freshness on zinc is CRITICAL: No successful Puppet run in the last 10 hours [10:25:54] o_0 [10:37:16] PROBLEM - udp2log log age for gadolinium on gadolinium is CRITICAL: CRITICAL: log files /a/log/webrequest/packet-loss.log, have not been written in a critical amount of time. For most logs, this is 4 hours. For slow logs, this is 4 days. [10:48:56] hashar: cool :) [10:49:21] :D [10:54:52] RECOVERY - Puppet freshness on zinc is OK: puppet ran at Wed Jul 24 10:54:39 UTC 2013 [10:55:26] PROBLEM - Puppet freshness on zinc is CRITICAL: No successful Puppet run in the last 10 hours [10:55:47] (PS1) Mark Bergsma: Install new servers as Text Varnish caches [operations/puppet] - https://gerrit.wikimedia.org/r/75572 [10:55:49] (CR) TTO: [C: 1] "yep, probably needs a rebase by now though" [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/67953 (owner: Raimond Spekking) [10:56:05] (CR) jenkins-bot: [V: -1] Install new servers as Text Varnish caches [operations/puppet] - https://gerrit.wikimedia.org/r/75572 (owner: Mark Bergsma) [10:56:49] yeahhhh err: Could not retrieve catalog from remote server: Error 400 on SERVER: Could not parse for environment production: No file(s) found for import of '../private/manifests/passwords.pp' at /etc/puppet/manifests/base.pp:10 on node i-00000778.pmtpa.wmflabs [10:57:11] (PS2) Mark Bergsma: Install new servers as Text Varnish caches [operations/puppet] - https://gerrit.wikimedia.org/r/75572 [10:57:28] (CR) jenkins-bot: [V: -1] Install new servers as Text Varnish caches [operations/puppet] - https://gerrit.wikimedia.org/r/75572 (owner: Mark Bergsma) [10:57:51] (PS1) MaxSem: Revert "Switch translation memory from vanadium to zinc" [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/75573 [10:58:07] (CR) MaxSem: [C: 2] Revert "Switch translation memory from vanadium to zinc" [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/75573 (owner: MaxSem) [10:58:14] (Merged) jenkins-bot: Revert "Switch translation memory from vanadium to zinc" [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/75573 (owner: MaxSem) [10:58:33] (PS3) Mark Bergsma: Install new servers as Text Varnish caches [operations/puppet] - https://gerrit.wikimedia.org/r/75572 [10:58:49] !log Aborted TTM migration, will file a bug report [10:59:00] Logged the message, Master [10:59:34] (CR) Mark Bergsma: [C: 2] Install new servers as Text Varnish caches [operations/puppet] - https://gerrit.wikimedia.org/r/75572 (owner: Mark Bergsma) [10:59:35] (Merged) Mark Bergsma: Install new servers as Text Varnish caches [operations/puppet] - https://gerrit.wikimedia.org/r/75572 (owner: Mark Bergsma) [11:12:28] (PS1) Mark Bergsma: Generate ganglia plugin conf after starting Varnish [operations/puppet] - https://gerrit.wikimedia.org/r/75575 [11:12:31] MaxSem: are you happy with that mobile vary change now? [11:13:25] (PS2) Mark Bergsma: Generate ganglia plugin conf after starting Varnish [operations/puppet] - https://gerrit.wikimedia.org/r/75575 [11:13:26] mark, I've prepared the PHP counterpart in https://gerrit.wikimedia.org/r/#/c/75362/ [11:13:39] thanks [11:13:52] so varnish won't use XVO (yet), but at least we can use the info from the header to create the VCL [11:14:00] and squid will work ;) [11:14:10] mark, my only concern for your change is that we currently have another cookie mf_alpha which we will deprecate soon [11:14:19] (aka most likely next week) [11:14:20] I can add that [11:14:22] I just didn't know about it [11:15:01] hrm [11:15:04] I just realized something [11:15:12] using one regsuball does not always give the same order [11:15:18] mark, hmm, and after we ditch it optin will have values other than 1 [11:15:21] perhaps I should do one regsub per cookie [11:15:37] MaxSem: that's ok, just change the regex [11:16:17] (CR) Mark Bergsma: [C: 2] Generate ganglia plugin conf after starting Varnish [operations/puppet] - https://gerrit.wikimedia.org/r/75575 (owner: Mark Bergsma) [11:16:18] (Merged) Mark Bergsma: Generate ganglia plugin conf after starting Varnish [operations/puppet] - https://gerrit.wikimedia.org/r/75575 (owner: Mark Bergsma) [11:16:33] we'll have to let mf_alpha to pass through varnish for some time after that, for migration period [11:21:10] I got a really weird varnish cache issue. An http://foo/ url is a redirect to the https version. But the https cached version ends up being the same as the http version [11:21:27] so https version is cached as being a redirect to … the https version. Hence a nice loop [11:21:34] and I have zero clue where it comes from :( [11:22:14] the good thing is I managed to reproduced it constantly : https://bugzilla.wikimedia.org/show_bug.cgi?id=51700#c8 :-] [11:22:46] hashar: so we use protocol relative URLs right, no separate http/https caching [11:24:19] so if I ask varnish for http://foo or https://foo it serves the exact same cached copy ? [11:24:22] like it is shared? [11:24:35] yes [11:24:45] hmm [11:24:59] RECOVERY - Puppet freshness on zinc is OK: puppet ran at Wed Jul 24 11:24:49 UTC 2013 [11:24:59] the https is handled by nginx anyway, which does the http:// with X-Fowarded-Proto: https [11:25:22] so varnish might have to vary on X-Fowarded-Proto maybe [11:25:29] PROBLEM - Puppet freshness on zinc is CRITICAL: No successful Puppet run in the last 10 hours [11:26:11] it could, but we generally avoid it [11:26:16] as this would fragment the cache [11:26:50] redirects are varied though, iirc [11:26:51] (PS1) Mark Bergsma: Install extra VCL files before starting Varnish [operations/puppet] - https://gerrit.wikimedia.org/r/75576 [11:27:00] yeah [11:27:32] (CR) Mark Bergsma: [C: 2] Install extra VCL files before starting Varnish [operations/puppet] - https://gerrit.wikimedia.org/r/75576 (owner: Mark Bergsma) [11:27:34] (Merged) Mark Bergsma: Install extra VCL files before starting Varnish [operations/puppet] - https://gerrit.wikimedia.org/r/75576 (owner: Mark Bergsma) [11:33:44] any idea where the redirect vary is handled ? [11:37:56] depends on what sends out the redirect eh [11:38:04] mediawiki or apache [11:39:00] the way I reproduce the issue is listed at https://bugzilla.wikimedia.org/show_bug.cgi?id=51700#c8 . Basically on a purged cache, the http:// query yields Location: https. When querying the https:// I am served the Location: https. [11:39:11] makes a lot of sense [11:39:35] the http from the client goes directly to the varnish frontend. the https:// one pass via nginx which set the X-Fowarded-Proto: https and then do a http query on the varnish frontend [11:39:57] yes [11:40:00] the http:// + X-Fowarded-For: https to not be a redirect to https. [11:40:07] so whatever serves that page (check the headers) needs to add a varyheader [11:41:01] Vary: X-Fowarded-Proto ? [11:41:06] yep [11:41:19] X-FoRwarded-Proto [11:41:21] I think that is my apache conf which does the 301 [11:41:32] upper case R ? [11:41:36] * hashar smiles [11:41:40] no, but you forgot it a few times ;) [11:41:51] ah yeah [11:42:09] * hashar attempts to blame Apple spelling check [11:42:18] * mark blames the french [11:43:18] of course [11:43:25] the redirect needs to be conditional as well [11:43:36] a vary header makes no difference if apache is configured to redirect https to https ;) [11:43:45] so it should only do that for XFP == "http" [11:44:11] RewriteEngine On [11:44:12] RewriteCond %{HTTP:X-Forwarded-Proto} !https [11:44:12] RewriteRule ^/(.*)$ https://login.wikimedia.beta.wmflabs.org/$1 [R=301,L] [11:44:49] that comes from our apache conf https://git.wikimedia.org/blob/operations%2Fapache-config.git/ca2f6fd3740adcf81ec0de87c9fd30d66a802f40/wikimedia.conf#L157 [11:45:54] so either apache does not receive the X-Fwd-Proto or it does not properly handle it fun. [11:46:20] or it doesn't also send a vary header [11:47:24] so apache might serve different content but varnish would never query it because it was not instructed to vary ? [11:49:35] varnish would cache either the http or the https version [11:49:39] and not distinguish them [11:49:48] so it depends on the object currently cached, whether you get a redirect loop or not [11:50:00] that is what I noticed [11:50:07] figuring out what apache sends me back [11:54:39] RECOVERY - Puppet freshness on zinc is OK: puppet ran at Wed Jul 24 11:54:37 UTC 2013 [11:55:30] PROBLEM - Puppet freshness on zinc is CRITICAL: No successful Puppet run in the last 10 hours [11:56:13] well [11:56:19] if the vary header config is not right up there with the rest [11:56:25] i don't see how it would send a Vary header ;) [12:01:08] so apache works properly but does not send the Vary: [12:01:26] I am afraid we have the same issue in production :/ [12:01:27] https://git.wikimedia.org/blob/operations%2Fapache-config.git/ca2f6fd3740adcf81ec0de87c9fd30d66a802f40/wikimedia.conf#L157 [12:01:33] since I copied the apache conf from there [12:02:56] curl -s -i https://login.wikimedia.org/|grep Vary: [12:02:57] Vary: Accept-Encoding,X-Forwarded-Proto,Cookie [12:02:57] \O/ [12:08:08] PROBLEM - Varnish HTTP text-backend on cp1065 is CRITICAL: Connection refused [12:10:08] yes likely [12:10:24] i can't find how it is setup [12:14:45] (PS1) Mark Bergsma: Add new Text Varnish servers [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/75581 [12:15:29] (PS2) Mark Bergsma: Add new Text Varnish servers [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/75581 [12:15:57] (CR) Mark Bergsma: [C: 2] Add new Text Varnish servers [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/75581 (owner: Mark Bergsma) [12:15:58] (Merged) Mark Bergsma: Add new Text Varnish servers [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/75581 (owner: Mark Bergsma) [12:16:58] !log mark synchronized wmf-config/squid.php [12:17:09] Logged the message, Master [12:19:08] RECOVERY - Varnish HTTP text-backend on cp1065 is OK: HTTP OK: HTTP/1.1 200 OK - 188 bytes in 0.000 second response time [12:22:52] mark: I think i found out. In production the http:// redirecting url is not cached by the squid caches. curl http://login.wikimedia.org/ -i yields MISS [12:23:04] where has on the beta varnish cache, that is cached [12:23:18] that's probably the reason why it wasn't noticed before, indeed [12:23:21] so on squid, the http:// always redirect to the https:// version which is always the version served by apache (and send you to the main page) [12:23:29] where as on varnish, the http:// is cached [12:23:34] right [12:23:37] and override the https:// version [12:23:38] so we should fix the apache config [12:23:46] so that would potentially be an issue on varnish text whenever you deploy them. [12:23:56] in an hour ;) [12:23:59] you want to get the redirect cached dont you ? [12:24:06] yes [12:24:43] so the apache boxes needs mod_headers and we have to add a Vary to all the redirects :/ [12:24:48] RECOVERY - Puppet freshness on zinc is OK: puppet ran at Wed Jul 24 12:24:39 UTC 2013 [12:24:55] damn it took me a while to figure out all of that [12:25:29] PROBLEM - Puppet freshness on zinc is CRITICAL: No successful Puppet run in the last 10 hours [12:27:57] yes [12:28:07] do you want to write an email to some lists about that? [12:28:17] wrong question I guess [12:28:21] you don't want to, but could you? ;) [12:28:59] I think someone was trying to figure that in the past [12:29:01] I have updated my bug as I was investigating the issue. Wrote a quick summary at https://bugzilla.wikimedia.org/show_bug.cgi?id=51700#c11 [12:29:14] and ended up with deciding that it's not possible with apache or something [12:29:20] I don't remember more details [12:29:38] PROBLEM - Host cp1052 is DOWN: PING CRITICAL - Packet loss = 100% [12:29:48] sorry, just saw the backlog [12:29:55] don't hate me :> [12:30:12] well I have already spend a bunch of yesterday afternoon on that issue [12:30:35] is that your way of saying that you do hate me? [12:30:35] :) [12:30:44] ROFL [12:30:48] RECOVERY - Host cp1052 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [12:30:48] PROBLEM - Host cp1053 is DOWN: PING CRITICAL - Packet loss = 100% [12:31:03] I would have hated you if you said that after I change the patch in Gerrit :-] [12:31:18] … I submit the patch in Gerrit .. [12:31:28] not a bit deal, I have learned a lot along the way [12:31:48] RECOVERY - Host cp1053 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms [12:31:58] PROBLEM - Host cp1054 is DOWN: PING CRITICAL - Packet loss = 100% [12:32:28] PROBLEM - Host cp1055 is DOWN: PING CRITICAL - Packet loss = 100% [12:32:48] RECOVERY - Host cp1054 is UP: PING OK - Packet loss = 0%, RTA = 0.30 ms [12:33:38] RECOVERY - Host cp1055 is UP: PING OK - Packet loss = 0%, RTA = 0.44 ms [12:33:59] PROBLEM - Host cp1065 is DOWN: PING CRITICAL - Packet loss = 100% [12:34:58] PROBLEM - Host cp1066 is DOWN: PING CRITICAL - Packet loss = 100% [12:35:08] RECOVERY - Host cp1065 is UP: PING OK - Packet loss = 0%, RTA = 1.10 ms [12:35:09] (PS1) Mark Bergsma: Fix up Vary headers on 30x redirects from Apache [operations/puppet] - https://gerrit.wikimedia.org/r/75583 [12:35:10] hashar: ^ [12:35:52] (PS2) Hashar: Fix up Vary headers on 30x redirects from Apache [operations/puppet] - https://gerrit.wikimedia.org/r/75583 (owner: Mark Bergsma) [12:35:58] RECOVERY - Host cp1066 is UP: PING OK - Packet loss = 0%, RTA = 0.35 ms [12:35:59] PROBLEM - Host cp1067 is DOWN: PING CRITICAL - Packet loss = 100% [12:36:01] referenced bug 51700 :) [12:36:20] :) [12:36:47] in apache we could potentially set an env variable in the RewriteRule and send the Vary: header whenever that env is set [12:36:56] but that is scary and need to be done all other the place [12:37:08] PROBLEM - Host cp1068 is DOWN: PING CRITICAL - Packet loss = 100% [12:37:15] another way would be to let MediaWiki handle them, but that is too far in the layers i guess [12:37:18] PROBLEM - Varnish HTTP text-frontend on cp1065 is CRITICAL: Connection refused [12:37:24] mark: why obj.http.Location ~ "^http" ? [12:37:28] PROBLEM - Varnish traffic logger on cp1065 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishncsa [12:37:28] RECOVERY - Host cp1067 is UP: PING OK - Packet loss = 0%, RTA = 0.48 ms [12:37:43] i don't know if protocol relative redirects are possible [12:37:49] but if they are, we don't need the vary header there [12:38:04] I don't think so [12:38:18] RECOVERY - Host cp1068 is UP: PING OK - Packet loss = 0%, RTA = 0.57 ms [12:38:27] so potentially we could test that patch in beta but puppet is broken in labs currently. [12:38:45] (PS3) Mark Bergsma: Fix up Vary headers on 30x redirects from Apache [operations/puppet] - https://gerrit.wikimedia.org/r/75583 [12:39:18] RECOVERY - Varnish HTTP text-frontend on cp1065 is OK: HTTP OK: HTTP/1.1 200 OK - 197 bytes in 0.001 second response time [12:39:29] RECOVERY - Varnish traffic logger on cp1065 is OK: PROCS OK: 2 processes with command name varnishncsa [12:39:48] PROBLEM - NTP on cp1068 is CRITICAL: NTP CRITICAL: Offset unknown [12:39:48] The [12:39:48] field value consists of a single absolute URI. [12:39:48] Location = "Location" ":" absoluteURI [12:39:53] http://tools.ietf.org/html/rfc2616#section-14.30 [12:40:06] wikipedia says though [12:40:07] This example, is incorrect according to the current standard, which specifies the URI returned to be absolute.[7] However, all popular browsers will accept a relative URL[citation needed], and it is correct according to the upcoming revision of HTTP/1.1.[8] [12:40:29] i was just looking up the same [12:40:31] then you get some weird mobile browser that does not support it :D [12:40:45] so if relative URLs will be possible, let's keep that in there ;) [12:40:52] it doesn't hurt [12:41:48] RECOVERY - NTP on cp1068 is OK: NTP OK: Offset 0.001034259796 secs [12:41:48] PROBLEM - Puppet freshness on manutius is CRITICAL: No successful Puppet run in the last 10 hours [12:42:21] nod [12:47:19] PROBLEM - NTP on cp1054 is CRITICAL: NTP CRITICAL: Offset unknown [12:47:48] (CR) Mark Bergsma: [C: 2] Fix up Vary headers on 30x redirects from Apache [operations/puppet] - https://gerrit.wikimedia.org/r/75583 (owner: Mark Bergsma) [12:48:02] (Merged) Mark Bergsma: Fix up Vary headers on 30x redirects from Apache [operations/puppet] - https://gerrit.wikimedia.org/r/75583 (owner: Mark Bergsma) [12:48:18] PROBLEM - NTP on cp1055 is CRITICAL: NTP CRITICAL: Offset unknown [12:48:33] argh [12:49:10] I can't test out your fix in beta, the labs puppetmaster is broken :D [12:49:21] i can test it in production :) [12:49:26] (PS1) Mark Bergsma: Fix typo [operations/puppet] - https://gerrit.wikimedia.org/r/75584 [12:49:30] i have all these test servers to play with [12:50:08] (CR) Mark Bergsma: [C: 2] Fix typo [operations/puppet] - https://gerrit.wikimedia.org/r/75584 (owner: Mark Bergsma) [12:50:12] (Merged) Mark Bergsma: Fix typo [operations/puppet] - https://gerrit.wikimedia.org/r/75584 (owner: Mark Bergsma) [12:51:44] argh [12:52:07] out for a snack [12:52:18] RECOVERY - NTP on cp1054 is OK: NTP OK: Offset 0.001305222511 secs [12:52:53] (PS1) Mark Bergsma: Use beresp. instead of obj. in vcl_fetch [operations/puppet] - https://gerrit.wikimedia.org/r/75585 [12:52:54] * mark in with noodles [12:53:18] RECOVERY - NTP on cp1055 is OK: NTP OK: Offset 0.001746296883 secs [12:53:56] when will jenkins do VCL tests? ;-) [12:54:14] whenever someone figure out how to expand the tempaltes [12:54:16] (CR) Mark Bergsma: [C: 2] Use beresp. instead of obj. in vcl_fetch [operations/puppet] - https://gerrit.wikimedia.org/r/75585 (owner: Mark Bergsma) [12:54:20] (Merged) Mark Bergsma: Use beresp. instead of obj. in vcl_fetch [operations/puppet] - https://gerrit.wikimedia.org/r/75585 (owner: Mark Bergsma) [12:54:35] I think varnish has some unit testing suite [12:54:41] more after I grab a snack [12:54:58] RECOVERY - Puppet freshness on zinc is OK: puppet ran at Wed Jul 24 12:54:52 UTC 2013 [12:55:28] PROBLEM - Puppet freshness on zinc is CRITICAL: No successful Puppet run in the last 10 hours [12:59:57] (PS1) Mark Bergsma: Use hit_for_pass for TTL <= 0s objects [operations/puppet] - https://gerrit.wikimedia.org/r/75586 [13:00:04] i wonder if/why we're not seeing that problem on mobile [13:00:50]

Welcome to the alpha test web service for the Parsoid project.