[00:02:52] hm. deployment is working with grains, but it's slower for some reason… I wonder what the deal with that is [00:03:17] I wonder if the timeout is too long [00:03:17] notpeter_: thank you! i really appreciate it! [00:03:40] ori-l: I tested with a force sync of fluorine, btw [00:03:47] I'd imagine that's ok? [00:04:10] notpeter_: for context, this was the default output of 'top' when i was trying to debug memory bloating in EventLogging: http://i.imgur.com/G0e5azP.png [00:04:19] Ryan_Lane: yes, nothing is depending on it [00:04:22] * Ryan_Lane nods [00:04:40] notpeter_: so this will make my life a lot easier :) [00:04:55] ori-l: oh god, that's terrible [00:05:00] glad to help :) [00:06:28] PROBLEM - Puppet freshness on zinc is CRITICAL: No successful Puppet run in the last 10 hours [00:06:36] <^demon> !log dist-upgrading formey lol [00:06:46] Logged the message, Master [00:06:59] Ryan_Lane: I'm wrapping up for the day… I think things are working properly. [00:07:44] andrewbogott: \o/ [00:07:46] andrewbogott: great work [00:07:53] time will tell :/ [00:07:56] g'night [00:07:57] one less place we need forwarded keys [00:08:00] night [00:11:34] <^demon> !log rebooting formey [00:11:37] Logged the message, Master [00:18:40] (PS4) Physikerwelt: Creating initial debianization [operations/debs/latexml] - https://gerrit.wikimedia.org/r/75513 (owner: AzaToth) [00:24:58] RECOVERY - Puppet freshness on zinc is OK: puppet ran at Wed Jul 24 00:24:48 UTC 2013 [00:25:28] PROBLEM - Puppet freshness on zinc is CRITICAL: No successful Puppet run in the last 10 hours [00:34:28] PROBLEM - MySQL Slave Delay on db1008 is CRITICAL: CRIT replication delay 207 seconds [00:35:28] RECOVERY - MySQL Slave Delay on db1008 is OK: OK replication delay 16 seconds [00:46:25] (CR) AzaToth: [C: -1] "(2 comments)" [operations/debs/latexml] - https://gerrit.wikimedia.org/r/75513 (owner: AzaToth) [00:55:01] RECOVERY - Puppet freshness on zinc is OK: puppet ran at Wed Jul 24 00:54:51 UTC 2013 [00:55:21] PROBLEM - Puppet freshness on zinc is CRITICAL: No successful Puppet run in the last 10 hours [01:00:41] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:01:32] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.126 second response time [01:11:20] (PS1) TTO: (bug 51803) set up flood flag for ckbwiki [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/75538 [01:23:31] PROBLEM - MySQL Slave Delay on db1008 is CRITICAL: CRIT replication delay 202 seconds [01:25:01] RECOVERY - Puppet freshness on zinc is OK: puppet ran at Wed Jul 24 01:24:58 UTC 2013 [01:25:21] PROBLEM - Puppet freshness on zinc is CRITICAL: No successful Puppet run in the last 10 hours [01:25:31] RECOVERY - MySQL Slave Delay on db1008 is OK: OK replication delay 22 seconds [01:28:31] PROBLEM - MySQL Slave Delay on db1008 is CRITICAL: CRIT replication delay 202 seconds [01:29:15] (PS1) TTO: (bug 49600) add Portal namespace for sowiki [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/75540 [01:30:31] RECOVERY - MySQL Slave Delay on db1008 is OK: OK replication delay 22 seconds [01:48:25] PROBLEM - MySQL Slave Delay on db1008 is CRITICAL: CRIT replication delay 194 seconds [01:50:25] RECOVERY - MySQL Slave Delay on db1008 is OK: OK replication delay 13 seconds [01:54:45] RECOVERY - Puppet freshness on zinc is OK: puppet ran at Wed Jul 24 01:54:36 UTC 2013 [01:55:25] PROBLEM - Puppet freshness on zinc is CRITICAL: No successful Puppet run in the last 10 hours [01:56:47] (PS1) Jforrester: Hide the new 'visualeditor-betatempdisable' preference if in alpha [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/75542 [01:56:49] (PS1) Jforrester: VisualEditor into beta on de/es/fr/he/it/pl/ru/svwiki [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/75543 [02:08:47] !log LocalisationUpdate completed (1.22wmf11) at Wed Jul 24 02:08:46 UTC 2013 [02:09:00] Logged the message, Master [02:15:05] !log LocalisationUpdate completed (1.22wmf10) at Wed Jul 24 02:15:04 UTC 2013 [02:15:15] Logged the message, Master [02:18:21] PROBLEM - MySQL Slave Delay on db1008 is CRITICAL: CRIT replication delay 194 seconds [02:20:21] RECOVERY - MySQL Slave Delay on db1008 is OK: OK replication delay 13 seconds [02:23:22] PROBLEM - MySQL Slave Delay on db1008 is CRITICAL: CRIT replication delay 193 seconds [02:24:21] RECOVERY - MySQL Slave Delay on db1008 is OK: OK replication delay 21 seconds [02:24:51] RECOVERY - Puppet freshness on zinc is OK: puppet ran at Wed Jul 24 02:24:44 UTC 2013 [02:25:21] PROBLEM - Puppet freshness on zinc is CRITICAL: No successful Puppet run in the last 10 hours [02:26:45] !log LocalisationUpdate ResourceLoader cache refresh completed at Wed Jul 24 02:26:44 UTC 2013 [02:26:56] Logged the message, Master [02:39:23] PROBLEM - MySQL Slave Delay on db1008 is CRITICAL: CRIT replication delay 195 seconds [02:40:23] RECOVERY - MySQL Slave Delay on db1008 is OK: OK replication delay 14 seconds [02:40:53] PROBLEM - Puppet freshness on manutius is CRITICAL: No successful Puppet run in the last 10 hours [02:54:43] RECOVERY - Puppet freshness on zinc is OK: puppet ran at Wed Jul 24 02:54:38 UTC 2013 [02:55:23] PROBLEM - Puppet freshness on zinc is CRITICAL: No successful Puppet run in the last 10 hours [03:08:42] (PS2) Jforrester: VisualEditor into beta on de/es/fr/he/it/pl/ru/svwiki for logged-in only [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/75543 [03:18:26] PROBLEM - MySQL Slave Delay on db1008 is CRITICAL: CRIT replication delay 194 seconds [03:20:19] RECOVERY - MySQL Slave Delay on db1008 is OK: OK replication delay 14 seconds [03:23:19] PROBLEM - MySQL Slave Delay on db1008 is CRITICAL: CRIT replication delay 194 seconds [03:24:49] RECOVERY - Puppet freshness on zinc is OK: puppet ran at Wed Jul 24 03:24:46 UTC 2013 [03:25:19] RECOVERY - MySQL Slave Delay on db1008 is OK: OK replication delay 14 seconds [03:25:29] PROBLEM - Puppet freshness on zinc is CRITICAL: No successful Puppet run in the last 10 hours [03:31:39] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:32:29] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.123 second response time [03:33:19] PROBLEM - MySQL Slave Delay on db1008 is CRITICAL: CRIT replication delay 194 seconds [03:35:19] RECOVERY - MySQL Slave Delay on db1008 is OK: OK replication delay 14 seconds [03:35:40] interesting topic [03:46:33] Who's on RT duty? [03:47:58] I'm not sure; the topic was blank so I scanned my recent logs for the basic template [03:48:20] PROBLEM - MySQL Slave Delay on db1008 is CRITICAL: CRIT replication delay 194 seconds [03:50:20] RECOVERY - MySQL Slave Delay on db1008 is OK: OK replication delay 14 seconds [03:50:46] According to https://wikitech.wikimedia.org/wiki/Interrupts_Rotation it's Ryan_Lane [03:53:20] PROBLEM - MySQL Slave Delay on db1008 is CRITICAL: CRIT replication delay 194 seconds [03:54:50] RECOVERY - Puppet freshness on zinc is OK: puppet ran at Wed Jul 24 03:54:42 UTC 2013 [03:55:20] RECOVERY - MySQL Slave Delay on db1008 is OK: OK replication delay 14 seconds [03:55:20] PROBLEM - Puppet freshness on zinc is CRITICAL: No successful Puppet run in the last 10 hours [03:56:02] Damn, beaten. [04:13:19] PROBLEM - MySQL Slave Delay on db1008 is CRITICAL: CRIT replication delay 194 seconds [04:13:52] yes, it's me [04:14:06] Elsie: why? what's up? [04:14:29] PROBLEM - SSH on lvs1001 is CRITICAL: Server answer: [04:15:19] RECOVERY - MySQL Slave Delay on db1008 is OK: OK replication delay 14 seconds [04:15:30] RECOVERY - SSH on lvs1001 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [04:16:59] PROBLEM - Puppet freshness on sq41 is CRITICAL: No successful Puppet run in the last 10 hours [04:18:29] PROBLEM - SSH on lvs1001 is CRITICAL: Server answer: [04:19:19] PROBLEM - MySQL Slave Delay on db1008 is CRITICAL: CRIT replication delay 225 seconds [04:20:19] RECOVERY - MySQL Slave Delay on db1008 is OK: OK replication delay 14 seconds [04:21:31] Ryan_Lane: nothing; the topic was blanked so we reset it. Sorry for the ping. [04:22:14] ah. no worries [04:24:49] RECOVERY - Puppet freshness on zinc is OK: puppet ran at Wed Jul 24 04:24:47 UTC 2013 [04:25:30] PROBLEM - Puppet freshness on zinc is CRITICAL: No successful Puppet run in the last 10 hours [04:26:29] RECOVERY - SSH on lvs1001 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [04:42:19] PROBLEM - Puppet freshness on gadolinium is CRITICAL: No successful Puppet run in the last 10 hours [04:42:32] :-) [04:42:45] I read the pages on wikitech.wikimedia.org. Yay documentation! [04:54:49] RECOVERY - Puppet freshness on zinc is OK: puppet ran at Wed Jul 24 04:54:46 UTC 2013 [04:55:29] PROBLEM - Puppet freshness on zinc is CRITICAL: No successful Puppet run in the last 10 hours [05:13:46] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:14:36] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.133 second response time [05:19:56] PROBLEM - MySQL Slave Delay on db1008 is CRITICAL: CRIT replication delay 182 seconds [05:23:56] PROBLEM - MySQL Slave Delay on db1008 is CRITICAL: CRIT replication delay 225 seconds [05:24:56] RECOVERY - Puppet freshness on zinc is OK: puppet ran at Wed Jul 24 05:24:50 UTC 2013 [05:25:26] PROBLEM - Puppet freshness on zinc is CRITICAL: No successful Puppet run in the last 10 hours [05:25:56] RECOVERY - MySQL Slave Delay on db1008 is OK: OK replication delay 4 seconds [05:32:46] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:33:36] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.123 second response time [05:33:56] PROBLEM - MySQL Slave Delay on db1008 is CRITICAL: CRIT replication delay 225 seconds [05:36:56] RECOVERY - MySQL Slave Delay on db1008 is OK: OK replication delay 24 seconds [05:39:48] PROBLEM - MySQL Slave Delay on db1008 is CRITICAL: CRIT replication delay 204 seconds [05:43:48] PROBLEM - MySQL Slave Delay on db1008 is CRITICAL: CRIT replication delay 225 seconds [05:48:48] RECOVERY - MySQL Slave Delay on db1008 is OK: OK replication delay 26 seconds [05:53:48] PROBLEM - MySQL Slave Delay on db1008 is CRITICAL: CRIT replication delay 225 seconds [05:54:48] RECOVERY - Puppet freshness on zinc is OK: puppet ran at Wed Jul 24 05:54:43 UTC 2013 [05:55:28] PROBLEM - Puppet freshness on zinc is CRITICAL: No successful Puppet run in the last 10 hours [05:58:48] PROBLEM - MySQL Slave Delay on db1008 is CRITICAL: CRIT replication delay 224 seconds [06:00:48] RECOVERY - MySQL Slave Delay on db1008 is OK: OK replication delay 20 seconds [06:18:50] PROBLEM - MySQL Slave Delay on db1008 is CRITICAL: CRIT replication delay 202 seconds [06:20:50] RECOVERY - MySQL Slave Delay on db1008 is OK: OK replication delay 24 seconds [06:23:40] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:24:30] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.124 second response time [06:24:50] RECOVERY - Puppet freshness on zinc is OK: puppet ran at Wed Jul 24 06:24:48 UTC 2013 [06:25:30] PROBLEM - Puppet freshness on zinc is CRITICAL: No successful Puppet run in the last 10 hours [06:28:50] PROBLEM - MySQL Slave Delay on db1008 is CRITICAL: CRIT replication delay 224 seconds [06:34:50] PROBLEM - MySQL Slave Delay on db1008 is CRITICAL: CRIT replication delay 217 seconds [06:36:50] RECOVERY - MySQL Slave Delay on db1008 is OK: OK replication delay 20 seconds [06:55:43] RECOVERY - Puppet freshness on zinc is OK: puppet ran at Wed Jul 24 06:55:34 UTC 2013 [06:56:23] PROBLEM - Puppet freshness on zinc is CRITICAL: No successful Puppet run in the last 10 hours [07:02:43] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:03:33] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.126 second response time [07:24:54] RECOVERY - Puppet freshness on zinc is OK: puppet ran at Wed Jul 24 07:24:52 UTC 2013 [07:25:24] PROBLEM - Puppet freshness on zinc is CRITICAL: No successful Puppet run in the last 10 hours [07:48:32] goood morniiing [07:48:49] PROBLEM - Puppet freshness on erzurumi is CRITICAL: No successful Puppet run in the last 10 hours [07:48:49] PROBLEM - Puppet freshness on lvs1004 is CRITICAL: No successful Puppet run in the last 10 hours [07:48:49] PROBLEM - Puppet freshness on lvs1005 is CRITICAL: No successful Puppet run in the last 10 hours [07:48:49] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: No successful Puppet run in the last 10 hours [07:48:49] PROBLEM - Puppet freshness on virt1 is CRITICAL: No successful Puppet run in the last 10 hours [07:48:50] PROBLEM - Puppet freshness on virt3 is CRITICAL: No successful Puppet run in the last 10 hours [07:48:50] PROBLEM - Puppet freshness on virt4 is CRITICAL: No successful Puppet run in the last 10 hours [07:53:49] PROBLEM - Puppet freshness on ms-fe1002 is CRITICAL: No successful Puppet run in the last 10 hours [07:54:39] RECOVERY - Puppet freshness on zinc is OK: puppet ran at Wed Jul 24 07:54:35 UTC 2013 [07:55:29] PROBLEM - Puppet freshness on zinc is CRITICAL: No successful Puppet run in the last 10 hours [07:57:06] mark: morning. Following gitblit (serving git.wikimedia.org) issue yesterday: there is now an init script, jstack is fixed in puppet, puppet will ensure service is running AND the nasty web spiders have been blocked on apache frontend =) [07:59:49] PROBLEM - Puppet freshness on ms-fe1003 is CRITICAL: No successful Puppet run in the last 10 hours [08:04:49] PROBLEM - Puppet freshness on ms-fe1004 is CRITICAL: No successful Puppet run in the last 10 hours [08:20:55] PROBLEM - Puppet freshness on ms-fe1001 is CRITICAL: No successful Puppet run in the last 10 hours [08:24:55] RECOVERY - Puppet freshness on zinc is OK: puppet ran at Wed Jul 24 08:24:49 UTC 2013 [08:25:25] PROBLEM - Puppet freshness on zinc is CRITICAL: No successful Puppet run in the last 10 hours [08:54:51] RECOVERY - Puppet freshness on zinc is OK: puppet ran at Wed Jul 24 08:54:46 UTC 2013 [08:55:31] PROBLEM - Puppet freshness on zinc is CRITICAL: No successful Puppet run in the last 10 hours [09:09:16] PROBLEM - MySQL Slave Delay on db1008 is CRITICAL: CRIT replication delay 182 seconds [09:10:16] RECOVERY - MySQL Slave Delay on db1008 is OK: OK replication delay 6 seconds [09:13:46] RECOVERY - search indices - check lucene status page on search1013 is OK: HTTP OK: HTTP/1.1 200 OK - 747 bytes in 0.006 second response time [09:25:06] RECOVERY - Puppet freshness on zinc is OK: puppet ran at Wed Jul 24 09:24:58 UTC 2013 [09:25:26] PROBLEM - Puppet freshness on zinc is CRITICAL: No successful Puppet run in the last 10 hours [09:28:23] (PS4) Hashar: replicate Gerrit repos to Jenkins slave lanthanum [operations/puppet] - https://gerrit.wikimedia.org/r/75499 [09:28:24] (PS4) Hashar: replicate Gerrit repos to Jenkins slave gallium [operations/puppet] - https://gerrit.wikimedia.org/r/75500 [09:29:02] (CR) Hashar: "renamed Gerrit replication group 'jenkins-lanthanum' to 'jenkins-slaves'" [operations/puppet] - https://gerrit.wikimedia.org/r/75499 (owner: Hashar) [09:29:44] (CR) Hashar: "(1 comment)" [operations/puppet] - https://gerrit.wikimedia.org/r/75500 (owner: Hashar) [09:30:21] (CR) Ori.livneh: "> I haven't looked, but I'd assume since so much effort went in to this it is more complete and robust than git::clone." [operations/puppet] - https://gerrit.wikimedia.org/r/74099 (owner: Andrew Bogott) [09:30:43] (CR) Hashar: "(1 comment)" [operations/puppet] - https://gerrit.wikimedia.org/r/75500 (owner: Hashar) [09:43:17] PROBLEM - udp2log log age for gadolinium on gadolinium is CRITICAL: CRITICAL: log files /a/log/webrequest/packet-loss.log, have not been written in a critical amount of time. For most logs, this is 4 hours. For slow logs, this is 4 days. [09:46:39] (CR) TTO: "In http://lists.wikimedia.org/pipermail/wikitech-l/2013-July/070716.html , James said he would be instating an "opt-out" user preference. " [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/73565 (owner: Odder) [09:53:19] (CR) Matmarex: "No, that's in fact exactly what this patch does." [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/73565 (owner: Odder) [09:54:47] RECOVERY - Puppet freshness on zinc is OK: puppet ran at Wed Jul 24 09:54:37 UTC 2013 [09:55:27] PROBLEM - Puppet freshness on zinc is CRITICAL: No successful Puppet run in the last 10 hours [10:06:59] (PS1) MaxSem: Switch translation memory from vanadium to zinc [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/75566 [10:16:37] (CR) MaxSem: [C: 2] Switch translation memory from vanadium to zinc [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/75566 (owner: MaxSem) [10:17:01] (Merged) jenkins-bot: Switch translation memory from vanadium to zinc [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/75566 (owner: MaxSem) [10:24:18] !log Attempting to migrate TTM to zinc [10:24:29] Logged the message, Master [10:24:52] RECOVERY - Puppet freshness on zinc is OK: puppet ran at Wed Jul 24 10:24:48 UTC 2013 [10:25:32] PROBLEM - Puppet freshness on zinc is CRITICAL: No successful Puppet run in the last 10 hours [10:25:54] o_0 [10:37:16] PROBLEM - udp2log log age for gadolinium on gadolinium is CRITICAL: CRITICAL: log files /a/log/webrequest/packet-loss.log, have not been written in a critical amount of time. For most logs, this is 4 hours. For slow logs, this is 4 days. [10:48:56] hashar: cool :) [10:49:21] :D [10:54:52] RECOVERY - Puppet freshness on zinc is OK: puppet ran at Wed Jul 24 10:54:39 UTC 2013 [10:55:26] PROBLEM - Puppet freshness on zinc is CRITICAL: No successful Puppet run in the last 10 hours [10:55:47] (PS1) Mark Bergsma: Install new servers as Text Varnish caches [operations/puppet] - https://gerrit.wikimedia.org/r/75572 [10:55:49] (CR) TTO: [C: 1] "yep, probably needs a rebase by now though" [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/67953 (owner: Raimond Spekking) [10:56:05] (CR) jenkins-bot: [V: -1] Install new servers as Text Varnish caches [operations/puppet] - https://gerrit.wikimedia.org/r/75572 (owner: Mark Bergsma) [10:56:49] yeahhhh err: Could not retrieve catalog from remote server: Error 400 on SERVER: Could not parse for environment production: No file(s) found for import of '../private/manifests/passwords.pp' at /etc/puppet/manifests/base.pp:10 on node i-00000778.pmtpa.wmflabs [10:57:11] (PS2) Mark Bergsma: Install new servers as Text Varnish caches [operations/puppet] - https://gerrit.wikimedia.org/r/75572 [10:57:28] (CR) jenkins-bot: [V: -1] Install new servers as Text Varnish caches [operations/puppet] - https://gerrit.wikimedia.org/r/75572 (owner: Mark Bergsma) [10:57:51] (PS1) MaxSem: Revert "Switch translation memory from vanadium to zinc" [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/75573 [10:58:07] (CR) MaxSem: [C: 2] Revert "Switch translation memory from vanadium to zinc" [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/75573 (owner: MaxSem) [10:58:14] (Merged) jenkins-bot: Revert "Switch translation memory from vanadium to zinc" [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/75573 (owner: MaxSem) [10:58:33] (PS3) Mark Bergsma: Install new servers as Text Varnish caches [operations/puppet] - https://gerrit.wikimedia.org/r/75572 [10:58:49] !log Aborted TTM migration, will file a bug report [10:59:00] Logged the message, Master [10:59:34] (CR) Mark Bergsma: [C: 2] Install new servers as Text Varnish caches [operations/puppet] - https://gerrit.wikimedia.org/r/75572 (owner: Mark Bergsma) [10:59:35] (Merged) Mark Bergsma: Install new servers as Text Varnish caches [operations/puppet] - https://gerrit.wikimedia.org/r/75572 (owner: Mark Bergsma) [11:12:28] (PS1) Mark Bergsma: Generate ganglia plugin conf after starting Varnish [operations/puppet] - https://gerrit.wikimedia.org/r/75575 [11:12:31] MaxSem: are you happy with that mobile vary change now? [11:13:25] (PS2) Mark Bergsma: Generate ganglia plugin conf after starting Varnish [operations/puppet] - https://gerrit.wikimedia.org/r/75575 [11:13:26] mark, I've prepared the PHP counterpart in https://gerrit.wikimedia.org/r/#/c/75362/ [11:13:39] thanks [11:13:52] so varnish won't use XVO (yet), but at least we can use the info from the header to create the VCL [11:14:00] and squid will work ;) [11:14:10] mark, my only concern for your change is that we currently have another cookie mf_alpha which we will deprecate soon [11:14:19] (aka most likely next week) [11:14:20] I can add that [11:14:22] I just didn't know about it [11:15:01] hrm [11:15:04] I just realized something [11:15:12] using one regsuball does not always give the same order [11:15:18] mark, hmm, and after we ditch it optin will have values other than 1 [11:15:21] perhaps I should do one regsub per cookie [11:15:37] MaxSem: that's ok, just change the regex [11:16:17] (CR) Mark Bergsma: [C: 2] Generate ganglia plugin conf after starting Varnish [operations/puppet] - https://gerrit.wikimedia.org/r/75575 (owner: Mark Bergsma) [11:16:18] (Merged) Mark Bergsma: Generate ganglia plugin conf after starting Varnish [operations/puppet] - https://gerrit.wikimedia.org/r/75575 (owner: Mark Bergsma) [11:16:33] we'll have to let mf_alpha to pass through varnish for some time after that, for migration period [11:21:10] I got a really weird varnish cache issue. An http://foo/ url is a redirect to the https version. But the https cached version ends up being the same as the http version [11:21:27] so https version is cached as being a redirect to … the https version. Hence a nice loop [11:21:34] and I have zero clue where it comes from :( [11:22:14] the good thing is I managed to reproduced it constantly : https://bugzilla.wikimedia.org/show_bug.cgi?id=51700#c8 :-] [11:22:46] hashar: so we use protocol relative URLs right, no separate http/https caching [11:24:19] so if I ask varnish for http://foo or https://foo it serves the exact same cached copy ? [11:24:22] like it is shared? [11:24:35] yes [11:24:45] hmm [11:24:59] RECOVERY - Puppet freshness on zinc is OK: puppet ran at Wed Jul 24 11:24:49 UTC 2013 [11:24:59] the https is handled by nginx anyway, which does the http:// with X-Fowarded-Proto: https [11:25:22] so varnish might have to vary on X-Fowarded-Proto maybe [11:25:29] PROBLEM - Puppet freshness on zinc is CRITICAL: No successful Puppet run in the last 10 hours [11:26:11] it could, but we generally avoid it [11:26:16] as this would fragment the cache [11:26:50] redirects are varied though, iirc [11:26:51] (PS1) Mark Bergsma: Install extra VCL files before starting Varnish [operations/puppet] - https://gerrit.wikimedia.org/r/75576 [11:27:00] yeah [11:27:32] (CR) Mark Bergsma: [C: 2] Install extra VCL files before starting Varnish [operations/puppet] - https://gerrit.wikimedia.org/r/75576 (owner: Mark Bergsma) [11:27:34] (Merged) Mark Bergsma: Install extra VCL files before starting Varnish [operations/puppet] - https://gerrit.wikimedia.org/r/75576 (owner: Mark Bergsma) [11:33:44] any idea where the redirect vary is handled ? [11:37:56] depends on what sends out the redirect eh [11:38:04] mediawiki or apache [11:39:00] the way I reproduce the issue is listed at https://bugzilla.wikimedia.org/show_bug.cgi?id=51700#c8 . Basically on a purged cache, the http:// query yields Location: https. When querying the https:// I am served the Location: https. [11:39:11] makes a lot of sense [11:39:35] the http from the client goes directly to the varnish frontend. the https:// one pass via nginx which set the X-Fowarded-Proto: https and then do a http query on the varnish frontend [11:39:57] yes [11:40:00] the http:// + X-Fowarded-For: https to not be a redirect to https. [11:40:07] so whatever serves that page (check the headers) needs to add a varyheader [11:41:01] Vary: X-Fowarded-Proto ? [11:41:06] yep [11:41:19] X-FoRwarded-Proto [11:41:21] I think that is my apache conf which does the 301 [11:41:32] upper case R ? [11:41:36] * hashar smiles [11:41:40] no, but you forgot it a few times ;) [11:41:51] ah yeah [11:42:09] * hashar attempts to blame Apple spelling check [11:42:18] * mark blames the french [11:43:18] of course [11:43:25] the redirect needs to be conditional as well [11:43:36] a vary header makes no difference if apache is configured to redirect https to https ;) [11:43:45] so it should only do that for XFP == "http" [11:44:11] RewriteEngine On [11:44:12] RewriteCond %{HTTP:X-Forwarded-Proto} !https [11:44:12] RewriteRule ^/(.*)$ https://login.wikimedia.beta.wmflabs.org/$1 [R=301,L] [11:44:49] that comes from our apache conf https://git.wikimedia.org/blob/operations%2Fapache-config.git/ca2f6fd3740adcf81ec0de87c9fd30d66a802f40/wikimedia.conf#L157 [11:45:54] so either apache does not receive the X-Fwd-Proto or it does not properly handle it fun. [11:46:20] or it doesn't also send a vary header [11:47:24] so apache might serve different content but varnish would never query it because it was not instructed to vary ? [11:49:35] varnish would cache either the http or the https version [11:49:39] and not distinguish them [11:49:48] so it depends on the object currently cached, whether you get a redirect loop or not [11:50:00] that is what I noticed [11:50:07] figuring out what apache sends me back [11:54:39] RECOVERY - Puppet freshness on zinc is OK: puppet ran at Wed Jul 24 11:54:37 UTC 2013 [11:55:30] PROBLEM - Puppet freshness on zinc is CRITICAL: No successful Puppet run in the last 10 hours [11:56:13] well [11:56:19] if the vary header config is not right up there with the rest [11:56:25] i don't see how it would send a Vary header ;) [12:01:08] so apache works properly but does not send the Vary: [12:01:26] I am afraid we have the same issue in production :/ [12:01:27] https://git.wikimedia.org/blob/operations%2Fapache-config.git/ca2f6fd3740adcf81ec0de87c9fd30d66a802f40/wikimedia.conf#L157 [12:01:33] since I copied the apache conf from there [12:02:56] curl -s -i https://login.wikimedia.org/|grep Vary: [12:02:57] Vary: Accept-Encoding,X-Forwarded-Proto,Cookie [12:02:57] \O/ [12:08:08] PROBLEM - Varnish HTTP text-backend on cp1065 is CRITICAL: Connection refused [12:10:08] yes likely [12:10:24] i can't find how it is setup [12:14:45] (PS1) Mark Bergsma: Add new Text Varnish servers [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/75581 [12:15:29] (PS2) Mark Bergsma: Add new Text Varnish servers [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/75581 [12:15:57] (CR) Mark Bergsma: [C: 2] Add new Text Varnish servers [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/75581 (owner: Mark Bergsma) [12:15:58] (Merged) Mark Bergsma: Add new Text Varnish servers [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/75581 (owner: Mark Bergsma) [12:16:58] !log mark synchronized wmf-config/squid.php [12:17:09] Logged the message, Master [12:19:08] RECOVERY - Varnish HTTP text-backend on cp1065 is OK: HTTP OK: HTTP/1.1 200 OK - 188 bytes in 0.000 second response time [12:22:52] mark: I think i found out. In production the http:// redirecting url is not cached by the squid caches. curl http://login.wikimedia.org/ -i yields MISS [12:23:04] where has on the beta varnish cache, that is cached [12:23:18] that's probably the reason why it wasn't noticed before, indeed [12:23:21] so on squid, the http:// always redirect to the https:// version which is always the version served by apache (and send you to the main page) [12:23:29] where as on varnish, the http:// is cached [12:23:34] right [12:23:37] and override the https:// version [12:23:38] so we should fix the apache config [12:23:46] so that would potentially be an issue on varnish text whenever you deploy them. [12:23:56] in an hour ;) [12:23:59] you want to get the redirect cached dont you ? [12:24:06] yes [12:24:43] so the apache boxes needs mod_headers and we have to add a Vary to all the redirects :/ [12:24:48] RECOVERY - Puppet freshness on zinc is OK: puppet ran at Wed Jul 24 12:24:39 UTC 2013 [12:24:55] damn it took me a while to figure out all of that [12:25:29] PROBLEM - Puppet freshness on zinc is CRITICAL: No successful Puppet run in the last 10 hours [12:27:57] yes [12:28:07] do you want to write an email to some lists about that? [12:28:17] wrong question I guess [12:28:21] you don't want to, but could you? ;) [12:28:59] I think someone was trying to figure that in the past [12:29:01] I have updated my bug as I was investigating the issue. Wrote a quick summary at https://bugzilla.wikimedia.org/show_bug.cgi?id=51700#c11 [12:29:14] and ended up with deciding that it's not possible with apache or something [12:29:20] I don't remember more details [12:29:38] PROBLEM - Host cp1052 is DOWN: PING CRITICAL - Packet loss = 100% [12:29:48] sorry, just saw the backlog [12:29:55] don't hate me :> [12:30:12] well I have already spend a bunch of yesterday afternoon on that issue [12:30:35] is that your way of saying that you do hate me? [12:30:35] :) [12:30:44] ROFL [12:30:48] RECOVERY - Host cp1052 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [12:30:48] PROBLEM - Host cp1053 is DOWN: PING CRITICAL - Packet loss = 100% [12:31:03] I would have hated you if you said that after I change the patch in Gerrit :-] [12:31:18] … I submit the patch in Gerrit .. [12:31:28] not a bit deal, I have learned a lot along the way [12:31:48] RECOVERY - Host cp1053 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms [12:31:58] PROBLEM - Host cp1054 is DOWN: PING CRITICAL - Packet loss = 100% [12:32:28] PROBLEM - Host cp1055 is DOWN: PING CRITICAL - Packet loss = 100% [12:32:48] RECOVERY - Host cp1054 is UP: PING OK - Packet loss = 0%, RTA = 0.30 ms [12:33:38] RECOVERY - Host cp1055 is UP: PING OK - Packet loss = 0%, RTA = 0.44 ms [12:33:59] PROBLEM - Host cp1065 is DOWN: PING CRITICAL - Packet loss = 100% [12:34:58] PROBLEM - Host cp1066 is DOWN: PING CRITICAL - Packet loss = 100% [12:35:08] RECOVERY - Host cp1065 is UP: PING OK - Packet loss = 0%, RTA = 1.10 ms [12:35:09] (PS1) Mark Bergsma: Fix up Vary headers on 30x redirects from Apache [operations/puppet] - https://gerrit.wikimedia.org/r/75583 [12:35:10] hashar: ^ [12:35:52] (PS2) Hashar: Fix up Vary headers on 30x redirects from Apache [operations/puppet] - https://gerrit.wikimedia.org/r/75583 (owner: Mark Bergsma) [12:35:58] RECOVERY - Host cp1066 is UP: PING OK - Packet loss = 0%, RTA = 0.35 ms [12:35:59] PROBLEM - Host cp1067 is DOWN: PING CRITICAL - Packet loss = 100% [12:36:01] referenced bug 51700 :) [12:36:20] :) [12:36:47] in apache we could potentially set an env variable in the RewriteRule and send the Vary: header whenever that env is set [12:36:56] but that is scary and need to be done all other the place [12:37:08] PROBLEM - Host cp1068 is DOWN: PING CRITICAL - Packet loss = 100% [12:37:15] another way would be to let MediaWiki handle them, but that is too far in the layers i guess [12:37:18] PROBLEM - Varnish HTTP text-frontend on cp1065 is CRITICAL: Connection refused [12:37:24] mark: why obj.http.Location ~ "^http" ? [12:37:28] PROBLEM - Varnish traffic logger on cp1065 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishncsa [12:37:28] RECOVERY - Host cp1067 is UP: PING OK - Packet loss = 0%, RTA = 0.48 ms [12:37:43] i don't know if protocol relative redirects are possible [12:37:49] but if they are, we don't need the vary header there [12:38:04] I don't think so [12:38:18] RECOVERY - Host cp1068 is UP: PING OK - Packet loss = 0%, RTA = 0.57 ms [12:38:27] so potentially we could test that patch in beta but puppet is broken in labs currently. [12:38:45] (PS3) Mark Bergsma: Fix up Vary headers on 30x redirects from Apache [operations/puppet] - https://gerrit.wikimedia.org/r/75583 [12:39:18] RECOVERY - Varnish HTTP text-frontend on cp1065 is OK: HTTP OK: HTTP/1.1 200 OK - 197 bytes in 0.001 second response time [12:39:29] RECOVERY - Varnish traffic logger on cp1065 is OK: PROCS OK: 2 processes with command name varnishncsa [12:39:48] PROBLEM - NTP on cp1068 is CRITICAL: NTP CRITICAL: Offset unknown [12:39:48] The [12:39:48] field value consists of a single absolute URI. [12:39:48] Location = "Location" ":" absoluteURI [12:39:53] http://tools.ietf.org/html/rfc2616#section-14.30 [12:40:06] wikipedia says though [12:40:07] This example, is incorrect according to the current standard, which specifies the URI returned to be absolute.[7] However, all popular browsers will accept a relative URL[citation needed], and it is correct according to the upcoming revision of HTTP/1.1.[8] [12:40:29] i was just looking up the same [12:40:31] then you get some weird mobile browser that does not support it :D [12:40:45] so if relative URLs will be possible, let's keep that in there ;) [12:40:52] it doesn't hurt [12:41:48] RECOVERY - NTP on cp1068 is OK: NTP OK: Offset 0.001034259796 secs [12:41:48] PROBLEM - Puppet freshness on manutius is CRITICAL: No successful Puppet run in the last 10 hours [12:42:21] nod [12:47:19] PROBLEM - NTP on cp1054 is CRITICAL: NTP CRITICAL: Offset unknown [12:47:48] (CR) Mark Bergsma: [C: 2] Fix up Vary headers on 30x redirects from Apache [operations/puppet] - https://gerrit.wikimedia.org/r/75583 (owner: Mark Bergsma) [12:48:02] (Merged) Mark Bergsma: Fix up Vary headers on 30x redirects from Apache [operations/puppet] - https://gerrit.wikimedia.org/r/75583 (owner: Mark Bergsma) [12:48:18] PROBLEM - NTP on cp1055 is CRITICAL: NTP CRITICAL: Offset unknown [12:48:33] argh [12:49:10] I can't test out your fix in beta, the labs puppetmaster is broken :D [12:49:21] i can test it in production :) [12:49:26] (PS1) Mark Bergsma: Fix typo [operations/puppet] - https://gerrit.wikimedia.org/r/75584 [12:49:30] i have all these test servers to play with [12:50:08] (CR) Mark Bergsma: [C: 2] Fix typo [operations/puppet] - https://gerrit.wikimedia.org/r/75584 (owner: Mark Bergsma) [12:50:12] (Merged) Mark Bergsma: Fix typo [operations/puppet] - https://gerrit.wikimedia.org/r/75584 (owner: Mark Bergsma) [12:51:44] argh [12:52:07] out for a snack [12:52:18] RECOVERY - NTP on cp1054 is OK: NTP OK: Offset 0.001305222511 secs [12:52:53] (PS1) Mark Bergsma: Use beresp. instead of obj. in vcl_fetch [operations/puppet] - https://gerrit.wikimedia.org/r/75585 [12:52:54] * mark in with noodles [12:53:18] RECOVERY - NTP on cp1055 is OK: NTP OK: Offset 0.001746296883 secs [12:53:56] when will jenkins do VCL tests? ;-) [12:54:14] whenever someone figure out how to expand the tempaltes [12:54:16] (CR) Mark Bergsma: [C: 2] Use beresp. instead of obj. in vcl_fetch [operations/puppet] - https://gerrit.wikimedia.org/r/75585 (owner: Mark Bergsma) [12:54:20] (Merged) Mark Bergsma: Use beresp. instead of obj. in vcl_fetch [operations/puppet] - https://gerrit.wikimedia.org/r/75585 (owner: Mark Bergsma) [12:54:35] I think varnish has some unit testing suite [12:54:41] more after I grab a snack [12:54:58] RECOVERY - Puppet freshness on zinc is OK: puppet ran at Wed Jul 24 12:54:52 UTC 2013 [12:55:28] PROBLEM - Puppet freshness on zinc is CRITICAL: No successful Puppet run in the last 10 hours [12:59:57] (PS1) Mark Bergsma: Use hit_for_pass for TTL <= 0s objects [operations/puppet] - https://gerrit.wikimedia.org/r/75586 [13:00:04] i wonder if/why we're not seeing that problem on mobile [13:00:50] (CR) Mark Bergsma: [C: 2] Use hit_for_pass for TTL <= 0s objects [operations/puppet] - https://gerrit.wikimedia.org/r/75586 (owner: Mark Bergsma) [13:00:53] (Merged) Mark Bergsma: Use hit_for_pass for TTL <= 0s objects [operations/puppet] - https://gerrit.wikimedia.org/r/75586 (owner: Mark Bergsma) [13:06:34] re [13:07:10] mark: so if you figure out a way to generate vcl files, we could run the varnish syntax check on each of them [13:07:51] yeah kinda difficult since they're erb [13:11:24] varnish has some testing support (varnishtest) , example input https://github.com/varnish/Varnish-Cache/blob/master/bin/varnishtest/tests/v00006.vtc [13:12:11] (PS1) Mark Bergsma: Do XFF appends, use pass for POST requests [operations/puppet] - https://gerrit.wikimedia.org/r/75587 [13:12:54] i think not varnish is the problem, but getting the VCL files expanded from the erb templates [13:14:41] (PS2) Mark Bergsma: Do XFF appends, use pass for POST requests [operations/puppet] - https://gerrit.wikimedia.org/r/75587 [13:14:56] hashar, I'm looking at the puppet/password issue [13:15:03] \O/ [13:15:19] (CR) Manybubbles: "I'm not really sure about this commit because it looks like it defines a second copy of the role::db::labsdb. I added Peter because he se" [operations/puppet] - https://gerrit.wikimedia.org/r/74158 (owner: coren) [13:15:25] andrewbogott: the bug is https://bugzilla.wikimedia.org/show_bug.cgi?id=51955 [13:15:37] (CR) Mark Bergsma: [C: 2] Do XFF appends, use pass for POST requests [operations/puppet] - https://gerrit.wikimedia.org/r/75587 (owner: Mark Bergsma) [13:15:38] (Merged) Mark Bergsma: Do XFF appends, use pass for POST requests [operations/puppet] - https://gerrit.wikimedia.org/r/75587 (owner: Mark Bergsma) [13:15:39] yeps, I see it. Thanks. [13:16:42] heh, of course it runs fine on the instance I was using as a canary yesterday :( [13:17:11] (PS1) Mark Bergsma: Restore cookies on pass requests as well, to be sure. [operations/puppet] - https://gerrit.wikimedia.org/r/75588 [13:17:50] (CR) Mark Bergsma: [C: 2] Restore cookies on pass requests as well, to be sure. [operations/puppet] - https://gerrit.wikimedia.org/r/75588 (owner: Mark Bergsma) [13:17:51] (Merged) Mark Bergsma: Restore cookies on pass requests as well, to be sure. [operations/puppet] - https://gerrit.wikimedia.org/r/75588 (owner: Mark Bergsma) [13:23:55] (PS1) Mark Bergsma: Properly handle request restarts with cookie munging and restoration [operations/puppet] - https://gerrit.wikimedia.org/r/75589 [13:24:51] RECOVERY - Puppet freshness on zinc is OK: puppet ran at Wed Jul 24 13:24:41 UTC 2013 [13:25:32] PROBLEM - Puppet freshness on zinc is CRITICAL: No successful Puppet run in the last 10 hours [13:30:03] (PS1) Mark Bergsma: Filter out the Orig-Cookie header if coming from clients [operations/puppet] - https://gerrit.wikimedia.org/r/75590 [13:30:55] (CR) Mark Bergsma: [C: 2] Properly handle request restarts with cookie munging and restoration [operations/puppet] - https://gerrit.wikimedia.org/r/75589 (owner: Mark Bergsma) [13:30:56] (Merged) Mark Bergsma: Properly handle request restarts with cookie munging and restoration [operations/puppet] - https://gerrit.wikimedia.org/r/75589 (owner: Mark Bergsma) [13:31:17] (CR) Mark Bergsma: [C: 2] Filter out the Orig-Cookie header if coming from clients [operations/puppet] - https://gerrit.wikimedia.org/r/75590 (owner: Mark Bergsma) [13:31:31] (Merged) Mark Bergsma: Filter out the Orig-Cookie header if coming from clients [operations/puppet] - https://gerrit.wikimedia.org/r/75590 (owner: Mark Bergsma) [13:32:13] hmm mobile uses X-Orig-Cookie [13:32:19] perhaps we should just use the same for consistency [13:32:55] MaxSem: do you know if MobileFrontend inspects/uses the X-Orig-Cookie header? [13:33:32] mark, it doesn't [13:33:54] so perhaps it would be better to do what we do on text varnish: modify the cookie header for caching, but restore before sending to mediawiki [13:34:30] so mediawiki always receives the original cookie sent by the client, but varnish has a cleaned up version for caching (vary) [13:35:33] (PS1) Andrew Bogott: On the labs puppetmaster, link to labs private repo [operations/puppet] - https://gerrit.wikimedia.org/r/75592 [13:35:38] makes sense [13:36:36] afk [13:36:37] (CR) Andrew Bogott: [C: 2] On the labs puppetmaster, link to labs private repo [operations/puppet] - https://gerrit.wikimedia.org/r/75592 (owner: Andrew Bogott) [13:36:38] (Merged) Andrew Bogott: On the labs puppetmaster, link to labs private repo [operations/puppet] - https://gerrit.wikimedia.org/r/75592 (owner: Andrew Bogott) [13:37:42] mark, I just merged one of your changes [13:38:34] (CR) Hashar: "That is for bug https://bugzilla.wikimedia.org/show_bug.cgi?id=51955" [operations/puppet] - https://gerrit.wikimedia.org/r/75592 (owner: Andrew Bogott) [13:41:41] hashar, now I get a new and different error. Is that legit, or still a puppetmaster failure? [13:42:26] (PS1) Mark Bergsma: Split off pass requests from cookie munging [operations/puppet] - https://gerrit.wikimedia.org/r/75593 [13:42:28] andrewbogott: ok, sorry [13:42:34] * andrewbogott thinks it is probably legit [13:43:01] mark, no problem, I'm just working too fast trying to get hashar unstuck [13:43:28] and i'm doing many changes in succession, so one got forgotten;) [13:44:18] (PS2) Mark Bergsma: Split off pass requests from cookie munging [operations/puppet] - https://gerrit.wikimedia.org/r/75593 [13:45:36] (CR) Mark Bergsma: [C: 2] Split off pass requests from cookie munging [operations/puppet] - https://gerrit.wikimedia.org/r/75593 (owner: Mark Bergsma) [13:45:37] (Merged) Mark Bergsma: Split off pass requests from cookie munging [operations/puppet] - https://gerrit.wikimedia.org/r/75593 (owner: Mark Bergsma) [13:53:47] (PS1) Andrew Bogott: Virt0 puppet symlinks should point to /root/testrepo/* [operations/puppet] - https://gerrit.wikimedia.org/r/75596 [13:54:02] (CR) jenkins-bot: [V: -1] Virt0 puppet symlinks should point to /root/testrepo/* [operations/puppet] - https://gerrit.wikimedia.org/r/75596 (owner: Andrew Bogott) [13:54:46] RECOVERY - Puppet freshness on zinc is OK: puppet ran at Wed Jul 24 13:54:41 UTC 2013 [13:55:27] PROBLEM - Puppet freshness on zinc is CRITICAL: No successful Puppet run in the last 10 hours [13:59:14] (PS2) Andrew Bogott: Virt0 puppet symlinks should point to /root/testrepo/* [operations/puppet] - https://gerrit.wikimedia.org/r/75596 [13:59:20] andrewbogott: btw, the private repo is a good candidate for a module ;) [13:59:36] indeed [14:00:13] (PS3) Andrew Bogott: Virt0 puppet symlinks should point to /root/testrepo/* [operations/puppet] - https://gerrit.wikimedia.org/r/75596 [14:01:11] (CR) Andrew Bogott: [C: 2] Virt0 puppet symlinks should point to /root/testrepo/* [operations/puppet] - https://gerrit.wikimedia.org/r/75596 (owner: Andrew Bogott) [14:01:12] (Merged) Andrew Bogott: Virt0 puppet symlinks should point to /root/testrepo/* [operations/puppet] - https://gerrit.wikimedia.org/r/75596 (owner: Andrew Bogott) [14:03:16] so... [14:03:25] maybe while I'm at it i should add nginx on all text varnish servers [14:03:40] for https only, not ipv6 of course [14:04:07] a bit easier than when there's traffic on them [14:08:36] sh is going to suck [14:08:54] why? [14:09:00] btw, I played a bit with weight since the ssl1005/6 are faster boxes [14:09:03] it makes no difference at all [14:09:17] I even had it at 75 vs. 25 and they got the exact same amount of traffic [14:09:30] I should refresh my wcsh balancer [14:09:34] oh also I suggested that HT might make a huge difference for SSL and Ryan tested that [14:09:54] and it made a 2x difference [14:10:00] in terms of what? [14:10:02] amazingly [14:10:16] that's what he said, I presume halfed the CPU load? [14:10:21] lol [14:10:24] well [14:10:33] I guess it would do that if ganglia counted those cpus [14:10:33] of course it's half the cpu load [14:10:35] as separate [14:10:58] did requests/s go up? [14:11:17] I have no idea what tests he ran [14:11:25] and I'm clearly not in a state to relay information :) [14:11:35] (PS1) Ottomata: Including fundraising::udp2log_rotation and accounts::file_mover on erbium. [operations/puppet] - https://gerrit.wikimedia.org/r/75599 [14:12:00] (CR) Ottomata: [C: 2 V: 2] Including fundraising::udp2log_rotation and accounts::file_mover on erbium. [operations/puppet] - https://gerrit.wikimedia.org/r/75599 (owner: Ottomata) [14:12:00] (Merged) Ottomata: Including fundraising::udp2log_rotation and accounts::file_mover on erbium. [operations/puppet] - https://gerrit.wikimedia.org/r/75599 (owner: Ottomata) [14:12:49] (PS1) Mark Bergsma: Fix indentation [operations/puppet] - https://gerrit.wikimedia.org/r/75600 [14:13:47] (CR) Mark Bergsma: [C: 2] Fix indentation [operations/puppet] - https://gerrit.wikimedia.org/r/75600 (owner: Mark Bergsma) [14:13:48] (Merged) Mark Bergsma: Fix indentation [operations/puppet] - https://gerrit.wikimedia.org/r/75600 (owner: Mark Bergsma) [14:14:04] PROBLEM - RAID on erbium is CRITICAL: Connection refused by host [14:14:24] RECOVERY - udp2log log age for erbium on erbium is OK: OK: all log files active [14:14:35] PROBLEM - Puppetmaster HTTPS on virt0 is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 8140: HTTP/1.1 500 Internal Server Error [14:15:34] so now it's 50 vs. 25 [14:15:45] 50 for 1005/1006, 25 for 1001-4 [14:16:05] mark: I noticed yesterday we still have a varnishhtcpd upstart job deployed. It must be a dupe of the vhtcpd init script provided by the Debian package ( https://gerrit.wikimedia.org/r/#/c/75323/ ) :) [14:16:28] and they're getting about the same traffic [14:16:33] +/- 5% [14:17:03] hashar: isn't that for the old perl script? [14:17:04] PROBLEM - Puppet freshness on sq41 is CRITICAL: No successful Puppet run in the last 10 hours [14:17:26] mark: I have no clue [14:17:40] I don't think we ever used the name varnishhtcpd for the C version [14:17:53] exec /usr/local/bin/varnishhtcpd [14:18:08] might have been from a time when we compiled our own version and did not rely on a deb package? [14:18:14] PROBLEM - MySQL Slave Delay on db1008 is CRITICAL: CRIT replication delay 186 seconds [14:18:46] I see a .conf but I don't see it referenced from anywhere? [14:19:14] RECOVERY - MySQL Slave Delay on db1008 is OK: OK replication delay 15 seconds [14:19:25] there's files/upstart/varnishhtcpd.conf but no upstart_job { 'varnishtcpd': } that I can see [14:19:32] I guess the upstart job got forgotten when the script got removed [14:20:16] https://gerrit.wikimedia.org/r/#/c/68687/ remove obsolete Perl HTCP purger daemon [14:20:54] (PS2) Hashar: get rid of varnishhtcpd upstart job [operations/puppet] - https://gerrit.wikimedia.org/r/75323 [14:21:05] heh, I was about to push [14:21:06] amended to reference the removal change [14:22:02] (CR) Faidon: [C: 2] get rid of varnishhtcpd upstart job [operations/puppet] - https://gerrit.wikimedia.org/r/75323 (owner: Hashar) [14:22:07] (Merged) Faidon: get rid of varnishhtcpd upstart job [operations/puppet] - https://gerrit.wikimedia.org/r/75323 (owner: Hashar) [14:22:12] \O/ [14:25:07] (PS1) Mark Bergsma: Setup NGINX for HTTPS on the Varnish servers [operations/puppet] - https://gerrit.wikimedia.org/r/75601 [14:25:14] PROBLEM - udp2log log age for gadolinium on gadolinium is CRITICAL: CRITICAL: log files /a/log/webrequest/packet-loss.log, have not been written in a critical amount of time. For most logs, this is 4 hours. For slow logs, this is 4 days. [14:26:04] RECOVERY - Puppet freshness on zinc is OK: puppet ran at Wed Jul 24 14:26:01 UTC 2013 [14:26:34] PROBLEM - Puppet freshness on zinc is CRITICAL: No successful Puppet run in the last 10 hours [14:28:40] (PS1) Rillke: Update CommonSettings to reflect changes in UpWiz [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/75602 [14:28:52] hmm [14:29:10] hm? [14:29:11] so the nginx configuration template for the protoproxies doesn't live in the protoproxy module, but in the nginx module [14:29:14] that's yuck [14:29:23] ew [14:30:16] the template itself is really yuck too [14:30:26] it has a case statement per $::site [14:32:03] ah it's not in the nginx module [14:32:07] it's in the main templates/nginx dir [14:32:15] that's less bad ;-) [14:35:34] (PS2) Mark Bergsma: Setup NGINX for HTTPS on the Varnish servers [operations/puppet] - https://gerrit.wikimedia.org/r/75601 [14:35:51] (CR) jenkins-bot: [V: -1] Setup NGINX for HTTPS on the Varnish servers [operations/puppet] - https://gerrit.wikimedia.org/r/75601 (owner: Mark Bergsma) [14:36:03] i'm gonna have to rewrite that template [14:39:23] paravoid, do you know anything about how the puppetmaster on virt0 is/was set up? [14:41:14] PROBLEM - udp2log log age for gadolinium on gadolinium is CRITICAL: CRITICAL: log files /a/log/webrequest/packet-loss.log, have not been written in a critical amount of time. For most logs, this is 4 hours. For slow logs, this is 4 days. [14:41:46] not much, no [14:41:49] what are you looking for [14:42:42] My refactors yesterday broke it and I can't figure out what's happening. Right now it responds with a big block of html [14:42:52] complaning that it can't create /etc/puppet/manifests [14:43:00] which, I don't understand why it woudl want to create it... [14:43:04] PROBLEM - Puppet freshness on gadolinium is CRITICAL: No successful Puppet run in the last 10 hours [14:43:20] …from which I conclude that it is configured /very/ differently from sockpuppet or stafford [14:43:44] (CR) Yuvipanda: [C: 1] "Whoops, I missed that." [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/75602 (owner: Rillke) [14:47:24] PROBLEM - Puppetmaster HTTPS on virt1000 is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 8140: HTTP/1.1 500 Internal Server Error [14:48:15] (PS1) Ottomata: Putting erbium in ganglia udp2log view [operations/puppet] - https://gerrit.wikimedia.org/r/75604 [14:48:27] (CR) Ottomata: [C: 2 V: 2] Putting erbium in ganglia udp2log view [operations/puppet] - https://gerrit.wikimedia.org/r/75604 (owner: Ottomata) [14:48:28] (Merged) Ottomata: Putting erbium in ganglia udp2log view [operations/puppet] - https://gerrit.wikimedia.org/r/75604 (owner: Ottomata) [14:52:44] PROBLEM - Varnish HTTP mobile-backend on cp3012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:52:44] PROBLEM - Varnish HTCP daemon on cp3012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:53:14] PROBLEM - Varnish traffic logger on cp3012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:54:14] PROBLEM - udp2log log age for gadolinium on gadolinium is CRITICAL: CRITICAL: log files /a/log/webrequest/packet-loss.log, have not been written in a critical amount of time. For most logs, this is 4 hours. For slow logs, this is 4 days. [14:55:04] RECOVERY - Puppet freshness on zinc is OK: puppet ran at Wed Jul 24 14:54:54 UTC 2013 [14:55:34] PROBLEM - Puppet freshness on zinc is CRITICAL: No successful Puppet run in the last 10 hours [14:56:32] (PS1) Mark Bergsma: Add a lean definition (and custom template) for SSL proxies to localhost [operations/puppet] - https://gerrit.wikimedia.org/r/75606 [14:56:48] (CR) jenkins-bot: [V: -1] Add a lean definition (and custom template) for SSL proxies to localhost [operations/puppet] - https://gerrit.wikimedia.org/r/75606 (owner: Mark Bergsma) [14:58:08] (PS2) Mark Bergsma: Add a lean definition (and custom template) for SSL proxies to localhost [operations/puppet] - https://gerrit.wikimedia.org/r/75606 [14:58:14] PROBLEM - SSH on pdf3 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:00:22] (PS1) Hashar: contint: logs directory to hold Jenkins console logs [operations/puppet] - https://gerrit.wikimedia.org/r/75608 [15:03:14] PROBLEM - udp2log log age for gadolinium on gadolinium is CRITICAL: CRITICAL: log files /a/log/webrequest/packet-loss.log, have not been written in a critical amount of time. For most logs, this is 4 hours. For slow logs, this is 4 days. [15:03:58] (PS3) Mark Bergsma: Setup NGINX for HTTPS on the Varnish servers [operations/puppet] - https://gerrit.wikimedia.org/r/75601 [15:04:04] RECOVERY - SSH on pdf3 is OK: SSH OK - OpenSSH_4.7p1 Debian-8ubuntu3 (protocol 2.0) [15:05:30] (CR) ArielGlenn: [C: 2] contint: logs directory to hold Jenkins console logs [operations/puppet] - https://gerrit.wikimedia.org/r/75608 (owner: Hashar) [15:05:31] (Merged) ArielGlenn: contint: logs directory to hold Jenkins console logs [operations/puppet] - https://gerrit.wikimedia.org/r/75608 (owner: Hashar) [15:09:00] (PS4) Mark Bergsma: Setup NGINX for HTTPS on the Varnish servers [operations/puppet] - https://gerrit.wikimedia.org/r/75601 [15:09:01] (PS3) Mark Bergsma: Add a lean definition (and custom template) for SSL proxies to localhost [operations/puppet] - https://gerrit.wikimedia.org/r/75606 [15:10:16] PROBLEM - udp2log log age for gadolinium on gadolinium is CRITICAL: CRITICAL: log files /a/log/webrequest/packet-loss.log, have not been written in a critical amount of time. For most logs, this is 4 hours. For slow logs, this is 4 days. [15:10:22] (CR) jenkins-bot: [V: -1] Setup NGINX for HTTPS on the Varnish servers [operations/puppet] - https://gerrit.wikimedia.org/r/75601 (owner: Mark Bergsma) [15:11:06] (CR) jenkins-bot: [V: -1] Add a lean definition (and custom template) for SSL proxies to localhost [operations/puppet] - https://gerrit.wikimedia.org/r/75606 (owner: Mark Bergsma) [15:11:44] (PS5) Mark Bergsma: Setup NGINX for HTTPS on the Varnish servers [operations/puppet] - https://gerrit.wikimedia.org/r/75601 [15:11:45] (PS4) Mark Bergsma: Add a lean definition (and custom template) for SSL proxies to localhost [operations/puppet] - https://gerrit.wikimedia.org/r/75606 [15:13:06] RECOVERY - Puppet freshness on gadolinium is OK: puppet ran at Wed Jul 24 15:13:02 UTC 2013 [15:15:40] (PS1) Tzafrir: hewikivoyage: also sortPrepend en [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/75617 [15:18:12] (PS6) Mark Bergsma: Setup NGINX for HTTPS on the Varnish servers [operations/puppet] - https://gerrit.wikimedia.org/r/75601 [15:18:13] (PS5) Mark Bergsma: Add a lean definition (and custom template) for SSL proxies to localhost [operations/puppet] - https://gerrit.wikimedia.org/r/75606 [15:19:42] (CR) Reedy: [C: 2] Update CommonSettings to reflect changes in UpWiz [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/75602 (owner: Rillke) [15:19:51] (Merged) jenkins-bot: Update CommonSettings to reflect changes in UpWiz [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/75602 (owner: Rillke) [15:24:53] RECOVERY - Puppet freshness on zinc is OK: puppet ran at Wed Jul 24 15:24:44 UTC 2013 [15:25:36] PROBLEM - Puppet freshness on zinc is CRITICAL: No successful Puppet run in the last 10 hours [15:30:51] !log reedy synchronized wmf-config/ [15:31:01] Logged the message, Master [15:31:51] andrewbogott: why didn't you just keep the git stuff under /var/lib/git/operations? [15:32:13] mark, which? [15:32:28] the git repos I mean [15:32:34] I think I did. [15:32:39] what's the need to have them under /etc/puppet? [15:32:46] with symlinks :) [15:33:30] also... might this be related to anything you're working on? [15:33:31] err: Could not retrieve catalog from remote server: Error 400 on SERVER: Puppet::Parser::AST::Resource failed with error ArgumentError: Invalid resource type files at /etc/puppet/modules/varnish/manifests/htcppurger.pp:23 on node cp1052.eqiad.wmnet [15:33:45] Yeah, that might not be needed; /etc/puppet is where puppetmaster looks by default, and I was following the pattern established on sockpuppet. [15:34:02] We could point the puppetmaster to /var/lib/git/operations with a bunch of extra config settings. [15:34:19] sockpuppet's setup was really much older and was redone on stafford [15:34:36] (and unused) [15:34:52] yeah, we could move it. [15:34:58] so if you want to stay consistent, it's best to keep stafford's setup [15:35:05] sockpuppet is just CA stuff [15:35:17] peter is redoing some stuff anyway, so it doesn't matter a lot [15:35:29] but if stuff is broken now, i'd say, just go back to how it was [15:35:43] Virt0 is broken but for unrelated reasons... [15:35:50] ok [15:35:58] mark: i think that would be [15:35:58] me [15:35:58] ah [15:35:59] It had yet a third organization… I'd love to have the three systems agree, for starters :) [15:36:05] yeah so [15:36:11] sockpuppet was the original puppetmaster [15:36:16] setup before we had puppet (obviously) [15:36:22] hashar actually [15:36:23] its configuration changed drastically a few times while we experimented [15:36:28] but my fault for not seeing it [15:36:33] then virt0 was installed for labs, with a rather different setup [15:36:35] and jenkins fault for not complaining [15:36:35] + files { '/etc/init/varnishhtcpd.conf': [15:36:48] and then I setup stafford for performance reasons, also trying to bring a little bit of sanity in the two setups [15:37:10] paravoid: I got something wrong ? [15:37:11] i never attempted to fix sockpuppet, it was meant to go away, unfortunately that didn't happebn [15:37:15] hashar: yes :) [15:37:20] but at least it wasn't really used except for CA stuff [15:37:24] mark, sockpuppet had /three/ live copies of the pupet repo on it, and I was bound and determined to whittle that down to one. the rearrange on stafford was kind of a side-effect :) [15:37:35] yeah [15:38:04] i wish I had killed sockpuppet back then, the only reason I didn't was because I had some issues with getting the certs to work across the cluster iirc [15:38:13] and then had more important things to do [15:38:14] (PS1) Faidon: Fix typo, files -> file [operations/puppet] - https://gerrit.wikimedia.org/r/75624 [15:38:16] I'll try to de-symlink stafford once I have virt0 on board. [15:38:28] paravoid: sorry :( Ah yeah I wanted to get rid of /etc/init/varnishhtcpd.conf since it is for the upstart jo. [15:38:45] (CR) Faidon: [C: 2] Fix typo, files -> file [operations/puppet] - https://gerrit.wikimedia.org/r/75624 (owner: Faidon) [15:38:49] ohh [15:38:50] a typo [15:38:51] ah right, you guys were working on that varnishhtcpd thing [15:38:55] but, yeah, hopefully we can kill sockpuppet entirely sometime soon, that'll make the orgchart much simpler! [15:39:52] well, we were planning for multiple appservers and a load balancer though, so that'll make it complex again :) [15:40:27] no different from how it is now really [15:40:27] (CR) Faidon: [V: 2] Fix typo, files -> file [operations/puppet] - https://gerrit.wikimedia.org/r/75624 (owner: Faidon) [15:40:28] (Merged) Faidon: Fix typo, files -> file [operations/puppet] - https://gerrit.wikimedia.org/r/75624 (owner: Faidon) [15:40:31] well, not much [15:41:30] Yeah, that seems ok, will just have to change puppet-merge so it can run on an arbitrary master. (which it almost can already.) [15:49:55] mark@fenari:~$ ./firstbyte.py cp1052.eqiad.wmnet 80 www.wikidata.org / [15:49:55] GET / HTTP/1.1 [15:49:55] Host: www.wikidata.org [15:49:55] User-Agent: firstbyte.py [15:49:55] Connection: close [15:49:57] HTTP/1.1 200 OK [15:49:59] X-Powered-By: Express [15:50:03] by what? [15:52:02] eh?! [15:52:06] (PS1) Hashar: contint: python dependency for publish-console.py [operations/puppet] - https://gerrit.wikimedia.org/r/75632 [15:52:39] mark: Express = nodejs web server [15:52:44] i know [15:52:47] RECOVERY - Puppetmaster HTTPS on virt0 is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.122 second response time [15:52:52] But ... wtf is that doing on wikidata.org [15:52:54] but but, wikidata? [15:52:59] is it contacting parsoid perhaps [15:53:12] That looks like the kind of header Parsoid would send [15:53:30] If you look at the full response you can tell if it's Parsoid pretty easily [15:53:42] it is [15:53:45]

Welcome to the alpha test web service for the Parsoid project.

[15:53:48]

Usage:

  • GET /title for the DOM. Example: Main Page
  • [15:53:58] misconfigured varnish? :) [15:54:57] RECOVERY - Puppet freshness on zinc is OK: puppet ran at Wed Jul 24 15:54:50 UTC 2013 [15:55:15] X-Cache: cp1058 miss (0), cp1052 frontend hit (1) [15:55:21] (PS5) Ottomata: Fixing automated hue SSL generation and permissions [operations/puppet/cdh4] - https://gerrit.wikimedia.org/r/74686 [15:55:24] cp1058 being a parsoid varnish [15:55:27] PROBLEM - Puppet freshness on zinc is CRITICAL: No successful Puppet run in the last 10 hours [15:55:28] cp1058 is a Parsoid Varnish backend [15:55:29] Yeah [15:55:46] wikimedia_text-frontend.vcl:backend cp1058 { [15:55:46] wikimedia_text-frontend.vcl: .host = "cp1058.eqiad.wmnet"; [15:56:03] nice [15:56:26] andrewbogott: i think puppet-merge can run anywhere [15:56:40] all it does is a put a review step instead of merge, with diffs for submodules [15:56:54] ottomata: yeah, I guess it's the post-merge hook that would have to be made generic [15:56:54] anything fancy is handled by git hooks [15:56:59] aye ja [15:57:21] i was tempted to not even put 'puppet' in the name of that script [15:57:26] since really it would work with any git repo [15:57:41] (PS1) Mark Bergsma: Use the correct Varnish backends [operations/puppet] - https://gerrit.wikimedia.org/r/75634 [15:58:36] paravoid, if I fixed those most recent comments on the hue ssl commit, s'ok to merge? [15:58:39] (CR) Mark Bergsma: [C: 2] Use the correct Varnish backends [operations/puppet] - https://gerrit.wikimedia.org/r/75634 (owner: Mark Bergsma) [15:58:40] (Merged) Mark Bergsma: Use the correct Varnish backends [operations/puppet] - https://gerrit.wikimedia.org/r/75634 (owner: Mark Bergsma) [15:58:58] that's annoying, now i'll need to purge those varnish servers [15:59:29] ottomata: yeah [15:59:35] (CR) Faidon: [C: 1] Fixing automated hue SSL generation and permissions [operations/puppet/cdh4] - https://gerrit.wikimedia.org/r/74686 (owner: Ottomata) [16:00:28] k danke [16:01:11] (CR) Ottomata: [C: 2 V: 2] Fixing automated hue SSL generation and permissions [operations/puppet/cdh4] - https://gerrit.wikimedia.org/r/74686 (owner: Ottomata) [16:01:12] (Merged) Ottomata: Fixing automated hue SSL generation and permissions [operations/puppet/cdh4] - https://gerrit.wikimedia.org/r/74686 (owner: Ottomata) [16:01:47] PROBLEM - Puppetmaster HTTPS on virt0 is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 8140: HTTP/1.1 500 Internal Server Error [16:03:41] !log catrope Started syncing Wikimedia installation... : Deploy code ahead of VE roll-out to another 8 wikis [16:03:51] Logged the message, Master [16:04:30] (PS1) Ottomata: Updating modules/cdh4 to latest hue commit [operations/puppet] - https://gerrit.wikimedia.org/r/75635 [16:04:49] (CR) Ottomata: [C: 2 V: 2] Updating modules/cdh4 to latest hue commit [operations/puppet] - https://gerrit.wikimedia.org/r/75635 (owner: Ottomata) [16:04:54] (Merged) Ottomata: Updating modules/cdh4 to latest hue commit [operations/puppet] - https://gerrit.wikimedia.org/r/75635 (owner: Ottomata) [16:09:15] (CR) Ottomata: [C: 2 V: 2] Adding role::analytics::hue [operations/puppet] - https://gerrit.wikimedia.org/r/74388 (owner: Ottomata) [16:09:15] (Merged) Ottomata: Adding role::analytics::hue [operations/puppet] - https://gerrit.wikimedia.org/r/74388 (owner: Ottomata) [16:11:39] heya mark, i know you did this kinda recently for gadolinium, but could you do it for erbium now too? [16:11:39] https://rt.wikimedia.org/Ticket/Display.html?id=5510 [16:14:12] ok [16:14:23] (PS7) Mark Bergsma: Setup NGINX for HTTPS on the Varnish servers [operations/puppet] - https://gerrit.wikimedia.org/r/75601 [16:14:24] (PS6) Mark Bergsma: Add a lean definition (and custom template) for SSL proxies to localhost [operations/puppet] - https://gerrit.wikimedia.org/r/75606 [16:14:27] PROBLEM - RAID on snapshot3 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:16:31] done [16:20:20] thanks! [16:21:34] !log catrope synchronized php-1.22wmf10/resources/startup.js 'touch' [16:21:43] Logged the message, Master [16:23:44] !log catrope synchronized php-1.22wmf10/extensions/VisualEditor/modules/ve-mw/init/targets/ve.init.mw.ViewPageTarget.js 'touch' [16:23:54] Logged the message, Master [16:24:57] RECOVERY - Puppet freshness on zinc is OK: puppet ran at Wed Jul 24 16:24:56 UTC 2013 [16:25:27] PROBLEM - Puppet freshness on zinc is CRITICAL: No successful Puppet run in the last 10 hours [16:27:33] !log catrope Finished syncing Wikimedia installation... : Deploy code ahead of VE roll-out to another 8 wikis [16:27:42] Logged the message, Master [16:30:58] !log Fixed /etc/dsh/group/bits ... again :( [16:31:08] Logged the message, Mr. Obvious [16:33:14] RoanKattouw: errors on dewiki: "novenamespace: VisualEditor is not enabled in namespace 4" [16:33:17] PROBLEM - MySQL Slave Delay on db1008 is CRITICAL: CRIT replication delay 187 seconds [16:34:17] RECOVERY - MySQL Slave Delay on db1008 is OK: OK replication delay 6 seconds [16:34:19] Raymond_: Yeah, we know [16:34:26] ok. [16:34:32] It's caching issue [16:34:35] It should be fixed now [16:35:27] I am off *wave* [16:36:07] RoanKattouw: friendly reminder to pick up https://gerrit.wikimedia.org/r/#/c/75140/ if you haven't already [16:36:17] chrismcmahon: Will do once I do config [16:36:21] but I'm not there yet [16:36:30] RoanKattouw: gotcha, thanks [16:38:16] PROBLEM - MySQL Slave Delay on db1008 is CRITICAL: CRIT replication delay 181 seconds [16:38:36] Ugh, Gerrit is sloooowwww [16:40:12] (PS1) Mark Bergsma: Reference the actual Cookie header [operations/puppet] - https://gerrit.wikimedia.org/r/75640 [16:40:16] RECOVERY - MySQL Slave Delay on db1008 is OK: OK replication delay 2 seconds [16:40:20] !log catrope synchronized php-1.22wmf10/extensions/VisualEditor 'Fix nowiki bug' [16:40:30] Logged the message, Master [16:40:45] !log catrope synchronized php-1.22wmf11/extensions/VisualEditor 'Fix nowiki bug' [16:40:55] Logged the message, Master [16:43:12] yeah gerrit is having its daily crappiness [16:44:19] :-( [16:45:30] nice [16:45:37] mobile caches don't set X-Forwarded-Proto [16:45:43] so it's https when nginx, empty when not [16:45:46] Ahm, oops? [16:45:53] !log catrope synchronized php-1.22wmf10/extensions/VisualEditor/modules/ve-mw/dm/annotations/ve.dm.MWExternalLinkAnnotation.js 'touch' [16:45:57] Well that's how it works for MW too [16:46:04] Or used to at least [16:46:04] Logged the message, Master [16:46:16] !log catrope synchronized php-1.22wmf11/extensions/VisualEditor/modules/ve-mw/dm/annotations/ve.dm.MWExternalLinkAnnotation.js 'touch' [16:46:26] Logged the message, Master [16:46:38] !log catrope synchronized php-1.22wmf10/extensions/VisualEditor/modules/ve-mw/dm/annotations/ve.dm.MWNowikiAnnotation.js 'touch' [16:46:48] Logged the message, Master [16:47:00] !log catrope synchronized php-1.22wmf11/extensions/VisualEditor/modules/ve-mw/dm/annotations/ve.dm.MWNowikiAnnotation.js 'touch' [16:47:10] Logged the message, Master [16:48:22] hrm [16:48:24] bits doesn't either [16:48:51] I don't think text does either [16:49:41] indeed [16:49:44] text varnish does though [16:49:47] so I guess i'll have to change that [16:50:22] Yeah I'm not sure MW will even handle XFP: http correctly. I think it should, but I'm not sure it's been tested [16:52:45] i didn't notice anything in my casual browsing [16:52:48] but I don't want to risk it now [16:52:58] yurik, i told RoanKattow to go ahead and use our window in 7 minutes, as we can't be deploying anything for wikipedia zero today anyway. cc: greg-g [16:53:13] yep [16:53:24] * greg-g nods [16:54:46] RECOVERY - Puppet freshness on zinc is OK: puppet ran at Wed Jul 24 16:54:40 UTC 2013 [16:55:26] PROBLEM - Puppet freshness on zinc is CRITICAL: No successful Puppet run in the last 10 hours [16:55:47] aight, who's killing gerrit [16:56:09] * greg-g puts away his revolver [16:56:16] no idea [16:56:35] 174.0.4.10.in-addr.arpa domain name pointer i-0000081e.pmtpa.wmflabs. [16:56:35] 174.0.4.10.in-addr.arpa domain name pointer wikidata-test-multi.pmtpa.wmflabs. [16:56:54] these wikidata folks are delaying their own varnish deployment [16:59:13] (CR) Mark Bergsma: [C: 2] Reference the actual Cookie header [operations/puppet] - https://gerrit.wikimedia.org/r/75640 (owner: Mark Bergsma) [16:59:19] (Merged) Mark Bergsma: Reference the actual Cookie header [operations/puppet] - https://gerrit.wikimedia.org/r/75640 (owner: Mark Bergsma) [16:59:32] (CR) Catrope: [C: 2] Enable VisualEditor for all users on test2wiki, experimental also [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/75140 (owner: Cmcmahon) [16:59:36] (CR) Catrope: [C: 2] Hide the new 'visualeditor-betatempdisable' preference if in alpha [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/75542 (owner: Jforrester) [16:59:40] (CR) Catrope: [C: 2] VisualEditor into beta on de/es/fr/he/it/pl/ru/svwiki for logged-in only [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/75543 (owner: Jforrester) [17:00:10] (Merged) jenkins-bot: Enable VisualEditor for all users on test2wiki, experimental also [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/75140 (owner: Cmcmahon) [17:00:12] (Merged) jenkins-bot: Hide the new 'visualeditor-betatempdisable' preference if in alpha [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/75542 (owner: Jforrester) [17:00:45] (PS1) Mark Bergsma: X-Forwarded-Proto is empty for straight http [operations/puppet] - https://gerrit.wikimedia.org/r/75642 [17:02:03] (CR) Mark Bergsma: [C: 2] X-Forwarded-Proto is empty for straight http [operations/puppet] - https://gerrit.wikimedia.org/r/75642 (owner: Mark Bergsma) [17:02:04] (Merged) Mark Bergsma: X-Forwarded-Proto is empty for straight http [operations/puppet] - https://gerrit.wikimedia.org/r/75642 (owner: Mark Bergsma) [17:04:43] (PS1) Mark Bergsma: Move vcl_deliver after vcl_error (more consistent) [operations/puppet] - https://gerrit.wikimedia.org/r/75644 [17:05:29] (PS3) Catrope: VisualEditor into beta on de/es/fr/he/it/pl/ru/svwiki for logged-in only [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/75543 (owner: Jforrester) [17:05:36] (CR) jenkins-bot: [V: -1] VisualEditor into beta on de/es/fr/he/it/pl/ru/svwiki for logged-in only [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/75543 (owner: Jforrester) [17:06:10] (PS4) Catrope: VisualEditor into beta on de/es/fr/he/it/pl/ru/svwiki for logged-in only [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/75543 (owner: Jforrester) [17:06:26] (CR) Catrope: [C: 2] VisualEditor into beta on de/es/fr/he/it/pl/ru/svwiki for logged-in only [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/75543 (owner: Jforrester) [17:06:35] (Merged) jenkins-bot: VisualEditor into beta on de/es/fr/he/it/pl/ru/svwiki for logged-in only [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/75543 (owner: Jforrester) [17:07:27] (CR) Mark Bergsma: [C: 2] Move vcl_deliver after vcl_error (more consistent) [operations/puppet] - https://gerrit.wikimedia.org/r/75644 (owner: Mark Bergsma) [17:07:28] (Merged) Mark Bergsma: Move vcl_deliver after vcl_error (more consistent) [operations/puppet] - https://gerrit.wikimedia.org/r/75644 (owner: Mark Bergsma) [17:08:16] !log catrope synchronized wmf-config/CommonSettings.php 'Show either visualeditor-enable or visualeditor-betatempdisable, not both' [17:08:27] Logged the message, Master [17:09:30] !log catrope synchronized wmf-config/InitialiseSettings.php 'Enable VisualEditor on 8 more wikis' [17:09:40] Logged the message, Master [17:11:16] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:12:06] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 1.780 second response time [17:15:17] (PS1) Mark Bergsma: Pass test.* requests through all caching layers [operations/puppet] - https://gerrit.wikimedia.org/r/75647 [17:15:27] notpeter_ & ori-l, today I attempted to migrate TTM to zinc but encountered a bug and had to bail out, so this'll have to wait for Nikrabbit to finish his vacation [17:18:13] (CR) Mark Bergsma: [C: 2] Pass test.* requests through all caching layers [operations/puppet] - https://gerrit.wikimedia.org/r/75647 (owner: Mark Bergsma) [17:18:14] (Merged) Mark Bergsma: Pass test.* requests through all caching layers [operations/puppet] - https://gerrit.wikimedia.org/r/75647 (owner: Mark Bergsma) [17:19:50] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:20:24] !log torrus deadlocked, fixing [17:20:36] Logged the message, RobH [17:20:40] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.124 second response time [17:23:50] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:24:40] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 2.515 second response time [17:25:30] RECOVERY - Puppet freshness on zinc is OK: puppet ran at Wed Jul 24 17:25:28 UTC 2013 [17:26:30] PROBLEM - Puppet freshness on zinc is CRITICAL: No successful Puppet run in the last 10 hours [17:27:12] MaxSem: ok. anything I can help with? [17:28:07] no, it's a PHP issue [17:30:28] !log catrope synchronized php-1.22wmf10/extensions/FlaggedRevs 'Fix excessive FlaggedRevs notices' [17:30:39] Logged the message, Master [17:30:47] RoanKattouw: Reedy: How long should we keep it in Common.js. 30 days is too short, right? Last I checked we've begun removing old wmf branches on bits again, what we end up using as the reliable cache rollover time? [17:30:53] !log catrope synchronized php-1.22wmf11/extensions/FlaggedRevs 'Fix excessive FlaggedRevs notices' [17:31:04] Logged the message, Master [17:31:09] Krinkle: Right now I don't think there's a reliable cache rollover time [17:31:11] Which is a problem [17:31:25] And yet we've begun removing wmf branches again? [17:31:26] * Krinkle looks up the bug report [17:31:31] I think this is due to 304 refreshing and I've discussed this with Aaron and Asher but nothing ever happened [17:31:44] Yes, but Tim fixed that, right? [17:32:01] When? [17:32:01] In the past ~6 months? [17:32:01] We used to allow Squid to essentially renew the 304 expiration without taking new content [17:32:05] Yes [17:32:06] yes, 1-2 months back [17:32:11] Oh? Link? [17:32:26] Sure, will take a few minutes. Yield :) [17:33:26] I3889f300012aeabd37e228653279ad19b296e4ae ? [17:33:50] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:35:40] AaronSchulz: I think so, yes. https://gerrit.wikimedia.org/r/#/c/58415/3/includes/OutputPage.php [17:35:40] https://bugzilla.wikimedia.org/show_bug.cgi?id=44570 was closed [17:35:40] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.133 second response time [17:37:10] I'll just say "remove before September 1". Nicely ambiguous. [17:41:00] !log catrope synchronized php-1.22wmf10/extensions/FlaggedRevs 'Fix FlaggedRevs fatal' [17:41:10] Logged the message, Master [17:41:25] !log catrope synchronized php-1.22wmf11/extensions/FlaggedRevs 'Fix FlaggedRevs fatal' [17:41:39] Logged the message, Master [17:41:43] * andrewbogott ambushes Ryan_Lane [17:41:51] :D [17:41:56] andrewbogott: so, virt0 [17:41:59] yeah [17:42:05] the crons update from /root [17:42:12] they need to be changed to update /var/lib... [17:42:36] I moved the /etc/puppet symlinks to point at things in /root/testrepo [17:43:06] is virt0's puppetmaster set to use /root? [17:43:06] I thought it was using /etc [17:43:22] right [17:43:32] and /root rsync'd to /etc [17:43:52] hm… who would be doing that rsync? [17:44:04] Here's what I think is happening: [17:44:06] a git hook in /root I think [17:44:24] someone changed how sockpuppet worked ages ago and no one updated virt0 [17:45:32] - cron does 'git pull' in /root/testrepo/puppet [17:45:41] - links from root/testrepo/puppet/* to /etc/puppet/* [17:45:41] - puppet master looks for files in /etc/puppet/* [17:46:28] so, we just need to make /etc point to /var [17:46:30] and update the cron [17:47:15] Eventually I would like to move the repo from /root/testrepo to /var, but it should be working as it is now. [17:47:17] And it isn't [17:47:43] I'd say let's not try to figure out why it's not working in root [17:47:48] and just make it work in var [17:48:00] sure, ok. [17:48:10] it would be a more consistent way anyway :) [17:48:16] d'you mind changing the crons? I don't know how to do that offhand. [17:48:20] sure [17:48:20] I'll move the links. [17:49:16] PROBLEM - Puppet freshness on erzurumi is CRITICAL: No successful Puppet run in the last 10 hours [17:49:16] PROBLEM - Puppet freshness on lvs1004 is CRITICAL: No successful Puppet run in the last 10 hours [17:49:16] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: No successful Puppet run in the last 10 hours [17:49:16] PROBLEM - Puppet freshness on lvs1005 is CRITICAL: No successful Puppet run in the last 10 hours [17:49:16] PROBLEM - Puppet freshness on virt1 is CRITICAL: No successful Puppet run in the last 10 hours [17:49:17] PROBLEM - Puppet freshness on virt3 is CRITICAL: No successful Puppet run in the last 10 hours [17:49:17] PROBLEM - Puppet freshness on virt4 is CRITICAL: No successful Puppet run in the last 10 hours [17:50:48] I wonder if those crons aren't puppetized [17:50:52] ottomata: So I know you used to use the install server docs on wikitech, yesterday i updated and rolled into https://wikitech.wikimedia.org/wiki/Server_Lifecycle [17:50:55] just fyi ;] [17:51:14] Ryan_Lane, I'm going to expect the labs private repo to be in /var/lib/git/operations/labs/private [17:51:19] which included stripping out some vendor specific information to their respective platform subpages, but its overall now more in line with reality. [17:51:30] those crons are indeed not puppetized [17:51:40] andrewbogott: sounds good [17:52:39] (PS1) Andrew Bogott: Move /etc/puppet symlinks back to a normal place on virt0 [operations/puppet] - https://gerrit.wikimedia.org/r/75661 [17:53:16] PROBLEM - MySQL Slave Delay on db1008 is CRITICAL: CRIT replication delay 187 seconds [17:53:42] (CR) Andrew Bogott: [C: 2] Move /etc/puppet symlinks back to a normal place on virt0 [operations/puppet] - https://gerrit.wikimedia.org/r/75661 (owner: Andrew Bogott) [17:53:43] (Merged) Andrew Bogott: Move /etc/puppet symlinks back to a normal place on virt0 [operations/puppet] - https://gerrit.wikimedia.org/r/75661 (owner: Andrew Bogott) [17:53:59] hm [17:54:16] PROBLEM - Puppet freshness on ms-fe1002 is CRITICAL: No successful Puppet run in the last 10 hours [17:54:55] <^d> manybubbles: I was going to start setting up the ES hosts in beta. What port do the apaches need to access ES on? [17:55:11] andrewbogott: so, we need to make this work somewhat similarly to puppetmaster::self [17:55:12] 9200! [17:55:16] RECOVERY - MySQL Slave Delay on db1008 is OK: OK replication delay 7 seconds [17:55:24] <^d> manybubbles: mmk, thanks. [17:55:51] andrewbogott: because we need a private ssh key to checkout private [17:55:53] ^d: ah - I'm messing around with automated tests! I found a bug.... They are pretty easy to write. [17:55:56] RECOVERY - Puppet freshness on zinc is OK: puppet ran at Wed Jul 24 17:55:46 UTC 2013 [17:56:13] Ryan_Lane, we don't want private though, do we? Just labs-private [17:56:22] yes, that's what I mean :) [17:56:24] I'd like to set up elasticsearch with puppet in labs (currently still debed). Puppet is done. I'm in the zone on the tests though. [17:56:26] PROBLEM - Puppet freshness on zinc is CRITICAL: No successful Puppet run in the last 10 hours [17:56:46] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:57:09] Hm… what has changed w/respect to labs private? [17:57:20] nothing, why? [17:57:32] we're using novaadmin's ssh keys on virt0 for this [17:57:36] I'm not a huge fan of this [17:57:42] <^d> manybubbles: How many hosts do we want for the initial setup? 4 maybe? [17:57:46] RECOVERY - Puppetmaster HTTPS on virt0 is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.125 second response time [17:57:50] in fact, I'd like to remove any key from novaadmin [17:57:55] it's somewhat dangerous [17:57:59] Ah, ok. So you're proposing a further improvement, not part of getting us back to how we were before [17:58:05] * Ryan_Lane nods [17:58:09] ^d: I was thinking 3 or 4. We only use 2 for the current search now. [17:58:16] well, it seems this cron stuff isn't puppetized at all [17:58:29] <^d> manybubbles: I'll do 4. They'll be deployment-es[0-3] [17:58:36] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.128 second response time [17:58:50] cool [18:00:00] really, It would be nice if this wasn't done via cron at all [18:00:16] PROBLEM - Puppet freshness on ms-fe1003 is CRITICAL: No successful Puppet run in the last 10 hours [18:00:25] but that's a further improvement :) [18:01:46] Yeah, it should be a post-commit hook on gerrit [18:01:48] or post-merge rather [18:02:01] well, neither of those exist ;) [18:02:13] we could have a daemon that reads gerrit's stream events, though [18:02:43] I guess there's a post merge. but we actually removed all of gerrit's hooks the other day [18:02:46] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:02:47] stream events are better [18:03:04] !log catrope synchronized php-1.22wmf10/resources/mediawiki/mediawiki.notification.js 'Fix mw.notify positioning bug' [18:03:14] Logged the message, Master [18:03:27] !log catrope synchronized php-1.22wmf11/resources/mediawiki/mediawiki.notification.js 'Fix mw.notify positioning bug' [18:03:31] <^d> Ryan_Lane: Only thing with stream-events is making sure the user who's accessing it is added to the stream-events group, it's not exposed to all users by default. [18:03:36] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.127 second response time [18:03:36] Logged the message, Master [18:03:41] ^d: no? [18:03:50] that's lame [18:04:15] <^d> It was made an explicit permission. And I've never really looked into how much "private" data it exposes :) [18:04:23] heh [18:04:57] well, either way, we'll do this the cron way now and do stream events later [18:05:08] (PS1) Pyoungmeister: move jobqueue check from neon to hume/terbium [operations/puppet] - https://gerrit.wikimedia.org/r/75667 [18:05:16] PROBLEM - Puppet freshness on ms-fe1004 is CRITICAL: No successful Puppet run in the last 10 hours [18:07:18] rawr. where is virt0 actually including the puppetmaster? [18:07:50] Ryan_Lane: did you check under the couch? sometimes puppetmasters wind up there [18:08:13] or maybe between the cushions? :) [18:08:38] Ryan_Lane, in the nova role I think [18:08:45] <^d> Ryan_Lane: While you're looking in the couch, look for my keys. [18:09:08] ah [18:09:09] class { "role::puppet::server::labs": } [18:09:18] andrewbogott: indeed it was :) [18:10:23] (CR) Pyoungmeister: [C: 2] move jobqueue check from neon to hume/terbium [operations/puppet] - https://gerrit.wikimedia.org/r/75667 (owner: Pyoungmeister) [18:10:24] (Merged) Pyoungmeister: move jobqueue check from neon to hume/terbium [operations/puppet] - https://gerrit.wikimedia.org/r/75667 (owner: Pyoungmeister) [18:12:16] RECOVERY - Puppetmaster HTTPS on virt1000 is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.976 second response time [18:13:16] RECOVERY - Varnish traffic logger on cp3012 is OK: PROCS OK: 2 processes with command name varnishncsa [18:13:16] RECOVERY - Varnish HTCP daemon on cp3012 is OK: PROCS OK: 1 process with UID = 110 (vhtcpd), args vhtcpd [18:13:36] RECOVERY - Varnish HTTP mobile-backend on cp3012 is OK: HTTP OK: HTTP/1.1 200 OK - 189 bytes in 0.176 second response time [18:14:10] Ryan_Lane, ok, I've cloned puppet and private by hand and the symlinks are back to normal [18:14:20] And, dammit, the puppetmaster is working fine. So my problems from this morning remain a mystery. [18:14:24] heh [18:14:36] ok. I'll manually modify the crons for now [18:14:47] fixing that is looking to be….. complicated [18:15:16] mostly because of the private repo [18:15:58] * Ryan_Lane hates the private repo [18:16:59] (CR) Aude: [C: 1] "looks perfect" [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/75617 (owner: Tzafrir) [18:17:49] Ryan_Lane: So make it public. :-) [18:17:55] Elsie: ;) [18:20:52] Elsie: it's just passwords fyi [18:21:11] (PS1) Petr Onderka: made indexes into trees [operations/dumps/incremental] (gsoc) - https://gerrit.wikimedia.org/r/75668 [18:21:46] PROBLEM - Puppet freshness on ms-fe1001 is CRITICAL: No successful Puppet run in the last 10 hours [18:22:03] What could go wrong? [18:24:47] RECOVERY - Puppet freshness on zinc is OK: puppet ran at Wed Jul 24 18:24:43 UTC 2013 [18:25:28] PROBLEM - Puppet freshness on zinc is CRITICAL: No successful Puppet run in the last 10 hours [18:32:52] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:33:37] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.128 second response time [18:54:52] RECOVERY - Puppet freshness on zinc is OK: puppet ran at Wed Jul 24 18:54:51 UTC 2013 [18:55:32] PROBLEM - Puppet freshness on zinc is CRITICAL: No successful Puppet run in the last 10 hours [19:00:40] (PS1) Pyoungmeister: derp. this obviously needs to be a nrpe check [operations/puppet] - https://gerrit.wikimedia.org/r/75674 [19:02:30] !log catrope synchronized php-1.22wmf10/extensions/VisualEditor 'Fix copyright notice bug' [19:02:41] Logged the message, Master [19:02:57] !log catrope synchronized php-1.22wmf11/extensions/VisualEditor 'Fix copyright notice bug' [19:03:06] Logged the message, Master [19:05:10] PROBLEM - MySQL Slave Delay on db1008 is CRITICAL: CRIT replication delay 222 seconds [19:08:06] (CR) Pyoungmeister: [C: 2] derp. this obviously needs to be a nrpe check [operations/puppet] - https://gerrit.wikimedia.org/r/75674 (owner: Pyoungmeister) [19:08:07] (Merged) Pyoungmeister: derp. this obviously needs to be a nrpe check [operations/puppet] - https://gerrit.wikimedia.org/r/75674 (owner: Pyoungmeister) [19:09:10] RECOVERY - MySQL Slave Delay on db1008 is OK: OK replication delay 1 seconds [19:14:10] PROBLEM - MySQL Slave Delay on db1008 is CRITICAL: CRIT replication delay 229 seconds [19:16:10] RECOVERY - MySQL Slave Delay on db1008 is OK: OK replication delay 24 seconds [19:20:00] (CR) Hashar: "That indeed fixed the redirect loop issue on http://login.wikimedia.beta.wmflabs.org/ (bug 51700)" [operations/puppet] - https://gerrit.wikimedia.org/r/75583 (owner: Mark Bergsma) [19:24:47] RECOVERY - Puppet freshness on zinc is OK: puppet ran at Wed Jul 24 19:24:46 UTC 2013 [19:25:27] PROBLEM - Puppet freshness on zinc is CRITICAL: No successful Puppet run in the last 10 hours [19:26:37] PROBLEM - check_job_queue on terbium is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 9,999 jobs: , zhwiki (148738), enwiki (73602), frwiki (19806), Total (253979) [19:28:35] hey, look at that, we have jobqueue monitoring again [19:30:49] :-] [19:33:17] PROBLEM - MySQL Slave Delay on db1008 is CRITICAL: CRIT replication delay 187 seconds [19:35:17] RECOVERY - MySQL Slave Delay on db1008 is OK: OK replication delay 7 seconds [19:45:36] is the job queue ever going to be uncritical ? [19:46:49] LeslieCarr, it's going to - if someone adjusts the check to be against 1M which is a much more reasonable number:P [19:47:14] hehe [19:54:06] PROBLEM - check_job_queue on hume is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:54:46] RECOVERY - Puppet freshness on zinc is OK: puppet ran at Wed Jul 24 19:54:37 UTC 2013 [19:55:26] PROBLEM - Puppet freshness on zinc is CRITICAL: No successful Puppet run in the last 10 hours [19:56:51] hashar: login broken on beta labs? [20:07:26] PROBLEM - check_job_queue on fenari is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:10:48] (PS2) MaxSem: $wgMFRemovableClasses overhaul [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/66891 [20:11:22] (PS1) Ottomata: Adding process icinga nrpe check for webstats collector and filter processes [operations/puppet] - https://gerrit.wikimedia.org/r/75766 [20:13:54] (CR) MaxSem: [C: 2] $wgMFRemovableClasses overhaul [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/66891 (owner: MaxSem) [20:13:57] (CR) Ottomata: [C: 2 V: 2] Adding process icinga nrpe check for webstats collector and filter processes [operations/puppet] - https://gerrit.wikimedia.org/r/75766 (owner: Ottomata) [20:14:16] PROBLEM - MySQL Slave Delay on db1008 is CRITICAL: CRIT replication delay 193 seconds [20:15:16] RECOVERY - MySQL Slave Delay on db1008 is OK: OK replication delay 7 seconds [20:15:51] (Merged) jenkins-bot: $wgMFRemovableClasses overhaul [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/66891 (owner: MaxSem) [20:19:48] !log authdns-update: change ns0's IP to new service IP; glues were updated [20:19:58] Logged the message, Master [20:20:03] RobH: ^ [20:22:45] (PS2) Hashar: contint: python dependency for publish-console.py [operations/puppet] - https://gerrit.wikimedia.org/r/75632 [20:23:16] PROBLEM - MySQL Slave Delay on db1008 is CRITICAL: CRIT replication delay 187 seconds [20:24:46] RECOVERY - Puppet freshness on zinc is OK: puppet ran at Wed Jul 24 20:24:45 UTC 2013 [20:25:16] RECOVERY - MySQL Slave Delay on db1008 is OK: OK replication delay 7 seconds [20:25:27] PROBLEM - Puppet freshness on zinc is CRITICAL: No successful Puppet run in the last 10 hours [20:25:38] oh, watchmouse [20:25:55] don't worry about the alert, what happened is normal [20:26:00] I just didn't know we had an alert :) [20:26:15] i was just about to say " ah looks like they switched it over ? " [20:26:34] I'm trying to figure out watchmouse now [20:28:16] PROBLEM - MySQL Slave Delay on db1008 is CRITICAL: CRIT replication delay 187 seconds [20:30:17] RECOVERY - MySQL Slave Delay on db1008 is OK: OK replication delay 7 seconds [20:33:03] (PS2) Demon: Enable CirrusSearch in beta. [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/75507 (owner: Manybubbles) [20:38:42] PROBLEM - RAID on analytics1017 is CRITICAL: Timeout while attempting connection [20:39:23] (PS1) Demon: Add icinga monitoring for Gerrit and Gitblit [operations/puppet] - https://gerrit.wikimedia.org/r/75777 [20:40:02] PROBLEM - Host analytics1017 is DOWN: PING CRITICAL - Packet loss = 100% [20:41:08] <^demon> LeslieCarr: There's the icinga stuff I did ^. Dunno if it's right, just copy+pasted what was done for jenkins and zuul with some tweaks. [20:41:29] (PS1) Ottomata: Puppetizing analytics1017 as hadoop worker [operations/puppet] - https://gerrit.wikimedia.org/r/75779 [20:41:38] (CR) Ottomata: [C: 2 V: 2] Puppetizing analytics1017 as hadoop worker [operations/puppet] - https://gerrit.wikimedia.org/r/75779 (owner: Ottomata) [20:44:28] (CR) Lcarr: [C: -1] "If we are to run nrpe on a server with a public ip, it is required that iptables be running on the server." [operations/puppet] - https://gerrit.wikimedia.org/r/75777 (owner: Demon) [20:45:13] RECOVERY - Host analytics1017 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [20:45:44] LeslieCarr: you know this isn't the case out there, right? :) [20:47:12] PROBLEM - Disk space on analytics1017 is CRITICAL: Connection refused by host [20:47:22] PROBLEM - DPKG on analytics1017 is CRITICAL: Connection refused by host [20:47:32] PROBLEM - SSH on analytics1017 is CRITICAL: Connection refused [20:47:40] paravoid: just because we havent done it right all the time, we shouldn't do it wrong again [20:48:18] I didn't disagree, just pointing it out :) [20:48:55] I'm hoping that I'll work on ferm as soon as these two storms pass [20:49:00] dns/ulsfo & media storage [20:49:05] the bulk of them anyway [20:49:12] PROBLEM - MySQL Slave Delay on db1008 is CRITICAL: CRIT replication delay 188 seconds [20:50:12] RECOVERY - MySQL Slave Delay on db1008 is OK: OK replication delay 7 seconds [20:54:02] PROBLEM - Puppet freshness on professor is CRITICAL: No successful Puppet run in the last 10 hours [20:54:42] RECOVERY - Puppet freshness on zinc is OK: puppet ran at Wed Jul 24 20:54:37 UTC 2013 [20:55:32] PROBLEM - Puppet freshness on zinc is CRITICAL: No successful Puppet run in the last 10 hours [20:56:02] PROBLEM - Puppet freshness on holmium is CRITICAL: No successful Puppet run in the last 10 hours [20:56:43] (PS2) Demon: Add icinga monitoring for Gerrit and Gitblit [operations/puppet] - https://gerrit.wikimedia.org/r/75777 [20:57:37] (CR) Demon: "Added nrpe to antimony and manganese in PS2. Will have a look at the iptables stuff after this meeting." [operations/puppet] - https://gerrit.wikimedia.org/r/75777 (owner: Demon) [20:59:33] PROBLEM - NTP on analytics1017 is CRITICAL: NTP CRITICAL: No response from NTP server [21:13:38] (PS1) Lcarr: removing nagios redirects :( [operations/puppet] - https://gerrit.wikimedia.org/r/75786 [21:14:42] (CR) Lcarr: [C: 2] removing nagios redirects :( [operations/puppet] - https://gerrit.wikimedia.org/r/75786 (owner: Lcarr) [21:15:26] RECOVERY - SSH on analytics1017 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [21:16:32] (PS1) Bsitu: Set $wgAllowHTMLEmail default to true [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/75787 [21:24:56] RECOVERY - Puppet freshness on zinc is OK: puppet ran at Wed Jul 24 21:24:51 UTC 2013 [21:25:27] PROBLEM - Puppet freshness on zinc is CRITICAL: No successful Puppet run in the last 10 hours [21:28:51] (PS1) Catrope: Add eqiad bits caches to /etc/dsh/group/bits [operations/puppet] - https://gerrit.wikimedia.org/r/75791 [21:29:24] !log dropped all databases on s3 that were migrated to s7 [rt5506] [21:29:34] Logged the message, Master [21:33:00] (PS1) Bsitu: Preparation of Echo and Thanks for metawiki [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/75795 [21:33:23] (CR) Hashar: "Should we reopen bug 45926 which requested the nagios redirection to icinga ? :-]" [operations/puppet] - https://gerrit.wikimedia.org/r/75786 (owner: Lcarr) [21:33:42] * MaxSem scaps [21:36:04] (CR) Lcarr: "we could reopen it and reclose it as a won'tfix ?" [operations/puppet] - https://gerrit.wikimedia.org/r/75786 (owner: Lcarr) [21:36:21] ugh Symbolic link not allowed or link target not accessible: /usr/local/apache/common/docroot/bits/static-1.22wmf5/extensions, referer: http://de.m.wikipedia.org/wiki/Scheibenwelt-Romane [21:36:43] our cache isn't supposed to last that long... [21:39:51] wmf5.... [21:43:10] !log maxsem Started syncing Wikimedia installation... : Weekly mobile deployment [21:43:22] Logged the message, Master [21:50:24] Ryan_Lane: yo [21:50:35] one channel at a time please ;) [21:51:06] we're going to need a cache flush soon as requested - are you still the point person for that? [21:51:35] ^ Ryan_Lane [21:51:52] MaxSem will be able to say when [21:52:19] Peter told me to ping you around the time ;-) [21:54:46] RECOVERY - Puppet freshness on zinc is OK: puppet ran at Wed Jul 24 21:54:40 UTC 2013 [21:55:27] PROBLEM - Puppet freshness on zinc is CRITICAL: No successful Puppet run in the last 10 hours [21:55:56] Ryan_Lane: poke.. [21:56:00] Ryan_Lane: reassure me :) [21:57:09] jdlrobson: no reassurances!!! :D [21:57:10] muaahahahahaha [21:57:16] RECOVERY - Puppet freshness on erzurumi is OK: puppet ran at Wed Jul 24 21:57:06 UTC 2013 [21:57:37] phew.. MaxSem is scapping now but when it's done we are going to need that cache flushed so please don't go awol on me lol [21:57:47] this deployment is scary enough as it is ;-) [21:57:54] !log maxsem Finished syncing Wikimedia installation... : Weekly mobile deployment [21:57:54] * Ryan_Lane disappears [21:58:04] Logged the message, Master [21:58:08] Ryan_Lane: are fortune cookie told us to expect a miracle: https://twitter.com/rakugojon/status/360150740238495744/photo/1 [21:58:09] tell me when [21:58:18] *our fortune cookie [21:58:52] looks like it finished. should I flush it now? [21:58:59] (PS1) Lcarr: getting rid of nagios.wikimedia.org trap [operations/puppet] - https://gerrit.wikimedia.org/r/75796 [21:59:27] would someone like to check this above^^ ? [21:59:41] jdlrobson: so. if I can't disappear you can't either.... [21:59:44] binasher, it surfaced that a wikipedia zero partner's press release is promoting their launch as covering all languages. so we'll be putting in a VCL submission shortly to update their existing rule. hoping you can review so that we can get this in place quickly. their launch is around midnight tonight. [21:59:46] ^ MaxSem ? [21:59:54] Ryan_Lane: i'm waiting for MaxSem to confirm :) [22:00:07] MaxSem: ... [22:00:31] 1 sec, waiting for QA [22:00:33] dr0ptp4kt: sure [22:00:36] dr0ptp4kt:why specifically binasher ? [22:01:03] LeslieCarr, he's usually on the vcl submissions. are you able to review this, too? [22:01:15] i can do basic ones, i think more on our team can as well [22:01:15] * jdlrobson plays catch with Ryan_Lane  [22:01:23] LeslieCarr, didn't want to wake up our two friends in europe, either :) [22:01:26] really you shouldn't necessarily rely on 1 person [22:01:37] Ryan_Lane: why are you not in the office? this is a whiskey moment.. [22:01:37] (CR) Bsitu: [C: -2] Preparation of Echo and Thanks for metawiki [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/75795 (owner: Bsitu) [22:01:51] (CR) Bsitu: [C: -2] Set $wgAllowHTMLEmail default to true [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/75787 (owner: Bsitu) [22:02:00] jdlrobson: socialize? [22:02:14] you guys have whisky? [22:02:16] Ryan_Lane: you know i have 40 year old whiskey right ;-) [22:02:28] save me a glass for friday :) [22:02:41] Ryan_Lane: will have to be tomorrow - flying out of the country friday [22:02:44] jdlrobson: what is it? [22:02:51] (CR) Lcarr: [C: 2] getting rid of nagios.wikimedia.org trap [operations/puppet] - https://gerrit.wikimedia.org/r/75796 (owner: Lcarr) [22:03:15] it's some old Glenfiddich and Glemorangie from my granddad's attic [22:05:20] yum [22:05:27] Ryan_Lane, fire away!:P [22:05:53] \o/ [22:06:25] Ryan_Lane: push that red button :) [22:06:29] (i assume it is red) [22:06:37] waiting on qa [22:06:39] give me a week [22:06:40] :) [22:06:48] jdlrobson, uttons are usually black on laptop keyboards [22:06:53] (I had already done it, btw ;) ) [22:07:09] Ryan_Lane: WOOOO [22:07:42] (PS1) Pyoungmeister: removing unused enwikijobqueue check [operations/puppet] - https://gerrit.wikimedia.org/r/75797 [22:08:55] Ryan_Lane, thanksalot!:) [22:09:45] yw [22:12:35] (PS1) Dr0ptp4kt: Expanding Aircel whitelist to cover all langs. [operations/puppet] - https://gerrit.wikimedia.org/r/75798 [22:15:45] LeslieCarr, binasher, would you please review and merge https://gerrit.wikimedia.org/r/75798 ? this is the change for tonight's deployment with the wikipedia zero carrier. [22:16:03] taking a look [22:16:57] (CR) Pyoungmeister: "http://dayofthejedi.com/wp-content/uploads/2011/03/27.jpg" [operations/puppet] - https://gerrit.wikimedia.org/r/75796 (owner: Lcarr) [22:17:13] hahaha [22:17:50] (CR) Asher: "(1 comment)" [operations/puppet] - https://gerrit.wikimedia.org/r/75798 (owner: Dr0ptp4kt) [22:18:40] dr0ptp4kt: does that comment make sense? [22:25:02] RECOVERY - Puppet freshness on zinc is OK: puppet ran at Wed Jul 24 22:24:59 UTC 2013 [22:25:32] PROBLEM - Puppet freshness on zinc is CRITICAL: No successful Puppet run in the last 10 hours [22:27:11] binasher, it's a defense against domains other than mdot and zerodot. or are you saying that it's guaranteed that the requests will be mdot/zerodot? [22:28:52] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:29:40] if someone is crafting fordged requests directly to the m/z ip, does it matter if x-cs is internally set? and besides, it matches *.org anyways, so if the answer isn't no, that's a problem for all of the rules [22:29:42] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.131 second response time [22:31:25] binasher, do we currently serve anything via text varnishes? if we do, https://bugzilla.wikimedia.org/show_bug.cgi?id=51988 looks urgent [22:31:33] (CR) Dr0ptp4kt: "(1 comment)" [operations/puppet] - https://gerrit.wikimedia.org/r/75798 (owner: Dr0ptp4kt) [22:31:47] binasher, i updated the gerrit comment. what do you think? [22:32:39] MaxSem: not sure, but i don't think so [22:33:12] PROBLEM - MySQL Slave Delay on db1008 is CRITICAL: CRIT replication delay 186 seconds [22:34:12] RECOVERY - MySQL Slave Delay on db1008 is OK: OK replication delay 6 seconds [22:37:13] dr0ptp4kt: so if the domain is one that shouldn't ever hit the mobile varnish servers, then what? [22:40:03] binasher, under the current code, the traffic simply wouldn't get the X-CS tagged onto the header. if you think it's safe to just comment out that line, i'm cool with that. funny thing, i made the rule update somewhat tight because i thought, asher's gonna want this thing restrictive :) [22:41:07] binasher, should i just resubmit with that line commented out? [22:42:35] PROBLEM - Puppet freshness on manutius is CRITICAL: No successful Puppet run in the last 10 hours [22:42:54] binasher, i should say…shall i just remove the if and wrapping curly braces, and then unindent the 'set req.http.X-CS' assignment? [22:43:35] RECOVERY - Solr on vanadium is OK: All OK [22:43:47] dr0ptp4kt: it looks like *.wap.wiki*.org requests can hit that path, so if those shouldn't get x-cs, the current patch is ok [22:43:51] !log maxsem synchronized php-1.22wmf11/extensions/MobileFrontend/ [22:44:02] Logged the message, Master [22:45:20] !log maxsem synchronized php-1.22wmf10/extensions/MobileFrontend/ [22:45:35] Logged the message, Master [22:46:19] (CR) Asher: [C: 2 V: 2] Expanding Aircel whitelist to cover all langs. [operations/puppet] - https://gerrit.wikimedia.org/r/75798 (owner: Dr0ptp4kt) [22:49:50] binasher, thanks. [22:51:21] no prob [22:54:35] RECOVERY - Puppet freshness on zinc is OK: puppet ran at Wed Jul 24 22:54:30 UTC 2013 [22:55:25] PROBLEM - Puppet freshness on zinc is CRITICAL: No successful Puppet run in the last 10 hours [23:11:15] !log maxsem synchronized php-1.22wmf10/extensions/MobileFrontend/ [23:13:36] !log maxsem synchronized php-1.22wmf11/extensions/MobileFrontend/ [23:13:38] (CR) Demon: [C: 1] replicate Gerrit repos to Jenkins slave lanthanum [operations/puppet] - https://gerrit.wikimedia.org/r/75499 (owner: Hashar) [23:14:15] !log upgrading wikitech to 1.22wmf11 [23:14:26] Logged the message, Master [23:24:54] RECOVERY - Puppet freshness on zinc is OK: puppet ran at Wed Jul 24 23:24:46 UTC 2013 [23:25:24] PROBLEM - Puppet freshness on zinc is CRITICAL: No successful Puppet run in the last 10 hours [23:33:09] greg-g: ping [23:34:49] Krinkle: hi, I'm heading out in a couple, what's up? [23:35:17] I'd like a lighting deploy window to push out a bugfix for a regression in VE. [23:35:19] greg-g: [23:36:11] Krinkle: ok. cherry-picking it should be fine [23:37:04] There've been no other changes so updating to master only includes this bugfix by Roan (who left for the day as he was in late yesterday) and an i18n change [23:37:05] update* [23:37:20] So update to latest master and cherry-picking submodule update on wmf10/11 is OK? [23:38:27] yeah, so there is code in both VE and other places? [23:38:34] (just making sure I understand) [23:38:41] so what you said should be fine [23:39:08] greg-g: no, just VE. But since VE is deployed as a mediawiki extension I need to do a submodule update in the wmf branch of core (as we always do, just saying it weirdly I guess) [23:39:26] oh, I see [23:39:37] https://gerrit.wikimedia.org/r/#/c/75816/ [23:42:38] so, going ahead in a few minutes. Waiting for jenkins to verify one more time. [23:42:48] :) [23:46:41] Ryan_Lane: There's undeployed changes for OpenStackManager [23:46:50] Krinkle: where? [23:46:54] wmf11 [23:46:59] dirty git status [23:47:03] where? [23:47:05] submodule updated in repo, but not updated locally [23:47:06] tin [23:47:14] !g cde8daa3c3365bc36acc9de3adc8fc21f1a4f1de [23:47:14] https://gerrit.wikimedia.org/r/#q,cde8daa3c3365bc36acc9de3adc8fc21f1a4f1de,n,z [23:47:31] AFAIK I did everything properly in gerrit [23:48:17] did someone do a git pull without doing a git submodule update? [23:48:33] yes, you probably [23:48:37] not me [23:48:40] I don't deploy from tin [23:48:46] and persumaby you did a sync afterwards which was a no-op [23:48:49] !log krinkle synchronized php-1.22wmf10/extensions/VisualEditor 'Idc7b094f8eb2788c48' [23:48:56] Ryan_Lane: Hm.. what do you mean [23:48:57] again. I don't deploy from tin ;) [23:49:00] Logged the message, Master [23:49:07] I use the wmf branches, but not via tin [23:49:11] for wikitech [23:49:16] Right, this submodule is not used in mediawiki config [23:50:10] Ryan_Lane: So on tin it is the (imho good) habit to not blindly git submodule after git pull, but only what you intend to deploy (the 2 should be equal), just like doing sync-file instead of full on scap. [23:50:29] However that means whenever you update OSM, we get a dirty status. I'll just submodule update that one then [23:50:37] * Ryan_Lane nods [23:52:14] Reedy: ping [23:52:28] Reedy: https://gist.github.com/anonymous/75c9a722506058e0ac1b [23:52:49] at least they're empty so probably just artifacts. [23:52:52] alright going ahead now [23:53:16] PROBLEM - MySQL Slave Delay on db1008 is CRITICAL: CRIT replication delay 187 seconds [23:53:41] !log krinkle synchronized php-1.22wmf11/extensions/VisualEditor 'Idc7b094f8eb2788c48' [23:54:06] Logged the message, Master [23:54:56] RECOVERY - Puppet freshness on zinc is OK: puppet ran at Wed Jul 24 23:54:54 UTC 2013 [23:55:16] RECOVERY - MySQL Slave Delay on db1008 is OK: OK replication delay 7 seconds [23:55:26] PROBLEM - Puppet freshness on zinc is CRITICAL: No successful Puppet run in the last 10 hours [23:57:46] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds