[12:33:18] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:34:49] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 6.758 seconds [13:09:43] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:11:13] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.049 seconds [13:42:16] PROBLEM - Memcached on virt0 is CRITICAL: Connection refused [13:44:58] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:50:31] RECOVERY - Memcached on virt0 is OK: TCP OK - 0.005 second response time on port 11000 [13:58:01] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.059 seconds [14:11:38] PROBLEM - Puppet freshness on labstore4 is CRITICAL: Puppet has not run in the last 10 hours [14:20:38] PROBLEM - Puppet freshness on analytics1007 is CRITICAL: Puppet has not run in the last 10 hours [14:20:39] PROBLEM - Puppet freshness on sq48 is CRITICAL: Puppet has not run in the last 10 hours [14:24:26] New review: Reedy; "Might aswell just remove them all. I can re-run the size script to generate the small/medium/large d..." [operations/mediawiki-config] (master); V: 0 C: 0; - https://gerrit.wikimedia.org/r/38054 [14:29:47] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:40:17] New patchset: Jgreen; "fundraising cron hour/min typo fixed" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/38714 [14:40:57] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/38714 [14:42:50] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 5.477 seconds [14:53:46] New patchset: Jgreen; "exit if already running" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/38717 [14:54:09] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/38717 [15:16:53] New patchset: Dereckson; "(bug 42933) Initial configuration for es.wikivoyage" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/38054 [15:17:20] New review: Dereckson; "Done." [operations/mediawiki-config] (master) C: 0; - https://gerrit.wikimedia.org/r/38054 [15:17:39] New patchset: Dereckson; "(bug 42934) Initial configuration for pt.wikivoyage" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/38057 [15:18:42] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:20:05] New patchset: Dereckson; "(bug 42934) Initial configuration for pt.wikivoyage" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/38057 [15:20:37] New review: Dereckson; "Done." [operations/mediawiki-config] (master) C: 0; - https://gerrit.wikimedia.org/r/38057 [15:23:14] New patchset: Dereckson; "(bug 42933) Initial configuration for es.wikivoyage" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/38054 [15:23:43] New patchset: Dereckson; "(bug 42934) Initial configuration for pt.wikivoyage" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/38057 [15:24:24] New patchset: Dereckson; "(bug 42934) Initial configuration for pt.wikivoyage" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/38057 [15:33:23] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.026 seconds [15:36:54] mark looks like db1025 snapshot is an overrun after all: snap12140018 tank Swi-I- 100.00g data   100.00%  [15:37:09] hm, hey, are user classes in admins.pp supposed to be in alphabetical order? or uid order? [15:37:17] Jeff_Green: that explains everything then [15:37:41] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [15:37:42] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [15:37:42] ottomata: those two plus cronological order :-) [15:38:26] ha [15:39:55] Change merged: Demon; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/38631 [15:40:43] !log demon synchronized wmf-config/throttle.php 'Deploying I330230e6' [15:40:51] Logged the message, Master [15:43:47] New patchset: Ottomata; "Adding 4 users and giving access to stat1. See RT 4106." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/38718 [15:44:30] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/38718 [15:48:45] New patchset: Ottomata; "Fixing ssh key for jgonera" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/38719 [15:48:54] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/38719 [15:49:23] New patchset: Mark Bergsma; "Revert "Adding 4 users and giving access to stat1. See RT 4106."" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/38720 [15:49:36] uh oh, mark, sorry [15:49:37] what's up? [15:49:44] CT told me specifically to do that [15:50:12] to do what? [15:50:20] add those 4 users to stat1 [15:50:27] that's not reflected on the RT ticket [15:50:55] will forward [15:51:00] our normal process is that an RT request is filed, and then we give approval if there are no objections for a few days [15:51:55] ok. [15:52:00] !log sq48 going down for HDD controller card replacement-https://rt.wikimedia.org/Ticket/Display.html?id=4041 [15:52:04] if CT has any reasons to override that policy then that should at least be noted on the ticket [15:52:07] sorry [15:52:10] no its cool [15:52:14] i just forwarded you the email [15:52:52] your revert looks like it failed because i had just committed a second change to fix an ssh key type [15:52:54] typo [15:53:16] well you can revert it then [15:53:18] since it it now two patchsets, should I just manually commit a change to not include them on stat1 [15:53:36] can I leave the accounts there, but leave them off of stat1? [15:53:46] or, wait, actually, I already ran puppet there [15:53:51] so i'll ensure the accounts are absent [15:53:53] PROBLEM - Puppet freshness on magnesium is CRITICAL: Puppet has not run in the last 10 hours [15:53:53] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [15:53:54] thanks [15:55:30] New patchset: Ottomata; "Ensuring RT 4106 accounts are absent until approval is given on the RT ticket." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/38721 [15:56:08] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/38721 [15:57:38] PROBLEM - Host sq48 is DOWN: PING CRITICAL - Packet loss = 100% [16:00:29] New patchset: Aude; "(bug 40036) add central auth icon for wikimania2013 wiki" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/38722 [16:04:09] New patchset: Aude; "(bug 40036) add central auth icon for wikimania2013 wiki" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/38722 [16:05:05] New patchset: Cmjohnson; "Changing mac address on mc1002 to reflect h/w change (new card)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/38723 [16:06:13] Change merged: Cmjohnson; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/38723 [16:07:05] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:07:39] New patchset: Aude; "Fix proportions of commons central auth image" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/38724 [16:20:53] RECOVERY - Host sq48 is UP: PING OK - Packet loss = 0%, RTA = 0.59 ms [16:21:47] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.047 seconds [16:25:50] PROBLEM - SSH on sq48 is CRITICAL: Connection refused [16:40:05] PROBLEM - Host sq48 is DOWN: PING CRITICAL - Packet loss = 100% [16:48:51] sbernardin: i think the !log needs to be the very first thing on the line (you had a leading space). also maybe you want {{RT|NNNN}} instead of a full URL. but that's your call :) [16:51:47] RECOVERY - Host sq48 is UP: PING WARNING - Packet loss = 50%, RTA = 0.19 ms [16:52:53] jeremyb: ok...thanks for the heads up...will make sure there is no leading spaces and will only include the RT# [16:53:25] sbernardin: anyway, the point was that your !log was never seen beyond this channel [16:53:44] PROBLEM - MySQL Slave Delay on db1035 is CRITICAL: CRIT replication delay 198 seconds [16:53:50] jeremyb: ok [16:54:38] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:55:23] RECOVERY - MySQL Slave Delay on db1035 is OK: OK replication delay 0 seconds [16:56:50] !log sq48 going down for HDD controller card replacement-rt4041 [16:56:58] Logged the message, Master [16:57:29] PROBLEM - Host ms-be3001 is DOWN: PING CRITICAL - Packet loss = 100% [17:01:14] RECOVERY - Host ms-be3001 is UP: PING OK - Packet loss = 0%, RTA = 110.35 ms [17:09:38] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.028 seconds [17:16:41] PROBLEM - NTP on sq48 is CRITICAL: NTP CRITICAL: No response from NTP server [17:18:11] PROBLEM - Host ms-be3002 is DOWN: PING CRITICAL - Packet loss = 100% [17:21:38] RECOVERY - Host ms-be3002 is UP: PING OK - Packet loss = 0%, RTA = 109.29 ms [17:26:53] PROBLEM - Host ms-be3002 is DOWN: PING CRITICAL - Packet loss = 100% [17:28:05] RECOVERY - Host ms-be3002 is UP: PING OK - Packet loss = 0%, RTA = 110.24 ms [17:41:53] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:56:35] PROBLEM - Puppet freshness on ms-be3 is CRITICAL: Puppet has not run in the last 10 hours [17:59:53] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.032 seconds [18:00:20] PROBLEM - Host ms-be3003 is DOWN: PING CRITICAL - Packet loss = 100% [18:09:47] RECOVERY - Host ms-be3003 is UP: PING OK - Packet loss = 0%, RTA = 109.35 ms [18:21:11] PROBLEM - Host srv278 is DOWN: PING CRITICAL - Packet loss = 100% [18:22:14] RECOVERY - Host srv278 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [18:31:39] !log reedy synchronized wmf-config/InitialiseSettings.php 'Enable clicktracking on test2wiki' [18:31:48] Logged the message, Master [18:31:53] srv266: rsync: change_dir#3 "/apache/common-local" failed: No such file or directory (2) [18:31:53] srv266: rsync error: errors selecting input/output files, dirs (code 3) at main.c(643) [Receiver=3.0.9] [18:32:10] New patchset: Reedy; "Enable clicktracking on test2wiki" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/38737 [18:32:26] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:32:28] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/38737 [18:34:11] New patchset: Demon; "Clean wikivoyage namespaces configuration" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/38063 [18:34:41] PROBLEM - Puppet freshness on erzurumi is CRITICAL: Puppet has not run in the last 10 hours [18:41:05] New patchset: Pyoungmeister; "removing commnet" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/38739 [18:45:38] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 9.359 seconds [18:46:02] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/38739 [18:54:00] mark: sbernardin put a new hdd controller into sq48 and saved a squid for you [18:54:06] yay [18:54:08] thanks [18:54:17] !log appyling mini patch on bugzilla per http://bzr.mozilla.org/bugzilla/4.2/revision/8176 [18:54:25] Logged the message, Master [18:54:36] (this should fix "#4122: Searching by commenter is slow" [19:19:32] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:22:23] RECOVERY - Host ms-be1001 is UP: PING OK - Packet loss = 0%, RTA = 26.55 ms [19:22:24] RECOVERY - Host ms-be1002 is UP: PING OK - Packet loss = 0%, RTA = 26.56 ms [19:25:50] PROBLEM - swift-object-auditor on ms-be1002 is CRITICAL: Connection refused by host [19:25:50] PROBLEM - swift-container-auditor on ms-be1002 is CRITICAL: Connection refused by host [19:25:51] PROBLEM - swift-account-auditor on ms-be1002 is CRITICAL: Connection refused by host [19:26:17] PROBLEM - swift-account-reaper on ms-be1002 is CRITICAL: Connection refused by host [19:26:17] PROBLEM - swift-object-replicator on ms-be1002 is CRITICAL: Connection refused by host [19:26:17] PROBLEM - swift-container-replicator on ms-be1002 is CRITICAL: Connection refused by host [19:26:35] PROBLEM - Host ms-be3004 is DOWN: PING CRITICAL - Packet loss = 100% [19:26:40] apergos: how's swift btw? [19:26:44] PROBLEM - swift-account-replicator on ms-be1002 is CRITICAL: Connection refused by host [19:26:44] PROBLEM - swift-object-server on ms-be1002 is CRITICAL: Connection refused by host [19:26:53] PROBLEM - swift-container-server on ms-be1002 is CRITICAL: Connection refused by host [19:26:59] I've been on ceph land all week :) [19:27:02] PROBLEM - SSH on ms-be1002 is CRITICAL: Connection refused [19:27:02] PROBLEM - swift-account-server on ms-be1002 is CRITICAL: Connection refused by host [19:27:03] PROBLEM - swift-container-updater on ms-be1002 is CRITICAL: Connection refused by host [19:27:03] PROBLEM - swift-object-updater on ms-be1002 is CRITICAL: Connection refused by host [19:27:58] paravoid: it had an issue a couple days ago [19:28:05] that caused pages [19:28:13] I didn't have time to fully investigate it [19:28:20] but images were loading incredibly slowly [19:28:43] a positive, is that text was loading perfectly fine :) [19:28:44] swift is crawling along, moved out new rings today and waiting for the next obj repl run to complete [19:29:01] in the past when images weren't available the entire site died [19:29:22] I have been monitoring the back ends and they don't seem to have had any spikes (including the period you mentioned) [19:29:57] just increased activity according to the ring changes, that's about it [19:34:23] RECOVERY - Host ms-be3004 is UP: PING OK - Packet loss = 0%, RTA = 109.32 ms [19:35:44] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.017 seconds [19:41:55] New patchset: MaxSem; "WIP: OSM module" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/36222 [19:46:05] apergos: problems again with latex on it.wikibooks, I'll give you srv number asap [19:46:41] PROBLEM - NTP on ms-be1002 is CRITICAL: NTP CRITICAL: No response from NTP server [19:46:43] Vito: ok, though it's getting a bit late for me [19:46:48] (almost 10 pm) [19:46:59] 21:46 :D [19:47:05] I'm basically hainging out waiting for AaronSchulz to be here (and not in a meeting) [19:47:34] no pb, is there someone who can take care of srv279? [19:48:29] well if you describe the problem in here and no one is responsive in a little while, [19:48:48] hmm can other folks submit rt tickets by email? maybe not [19:48:49] meh [19:49:04] then bugzilla it and leave the link in here I guess [19:53:04] nn apergos. Happy belated wiktionary day. [19:53:21] oh? [19:53:29] vito what's up w/srv279 [19:53:30] when was it then? [19:53:34] 12th. [19:53:38] nic [19:53:39] e [20:05:00] gerrit down again? [20:05:08] PROBLEM - Puppet freshness on analytics1001 is CRITICAL: Puppet has not run in the last 10 hours [20:05:30] back again [20:06:32] apergos: hey hey [20:06:34] <^demon> AaronSchulz: We're doing some replication work, I've been restarting it periodically. [20:06:52] hey AaronSchulz [20:07:00] if ( mt_rand( 0, 99 ) restart_gerrit(); [20:07:13] * if ( mt_rand( 0, 99 ) ) restart_gerrit(); [20:07:14] we're supposed to chat about copying thumbs from swift to the netapp, I hear [20:08:08] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:09:34] cmjohnson1: problems with latex rendering [20:09:37] apergos: I don't see the point but I said I could start it [20:09:39] a missing component [20:09:59] it happened with other apaches and apergos saw it depends on ubuntu version upgrade [20:10:35] uhh [20:10:49] ok what is your take on it AaronSchulz? [20:11:22] apergos: I'll just run it anyway :) [20:11:33] hopefully all this will go away and there will just be ceph [20:12:49] hahaha [20:12:51] ok well [20:13:11] what was your plan? I mean, we need a list of what to copy (it seems silly to copy them all) [20:13:14] PROBLEM - Puppet freshness on ssl3001 is CRITICAL: Puppet has not run in the last 10 hours [20:13:49] AaronSchulz: [20:14:17] apergos: I thought it was requested to copy them all [20:14:31] though I recall floating some mt_rand() regeneration ideas [20:14:47] ok well I didn't know about either of those approaches [20:14:56] hmm...something like copying thumbs on cache miss when they hit thumb.php some portion of the time [20:15:13] in my ideal world we would know which thumbs were requsted for the last month (say), and if they weren't on the netapp we'd stuff em in [20:15:23] but in the real world we don't have that list [20:15:29] given that we have upload squids with stuff on disk, I still can't see the point though [20:15:56] it's like this swift migration is becoming busy work ;) [20:16:08] is becoming? [20:17:03] so are those squids or varnishes any more, and how do they revalidate? that's the ppice I need [20:17:14] cause if they recheck content every x days for some reasonable value [20:17:41] and if the recheck will cause multiwritebackend to write to the netapp if it ain't there [20:17:58] we could just wait for x days + 1 month to go by and it would be done for us [20:18:15] but those are a lot of if [20:19:42] ok what's the simplest thing we can do to get this done, AaronSchulz? :-D [20:21:34] yeah, I'd want to know the squid/varnish revalidate time [20:22:06] soo by simplest I don't mean my idea, I mean copy some reasonable amount of thumbs and be done. [20:22:42] like just a %, regardless of which ones are used more? [20:22:50] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 1.812 seconds [20:23:03] I suppose I can randomize the copy, heh [20:23:07] well, if that is enough to make us feel safe about surviving a failover [20:23:13] then yeah. [20:24:23] apergos: are you worried about the varnish caches revalidating and causing a regen on HEAD? [20:24:26] our thumb strategy is ridiculous [20:24:32] no [20:24:33] say if we switched to eqiad swift [20:24:39] I'm not worried about them particularly [20:24:51] we keep something like 9 thumbs on average [20:24:57] per original [20:25:01] yep [20:25:02] I'd think the heavily used stuff is in eqiad varnish now [20:29:12] so we have two options: 1) decide no copy is needed, tell ct why [20:29:31] 2) copy some stuff anyways (random percent is quickest?) [20:29:41] votes please? [20:30:08] apergos: relying on varnish to avoid overloads is fine as long as there is no revalidation that causes excess regens on HEAD or changing the origin to eqiad does not clear all the cache [20:30:26] if that works, then I'd stick to that, if not, then some stuff has to be copied [20:30:38] right [20:30:41] I think some mt_rand() thing might be the best bet in the later case [20:33:16] ok, the pmtpa ones are squids and the eqiad ones are varnishes so I guess we need to know about revalidation for both [20:38:26] refresh_pattern . 300 80% 5184000 stale-while-revalidate=21600 ignore-reload [20:38:26] which is a hard cutoff of 60 days, [20:38:44] New patchset: MarkTraceur; "wikibugs migrated from svn to git" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/26325 [20:38:44] New patchset: MarkTraceur; "git::clone now support a specific sha1" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/27175 [20:39:08] hashar, ^^ [20:39:29] paravoid: You too, ^^ [20:40:39] marktraceur: fill in a comment on each patchset about what you have done :-] [20:40:45] Righto [20:41:13] New review: MarkTraceur; "Rebased" [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/27175 [20:41:24] marktraceur: thankkkks [20:41:28] New review: MarkTraceur; "Rebased, changed the specific hash to latest." [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/26325 [20:44:06] New review: Hashar; "This is following the remarks by Faidon that we should simply use master. Changes to wikimedia/bugzi..." [operations/puppet] (production); V: 0 C: 1; - https://gerrit.wikimedia.org/r/26325 [20:45:23] not exactly what I said, I said package it or use the deployment system to push it rather than pupet :-) [20:46:15] oh [20:46:40] paravoid: that is true, I forgot about that sorry :/ [20:46:49] np [20:46:52] it was a while ago [20:47:25] So....I have to package it, now, to get a one-line patch merged? [20:47:35] paravoid: while you are interrupted, could you possibly rm a dir for us on gallium please ? [20:48:08] paravoid: ssh gallium.wikimedia.org and then rm -R /var/lib/git/integration . It belongs to root:root and its content need to be replaced by a bare git repo ;) [20:49:07] hashar: fyi 23:28 mutante: installing package upgrades on gallium, incl. newer mysql packages [20:49:14] 23:40 mutante: gallium mysql (jenkins/CI) back up and running after brief db server issue [20:49:32] mutante: thanks for the warning :) [20:50:04] hashar: that happened yesterday, it broke for a brief period due to mysql vs mariadb package but was fixed soon [20:50:08] PROBLEM - Puppet freshness on ms1002 is CRITICAL: Puppet has not run in the last 10 hours [20:50:15] and now you got newer packages [20:50:41] mutante: I don't think we still use mysql on that host anyway. Though I will start using it next year :D [20:51:05] hashar: in case you want to see details ... tail /var/log/apt/history.log [20:51:44] mutante: that explains the entries in dmesg [20:52:13] mutante: something about mysql respawning. Thanks for the upgrade! [20:52:38] yw [20:53:01] mutante: would you mind deleting a dir for me on gallium please ? [20:53:15] mutante: only if you have lunched and not in the middle of something though :-) [20:53:30] /var/lib/git/integration right [20:53:39] on gallium yeah [20:53:44] I used that with git::clone() [20:54:07] !log deleting /var/lib/git/integration on gallium [20:54:08] done [20:54:11] but now the /var/lib/git/ dir is used for replication of Gerrit repositories so we end up with a conflict :) [20:54:16] mutante: danke :-] [20:54:17] Logged the message, Master [20:54:20] de rien [20:54:33] mutante: I have filled a RT to get sudo rights on gallium. That will save some time to ops team :-] [20:54:55] hashar: i saw it, no worries [20:57:01] apergos: what is it for varnish? [20:57:02] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:57:43] trying to find that out but don't knw much about varnish config [20:57:52] reading docs and looking at the same time [21:03:06] because it's lru I would want to see the age of a few objects being evicted to get a sense [21:03:10] but I don't know how to do that [21:07:30] there's really no point copying those thumbs to the netapp [21:07:37] :-D [21:07:53] well that would save me trying to figure out this varnish thing that I really have no idea about [21:08:01] it caches 30 days [21:08:11] where is that? [21:08:17] in the VCL [21:08:24] oh ttl [21:08:28] yes [21:08:37] ok [21:08:48] well never mind the rest then [21:09:47] so what about aaron's question above (well requirement)? [21:10:04] which one? [21:10:06] relying on varnish to avoid overloads is fine as long as there is no revalidation that causes excess regens on HEAD or changing the origin to eqiad does not clear all the cache [21:10:14] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 8.665 seconds [21:10:15] neither of those are issues? [21:10:24] changing the origin won't clear the cache no [21:10:37] "that causes excess regens on HEAD" - I don't get that [21:10:37] mutante: what is RT 1991 ? should be duped to 3071 ? [21:10:59] a HEAD to eqiad swift proxy will result in 404 handling [21:11:12] oh [21:11:18] nothing would exist there [21:11:19] but we're probably not even gonna have swift in eqiad [21:11:25] heh [21:11:28] and we'd copy thumbs to it before we do that [21:11:40] mutante: also, i can't see 2402 apparently? [21:11:42] (like the vast majority of them anyway) [21:12:10] i'm copying to esams ceph cluster now too [21:12:16] so this is all about 'suppose we have to fll over to the netapp' [21:12:49] afaik mediawiki is now writing new thumbs to the netapp as well [21:12:52] before ceph is in place (if it works out) etc [21:12:55] new ones, yup [21:13:05] and it's filling up quite fast? [21:13:20] 1 TB so far [21:13:45] jeremyb: 1991 is "jobs.wm serves wrong SSL cert", 3071 is "change DNS to wikimedia-lb"... 2402 is "create/redirect http://labs.wikimedia.org" .. could you see them via the SelfService URL? [21:13:48] the issue is all the old ones for existing pages [21:14:00] a lot of those images probably don't get changed [21:14:10] mutante: i can only see 3071 [21:14:26] let me look closer [21:15:00] jeremyb: yeah, that is because 3071 is in the public queue and the others are not..we can talk about moving stuff or giving you more perms... [21:15:21] oh, didn't realize those were in other places [21:15:33] mutante: anyway, the question stands. should those two be merged? [21:16:15] gerrit maintenance? 503 [21:20:12] jeremyb: 1991 can be closed then i think, let me hit resolve [21:20:21] it just redirects to http://wikimediafoundation.org/wiki/Work_with_us [21:20:29] right [21:21:43] we actually moved stuff out of "ops-requests" into "core-ops" to show we are picking it up.. but yeah, that means you could not see it anymore [21:22:08] kinda silly that [21:22:12] i don't follow [21:22:27] to pick up, you Open, not move to a different queue [21:22:28] the only one that i could ever see (afaik) was 3071 [21:22:45] 2402 and 1991 i only first looked for today and couldn't see [21:22:57] (of those mentioned above) [21:23:05] jeremyb: yes, that is because there are different queues and permissions are per queue [21:23:11] and the whole point of ops-requests was that external people could find and see their tickets there [21:23:26] mutante: yeah. but i wouldn't have known it was moved because i don't know anything about the ticket [21:23:56] mark is right, we should not move them ... or rethink permissions [21:24:22] i don't see any point moving them [21:24:52] whatever, i'm just happy we could close a ticket and no one even needed to do much work :P [21:25:04] or at least the work was in july not today [21:26:01] when leslie recently cleaned out ops-requests it was part of the idea of bug triaging ..to assign or move stuff to appropriate queues as if "ops-request" was an "incoming" or so .. [21:26:33] ... [21:26:34] no. [21:26:39] status "new", is not triaged yet [21:27:09] yea, true [21:27:10] New patchset: Demon; "Turning replication back on, issue was host fingerprint and non-cloned repos" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/38795 [21:28:20] oh, hah, 2402 is one of mine anyway [21:28:36] so i guess i did know where that started [21:40:23] New review: Hashar; "yes please! :-)" [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/38795 [21:40:44] <^demon> Could someone poke that for me and hashar? ^ [21:43:07] ^demon: I have updated the related bug and assigned it to you so you get the well deserved credit : https://bugzilla.wikimedia.org/show_bug.cgi?id=42958 [21:45:58] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:48:02] AaronSchulz: any suggestions on how often a parser cache purging script should actually run? [21:48:18] without context, no :) [21:48:36] !change 38275 | AaronSchulz [21:48:36] AaronSchulz: https://gerrit.wikimedia.org/r/#q,38275,n,z [21:48:55] oh, for pc*? [21:48:55] its a cronjob to run usr/local/bin/mwscript purgeParserCache.php on hume [21:49:02] i got the command line from Tim [21:49:10] i just dont know when the cron should run [21:49:35] paravoid: So re: wikibugs, is it really going to need to be packaged before we can deploy tiny fixes like adding a channel, etc.? [21:49:41] I think asher fixed the page purge problem via config (as well as some timeout issue) [21:49:48] AaronSchulz: oooh! [21:49:53] marktraceur: or use the deployment system I'd say [21:49:56] mutante: I would double check with him [21:50:13] AaronSchulz: okay, will do, he just might be gone on vacation [21:50:21] paravoid: Are there docs on setting up a deployment system for a project? [21:50:33] tiny fixes like adding a channel sounds like "configuration" though, I think this fits puppet well [21:50:33] s/ a / the / [21:50:41] but yea, afaik this was simply to make sure it does not run full again, thanks Aaron [21:50:45] not much yet I think, it's like days old [21:50:52] Ryan_Lane: ^^^ [21:50:57] Oh, the git-deploy things [21:51:53] yep [21:58:53] New patchset: Hashar; "jenkins: OpenStack jenkins-job-builder" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24620 [22:00:40] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.025 seconds [22:00:47] New review: Aaron Schulz; "Asher should OK this too" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/38275 [22:01:19] New patchset: Hashar; "(bug 43141) jenkins: OpenStack jenkins-job-builder" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24620 [22:03:00] New review: Hashar; "PS6: rebase" [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/24620 [22:03:15] New review: Hashar; "recheck" [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/24620 [22:08:49] New patchset: Dzahn; "sort nodes at the end of site.pp alphabetically" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/38797 [22:17:35] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/38795 [22:18:04] paravoid, marktraceur: yeah, could use the deployment system [22:18:54] I really need to get grains going [22:19:05] Ryan_Lane: I'm sure you have some timeline for pushing out git-deploy, and I don't want to disrupt it, so whenever you're able to do that would be great :) [22:19:38] it's ready for use, in general [22:20:04] and getting more people using it helps get me feedback [22:20:47] <^demon> Ryan_Lane: Thanks. [22:20:51] yw [22:21:33] parsoid is using git-deploy for deployment already [22:21:45] Yuup, I saw that [22:22:13] So I could very readily *use* the deployment system if it were set up and I had shell access [22:24:12] you don't have shell? [22:24:40] should probably request that through your manager :D [22:24:50] Ryan_Lane: It's in the pipes as we speak [22:25:00] cool [22:25:15] Ryan_Lane: The catalyst being that I'm available for emergencies through 99.9% of the holidays :) [22:25:30] heh [22:26:36] New patchset: Demon; "Go ahead and ensure the plugins directory for gerrit" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/38798 [22:28:51] marktraceur: oh, you don't sleep? [22:28:54] cool [22:29:01] jeremyb: Pretty much... [22:29:31] can I get pep8 (a python linter) on gallium please ? https://gerrit.wikimedia.org/r/#/c/35666/ :) [22:29:37] would let us lint our python scripts :) [22:30:50] !log restarting Zuul on gallium [22:30:59] Logged the message, Master [22:31:23] New review: Hashar; "recheck" [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/38798 [22:32:08] New review: Demon; "This can be merged whenever (it's harmless), but needs to be in sometime before 2.5/2.6" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/38798 [22:33:04] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:34:28] !log fixing symlink on kaulen to reflect new bugzilla version 4.2.4 [22:34:37] Logged the message, Master [22:40:44] !log I am not sure what is happening between Gerrit and Zuul. Seems Gerrit might not have sent all its event via stream-event. I have restarted Zuul and it seems to be happy (test: https://gerrit.wikimedia.org/r/#/c/38800/ ). Will monitor over the weekend. [22:40:52] Logged the message, Master [22:46:17] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 9.324 seconds [23:03:10] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/35666 [23:18:48] PROBLEM - Memcached on virt0 is CRITICAL: Connection refused [23:20:28] RECOVERY - Memcached on virt0 is OK: TCP OK - 0.003 second response time on port 11000 [23:21:48] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:34:51] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 7.487 seconds [23:37:43] New patchset: Ryan Lane; "Change registration link" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/38808 [23:39:07] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/38808