[00:02:04] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:03:35] Reedy, TimStarling: why isn't Special:MergeHistory enabled on WMF wikis? [00:05:26] dunno [00:07:32] it would save sysops a lot of work on wikis where history merges are common [00:14:49] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.036 seconds [00:22:05] Jasper_Deng: there's been a recent thread about it on wikitech [00:22:13] Nemo_bis: link? [00:22:15] history merged mustn't be common [00:22:19] lazy boy [00:22:53] Jasper_Deng: http://thread.gmane.org/gmane.science.linguistics.wikipedia.technical/56356 [00:22:57] not hard ;) [00:23:05] * Jasper_Deng never knew of that list [00:23:21] :-OO [00:23:32] oh [00:23:37] it's still experimental, eh? [00:24:02] in short [00:24:05] plus very scary [00:24:13] scary? [00:25:32] as br.ion says [00:26:54] Jasper_Deng, you never knew of wikitech-l? [00:27:00] Krenair: I did [00:27:09] but I've always been looking at lists.wikimedia.org [00:28:03] Oh yeah [00:28:09] Some people look at it on gmane. Not sure why [00:45:52] PROBLEM - check_job_queue on neon is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 9,999 jobs: , zhwiki (37253) [00:47:04] PROBLEM - check_job_queue on spence is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 9,999 jobs: , zhwiki (35772) [00:48:34] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:51:38] Krenair: for the search and pretty threading [00:51:50] oh, also because the archive is never messed up and links are stable of course [00:52:34] the core features are supposed to be 1) using lists with newsreaders, 2) replying from gmane directly [00:52:40] AFAIK [00:54:25] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [00:54:25] PROBLEM - Puppet freshness on magnesium is CRITICAL: Puppet has not run in the last 10 hours [00:59:49] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 9.303 seconds [01:20:56] RECOVERY - check_job_queue on neon is OK: JOBQUEUE OK - all job queues below 10,000 [01:22:53] RECOVERY - check_job_queue on spence is OK: JOBQUEUE OK - all job queues below 10,000 [01:34:53] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:40:17] PROBLEM - MySQL Slave Delay on storage3 is CRITICAL: CRIT replication delay 220 seconds [01:41:29] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 294 seconds [01:44:36] New patchset: Tim Starling; "Do not run jobs on precise, due to bug 40462" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24769 [01:45:35] Change merged: Tim Starling; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24769 [01:46:17] RECOVERY - MySQL Slave Delay on storage3 is OK: OK replication delay 25 seconds [01:46:26] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 5 seconds [01:48:23] PROBLEM - Puppet freshness on ms-be6 is CRITICAL: Puppet has not run in the last 10 hours [01:48:32] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.026 seconds [02:20:56] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:34:53] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.020 seconds [02:38:20] RECOVERY - Puppet freshness on search1017 is OK: puppet ran at Mon Sep 24 02:38:04 UTC 2012 [02:40:55] Nemo_bis is the only person capable of navigating Gmane's UI. [02:41:07] I always get trapped in drop-down hell. [02:42:14] RECOVERY - Puppet freshness on cp1034 is OK: puppet ran at Mon Sep 24 02:42:02 UTC 2012 [02:43:17] RECOVERY - Puppet freshness on search25 is OK: puppet ran at Mon Sep 24 02:42:45 UTC 2012 [02:43:26] PROBLEM - Puppet freshness on aluminium is CRITICAL: Puppet has not run in the last 10 hours [02:45:14] RECOVERY - Puppet freshness on sq54 is OK: puppet ran at Mon Sep 24 02:44:57 UTC 2012 [02:47:47] RECOVERY - Puppet freshness on search14 is OK: puppet ran at Mon Sep 24 02:47:38 UTC 2012 [02:48:14] RECOVERY - Puppet freshness on search1015 is OK: puppet ran at Mon Sep 24 02:48:07 UTC 2012 [02:48:14] RECOVERY - Puppet freshness on search15 is OK: puppet ran at Mon Sep 24 02:48:08 UTC 2012 [02:48:50] RECOVERY - Puppet freshness on sq77 is OK: puppet ran at Mon Sep 24 02:48:34 UTC 2012 [02:50:20] RECOVERY - Puppet freshness on search1005 is OK: puppet ran at Mon Sep 24 02:49:55 UTC 2012 [02:50:47] RECOVERY - Puppet freshness on search1003 is OK: puppet ran at Mon Sep 24 02:50:34 UTC 2012 [02:51:14] RECOVERY - Puppet freshness on search1011 is OK: puppet ran at Mon Sep 24 02:51:14 UTC 2012 [02:54:50] RECOVERY - Puppet freshness on search27 is OK: puppet ran at Mon Sep 24 02:54:19 UTC 2012 [02:54:50] RECOVERY - Puppet freshness on sq80 is OK: puppet ran at Mon Sep 24 02:54:25 UTC 2012 [02:55:26] RECOVERY - Puppet freshness on search19 is OK: puppet ran at Mon Sep 24 02:55:15 UTC 2012 [02:55:35] RECOVERY - Puppet freshness on search1018 is OK: puppet ran at Mon Sep 24 02:55:20 UTC 2012 [02:58:17] RECOVERY - Puppet freshness on search1014 is OK: puppet ran at Mon Sep 24 02:57:53 UTC 2012 [03:02:20] RECOVERY - Puppet freshness on sq53 is OK: puppet ran at Mon Sep 24 03:01:46 UTC 2012 [03:02:20] RECOVERY - Puppet freshness on lvs5 is OK: puppet ran at Mon Sep 24 03:01:49 UTC 2012 [03:03:05] RECOVERY - Puppet freshness on search1002 is OK: puppet ran at Mon Sep 24 03:02:53 UTC 2012 [03:03:05] RECOVERY - Puppet freshness on search30 is OK: puppet ran at Mon Sep 24 03:02:54 UTC 2012 [03:04:44] RECOVERY - Puppet freshness on search23 is OK: puppet ran at Mon Sep 24 03:04:34 UTC 2012 [03:06:05] RECOVERY - Puppet freshness on sq85 is OK: puppet ran at Mon Sep 24 03:05:45 UTC 2012 [03:06:59] RECOVERY - Puppet freshness on analytics1009 is OK: puppet ran at Mon Sep 24 03:06:47 UTC 2012 [03:07:35] RECOVERY - Puppet freshness on search29 is OK: puppet ran at Mon Sep 24 03:07:23 UTC 2012 [03:21:32] PROBLEM - check_job_queue on neon is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 9,999 jobs: , plwiktionary (45574) [03:23:02] PROBLEM - check_job_queue on spence is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 9,999 jobs: , plwiktionary (47688) [05:21:47] Brooke: never heard of dropdowns being needed :) [05:23:26] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [05:23:26] PROBLEM - Puppet freshness on singer is CRITICAL: Puppet has not run in the last 10 hours [05:23:26] PROBLEM - Puppet freshness on virt1002 is CRITICAL: Puppet has not run in the last 10 hours [05:23:26] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [05:23:26] PROBLEM - Puppet freshness on virt1001 is CRITICAL: Puppet has not run in the last 10 hours [05:23:26] PROBLEM - Puppet freshness on virt1003 is CRITICAL: Puppet has not run in the last 10 hours [07:23:02] PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours [08:11:02] PROBLEM - Puppet freshness on ms-be7 is CRITICAL: Puppet has not run in the last 10 hours [08:21:17] New patchset: Dereckson; "(bug 40186) Favicon for es.wiktionary" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/23610 [08:21:34] New patchset: Dereckson; "(bug 40186) Favicon for es.wiktionary" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/23610 [08:22:11] New review: Dereckson; "PS2 > Use https://bits.wikimedia.org/favicon/piece.ico" [operations/mediawiki-config] (master) C: 0; - https://gerrit.wikimedia.org/r/23610 [08:33:12] PROBLEM - Puppet freshness on dobson is CRITICAL: Puppet has not run in the last 10 hours [09:18:27] Change merged: Hashar; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/23610 [09:22:45] New patchset: Hashar; "(bug 40436) Namespaces configuration for se.wikipedia" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/24656 [09:23:11] Change merged: Hashar; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/24656 [09:30:43] New patchset: Hashar; "(bug 40419) extension assets not available on beta" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/24525 [09:30:59] Change merged: Hashar; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/24525 [09:33:56] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [09:40:07] hello [10:55:27] PROBLEM - Puppet freshness on magnesium is CRITICAL: Puppet has not run in the last 10 hours [10:55:27] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [11:00:42] New patchset: Mark Bergsma; "Allow Varnish director options to be set from the manifests" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24785 [11:01:42] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/24785 [11:05:52] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24785 [11:11:11] New patchset: Mark Bergsma; "Use .inspect" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24786 [11:12:05] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/24786 [11:14:48] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24786 [11:20:35] New patchset: Mark Bergsma; "Perhaps not use .inspect ;)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24788 [11:21:29] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/24788 [11:21:38] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24788 [11:29:43] mark: hello :-] while you are in varnish config, there is my lame bits.beta.wmflabs.org patch at https://gerrit.wikimedia.org/r/#/c/13304/ ;) [11:29:58] up to PS 25, maybe I should resend it [11:30:36] or probably split it in smaller commits [11:31:06] i think that's not enough patch sets [11:31:12] hehe [11:31:13] you obviously didn't test it enough [11:32:38] bah bits on beta is dead [11:33:25] New patchset: Mark Bergsma; "varnish config for bits.beta.wmflabs.org" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/13304 [11:34:17] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/13304 [11:34:27] can you get rid of the max connections thing perhaps? [11:34:38] just keep it at 10k in labs too [11:34:49] sure [11:35:29] i want to minimze the amount of differences between labs and production [11:35:35] and that one doesn't seem really necessary [11:35:44] sure, labs doesn't have the same capacity [11:35:49] but it also doesn't need to stay up that much ;) [11:36:41] New patchset: Hashar; "varnish config for bits.beta.wmflabs.org" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/13304 [11:37:00] and 50 was probably a bit too low anyway [11:37:34] New review: Hashar; "Removed the max_connections exception for labs, so it is using 10k too." [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/13304 [11:37:34] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/13304 [11:37:57] mark: done in PS 27 :) [11:40:47] yeah i'm looking at it [11:43:27] the instance is deployment-cache-bits02 if you want to have a look at it [11:44:24] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/13304 [11:47:22] PROBLEM - Puppet freshness on mw10 is CRITICAL: Puppet has not run in the last 10 hours [11:47:22] PROBLEM - Puppet freshness on mw1 is CRITICAL: Puppet has not run in the last 10 hours [11:47:22] PROBLEM - Puppet freshness on mw12 is CRITICAL: Puppet has not run in the last 10 hours [11:47:22] PROBLEM - Puppet freshness on mw13 is CRITICAL: Puppet has not run in the last 10 hours [11:47:22] PROBLEM - Puppet freshness on mw14 is CRITICAL: Puppet has not run in the last 10 hours [11:47:22] PROBLEM - Puppet freshness on mw15 is CRITICAL: Puppet has not run in the last 10 hours [11:47:22] PROBLEM - Puppet freshness on mw11 is CRITICAL: Puppet has not run in the last 10 hours [11:47:23] PROBLEM - Puppet freshness on mw16 is CRITICAL: Puppet has not run in the last 10 hours [11:47:23] PROBLEM - Puppet freshness on mw4 is CRITICAL: Puppet has not run in the last 10 hours [11:47:24] PROBLEM - Puppet freshness on mw7 is CRITICAL: Puppet has not run in the last 10 hours [11:47:24] PROBLEM - Puppet freshness on mw3 is CRITICAL: Puppet has not run in the last 10 hours [11:47:25] PROBLEM - Puppet freshness on mw2 is CRITICAL: Puppet has not run in the last 10 hours [11:47:25] PROBLEM - Puppet freshness on mw6 is CRITICAL: Puppet has not run in the last 10 hours [11:47:26] PROBLEM - Puppet freshness on mw8 is CRITICAL: Puppet has not run in the last 10 hours [11:47:26] PROBLEM - Puppet freshness on mw9 is CRITICAL: Puppet has not run in the last 10 hours [11:47:27] PROBLEM - Puppet freshness on mw5 is CRITICAL: Puppet has not run in the last 10 hours [11:47:28] see what you did [11:47:35] :-D [11:47:37] ;-) [11:47:44] good job [11:47:49] the config on production didn't change at all [11:47:57] hurrah!! [11:48:18] no I need to find out how to revert puppetmaster:self :-) [11:48:41] why do you need to revert it? [11:49:04] I have enabled it on my instance [11:49:19] PROBLEM - Puppet freshness on ms-be6 is CRITICAL: Puppet has not run in the last 10 hours [11:49:19] and would like to make it use the regular repo again instead of the local copy of operations/puppet [11:49:26] just keep updating it hehe [11:49:46] I will just create a new instance [11:52:05] mark: have you tested varnish on Precise ? [11:52:16] what do you mean? tested how? [11:52:37] like setting up a Precise box to use varnish and add it to the pool [11:52:56] we have several precise varnish boxes running in production [11:53:20] like 10 of them, I think [11:54:10] great [11:54:20] so my bits instance will use Precise too :-] [11:54:24] sure [12:43:05] mark: Compiled VCL program failed to load: undefined symbol: GeoIPRecord_delete [12:43:10] any idea what could cause that ? [12:43:18] the symbol seems to be defined in the libgeoip.so [12:43:24] is it linked? [12:43:44] what do you mean ? [12:43:51] I am C illiterate [12:44:00] check the varnish init scripts [12:44:03] PROBLEM - Puppet freshness on aluminium is CRITICAL: Puppet has not run in the last 10 hours [12:45:45] !log restarting powerdns on ns2 [12:45:55] Logged the message, Mistress of the network gear. [12:46:04] exec cc -fpic -shared -Wl,-x -L/usr/local/lib/ -lGeoIP -o %o %s [12:46:12] guess I am missing it from /usr/local/lib [12:48:24] PROBLEM - SSH on lvs6 is CRITICAL: Server answer: [12:51:24] RECOVERY - SSH on lvs6 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [12:51:51] eep who killed lvs6 ? [12:51:59] it's ok [12:52:04] some ssh scan [12:52:04] oh okay [13:00:01] mark: so CC does load the /usr/lib/libGeoIP.so , which does seem to contain the GeoIPRecord_delete symbol [13:01:30] yes [13:01:58] but I still get the undefined symbol : [13:04:37] """" Change the cc_command to include the library after the source file, ie move -lGeoIP to the end of the command. """ [13:04:39] seriously [13:11:06] New patchset: Hashar; "put library after sourcefile in varnish cc command" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24797 [13:12:05] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/24797 [13:17:42] \O/ http://bits.beta.wmflabs.org/geoiplookup [13:17:47] !log switching unnumbered GRE tunnel to numbered GRE Tunnel between cr2-knams and cr2-eqiad for routability reasons [13:17:57] Logged the message, Mistress of the network gear. [13:18:08] New review: Hashar; "That did fix the issue on deployment-cache-bits03 :" [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/24797 [13:50:16] !log updated distribute to 0.6.28 [13:50:26] Logged the message, Master [13:50:56] !log updated pip to 1.2.1 [13:51:06] Logged the message, Master [13:51:14] !log updated timelib to 0.2.4 [13:51:24] Logged the message, Master [13:52:04] !log update mwlib to 0.14.1 [13:52:14] Logged the message, Master [13:52:28] !log updated mwlib.rl to 014.1 [13:52:38] Logged the message, Master [13:52:49] !log updated mwlib.epub to 0.14.2 [13:52:59] Logged the message, Master [13:53:09] !log restart all services [13:53:19] Logged the message, Master [13:56:59] lucky us that the other three of those are disabled :-/ [14:20:53] wow that's gotta be the shittiest IRC bot i've ever seen [14:22:39] ? [14:55:02] notpeter: a couple of things...search32 down again! ....mc15 nic card was replaced/flashed and enabled for you [14:56:42] hey cmjohnson1 [14:56:49] didn't see you come in [14:57:02] hi apergos [14:57:13] oh...stealthy i guess [14:57:21] so I'm just now seeing the emails from this weekend [14:57:28] (yeah you are :-P) [14:57:50] I won't actually be on line most of this evening, I had a previous engagement [14:58:03] I gotta leave at 8 pm which is um [14:58:24] which is 1p here [14:58:27] well in a coupl e hours anyways. so my overlap with the Dell guy might not be a lot [14:58:44] dell guy should be here in the next 30-60mins [14:58:50] yeah, I see that in the mail [14:59:23] did you guys chat at all outside of email or is the whole discussion there? [14:59:57] no, i kept all my communication w/ dell tech using email [15:00:12] ok [15:00:15] guess we'll see [15:01:31] k...i will be afk for a few ...going upstairs [15:01:49] see ya [15:15:38] cmjohnson1: oh ? anything of interest with the dell tech ? [15:16:27] he's supposed to show up to lok at the swift servers [15:18:31] ooo physically ? [15:18:35] uh huh [15:19:07] cmjohnson1: how much would it cost for you to tar and feather him and put that up on youtube? ;) [15:19:28] notpeter: So are you working on the memcached machines in eqiad when I finish the cable install? [15:19:36] I think we want to tar and feather some other people over there, but this guy is actually coming on site [15:19:46] yeah, it would just be shooting the messenger [15:19:48] I have the cables, going to label and run them. However, I also need to take down the orginal 1001-1008 to install their memory upgrades [15:20:30] lesliecarr: not sure what this guy can do ...verify it is a server and it's connected? [15:21:02] well if he's a uber tech he can do voltage testing on parts of the mobo [15:21:07] power quality tests [15:21:11] wonder if we'll need to uncable the ssds again, guess we'll see about that too [15:21:11] heatmaps [15:21:21] so... [15:21:24] hrmm...we proabably should and reinstall [15:21:35] let's do that now [15:21:47] he is a Solutions Architect [15:21:58] and a Social Media and Community Professional [15:22:06] I don't know what that nets us exactly [15:22:18] another ticket punched ;-] [15:22:24] cmjohnson1: actually, let's see what he says when he shows up [15:22:47] k...at this rate...Dell is digging in and not going to do much for us anywyay [15:22:52] * cmjohnson1 opinion only! [15:23:23] I hope he will want to do some checks with their own diag tools or whatever they have in the closets in the back there [15:23:50] hrm, that doesn't sound super techy but i'll keep my fingers crossed [15:24:40] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [15:24:40] PROBLEM - Puppet freshness on virt1001 is CRITICAL: Puppet has not run in the last 10 hours [15:24:40] PROBLEM - Puppet freshness on virt1002 is CRITICAL: Puppet has not run in the last 10 hours [15:24:40] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [15:24:40] PROBLEM - Puppet freshness on virt1003 is CRITICAL: Puppet has not run in the last 10 hours [15:24:41] PROBLEM - Puppet freshness on singer is CRITICAL: Puppet has not run in the last 10 hours [15:25:55] ocg3 has bad DIMM [15:28:32] a simple volt meter doesn't suffice to do proper power quality testing eh ;) [15:29:53] New patchset: Demon; "Redo gerrit replication template so we can set any option" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24812 [15:30:46] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/24812 [15:31:34] New patchset: Cmjohnson; "adding hfung to admins.pp and site.pp for stat1 access" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24813 [15:32:20] New review: gerrit2; "Change did not pass lint check. You will need to send an amended patchset for this (see: https://lab..." [operations/puppet] (production); V: -1 - https://gerrit.wikimedia.org/r/24813 [15:34:02] Change abandoned: Cmjohnson; "more unk errors" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24813 [15:51:48] !log mc1001-mc1008 are having memory upgrades installed. hosts already offline. [15:51:58] Logged the message, RobH [16:27:38] cmjohnson1: not surprising... and great! thank you! [16:28:12] RobH: I want to get the mc10[01[0-9] hosts up today, but I need to figure out why not pxe booting [16:28:18] lemme know when you're done upgrading them? [16:35:17] !log ms-be6 shutting down for h/w testing [16:35:24] !log mc1001-1008 memory upgrade complete [16:35:27] Logged the message, Master [16:35:37] Logged the message, RobH [16:35:37] oh yay [16:35:41] notpeter: Ok, well, mc1001-1013 have the DAC cables that we have confirmed work with the switches [16:35:48] and mc1014 has the new DAC cable to test [16:35:53] if it works, I will wire mc1015+ [16:36:01] (i dont want to open the cables if they dont work, so we can return them) [16:36:53] PROBLEM - Host ms-be6 is DOWN: PING CRITICAL - Packet loss = 100% [16:38:35] RobH: kk, sounds good [16:45:17] RECOVERY - Host ms-be6 is UP: PING OK - Packet loss = 0%, RTA = 53.96 ms [16:46:07] I gotta go in 15 minutes, cmjohnson1, so if there is anything the Dell tech needs from me, now is the time [16:46:43] apergos: no, he won't need anything from you now [16:46:45] thx [16:47:04] ok, good luck! [17:02:13] New review: Brian Wolff; "(Not an ops person, take with grain of salt. probably totally wrong)." [operations/mediawiki-config] (master) C: -1; - https://gerrit.wikimedia.org/r/24561 [17:03:48] !log powering up and toying with ms-be8 for h/w checks [17:03:58] Logged the message, Master [17:04:04] New review: Catrope; "Brian is right on both counts: you need to use an intermediate wmg variable for this to work, and th..." [operations/mediawiki-config] (master); V: 0 C: 0; - https://gerrit.wikimedia.org/r/24561 [17:08:21] New patchset: Ottomata; "Creating cron job to rsync edit logs to stat1" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24815 [17:08:32] PROBLEM - NTP on ms-be6 is CRITICAL: NTP CRITICAL: No response from NTP server [17:09:15] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/24815 [17:10:29] RECOVERY - Host ms-be8 is UP: PING OK - Packet loss = 0%, RTA = 0.45 ms [17:24:26] PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours [17:30:03] RobH: i've tried to change the privacy options on https://lists.wikimedia.org/mailman/admin/mobile-feedback-l/privacy/spam a couple of times but none of them stick. why is that ? [17:31:39] tfinc: I really have not messed with it a lot recently (the admins now tend to admin the staff lists), but that should be writable by the list admin =/ [17:31:56] if its not, you may wanna drop an RT ticket so an op can investigate, cuz thats not normal behavior afaik [17:32:30] tfinc: you adding something to the regex in the second field? [17:32:32] New patchset: Aaron Schulz; "Configure the math container as sharded." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24818 [17:33:25] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/24818 [17:33:26] PROBLEM - NTP on ms-be8 is CRITICAL: NTP CRITICAL: No response from NTP server [17:36:30] RobH: i'm just trying to switch it from holding suspect mails to just rejectiong them [17:38:42] tfinc: Can you send me the regex you want in the first panel so i can confirm behavior? [17:38:59] or you just want whats in the hold field moved to the reject one? [17:39:08] RobH: i wonder if this is a chrome issue. I've seen that happen before. [17:39:17] i was about to test in FF [17:39:44] PROBLEM - Host ms-be8 is DOWN: PING CRITICAL - Packet loss = 100% [17:40:13] RobH: oh wait. i'm thinking incorrectly about this feature. i was reading it as an action when it hit a preconfigured spam filter . not a new one from me [17:40:29] stupid of it to not complain that i hadn't put in a regex [17:40:50] RobH: so what i wanted to do was to reject all " Reason: Message has implicit destination" [17:40:53] where is that set? [17:42:34] tfinc: hrmm, not sure, looking at the content filtering option now [17:42:42] all attachment based [17:42:43] bleh [17:43:04] tfinc: we may wanna as mutante as he has a bit more mailman experience [18:08:13] New patchset: Asher; "set director retries to two for mobile-backend instances" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24822 [18:08:27] mark: does this make sense ^^^ [18:09:08] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/24822 [18:12:26] PROBLEM - Puppet freshness on ms-be7 is CRITICAL: Puppet has not run in the last 10 hours [18:13:39] New patchset: Hashar; "beta: disable commons as a default foreign file repo" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/24823 [18:14:20] Change merged: Hashar; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/24823 [18:20:53] !log temp stopping puppet on brewster [18:21:03] Logged the message, notpeter [18:21:55] binasher: not sure that's necessary [18:22:09] but yeah, backend does use the old hash director [18:22:20] so it would make the _backend_ mobiles work exactly like they did before [18:23:14] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 250 seconds [18:23:41] PROBLEM - MySQL Slave Delay on storage3 is CRITICAL: CRIT replication delay 276 seconds [18:25:04] just going for a small but >0 number of retries from the varnish backends to apache [18:27:53] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 8 seconds [18:28:44] binasher: are you going to merge and push https://gerrit.wikimedia.org/r/#/c/24822/ [18:29:06] preilly: srsly? [18:29:06] binasher, mark: We are still seeing a good number of 503 responses on mobile [18:29:11] fyi (esp hashar) i'm installing the somatype artifact repo for maven support in jenkins atm [18:29:30] dschoon: on gallium?:-) [18:29:36] yessir [18:29:50] RECOVERY - MySQL Slave Delay on storage3 is OK: OK replication delay 29 seconds [18:29:59] binasher: pm'ed [18:30:34] dschoon: let me know if you need any details about gallium. On maven/somatype, I am afraid I will not be able to help that much :-D [18:30:41] mark: did you change the hash director at some point? [18:30:46] thanks, hashar! will do [18:31:20] preilly: the code? no [18:31:25] the configuration, yes, this morning [18:31:44] increased retries from 2 to 40 after rereading the chash code [18:31:56] preilly: are you seeing certain urls consistently 503 still? [18:32:53] mark: was this the change that you made Change I51ec1bd4 [18:32:57] mark: https://gerrit.wikimedia.org/r/#/c/24785/ [18:33:22] yes, with 1-2 subsequent commits to fix the syntax [18:33:59] PROBLEM - Puppet freshness on dobson is CRITICAL: Puppet has not run in the last 10 hours [18:34:06] binasher: I'm not sure let me check with the people reporting the issue [18:34:26] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 304 seconds [18:34:59] mark: so do you think with the new director_options set the issue is resolved? [18:35:02] PROBLEM - MySQL Slave Delay on storage3 is CRITICAL: CRIT replication delay 340 seconds [18:35:18] that was an assumption [18:35:24] binasher: also Jon Robson got one on a search request [18:35:36] i wasn't here during the actual problem and only went off asher's mail [18:35:44] mark: okay [18:35:46] if you're still seeing many 503s, obviously it's not solved [18:36:22] mark: well the reports might be old from before you made the last configuration change [18:36:30] that particular issue isn't what's going on if all backends are up, regardless of mark's changes [18:36:32] RECOVERY - MySQL Slave Delay on storage3 is OK: OK replication delay 0 seconds [18:36:45] ok [18:36:54] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24822 [18:36:58] hmm [18:37:02] just got one on http://en.m.wikipedia.org/wiki/Special:Random [18:37:17] Guru Meditation: XID: 2910971993 [18:37:26] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 5 seconds [18:37:35] although, the backends being able to retry to apache was originally put in to avoid 503's from occasional apache connect timeouts [18:37:41] and on http://en.m.wikipedia.org/wiki/Cape_Thompson [18:39:10] preilly: did you capture response headers on any of those 503s? [18:39:33] i'm pushing the change to allow backend retries [18:40:25] binasher: http://pastebin.mozilla.org/1840398 [18:41:23] cool [18:42:34] that verifies that it's different than the hash director problem sunday - the request is actually hitting a backend varnish, so the 503 is related to the request between it and apache [18:42:43] binasher so mark had it set to 'retries' => 40, and you changed it to 'retries' => 2, [18:42:48] no i didn't [18:43:12] frontend retries are unchanged from 40 [18:43:15] backend retries were 0 [18:43:18] now 2 [18:43:42] that is deployed now, try to get more 503s [18:43:55] ah I was confused by https://gerrit.wikimedia.org/r/#/c/24785/1/manifests/role/cache.pp vs https://gerrit.wikimedia.org/r/#/c/24822/1/manifests/role/cache.pp [18:44:55] binasher: got one http://pastebin.mozilla.org/1840400 [18:45:08] same here [18:46:35] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 231 seconds [18:47:11] PROBLEM - MySQL Slave Delay on storage3 is CRITICAL: CRIT replication delay 267 seconds [18:47:42] binasher: isn't it a bit weird that it returns a 503 so quickly [18:48:10] I would think that the connect_timeout or first_byte_timeout or something would take a bit longer [18:49:35] preilly: it may not be related to timeouts at all [18:49:51] binasher: yeah that makes sense [18:50:15] binasher: can we use varnish log and just look at our requests from the office and see if we spot anything weird [18:51:50] RECOVERY - MySQL Slave Delay on storage3 is OK: OK replication delay 5 seconds [18:55:44] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 240 seconds [18:56:20] PROBLEM - MySQL Slave Delay on storage3 is CRITICAL: CRIT replication delay 277 seconds [18:57:23] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 1 seconds [18:57:58] I also wonder if we could see FetchError responses from varnish log [18:58:08] RECOVERY - MySQL Slave Delay on storage3 is OK: OK replication delay 23 seconds [19:02:12] seeing the backend varnish log "FetchError c no backend connection" on the 503's [19:02:20] and yet the same varnish instance shows: [19:02:21] backend_unhealthy 0 0.00 Backend conn. not attempted [19:02:21] backend_busy 0 0.00 Backend conn. too many [19:02:22] backend_fail 0 0.00 Backend conn. failures [19:04:58] binasher: are we using Varnish saint mode at all? [19:05:50] because Varnish saint mode will blacklist a complete backend server silently (!) once a specific number of blacklisted URLs for this backend is exceeded. [19:06:05] PROBLEM - Host ms-be6 is DOWN: PING CRITICAL - Packet loss = 100% [19:06:09] could we maybe try setting the saintmode_threshold higher [19:06:14] preilly: we do not use saintmode. [19:06:44] binasher: hmm okay [19:09:28] binasher: was the output above from varnishlog -i Backend_health ? [19:09:51] cmjohnson1: mc15 still won't pass traffic. it was all looking good when you left it? [19:10:41] notpeter: i will check it [19:13:48] cmjohnson1: thanks! [19:20:23] New review: Catrope; "Yes, this is now safe" [operations/mediawiki-config] (master); V: 0 C: 1; - https://gerrit.wikimedia.org/r/18125 [19:34:23] mark: this is weird.. backend varnish on cp1041 still shows "backend_unhealthy 0 0.00" and backend_fail isn't increasing ever [19:34:46] and yet varnishlog shows [19:34:47] 0 Backend_health - ipv4_10_2_1_1 Still healthy 4--X-RH 2 2 3 0.091484 0.091619 HTTP/1.1 200 OK [19:34:48] 0 Backend_health - ipv4_10_2_1_1 Went sick 4--X--- 1 2 3 0.000000 0.091619 [19:34:49] 0 Backend_health - ipv4_10_2_1_1 Still sick 4--X--- 1 2 3 0.000000 0.091619 [19:34:50] 0 Backend_health - ipv4_10_2_1_1 Still sick 4--X-RH 1 2 3 0.087626 0.090621 HTTP/1.1 200 OK [19:35:03] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [19:35:52] binasher: is that with varnishadm -T stats ? [19:36:01] um [19:36:04] something like that [19:36:16] !log db1012 shutting down for pci card install per rt3558 [19:36:25] Logged the message, RobH [19:39:05] PROBLEM - Host db1012 is DOWN: PING CRITICAL - Packet loss = 100% [19:42:15] notpeter: mark: is the 0/14 enabled on asw2-d3-sdtpa? [19:44:54] !log db1012 card install done, system powered down for asher [19:44:54] binasher: ^ db1012 has the flash card installed [19:45:04] Logged the message, RobH [19:50:19] Change abandoned: Catrope; "(no reason)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10669 [19:50:35] mark: about? [19:53:06] New patchset: Hashar; "import zuul manifest from OpenStack (WIP)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24878 [19:54:01] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/24878