[00:07:47] New patchset: MarkTraceur; "Move parsoid IRC channel to #mediawiki-parsoid" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24390 [00:08:43] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/24390 [00:13:06] PROBLEM - Puppet freshness on virt0 is CRITICAL: Puppet has not run in the last 10 hours [00:13:06] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24390 [00:26:10] New patchset: Faidon; "swift: allow sync between {pmtpa,eqiad}-prod" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24395 [00:27:06] New patchset: Faidon; "swift: add Content-Disposition to the header whitelist" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24396 [00:27:58] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/24395 [00:27:59] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/24396 [00:28:39] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24395 [00:33:31] AaronSchulz: are you also storing content-disposition using cloudfiles? [00:33:50] yes, when it fits [00:34:04] okay [00:34:11] that's a good reason to keep a whitelist btw [00:34:23] you wouldn't want the MW responses and the swift responses to diverge [00:34:29] yeah, makes sense [00:34:48] so take must be taken to store into swift the headers that are in the whitelist [00:34:53] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24396 [00:38:51] RobH: around? [01:14:00] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [01:14:00] PROBLEM - Puppet freshness on singer is CRITICAL: Puppet has not run in the last 10 hours [01:14:00] PROBLEM - Puppet freshness on virt1001 is CRITICAL: Puppet has not run in the last 10 hours [01:14:00] PROBLEM - Puppet freshness on virt1002 is CRITICAL: Puppet has not run in the last 10 hours [01:14:00] PROBLEM - Puppet freshness on virt1003 is CRITICAL: Puppet has not run in the last 10 hours [01:14:00] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [01:42:03] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 266 seconds [01:43:33] PROBLEM - MySQL Slave Delay on storage3 is CRITICAL: CRIT replication delay 308 seconds [01:43:33] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 15 seconds [01:46:33] RECOVERY - MySQL Slave Delay on storage3 is OK: OK replication delay 12 seconds [01:54:55] New patchset: Jgreen; "switch aggregator hosts for eqiad Fundraising ganglia" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24404 [01:55:56] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24404 [02:03:18] New patchset: Jeremyb; "change bugzilla redir target to HTTPS; consolidate" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/24405 [02:05:02] New patchset: Pyoungmeister; "correcting macs for (most) pmtpa MC hosts" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24406 [02:05:56] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/24406 [02:06:37] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24406 [02:07:42] PROBLEM - Squid on brewster is CRITICAL: Connection refused [02:13:29] New patchset: Jeremyb; "redirects for develop{,er{,s}}.wiki{p,m}edia.org" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/24407 [02:15:39] New review: Jeremyb; "See also http://lists.wikimedia.org/pipermail/wikitech-l/2012-September/063328.html" [operations/apache-config] (master) C: 0; - https://gerrit.wikimedia.org/r/24407 [02:41:27] RECOVERY - Puppet freshness on virt0 is OK: puppet ran at Thu Sep 20 02:41:23 UTC 2012 [03:11:00] PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours [03:13:59] who is still around [03:14:00] ? [03:21:14] "around"? [03:23:54] !log purged search data from dataset2 on diederik's request [03:24:05] Logged the message, Master [03:32:54] RECOVERY - Puppet freshness on spence is OK: puppet ran at Thu Sep 20 03:32:34 UTC 2012 [03:36:30] RECOVERY - Puppet freshness on analytics1003 is OK: puppet ran at Thu Sep 20 03:36:16 UTC 2012 [03:47:06] New patchset: Pyoungmeister; "temporarily disabling lucene logs cron by diederik's request" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24409 [03:48:01] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/24409 [03:48:10] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24409 [04:02:21] PROBLEM - Puppet freshness on ms-be7 is CRITICAL: Puppet has not run in the last 10 hours [05:16:14] PROBLEM - Apache HTTP on mw58 is CRITICAL: Connection refused [05:21:02] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [05:38:23] New patchset: Tim Starling; "Update modeline" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/24414 [05:38:23] New patchset: Tim Starling; "Add generated-pp-node-count debug log group" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/24415 [05:38:42] Change merged: Tim Starling; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/24414 [05:38:57] Change merged: Tim Starling; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/24415 [05:43:30] New patchset: Tim Starling; "Increase generated node count limit to 4M" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/24416 [05:43:42] Change merged: Tim Starling; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/24416 [06:02:04] Change restored: Parent5446; "(no reason)" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/21322 [06:02:12] New patchset: Parent5446; "(bug 39380) Enabling secure login (HTTPS)." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/21322 [06:02:20] New patchset: Parent5446; "(bug 39380) Enabling secure login (HTTPS)." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/21322 [06:32:43] New review: Nikerabbit; "Ah, $msgOpts is now unused?" [operations/mediawiki-config] (master); V: 0 C: 0; - https://gerrit.wikimedia.org/r/24180 [06:44:26] PROBLEM - Puppet freshness on magnesium is CRITICAL: Puppet has not run in the last 10 hours [06:44:26] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [07:15:38] RECOVERY - Squid on brewster is OK: TCP OK - 0.012 second response time on port 8080 [07:36:44] New patchset: Hashar; "(bug 33464) developer.wikimedia.org redirect" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/24419 [08:02:31] PROBLEM - Puppet freshness on ms-be6 is CRITICAL: Puppet has not run in the last 10 hours [08:03:55] !log Jenkins: installing "Gerrit Trigger plugin" 2.6 which replaces our 2.5.3 snapshot [08:04:05] Logged the message, Master [08:21:30] PROBLEM - LVS HTTPS IPv6 on wiktionary-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:22:51] RECOVERY - LVS HTTPS IPv6 on wiktionary-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK HTTP/1.1 200 OK - 59974 bytes in 7.036 seconds [08:29:45] PROBLEM - Puppet freshness on mw58 is CRITICAL: Puppet has not run in the last 10 hours [08:31:54] New review: Siebrand; "@Niklas: Still used (lines 59-66)" [operations/mediawiki-config] (master) C: 1; - https://gerrit.wikimedia.org/r/24180 [08:33:39] PROBLEM - Puppet freshness on aluminium is CRITICAL: Puppet has not run in the last 10 hours [08:33:57] PROBLEM - LVS HTTPS IPv6 on wikimedia-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:35:18] RECOVERY - LVS HTTPS IPv6 on wikimedia-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK HTTP/1.1 200 OK - 77504 bytes in 1.105 seconds [09:28:25] New patchset: Hashar; "jenkins requires an apache2 installation" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24423 [09:28:48] apergos1: mark: could you potentially merge the very simple https://gerrit.wikimedia.org/r/24423 [09:28:53] apache2 is needed to install jenkins :) [09:29:20] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/24423 [09:40:41] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24423 [10:41:18] !log UploadWizard has been broken in some cases following the release of wmf12. Restored by reverting a change. See {{bug|40380}} for details. [10:41:28] Logged the message, Master [11:07:27] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:08:57] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 2.392 seconds [11:15:24] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [11:15:24] PROBLEM - Puppet freshness on singer is CRITICAL: Puppet has not run in the last 10 hours [11:15:24] PROBLEM - Puppet freshness on virt1001 is CRITICAL: Puppet has not run in the last 10 hours [11:15:24] PROBLEM - Puppet freshness on virt1003 is CRITICAL: Puppet has not run in the last 10 hours [11:15:24] PROBLEM - Puppet freshness on virt1002 is CRITICAL: Puppet has not run in the last 10 hours [11:15:24] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [11:37:23] Change abandoned: Hashar; "Jeremy already did it in https://gerrit.wikimedia.org/r/#/c/24407/" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/24419 [11:37:52] poor virt**** [11:43:27] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:47:57] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 7.875 seconds [12:18:38] New patchset: Hashar; "update jenkins default init file" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24427 [12:18:57] LeslieCarr: mark apergos if anyone around, can you please merge in https://gerrit.wikimedia.org/r/24427 ? :-D [12:19:03] fix yet another issue with jenkins :D [12:19:33] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/24427 [12:19:51] bah [12:19:57] don't even have the correct jenkins version [12:22:54] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:31:54] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 9.656 seconds [13:06:33] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:07:10] New patchset: Jeremyb; "redirects for develop{,er{,s}}.wiki{p,m}edia.org" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/24407 [13:08:10] New review: Jeremyb; "PS2: just added a list of all the domain names to the commit msg in case someone's searching for the..." [operations/apache-config] (master) C: 0; - https://gerrit.wikimedia.org/r/24407 [13:12:06] PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours [13:18:42] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 4.961 seconds [13:48:06] PROBLEM - Puppet freshness on virt0 is CRITICAL: Puppet has not run in the last 10 hours [13:53:57] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:57:17] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/23770 [14:03:06] PROBLEM - Puppet freshness on ms-be7 is CRITICAL: Puppet has not run in the last 10 hours [14:06:11] hiyyyya mark, you around? [14:06:19] q about precise and linux-host-entries [14:06:26] yes [14:06:30] I want to reinstall analytics1001-1010 with precise [14:06:34] they were originaly lucid [14:06:39] but perhaps I should make you wait for SF people to wake up [14:06:42] precise is now the default, right? [14:06:45] yeah i'm not going to do it now [14:06:46] just curious [14:06:48] since apparently that's what you always have to do ;-) [14:07:10] yeah precise is now the default [14:07:26] cool, so if I don't change anything, and don't specify [14:07:33] yup [14:07:38] and reboot these thing to reinstall, they'll just pick up precise [14:07:39] cool [14:07:40] danke [14:09:06] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.026 seconds [14:34:00] New patchset: Ottomata; "Updating DTAC Thailand Zero Filter IPs." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24431 [14:34:57] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/24431 [14:40:57] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:43:29] New patchset: Hashar; "mw udp2log filter did not honor $log_directory" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24432 [14:44:23] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/24432 [14:46:01] apergos: any more info on the c2100? [14:46:50] nope [14:47:19] am I right in assuming we do nt yet have a ticket on ms-be10 that we are working with dell on? [14:47:28] i.e. the one active ticket is ms-be6? [14:48:22] apergos: that is correct [14:48:52] ok [14:49:15] well the disks did not resolve the issue that was outstanding on ms-be10. since it's not open we will have to put it off I guess [14:49:16] so [14:49:23] would you mind doing the following [14:50:10] shut down ms be6, recable the ssds, pull the new drives from ms-be10 and drop em in, full reinstall over there? [14:50:29] sure..simple enough [14:50:30] if it whines we are going to have to pull the ssds again, that's the thing [14:50:34] and then reinstall [14:50:36] *again* [14:50:53] :-\ [14:50:53] ok [14:51:00] yeah. see? [14:51:10] but if you can stand it, [14:51:26] let's do that [14:52:55] what I would like to do is get ms-be6 to a state where it either appears to be trouble free (no disks slow to show up, no degraded arrays, etc) [14:53:36] or to a state where (without the ssds) it's still unhappy [14:53:57] i know...i ams still curious to know if the slowness for the disks is normal or is going on w/known good systems. [14:54:02] uh huh [14:54:11] see generally that is not "normal" behavior [14:54:42] i know but since non of us did the install on be1-5...we don't have a baseline [14:54:50] uh huh [14:55:57] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.031 seconds [14:56:12] let me know when you get to the puppet piece in the reinstall so I can pick it up from there [14:58:45] ok [15:04:03] PROBLEM - Host analytics1010 is DOWN: PING CRITICAL - Packet loss = 100% [15:05:36] New patchset: Hashar; "varnish config for bits.beta.wmflabs.org" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/13304 [15:06:36] New review: Hashar; "Reinstate patchset 18 which, albeit ugly, is working on labs." [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/13304 [15:06:37] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/13304 [15:09:45] RECOVERY - Host analytics1010 is UP: PING OK - Packet loss = 0%, RTA = 26.79 ms [15:12:45] PROBLEM - SSH on analytics1010 is CRITICAL: Connection refused [15:16:27] err: Could not retrieve catalog from remote server: Error 400 on SERVER: Could not find class passwords::ldap::labs for i-0000031c.pmtpa.wmflabs at /etc/puppet/manifests/role/ldap.pp:2 on node i-0000031c.pmtpa.wmflabs [15:16:30] stupiiiid puppet [15:21:36] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [15:28:48] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:31:48] PROBLEM - NTP on analytics1010 is CRITICAL: NTP CRITICAL: No response from NTP server [15:40:48] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.608 seconds [15:45:32] PROBLEM - Host ms-be10 is DOWN: PING CRITICAL - Packet loss = 100% [15:48:56] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24431 [15:56:47] PROBLEM - Host ms-be6 is DOWN: PING CRITICAL - Packet loss = 100% [15:58:48] New patchset: Demon; "Remove unncessary quotes" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24440 [15:59:44] New patchset: Demon; "Remove unnecessary quotes" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24440 [16:00:37] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/24440 [16:03:39] hiya all (maybe notpeter or RobH?) [16:03:50] I want to reinstall the 10 analytics ciscos with Precise [16:04:10] I've started doing an10, but it is unhappy with the RAID or partman stuff that is currently configed [16:13:56] I cannot really help, in a training course today, sorry [16:16:08] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:16:32] aye ok, thanks Rob [16:17:41] Change merged: Andrew Bogott; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/23155 [16:21:02] I also cannot find the damned video on google mapping meta data [16:21:05] and it bums me out. [16:21:48] apergos [16:21:53] Error while setting up RAID │ [16:21:53] │ An unexpected error occurred while setting up a preseeded RAID │ [16:21:53] │ configuration. │ [16:21:54] │ │ [16:21:56] │ Check /var/log/syslog or see virtual console 4 for the details. [16:22:17] is it still using the non-ssd confg? [16:25:08] RECOVERY - Host ms-be6 is UP: PING OK - Packet loss = 0%, RTA = 0.38 ms [16:25:34] New review: Nikerabbit; "@Siebrand. but it ends up unused if you merge both this and the patchset which this logically follow..." [operations/mediawiki-config] (master); V: 0 C: 0; - https://gerrit.wikimedia.org/r/24180 [16:25:35] rats [16:25:43] yes [16:28:48] cmjohnson1: where are we in the install now? [16:29:11] PROBLEM - SSH on ms-be6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:29:25] i am giving it another go and going to watch the log and see what errors [16:29:38] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 9.432 seconds [16:29:42] that error came in the middle of the install [16:30:44] yeah well I need to fix the preseed file [16:31:16] migh as well do site.pp while I'm at it [16:31:35] k..the errors pop up in the partitioning [16:32:08] uh huh [16:32:12] give me a minute [16:33:12] k [16:33:35] New patchset: ArielGlenn; "ms-be6 disk layout with ssds again" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24443 [16:34:03] hmm this is the longest I've been on in days without dropping (now I will jinx it)... maybe their reset of the libe from their end actually had an impact [16:34:35] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/24443 [16:35:16] notpeter: can you merge and push https://gerrit.wikimedia.org/r/#/c/24444/ [16:36:10] Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24443 [16:36:28] paravoid: ping [16:37:06] almost ready [16:37:28] ok...let me know and I will restart the install [16:37:53] PROBLEM - Host analytics1010 is DOWN: PING CRITICAL - Packet loss = 100% [16:39:31] ok I think we are good, take it away [16:40:15] preilly: pong [16:40:24] k [16:40:24] paravoid: can you merge and push https://gerrit.wikimedia.org/r/#/c/24444/ [16:40:39] paravoid: I was trying to get notpeter to do it but he isn't around right now [16:40:55] I'm about to start walking towards the office, can it wait 20'? [16:42:02] preilly: i can get it ifi can get 10 barrells of oil from ST [16:42:29] LeslieCarr: ha ha [16:42:35] LeslieCarr: I submit the request [16:43:04] :) [16:43:14] is this the normal push then purge mobile cache ? [16:43:35] LeslieCarr: but, to be fair that's only $922.90 U.S. dollars worth of crude oil at today's market price [16:43:45] LeslieCarr: just merge and push [16:43:52] LeslieCarr: no need to purge the cache [16:44:05] i'm planning on holding onto it for a few years, should be mega money [16:44:26] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24444 [16:44:46] can't get my nick back yet though it;s been released [16:44:50] so irritating [16:45:32] PROBLEM - Puppet freshness on magnesium is CRITICAL: Puppet has not run in the last 10 hours [16:45:32] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [16:45:33] New patchset: Andrew Bogott; "Update nova.conf.erb template" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24445 [16:46:29] Change merged: Andrew Bogott; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24445 [16:46:39] preilly: merged and puppet ran [16:47:22] LeslieCarr: cool thanks [16:48:05] RECOVERY - Puppet freshness on virt0 is OK: puppet ran at Thu Sep 20 16:47:39 UTC 2012 [16:48:53] apergos1: [16:48:57] ──────┤ [!!] Configuring openssh-server ├────────────────┐ [16:48:57] ┌───│ │ ─┐ [16:48:57] │ │ Failed to run preseeded command │ │ [16:48:58] │ │ Execution of preseeded command "wget -O /tmp/late_command │ │ [16:49:00] │ │ http://apt.wikimedia.org/autoinstall/scripts/late_command && sh │ │ [16:49:02] │ Ru│ /tmp/late_command" failed with exit code 1. │ │ [16:49:04] │ │ │ │ [16:49:06] └───│ │ ─┘ [16:49:08] RECOVERY - Host analytics1010 is UP: PING OK - Packet loss = 0%, RTA = 26.56 ms [16:49:08] │ [16:49:10] don't worry [16:49:14] just continue [16:49:15] ? [16:49:21] I bet it fails on the last thing in that shell which is [16:49:23] um [16:49:54] [ -d /sys/module/ixgbe ] && apt-install ixgbe-dkms [16:50:18] which we don't need anyways [16:50:28] that should be cleaned up at some point but for right now, [16:50:42] ssh installs ok (or it should be installed ok, I'll find out soon) [16:50:54] ok...finished w/ install ...booting now [16:51:00] sweet [16:51:11] hrm, but isn't that important ? [16:51:12] * apergos1 camps on sockpuppet ready to do the first run [16:51:21] on the c2100s? [16:51:23] like if the ixgbe-dkms doesn't run, we'll have some problems [16:51:26] not on those specifically [16:51:50] it might run fine on the hosts that need it [16:52:09] all 12 disk show in post [16:52:15] uh huh [16:52:21] just fyi [16:52:40] expected but hey it's better than them not showing up [16:52:57] :-P [16:53:48] os is not showing up [16:53:56] orilly [16:54:11] no login prompt [16:54:18] what do you have? [16:54:22] last message [16:54:56] checking nvram [16:55:06] really? meeehhh [16:55:36] I really hate how long dell's take to boot, they ruddy post at least twice [16:55:52] the c2100's take 2x longer [16:56:20] PROBLEM - Host ms-be6 is DOWN: PING CRITICAL - Packet loss = 100% [16:56:29] ( LeslieCarr: these c2100s don't have /sys/module/ixgbe so that line in th shell is a noop, it's why I'm not worried about it ) [16:56:42] ah, yes :) [16:56:59] but it would be nice to fix the script so it still returns the right value in those cases [16:57:05] anyhoo [16:57:18] I once used a model of dell that if you plgged a kvm in it took about 45min to get past the 'loading' screen, unplug it and it flew past... [16:57:30] great [16:57:55] not even ping. [16:59:08] well err [16:59:16] care to powercycle it the hard way? [17:00:33] i can do that [17:00:54] if it's not moving at all [17:01:59] i will quit console...so it's all yours [17:02:30] ok [17:02:32] lemme get there [17:02:47] New patchset: Andrew Bogott; "Point the autostatus notifier to labsconsole.wmflags.org" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24447 [17:02:52] lmk and I will turn on [17:03:31] nm it's on [17:03:43] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/24447 [17:04:08] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:04:27] !log temp stopping puppet on brewster [17:04:37] Logged the message, notpeter [17:04:46] New patchset: Andrew Bogott; "Point the autostatus notifier to labsconsole.wikimedia.org" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24447 [17:05:30] how long can it take to check the nvram [17:05:32] seriously [17:05:43] Change merged: Andrew Bogott; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24447 [17:07:42] apergos1...i know what the problem is....we need to go into bios and fix the boot settings to the ssd's [17:07:49] and to boot ahci [17:07:49] :-D [17:07:58] that's someting I never had to mess with [17:08:08] just occurred to me [17:08:10] I guess I should get the heck off the console [17:08:27] have at [17:08:39] great..give me a few ...ping u shortly [17:09:44] sure [17:12:58] 2012-09-20 13:30:37 mw6 nlwiki: InvalidResponseException in 'SwiftFileBackend::doPrepareInternal' (given '{"dir":"mwstore:\/\/local-swift\/timeline-render"}'): Unexpected response (): [17:13:05] not very helpful [17:14:01] blah [17:14:16] very descriptive response [17:14:28] lots and lots of similar stat errors [17:17:29] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 2.181 seconds [17:18:59] RECOVERY - Host ms-be6 is UP: PING OK - Packet loss = 0%, RTA = 0.55 ms [17:19:53] apergos1: looks like that only happens on authentication [17:20:10] hmmm [17:20:14] cloudfiles is getting no response or can't parse the status out of the header [17:21:39] New patchset: Pyoungmeister; "remove trailing \ from recipe. works now :)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24452 [17:22:35] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/24452 [17:23:20] can you have it dump the headers in that case (see if we get any)? [17:24:39] apergos1: ms-be6 is ready for puppet [17:24:47] ok [17:24:51] thanks! [17:24:56] RECOVERY - SSH on ms-be6 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [17:24:56] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24452 [17:28:32] RECOVERY - Puppet freshness on ms-be6 is OK: puppet ran at Thu Sep 20 17:28:03 UTC 2012 [17:29:44] RECOVERY - swift-account-server on ms-be6 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [17:29:44] RECOVERY - swift-container-updater on ms-be6 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-updater [17:29:53] RECOVERY - swift-object-updater on ms-be6 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-updater [17:29:54] yeah yeah lies, but whatever [17:30:02] RECOVERY - swift-object-auditor on ms-be6 is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [17:30:02] RECOVERY - swift-container-auditor on ms-be6 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [17:30:02] RECOVERY - swift-account-auditor on ms-be6 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [17:30:20] RECOVERY - swift-container-replicator on ms-be6 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [17:30:20] RECOVERY - swift-account-reaper on ms-be6 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [17:30:20] RECOVERY - swift-object-replicator on ms-be6 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [17:30:32] random auth error still going on [17:30:47] RECOVERY - swift-container-server on ms-be6 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [17:30:47] RECOVERY - swift-account-replicator on ms-be6 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [17:30:56] RECOVERY - swift-object-server on ms-be6 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [17:32:34] oh awesoe all the disk labels are wrong :-D [17:32:37] *sigh* [17:33:32] ok I'm going t ignore that for a minute and just see about the rest of the install, like do we get good reboots [17:34:07] big things first eh? [17:34:59] PROBLEM - Host analytics1010 is DOWN: PING CRITICAL - Packet loss = 100% [17:35:12] yup [17:35:44] reboot in progress, watching [17:40:06] and not watching, got disconnected right at the start of the reboot >_< [17:40:26] guess we'll repeat that [17:40:33] RECOVERY - Host analytics1010 is UP: PING OK - Packet loss = 0%, RTA = 26.78 ms [17:44:35] PROBLEM - Host ms-be6 is DOWN: PING CRITICAL - Packet loss = 100% [17:45:49] sdm1 .. wait a couple secs... shows. [17:45:56] RECOVERY - Host ms-be6 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [17:46:00] rebooting again to see what it does this time [17:48:11] PROBLEM - Host analytics1010 is DOWN: PING CRITICAL - Packet loss = 100% [17:49:32] m again [17:49:35] same deal [17:50:12] so I'm going to disable puppet over there, stop swift services, and we shoudl report this to dell [17:50:31] what happened? [17:51:22] we get sdm claiming not to be ready during reboot [17:51:36] it doesnt' require much wait but [17:51:44] it shoouldn't require any [17:52:05] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:52:08] guess I'll make the labels match up with the disk ids too while I'm att it [17:54:20] PROBLEM - swift-account-auditor on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [17:54:20] PROBLEM - swift-object-replicator on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [17:54:20] PROBLEM - swift-account-reaper on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [17:54:20] PROBLEM - swift-container-replicator on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [17:54:29] PROBLEM - swift-container-auditor on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [17:54:38] PROBLEM - swift-object-updater on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-updater [17:54:47] PROBLEM - swift-object-auditor on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [17:54:47] PROBLEM - swift-container-server on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [17:54:56] PROBLEM - swift-account-replicator on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [17:55:05] PROBLEM - swift-object-server on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [17:55:14] PROBLEM - swift-container-updater on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-updater [17:55:23] PROBLEM - swift-account-server on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [18:03:51] and gone [18:03:53] rat [18:03:53] s [18:07:14] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.035 seconds [18:07:59] apergos: what does mw\d+ stand for? [18:08:05] as server names [18:08:14] just mediawiki or something? [18:08:54] all the swift errors are coming from boxes with that name convention [18:09:06]