[00:07:47] New patchset: MarkTraceur; "Move parsoid IRC channel to #mediawiki-parsoid" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24390 [00:08:43] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/24390 [00:13:06] PROBLEM - Puppet freshness on virt0 is CRITICAL: Puppet has not run in the last 10 hours [00:13:06] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24390 [00:26:10] New patchset: Faidon; "swift: allow sync between {pmtpa,eqiad}-prod" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24395 [00:27:06] New patchset: Faidon; "swift: add Content-Disposition to the header whitelist" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24396 [00:27:58] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/24395 [00:27:59] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/24396 [00:28:39] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24395 [00:33:31] AaronSchulz: are you also storing content-disposition using cloudfiles? [00:33:50] yes, when it fits [00:34:04] okay [00:34:11] that's a good reason to keep a whitelist btw [00:34:23] you wouldn't want the MW responses and the swift responses to diverge [00:34:29] yeah, makes sense [00:34:48] so take must be taken to store into swift the headers that are in the whitelist [00:34:53] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24396 [00:38:51] RobH: around? [01:14:00] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [01:14:00] PROBLEM - Puppet freshness on singer is CRITICAL: Puppet has not run in the last 10 hours [01:14:00] PROBLEM - Puppet freshness on virt1001 is CRITICAL: Puppet has not run in the last 10 hours [01:14:00] PROBLEM - Puppet freshness on virt1002 is CRITICAL: Puppet has not run in the last 10 hours [01:14:00] PROBLEM - Puppet freshness on virt1003 is CRITICAL: Puppet has not run in the last 10 hours [01:14:00] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [01:42:03] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 266 seconds [01:43:33] PROBLEM - MySQL Slave Delay on storage3 is CRITICAL: CRIT replication delay 308 seconds [01:43:33] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 15 seconds [01:46:33] RECOVERY - MySQL Slave Delay on storage3 is OK: OK replication delay 12 seconds [01:54:55] New patchset: Jgreen; "switch aggregator hosts for eqiad Fundraising ganglia" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24404 [01:55:56] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24404 [02:03:18] New patchset: Jeremyb; "change bugzilla redir target to HTTPS; consolidate" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/24405 [02:05:02] New patchset: Pyoungmeister; "correcting macs for (most) pmtpa MC hosts" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24406 [02:05:56] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/24406 [02:06:37] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24406 [02:07:42] PROBLEM - Squid on brewster is CRITICAL: Connection refused [02:13:29] New patchset: Jeremyb; "redirects for develop{,er{,s}}.wiki{p,m}edia.org" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/24407 [02:15:39] New review: Jeremyb; "See also http://lists.wikimedia.org/pipermail/wikitech-l/2012-September/063328.html" [operations/apache-config] (master) C: 0; - https://gerrit.wikimedia.org/r/24407 [02:41:27] RECOVERY - Puppet freshness on virt0 is OK: puppet ran at Thu Sep 20 02:41:23 UTC 2012 [03:11:00] PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours [03:13:59] who is still around [03:14:00] ? [03:21:14] "around"? [03:23:54] !log purged search data from dataset2 on diederik's request [03:24:05] Logged the message, Master [03:32:54] RECOVERY - Puppet freshness on spence is OK: puppet ran at Thu Sep 20 03:32:34 UTC 2012 [03:36:30] RECOVERY - Puppet freshness on analytics1003 is OK: puppet ran at Thu Sep 20 03:36:16 UTC 2012 [03:47:06] New patchset: Pyoungmeister; "temporarily disabling lucene logs cron by diederik's request" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24409 [03:48:01] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/24409 [03:48:10] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24409 [04:02:21] PROBLEM - Puppet freshness on ms-be7 is CRITICAL: Puppet has not run in the last 10 hours [05:16:14] PROBLEM - Apache HTTP on mw58 is CRITICAL: Connection refused [05:21:02] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [05:38:23] New patchset: Tim Starling; "Update modeline" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/24414 [05:38:23] New patchset: Tim Starling; "Add generated-pp-node-count debug log group" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/24415 [05:38:42] Change merged: Tim Starling; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/24414 [05:38:57] Change merged: Tim Starling; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/24415 [05:43:30] New patchset: Tim Starling; "Increase generated node count limit to 4M" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/24416 [05:43:42] Change merged: Tim Starling; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/24416 [06:02:04] Change restored: Parent5446; "(no reason)" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/21322 [06:02:12] New patchset: Parent5446; "(bug 39380) Enabling secure login (HTTPS)." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/21322 [06:02:20] New patchset: Parent5446; "(bug 39380) Enabling secure login (HTTPS)." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/21322 [06:32:43] New review: Nikerabbit; "Ah, $msgOpts is now unused?" [operations/mediawiki-config] (master); V: 0 C: 0; - https://gerrit.wikimedia.org/r/24180 [06:44:26] PROBLEM - Puppet freshness on magnesium is CRITICAL: Puppet has not run in the last 10 hours [06:44:26] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [07:15:38] RECOVERY - Squid on brewster is OK: TCP OK - 0.012 second response time on port 8080 [07:36:44] New patchset: Hashar; "(bug 33464) developer.wikimedia.org redirect" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/24419 [08:02:31] PROBLEM - Puppet freshness on ms-be6 is CRITICAL: Puppet has not run in the last 10 hours [08:03:55] !log Jenkins: installing "Gerrit Trigger plugin" 2.6 which replaces our 2.5.3 snapshot [08:04:05] Logged the message, Master [08:21:30] PROBLEM - LVS HTTPS IPv6 on wiktionary-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:22:51] RECOVERY - LVS HTTPS IPv6 on wiktionary-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK HTTP/1.1 200 OK - 59974 bytes in 7.036 seconds [08:29:45] PROBLEM - Puppet freshness on mw58 is CRITICAL: Puppet has not run in the last 10 hours [08:31:54] New review: Siebrand; "@Niklas: Still used (lines 59-66)" [operations/mediawiki-config] (master) C: 1; - https://gerrit.wikimedia.org/r/24180 [08:33:39] PROBLEM - Puppet freshness on aluminium is CRITICAL: Puppet has not run in the last 10 hours [08:33:57] PROBLEM - LVS HTTPS IPv6 on wikimedia-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:35:18] RECOVERY - LVS HTTPS IPv6 on wikimedia-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK HTTP/1.1 200 OK - 77504 bytes in 1.105 seconds [09:28:25] New patchset: Hashar; "jenkins requires an apache2 installation" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24423 [09:28:48] apergos1: mark: could you potentially merge the very simple https://gerrit.wikimedia.org/r/24423 [09:28:53] apache2 is needed to install jenkins :) [09:29:20] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/24423 [09:40:41] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24423 [10:41:18] !log UploadWizard has been broken in some cases following the release of wmf12. Restored by reverting a change. See {{bug|40380}} for details. [10:41:28] Logged the message, Master [11:07:27] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:08:57] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 2.392 seconds [11:15:24] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [11:15:24] PROBLEM - Puppet freshness on singer is CRITICAL: Puppet has not run in the last 10 hours [11:15:24] PROBLEM - Puppet freshness on virt1001 is CRITICAL: Puppet has not run in the last 10 hours [11:15:24] PROBLEM - Puppet freshness on virt1003 is CRITICAL: Puppet has not run in the last 10 hours [11:15:24] PROBLEM - Puppet freshness on virt1002 is CRITICAL: Puppet has not run in the last 10 hours [11:15:24] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [11:37:23] Change abandoned: Hashar; "Jeremy already did it in https://gerrit.wikimedia.org/r/#/c/24407/" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/24419 [11:37:52] poor virt**** [11:43:27] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:47:57] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 7.875 seconds [12:18:38] New patchset: Hashar; "update jenkins default init file" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24427 [12:18:57] LeslieCarr: mark apergos if anyone around, can you please merge in https://gerrit.wikimedia.org/r/24427 ? :-D [12:19:03] fix yet another issue with jenkins :D [12:19:33] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/24427 [12:19:51] bah [12:19:57] don't even have the correct jenkins version [12:22:54] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:31:54] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 9.656 seconds [13:06:33] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:07:10] New patchset: Jeremyb; "redirects for develop{,er{,s}}.wiki{p,m}edia.org" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/24407 [13:08:10] New review: Jeremyb; "PS2: just added a list of all the domain names to the commit msg in case someone's searching for the..." [operations/apache-config] (master) C: 0; - https://gerrit.wikimedia.org/r/24407 [13:12:06] PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours [13:18:42] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 4.961 seconds [13:48:06] PROBLEM - Puppet freshness on virt0 is CRITICAL: Puppet has not run in the last 10 hours [13:53:57] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:57:17] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/23770 [14:03:06] PROBLEM - Puppet freshness on ms-be7 is CRITICAL: Puppet has not run in the last 10 hours [14:06:11] hiyyyya mark, you around? [14:06:19] q about precise and linux-host-entries [14:06:26] yes [14:06:30] I want to reinstall analytics1001-1010 with precise [14:06:34] they were originaly lucid [14:06:39] but perhaps I should make you wait for SF people to wake up [14:06:42] precise is now the default, right? [14:06:45] yeah i'm not going to do it now [14:06:46] just curious [14:06:48] since apparently that's what you always have to do ;-) [14:07:10] yeah precise is now the default [14:07:26] cool, so if I don't change anything, and don't specify [14:07:33] yup [14:07:38] and reboot these thing to reinstall, they'll just pick up precise [14:07:39] cool [14:07:40] danke [14:09:06] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.026 seconds [14:34:00] New patchset: Ottomata; "Updating DTAC Thailand Zero Filter IPs." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24431 [14:34:57] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/24431 [14:40:57] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:43:29] New patchset: Hashar; "mw udp2log filter did not honor $log_directory" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24432 [14:44:23] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/24432 [14:46:01] apergos: any more info on the c2100? [14:46:50] nope [14:47:19] am I right in assuming we do nt yet have a ticket on ms-be10 that we are working with dell on? [14:47:28] i.e. the one active ticket is ms-be6? [14:48:22] apergos: that is correct [14:48:52] ok [14:49:15] well the disks did not resolve the issue that was outstanding on ms-be10. since it's not open we will have to put it off I guess [14:49:16] so [14:49:23] would you mind doing the following [14:50:10] shut down ms be6, recable the ssds, pull the new drives from ms-be10 and drop em in, full reinstall over there? [14:50:29] sure..simple enough [14:50:30] if it whines we are going to have to pull the ssds again, that's the thing [14:50:34] and then reinstall [14:50:36] *again* [14:50:53] :-\ [14:50:53] ok [14:51:00] yeah. see? [14:51:10] but if you can stand it, [14:51:26] let's do that [14:52:55] what I would like to do is get ms-be6 to a state where it either appears to be trouble free (no disks slow to show up, no degraded arrays, etc) [14:53:36] or to a state where (without the ssds) it's still unhappy [14:53:57] i know...i ams still curious to know if the slowness for the disks is normal or is going on w/known good systems. [14:54:02] uh huh [14:54:11] see generally that is not "normal" behavior [14:54:42] i know but since non of us did the install on be1-5...we don't have a baseline [14:54:50] uh huh [14:55:57] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.031 seconds [14:56:12] let me know when you get to the puppet piece in the reinstall so I can pick it up from there [14:58:45] ok [15:04:03] PROBLEM - Host analytics1010 is DOWN: PING CRITICAL - Packet loss = 100% [15:05:36] New patchset: Hashar; "varnish config for bits.beta.wmflabs.org" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/13304 [15:06:36] New review: Hashar; "Reinstate patchset 18 which, albeit ugly, is working on labs." [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/13304 [15:06:37] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/13304 [15:09:45] RECOVERY - Host analytics1010 is UP: PING OK - Packet loss = 0%, RTA = 26.79 ms [15:12:45] PROBLEM - SSH on analytics1010 is CRITICAL: Connection refused [15:16:27] err: Could not retrieve catalog from remote server: Error 400 on SERVER: Could not find class passwords::ldap::labs for i-0000031c.pmtpa.wmflabs at /etc/puppet/manifests/role/ldap.pp:2 on node i-0000031c.pmtpa.wmflabs [15:16:30] stupiiiid puppet [15:21:36] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [15:28:48] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:31:48] PROBLEM - NTP on analytics1010 is CRITICAL: NTP CRITICAL: No response from NTP server [15:40:48] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.608 seconds [15:45:32] PROBLEM - Host ms-be10 is DOWN: PING CRITICAL - Packet loss = 100% [15:48:56] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24431 [15:56:47] PROBLEM - Host ms-be6 is DOWN: PING CRITICAL - Packet loss = 100% [15:58:48] New patchset: Demon; "Remove unncessary quotes" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24440 [15:59:44] New patchset: Demon; "Remove unnecessary quotes" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24440 [16:00:37] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/24440 [16:03:39] hiya all (maybe notpeter or RobH?) [16:03:50] I want to reinstall the 10 analytics ciscos with Precise [16:04:10] I've started doing an10, but it is unhappy with the RAID or partman stuff that is currently configed [16:13:56] I cannot really help, in a training course today, sorry [16:16:08] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:16:32] aye ok, thanks Rob [16:17:41] Change merged: Andrew Bogott; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/23155 [16:21:02] I also cannot find the damned video on google mapping meta data [16:21:05] and it bums me out. [16:21:48] apergos [16:21:53] Error while setting up RAID │ [16:21:53] │ An unexpected error occurred while setting up a preseeded RAID │ [16:21:53] │ configuration. │ [16:21:54] │ │ [16:21:56] │ Check /var/log/syslog or see virtual console 4 for the details. [16:22:17] is it still using the non-ssd confg? [16:25:08] RECOVERY - Host ms-be6 is UP: PING OK - Packet loss = 0%, RTA = 0.38 ms [16:25:34] New review: Nikerabbit; "@Siebrand. but it ends up unused if you merge both this and the patchset which this logically follow..." [operations/mediawiki-config] (master); V: 0 C: 0; - https://gerrit.wikimedia.org/r/24180 [16:25:35] rats [16:25:43] yes [16:28:48] cmjohnson1: where are we in the install now? [16:29:11] PROBLEM - SSH on ms-be6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:29:25] i am giving it another go and going to watch the log and see what errors [16:29:38] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 9.432 seconds [16:29:42] that error came in the middle of the install [16:30:44] yeah well I need to fix the preseed file [16:31:16] migh as well do site.pp while I'm at it [16:31:35] k..the errors pop up in the partitioning [16:32:08] uh huh [16:32:12] give me a minute [16:33:12] k [16:33:35] New patchset: ArielGlenn; "ms-be6 disk layout with ssds again" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24443 [16:34:03] hmm this is the longest I've been on in days without dropping (now I will jinx it)... maybe their reset of the libe from their end actually had an impact [16:34:35] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/24443 [16:35:16] notpeter: can you merge and push https://gerrit.wikimedia.org/r/#/c/24444/ [16:36:10] Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24443 [16:36:28] paravoid: ping [16:37:06] almost ready [16:37:28] ok...let me know and I will restart the install [16:37:53] PROBLEM - Host analytics1010 is DOWN: PING CRITICAL - Packet loss = 100% [16:39:31] ok I think we are good, take it away [16:40:15] preilly: pong [16:40:24] k [16:40:24] paravoid: can you merge and push https://gerrit.wikimedia.org/r/#/c/24444/ [16:40:39] paravoid: I was trying to get notpeter to do it but he isn't around right now [16:40:55] I'm about to start walking towards the office, can it wait 20'? [16:42:02] preilly: i can get it ifi can get 10 barrells of oil from ST [16:42:29] LeslieCarr: ha ha [16:42:35] LeslieCarr: I submit the request [16:43:04] :) [16:43:14] is this the normal push then purge mobile cache ? [16:43:35] LeslieCarr: but, to be fair that's only $922.90 U.S. dollars worth of crude oil at today's market price [16:43:45] LeslieCarr: just merge and push [16:43:52] LeslieCarr: no need to purge the cache [16:44:05] i'm planning on holding onto it for a few years, should be mega money [16:44:26] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24444 [16:44:46] can't get my nick back yet though it;s been released [16:44:50] so irritating [16:45:32] PROBLEM - Puppet freshness on magnesium is CRITICAL: Puppet has not run in the last 10 hours [16:45:32] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [16:45:33] New patchset: Andrew Bogott; "Update nova.conf.erb template" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24445 [16:46:29] Change merged: Andrew Bogott; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24445 [16:46:39] preilly: merged and puppet ran [16:47:22] LeslieCarr: cool thanks [16:48:05] RECOVERY - Puppet freshness on virt0 is OK: puppet ran at Thu Sep 20 16:47:39 UTC 2012 [16:48:53] apergos1: [16:48:57] ──────┤ [!!] Configuring openssh-server ├────────────────┐ [16:48:57] ┌───│ │ ─┐ [16:48:57] │ │ Failed to run preseeded command │ │ [16:48:58] │ │ Execution of preseeded command "wget -O /tmp/late_command │ │ [16:49:00] │ │ http://apt.wikimedia.org/autoinstall/scripts/late_command && sh │ │ [16:49:02] │ Ru│ /tmp/late_command" failed with exit code 1. │ │ [16:49:04] │ │ │ │ [16:49:06] └───│ │ ─┘ [16:49:08] RECOVERY - Host analytics1010 is UP: PING OK - Packet loss = 0%, RTA = 26.56 ms [16:49:08] │ [16:49:10] don't worry [16:49:14] just continue [16:49:15] ? [16:49:21] I bet it fails on the last thing in that shell which is [16:49:23] um [16:49:54] [ -d /sys/module/ixgbe ] && apt-install ixgbe-dkms [16:50:18] which we don't need anyways [16:50:28] that should be cleaned up at some point but for right now, [16:50:42] ssh installs ok (or it should be installed ok, I'll find out soon) [16:50:54] ok...finished w/ install ...booting now [16:51:00] sweet [16:51:11] hrm, but isn't that important ? [16:51:12] * apergos1 camps on sockpuppet ready to do the first run [16:51:21] on the c2100s? [16:51:23] like if the ixgbe-dkms doesn't run, we'll have some problems [16:51:26] not on those specifically [16:51:50] it might run fine on the hosts that need it [16:52:09] all 12 disk show in post [16:52:15] uh huh [16:52:21] just fyi [16:52:40] expected but hey it's better than them not showing up [16:52:57] :-P [16:53:48] os is not showing up [16:53:56] orilly [16:54:11] no login prompt [16:54:18] what do you have? [16:54:22] last message [16:54:56] checking nvram [16:55:06] really? meeehhh [16:55:36] I really hate how long dell's take to boot, they ruddy post at least twice [16:55:52] the c2100's take 2x longer [16:56:20] PROBLEM - Host ms-be6 is DOWN: PING CRITICAL - Packet loss = 100% [16:56:29] ( LeslieCarr: these c2100s don't have /sys/module/ixgbe so that line in th shell is a noop, it's why I'm not worried about it ) [16:56:42] ah, yes :) [16:56:59] but it would be nice to fix the script so it still returns the right value in those cases [16:57:05] anyhoo [16:57:18] I once used a model of dell that if you plgged a kvm in it took about 45min to get past the 'loading' screen, unplug it and it flew past... [16:57:30] great [16:57:55] not even ping. [16:59:08] well err [16:59:16] care to powercycle it the hard way? [17:00:33] i can do that [17:00:54] if it's not moving at all [17:01:59] i will quit console...so it's all yours [17:02:30] ok [17:02:32] lemme get there [17:02:47] New patchset: Andrew Bogott; "Point the autostatus notifier to labsconsole.wmflags.org" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24447 [17:02:52] lmk and I will turn on [17:03:31] nm it's on [17:03:43] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/24447 [17:04:08] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:04:27] !log temp stopping puppet on brewster [17:04:37] Logged the message, notpeter [17:04:46] New patchset: Andrew Bogott; "Point the autostatus notifier to labsconsole.wikimedia.org" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24447 [17:05:30] how long can it take to check the nvram [17:05:32] seriously [17:05:43] Change merged: Andrew Bogott; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24447 [17:07:42] apergos1...i know what the problem is....we need to go into bios and fix the boot settings to the ssd's [17:07:49] and to boot ahci [17:07:49] :-D [17:07:58] that's someting I never had to mess with [17:08:08] just occurred to me [17:08:10] I guess I should get the heck off the console [17:08:27] have at [17:08:39] great..give me a few ...ping u shortly [17:09:44] sure [17:12:58] 2012-09-20 13:30:37 mw6 nlwiki: InvalidResponseException in 'SwiftFileBackend::doPrepareInternal' (given '{"dir":"mwstore:\/\/local-swift\/timeline-render"}'): Unexpected response (): [17:13:05] not very helpful [17:14:01] blah [17:14:16] very descriptive response [17:14:28] lots and lots of similar stat errors [17:17:29] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 2.181 seconds [17:18:59] RECOVERY - Host ms-be6 is UP: PING OK - Packet loss = 0%, RTA = 0.55 ms [17:19:53] apergos1: looks like that only happens on authentication [17:20:10] hmmm [17:20:14] cloudfiles is getting no response or can't parse the status out of the header [17:21:39] New patchset: Pyoungmeister; "remove trailing \ from recipe. works now :)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24452 [17:22:35] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/24452 [17:23:20] can you have it dump the headers in that case (see if we get any)? [17:24:39] apergos1: ms-be6 is ready for puppet [17:24:47] ok [17:24:51] thanks! [17:24:56] RECOVERY - SSH on ms-be6 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [17:24:56] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24452 [17:28:32] RECOVERY - Puppet freshness on ms-be6 is OK: puppet ran at Thu Sep 20 17:28:03 UTC 2012 [17:29:44] RECOVERY - swift-account-server on ms-be6 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [17:29:44] RECOVERY - swift-container-updater on ms-be6 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-updater [17:29:53] RECOVERY - swift-object-updater on ms-be6 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-updater [17:29:54] yeah yeah lies, but whatever [17:30:02] RECOVERY - swift-object-auditor on ms-be6 is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [17:30:02] RECOVERY - swift-container-auditor on ms-be6 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [17:30:02] RECOVERY - swift-account-auditor on ms-be6 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [17:30:20] RECOVERY - swift-container-replicator on ms-be6 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [17:30:20] RECOVERY - swift-account-reaper on ms-be6 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [17:30:20] RECOVERY - swift-object-replicator on ms-be6 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [17:30:32] random auth error still going on [17:30:47] RECOVERY - swift-container-server on ms-be6 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [17:30:47] RECOVERY - swift-account-replicator on ms-be6 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [17:30:56] RECOVERY - swift-object-server on ms-be6 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [17:32:34] oh awesoe all the disk labels are wrong :-D [17:32:37] *sigh* [17:33:32] ok I'm going t ignore that for a minute and just see about the rest of the install, like do we get good reboots [17:34:07] big things first eh? [17:34:59] PROBLEM - Host analytics1010 is DOWN: PING CRITICAL - Packet loss = 100% [17:35:12] yup [17:35:44] reboot in progress, watching [17:40:06] and not watching, got disconnected right at the start of the reboot >_< [17:40:26] guess we'll repeat that [17:40:33] RECOVERY - Host analytics1010 is UP: PING OK - Packet loss = 0%, RTA = 26.78 ms [17:44:35] PROBLEM - Host ms-be6 is DOWN: PING CRITICAL - Packet loss = 100% [17:45:49] sdm1 .. wait a couple secs... shows. [17:45:56] RECOVERY - Host ms-be6 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [17:46:00] rebooting again to see what it does this time [17:48:11] PROBLEM - Host analytics1010 is DOWN: PING CRITICAL - Packet loss = 100% [17:49:32] m again [17:49:35] same deal [17:50:12] so I'm going to disable puppet over there, stop swift services, and we shoudl report this to dell [17:50:31] what happened? [17:51:22] we get sdm claiming not to be ready during reboot [17:51:36] it doesnt' require much wait but [17:51:44] it shoouldn't require any [17:52:05] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:52:08] guess I'll make the labels match up with the disk ids too while I'm att it [17:54:20] PROBLEM - swift-account-auditor on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [17:54:20] PROBLEM - swift-object-replicator on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [17:54:20] PROBLEM - swift-account-reaper on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [17:54:20] PROBLEM - swift-container-replicator on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [17:54:29] PROBLEM - swift-container-auditor on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [17:54:38] PROBLEM - swift-object-updater on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-updater [17:54:47] PROBLEM - swift-object-auditor on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [17:54:47] PROBLEM - swift-container-server on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [17:54:56] PROBLEM - swift-account-replicator on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [17:55:05] PROBLEM - swift-object-server on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [17:55:14] PROBLEM - swift-container-updater on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-updater [17:55:23] PROBLEM - swift-account-server on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [18:03:51] and gone [18:03:53] rat [18:03:53] s [18:07:14] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.035 seconds [18:07:59] apergos: what does mw\d+ stand for? [18:08:05] as server names [18:08:14] just mediawiki or something? [18:08:54] all the swift errors are coming from boxes with that name convention [18:09:06] I guess those are app servers [18:09:10] just running apache, the usual [18:09:23] why don't I see srvxxx in there? [18:10:11] no idea [18:10:41] hiii mutante, you around? [18:10:47] hey .i am [18:10:49] i'm having some trouble with the ciscos [18:11:10] i've been able to resinstall one with precise [18:11:27] it looks like pxe boot remains the default boot process after reinstall [18:11:29] so, i manually change it [18:11:34] to hdd [18:11:38] but then when I reboot [18:11:48] it hangs here: [18:11:49] Press or to enter Emulex BIOS configuration [18:11:49] utility. Press to skip Emulex BIOS [18:11:49] Emulex BIOS is Disabled on Adapter 1 [18:11:49] Emulex BIOS is Disabled on Adapter 2 [18:12:03] nothing happens after that [18:12:14] and if you skip it with "s"? [18:12:36] and then reboot it again..does it just do that once or at every boot [18:13:00] actually i did not change the default to PXE [18:13:07] i just hit F12 to go to PXE [18:13:25] hmm [18:13:31] oh on boot, hmmmm [18:13:46] (how do I send an F12, btw?) [18:14:21] rebooting, gonna try to skip... [18:14:30] notpeter: is there a difference between mw\d+ and srv\d+ apaches? [18:14:53] ottomata: eh, for some reason i am lucky with that, i just hit litearally F12 when i am connected and it goes right through [18:14:59] can you give me the name of one of he mws that's doing it by the way? [18:15:08] AaronSchulz: nein [18:15:17] I mean, mw's are newer and faster [18:15:21] but they are set up the same [18:15:22] why? [18:15:33] they are giving weird errors in the swift log [18:15:43] what numers? [18:15:45] mediawiki tries to auth and gets a response it can't parse [18:15:54] mw1-mw16 perhaps [18:15:58] ah [18:16:07] those have apache off [18:16:10] and are only jobrunners [18:16:28] seems like they fail when curling to swift then [18:16:34] indeed [18:19:59] notpeter: authenticating eval.php from mw12 seems to work [18:21:06] *authenticating in [18:21:18] which is cli, just like job runners [18:22:13] hhhmmmm, they should have all of the same packages, confs, etc [18:22:33] notpeter: maybe you can watch the packets? ;) [18:22:49] I'm curious just what swift is responding with [18:23:21] sure, I can tcpdump for you [18:23:23] maybe it just fails the "if ( preg_match( "/^(HTTP\/1\.[01]) (\d{3}) (.*)/", $header, $matches ) ) {" check in CF [18:23:38] do mw3 [18:23:47] seems to be popular :) [18:24:31] what port? [18:25:33] I believe 80 (to swift) [18:26:15] so there should be GET reqs to an /auth/v1 url [18:26:29] there's basically no traffic on port 80 [18:27:51] New patchset: Jdlrobson; "move from deprecated wfMsg to wfMessage" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/24243 [18:28:13] wait, one sec [18:28:34] * AaronSchulz likes how none of the mw boxes are at server roles [18:28:42] PROBLEM - Host analytics1009 is DOWN: PING CRITICAL - Packet loss = 100% [18:30:27] Change abandoned: Jdlrobson; "Actually covered by https://gerrit.wikimedia.org/r/#/c/24243/" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/24180 [18:31:05] PROBLEM - Puppet freshness on mw58 is CRITICAL: Puppet has not run in the last 10 hours [18:34:14] RECOVERY - Host analytics1009 is UP: PING OK - Packet loss = 0%, RTA = 26.50 ms [18:34:21] AaronSchulz: emailed dump to you [18:34:30] just saw [18:34:33] didn't want to pastebin, as I'm not sure if that counts as private data or not [18:34:41] is that what you neeeeeed? [18:35:08] PROBLEM - Puppet freshness on aluminium is CRITICAL: Puppet has not run in the last 10 hours [18:35:57] New review: Brion VIBBER; "How is this on operations/mediawiki-config project? When I try to check it out per the above directi..." [operations/mediawiki-config] (master) C: 0; - https://gerrit.wikimedia.org/r/24243 [18:37:27] notpeter: I'm trying to see the headers that swift uses in its response to the auth reqs [18:37:47] New patchset: Cmjohnson; "Removing decommissioned servers from the dhcpd file" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24464 [18:37:48] I see some object responses for GETs [18:37:50] PROBLEM - SSH on analytics1009 is CRITICAL: Connection refused [18:38:41] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/24464 [18:39:13] so I guess auth most work sometimes, but I still figure whats wrong with the auth response sometimes [18:39:20] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:41:01] New review: Brion VIBBER; "On second look this is actually fine, I just thought it was MobileFrontend but it's a config extensi..." [operations/mediawiki-config] (master) C: 1; - https://gerrit.wikimedia.org/r/24243 [18:42:08] ottomata: sooo.. i can confirm your report, it hangs on reboot in a weird way :/, still trying though, it just takes that long to boot [18:42:47] not sure if this is relevant [18:42:56] at the end of the install it does this: [18:42:56] │ Execution of preseeded command "wget -O /tmp/late_command │ │ [18:42:56] │ │ http://apt.wikimedia.org/autoinstall/scripts/late_command && sh │ │ [18:42:56] │ Ru│ /tmp/late_command" failed with exit code 1. [18:43:01] i just hit continue [18:44:36] Change merged: Demon; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/24243 [18:46:47] ottomata: so you are also installing 1009 right now, hm? wanna wait to see if it behaves the same first? [18:47:17] after the freeze i cant get console output anymore [18:47:17] PROBLEM - Host analytics1009 is DOWN: PING CRITICAL - Packet loss = 100% [18:47:37] and i never saw anything from the actual OS yet, so i doubt its related to the late_command from OS install [18:47:38] yeah [18:47:42] i just installed it [18:47:50] and i set boot order back to hdd [18:47:57] rebooted it [18:47:59] now waiting for memtest [18:48:01] we will see... [18:48:01] also did not get grub shell or anything [18:48:07] ok, brb then [18:48:26] New patchset: Faidon; "autoinstall: make late-command -e happy" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24467 [18:49:27] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/24467 [18:49:34] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24467 [18:50:06] ottomata: fixed (re: preseed error) [18:51:22] oh! [18:51:25] 09 worked too! [18:51:31] thanks paravoid! [18:51:38] RECOVERY - SSH on analytics1009 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [18:51:39] mutante, i have a login prompt on 09 [18:51:47] RECOVERY - Host analytics1009 is UP: PING OK - Packet loss = 0%, RTA = 26.80 ms [18:51:59] lemme try to reinstall 1010 from scratch again and see if it still doesn't work [18:53:10] notpeter: nic card may be unseated....will let you know [18:54:47] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.037 seconds [18:54:54] I forced run puppet on brewster and it replaced a couple of mac addresses on mc* boxes [18:54:58] binasher: is that you? [18:55:24] I saw a .swp, so someone is editing it live and just lost their changes [18:55:30] paravoid: i think it was notpeter [18:55:38] i saw he !log'd about disabling puppet on brewster [18:55:51] oh indeed, silly me [18:55:58] sorry [18:56:36] mc2-14,16 are imaged, hopefully their macs are in git [18:57:00] host mc11 { [18:57:00] - hardware ethernet 90:e2:ba:19:53:fc; [18:57:00] + hardware ethernet d4:be:d9:f7:c2:ca; [18:57:05] host mc15 { [18:57:06] - hardware ethernet 90:e2:ba:19:51:b9; [18:57:06] + hardware ethernet 90:e2:ba:19:51:b8; [18:58:58] !log ran scap on mw58 and returning to the lvs pool [18:59:08] Logged the message, Master [19:02:32] New patchset: Andrew Bogott; "Define instance_status values in role." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24469 [19:02:41] RECOVERY - Apache HTTP on mw58 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.034 second response time [19:03:27] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/24469 [19:04:02] Change merged: Andrew Bogott; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24469 [19:04:04] I am tired of init.d scripts :-D [19:04:13] they keep hiding error messages [19:05:49] hm mutante, once the OS installed and it has booted [19:06:00] I should be able to log in as root with my ssh key, right? [19:06:02] about to scap [19:11:22] New patchset: Asher; "install php redis extension on precise and lucid apaches" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24470 [19:11:46] paravoid: those were me [19:11:51] thanks for grabbing the diff! [19:12:09] hmm, mutante, also, now analyics1010 seems to hang even on pxe boot [19:12:18] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/24470 [19:12:35] New patchset: Pyoungmeister; "correcting mac for mc11" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24472 [19:13:30] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/24472 [19:13:38] PROBLEM - NTP on analytics1009 is CRITICAL: NTP CRITICAL: No response from NTP server [19:14:58] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24472 [19:15:45] notpeter: not sure what is going on with the nic on mc15..i am going reflash the card [19:16:43] New patchset: Pyoungmeister; "lucene.php: moving all search traffic back to eqiad" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/24473 [19:16:52] cmjohnson1: cool! thanks. [19:17:34] notepeter, after the thing installs and boots, I should be able to login as root with my ssh key, right? [19:17:35] cmjohnson1: for mc1, has dell shipped replacement parts yet? [19:18:04] ottomata: i have disabled all other methods besides HDD in bios for that one [19:18:20] ottomata: you have to use sockpuppet:/root/.ssh/new_install until puppet has first run [19:18:38] binasher: not yet...i haven't called yet. the c2100's have been dominating my time...i will call in a few now that I know we don't need new sfp's [19:18:47] aslo, get puppet cert signed by sockpuppet [19:18:51] oh, didn't know that, cool [19:19:12] mutante, ah I see, so I can't use the mgmt interface to set boot order? [19:19:14] cmjohnson1: great, thanks. let me know how the call goes [19:19:18] an an1010? [19:20:25] ottomata: i have never done that via mgmt on the Ciscos, i just hit F12.. what is 1009 doing ? [19:20:41] installed and booted! [19:20:45] trying to log in now to make sure all is well [19:20:52] i was going to just try reinstall on 1010 again [19:20:57] i'm suspicious that maybe [19:20:57] New patchset: Hashar; "varnish config for bits.beta.wmflabs.org" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/13304 [19:21:07] since it was PXE booting even after the install finished [19:21:13] that it was doing something that borked the install [19:21:31] on an1009, I switched it back to HDD and powercycled before it had a chance to finish the memory test [19:21:49] on an1010 I didn't get a chance to try that until the prompt about partitioning the drives came up [19:21:53] New review: Hashar; "Fixed up $test_wikipedia (aka srv193) which was not properly expanded in the vcl templates." [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/13304 [19:21:54] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/13304 [19:22:12] is anyone familiar with GeoIP ? [19:22:24] hehe, mutante, you cisco wikitech page says: [19:22:25] - Do not disable boot options via BIOS screens [19:22:33] hashar: in what sense [19:22:43] err: /Stage[main]/Geoip::Data::Sync/File[/usr/share/GeoIP]: Failed to generate additional resources using 'eval_generate: Error 400 on SERVER: Not authorized to call search on /file_metadata/volatile/GeoIP with {:checksum_type=>"md5", :links=>"manage", :recurse=>true} [19:22:46] I am a little bit, but more from the ops side, not so much the usage [19:22:50] looks like some permission error [19:22:58] though I have no idea what the volatile stuff is about [19:23:01] yeah that's my stuff [19:23:10] so volatile is basically a private puppet fileserver [19:23:31] kinda like the private puppet repo [19:23:36] ottomata: hah, ok, well, cool, if 1009 worked, i say reinstalling 1010 one more time makes sense. was about to suggest that too [19:23:46] ahh [19:23:49] yeah, i think I can't do it though, since you changed the bootorder in bios [19:23:52] can you switch it back [19:23:52] I thought we had a public version [19:24:08] * hashar looks at the geoip classes [19:24:10] ummmm, do you know which file it is trying to put in place? [19:24:13] is that GeoIP.conf? [19:24:14] ottomata: sure, will do [19:24:27] oh [19:24:31] ohhh [19:24:33] ok yeah [19:24:35] those are the db files [19:26:29] I am not sure what it tries to install [19:26:35] I am installing a varnish instance on labs [19:26:50] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:26:51] hmm [19:27:09] at least I got some data in /usr/share/GeoIP [19:27:14] oh you did? [19:27:17] there are .dat files there? [19:27:30] GeoIP.dat and GeoIPv6.dat \O/ [19:27:35] PROBLEM - MySQL Slave Delay on db33 is CRITICAL: CRIT replication delay 192 seconds [19:27:45] that is a bit crazy ;) [19:27:47] looks right to me [19:27:56] not sure what your error is then [19:28:01] so maybe it installs some basic version [19:28:08] and then just fail to update them [19:28:29] PROBLEM - MySQL Replication Heartbeat on db33 is CRITICAL: CRIT replication delay 219 seconds [19:28:54] its supposed to copy them via puppetmaster fileserver to the puppet client [19:29:04] that's it though [19:29:08] so it looks like it worked [19:29:18] what are the file sizes [19:29:21] should look like this: [19:29:21] -rw-r--r-- 1 root root 1664511 Jan 3 2012 GeoIP.dat [19:29:21] -rw-r--r-- 1 root root 5349447 Jan 3 2012 GeoIPv6.dat [19:29:36] 1204947 and 109251 for v6 [19:29:46] date January 18th 2010 [19:29:54] ah, hm, weird [19:29:59] PROBLEM - MySQL Replication Heartbeat on db1020 is CRITICAL: CRIT replication delay 196 seconds [19:30:01] that's /usr/share/geoip [19:30:01] ? [19:30:08] PROBLEM - MySQL Slave Delay on db1020 is CRITICAL: CRIT replication delay 199 seconds [19:30:08] ? [19:30:10] oh yeah [19:30:14] it is [19:30:17] sorry for being laggy :/ [19:30:18] hmmm [19:30:19] that's not good [19:30:35] jan 18th 2010?? weird [19:30:49] notpeter: where did you say you put that dump? [19:30:52] that is the Ubuntu package geoip-database [19:30:58] it is installed :-) [19:31:31] aye ok [19:31:31] AaronSchulz: in your home dir on mw3 [19:31:50] actually, yeah, hashar, i'm searching puppet for inclusion of geoip class [19:31:59] ahh, mw, I was looking on fenari ;) [19:32:07] i only see it in role::statistics::cruncher [19:32:25] !log pushing all search traffic back to eqiad [19:32:34] Change merged: Pyoungmeister; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/24473 [19:32:34] Logged the message, notpeter [19:32:42] hashar: ottomata there is "# geoip::data::download" [19:33:01] ahh searching poorly [19:33:05] which installs a cronjob to update it weekly [19:33:11] by downloading from maxmind [19:33:13] yes, but that is on puppetmaster only [19:33:29] ah [19:33:29] require geoip [19:33:33] instead of include geoip [19:33:37] in cache.pp line 467 [19:33:47] that's why hashar is getting it on varnish machines [19:34:05] ottomata: ana1010. BOOT options enabled again. Hit F12, boots into PXE now [19:34:16] ok, cool, i can't F12, so i'm going to set order with mgmt [19:34:33] well, you can just take over the console now [19:34:42] ok, its booting? [19:34:42] it already does it and after install you dont want it to, right [19:34:45] yea [19:34:49] yeah k [19:35:04] k will watch it and hopefully catch it soon enough [19:35:19] Ubuntu 12.04 Precise Pangolin AMD64 (Wikimedia edition [19:35:24] there is the installer.. [19:35:25] sorry been laggy [19:35:32] yup, watching it now [19:35:40] ah, we can both be on it? cool [19:35:45] yeah cool! [19:35:59] hashar, ummmmm hm [19:36:04] yeah something is broken for sure then [19:36:07] you should get the newer files [19:36:07] hmm [19:36:08] oh [19:36:12] well I am on labs [19:36:14] maybe its because puppet won't replace those files recursively? [19:36:15] oh good [19:36:17] try this then [19:36:21] so I probably don't have access to volatile :-) [19:36:27] oh [19:36:28] grep the puppet log for the filenames? [19:36:28] oh [19:36:31] yup that is probably right [19:36:38] there are some dummy files anyway, so that is probably enough for my use case [19:36:55] yeah they will work if they are from the .deb [19:37:00] they just won't be the full paid versions [19:37:08] fine :-) [19:37:11] RECOVERY - Host analytics1010 is UP: PING OK - Packet loss = 0%, RTA = 26.51 ms [19:39:10] notpeter: mw58: rsync: mkstemp "/apache/common-local/php-1.20wmf12/includes/filebackend/.SwiftFileBackend.php.tu25FB" failed: Permission denied (13) [19:39:53] that's binasher's fault [19:40:06] yep [19:40:11] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 1.670 seconds [19:40:38] RECOVERY - MySQL Replication Heartbeat on db1020 is OK: OK replication delay 0 seconds [19:40:38] RECOVERY - MySQL Slave Delay on db1020 is OK: OK replication delay 0 seconds [19:41:46] um, notpeter, i have maybe a stupid question [19:41:53] how do I clear the ssh host key? [19:42:01] I'm used to the hostname or IP being in .ssh/known_hosts [19:42:17] but it looks hashed on sockpuppet:/root/.ssh/known_hosts [19:42:18] or something [19:44:24] oh doh, it tells me in the ssh output [19:44:32] ottomata: fixed [19:44:37] New patchset: Hashar; "Maxmind geoIP data files are only for production" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24479 [19:44:38] ottomata: i cleared it for you for ana1009 [19:44:46] ssh-keygen -f "/etc/ssh/ssh_known_hosts" -R analytics1009.eqiad.wmnet [19:44:46] AaronSchulz: mw58 should be fixed, want me to run scap on it? [19:44:47] ? [19:44:59] ottomata: that is deleted it from /etc/ssh/ssh_known_hosts on fenari (as root) [19:45:15] aye, thanks [19:45:19] yayyy i'm in! [19:45:31] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/24479 [19:45:31] binasher: sure [19:45:35] RECOVERY - MySQL Slave Delay on db33 is OK: OK replication delay 0 seconds [19:45:55] New review: Hashar; "This is to prevents a warning when installing varnish on labs:" [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/24479 [19:46:26] ottomata: change set 24479 should get rid of the warning on labs :-) Cant add you to review! [19:46:29] RECOVERY - MySQL Replication Heartbeat on db33 is OK: OK replication delay 0 seconds [19:47:33] New patchset: Demon; "Perform daily backups of gerrit for amanda to pick up" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24481 [19:48:18] hashar, reviewing [19:48:27] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/24481 [19:48:58] New patchset: Andrew Bogott; "Move python-mwclient into a generic class." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24482 [19:49:17] ahh Guru Meditation: XID: 293871250 …; Varnish is starting up :-) [19:49:54] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/24482 [19:50:09] hashar, can I be really picky about this one? since I wrote it and want to keep it pretty? :) [19:50:14] PROBLEM - Host analytics1010 is DOWN: PING CRITICAL - Packet loss = 100% [19:50:18] ottomata: go ahead :) [19:50:27] binasher: is it done? [19:50:29] what you have will work for sure, but it kinda ruins the semantics of the provider parameter [19:50:44] AaronSchulz: yeah [19:50:50] basically, if realm is not production, then provider won't work [19:50:51] hmm [19:51:24] im' not sure if this would work [19:51:27] but it might be more elegant [19:51:27] if [19:51:32] instead of modifying the geoip::data class [19:51:38] you put a conditional in the geoip class [19:51:40] that says [19:52:23] if realm == production { [19:52:23] class { "geoip::data": data_directory => $data_directory } [19:52:23] } [19:52:23] else { [19:52:23] package { "geoip-database": ensure => "installed" } [19:52:23] } [19:52:44] i just searched real quick, and I don't think anything is requiring Class["geoip::data"] [19:52:59] ottomata: analytics1010 login: yay [19:53:06] yayyy! [19:53:21] ottomata: amending :) [19:53:25] aye, but hmm [19:53:39] wait, I do see requires for File["$data_directory"] [19:53:41] RECOVERY - SSH on analytics1010 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [19:53:48] so you shoudl add that in the else { } as well [19:53:50] RECOVERY - Host analytics1010 is UP: PING OK - Packet loss = 0%, RTA = 26.58 ms [19:54:00] # Make sure the volatile GeoIP directory exists. [19:54:00] # Data files will be downloaded by geoipupdate into [19:54:01] # this directory. [19:54:01] file { "$data_directory": [19:54:01] ensure => "directory", [19:54:01] } [19:54:11] just that bit in the else where you install the package [19:55:34] Change merged: Andrew Bogott; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24482 [19:56:21] New patchset: Hashar; "Maxmind geoIP data files are only for production" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24479 [19:56:24] ottomata: ssh keys cleared. want me to sign puppet certs really quick and run puppet? [19:56:26] ottomata: updated :) [19:56:46] oh men the $data_directory forgot about it [19:56:51] mutante, i'm doing it now :) [19:56:53] is there any reason that configchange wouldn't have hit all of the apaches? [19:57:14] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/24479 [19:57:36] hashar, I think you forgot your else { } too? [19:57:44] else install geoip-database package [19:57:51] and make sure $data_directory exists [19:58:11] ex: mw54 is still sending search traffic to pmtpa [19:58:18] even though I updated lucene.php [19:58:27] it's in the mediawiki-installations dsh group [20:02:38] alright, just abandoned the very last patch set that was in the test branch. bye bye "test" [20:03:19] (yay!) [20:04:00] foodbot: @find Thai [20:04:56] PROBLEM - Host analytics1002 is DOWN: PING CRITICAL - Packet loss = 100% [20:06:16] New patchset: Hashar; "Maxmind geoIP data files are only for production" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24479 [20:06:37] ottomata: and now fallback to instal the geoip-database package and copy them to $log_directory if needed [20:06:40] not nice though [20:07:10] New review: Dzahn; "yep, redirect bugzilla to https" [operations/apache-config] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/24405 [20:07:10] Change merged: Dzahn; [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/24405 [20:07:11] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/24479 [20:07:20] hmm, hashar, i don't think you need to do anything other than ensure that the directory exists [20:07:27] the geoip-database package ensures that the files are there [20:07:29] PROBLEM - Host analytics1004 is DOWN: PING CRITICAL - Packet loss = 100% [20:07:38] the only reason I said you'd need the file resource is really to make puppet happy [20:07:38] PROBLEM - Host analytics1003 is DOWN: PING CRITICAL - Packet loss = 100% [20:07:38] PROBLEM - Host analytics1006 is DOWN: PING CRITICAL - Packet loss = 100% [20:07:44] annnnnnd actually [20:07:46] the way you are doing it [20:07:53] only ensureing that the symlink exists in produciong [20:07:57] you might not even need it [20:08:05] PROBLEM - Host analytics1008 is DOWN: PING CRITICAL - Packet loss = 100% [20:08:06] i think that is the only place that would require that the file resource is defined [20:08:10] ottomata: but the package install the files in /usr/share/GeoIP [20:08:14] PROBLEM - Host analytics1005 is DOWN: PING CRITICAL - Packet loss = 100% [20:08:14] PROBLEM - Host analytics1007 is DOWN: PING CRITICAL - Packet loss = 100% [20:08:16] yes [20:08:24] the package does, and it will create the directory too [20:08:28] i was just saying to define the file in puppoet [20:08:39] so that if anything else in puppet referenced it [20:08:43] you wouldn't get an undefined error [20:08:44] but then $data_directory could be set to something else like /srv/host/something [20:08:52] ahhhhhhh [20:08:52] i see [20:09:12] wtf [20:09:20] maybe I should have added a comment like : "honor $data_directory" [20:09:29] notpeter: so the SwiftFileBackend class is different on fenari vs mw12 [20:09:35] hmmmmm [20:09:48] I was wondering why auths were still coming in as much from runners [20:09:56] i see [20:10:04] hmmm, maybe just symlink? [20:10:07] instead of copy? [20:10:12] !log sync-apache and pushing out redirects.conf change for Bugzilla redirects to https [20:10:19] * AaronSchulz tries a sync again [20:10:22] Logged the message, Master [20:10:29] RECOVERY - Host analytics1002 is UP: PING OK - Packet loss = 0%, RTA = 26.57 ms [20:10:36] AaronSchulz: I was a dumbass and didn't re-add them to the dsh group after broght them back up.... [20:10:41] try syncing them now :/ [20:10:48] if( $data_directory != '/usr/share/GeoIP' ) { [20:10:48] file { "$data_directory" ensure => "/usr/share/GeoIP" } [20:10:48] } [20:10:49] ? [20:10:53] notpeter: which ones? [20:10:58] mw2-16 [20:11:09] they all need a sync-common, or what ahser ran [20:11:20] I can do sync-common on all of them [20:11:24] who knows what else is out of date [20:11:31] yeah [20:11:34] sorry about that [20:11:36] will fix now [20:12:35] also, hashar, if you do that, add a comment as to why [20:12:36] somethign like [20:13:02] RECOVERY - Host analytics1004 is UP: PING OK - Packet loss = 0%, RTA = 26.45 ms [20:13:11] RECOVERY - Host analytics1006 is UP: PING OK - Packet loss = 0%, RTA = 26.44 ms [20:13:11] RECOVERY - Host analytics1003 is UP: PING OK - Packet loss = 0%, RTA = 26.46 ms [20:13:20] # The geoip-database package always installs at /usr/share/GeoIP. [20:13:21] # Make sure $data_directory points here if a different location has been specified. [20:13:38] RECOVERY - Host analytics1008 is UP: PING OK - Packet loss = 0%, RTA = 26.45 ms [20:13:47] RECOVERY - Host analytics1007 is UP: PING OK - Packet loss = 0%, RTA = 26.50 ms [20:13:47] RECOVERY - Host analytics1005 is UP: PING OK - Packet loss = 0%, RTA = 26.48 ms [20:14:24] PROBLEM - SSH on analytics1002 is CRITICAL: Connection refused [20:14:25] New review: Dzahn; "pushed out to cluster and done" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/24405 [20:15:17] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:16:50] New patchset: Hashar; "Maxmind geoIP data files are only for production" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24479 [20:16:56] PROBLEM - SSH on analytics1005 is CRITICAL: Connection refused [20:16:56] PROBLEM - SSH on analytics1007 is CRITICAL: Connection refused [20:16:58] ottomata: done :) [20:17:23] PROBLEM - SSH on analytics1004 is CRITICAL: Connection refused [20:17:23] PROBLEM - SSH on analytics1003 is CRITICAL: Connection refused [20:17:32] PROBLEM - SSH on analytics1006 is CRITICAL: Connection refused [20:17:56] notpeter: finished? [20:17:59] PROBLEM - SSH on analytics1008 is CRITICAL: Connection refused [20:18:10] AaronSchulz: still going. the rsync is not fast... [20:18:17] RECOVERY - NTP on analytics1009 is OK: NTP OK: Offset -0.0346814394 secs [20:18:22] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/24479 [20:19:24] aye hashar, maybe you missed the bit about symlinking? :) [20:19:35] symlinking would be better than co pying, right? [20:20:13] if $data_directory != /usr/share/GeoIP then symlink /usr/share/GeoIP -> $data_directory [20:20:14] right? [20:20:22] I hate sum links [20:20:32] RECOVERY - NTP on analytics1010 is OK: NTP OK: Offset -0.03044784069 secs [20:20:33] and not sure how to do them recursively [20:20:37] though we could define both files [20:20:40] you don't need to do it recursively [20:20:42] right? [20:21:23] if( $data_directory != '/usr/share/GeoIP' ) { [20:21:23]   file { "$data_directory": [20:21:23] ensure => "/usr/share/GeoIP", [20:21:23] require => Package["geoip-database"] } [20:21:24] } [20:21:57] what does ensure => some/path do ? [20:22:09] ensures that the file is a symlink to that path [20:22:22] http://docs.puppetlabs.com/references/stable/type.html#file [20:22:28] ohh [20:22:37] ensure [20:22:38] Anything other than the above values will create a symlink; [20:22:38] hashar: i think i agree to you that supporting ALL these subdomains (developer,develop,developers and on wp and wm) seems a bit much.. but unsure [20:22:52] you can also do [20:23:05] mutante: we had a chat about it with jeremy. I guess we will let the community decide :-) [20:23:09] ensure => link, [20:23:09] target => '/usr/share/GeoIP', [20:23:11] same thing [20:23:23] PROBLEM - Host analytics1002 is DOWN: PING CRITICAL - Packet loss = 100% [20:23:34] AaronSchulz: ok, should be all sync'd up now [20:23:36] sorry about that [20:24:00] hashar: yeah, less discussion and changes aftewards:) [20:24:02] bbl, food [20:24:18] mutante: i have no strong opinion, but that's what i would do if i were going to be bold and do it without waiting for consensus [20:24:28] (that=what i did) [20:24:43] thanks for the bugzilla merge ;) [20:25:56] PROBLEM - Host analytics1004 is DOWN: PING CRITICAL - Packet loss = 100% [20:26:05] PROBLEM - Host analytics1003 is DOWN: PING CRITICAL - Packet loss = 100% [20:26:50] PROBLEM - Host analytics1008 is DOWN: PING CRITICAL - Packet loss = 100% [20:26:50] PROBLEM - Host analytics1007 is DOWN: PING CRITICAL - Packet loss = 100% [20:26:50] PROBLEM - Host analytics1005 is DOWN: PING CRITICAL - Packet loss = 100% [20:27:08] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 2.973 seconds [20:27:30] New patchset: Hashar; "varnish config for bits.beta.wmflabs.org" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/13304 [20:28:26] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/13304 [20:28:37] New review: Hashar; "Fix the bits backend probe that were querying en.wikipedia.org and bits.wikimedia.org." [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/13304 [20:28:56] RECOVERY - Host analytics1002 is UP: PING OK - Packet loss = 0%, RTA = 26.53 ms [20:31:47] PROBLEM - Host analytics1006 is DOWN: PING CRITICAL - Packet loss = 100% [20:32:05] New review: Hashar; "PS25 deployed on deployment-cache-bits02 and happily serving files: http://bits.beta.wmflabs.org/fav..." [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/13304 [20:32:07] !log reinstalling analytics1002-analytics1010 with Ubuntu precise [20:32:16] Logged the message, Master [20:34:25] PROBLEM - Host analytics1002 is DOWN: PING CRITICAL - Packet loss = 100% [20:35:10] RECOVERY - SSH on analytics1008 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [20:35:19] RECOVERY - Host analytics1008 is UP: PING OK - Packet loss = 0%, RTA = 26.49 ms [20:36:40] RECOVERY - SSH on analytics1006 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [20:36:49] RECOVERY - Host analytics1006 is UP: PING OK - Packet loss = 0%, RTA = 26.46 ms [20:37:25] RECOVERY - SSH on analytics1003 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [20:37:34] RECOVERY - Host analytics1003 is UP: PING OK - Packet loss = 0%, RTA = 26.47 ms [20:38:10] RECOVERY - SSH on analytics1004 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [20:38:19] RECOVERY - Host analytics1004 is UP: PING OK - Packet loss = 0%, RTA = 26.47 ms [20:44:01] RECOVERY - Host analytics1007 is UP: PING OK - Packet loss = 0%, RTA = 26.50 ms [20:44:01] RECOVERY - Host analytics1005 is UP: PING OK - Packet loss = 0%, RTA = 26.53 ms [20:45:40] RECOVERY - Host analytics1002 is UP: PING OK - Packet loss = 0%, RTA = 26.50 ms [20:54:14] heya notpeter, I'm having the problem we had with analytics1010 this morning again [20:54:18] this time on analytics1002 [20:54:23] all of the others installed just fine though [20:54:29] An unexpected error occurred while setting up a preseeded RAID configuration. [20:54:29] Check /var/log/syslog or see virtual console 4 for the details. [20:55:22] hmm [20:55:22] Sep 20 20:53:52 partman-auto-raid: mdadm: cannot open /dev/sdc1: Device or resource busy [20:55:22] Sep 20 20:53:52 partman-auto-raid: Error creating array /dev/md0 [20:55:26] New patchset: Hashar; "(bug 39701) beta: automatic MediaWiki update" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/22116 [20:56:21] New review: Hashar; "Fixed up a dependency, we need to get the mw-update-l10n script" [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/22116 [20:56:22] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/22116 [20:57:13] PROBLEM - Host analytics1007 is DOWN: PING CRITICAL - Packet loss = 100% [20:57:40] PROBLEM - NTP on analytics1008 is CRITICAL: NTP CRITICAL: Offset unknown [20:57:47] can someone look at this https://gerrit.wikimedia.org/r/#/c/24464/ [20:58:07] PROBLEM - Host analytics1005 is DOWN: PING CRITICAL - Packet loss = 100% [20:59:01] PROBLEM - NTP on analytics1006 is CRITICAL: NTP CRITICAL: Offset unknown [20:59:37] PROBLEM - NTP on analytics1004 is CRITICAL: NTP CRITICAL: Offset unknown [20:59:46] PROBLEM - NTP on analytics1003 is CRITICAL: NTP CRITICAL: Offset unknown [20:59:55] RECOVERY - SSH on analytics1005 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [21:00:04] RECOVERY - Host analytics1005 is UP: PING OK - Packet loss = 0%, RTA = 26.48 ms [21:02:37] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:05:03] notpeter: around? [21:05:37] RECOVERY - NTP on analytics1004 is OK: NTP OK: Offset -0.03013217449 secs [21:05:46] RECOVERY - NTP on analytics1003 is OK: NTP OK: Offset -0.03670918941 secs [21:06:31] RECOVERY - NTP on analytics1006 is OK: NTP OK: Offset -0.03039526939 secs [21:06:41] RECOVERY - NTP on analytics1008 is OK: NTP OK: Offset -0.03352963924 secs [21:07:25] hey cmjohnson1, thanks for the update, are those emails on the ticket? (if not they should be pasted in I guess) [21:07:33] er about the c2100s of course [21:08:33] no, they are not in the ticket. I was hoping you could elaborate a little more (when you get a chance) [21:09:29] oh. well I updated the ticket with the results of today's test (which they should have got well before this last email from them) [21:09:36] I think it was cced to all the right people [21:10:52] it should be...which ticket did you update? [21:10:59] the ms-be6 one, jsut a sec [21:11:12] yep 3452 [21:11:29] New review: Krinkle; "-1 for the removal of entries that match the default but were explicitly set so, and with a bugzilla..." [operations/mediawiki-config] (master); V: 0 C: -1; - https://gerrit.wikimedia.org/r/23059 [21:11:48] it did not go through, how weird is that [21:11:57] I will try again (I have the tab still open in my browser) [21:11:59] 2012-09-20 21:08:59 mw12 zhwiki: InvalidResponseException in 'SwiftFileBackend::doGetFileStat' (given '{"src":"mwstore:\/\/local-swift\/timeline-render\/215594eaaee4b06a3bb4a5a07436f220.map"}'): Invalid response (0): (curl error: 6) Couldn't resolve host 'ms-fe.pmtpa.wmnet': Failed to obtain valid HTTP response. [21:12:06] oh.... I bet this was during a disconnect dang it [21:12:45] man that sucks [21:12:51] I can't wait for the mto replace this router [21:13:03] (I have a call into the isp and the landlord, it's a matter of who is first) [21:13:30] er... failed to obtain valid http response? wtf? [21:13:40] odd [21:14:16] (anyways I saved it again and it took it for the ticket, if it weren't for the connection they would have had it hours ago) [21:14:37] PROBLEM - NTP on analytics1002 is CRITICAL: NTP CRITICAL: No response from NTP server [21:15:16] paravoid: ping [21:15:24] apergos: why would it not be able to resolve the host? [21:15:26] notpeter: ping [21:15:36] binasher: ping [21:16:08] apergos: fairly high amount of those, all from job-runners [21:16:10] I seriously have no idea, all three hosts that should be in the pool look ok [21:16:16] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 4.295 seconds [21:16:21] see that's suspicious that it's only from them [21:16:34] if there were a general problem we would see it from other hosts [21:16:52] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [21:16:52] PROBLEM - Puppet freshness on singer is CRITICAL: Puppet has not run in the last 10 hours [21:16:52] PROBLEM - Puppet freshness on virt1002 is CRITICAL: Puppet has not run in the last 10 hours [21:16:52] PROBLEM - Puppet freshness on virt1001 is CRITICAL: Puppet has not run in the last 10 hours [21:16:52] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [21:16:53] PROBLEM - Puppet freshness on virt1003 is CRITICAL: Puppet has not run in the last 10 hours [21:19:42] I do see some out of memory messages on the one random jobrunner I looked at but the timing isn't right [21:20:50] notpeter: can you merge this pleas https://gerrit.wikimedia.org/r/#/c/24464/ [21:22:03] one more scap... [21:22:16] PROBLEM - NTP on analytics1005 is CRITICAL: NTP CRITICAL: No response from NTP server [21:27:22] Are there any operations folks available [21:27:42] I need to get this change https://gerrit.wikimedia.org/r/#/c/24493/ approved merged and pushed ASAP [21:28:19] apergos: have you tailed swift-backend.log? [21:28:26] no [21:28:30] New patchset: Andrew Bogott; "Clean up a bit when an instance is getting deleted." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24496 [21:28:53] how would the backends be implicated jus by the jobrunners? [21:29:23] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/24496 [21:29:28] Change merged: Andrew Bogott; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24496 [21:29:44] apergos: well the mw* servers are newer, maybe situated differently [21:29:45] wtf [21:29:54] sorry just looking at the log now and wondering [21:29:55] there is something beyond coincidence [21:30:27] these are all zhwiki right? [21:31:09] from what I can tell [21:31:21] for the last several hours anyways [21:31:34] are these somehow produced by queued jobs? [21:31:42] not sure why that matters...maybe a job is in a loop [21:31:46] maybe that's how this timeline stuff is handled? [21:31:55] greping for '42983552dc252c7e27508bb31d6940a3.map' shows a lot [21:32:16] cause say around 13:50 there are a batch that are all nl wiki [21:32:20] apergos: timeline doesn't just the jq, but it might be hit during parse [21:32:43] kaldari - http://www.maxmind.com/app/ipv6 [21:32:56] hm [21:33:00] thanks [21:33:49] note - What happens when I try to use IPv6 addresses with your current retail products? [21:33:49] Currently, IPv6 addresses will return a generic error message. [21:33:49] *doesn't use the jq [21:34:06] :( [21:35:15] weird that they have an IPv6 database available, but don't use it in their own API yet [21:39:15] notpeter: please merge and push https://gerrit.wikimedia.org/r/#/c/24493/ [21:39:30] woosters, mark: Do we use MaxMind's downloadable database or their web service? [21:39:39] for GeoIP [21:39:44] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24493 [21:39:47] so the next thing we could possibly try to do is see what some of these jobs that fail actually are, I guess [21:40:29] I dunno how that will help actually, nm [21:41:05] it's strange, logging into one of those boxing works fine in eval, including auth, and host resolution [21:41:20] I don't see why it keeps randomly failing [21:41:24] well I assume that not all the jobs are failing [21:41:29] only a relative few [21:43:02] so the only things I can think of would be: memory, network (but why only those?), lvs pool (but again then why only those?), dns server (and yet again why only those?) ... it's at least plausible that memory could play a role on the job runners [21:44:17] notpeter will know, are there other job runners besides mw 1 through 15? [21:46:30] 1-16 sorry [21:46:33] yep [21:46:48] apergos: /etc/dsh/groups/job-runners or something [21:46:51] lots [21:46:58] take a look at site.pp [21:46:59] I dont trust our dsh lists any more [21:47:04] yeah I'm in it now [21:47:13] node /mw(1[7-9]|[2-4][0-9]|5[0-4])\.pmtpa\.wmnet/ [21:47:31] node /^srv(23[1-9]|24[0-7])\.pmtpa\.wmnet$/ [21:47:42] so yeah lots more [21:49:43] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:51:23] are those the only ones on precise? (grasping at straws) [21:55:17] New patchset: Ryan Lane; "Revoke Sara's cluster access" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24497 [21:56:11] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/24497 [21:58:34] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24497 [21:58:36] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/23761 [22:00:05] New patchset: Asher; "adding new es clusters (but not moving writes)" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/24498 [22:03:58] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 4.763 seconds [22:04:07] woosters: just send an email to MaxMind support asking for a timetable on their IPv6 support [22:04:13] send = sent [22:04:16] RECOVERY - Puppet freshness on mw58 is OK: puppet ran at Thu Sep 20 22:03:58 UTC 2012 [22:04:33] ok thks - kaldari [22:06:54] apergos: maybe peter knows? [22:13:48] yes, mw1-16 are the only jobrunners that are on precise [22:26:20] * apergos is giving up, besides it's late and once again I have not done anything but work related crap today [22:26:37] apergos: yes, mw1-16 are the only jobrunners that are on precise [22:27:41] ok thanks [22:27:55] and goodnight [22:28:07] have a good night! [22:29:18] notpeter: are the image scalers getting the Precise upgrade this week, or is that currently unscheduled? [22:29:58] that is currently unscheduled [22:30:23] that's a blocker for the deployment of Timed Media Handler [22:31:12] short of a Precise upgrade, we'll need to upgrade the version of ffmpeg on those machines [22:31:21] New review: Dzahn; "yeah, it should be changed to chat.freenode, but i don't want to cause gerrit to try and merge it ov..." [operations/puppet] (production); V: 1 C: 1; - https://gerrit.wikimedia.org/r/22698 [22:31:25] alright. I shall push ahead with precise [22:33:41] New review: Dzahn; "It has been pointed out that if we are going to create a subdomain for each possible synonym that op..." [operations/apache-config] (master); V: 0 C: 0; - https://gerrit.wikimedia.org/r/24407 [22:36:40] !log stopping puppet on brewster [22:36:50] Logged the message, notpeter [22:37:23] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:47:00] PROBLEM - MySQL Slave Delay on storage3 is CRITICAL: CRIT replication delay 218 seconds [22:49:23] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 6.335 seconds [22:51:29] RECOVERY - MySQL Slave Delay on storage3 is OK: OK replication delay 15 seconds [22:58:20] !log restarting gerrit [22:58:30] Logged the message, Master [23:00:09] mutante: could you please merge this https://gerrit.wikimedia.org/r/#/c/24464/ [23:00:24] Change merged: Asher; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/24498 [23:00:37] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24470 [23:07:56] New patchset: Andrew Bogott; "Include project name in instance status." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24508 [23:09:09] Change merged: Andrew Bogott; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24508 [23:13:32] PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours [23:24:05] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24464 [23:24:29] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:24:39] cmjohnson1: done [23:24:47] thx [23:24:51] yw [23:25:23] quite a few servers there that have been decom'ed..yeah [23:27:43] oh yeah...several over the last year [23:27:47] New patchset: Asher; "movnig es2/3 to core db classes, bool cleanup" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24511 [23:27:59] mutante: quick question... i am getting this git error Errors running git rebase -i remotes/gerrit/production [23:28:20] any idea how to fix this? [23:28:41] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/24511 [23:30:22] cmjohnson1: what are you trying to do? have an existing change that has been sitting in git and now cant merge ? [23:30:55] no..just made the change [23:31:43] i thought maybe it was because i had an unmerged changed in my local branch but you merged it in gerrit [23:31:45] using "git review" ? [23:31:51] yes [23:31:58] hmmm [23:32:31] i may just blow it out and try again...may be easier [23:33:54] i guess my best bet for now is "git reset --hard origin", then git pull [23:34:12] yep...my thoughts exactly [23:38:08] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 4.694 seconds [23:40:55] New review: Krinkle; ""develop" and "developers" seems redundant, why do we need the aliases? Its not like we have a histo..." [operations/apache-config] (master) C: -1; - https://gerrit.wikimedia.org/r/24407 [23:45:14] AaronSchulz: I have something that should work btw [23:48:46] paravoid: is it in gerrit? [23:48:54] no [23:48:59] (t yet)