[00:00:03] RECOVERY - Host palladium is UP: PING OK - Packet loss = 0%, RTA = 26.66 ms [00:00:04] do we have a means of saying "this IP address can go this much more"? [00:00:57] I don't remember fielding that type of request before, and I suspect that we'll probably ask that whoever this is start using data dumps instead of hammering the cluster, but I thought I'd ask what the process is for requesting this [00:03:48] PROBLEM - SSH on palladium is CRITICAL: Connection refused [00:04:10] PROBLEM - Varnish HTTP bits on palladium is CRITICAL: Connection refused [00:04:29] ah...wait, I see, I think I found the answer. this is probably just an admin request to be put in the right group [00:04:32] robla: The API doesn't enforce per-IP-per-minute request limits, at least not on the MW side. Are they talking about rate limits on editing perhaps? [00:04:39] Or limits re how many items per request? [00:06:08] I imagine items per request. BTW, this is the requestor: http://meta.wikimedia.org/wiki/Research:Non-finite_Processes_in_Human_Social_Phenomena [00:06:45] I think they want the apihighlimits permission [00:07:13] * robla looks on enwiki for what group that translates to [00:08:14] ah....they just need to be added to the "Researchers" group on enwiki, or whatever [00:08:24] Special permissions for research purposes (including high API request limits or access to non-public data) can be granted by the Research Committee on a temporary basis. [00:08:26] NOTE [00:08:37] The 'researchers' group on enwiki also grants access to PRIVATE DATA [00:08:46] Like certain parts of deleted history metadata or something [00:08:46] oh, there's that.... [00:09:05] I originally created that group specifically for [[User:DarTar]], it's funny that he's running it now :) [00:09:40] paravoid: ERROR with Object server 10.0.6.202:6000/sda4 re: Trying to GET /v1/AUTH_43651b15-ed7a-40b6-b745-47666abf8dfe/global-data-math-render.60/6/0/8/608b3cd307410c615ee3af9df0e46da9.png: ConnectionTimeout (0.5s) (txn: txb7ded627e92c49e694619c86f4337194) (client_ip: 69.115.255.211) [00:09:50] Bots would also work for this purpose [00:10:28] maybe that explains those annoying weird error log entries [00:10:50] thought you'd think MW would get at least 50x back though [00:10:57] what? [00:11:13] paravoid: swift-backend.log in fluorine [00:12:54] actually I see lots of these for thumbs too, so that probably isn't it [00:13:05] that's ms-be3 and it seems unreachable [00:13:08] and nagios hasn't noticed [00:13:14] wtf [00:13:30] tail -n 1000 swift | grep 'proxy-server' | grep ' ERROR ' [00:13:42] RECOVERY - SSH on palladium is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [00:16:35] sigh [00:20:50] nagios didn't think you wanted to be bothered by that [00:21:24] New patchset: Dzahn; "enable nl, sv, ru wikivoyage in Apache" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/32329 [00:22:17] Change merged: Dzahn; [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/32329 [00:23:00] robla: would like to make another change now and graceful Apaches etc [00:24:30] PROBLEM - Host palladium is DOWN: PING CRITICAL - Packet loss = 100% [00:24:48] mutante: just check here before doing that sort of thing: http://wikitech.wikimedia.org/view/Deployments [00:25:33] powercycling it, although something tells me it won't come up [00:26:10] !log powercycling ms-be3 [00:26:10] paravoid: how are the replacements going anyway? [00:26:12] Logged the message, Master [00:26:26] dzahn is doing a graceful restart of all apaches [00:26:43] AaronSchulz: the first one arrived today [00:26:46] !log dzahn gracefulled all apaches [00:26:51] but not its iDRAC license [00:26:53] Logged the message, Master [00:27:21] RECOVERY - Host palladium is UP: PING OK - Packet loss = 0%, RTA = 26.52 ms [00:28:15] oh wow [00:28:17] it's up [00:28:18] amazing [00:28:33] RECOVERY - Host ms-be3 is UP: PING OK - Packet loss = 0%, RTA = 0.21 ms [00:30:03] AaronSchulz: you'll have to bear with those errors for a few more hours [00:30:17] !log aaron synchronized php-1.21wmf3/extensions/SwiftCloudFiles/php-cloudfiles-wmf/cloudfiles.php [00:30:21] Logged the message, Master [00:30:21] I'm going to stop it from serving for a while [00:30:40] leave it run rsyncs and start it up properly tomorrow noon/afternoon my time [00:31:42] PROBLEM - Memcached on virt0 is CRITICAL: Connection refused [00:33:21] RECOVERY - Memcached on virt0 is OK: TCP OK - 0.002 second response time on port 11000 [00:34:24] !log starting memcached on virt0 [00:34:28] Logged the message, Master [00:34:34] New patchset: Reedy; "Add sv, nl and ru wikivoyage" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/32330 [00:34:43] !log temporarily stopping {account,container,object}-server on ms-be3 while it syncs up from its downtime [00:34:52] Logged the message, Master [00:34:55] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/32330 [00:35:26] !log aaron synchronized php-1.21wmf3/extensions/SwiftCloudFiles/php-cloudfiles-wmf/cloudfiles.php [00:35:34] Logged the message, Master [00:36:10] New patchset: Reedy; "Revert "Add sv, nl and ru wikivoyage"" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/32331 [00:36:18] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/32331 [00:37:07] !log reedy synchronized all.dblist [00:37:13] Logged the message, Master [00:37:42] PROBLEM - swift-account-server on ms-be3 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [00:38:19] PROBLEM - swift-object-server on ms-be3 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [00:38:27] PROBLEM - swift-container-server on ms-be3 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [00:38:47] RECOVERY - Varnish HTTP bits on palladium is OK: HTTP OK HTTP/1.1 200 OK - 635 bytes in 0.053 seconds [00:39:03] !log aaron synchronized php-1.21wmf3/extensions/SwiftCloudFiles/php-cloudfiles-wmf/cloudfiles.php 'removed debug logging' [00:39:09] Logged the message, Master [00:40:09] !log wikivoyage imports: "fr" - done [00:40:13] Logged the message, Master [00:41:01] paravoid: actually that seemed to have fixed the swift-backend.log errors [00:42:27] I'm sure there are going to be more errors until I start the servers [00:42:29] thanks btw for spotting it [00:42:42] !log repooling palladium [00:42:43] oh I still see 10.0.6.202:6000 in the "swift" log though [00:42:48] Logged the message, notpeter [00:42:49] just not the other one [00:43:12] yeah [00:43:27] that one has been driving me crazy for weeks [00:49:30] binasher: eqiad bits caches upgraded as requested [00:49:52] thank you! [00:52:34] !log reedy synchronized wmf-config/ [00:52:40] Logged the message, Master [00:54:17] WebVideoTranscode::updateJobQueue 10.0.6.41 1213 Deadlock found when trying to get lock; try restarting transaction (10.0.6.41) DELETE FROM `transcode` WHERE transcode_image_name = 'ComputerHotline_-_Vidéo_de_la_Lune_(survol)_(by).OGG' AND transcode_key = '360p.webm' AND (transcode_id != 29660) [00:54:46] * AaronSchulz wonders how much activity happens with that table [00:56:56] lots? [01:00:57] paravoid: heh, the quick batch write function for MW saw a ~100ms latency drop after your change [01:01:21] wha?! [01:02:10] * AaronSchulz is looking at graphite [01:04:14] actually over 120ms [01:04:14] RECOVERY - swift-container-server on ms-be3 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [01:04:14] RECOVERY - swift-object-server on ms-be3 is OK: PROCS OK: 101 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [01:04:14] RECOVERY - swift-account-server on ms-be3 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [01:04:28] paravoid: are you doing something again? [01:04:32] no, cron is [01:04:34] heh [01:04:42] ahh [01:04:42] forgot that we have swift restart in cron [01:04:57] was that to work around leaks? [01:06:52] hah, no that was not it [01:06:54] it was puppet :) [01:07:00] we have ensure => running [01:07:04] or that [01:10:45] PROBLEM - swift-account-server on ms-be3 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [01:12:00] PROBLEM - swift-object-server on ms-be3 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [01:12:15] PROBLEM - swift-container-server on ms-be3 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [01:42:15] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 278 seconds [01:47:12] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 1 seconds [02:00:24] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 272 seconds [02:00:33] PROBLEM - MySQL Slave Delay on db78 is CRITICAL: CRIT replication delay 281 seconds [02:00:45] !log LocalisationUpdate failed: git pull of extensions failed [02:00:49] Logged the message, Master [02:01:14] :O [02:01:16] Reedy: ---^^ [02:51:07] RECOVERY - Puppet freshness on ms1002 is OK: puppet ran at Thu Nov 8 02:50:52 UTC 2012 [02:55:20] PROBLEM - SSH on ms1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:03:15] !log aaron synchronized php-1.21wmf3/includes/objectcache/MemcachedPeclBagOStuff.php 'deployed efda0b27befc28ec13dae23f48e8baea42f4a7f3' [03:03:19] Logged the message, Master [03:54:24] RECOVERY - MySQL Slave Delay on db78 is OK: OK replication delay 0 seconds [04:04:26] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 20 seconds [04:20:37] PROBLEM - Puppet freshness on ms-be7 is CRITICAL: Puppet has not run in the last 10 hours [04:20:37] PROBLEM - Puppet freshness on db42 is CRITICAL: Puppet has not run in the last 10 hours [04:20:37] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [04:24:26] hrmmm, akapoor is still silent [05:39:06] PROBLEM - LVS Lucene on search-pool4.svc.eqiad.wmnet is CRITICAL: Connection timed out [05:40:01] PROBLEM - Lucene on search1016 is CRITICAL: Connection timed out [05:41:30] RECOVERY - Lucene on search1016 is OK: TCP OK - 9.025 second response time on port 8123 [05:42:09] RECOVERY - LVS Lucene on search-pool4.svc.eqiad.wmnet is OK: TCP OK - 0.027 second response time on port 8123 [06:03:26] PROBLEM - Lucene on search1016 is CRITICAL: Connection timed out [06:16:36] RECOVERY - Lucene on search1016 is OK: TCP OK - 9.033 second response time on port 8123 [06:24:50] PROBLEM - LVS Lucene on search-pool1.svc.eqiad.wmnet is CRITICAL: Connection timed out [06:26:17] RECOVERY - LVS Lucene on search-pool1.svc.eqiad.wmnet is OK: TCP OK - 9.022 second response time on port 8123 [06:26:29] !log restarted lucene search on searchpool1016 [06:26:35] Logged the message, Master [06:31:06] PROBLEM - Puppet freshness on db62 is CRITICAL: Puppet has not run in the last 10 hours [07:16:46] PROBLEM - LVS Lucene on search-pool4.svc.eqiad.wmnet is CRITICAL: Connection timed out [07:20:03] RECOVERY - LVS Lucene on search-pool4.svc.eqiad.wmnet is OK: TCP OK - 0.027 second response time on port 8123 [07:20:18] !log restarted lucene search on search1015 [07:20:20] Logged the message, Master [07:32:36] PROBLEM - Puppet freshness on ms-fe3 is CRITICAL: Puppet has not run in the last 10 hours [08:02:36] PROBLEM - Puppet freshness on analytics1001 is CRITICAL: Puppet has not run in the last 10 hours [08:02:36] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [08:02:36] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [08:50:00] PROBLEM - Varnish HTTP mobile-backend on cp1041 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:50:45] PROBLEM - Varnish traffic logger on cp1041 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:51:23] PROBLEM - Varnish HTCP daemon on cp1041 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:54:01] http://frr.wikipedia.org/ is reported to be down (see -tech) since sometime yesterday - perhaps after an update yesterday [09:04:36] https://bugzilla.wikimedia.org/show_bug.cgi?id=41872 [09:06:21] New patchset: Tpt; "(bug 41872) Configure page and index namespaces for frr wiki." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/32350 [09:07:11] Merlissimo: ^ [09:07:39] Oh, you saw. [09:07:40] :) [09:07:49] I think. [09:08:49] Thx RD [09:11:48] RD: i am still wondering why test2.wikipedia.org is ok, because it also uses proofreadpage and has this config missing, too [09:15:21] PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours [09:18:19] unfortunately I don't know how that extensino works (or I would just +2 the changes and push themout directly) [09:20:42] [10243032.208628] XFS: possible memory allocation deadlock in kmem_alloc (mode:0x250) [09:20:56] oh joy [09:21:12] never seen that xfs whine before [09:21:24] !log Power cycled cp1041 [09:21:28] Logged the message, Master [09:24:59] RECOVERY - Varnish HTCP daemon on cp1041 is OK: PROCS OK: 1 process with UID = 997 (varnishhtcpd), args varnishhtcpd worker [09:28:45] RECOVERY - Varnish HTTP mobile-backend on cp1041 is OK: HTTP OK HTTP/1.1 200 OK - 696 bytes in 0.054 seconds [09:29:03] New review: Platonides; "Perfectly fine." [operations/mediawiki-config] (master) C: 1; - https://gerrit.wikimedia.org/r/32350 [09:29:12] PROBLEM - Varnish HTTP mobile-frontend on cp1041 is CRITICAL: Connection refused [09:29:34] the varnish cronjob on cp3001 is spamming once a minute [09:30:22] it looks for this: /var/lib/varnish/cp3001/_.vsm [09:30:29] it's because varnish is not running [09:38:52] i should kill it [09:38:52] thanks [09:38:53] PROBLEM - SSH on kaulen is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:38:54] morning [09:42:23] PROBLEM - HTTP on kaulen is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:42:23] RECOVERY - SSH on kaulen is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [09:42:23] RECOVERY - HTTP on kaulen is OK: HTTP OK HTTP/1.1 200 OK - 461 bytes in 1.926 seconds [09:44:14] bugzilla back? [09:44:14] not quite [09:44:14] apergos: can you fix it? [09:46:44] http://ganglia.wikimedia.org/latest/?c=Miscellaneous%20pmtpa&h=kaulen.wikimedia.org&m=cpu_report&r=hour&s=descending&hc=4&mc=2 [09:46:50] something running on kaulen :] [09:48:05] RECOVERY - Varnish traffic logger on cp1041 is OK: PROCS OK: 3 processes with command name varnishncsa [09:48:59] RECOVERY - Varnish HTTP mobile-frontend on cp1041 is OK: HTTP OK HTTP/1.1 200 OK - 643 bytes in 0.053 seconds [09:49:08] !log Killed show_bug.cgi processes on kaulen [09:49:13] thanks mark! [09:49:16] Logged the message, Master [09:50:13] thanks [09:52:14] sorry, I was in another window looking at cronspam [09:52:17] and despairing [09:59:05] apergos: mark: do you happen to know how to add someone in an LDAP group? krinkle is not in the LDAP 'wmf' group. [09:59:18] what would it takes to add him there ? [09:59:40] I think Chad took care of most of those? [10:01:27] I would fix it if I knew how to do it ;-D [10:02:32] I need it in order to operate Jenkins from the web interface. I used to do in, but since I got ssh access I somehow lost it (I'm in mortals and integration (for fenari and gallium respectively)). So I can ssh into the jenkins server, but when logging in with ldap from the web, I can't do anything since that one uses thew wmf group [10:46:24] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/32350 [10:53:17] !log reedy synchronized php-1.21wmf3/extensions/ProofreadPage/ [10:53:26] oh good [10:53:31] Reedy: I think there were two changes? [10:53:38] Yup [10:53:41] I can only do one at once ;) [10:53:43] ok nm if you're aware [10:53:51] !log reedy synchronized wmf-config/InitialiseSettings.php [10:53:54] heh, thanks for checking [10:54:00] sure [10:57:25] Logged the message, Master [10:57:30] Logged the message, Master [10:59:15] thx Reedy [11:03:34] PROBLEM - Puppet freshness on ms-be3 is CRITICAL: Puppet has not run in the last 10 hours [12:05:18] PROBLEM - Memcached on virt0 is CRITICAL: Connection refused [12:17:37] PROBLEM - Apache HTTP on mw57 is CRITICAL: Connection refused [12:20:09] RECOVERY - Memcached on virt0 is OK: TCP OK - 0.005 second response time on port 11000 [12:22:33] RECOVERY - Apache HTTP on mw57 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.055 second response time [12:23:56] !log Reinstalling sq70 with Precise [12:24:02] Logged the message, Master [12:28:51] RECOVERY - Host sq70 is UP: PING OK - Packet loss = 0%, RTA = 0.66 ms [12:33:50] PROBLEM - Varnish HTTP bits on sq70 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:41:20] RECOVERY - Varnish HTTP bits on sq70 is OK: HTTP OK HTTP/1.1 200 OK - 630 bytes in 0.002 seconds [12:46:37] ciao [12:46:45] !list [12:46:46] we have a mailing list labs-l@lists.wikimedia.org feel free to send a message there, don't forget to subscribe [12:47:25] ciao [12:47:40] pippone: su freenode non si scarica niente [12:47:40] !list [12:47:40] we have a mailing list labs-l@lists.wikimedia.org feel free to send a message there, don't forget to subscribe [12:47:45] pippone: ripeto [12:47:51] su freenode non si scarica niente [12:48:01] quindi in qualsiasi canale tu entrerai NON troverai niente [12:48:13] a margine ha poco senso provarci in canali specifici di progetto [12:48:23] e te l'ho detto un sacco di volte mi sa [12:50:01] boring people [12:52:26] PROBLEM - Puppet freshness on ms1002 is CRITICAL: Puppet has not run in the last 10 hours [12:55:26] New patchset: Matthias Mullie; "Increase abusefilter emergency shutdown threshold for feedback to 20%" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/32362 [13:51:17] mark: remember how you were saying that varnish's partman is different than anything else? [13:51:30] I'm considering merging ms-be-with-ssd.cfg with varnish, they're about the same [13:51:43] the only difference is that ms-be reserves 60GB for /, while varnish only 10GB [13:51:58] since there was a deliberate decision to keep logs on the swift boxes that I still don't understand [13:54:29] how can you merge them then? [13:55:09] add logrotate and stop keeping dozens of gigabytes of useless logs? [13:55:18] ok [13:55:44] were you afraid that I was going to switch varnish to 60gb-sized / ? :) [13:57:54] yes [13:58:05] hehe [13:59:38] not that crazy yet [14:00:14] New patchset: Faidon; "partman: separate common settings and be DRY" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/32366 [14:00:33] when you have a minute, want to have a look at that and give me your opinion? [14:01:05] New patchset: Faidon; "Remove generic::apt::pin-package and its callsites" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/32367 [14:01:05] New patchset: Faidon; "Remove apt::ppa-req and apt::key" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/32368 [14:01:05] New patchset: Faidon; "Initial attempt for an apt module" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/32369 [14:01:21] also this, mainly the latest one :) [14:02:31] now now [14:02:37] a high maintenance employee today, eh [14:02:42] ;-) [14:07:08] make sure that the installer doesn't automatically remove partitions/lvm volumes/whatever when there's no partman recipe defined [14:07:12] (but partman/common is, now) [14:10:42] hm, that's a very good point [14:11:25] as it is it'll remove everything I think [14:11:42] not sure how d-i would behave if remove is defined but no new recipe is, but it's not worth risking it [14:12:04] do we also have cases where we reformat boxes but keep e.g. LVM on a different disk? [14:13:30] yes, when we manually reinstall without a recipe [14:13:42] it's rare but it happens [14:13:54] !log LocalisationUpdate completed (1.21wmf3) at Thu Nov 8 14:13:54 UTC 2012 [14:14:01] Logged the message, Master [14:14:11] but we never do that with a recipe, right? [14:15:02] I mean, reformat e.g. sda with the recipe but keep sdb as-is [14:15:24] nah that's too risky [14:15:35] i mean, we do it with squids [14:15:42] when the partitioning remains the same, sometimes the cache data is still in tact [14:15:52] and if it's a quick reinstall that's fine and squid will reuse it after reinstall [14:15:54] but that's about all [14:17:18] nod [14:17:39] great [14:17:52] thanks, that was very useful input [14:18:12] completely missed the case of a non-partman install [14:18:43] i'll look at the others later [14:18:49] (not sure if you heard, the first 720xd arrived in pmtpa yesterday) [14:18:59] (but not its iDRAC license, sigh) [14:21:23] PROBLEM - Puppet freshness on db42 is CRITICAL: Puppet has not run in the last 10 hours [14:21:23] PROBLEM - Puppet freshness on ms-be7 is CRITICAL: Puppet has not run in the last 10 hours [14:21:23] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [14:21:31] ah [14:21:41] ok [14:21:45] grr [14:21:51] i can't figure out what this thread pileup is [14:34:09] :/ [14:41:07] ah~. [14:42:45] !log LocalisationUpdate completed (1.21wmf3) at Thu Nov 8 14:42:45 UTC 2012 [14:42:50] Logged the message, Master [14:45:27] hm? [14:46:39] nothing [14:47:19] hashar: around? [14:47:28] I am paravoid :-) [14:47:43] no you're not [14:47:56] New review: Faidon; "Mark rightfully commented that we shouldn't remove LVM/md if no partman recipe is defined. Will fix." [operations/puppet] (production); V: 0 C: -1; - https://gerrit.wikimedia.org/r/32366 [14:47:56] ,,, [14:48:02] repaired that comma key [14:48:19] hashar: we don't get verified on gerrit puppet pushes [14:48:26] I've been told this has been moved to jenkins [14:48:31] so thought you might know something about it [14:48:40] ^demon migrated it indeed [14:48:44] the gerrit hook have been removed [14:48:50] yeah [14:48:55] it worked [14:48:59] now it doesn't though [14:49:00] the jenkins job is at https://integration.mediawiki.org/ci/job/operations-puppet/ [14:49:08] maybe there is a connectivity issue between jenkins and gerrit [14:49:39] 17 hrs ago it says [14:49:41] <^demon> Why do I have to keep repeating this? [14:49:47] <^demon> It's not a connection issue. [14:49:49] <^demon> It's not precise. [14:49:54] <^demon> Gerrit is flapping. [14:49:55] I pushed stuff almost an hour ago [14:50:16] <^demon> If gerrit was down when Jenkins tried polling, it gives up. [14:50:17] wouldn't jenkins retry if it got a 500 from gerrit? [14:50:23] <^demon> You would think? [14:50:29] I would :) [14:50:39] apparently it does not :( [14:50:39] <^demon> It gets a 503. [14:50:46] <^demon> Jenkins should retry on that. [14:51:02] <^demon> (Seeing as it's usually back up within ~10s) [14:51:56] ok, so it doesn't retry automatically (it'd be nice to fix that) [14:51:59] how can I poke it manually? [14:52:37] so you could go to https://integration.mediawiki.org/ci/gerrit_manual_trigger/ [14:52:53] login? labs? [14:52:57] then search for puppet change which have not been verified : project:operations/puppet is:open -verified=1 -verified=-1 [14:53:00] yeah labs credentials [14:53:16] <^demon> Triggered all of them. [14:53:25] you are ruining the fun ;-] [14:53:36] haha [14:53:43] found how though [14:53:48] so that's the important part [14:53:49] thanks to both of you [14:54:08] <^demon> You're welcome :) [14:54:13] * ^demon goes back to doing laundry [15:10:20] New patchset: Hashar; "puppet module for nodejs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/32372 [15:10:20] New patchset: Hashar; "Node module gruntjs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/32373 [15:15:24] i've broken bits like 20 times today ;) [15:18:37] mark: you might consider using 'beta' has a staging area :_D [15:19:30] no that's boring [15:22:53] RECOVERY - Host ms-be6 is UP: PING OK - Packet loss = 0%, RTA = 1.78 ms [15:26:56] PROBLEM - SSH on ms-be6 is CRITICAL: Connection refused [15:34:59] RECOVERY - SSH on ms-be6 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [15:39:54] mark: paravoid: anyone up for some review of a change I have tested out in labs already ? ;-D [15:40:02] yeah [15:40:06] I wrote a very basic puppet module that provides nodejs [15:40:10] https://gerrit.wikimedia.org/r/#/c/32372/ [15:40:34] and then amend that module to provide the grunt npm module https://gerrit.wikimedia.org/r/#/c/32373/ [15:40:41] RECOVERY - Puppet freshness on ms-be6 is OK: puppet ran at Thu Nov 8 15:40:35 UTC 2012 [15:40:59] which comes from a WMF hosted git repository instead of the npm repo (since we are not going to trust npm repository just like we don't trust gem or pip ;-D) [15:41:06] been working on that with Timo [15:45:00] npm? uuuuuuughhhh [15:45:06] paravoid: and timo just told to me the grunt nodejs module could be made an independent puppet module named simply "grunt" instead of "nodejs::grunt" [15:45:07] hehe [15:45:47] paravoid: so what we did is that we installed the software locally and ran npm dependencies installer locally. Then git added everything in a dedicated WMF repository integration/gruntjs [15:46:09] this way we just have to git::clone it and don't need npm nor any request to be sent to some un trusted third party [15:46:14] PROBLEM - NTP on ms-be6 is CRITICAL: NTP CRITICAL: Offset unknown [15:46:16] an own hosted clone, like we do for gerrit (operations/debs/gerrit) [15:46:32] nak [15:46:32] package this [15:46:52] this is a typical case for things that we package [15:47:01] I don't see how it's any different from all the rest [15:47:18] and why you should git::clone it from a trivial module [15:47:42] Its hosted on Gerrit [15:47:58] so should we create a package like wikimedia-gruntjs wich would include all dependencies ? [15:48:14] isn't gruntjs a generic software? [15:48:16] can be in puppet with a 1-liner gitclone [15:49:10] paravoid: that is a node.js module which provides a CLI command [15:49:11] Krinkle: the same can be said for many things, you really don't want puppet to deploy everything everywhere [15:49:51] if we package this it would still be puppet deploying this..? [15:50:26] RECOVERY - swift-object-replicator on ms-be6 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [15:50:35] RECOVERY - swift-container-auditor on ms-be6 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [15:50:35] RECOVERY - swift-container-replicator on ms-be6 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [15:50:44] RECOVERY - swift-account-reaper on ms-be6 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [15:50:52] apergos: ^ [15:50:52] isn't grunt-js something very generic? [15:50:59] oh! [15:51:00] wow [15:51:02] RECOVERY - swift-object-server on ms-be6 is OK: PROCS OK: 101 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [15:51:02] RECOVERY - swift-container-server on ms-be6 is OK: PROCS OK: 13 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [15:51:09] apergos: WARNING: don't build ring files on ms-fe* [15:51:22] this is VERY important [15:51:31] RECOVERY - swift-object-updater on ms-be6 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-updater [15:51:31] RECOVERY - swift-container-updater on ms-be6 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-updater [15:51:38] RECOVERY - swift-account-server on ms-be6 is OK: PROCS OK: 13 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [15:51:56] RECOVERY - swift-account-auditor on ms-be6 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [15:51:58] cmjohnson1: you even provisioned it for us. thanks!!! [15:52:05] RECOVERY - swift-object-auditor on ms-be6 is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [15:52:08] yw [15:53:04] * apergos looks at the scrollback [15:53:09] fyi: the license is only a 30 day license....i have yet to get the new licenses from Dell...i can update on the mgmt's CMC w/out downtime once I get it [15:53:27] iDRAC license [15:53:38] oh wow, so it's all installed? it's running precise I guess? [15:53:47] that's excellent! [15:53:58] yes [15:54:23] thefront ens ms-fe are still version 1.7 or something right while the back end are 1.5? [15:54:27] yes [15:54:37] I want to do a 1.7 upgrade on backends too [15:54:47] I hadn't planned to build any files now anyways, figured you might have something to say about actual deployment [15:54:47] but let's progress on the hardware front a bit first [15:55:06] not really [15:55:51] !log starting ms-be3 {object,container,account}-server [15:55:57] Logged the message, Master [15:56:44] RECOVERY - swift-container-server on ms-be3 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [15:57:11] RECOVERY - swift-object-server on ms-be3 is OK: PROCS OK: 101 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [15:57:20] RECOVERY - swift-account-server on ms-be3 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [16:02:34] so we need to confirm whether this r720xd works well with swift before they send the rest or something? [16:03:23] that's the plan [16:03:23] the ssds are on /dev/sdm and /dev/sdn and they show up as scsi 0:0:12:0: and scsi 0:0:13:0: ( cmjohnson1 ) [16:03:48] how are they cabled up? [16:04:49]