[00:00:03] RECOVERY - Host palladium is UP: PING OK - Packet loss = 0%, RTA = 26.66 ms [00:00:04] do we have a means of saying "this IP address can go this much more"? [00:00:57] I don't remember fielding that type of request before, and I suspect that we'll probably ask that whoever this is start using data dumps instead of hammering the cluster, but I thought I'd ask what the process is for requesting this [00:03:48] PROBLEM - SSH on palladium is CRITICAL: Connection refused [00:04:10] PROBLEM - Varnish HTTP bits on palladium is CRITICAL: Connection refused [00:04:29] ah...wait, I see, I think I found the answer. this is probably just an admin request to be put in the right group [00:04:32] robla: The API doesn't enforce per-IP-per-minute request limits, at least not on the MW side. Are they talking about rate limits on editing perhaps? [00:04:39] Or limits re how many items per request? [00:06:08] I imagine items per request. BTW, this is the requestor: http://meta.wikimedia.org/wiki/Research:Non-finite_Processes_in_Human_Social_Phenomena [00:06:45] I think they want the apihighlimits permission [00:07:13] * robla looks on enwiki for what group that translates to [00:08:14] ah....they just need to be added to the "Researchers" group on enwiki, or whatever [00:08:24] Special permissions for research purposes (including high API request limits or access to non-public data) can be granted by the Research Committee on a temporary basis. [00:08:26] NOTE [00:08:37] The 'researchers' group on enwiki also grants access to PRIVATE DATA [00:08:46] Like certain parts of deleted history metadata or something [00:08:46] oh, there's that.... [00:09:05] I originally created that group specifically for [[User:DarTar]], it's funny that he's running it now :) [00:09:40] paravoid: ERROR with Object server 10.0.6.202:6000/sda4 re: Trying to GET /v1/AUTH_43651b15-ed7a-40b6-b745-47666abf8dfe/global-data-math-render.60/6/0/8/608b3cd307410c615ee3af9df0e46da9.png: ConnectionTimeout (0.5s) (txn: txb7ded627e92c49e694619c86f4337194) (client_ip: 69.115.255.211) [00:09:50] Bots would also work for this purpose [00:10:28] maybe that explains those annoying weird error log entries [00:10:50] thought you'd think MW would get at least 50x back though [00:10:57] what? [00:11:13] paravoid: swift-backend.log in fluorine [00:12:54] actually I see lots of these for thumbs too, so that probably isn't it [00:13:05] that's ms-be3 and it seems unreachable [00:13:08] and nagios hasn't noticed [00:13:14] wtf [00:13:30] tail -n 1000 swift | grep 'proxy-server' | grep ' ERROR ' [00:13:42] RECOVERY - SSH on palladium is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [00:16:35] sigh [00:20:50] nagios didn't think you wanted to be bothered by that [00:21:24] New patchset: Dzahn; "enable nl, sv, ru wikivoyage in Apache" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/32329 [00:22:17] Change merged: Dzahn; [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/32329 [00:23:00] robla: would like to make another change now and graceful Apaches etc [00:24:30] PROBLEM - Host palladium is DOWN: PING CRITICAL - Packet loss = 100% [00:24:48] mutante: just check here before doing that sort of thing: http://wikitech.wikimedia.org/view/Deployments [00:25:33] powercycling it, although something tells me it won't come up [00:26:10] !log powercycling ms-be3 [00:26:10] paravoid: how are the replacements going anyway? [00:26:12] Logged the message, Master [00:26:26] dzahn is doing a graceful restart of all apaches [00:26:43] AaronSchulz: the first one arrived today [00:26:46] !log dzahn gracefulled all apaches [00:26:51] but not its iDRAC license [00:26:53] Logged the message, Master [00:27:21] RECOVERY - Host palladium is UP: PING OK - Packet loss = 0%, RTA = 26.52 ms [00:28:15] oh wow [00:28:17] it's up [00:28:18] amazing [00:28:33] RECOVERY - Host ms-be3 is UP: PING OK - Packet loss = 0%, RTA = 0.21 ms [00:30:03] AaronSchulz: you'll have to bear with those errors for a few more hours [00:30:17] !log aaron synchronized php-1.21wmf3/extensions/SwiftCloudFiles/php-cloudfiles-wmf/cloudfiles.php [00:30:21] Logged the message, Master [00:30:21] I'm going to stop it from serving for a while [00:30:40] leave it run rsyncs and start it up properly tomorrow noon/afternoon my time [00:31:42] PROBLEM - Memcached on virt0 is CRITICAL: Connection refused [00:33:21] RECOVERY - Memcached on virt0 is OK: TCP OK - 0.002 second response time on port 11000 [00:34:24] !log starting memcached on virt0 [00:34:28] Logged the message, Master [00:34:34] New patchset: Reedy; "Add sv, nl and ru wikivoyage" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/32330 [00:34:43] !log temporarily stopping {account,container,object}-server on ms-be3 while it syncs up from its downtime [00:34:52] Logged the message, Master [00:34:55] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/32330 [00:35:26] !log aaron synchronized php-1.21wmf3/extensions/SwiftCloudFiles/php-cloudfiles-wmf/cloudfiles.php [00:35:34] Logged the message, Master [00:36:10] New patchset: Reedy; "Revert "Add sv, nl and ru wikivoyage"" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/32331 [00:36:18] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/32331 [00:37:07] !log reedy synchronized all.dblist [00:37:13] Logged the message, Master [00:37:42] PROBLEM - swift-account-server on ms-be3 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [00:38:19] PROBLEM - swift-object-server on ms-be3 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [00:38:27] PROBLEM - swift-container-server on ms-be3 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [00:38:47] RECOVERY - Varnish HTTP bits on palladium is OK: HTTP OK HTTP/1.1 200 OK - 635 bytes in 0.053 seconds [00:39:03] !log aaron synchronized php-1.21wmf3/extensions/SwiftCloudFiles/php-cloudfiles-wmf/cloudfiles.php 'removed debug logging' [00:39:09] Logged the message, Master [00:40:09] !log wikivoyage imports: "fr" - done [00:40:13] Logged the message, Master [00:41:01] paravoid: actually that seemed to have fixed the swift-backend.log errors [00:42:27] I'm sure there are going to be more errors until I start the servers [00:42:29] thanks btw for spotting it [00:42:42] !log repooling palladium [00:42:43] oh I still see 10.0.6.202:6000 in the "swift" log though [00:42:48] Logged the message, notpeter [00:42:49] just not the other one [00:43:12] yeah [00:43:27] that one has been driving me crazy for weeks [00:49:30] binasher: eqiad bits caches upgraded as requested [00:49:52] thank you! [00:52:34] !log reedy synchronized wmf-config/ [00:52:40] Logged the message, Master [00:54:17] WebVideoTranscode::updateJobQueue 10.0.6.41 1213 Deadlock found when trying to get lock; try restarting transaction (10.0.6.41) DELETE FROM `transcode` WHERE transcode_image_name = 'ComputerHotline_-_Vidéo_de_la_Lune_(survol)_(by).OGG' AND transcode_key = '360p.webm' AND (transcode_id != 29660) [00:54:46] * AaronSchulz wonders how much activity happens with that table [00:56:56] lots? [01:00:57] paravoid: heh, the quick batch write function for MW saw a ~100ms latency drop after your change [01:01:21] wha?! [01:02:10] * AaronSchulz is looking at graphite [01:04:14] actually over 120ms [01:04:14] RECOVERY - swift-container-server on ms-be3 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [01:04:14] RECOVERY - swift-object-server on ms-be3 is OK: PROCS OK: 101 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [01:04:14] RECOVERY - swift-account-server on ms-be3 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [01:04:28] paravoid: are you doing something again? [01:04:32] no, cron is [01:04:34] heh [01:04:42] ahh [01:04:42] forgot that we have swift restart in cron [01:04:57] was that to work around leaks? [01:06:52] hah, no that was not it [01:06:54] it was puppet :) [01:07:00] we have ensure => running [01:07:04] or that [01:10:45] PROBLEM - swift-account-server on ms-be3 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [01:12:00] PROBLEM - swift-object-server on ms-be3 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [01:12:15] PROBLEM - swift-container-server on ms-be3 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [01:42:15] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 278 seconds [01:47:12] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 1 seconds [02:00:24] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 272 seconds [02:00:33] PROBLEM - MySQL Slave Delay on db78 is CRITICAL: CRIT replication delay 281 seconds [02:00:45] !log LocalisationUpdate failed: git pull of extensions failed [02:00:49] Logged the message, Master [02:01:14] :O [02:01:16] Reedy: ---^^ [02:51:07] RECOVERY - Puppet freshness on ms1002 is OK: puppet ran at Thu Nov 8 02:50:52 UTC 2012 [02:55:20] PROBLEM - SSH on ms1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:03:15] !log aaron synchronized php-1.21wmf3/includes/objectcache/MemcachedPeclBagOStuff.php 'deployed efda0b27befc28ec13dae23f48e8baea42f4a7f3' [03:03:19] Logged the message, Master [03:54:24] RECOVERY - MySQL Slave Delay on db78 is OK: OK replication delay 0 seconds [04:04:26] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 20 seconds [04:20:37] PROBLEM - Puppet freshness on ms-be7 is CRITICAL: Puppet has not run in the last 10 hours [04:20:37] PROBLEM - Puppet freshness on db42 is CRITICAL: Puppet has not run in the last 10 hours [04:20:37] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [04:24:26] hrmmm, akapoor is still silent [05:39:06] PROBLEM - LVS Lucene on search-pool4.svc.eqiad.wmnet is CRITICAL: Connection timed out [05:40:01] PROBLEM - Lucene on search1016 is CRITICAL: Connection timed out [05:41:30] RECOVERY - Lucene on search1016 is OK: TCP OK - 9.025 second response time on port 8123 [05:42:09] RECOVERY - LVS Lucene on search-pool4.svc.eqiad.wmnet is OK: TCP OK - 0.027 second response time on port 8123 [06:03:26] PROBLEM - Lucene on search1016 is CRITICAL: Connection timed out [06:16:36] RECOVERY - Lucene on search1016 is OK: TCP OK - 9.033 second response time on port 8123 [06:24:50] PROBLEM - LVS Lucene on search-pool1.svc.eqiad.wmnet is CRITICAL: Connection timed out [06:26:17] RECOVERY - LVS Lucene on search-pool1.svc.eqiad.wmnet is OK: TCP OK - 9.022 second response time on port 8123 [06:26:29] !log restarted lucene search on searchpool1016 [06:26:35] Logged the message, Master [06:31:06] PROBLEM - Puppet freshness on db62 is CRITICAL: Puppet has not run in the last 10 hours [07:16:46] PROBLEM - LVS Lucene on search-pool4.svc.eqiad.wmnet is CRITICAL: Connection timed out [07:20:03] RECOVERY - LVS Lucene on search-pool4.svc.eqiad.wmnet is OK: TCP OK - 0.027 second response time on port 8123 [07:20:18] !log restarted lucene search on search1015 [07:20:20] Logged the message, Master [07:32:36] PROBLEM - Puppet freshness on ms-fe3 is CRITICAL: Puppet has not run in the last 10 hours [08:02:36] PROBLEM - Puppet freshness on analytics1001 is CRITICAL: Puppet has not run in the last 10 hours [08:02:36] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [08:02:36] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [08:50:00] PROBLEM - Varnish HTTP mobile-backend on cp1041 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:50:45] PROBLEM - Varnish traffic logger on cp1041 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:51:23] PROBLEM - Varnish HTCP daemon on cp1041 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:54:01] http://frr.wikipedia.org/ is reported to be down (see -tech) since sometime yesterday - perhaps after an update yesterday [09:04:36] https://bugzilla.wikimedia.org/show_bug.cgi?id=41872 [09:06:21] New patchset: Tpt; "(bug 41872) Configure page and index namespaces for frr wiki." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/32350 [09:07:11] Merlissimo: ^ [09:07:39] Oh, you saw. [09:07:40] :) [09:07:49] I think. [09:08:49] Thx RD [09:11:48] RD: i am still wondering why test2.wikipedia.org is ok, because it also uses proofreadpage and has this config missing, too [09:15:21] PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours [09:18:19] unfortunately I don't know how that extensino works (or I would just +2 the changes and push themout directly) [09:20:42] [10243032.208628] XFS: possible memory allocation deadlock in kmem_alloc (mode:0x250) [09:20:56] oh joy [09:21:12] never seen that xfs whine before [09:21:24] !log Power cycled cp1041 [09:21:28] Logged the message, Master [09:24:59] RECOVERY - Varnish HTCP daemon on cp1041 is OK: PROCS OK: 1 process with UID = 997 (varnishhtcpd), args varnishhtcpd worker [09:28:45] RECOVERY - Varnish HTTP mobile-backend on cp1041 is OK: HTTP OK HTTP/1.1 200 OK - 696 bytes in 0.054 seconds [09:29:03] New review: Platonides; "Perfectly fine." [operations/mediawiki-config] (master) C: 1; - https://gerrit.wikimedia.org/r/32350 [09:29:12] PROBLEM - Varnish HTTP mobile-frontend on cp1041 is CRITICAL: Connection refused [09:29:34] the varnish cronjob on cp3001 is spamming once a minute [09:30:22] it looks for this: /var/lib/varnish/cp3001/_.vsm [09:30:29] it's because varnish is not running [09:38:52] i should kill it [09:38:52] thanks [09:38:53] PROBLEM - SSH on kaulen is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:38:54] morning [09:42:23] PROBLEM - HTTP on kaulen is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:42:23] RECOVERY - SSH on kaulen is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [09:42:23] RECOVERY - HTTP on kaulen is OK: HTTP OK HTTP/1.1 200 OK - 461 bytes in 1.926 seconds [09:44:14] bugzilla back? [09:44:14] not quite [09:44:14] apergos: can you fix it? [09:46:44] http://ganglia.wikimedia.org/latest/?c=Miscellaneous%20pmtpa&h=kaulen.wikimedia.org&m=cpu_report&r=hour&s=descending&hc=4&mc=2 [09:46:50] something running on kaulen :] [09:48:05] RECOVERY - Varnish traffic logger on cp1041 is OK: PROCS OK: 3 processes with command name varnishncsa [09:48:59] RECOVERY - Varnish HTTP mobile-frontend on cp1041 is OK: HTTP OK HTTP/1.1 200 OK - 643 bytes in 0.053 seconds [09:49:08] !log Killed show_bug.cgi processes on kaulen [09:49:13] thanks mark! [09:49:16] Logged the message, Master [09:50:13] thanks [09:52:14] sorry, I was in another window looking at cronspam [09:52:17] and despairing [09:59:05] apergos: mark: do you happen to know how to add someone in an LDAP group? krinkle is not in the LDAP 'wmf' group. [09:59:18] what would it takes to add him there ? [09:59:40] I think Chad took care of most of those? [10:01:27] I would fix it if I knew how to do it ;-D [10:02:32] I need it in order to operate Jenkins from the web interface. I used to do in, but since I got ssh access I somehow lost it (I'm in mortals and integration (for fenari and gallium respectively)). So I can ssh into the jenkins server, but when logging in with ldap from the web, I can't do anything since that one uses thew wmf group [10:46:24] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/32350 [10:53:17] !log reedy synchronized php-1.21wmf3/extensions/ProofreadPage/ [10:53:26] oh good [10:53:31] Reedy: I think there were two changes? [10:53:38] Yup [10:53:41] I can only do one at once ;) [10:53:43] ok nm if you're aware [10:53:51] !log reedy synchronized wmf-config/InitialiseSettings.php [10:53:54] heh, thanks for checking [10:54:00] sure [10:57:25] Logged the message, Master [10:57:30] Logged the message, Master [10:59:15] thx Reedy [11:03:34] PROBLEM - Puppet freshness on ms-be3 is CRITICAL: Puppet has not run in the last 10 hours [12:05:18] PROBLEM - Memcached on virt0 is CRITICAL: Connection refused [12:17:37] PROBLEM - Apache HTTP on mw57 is CRITICAL: Connection refused [12:20:09] RECOVERY - Memcached on virt0 is OK: TCP OK - 0.005 second response time on port 11000 [12:22:33] RECOVERY - Apache HTTP on mw57 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.055 second response time [12:23:56] !log Reinstalling sq70 with Precise [12:24:02] Logged the message, Master [12:28:51] RECOVERY - Host sq70 is UP: PING OK - Packet loss = 0%, RTA = 0.66 ms [12:33:50] PROBLEM - Varnish HTTP bits on sq70 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:41:20] RECOVERY - Varnish HTTP bits on sq70 is OK: HTTP OK HTTP/1.1 200 OK - 630 bytes in 0.002 seconds [12:46:37] ciao [12:46:45] !list [12:46:46] we have a mailing list labs-l@lists.wikimedia.org feel free to send a message there, don't forget to subscribe [12:47:25] ciao [12:47:40] pippone: su freenode non si scarica niente [12:47:40] !list [12:47:40] we have a mailing list labs-l@lists.wikimedia.org feel free to send a message there, don't forget to subscribe [12:47:45] pippone: ripeto [12:47:51] su freenode non si scarica niente [12:48:01] quindi in qualsiasi canale tu entrerai NON troverai niente [12:48:13] a margine ha poco senso provarci in canali specifici di progetto [12:48:23] e te l'ho detto un sacco di volte mi sa [12:50:01] boring people [12:52:26] PROBLEM - Puppet freshness on ms1002 is CRITICAL: Puppet has not run in the last 10 hours [12:55:26] New patchset: Matthias Mullie; "Increase abusefilter emergency shutdown threshold for feedback to 20%" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/32362 [13:51:17] mark: remember how you were saying that varnish's partman is different than anything else? [13:51:30] I'm considering merging ms-be-with-ssd.cfg with varnish, they're about the same [13:51:43] the only difference is that ms-be reserves 60GB for /, while varnish only 10GB [13:51:58] since there was a deliberate decision to keep logs on the swift boxes that I still don't understand [13:54:29] how can you merge them then? [13:55:09] add logrotate and stop keeping dozens of gigabytes of useless logs? [13:55:18] ok [13:55:44] were you afraid that I was going to switch varnish to 60gb-sized / ? :) [13:57:54] yes [13:58:05] hehe [13:59:38] not that crazy yet [14:00:14] New patchset: Faidon; "partman: separate common settings and be DRY" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/32366 [14:00:33] when you have a minute, want to have a look at that and give me your opinion? [14:01:05] New patchset: Faidon; "Remove generic::apt::pin-package and its callsites" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/32367 [14:01:05] New patchset: Faidon; "Remove apt::ppa-req and apt::key" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/32368 [14:01:05] New patchset: Faidon; "Initial attempt for an apt module" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/32369 [14:01:21] also this, mainly the latest one :) [14:02:31] now now [14:02:37] a high maintenance employee today, eh [14:02:42] ;-) [14:07:08] make sure that the installer doesn't automatically remove partitions/lvm volumes/whatever when there's no partman recipe defined [14:07:12] (but partman/common is, now) [14:10:42] hm, that's a very good point [14:11:25] as it is it'll remove everything I think [14:11:42] not sure how d-i would behave if remove is defined but no new recipe is, but it's not worth risking it [14:12:04] do we also have cases where we reformat boxes but keep e.g. LVM on a different disk? [14:13:30] yes, when we manually reinstall without a recipe [14:13:42] it's rare but it happens [14:13:54] !log LocalisationUpdate completed (1.21wmf3) at Thu Nov 8 14:13:54 UTC 2012 [14:14:01] Logged the message, Master [14:14:11] but we never do that with a recipe, right? [14:15:02] I mean, reformat e.g. sda with the recipe but keep sdb as-is [14:15:24] nah that's too risky [14:15:35] i mean, we do it with squids [14:15:42] when the partitioning remains the same, sometimes the cache data is still in tact [14:15:52] and if it's a quick reinstall that's fine and squid will reuse it after reinstall [14:15:54] but that's about all [14:17:18] nod [14:17:39] great [14:17:52] thanks, that was very useful input [14:18:12] completely missed the case of a non-partman install [14:18:43] i'll look at the others later [14:18:49] (not sure if you heard, the first 720xd arrived in pmtpa yesterday) [14:18:59] (but not its iDRAC license, sigh) [14:21:23] PROBLEM - Puppet freshness on db42 is CRITICAL: Puppet has not run in the last 10 hours [14:21:23] PROBLEM - Puppet freshness on ms-be7 is CRITICAL: Puppet has not run in the last 10 hours [14:21:23] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [14:21:31] ah [14:21:41] ok [14:21:45] grr [14:21:51] i can't figure out what this thread pileup is [14:34:09] :/ [14:41:07] ah~. [14:42:45] !log LocalisationUpdate completed (1.21wmf3) at Thu Nov 8 14:42:45 UTC 2012 [14:42:50] Logged the message, Master [14:45:27] hm? [14:46:39] nothing [14:47:19] hashar: around? [14:47:28] I am paravoid :-) [14:47:43] no you're not [14:47:56] New review: Faidon; "Mark rightfully commented that we shouldn't remove LVM/md if no partman recipe is defined. Will fix." [operations/puppet] (production); V: 0 C: -1; - https://gerrit.wikimedia.org/r/32366 [14:47:56] ,,, [14:48:02] repaired that comma key [14:48:19] hashar: we don't get verified on gerrit puppet pushes [14:48:26] I've been told this has been moved to jenkins [14:48:31] so thought you might know something about it [14:48:40] ^demon migrated it indeed [14:48:44] the gerrit hook have been removed [14:48:50] yeah [14:48:55] it worked [14:48:59] now it doesn't though [14:49:00] the jenkins job is at https://integration.mediawiki.org/ci/job/operations-puppet/ [14:49:08] maybe there is a connectivity issue between jenkins and gerrit [14:49:39] 17 hrs ago it says [14:49:41] <^demon> Why do I have to keep repeating this? [14:49:47] <^demon> It's not a connection issue. [14:49:49] <^demon> It's not precise. [14:49:54] <^demon> Gerrit is flapping. [14:49:55] I pushed stuff almost an hour ago [14:50:16] <^demon> If gerrit was down when Jenkins tried polling, it gives up. [14:50:17] wouldn't jenkins retry if it got a 500 from gerrit? [14:50:23] <^demon> You would think? [14:50:29] I would :) [14:50:39] apparently it does not :( [14:50:39] <^demon> It gets a 503. [14:50:46] <^demon> Jenkins should retry on that. [14:51:02] <^demon> (Seeing as it's usually back up within ~10s) [14:51:56] ok, so it doesn't retry automatically (it'd be nice to fix that) [14:51:59] how can I poke it manually? [14:52:37] so you could go to https://integration.mediawiki.org/ci/gerrit_manual_trigger/ [14:52:53] login? labs? [14:52:57] then search for puppet change which have not been verified : project:operations/puppet is:open -verified=1 -verified=-1 [14:53:00] yeah labs credentials [14:53:16] <^demon> Triggered all of them. [14:53:25] you are ruining the fun ;-] [14:53:36] haha [14:53:43] found how though [14:53:48] so that's the important part [14:53:49] thanks to both of you [14:54:08] <^demon> You're welcome :) [14:54:13] * ^demon goes back to doing laundry [15:10:20] New patchset: Hashar; "puppet module for nodejs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/32372 [15:10:20] New patchset: Hashar; "Node module gruntjs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/32373 [15:15:24] i've broken bits like 20 times today ;) [15:18:37] mark: you might consider using 'beta' has a staging area :_D [15:19:30] no that's boring [15:22:53] RECOVERY - Host ms-be6 is UP: PING OK - Packet loss = 0%, RTA = 1.78 ms [15:26:56] PROBLEM - SSH on ms-be6 is CRITICAL: Connection refused [15:34:59] RECOVERY - SSH on ms-be6 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [15:39:54] mark: paravoid: anyone up for some review of a change I have tested out in labs already ? ;-D [15:40:02] yeah [15:40:06] I wrote a very basic puppet module that provides nodejs [15:40:10] https://gerrit.wikimedia.org/r/#/c/32372/ [15:40:34] and then amend that module to provide the grunt npm module https://gerrit.wikimedia.org/r/#/c/32373/ [15:40:41] RECOVERY - Puppet freshness on ms-be6 is OK: puppet ran at Thu Nov 8 15:40:35 UTC 2012 [15:40:59] which comes from a WMF hosted git repository instead of the npm repo (since we are not going to trust npm repository just like we don't trust gem or pip ;-D) [15:41:06] been working on that with Timo [15:45:00] npm? uuuuuuughhhh [15:45:06] paravoid: and timo just told to me the grunt nodejs module could be made an independent puppet module named simply "grunt" instead of "nodejs::grunt" [15:45:07] hehe [15:45:47] paravoid: so what we did is that we installed the software locally and ran npm dependencies installer locally. Then git added everything in a dedicated WMF repository integration/gruntjs [15:46:09] this way we just have to git::clone it and don't need npm nor any request to be sent to some un trusted third party [15:46:14] PROBLEM - NTP on ms-be6 is CRITICAL: NTP CRITICAL: Offset unknown [15:46:16] an own hosted clone, like we do for gerrit (operations/debs/gerrit) [15:46:32] nak [15:46:32] package this [15:46:52] this is a typical case for things that we package [15:47:01] I don't see how it's any different from all the rest [15:47:18] and why you should git::clone it from a trivial module [15:47:42] Its hosted on Gerrit [15:47:58] so should we create a package like wikimedia-gruntjs wich would include all dependencies ? [15:48:14] isn't gruntjs a generic software? [15:48:16] can be in puppet with a 1-liner gitclone [15:49:10] paravoid: that is a node.js module which provides a CLI command [15:49:11] Krinkle: the same can be said for many things, you really don't want puppet to deploy everything everywhere [15:49:51] if we package this it would still be puppet deploying this..? [15:50:26] RECOVERY - swift-object-replicator on ms-be6 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [15:50:35] RECOVERY - swift-container-auditor on ms-be6 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [15:50:35] RECOVERY - swift-container-replicator on ms-be6 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [15:50:44] RECOVERY - swift-account-reaper on ms-be6 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [15:50:52] apergos: ^ [15:50:52] isn't grunt-js something very generic? [15:50:59] oh! [15:51:00] wow [15:51:02] RECOVERY - swift-object-server on ms-be6 is OK: PROCS OK: 101 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [15:51:02] RECOVERY - swift-container-server on ms-be6 is OK: PROCS OK: 13 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [15:51:09] apergos: WARNING: don't build ring files on ms-fe* [15:51:22] this is VERY important [15:51:31] RECOVERY - swift-object-updater on ms-be6 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-updater [15:51:31] RECOVERY - swift-container-updater on ms-be6 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-updater [15:51:38] RECOVERY - swift-account-server on ms-be6 is OK: PROCS OK: 13 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [15:51:56] RECOVERY - swift-account-auditor on ms-be6 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [15:51:58] cmjohnson1: you even provisioned it for us. thanks!!! [15:52:05] RECOVERY - swift-object-auditor on ms-be6 is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [15:52:08] yw [15:53:04] * apergos looks at the scrollback [15:53:09] fyi: the license is only a 30 day license....i have yet to get the new licenses from Dell...i can update on the mgmt's CMC w/out downtime once I get it [15:53:27] iDRAC license [15:53:38] oh wow, so it's all installed? it's running precise I guess? [15:53:47] that's excellent! [15:53:58] yes [15:54:23] thefront ens ms-fe are still version 1.7 or something right while the back end are 1.5? [15:54:27] yes [15:54:37] I want to do a 1.7 upgrade on backends too [15:54:47] I hadn't planned to build any files now anyways, figured you might have something to say about actual deployment [15:54:47] but let's progress on the hardware front a bit first [15:55:06] not really [15:55:51] !log starting ms-be3 {object,container,account}-server [15:55:57] Logged the message, Master [15:56:44] RECOVERY - swift-container-server on ms-be3 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [15:57:11] RECOVERY - swift-object-server on ms-be3 is OK: PROCS OK: 101 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [15:57:20] RECOVERY - swift-account-server on ms-be3 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [16:02:34] so we need to confirm whether this r720xd works well with swift before they send the rest or something? [16:03:23] that's the plan [16:03:23] the ssds are on /dev/sdm and /dev/sdn and they show up as scsi 0:0:12:0: and scsi 0:0:13:0: ( cmjohnson1 ) [16:03:48] how are they cabled up? [16:04:49] with the 720 there are 2 drive slots in the rear....so they will be 12 and 13 [16:05:34] uh oh [16:05:51] so it looks like it built the OS on 0 and 1 [16:05:56] yep [16:06:02] RECOVERY - NTP on ms-be6 is OK: NTP OK: Offset -0.02630293369 secs [16:06:08] plus the container listings would automatically go there too [16:06:24] NP...what we do is put the ssd's in 0 and 1 and populate the 2 back drives w/standard [16:06:37] sounds good [16:07:11] ok, going to have to provision again [16:07:40] hm? [16:07:45] does that mean we can't hot swap those disks? [16:08:25] paravoid: so the idea is that we should get a debian package under operations/debs/gruntjs that would provide/copy/install the node module we need? So puppet would just be all about using : [ "gruntjs" ]: ensure => latest; [16:08:54] yes, that's the idea [16:09:01] isn't gruntjs a separate piece of software? [16:09:07] is there anything wikimedia-specific to it? [16:09:19] no it is a copy from upstream [16:09:49] and grunts needs node.js + some other node modules [16:10:03] I don't understand [16:10:05] did you fork it? [16:10:09] nop [16:10:19] just copied the files from upstream [16:10:29] to avoid us having to use the npm package management system [16:10:32] that is an exact copy of upstream [16:10:32] PROBLEM - Host ms-be6 is DOWN: PING CRITICAL - Packet loss = 100% [16:10:41] what does grunt do? [16:11:00] it is the equivalent of ruby rake or java ant but for nodejs world [16:11:06] let you define tasks having dependencies [16:11:11] upon other tasks [16:12:04] I am wondering whereas there is a Debian helper to package Node modules [16:12:31] apergos: paravoid: i don't know what i was thinking....the new 720's have 3.5" disk...i can't move the ssd's [16:12:44] blahhh [16:12:44] maybe we should just git clone our copy in some local directory and use it directly. [16:13:01] why?! [16:13:09] they are 2.5"? don't we have adapters? [16:13:23] hashar: http://bugs.debian.org/673727 [16:13:43] request for package; see? let's package it up and then perhaps contribute it back to Debian [16:13:43] you are awesome :-] [16:13:50] no the adapters won't work for these [16:14:05] we had this talk w/Dell already [16:14:05] can we get two that will work? [16:14:19] I'm not sure I am, I'd prefer if someone else did the initial packaging attempt ;-) [16:14:20] wait so they say they don't have any? [16:14:40] that there are none in existence? (except icky grey market things that will be iffy)? [16:14:50] exactly [16:14:55] jesus [16:14:59] paravoid: oh I though there was already a package in debian .. [16:15:14] no, there's someone that asked for it exist [16:15:19] and then the disk may not fit correctly if they fit at all [16:15:20] ok can we ask them about disk order? maybe there are some jumpers or some *&^% thing we can change [16:15:46] hashar: try https://github.com/arikon/npm2debian it might do the trick for you [16:16:16] apergos: fix partman & puppet to install the OS in sdm/sdn [16:16:16] apergos: why do the SSDs need to be first? [16:16:24] heh [16:16:33] or even keep the OS on sda and just make sure those specific containers end up on sdm/sdn [16:16:36] apergos: we can try and move them in bios...boot from the ssd [16:16:37] drdee: ahh nice I will have a look at it [16:16:45] or that, yes [16:50:51] mark or paravoid, what would you recommend for me adding a stanza that just covers the swift 720xds? how can I do that properly? [16:50:51] I need this for swift::create_filesystem [16:50:51] so it won't blithely create on sdm and n [16:50:53] from a quick glance, it seems to be all defined in site.pp [16:50:53] $all_drives et all [16:50:53] separate sections per class of machines [16:50:53] have a look and play a bit [16:50:53] that's a different bit [16:50:53] wouldnt you just make two partman recipies? [16:50:53] so swift::createfilesystem will create on anything not in $base::platform::startup_drives [16:50:53] and then in netboot put some servers to one by name [16:50:53] I already have the partman and netboot [16:50:55] that's the easy part [16:51:28] well, we will have other 720s in use in the future [16:51:32] PROBLEM - NTP on ms-be6 is CRITICAL: NTP CRITICAL: No response from NTP server [16:51:36] and they can change, already have three versions ;] [16:51:52] (enwiki test, analytics, and new swift) [16:51:53] cmjohnson1: thx [16:53:21] I need to extend $base::platform::startup_drives some clever way so that only for the swift 720xds it lists the startup drives as /dev/sdm and /dev/sdn, but I don't know a good way to do that [16:57:00] I don't understand the startup_drives logic at all [16:57:21] ok [16:57:22] we don't do the same for squid coss disks [16:57:32] so just get rid of that I'd say [16:57:39] mark made that I think, so he may have a different opinion [16:57:46] apergos: maybe i am simplifying it too much but wouldn't you just change the current recipe to to be /dev/sdm and n instead of a and b [16:58:00] I already did that ( cmjohnson1 ) or at least I will when I commit this [16:58:17] but puppet also does this clever thing where it creates all the xfs filesystems on the rest of the drives [16:58:23] we want it not to clobber the ones with the os is all [16:58:38] ah ok [16:59:44] are we going to stop using COSS? it's been removed from squid 3.1 [16:59:53] mark: is it ok to get rid of the ! ($title in $base::platform::startup_drives) bit in swift::create_filesystem (swift.pp) ? [17:02:52] andrewbogott, i might know about the GeoIP.dat thing [17:03:02] where is puppet cloned on stafford? [17:03:20] /var/lib/git/operations/puppet [17:07:06] ottomata: thoughts? [17:09:15] hmmm, i was going to check the mod date, if it was up to date I was going to say this was because puppetmasters have a cron job to dl that file [17:09:16] but nope [17:09:25] the files are downloaded to /var/lib/puppet/volatile [17:09:28] so I don't know [17:09:40] uhhh, will respond to your email for posterity [17:10:04] thanks. [17:20:27] apergos: are you installing on be6? [17:21:12] no, I have a one on one with ct right now [17:21:58] k [17:22:10] sorry about that [17:23:05] no...it's ok..gonna get some food than [17:23:10] enjoy [17:28:00] PROBLEM - Host ms-be6 is DOWN: PING CRITICAL - Packet loss = 100% [17:33:44] /away [17:50:21] Change abandoned: Hashar; "(no reason)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/32373 [17:50:32] Change abandoned: Hashar; "indeed :-]" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/32372 [17:52:17] so what's been going on with lucene ? [17:52:24] it's been paging me a lot every night [17:56:11] LeslieCarr: it needs work from a software engineer.... [17:56:57] no one has really put work into the codebase in about 5 years [17:58:19] has anyone volunteered to help ? [17:58:37] if not, i want to point the pagers at all the eng managers phones instead of ours unti lit is fixed [17:58:58] sadly, no [17:59:12] no one wants to take repsonsibility for this feature [17:59:18] so the only people who care are ops [17:59:23] as we have pagers [17:59:44] at this point, I would rather us just contract with google to index our stuff [17:59:52] it wouldn't be open source, but it would work and stay up! [18:00:12] haha yeah [18:00:29] I mean, I bet we wouldn't have to pay [18:00:33] do you know any of the technical details of what's going on ? [18:00:35] so at least that would be free like beer :) [18:00:40] actually google offers it for free to everyone so yeah [18:00:55] reminds of the bugzilla extension we installed to submit sitemaps to search engines, but with varying results [18:01:04] every night there's some kinda rebuild of the indexes [18:01:11] this then gets rsynced over to the frontend nodes [18:01:27] as search pool 4 is the "everything else" pool [18:01:34] it has to rsync over a lot of files [18:01:57] rsync then slows down responses to the api by enough that it's deemed "down" [18:02:11] ok [18:02:34] there are probably things that could be done, such as with the niceness of rsync, or limiting the bandwidth [18:02:42] i'll email engineering ... [18:02:45] but that functionality is buried in the java code [18:03:07] do not want loud beeps when i try to sleep :) [18:03:07] and, truth be told, the real solution is to use something that's not 5 years old [18:03:11] and totally unmaintained [18:03:19] fair enough [18:03:57] PROBLEM - Puppet freshness on analytics1001 is CRITICAL: Puppet has not run in the last 10 hours [18:03:57] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [18:03:57] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [18:04:33] yeah [18:04:39] sphinx has some supporters [18:04:50] there are lots of good options [18:12:10] I;ve been restarting different ones of those every morning [18:12:42] the process gets into some hung state where it won't actually stop with init.d [18:12:53] and you have to shoot it a few times [18:13:09] I guess that's when it stops being responsive for nagios as well [18:13:14] notpeter: ^ [18:15:40] notpeter, LeslieCarr, I have been helping CT with finding a contractor and i am pretty confident we will pick someone by next week [18:15:45] !log removing wm-nl@wikimedia.org as list admin of "press-nl" because it is an OTRS queue (per thehelpfulone) [18:15:49] ok [18:15:51] Logged the message, Master [18:16:07] do you want to get paged until then? ;) [18:16:24] :-D [18:16:28] !log reedy synchronized php-1.21wmf3/extensions/TimedMediaHandler [18:16:34] Logged the message, Master [18:18:33] LeslieCarr: i'll soon have my own personal pager aka baby 1.0 [18:19:20] haha [18:20:09] RECOVERY - Host ms-be6 is UP: PING OK - Packet loss = 0%, RTA = 1.37 ms [18:21:03] New patchset: ArielGlenn; "ms-be6 is a 720xd with different ssd layout, update partman, swift, site files" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/32398 [18:21:14] I would love it if someone else would review that [18:21:59] LeslieCarr, I wanted to tell you, ganglia on analytics machines now magically works [18:22:03] after not looking at it for a week or two [18:23:15] LeslieCarr: I don't know if that's actually the problem, because afaict the problem of unresponsive does not go away unless and untilsomeone restarts the errant search process on whichever host it has got stuck [18:24:21] grrr [18:24:22] hate magic [18:24:50] sadly, I think that an ops contractor is a very poor solution for this [18:25:24] we need at least one, if not more software devs who are going to actually develope and maintain a codebase... [18:28:14] Reedy: or someone, I see I now no longer get useful exception output when a dumprun fails, instead I get [18:28:16] Set $wgShowExceptionDetails = true; in LocalSettings.php to show detailed debugging information. [18:29:00] this is bad because eswiki fails, and I have no idea why. what can we do about this? [18:29:38] Hmm [18:29:56] i wonder if it'd even appear in the fatal/exception logs... [18:30:09] not having any sort of a message to go on really isn't helpful [18:30:15] no, it isn't [18:30:18] We actually have a bug saying that sucks [18:30:22] :-D [18:30:34] well it sucks for me cause I have no idea why es wiki isn't working and the rest are [18:30:51] Have you tried inserting $wgShowExceptionDetails on one of the runners, then forcing it to run eswiki jobs? [18:31:26] you mean shove it into whichever copy of commonsettings for the version of mw being run? [18:31:29] no, haven't done that [18:32:01] of course it will get overwritten the next time that file is synced [18:32:34] everything is on 1.21wmf3 now at least [18:32:41] ok [18:32:56] this step that fails takes 7 hours to run so it's a bit annoying [18:33:07] Ouch [18:33:09] ie it fails after some time [18:33:24] Do you get a partially written dump file? [18:33:30] PROBLEM - Host ms-be6 is DOWN: PING CRITICAL - Packet loss = 100% [18:33:47] just wondering if it would have managed to add the page title before it dies [18:38:56] I don't think so [18:39:06] the abstract file has a complete write, there's no partial anything [18:39:12] RECOVERY - Host ms-be6 is UP: PING OK - Packet loss = 0%, RTA = 0.46 ms [18:39:46] * apergos eyes ms-be6 and wonders [18:40:27] apergos: it keeps cycling through the install process [18:40:45] bios [18:40:56] hard drive boot before network [18:41:00] i just checked...set to boot hard drive [18:41:07] uuhh [18:41:26] crap [18:44:29] apergos: powering it down [18:44:40] ok [18:45:11] if I don't get any takers in a while I guess I'll self review (the puppet changes) and push them aroun to brewster [18:45:44] sounds good to me [18:46:00] worse case...fix them afterwards [18:46:46] yep [18:46:51] PROBLEM - Host ms-be6 is DOWN: PING CRITICAL - Packet loss = 100% [18:49:25] Reedy: so the stubs files also have completepages written, no partials [18:49:26] dead end [18:51:38] :( [18:52:02] I read that as "push them anon to brewster" [18:52:07] I'll have to live-hack comonsettings but I hate it [18:52:18] yeah you know me, anon ssh in :-P [18:53:13] hmm [18:53:17] What host does the dumps? [18:53:39] I was just thinking it's worth just checking the logs for any hits.. [18:53:59] snapshot[1-4] [18:54:02] in this case 1 [18:54:29] aha [18:54:30] 2012-11-08 15:59:21 snapshot1 eswiki: [87974a13] [no req] Exception from line 36 of /a/usr/local/apache/common-local/php-1.21wmf3/includes/content/TextContent.php: TextContent expects a string in the constructor. [18:54:46] 2012-11-08 12:29:22 snapshot1 eswiki: [64544654] [no req] Exception from line 36 of /a/usr/local/apache/common-local/php-1.21wmf3/includes/content/TextContent.php: TextContent expects a string in the constructor. [18:55:07] Would you like a stack trace to go with that? ;) [18:55:16] ContentHandler/Wikidata related [18:55:23] no, I'd like someone to fix TextContext please :-P [18:56:01] apergos: fyi, that was in /a/mw-log/exception.log on fluorine [18:56:10] oh yay at least it is recorded someplace [18:56:14] http://p.defau.lt/?7wqGCAZ_YrQa9hEeBMBRTg [18:57:12] I wish we had the page so the person could test [18:57:18] (whoever will fix textcontent :-P) [18:57:51] https://bugzilla.wikimedia.org/show_bug.cgi?id=41900 for starters [19:00:31] I've just pinged Daniel in #wikimedia-wikidata [19:01:02] it will be after pageid 2355690 [19:01:18] I'll see if I can find out which one does it. meh [19:01:20] or at least limit it severely [19:02:32] 5479777 is the highest pageid [19:02:42] somewhere right in the middle [19:06:42] apergos: I'm guessing this is when it's doing a full history dump? [19:06:49] no [19:06:57] lol [19:06:58] abstract dump is the first round [19:07:03] reduced use case at least [19:07:03] ahh [19:07:04] [19:05:35] Reedy: first guess from the stack trace: there's a broken revision. it tries to load it from ES or something and gets null. then tries to turn it into a TextContent object. and dies. [19:07:20] /usr/bin/php -q /apache/common/multiversion/MWScript.php dumpBackup.php --wiki=eswiki --plugin=AbstractFilter:/apache/common/php-1.21wmf3/extensions/ActiveAbstract/AbstractFilter.php --current --report=1000 --force-normal --output=file:/mnt/data/xmldatadumps/public/eswiki/20121108/eswiki-20121108-abstract.xml.testing --filter=namespace:NS_MAIN --filter=noredirect --filter=abstract --start=2355690 --end=2360000 [19:07:20] Set $wgShowExceptionDetails = true; in LocalSettings.php to show detailed debugging information. [19:07:25] fails within 10 seconds [19:07:36] aha [19:07:40] you can write the utput file wherever you like [19:07:47] so have at, debuggers :-) [19:07:55] I'm going AFK for a while [19:07:58] might look a bit later ;) [19:08:08] ok. I'll see if I can narrow down the page id a bit more [19:08:22] that'd be great [19:08:52] --start=2355690 --end=2355700 [19:08:55] :-D [19:09:05] 10 page ids is good :P [19:10:04] Add it to the bug when you find out which it is :p [19:10:06] --start=2355692 --end=2355693 [19:10:20] can't do better, when I ask for same start or same end I dobn't get the error [19:10:41] haha [19:10:47] 2 page ids is pretty specific [19:11:32] yep, thoguht you'd like it [19:12:22] ah I beet it's ...2 [19:13:14] New patchset: Platonides; "Wikimedia-ES workshop in Oviedo tomorrow." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/32404 [19:13:24] Userbox/amigojimmy [19:14:13] yes! [19:14:19] http://es.wikipedia.org/w/index.php?title=Especial%3ABuscar&profile=advanced&search=Userbox%2Famigojimmy&fulltext=Search&ns0=1&ns1=1&ns2=1&ns3=1&ns4=1&ns5=1&ns6=1&ns7=1&ns8=1&ns9=1&ns10=1&ns11=1&ns12=1&ns13=1&ns14=1&ns15=1&ns100=1&ns101=1&ns102=1&ns103=1&ns104=1&ns105=1&redirs=1&profile=advanced [19:14:25] love it [19:15:37] http://es.wikipedia.org/wiki/Usuario:Userbox/amigojimmy [19:16:14] <^demon|lunch> I've got a user who can't connect to gerrit (ssh -vvv: http://pastebin.com/7hxDzPuK, traceroute: http://pastebin.com/zJ38sRkX). It's nothing on the gerrit side as far as I can tell. Thoughts on how to proceed? [19:17:00] PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours [19:18:20] ^demon|lunch: traceroute gets all the way to manganese, so i guess it is firewalling on his side somewhere because of the high port, maybe his "personal firewall" or anything thinks 29418 is suspicious or something [19:18:42] <^demon|lunch> Hmm, hadn't thought of that. [19:19:59] <^demon|lunch> It's been awhile since someone had complained about 29418 :p [19:20:05] ^demon|lunch: you could let him try ssh to 22 on manganese, not for a login, just to see [19:23:31] self reviewing >_< [19:23:40] <^demon|lunch> mutante: That was it. Thanks :) [19:23:45] oh, hmm [19:23:59] jenkins didn'tlike it [19:24:22] ^demon|lunch, you could ask him to do traceroute -M tcp -p 29418 gerrit.wikimedia.org [19:24:41] -M tcp needs root on the local machine, though [19:26:05] quote [19:26:56] New review: Dzahn; "yep. cachewww1.uniovi.es.cachewww2.uniovi.es cachewww3.uniovi.es." [operations/mediawiki-config] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/32404 [19:26:56] Change merged: Dzahn; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/32404 [19:27:57] New patchset: ArielGlenn; "ms-be6 is a 720xd with different ssd layout, update partman, swift, site files" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/32398 [19:28:24] Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/32398 [19:29:32] !log dzahn synchronized ./wmf-config/throttle.php [19:29:39] Logged the message, Master [19:30:21] cmjohnson1: wanna give it another shot with the install? puppet changes live [19:30:58] yep [19:31:36] thanks! [19:32:12] * apergos thinks it is dinnertime [19:34:34] <^demon|lunch> apergos: What's for dinner tonight? [19:34:40] good question [19:34:47] something fast which probably means an easy pasta [19:34:50] I'm past starving [19:37:04] New patchset: Lcarr; "allowing new fundraising IP range access to ncsa fun" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/32406 [19:37:34] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/32406 [19:42:46] chamge of plans, boiled baby zucchini, some tomato and cheese and bread with that, while I wait I have a perfectly ripe pear [19:44:05] mmmm [19:44:11] apergos: want to come cook at my house ? :) [19:44:26] yes, if you move your house here temporarily! :-) [19:45:37] * apergos remembersto defrost a pie with greens and various cheese in it so it will be ready for tomorrow [19:46:17] <^demon|lunch> apergos: Your dinners always sound so yummy :) [19:46:21] <^demon|lunch> And lunches. [19:46:21] heh [19:46:24] <^demon|lunch> And meals, generally. [19:46:54] and they are all meat and fish free [19:48:01] what are you eating, I notice you are tagged as lunch [19:49:42] <^demon|lunch> I got a falafel pita today. [19:49:52] anything on it? [19:50:09] <^demon|lunch> Tomatoes, lettuce, pickles. [19:50:19] no sauce of some sort? [19:50:26] <^demon|lunch> Tzatziki [19:50:30] RECOVERY - Host ms-be6 is UP: PING OK - Packet loss = 0%, RTA = 0.42 ms [19:50:31] ah ha [19:50:39] hmm there comes the 720xd [19:50:47] let's see if it tries to reinstall again [19:51:40] <^demon|lunch> Best falafel in richmond, hands-down. Probably some of the best falafel I've had ever. [19:52:37] we were just discussin the fact that there are no falafel places around here. maybe one a fair distance away [19:52:54] apergos: it shouldn't if it placed the OS on the right disk [19:53:21] we'll see [19:53:25] * apergos crosses fingers [19:53:34] <^demon|lunch> Anyway, lunchtime is over. [19:53:38] ok [19:53:43] I'd say enjoy but too late [19:53:51] the recipe calls for grub to be on /dev/sdm and sdn...the bios is set for disk 12 to be the boot loader that should be dev/sdm [19:54:04] yep [19:54:20] as long as we start with disk 0 [19:55:47] so basic install is done...wanna to puppet or should i? [19:56:27] feel free! it shooooould just work [19:56:53] okay [19:56:53] (also I'm about to have my dinner here in a minute) [19:58:00] RECOVERY - SSH on ms-be6 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [20:05:31] PROBLEM - LVS HTTPS IPv6 on wikibooks-lb.esams.wikimedia.org_ipv6 is CRITICAL: Connection refused [20:05:48] PROBLEM - LVS HTTPS IPv6 on wikinews-lb.esams.wikimedia.org_ipv6 is CRITICAL: Connection refused [20:05:48] PROBLEM - LVS HTTPS IPv6 on wikipedia-lb.esams.wikimedia.org_ipv6 is CRITICAL: Connection refused [20:06:15] βλαη [20:06:15] PROBLEM - LVS HTTPS IPv6 on wikiversity-lb.esams.wikimedia.org_ipv6 is CRITICAL: Connection refused [20:06:17] <-- nginx missing IPs in esams [20:06:21] just 3001 [20:06:21] ah [20:06:24] PROBLEM - LVS HTTPS IPv6 on mediawiki-lb.esams.wikimedia.org_ipv6 is CRITICAL: Connection refused [20:06:26] you'rem on it? [20:06:33] PROBLEM - HTTPS on ssl3001 is CRITICAL: Connection refused [20:06:34] yeah, but what to use [20:06:40] we used localhost / 127.0.0.1 [20:06:48] but it does not like that as a valid host [20:06:53] is esams ipv6 only going to nginx on ssl3001? [20:07:18] RECOVERY - swift-object-server on ms-be6 is OK: PROCS OK: 101 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [20:07:27] RECOVERY - swift-account-server on ms-be6 is OK: PROCS OK: 13 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [20:07:27] RECOVERY - swift-container-auditor on ms-be6 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [20:07:27] RECOVERY - swift-container-server on ms-be6 is OK: PROCS OK: 13 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [20:07:40] restarting without wikidata and wikivoyage config, temp [20:07:54] RECOVERY - swift-container-updater on ms-be6 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-updater [20:07:54] RECOVERY - swift-account-auditor on ms-be6 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [20:07:54] RECOVERY - swift-object-auditor on ms-be6 is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [20:07:54] RECOVERY - LVS HTTPS IPv6 on wikiversity-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK HTTP/1.1 200 OK - 60060 bytes in 0.836 seconds [20:08:03] RECOVERY - LVS HTTPS IPv6 on mediawiki-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK HTTP/1.1 200 OK - 60060 bytes in 0.847 seconds [20:08:12] RECOVERY - HTTPS on ssl3001 is OK: OK - Certificate will expire on 07/19/2016 16:14. [20:08:12] RECOVERY - swift-container-replicator on ms-be6 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [20:08:13] RECOVERY - swift-object-replicator on ms-be6 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [20:08:13] RECOVERY - swift-account-reaper on ms-be6 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [20:08:21] RECOVERY - swift-object-updater on ms-be6 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-updater [20:08:48] RECOVERY - LVS HTTPS IPv6 on wikibooks-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK HTTP/1.1 200 OK - 60060 bytes in 0.856 seconds [20:09:06] RECOVERY - LVS HTTPS IPv6 on wikinews-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK HTTP/1.1 200 OK - 60060 bytes in 0.840 seconds [20:09:06] RECOVERY - LVS HTTPS IPv6 on wikipedia-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK HTTP/1.1 200 OK - 60060 bytes in 0.850 seconds [20:09:25] missind for /dev/sda and /dev/sdb, rats [20:09:37] but at least it didn't do them for m and n [20:10:05] os is in the right plac [20:10:07] e [20:15:00] apergos: how many disk do you see? [20:15:10] New patchset: Dzahn; "comment wikidata/voyage esams proxy IPs, nginx does not like us using the 127.0.0.1 or just :443 work around" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/32410 [20:15:15] 12 [20:15:23] well wait a sec [20:15:30] only seeing 10 [20:15:36] oh [20:15:40] I;mm blind ignore me [20:15:42] no /dev/sda and /dev/sdb [20:15:46] I see em [20:15:53] no [20:15:58] I see em in the fstab [20:16:02] I really am blind [20:16:03] sheesh [20:16:05] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/32410 [20:16:22] yeah they aren't mounted and I bet ther eis no filesystem on em [20:16:48] i think you are right [20:17:13] * apergos does apuppet run over there to see what it thinks [20:17:27] apergos: fyi, the other issue with the wikipedia cert, it was just different on ssl1, checked all with md5sum ..ugh [20:17:43] "different"? wtf [20:18:01] well you get the cert and the key right, and the puppet creates the "chained" version [20:18:31] i got the same md5sums for keys and certs but the "chained" one was not updated on that one host that was off .. [20:19:00] RECOVERY - swift-account-replicator on ms-be6 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [20:19:12] it tries to mount them without making them [20:19:14] checking [20:21:29] /dev/sda has the old three partitions on it, new ones never got written [20:24:21] New patchset: Asher; "don't call nginx proxy_configuration class in esams while we don't have ip's" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/32411 [20:24:39] mutante: ^^^ i think that will fix it [20:25:18] RECOVERY - NTP on ms-be6 is OK: NTP OK: Offset -0.02005827427 secs [20:26:21] tried by hand, [20:26:38] New patchset: Demon; "Enable lucene for wikidatawiki" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/32412 [20:26:57] 'partition has been written but we have been unable to inform the kernel of the change, probably because it/they are in use. [20:27:01] nice [20:28:04] Change merged: Demon; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/32412 [20:28:55] why is there a /dev/md0 and a /dev/md1 [20:28:59] this seems like a problem [20:30:58] !log demon synchronized wmf-config/CommonSettings.php 'Enabling lucene for wikidatawiki' [20:31:00] yeah..i am looking at it now I think the last install is conflicting with the new install [20:31:05] Logged the message, Master [20:31:23] yep it kept the old entries for md0 [20:31:25] bah [20:31:29] partman-md/confirm_nooverwrite boolean true [20:31:46] would that be it? [20:32:48] PROBLEM - HTTP on singer is CRITICAL: Connection refused [20:33:38] New review: Dzahn; "testing this on ssl3001 (just commenting creates puppet run error with template)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/32411 [20:33:39] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/32411 [20:35:00] no [20:35:09] I don't think so [20:37:26] tried some cleanup going to reboot now [20:37:55] cmjohnson1: you off? [20:38:19] yes [20:40:26] RobH hiiiiii! [20:43:56] New patchset: Dzahn; "don't call nginx proxy_configuration class in esams while we don't have ip's - do the same for wikivoyage" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/32413 [20:44:32] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/32413 [20:44:57] PROBLEM - SSH on ms-be6 is CRITICAL: Connection refused [20:45:15] PROBLEM - swift-account-server on ms-be6 is CRITICAL: Connection refused by host [20:45:15] PROBLEM - swift-account-replicator on ms-be6 is CRITICAL: Connection refused by host [20:45:15] PROBLEM - swift-container-server on ms-be6 is CRITICAL: Connection refused by host [20:45:15] PROBLEM - swift-object-server on ms-be6 is CRITICAL: Connection refused by host [20:46:00] PROBLEM - swift-account-auditor on ms-be6 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:46:00] PROBLEM - swift-object-replicator on ms-be6 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:46:00] PROBLEM - swift-container-updater on ms-be6 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:46:00] PROBLEM - swift-container-replicator on ms-be6 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:46:09] PROBLEM - swift-object-auditor on ms-be6 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:46:18] PROBLEM - swift-account-reaper on ms-be6 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:46:27] PROBLEM - swift-container-auditor on ms-be6 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:46:28] PROBLEM - swift-object-updater on ms-be6 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:47:30] PROBLEM - Host ms-be6 is DOWN: PING CRITICAL - Packet loss = 100% [20:48:02] so the good news is that I *think* I have a proper partition table on those two disks now so... [20:48:14] a reinstall (if we had to) might just fix thewhole thing :-/ [20:49:21] * cmjohnson1 grumbles [20:50:04] !log ignore the fail event on ms-be6, this was caused manually and is a bogus warning [20:50:10] Logged the message, Master [20:51:51] !log fixed wikipedia ssl chained cert on ssl1 (all other hosts had it recreated as expected) [20:51:57] Logged the message, Master [20:52:00] RECOVERY - Host ms-be6 is UP: PING OK - Packet loss = 0%, RTA = 0.85 ms [20:54:02] !log ms-be6 another OS install [20:54:08] Logged the message, Master [20:55:18] PROBLEM - Host ms-be6 is DOWN: PING CRITICAL - Packet loss = 100% [20:55:40] !log removing wikidata/voyage sites config from esams ssl hosts [20:55:45] Logged the message, Master [20:56:55] New patchset: Aaron Schulz; "Added memc-pecl to parser & session caches." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/32416 [20:58:33] RECOVERY - Host ms-be6 is UP: PING OK - Packet loss = 0%, RTA = 0.49 ms [20:59:23] binasher: ^ [21:00:30] RECOVERY - HTTP on singer is OK: HTTP OK - HTTP/1.1 302 Found - 0.001 second response time [21:00:48] AaronSchulz: oh.. i have something i was going to commit that removes the session multiwrite and just uses redis [21:00:55] can you revise? [21:01:00] yeah I wasn't sure about sessions [21:01:09] only many sites raw php code is send instead of parsed php [21:01:36] !log restarted nginx on all SSL proxies without issues now. puppet runs again and does not recreate configs for data/voyage on esams. [21:01:37] !log olivneh synchronized php-1.21wmf3/extensions/E3Experiments [21:01:41] Logged the message, Master [21:01:47] Logged the message, Master [21:03:44] Merlissimo: ? [21:03:50] New patchset: Aaron Schulz; "Added memc-pecl to parser cache and made session cache redis-only." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/32416 [21:04:25] i got raw php code on http://noc.wikimedia.org/conf/highlight.php?file=InitialiseSettings.php (but that seems to be fixed) and http://wikidata-test-repo.wikimedia.de/w/index.php [21:04:33] PROBLEM - Puppet freshness on ms-be3 is CRITICAL: Puppet has not run in the last 10 hours [21:05:53] Change merged: Asher; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/32416 [21:06:28] AaronSchulz: just noticed a problem with the jobqueue - there are jobs in the enwiki queue from before wmf3 was deployed, which will never run since they don't have job_random values [21:06:39] RECOVERY - SSH on ms-be6 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [21:06:55] well 0 is a valid value [21:06:57] i wonder if other wikis have the same problem [21:07:16] hmm [21:08:19] !log olivneh synchronized php-1.21wmf3/extensions/EventLogging [21:08:20] apergos, why is that revision missing? [21:08:24] Logged the message, Master [21:08:36] no idea, that's way outside my scope [21:08:51] apergos: we are good to go [21:08:54] active raid1 sdm1[0] sdn1[1] [21:08:54] 58559360 blocks super 1.2 [2/2] [UU] [21:08:54] [=======>.............] resync = 38.5% (22576704/58559360) finish=6.2min speed=96313K/sec [21:08:55] (anotehr day if I had some spare cycles I might poke around at it) [21:08:55] [21:09:01] yay [21:09:16] while you are there can you [21:09:24] fdisk /dev/sda and print the partition table? [21:09:29] wanna see what it has [21:09:34] same for /dev/sdb [21:09:40] cmjohnson1: [21:09:45] k [21:09:49] thanks [21:10:30] AaronSchulz: there are refreshLinks2 jobs in the enwiki queue from 11/5 for 'Location_map+' that the job runners are never trying to run, though other enwiki refreshLinks2 jobs are running [21:13:39] apergos: seen this before [21:13:43] fdisk /dev/sda [21:13:43] WARNING: GPT (GUID Partition Table) detected on '/dev/sda'! The util fdisk doesn't support GPT. Use GNU Parted. [21:14:12] !log asher synchronized wmf-config/CommonSettings.php 'multiwrite pcache to new memcached servers, only use redis for sessions' [21:14:18] Logged the message, Master [21:14:18] nevermind ... [21:14:41] that's fine [21:14:47] just p to show the partition table [21:15:05] you're not going to change anyting, only look at it [21:15:23] cmjohnson1: [21:15:31] Disk /dev/sda: 2000.4 GB, 2000398934016 bytes [21:15:31] 255 heads, 63 sectors/track, 243201 cylinders, total 3907029168 sectors [21:15:31] Units = sectors of 1 * 512 = 512 bytes [21:15:33] Sector size (logical/physical): 512 bytes / 512 bytes [21:15:35] I/O size (minimum/optimal): 512 bytes / 512 bytes [21:15:37] Disk identifier: 0x00000000 [21:15:39] Device Boot Start End Blocks Id System [21:15:41] /dev/sda1 1 3907029167 1953514583+ ee GPT [21:15:45] awesome [21:15:48] q to quit [21:16:23] then the same for /dev/sdb and you're just looking for the line that starts /dev/sdxx with the start and end sector numbers and the GT at the end [21:16:35] if you see that then we are good for doing an initial puppet run [21:17:24] ok we are good [21:17:50] sweet [21:17:56] take it away [21:18:23] !log ms-be6 initial puppet run [21:18:28] Logged the message, Master [21:24:12] RECOVERY - swift-object-replicator on ms-be6 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [21:24:12] RECOVERY - swift-container-server on ms-be6 is OK: PROCS OK: 13 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [21:24:21] RECOVERY - swift-account-reaper on ms-be6 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [21:24:30] it's alive! [21:24:30] RECOVERY - swift-account-server on ms-be6 is OK: PROCS OK: 13 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [21:24:30] RECOVERY - swift-container-updater on ms-be6 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-updater [21:24:30] RECOVERY - swift-object-updater on ms-be6 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-updater [21:25:06] RECOVERY - swift-object-server on ms-be6 is OK: PROCS OK: 101 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [21:25:06] RECOVERY - swift-object-auditor on ms-be6 is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [21:25:15] RECOVERY - swift-container-auditor on ms-be6 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [21:25:33] RECOVERY - swift-container-replicator on ms-be6 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [21:25:33] RECOVERY - swift-account-auditor on ms-be6 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [21:28:08] puppet run complete? [21:28:17] ys [21:28:50] still have to do fs manually, this will all go away with new fresh disks on the next host [21:30:55] meh [21:30:57] d127 : active raid1 sdb1[1] [21:30:57] 58559360 blocks super 1.2 [2/1] [_U] [21:30:57] [21:30:57] PROBLEM - NTP on ms-be6 is CRITICAL: NTP CRITICAL: Offset unknown [21:31:05] that's a problem [21:31:08] yep [21:31:35] should i wipe the disk? [21:32:07] 0 and 1 [21:32:27] fyi: that didn't show up before puppet [21:32:43] no, just wait a bit [21:42:21] RECOVERY - NTP on ms-be6 is OK: NTP OK: Offset -0.02238750458 secs [21:45:16] cmjohnson1: wanna keep an eye on the console? [21:45:18] *sigh* [21:45:23] really really tired of this box [21:45:29] sure [21:45:36] or really, of working way past my off time [21:45:48] yes you are [21:47:18] PROBLEM - Host ms-be6 is DOWN: PING CRITICAL - Packet loss = 100% [21:48:12] RECOVERY - Host ms-be6 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [21:48:48] RECOVERY - swift-account-replicator on ms-be6 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [21:50:12] binasher: if you're keeping score, "duplicate" events turned out to have been just a dumb python error on my part: for line in data.split('\n'): q.put(data) [21:55:08] New patchset: Hashar; "get nodejs on the continuous integration server" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/32462 [21:57:12] PROBLEM - MySQL Replication Heartbeat on db1003 is CRITICAL: CRIT replication delay 181 seconds [21:57:39] PROBLEM - MySQL Slave Delay on db1003 is CRITICAL: CRIT replication delay 197 seconds [21:59:46] ori-l: good to know. the esams and eqiad bits servers all have the new varnish build with the logging fix [22:01:16] ori-l: i need to revisit the email thread that ottomata started re: a new log format, but i think that will happen tomorrow or early next week and you'll both get the streams [22:01:26] binasher: woot! that's awesome. we should have some results from the first experiment to use this (task system on enwiki's community portal) pretty soon, i'll CC them to you [22:02:03] binasher: re: the log format, reminder to give me a few hours' notice so i can migrate my parsing scripts [22:02:09] RECOVERY - MySQL Replication Heartbeat on db1003 is OK: OK replication delay 0 seconds [22:02:36] RECOVERY - MySQL Slave Delay on db1003 is OK: OK replication delay 0 seconds [22:02:44] cooooool guys! [22:02:48] ori-l: will do [22:02:53] ori-l you ok with that format, btw? [22:04:39] ottomata: it's what we discussed no? let me take a quick look just to make sure [22:05:30] ottomata: works for me [22:06:32] ori-l, binasher; the change of the log format only affects event data right? not regular udp2log data? [22:06:43] drdee: right [22:06:49] cool [22:07:54] yeah just event data [22:10:51] cmjohnson1: finally they are all there and mounted [22:10:53] what a waste of time [22:11:15] like I say with fresh disks this won't be an issue [22:12:27] yeah it was...lesson learned...hopefully the next 11 will be better [22:12:32] yep [22:12:39] I am now leaving this to other folks cause I am [22:12:45] done for the night. have a good one [22:13:43] New patchset: Demon; "(bug 30720) Add links to Special:Code on mw.org" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/32465 [22:17:08] apergos: g'night thx for the help [22:18:12] PROBLEM - MySQL Replication Heartbeat on db1035 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:19:51] PROBLEM - MySQL Slave Delay on db1035 is CRITICAL: CRIT replication delay 191 seconds [22:25:09] New patchset: Asher; "having pybal monitor search-pool4 hosts with an search query to one of their indices" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/32467 [22:25:13] notpeter: ^^ [22:26:18] PROBLEM - MySQL Replication Heartbeat on db1035 is CRITICAL: CRIT replication delay 194 seconds [22:26:27] PROBLEM - MySQL Slave Delay on db1035 is CRITICAL: CRIT replication delay 183 seconds [22:26:50] binasher: works for me [22:27:14] binasher: oh, actually [22:27:44] sure, I'm down to give it a shot [22:28:00] i don't think it'll make much difference but you were saying that pybal was sending queries to the host that was getting the index rsync'd even tho the other host was fine [22:28:30] i think we need to break pool4 up into at least two pools [22:29:01] pybal isn't depooling the hosts... [22:29:09] yeah [22:29:30] iunno, if you think that'll actually get non-repsonsive boxes depooled, great! [22:29:36] RECOVERY - MySQL Replication Heartbeat on db1035 is OK: OK replication delay 0 seconds [22:29:36] also, as for depooling... I disagree [22:29:45] RECOVERY - MySQL Slave Delay on db1035 is OK: OK replication delay 0 seconds [22:29:51] those boxes are very underutilized for almost the entire day [22:30:01] and then when the rebuild occurs they fall over [22:30:11] throwing hardware at this seems really wasteful [22:30:54] just spacing out the index rebuilds, and thus when they're rsynced over, might solve this [22:37:37] It should get extra hosts anyways [22:37:44] two hosts isn't enough for redundancy [22:38:25] if a single host dies, we can't push indexes without an outage, even if everything else worked fine [22:38:52] http://ganglia.wikimedia.org/latest/?c=Search%20eqiad&h=search1015.eqiad.wmnet&m=load_one&r=hour&s=by%20name&hc=4&mc=2 [22:39:03] pretty under-utilized [22:39:28] yah, maybe just needs an extra host vs. a new pool [22:39:29] I mean, yes, I agree with your assessment [22:39:40] sure, that could work [22:40:08] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/32467 [22:40:46] I mean, even during the "outages" the boxes aren't working that hard: http://ganglia.wikimedia.org/latest/?r=day&cs=&ce=&c=Search+eqiad&h=search1015.eqiad.wmnet&tab=m&vn=&mc=2&z=medium&metric_group=ALLGROUPS [22:41:48] do you think it would be effective to change niceness of rsync? [22:48:38] New patchset: Mwalker; "CentralNotice: New projects support, X-Domain data" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/32469 [22:49:20] in case you find fenari slow, it is me, need to unpack and repack a large dump file. i nice'd it 19 first, then reniced to 0 because it took forever.. if it really bothers you maybe bast1001 is an option or bug me about it. better once on fenari though than doing this on all db servers also causing some issues there [22:53:53] PROBLEM - Puppet freshness on ms1002 is CRITICAL: Puppet has not run in the last 10 hours [22:54:40] New patchset: Asher; "lower pybal depool-thresholds for search pools based on number of hosts" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/32471 [23:10:41] PROBLEM - HTTPS on formey is CRITICAL: Connection refused [23:11:26] PROBLEM - HTTP on formey is CRITICAL: Connection refused [23:12:32] Reedy: can you tell me about the sitecode for wikivoyage? -- planning on adding a wikivoyage target to centralnotice; but confused as to what entry I need to add to the wmgNoticeProject map. Just 'wikivoyage' => centralnotice name, or 'nlwikivoyage'=> ... 'frwikivoyage' => ... etc [23:29:33] !log olivneh synchronized php-1.21wmf3/extensions/E3Experiments [23:29:39] Logged the message, Master [23:32:02] ori-l: moooaaar experiments \O/ [23:35:16] mwalker: wikivoyage will make it apply on all wikivoyage projects... [23:35:30] OK, that's what I thought [23:35:38] it just appears a little strangely in the sitematrix api [23:36:17] errm [23:36:44] If sitematrix is weird, most liklely blame Matthais ;) [23:36:55] hashar: :) [23:37:18] mwalker: what's strange about it? [23:37:42] Change merged: Pgehres; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/32469 [23:37:48] Oh [23:37:54] They all appear as special [23:37:58] That should probably be fixed [23:38:47] at some point or another [23:39:21] AFAIK we have no plans to target them, in fact, we want to get them set up so that we can explicitly ignore them, i belive [23:40:27] I'm sure we will in the future [23:40:33] mwalker: sooner rather than later though! https://bugzilla.wikimedia.org/show_bug.cgi?id=41904 [23:42:06] yeah, we had a bunch of people yell at us when wikidata got banners since it was improperly configured as "wikipedia" [23:45:16] mutante: https://labsconsole.wikimedia.org is down . Can you help? [23:46:27] o_0 [23:46:36] That looks seriously broken [23:46:42] pgehres: Pffft. Everything is a wikipedia [23:47:13] * Damianz browses to reedy.wikipedia and nods [23:48:20] sumanah: ooh :/ i am actually in the middle of doing db imports for wikivoyage, let me take a look [23:48:30] I'm sorry for the interruption mutante [23:48:30] brings a 404 :( [23:48:37] New patchset: Hashar; "get nodejs on the continuous integration server" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/32462 [23:49:11] I do need to be able to use labsconsole within the next 12 hours or so, to make some accounts for the Bangalore dev camp which starts then [23:49:30] sleep time. have a good evening [23:49:31] (Labs/Gerrit accounts) [23:49:33] night hashar [23:49:59] RECOVERY - HTTPS on formey is OK: OK - Certificate will expire on 08/22/2015 22:23. [23:50:35] RECOVERY - HTTP on formey is OK: HTTP OK HTTP/1.1 200 OK - 3596 bytes in 0.003 seconds [23:51:28] !log pgehres synchronized php-1.21wmf3/extensions/CentralNotice/ 'Updating CentralNotice to master' [23:51:34] Logged the message, Master [23:51:53] Reedy: any tricks to syncing Initialise and Common settings or is it just two sync-files? [23:52:00] sync-dir wmf-config [23:52:14] that works too [23:52:20] essentially the same thing, and it won't transfer files un-necesserily [23:54:34] New patchset: Demon; "Fixing SSLCACertificateFile for Gerrit" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/32476 [23:55:57] New review: Demon; "This is what caused SVN's https to flap earlier. Don't know why it hadn't broken before now." [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/32476 [23:56:33] !log pgehres synchronized wmf-config [23:56:39] Logged the message, Master