[00:00:13] I thought about keeping the same mountpoint for OCD reasons [00:00:18] but thought it isn't worth it [00:00:41] it'd be the same amount of work, but we'd run into the risk that puppet didn't run somewhere for some reason [00:01:12] and /mnt/thumbs would still be ms5 and all the debugging that this would entail [00:03:09] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 6.861 seconds [00:03:57] RobH: to mention earlier stuff - we have 10g interfaces on cr1 and cr2 actually reserved for the NAS - unless mark has changed his mind [00:04:10] * Reedy kicks db1035 [00:13:16] AaronSchulz: if it works for WebM we could enable it for Ogg too. Would have to reevalute the ogg seeking issues in avconv compared to oggThumb [00:13:29] PROBLEM - MySQL Replication Heartbeat on db1035 is CRITICAL: CRIT replication delay 251 seconds [00:14:17] PROBLEM - MySQL Replication Heartbeat on db1010 is CRITICAL: CRIT replication delay 278 seconds [00:14:26] RECOVERY - MySQL Replication Heartbeat on db1010 is OK: OK replication delay 0 seconds [00:14:35] PROBLEM - MySQL Replication Heartbeat on db11 is CRITICAL: CRIT replication delay 316 seconds [00:14:53] PROBLEM - MySQL Slave Delay on db11 is CRITICAL: CRIT replication delay 335 seconds [00:15:47] if what works? [00:15:48] RECOVERY - MySQL Replication Heartbeat on db1035 is OK: OK replication delay 0 seconds [00:16:12] Change merged: Asher; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/33289 [00:17:14] Reedy: db1035 does lvm snapshots so it's a little write sluggish :) [00:17:31] paravoid: sorry, switched channel, was responding to a question on #mediawiki [00:17:41] binasher: looking at the database size... I dropped about 9GB of data :D [00:17:52] oh not in there :) [00:18:02] maybe I should [00:18:07] well, "data" [00:18:10] Reedy: that's what i like to hear! [00:18:34] It was about 21GB originally [00:18:37] And there's more to go [00:19:20] !log asher synchronized wmf-config/CommonSettings.php 'remove apache memchached's from the parsercache multiwriter' [00:19:27] Logged the message, Master [00:21:30] !log asher synchronized wmf-config/mc.php 'disable general memcached multiwriting: nothing but pecl' [00:21:36] Logged the message, Master [00:23:11] Awesome [00:23:20] wow, really [00:23:21] cool! [00:23:58] running tcpdump on apaches to see if there's any port 11000 activity (the new hosts are using the standard port of 11211) and it's nothing but nagios [00:24:01] does anyone know how long is de wikivoyage going to be DB locked for please? [00:24:20] Till the slaves catch up [00:24:20] Again [00:24:26] Same as everything else on s3 [00:24:55] Another 10 million rows to go bye bye [00:25:13] Reedy: what's your delete chunk size? [00:25:18] a month [00:25:31] what are you deleting out of pure curiosity? [00:25:33] Thehelpfulone: 2 slaves have caught up... 1 is catching up [00:25:42] oai "audit" logs [00:26:07] It had 82 million rows [00:26:11] binasher: awesome! do you have any numbers on the performance benefits of that? [00:26:20] KEY (oa_client,oa_timestamp), [00:26:20] KEY (oa_timestamp,oa_client) [00:26:20] ok [00:26:25] ^ the indexes were beyond useless [00:27:10] !log aaron synchronized wmf-config/PrivateSettings.php [00:27:16] Logged the message, Master [00:27:28] binasher: is it worth adding a simple auto increment PK to the table when I get the row count down enough? [00:28:06] New patchset: Aaron Schulz; "Set swiftTempUrlKey parameter." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/33296 [00:28:34] !log Set X-Account-Meta-Temp-Url-Key for swift mw:thumb account [00:28:40] Logged the message, Master [00:29:35] the hope is when the table is not backlogged with crap, I can actually get a few metrics from it... [00:29:45] AaronSchulz: hm? [00:31:28] binasher: where is pc in ganglia? [00:31:36] * AaronSchulz hates it when he can't find stuff easily [00:31:50] ahh, mysql group [00:31:56] * AaronSchulz was looking in misc [00:32:11] AaronSchulz: what's the Temp URL thing? [00:33:50] http://docs.rackspace.com/files/api/v1/cf-devguide/content/TempURL-d1a4450.html [00:34:16] AaronSchulz: this is a good sign http://ganglia.wikimedia.org/latest/graph.php?r=week&z=xlarge&c=MySQL+pmtpa&h=pc1.pmtpa.wmnet&v=42684218&m=mysql_bytes_sent&jr=&js=&vl=bytes&ti=mysql_bytes_sent [00:34:19] also see https://gerrit.wikimedia.org/r/#/c/31666/ [00:34:29] I've read what it does, I'm wondering what are you going to use it to [00:35:02] looking [00:35:29] ah, great! [00:35:36] binasher: are you thinking about killing something else? :) [00:36:24] Reedy: wtf is oaiaudit? heh [00:37:31] yes, please do add an auto-inc pk [00:37:32] RECOVERY - MySQL Replication Heartbeat on db11 is OK: OK replication delay 0 seconds [00:37:41] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:37:50] PROBLEM - Puppet freshness on db42 is CRITICAL: Puppet has not run in the last 10 hours [00:37:50] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [00:38:08] RECOVERY - MySQL Slave Delay on db11 is OK: OK replication delay 0 seconds [00:38:08] unless oa_client is unique and the table creator just forgot about a pk.. i'm sure that's not the case, but it'd be nice [00:38:10] Change merged: Aaron Schulz; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/33296 [00:38:22] log of requests to the OAI extension... So every time lucene requests anything... or any of the 3rd party users.. [00:38:34] Nope, there's no unique key :( [00:39:00] !log aaron synchronized wmf-config/filebackend.php 'Set "swiftTempUrlKey"' [00:39:07] Logged the message, Master [00:39:26] Ideally I want to add a couple of indexes too so we can get some useful information from it... [00:39:40] Obviously to be done when it hasn't got quite so many rows ;) [00:39:43] frwiki Revision::fetchFromConds 10.0.6.53 2008 MySQL client ran out of memory (10.0.6.53) [00:39:47] binasher: lol [00:40:14] :D [00:40:40] autoinc that shit [00:41:02] wtf is with UserDailyContribsHooks on incubator wiki [00:41:14] editcountitis? [00:41:24] Lock wait timeout exceeded; try restarting transaction (10.0.6.44) UPDATE `user_daily_contribs` SET contribs=contribs+1 WHERE day = '20121113' AND user_id = '17032' [00:41:34] Reedy: maybe it's a bot ;) [00:41:36] hell yes [00:42:12] that "MySQL client ran out of memory" query had a limit 1 on it! [00:42:33] it was probably the straw that broke the parsing camel's back [00:43:05] [user_name] => Base [00:43:11] [user_editcount] => 65 [00:43:17] that doesn't sound like a bot to me [00:50:44] mutante: when you pop up and have a little bit of time, could you please take a look at http://ftp.mozilla.org/pub/mozilla.org/webtools/bugzilla-4.0.8-to-4.0.9-nodocs.diff.gz ? TIA [00:52:50] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 2.578 seconds [00:54:00] !log aaron synchronized php-1.21wmf4/extensions/TimedMediaHandler 'deployed 2784d7ab1b0e0a56b905ff4b9c4070f63f867d25' [00:54:06] Logged the message, Master [00:54:24] binasher: too bad cache multiwrite doesn't backfill, ah well :) [00:54:57] yeah, i like the trend though [00:56:43] there's also a trend i don't like much, get times from MWMemcached were much more erratic and had huge spikes that are absent from the pecl client, but on average, it was quicker [00:57:14] that might be because more of the gets are actually returning larger pcache objects which were previously misses though [00:58:04] binasher: it was slower even before when added it to the parser cache group [00:58:43] * AaronSchulz tries to recall the order of events [00:58:47] and the tp50 time is something like 1.5ms vs 3ms [00:59:49] network latency is greater from the apaches (new and old racks) to the mc hosts than to other apaches though [01:00:16] mw24 to srv245: rtt min/avg/max/mdev = 0.121/0.145/0.158/0.011 ms [01:00:22] yeah, profiling was one for a while before we added pcache [01:00:26] vs mw24 to mc3: rtt min/avg/max/mdev = 0.095/0.209/0.362/0.074 ms [01:00:29] * AaronSchulz double checked [01:02:16] AaronSchulz: i wonder if you could get data on instantiation overhead for the pecl client without persistent connections [01:03:33] tp90s are a lot different :/ [01:04:38] oh, go further back.. times for MWMemcached dropped drastically once the initial multiwrite to pecl (without pcache) went into place [01:05:09] so really need to compare current pecl client times to times from before we started touching anything [01:05:20] gah, duh [01:06:20] current tp99 is actually much better than the old [01:06:26] yeah, tp90 and tp50 are better [01:06:28] phew! i thought it was worse, heh [01:06:28] yep [01:06:40] * AaronSchulz looks at the 14 day graph [01:07:22] don't have to go republican, hide the bad graphs, and claim mission accomplished [01:07:59] tp99 was hitting 290ms, now tops at 72ms [01:08:49] binasher: do you still want to remove pc1 eventually? [01:09:14] yes, but no :( [01:09:30] we're actually going to deploy pc2-3 [01:10:07] slaves or sharded? [01:10:13] the initial reason for this is for pcache replication, with the idea that all reads would generally be from memcached [01:10:27] sharded, with binlogging to replicate to eqiad [01:10:41] was sharding support ever added? [01:10:49] * AaronSchulz doesn't remember that [01:11:04] binasher: I guess you could use mysql federated tables :p [01:11:08] no, tim said it'd be easy though [01:11:25] there's already table sharding [01:12:03] the better logging from the pecl client shows that some pcache objects are >1MB though, so that's another reason to have something backing memcached [01:12:23] sigh...I guess that makes sense [01:14:29] http://bit.ly/T05KXq [01:14:55] paravoid: re: question on performance, see the max times ^^ [01:15:21] yeah, I've been following :-) [01:15:30] thanks [01:15:45] impressive [01:16:16] you should mail that! [01:19:11] writing an email now :) [01:19:19] New patchset: Kaldari; "Enableing upload_by_url on test2 for Flickr uploading tests" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/33300 [01:20:35] kaldari: \o/ [01:21:02] Change merged: Kaldari; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/33300 [01:21:30] ori-l: The future is coming! [01:22:26] kaldari: browsing CC content on flickr and uploading it to commons is how i procrastinated on at least a half-dozen papers in grad school [01:22:37] so i'm sure you're destroying someone's academic career somewhere :P [01:22:56] just wait until Fabrice finds out about it! [01:23:07] he has like 20,000 free photos on Flickr [01:23:17] that's over 9000!? [01:23:17] heh [01:24:02] Reedy: for the last time, we can't turn on image uploads from /b/ [01:24:13] That's what you think... [01:24:14] you remember what happened last time. [01:24:18] http://ganglia.wikimedia.org/latest/graph.php?r=year&z=xlarge&m=swift_object_count&h=Swift+pmtpa+prod&c=Swift+pmtpa fwiw [01:24:29] but think of the LULZ! [01:24:35] kind of an interesting slope [01:24:50] you can see the effect of WLM I think [01:25:38] that's a lot of objects [01:25:50] this includes thumbs [01:25:59] glad you guys are ready for the flood ;) [01:26:01] that we never delete [01:26:42] Personally, I think all thumbs should have TTLs [01:27:09] there have been discussions about that [01:27:13] it's not that simple [01:27:22] yes, easier said than done [01:27:25] TTLs probably wouldn't work, we'd need some kind of LRU [01:27:29] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:27:40] We do at least have our wgThumbnailEpoch ;) [01:27:43] LRU? [01:27:52] least recently used [01:28:16] Even commonly used thumbs should be regenerated every few years [01:29:04] for updates to ImageMagic and whatever we use for SVG rendering [01:29:30] kaldari: that's what the thumbnail epoch is for, sort of [01:29:49] Oh! [01:30:00] $wgThumbnailEpoch = '20120101000000'; [01:30:24] sleep now [01:30:28] talk to you tomorrow [01:30:38] that's exciting [01:30:45] for me at least :) [01:30:45] which reminds me [01:30:58] I think there was one of the other epochs that wanted bumping up again [01:34:56] binasher: https://gerrit.wikimedia.org/r/#/c/29736/ [01:35:14] binasher: can you run that schema change (after truncating zhwiki again or something)? [01:35:30] AaronSchulz: sure.. mind if its tomorrow? [01:35:34] Reedy: is there a bot making daily edits there or what? [01:35:55] binasher: that's fine [01:36:35] No idea [01:36:38] Possibly [01:38:36] PROBLEM - Apache HTTP on srv278 is CRITICAL: Connection refused [01:39:23] AaronSchulz: Maybe the great firewall of china is broken [01:39:56] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 263 seconds [01:40:50] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 6.456 seconds [01:41:53] PROBLEM - MySQL Slave Delay on db78 is CRITICAL: CRIT replication delay 255 seconds [01:43:32] RECOVERY - MySQL Slave Delay on db78 is OK: OK replication delay 0 seconds [01:43:36] who moderates the engineering list? [01:44:15] my email went into moderation due to the attached graph being >40k [01:44:42] and was apparently rejected on wikitech-l due to "The message's content type was not explicitly allowed", though it was just typed in gmail's web interface [01:46:32] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 0 seconds [01:51:16] binasher: rachel, iirc [02:00:16] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 267 seconds [02:00:25] PROBLEM - MySQL Slave Delay on db78 is CRITICAL: CRIT replication delay 276 seconds [02:14:59] RECOVERY - Apache HTTP on srv278 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.045 second response time [02:15:08] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:15:34] PROBLEM - Puppet freshness on analytics1002 is CRITICAL: Puppet has not run in the last 10 hours [02:23:13] PROBLEM - Squid on brewster is CRITICAL: Connection refused [02:29:53] !log LocalisationUpdate completed (1.21wmf3) at Wed Nov 14 02:29:53 UTC 2012 [02:30:01] Logged the message, Master [02:31:19] RECOVERY - Squid on brewster is OK: TCP OK - 0.007 second response time on port 8080 [02:31:46] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.024 seconds [02:47:40] PROBLEM - Puppet freshness on db62 is CRITICAL: Puppet has not run in the last 10 hours [02:49:19] PROBLEM - Squid on brewster is CRITICAL: Connection refused [02:50:05] !log LocalisationUpdate completed (1.21wmf4) at Wed Nov 14 02:50:04 UTC 2012 [02:50:12] Logged the message, Master [02:55:28] RECOVERY - Puppet freshness on ms1002 is OK: puppet ran at Wed Nov 14 02:55:18 UTC 2012 [03:12:34] RECOVERY - MySQL Slave Delay on db78 is OK: OK replication delay 0 seconds [03:17:31] RECOVERY - Squid on brewster is OK: TCP OK - 0.003 second response time on port 8080 [03:22:29] PROBLEM - Squid on brewster is CRITICAL: Connection refused [03:23:13] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 21 seconds [04:17:40] PROBLEM - Puppet freshness on analytics1001 is CRITICAL: Puppet has not run in the last 10 hours [04:17:40] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [04:17:40] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [05:34:23] PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours [06:50:00] RECOVERY - Squid on brewster is OK: TCP OK - 0.001 second response time on port 8080 [08:24:09] PROBLEM - Memcached on virt0 is CRITICAL: Connection refused [08:35:33] RECOVERY - Memcached on virt0 is OK: TCP OK - 0.005 second response time on port 11000 [08:36:14] morning :-] [08:40:40] apergos: mark: paravoid: hello :-] The gerrit service running on manganese keep restarting and I am wondering if that might be caused by puppet. The service is subscribed to three files. Would it be possible to run puppetd -tv on manganese and paste me the result please ? [08:41:08] The last Puppet run was at Wed Nov 14 08:13:56 UTC 2012 (27 minutes ago). [08:41:10] running now [08:41:54] \O/ [08:42:08] I am not sure how puppet handled file subscribption [08:42:25] I suspect it regenerate the templates on every run and end up restarting the service [08:44:27] run completed [08:44:56] http://p.defau.lt/?Uf15oRqRbFY3MIgJa1lp2A [08:45:01] you rock [08:45:18] :-) [08:45:43] so it end up refreshing replication.config and restarting it :( [08:45:47] info: /Stage[main]/Gerrit::Jetty/File[/var/lib/gerrit2/review_site/etc/replication.config]: Scheduling refresh of Service[gerrit] [08:45:56] apergos: mind running it again? [08:46:21] maybe gerrit rewrite the configuration file when it is started [08:46:32] ok [08:48:57] ohhh [08:49:16] the template use a foreach loop on an array of settings to generate the file [08:49:37] that might end up using some random order [08:49:44] http://p.defau.lt/?DoykSQyCFWBZoFEIAbAP6g [08:49:49] which would lead to the file configuration changing from time to time [08:51:17] apergos: will debug it on my local config and submit a patch. Thanks a lot! [08:52:45] ok, thanks for working on this, it's been an annoying bug! [08:53:06] definitely [08:53:27] we have been exchanging a few emails with Chad over the week-end [08:53:39] and it magically appeared to me this morning that puppet might be restarting it [08:56:34] sweet! [08:58:28] just have to find out how to pass variables to an erb template from the command line [10:13:44] New patchset: Tim Starling; "Disable UniversalLanguageSelector" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/33318 [10:38:25] PROBLEM - Puppet freshness on db42 is CRITICAL: Puppet has not run in the last 10 hours [10:38:25] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [10:43:10] New patchset: Hashar; "prevents puppet from restarting Gerrit" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33321 [10:44:02] ahoho need to find out the bug report [10:45:49] apergos: https://gerrit.wikimedia.org/r/33321 should fix up the Gerrit random restart [10:58:36] New review: Nikerabbit; "We call this throwing out the baby with the bath water." [operations/mediawiki-config] (master); V: 0 C: -1; - https://gerrit.wikimedia.org/r/33318 [11:30:49] PROBLEM - SSH on singer is CRITICAL: Server answer: [11:31:43] PROBLEM - SSH on virt0 is CRITICAL: Server answer: [11:31:52] PROBLEM - SSH on hydrogen is CRITICAL: Server answer: [11:32:10] PROBLEM - SSH on chromium is CRITICAL: Server answer: [11:32:19] RECOVERY - SSH on singer is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [11:32:46] PROBLEM - SSH on kaulen is CRITICAL: Server answer: [11:33:13] PROBLEM - SSH on pdf2 is CRITICAL: Server answer: [11:33:31] PROBLEM - SSH on fenari is CRITICAL: Server answer: [11:33:31] RECOVERY - SSH on hydrogen is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [11:34:07] RECOVERY - SSH on chromium is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [11:34:52] RECOVERY - SSH on pdf2 is OK: SSH OK - OpenSSH_4.7p1 Debian-8ubuntu3 (protocol 2.0) [11:34:52] RECOVERY - SSH on virt0 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [11:35:10] RECOVERY - SSH on fenari is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [11:36:13] RECOVERY - SSH on kaulen is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [11:36:13] PROBLEM - SSH on zirconium is CRITICAL: Server answer: [11:36:31] PROBLEM - SSH on sq76 is CRITICAL: Server answer: [11:37:52] RECOVERY - SSH on zirconium is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [11:38:10] PROBLEM - SSH on sq53 is CRITICAL: Server answer: [11:38:10] RECOVERY - SSH on sq76 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [11:39:49] RECOVERY - SSH on sq53 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [11:40:35] PROBLEM - SSH on spence is CRITICAL: Server answer: [11:42:13] RECOVERY - SSH on spence is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [11:47:43] New review: Siebrand; ">Diederik 00:53" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/12188 [11:50:41] New review: Aude; "Regardless of the merits of this change or not, we have a handful of places in Wikibase that use bit..." [operations/mediawiki-config] (master) C: -1; - https://gerrit.wikimedia.org/r/33318 [11:51:27] New review: Aude; "of course, getting a random language is not nice but the entire JS breaking is worse." [operations/mediawiki-config] (master) C: -1; - https://gerrit.wikimedia.org/r/33318 [11:52:16] PROBLEM - SSH on lvs1003 is CRITICAL: Server answer: [11:53:55] RECOVERY - SSH on lvs1003 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [11:54:22] PROBLEM - SSH on nitrogen is CRITICAL: Server answer: [11:56:01] RECOVERY - SSH on nitrogen is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [11:59:58] New review: Hashar; "Since lot of files got renamed and tweaked at the same time, this patchset is best viewed by tweakin..." [operations/mediawiki-config] (master); V: 0 C: -1; - https://gerrit.wikimedia.org/r/32167 [12:00:13] PROBLEM - SSH on ssl1001 is CRITICAL: Server answer: [12:01:52] RECOVERY - SSH on ssl1001 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [12:06:49] New patchset: Dereckson; "(bug 41831) Set autoconfirm on ru.wikipedia" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/33329 [12:07:07] PROBLEM - SSH on sq33 is CRITICAL: Server answer: [12:08:46] RECOVERY - SSH on sq33 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [12:10:55] New review: Hashar; "The shell script MWRealm.sh should reuse the code from MWRealm.php, we really want to avoid duplicat..." [operations/mediawiki-multiversion] (master); V: 0 C: 0; - https://gerrit.wikimedia.org/r/32168 [12:16:34] PROBLEM - Puppet freshness on analytics1002 is CRITICAL: Puppet has not run in the last 10 hours [12:30:49] PROBLEM - check_gcsip on payments2 is CRITICAL: Connection timed out [12:30:49] PROBLEM - check_gcsip on payments4 is CRITICAL: Connection timed out [12:30:49] PROBLEM - check_gcsip on payments3 is CRITICAL: Connection timed out [12:30:49] PROBLEM - check_gcsip on payments1 is CRITICAL: Connection timed out [12:31:18] PROBLEM - check_gcsip on payments1001 is CRITICAL: CRITICAL - Socket timeout after 61 seconds [12:31:18] PROBLEM - check_gcsip on payments1002 is CRITICAL: CRITICAL - Socket timeout after 61 seconds [12:31:18] PROBLEM - check_gcsip on payments1003 is CRITICAL: CRITICAL - Socket timeout after 61 seconds [12:31:18] PROBLEM - check_gcsip on payments1004 is CRITICAL: CRITICAL - Socket timeout after 61 seconds [12:48:52] New patchset: Mark Bergsma; "Install ms-be30xx without a large RAID1 partition" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33332 [12:49:04] PROBLEM - Puppet freshness on db62 is CRITICAL: Puppet has not run in the last 10 hours [12:49:07] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33332 [12:52:22] !log Reinstall ms-be3001 [12:52:29] Logged the message, Master [12:57:01] PROBLEM - Puppet freshness on ms1002 is CRITICAL: Puppet has not run in the last 10 hours [12:59:27] New patchset: Mark Bergsma; "Add coffee" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33333 [13:00:05] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33333 [13:01:12] apergos: did you rebalance account/container on ms-be6? [13:01:42] no, it wouldn't rebalnce them, (i.e. I ran it and it refused to do a rebalance) [13:03:04] strange [13:03:20] the strange part is that it says balance -33 on the new entries [13:04:26] *sigh* [13:09:14] Cowardly refusing to save rebalance as it did not change at least 1% [13:09:21] that's what it told me when I tried to rebalance them [13:09:36] because it just was a change to the port number, not to the paritions themselves I guess [13:09:59] maybe if we changed the weights [13:10:59] apergos: I probably fixed Gerrit restart madness https://gerrit.wikimedia.org/r/#/c/33321/ [13:11:22] I saw you pinged me on the change, Illl have a look at it [13:11:27] nice :-] [13:11:31] hopefully that will fix it [13:11:47] that would be pretty great alright [13:11:48] there might be other place where we have a similar issue [13:13:43] New review: Diederik; "Siebrand, this Change is On The critical path of The analytics team, there are some more changes tha..." [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/12188 [13:18:10] hashar: do you want demon to look at this change or can I just +2 it and merge it? [13:20:39] apergos: I guess just +2 [13:20:49] apergos: Chad is attending a Gerrit conference for the whole week [13:20:54] ok [13:20:58] Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33321 [13:21:13] that is merely making sure the order of parameters are sane [13:21:21] yep, I got that [13:21:27] mind applying it to manganese and running puppetd -tv twice ? [13:21:35] the second run should not show any conf change [13:21:48] (nor a third run if you are willing to double test it :-) [13:22:08] shall I puppetd on manganese then? [13:22:14] ah I see you already asked me :-D [13:22:15] sec [13:22:48] yeah I'lll do three runs, no extra work long as I'm there [13:25:59] ep second run no changes, running one more time for double-check [13:26:08] and third run is good too, good job! [13:28:29] apergos: you are the best [13:28:36] that should fix it hopefully [13:28:40] actually, you are: you fixed gerrit! [13:29:06] I'm going to keep that lesson in mind though when generating files based on hashes inpuppeet [14:18:56] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [14:18:56] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [14:18:56] PROBLEM - Puppet freshness on analytics1001 is CRITICAL: Puppet has not run in the last 10 hours [14:20:18] apergos: Which cache epoch was it that we needed to update again? [14:20:38] we don't want to I think [14:20:48] Oh, fair enough then [14:21:25] there was a 60 day period, I have to check my notes, anyways in early dec when we should see if it's all ok [14:47:17] New patchset: Hashar; "Unmount /mnt/thumbs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33202 [15:01:22] New patchset: Faidon; "reprepro: use the Ceph repository as an upstream" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33352 [15:04:01] hashar: as I said earlier, I'd prefer it if labs was going to use /data/foo directly from MW's config... [15:04:24] I mean, this whole indirection is useless, and it doesn't get you closer to production anyway [15:04:58] the reason is that the paths are hardcoded in various places :-D [15:05:04] anyway /mnt/thumbs never got used on beta [15:05:19] are they? [15:05:28] I mounted netapp at /mnt/thumbs2 [15:05:41] so, if they're hardcoded, this is going to be a problem :) [15:13:33] /mnt/thumbs is still described in the wmf-config/filebackend.php file though it might not be actually used [15:16:03] 2012-11-14 15:08:47 mw68 commonswiki: FileBackendMultiWrite failed sync check: ["mwstore:\/\/local-multiwrite\/local-public\/c\/c8\/Segwun2009.jpg"] [15:16:05] in production [15:16:06] doh [15:16:30] apparently the local repository is configured to use a multi write file backend [15:16:40] which write to both swift and /mnt/thumbs/ [15:24:24] New review: Hashar; "/mnt/thumbs is still described in the wmf-config/filebackend.php file though it might not be actual..." [operations/puppet] (production); V: 0 C: 1; - https://gerrit.wikimedia.org/r/33202 [15:30:34] New patchset: Pyoungmeister; "removing erroneous comments in db-secondary.php" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/33355 [15:32:26] New patchset: Pyoungmeister; "removing erroneous comments in db-secondary.php" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/33355 [15:36:06] PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours [15:41:28] New patchset: Pyoungmeister; "removing erroneous comments in db-secondary.php" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/33355 [15:45:19] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/13429 [16:00:12] New patchset: Faidon; "secure.wikimedia.org: also redirect /" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33360 [16:01:04] New patchset: Jgreen; "banner logging adjustment" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33361 [16:02:12] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33361 [16:05:31] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33360 [16:09:56] Oooh [16:10:03] paravoid: Is that going live? [16:12:01] heh [16:12:06] so should i name a server Schickard? [16:12:16] * RobH is having to assign a new tampa based misc server [16:12:45] * RobH is out of encyclopedians and is onto computer scientists [16:13:00] Cimrman [16:13:37] thats not a real person ;p [16:13:49] you know him? ;-) [16:13:58] well, the enwiki page says its not [16:14:05] and who doesnt trust wikipedia. [16:15:12] https://en.wikipedia.org/wiki/Jan_Otto [16:15:38] ohh [16:15:40] i like it. [16:15:41] the biggest czech encyklopedia [16:15:43] ever [16:15:49] and the person is dead, so thats good for server names [16:16:01] (before wikipedia which beated it quite recently) [16:16:16] Reedy: it did [16:16:21] I'm mailing wikitech now [16:16:32] Danny_B|backup: well, now new otrs dev box is called otto. [16:16:37] you have named a wiki server ;] [16:17:23]