[00:00:13] I thought about keeping the same mountpoint for OCD reasons [00:00:18] but thought it isn't worth it [00:00:41] it'd be the same amount of work, but we'd run into the risk that puppet didn't run somewhere for some reason [00:01:12] and /mnt/thumbs would still be ms5 and all the debugging that this would entail [00:03:09] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 6.861 seconds [00:03:57] RobH: to mention earlier stuff - we have 10g interfaces on cr1 and cr2 actually reserved for the NAS - unless mark has changed his mind [00:04:10] * Reedy kicks db1035 [00:13:16] AaronSchulz: if it works for WebM we could enable it for Ogg too. Would have to reevalute the ogg seeking issues in avconv compared to oggThumb [00:13:29] PROBLEM - MySQL Replication Heartbeat on db1035 is CRITICAL: CRIT replication delay 251 seconds [00:14:17] PROBLEM - MySQL Replication Heartbeat on db1010 is CRITICAL: CRIT replication delay 278 seconds [00:14:26] RECOVERY - MySQL Replication Heartbeat on db1010 is OK: OK replication delay 0 seconds [00:14:35] PROBLEM - MySQL Replication Heartbeat on db11 is CRITICAL: CRIT replication delay 316 seconds [00:14:53] PROBLEM - MySQL Slave Delay on db11 is CRITICAL: CRIT replication delay 335 seconds [00:15:47] if what works? [00:15:48] RECOVERY - MySQL Replication Heartbeat on db1035 is OK: OK replication delay 0 seconds [00:16:12] Change merged: Asher; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/33289 [00:17:14] Reedy: db1035 does lvm snapshots so it's a little write sluggish :) [00:17:31] paravoid: sorry, switched channel, was responding to a question on #mediawiki [00:17:41] binasher: looking at the database size... I dropped about 9GB of data :D [00:17:52] oh not in there :) [00:18:02] maybe I should [00:18:07] well, "data" [00:18:10] Reedy: that's what i like to hear! [00:18:34] It was about 21GB originally [00:18:37] And there's more to go [00:19:20] !log asher synchronized wmf-config/CommonSettings.php 'remove apache memchached's from the parsercache multiwriter' [00:19:27] Logged the message, Master [00:21:30] !log asher synchronized wmf-config/mc.php 'disable general memcached multiwriting: nothing but pecl' [00:21:36] Logged the message, Master [00:23:11] Awesome [00:23:20] wow, really [00:23:21] cool! [00:23:58] running tcpdump on apaches to see if there's any port 11000 activity (the new hosts are using the standard port of 11211) and it's nothing but nagios [00:24:01] does anyone know how long is de wikivoyage going to be DB locked for please? [00:24:20] Till the slaves catch up [00:24:20] Again [00:24:26] Same as everything else on s3 [00:24:55] Another 10 million rows to go bye bye [00:25:13] Reedy: what's your delete chunk size? [00:25:18] a month [00:25:31] what are you deleting out of pure curiosity? [00:25:33] Thehelpfulone: 2 slaves have caught up... 1 is catching up [00:25:42] oai "audit" logs [00:26:07] It had 82 million rows [00:26:11] binasher: awesome! do you have any numbers on the performance benefits of that? [00:26:20] KEY (oa_client,oa_timestamp), [00:26:20] KEY (oa_timestamp,oa_client) [00:26:20] ok [00:26:25] ^ the indexes were beyond useless [00:27:10] !log aaron synchronized wmf-config/PrivateSettings.php [00:27:16] Logged the message, Master [00:27:28] binasher: is it worth adding a simple auto increment PK to the table when I get the row count down enough? [00:28:06] New patchset: Aaron Schulz; "Set swiftTempUrlKey parameter." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/33296 [00:28:34] !log Set X-Account-Meta-Temp-Url-Key for swift mw:thumb account [00:28:40] Logged the message, Master [00:29:35] the hope is when the table is not backlogged with crap, I can actually get a few metrics from it... [00:29:45] AaronSchulz: hm? [00:31:28] binasher: where is pc in ganglia? [00:31:36] * AaronSchulz hates it when he can't find stuff easily [00:31:50] ahh, mysql group [00:31:56] * AaronSchulz was looking in misc [00:32:11] AaronSchulz: what's the Temp URL thing? [00:33:50] http://docs.rackspace.com/files/api/v1/cf-devguide/content/TempURL-d1a4450.html [00:34:16] AaronSchulz: this is a good sign http://ganglia.wikimedia.org/latest/graph.php?r=week&z=xlarge&c=MySQL+pmtpa&h=pc1.pmtpa.wmnet&v=42684218&m=mysql_bytes_sent&jr=&js=&vl=bytes&ti=mysql_bytes_sent [00:34:19] also see https://gerrit.wikimedia.org/r/#/c/31666/ [00:34:29] I've read what it does, I'm wondering what are you going to use it to [00:35:02] looking [00:35:29] ah, great! [00:35:36] binasher: are you thinking about killing something else? :) [00:36:24] Reedy: wtf is oaiaudit? heh [00:37:31] yes, please do add an auto-inc pk [00:37:32] RECOVERY - MySQL Replication Heartbeat on db11 is OK: OK replication delay 0 seconds [00:37:41] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:37:50] PROBLEM - Puppet freshness on db42 is CRITICAL: Puppet has not run in the last 10 hours [00:37:50] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [00:38:08] RECOVERY - MySQL Slave Delay on db11 is OK: OK replication delay 0 seconds [00:38:08] unless oa_client is unique and the table creator just forgot about a pk.. i'm sure that's not the case, but it'd be nice [00:38:10] Change merged: Aaron Schulz; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/33296 [00:38:22] log of requests to the OAI extension... So every time lucene requests anything... or any of the 3rd party users.. [00:38:34] Nope, there's no unique key :( [00:39:00] !log aaron synchronized wmf-config/filebackend.php 'Set "swiftTempUrlKey"' [00:39:07] Logged the message, Master [00:39:26] Ideally I want to add a couple of indexes too so we can get some useful information from it... [00:39:40] Obviously to be done when it hasn't got quite so many rows ;) [00:39:43] frwiki Revision::fetchFromConds 10.0.6.53 2008 MySQL client ran out of memory (10.0.6.53) [00:39:47] binasher: lol [00:40:14] :D [00:40:40] autoinc that shit [00:41:02] wtf is with UserDailyContribsHooks on incubator wiki [00:41:14] editcountitis? [00:41:24] Lock wait timeout exceeded; try restarting transaction (10.0.6.44) UPDATE `user_daily_contribs` SET contribs=contribs+1 WHERE day = '20121113' AND user_id = '17032' [00:41:34] Reedy: maybe it's a bot ;) [00:41:36] hell yes [00:42:12] that "MySQL client ran out of memory" query had a limit 1 on it! [00:42:33] it was probably the straw that broke the parsing camel's back [00:43:05] [user_name] => Base [00:43:11] [user_editcount] => 65 [00:43:17] that doesn't sound like a bot to me [00:50:44] mutante: when you pop up and have a little bit of time, could you please take a look at http://ftp.mozilla.org/pub/mozilla.org/webtools/bugzilla-4.0.8-to-4.0.9-nodocs.diff.gz ? TIA [00:52:50] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 2.578 seconds [00:54:00] !log aaron synchronized php-1.21wmf4/extensions/TimedMediaHandler 'deployed 2784d7ab1b0e0a56b905ff4b9c4070f63f867d25' [00:54:06] Logged the message, Master [00:54:24] binasher: too bad cache multiwrite doesn't backfill, ah well :) [00:54:57] yeah, i like the trend though [00:56:43] there's also a trend i don't like much, get times from MWMemcached were much more erratic and had huge spikes that are absent from the pecl client, but on average, it was quicker [00:57:14] that might be because more of the gets are actually returning larger pcache objects which were previously misses though [00:58:04] binasher: it was slower even before when added it to the parser cache group [00:58:43] * AaronSchulz tries to recall the order of events [00:58:47] and the tp50 time is something like 1.5ms vs 3ms [00:59:49] network latency is greater from the apaches (new and old racks) to the mc hosts than to other apaches though [01:00:16] mw24 to srv245: rtt min/avg/max/mdev = 0.121/0.145/0.158/0.011 ms [01:00:22] yeah, profiling was one for a while before we added pcache [01:00:26] vs mw24 to mc3: rtt min/avg/max/mdev = 0.095/0.209/0.362/0.074 ms [01:00:29] * AaronSchulz double checked [01:02:16] AaronSchulz: i wonder if you could get data on instantiation overhead for the pecl client without persistent connections [01:03:33] tp90s are a lot different :/ [01:04:38] oh, go further back.. times for MWMemcached dropped drastically once the initial multiwrite to pecl (without pcache) went into place [01:05:09] so really need to compare current pecl client times to times from before we started touching anything [01:05:20] gah, duh [01:06:20] current tp99 is actually much better than the old [01:06:26] yeah, tp90 and tp50 are better [01:06:28] phew! i thought it was worse, heh [01:06:28] yep [01:06:40] * AaronSchulz looks at the 14 day graph [01:07:22] don't have to go republican, hide the bad graphs, and claim mission accomplished [01:07:59] tp99 was hitting 290ms, now tops at 72ms [01:08:49] binasher: do you still want to remove pc1 eventually? [01:09:14] yes, but no :( [01:09:30] we're actually going to deploy pc2-3 [01:10:07] slaves or sharded? [01:10:13] the initial reason for this is for pcache replication, with the idea that all reads would generally be from memcached [01:10:27] sharded, with binlogging to replicate to eqiad [01:10:41] was sharding support ever added? [01:10:49] * AaronSchulz doesn't remember that [01:11:04] binasher: I guess you could use mysql federated tables :p [01:11:08] no, tim said it'd be easy though [01:11:25] there's already table sharding [01:12:03] the better logging from the pecl client shows that some pcache objects are >1MB though, so that's another reason to have something backing memcached [01:12:23] sigh...I guess that makes sense [01:14:29] http://bit.ly/T05KXq [01:14:55] paravoid: re: question on performance, see the max times ^^ [01:15:21] yeah, I've been following :-) [01:15:30] thanks [01:15:45] impressive [01:16:16] you should mail that! [01:19:11] writing an email now :) [01:19:19] New patchset: Kaldari; "Enableing upload_by_url on test2 for Flickr uploading tests" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/33300 [01:20:35] kaldari: \o/ [01:21:02] Change merged: Kaldari; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/33300 [01:21:30] ori-l: The future is coming! [01:22:26] kaldari: browsing CC content on flickr and uploading it to commons is how i procrastinated on at least a half-dozen papers in grad school [01:22:37] so i'm sure you're destroying someone's academic career somewhere :P [01:22:56] just wait until Fabrice finds out about it! [01:23:07] he has like 20,000 free photos on Flickr [01:23:17] that's over 9000!? [01:23:17] heh [01:24:02] Reedy: for the last time, we can't turn on image uploads from /b/ [01:24:13] That's what you think... [01:24:14] you remember what happened last time. [01:24:18] http://ganglia.wikimedia.org/latest/graph.php?r=year&z=xlarge&m=swift_object_count&h=Swift+pmtpa+prod&c=Swift+pmtpa fwiw [01:24:29] but think of the LULZ! [01:24:35] kind of an interesting slope [01:24:50] you can see the effect of WLM I think [01:25:38] that's a lot of objects [01:25:50] this includes thumbs [01:25:59] glad you guys are ready for the flood ;) [01:26:01] that we never delete [01:26:42] Personally, I think all thumbs should have TTLs [01:27:09] there have been discussions about that [01:27:13] it's not that simple [01:27:22] yes, easier said than done [01:27:25] TTLs probably wouldn't work, we'd need some kind of LRU [01:27:29] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:27:40] We do at least have our wgThumbnailEpoch ;) [01:27:43] LRU? [01:27:52] least recently used [01:28:16] Even commonly used thumbs should be regenerated every few years [01:29:04] for updates to ImageMagic and whatever we use for SVG rendering [01:29:30] kaldari: that's what the thumbnail epoch is for, sort of [01:29:49] Oh! [01:30:00] $wgThumbnailEpoch = '20120101000000'; [01:30:24] sleep now [01:30:28] talk to you tomorrow [01:30:38] that's exciting [01:30:45] for me at least :) [01:30:45] which reminds me [01:30:58] I think there was one of the other epochs that wanted bumping up again [01:34:56] binasher: https://gerrit.wikimedia.org/r/#/c/29736/ [01:35:14] binasher: can you run that schema change (after truncating zhwiki again or something)? [01:35:30] AaronSchulz: sure.. mind if its tomorrow? [01:35:34] Reedy: is there a bot making daily edits there or what? [01:35:55] binasher: that's fine [01:36:35] No idea [01:36:38] Possibly [01:38:36] PROBLEM - Apache HTTP on srv278 is CRITICAL: Connection refused [01:39:23] AaronSchulz: Maybe the great firewall of china is broken [01:39:56] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 263 seconds [01:40:50] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 6.456 seconds [01:41:53] PROBLEM - MySQL Slave Delay on db78 is CRITICAL: CRIT replication delay 255 seconds [01:43:32] RECOVERY - MySQL Slave Delay on db78 is OK: OK replication delay 0 seconds [01:43:36] who moderates the engineering list? [01:44:15] my email went into moderation due to the attached graph being >40k [01:44:42] and was apparently rejected on wikitech-l due to "The message's content type was not explicitly allowed", though it was just typed in gmail's web interface [01:46:32] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 0 seconds [01:51:16] binasher: rachel, iirc [02:00:16] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 267 seconds [02:00:25] PROBLEM - MySQL Slave Delay on db78 is CRITICAL: CRIT replication delay 276 seconds [02:14:59] RECOVERY - Apache HTTP on srv278 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.045 second response time [02:15:08] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:15:34] PROBLEM - Puppet freshness on analytics1002 is CRITICAL: Puppet has not run in the last 10 hours [02:23:13] PROBLEM - Squid on brewster is CRITICAL: Connection refused [02:29:53] !log LocalisationUpdate completed (1.21wmf3) at Wed Nov 14 02:29:53 UTC 2012 [02:30:01] Logged the message, Master [02:31:19] RECOVERY - Squid on brewster is OK: TCP OK - 0.007 second response time on port 8080 [02:31:46] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.024 seconds [02:47:40] PROBLEM - Puppet freshness on db62 is CRITICAL: Puppet has not run in the last 10 hours [02:49:19] PROBLEM - Squid on brewster is CRITICAL: Connection refused [02:50:05] !log LocalisationUpdate completed (1.21wmf4) at Wed Nov 14 02:50:04 UTC 2012 [02:50:12] Logged the message, Master [02:55:28] RECOVERY - Puppet freshness on ms1002 is OK: puppet ran at Wed Nov 14 02:55:18 UTC 2012 [03:12:34] RECOVERY - MySQL Slave Delay on db78 is OK: OK replication delay 0 seconds [03:17:31] RECOVERY - Squid on brewster is OK: TCP OK - 0.003 second response time on port 8080 [03:22:29] PROBLEM - Squid on brewster is CRITICAL: Connection refused [03:23:13] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 21 seconds [04:17:40] PROBLEM - Puppet freshness on analytics1001 is CRITICAL: Puppet has not run in the last 10 hours [04:17:40] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [04:17:40] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [05:34:23] PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours [06:50:00] RECOVERY - Squid on brewster is OK: TCP OK - 0.001 second response time on port 8080 [08:24:09] PROBLEM - Memcached on virt0 is CRITICAL: Connection refused [08:35:33] RECOVERY - Memcached on virt0 is OK: TCP OK - 0.005 second response time on port 11000 [08:36:14] morning :-] [08:40:40] apergos: mark: paravoid: hello :-]  The gerrit service running on manganese keep restarting and I am wondering if that might be caused by puppet. The service is subscribed to three files. Would it be possible to run puppetd -tv on manganese and paste me the result please ? [08:41:08] The last Puppet run was at Wed Nov 14 08:13:56 UTC 2012 (27 minutes ago). [08:41:10] running now [08:41:54] \O/ [08:42:08] I am not sure how puppet handled file subscribption [08:42:25] I suspect it regenerate the templates on every run and end up restarting the service [08:44:27] run completed [08:44:56] http://p.defau.lt/?Uf15oRqRbFY3MIgJa1lp2A [08:45:01] you rock [08:45:18] :-) [08:45:43] so it end up refreshing replication.config and restarting it :( [08:45:47] info: /Stage[main]/Gerrit::Jetty/File[/var/lib/gerrit2/review_site/etc/replication.config]: Scheduling refresh of Service[gerrit] [08:45:56] apergos: mind running it again? [08:46:21] maybe gerrit rewrite the configuration file when it is started [08:46:32] ok [08:48:57] ohhh [08:49:16] the template use a foreach loop on an array of settings to generate the file [08:49:37] that might end up using some random order [08:49:44] http://p.defau.lt/?DoykSQyCFWBZoFEIAbAP6g [08:49:49] which would lead to the file configuration changing from time to time [08:51:17] apergos: will debug it on my local config and submit a patch. Thanks a lot! [08:52:45] ok, thanks for working on this, it's been an annoying bug! [08:53:06] definitely [08:53:27] we have been exchanging a few emails with Chad over the week-end [08:53:39] and it magically appeared to me this morning that puppet might be restarting it [08:56:34] sweet! [08:58:28] just have to find out how to pass variables to an erb template from the command line [10:13:44] New patchset: Tim Starling; "Disable UniversalLanguageSelector" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/33318 [10:38:25] PROBLEM - Puppet freshness on db42 is CRITICAL: Puppet has not run in the last 10 hours [10:38:25] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [10:43:10] New patchset: Hashar; "prevents puppet from restarting Gerrit" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33321 [10:44:02] ahoho need to find out the bug report [10:45:49] apergos: https://gerrit.wikimedia.org/r/33321 should fix up the Gerrit random restart [10:58:36] New review: Nikerabbit; "We call this throwing out the baby with the bath water." [operations/mediawiki-config] (master); V: 0 C: -1; - https://gerrit.wikimedia.org/r/33318 [11:30:49] PROBLEM - SSH on singer is CRITICAL: Server answer: [11:31:43] PROBLEM - SSH on virt0 is CRITICAL: Server answer: [11:31:52] PROBLEM - SSH on hydrogen is CRITICAL: Server answer: [11:32:10] PROBLEM - SSH on chromium is CRITICAL: Server answer: [11:32:19] RECOVERY - SSH on singer is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [11:32:46] PROBLEM - SSH on kaulen is CRITICAL: Server answer: [11:33:13] PROBLEM - SSH on pdf2 is CRITICAL: Server answer: [11:33:31] PROBLEM - SSH on fenari is CRITICAL: Server answer: [11:33:31] RECOVERY - SSH on hydrogen is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [11:34:07] RECOVERY - SSH on chromium is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [11:34:52] RECOVERY - SSH on pdf2 is OK: SSH OK - OpenSSH_4.7p1 Debian-8ubuntu3 (protocol 2.0) [11:34:52] RECOVERY - SSH on virt0 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [11:35:10] RECOVERY - SSH on fenari is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [11:36:13] RECOVERY - SSH on kaulen is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [11:36:13] PROBLEM - SSH on zirconium is CRITICAL: Server answer: [11:36:31] PROBLEM - SSH on sq76 is CRITICAL: Server answer: [11:37:52] RECOVERY - SSH on zirconium is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [11:38:10] PROBLEM - SSH on sq53 is CRITICAL: Server answer: [11:38:10] RECOVERY - SSH on sq76 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [11:39:49] RECOVERY - SSH on sq53 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [11:40:35] PROBLEM - SSH on spence is CRITICAL: Server answer: [11:42:13] RECOVERY - SSH on spence is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [11:47:43] New review: Siebrand; ">Diederik 00:53" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/12188 [11:50:41] New review: Aude; "Regardless of the merits of this change or not, we have a handful of places in Wikibase that use bit..." [operations/mediawiki-config] (master) C: -1; - https://gerrit.wikimedia.org/r/33318 [11:51:27] New review: Aude; "of course, getting a random language is not nice but the entire JS breaking is worse." [operations/mediawiki-config] (master) C: -1; - https://gerrit.wikimedia.org/r/33318 [11:52:16] PROBLEM - SSH on lvs1003 is CRITICAL: Server answer: [11:53:55] RECOVERY - SSH on lvs1003 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [11:54:22] PROBLEM - SSH on nitrogen is CRITICAL: Server answer: [11:56:01] RECOVERY - SSH on nitrogen is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [11:59:58] New review: Hashar; "Since lot of files got renamed and tweaked at the same time, this patchset is best viewed by tweakin..." [operations/mediawiki-config] (master); V: 0 C: -1; - https://gerrit.wikimedia.org/r/32167 [12:00:13] PROBLEM - SSH on ssl1001 is CRITICAL: Server answer: [12:01:52] RECOVERY - SSH on ssl1001 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [12:06:49] New patchset: Dereckson; "(bug 41831) Set autoconfirm on ru.wikipedia" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/33329 [12:07:07] PROBLEM - SSH on sq33 is CRITICAL: Server answer: [12:08:46] RECOVERY - SSH on sq33 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [12:10:55] New review: Hashar; "The shell script MWRealm.sh should reuse the code from MWRealm.php, we really want to avoid duplicat..." [operations/mediawiki-multiversion] (master); V: 0 C: 0; - https://gerrit.wikimedia.org/r/32168 [12:16:34] PROBLEM - Puppet freshness on analytics1002 is CRITICAL: Puppet has not run in the last 10 hours [12:30:49] PROBLEM - check_gcsip on payments2 is CRITICAL: Connection timed out [12:30:49] PROBLEM - check_gcsip on payments4 is CRITICAL: Connection timed out [12:30:49] PROBLEM - check_gcsip on payments3 is CRITICAL: Connection timed out [12:30:49] PROBLEM - check_gcsip on payments1 is CRITICAL: Connection timed out [12:31:18] PROBLEM - check_gcsip on payments1001 is CRITICAL: CRITICAL - Socket timeout after 61 seconds [12:31:18] PROBLEM - check_gcsip on payments1002 is CRITICAL: CRITICAL - Socket timeout after 61 seconds [12:31:18] PROBLEM - check_gcsip on payments1003 is CRITICAL: CRITICAL - Socket timeout after 61 seconds [12:31:18] PROBLEM - check_gcsip on payments1004 is CRITICAL: CRITICAL - Socket timeout after 61 seconds [12:48:52] New patchset: Mark Bergsma; "Install ms-be30xx without a large RAID1 partition" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33332 [12:49:04] PROBLEM - Puppet freshness on db62 is CRITICAL: Puppet has not run in the last 10 hours [12:49:07] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33332 [12:52:22] !log Reinstall ms-be3001 [12:52:29] Logged the message, Master [12:57:01] PROBLEM - Puppet freshness on ms1002 is CRITICAL: Puppet has not run in the last 10 hours [12:59:27] New patchset: Mark Bergsma; "Add coffee" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33333 [13:00:05] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33333 [13:01:12] apergos: did you rebalance account/container on ms-be6? [13:01:42] no, it wouldn't rebalnce them, (i.e. I ran it and it refused to do a rebalance) [13:03:04] strange [13:03:20] the strange part is that it says balance -33 on the new entries [13:04:26] *sigh* [13:09:14] Cowardly refusing to save rebalance as it did not change at least 1% [13:09:21] that's what it told me when I tried to rebalance them [13:09:36] because it just was a change to the port number, not to the paritions themselves I guess [13:09:59] maybe if we changed the weights [13:10:59] apergos: I probably fixed Gerrit restart madness https://gerrit.wikimedia.org/r/#/c/33321/ [13:11:22] I saw you pinged me on the change, Illl have a look at it [13:11:27] nice :-] [13:11:31] hopefully that will fix it [13:11:47] that would be pretty great alright [13:11:48] there might be other place where we have a similar issue [13:13:43] New review: Diederik; "Siebrand, this Change is On The critical path of The analytics team, there are some more changes tha..." [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/12188 [13:18:10] hashar: do you want demon to look at this change or can I just +2 it and merge it? [13:20:39] apergos: I guess just +2 [13:20:49] apergos: Chad is attending a Gerrit conference for the whole week [13:20:54] ok [13:20:58] Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33321 [13:21:13] that is merely making sure the order of parameters are sane [13:21:21] yep, I got that [13:21:27] mind applying it to manganese and running puppetd -tv twice ? [13:21:35] the second run should not show any conf change [13:21:48] (nor a third run if you are willing to double test it :-) [13:22:08] shall I puppetd on manganese then? [13:22:14] ah I see you already asked me :-D [13:22:15] sec [13:22:48] yeah I'lll do three runs, no extra work long as I'm there [13:25:59] ep second run no changes, running one more time for double-check [13:26:08] and third run is good too, good job! [13:28:29] apergos: you are the best [13:28:36] that should fix it hopefully [13:28:40] actually, you are: you fixed gerrit! [13:29:06] I'm going to keep that lesson in mind though when generating files based on hashes inpuppeet [14:18:56] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [14:18:56] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [14:18:56] PROBLEM - Puppet freshness on analytics1001 is CRITICAL: Puppet has not run in the last 10 hours [14:20:18] apergos: Which cache epoch was it that we needed to update again? [14:20:38] we don't want to I think [14:20:48] Oh, fair enough then [14:21:25] there was a 60 day period, I have to check my notes, anyways in early dec when we should see if it's all ok [14:47:17] New patchset: Hashar; "Unmount /mnt/thumbs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33202 [15:01:22] New patchset: Faidon; "reprepro: use the Ceph repository as an upstream" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33352 [15:04:01] hashar: as I said earlier, I'd prefer it if labs was going to use /data/foo directly from MW's config... [15:04:24] I mean, this whole indirection is useless, and it doesn't get you closer to production anyway [15:04:58] the reason is that the paths are hardcoded in various places :-D [15:05:04] anyway /mnt/thumbs never got used on beta [15:05:19] are they? [15:05:28] I mounted netapp at /mnt/thumbs2 [15:05:41] so, if they're hardcoded, this is going to be a problem :) [15:13:33] /mnt/thumbs is still described in the wmf-config/filebackend.php file though it might not be actually used [15:16:03] 2012-11-14 15:08:47 mw68 commonswiki: FileBackendMultiWrite failed sync check: ["mwstore:\/\/local-multiwrite\/local-public\/c\/c8\/Segwun2009.jpg"] [15:16:05] in production [15:16:06] doh [15:16:30] apparently the local repository is configured to use a multi write file backend [15:16:40] which write to both swift and /mnt/thumbs/ [15:24:24] New review: Hashar; "/mnt/thumbs is still described in the wmf-config/filebackend.php file though it might not be actual..." [operations/puppet] (production); V: 0 C: 1; - https://gerrit.wikimedia.org/r/33202 [15:30:34] New patchset: Pyoungmeister; "removing erroneous comments in db-secondary.php" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/33355 [15:32:26] New patchset: Pyoungmeister; "removing erroneous comments in db-secondary.php" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/33355 [15:36:06] PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours [15:41:28] New patchset: Pyoungmeister; "removing erroneous comments in db-secondary.php" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/33355 [15:45:19] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/13429 [16:00:12] New patchset: Faidon; "secure.wikimedia.org: also redirect /" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33360 [16:01:04] New patchset: Jgreen; "banner logging adjustment" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33361 [16:02:12] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33361 [16:05:31] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33360 [16:09:56] Oooh [16:10:03] paravoid: Is that going live? [16:12:01] heh [16:12:06] so should i name a server Schickard? [16:12:16] * RobH is having to assign a new tampa based misc server [16:12:45] * RobH is out of encyclopedians and is onto computer scientists [16:13:00] Cimrman [16:13:37] thats not a real person ;p [16:13:49] you know him? ;-) [16:13:58] well, the enwiki page says its not [16:14:05] and who doesnt trust wikipedia. [16:15:12] https://en.wikipedia.org/wiki/Jan_Otto [16:15:38] ohh [16:15:40] i like it. [16:15:41] the biggest czech encyklopedia [16:15:43] ever [16:15:49] and the person is dead, so thats good for server names [16:16:01] (before wikipedia which beated it quite recently) [16:16:16] Reedy: it did [16:16:21] I'm mailing wikitech now [16:16:32] Danny_B|backup: well, now new otrs dev box is called otto. [16:16:37] you have named a wiki server ;] [16:17:23] when is baptism? [16:17:38] for a server, thats when the label with its name is applied? [16:17:41] cuz then later today [16:17:42] do I get automatic access? [16:17:50] lol [16:17:52] ottomata: nope, but i figured you would like the name ;] [16:17:57] hah, i like it [16:18:04] unlike apergos who made us kill the server named ariel [16:18:07] but it means that whenever it goes down I'm going to get chat notifications abou tit! [16:18:15] I did. and I enjoyed it [16:18:19] heh, i didnt think i about that [16:18:25] ottomata: i can find a different name if you care =] [16:18:29] haha, not really [16:18:30] is it any special server? [16:18:31] but then im gonna make you help [16:18:40] how about Treccani [16:18:41] Danny_B|backup: new otrs dev box [16:18:50] so we can actually upgrade otrs [16:18:50] we should also have a Mill server I think [16:19:13] so the new otrs will be running on otto? [16:19:25] dunno if new for users will be [16:19:32] but new for the dev who is trying to make an upgrade happen [16:19:34] 50% name match :o) [16:19:43] it may or may not become new actual server, dunno [16:19:49] "actually upgrade otrs" <---- am I dreaming [16:20:20] Nemo_bis: ask somebody around to hit you and you'll recognize... ;-) [16:21:23] Change merged: Pyoungmeister; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/33355 [16:21:24] PROBLEM - Varnish traffic logger on cp1041 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [16:21:51] PROBLEM - Varnish traffic logger on cp1043 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [16:21:51] PROBLEM - Varnish traffic logger on cp1042 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [16:21:51] PROBLEM - Varnish traffic logger on cp1044 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [16:22:23] !log py synchronized wmf-config/db-secondary.php 'updating comments in db-secondary.php' [16:22:30] Logged the message, Master [16:22:43] Nemo_bis: dont quote me on that. [16:22:50] i just assign the servers, im not doing the upgrade [16:25:30] RobH: I know :D [16:25:46] but still it's a miracle to see even the slightest move on this after so many years [16:26:54] yep, i just assigned a new server, and it will be put into our dev sandbox lan by network admins, then we are letting some dev go to town on the box [16:28:36] RECOVERY - Varnish traffic logger on cp1042 is OK: PROCS OK: 3 processes with command name varnishncsa [16:30:39] New patchset: Pyoungmeister; "setting db61 to use coredb module" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33377 [16:31:49] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33377 [16:36:51] RECOVERY - Varnish traffic logger on cp1044 is OK: PROCS OK: 3 processes with command name varnishncsa [16:39:22] !log server wmf5815 named 'otto' after Jan Otto, name suggested by DannyB [16:39:29] Logged the message, RobH [16:41:17] hi robh! [16:41:24] heh [16:41:26] everytime i get a chat notification for that i'm going to say hi to you! [16:41:27] hehehh [16:41:46] you are going to make me rename the server by doing that =P [16:41:50] i asked you! [16:41:53] you had the option! [16:42:01] (of breaking Danny_B|backup's heart) [16:42:33] ottomata: does your pattern match with server_otto too? [16:42:43] cuz if not im putting that in logs when i do. [16:42:45] ;] [16:43:03] wait, try again [16:43:14] otto [16:43:16] server_otto [16:43:26] yup [16:43:28] oh [16:43:29] someone said otto [16:43:31] invalid test [16:43:33] do it again [16:43:34] server_otto [16:43:41] nop [16:43:42] e [16:43:43] heh [16:43:45] doesn't match [16:43:46] RECOVERY - Varnish traffic logger on cp1043 is OK: PROCS OK: 3 processes with command name varnishncsa [16:43:46] cool! [16:43:47] success \o/ [16:43:49] haha [16:43:54] um, but I don't really care [16:43:56] i think its kinda funny [16:44:08] you do whatchu want [16:44:34] i wanna name them things like eiximenis [16:44:38] but folks didnt like that ;] [16:44:51] so when people say otto is down you wll have only yourself to blame! [16:45:01] but more likely the conversation will go like this [16:45:03] what is eiximenis actually? [16:45:12] "otto is acting up again, can someone take a look?" [16:45:23] Danny_B|backup: it was the name of a server [16:45:26] eximenisadpoetewt8965 is .... I forget what it is [16:45:27] i know [16:45:32] except that I could never spell it :-D [16:45:34] but the meaning of the word [16:45:34] famous encyclopedian [16:45:37] aha [16:45:48] * Danny_B|backup should wikipedia it [16:46:10] Ahh, found it [16:46:11] http://en.wikipedia.org/wiki/Category:Encyclopedists [16:46:16] and then I will have to really act up up [16:46:23] * RobH rebookmarks that [16:46:24] http://en.wikipedia.org/wiki/Francesc_Eiximenis [16:46:37] https://en.wikipedia.org/wiki/Category:Encyclopedists gives many suggestions [16:46:48] RobH: how about Chmielowski [16:46:48] Nemo_bis: yep, i just pasted that ;] [16:46:57] Nemo_bis: i like it [16:47:04] RECOVERY - Varnish traffic logger on cp1041 is OK: PROCS OK: 3 processes with command name varnishncsa [16:47:07] but i dont think others will agree, heh [16:47:12] al-Qalqashandi [16:47:18] ahh man, good thing its already called otto [16:47:20] it's not hard [16:47:22] cuz i would call it mexia now [16:47:26] thats a good name. [16:47:33] :D [16:47:51] He maintained correspondence with Erasmus of Rotterdam, Luis Vives and Juan Gines de Sepulveda [16:47:52] RobH: do we have "wales" server as well? ;-) [16:47:55] <^demon> I miss eiximenis :p [16:48:04] ^demon: no one appreciated it like we did. [16:48:21] I sure didn't. [16:48:33] wasn't eiximenis where etherpad is? [16:48:33] <^demon> Elements are boring names for servers. [16:48:34] Danny_B|backup: i dont name them after live folks [16:48:41] <^demon> Danny_B|backup: Was, but yes. [16:48:43] ^demon: we disagree, that was prety much all me. [16:48:51] i like elements, i dont have to dig for names. [16:49:01] plus carbon is the install server [16:49:05] that is comedy people. [16:49:21] (only biochem nerds like that one) [16:49:23] RobH: which electron number we use so far? [16:49:33] i mean how many elements we used? [16:50:23] lots [16:50:46] 49 [16:50:50] we are up to indium [16:50:59] tin is next [16:51:09] I can't wait for unobtainium [16:51:21] so can't i ;-) [16:51:28] we need to use that for ... hmm... how about the new search system [16:51:30] http://en.wikipedia.org/wiki/Ununpentium [16:51:30] :-P [16:52:19] PROBLEM - MySQL Replication Heartbeat on db1003 is CRITICAL: CRIT replication delay 232 seconds [16:52:19] PROBLEM - MySQL Replication Heartbeat on db11 is CRITICAL: CRIT replication delay 232 seconds [16:52:19] PROBLEM - MySQL Replication Heartbeat on db1010 is CRITICAL: CRIT replication delay 232 seconds [16:52:34] That'll be me, again [16:52:37] PROBLEM - MySQL Slave Delay on db11 is CRITICAL: CRIT replication delay 251 seconds [16:52:40] It'll sort itself out in a few [16:52:46] PROBLEM - MySQL Slave Delay on db39 is CRITICAL: CRIT replication delay 259 seconds [16:52:56] PROBLEM - MySQL Replication Heartbeat on db39 is CRITICAL: CRIT replication delay 269 seconds [16:53:31] PROBLEM - MySQL Replication Heartbeat on db1035 is CRITICAL: CRIT replication delay 304 seconds [16:53:58] RECOVERY - MySQL Replication Heartbeat on db1003 is OK: OK replication delay 0 seconds [16:53:58] RECOVERY - MySQL Replication Heartbeat on db1010 is OK: OK replication delay 0 seconds [16:54:34] RECOVERY - MySQL Replication Heartbeat on db39 is OK: OK replication delay 0 seconds [16:54:34] RECOVERY - MySQL Slave Delay on db39 is OK: OK replication delay 0 seconds [16:54:41] <^demon> RobH: You and I should get californium. [16:54:52] im on the page right now [16:54:57] heh [16:55:05] i was thinking 'i wanna go to califorium' [16:55:10] RECOVERY - MySQL Replication Heartbeat on db1035 is OK: OK replication delay 0 seconds [16:55:15] was discovered at berkley! [16:55:23] berkeley even [16:55:23] RobH: you can use all names up to "Buo" [16:55:48] isn't there berkelium? [16:56:00] yea, bk [16:56:37] it decays into califorium [16:56:47] this amuses me [16:59:08] * Danny_B|backup needs to wait for dubnium for his server [17:09:20] I've just been talking to some people about the secure.wm.o deprecation [17:09:42] Apparently one of them was abusing it to work around cross-domain restrictions for JS requests for global renaming [17:09:59] indeed [17:10:59] hmm? [17:13:55] PROBLEM - MySQL Replication Heartbeat on db11 is CRITICAL: CRIT replication delay 264 seconds [17:13:55] PROBLEM - MySQL Slave Delay on db11 is CRITICAL: CRIT replication delay 265 seconds [17:16:15] Krenair: It's not like the move away from secure... wasn't telegraphed for the past year. [17:16:23] Krenair- Have they looked into CORS instead? [17:18:41] heyaaaaa, LeslieCarr, I know we don't know what magic made the analytics ganglia stuff start working [17:18:52] but I've got a buncha new servers I'd like to get in the ganglia group now [17:19:00] not sure if there was something special we did to give them a little kick [17:19:33] they've been up for a couple of days now [17:19:49] and gmond.conf is configured properly [17:19:52] was hoping they'd just join [17:19:57] but maybe I have to wait for more magic [17:23:20] Hmm [17:23:34] The number of these oai audit rows is growing by a lot per week from mid 2012 [17:25:53] Meh, just july [17:30:30] New review: Aaron Schulz; "Those sync errors are about originals on nas vs swift...it doesn't have to do with this change. The ..." [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/33202 [17:36:20] ottomata: i don't think there was any magic [17:36:26] they should just show up ... [17:36:51] ok, that's what I thought. They don't yet, but maybe they will one day [17:38:08] what's one of the servers? [17:38:32] ottomata: [17:38:48] analytics1021 [17:38:52] or analytics1011 [17:38:58] i'm working with those to figure things out [17:39:00] ok, let me hunt around some [17:39:02] ok coo [17:39:03] danke [17:39:07] I'll look at 1011, ok? [17:39:08] sure [17:39:23] if you tcpdump, you can see it sending packets to the correct multicast addy [17:39:27] analytics1011.eqiad.wmnet.46321 > 239.192.1.32.8649 [17:39:31] ok [17:39:39] the aggregator is analytics1003 [17:39:58] if you tcpdump there, you will only see traffic from the current nodes that ganglia knows about [17:40:03] analytics1001-analytics1010 [17:40:51] hrm … [17:41:34] this was pretty much what happened to us when LeslieCarr and I were trying to set up this ganglia group witih an01-an10 [17:41:36] yeah I see that in lsof (the address) [17:41:48] and then it started working [17:41:51] wtf [17:41:54] everything looked like it should work, but it didn't [17:41:56] we took a break [17:41:58] i mean, restarting ganglia worked for a minute [17:42:01] i came back to check it a week later [17:42:04] and then they were there [17:42:12] yeah, i tried restarting gmond on an21 [17:42:13] ok, so these are in a different rack [17:42:15] didn't restart it on an03 though [17:42:16] lemme double check the configs [17:42:18] ok [17:42:24] i do want to set up another aggregator [17:42:27] probably on one of these gusy [17:42:31] if it helps I'm ready to do that [17:42:40] i want to solve the issue first :) [17:42:42] ok [17:42:51] besides, i'm not sure what ganglia does if aggregators have different info [17:43:45] maybe the aggregator drops em [17:44:26] different info? [17:46:09] sigh [17:46:11] wtf asw-c [17:46:28] !log oai.oaiaudit cleaned out on S3 [17:46:33] note that 1002 doesn't report [17:46:34] Logged the message, Master [17:46:40] it's also a 10.64.x host [17:47:51] nah, irrelevant [17:48:29] 1002 is kinda offline [17:48:30] but 10.64.36 != 10.64.21 so maybe there's an issue there [17:48:35] it jsut came back up yesterday [17:48:36] for the new hosts I mean [17:48:38] with RobH's help [17:48:41] it isn't puppetized yet [17:48:54] servers fear me. [17:49:00] they oughta [17:51:58] ottomata: asw-c isn't bringing up one of the aggregated ethernets either [17:52:05] looks like me and it need to have a nice long "chat" [17:52:07] perhaps with a hammer [17:52:09] ooooo [17:52:14] hammer time [17:52:14] uh oh [17:52:33] LeslieCarr: can i beg https://rt.wikimedia.org/Ticket/Display.html?id=3687 off you? [17:52:44] just some labels and vlan assignments for 3 servers in eqiad [17:52:57] then i can installand deploy them =] [17:53:03] ok [17:53:34] are their second interfaces no longer in use ? [17:53:47] they had two interfaces? [17:54:04] i dont recall, if so can you relabel and set both just in case? [17:54:18] i will check and see if the secondary is still needed, but if not i can drop ticket and unplug [17:54:22] and it wont hold up use [17:54:51] also tampa server vlan https://rt.wikimedia.org/Ticket/Display.html?id=3912 would be nice to hand off but is less pressing [17:55:06] as i just assigned it today, its not been waiting on networking at all =] [17:55:37] well one way they'd need to be in lacp bundles, the other way, no bundling [17:56:03] oh, lets just kill it off then [17:56:03] oh, they've already been done [17:56:07] disable the secondary port [17:56:08] the networking for pc1001-1003 [17:56:10] oh? [17:56:20] !log aaron synchronized php-1.21wmf3/extensions/TimedMediaHandler 'deployed 275aa9524ea6e1e15d7eef3d81c68dde3ffd5693' [17:56:26] the parsercache in tampa doesnt have two network connections i dont thin [17:56:26] Logged the message, Master [17:56:26] k [17:56:28] lemme check [17:56:41] i think the two is just legacy from labs. [17:56:53] yea, it is [17:57:07] LeslieCarr: So the networking for port1 on each is already done and the secondary connection is killed? [17:57:17] cuz i dont wanna bridge labs to production via a system =P [17:57:20] yeah [17:57:21] hehehe [17:57:25] good to check :) [17:57:26] awesome, thank you! [17:58:07] New review: Anomie; "> The shell script MWRealm.sh should reuse the code from MWRealm.php, we really" [operations/mediawiki-multiversion] (master) C: 0; - https://gerrit.wikimedia.org/r/32168 [18:00:00] New patchset: RobH; "renamed virt1001-1003 to pc1001-1003" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33389 [18:00:14] cmjohnson1: heya, i dropped a ticket for a server label in pmtpa queue for you while you were away [18:00:15] just fyii [18:01:04] yep...i see it...it's port 33 (btw) [18:01:05] so since this is a "sending on one end but packets not seen on other end" problem and not a puppet problem, I'm off this case, right otto? [18:01:57] RobH: wondering, how's the ulsfo dell order coming along ? [18:02:17] New review: Anomie; "As you commented over there, it really seems that multiversion should just be merged here." [operations/mediawiki-config] (master) C: 0; - https://gerrit.wikimedia.org/r/32167 [18:02:26] universal language selector foo? :-P [18:03:10] New patchset: Anomie; "Allow per-realm and per-datacenter configuration" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/32167 [18:04:06] er ottomata [18:05:02] New patchset: Nemo bis; "(bug 42105) Restore normal bureaucrat permissions where changed without consensus" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/33390 [18:05:09] someone other than me review https://gerrit.wikimedia.org/r/#/c/33389/1 [18:05:15] or im just gonna self review [18:05:19] (bad rob) [18:05:47] New review: RobH; "i cant help myself" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/33389 [18:05:48] Change merged: RobH; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33389 [18:06:11] hahaha [18:06:15] I was gonna +2 it too [18:06:21] hehe [18:06:26] i thought it would be funny. [18:06:30] but make you do your own merge on sockpuppet :-P [18:06:37] I'll smileyface it [18:12:08] PROBLEM - Memcached on virt0 is CRITICAL: Connection refused [18:18:43] RECOVERY - Memcached on virt0 is OK: TCP OK - 0.002 second response time on port 11000 [18:19:01] I am wondering if memcached is restarted by puppet :-] [18:19:08] dinner time anywa [18:19:08] y [18:24:42] !log patching Bugzilla to 4.0.9 [18:24:48] Logged the message, Master [18:28:15] !log temp stopping puppet on brewster [18:28:20] Logged the message, notpeter [18:30:35] apergos: ms7 cruft? saw the cortado update yesterday? [18:30:50] I did not see it [18:31:12] it was already moved though I thought [18:31:38] well, it wasn't with TMH, but should be now [18:31:42] ah [18:31:55] well that's good to know (it's in the "already moved" list anyways) [18:32:27] did the timeline stuff you and aaron were working on happen? [18:34:21] paravoid: http://commons.wikimedia.org/wiki/Commons:Village_pump#File:Lamppost-singapore.jpg [18:34:34] hmm, some exception should probably got caught there or something [18:44:02] New patchset: Mwalker; "CN Fundraising Cookie Expiration" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/33393 [18:46:26] Change merged: Pgehres; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/33393 [19:03:53] !log Running sync-common on mw48 [19:03:59] Logged the message, Master [19:07:07] AaronSchulz: hey [19:07:14] yeah, that's the urlopen exception [19:07:23] we're handling only the http exceptions, not the socket refused [19:08:46] I'm not sure what the correct behavior would be though... [19:17:04] does it currently give a 503 for that exception? [19:20:27] AaronSchulz: https://gerrit.wikimedia.org/r/#/c/33202/ ? [19:20:54] you replied on hashar's comment but didn't +1/+2, is that on purpose? [19:21:09] well, +1, I guess you don't have +2 on puppet :) [19:22:57] New patchset: Pyoungmeister; "coredb fix 1: snapshot needs to inherit role::coredb::common" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33397 [19:23:16] hey LeslieCarr, you still around? [19:23:25] I have a related but slightly different ganglia problem [19:23:30] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33397 [19:24:15] ottomata: i left and came back [19:25:21] so, the machines that are working with ganglia right now, all work great, except for an01 [19:25:39] hadoop is configured to send ganglia metrics to the mcast addy, and an03 (the aggregator) sends those stats to gmetad [19:25:45] and they show up at ganglia.wikimedia.org [19:25:47] just fine [19:25:53] for everything except analytics1001 [19:25:54] now [19:25:56] on an01 [19:26:06] if I do [19:26:07] tcpdump -A 'udp port 8649' [19:26:18] I can see a bunch of java and hadoop related metrics going out [19:26:25] on an03 [19:26:26] i do [19:26:45] tcpdump -A 'udp port 8649 and host 239.192.1.32' | grep analytics1001 [19:27:21] and I only see default ganglia stats showing up there [19:27:36] load, heartbeats, memory, swap space [19:27:36] etc. [19:27:41] (although, now that I'm looking at it, maybe grepping for analytics1001 in the output sin't going to show me everything?) [19:28:07] ottomata: does it show you all the junks for an1002? [19:28:18] like, all the stat you want on one that is sending them? [19:28:25] an1002 is down, but i will check an05 (prettys ure it does) [19:28:53] notpeter, yes it does [19:29:07] problem != grep [19:29:09] yay! [19:29:09] jvm.JvmMetrics.ThreadsRunnable, yarn.NodeManagerMetrics.ContainersRunning etc [19:29:14] hehe, thanks for the double check! [19:29:14] heheh [19:29:27] what about when you grep on an1001 ? [19:29:30] do you see it sending? [19:29:36] yes [19:29:39] ah [19:29:47] this is netwrok magic. my powers are useless [19:29:56] sorry :( [19:30:00] analytics1001.wikimedia.org.6429 > 239.192.1.32.8649 [19:30:00] analytics1001.wikimedia.org....)dfs.FSNamesystem.PendingReplicationBlocks [19:30:01] etc. [19:30:04] yeah [19:30:04] rats! [19:30:06] yeah me too [19:30:30] ganglia seems quite difficult to debug! i guess its all the mcast stuff it likes to do [19:30:36] if things get lost, its pretty hard to find out why [19:30:43] yeah [19:30:44] or at least, for my ganglia noob self [19:30:57] well, sounds like ganglia is sending just fine [19:31:18] right, since it is getting some stats [19:31:59] apergos, any ideas? [19:32:02] well, and becuase you see it sending from an1001 [19:32:48] right, well, i see it sending all metrics (java and default) to the mcast addy from an01 [19:33:04] I didn't see anything coming in from an1011 on the aggregator [19:33:24] if we're on another issue, [19:33:33] unless it's an emergency I'm pretty done for the day [19:33:43] New patchset: Pyoungmeister; "coredb fix 2: needs moar scope" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33401 [19:34:00] and I see an03 (the aggregator) seeing the default metrics from an01 on the mcast addy [19:34:00] yeah [19:34:02] me neither, that's something else at the moment [19:34:02] LeslieCarr says she needs to bash a switch or soemthing [19:34:04] but, i have another problem [19:34:07] analytics1001 is generating a buncha metrics for hadoop, as well as the default ones (load, mem, etc.) [19:34:12] awwww ok! [19:34:14] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33401 [19:34:15] yeah its late over there [19:34:41] then why did you make it generate more metrics ? [19:34:42] duh [19:35:20] duhhhhhh [19:37:59] LeslieCarr, do you know, in this line: [19:38:01] 19:32:11.549537 IP analytics1001.wikimedia.org.40547 > 239.192.1.32.8649: UDP, length 184 [19:38:14] what the 549537 bit is? [19:38:18] is that part of the timestamp? [19:38:20] microseconds? [19:38:51] yeah [19:38:55] ok [19:38:57] if micro is enough digits [19:39:03] or whatev, nano whatev [19:40:19] !log depooling ms-fe1 for rewrite.py testing [19:40:27] Logged the message, Master [19:43:14] afk for a few mins, going to post office [19:44:24] New patchset: Reedy; "Add backup folder and gitignore to match" [operations/mediawiki-multiversion] (master) - https://gerrit.wikimedia.org/r/33406 [19:44:41] Change merged: Reedy; [operations/mediawiki-multiversion] (master) - https://gerrit.wikimedia.org/r/33406 [19:45:19] New patchset: Reedy; "Update multiversion" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/33408 [19:45:31] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/33408 [19:46:52] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: private, readonly, closed and special to 1.21wmf4 [19:46:58] Logged the message, Master [19:50:12] New patchset: Reedy; "?> is bad, mmkay?" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/33409 [19:50:30] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/33409 [19:52:09] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: wikibooks and wikinews to 1.21wmf4 [19:52:15] Logged the message, Master [19:55:34] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: Everything else (non wikipedia) to 1.21wmf4 [19:55:40] Logged the message, Master [19:56:22] AaronSchulz: it's a 500 [19:56:36] !log srv269 has a full /tmp. Cleaning up [19:56:42] Logged the message, Master [19:57:14] we could retry it two times, but I'm not sure if I want it to [19:57:23] mutante: What's that command for deleting large numers of stuff? [19:58:03] find --delete ? :-D [19:58:25] It's something like that, yup :P [19:59:10] find . --some --or --expression [19:59:21] make sure to double check the outputu [19:59:25] then add --delete at the very end [19:59:48] (putting --delete at a first will simply delete everything under the current dir exactly like rm * .* [20:00:11] Reedy: find .. | xargs rm [20:00:19] might work better if you run into "too many arguments" [20:00:26] mutante: nope [20:00:27] yeah [20:00:34] find .. -print0 .. | xargs -0 [20:00:35] nope? [20:00:40] ah,ok [20:00:43] and for the too many arguments [20:00:46] there's xargs -n N [20:00:53] to do it in batches [20:00:55] like 200 or so [20:01:04] although this has been fixed in newer kernels iirc [20:01:05] xargs ? --delete should be enough really It just find the file and apply them unlink() [20:01:21] afair with --delete we ran into "too many args" [20:01:30] hashar: you mean -delete [20:01:33] or was it just because it was slower? hmmm [20:01:36] find doesn't have double dashes [20:01:54] the problem with find's delete is that it'll do it one at a time (again, iirc) [20:01:55] indeed [20:02:20] pick a file, apply action [20:02:23] yea, ok, so xargs came up to speed it up [20:02:28] would need to benchmark that one day [20:02:49] find . -type f -print0 | xargs -0 -n 200 rm -v [20:02:59] that's probably what you want [20:03:09] ;-D [20:03:15] o_0 [20:05:05] ahh [20:05:31] got a fresh new Gerrit instance to play with \O/ [20:06:59] sudo -u apache find /tmp -name '*.png' -type f -ctime +1 -exec rm -f {} \; [20:07:45] that works too, still one at a time though [20:07:53] lol [20:08:02] it's quicker than doing 100s of rm commands ;) [20:08:03] I haven't benchmarked it myself tbh [20:10:21] !log Running sync-common on mw53 [20:10:27] Logged the message, Master [20:17:14] New patchset: Reedy; "Everything non wikipedia to 1.21wmf4" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/33415 [20:17:55] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/33415 [20:20:08] !log installing some package upgrades on sodium (lists) [20:20:14] Logged the message, Master [20:21:00] Reedy: is this wmf4 with or without AaronSchulz's patch for job queue? [20:21:09] Which patch? [20:21:22] https://gerrit.wikimedia.org/r/#/c/33411/ [20:21:28] not merged I guess [20:21:38] Without [20:21:41] It's not even merged in core ;) [20:22:01] I don't think it was hitting us, though it would be nice to be merged [20:22:27] AaronSchulz: how can you discover if it did? [20:22:29] there was some totally broken third party extensions though [20:22:37] drop of queue jobs? [20:24:09] it's at least 2 reported extensions btw [20:25:35] AaronSchulz: if you are still around. /mnt/thumbs has been unmounted in production but is still used as a wgLocalFileRepo ( the backend-multi-write uses both swift and /mnt/thumbs [20:26:18] I already said that's not used...but yes it can be changed as prep [20:27:58] mind doing it ? that is spamming the filebackend-ops.log udp2log file [20:28:17] New patchset: Pyoungmeister; "coredb fix 3: use 'in' correctly and quote correctly" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33418 [20:29:08] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33418 [20:29:37] hashar: none of those sync errors are related [20:29:49] * AaronSchulz thought he sent an email or comment about this somewhere already [20:29:52] paravoid: got a sec for a puppet question? [20:29:59] yes [20:30:30] so, i'm troubleshooting the coredb module that I checked in [20:30:32] and I'm seeing this: [20:30:33] Error 400 on SERVER: role::coredb::config::topology is not an hash or array when accessing it with s1 at /var/lib/git/operations/puppet/modules/coredb/manifests/snapshot.pp:11 on node db61.pmtpa.wmnet [20:30:37] I are confused [20:30:43] I don't know how that array could not be an array.... [20:30:51] especially when accessing it with one of its keys [20:32:01] I feel like either a) puppet's being dumb or b) I'm being dumb lin the form of a typo of lack of/too many quotes [20:32:10] *or [20:33:56] do you have insights or magic? [20:33:56] why not both ? [20:33:57] :) [20:34:00] could be both! [20:34:04] I'm trying to find why [20:34:04] i'll check out and see if i can see anything [20:34:04] can't rule it out [20:34:14] but in any case, the whole thing is ewww :) [20:34:20] you're referencing the role class from the module [20:34:25] that's bad [20:34:35] you should parameterize the module instead [20:34:41] sure, I could [20:34:41] yeah [20:34:42] or else the abstraction makes no sense [20:35:02] you have a role class including a module which depends on the role class [20:35:07] New patchset: Asher; "ensure existence of xml dump target dir on search indexers" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33419 [20:35:09] that's too entangled [20:35:13] so you're saying we shouldn't have ONE HASH TO RULE THEM ALL? [20:35:39] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33419 [20:35:45] binasher: well, the hash is useful for mha to know the topology [20:35:52] but being explicit is reasonable [20:35:55] yeah [20:36:17] paravoid: ok, I'll just make a class param [20:36:23] that's very reasonable [20:36:44] and if you have more review, I am all ears :) [20:36:48] maybe we could keep it but also paramatize the module, and call it with dbmodulethingy{ $bigdbhash::s1 } [20:36:59] I realize how I didn't actually reply to your question [20:37:15] well, restrucutring is reasonable as well [20:37:21] doubly reasonable, even [20:38:05] !log Killing /tmp/mw-cache-1.20wmf* from all apaches [20:38:11] Logged the message, Master [20:38:12] Looks like we're back to the /tmp full again time [20:38:57] basically, the abstraction we should be using is site.pp -> role class -> high-level module (e.g. appserver) or manifest (for now) -> low-level modules (apache/mysql/redis) [20:39:11] apergos: You might want to kill /tmp/mw-cache-1.20wmf* on the snapshot boxen [20:39:21] PROBLEM - Puppet freshness on db42 is CRITICAL: Puppet has not run in the last 10 hours [20:39:21] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [20:39:24] paravoid: I don't see /thumb2 on fenari, just apaches [20:39:29] with modules avoiding to depend on each other, unless otherwise needed [20:39:40] paravoid: ya [20:39:53] not on hume either [20:40:26] is it needed there? [20:40:37] (silly question time) [20:40:40] it should be on those, yes [20:40:46] how come? [20:40:52] someone will run a script one day and it won't work [20:41:07] I assume the job runners at least have it [20:41:14] as well [20:42:22] !log fixed private wiki search index builds in pmtpa (standby cluster) and building initial wikivoyagedata indices there [20:43:15] Logged the message, Master [20:44:28] !log removing srv200-srv213 from apaches pool for upgarde to precise [20:44:34] Logged the message, notpeter [20:45:20] ooh [20:45:58] Reedy: oh, I'll send an email to eng@ :) [20:50:53] New review: Alex Monk; "See the problems I've raised on the bug." [operations/mediawiki-config] (master) C: -1; - https://gerrit.wikimedia.org/r/33390 [20:57:22] AaronSchulz: done [21:01:08] AaronSchulz: did you just fix the double-escaped error? [21:01:16] like today? [21:06:22] AaronSchulz: hm? [21:06:39] PROBLEM - Host srv202 is DOWN: PING CRITICAL - Packet loss = 100% [21:06:39] PROBLEM - Host srv203 is DOWN: PING CRITICAL - Packet loss = 100% [21:06:39] PROBLEM - Host srv204 is DOWN: PING CRITICAL - Packet loss = 100% [21:06:45] I was trying to fix that but I can't reproduce it anymore [21:07:51] RECOVERY - Host srv202 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms [21:08:00] RECOVERY - Host srv203 is UP: PING OK - Packet loss = 0%, RTA = 1.24 ms [21:09:03] RECOVERY - Host srv204 is UP: PING OK - Packet loss = 0%, RTA = 0.46 ms [21:10:08] New patchset: Hashar; "change lab instance for Zuul on labs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33465 [21:10:24] PROBLEM - Apache HTTP on srv200 is CRITICAL: Connection refused [21:11:45] New patchset: Ottomata; "Creating class misc::contint::analytics::packages to ensure that packages needed to build udp-filter are installed." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33466 [21:11:45] PROBLEM - Apache HTTP on srv205 is CRITICAL: Connection refused [21:11:45] PROBLEM - Apache HTTP on srv202 is CRITICAL: Connection refused [21:12:03] PROBLEM - Apache HTTP on srv209 is CRITICAL: Connection refused [21:12:12] PROBLEM - Apache HTTP on srv208 is CRITICAL: Connection refused [21:12:21] PROBLEM - Apache HTTP on srv203 is CRITICAL: Connection refused [21:13:15] PROBLEM - Apache HTTP on srv210 is CRITICAL: Connection refused [21:13:15] PROBLEM - Apache HTTP on srv207 is CRITICAL: Connection refused [21:13:27] paravoid: I didn't do anything [21:13:38] RobH: you around? [21:13:42] PROBLEM - Apache HTTP on srv213 is CRITICAL: Connection refused [21:14:00] PROBLEM - Apache HTTP on srv204 is CRITICAL: Connection refused [21:14:04] what's the "boot once from pxe" command for older dracs? [21:14:21] it's telling me ERROR: Invalid group name specified. [21:14:43] there isn't one [21:14:44] New patchset: Ottomata; "Creating class misc::contint::analytics::packages to ensure that packages needed to build udp-filter are installed." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33466 [21:14:45] PROBLEM - Apache HTTP on srv211 is CRITICAL: Connection refused [21:14:49] you have to press F12 or whatever that maps to [21:14:50] when booting [21:15:00] noooooooooooooo [21:15:03] PROBLEM - Apache HTTP on srv201 is CRITICAL: Connection refused [21:15:05] there is for the newer ones [21:15:34] racadm config -g cfgServerInfo -o cfgServerFirstBootDevice PXE [21:15:45] New review: Ottomata; "This change was originally attempted over at " [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/33466 [21:15:45] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33466 [21:15:48] RECOVERY - Apache HTTP on srv205 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.056 second response time [21:16:02] Change abandoned: Ottomata; "This is done at https://gerrit.wikimedia.org/r/#/c/33466/ instead." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/32192 [21:17:00] RECOVERY - Apache HTTP on srv200 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.050 second response time [21:18:39] RECOVERY - Apache HTTP on srv213 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.074 second response time [21:19:09] !log restarted lsearchd on all pmtpa lucene hosts [21:19:15] Logged the message, Master [21:19:42] RECOVERY - Apache HTTP on srv211 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.078 second response time [21:20:44] notpeter: sup? [21:20:53] f12 [21:21:32] or Use the <@> key sequence for [21:22:00] the one time drac cli command may work, but i dont think it does. [21:24:59] doesn't [21:25:02] doh. [21:25:12] wel, it's 430. I'm going to punt on this until tomorrow [21:25:47] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33465 [21:26:27] RECOVERY - Apache HTTP on srv202 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.087 second response time [21:26:37] RECOVERY - Apache HTTP on srv201 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.097 second response time [21:26:39] ottomata: am i good to merge your change on sockpuppet ? [21:27:12] RECOVERY - Apache HTTP on srv204 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.051 second response time [21:27:21] RECOVERY - Apache HTTP on srv203 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.063 second response time [21:28:06] RECOVERY - Apache HTTP on srv207 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.070 second response time [21:28:33] RECOVERY - Apache HTTP on srv209 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.085 second response time [21:28:33] RECOVERY - Apache HTTP on srv208 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.091 second response time [21:28:42] oh yes [21:28:43] sorry about that [21:28:47] LeslieCarr ^ [21:28:53] go right ahead [21:28:55] np ottomata [21:29:04] mergetastic! [21:29:29] New patchset: Dzahn; "add account for anomie (Brad), add to mortals (deployers) and some minor tab fixes" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33469 [21:29:36] RECOVERY - Apache HTTP on srv210 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.054 second response time [21:31:18] New review: Dzahn; "approved per RT-3880" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/33469 [21:31:18] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33469 [21:32:19] !log repooling srv200-srv213 due to lack of automated imaging :( [21:32:26] Logged the message, notpeter [21:41:24] !log pgehres synchronized php-1.21wmf4/extensions/ContributionReporting/ [21:41:30] Logged the message, Master [21:48:57] !log pgehres synchronized wmf-config/CommonSettings.php [21:49:03] Logged the message, Master [21:53:03] LeslieCarr, just so I know [21:53:12] binasher: https://gdash.wikimedia.org/dashboards/filebackend/ combining lockmanager and streamfile graphs isn't really useful [21:53:12] what needs to be done to get the new servers talking with ganglia? [21:53:13] New patchset: Hashar; "zuul: setup.py require python-setuptools package" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33475 [21:53:17] some fancy network tricks? [21:53:46] the lock stuff should go its own graphs [21:54:38] ottomata: honestly, i'm not sure, that switch is having some problems and i'm talking with juniper in a minute [21:54:48] i've been staring at my phone, hoping it would explode so i wouldn't have to call them [21:54:58] sadly my laser eye beams are broken today [21:55:12] !log reedy synchronized php-1.21wmf4/extensions/Echo/ [21:55:16] rats, ok thanks! [21:55:18] Logged the message, Master [21:57:50] LeslieCarr, just so you know, (you probably already know), this def seems to be a multicast only communications problem [21:57:58] i can do multicast between machiens in this rack [21:58:06] but not between the ciscos (in another rack) and these [21:58:39] yeah [21:58:44] on the phone, awful hold music and all [21:58:55] cool :) [22:07:12] New patchset: Aaron Schulz; "Replaced /thumbs -> /thumbs2 and re-enabled multwrite for "quick" operations." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/33479 [22:08:45] * AaronSchulz greats TimStarling with https://gerrit.wikimedia.org/r/#/c/33411/ [22:10:18] paravoid: 2.1T, heh, I wonder how long that will last ;) [22:10:29] we can always resize it [22:10:36] yeah, but still :) [22:10:56] ms5 had what? 5.5T? [22:11:02] and we're not copying thumbs right now, right? [22:11:08] we're just going to copy new thumb requests [22:13:00] PROBLEM - Swift HTTP on ms-fe1 is CRITICAL: Connection refused [22:13:44] (that'd be me) [22:17:23] Change merged: Aaron Schulz; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/33479 [22:17:30] PROBLEM - Puppet freshness on analytics1002 is CRITICAL: Puppet has not run in the last 10 hours [22:19:45] !log aaron synchronized wmf-config/filebackend.php 'Multiwrite thumbs to /thumbs2 and math/timeline files to nas as well.' [22:19:45] AaronSchulz: what if an unhandled exception is thrown after onTransactionIdle() is called? [22:19:51] Logged the message, Master [22:20:04] MWExceptionHandler::handle() just calls exit() [22:20:21] in the past, we have relied on the fact that that is an implicit rollback [22:20:33] but with your change, it will throw an exception in the destructor [22:21:25] maybe that exception will even find its way back to the same exception handler and infinite recursion will result [22:21:51] because exit() is called outside of the try/catch block which normally protects it from exceptions thrown from exception handlers [22:22:42] !log aaron synchronized wmf-config/filebackend.php 'revert for a moment' [22:22:48] Logged the message, Master [22:23:21] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:25:22] paravoid: yeah, so we may counterparts for those top-level dirs in /thumbs [22:26:38] TimStarling: yeah I wasn't totally sure about an exception there, a simple warnings might be better [22:27:59] a warning would work, yes [22:28:15] * AaronSchulz wonders were paravoid went [22:28:56] New review: Dereckson; "en.wiktionary community have discussed this and would like to keep their permission" [operations/mediawiki-config] (master) C: -1; - https://gerrit.wikimedia.org/r/33390 [22:29:34] hmmm, notpeter, another ganglia q for you [22:29:49] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.034 seconds [22:29:50] how do multiple aggregators for the same cluster work? [22:30:01] they will both join the same multicast addy [22:30:22] paravoid: ping [22:30:34] it is half past midnight for paravoid, according to CTCP [22:30:42] does ganglia just know how to deal with duplicate metrics? [22:30:45] maybe he found a life or something [22:30:51] TimStarling: yeah, but he is always up [22:31:09] * preilly knows that it's 12:30am Thursday (EET) - Time in Athens, Greece [22:31:45] TimStarling: doubtful [22:32:20] TimStarling: want to create a few empty dirs in /mnt/thumbs2 ? [22:32:45] TimStarling: have you heard back from Rob Richards? [22:33:04] AaronSchulz: I'm your backup root after paravoid disappears? [22:33:09] New review: Dereckson; "fi.wikipedia too" [operations/mediawiki-config] (master) C: -1; - https://gerrit.wikimedia.org/r/33390 [22:34:43] TimStarling: do you want to be the primary root? ;) [22:34:45] preilly: no, I haven't [22:35:03] TimStarling: I talked to him last week [22:35:10] er, I was trying to write some python [22:35:21] TimStarling: nope, no life [22:35:40] TimStarling: he said that he would look at it soon [22:35:42] AaronSchulz: counterpart what? [22:35:44] preilly: pong [22:35:48] RECOVERY - Swift HTTP on ms-fe1 is OK: HTTP OK HTTP/1.1 200 OK - 397 bytes in 0.012 seconds [22:35:57] TimStarling: and that he was affected by Sandy [22:36:05] paravoid: do ls for /mnt/thumbs [22:36:06] TimStarling: I'll ping him again tomorrow [22:36:29] yes? [22:36:34] paravoid: can you check that the redis on silver and zhen is the same as production and has the same config [22:36:35] we need to mkdir the first-level? [22:36:36] or all levels? [22:36:40] first [22:36:51] enough to not get permission error spam [22:37:03] TimStarling: I'll let you know what I find out [22:37:24] TimStarling: he did do a quick scan of it and was leaning towards not accepting it in it's current form [22:37:44] TimStarling: as it would more than likely effect other things in his mind [22:38:44] AaronSchulz: better? [22:40:03] should work, I wonder why there were 0777 before (aka "bright green") [22:40:11] meh [22:40:14] paravoid: are you able to take a look at redis? [22:40:38] paravoid: Dan Foy is in India still and is going to their office today to work with them [22:40:49] paravoid: and we wanted to confirm that it was okay before he get's there [22:41:24] look at what exactly? [22:41:26] !log aaron synchronized wmf-config/filebackend.php [22:41:32] Logged the message, Master [22:42:08] paravoid: did you not see what I messaged you above, " can you check that the redis on silver and zhen is the same as production and has the same config"?!? [22:42:50] what do you mean "production"? [22:43:03] silver and zhen are production [22:43:04] paravoid: the memcached redis instance [22:43:18] paravoid: please try to not be difficult [22:43:32] I'm not being difficult, not on purpose anyway [22:43:51] !log aaron synchronized wmf-config/filebackend.php [22:43:56] paravoid: okay good text communicated is so terse [22:43:58] Logged the message, Master [22:44:09] s/communicated/communication [22:44:42] I don't know anything about our redis setup [22:44:46] either of the two setups [22:44:47] * AaronSchulz sighs [22:44:54] paravoid: okay that's fine [22:44:58] I could find out, but I'm occupied right now with some other things [22:45:05] (and after that I plan on getting some sleep) [22:45:07] paravoid: basically it appears that redis keeps crashing on silver [22:45:21] paravoid: okay [22:45:23] seems to be the same version [22:45:31] paravoid: it was 777 since MW writes as apache not www-data, so the perms are still wonk [22:45:35] well, seems to be the latest version according to the repository [22:46:14] * AaronSchulz wonders why the same things get him every time :) [22:46:54] paravoid: just make the perms match the /thumbs ones [22:48:32] sigh, 777, really... [22:48:45] I know, it feels like an antipattern sometimes :) [22:48:55] I made it apache:apache [22:49:09] I'm wondering if we're creating that user/group with the same uid/gid everywhere [22:49:11] why was it www-data? [22:49:16] what will break now? [22:49:29] because that was the owner on /mnt/thumbs too [22:49:35] I just didn't copy 777 on purpose [22:49:37] paravoid: thanks I'll just ask CT to find someone [22:50:01] preilly: it'd help if you'd ask in advance... [22:50:16] paravoid: I wish I knew in advance [22:50:25] paravoid: Dan Foy just emailed me about it [22:50:26] preilly: also, we have RT and this ops monkey/rt duty role nowadays, this may help [22:50:30] PROBLEM - Puppet freshness on db62 is CRITICAL: Puppet has not run in the last 10 hours [22:50:37] well, the files would be apache anyway (since it's not a sticky owner) [22:50:38] paravoid: okay sounds good [22:50:51] looking in /thumbs I see a mix of both even for shard dirs [22:50:57] paravoid: I appreciate your assistance [22:51:26] preilly: I'm sure CT would work out too, whichever you prefer [22:52:04] !log aaron synchronized wmf-config/filebackend.php 're-enabled writes to thumbs2/' [22:52:10] Logged the message, Master [22:52:29] AaronSchulz: do you mind if we try it like that and chase down bugs as they come? [22:52:34] I know this isn't the most productive use of our time [22:52:49] but otoh, 777 with the directory mounted on fenari no less is... not ideal. [22:52:59] heh, ok [22:53:07] * AaronSchulz looks for nas in ganglia [22:53:16] yeah good luck with that :P [22:53:28] good luck I was not having [22:53:40] what are you looking for? [22:53:58] besides fancy graphs that I can't give to you :) [22:54:07] oh, well in that case, nothing :) [22:54:26] heh [22:54:33] so, I'm patching up rewrite.py [22:54:38] fixed the connection refused thing [22:54:46] and I'm looking at the double-escaping error [22:54:55] so, basically I'm at a crossroad [22:55:16] !log Belatedly starting VisualEditor dark launch, going a bit over time [22:55:23] Logged the message, Mr. Obvious [22:55:26] should I just return what the imagescalers returned (fancy errors, correct error code etc.) [22:55:44] !log aaron synchronized wmf-config/filebackend.php [22:55:50] Logged the message, Master [22:55:56] but this may be counter-intuitive [22:56:02] i.e. we won't know what produced the 404 [22:56:04] well the logs are not spamming [22:56:07] or 403 or whatever [22:56:18] what do you think? [22:56:55] what do you mean "we won't know what produced the 404"? [22:57:29] if I'm just copying the imagescaler's errors in verbatim [22:57:46] and returning those, as-is [22:57:47] the message bodies are usually different [22:58:00] so you know what's what...and I can always change thumb.php too [22:58:06] true [22:59:48] PROBLEM - Puppet freshness on ms1002 is CRITICAL: Puppet has not run in the last 10 hours [23:03:42] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:05:34] AaronSchulz: do you happen to have an example of this long-URLs that you've returning 301/302 for? [23:06:02] http://test.wikipedia.org/wiki/Special:Contributions/Aaron_Schulz [23:06:26] paravoid: you can modify the thumb urls for that file to the long format [23:06:33] hehe useful [23:06:34] (my last edit) [23:06:34] thanks [23:07:00] PROBLEM - Swift HTTP on ms-fe1 is CRITICAL: Connection refused [23:08:39] RECOVERY - Swift HTTP on ms-fe1 is OK: HTTP OK HTTP/1.1 200 OK - 397 bytes in 0.011 seconds [23:10:54] !log catrope synchronized php-1.21wmf3/extensions/VisualEditor 'Updating VE to master' [23:11:00] Logged the message, Master [23:11:26] !log catrope synchronized php-1.21wmf4/extensions/VisualEditor 'Updating VE to master' [23:11:32] Logged the message, Master [23:15:15] PROBLEM - Swift HTTP on ms-fe1 is CRITICAL: Connection refused [23:15:15] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.020 seconds [23:17:04] RECOVERY - Swift HTTP on ms-fe1 is OK: HTTP OK HTTP/1.1 200 OK - 397 bytes in 0.010 seconds [23:17:27] ha! just found another bug [23:17:35] we're not varying on protocol with those redirects [23:17:52] so we're returning 301 http://upload/... even for https:// requests [23:18:17] Yeah [23:18:17] that's at least a rewrite.py bug, haven't verified if it's a MW bug too (unlikely) [23:18:37] A similar bug exists in the Apache redirect rules for HTTPS-only domains. I think binasher was looking at that [23:21:17] New patchset: MaxSem; "Support for multiple Solr cores" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29827 [23:23:40] New patchset: Catrope; "Dark launch VisualEditor on enwiki" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/33488 [23:24:06] Reedy: You up for reviewing that real quick? ---^^ [23:24:14] * RoanKattouw could self-review but avoids doing that if practical [23:24:33] AaronSchulz: CT just pointed me to http://torrus.wikimedia.org/torrus/Storage?path=/Storage/nas1-a.pmtpa.wmnet/NetApp_General/&view=expanded-dir-html [23:24:37] it's not ganglia, but it's better than nothing [23:24:41] yep [23:26:23] Meh [23:26:31] * RoanKattouw is lazy, gives up waiting for Reedy and self-reviews [23:26:36] Change merged: Catrope; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/33488 [23:26:44] lol [23:26:51] A whole 2 minuts and 24 seconds ;) [23:27:03] I did say I was lazy :) [23:27:19] Besides I'm already half an hour over my window [23:27:33] What is ParsoidPrefix? [23:27:47] !log catrope synchronized wmf-config/InitialiseSettings.php 'Dark launch VisualEditor on enwiki' [23:27:48] Brooke: Interwiki prefix-esque thing for Parsoid [23:27:52] Weird. [23:27:53] Logged the message, Master [23:28:01] <^demon> RoanKattouw: You self reviewed? For shame. [23:28:04] I'd think those types of prefixen would already be defined somewhere. [23:28:05] <^demon> A plague on your house. [23:28:10] New patchset: Faidon; "swift: also handle URLErrors from imagescalers" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33490 [23:28:10] New patchset: Faidon; "swift: passthrough all imgscalers errors as-is" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33491 [23:28:10] New patchset: Faidon; "swift: fix https for short thumb URL redirects" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33492 [23:28:10] New patchset: Faidon; "swift: removed code to hide the ETag." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/23392 [23:28:11] New patchset: Faidon; "swift: removed copy2() and friends from rewrite.py" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25410 [23:28:27] AaronSchulz: are you up for some reviewing? :) [23:28:39] I also merged and tested your two changes too [23:29:11] RoanKattouw: No namespace definitions? [23:29:21] PROBLEM - Apache HTTP on mw1018 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception [23:29:24] Brooke: We're not gonna have a VisualEditor namespace on enwiki [23:29:38] It'll operate in the main namespace, if the user enables the preference [23:29:39] Hmmm. Default is NS_MAIN. [23:29:44] Does that mean it's on now? [23:29:45] The "dark" part is that the preference is hidden right now [23:29:50] Ah. [23:29:51] And defaults to false [23:29:58] So you can't turn it on unless you have DB access [23:30:04] The API can't set prefs? [23:30:12] Which I do, so I'm gonna proceed to turn it on for the VE team [23:30:13] There's a bug about that. (tm) [23:30:15] Not that I know of [23:30:21] Lame. [23:30:24] I thought it could now. [23:30:29] paravoid: did you just rebase them? [23:30:32] yes [23:30:58] Brooke: Dang it, you're right [23:31:07] * RoanKattouw tries to see if that works [23:31:38] Humanity weeps when I'm right. [23:36:20] Yup, you were right [23:36:29] You can set it with &action=options, right? [23:36:35] Yes [23:37:25] New patchset: Faidon; "swift: also handle URLErrors from imagescalers" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33490 [23:37:26] New patchset: Faidon; "swift: passthrough all imgscalers errors as-is" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33491 [23:37:26] New patchset: Faidon; "swift: fix https for short thumb URL redirects" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33492 [23:37:36] I'm smiling just a little to myself that I knew about an API feature and you didn't. :P [23:37:37] that's better [23:38:33] You're entitled to that :) [23:38:45] Hmm, it's not respecting it for some reasno [23:39:00] I set the pref for myself, but VE doesn't turn on [23:39:36] RoanKattouw: You verified that the pref is set with ... whatever that option is? [23:39:43] Yes [23:39:46] OHAI [23:39:59] jello there [23:40:03] https://en.wikipedia.org/w/api.php?action=query&meta=userinfo&uiprop=options&format=jsonfm [23:40:03] >>> mw.user.options.get('visualeditor-enable') [23:40:05] "1" [23:40:11] yum jello :3 [23:40:24] Quoted string? [23:40:39] my grandfather invented jello [23:40:42] COW BONE [23:40:46] really?? [23:40:49] yep [23:40:49] das so coo [23:40:51] he da bes [23:40:52] mhm [23:41:15] feenga in da air [23:41:26] awh yea! [23:41:32] thekharatekid, Mi`: Please use #wikipedia or another social channel for off-topic chat. [23:41:44] RoanKattouw: It seems options can be integers or strings? Bizarre... [23:41:56] "enotifusertalkpages": 1, [23:42:00] "forceeditsummary": "1", [23:42:15] Yeah it's weird [23:44:21] RoanKattouw: Where's the options check in VE? [23:44:38] In PHP [23:44:42] And it didn't have $ignoreHidden set [23:44:42] Which file? [23:44:48] Ah. [23:44:49] VisualEditor.hooks.php [23:44:55] See also https://gerrit.wikimedia.org/r/33493 [23:45:10] You're too fast. [23:48:24] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:49:43] !log catrope synchronized php-1.21wmf3/extensions/VisualEditor 'Updating VE to master' [23:49:49] Logged the message, Master [23:50:01] !log catrope synchronized php-1.21wmf4/extensions/VisualEditor 'Updating VE to master' [23:50:09] Logged the message, Master