[00:07:30] Got another 503 again for uploads [00:07:31] Failed to load resource: the server responded with a status of 503 (Service Unavailable) https://upload.wikimedia.org/wikipedia/commons/thumb/e/e0/Default_Mode_rules.png/440px-Default_Mode_rules.png [00:07:42] not consistently reproducible though [00:07:48] wfm :D [00:08:07] yeah, me too. It just shows up from time to time for random thumbnails. [00:08:15] and the second time always works of course :-/ [00:08:25] This has been going on for several weeks now, something is broken. [00:08:31] a bad server maybe? [00:08:53] see if you can grab the headers of a bad request [00:09:38] I'd have to catch it with the dev tools open, I only see it when I see a broken image on the page, when I inspect it I only get it http code and url, headers only on refresh, at which point it doesn't happen :) [00:09:57] so will probably take a few more hours before I get one again [00:11:15] There must be a way to get it to always save the headers [00:18:28] (03PS1) 10Ori.livneh: Enable localStorage module caching on enwi^H^H^H beta cluster [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/92460 [00:19:06] (03CR) 10Ori.livneh: [C: 032] Enable localStorage module caching on enwi^H^H^H beta cluster [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/92460 (owner: 10Ori.livneh) [00:23:14] (03Merged) 10jenkins-bot: Enable localStorage module caching on enwi^H^H^H beta cluster [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/92460 (owner: 10Ori.livneh) [00:25:19] !log ori synchronized wmf-config/CommonSettings-labs.php 'I6ca3517dc: Enable localStorage module caching (If2ad2d80d) on beta cluster' [00:25:34] Logged the message, Master [00:26:04] (03PS1) 10Reedy: Update CentralAuth RC2UDP config [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/92463 [00:26:57] (03CR) 10jenkins-bot: [V: 04-1] Update CentralAuth RC2UDP config [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/92463 (owner: 10Reedy) [00:28:10] paravoid: ping [00:28:10] (03PS2) 10Reedy: Update CentralAuth RC2UDP config [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/92463 [00:37:31] gwicke: pong [00:37:39] what's up? [00:37:48] (I'm about to go to sleep :) [00:37:55] just saw your question re the public IP for the parsoid service [00:38:00] oh [00:38:15] but it's not urgent & I responded on the ticket [00:38:33] so don't let me keep you from sleeping! [00:38:38] hehe, don't worry :) [00:39:19] it is basically a stop-gap to let mobile and others who are eager to get their hands on Parsoid HTML to access the existing internal web service [00:39:53] ok [00:40:11] not intended to be published widely [00:40:29] gwicke: liink to ticket? [00:40:34] the Kiwix folks recently dumped all of the French Wikipedia through our tiny labs instance [00:40:42] YuviPanda: RT #6107 [00:41:42] ah, alright [00:43:01] MaxSem: do you remember which graph you were looking at that showed a memcached perf issue starting in august / sept? [00:43:18] it's in my mail [00:43:23] back from september [00:43:39] the "site issues" one [00:43:51] ori-l, https://graphite.wikimedia.org/dashboard/temporary-33 [00:44:26] tim replied with a guess there [00:44:40] http://ganglia.wikimedia.org/latest/graph_all_periods.php?c=Memcached%20eqiad&m=cpu_report&r=year&s=by%20name&hc=4&mc=2&st=1383007424&g=network_report&z=large [00:56:00] sigh sigh sigh, anybody want to help me troubleshoot the ever annoying stuck ganglia metric? [00:56:10] cmmooon, it'll be fuuuuuuuUUuuun! [00:58:05] (03CR) 10MZMcBride: "Ori: I'd be happy to kill this entire idea by fixing bug 50422 instead. :-)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/84743 (owner: 10QChris) [02:15:32] !log LocalisationUpdate completed (1.23wmf1) at Tue Oct 29 02:15:32 UTC 2013 [02:15:51] Logged the message, Master [02:29:03] !log LocalisationUpdate completed (1.22wmf22) at Tue Oct 29 02:29:03 UTC 2013 [02:29:16] Logged the message, Master [03:01:47] !log LocalisationUpdate ResourceLoader cache refresh completed at Tue Oct 29 03:01:47 UTC 2013 [03:02:02] Logged the message, Master [03:28:18] ottomata: which metric? [03:28:36] i'm working on any of the es metrics on testsearch1001 [03:28:38] elasticsearh [03:28:44] particularly es_gc_time right now [03:28:48] because it changes relatively often [03:29:12] https://docs.google.com/spreadsheet/ccc?key=0AtLjsFovAGuvdERqVUd2TnZta0pQLTNYZmNfcXVqMkE#gid=0 [03:29:24] so, these are the values at different points, queried every 15 seconds [03:29:38] you can see that the value reported by the ganglia plugin directly changes kinda often [03:29:58] but the value from gmond, obtained by netcat into ms1004.eqiad.wmnet [03:30:01] is stuck [03:30:07] the other values lag behind [03:30:11] and eventually catch up to what is in gmond [03:30:20] but since gmond never changes, they don't etierh [03:30:53] if I restart gmond on testsearch1001 [03:31:11] usually at least one new value makes it to gmond on the aggregator [03:31:21] but it always gets stuck again somewhere [03:31:31] where's the script? [03:31:38] the script to output this? [03:31:47] the metric module [03:31:50] or the python module [03:31:51] ah [03:32:00] i suppose i could just ssh into testsearch1001 and take a look [03:32:16] yeah its in puppet too [03:32:34] modules/elasticsearch/files/ganglia [03:32:48] elasticsearch_monitoring.py [03:38:27] using this to check stuff, ori-l [03:38:28] https://gist.github.com/ottomata/7208847 [03:40:00] how often do the values change in elasticsearch? [03:40:24] ottomata: ^ [03:40:59] this value I see changing at least once every 30 seconds, maybe more often [03:41:06] maybe I can find one that changes more frequently [03:41:15] they all go stale [03:43:07] es_index_*_docs_count changes [03:43:17] havent' seen the others change yet [03:44:27] es_gc_time changes [03:45:25] eyah but not that often, maybe every 30 seconds - 1 min [03:59:52] seeing anything i'm not, ori-l? [04:00:40] ottomata: not yet [04:04:28] ottomata: try /root/el.py on testsearch1001 [04:07:21] oo [04:07:50] well, i guess i'm not giving it the same params [04:08:29] your'e not seeing anything change? [04:08:36] i don't see anyting change with your script :/ [04:08:45] no, me neither now [04:09:21] ottomata: where did the '*' notation for all indices come from? [04:09:26] it's not mentioned in the docs http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/indices-stats.html [04:09:51] and if you try to curl it, although you get JSON back, the http status code is 404 [04:09:51] ? [04:09:56] was I doing that? [04:10:04] param indices { [04:10:04] value = "*" [04:10:04] # value = "index1 index2 index3,index4" [04:10:06] } [04:10:10] oh, I di dn't write it [04:10:12] dunno [04:10:15] in /etc/ganglia/conf.d/elasticsearch.pyconf [04:10:37] https://github.com/ganglia/gmond_python_modules/blob/master/elasticsearch/conf.d/elasticsearch.pyconf [04:10:48] i think nik grabbed it from there [04:16:00] ok, thanks for the help ori-l [04:16:03] gotta go to sleep [04:16:19] oh more is changing more often right now :) [04:16:21] ok niggghters [04:16:30] np, i think you need to specify the indexes [04:54:11] morning [04:55:04] I'm just making TODO lists to get back on track and realized that between all the requested backports, I have to build *15* Debian packages [05:02:53] how hard are they actually to do? [05:38:11] paravoid: gonna put that DD badge of yours to good use ;) [05:41:59] Reedy: some of them are backports, should be easy enough [05:43:20] i should try sometime when its not 0543 ;) [05:44:23] Reedy: I see that hour of the day some times, just, on the other end of my day :) [05:44:52] I packaged most of cowsay for optware ;) [05:55:08] PROBLEM - MySQL Slave Running on db32 is CRITICAL: CRIT replication Slave_IO_Running: Yes Slave_SQL_Running: No Last_Error: Error Table _archive_new already exists on query. Default databas [05:57:08] RECOVERY - MySQL Slave Running on db32 is OK: OK replication Slave_IO_Running: Yes Slave_SQL_Running: Yes Last_Error: [06:13:52] (03PS2) 10ArielGlenn: remove entries for db5,6,7,26,27 long since decommed [operations/dns] - 10https://gerrit.wikimedia.org/r/92272 [06:15:34] (03PS3) 10ArielGlenn: remove entries for db5,7,8,26,27 long since decommed [operations/dns] - 10https://gerrit.wikimedia.org/r/92272 [06:19:37] (03CR) 10ArielGlenn: [C: 032] remove entries for db5,7,8,26,27 long since decommed [operations/dns] - 10https://gerrit.wikimedia.org/r/92272 (owner: 10ArielGlenn) [06:41:08] (03PS1) 10ArielGlenn: remove last of sq38-42, 45-47, decommed long ago [operations/dns] - 10https://gerrit.wikimedia.org/r/92491 [06:44:24] PROBLEM - Disk space on copper is CRITICAL: DISK CRITICAL - free space: / 355 MB (3% inode=90%): [06:44:47] meh [06:48:24] RECOVERY - Disk space on copper is OK: DISK OK [06:48:36] ok, I created a 100G LV [06:48:38] should be enough for now [06:52:42] apergos: ^^^ thanks for all the pings, fixed permanently. [06:52:53] yep saw, that's perfect [06:53:19] even if that were to fill up it won't kill regular system operation (like / filling) [06:53:21] (03CR) 10TTO: "(1 comment)" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/92314 (owner: 10Dereckson) [06:54:13] (03PS1) 10ArielGlenn: remove sq38-42, 45-47 from remaining dsh files [operations/puppet] - 10https://gerrit.wikimedia.org/r/92493 [06:56:01] (03CR) 10ArielGlenn: [C: 032] remove sq38-42, 45-47 from remaining dsh files [operations/puppet] - 10https://gerrit.wikimedia.org/r/92493 (owner: 10ArielGlenn) [07:10:17] Reedy: about? [07:11:48] foreachwiki is broken on terbium [07:11:58] /usr/local/lib/mw-deployment-vars.sh has MW_COMMON_SOURCE=/a/common (etc.) [07:12:07] but there's no /a, there, just /apache [07:13:14] !log "ln -s /apache /a" on terbium; foreachwiki and friends was broken (for a while) due to /usr/local/lib/mw-deployment-vars.sh pointing to non-existent /a [07:13:26] it'll definitely work on tin [07:13:26] judging from the 5T of temp files on swift, quite a while [07:13:31] haha [07:13:32] Logged the message, Master [07:13:55] we have a maintenance script using foreachwiki [07:14:09] on a cron job, so that's terbium [07:16:12] and now the php script is broken [07:16:14] throws exception [07:16:15] oh joy [07:17:08] which script? [07:17:20] hm, maybe we're missing a proper config on terbium? [07:18:40] no, config looks fine [07:18:46] "/usr/local/bin/foreachwiki maintenance/cleanupUploadStash.php" is what I'm trying to run [07:20:02] what's the exception? [07:20:39] bswikibooks: UploadStash::getFile No user is logged in, files must belong to users [07:20:47] if ( !$noAuth && !$this->isLoggedIn ) { [07:20:47] throw new UploadStashNotLoggedInException( __METHOD__ . [07:20:50] ' No user is logged in, files must belong to users' ); [07:20:53] } [07:21:09] it's not urgent, I'll file a bug [07:21:30] I guess noauth needs to be true there [07:21:37] no [07:21:43] well, maybe, dunno [07:21:48] (03CR) 10Nemo bis: "Now tracked at https://bugzilla.wikimedia.org/show_bug.cgi?id=56287" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/68417 (owner: 10Nemo bis) [07:21:57] but it needs to login at some point if it will actually clean up :) [07:23:54] I guess its been broken for a while [07:23:56] yeah [07:25:14] https://bugzilla.wikimedia.org/show_bug.cgi?id=56298 [08:02:12] mark, do you have few min now? [08:52:53] PROBLEM - MySQL Replication Heartbeat on db1047 is CRITICAL: CRIT replication delay 316 seconds [08:53:12] PROBLEM - MySQL Slave Delay on db1047 is CRITICAL: CRIT replication delay 324 seconds [08:53:28] springle: that you? [08:53:43] (db1047 replag) [08:53:56] don't think so. i [08:54:20] i'm hammering s4 atm, not db1047 [08:54:51] DELETE /* SolrUpdate::safeExecute [08:54:51] ok [08:54:53] some job [08:55:00] from? terbium? [08:56:13] apache 13200 0.0 0.0 12308 1424 ? S 08:54 0:00 /bin/bash /usr/local/bin/foreachwikiindblist /usr/local/apache/common/special.dblist extensions/GeoData/solrupdate.php --clear-killlist 3 --noindex [08:56:17] looks like it [08:56:31] ya.. master has already finished it [08:57:09] or rather, no terbium->wikiadmin doing much right this second [08:57:18] I wonder if my fix has anything to do it with it [08:57:28] yeah it probably did [08:57:34] foreachwiki/foreachwikiindblist was broken on terbium [08:57:37] who knows for how long [08:57:40] jobs suddenly started running because you fixed the maintenance host? :) [08:57:43] yeah [08:57:43] lol [08:59:29] this happened yesterday at around this time, but recovered in 5 minutes, while I was still lookingat it [08:59:44] db1047? [08:59:45] same db [08:59:47] yup [08:59:51] hmm [09:00:31] didn't save the processlist since it was already done [09:01:56] ---TRANSACTION 854075953, ACTIVE 389792 sec fetching rows [09:02:03] that will slow it down [09:02:14] one of the research queries [09:02:42] the COUNT(DISTINCT linked_page.page_id) was going, I remember that [09:03:35] can't tell you about the other two, sorry. it was a short list though, only a few things like today [09:04:22] db1047 isn't used by MW, only research. not much can be done about replag in this case [09:15:39] You can hurt the researchers >.> <.< [09:59:48] lol [10:00:01] that might explain a few things.... [10:02:42] to the special pages are now fixed? [10:07:59] ? [10:21:42] paravoid: https://bugzilla.wikimedia.org/show_bug.cgi?id=53227 [10:23:20] speaking of which, https://gerrit.wikimedia.org/r/#/c/90117/ is a rather trivial change and helps making things easier to understand [10:24:21] ah, but update-special-pages doesn't use foreachwiki/foreachwikiindblist [10:38:40] (03CR) 10Faidon Liambotis: "(2 comments)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/92288 (owner: 10Matanya) [10:38:49] (03CR) 10Faidon Liambotis: [C: 04-1] removing cache clean up patch [operations/puppet] - 10https://gerrit.wikimedia.org/r/92288 (owner: 10Matanya) [10:39:24] (03PS2) 10Dereckson: DynamicPageList extension configuration maintenance [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/92314 [10:41:39] (03CR) 10Dereckson: "(1 comment)" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/92314 (owner: 10Dereckson) [10:43:57] (03PS2) 10Matanya: removing cache clean up patch [operations/puppet] - 10https://gerrit.wikimedia.org/r/92288 [10:46:41] (03PS3) 10Matanya: removing cache clean up patch [operations/puppet] - 10https://gerrit.wikimedia.org/r/92288 [11:08:12] Are there known issues with the blog? [11:08:24] http://blog.wikimedia.org/2013/10/24/airtel-wikipedia-zero-text-trial/#comments says there are 41 comments in 3 pages. None are displayed. [11:08:41] I remember seeing this before [11:08:53] maybe they're held for moderation? [11:09:07] we don't really do much with blog to be honest [11:09:16] just bugzilla is? [11:09:20] s/is/it/ [11:09:36] no idea [11:09:56] last time we were discussing about blog issues it was decided for us to give it to an external contractor [11:09:59] that was months ago [11:10:14] I think the communications is doing moderations and such? [11:11:55] I've reported https://bugzilla.wikimedia.org/show_bug.cgi?id=56308 [11:12:00] paravoid: is this file using tabs? [11:12:07] which file? [11:12:20] templates/varnish/mobile-frontend.inc.vcl.erb [11:12:24] siebrand: I think it's the pingbacks [11:17:24] paravoid: templates/varnish/mobile-frontend.inc.vcl.erb [11:17:42] why are you asking me? :P [11:18:16] $ grep -q '\t' templates/varnish/mobile-frontend.inc.vcl.erb && echo yes [11:18:19] yes [11:18:24] you edited it last before me ... [11:19:09] paravoid: so should i follow that scheme or the normal 4 space in my patch? [11:19:11] the answer is "yes, it's indented with tabs" :) [11:24:02] Thanks paravoid :) [11:24:09] :) [11:24:25] you may merge my patch if you wish [11:24:51] why did you convert to spaces? [11:25:41] i noted it too late [11:26:10] that is why i asked you in the first place if i should use tabs [11:26:22] yes, you should [11:26:34] in general, try to keep indentation the same as the rest of the file [11:26:46] so the whole file is indented with tabs (as is all of our .vcl files) [11:26:53] so we shouldn't mix tab/spaces [11:27:37] you should never ever use both in a single file [11:27:52] * matanya is fixing [11:27:52] and if you decide you want to change it, which may or may not be warrantied, ALWAYS do it as a separate commit with no other changes but whitespace [11:29:51] warranted [11:30:38] (03PS4) 10Matanya: removing cache clean up patch [operations/puppet] - 10https://gerrit.wikimedia.org/r/92288 [11:30:41] thanks for this mark [11:31:12] paravoid: ^ [11:31:15] ok [11:31:23] we should make sure it's 30 days indeed [11:31:31] agreed [11:36:42] !log gerrit: creating integration/phpunit to hold PHPUnit files for deployment [11:36:57] Logged the message, Master [11:53:49] (03PS1) 10Hashar: deployment: integration/phpunit for Jenkins CI slaves [operations/puppet] - 10https://gerrit.wikimedia.org/r/92510 [11:56:39] !log lanthanum.eqiad.wmnet running apt-get upgrade [11:56:53] Logged the message, Master [12:09:27] !log jenkins refreshing jobs to let them recurse in git submodules {{bug|55614}} {{gerrit|92511}} [12:09:41] Logged the message, Master [12:10:07] addshore: I am refreshing the Jenkins jobs to let them recurse in submodules [12:12:11] :) [12:29:44] addshore: deployed :) [12:29:47] !b 55614 [12:29:47] https://bugzilla.wikimedia.org/55614 [12:30:17] :) [12:32:12] (03PS2) 10Mark Bergsma: Repartition esams LVS service IPs [operations/dns] - 10https://gerrit.wikimedia.org/r/92344 [12:32:13] (03PS2) 10Mark Bergsma: Repartition eqiad LVS service IPs [operations/dns] - 10https://gerrit.wikimedia.org/r/92343 [12:32:14] (03PS2) 10Mark Bergsma: Repartition ulsfo LVS service IPs [operations/dns] - 10https://gerrit.wikimedia.org/r/92342 [12:32:55] * mark violates his own rules on mixing tabs/spaces [12:33:10] bad mark [12:33:36] zonefiles really should be made spaces soon [12:35:33] want a lint check on them that reject the change whenever a tab is encountered ? [12:35:51] after we migrate all of them, yes [12:35:53] not before ;) [12:37:09] You could [12:37:17] see who it annoys enough to change them all ;) [12:38:10] MS-DOS Application (.com) [12:38:12] gj Windows [12:41:49] akosiaris: ahh alexanders :-] Are you familiar with the deployment system ? I would like to be able to deploy a repo containing phpunit on the CI slaves https://gerrit.wikimedia.org/r/92510 [12:43:48] hashar: I used to be a couple of weeks ago [12:44:05] then a ryan started pushing a ton of changes and now i lag behind [12:44:16] doh [12:44:36] let me get something done first and I 'll try to figure it out in some 30 mins. OK ? [12:44:41] sure [12:44:46] will grab a snack meanwhile [13:01:15] (03PS1) 10Reedy: Tabs to spaces [operations/dns] - 10https://gerrit.wikimedia.org/r/92516 [13:01:26] +299, -299 [13:02:26] (03PS1) 10Mark Bergsma: Add new ulsfo upload LVS service IPs [operations/puppet] - 10https://gerrit.wikimedia.org/r/92517 [13:02:40] not now reedy [13:02:49] i'm doing a ton of lvs service ip changes, am not gonna rebase em now [13:02:50] It's alright, I didn't to it by hand [13:03:42] (03CR) 10Mark Bergsma: [C: 04-1] "Please not now, I'm doing zero repartitioning atm..." [operations/dns] - 10https://gerrit.wikimedia.org/r/92516 (owner: 10Reedy) [13:06:30] (03PS2) 10Reedy: Tabs to spaces [operations/dns] - 10https://gerrit.wikimedia.org/r/92516 [13:06:40] (03Abandoned) 10Reedy: Tabs to spaces [operations/dns] - 10https://gerrit.wikimedia.org/r/92516 (owner: 10Reedy) [13:07:28] I like tabs in DNS anyway :) [13:09:18] (03CR) 10Mark Bergsma: [C: 032] Add new ulsfo upload LVS service IPs [operations/puppet] - 10https://gerrit.wikimedia.org/r/92517 (owner: 10Mark Bergsma) [13:15:41] !log reprepro: include backported graphite-carbon, graphite-web, python-whisper (plus deps python-django-tagging, flot) from saucy; replace custom-built python-whisper [13:15:58] Logged the message, Master [13:19:07] (03CR) 10Faidon Liambotis: "(1 comment)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/92271 (owner: 10Ori.livneh) [13:27:20] mark, do you have time to look at it now? [13:27:32] at what, yurik? [13:27:35] ESI [13:28:29] mark, enabling ESI on beta cluster causes it to create tons of instances [13:28:44] what makes you think that? [13:30:32] mark, the varnish frontend log shows tons of strange "creating new thread" entries [13:30:46] can you work on it now? i can show you how i get there [13:30:46] yeah, because varnish crashes and it restarts :) [13:31:00] of course it makes a lot of new threads then [13:32:37] oh, ok, so could you take a look - connect to the bastion / 10.4.1.82 [13:32:53] i did have a look last week [13:33:19] is there anything new? [13:33:46] did you see anything weird there? [13:34:16] last week i asked you to look but then thought that it might have to do with the backend producing relative paths [13:34:26] so i did a few hacks - but it crashes reagrdless [13:35:13] i think it's more likely that it's simply https://www.varnish-cache.org/trac/ticket/1184 [13:35:43] that link crashes too ;) [13:37:12] hmm... something is weird with the russian internet - i constantly have issues connecting to a large number of sites (like google) - regardless of where I am [13:38:37] mark, so how can we verify its that bug? [13:38:44] it doesn't seem the same [13:39:07] check if it asserts, in the system logs [13:39:13] if the message is different, let's check the source [13:39:20] then it's yet another bug, probably unfixed [13:39:43] mark, could you check it with me - it would take you far less time - i will simulate the issue [13:40:18] yurik, i don't really have much time for this right now [13:40:27] i'm going on a 3 week holiday in a few weeks, and need to finish some stuff before then [13:40:32] ESI isn't one of them I'm afraid [13:41:07] what the segfault or assert message in the log if you crash it? [13:41:24] what log file should i look at? [13:41:54] /var/log/syslog [13:41:55] or dmesg [13:42:49] [1261947.673830] varnishd[10196]: segfault at 7f3a845c0000 ip 00007f3a9268d20b sp 00007f3a845bc158 error 6 in libvgz.so[7f3a92687000+13000] [13:42:51] [1262024.112711] varnishd[10959]: segfault at 7f3a825a2280 ip 000000000042022b sp 00007f3a825a2280 error 6 in varnishd[400000+83000] [13:42:52] [1262025.099503] varnishd[11254]: segfault at 7f3a845bf000 ip 00007f3a9268d20b sp 00007f3a845bb158 error 6 in libvgz.so[7f3a92687000+13000] [13:42:54] [1309764.416207] varnishd[510]: segfault at 7fca24d40280 ip 000000000042022b sp 00007fca24d40280 error 6 in varnishd[400000+83000] [13:43:00] gdb it? [13:43:01] these are the last messages [13:43:17] get a backtrace, try to find why it's crashing? [13:43:40] paravoid: would love to, but any hints on the steps to do it? [13:43:50] it certainly sounds like something to do with compress/decompress and esi [13:43:52] how to gdb? [13:44:27] paravoid: yes, and more specifically - how to gdb a specific varnish service [13:45:21] Attach to a varnish process and wait for it to die? [13:46:07] apt-get install varnish-dbg; gdb -p $(pidof varnishd); (make it crash); bt full [13:46:10] or thread apply all bt full [13:47:05] in labs you can set a low thread_max count in varnish to ease debugging [13:47:11] there are 4 varnishd processes [13:47:26] any way to find the right one? [13:47:34] 2 for frontend, 2 for backend [13:47:39] each has a child process and a parent [13:47:45] the parent does nothing but restarting the child [13:47:50] so you need to attach to the child [13:47:54] of the -frontend process [13:48:14] mark - I used "ps -A | grep varn" [13:48:22] good [13:48:30] any options to see parent & frontend status? [13:48:36] like a startup command, etc? [13:48:37] now use "ps -waux" as well as your eyes [13:48:43] thx [14:02:09] (03PS3) 10Mark Bergsma: Repartition esams LVS service IPs [operations/dns] - 10https://gerrit.wikimedia.org/r/92344 [14:02:10] (03PS3) 10Mark Bergsma: Repartition eqiad LVS service IPs [operations/dns] - 10https://gerrit.wikimedia.org/r/92343 [14:02:11] (03PS3) 10Mark Bergsma: Repartition ulsfo LVS service IPs [operations/dns] - 10https://gerrit.wikimedia.org/r/92342 [14:03:36] (03PS1) 10Mark Bergsma: Update ulsfo Text LVS service IPs to new Zero scheme [operations/puppet] - 10https://gerrit.wikimedia.org/r/92527 [14:04:28] btw, mark, i just spoke with one of our biggest partners, they are totally ok to switch to IP, and just need the ranges [14:04:41] they zero-rate both text & media [14:05:00] if they zero rate everything we can give them the full ranges soon [14:05:28] cool, as long as we don't give them labs where ppl could in theory run proxies [14:06:40] just the LVS ranges I mean, not all our ips [14:06:56] but please don't give anyone anything until we confirm we're gonna use these ranges [14:07:07] i'll mail, probably this week still [14:07:13] i'm doing a bit of a trial run with ulsfo now [14:07:27] (03CR) 10Mark Bergsma: [C: 032] Update ulsfo Text LVS service IPs to new Zero scheme [operations/puppet] - 10https://gerrit.wikimedia.org/r/92527 (owner: 10Mark Bergsma) [14:07:47] np, won't give anything untill i receive an email from you [14:16:52] paravoid: when i attach gdb -p , it seems like already crashes, without waiting for me to hit the service [14:17:14] either that, or gdb -p causes a break that i can't continue [14:23:48] (03PS1) 10Mark Bergsma: Swap old/new upload LVS service IPs in ulsfo [operations/puppet] - 10https://gerrit.wikimedia.org/r/92531 [14:24:13] yurik_: disable the watchdog timer in varnish [14:24:18] i.e. varnish checks whether the client responds [14:24:27] if you attach to it, gdb blocks the child, and the parent will restart [14:24:31] there's a runtime parameter to disable that [14:25:13] mark, i'm trying to run the process in shell, by copying the entire string from the ps -waux, but it shows the params list [14:25:35] (03CR) 10Mark Bergsma: [C: 032] Swap old/new upload LVS service IPs in ulsfo [operations/puppet] - 10https://gerrit.wikimedia.org/r/92531 (owner: 10Mark Bergsma) [14:27:53] akosiaris: have you finished? :D https://gerrit.wikimedia.org/r/92510 to deploy a repo containing phpunit on the CI slaves [14:31:34] (03PS1) 10Mark Bergsma: Change ulsfo upload-lb IPs [operations/dns] - 10https://gerrit.wikimedia.org/r/92533 [14:33:57] hiiiii mark, so i'm trying to verify that the 2 analytics racks are seeing different multicast traffic [14:34:21] I've counted packets by node seen on each aggregator [14:34:45] I've also tried to look directly at the traffic and count occurrences of things, but that is less conclusive because its binary and harder to examine [14:35:08] i can verify that if I listen on the ganglia multicast addy and try to send between racks [14:35:15] i get the same behavior i saw yesterday [14:35:22] i can't send from one rack to the other [14:41:07]