[00:07:30] Got another 503 again for uploads [00:07:31] Failed to load resource: the server responded with a status of 503 (Service Unavailable) https://upload.wikimedia.org/wikipedia/commons/thumb/e/e0/Default_Mode_rules.png/440px-Default_Mode_rules.png [00:07:42] not consistently reproducible though [00:07:48] wfm :D [00:08:07] yeah, me too. It just shows up from time to time for random thumbnails. [00:08:15] and the second time always works of course :-/ [00:08:25] This has been going on for several weeks now, something is broken. [00:08:31] a bad server maybe? [00:08:53] see if you can grab the headers of a bad request [00:09:38] I'd have to catch it with the dev tools open, I only see it when I see a broken image on the page, when I inspect it I only get it http code and url, headers only on refresh, at which point it doesn't happen :) [00:09:57] so will probably take a few more hours before I get one again [00:11:15] There must be a way to get it to always save the headers [00:18:28] (03PS1) 10Ori.livneh: Enable localStorage module caching on enwi^H^H^H beta cluster [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/92460 [00:19:06] (03CR) 10Ori.livneh: [C: 032] Enable localStorage module caching on enwi^H^H^H beta cluster [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/92460 (owner: 10Ori.livneh) [00:23:14] (03Merged) 10jenkins-bot: Enable localStorage module caching on enwi^H^H^H beta cluster [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/92460 (owner: 10Ori.livneh) [00:25:19] !log ori synchronized wmf-config/CommonSettings-labs.php 'I6ca3517dc: Enable localStorage module caching (If2ad2d80d) on beta cluster' [00:25:34] Logged the message, Master [00:26:04] (03PS1) 10Reedy: Update CentralAuth RC2UDP config [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/92463 [00:26:57] (03CR) 10jenkins-bot: [V: 04-1] Update CentralAuth RC2UDP config [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/92463 (owner: 10Reedy) [00:28:10] paravoid: ping [00:28:10] (03PS2) 10Reedy: Update CentralAuth RC2UDP config [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/92463 [00:37:31] gwicke: pong [00:37:39] what's up? [00:37:48] (I'm about to go to sleep :) [00:37:55] just saw your question re the public IP for the parsoid service [00:38:00] oh [00:38:15] but it's not urgent & I responded on the ticket [00:38:33] so don't let me keep you from sleeping! [00:38:38] hehe, don't worry :) [00:39:19] it is basically a stop-gap to let mobile and others who are eager to get their hands on Parsoid HTML to access the existing internal web service [00:39:53] ok [00:40:11] not intended to be published widely [00:40:29] gwicke: liink to ticket? [00:40:34] the Kiwix folks recently dumped all of the French Wikipedia through our tiny labs instance [00:40:42] YuviPanda: RT #6107 [00:41:42] ah, alright [00:43:01] MaxSem: do you remember which graph you were looking at that showed a memcached perf issue starting in august / sept? [00:43:18] it's in my mail [00:43:23] back from september [00:43:39] the "site issues" one [00:43:51] ori-l, https://graphite.wikimedia.org/dashboard/temporary-33 [00:44:26] tim replied with a guess there [00:44:40] http://ganglia.wikimedia.org/latest/graph_all_periods.php?c=Memcached%20eqiad&m=cpu_report&r=year&s=by%20name&hc=4&mc=2&st=1383007424&g=network_report&z=large [00:56:00] sigh sigh sigh, anybody want to help me troubleshoot the ever annoying stuck ganglia metric? [00:56:10] cmmooon, it'll be fuuuuuuuUUuuun! [00:58:05] (03CR) 10MZMcBride: "Ori: I'd be happy to kill this entire idea by fixing bug 50422 instead. :-)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/84743 (owner: 10QChris) [02:15:32] !log LocalisationUpdate completed (1.23wmf1) at Tue Oct 29 02:15:32 UTC 2013 [02:15:51] Logged the message, Master [02:29:03] !log LocalisationUpdate completed (1.22wmf22) at Tue Oct 29 02:29:03 UTC 2013 [02:29:16] Logged the message, Master [03:01:47] !log LocalisationUpdate ResourceLoader cache refresh completed at Tue Oct 29 03:01:47 UTC 2013 [03:02:02] Logged the message, Master [03:28:18] ottomata: which metric? [03:28:36] i'm working on any of the es metrics on testsearch1001 [03:28:38] elasticsearh [03:28:44] particularly es_gc_time right now [03:28:48] because it changes relatively often [03:29:12] https://docs.google.com/spreadsheet/ccc?key=0AtLjsFovAGuvdERqVUd2TnZta0pQLTNYZmNfcXVqMkE#gid=0 [03:29:24] so, these are the values at different points, queried every 15 seconds [03:29:38] you can see that the value reported by the ganglia plugin directly changes kinda often [03:29:58] but the value from gmond, obtained by netcat into ms1004.eqiad.wmnet [03:30:01] is stuck [03:30:07] the other values lag behind [03:30:11] and eventually catch up to what is in gmond [03:30:20] but since gmond never changes, they don't etierh [03:30:53] if I restart gmond on testsearch1001 [03:31:11] usually at least one new value makes it to gmond on the aggregator [03:31:21] but it always gets stuck again somewhere [03:31:31] where's the script? [03:31:38] the script to output this? [03:31:47] the metric module [03:31:50] or the python module [03:31:51] ah [03:32:00] i suppose i could just ssh into testsearch1001 and take a look [03:32:16] yeah its in puppet too [03:32:34] modules/elasticsearch/files/ganglia [03:32:48] elasticsearch_monitoring.py [03:38:27] using this to check stuff, ori-l [03:38:28] https://gist.github.com/ottomata/7208847 [03:40:00] how often do the values change in elasticsearch? [03:40:24] ottomata: ^ [03:40:59] this value I see changing at least once every 30 seconds, maybe more often [03:41:06] maybe I can find one that changes more frequently [03:41:15] they all go stale [03:43:07] es_index_*_docs_count changes [03:43:17] havent' seen the others change yet [03:44:27] es_gc_time changes [03:45:25] eyah but not that often, maybe every 30 seconds - 1 min [03:59:52] seeing anything i'm not, ori-l? [04:00:40] ottomata: not yet [04:04:28] ottomata: try /root/el.py on testsearch1001 [04:07:21] oo [04:07:50] well, i guess i'm not giving it the same params [04:08:29] your'e not seeing anything change? [04:08:36] i don't see anyting change with your script :/ [04:08:45] no, me neither now [04:09:21] ottomata: where did the '*' notation for all indices come from? [04:09:26] it's not mentioned in the docs http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/indices-stats.html [04:09:51] and if you try to curl it, although you get JSON back, the http status code is 404 [04:09:51] ? [04:09:56] was I doing that? [04:10:04] param indices { [04:10:04] value = "*" [04:10:04] # value = "index1 index2 index3,index4" [04:10:06] } [04:10:10] oh, I di dn't write it [04:10:12] dunno [04:10:15] in /etc/ganglia/conf.d/elasticsearch.pyconf [04:10:37] https://github.com/ganglia/gmond_python_modules/blob/master/elasticsearch/conf.d/elasticsearch.pyconf [04:10:48] i think nik grabbed it from there [04:16:00] ok, thanks for the help ori-l [04:16:03] gotta go to sleep [04:16:19] oh more is changing more often right now :) [04:16:21] ok niggghters [04:16:30] np, i think you need to specify the indexes [04:54:11] morning [04:55:04] I'm just making TODO lists to get back on track and realized that between all the requested backports, I have to build *15* Debian packages [05:02:53] how hard are they actually to do? [05:38:11] paravoid: gonna put that DD badge of yours to good use ;) [05:41:59] Reedy: some of them are backports, should be easy enough [05:43:20] i should try sometime when its not 0543 ;) [05:44:23] Reedy: I see that hour of the day some times, just, on the other end of my day :) [05:44:52] I packaged most of cowsay for optware ;) [05:55:08] PROBLEM - MySQL Slave Running on db32 is CRITICAL: CRIT replication Slave_IO_Running: Yes Slave_SQL_Running: No Last_Error: Error Table _archive_new already exists on query. Default databas [05:57:08] RECOVERY - MySQL Slave Running on db32 is OK: OK replication Slave_IO_Running: Yes Slave_SQL_Running: Yes Last_Error: [06:13:52] (03PS2) 10ArielGlenn: remove entries for db5,6,7,26,27 long since decommed [operations/dns] - 10https://gerrit.wikimedia.org/r/92272 [06:15:34] (03PS3) 10ArielGlenn: remove entries for db5,7,8,26,27 long since decommed [operations/dns] - 10https://gerrit.wikimedia.org/r/92272 [06:19:37] (03CR) 10ArielGlenn: [C: 032] remove entries for db5,7,8,26,27 long since decommed [operations/dns] - 10https://gerrit.wikimedia.org/r/92272 (owner: 10ArielGlenn) [06:41:08] (03PS1) 10ArielGlenn: remove last of sq38-42, 45-47, decommed long ago [operations/dns] - 10https://gerrit.wikimedia.org/r/92491 [06:44:24] PROBLEM - Disk space on copper is CRITICAL: DISK CRITICAL - free space: / 355 MB (3% inode=90%): [06:44:47] meh [06:48:24] RECOVERY - Disk space on copper is OK: DISK OK [06:48:36] ok, I created a 100G LV [06:48:38] should be enough for now [06:52:42] apergos: ^^^ thanks for all the pings, fixed permanently. [06:52:53] yep saw, that's perfect [06:53:19] even if that were to fill up it won't kill regular system operation (like / filling) [06:53:21] (03CR) 10TTO: "(1 comment)" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/92314 (owner: 10Dereckson) [06:54:13] (03PS1) 10ArielGlenn: remove sq38-42, 45-47 from remaining dsh files [operations/puppet] - 10https://gerrit.wikimedia.org/r/92493 [06:56:01] (03CR) 10ArielGlenn: [C: 032] remove sq38-42, 45-47 from remaining dsh files [operations/puppet] - 10https://gerrit.wikimedia.org/r/92493 (owner: 10ArielGlenn) [07:10:17] Reedy: about? [07:11:48] foreachwiki is broken on terbium [07:11:58] /usr/local/lib/mw-deployment-vars.sh has MW_COMMON_SOURCE=/a/common (etc.) [07:12:07] but there's no /a, there, just /apache [07:13:14] !log "ln -s /apache /a" on terbium; foreachwiki and friends was broken (for a while) due to /usr/local/lib/mw-deployment-vars.sh pointing to non-existent /a [07:13:26] it'll definitely work on tin [07:13:26] judging from the 5T of temp files on swift, quite a while [07:13:31] haha [07:13:32] Logged the message, Master [07:13:55] we have a maintenance script using foreachwiki [07:14:09] on a cron job, so that's terbium [07:16:12] and now the php script is broken [07:16:14] throws exception [07:16:15] oh joy [07:17:08] which script? [07:17:20] hm, maybe we're missing a proper config on terbium? [07:18:40] no, config looks fine [07:18:46] "/usr/local/bin/foreachwiki maintenance/cleanupUploadStash.php" is what I'm trying to run [07:20:02] what's the exception? [07:20:39] bswikibooks: UploadStash::getFile No user is logged in, files must belong to users [07:20:47] if ( !$noAuth && !$this->isLoggedIn ) { [07:20:47] throw new UploadStashNotLoggedInException( __METHOD__ . [07:20:50] ' No user is logged in, files must belong to users' ); [07:20:53] } [07:21:09] it's not urgent, I'll file a bug [07:21:30] I guess noauth needs to be true there [07:21:37] no [07:21:43] well, maybe, dunno [07:21:48] (03CR) 10Nemo bis: "Now tracked at https://bugzilla.wikimedia.org/show_bug.cgi?id=56287" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/68417 (owner: 10Nemo bis) [07:21:57] but it needs to login at some point if it will actually clean up :) [07:23:54] I guess its been broken for a while [07:23:56] yeah [07:25:14] https://bugzilla.wikimedia.org/show_bug.cgi?id=56298 [08:02:12] mark, do you have few min now? [08:52:53] PROBLEM - MySQL Replication Heartbeat on db1047 is CRITICAL: CRIT replication delay 316 seconds [08:53:12] PROBLEM - MySQL Slave Delay on db1047 is CRITICAL: CRIT replication delay 324 seconds [08:53:28] springle: that you? [08:53:43] (db1047 replag) [08:53:56] don't think so. i [08:54:20] i'm hammering s4 atm, not db1047 [08:54:51] DELETE /* SolrUpdate::safeExecute [08:54:51] ok [08:54:53] some job [08:55:00] from? terbium? [08:56:13] apache 13200 0.0 0.0 12308 1424 ? S 08:54 0:00 /bin/bash /usr/local/bin/foreachwikiindblist /usr/local/apache/common/special.dblist extensions/GeoData/solrupdate.php --clear-killlist 3 --noindex [08:56:17] looks like it [08:56:31] ya.. master has already finished it [08:57:09] or rather, no terbium->wikiadmin doing much right this second [08:57:18] I wonder if my fix has anything to do it with it [08:57:28] yeah it probably did [08:57:34] foreachwiki/foreachwikiindblist was broken on terbium [08:57:37] who knows for how long [08:57:40] jobs suddenly started running because you fixed the maintenance host? :) [08:57:43] yeah [08:57:43] lol [08:59:29] this happened yesterday at around this time, but recovered in 5 minutes, while I was still lookingat it [08:59:44] db1047? [08:59:45] same db [08:59:47] yup [08:59:51] hmm [09:00:31] didn't save the processlist since it was already done [09:01:56] ---TRANSACTION 854075953, ACTIVE 389792 sec fetching rows [09:02:03] that will slow it down [09:02:14] one of the research queries [09:02:42] the COUNT(DISTINCT linked_page.page_id) was going, I remember that [09:03:35] can't tell you about the other two, sorry. it was a short list though, only a few things like today [09:04:22] db1047 isn't used by MW, only research. not much can be done about replag in this case [09:15:39] You can hurt the researchers >.> <.< [09:59:48] lol [10:00:01] that might explain a few things.... [10:02:42] to the special pages are now fixed? [10:07:59] ? [10:21:42] paravoid: https://bugzilla.wikimedia.org/show_bug.cgi?id=53227 [10:23:20] speaking of which, https://gerrit.wikimedia.org/r/#/c/90117/ is a rather trivial change and helps making things easier to understand [10:24:21] ah, but update-special-pages doesn't use foreachwiki/foreachwikiindblist [10:38:40] (03CR) 10Faidon Liambotis: "(2 comments)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/92288 (owner: 10Matanya) [10:38:49] (03CR) 10Faidon Liambotis: [C: 04-1] removing cache clean up patch [operations/puppet] - 10https://gerrit.wikimedia.org/r/92288 (owner: 10Matanya) [10:39:24] (03PS2) 10Dereckson: DynamicPageList extension configuration maintenance [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/92314 [10:41:39] (03CR) 10Dereckson: "(1 comment)" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/92314 (owner: 10Dereckson) [10:43:57] (03PS2) 10Matanya: removing cache clean up patch [operations/puppet] - 10https://gerrit.wikimedia.org/r/92288 [10:46:41] (03PS3) 10Matanya: removing cache clean up patch [operations/puppet] - 10https://gerrit.wikimedia.org/r/92288 [11:08:12] Are there known issues with the blog? [11:08:24] http://blog.wikimedia.org/2013/10/24/airtel-wikipedia-zero-text-trial/#comments says there are 41 comments in 3 pages. None are displayed. [11:08:41] I remember seeing this before [11:08:53] maybe they're held for moderation? [11:09:07] we don't really do much with blog to be honest [11:09:16] just bugzilla is? [11:09:20] s/is/it/ [11:09:36] no idea [11:09:56] last time we were discussing about blog issues it was decided for us to give it to an external contractor [11:09:59] that was months ago [11:10:14] I think the communications is doing moderations and such? [11:11:55] I've reported https://bugzilla.wikimedia.org/show_bug.cgi?id=56308 [11:12:00] paravoid: is this file using tabs? [11:12:07] which file? [11:12:20] templates/varnish/mobile-frontend.inc.vcl.erb [11:12:24] siebrand: I think it's the pingbacks [11:17:24] paravoid: templates/varnish/mobile-frontend.inc.vcl.erb [11:17:42] why are you asking me? :P [11:18:16] $ grep -q '\t' templates/varnish/mobile-frontend.inc.vcl.erb && echo yes [11:18:19] yes [11:18:24] you edited it last before me ... [11:19:09] paravoid: so should i follow that scheme or the normal 4 space in my patch? [11:19:11] the answer is "yes, it's indented with tabs" :) [11:24:02] Thanks paravoid :) [11:24:09] :) [11:24:25] you may merge my patch if you wish [11:24:51] why did you convert to spaces? [11:25:41] i noted it too late [11:26:10] that is why i asked you in the first place if i should use tabs [11:26:22] yes, you should [11:26:34] in general, try to keep indentation the same as the rest of the file [11:26:46] so the whole file is indented with tabs (as is all of our .vcl files) [11:26:53] so we shouldn't mix tab/spaces [11:27:37] you should never ever use both in a single file [11:27:52] * matanya is fixing [11:27:52] and if you decide you want to change it, which may or may not be warrantied, ALWAYS do it as a separate commit with no other changes but whitespace [11:29:51] warranted [11:30:38] (03PS4) 10Matanya: removing cache clean up patch [operations/puppet] - 10https://gerrit.wikimedia.org/r/92288 [11:30:41] thanks for this mark [11:31:12] paravoid: ^ [11:31:15] ok [11:31:23] we should make sure it's 30 days indeed [11:31:31] agreed [11:36:42] !log gerrit: creating integration/phpunit to hold PHPUnit files for deployment [11:36:57] Logged the message, Master [11:53:49] (03PS1) 10Hashar: deployment: integration/phpunit for Jenkins CI slaves [operations/puppet] - 10https://gerrit.wikimedia.org/r/92510 [11:56:39] !log lanthanum.eqiad.wmnet running apt-get upgrade [11:56:53] Logged the message, Master [12:09:27] !log jenkins refreshing jobs to let them recurse in git submodules {{bug|55614}} {{gerrit|92511}} [12:09:41] Logged the message, Master [12:10:07] addshore: I am refreshing the Jenkins jobs to let them recurse in submodules [12:12:11] :) [12:29:44] addshore: deployed :) [12:29:47] !b 55614 [12:29:47] https://bugzilla.wikimedia.org/55614 [12:30:17] :) [12:32:12] (03PS2) 10Mark Bergsma: Repartition esams LVS service IPs [operations/dns] - 10https://gerrit.wikimedia.org/r/92344 [12:32:13] (03PS2) 10Mark Bergsma: Repartition eqiad LVS service IPs [operations/dns] - 10https://gerrit.wikimedia.org/r/92343 [12:32:14] (03PS2) 10Mark Bergsma: Repartition ulsfo LVS service IPs [operations/dns] - 10https://gerrit.wikimedia.org/r/92342 [12:32:55] * mark violates his own rules on mixing tabs/spaces [12:33:10] bad mark [12:33:36] zonefiles really should be made spaces soon [12:35:33] want a lint check on them that reject the change whenever a tab is encountered ? [12:35:51] after we migrate all of them, yes [12:35:53] not before ;) [12:37:09] You could [12:37:17] see who it annoys enough to change them all ;) [12:38:10] MS-DOS Application (.com) [12:38:12] gj Windows [12:41:49] akosiaris: ahh alexanders :-] Are you familiar with the deployment system ? I would like to be able to deploy a repo containing phpunit on the CI slaves https://gerrit.wikimedia.org/r/92510 [12:43:48] hashar: I used to be a couple of weeks ago [12:44:05] then a ryan started pushing a ton of changes and now i lag behind [12:44:16] doh [12:44:36] let me get something done first and I 'll try to figure it out in some 30 mins. OK ? [12:44:41] sure [12:44:46] will grab a snack meanwhile [13:01:15] (03PS1) 10Reedy: Tabs to spaces [operations/dns] - 10https://gerrit.wikimedia.org/r/92516 [13:01:26] +299, -299 [13:02:26] (03PS1) 10Mark Bergsma: Add new ulsfo upload LVS service IPs [operations/puppet] - 10https://gerrit.wikimedia.org/r/92517 [13:02:40] not now reedy [13:02:49] i'm doing a ton of lvs service ip changes, am not gonna rebase em now [13:02:50] It's alright, I didn't to it by hand [13:03:42] (03CR) 10Mark Bergsma: [C: 04-1] "Please not now, I'm doing zero repartitioning atm..." [operations/dns] - 10https://gerrit.wikimedia.org/r/92516 (owner: 10Reedy) [13:06:30] (03PS2) 10Reedy: Tabs to spaces [operations/dns] - 10https://gerrit.wikimedia.org/r/92516 [13:06:40] (03Abandoned) 10Reedy: Tabs to spaces [operations/dns] - 10https://gerrit.wikimedia.org/r/92516 (owner: 10Reedy) [13:07:28] I like tabs in DNS anyway :) [13:09:18] (03CR) 10Mark Bergsma: [C: 032] Add new ulsfo upload LVS service IPs [operations/puppet] - 10https://gerrit.wikimedia.org/r/92517 (owner: 10Mark Bergsma) [13:15:41] !log reprepro: include backported graphite-carbon, graphite-web, python-whisper (plus deps python-django-tagging, flot) from saucy; replace custom-built python-whisper [13:15:58] Logged the message, Master [13:19:07] (03CR) 10Faidon Liambotis: "(1 comment)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/92271 (owner: 10Ori.livneh) [13:27:20] mark, do you have time to look at it now? [13:27:32] at what, yurik? [13:27:35] ESI [13:28:29] mark, enabling ESI on beta cluster causes it to create tons of instances [13:28:44] what makes you think that? [13:30:32] mark, the varnish frontend log shows tons of strange "creating new thread" entries [13:30:46] can you work on it now? i can show you how i get there [13:30:46] yeah, because varnish crashes and it restarts :) [13:31:00] of course it makes a lot of new threads then [13:32:37] oh, ok, so could you take a look - connect to the bastion / 10.4.1.82 [13:32:53] i did have a look last week [13:33:19] is there anything new? [13:33:46] did you see anything weird there? [13:34:16] last week i asked you to look but then thought that it might have to do with the backend producing relative paths [13:34:26] so i did a few hacks - but it crashes reagrdless [13:35:13] i think it's more likely that it's simply https://www.varnish-cache.org/trac/ticket/1184 [13:35:43] that link crashes too ;) [13:37:12] hmm... something is weird with the russian internet - i constantly have issues connecting to a large number of sites (like google) - regardless of where I am [13:38:37] mark, so how can we verify its that bug? [13:38:44] it doesn't seem the same [13:39:07] check if it asserts, in the system logs [13:39:13] if the message is different, let's check the source [13:39:20] then it's yet another bug, probably unfixed [13:39:43] mark, could you check it with me - it would take you far less time - i will simulate the issue [13:40:18] yurik, i don't really have much time for this right now [13:40:27] i'm going on a 3 week holiday in a few weeks, and need to finish some stuff before then [13:40:32] ESI isn't one of them I'm afraid [13:41:07] what the segfault or assert message in the log if you crash it? [13:41:24] what log file should i look at? [13:41:54] /var/log/syslog [13:41:55] or dmesg [13:42:49] [1261947.673830] varnishd[10196]: segfault at 7f3a845c0000 ip 00007f3a9268d20b sp 00007f3a845bc158 error 6 in libvgz.so[7f3a92687000+13000] [13:42:51] [1262024.112711] varnishd[10959]: segfault at 7f3a825a2280 ip 000000000042022b sp 00007f3a825a2280 error 6 in varnishd[400000+83000] [13:42:52] [1262025.099503] varnishd[11254]: segfault at 7f3a845bf000 ip 00007f3a9268d20b sp 00007f3a845bb158 error 6 in libvgz.so[7f3a92687000+13000] [13:42:54] [1309764.416207] varnishd[510]: segfault at 7fca24d40280 ip 000000000042022b sp 00007fca24d40280 error 6 in varnishd[400000+83000] [13:43:00] gdb it? [13:43:01] these are the last messages [13:43:17] get a backtrace, try to find why it's crashing? [13:43:40] paravoid: would love to, but any hints on the steps to do it? [13:43:50] it certainly sounds like something to do with compress/decompress and esi [13:43:52] how to gdb? [13:44:27] paravoid: yes, and more specifically - how to gdb a specific varnish service [13:45:21] Attach to a varnish process and wait for it to die? [13:46:07] apt-get install varnish-dbg; gdb -p $(pidof varnishd); (make it crash); bt full [13:46:10] or thread apply all bt full [13:47:05] in labs you can set a low thread_max count in varnish to ease debugging [13:47:11] there are 4 varnishd processes [13:47:26] any way to find the right one? [13:47:34] 2 for frontend, 2 for backend [13:47:39] each has a child process and a parent [13:47:45] the parent does nothing but restarting the child [13:47:50] so you need to attach to the child [13:47:54] of the -frontend process [13:48:14] mark - I used "ps -A | grep varn" [13:48:22] good [13:48:30] any options to see parent & frontend status? [13:48:36] like a startup command, etc? [13:48:37] now use "ps -waux" as well as your eyes [13:48:43] thx [14:02:09] (03PS3) 10Mark Bergsma: Repartition esams LVS service IPs [operations/dns] - 10https://gerrit.wikimedia.org/r/92344 [14:02:10] (03PS3) 10Mark Bergsma: Repartition eqiad LVS service IPs [operations/dns] - 10https://gerrit.wikimedia.org/r/92343 [14:02:11] (03PS3) 10Mark Bergsma: Repartition ulsfo LVS service IPs [operations/dns] - 10https://gerrit.wikimedia.org/r/92342 [14:03:36] (03PS1) 10Mark Bergsma: Update ulsfo Text LVS service IPs to new Zero scheme [operations/puppet] - 10https://gerrit.wikimedia.org/r/92527 [14:04:28] btw, mark, i just spoke with one of our biggest partners, they are totally ok to switch to IP, and just need the ranges [14:04:41] they zero-rate both text & media [14:05:00] if they zero rate everything we can give them the full ranges soon [14:05:28] cool, as long as we don't give them labs where ppl could in theory run proxies [14:06:40] just the LVS ranges I mean, not all our ips [14:06:56] but please don't give anyone anything until we confirm we're gonna use these ranges [14:07:07] i'll mail, probably this week still [14:07:13] i'm doing a bit of a trial run with ulsfo now [14:07:27] (03CR) 10Mark Bergsma: [C: 032] Update ulsfo Text LVS service IPs to new Zero scheme [operations/puppet] - 10https://gerrit.wikimedia.org/r/92527 (owner: 10Mark Bergsma) [14:07:47] np, won't give anything untill i receive an email from you [14:16:52] paravoid: when i attach gdb -p , it seems like already crashes, without waiting for me to hit the service [14:17:14] either that, or gdb -p causes a break that i can't continue [14:23:48] (03PS1) 10Mark Bergsma: Swap old/new upload LVS service IPs in ulsfo [operations/puppet] - 10https://gerrit.wikimedia.org/r/92531 [14:24:13] yurik_: disable the watchdog timer in varnish [14:24:18] i.e. varnish checks whether the client responds [14:24:27] if you attach to it, gdb blocks the child, and the parent will restart [14:24:31] there's a runtime parameter to disable that [14:25:13] mark, i'm trying to run the process in shell, by copying the entire string from the ps -waux, but it shows the params list [14:25:35] (03CR) 10Mark Bergsma: [C: 032] Swap old/new upload LVS service IPs in ulsfo [operations/puppet] - 10https://gerrit.wikimedia.org/r/92531 (owner: 10Mark Bergsma) [14:27:53] akosiaris: have you finished? :D https://gerrit.wikimedia.org/r/92510 to deploy a repo containing phpunit on the CI slaves [14:31:34] (03PS1) 10Mark Bergsma: Change ulsfo upload-lb IPs [operations/dns] - 10https://gerrit.wikimedia.org/r/92533 [14:33:57] hiiiii mark, so i'm trying to verify that the 2 analytics racks are seeing different multicast traffic [14:34:21] I've counted packets by node seen on each aggregator [14:34:45] I've also tried to look directly at the traffic and count occurrences of things, but that is less conclusive because its binary and harder to examine [14:35:08] i can verify that if I listen on the ganglia multicast addy and try to send between racks [14:35:15] i get the same behavior i saw yesterday [14:35:22] i can't send from one rack to the other [14:41:07] at all? [14:41:22] so how do some core metrics get exchanged then? [14:45:34] i don't know, all i'm saying is that when I do it manually, i can't send the traffic through [14:45:53] weird [14:45:55] so how about [14:46:00] we temp disable that analytics ACL [14:46:04] and see if the problem goes away with that [14:46:51] ok [14:46:56] (03PS1) 10Cmjohnson: Removing mgmt dns entries for ms2 [operations/dns] - 10https://gerrit.wikimedia.org/r/92535 [14:47:00] hashar: I have not finished, but I just started waiting on something, so I have some time [14:47:39] ottomata: done [14:48:56] (03PS1) 10Cmjohnson: Removin ms2 from dhcpd [operations/puppet] - 10https://gerrit.wikimedia.org/r/92536 [14:49:30] (03CR) 10Cmjohnson: [C: 032] Removing mgmt dns entries for ms2 [operations/dns] - 10https://gerrit.wikimedia.org/r/92535 (owner: 10Cmjohnson) [14:49:48] !log dns update [14:50:05] Logged the message, Master [14:51:48] (03CR) 10Cmjohnson: [C: 032] Removin ms2 from dhcpd [operations/puppet] - 10https://gerrit.wikimedia.org/r/92536 (owner: 10Cmjohnson) [14:52:01] PROBLEM - LVS HTTPS IPv4 on upload-lb.ulsfo.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 325 bytes in 7.381 second response time [14:52:36] hrm [14:52:49] akosiaris: so basically my aim is to be able to deploy integration/phpunit.git content (that is a copy of the PHP unit framework) to all jenkins slaves in production (aka gallium and lanthanum). [14:53:00] RECOVERY - LVS HTTPS IPv4 on upload-lb.ulsfo.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 575 bytes in 0.378 second response time [14:53:05] weird [14:53:27] akosiaris: so I though instead of using git::clone( ensure => latest ); i could look at the deployment system we have. And that is apparently very simple to add a new repo / group of servers as target [14:53:44] seems easier than last time [14:53:49] mark, acl change doesn't seem to make a difference, although i can send multicast traffic manually now [14:53:51] ryan changed it a lot [14:54:01] can you turn the acl back on again? i want to verify i wasn't doing that wrong before [14:54:11] akosiaris: yeah that is part of the refactoring, we talked a couple weeks ago how it looked messy to add something new in [14:54:24] akosiaris: ryan promply refactored the puppet manifest :] [14:54:56] ottomata: what ttl did you set? [14:55:00] it's limited to 3 in the acl [14:55:10] i'm using iperf, hm [14:55:40] 32 [14:56:00] set to 2 or 3 [14:56:08] that would have been the reason I think [14:56:21] should perhaps widen that up a bit [14:56:26] hashar: so... it seems ok [14:56:33] the gerrit patchset I mean [14:56:50] btw this is happening because of old packages in Ubuntu ? [14:57:35] yup [14:57:44] and I don't feel like packaging PHPUnit for Debian [14:57:49] ok, so mark is acl back on now? [14:57:51] that is way above my competencies :] [14:57:58] no [14:58:01] will turn it back on [14:58:28] doe [14:58:29] done [14:59:11] I was considering what mark said about inter-LTS versions. Maybe it makes sense to make CI 13.10 ? Not sure yet. But it could help. [14:59:29] ok yeah, can't get traffic through now, sorry, is the acl allowing any multicast traffic on the addy, or just port 8649? [14:59:44] so hashar. I am gonna merge. You have access to tin to deploy that ? [14:59:47] or should I ? [14:59:53] i'm not using 8649 to test, its hard to tell because there is already a bunch of ganglia traffic on that port [15:00:10] akosiaris: I got access [15:00:17] akosiaris: no clue how to deploy but I will figure it out :] [15:00:38] nice [15:00:43] (03CR) 10Akosiaris: [C: 032] "Talked with hashar on IRC and then reviewed previous changes to get a hang of what changed and all seems well." [operations/puppet] - 10https://gerrit.wikimedia.org/r/92510 (owner: 10Hashar) [15:01:00] PROBLEM - LVS HTTPS IPv6 on upload-lb.ulsfo.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 325 bytes in 7.379 second response time [15:01:32] gah [15:01:40] so we need to restart nginx on ip changes [15:02:15] on lvs ip changes? [15:02:16] bwer [15:02:50] RECOVERY - LVS HTTPS IPv6 on upload-lb.ulsfo.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 573 bytes in 0.388 second response time [15:04:13] maybe not [15:06:06] PROBLEM - LVS HTTPS IPv4 on upload-lb.ulsfo.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 325 bytes in 7.380 second response time [15:09:22] (03PS1) 10Mark Bergsma: Restore internal svc address for ulsfo upload [operations/puppet] - 10https://gerrit.wikimedia.org/r/92538 [15:09:53] (03CR) 10Mark Bergsma: [C: 032 V: 032] Restore internal svc address for ulsfo upload [operations/puppet] - 10https://gerrit.wikimedia.org/r/92538 (owner: 10Mark Bergsma) [15:09:56] PROBLEM - LVS HTTPS IPv6 on upload-lb.ulsfo.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 325 bytes in 7.399 second response time [15:10:06] RECOVERY - LVS HTTPS IPv4 on upload-lb.ulsfo.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 575 bytes in 0.386 second response time [15:10:46] RECOVERY - LVS HTTPS IPv6 on upload-lb.ulsfo.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 575 bytes in 0.378 second response time [15:15:51] hashar: all ok with Ryan deployment system ? [15:16:11] akosiaris: the repo got cloned on tin, waiting for puppet to finish on the target hosts [15:16:19] info: Salt::Grain[deployment_target_contint-production-slaves]: Scheduling refresh of Exec[deployment_target_refresh_pillars] [15:16:19] :-) [15:17:10] :-) [15:18:19] akosiaris: annnnd [15:18:23] THAT WORKS!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! [15:18:25] \O/ [15:18:26] * hashar dance [15:18:29] damn [15:18:31] I am happy really [15:18:41] that is a proof that sometime stuff just works! [15:18:54] (03PS1) 10Mark Bergsma: Connect to the system backend ip instead of a (volatile) LVS ip [operations/puppet] - 10https://gerrit.wikimedia.org/r/92539 [15:19:40] :) :) [15:20:09] hashar: now i am gonna have deploy everything through that and stop packaging :P [15:20:10] Continue? ([d]etailed/[C]oncise report,[y]es,[n]o,[r]etry): [15:20:12] huhu [15:20:23] akosiaris: na na :-] [15:20:32] the python packages can land in debian, they are not hard topackage [15:21:31] # YAY : 'finish' for 'integration/phpunit' completed successfully (now at integration/phpunit-20131029-151743) [15:21:32] hehe [15:22:01] mark, can you confirm if I should be able to send multicast traffic on 239.192.1.32 any port? [15:22:03] or just 8649? [15:23:50] just 8649 [15:24:03] so the acl says: 239.192/16, ttl max 3, udp port 8649 [15:24:57] hm oh ok, so i should be able to test using something else in the / 16 on 8649 [15:25:04] yup [15:26:36] ok so then that works [15:26:44] false alarm on the not being able to send multicast traffic between racks at all [15:26:50] but then, i still have a mystery to solve [15:28:54] sorry ;) [15:29:17] to the mystery machine! [15:29:18] check ttl of all ganglia messages perhaps? [15:29:20] and dest port [15:29:29] perhaps the acl catches some if they're not all equal [15:29:36] although turning it off didn't fix it I guess [15:29:40] ok, so [15:29:52] i just edited gmetad.conf on nickel [15:30:01] and took out the row b aggregator (analytics1009) [15:30:12] so gmetad would be forced to talk to the aggregator in row c analytics1011 [15:30:16] and now, [15:30:22] the metrics show up. [15:30:32] but all metrics are in row c too right? [15:30:38] so that makes sense then [15:30:51] if multicast messages don't make it to row b [15:31:18] the metrics i was trying to get to work originate in row c [15:31:20] right [15:31:35] yeah [15:31:43] ttl = 3 [15:32:33] but, this is still bad behavior, right? [15:32:35] oh i was wrong [15:32:38] ganglia doesn't have a ttl check [15:32:41] it was pim that does [15:32:49] term ganglia { [15:32:49] from { [15:32:49] destination-address { [15:32:49] 239.192.0.0/16; [15:32:49] } [15:32:50] protocol udp; [15:32:52] destination-port 8649; [15:32:54] } [15:32:56] then accept; [15:32:58] } [15:33:14] hm, ok, so that works [15:33:20] still confused though [15:33:28] so, if I have two data sources set in gmetad.conf [15:33:41] and each of those are non-deaf aggregators for the cluster [15:33:48] hmmm [15:34:05] how does a gmetad datasource use multiple aggregators…googling [15:34:15] they should be equal [15:34:23] so i believe it just talks to the first unless it doesn't respond [15:34:27] (but do check) [15:34:29] If gmetad cannot pull the data from the first source, it will continue trying the other sources in order. [15:34:31] hm [15:34:33] yeah [15:34:42] so row b was listed first [15:34:45] and they are not equal [15:34:48] so something is wrong [15:34:50] so if it doesn't have the metrics from row c, then we don't see metrics [15:34:51] with the multicast channel [15:34:53] that explains that [15:34:54] yeah [15:34:59] yes [15:35:13] you can reverse the order in gmetad but it's not a real fix [15:35:17] but i can send things via multicast manually, and we know that some of the metrics get through fine [15:35:19] right, no its not [15:35:23] because we do have machines we use in row b [15:35:28] right now really only the hadoop namenodes [15:35:30] so carefully examine the packets for the other metrics [15:35:32] but those are important ones [15:35:34] and see if you notice anything different [15:35:39] hm [15:35:51] udp ports, ttl, etc [15:36:03] different source address perhaps [15:41:13] HMMMMMMMMM [15:41:33] I think I see the non kafka metrics traffic on both aggregators [15:41:44] i wonder if the kafka ganglia client doesn't do multicast well [15:42:32] it's not a ganglia python plugin? [15:42:37] no [15:42:43] how does it send data to gmond? [15:42:48] https://github.com/criteo/kafka-ganglia [15:43:34] yeah doesn't look like it does multicast [15:44:18] although I haven't seen all classes it uses obviously [15:44:34] but somehow it does work cross-host within the same subnet [15:44:36] RECOVERY - check_job_queue on fenari is OK: JOBQUEUE OK - all job queues below 10,000 [15:44:45] so check traffic from a non-aggregator host to the aggregator in row C [15:44:51] perhaps it's ttl 1 [15:44:57] and then it can't pass the router [15:45:02] yeah that's probably it [15:45:16] multicast senders use ttl 1 by default, you need to raise it explicitly [15:45:22] if this doesn't do that, then it can't go beyond one subnet [15:45:54] oh i do see ttls [15:46:05] ttl 1 [15:46:31] yeah, so the router decrements it, finds it 0, discards the packet [15:46:33] yeah from kafka metrics [15:46:41] explains it all [15:47:36] PROBLEM - check_job_queue on fenari is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:55:12] (03CR) 10GWicke: "(1 comment)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/90733 (owner: 10Physikerwelt) [16:00:13] paravoid: https://gerrit.wikimedia.org/r/#/c/92542/ [16:00:28] I'm sure I said something about the parameter being missing ;) [16:00:31] (03PS2) 10Chad: Wikidatawiki gets Cirrus [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/92352 [16:00:59] (03CR) 10Chad: [C: 032] "Updated summary to clarify this is secondary so no one panics :)" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/92352 (owner: 10Chad) [16:01:12] Reedy: you did :) [16:03:37] pssh, i'm going to have to go back to jmxtrans for this :/ [16:04:33] !log demon synchronized wmf-config/InitialiseSettings.php 'Cirrus on wikidatawiki as secondary index' [16:04:49] Logged the message, Master [16:06:33] ottomata: GangliaReporter class indeed doesn't set the multicast ttl [16:06:33] (doesn't do anything with multicast really) [16:07:52] !log reedy synchronized php-1.22wmf22/includes/upload/UploadStash.php [16:07:56] yeah [16:08:06] Logged the message, Master [16:08:14] i'm reading the Metrics docs (that Kafka Ganglia Reporter is using) [16:08:18] and it is using an old version [16:08:24] so I could port it over I suppose [16:08:25] !log reedy synchronized php-1.23wmf1/includes/upload/UploadStash.php [16:08:26] hm [16:08:35] that is probably more flexible [16:08:38] Logged the message, Master [16:09:01] Does anyone besides Hashar have Jenkins permission? Or know where Hashar is? [16:09:12] He's @ hasharCall [16:09:16] hexmode: What permissions? [16:09:48] Reedy: he told me to hit "build now" button, but I don't see one [16:09:57] Have you logged in? [16:10:01] Reedy: https://integration.wikimedia.org/ci/job/mediawiki-core-release/ [16:10:02] yes [16:10:51] Reedy: need to add him :D [16:11:01] hasharCall: :) [16:11:01] i am in call right now [16:11:15] * Reedy looks [16:13:03] <^d> !log elasticsearch: created wikidatawiki index, running indexing from 8 processes on terbium. [16:13:13] yay! [16:13:21] Logged the message, Master [16:14:50] hexmode: is your username on jenkins hexmode? [16:15:01] (03CR) 10Physikerwelt: "(1 comment)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/90733 (owner: 10Physikerwelt) [16:15:07] Reedy: MarkAHershberger [16:15:33] hexmode is my sockpuppet [16:15:36] ROLE_PROJECT-MEDIAWIKI-RELEASE [16:17:15] * hexmode wonders if he should re-login [16:18:35] You're already in that apparently [16:19:18] MarkA is? [16:19:26] I am confused. [16:20:29] Logged out and back in. Still nothing. [16:21:18] trying chrome [16:25:10] hrm... now I've got the spinner of doom. Jenkins is taking forever to load. [16:29:12] Reedy: loaded in chrome ... still no button for MarkAHershberger [16:40:59] greg-g: just fyi, trying to finish cleaning up the installer and then make RC0 today [16:41:52] hexmode: weeee [16:42:08] jenkins says no though, unfortunately [16:43:43] maybe he isn't happy buttling for me :( [16:44:05] never can find good help these days [16:46:29] Some of the user config stuff just don't seem to want to load [16:49:37] !log reedy synchronized php-1.23wmf1/extensions/Wikibase [16:49:50] Logged the message, Master [17:02:03] Reedy: cleanup upload stash works fwiw [17:02:05] thanks :) [17:02:10] it's running now [17:03:01] For anyone interested: Analytics is doing their quarterly review now, GH link: https://plus.google.com/hangouts/_/calendar/cmZhcnJhbmRAd2lraW1lZGlhLm9yZw.us2ntil2r91uec5t9dm7l8162s [17:03:49] oh [17:03:50] damn [17:03:56] You're not allowed to join this video call. [17:04:08] Is it going to youtube? [17:04:30] ugh. Let me find out [17:06:07] I think it would be the first public quarterly review? The others were all private IIRC. Some didn't even publish notes (or only after some months embargo) [17:17:20] Reedy: no we are not putting this on youtube [17:22:14] makes sense :) [17:37:06] (03PS1) 10Bsitu: Switch all small wikis to use extension1 db [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/92549 [17:41:21] (03PS2) 10Bsitu: Switch all small wikis to use extension1 db [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/92549 [17:42:09] (03CR) 10Bsitu: [C: 032] Switch all small wikis to use extension1 db [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/92549 (owner: 10Bsitu) [17:42:18] (03Merged) 10jenkins-bot: Switch all small wikis to use extension1 db [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/92549 (owner: 10Bsitu) [17:45:58] !log bsitu synchronized wmf-config/InitialiseSettings.php 'Switch all small wikis to use extension1 db' [17:46:15] Logged the message, Master [17:55:57] dr0ptp4kt: ping [17:56:11] paravoid, just sent a "chat" invitation from gmail. you available? [17:57:20] haven't got anything [17:58:18] see it now? [18:09:16] apergos: heya, I asume you're close to crashing time, could I ask you for help on https://bugzilla.wikimedia.org/show_bug.cgi?id=48501 ? I just cc'd you. [18:11:07] I'm definitely on my 'done with work' time, man this is a long bug report [18:11:26] :) [18:11:35] see comment 46 for what needs to happen [18:12:07] I'm looking at it now, having read about 20 other comments to figure that out [18:12:40] so already I'm wondering how step 1 is going to work [18:13:05] right, the comment I just added is: "I think the problem is that no one is quite sure how to do those things. I'm [18:13:08] it seems that as ryan says near the end this is not blocked on ops [18:13:08] not sure otherwise I would JFDI." [18:13:32] well, just no one knows how to do those things other than the opsen cc'd on the list, from what I can tell [18:13:40] no, I'm saying it's never been blocked on ops [18:13:41] anyone that has projectadmin in deployment-prep ... [18:13:44] and I've been saying that for months now [18:14:00] but no one is taking ownership of the bug [18:14:07] right, anyone with a labs account *can* make changes, but I don't think people know what changes to make [18:14:12] so I guess "someone" would make a list of the current projectadmins who are volunteers [18:14:20] scramble around to get them all to sign ndas [18:14:27] or remove the ones who say they are't interested [18:15:10] rob would be the cert buying person but someone would have to rt ticket that [18:15:24] basically: I'm offering to to things, but I don't know what to do. [18:15:51] are you a project admin on deployment-prep? [18:16:01] no, but I assume I could be [18:16:07] :) [18:16:11] yes that's true [18:16:24] (guh, bz is fighting me, sorry for all the bugmail) [18:16:27] so here is what I would do if I were you and wanting to move this along [18:16:45] 1) get hashar to make you a project admin (or somene else over there who's active and is an admin ) [18:17:11] hexmode: Reedy: have you managed to grant MarkAHershberger build authorization in jenkins ? [18:17:16] 2) look at the current list of who is and see who is willing to sign an nda and who isn't, find out from legal what that nda looks like, etc etc [18:17:24] hashar: Nope... [18:17:28] Half of the pages wouldn't load [18:17:34] until everybody has them or is off [18:17:44] Reedy: poor jenkins :( [18:17:49] apergos: 2) will take a bit :) I guess I'll start on that [18:18:08] ok [18:18:25] Reedy: for reference, that is done in https://integration.wikimedia.org/ci/configureSecurity/? [18:18:31] <^d> greg-g: Who needs removing? [18:18:34] then 3) would be in rt 'hey rob, we need these certs, etc' [18:18:57] <^d> I only see 3 non-staff in here anyway. [18:19:12] <^d> Meh, 5, can't count. [18:19:30] apergos: /me nods, thanks [18:19:44] ^d: where's the list? /me is lazy to look for it right now [18:19:44] math is hard... [18:19:55] <^d> https://wikitech.wikimedia.org/wiki/Special:NovaProject [18:20:00] <^d> deployment-prep in the filter [18:20:06] * greg-g nods [18:20:10] <^d> People who have projectadmin. [18:20:21] Reedy: hexmode should have access now :) if not follow up by email :] [18:20:40] uhhh, I don't see it listed, I guess I have to be a member of it first? [18:20:44] I only see bastion [18:20:46] greg-g: apergos: are you referring to the SSL certificates on beta ? [18:20:51] yeah [18:21:05] potentially we "just" have to forgive access on the varnish caches [18:21:25] I'm looking at this bug hashar: [18:21:29] hashar: heya, can you add me as an admin to deployment-prep? :) [18:21:39] <^d> greg-g: Added you to the project. [18:21:40] https://bugzilla.wikimedia.org/show_bug.cgi?id=48501 world;s longest but tl;dr: see comment 46 [18:21:40] greg-g: username ? [18:21:57] which one? wikitech or shell? [18:21:58] thanks Chad! [18:22:19] wikitech: Greg Grossmeier, shell: gjg [18:22:25] <^d> Ryan_Lane: Random stupid bug, you can't add "Foo_Bar" as a user to a project, just "Foo Bar". I would expect them to be interchangeable since they're wiki usernames. [18:22:40] yeah [18:22:47] ^d: open a bug in OpenStackManager extension :) [18:23:02] greg-g: apergos potentially we could tweak the sudo policy to prevent root access on cache to people we don't trust. [18:23:40] greg-g: I created a under_NDA group in https://wikitech.wikimedia.org/wiki/Special:NovaSudoer [18:23:45] anyway, dinner timer sorry [18:23:59] yes, that's listed as a must do [18:24:01] hey ^d mind adding me as an admin to deployment-prep? :) [18:24:08] cya [18:24:14] hashar: ooh, nice, we've wanted that for quite a while [18:24:22] yeah I am eating my dessert first (bad) and now need to think about dinner [18:24:35] <^d> greg-g: Somebody beat me to it :) [18:24:35] ^d: ah, you did, thanks echo [18:24:44] Chad added you to project Nova Resource:Deployment-prep [18:24:44] 2 minutes ago [18:24:54] <^d> Yeah I added you to project. [18:24:58] <^d> Someone else must've made you admin. [18:25:04] <^d> I suspect hashar :) [18:25:20] ahhh, I see [18:25:52] I wonder who has dealt with legal about ndas before, they would probably have some tips (I have not) [18:26:06] * apergos eyes mutante [18:26:55] :) [18:28:20] (03PS1) 10Brion VIBBER: Whitelist API mobileview for robots.txt [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/92552 [18:28:56] (03PS1) 10Cmjohnson: Removing decommissioned db's db11-db28, db30 from dhcpd, dsh groups [operations/puppet] - 10https://gerrit.wikimedia.org/r/92553 [18:31:42] (03CR) 10MaxSem: "Note that while parser cache is indeed used, processed mobileview results smaller than 64k aren't cached to reduce memcached usage, so Goo" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/92552 (owner: 10Brion VIBBER) [18:32:51] (03CR) 10jenkins-bot: [V: 04-1] Whitelist API mobileview for robots.txt [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/92552 (owner: 10Brion VIBBER) [18:35:11] kaldari: can I assing you to the ssl bug for this part of it? (scrubing deploy-prep sudoers) [18:36:07] er, "1. Remove projectadmin permissions from volunteers" I mean [18:36:24] which really should be "1. Remove projectadmin permissions from non-NDA'd people" [18:37:23] legal says Jan Gerber is cool [18:37:31] i.e. J [18:37:32] cool [18:37:43] kaldari: how'd you get that info? [18:37:50] talking to Luis [18:37:55] (03CR) 10Cmjohnson: [C: 032] Removing decommissioned db's db11-db28, db30 from dhcpd, dsh groups [operations/puppet] - 10https://gerrit.wikimedia.org/r/92553 (owner: 10Cmjohnson) [18:37:55] do they have a list they can put on office. or something? [18:38:47] how did you find out what legal says so fast? [18:38:49] kaldari: [18:39:24] apergos: I assume he walked over there [18:39:26] oh.. the old 'walk over to the desk' trick? [18:39:27] meh [18:39:29] :) [18:39:58] the "my question is way more important than whatever you're doing right now" trick [18:39:59] ok it would be really good for that to be recorded in an email in rt or something, before the end of all thi [18:40:08] s [18:40:09] +1, kaldari ^ [18:40:18] OK, Luis says TheDJ (Dirk-Jan) has no NDA [18:40:33] and Mdale is in no-man's land [18:40:43] so probably best to remove both of them [18:41:27] * greg-g nods [18:41:36] Luis says he owes Ken an email about this, but he hasn't had time to put it together yet [18:41:52] ok so ... if I were you, which I am not, I would ask them whether they are willing to sign ndas before just removing them [18:42:11] (because if I were a volunteer projectadmin I might actually care) [18:42:15] apergos: sure, I'll send an email to them now [18:42:34] at least the list is short [18:43:35] this is so great, so next time somebody asks for stuff like mailman info i can just lookup in LDAP, that's what you're doing right [18:43:45] i mean, checking NDA status [18:43:57] and I assume that aude is OK, correct? [18:44:35] or do I need to email her as well? [18:44:50] no idea [18:44:57] she is WMDE staff [18:45:10] mutante: no, he's talking to lvilla [18:45:21] apergos: in the past i asked Philippe for that usually [18:45:21] (03PS1) 10Cmjohnson: Removing all dns entries for decom'd db11-28, 30 [operations/dns] - 10https://gerrit.wikimedia.org/r/92558 [18:45:33] but we talked about wanting it in LDAP forever [18:45:44] wasn't sumanah going to manage that? [18:46:04] Ryan_Lane: are chapter staff kosher for projectadmin access on beta labs? [18:46:09] as long as the process to keep it updated works ... [18:46:19] without "walk over to desk" [18:46:19] if they have an NDA, yes [18:46:25] otherwise, no [18:46:37] OK, I'll double check [18:46:44] um yeah I dunno what chapter staff have by default if anything [18:46:46] maybe nothing [18:46:54] hey aude, have you signed an NDA? [18:47:02] with the WMF [18:47:22] Reedy: can you deploy this tomorrow https://gerrit.wikimedia.org/r/#/c/92552/ ? see Ops list :) [18:48:04] * greg-g goes to make lunch [18:49:23] (03CR) 10Cmjohnson: [C: 032] Removing all dns entries for decom'd db11-28, 30 [operations/dns] - 10https://gerrit.wikimedia.org/r/92558 (owner: 10Cmjohnson) [18:49:51] !log dns update [18:50:04] Logged the message, Master [18:50:05] Yeah, I'm still waiting on official word on this, but for now, lets just make sure there's a signed NDA *nods* [18:50:14] ori-l: hey; packages are in apt, not sure if you saw the gerrit comment [18:51:39] Oh! Luis' ears must have been burning, he just reached out. ;) [18:54:29] apergos: just emailed TheDJ and aude. Looks like ^d already removed Mdale, which is probably fine since he's inactive. [18:54:44] (as far as I know) [18:54:55] <^d> Yeah I hadn't seen him around for the better part of a year. [18:59:42] (03PS1) 10Cmjohnson: Decomming mw50, never been used removing dns entries [operations/dns] - 10https://gerrit.wikimedia.org/r/92559 [19:01:30] I can't open an RT ticket b/c I can't see the queues. But I need get a tarfile to download.wm.o [19:01:53] hexmode: hi, i can help, what's the ticket number [19:02:12] mutante: no ticket b/c I can't create one [19:02:13] you say you need a file attached to a ticket? [19:02:19] no [19:02:34] just need files uploaded to download.wm.o [19:02:54] mutante: these: http://mah.everybody.org/tmp/ [19:03:45] (03PS1) 10Cmjohnson: Removing dhcp entry for mw50 [operations/puppet] - 10https://gerrit.wikimedia.org/r/92561 [19:04:32] hexmode: hold on, i'm on it [19:04:37] :) [19:06:33] (03CR) 10Cmjohnson: [C: 032] Decomming mw50, never been used removing dns entries [operations/dns] - 10https://gerrit.wikimedia.org/r/92559 (owner: 10Cmjohnson) [19:06:55] !log dns update [19:07:27] (03CR) 10Cmjohnson: [C: 032] Removing dhcp entry for mw50 [operations/puppet] - 10https://gerrit.wikimedia.org/r/92561 (owner: 10Cmjohnson) [19:12:56] hexmode: http://dumps.wikimedia.org/mediawiki/1.22/ .. (gpg: Good signature ... ) [19:13:09] paravoid: wooooo! no, i missed the comment -- that's awesome! [19:13:12] thank you! [19:14:02] !log create mw/1.22 dir on dumps.wm.org, publish mw 1.22.0rc0 - http://dumps.wikimedia.org/mediawiki/1.22/ [19:14:18] Logged the message, Master [19:14:42] Sending an email :) [19:14:54] ori-l: completely untested [19:14:57] ty ty [19:15:25] paravoid: that's how we like it [19:37:44] (03PS2) 10Reedy: Whitelist API mobileview for robots.txt [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/92552 (owner: 10Brion VIBBER) [19:38:20] (03PS1) 10Reedy: Minor changes to robots.php [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/92566 [19:39:04] (03PS1) 10Cmjohnson: More dsh files for removing db11-28,30 [operations/puppet] - 10https://gerrit.wikimedia.org/r/92567 [19:50:07] (03CR) 10Brion VIBBER: "Max, do we want to run some load stress tests or something to make sure this won't be a problem?" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/92552 (owner: 10Brion VIBBER) [19:55:35] (03CR) 10Cmjohnson: [C: 032] More dsh files for removing db11-28,30 [operations/puppet] - 10https://gerrit.wikimedia.org/r/92567 (owner: 10Cmjohnson) [19:59:28] (03CR) 10MaxSem: "Per that email thread, this should be like +30% to our current mobile request rate. Doesn't sound like a showstopper, and we had no object" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/92552 (owner: 10Brion VIBBER) [20:03:11] (03PS1) 10Dzahn: install gnupg on download-wikimedia server [operations/puppet] - 10https://gerrit.wikimedia.org/r/92570 [20:05:13] (03PS1) 10Cmjohnson: fixing mw50 dns removal [operations/dns] - 10https://gerrit.wikimedia.org/r/92571 [20:05:41] (03CR) 10Cmjohnson: [C: 032] fixing mw50 dns removal [operations/dns] - 10https://gerrit.wikimedia.org/r/92571 (owner: 10Cmjohnson) [20:06:08] !log dns update [20:08:13] PROBLEM - RAID on terbium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:14:13] RECOVERY - RAID on terbium is OK: OK: State is Optimal, checked 1 logical drive(s), 2 physical drive(s) [20:20:13] PROBLEM - RAID on terbium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:21:23] PROBLEM - SSH on terbium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:22:03] RECOVERY - RAID on terbium is OK: OK: State is Optimal, checked 1 logical drive(s), 2 physical drive(s) [20:22:13] RECOVERY - SSH on terbium is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [20:26:13] PROBLEM - RAID on terbium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:29:33] what's up with terbium? [20:31:03] RECOVERY - RAID on terbium is OK: OK: State is Optimal, checked 1 logical drive(s), 2 physical drive(s) [20:31:59] seems ok now... odd though [20:32:12] its our hume replacement, so its not a trivial server. [20:32:43] oh [20:32:47] someone is running some heavy job [20:32:50] cmjohnson1: ^ [20:32:57] which makes sense, since its a script host [20:33:03] but yea, not awesome. [20:33:13] ah..okay but not cool indeed [20:33:49] well, if its deploy related, its oen time stuff [20:33:55] so it just is what it is [20:34:17] its only really an issue if entire box locks up, or if its a regularly run maintainance script [20:36:10] granted, our monitoring for that consists mostly of 'if lots of folks are complaining then someone has a bad script and needs to get it killed' [20:36:26] and/or/including icinga warnings occurring regularly. [20:36:32] Possibly paravoid? [20:36:40] He is/was running cleanupUploadStash.php [20:36:42] I know he was running some stuff recently [20:36:48] and that looks liek its there yes [20:36:50] so i think its fine [20:36:59] box is up, a few false positives are not end of world. [20:37:09] script has to run someplace. [20:37:26] true [20:37:27] it's very unresponsive [20:37:40] very slow reedy [20:37:42] there are cirus search updates running as well [20:37:48] load average: 26.14, 23.79, 26.13 [20:37:54] which i thought was over already [20:38:10] the update window for that is over, but the scripts fired for it are not. [20:38:16] check with ^demon|lunch [20:38:34] heh, you just did for me =] [20:38:35] <^demon|lunch> Hm? [20:38:47] terbium is slow, im looking at jobs and i see cirus search stuff [20:38:52] but dunno if its causing it, just asking about them [20:39:03] apache 8918 77.1 27.2 9350428 8972504 pts/27 Dl+ 16:11 206:15 php /a/common/multiversion/MWScript.php extensions/CirrusSearch/maintenance/forceSearchIndex.php --wiki wikidatawiki --fromId 0 --toId 2096950 --batch-size 100 [20:39:05] long running [20:39:08] <^demon|lunch> Yeah. [20:39:13] PROBLEM - RAID on terbium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:39:13] expected? [20:39:14] <^demon|lunch> It's got...a ways...to go [20:39:26] terbium is suffering under load [20:39:37] would those be causing this, we could offload to a different server. [20:39:46] apache 4704 84.2 34.6 11664224 11373612 pts/33 D+ 17:32 157:07 php /a/common/multiversion/MWScript.php copyFileBackend.php commonswiki --src local-swift --dst local-swift-eqiad --syncviadelete --prestat --containers local-public local-transcoded --skiphash [20:39:49] <^demon|lunch> Very likely to be the cause, yes. [20:40:01] <^demon|lunch> RobH: I'm fine with offloading it somewhere if you've got a better place :) [20:40:54] ^demon|lunch: soooooo, what does that need to run on antoher server if i spin up some 1u box for this with same puppet stuff as terbium should be fine? [20:40:54] and will this be a regular search thing or just during implementation? [20:40:54] ie: short or long term server allocation [20:40:54] i'm more prone to spec something specifically for this job if its longer term [20:41:17] <^demon|lunch> Just during implementation. [20:41:20] i can toss a high performance dual cpu at it [20:41:23] 16gb memory [20:41:24] <^demon|lunch> And recovery from disaster :\ [20:41:37] ^demon|lunch: search suggestions seem to be down again on beta, manybubbles fixed it this morning. I don't suppose there is some easy/automate-able way to maintain that? [20:42:07] <^demon|lunch> chrismcmahon: Not right now, no. [20:42:13] PROBLEM - DPKG on terbium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:42:23] PROBLEM - SSH on terbium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:42:24] <^demon|lunch> RobH: We've toyed with a few ideas to help spread the load out across $something [20:42:30] <^demon|lunch> Ok, I'm terminating. [20:42:33] ^demon|lunch: Ok, I am going to drop an RT ticket for this and link you in as requestor [20:42:43] we'll get a box spun up today for you for those scripts. [20:42:58] <^demon|lunch> Long term we need a better way to deal with this. [20:43:03] RECOVERY - DPKG on terbium is OK: All packages OK [20:43:03] RECOVERY - RAID on terbium is OK: OK: State is Optimal, checked 1 logical drive(s), 2 physical drive(s) [20:43:11] yep, but i dont wanna slow down your project, and i dont wanna let terbium die [20:43:13] RECOVERY - SSH on terbium is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [20:43:16] so this is the stopgap for today =] [20:43:46] (unless you think you'll have a more scalable solution for this in next 48 hours, I assume you need those scripts to run to move forward) [20:43:50] ? [20:43:58] or next X hours [20:44:04] i dunno what timeframe you guys have for this [20:44:18] I just don't want operations to be a blocker in any fashion. [20:48:41] <^demon|lunch> RobH: I definitely won't have it within the next day or two. [20:48:47] cool [20:48:53] so yea, a short term server is needed then right? [20:49:00] I just dont wanna be misrepresenting in rt ticket. [20:49:09] <^demon|lunch> That'd be great as a stopgap, yeah. [20:49:16] <^demon|lunch> Keep me from hurting terbium :) [20:49:35] https://rt.wikimedia.org/Ticket/Display.html?id=6120 [20:49:47] cool, just reply confirmin for record of server allocations if you dont mind [20:49:50] RobH: so, the bugzilla cert issue.. the root cause is.. bug-attachment.wm.org.key is not a key, it's a cert [20:50:00] this can benefit from multicpu yes? [20:50:08] <^demon|lunch> !log elasticsearch: stopped wikidatawiki indexing processes on terbium, causing too much load. working with RobH on getting an alternative place to run these scripts. [20:50:13] <^demon|lunch> RobH: Absolutely :) [20:50:26] Logged the message, Master [20:51:10] cool, dual cpu 16gb ram r610 it is. [20:51:11] arsenic [20:51:11] assuming ops agrees and no one blocks. [20:51:11] <^demon|lunch> RobH: "No permission to view ticket" [20:51:31] hrmm [20:51:36] well fu3124 [20:51:39] @!#$!@ even. [20:51:41] oh procurement [20:51:46] indeed. [20:52:00] ^demon|lunch: should have emailed you a copy of it [20:52:05] you can just reply to email [20:52:13] i dont feel like messing with rt root account at the moment. [20:52:26] im moving ahead as if i have approval ;] [20:52:47] 'be bold' [20:53:32] <^demon|lunch> You know, there's a second half to the "Be bold" statement ;-) [20:53:52] the second half doesn't help me in as many situations. [20:54:17] is the second half "be italic" ? [20:54:26] i like to emulate the mainstream media and just pick and choose what i like from statements to support whatever viewpoint I formed well before the conversation. [20:54:47] be a sockpuppet! [20:55:16] (WMF) sockpuppet required [20:55:19] RobH: shhh! the channel is logged! [20:55:32] RobH: so.. i need the private key for a cert :) [20:55:41] to fix it on sockpuppet [20:55:48] mutante: .... [20:55:55] check on tridge? [20:55:58] ssl certs directory [20:56:12] man i hope its in there. [20:56:30] if not i get to plug in my time machine drive. [20:56:32] /data/ssl_certs/ .no luck [20:56:37] well, fuck. [20:56:52] the file that is called .key and on sockpuppet.. it's not a key [20:57:14] yea, and no doubt i had it, but once i thought i put in repo [20:57:23] i deleted it. [20:57:29] cuz i shouldnt store private keys for cluster on laptop! [20:57:48] ........shit shit shit. [20:58:01] i am pretty sure i keep those in a non time machine directory too for just that reason. [20:58:03] Robh: it's just the bug-attachments, not both [20:58:07] but perhaps it put it in default home. [20:58:09] they might let you rekey it [20:58:19] they prolly will if i reach out to the rep [20:58:25] but blarg [20:58:28] bad mistake on my part. [20:58:38] sounds like it should be rekeyed [20:58:43] shit happens, it's obviously just mixed up filenames [20:59:00] yea, lemme pull up the receipt for that order and start the process. [20:59:02] no worries, it can wait a little while longer [20:59:13] actually, lemme finish arsenic allocation, then that [20:59:18] cuz im half done in this [20:59:20] yes, sure, one by one [20:59:26] go ahead with that [21:00:04] <^demon|lunch> LeslieCarr: "... when updating the encyclopedia" [21:00:13] hehe [21:01:06] RobH: as long as it wasn't switched around both ways and private key was on public puppet, but it's not i, checked :) [21:01:06] <^demon|lunch> RobH: Responded via e-mail to the ticket. [21:03:26] (03PS1) 10RobH: arsenic dns entry [operations/dns] - 10https://gerrit.wikimedia.org/r/92579 [21:03:33] ah, or is that the same private key for bugzilla and bug-attachment [21:03:42] no, different keys [21:03:50] gotcha, just thinking out loud [21:03:52] seemed easier [21:04:03] and likely better, yep [21:05:29] !log one of the sdtpa <-> eqiad links is down [21:05:43] Logged the message, Mistress of the network gear. [21:06:02] LeslieCarr: was that expected? [21:07:00] no [21:07:03] not expected [21:07:13] sadness [21:10:20] (03CR) 10RobH: [C: 032] arsenic dns entry [operations/dns] - 10https://gerrit.wikimedia.org/r/92579 (owner: 10RobH) [21:12:51] !log maxsem synchronized php-1.22wmf22/extensions/MobileFrontend/ 'https://gerrit.wikimedia.org/r/#/c/92474/' [21:13:07] Logged the message, Master [21:14:50] !log maxsem synchronized php-1.23wmf1/extensions/MobileFrontend/ 'https://gerrit.wikimedia.org/r/#/c/92474/' [21:15:06] Logged the message, Master [21:18:27] RECOVERY - check_job_queue on fenari is OK: JOBQUEUE OK - all job queues below 10,000 [21:21:37] PROBLEM - check_job_queue on fenari is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:22:26] and it could even be true (not sure about 2nd) [21:37:01] (03PS1) 10RobH: arsenic as cirrussearch script host [operations/puppet] - 10https://gerrit.wikimedia.org/r/92584 [21:41:49] (03CR) 10RobH: [C: 032] arsenic as cirrussearch script host [operations/puppet] - 10https://gerrit.wikimedia.org/r/92584 (owner: 10RobH) [22:06:15] ori-l: sorry. I wasn't available during the stand-up [22:06:25] ori-l, greg-g: I gave feedback on your etherpad [22:06:41] oh. brb. switching rooms in the coworking space [22:08:22] (how is the coworking space, btw?) [22:19:06] ori-l: back [22:19:42] I'm doing twelve different things, will have to catch up on the pads later tonight :/ [22:19:46] you're basically asking to have the entire frontend re-written ;) [22:20:23] ok [22:20:46] Ryan_Lane: which pad? [22:20:56] http://etherpad.wikimedia.org/p/trebuchet-propose [22:21:03] ori is rewriting the etherpad frontend? ah, heh [22:21:35] using Trebuchet MS [22:22:36] well, ok, couple of points: one, that etherpad came out of not knowing what the current frontend looks like [22:22:46] so if it appears to be disconnected from facts on the ground, well :) [22:24:16] i think we haven't drawn up explicit requirements and in the absence of that we don't really know what is done and what needs to be done, so any specs would be good IMO [22:24:35] even there's a lot of RESOLVED WONTFIX that comes out of that [22:24:37] *even if [22:24:56] (03PS1) 10RobH: arsenic was in wrong section [operations/dns] - 10https://gerrit.wikimedia.org/r/92587 [22:25:41] (03CR) 10RobH: [C: 032] arsenic was in wrong section [operations/dns] - 10https://gerrit.wikimedia.org/r/92587 (owner: 10RobH) [22:34:02] cmjohnson1: can i paste you a list of servers that appear decom'ed to me for crosscheck [22:34:08] sure [22:34:17] doing the 'what's left in tampa' thing [22:34:20] ok, query [22:34:30] ori-l: the frontend is exactly like it currently is [22:34:35] git deploy start [22:34:38] git deploy sync [22:34:41] git deploy abort [22:35:31] i'm confused -- why are we swapping the perl git-deploy script with sartoris if they're both front-ends that provide the same interface? [22:35:53] 1: we don't want to have both perl and python [22:38:56] 2: because we wanted a smaller set of functionality [22:38:59] 2: there is no 2 [22:39:00] dangit [22:39:04] and we wanted to control the output [22:40:32] also, the initial spec was fine [22:40:47] if we want to re-write the frontend again we can [22:41:45] <^d> The annoying part was never the frontend, it was trying to push those huge git repos around to a bajillion apaches. [22:43:07] not rewrite, but tweak? anyway, again: that came out of a core meeting where we weren't sure what is up to us to implement and what is something we can just expect to be in place [22:44:13] RECOVERY - check_job_queue on hume is OK: JOBQUEUE OK - all job queues below 10,000 [22:44:56] so can we define some point where from your perspective things are set up and it's up to us to iterate, when it comes to the whole mw deploy process? [22:46:43] there are things that obviously fall within the scope of trebuchet, like using git to manage deltas [22:47:03] and there are things that obviously fall outside, like writing the script to build cdb files on each apache (if we go that route) [22:47:23] PROBLEM - check_job_queue on hume is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:47:45] but there's a big gap in between where it's not clear who's doing what [22:49:01] all the frontend currently does is this: [22:49:05] git deploy start [22:49:09] ^^ writes a lock file [22:49:16] git deploy sync [22:49:25] ^^ runs a sync script (that we control) [22:49:29] git deploy abort [22:49:50] ^^ rolls back to some state, which I think is a tag written after start [22:50:06] sorry, let me re-write that [22:50:09] git deploy start [22:50:30] ^^ writes a lock file, writes out a tag that defines the state of things when deploy started [22:50:34] git deploy sync [22:50:57] ^^ writes a tag that defines the state of things when the sync began and calls a script that we control [22:51:01] git deploy abort [22:51:10] ^^ rolls back to the tag at the start [22:51:26] both abort and sync will automatically call: git deploy finish [22:51:30] which will remove the lock file [22:51:35] that's all the frontend really does [22:52:07] it has some other function calls that will do things like list deployment tags and such [22:53:20] the sync script calls the deploy.fetch runner, then will go into feedback mode. it'll then call the deploy.fetch runner, and go into feedback mode [22:54:13] btw, I responded to your questions on the etherpad [22:54:24] right, but that seems like a pretty low-level API, no reason to not to wrap it in some higher-level set of scripts, right? [22:54:29] ori-l: ^ check them out as well, plz [22:54:52] that kind of mirrors git, with the whole porcelain / non-porcelain subcommands distinction [22:57:16] we're going to write a frontend to the frontend? :) [22:57:53] or make the frontend more clearer for deployers [22:57:58] more clearer? [22:58:00] whatever [22:58:14] basically, it is way too verbose right now [22:58:23] too verbose in which way? [22:58:26] we already have info overload in the current system, let's not make that mistake again :) [22:58:37] too much text that is unneeded [22:58:40] yeah [22:58:45] the new frontend doesn't have any of that [22:58:49] that's the crappy perl frontend [22:58:58] and you can't suppress the messaging [22:59:06] ah, then let's take a look at that :) [22:59:07] that's was my original motivation for killing it off [22:59:41] ok, so, I'll ignore this etherpad for a bit until I see the new frontend in action [23:00:23] RECOVERY - check_job_queue on hume is OK: JOBQUEUE OK - all job queues below 10,000 [23:02:24] ok, headache/eye strain is coming back, I'll be online later [23:02:33] greg-g: feel better [23:03:23] PROBLEM - check_job_queue on hume is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:04:41] (03CR) 10Yurik: "A while back I heard something about 6 months, not 30 days... Will need to verify" [operations/puppet] - 10https://gerrit.wikimedia.org/r/92288 (owner: 10Matanya) [23:05:04] the new frontend is exactly like the old one, but the BS output isn't there [23:05:17] it will still go into the fetch/checkout confirmation interface, though [23:05:23] that occurs in the sync script [23:09:53] (03Abandoned) 10Dzahn: remove search21-50, keep search13-20 [operations/dns] - 10https://gerrit.wikimedia.org/r/91638 (owner: 10Dzahn) [23:11:00] ^d: so the initial puppet run is finishing up on arsenic, once its done you'll have same rights on it as on terbium [23:11:10] but its a temp dedicated server for you [23:11:26] I'm putting a follow up RT task to check with you in 90 days about its continued use =] [23:58:56] (03PS1) 10Dzahn: dsh: split the misc-servers into dc-specific files [operations/puppet] - 10https://gerrit.wikimedia.org/r/92599