[00:00:49] Coren|Food: around ? (i'm guessing not due to your nick( [00:01:04] I am just back. [00:01:08] whazza? [00:01:30] * Jasper_Deng repokes LeslieCarr [00:01:41] labstore access [00:01:43] binasher: I don't see too many lock wait timeouts for term_entity_type, term_entity_id queries [00:02:12] LeslieCarr: Either the problem wasn't at the network layer, or you fixed something. :-) [00:02:35] Cuz the packets started flowing once I beat up the NFS client. [00:02:44] i didn't fix anything [00:02:46] * AaronSchulz actually wonders were all the locks errors went lately...guess someone fixed them or some bots are editing less [00:02:47] AaronSchulz: wikidatawiki lock timeouts actually dropped quite a lot after s5 went mariadb [00:03:00] aha [00:03:21] that might explain [00:03:24] LeslieCarr: So the packets not getting to the server was a symptom, not the cause. Thanks for checking. :-) [00:03:35] AaronSchulz: i don't think that wb_term deletes are plentiful right now, though i expect they'll become more so over time [00:04:46] binasher: I wonder what the cost of not being unique and not using the insert buffer will be [00:04:53] *of being unique [00:11:01] AaronSchulz: that's a valid question. the best choice would be for them to only delete by term_row_id instead of on entity_type+id [00:11:45] they also want to perform updates on entity_type+id though [00:12:07] if it's unique, that avoids gap locking [00:13:27] so some of these queries are not happening yet? [00:14:39] * AaronSchulz should head out soon [00:30:48] AaronSchulz: the write queries on wb_terms are mostly inserts and some deletes, but no updates at all. it's possible that daniel was speaking hypothetically about the update queries in bugzilla [00:31:13] but it seems like term_search_key would at least be subject to change [00:31:44] New patchset: MaxSem; "$wgMFRemovableClasses overhaul" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/66891 [01:02:48] RECOVERY - NTP on ssl3002 is OK: NTP OK: Offset 0.004111051559 secs [01:31:56] RECOVERY - NTP on ssl3003 is OK: NTP OK: Offset -0.002082228661 secs [02:09:55] !log LocalisationUpdate completed (1.22wmf5) at Tue Jun 4 02:09:55 UTC 2013 [02:10:05] Logged the message, Master [02:16:08] !log LocalisationUpdate completed (1.22wmf4) at Tue Jun 4 02:16:08 UTC 2013 [02:16:15] Logged the message, Master [02:36:23] !log LocalisationUpdate ResourceLoader cache refresh completed at Tue Jun 4 02:36:22 UTC 2013 [02:36:30] Logged the message, Master [03:06:51] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:07:41] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.129 second response time [03:55:10] PROBLEM - Puppet freshness on erzurumi is CRITICAL: No successful Puppet run in the last 10 hours [03:55:10] PROBLEM - Puppet freshness on lvs1004 is CRITICAL: No successful Puppet run in the last 10 hours [03:55:10] PROBLEM - Puppet freshness on lvs1005 is CRITICAL: No successful Puppet run in the last 10 hours [03:55:10] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: No successful Puppet run in the last 10 hours [03:55:10] PROBLEM - Puppet freshness on ms-be1 is CRITICAL: No successful Puppet run in the last 10 hours [03:55:11] PROBLEM - Puppet freshness on mc15 is CRITICAL: No successful Puppet run in the last 10 hours [03:55:11] PROBLEM - Puppet freshness on ms-fe3001 is CRITICAL: No successful Puppet run in the last 10 hours [03:55:12] PROBLEM - Puppet freshness on pdf1 is CRITICAL: No successful Puppet run in the last 10 hours [03:55:12] PROBLEM - Puppet freshness on pdf2 is CRITICAL: No successful Puppet run in the last 10 hours [03:55:13] PROBLEM - Puppet freshness on virt1 is CRITICAL: No successful Puppet run in the last 10 hours [03:55:13] PROBLEM - Puppet freshness on virt3 is CRITICAL: No successful Puppet run in the last 10 hours [03:55:14] PROBLEM - Puppet freshness on virt4 is CRITICAL: No successful Puppet run in the last 10 hours [04:56:56] PROBLEM - NTP on ssl3003 is CRITICAL: NTP CRITICAL: No response from NTP server [05:00:47] PROBLEM - NTP on ssl3002 is CRITICAL: NTP CRITICAL: No response from NTP server [05:45:56] New patchset: ArielGlenn; "wikiretriever can now get user info for all users of a wiki" [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/66876 [05:47:48] Change merged: ArielGlenn; [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/66876 [06:28:20] New patchset: ArielGlenn; "description for wb_terms table in dump; this completes bug #44844" [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/66897 [06:28:53] Change merged: ArielGlenn; [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/66897 [06:34:01] thanks apergos :) [06:34:33] yw [06:34:50] I haven't deployed it yet (though I ran it on the most recent run manually so it's there in the latets run [06:34:50] ) [06:35:23] ok [06:35:28] I feel like it's time to write term papers and instead of that I'm cleaning my cat litter box [06:35:39] heh [06:35:39] (shoud do reviews, am instead going through bugzilla :-D) [06:35:45] *should [06:36:07] * aude off to the office [06:36:12] enjoy [07:31:28] morning [07:32:48] RECOVERY - NTP on ssl3003 is OK: NTP OK: Offset -0.00158393383 secs [07:41:30] New review: Hashar; "recheck" [operations/puppet/zookeeper] (master) - https://gerrit.wikimedia.org/r/66882 [07:42:27] New patchset: Hashar; "Jenkins job validation (DO NOT SUBMIT)" [operations/puppet/zookeeper] (master) - https://gerrit.wikimedia.org/r/66906 [07:42:55] New review: Hashar; "recheck" [operations/puppet/zookeeper] (master) - https://gerrit.wikimedia.org/r/66882 [07:44:16] is there someone with bugzilla admin rights who can delete https://bugzilla.wikimedia.org/show_bug.cgi?id=49099 ? [07:45:48] drdee: just close it :-D [07:46:01] yeahhhh but it's obvious spam [07:46:07] I can mark the comment private and close it [07:46:13] the mail notification already got sent though :( [07:46:33] done :) [07:46:55] New review: Hashar; "recheck" [operations/puppet/zookeeper] (master) - https://gerrit.wikimedia.org/r/66882 [07:47:33] New review: Hashar; "Sorry for the Jenkins spam on this change. We now have puppet lint + erb lint on operations/puppet/z..." [operations/puppet/zookeeper] (master) - https://gerrit.wikimedia.org/r/66882 [07:47:43] Change abandoned: Hashar; "(no reason)" [operations/puppet/zookeeper] (master) - https://gerrit.wikimedia.org/r/66906 [07:50:38] drdee: and mailed Andre (you are in cc) for his information [07:50:59] thanks hashar! [08:02:32] RECOVERY - NTP on ssl3002 is OK: NTP OK: Offset -0.003262877464 secs [08:02:59] yeah we don't try to delete things, we mark them as private, deleting turns out to be hard. [08:03:13] k, didn't know [08:03:17] sure [09:34:31] New patchset: Akosiaris; "Pin cloudera packages at 4.2.1" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/66950 [09:46:09] PROBLEM - Host wtp1008 is DOWN: PING CRITICAL - Packet loss = 100% [09:46:59] RECOVERY - Host wtp1008 is UP: PING OK - Packet loss = 0%, RTA = 3.24 ms [09:50:46] New patchset: Daniel Kinzler; "add "/entity/" redirects for wikidata per" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/65463 [09:53:35] Change merged: Akosiaris; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/66950 [10:15:19] New review: Daniel Kinzler; "(1 comment)" [operations/apache-config] (master) C: -1; - https://gerrit.wikimedia.org/r/65463 [10:28:46] New review: Aude; "(1 comment)" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/65463 [10:47:18] is it possible to get the number of jobs in the queue for a given type? [10:50:08] for commonswiki [11:06:19] what the. [11:06:51] why am I getting blank pages whenever I load a new page on Wikiquote, I wonder... [11:13:39] j^: you could probably harrse Reedy to get them [11:57:44] New patchset: Hashar; "jenkins: add in ganglia monitoring" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/66960 [12:23:47] PROBLEM - Disk space on ms2 is CRITICAL: NRPE: Command check_disk_space not defined [12:23:47] PROBLEM - RAID on db44 is CRITICAL: CRITICAL: Degraded [12:23:47] PROBLEM - RAID on virt1 is CRITICAL: Connection refused by host [12:23:47] PROBLEM - twemproxy process on terbium is CRITICAL: NRPE: Command check_twemproxy not defined [12:23:47] PROBLEM - DPKG on virt6 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [12:23:47] PROBLEM - SSH on virt1 is CRITICAL: Connection refused [12:23:47] PROBLEM - NTP on nescio is CRITICAL: NTP CRITICAL: Offset unknown [12:23:48] PROBLEM - search indices - check lucene status page on search1017 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern found - 60120 bytes in 0.009 second response time [12:23:48] PROBLEM - search indices - check lucene status page on search1015 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern found - 52911 bytes in 0.019 second response time [12:23:56] PROBLEM - twemproxy process on fenari is CRITICAL: PROCS CRITICAL: 0 processes with UID = 65534 (nobody), command name nutcracker [12:23:56] PROBLEM - Parsoid on wtp1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:23:56] PROBLEM - search indices - check lucene status page on search1019 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern found - 58845 bytes in 0.015 second response time [12:23:56] PROBLEM - mysqld processes on db44 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [12:24:06] PROBLEM - twemproxy process on mw15 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 65534 (nobody), command name nutcracker [12:24:06] PROBLEM - Disk space on virt3 is CRITICAL: Connection refused by host [12:25:59] PROBLEM - twemproxy process on mw9 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 65534 (nobody), command name nutcracker [12:25:59] PROBLEM - twemproxy process on mw61 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 65534 (nobody), command name nutcracker [12:25:59] PROBLEM - twemproxy process on mw12 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 65534 (nobody), command name nutcracker [12:26:47] PROBLEM - swift-container-updater on ms-be1 is CRITICAL: Connection refused by host [12:26:47] PROBLEM - DPKG on ms-be1 is CRITICAL: Connection refused by host [12:26:48] PROBLEM - Disk space on ms-be1 is CRITICAL: Connection refused by host [12:26:48] PROBLEM - twemproxy process on mw7 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 65534 (nobody), command name nutcracker [12:26:56] PROBLEM - swift-account-server on ms-be1 is CRITICAL: Connection refused by host [12:26:57] PROBLEM - swift-object-auditor on ms-be1 is CRITICAL: Connection refused by host [12:31:27] Reedy: is it possible to get the number of jobs in the queue for a given type? would want to know how many webVideoTranscode jobs are in the queue on commonswiki [12:37:19] j^: yup :) [12:37:54] mwscript showJobs.php --wiki commonswiki --group webVideoTranscode [12:38:16] webVideoTranscode: 0 queued; 1179 claimed (527 active, 652 abandoned) [12:38:35] j^: webVideoTranscode: 0 queued; 1179 claimed (527 active, 652 abandoned) [12:38:38] too late [12:39:04] I am wondering whether it is send in graphite [12:39:28] thats totally wrong, so clearly jobs get abandoned, claimed by things other than the videoscalers [12:40:24] or jobs-loop.sh does something it should not [12:43:01] how can jobs get abandoned? [12:43:34] I have no idea what claimed / active stand for :-D [12:43:42] nor what is abandonned [12:48:35] would be nice to have that as a graph somewhere to check what the status is [12:48:57] buuuug report it :-] [12:49:07] I have no idea how to do that myself unfortunately [12:49:35] at least we have a job queue rate metric http://gdash.wikimedia.org/dashboards/jobq/ [12:50:09] PROBLEM - Apache HTTP on mw1157 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:50:18] PROBLEM - Apache HTTP on mw1155 is CRITICAL: Connection timed out [12:50:18] PROBLEM - Apache HTTP on mw1159 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:50:18] PROBLEM - Apache HTTP on mw1160 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:50:19] PROBLEM - Apache HTTP on mw1154 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:50:28] PROBLEM - Apache HTTP on mw1153 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:51:11] PROBLEM - Apache HTTP on mw1156 is CRITICAL: Connection timed out [12:51:29] PROBLEM - Apache HTTP on mw1158 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:51:29] PROBLEM - LVS HTTP IPv4 on rendering.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:51:44] dammit [12:51:59] RECOVERY - Apache HTTP on mw1156 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.062 second response time [12:52:02] RECOVERY - Apache HTTP on mw1157 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.766 second response time [12:52:10] RECOVERY - Apache HTTP on mw1159 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.052 second response time [12:52:11] RECOVERY - Apache HTTP on mw1158 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.053 second response time [12:52:11] RECOVERY - Apache HTTP on mw1160 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.068 second response time [12:52:11] RECOVERY - Apache HTTP on mw1155 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.069 second response time [12:52:12] RECOVERY - Apache HTTP on mw1154 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 2.210 second response time [12:52:20] RECOVERY - Apache HTTP on mw1153 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.054 second response time [12:52:21] RECOVERY - LVS HTTP IPv4 on rendering.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 61822 bytes in 0.241 second response time [12:52:28] j^: we have stats in graphite :-] [12:52:49] uh [12:53:18] now I got the pages [12:54:06] sigh, we're still getting nagios too [12:55:16] I think it might abandon them after x retries, there's some point at which it just gives up [12:55:19] (job queue) [12:56:18] apergos: as far as i can tell the videoscalers never get to try them though [12:56:33] hashar: what account is needed for graphite? [12:56:40] that is odd [12:56:41] j^: labs [12:56:46] I don't think anyone here knows enough about the job queue [12:56:59] j^: """ WMF Labs (use wiki login name not shell) """ [12:56:59] I also don't think graphite has per job type information but I could be wrong (see above :) [12:57:27] j^: under Metric Type choose "Stats" [12:57:59] now whats the wiki login, the one for labsconsole? [12:58:04] j^: then in the three view you have metrics such as job.job-insert. [13:01:20] https://doc.wikimedia.org/mediawiki-core/master/php/html/JobQueueRedis_8php_source.html this isn't too bad to read through, you can see that in recycleAndDeleteStaleJobs() where it does redis.call('zAdd',KEYS[5],timestamp,id) (that's add to abandoned if no retries left) [13:01:38] but I have no idea whatsoever about the videoscalers [13:05:30] i guess i have to catch aaron to help me debug this, videoscalers are 'just' running jobs-loop.sh for transcoding jobs. only thing special might be that they take a bit longer but that used to work [13:06:09] I might be able to help you from the ops end of things [13:06:37] i.e. I can check the number of jobs of type x by checking the tables directly, or I can see what things are running on a given host, etc [13:06:44] dunno how much that will actually help [13:09:13] looking at http://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&s=by+name&c=Video%2520scalers%2520eqiad&tab=m&vn= they are currently not doing anything [13:10:28] can i get more info on the jobs in the queue, i.e. all the data that is in the db/redis for them [13:10:46] which type of job, all of them? [13:10:47] (all webVideoTranscode jobs) [13:10:49] ah [13:10:50] :-D [13:11:20] ok give me just a sec to ... decide I am going to take a break from the 'self assessment' piece of the review, ugh [13:11:28] yeah so not going to finish that right now [13:12:43] mysql:wikiadmin@db1059 [commonswiki]> select count(*) from job where job_cmd = 'webVideoTranscode'; [13:12:49] | 615 | [13:13:00] isn't the job queue in Redis nowadays ? [13:13:21] it should be [13:13:24] yeah, I wonder why these are still in here [13:13:46] we have a copyJob.php scriptt [13:14:03] that I think got written to migrate the job queue from SQL to Redis [13:14:17] the one you are seeing are most probably left over that would need to be cleaned up one day [13:14:20] they are all from april so I'll assume they never got flushed out [13:14:24] yep [13:14:28] I have no idea how to access redis though [13:14:41] and we do not have a maintenance script to dump the job queue content [13:14:55] i would just remove the webVideoTranscode jobs [13:15:00] from the db [13:15:02] in graphite, we had a spike of abandoned jobs around May 10th iirc [13:16:23] that would be around the time i tried to reinsert jobs [13:16:54] but if they got abandoned right away, would be good to get access to the data in redis to see the full state [13:27:11] New patchset: Akosiaris; "Add myself to icinga's authorized for all lists" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/66963 [13:27:59] it's just taking be a bit to figure out what the right query is [13:28:04] s/be/me [13:30:08] 12 million keys in there... meh [13:31:47] anyone knows about redactatron / python ? I got a very simple change pending that removes useless backslashes https://gerrit.wikimedia.org/r/60416 [13:31:57] that is to make it pass pep8 :-] [13:34:18] hashar: notpeter/binasher [13:34:42] paravoid: yeah they don't answer to my email :-] Will try again [13:34:50] oh peter too [13:34:56] will add him in the loop :] thx [13:36:14] Change merged: Akosiaris; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/66963 [13:36:14] New review: Hashar; "Mailed Asher and Peter about this change." [operations/software/redactatron] (master) - https://gerrit.wikimedia.org/r/60416 [13:36:23] making nores here [13:36:25] commonswiki:jobqueue:webVideoTranscode:z-abandoned [13:37:00] apergos: somehow your mail from last week titled "Can I redo the upload.beta setup?" ended up in my spam folder :/ [13:37:47] hahaha [13:37:55] well you already asnwered and I am already working on it so... [13:38:04] zcard commonswiki:jobqueue:webVideoTranscode:z-abandoned [13:38:04] (integer) 652 [13:38:27] so it shows 653 abandoned jobs, we can look to see which ones those are if you want, j^ [13:38:32] er 652 [13:47:02] redis 127.0.0.1:6379> zcard commonswiki:jobqueue:webVideoTranscode:z-abandoned (integer) 652 [13:47:02] redis 127.0.0.1:6379> zcard commonswiki:jobqueue:webVideoTranscode:z-claimed (integer) 529 [13:47:02] redis 127.0.0.1:6379> zcard commonswiki:jobqueue:webVideoTranscode:z-delayed (integer) 0 [13:47:02] redis 127.0.0.1:6379> llen commonswiki:jobqueue:webVideoTranscode:l-unclaimed (integer) 0 [13:47:11] so no new abandoned ones [13:55:17] PROBLEM - Puppet freshness on erzurumi is CRITICAL: No successful Puppet run in the last 10 hours [13:55:17] PROBLEM - Puppet freshness on lvs1004 is CRITICAL: No successful Puppet run in the last 10 hours [13:55:17] PROBLEM - Puppet freshness on lvs1005 is CRITICAL: No successful Puppet run in the last 10 hours [13:55:17] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: No successful Puppet run in the last 10 hours [13:55:17] PROBLEM - Puppet freshness on mc15 is CRITICAL: No successful Puppet run in the last 10 hours [13:55:18] PROBLEM - Puppet freshness on ms-be1 is CRITICAL: No successful Puppet run in the last 10 hours [13:55:18] PROBLEM - Puppet freshness on ms-fe3001 is CRITICAL: No successful Puppet run in the last 10 hours [13:55:19] PROBLEM - Puppet freshness on pdf1 is CRITICAL: No successful Puppet run in the last 10 hours [13:55:19] PROBLEM - Puppet freshness on pdf2 is CRITICAL: No successful Puppet run in the last 10 hours [13:55:20] PROBLEM - Puppet freshness on virt1 is CRITICAL: No successful Puppet run in the last 10 hours [13:55:20] PROBLEM - Puppet freshness on virt3 is CRITICAL: No successful Puppet run in the last 10 hours [13:55:21] PROBLEM - Puppet freshness on virt4 is CRITICAL: No successful Puppet run in the last 10 hours [13:58:21] j^: did you want to see the innards of a few of the abandoned ones or what would you like? [13:58:36] note that the timestamps are all going to be in secs since 1 1 1970 :-/ but anyways... [14:00:38] do we not have 'how to look up job queue stuff in redis' on wikitech? cause if not I can write it [14:33:46] https://wikitech.wikimedia.org/wiki/Redis [14:33:55] * apergos goes back to self-assessment [14:40:37] New patchset: Odder; "(bug 49125) Add localised/v2 logos for Wikipedias without one IV" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/66972 [14:41:13] apergos: apparently not [14:41:32] apergos: though we have https://wikitech.wikimedia.org/wiki/Redis#Examining_the_data_for_a_job [14:41:42] (which should not be under [[Redis]] hehe) [14:41:46] yeah, I just wrote that [14:41:50] *just* now. [14:41:54] ahhh [14:42:01] feel free to move it wherever etc [14:42:10] I would move that under https://wikitech.wikimedia.org/wiki/Job_queue [14:42:15] I wantd people to add the other things we use redis for [14:42:24] like isn't there session management... [14:42:35] anyways feel free to chop up, put links, etc [14:42:37] yup we hold sessions in redis too [14:43:22] the 'job queue' page appears to describe the job runners [14:43:35] maybe it should be moved too [14:43:52] yup 'Job queue runners' [14:43:58] and move 'Redis' to 'Job queue' [14:44:13] then Redis could hold the general doc about redis (like how to connect to it) [14:44:18] ok but then pull the redis specific stuff like 'where it lives' and 'comamnds and configuration refs' and put em in a 'redis' page [14:44:22] yeah good [14:45:08] it's a very simple command syntax, one point in its favor [14:45:40] the simpler the command, the easier it is to write doc for it :-] [14:45:57] yep! [14:47:56] apergos: will you do it or are you expecting me to do it ? :D [14:48:04] uh [14:48:20] well... you have a vision so unless you are chomping at the bit to do reviews! [14:48:33] but if you would rather I will do it, say so [14:49:08] apergos: do the move pages and I can amend later [14:49:13] oh no [14:49:16] apergos: I will do it [14:49:20] move keep the history [14:49:21] I am lame [14:50:32] ok it's all you [14:50:40] bah I can't delete files [14:50:54] apergos can you delete ? https://wikitech.wikimedia.org/w/index.php?title=Job_queue&redirect=no [14:51:22] ah so you can move on top of it [14:51:31] yup moving Redis to Job queue [14:51:58] done [14:56:00] done [14:56:05] https://wikitech.wikimedia.org/wiki/Job_queue is now only about job queue [14:56:06] yay [14:56:11] https://wikitech.wikimedia.org/wiki/Redis has the overview [14:56:14] and usage [14:56:33] uh the please add should go to the redis page [14:57:30] or alternatively make a red link from redis to that... [14:57:47] it's like real docs now :-D [14:58:00] do edit :-] [14:58:05] sure [14:58:16] ah yeah I forgot to move that one hehe [14:58:23] so now we need a [[user session]] article haha [14:59:50] well I didn't sign up to write that, since I haven't looked at it at all [15:00:01] we need more doc hehe [15:00:06] Ceph: https://wikitech.wikimedia.org/wiki/Ceph that is sparse [15:00:06] we do [15:00:35] yeah I found out how sparse it was the other day [15:03:33] Change abandoned: Hashar; "redactatron is being rewritten entirely, so there is no point in keeping this change around." [operations/software/redactatron] (master) - https://gerrit.wikimedia.org/r/60416 [15:41:32] New review: Reedy; "As above" [operations/mediawiki-config] (master) C: -1; - https://gerrit.wikimedia.org/r/65860 [15:55:51] New patchset: Ottomata; "Initial commit of zookeeper module." [operations/puppet/zookeeper] (master) - https://gerrit.wikimedia.org/r/66882 [15:57:18] New review: Ottomata; "Thanks Hashar!" [operations/puppet/zookeeper] (master) - https://gerrit.wikimedia.org/r/66882 [15:57:40] paravoid ^ whenver you get a chance [15:57:49] review time... [15:58:16] <^demon> Oh yeah, I meant to take a gander at that too. [16:01:48] <^demon> ottomata: Looking at line 19 in https://gerrit.wikimedia.org/r/#/c/66882/2/manifests/server.pp, can you explain exactly what's happening with the ->? [16:01:50] <^demon> I've not seen that. [16:04:25] yeah that is an explicit class dependency [16:04:31] there are several ways of doing that [16:04:39] its the same as [16:04:42] ottomata: you are welcome :-) (pep8 on zookeeper puppet repo) [16:04:45] I am off! [16:04:55] laters, thank you! [16:04:55] require zookeeper [16:04:55] except [16:04:58] it doesn't auto-include the class [16:05:03] which is better for parameterized classes [16:05:06] so [16:05:12] with that dependency listed like that [16:05:17] if you try to include zookeeper::server [16:05:24] and haven't already explicitly included zookeeper [16:05:29] puppet will throw an error [16:05:36] "better" as in, "require" doesn't work with param classes :) [16:05:46] well, it does if all the parameters have defaults [16:05:52] <^demon> Gotcha. Makes sense now. I just hadn't seen the syntax before :) [16:05:55] <^demon> Thanks for clarifying [16:06:04] if they have defaults and you're not modifying the defaults [16:06:08] yeah [16:06:11] so basically it's not parameterized [16:06:26] require will work though, even if you have changed the defaults [16:06:32] it will keep the defaults you set when you included the class [16:06:33] but [16:06:39] ^demon: -> is the requires syntax, it's not limited to class dependencies [16:06:49] so you could say Package['foo'] -> File['bar'] too [16:06:58] if there was a case where you wanted to be able to maybe include zookeeper::server [16:07:03] and were ok with all of the defaults [16:07:06] you can even say package { 'foo': ... } -> file { 'bar': ... } iirc, but don't do that [16:07:15] you might not want to force the user to explicitly include the zookeeper class [16:07:19] in this case I do [16:07:21] so -> is better [16:07:24] in most cases -> is better [16:08:18] <^demon> *nods* [16:14:15] <^demon> ottomata: Same file, line 46: "$myid = inline_template('<%= zookeeper_hosts.index(fqdn) + 1 %>')" [16:14:27] <^demon> Is this going to preserve order or do we need to slap a sort on it like so many other places? [16:14:34] New patchset: Ottomata; "Renaming role::hadoop classes to role::analytics::hadoop" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/66976 [16:14:46] hmm, its an array, so it should be ok…. [16:14:47] PROBLEM - SSH on ms-be11 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:14:47] right? [16:14:52] this isn't a hash [16:14:59] ^demon [16:15:08] <^demon> Prolly, just thought it was worth mentioning. [16:15:13] yeah, i think its cool [16:15:15] <^demon> Bit me before, jumped out as a many gotcha. [16:15:21] <^demon> *maybe [16:15:28] have you had the problem with an array before? or just hashes? [16:15:32] <^demon> Hashes. [16:15:35] k [16:16:10] New review: Ottomata; "This is tested and running on labs." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/66976 [16:16:43] yeah arrays are fine [16:17:07] <^demon> ottomata: I'm going to install puppetmaster::self and give this a whirl with solr. I think it'll probably all work as is (and will earn a +1 from me), but let's find out for sure. [16:18:49] ok cool! [16:19:40] hmm actually [16:19:52] ^demon, lemme remove the zookeeper cdh4 .deb from our apt, i think that will be a problem [16:21:59] paravoid, you told me there was something special I had to do to remove this from our apt [16:22:31] did I tell you I have to review 11 people? [16:22:33] :) [16:23:18] haha, nope but I believe it [16:25:02] heya akosiaris, you there? [16:26:37] PROBLEM - swift-container-server on ms-be11 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:26:47] PROBLEM - RAID on ms-be11 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:26:47] PROBLEM - swift-container-replicator on ms-be11 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:26:47] PROBLEM - swift-object-server on ms-be11 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:26:47] PROBLEM - swift-container-auditor on ms-be11 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:26:47] PROBLEM - swift-account-reaper on ms-be11 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:26:57] PROBLEM - swift-account-replicator on ms-be11 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:26:57] PROBLEM - swift-container-updater on ms-be11 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:26:57] PROBLEM - swift-object-updater on ms-be11 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:26:59] apergos: ^^^ [16:27:18] PROBLEM - swift-account-auditor on ms-be11 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:27:18] PROBLEM - DPKG on ms-be11 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:27:27] PROBLEM - Disk space on ms-be11 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:27:27] PROBLEM - swift-object-auditor on ms-be11 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:27:27] PROBLEM - swift-account-server on ms-be11 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:27:27] PROBLEM - swift-object-replicator on ms-be11 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:27:59] grrr [16:30:44] New patchset: MaxSem; "Serve mobile logos from the same domain to avoid charging Zero users for them" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/66167 [16:30:45] [10107517.182566] BUG: soft lockup - CPU#10 stuck for 22s! [xfsaild/sdb3:28665] [16:30:45] [10107517.190565] Stack: [16:30:45] [10107517.193109] Call Trace: [16:30:45] [10107517.196325] Code: 90 90 90 90 90 90 55 b8 00 00 01 00 48 89 e5 f0 0f c1 07 89 c2 c1 ea 10 66 39 c2 74 13 66 0f 1f 84 00 00 00 00 00 f3 90 0f b7 07 <66> 39 d0 75 f6 5d c3 0f 1f 40 00 8b 17 55 31 c0 48 89 e5 89 d1 [16:30:54] junk on console, can't log in, power cycling [16:31:10] did you turn off the Dell power management stuff in the bios? [16:32:03] I didn't do anything to the bios [16:32:35] rpetty sure I didn't set these up (not 100% but pretty sure) [16:33:07] PROBLEM - Host ms-be11 is DOWN: PING CRITICAL - Packet loss = 100% [16:33:18] where is it in the menus? [16:33:39] mark: didn't set up but it's possible that dell power mgmt is on...sbernardin was doing that for some of the new apaches [16:33:58] system profile or some such [16:34:03] needs to be set to "OS", not Dell [16:34:07] checking [16:34:16] is says 'OS' [16:34:35] ok that's good [16:34:43] aything else I oughta look for while in here? [16:35:23] mark [16:35:53] hyperthreading turned off [16:36:04] other than that, not really [16:37:37] logical processor is off [16:37:42] ok bring back up then [16:37:46] *bringing [16:38:55] while I wait, why do we have those two things turned off? [16:39:35] !log powercycled ms-be11, was unresponsive on console with lots of [10107517.182566] BUG: soft lockup - CPU#10 stuck for 22s! [xfsaild/sdb3:28665] etc [16:41:13] RECOVERY - swift-account-replicator on ms-be11 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [16:41:13] RECOVERY - DPKG on ms-be11 is OK: All packages OK [16:41:13] RECOVERY - swift-account-auditor on ms-be11 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [16:41:23]