[00:00:49] Coren|Food: around ? (i'm guessing not due to your nick( [00:01:04] I am just back. [00:01:08] whazza? [00:01:30] * Jasper_Deng repokes LeslieCarr [00:01:41] labstore access [00:01:43] binasher: I don't see too many lock wait timeouts for term_entity_type, term_entity_id queries [00:02:12] LeslieCarr: Either the problem wasn't at the network layer, or you fixed something. :-) [00:02:35] Cuz the packets started flowing once I beat up the NFS client. [00:02:44] i didn't fix anything [00:02:46] * AaronSchulz actually wonders were all the locks errors went lately...guess someone fixed them or some bots are editing less [00:02:47] AaronSchulz: wikidatawiki lock timeouts actually dropped quite a lot after s5 went mariadb [00:03:00] aha [00:03:21] that might explain [00:03:24] LeslieCarr: So the packets not getting to the server was a symptom, not the cause. Thanks for checking. :-) [00:03:35] AaronSchulz: i don't think that wb_term deletes are plentiful right now, though i expect they'll become more so over time [00:04:46] binasher: I wonder what the cost of not being unique and not using the insert buffer will be [00:04:53] *of being unique [00:11:01] AaronSchulz: that's a valid question. the best choice would be for them to only delete by term_row_id instead of on entity_type+id [00:11:45] they also want to perform updates on entity_type+id though [00:12:07] if it's unique, that avoids gap locking [00:13:27] so some of these queries are not happening yet? [00:14:39] * AaronSchulz should head out soon [00:30:48] AaronSchulz: the write queries on wb_terms are mostly inserts and some deletes, but no updates at all. it's possible that daniel was speaking hypothetically about the update queries in bugzilla [00:31:13] but it seems like term_search_key would at least be subject to change [00:31:44] New patchset: MaxSem; "$wgMFRemovableClasses overhaul" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/66891 [01:02:48] RECOVERY - NTP on ssl3002 is OK: NTP OK: Offset 0.004111051559 secs [01:31:56] RECOVERY - NTP on ssl3003 is OK: NTP OK: Offset -0.002082228661 secs [02:09:55] !log LocalisationUpdate completed (1.22wmf5) at Tue Jun 4 02:09:55 UTC 2013 [02:10:05] Logged the message, Master [02:16:08] !log LocalisationUpdate completed (1.22wmf4) at Tue Jun 4 02:16:08 UTC 2013 [02:16:15] Logged the message, Master [02:36:23] !log LocalisationUpdate ResourceLoader cache refresh completed at Tue Jun 4 02:36:22 UTC 2013 [02:36:30] Logged the message, Master [03:06:51] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:07:41] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.129 second response time [03:55:10] PROBLEM - Puppet freshness on erzurumi is CRITICAL: No successful Puppet run in the last 10 hours [03:55:10] PROBLEM - Puppet freshness on lvs1004 is CRITICAL: No successful Puppet run in the last 10 hours [03:55:10] PROBLEM - Puppet freshness on lvs1005 is CRITICAL: No successful Puppet run in the last 10 hours [03:55:10] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: No successful Puppet run in the last 10 hours [03:55:10] PROBLEM - Puppet freshness on ms-be1 is CRITICAL: No successful Puppet run in the last 10 hours [03:55:11] PROBLEM - Puppet freshness on mc15 is CRITICAL: No successful Puppet run in the last 10 hours [03:55:11] PROBLEM - Puppet freshness on ms-fe3001 is CRITICAL: No successful Puppet run in the last 10 hours [03:55:12] PROBLEM - Puppet freshness on pdf1 is CRITICAL: No successful Puppet run in the last 10 hours [03:55:12] PROBLEM - Puppet freshness on pdf2 is CRITICAL: No successful Puppet run in the last 10 hours [03:55:13] PROBLEM - Puppet freshness on virt1 is CRITICAL: No successful Puppet run in the last 10 hours [03:55:13] PROBLEM - Puppet freshness on virt3 is CRITICAL: No successful Puppet run in the last 10 hours [03:55:14] PROBLEM - Puppet freshness on virt4 is CRITICAL: No successful Puppet run in the last 10 hours [04:56:56] PROBLEM - NTP on ssl3003 is CRITICAL: NTP CRITICAL: No response from NTP server [05:00:47] PROBLEM - NTP on ssl3002 is CRITICAL: NTP CRITICAL: No response from NTP server [05:45:56] New patchset: ArielGlenn; "wikiretriever can now get user info for all users of a wiki" [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/66876 [05:47:48] Change merged: ArielGlenn; [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/66876 [06:28:20] New patchset: ArielGlenn; "description for wb_terms table in dump; this completes bug #44844" [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/66897 [06:28:53] Change merged: ArielGlenn; [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/66897 [06:34:01] thanks apergos :) [06:34:33] yw [06:34:50] I haven't deployed it yet (though I ran it on the most recent run manually so it's there in the latets run [06:34:50] ) [06:35:23] ok [06:35:28] I feel like it's time to write term papers and instead of that I'm cleaning my cat litter box [06:35:39] heh [06:35:39] (shoud do reviews, am instead going through bugzilla :-D) [06:35:45] *should [06:36:07] * aude off to the office [06:36:12] enjoy [07:31:28] morning [07:32:48] RECOVERY - NTP on ssl3003 is OK: NTP OK: Offset -0.00158393383 secs [07:41:30] New review: Hashar; "recheck" [operations/puppet/zookeeper] (master) - https://gerrit.wikimedia.org/r/66882 [07:42:27] New patchset: Hashar; "Jenkins job validation (DO NOT SUBMIT)" [operations/puppet/zookeeper] (master) - https://gerrit.wikimedia.org/r/66906 [07:42:55] New review: Hashar; "recheck" [operations/puppet/zookeeper] (master) - https://gerrit.wikimedia.org/r/66882 [07:44:16] is there someone with bugzilla admin rights who can delete https://bugzilla.wikimedia.org/show_bug.cgi?id=49099 ? [07:45:48] drdee: just close it :-D [07:46:01] yeahhhh but it's obvious spam [07:46:07] I can mark the comment private and close it [07:46:13] the mail notification already got sent though :( [07:46:33] done :) [07:46:55] New review: Hashar; "recheck" [operations/puppet/zookeeper] (master) - https://gerrit.wikimedia.org/r/66882 [07:47:33] New review: Hashar; "Sorry for the Jenkins spam on this change. We now have puppet lint + erb lint on operations/puppet/z..." [operations/puppet/zookeeper] (master) - https://gerrit.wikimedia.org/r/66882 [07:47:43] Change abandoned: Hashar; "(no reason)" [operations/puppet/zookeeper] (master) - https://gerrit.wikimedia.org/r/66906 [07:50:38] drdee: and mailed Andre (you are in cc) for his information [07:50:59] thanks hashar! [08:02:32] RECOVERY - NTP on ssl3002 is OK: NTP OK: Offset -0.003262877464 secs [08:02:59] yeah we don't try to delete things, we mark them as private, deleting turns out to be hard. [08:03:13] k, didn't know [08:03:17] sure [09:34:31] New patchset: Akosiaris; "Pin cloudera packages at 4.2.1" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/66950 [09:46:09] PROBLEM - Host wtp1008 is DOWN: PING CRITICAL - Packet loss = 100% [09:46:59] RECOVERY - Host wtp1008 is UP: PING OK - Packet loss = 0%, RTA = 3.24 ms [09:50:46] New patchset: Daniel Kinzler; "add "/entity/" redirects for wikidata per" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/65463 [09:53:35] Change merged: Akosiaris; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/66950 [10:15:19] New review: Daniel Kinzler; "(1 comment)" [operations/apache-config] (master) C: -1; - https://gerrit.wikimedia.org/r/65463 [10:28:46] New review: Aude; "(1 comment)" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/65463 [10:47:18] is it possible to get the number of jobs in the queue for a given type? [10:50:08] for commonswiki [11:06:19] what the. [11:06:51] why am I getting blank pages whenever I load a new page on Wikiquote, I wonder... [11:13:39] j^: you could probably harrse Reedy to get them [11:57:44] New patchset: Hashar; "jenkins: add in ganglia monitoring" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/66960 [12:23:47] PROBLEM - Disk space on ms2 is CRITICAL: NRPE: Command check_disk_space not defined [12:23:47] PROBLEM - RAID on db44 is CRITICAL: CRITICAL: Degraded [12:23:47] PROBLEM - RAID on virt1 is CRITICAL: Connection refused by host [12:23:47] PROBLEM - twemproxy process on terbium is CRITICAL: NRPE: Command check_twemproxy not defined [12:23:47] PROBLEM - DPKG on virt6 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [12:23:47] PROBLEM - SSH on virt1 is CRITICAL: Connection refused [12:23:47] PROBLEM - NTP on nescio is CRITICAL: NTP CRITICAL: Offset unknown [12:23:48] PROBLEM - search indices - check lucene status page on search1017 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern found - 60120 bytes in 0.009 second response time [12:23:48] PROBLEM - search indices - check lucene status page on search1015 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern found - 52911 bytes in 0.019 second response time [12:23:56] PROBLEM - twemproxy process on fenari is CRITICAL: PROCS CRITICAL: 0 processes with UID = 65534 (nobody), command name nutcracker [12:23:56] PROBLEM - Parsoid on wtp1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:23:56] PROBLEM - search indices - check lucene status page on search1019 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern found - 58845 bytes in 0.015 second response time [12:23:56] PROBLEM - mysqld processes on db44 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [12:24:06] PROBLEM - twemproxy process on mw15 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 65534 (nobody), command name nutcracker [12:24:06] PROBLEM - Disk space on virt3 is CRITICAL: Connection refused by host [12:25:59] PROBLEM - twemproxy process on mw9 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 65534 (nobody), command name nutcracker [12:25:59] PROBLEM - twemproxy process on mw61 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 65534 (nobody), command name nutcracker [12:25:59] PROBLEM - twemproxy process on mw12 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 65534 (nobody), command name nutcracker [12:26:47] PROBLEM - swift-container-updater on ms-be1 is CRITICAL: Connection refused by host [12:26:47] PROBLEM - DPKG on ms-be1 is CRITICAL: Connection refused by host [12:26:48] PROBLEM - Disk space on ms-be1 is CRITICAL: Connection refused by host [12:26:48] PROBLEM - twemproxy process on mw7 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 65534 (nobody), command name nutcracker [12:26:56] PROBLEM - swift-account-server on ms-be1 is CRITICAL: Connection refused by host [12:26:57] PROBLEM - swift-object-auditor on ms-be1 is CRITICAL: Connection refused by host [12:31:27] Reedy: is it possible to get the number of jobs in the queue for a given type? would want to know how many webVideoTranscode jobs are in the queue on commonswiki [12:37:19] j^: yup :) [12:37:54] mwscript showJobs.php --wiki commonswiki --group webVideoTranscode [12:38:16] webVideoTranscode: 0 queued; 1179 claimed (527 active, 652 abandoned) [12:38:35] j^: webVideoTranscode: 0 queued; 1179 claimed (527 active, 652 abandoned) [12:38:38] too late [12:39:04] I am wondering whether it is send in graphite [12:39:28] thats totally wrong, so clearly jobs get abandoned, claimed by things other than the videoscalers [12:40:24] or jobs-loop.sh does something it should not [12:43:01] how can jobs get abandoned? [12:43:34] I have no idea what claimed / active stand for :-D [12:43:42] nor what is abandonned [12:48:35] would be nice to have that as a graph somewhere to check what the status is [12:48:57] buuuug report it :-] [12:49:07] I have no idea how to do that myself unfortunately [12:49:35] at least we have a job queue rate metric http://gdash.wikimedia.org/dashboards/jobq/ [12:50:09] PROBLEM - Apache HTTP on mw1157 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:50:18] PROBLEM - Apache HTTP on mw1155 is CRITICAL: Connection timed out [12:50:18] PROBLEM - Apache HTTP on mw1159 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:50:18] PROBLEM - Apache HTTP on mw1160 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:50:19] PROBLEM - Apache HTTP on mw1154 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:50:28] PROBLEM - Apache HTTP on mw1153 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:51:11] PROBLEM - Apache HTTP on mw1156 is CRITICAL: Connection timed out [12:51:29] PROBLEM - Apache HTTP on mw1158 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:51:29] PROBLEM - LVS HTTP IPv4 on rendering.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:51:44] dammit [12:51:59] RECOVERY - Apache HTTP on mw1156 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.062 second response time [12:52:02] RECOVERY - Apache HTTP on mw1157 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.766 second response time [12:52:10] RECOVERY - Apache HTTP on mw1159 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.052 second response time [12:52:11] RECOVERY - Apache HTTP on mw1158 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.053 second response time [12:52:11] RECOVERY - Apache HTTP on mw1160 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.068 second response time [12:52:11] RECOVERY - Apache HTTP on mw1155 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.069 second response time [12:52:12] RECOVERY - Apache HTTP on mw1154 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 2.210 second response time [12:52:20] RECOVERY - Apache HTTP on mw1153 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.054 second response time [12:52:21] RECOVERY - LVS HTTP IPv4 on rendering.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 61822 bytes in 0.241 second response time [12:52:28] j^: we have stats in graphite :-] [12:52:49] uh [12:53:18] now I got the pages [12:54:06] sigh, we're still getting nagios too [12:55:16] I think it might abandon them after x retries, there's some point at which it just gives up [12:55:19] (job queue) [12:56:18] apergos: as far as i can tell the videoscalers never get to try them though [12:56:33] hashar: what account is needed for graphite? [12:56:40] that is odd [12:56:41] j^: labs [12:56:46] I don't think anyone here knows enough about the job queue [12:56:59] j^: """ WMF Labs (use wiki login name not shell) """ [12:56:59] I also don't think graphite has per job type information but I could be wrong (see above :) [12:57:27] j^: under Metric Type choose "Stats" [12:57:59] now whats the wiki login, the one for labsconsole? [12:58:04] j^: then in the three view you have metrics such as job.job-insert. [13:01:20] https://doc.wikimedia.org/mediawiki-core/master/php/html/JobQueueRedis_8php_source.html this isn't too bad to read through, you can see that in recycleAndDeleteStaleJobs() where it does redis.call('zAdd',KEYS[5],timestamp,id) (that's add to abandoned if no retries left) [13:01:38] but I have no idea whatsoever about the videoscalers [13:05:30] i guess i have to catch aaron to help me debug this, videoscalers are 'just' running jobs-loop.sh for transcoding jobs. only thing special might be that they take a bit longer but that used to work [13:06:09] I might be able to help you from the ops end of things [13:06:37] i.e. I can check the number of jobs of type x by checking the tables directly, or I can see what things are running on a given host, etc [13:06:44] dunno how much that will actually help [13:09:13] looking at http://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&s=by+name&c=Video%2520scalers%2520eqiad&tab=m&vn= they are currently not doing anything [13:10:28] can i get more info on the jobs in the queue, i.e. all the data that is in the db/redis for them [13:10:46] which type of job, all of them? [13:10:47] (all webVideoTranscode jobs) [13:10:49] ah [13:10:50] :-D [13:11:20] ok give me just a sec to ... decide I am going to take a break from the 'self assessment' piece of the review, ugh [13:11:28] yeah so not going to finish that right now [13:12:43] mysql:wikiadmin@db1059 [commonswiki]> select count(*) from job where job_cmd = 'webVideoTranscode'; [13:12:49] | 615 | [13:13:00] isn't the job queue in Redis nowadays ? [13:13:21] it should be [13:13:24] yeah, I wonder why these are still in here [13:13:46] we have a copyJob.php scriptt [13:14:03] that I think got written to migrate the job queue from SQL to Redis [13:14:17] the one you are seeing are most probably left over that would need to be cleaned up one day [13:14:20] they are all from april so I'll assume they never got flushed out [13:14:24] yep [13:14:28] I have no idea how to access redis though [13:14:41] and we do not have a maintenance script to dump the job queue content [13:14:55] i would just remove the webVideoTranscode jobs [13:15:00] from the db [13:15:02] in graphite, we had a spike of abandoned jobs around May 10th iirc [13:16:23] that would be around the time i tried to reinsert jobs [13:16:54] but if they got abandoned right away, would be good to get access to the data in redis to see the full state [13:27:11] New patchset: Akosiaris; "Add myself to icinga's authorized for all lists" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/66963 [13:27:59] it's just taking be a bit to figure out what the right query is [13:28:04] s/be/me [13:30:08] 12 million keys in there... meh [13:31:47] anyone knows about redactatron / python ? I got a very simple change pending that removes useless backslashes https://gerrit.wikimedia.org/r/60416 [13:31:57] that is to make it pass pep8 :-] [13:34:18] hashar: notpeter/binasher [13:34:42] paravoid: yeah they don't answer to my email :-] Will try again [13:34:50] oh peter too [13:34:56] will add him in the loop :] thx [13:36:14] Change merged: Akosiaris; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/66963 [13:36:14] New review: Hashar; "Mailed Asher and Peter about this change." [operations/software/redactatron] (master) - https://gerrit.wikimedia.org/r/60416 [13:36:23] making nores here [13:36:25] commonswiki:jobqueue:webVideoTranscode:z-abandoned [13:37:00] apergos: somehow your mail from last week titled "Can I redo the upload.beta setup?" ended up in my spam folder :/ [13:37:47] hahaha [13:37:55] well you already asnwered and I am already working on it so... [13:38:04] zcard commonswiki:jobqueue:webVideoTranscode:z-abandoned [13:38:04] (integer) 652 [13:38:27] so it shows 653 abandoned jobs, we can look to see which ones those are if you want, j^ [13:38:32] er 652 [13:47:02] redis 127.0.0.1:6379> zcard commonswiki:jobqueue:webVideoTranscode:z-abandoned (integer) 652 [13:47:02] redis 127.0.0.1:6379> zcard commonswiki:jobqueue:webVideoTranscode:z-claimed (integer) 529 [13:47:02] redis 127.0.0.1:6379> zcard commonswiki:jobqueue:webVideoTranscode:z-delayed (integer) 0 [13:47:02] redis 127.0.0.1:6379> llen commonswiki:jobqueue:webVideoTranscode:l-unclaimed (integer) 0 [13:47:11] so no new abandoned ones [13:55:17] PROBLEM - Puppet freshness on erzurumi is CRITICAL: No successful Puppet run in the last 10 hours [13:55:17] PROBLEM - Puppet freshness on lvs1004 is CRITICAL: No successful Puppet run in the last 10 hours [13:55:17] PROBLEM - Puppet freshness on lvs1005 is CRITICAL: No successful Puppet run in the last 10 hours [13:55:17] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: No successful Puppet run in the last 10 hours [13:55:17] PROBLEM - Puppet freshness on mc15 is CRITICAL: No successful Puppet run in the last 10 hours [13:55:18] PROBLEM - Puppet freshness on ms-be1 is CRITICAL: No successful Puppet run in the last 10 hours [13:55:18] PROBLEM - Puppet freshness on ms-fe3001 is CRITICAL: No successful Puppet run in the last 10 hours [13:55:19] PROBLEM - Puppet freshness on pdf1 is CRITICAL: No successful Puppet run in the last 10 hours [13:55:19] PROBLEM - Puppet freshness on pdf2 is CRITICAL: No successful Puppet run in the last 10 hours [13:55:20] PROBLEM - Puppet freshness on virt1 is CRITICAL: No successful Puppet run in the last 10 hours [13:55:20] PROBLEM - Puppet freshness on virt3 is CRITICAL: No successful Puppet run in the last 10 hours [13:55:21] PROBLEM - Puppet freshness on virt4 is CRITICAL: No successful Puppet run in the last 10 hours [13:58:21] j^: did you want to see the innards of a few of the abandoned ones or what would you like? [13:58:36] note that the timestamps are all going to be in secs since 1 1 1970 :-/ but anyways... [14:00:38] do we not have 'how to look up job queue stuff in redis' on wikitech? cause if not I can write it [14:33:46] https://wikitech.wikimedia.org/wiki/Redis [14:33:55] * apergos goes back to self-assessment [14:40:37] New patchset: Odder; "(bug 49125) Add localised/v2 logos for Wikipedias without one IV" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/66972 [14:41:13] apergos: apparently not [14:41:32] apergos: though we have https://wikitech.wikimedia.org/wiki/Redis#Examining_the_data_for_a_job [14:41:42] (which should not be under [[Redis]] hehe) [14:41:46] yeah, I just wrote that [14:41:50] *just* now. [14:41:54] ahhh [14:42:01] feel free to move it wherever etc [14:42:10] I would move that under https://wikitech.wikimedia.org/wiki/Job_queue [14:42:15] I wantd people to add the other things we use redis for [14:42:24] like isn't there session management... [14:42:35] anyways feel free to chop up, put links, etc [14:42:37] yup we hold sessions in redis too [14:43:22] the 'job queue' page appears to describe the job runners [14:43:35] maybe it should be moved too [14:43:52] yup 'Job queue runners' [14:43:58] and move 'Redis' to 'Job queue' [14:44:13] then Redis could hold the general doc about redis (like how to connect to it) [14:44:18] ok but then pull the redis specific stuff like 'where it lives' and 'comamnds and configuration refs' and put em in a 'redis' page [14:44:22] yeah good [14:45:08] it's a very simple command syntax, one point in its favor [14:45:40] the simpler the command, the easier it is to write doc for it :-] [14:45:57] yep! [14:47:56] apergos: will you do it or are you expecting me to do it ? :D [14:48:04] uh [14:48:20] well... you have a vision so unless you are chomping at the bit to do reviews! [14:48:33] but if you would rather I will do it, say so [14:49:08] apergos: do the move pages and I can amend later [14:49:13] oh no [14:49:16] apergos: I will do it [14:49:20] move keep the history [14:49:21] I am lame [14:50:32] ok it's all you [14:50:40] bah I can't delete files [14:50:54] apergos can you delete ? https://wikitech.wikimedia.org/w/index.php?title=Job_queue&redirect=no [14:51:22] ah so you can move on top of it [14:51:31] yup moving Redis to Job queue [14:51:58] done [14:56:00] done [14:56:05] https://wikitech.wikimedia.org/wiki/Job_queue is now only about job queue [14:56:06] yay [14:56:11] https://wikitech.wikimedia.org/wiki/Redis has the overview [14:56:14] and usage [14:56:33] uh the please add should go to the redis page [14:57:30] or alternatively make a red link from redis to that... [14:57:47] it's like real docs now :-D [14:58:00] do edit :-] [14:58:05] sure [14:58:16] ah yeah I forgot to move that one hehe [14:58:23] so now we need a [[user session]] article haha [14:59:50] well I didn't sign up to write that, since I haven't looked at it at all [15:00:01] we need more doc hehe [15:00:06] Ceph: https://wikitech.wikimedia.org/wiki/Ceph that is sparse [15:00:06] we do [15:00:35] yeah I found out how sparse it was the other day [15:03:33] Change abandoned: Hashar; "redactatron is being rewritten entirely, so there is no point in keeping this change around." [operations/software/redactatron] (master) - https://gerrit.wikimedia.org/r/60416 [15:41:32] New review: Reedy; "As above" [operations/mediawiki-config] (master) C: -1; - https://gerrit.wikimedia.org/r/65860 [15:55:51] New patchset: Ottomata; "Initial commit of zookeeper module." [operations/puppet/zookeeper] (master) - https://gerrit.wikimedia.org/r/66882 [15:57:18] New review: Ottomata; "Thanks Hashar!" [operations/puppet/zookeeper] (master) - https://gerrit.wikimedia.org/r/66882 [15:57:40] paravoid ^ whenver you get a chance [15:57:49] review time... [15:58:16] <^demon> Oh yeah, I meant to take a gander at that too. [16:01:48] <^demon> ottomata: Looking at line 19 in https://gerrit.wikimedia.org/r/#/c/66882/2/manifests/server.pp, can you explain exactly what's happening with the ->? [16:01:50] <^demon> I've not seen that. [16:04:25] yeah that is an explicit class dependency [16:04:31] there are several ways of doing that [16:04:39] its the same as [16:04:42] ottomata: you are welcome :-) (pep8 on zookeeper puppet repo) [16:04:45] I am off! [16:04:55] laters, thank you! [16:04:55] require zookeeper [16:04:55] except [16:04:58] it doesn't auto-include the class [16:05:03] which is better for parameterized classes [16:05:06] so [16:05:12] with that dependency listed like that [16:05:17] if you try to include zookeeper::server [16:05:24] and haven't already explicitly included zookeeper [16:05:29] puppet will throw an error [16:05:36] "better" as in, "require" doesn't work with param classes :) [16:05:46] well, it does if all the parameters have defaults [16:05:52] <^demon> Gotcha. Makes sense now. I just hadn't seen the syntax before :) [16:05:55] <^demon> Thanks for clarifying [16:06:04] if they have defaults and you're not modifying the defaults [16:06:08] yeah [16:06:11] so basically it's not parameterized [16:06:26] require will work though, even if you have changed the defaults [16:06:32] it will keep the defaults you set when you included the class [16:06:33] but [16:06:39] ^demon: -> is the requires syntax, it's not limited to class dependencies [16:06:49] so you could say Package['foo'] -> File['bar'] too [16:06:58] if there was a case where you wanted to be able to maybe include zookeeper::server [16:07:03] and were ok with all of the defaults [16:07:06] you can even say package { 'foo': ... } -> file { 'bar': ... } iirc, but don't do that [16:07:15] you might not want to force the user to explicitly include the zookeeper class [16:07:19] in this case I do [16:07:21] so -> is better [16:07:24] in most cases -> is better [16:08:18] <^demon> *nods* [16:14:15] <^demon> ottomata: Same file, line 46: "$myid = inline_template('<%= zookeeper_hosts.index(fqdn) + 1 %>')" [16:14:27] <^demon> Is this going to preserve order or do we need to slap a sort on it like so many other places? [16:14:34] New patchset: Ottomata; "Renaming role::hadoop classes to role::analytics::hadoop" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/66976 [16:14:46] hmm, its an array, so it should be ok…. [16:14:47] PROBLEM - SSH on ms-be11 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:14:47] right? [16:14:52] this isn't a hash [16:14:59] ^demon [16:15:08] <^demon> Prolly, just thought it was worth mentioning. [16:15:13] yeah, i think its cool [16:15:15] <^demon> Bit me before, jumped out as a many gotcha. [16:15:21] <^demon> *maybe [16:15:28] have you had the problem with an array before? or just hashes? [16:15:32] <^demon> Hashes. [16:15:35] k [16:16:10] New review: Ottomata; "This is tested and running on labs." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/66976 [16:16:43] yeah arrays are fine [16:17:07] <^demon> ottomata: I'm going to install puppetmaster::self and give this a whirl with solr. I think it'll probably all work as is (and will earn a +1 from me), but let's find out for sure. [16:18:49] ok cool! [16:19:40] hmm actually [16:19:52] ^demon, lemme remove the zookeeper cdh4 .deb from our apt, i think that will be a problem [16:21:59] paravoid, you told me there was something special I had to do to remove this from our apt [16:22:31] did I tell you I have to review 11 people? [16:22:33] :) [16:23:18] haha, nope but I believe it [16:25:02] heya akosiaris, you there? [16:26:37] PROBLEM - swift-container-server on ms-be11 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:26:47] PROBLEM - RAID on ms-be11 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:26:47] PROBLEM - swift-container-replicator on ms-be11 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:26:47] PROBLEM - swift-object-server on ms-be11 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:26:47] PROBLEM - swift-container-auditor on ms-be11 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:26:47] PROBLEM - swift-account-reaper on ms-be11 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:26:57] PROBLEM - swift-account-replicator on ms-be11 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:26:57] PROBLEM - swift-container-updater on ms-be11 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:26:57] PROBLEM - swift-object-updater on ms-be11 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:26:59] apergos: ^^^ [16:27:18] PROBLEM - swift-account-auditor on ms-be11 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:27:18] PROBLEM - DPKG on ms-be11 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:27:27] PROBLEM - Disk space on ms-be11 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:27:27] PROBLEM - swift-object-auditor on ms-be11 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:27:27] PROBLEM - swift-account-server on ms-be11 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:27:27] PROBLEM - swift-object-replicator on ms-be11 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:27:59] grrr [16:30:44] New patchset: MaxSem; "Serve mobile logos from the same domain to avoid charging Zero users for them" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/66167 [16:30:45] [10107517.182566] BUG: soft lockup - CPU#10 stuck for 22s! [xfsaild/sdb3:28665] [16:30:45] [10107517.190565] Stack: [16:30:45] [10107517.193109] Call Trace: [16:30:45] [10107517.196325] Code: 90 90 90 90 90 90 55 b8 00 00 01 00 48 89 e5 f0 0f c1 07 89 c2 c1 ea 10 66 39 c2 74 13 66 0f 1f 84 00 00 00 00 00 f3 90 0f b7 07 <66> 39 d0 75 f6 5d c3 0f 1f 40 00 8b 17 55 31 c0 48 89 e5 89 d1 [16:30:54] junk on console, can't log in, power cycling [16:31:10] did you turn off the Dell power management stuff in the bios? [16:32:03] I didn't do anything to the bios [16:32:35] rpetty sure I didn't set these up (not 100% but pretty sure) [16:33:07] PROBLEM - Host ms-be11 is DOWN: PING CRITICAL - Packet loss = 100% [16:33:18] where is it in the menus? [16:33:39] mark: didn't set up but it's possible that dell power mgmt is on...sbernardin was doing that for some of the new apaches [16:33:58] system profile or some such [16:34:03] needs to be set to "OS", not Dell [16:34:07] checking [16:34:16] is says 'OS' [16:34:35] ok that's good [16:34:43] aything else I oughta look for while in here? [16:35:23] mark [16:35:53] hyperthreading turned off [16:36:04] other than that, not really [16:37:37] logical processor is off [16:37:42] ok bring back up then [16:37:46] *bringing [16:38:55] while I wait, why do we have those two things turned off? [16:39:35] !log powercycled ms-be11, was unresponsive on console with lots of [10107517.182566] BUG: soft lockup - CPU#10 stuck for 22s! [xfsaild/sdb3:28665] etc [16:41:13] RECOVERY - swift-account-replicator on ms-be11 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [16:41:13] RECOVERY - DPKG on ms-be11 is OK: All packages OK [16:41:13] RECOVERY - swift-account-auditor on ms-be11 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [16:41:23] RECOVERY - swift-container-updater on ms-be11 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-updater [16:41:23] RECOVERY - Host ms-be11 is UP: PING OK - Packet loss = 0%, RTA = 26.59 ms [16:41:23] RECOVERY - Disk space on ms-be11 is OK: DISK OK [16:41:23] RECOVERY - swift-object-replicator on ms-be11 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [16:41:23] RECOVERY - swift-object-auditor on ms-be11 is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [16:41:24] RECOVERY - swift-account-server on ms-be11 is OK: PROCS OK: 13 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [16:41:24] RECOVERY - swift-container-server on ms-be11 is OK: PROCS OK: 13 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [16:41:33] RECOVERY - swift-object-updater on ms-be11 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-updater [16:41:44] RECOVERY - swift-container-replicator on ms-be11 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [16:41:44] RECOVERY - swift-object-server on ms-be11 is OK: PROCS OK: 101 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [16:41:44] RECOVERY - RAID on ms-be11 is OK: OK: State is Optimal, checked 1 logical device(s) [16:41:44] RECOVERY - SSH on ms-be11 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [16:41:44] RECOVERY - swift-container-auditor on ms-be11 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [16:41:44] RECOVERY - swift-account-reaper on ms-be11 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [16:43:39] apergos: dell power management = the BIOS is taking decisions on CPU frequency (P-state) and C-states [16:43:51] so it can randomly slow down a server because it thinks it's consuming too much power or whatever [16:44:00] while (OS) passes that control to Linux [16:44:04] and the cpufreq driver [16:44:19] I wonde rhow the bios makes those decisions... ok thanks [16:44:19] which is better informed to take these decisions and probably much less buggy :) [16:44:22] yep [16:45:53] meh so much nooise in the kern.log that it's hard to see what set it off [16:49:03] PROBLEM - Host wtp1008 is DOWN: PING CRITICAL - Packet loss = 100% [16:49:33] RECOVERY - Host wtp1008 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [17:01:31] ottomata: the zookeeper deb from cdh4 is an explicit dependency of the hadoop package [17:01:36] aptitude why zookeeper [17:01:36] i hadoop Depends zookeeper (>= 3.4.0) [17:02:31] at first glance this can not be satisfied by zookeeper 3.3 in standard precise repo [17:02:50] WHAAAA [17:02:51] growl. [17:02:53] hm [17:03:01] hmmm [17:03:13] hmmmmm [17:03:20] growl. [17:03:22] welp hm [17:03:23] i mean [17:03:25] we can use either package [17:03:27] ^demon [17:03:29] ahahahaha [17:03:38] i just puppetized for ubuntu .deb though [17:03:47] because I was separating out the puppetization for zookeeper [17:03:52] i thought it would be better to use the ubuntu one [17:03:54] rather than cdh4 [17:04:01] ergh [17:04:13] the packages as installed are not easily puppetizable in the same way [17:04:17] that's a fine goal [17:04:18] different launch script, etc. [17:04:23] <^demon> Argh :( [17:04:37] hmm [17:04:55] i just realized the entire extent of the problem [17:05:10] we were hoping to use the zookeeper module somewhere else correct ? [17:05:35] using ubuntu's package and not cdh4... [17:05:55] <^demon> For solr. [17:06:00] <^demon> Was the plan [17:06:03] yes [17:06:04] i mean [17:06:09] i'm sure the cdh4 package would work for solr too [17:06:13] its even a newer version in cdh4 [17:06:19] but, i have to puppetize it one way or the other [17:06:29] and since I moved the zookeeper puppet stuff out of the cdh4 module [17:06:35] i think we should use the ubuntu .deb [17:06:36] they are slightly different from what i saw [17:06:38] right? [17:06:50] yeah, slightly different config files [17:06:57] package names zookeeperd ->zookeeper-server etc [17:07:07] e.g. /etc/default/zookeeper vs /etc/zookeeper/conf/zookeeper-env.sh [17:07:08] yeah [17:07:10] that too [17:07:18] i mean [17:07:28] i can install zookeeper ubuntu by manually specifying version [17:07:29] its still available [17:07:34] i could put that in the puppetization i guess [17:07:39] package zookeeper=.... [17:07:42] or whatever it is [17:08:06] yeah but that would not be enough given the other differences [17:08:24] you would basically have to maintain 2 different modules... [17:09:32] why is hadoop depending on zookeeper? [17:09:44] does it actually use it somewhere, or is it a design dependency [17:09:51] use its files, init scripts etc. I mean [17:10:29] if it's the latter, we can create dummy "zookeeper" packages that depends on the stock packages [17:10:40] like a transitional package, that's easy [17:11:11] I would like us to try really hard to avoid maintaining two modules [17:11:26] if there are very good reasons we could do that, but I'd prefer us not to [17:11:29] I am getting the feeling that this dependency might be wrong... [17:11:36] naw [17:11:38] looking at the package for hadoop [17:11:42] akosiaris: not 2 modules [17:11:54] the zk module i'm writing would use zookeeper from ubuntu [17:12:01] and i'd just manually specify that it has to match the version from ubuntu [17:12:08] that package is still available via apt [17:12:12] i find this inside: /usr/lib/hadoop/lib/zookeeper-3.4.5-cdh4.2.1.jar [17:12:13] its just not the default when you do [17:12:16] apt-get install zookeeper [17:12:21] since we apt pin the wmf apt repo [17:12:31] ottomata: yeah ok... but what about the rest of the differences ? [17:12:38] config dirs ? package names ? [17:13:12] at least for the zookeeperd <-> zookeeper-server stuff i mean [17:15:02] !log reedy synchronized wmf-config/InitialiseSettings.php 'Enable vips on testwiki and mediawikiwiki' [17:15:49] as long as the cdh4 zookeeper package is not installed [17:15:50] it will be fine [17:15:54] i puppetized using the ubuntu one [17:17:40] And in the hadoop nodes? Where the cdh4 zookeeper will be installed because of the dependency? [17:18:03] grr this dependency is confusing me [17:18:12] what do they need it for ? [17:19:26] New patchset: Reedy; "Enable Vips on testwiki and mediawikiwiki" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/66982 [17:21:17] hm akosiaris, i'm not sure if it will be installed, but if it is, i think it won't hurt [17:21:34] i don't need to apply zookeeper puppetization to the hadoop nodes (at least not right now…) [17:21:41] I *could* [17:21:43] ergh [17:21:49] i could keep the cdh4 puppetization in the cdh4 module [17:21:54] and still have a separate zookeeper module [17:21:56] but ungh [17:21:58] that's dumb [17:22:11] i agree with paravoid on that, 2 modules would be annoying [17:22:16] i mean, i already did the work for the cdh4 one [17:22:18] so it wouldn't be more work [17:22:21] but mehhhh [17:23:39] i agree on that 2. Two modules to maintain would be a burden we don't need... [17:23:41] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/66982 [17:25:29] and yeah, installing cdh4 hadoop does install cdh4 zookeeper, gr [17:25:55] can we discuss this tomorrow? [17:25:56] wait, no i'm not sure, sorry, i remember that my previous puppetization installed the zookeeper package [17:26:12] yeah no worries paravoid, i think we know what we want [17:26:13] sorry, I just don't want to spend all night writing reviews :) [17:26:41] I know you know what we want, I just wanted to help :) [17:26:45] i'm pretty sure that what we have will work if I manually specify version in zookeeper puppetization, i jsut won't be able to puppetize zookeeper on hadoop nodes…which may or may not be ok [17:26:46] we'll see [17:27:08] yeah, no worries, the zk puppetization review is not a priority [17:27:15] i have meetings and have to do these reveiws today too [17:34:17] New patchset: Kaldari; "Turning on Disambiguator for test and test2" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/66987 [17:42:48] New review: Kaldari; "Not yet" [operations/mediawiki-config] (master) C: -1; - https://gerrit.wikimedia.org/r/66987 [17:53:31] paravoid: hadoop needs zookeeper to function [17:54:12] preilly, i'm not really sure why, or if that is really true [17:54:22] i haven't configured hadoop to use zookeeper at all [17:54:24] paravoid: hbase uses it [17:54:27] ahhh [17:54:31] that makes sense [17:54:35] we haven't used hbase yet, so [17:54:38] ja [17:54:57] cluster management like locks, leader election etc. [17:55:09] Hey mutante, anything I can do to help getting https://gerrit.wikimedia.org/r/#/c/65443/ finished? [18:11:18] paravoid: you should check if both zookeeper packages use the same port of 2181 [18:12:08] if they're configured as different ports you could have both packages side-by-side without issue [18:14:08] New patchset: MaxSem; "Mobile redirect for Commons and Wikimania2013" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/66991 [18:17:51] Is the wmflabs.org server hosted by the WMF? [18:18:04] Prodego: yes [18:18:22] though technically that domain isn't pointing at anything [18:18:45] Ryan_Lane: so there is no longer any legal reason for edit counter tools hosted at http://tools.wmflabs.org/ to be opt-in [18:19:06] I'm not sure what you mean [18:20:01] Ryan_Lane: edit counters that did month by month breakdowns were opt-in on the toolserver because of some german privacy law [18:20:24] but if wikimedia de is no longer hosting the tools... [18:21:11] Ryan_Lane: this is from https://wiki.toolserver.org/view/Rules#Privacy_Policy [18:21:31] Ryan_Lane: presumably that rule does not exist on wmflabs? [18:22:13] tools.wmflabs.org has the same privacy policy as wikimedia sites [18:22:33] if an edit counter would violate a user's privacy, then it likely isn't allowed. [18:22:37] ask Coren, though [18:23:07] I deny everything! [18:23:11] Ryan_Lane: well this would not violate the wikimedia privacy policy [18:23:19] Coren: thoughts? [18:24:53] That's actually a very good question, which I am loathe to answer without some help from Legal. [18:25:15] Coren: well I'd really like to go back to using these edit counters instead of the third party ones [18:25:17] I would expect it'd be okay by the privacy policy, but that the *community* might have reservations. [18:25:37] Coren: month by month breakdowns are (well, as you know) useful [18:25:55] PROBLEM - Puppet freshness on mw1149 is CRITICAL: No successful Puppet run in the last 10 hours [18:26:13] Prodego: I would expect that monthly breakdowns should be okay; the information is publically available and isn't that revealing. [18:26:16] Coren: is TParis still the person in charge of that tool? [18:26:27] Coren: are you from the US or europe or somewhere else? [18:26:41] Prodego: If it's on labs, you can see who the maintainers are on tools.wmflabs.org [18:26:56] Prodego: Somewhere else. North of the border in Canada. :-) [18:27:37] Coren: :) It seems that nearly everyone from the US and Canada find it quite reasonable to break down publically available information by month [18:27:51] Coren: but many of the europeans consider that an invasion of privacy [18:27:55] PROBLEM - Puppet freshness on mw1096 is CRITICAL: No successful Puppet run in the last 10 hours [18:28:03] obviously that is generalizing quite a bit [18:28:43] Prodego: Different mores. I should expect a monthly breakdown would be okay because it's unlikely to provide much information not plainly visible. A day-by-hour graph is more revealing, and is probably iffy. [18:30:51] Prodego: what third party edit counts? [18:31:18] Nemo_bis: http://en.wikichecker.com/ is a good one [18:31:43] Nemo_bis: gives you nice detailed breakouts [18:31:54] when that rule was introduced, by the way, interiot'd edit counter also told at what time of the day you were active and things like that [18:32:02] Nemo_bis: which was nice [18:32:05] wikichecker does that [18:32:47] looks useless, only en.wiki and few others? [18:33:37] omg flash player [18:33:53] "nice" is surely a word I wouldn't use in relation to that site :) [18:34:18] Nemo_bis: for sure it is a lot less useful [18:34:28] Nemo_bis: that's why I'd like the original ones back [18:34:52] The opt-in behavior really killed the usefulness [19:00:16] PROBLEM - Swift HTTP on ms-fe4 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:01:02] New review: Krinkle; "(1 comment)" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/62923 [19:04:36] PROBLEM - Swift HTTP on ms-fe1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:11:14] PROBLEM - Apache HTTP on mw1157 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:11:23] PROBLEM - Swift HTTP on ms-fe3 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:11:24] PROBLEM - Apache HTTP on mw1153 is CRITICAL: Connection timed out [19:11:35] PROBLEM - Apache HTTP on mw1158 is CRITICAL: Connection timed out [19:11:35] PROBLEM - Apache HTTP on mw1160 is CRITICAL: Connection timed out [19:11:35] PROBLEM - Apache HTTP on mw1154 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:11:35] PROBLEM - Apache HTTP on mw1159 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:11:43] PROBLEM - Apache HTTP on mw1155 is CRITICAL: Connection timed out [19:11:55] PROBLEM - Apache HTTP on mw1156 is CRITICAL: Connection timed out [19:12:03] PROBLEM - LVS HTTP IPv4 on rendering.svc.eqiad.wmnet is CRITICAL: Connection timed out [19:12:07] PROBLEM - LVS HTTP IPv4 on ms-fe.pmtpa.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:12:37] grrrr [19:14:13] RECOVERY - Apache HTTP on mw1157 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 3.685 second response time [19:15:44] PROBLEM - DPKG on ms-fe3 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:15:45] PROBLEM - RAID on ms-fe3 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:15:51] !log mlitn Started syncing Wikimedia installation... : Update ArticleFeedbackv5 to master [19:16:33] PROBLEM - Swift HTTP on ms-fe2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:16:34] PROBLEM - SSH on ms-fe3 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:16:55] RECOVERY - DPKG on ms-fe3 is OK: All packages OK [19:17:04] RECOVERY - RAID on ms-fe3 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [19:17:13] PROBLEM - Apache HTTP on mw1157 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:17:23] RECOVERY - SSH on ms-fe3 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [19:20:04] PROBLEM - DPKG on ms-fe3 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:20:04] PROBLEM - Disk space on ms-fe3 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:20:04] PROBLEM - RAID on ms-fe3 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:20:53] RECOVERY - LVS HTTP IPv4 on ms-fe.pmtpa.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 343 bytes in 0.085 second response time [19:21:27] RECOVERY - Swift HTTP on ms-fe2 is OK: HTTP OK: HTTP/1.1 200 OK - 2503 bytes in 1.017 second response time [19:21:57] RECOVERY - Disk space on ms-fe3 is OK: DISK OK [19:21:57] RECOVERY - DPKG on ms-fe3 is OK: All packages OK [19:21:57] RECOVERY - RAID on ms-fe3 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [19:22:02] I have no idea what is going on, should anyone else be thinking of looking at these [19:23:59] PROBLEM - LVS HTTP IPv4 on ms-fe.pmtpa.wmnet is CRITICAL: HTTP CRITICAL - No data received from host [19:24:57] RECOVERY - LVS HTTP IPv4 on ms-fe.pmtpa.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 343 bytes in 3.439 second response time [19:25:35] RECOVERY - Swift HTTP on ms-fe3 is OK: HTTP OK: HTTP/1.1 200 OK - 2503 bytes in 0.075 second response time [19:25:35] RECOVERY - Apache HTTP on mw1157 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 1.374 second response time [19:25:35] RECOVERY - Apache HTTP on mw1153 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.083 second response time [19:25:35] RECOVERY - Apache HTTP on mw1154 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.082 second response time [19:25:36] RECOVERY - Apache HTTP on mw1158 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.083 second response time [19:25:43] !log restarted swift-proxy on ms-fe3 and 4 (swappping) [19:25:44] RECOVERY - Apache HTTP on mw1160 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.058 second response time [19:25:45] RECOVERY - Apache HTTP on mw1159 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.070 second response time [19:25:45] RECOVERY - Apache HTTP on mw1155 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 6.879 second response time [19:25:55] RECOVERY - Apache HTTP on mw1156 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.068 second response time [19:25:55] RECOVERY - LVS HTTP IPv4 on rendering.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 61645 bytes in 0.234 second response time [19:26:40] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:28:00] PROBLEM - DPKG on ms-fe3 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:28:10] PROBLEM - RAID on ms-fe3 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:28:30] didn't take on ms-fe3 [19:28:52] RECOVERY - DPKG on ms-fe3 is OK: All packages OK [19:29:00] RECOVERY - RAID on ms-fe3 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [19:29:20] redid [19:29:31] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.130 second response time [19:30:32] RECOVERY - Swift HTTP on ms-fe4 is OK: HTTP OK: HTTP/1.1 200 OK - 2503 bytes in 0.097 second response time [19:30:40] RECOVERY - Swift HTTP on ms-fe1 is OK: HTTP OK: HTTP/1.1 200 OK - 2503 bytes in 0.063 second response time [19:32:06] !log restarting ms-fe1/4 proxies, 100% cpu writing to a ENOTCONN fd [19:32:29] looks better [19:33:27] no idea what caused them to go out to lunch like that [19:33:59] New patchset: GWicke; "New Parsoid Varnish puppetization" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/63890 [19:34:28] back ends still look pretty unhappy [19:35:54] New review: GWicke; "Removed the duplicate purge logic in vcl_{miss,hit} as pointed out by Asher." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/63890 [19:38:18] New patchset: GWicke; "New Parsoid Varnish puppetization" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/63890 [19:39:10] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/63890 [19:39:27] !log "swift-init all restart" on all swift backends, CPU runaway threads, see above [19:40:05] I so don't care what's wrong with it right now [19:41:57] there, better than before [19:41:59] going out [19:42:30] have fun [19:56:39] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:03:28] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.125 second response time [20:20:02] New patchset: Catrope; "Fix parsoid-common VCL inclusion and references" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/67003 [20:21:33] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/67003 [20:31:53] New patchset: ArielGlenn; "add registration info to user info wikiretriever can get" [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/67005 [20:33:12] Change merged: ArielGlenn; [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/67005 [20:33:50] !log contacts.wm is back up, manually brought back up from tridge backups on zirconium, because singer is dead.. that also means.. singer is gone forever [20:34:34] New patchset: Catrope; "Actually use new Parsoid Varnish puppetization in production" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/67006 [20:35:56] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/67006 [20:48:25] New patchset: Andrew Bogott; "Pep8 cleanups:" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/67008 [20:58:37] Change abandoned: Andrew Bogott; "I've made Jenkins changes to support per-dir .pep8 files. So this isn't needed anymore." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61999 [21:00:46] Change merged: Andrew Bogott; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61891 [21:17:15] New patchset: RobH; "RT 2640 netmon1001 as smokeping server in eqiad" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/67012 [21:17:15] New patchset: RobH; "netmon1001 to be new smokeping host in eqiad" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/67013 [21:17:55] Change abandoned: RobH; "(no reason)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/67013 [21:19:21] sleep! [21:19:22] now! [21:19:25] *poof* [21:20:21] Change merged: RobH; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/67012 [21:24:18] mutante: Could you update the Watchmouse check for Parsoid to hit port 80 instead of port 6081 please? [21:29:03] RoanKattouw: "You have changed this monitor substantially, do you want to start a new log, or do you want to continue appending to the current log?" [21:29:22] Either is fine, I don't care [21:29:30] It's the same service, I just moved it to a different port [21:29:44] Which... I should probably update the MW config shouldn't I.... [21:29:59] done and saved [21:30:04] Thanks man [21:30:07] keeps old log and appends, np [21:31:02] New patchset: Catrope; "Parsoid Varnish no longer uses port 6081" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/67016 [21:31:26] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/67016 [21:32:28] !log update parsoid watchmouse monitoring from port 6081 to port 80 [21:32:59] !log catrope synchronized wmf-config/CommonSettings.php 'Update Parsoid Varnish port' [21:34:35] !log catrope synchronized wmf-config/CommonSettings.php 'Temp hack around Parsoid cache LVS breakage' [21:35:29] New patchset: Catrope; "Move parsoidcache LVS from port 6081 to port 80" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/67017 [21:37:15] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/67017 [21:43:35] !log Restarting pybal on lvs1006 [21:44:30] !log Restarting pybal on lvs1003 [21:45:49] !log catrope synchronized wmf-config/CommonSettings.php 'Undo temp hack' [21:46:17] New patchset: Andrew Bogott; "Pep8 cleanups:" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/67008 [21:46:17] New patchset: Andrew Bogott; "Pep8 cleanup" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/67020 [21:46:18] New patchset: Andrew Bogott; "Pep8 cleanups; mostly whitespace." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/67021 [21:46:18] New patchset: Andrew Bogott; "Pep8 cleanup stage one: tab purge!" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/67022 [22:09:03] andrewbogott: you are evil :-] [22:09:11] I'm only getting started [22:09:23] thanks for leading the effort on this! [22:09:30] +1 [22:09:57] andrewbogott: and your pep8 wrapper is a nice trick :-] [22:11:00] I have mixed feelings about using .pep8 exceptions… but at least this way they'll get reviewed and considered I guess. [22:11:31] When I was writing that yesterday I thought, "Of course, /I/ will never add exceptions" and today I am already adding a few [22:13:23] New review: Hashar; "(1 comment)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/67008 [22:13:48] andrewbogott: that is a good way to achieve progress [22:13:54] we can get rid of the exceptions later on [22:14:26] andrewbogott: left a comment on https://gerrit.wikimedia.org/r/#/c/67008/2/files/ldap/scripts/.pep8,unified [22:14:36] !log reedy synchronized php-1.22wmf5/extensions/SecurePoll/ [22:14:54] I usually add a comment line listing what the error code is such as: [22:14:54] # E501 line too long (79 chars) [22:14:55] ignore = E501 [22:15:02] Yep, good idea. [22:15:03] so we have clue, but feel free to ignore it :-) [22:15:25] the pep8 source code has detailed explanations as well [22:17:49] andrewbogott: also some python scripts comes from upstream :D [22:17:54] such as the ganglia plugins [22:18:07] there might be others which one would want to basically ignore [22:18:36] man, the pep8 tool is super bad at parsing its config [22:18:44] did a bunch of them for ganglia https://github.com/ganglia/gmond_python_modules/pull/109 [22:19:07] Oh, yeah, if they're upstream then I should leave them alone. Is there a README or somethign that says that? [22:19:32] not that I know [22:19:42] it seems we just cherry picked the ganglia plugins we needed [22:19:59] that led me to find out there is one to monitor jenkins :-] [22:21:22] I am off, happy tweaking! [22:22:28] New patchset: Catrope; "Fix monitoring for Parsoid caches" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/67029 [22:23:13] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/67029 [22:23:42] New patchset: Andrew Bogott; "Pep8 cleanups:" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/67008 [22:23:58] New patchset: Andrew Bogott; "Pep8 cleanup" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/67020 [22:24:15] New patchset: Andrew Bogott; "Pep8 cleanups; mostly whitespace." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/67021 [22:25:53] New patchset: Andrew Bogott; "Pep8 cleanup stage one: tab purge!" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/67022 [22:35:00] New patchset: Andrew Bogott; "Suppress a jillion pep8 warnings for these files." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/67030 [22:52:21] PROBLEM - NTP on ssl3002 is CRITICAL: NTP CRITICAL: No response from NTP server [23:03:01] PROBLEM - NTP on ssl3003 is CRITICAL: NTP CRITICAL: No response from NTP server [23:55:29] PROBLEM - Puppet freshness on erzurumi is CRITICAL: No successful Puppet run in the last 10 hours [23:55:29] PROBLEM - Puppet freshness on lvs1004 is CRITICAL: No successful Puppet run in the last 10 hours [23:55:29] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: No successful Puppet run in the last 10 hours [23:55:29] PROBLEM - Puppet freshness on mc15 is CRITICAL: No successful Puppet run in the last 10 hours [23:55:29] PROBLEM - Puppet freshness on lvs1005 is CRITICAL: No successful Puppet run in the last 10 hours [23:55:30] PROBLEM - Puppet freshness on ms-be1 is CRITICAL: No successful Puppet run in the last 10 hours [23:55:30] PROBLEM - Puppet freshness on ms-fe3001 is CRITICAL: No successful Puppet run in the last 10 hours [23:55:31] PROBLEM - Puppet freshness on pdf1 is CRITICAL: No successful Puppet run in the last 10 hours [23:55:31] PROBLEM - Puppet freshness on pdf2 is CRITICAL: No successful Puppet run in the last 10 hours [23:55:32] PROBLEM - Puppet freshness on virt1 is CRITICAL: No successful Puppet run in the last 10 hours [23:55:32] PROBLEM - Puppet freshness on virt3 is CRITICAL: No successful Puppet run in the last 10 hours [23:55:33] PROBLEM - Puppet freshness on virt4 is CRITICAL: No successful Puppet run in the last 10 hours