[00:02:36] (03PS21) 10Catrope: Citoid puppetization [puppet] - 10https://gerrit.wikimedia.org/r/163068 [00:09:20] (03PS1) 10Dzahn: delete the noc.wikimedia.org SSL cert [puppet] - 10https://gerrit.wikimedia.org/r/164001 [00:10:46] (03PS1) 10Reedy: Stop sending apache syslogs to remote syslog [puppet] - 10https://gerrit.wikimedia.org/r/164002 [00:10:49] ori: ^^ [00:10:55] Seems redundant with them going to fluorine :) [00:12:12] (03CR) 10Dzahn: [C: 031] "we should not be logging to NFS anymore" [puppet] - 10https://gerrit.wikimedia.org/r/164002 (owner: 10Reedy) [00:12:49] (03CR) 10Dzahn: "https://rt.wikimedia.org/Ticket/Display.html?id=7295" [puppet] - 10https://gerrit.wikimedia.org/r/164002 (owner: 10Reedy) [00:14:34] (03CR) 10Ori.livneh: [C: 031] Stop sending apache syslogs to remote syslog [puppet] - 10https://gerrit.wikimedia.org/r/164002 (owner: 10Reedy) [00:15:22] (03CR) 10Dzahn: "also:" [puppet] - 10https://gerrit.wikimedia.org/r/164002 (owner: 10Reedy) [00:15:23] I'm not sure if httpd is right.. Or it should be apache per the log file name [00:15:31] What's the SSL cert for noc.wikimedia.org about? [00:15:38] Carmela: deleting it [00:15:42] Deleting noc? [00:15:48] no, the cert [00:15:51] Ah. [00:16:11] noc is now: noc.wikimedia.org is an alias for misc-web-lb.eqiad.wikimedia.org. [00:16:16] before it was fenari [00:16:20] Ah. [00:16:26] fenari is dying. [00:16:29] so now it is behind that, and uses the star cert there [00:16:31] yes, that [00:16:38] Got it. [00:16:51] Tim and I discussed killing the conf part of noc at some point. [00:17:08] It could probably be a wiki page somewhere. [00:17:25] (03CR) 10Reedy: Stop sending apache syslogs to remote syslog (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/164002 (owner: 10Reedy) [00:17:29] And I guess the home directories got moved, so I'm not sure what else is there. [00:17:43] Carmela: we have 3 things now: noc - like before but minus the user dirs and pybal, people - the user public_html, config-master - pybal [00:17:57] It's a nice memorable URL. [00:18:01] Carmela: what else - http://noc.wikimedia.org/dbtree/ [00:18:04] I liked the nocnocnocnoc home page. [00:18:17] config-master.wikimedia.org? [00:18:20] Carmela: also see http://config-master.wikimedia.org/ [00:19:01] Carmela: we would still need something that keeps the config files updated now [00:19:18] Carmela: also, does it still make sense which configs are selected to be shown there? [00:19:33] Keep which config files updated? [00:19:41] " a selection of Wikimedia configuration files " [00:19:41] Not sure what you're discussing. [00:19:51] Carmela: http://noc.wikimedia.org/conf/ [00:19:51] Well, it's a memorable URL and a pretty page. [00:19:58] We could just make it a wiki page and point to GitBlit. [00:20:00] Or GitHub. [00:20:06] Or wherever else. [00:20:21] /conf/ used to be manually updated. Eventually symlinks were put in place. Then came Git. [00:20:26] (03PS22) 10Catrope: Citoid puppetization [puppet] - 10https://gerrit.wikimedia.org/r/163068 [00:20:28] it's a "selection of files", but that must be kept current, from git [00:20:57] can it just link to gitblit and show all files [00:21:08] All files in what? [00:21:22] dunno, all files as opposed to just a "selection" of them [00:21:27] All the files? [00:21:31] :-) [00:21:35] The selection is useful. [00:21:40] I think. [00:21:52] notes how the Apache config section is empty [00:22:11] It should probably be a wiki page. [00:22:21] I'll say again. :-) [00:22:35] git.wikimedia.org isn't bad, but sometimes I really just want to look up a config setting. [00:22:42] fine with all, we just moved it for now because of the due date [00:22:54] So noc.wikimedia.org/conf/ redirecting to a wiki page would be fine. [00:22:55] i even copied the files over manually that were broken symlinks.. [00:23:08] Yeah, I noticed the lucene config file is broken? [00:23:12] I meant to file a ticket. [00:23:19] i suppose we dont use that lucene config file anymore [00:23:30] We still use Lucene, though. [00:23:39] http://noc.wikimedia.org/conf/highlight.php?file=lucene-common.php [00:23:40] that? [00:23:46] i think i fixed it when moving it, heh [00:25:21] Is pybal a load balancer? [00:25:44] (03PS2) 10Dzahn: Stop sending apache syslogs to remote syslog [puppet] - 10https://gerrit.wikimedia.org/r/164002 (owner: 10Reedy) [00:26:13] mutante: I meant https://noc.wikimedia.org/conf/highlight.php?file=lsearch-global-2.1.conf [00:26:16] Which is linked at the bottom. [00:26:21] I don't remember the Apache section being broken... [00:26:33] "PyBal is a LVS monitoring script quite similar to lvsmon. It's written in Python using the Twisted framework." [00:26:49] LVS... [00:27:08] Apache section has moved around and stuff.. [00:27:17] operations/apache-config.git is a useless link now with them being in puppet [00:27:40] manifests/search.pp: file { '/a/search/conf/lsearch-global-2.1.conf': [00:27:49] /a sounds wrong nowadays [00:28:09] NFI. The symlink points to /home/w/something I think. [00:28:14] yea, Apache, what Reedy said, they are moved around to mw module [00:28:36] wikitech.wikimedia.org page then, I suppose. [00:29:33] hmm, you know what [00:29:39] those symlinks have been broken again [00:29:46] i guess by sync-common or something [00:30:32] Reedy: @terbium:/usr/local/apache/common/docroot/noc/conf is it synced by scripts? [00:30:55] we gotta replace the links in repo then, for some reason i thought i can just fix it manually [00:33:40] (03PS2) 10Dzahn: redirect noc user homedirs to people.wm.org [puppet] - 10https://gerrit.wikimedia.org/r/163756 [00:34:35] (03CR) 10Dzahn: "Chmarkine, yea, thanks, i removed that part here https://gerrit.wikimedia.org/r/#/c/163312/" [puppet] - 10https://gerrit.wikimedia.org/r/163756 (owner: 10Dzahn) [00:35:06] Carmela: btw https://gerrit.wikimedia.org/r/#/c/163756/ ? [00:36:56] mutante: Who's Chmarkine? [00:38:56] Carmela: https://wikitech.wikimedia.org/wiki/User:Chmarkine/HTTPS [00:39:53] somebody interested in improving SSL stuff and volunteering [00:39:59] don't ask me why https://en.wikipedia.org/wiki/User:Chmarkine though [00:40:51] Carmela: https://gerrit.wikimedia.org/r/#/q/owner:Chmarkine+status:merged,n,z [00:41:08] mutante: sync-docroot I guess [00:41:37] Ah, nice. [00:41:40] Yay volunteers! [00:41:50] https://github.com/wikimedia/operations-mediawiki-config/blob/master/docroot/noc/conf/index.php [00:43:09] (03CR) 10MZMcBride: "Other than the inline comment about the flag, this looks fine to me. I think the domain name is a bit presumptive, but that's not really r" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/163756 (owner: 10Dzahn) [00:44:17] mutante: Oh, nice chart. [00:44:32] Carmela: yea, i like it much, also see linked bugs:) [00:44:34] thanks to both of you, be back in a little [00:44:37] just need food [01:07:13] (03PS1) 10PleaseStand: Remove "Protection for unusual entry points" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/164008 [01:07:15] (03PS1) 10PleaseStand: Don't load php_utfnormal.so using dl() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/164009 [01:07:17] (03PS1) 10PleaseStand: Don't include DefaultSettings.php or set $DP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/164010 [01:07:19] (03PS1) 10PleaseStand: Remove obsolete profiling settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/164011 [01:07:21] (03PS1) 10PleaseStand: Remove obsolete flags from $wgAntiLockFlags [mediawiki-config] - 10https://gerrit.wikimedia.org/r/164012 [01:07:23] (03PS1) 10PleaseStand: Remove $wgCentralAuthSilentLogin and $wgCentralAuthUseOldAutoLogin [mediawiki-config] - 10https://gerrit.wikimedia.org/r/164013 [01:07:25] (03PS1) 10PleaseStand: Remove various settings removed in mediawiki/core [mediawiki-config] - 10https://gerrit.wikimedia.org/r/164014 [01:07:27] (03PS1) 10PleaseStand: Remove $wgCategoryTreeDynamicTag [mediawiki-config] - 10https://gerrit.wikimedia.org/r/164015 [01:07:29] (03PS1) 10PleaseStand: Remove $wgNoticeRunMessageIndexRebuildJobImmediately [mediawiki-config] - 10https://gerrit.wikimedia.org/r/164016 [01:07:31] (03PS1) 10PleaseStand: Remove $wgUploadWizardConfig['disableResourceLoader'] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/164017 [01:08:50] (03CR) 10PleaseStand: "https://gerrit.wikimedia.org/r/#/q/project:operations/mediawiki-config+topic:remove-cruft,n,z" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/154442 (owner: 10PleaseStand) [01:12:58] (03PS23) 10Catrope: Citoid puppetization [puppet] - 10https://gerrit.wikimedia.org/r/163068 [01:19:08] I love PleaseStand. [01:22:15] (03CR) 10MZMcBride: "Thank you very much for this change the others in the series. :-)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/164010 (owner: 10PleaseStand) [01:31:12] (03PS1) 10Ori.livneh: Grafana: use LogStash's ElasticSearch for dashboard storage [puppet] - 10https://gerrit.wikimedia.org/r/164019 [01:33:16] (03PS2) 10Ori.livneh: Grafana: use LogStash's ElasticSearch for dashboard storage [puppet] - 10https://gerrit.wikimedia.org/r/164019 [01:33:31] (03CR) 10MZMcBride: "I briefly looked through the "git log" output for mediawiki/extensions/UploadWizard.git and $wgUploadWizardDisableResourceLoader definitel" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/164017 (owner: 10PleaseStand) [01:35:06] (03CR) 10Ori.livneh: [C: 032] "Verified with PCC: http://puppet-compiler.wmflabs.org//393/change/164019/html" [puppet] - 10https://gerrit.wikimedia.org/r/164019 (owner: 10Ori.livneh) [01:41:53] (03PS3) 10Dzahn: redirect noc user homedirs to people.wm.org [puppet] - 10https://gerrit.wikimedia.org/r/163756 [01:42:08] (03CR) 10Dzahn: redirect noc user homedirs to people.wm.org (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/163756 (owner: 10Dzahn) [01:43:11] (03PS1) 10Ori.livneh: grafana::web: include mod_proxy_balancer [puppet] - 10https://gerrit.wikimedia.org/r/164020 [01:43:16] (03CR) 10jenkins-bot: [V: 04-1] grafana::web: include mod_proxy_balancer [puppet] - 10https://gerrit.wikimedia.org/r/164020 (owner: 10Ori.livneh) [01:43:19] (03PS2) 10Ori.livneh: grafana::web: include mod_proxy_balancer [puppet] - 10https://gerrit.wikimedia.org/r/164020 [01:43:43] (03CR) 10Ori.livneh: [C: 032 V: 032] grafana::web: include mod_proxy_balancer [puppet] - 10https://gerrit.wikimedia.org/r/164020 (owner: 10Ori.livneh) [01:45:25] (03CR) 10Dzahn: "eh, wait a minute, this already works without this?:)" [puppet] - 10https://gerrit.wikimedia.org/r/163756 (owner: 10Dzahn) [01:48:55] (03CR) 10Dzahn: "eh yea, http://noc.wikimedia.org/~dzahn/ vs. http://people.wikimedia.org/~dzahn/ already works without having to add the redirect. don't " [puppet] - 10https://gerrit.wikimedia.org/r/163756 (owner: 10Dzahn) [01:52:09] (03CR) 10Dzahn: [C: 032] "removes comments only" [puppet] - 10https://gerrit.wikimedia.org/r/163994 (owner: 10Dzahn) [01:53:24] (03CR) 10Dzahn: [C: 032] "removes an icinga checkcommand that isn't used anywhere - will check neon anyways" [puppet] - 10https://gerrit.wikimedia.org/r/163788 (owner: 10Dzahn) [01:53:48] (03PS2) 10Dzahn: search - remove commented check_lucene_frontend [puppet] - 10https://gerrit.wikimedia.org/r/163994 [01:56:48] (03CR) 10Dzahn: [C: 031] remove pmtpa from all $domain_search [puppet] - 10https://gerrit.wikimedia.org/r/159441 (owner: 10Dzahn) [01:59:25] (03CR) 10Dzahn: "- command_name check_lucene_frontend" [puppet] - 10https://gerrit.wikimedia.org/r/163788 (owner: 10Dzahn) [02:06:53] (03PS3) 10Dzahn: remove fenari from dsh and install-server [puppet] - 10https://gerrit.wikimedia.org/r/163315 [02:10:24] (03PS2) 10Dzahn: remove fenari [dns] - 10https://gerrit.wikimedia.org/r/163313 [02:13:03] mutante: Is there a known issue with ldap? Jenkins is rejecting login [02:14:00] Krinkle: i don't think it's known specifically to affect jenkins, but it could be related to virt0 being down [02:14:25] virt0 was shut down earlier today and first thought would be something wasnt switched over to virt1000/ldap-eqiad [02:15:36] yea, so most services have these AuthLDAPURL "ldaps://virt0.wikimedia.org virt1000.wikimedia.org .. and virt1000 should still work [02:15:42] (03CR) 10Krinkle: "It works, but that url does not redirect." [puppet] - 10https://gerrit.wikimedia.org/r/163756 (owner: 10Dzahn) [02:16:47] AAAARRRGH [02:17:16] precise has git version 1.7.9.5 and trusty has 1.9.1 [02:18:07] And OF COURSE there is an incompatibility between the two which means that if you're using a 1.9 client that's trying to fetch a submodule from a 1.7 clone doubling as a server, it fails [02:18:22] mutante: well, I just know it's not working. What's being done about it? [02:18:30] it doesn't seem to be recent [02:18:37] Which means submodule support in trebuchet is broken if the deployment master is on precise and the target is on trusty [02:18:38] oh, it is recent [02:18:50] like, it broke today? [02:18:51] Which of course is exactly what's happening with mathoid/citoid [02:19:00] Krinkle: this " 20:55 andrewbogott: powering down virt0, just to see what breaks" [02:19:05] wat? [02:21:27] https://github.com/search?q=virt0+%40wikimedia&type=Code&utf8=%E2%9C%93 [02:21:39] maybe fix the known ones first? [02:21:39] i think the reasoning is they all also have virt1000 configured [02:21:40] and that is still up [02:21:40] analytics-kraken looks like it would be affected [02:21:40] Krinkle: can you see where the jenkins config bit is though? [02:22:14] i can confirm icinga login still works. must be using virt1000 [02:23:36] is virt1000 going away, too? [02:23:49] no [02:24:02] I can't log in to Jenkins to fix the config [02:24:10] it's just about tampa, 1000 things are eqiad [02:24:32] can it be fixed on shell? [02:24:39] maybe, I'll look into it [02:24:53] might require me abusing some rights here and there and sql patch it [02:25:11] !log LocalisationUpdate completed (1.24wmf22) at 2014-10-01 02:25:11+00:00 [02:25:17] hmm.. that or we gotta request that virt0 comes back temp. [02:25:22] Logged the message, Master [02:25:24] how urgent? [02:26:52] or hack /etc/hosts to trick it into thinking virt1000 is virt0 so you can get in once and fix it ? hrmmm [02:27:11] Well, Jenkins powers our CI infrastructure. While the plain boolean build result is public and readable by logged-out users. I can't log in to look at any detailed data. So nobody on staff right now can investigate a jenkins failure if they have a problem with a patch set or something. [02:27:47] It's kind of like revoking the ability to run puppet on a server. Nothing is broken, but if you need to do anything, you can't. [02:28:08] I was in the middle of some routine maintenance and wanting to log in to verify things aren't regressing [02:28:13] h/o [02:29:07] Hmm, I might be wrong here [02:29:26] manual editing doesn't work indeed. It only re-reads config from disk when it knows about it. I'd have to do a restart and update the config file when it's offline. Startup takes upto an hour. [02:29:29] I think maybe I just need to run update-server-info? [02:29:46] I can hack /etc/hosts maybe though [02:30:06] or rather, a root can. Can you patch hosts on gallium ? [02:30:18] Krinkle: i can just power it back up and we see if it works [02:30:24] virt0 that is [02:30:39] OK [02:30:40] it sounds urgent but not urgent enough to page andrew though [02:30:46] Yeah, it's fixable. [02:30:59] assuming that's even the reason Jenkins login is broken [02:31:10] yea, i was about to say the same.. that was still guessing [02:31:16] but it seems likely [02:31:41] !log mw1053 flooding exception logs with: "Unrecognized job type 'EchoNotificationDeleteJob'." Disabling jobrunner & Puppet [02:31:45] Logged the message, Master [02:31:55] !log virt0 - powering back up, suspecting it broke jenkins login [02:31:59] Logged the message, Master [02:32:08] HA [02:32:10] That's it [02:32:12] git update-server-info [02:33:30] Which Trebuchet totally does but I manually recloned [02:33:32] My bad, my bad [02:34:28] Krinkle: i'd say analytics-kafka must have been affected for sure though, from glancing at that config [02:34:44] server coming back up. hold on [02:35:15] [02:35:15] ldaps://virt0.wikimedia.org:636 [02:35:27] :p that's it [02:35:38] (that's from shell, digging into the config) [02:36:16] RECOVERY - Host virt0 is UP: PING OK - Packet loss = 0%, RTA = 31.09 ms [02:36:43] !log LocalisationUpdate completed (1.25wmf1) at 2014-10-01 02:36:43+00:00 [02:36:48] Logged the message, Master [02:37:14] RECOVERY - Host labs-ns0.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 31.09 ms [02:37:17] Krinkle: it's fixed [02:37:23] you can login again [02:37:28] mutante: yep, jenkisn login, too [02:37:32] opendj is up [02:37:52] yea, so it is definitely that, gonna leave it at this for now [02:37:54] RECOVERY - Auth DNS on labs-ns0.wikimedia.org is OK: DNS OK: 2.446 seconds response time. nagiostest.beta.wmflabs.org returns 208.80.155.135 [02:38:55] !log bringing virt0 back up did indeed fix login on jenkins , also analytics-kafka appears to be still using it [02:39:00] Logged the message, Master [02:39:55] mutante: So what should I replace ldaps://virt0.wikimedia.org:636 with [02:40:12] Krinkle: should be virt1000 [02:42:03] !log jenkins config used virt0, login was needed though to change the config. blocked Krinkle [02:42:09] Logged the message, Master [02:44:02] andrewbogott_afk: for the backlog ping ^ [02:46:00] (03PS3) 10Dzahn: decom tarin (pmtpa poolcounter) [puppet] - 10https://gerrit.wikimedia.org/r/152154 [02:49:12] mutante: I've updated the config. Not sure how long it takes to apply. [02:50:35] PROBLEM - NTP on virt0 is CRITICAL: NTP CRITICAL: Offset unknown [02:54:40] Krinkle: ok, i'm leaving it like this for now.. going to be afk [02:54:45] RECOVERY - NTP on virt0 is OK: NTP OK: Offset 0.005471587181 secs [02:56:30] mutante: thx [02:58:45] (03PS5) 10Krinkle: contint: Package 'php5-parsekit' is absent on Trusty, don't require it [puppet] - 10https://gerrit.wikimedia.org/r/161748 (https://bugzilla.wikimedia.org/68255) [03:21:26] PROBLEM - puppet last run on es7 is CRITICAL: CRITICAL: Puppet has 1 failures [03:24:37] (03PS1) 10Mattflaschen: Have production and Labs Redis sessions use same structure. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/164027 (https://bugzilla.wikimedia.org/59838) [03:24:57] (03CR) 10Andrew Bogott: [C: 031] Point beta redis at the domain instead of ip [puppet] - 10https://gerrit.wikimedia.org/r/163973 (https://bugzilla.wikimedia.org/71484) (owner: 10EBernhardson) [03:25:36] (03Abandoned) 10Mattflaschen: Change how GettingStarted Redis server IP is determined [mediawiki-config] - 10https://gerrit.wikimedia.org/r/163547 (https://bugzilla.wikimedia.org/59838) (owner: 10Mattflaschen) [03:31:52] mutante or Krinkle|detached, now that virt0 is up and you can fix jenkins, did you change the config? [03:32:04] Or is it still relying on virt0? [03:32:25] Oh, wait, I see Krinkle: mutante: I've updated the config. Not sure how long it takes to apply. [03:33:06] So… please email me if this still needs attention… virt0 is likely to die tomorrow whether we like it or not. I shut it down a day early to flush issues just like this one :/ [03:34:22] thanks mutante! [03:34:28] i'm getting error notices from Mail Delivery System wrt ops-request@rt.wikimedia.org [03:34:36] "Delay reason: local delivery failed" [03:34:44] This message was created automatically by mail delivery software. [03:34:44] A message that you sent has not yet been delivered to one or more of its [03:34:44] recipients after more than 24 hours on the queue on magnesium.wikimedia.org. [03:35:05] The message identifier is: 1XYia1-0003Rb-24 [03:35:05] The date of the message is: Mon, 29 Sep 2014 17:33:59 -0400 [03:35:05] The subject of the message is: Re: Deploying Parsoid and OCG [03:35:16] cscott: I don't know any details, but I believe mark was working on some mail relay issues earlier today. [03:35:38] well, as of 53 minutes ago there still existed mail relay issues ;) [03:35:45] Might be worth mailing the ops list so that folks in Europe can catch up before tomorrow AM [03:40:48] RECOVERY - puppet last run on es7 is OK: OK: Puppet is currently enabled, last run 62 seconds ago with 0 failures [03:47:15] andrewbogott: done. [03:51:23] !log LocalisationUpdate ResourceLoader cache refresh completed at Wed Oct 1 03:51:23 UTC 2014 (duration 51m 22s) [03:51:27] Logged the message, Master [03:54:17] PROBLEM - puppet last run on labsdb1003 is CRITICAL: CRITICAL: Puppet last ran 572969 seconds ago, expected 14400 [03:54:49] ^ me [03:55:17] RECOVERY - puppet last run on labsdb1003 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [03:59:16] (03PS2) 10KartikMistry: Added initial Debian packaging [debs/contenttranslation/lttoolbox] - 10https://gerrit.wikimedia.org/r/163548 [04:41:17] (03PS2) 10KartikMistry: Added initial Debian packaging [debs/contenttranslation/apertium] - 10https://gerrit.wikimedia.org/r/163577 [04:42:40] (03PS1) 10Springle: Set join_cache_level = 2 for labsdbs. [puppet] - 10https://gerrit.wikimedia.org/r/164035 [04:43:28] (03PS2) 10KartikMistry: Add initial Debian packaging [debs/contenttranslation/cg3] - 10https://gerrit.wikimedia.org/r/163579 [04:44:07] (03CR) 10Springle: [C: 032] Set join_cache_level = 2 for labsdbs. [puppet] - 10https://gerrit.wikimedia.org/r/164035 (owner: 10Springle) [04:45:53] (03PS2) 10KartikMistry: Add initial Debian packaging [debs/contenttranslation/apertium-es-ca] - 10https://gerrit.wikimedia.org/r/163578 [04:46:15] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 203, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/2/0: down - Core: cr2-codfw:xe-5/2/1 (Telia, IC-307236) (#3658) [10Gbps wave]BR [05:57:43] <_joe_> good morning [05:57:51] hi _joe_ [05:58:21] <_joe_> hi ori [05:59:12] we have a bit of a problem with 503s, and i disabled the job runner on mw1053 because it hit some strange mediawiki error [05:59:29] very vague descriptions but i'm still gathering details [06:00:31] <_joe_> a problem with 503s meaning? [06:00:42] <_joe_> 503s mean hhvm choked the response, probably [06:00:50] <_joe_> or timed out [06:05:17] not so easy, alas [06:14:03] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 205, down: 0, dormant: 0, excluded: 0, unused: 0 [06:29:42] PROBLEM - puppet last run on analytics1030 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:52] PROBLEM - puppet last run on mw1123 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:14] PROBLEM - puppet last run on search1018 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:33] PROBLEM - puppet last run on lvs2004 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:42] PROBLEM - puppet last run on labcontrol2001 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:45] PROBLEM - puppet last run on mw1065 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:03] PROBLEM - puppet last run on mw1144 is CRITICAL: CRITICAL: Puppet has 2 failures [06:31:12] PROBLEM - puppet last run on mw1092 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:14] PROBLEM - puppet last run on mw1211 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:15] PROBLEM - puppet last run on cp4003 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:22] PROBLEM - puppet last run on amssq46 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:23] PROBLEM - puppet last run on mw1042 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:23] PROBLEM - puppet last run on cp4008 is CRITICAL: CRITICAL: Puppet has 1 failures [06:34:12] akosiaris: I understand you're the ops liaison for the services team? [06:45:40] RECOVERY - puppet last run on search1018 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [06:45:59] RECOVERY - puppet last run on mw1042 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [06:46:00] RECOVERY - puppet last run on lvs2004 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [06:46:09] RECOVERY - puppet last run on mw1065 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [06:46:18] RECOVERY - puppet last run on analytics1030 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [06:46:28] RECOVERY - puppet last run on mw1123 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [06:46:28] RECOVERY - puppet last run on mw1144 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [06:46:38] RECOVERY - puppet last run on mw1092 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [06:46:38] RECOVERY - puppet last run on cp4003 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [06:46:49] RECOVERY - puppet last run on cp4008 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [06:47:09] RECOVERY - puppet last run on labcontrol2001 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [06:47:38] RECOVERY - puppet last run on mw1211 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [06:47:38] PROBLEM - puppet last run on iron is CRITICAL: CRITICAL: Puppet has 1 failures [06:47:58] RECOVERY - puppet last run on amssq46 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [06:59:59] PROBLEM - Ubuntu mirror in sync with upstream on carbon is CRITICAL: /srv/ubuntu/project/trace/carbon.wikimedia.org is over 12 hours old. [07:01:38] RECOVERY - Disk space on ms1001 is OK: DISK OK [07:05:40] PROBLEM - puppet last run on cp3016 is CRITICAL: CRITICAL: puppet fail [07:06:03] RECOVERY - puppet last run on iron is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [07:06:44] akosiaris: I asked because I wrote a patchset to puppetize Citoid (also on the services cluster) and it's now ready for review: https://gerrit.wikimedia.org/r/#/c/163068 [07:24:57] RECOVERY - puppet last run on cp3016 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [07:31:16] PROBLEM - puppet last run on mw1098 is CRITICAL: CRITICAL: Puppet has 1 failures [07:49:21] RECOVERY - puppet last run on mw1098 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [07:54:09] (03CR) 10Chmarkine: [C: 031] redirect noc user homedirs to people.wm.org [puppet] - 10https://gerrit.wikimedia.org/r/163756 (owner: 10Dzahn) [08:00:06] (03CR) 10Filippo Giunchedi: "mind changing also misc/deployment to ensure => present on absent elsewhere?" [puppet] - 10https://gerrit.wikimedia.org/r/161748 (https://bugzilla.wikimedia.org/68255) (owner: 10Krinkle) [08:03:51] (03PS3) 10Filippo Giunchedi: Point beta redis at the domain instead of ip [puppet] - 10https://gerrit.wikimedia.org/r/163973 (https://bugzilla.wikimedia.org/71484) (owner: 10EBernhardson) [08:03:56] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] Point beta redis at the domain instead of ip [puppet] - 10https://gerrit.wikimedia.org/r/163973 (https://bugzilla.wikimedia.org/71484) (owner: 10EBernhardson) [08:06:23] (03CR) 10Filippo Giunchedi: [C: 031] redirect noc user homedirs to people.wm.org [puppet] - 10https://gerrit.wikimedia.org/r/163756 (owner: 10Dzahn) [08:35:55] <_joe_> !log disabling puppet on mw1018, enabling debug logging to get more details about fcgi reported errors [08:36:00] Logged the message, Master [08:51:04] RoanKattouw: no, I am not (unless something changed and I wasn't informed). But I suppose I can help with that review [08:53:29] OK [08:53:40] Feel free to bounce me to someone else [08:53:55] Ori and Bryan looked at this patchset earlier, and Ori made some changes to it [08:55:39] I don't think there is a liaison to the services team to bounce you to and I am probably the best qualified right now so I 'll take it from here, thanks for letting me know [09:02:31] PROBLEM - puppet last run on db1007 is CRITICAL: CRITICAL: Puppet has 1 failures [09:04:45] !log breaking the snapmirror relationships between nas1-a, nas1001-a. Effect: no more fr_archive syncing, fenari /home no longer is synced [09:04:51] Logged the message, Master [09:05:26] \o/ [09:20:49] RECOVERY - puppet last run on db1007 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [09:20:52] \o/ [09:22:58] so alex, you decided to route my email to the ldap server? ;) [09:23:22] mark: yes. I thought it made perfect sense [09:23:35] cause LDAP is the protocol for tomorrow mail [09:23:44] sorry about that [09:24:24] (03CR) 10Mark Bergsma: [C: 04-2] "nfs1 is the central syslog server, and this needs to be moved before shutdown" [puppet] - 10https://gerrit.wikimedia.org/r/159442 (owner: 10Dzahn) [09:24:28] :) [09:31:56] don't you love netapp [09:31:58] Too many users logged in! Please try again later. [09:32:10] where 'too many' == '>1' [09:33:08] ahaha [09:33:20] well if you know the command you can issue it [09:33:27] it just that is has a single pty [09:33:49] also showmount -a nas1-a [09:33:59] it still reports srv2XX:/vol/originals [09:34:10] doesn't it ever clear its cache ? [09:34:22] wow [09:36:30] vfiler run vfiler0 nfsstat -l [09:36:34] nfs1 and fenari [09:36:55] so yeah [09:36:57] we should move that off [09:38:20] nobody logged in at fenari!!! [09:38:29] quick, shut it down [09:38:36] my point exactly :-) [09:38:52] I 'll disable user logins and umount /home [09:39:08] and let's see who complains [09:40:44] yeah [09:42:11] !log touch /etc/nologin on fenari. Non root logins disallowed [09:42:16] Logged the message, Master [09:47:47] !log umount /home on fenari. fenari user homes no longer available [09:47:52] Logged the message, Master [09:58:06] !log destroyed baculasd1, baculasd2 and fr_archive volumes on nas1 [09:58:11] Logged the message, Master [10:13:33] PROBLEM - puppet last run on cp4006 is CRITICAL: CRITICAL: Puppet last ran 45019 seconds ago, expected 14400 [10:15:34] RECOVERY - puppet last run on cp4006 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [10:16:14] !log destroyed backups aggregate on nas1-a [10:16:20] Logged the message, Master [10:16:34] !log started spare disk zeroing process on nas1-a [10:16:39] Logged the message, Master [10:16:44] !log killed cp4006's stale puppet agent_disabled.lock, ran puppet [10:16:49] Logged the message, Master [10:22:13] ^ this is paravoid taking a vacation [10:22:50] paravoid: after your adventure, are you now a lean, mean, fighting machine? :) [10:23:01] haha [10:24:36] yeah next year we won't need rachel [10:25:16] paravoid already did well on the organizational skills, and has beefed up the rest of the skillset [10:31:42] (03PS1) 10Filippo Giunchedi: allocate wmf4573 as lithium [dns] - 10https://gerrit.wikimedia.org/r/164041 [10:33:50] ^ easy dns change/allocation, if someone has 5 min [10:34:07] where did the name lithium come from? [10:34:52] the next available element in alphabetical order [10:35:06] ok [10:35:23] (03CR) 10Mark Bergsma: [C: 04-1] "Sounds like you've allocated a (production) ip but not set reverse DNS for it?" [dns] - 10https://gerrit.wikimedia.org/r/164041 (owner: 10Filippo Giunchedi) [10:37:22] sigh, of course, fixing [10:38:29] <_joe_> !log re-enabled puppet on mw1018, repooling in a few [10:38:34] Logged the message, Master [10:41:16] (03PS2) 10Filippo Giunchedi: allocate wmf4573 as lithium [dns] - 10https://gerrit.wikimedia.org/r/164041 [10:42:05] (03CR) 10Mark Bergsma: [C: 031] allocate wmf4573 as lithium [dns] - 10https://gerrit.wikimedia.org/r/164041 (owner: 10Filippo Giunchedi) [10:50:17] (03PS3) 10Filippo Giunchedi: allocate wmf4573 as lithium [dns] - 10https://gerrit.wikimedia.org/r/164041 [10:50:24] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] allocate wmf4573 as lithium [dns] - 10https://gerrit.wikimedia.org/r/164041 (owner: 10Filippo Giunchedi) [11:00:21] RECOVERY - Slow CirrusSearch query rate on fluorine is OK: CirrusSearch-slow.log_line_rate OKAY: 0.0 [11:07:46] (03PS1) 10Faidon Liambotis: Merge pdns precise/trusty configs, unbreak nescio [puppet] - 10https://gerrit.wikimedia.org/r/164045 [11:10:29] (03CR) 10Faidon Liambotis: [C: 032] Merge pdns precise/trusty configs, unbreak nescio [puppet] - 10https://gerrit.wikimedia.org/r/164045 (owner: 10Faidon Liambotis) [11:12:31] RECOVERY - puppet last run on nescio is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [11:14:55] PROBLEM - Auth DNS on labs-ns1.wikimedia.org is CRITICAL: CRITICAL - Plugin timed out while executing system call [11:15:28] oops :) [11:53:03] (03PS1) 10Filippo Giunchedi: add lithium to install-server [puppet] - 10https://gerrit.wikimedia.org/r/164048 [11:54:36] simple enough, any takers for ^ ? [11:54:57] merge it already :P [11:55:28] (03PS2) 10Filippo Giunchedi: add lithium to install-server [puppet] - 10https://gerrit.wikimedia.org/r/164048 [11:55:30] haha fair enough [11:55:37] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] add lithium to install-server [puppet] - 10https://gerrit.wikimedia.org/r/164048 (owner: 10Filippo Giunchedi) [11:56:12] what's the process for running some simple SQL commands for DB cleanup? do I need a deploy slot or should I just go ahead and to it? [11:57:57] greg-g: ^^ [11:58:54] !log reboot lithium for installation [11:58:58] Logged the message, Master [12:04:46] (03PS1) 10KartikMistry: Add initial Debian packaging [debs/contenttranslation/apertium-lex-tools] - 10https://gerrit.wikimedia.org/r/164050 [12:17:55] there's no facility afaict in puppet to setup a lv in production, correct? I can only see labs_lvm [12:21:05] no [12:21:15] we typically do that in either the installer or manually [12:21:19] and for these one-off boxes, why not [12:24:48] indeed, so /srv/syslog [12:25:03] I'm behind the bikeshed if you need me [12:30:04] (03PS1) 10Filippo Giunchedi: lithium: provision as syslog server [puppet] - 10https://gerrit.wikimedia.org/r/164056 [12:33:11] (03PS1) 10KartikMistry: Add .gitreview file [debs/contenttranslation/apertium-lex-tools] - 10https://gerrit.wikimedia.org/r/164057 [12:37:10] i think it should be /usr/local/opt/a/syslog [12:37:55] haha we can symlink [12:38:25] mount --bind [12:44:19] PROBLEM - puppet last run on lithium is CRITICAL: CRITICAL: Puppet has 1 failures [12:46:58] RECOVERY - Ubuntu mirror in sync with upstream on carbon is OK: /srv/ubuntu/project/trace/carbon.wikimedia.org is over 0 hours old. [12:47:39] RECOVERY - puppet last run on lithium is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [13:00:04] K4: Dear anthropoid, the time has come. Please deploy Fundraising (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20141001T1300). [13:13:01] (03PS1) 10Aude: Rename wikibase debug log (currently unused) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/164061 [13:15:03] THE TIME HAS COME [13:15:15] can we get that for tampa shutdown too? [13:15:23] impending doom [13:21:55] !log Stopped OpenDJ on sanger [13:22:01] Logged the message, Master [13:24:05] PROBLEM - LDAP on sanger is CRITICAL: Connection refused [13:24:47] PROBLEM - Certificate expiration on sanger is CRITICAL: SSL_CERT CRITICAL: Error: verify depth is 6 [13:25:07] PROBLEM - LDAPS on sanger is CRITICAL: Connection refused [13:28:51] !log Stopped DNS recursor on dobson [13:28:57] Logged the message, Master [13:29:49] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] lithium: provision as syslog server [puppet] - 10https://gerrit.wikimedia.org/r/164056 (owner: 10Filippo Giunchedi) [13:31:47] PROBLEM - Recursive DNS on 208.80.152.131 is CRITICAL: CRITICAL - Plugin timed out while executing system call [13:33:07] RECOVERY - LDAPS on sanger is OK: TCP OK - 0.037 second response time on port 636 [13:33:26] RECOVERY - LDAP on sanger is OK: TCP OK - 0.036 second response time on port 389 [13:33:29] heh right [13:34:06] RECOVERY - Certificate expiration on sanger is OK: SSL_CERT OK - X.509 certificate for sanger.wikimedia.org from Wikimedia CA valid until Oct 11 20:23:26 2015 GMT (expires in 375 days) [13:36:16] PROBLEM - LDAPS on sanger is CRITICAL: Connection refused [13:36:26] PROBLEM - LDAP on sanger is CRITICAL: Connection refused [13:36:57] PROBLEM - Certificate expiration on sanger is CRITICAL: SSL_CERT CRITICAL: Error: verify depth is 6 [13:38:47] RECOVERY - Recursive DNS on 208.80.152.131 is OK: DNS OK: 5.161 seconds response time. www.wikipedia.org returns 208.80.154.224 [13:49:20] !log temporarily override syslog.eqiad.wmnet on mw1053 for testing [13:49:26] Logged the message, Master [13:54:18] RECOVERY - LDAPS on sanger is OK: TCP OK - 0.036 second response time on port 636 [13:54:20] RECOVERY - Certificate expiration on sanger is OK: SSL_CERT OK - X.509 certificate for sanger.wikimedia.org from Wikimedia CA valid until Oct 11 20:23:26 2015 GMT (expires in 375 days) [13:54:38] RECOVERY - LDAP on sanger is OK: TCP OK - 0.031 second response time on port 389 [14:02:44] !log Shutting down dobson [14:02:49] Logged the message, Master [14:04:01] PROBLEM - Host dobson is DOWN: PING CRITICAL - Packet loss = 100% [14:05:00] PROBLEM - Host 208.80.152.131 is DOWN: PING CRITICAL - Packet loss = 100% [14:08:43] !log Stopped pdns_recursor on mchenry [14:08:47] Logged the message, Master [14:09:25] (03PS1) 10Filippo Giunchedi: lithium: exclude from remote-syslog [puppet] - 10https://gerrit.wikimedia.org/r/164070 [14:09:55] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] lithium: exclude from remote-syslog [puppet] - 10https://gerrit.wikimedia.org/r/164070 (owner: 10Filippo Giunchedi) [14:10:27] (03PS1) 10Hashar: contint: python3.4 on Trusty labs slaves [puppet] - 10https://gerrit.wikimedia.org/r/164071 [14:10:31] PROBLEM - Recursive DNS on 208.80.152.132 is CRITICAL: CRITICAL - Plugin timed out while executing system call [14:12:01] PROBLEM - Host dataset2 is DOWN: CRITICAL - Plugin timed out after 15 seconds [14:12:45] (03PS2) 10Hashar: contint: python3.4 on Trusty labs slaves [puppet] - 10https://gerrit.wikimedia.org/r/164071 [14:15:41] (03CR) 10Hashar: "I have installed the packages manually on integration-slave100{6,7,8}. Puppet is broken because of the hhvm package ( https://bugzilla.wik" [puppet] - 10https://gerrit.wikimedia.org/r/164071 (owner: 10Hashar) [14:16:36] (03PS1) 10ArielGlenn: turn off rsync to dataset2 from dataset1001 [puppet] - 10https://gerrit.wikimedia.org/r/164072 [14:18:23] (03CR) 10ArielGlenn: [C: 032] turn off rsync to dataset2 from dataset1001 [puppet] - 10https://gerrit.wikimedia.org/r/164072 (owner: 10ArielGlenn) [14:19:24] (03PS1) 10Filippo Giunchedi: syslog-server: switch log directory to present [puppet] - 10https://gerrit.wikimedia.org/r/164073 [14:19:43] (03PS2) 10Filippo Giunchedi: syslog-server: switch log directory to present [puppet] - 10https://gerrit.wikimedia.org/r/164073 [14:19:49] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] syslog-server: switch log directory to present [puppet] - 10https://gerrit.wikimedia.org/r/164073 (owner: 10Filippo Giunchedi) [14:20:27] hah apergos assuming 518dd7f is good to merge? just ran puppet-merge :) [14:21:08] it is, I had not yet hit enter [14:21:33] Merge these changes? (yes/no)? [14:21:33] :-P [14:22:00] heheh, did mine show up there too? anyways let me know when merged! [14:25:41] <^demon|away> apergos: I responded to your e-mail about fenari :) /h/w/ is historical (and hysterical) [14:25:42] <^demon|away> :) [14:25:51] in't it though [14:26:09] sec [14:26:50] godog: claims to have merged them [14:27:11] !log mexia powered off [14:27:16] Logged the message, Master [14:28:37] apergos: [14:28:41] apergos: \o/ thanks [14:30:11] hashar: overnight there was talk about jenkins depending on virt0 for ldap. I think it was changed to point to virt1000 but there's probably more to do there… do you know where that config is? [14:30:13] (03PS1) 10BBlack: mexia out of authdns-update list [puppet] - 10https://gerrit.wikimedia.org/r/164075 [14:30:16] And/or do you have a moment to tinker with it? [14:30:34] andrewbogott: unpuppetized :-( [14:30:36] ^demon|away: I'll toss those things off our remaining copy, likely tomorrow. yay [14:30:41] (03CR) 10BBlack: [C: 032 V: 032] mexia out of authdns-update list [puppet] - 10https://gerrit.wikimedia.org/r/164075 (owner: 10BBlack) [14:30:53] hashar: hm, ok. Can you verify that it no longer relies on virt0? I'm about to shut virt0 off again [14:31:20] andrewbogott: it uses ldaps://virt1000.wikimedia.org:636 [14:31:24] hashar: also, it should more properly use a primary of ldap-eqiad, secondary ldap-codfw. Want to try that while you're in there? [14:31:30] Ah, ok, that's better at least. [14:31:42] ldap-eqiad may or may not work for you depending on cert stuff [14:31:43] it is hosted on gallium which is in eqiad, we can change it to ldap-eqiad [14:32:10] (ideally we would use ldap.wikimedia.org which would use gdnsd to distribute the request to the nearest datacenter :D ) [14:32:24] !log turning virt0 off again. Soon we won't have a choice about this, trying to flush out issues in the meantime. [14:32:31] Logged the message, Master [14:32:36] don't you have a log of queries being made on virt0 ? [14:33:20] hashar: yes, but in most cases servers refer to virt0 and virt1000 both. So just because somethign is hitting virt0 it doesn't mean we depend on it. [14:33:27] So far jenkins has been the only thing w/out a secondary [14:34:55] I would use ldap-eqiad but i am not sure how to fix Jenkins if there is a cert issue :D [14:35:31] (03PS2) 10Alexandros Kosiaris: deploy pollux as codfw corp ldap [puppet] - 10https://gerrit.wikimedia.org/r/163867 [14:35:39] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] deploy pollux as codfw corp ldap [puppet] - 10https://gerrit.wikimedia.org/r/163867 (owner: 10Alexandros Kosiaris) [14:35:55] hashar: just switch it back if there is. [14:36:01] PROBLEM - Host labs-ns0.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [14:36:25] PROBLEM - Host virt0 is DOWN: CRITICAL - Plugin timed out after 15 seconds [14:36:35] hashar: it depends on if the config has a specific reference like "tls_cacertfile /etc/ssl/certs/GlobalSign_CA.pem" [14:37:13] any test command I could use ? :d [14:37:31] (03PS1) 10Jgreen: ganglia collector for OCG service [puppet] - 10https://gerrit.wikimedia.org/r/164077 [14:38:09] (03CR) 10jenkins-bot: [V: 04-1] ganglia collector for OCG service [puppet] - 10https://gerrit.wikimedia.org/r/164077 (owner: 10Jgreen) [14:38:17] hashar: I don't know what broke in the first place, only that Krinkle was alarmed. [14:38:31] It has a web interface, right? Probably just logging into that (uses wikitech password?) is what broke [14:38:36] well gallium /etc/ has no file matching GlobalSign_CA [14:40:16] (03PS2) 10Jgreen: ganglia collector for OCG service [puppet] - 10https://gerrit.wikimedia.org/r/164077 [14:40:27] !log Shutdown mchenry [14:40:32] Logged the message, Master [14:40:55] (03CR) 10jenkins-bot: [V: 04-1] ganglia collector for OCG service [puppet] - 10https://gerrit.wikimedia.org/r/164077 (owner: 10Jgreen) [14:41:47] !log testing syslog change on mw1060 [14:41:51] Logged the message, Master [14:42:04] PROBLEM - Host mchenry is DOWN: PING CRITICAL - Packet loss = 100% [14:42:19] manybubbles, marktraceur, ^demon|away: I'll SWAT today, unless one of you already had your heart set on it. [14:42:38] not set on it. have fun! [14:42:59] Hm… who set up graphite.wikimedia.org? It's weird. [14:43:56] and by 'weird' I mean it seems to use two different logins [14:43:58] (03PS3) 10Hashar: contint: python3.4 on Trusty labs slaves [puppet] - 10https://gerrit.wikimedia.org/r/164071 [14:44:34] <^demon|away> anomie: but but, it's my favorite part of the morning! :p [14:45:06] ^demon|away: Is that sarcasm, or did you really want to do it? [14:45:23] andrewbogott: used to be on fenari then proxified to some other server (carbon). Nowadays I have no clue [14:45:38] anomie: Good luck! [14:45:41] * anomie has sarcasm detectors down for maintenance today [14:45:43] (03CR) 10Hashar: "Added python3.4-dev. Manually installed on the instances." [puppet] - 10https://gerrit.wikimedia.org/r/164071 (owner: 10Hashar) [14:46:15] hashar: mostly I just want to know how to reset my login (the dropdown that appears when I first visit) so I can re-test logins. Same with icinga [14:46:34] PROBLEM - Host 208.80.152.132 is DOWN: PING CRITICAL - Packet loss = 100% [14:46:50] (03PS4) 10Andrew Bogott: Move icinga ldap to the new servers. [puppet] - 10https://gerrit.wikimedia.org/r/163255 [14:46:57] andrewbogott: no clue :-( [14:48:29] (03CR) 10Andrew Bogott: [C: 032] Move icinga ldap to the new servers. [puppet] - 10https://gerrit.wikimedia.org/r/163255 (owner: 10Andrew Bogott) [14:48:49] andrewbogott: i think that's using HTTP authentication. so look in your browser options. I'm pretty sure chrome devtools has a way to do this? [14:48:57] <^demon|away> anomie: If I look like a duck and sound like a duck I'm probably being sarcastic :) [14:49:01] cscott: ok [14:49:15] ^demon|away: I can't see or hear you though ;) [14:49:26] <^demon|away> Very true [14:50:18] (03PS3) 10Jgreen: ganglia collector for OCG service [puppet] - 10https://gerrit.wikimedia.org/r/164077 [14:50:22] aude, gi11es: Ping for SWAT in 10 minutes [14:50:55] (03CR) 10jenkins-bot: [V: 04-1] ganglia collector for OCG service [puppet] - 10https://gerrit.wikimedia.org/r/164077 (owner: 10Jgreen) [14:50:55] ok [14:51:07] * aude making new build of wikibase [14:51:19] anomie: pong [14:54:17] We have things going out? [14:54:44] Oh, cool beans [14:54:53] * marktraceur watches intently [14:55:12] (03PS4) 10Jgreen: ganglia collector for OCG service [puppet] - 10https://gerrit.wikimedia.org/r/164077 [14:55:48] (03CR) 10jenkins-bot: [V: 04-1] ganglia collector for OCG service [puppet] - 10https://gerrit.wikimedia.org/r/164077 (owner: 10Jgreen) [14:55:50] (03PS1) 10Filippo Giunchedi: switch syslog to lithium [dns] - 10https://gerrit.wikimedia.org/r/164080 [14:56:50] * aude waiting on jenkins [14:57:26] tgr: no real official policy unless you want sean to do it for you (if you want him to review the cleanup code/etc). [14:58:24] tgr: so, "whenever is reasonable given what you are doing" [14:58:28] greg-g: it's fairly simple and it was reviewed internaly, I'll just do it then [14:58:35] * greg-g nods [14:58:42] hey SWATties, can I add something to your queue? [14:58:46] I'm supposed to that via sql.php from terbium, right? [14:59:01] !log Jenkins changed git executable path from 'git' to '/usr/bin/git' [14:59:07] Logged the message, Master [14:59:47] tgr: not positive off hand [14:59:57] cscott: Go ahead [15:00:04] manybubbles, anomie, ^d, marktraceur: Respected human, time to deploy SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20141001T1500). Please do the needful. [15:00:06] anomie: https://gerrit.wikimedia.org/r/163609 [15:00:08] * anomie is starting SWAT [15:00:16] cscott: Put it on the Deployments page, please? [15:00:19] gi11es: You're first [15:00:21] anomie: will do [15:00:28] oh my [15:00:30] (03PS2) 10Anomie: Thumbnail prerendering at upload time on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/163836 (owner: 10Gilles) [15:00:39] (03CR) 10Anomie: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/163836 (owner: 10Gilles) [15:00:50] (03Merged) 10jenkins-bot: Thumbnail prerendering at upload time on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/163836 (owner: 10Gilles) [15:01:25] !log anomie Synchronized wmf-config/InitialiseSettings-labs.php: SWAT: Enable thumbnail prerendering at upload time on Beta [[gerrit:163836]] (duration: 00m 09s) [15:01:27] gi11es: ^ Test? [15:01:30] Logged the message, Master [15:01:59] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] switch syslog to lithium [dns] - 10https://gerrit.wikimedia.org/r/164080 (owner: 10Filippo Giunchedi) [15:02:25] scap returned a sync error trying to sync to fenari [15:02:38] !log switched syslog to lithium [15:02:43] Logged the message, Master [15:02:51] (03CR) 10Cscott: "Ok, nothing's on fire, merge away." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/163609 (owner: 10Cscott) [15:03:15] Is fenari finally offline? /me wants to dance on it's slow sync grave [15:03:44] it's... not accepting logins [15:04:04] a picture would be a good memory indeed [15:04:06] perhaps someone can actually shut it down in a bit :) [15:04:46] (03PS5) 10Jgreen: ganglia collector for OCG service [puppet] - 10https://gerrit.wikimedia.org/r/164077 [15:05:01] This would be good to merge if we are done scapping to it -- https://gerrit.wikimedia.org/r/#/c/163315/ [15:05:13] !log switched icinga over to the new ldap servers. Seems to still work so far... [15:05:19] Logged the message, Master [15:05:26] (03CR) 10jenkins-bot: [V: 04-1] ganglia collector for OCG service [puppet] - 10https://gerrit.wikimedia.org/r/164077 (owner: 10Jgreen) [15:05:58] cscott: Well, let's do yours now since gi11es's was only on Beta [15:06:14] jenkins is being slow [15:06:16] cscott: Needs rebase [15:06:25] anomie: ok, hang on. [15:06:45] * YuviPanda|zzzz waves [15:06:58] whoa, path conflict? That's surprising. Let me look. [15:07:30] mutante: re: pointing to private repo in icinga, am refactoring those out now :) [15:08:43] anomie: well apparently UploadWizard is neither on beta commons nor beta enwiki, and I can't access Special:Upload either. so can't find a way to test the change right now [15:09:26] (03PS4) 10Cscott: Disable the old mwlib PDF render service. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/163609 [15:09:33] ...that's odd [15:09:39] anomie: hm, weird. no rebase conflict when i did it locally. [15:10:21] gi11es: UW is on beta commons though not beta enwiki [15:10:24] http://commons.wikimedia.beta.wmflabs.org/wiki/Special:UploadWizard [15:10:48] cscott: Were you going to turn off odf and zim too, or not yet? [15:10:53] oh, someone changed the UDP profiler port of HHVM on labs, that was the only change since i last rebased. [15:10:59] whut, I couldn't get that a second ago. must have mistyped the url [15:11:07] WTF why can't I resolve any labs domain names [15:11:12] anomie: testing... [15:11:20] Or maybe I just can't find the servers... [15:11:37] anomie: oh, hm. yeah. let's do it in a separate patch, maybe tomorrow. don't want to rush it and make a mistake. [15:11:46] ok [15:11:54] (03CR) 10Anomie: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/163609 (owner: 10Cscott) [15:12:05] (03Merged) 10jenkins-bot: Disable the old mwlib PDF render service. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/163609 (owner: 10Cscott) [15:12:32] what's the drac command to reboot on a dell? [15:12:35] !log anomie Synchronized wmf-config: SWAT: Disable the old mwlib PDF render service [[gerrit:163609]] (duration: 00m 09s) [15:12:36] cscott: ^ Test please [15:12:40] Logged the message, Master [15:13:34] anomie: doing so [15:15:56] anomie: https://gerrit.wikimedia.org/r/#/c/164061/ and https://gerrit.wikimedia.org/r/#/c/164082/ [15:16:04] which i am adding to the wiki [15:16:46] anomie: can't say if the feature is working, but nothing broke as a result of turning it on [15:16:47] aude: Does order matter? [15:16:53] gi11es: Ok [15:17:07] i don't think [15:17:22] aude: I'll do the MW change first then, then the config change [15:17:26] ok [15:22:52] PROBLEM - DPKG on stat1002 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [15:24:02] RECOVERY - Host labs-ns0.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 31.98 ms [15:24:42] RECOVERY - Host virt0 is UP: PING OK - Packet loss = 0%, RTA = 31.59 ms [15:26:01] RECOVERY - DPKG on stat1002 is OK: All packages OK [15:26:12] PROBLEM - Auth DNS on labs-ns0.wikimedia.org is CRITICAL: CRITICAL - Plugin timed out while executing system call [15:26:15] anomie: patch tested, looks fine. [15:27:15] !log anomie Synchronized php-1.25wmf1/extensions/Wikidata/: SWAT: Fix js error that breaks editing properties on Wikidata [[gerrit:164079]] (duration: 00m 16s) [15:27:20] aude: ^ Test please, unless you need the config change too [15:27:21] PROBLEM - DPKG on virt0 is CRITICAL: Timeout while attempting connection [15:27:21] PROBLEM - Disk space on virt0 is CRITICAL: Timeout while attempting connection [15:27:21] PROBLEM - Certificate expiration on virt0 is CRITICAL: SSL_CERT CRITICAL: Error: verify depth is 6 [15:27:22] Logged the message, Master [15:27:41] PROBLEM - LDAP on virt0 is CRITICAL: Connection timed out [15:27:42] PROBLEM - LDAPS on virt0 is CRITICAL: Connection timed out [15:27:42] PROBLEM - RAID on virt0 is CRITICAL: Timeout while attempting connection [15:27:43] PROBLEM - nutcracker process on virt0 is CRITICAL: Timeout while attempting connection [15:27:43] PROBLEM - puppetmaster https on virt0 is CRITICAL: Connection timed out [15:27:51] PROBLEM - SSH on virt0 is CRITICAL: Connection timed out [15:27:51] PROBLEM - Redis on virt0 is CRITICAL: Connection timed out [15:27:51] PROBLEM - HTTP on virt0 is CRITICAL: Connection timed out [15:27:51] PROBLEM - check configured eth on virt0 is CRITICAL: Timeout while attempting connection [15:27:51] PROBLEM - Memcached on virt0 is CRITICAL: Timeout while attempting connection [15:27:55] grrrr [15:28:01] how do I tell icinga that virt0 is gone for good? [15:28:49] disable all notifications should do it [15:29:36] I thought I did that yesterday... [15:29:39] maybe not [15:29:43] Maybe I acknowledged instead [15:29:45] anomie: looks fine [15:29:47] thanks [15:29:51] PROBLEM - Host virt0 is DOWN: CRITICAL - Plugin timed out after 15 seconds [15:30:02] (03PS2) 10Anomie: Rename wikibase debug log (currently unused) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/164061 (owner: 10Aude) [15:30:09] (03CR) 10Anomie: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/164061 (owner: 10Aude) [15:30:23] (03Merged) 10jenkins-bot: Rename wikibase debug log (currently unused) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/164061 (owner: 10Aude) [15:30:41] PROBLEM - Host labs-ns0.wikimedia.org is DOWN: CRITICAL - Plugin timed out after 15 seconds [15:30:43] !log anomie Synchronized wmf-config/InitialiseSettings.php: SWAT: Rename wikibase debug log [[gerrit:164061]] (duration: 00m 12s) [15:30:45] aude: ^ I suppose if it's unused you can't really test? [15:30:48] Logged the message, Master [15:30:48] godog: also, icinga is convinced that labs-ns0 is a different ip from what it actually should be [15:30:48] !log Jenkins added jgit as a git provider under https://integration.wikimedia.org/ci/configure [15:30:50] oh, speak of the devil [15:30:53] Logged the message, Master [15:32:05] andrewbogott: could be the dns it is talking to? [15:32:25] godog: no, the old IP is set in the icinga config, I can see it there [15:32:32] !log starting upgrade of stat1002 from precise to trusty [15:32:37] Logged the message, Master [15:35:14] mutante: and thanks for taking care of the decom :) [15:35:21] mutante: have you done a trusty upgrade on a server with a private IP? is there anything speciaL I have to do here? [15:35:29] $ do-release-upgrade -d [15:35:29] Checking for a new Ubuntu release [15:35:29] No new release found [15:36:17] andrewbogott: hah, then naggen or puppet come to mind [15:36:23] naggen2 that is [15:36:30] ah, hm, webproxy is working [15:36:31] nm! [15:36:49] godog: I've never used naggen2, that's something I need to configure by hand? [15:37:08] I'm surprised that icinga didn't just pick up the change when I renumbered labs-ns0 [15:37:31] andrewbogott: not afaik, it is the tool that generates icinga config [15:39:04] godog: ok, found it in puppet I think [15:40:01] !log purged graphite logs for deployment-mediawiki04 by hand on labmon1001 to prevent it from causing issues on icinga, since the instance has been deleted previously [15:40:07] Logged the message, Master [15:40:22] (03PS1) 10Andrew Bogott: Moved labs-ns0 and labs-ns1. [puppet] - 10https://gerrit.wikimedia.org/r/164088 [15:40:28] godog: review me? ^ [15:40:36] PROBLEM - Unmerged changes on repository mediawiki_config on tin is CRITICAL: There is one unmerged change in mediawiki_config (dir /srv/mediawiki-staging/). [15:41:03] andrewbogott: looking [15:41:29] (03CR) 10Filippo Giunchedi: [C: 031] Moved labs-ns0 and labs-ns1. [puppet] - 10https://gerrit.wikimedia.org/r/164088 (owner: 10Andrew Bogott) [15:42:06] (03CR) 10Andrew Bogott: [C: 032] Moved labs-ns0 and labs-ns1. [puppet] - 10https://gerrit.wikimedia.org/r/164088 (owner: 10Andrew Bogott) [15:44:25] PROBLEM - DPKG on stat1002 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [15:46:25] RECOVERY - Auth DNS on labs-ns1.wikimedia.org is OK: DNS OK: 0.014 seconds response time. nagiostest.beta.wmflabs.org returns 208.80.155.135 [15:47:56] (03PS4) 10Andrew Bogott: Move graphite ldap to the new ldap servers. [puppet] - 10https://gerrit.wikimedia.org/r/163257 [15:48:05] (03CR) 10Andrew Bogott: [C: 032] Move graphite ldap to the new ldap servers. [puppet] - 10https://gerrit.wikimedia.org/r/163257 (owner: 10Andrew Bogott) [15:57:40] godog: ok, icinga is still sad because it continues to produce the entry for virt0, which causes a conflict. Any idea how to make it stop? [15:59:01] andrewbogott: mh so getting an host in icinga takes two puppet runs iirc, perhaps it'll take two runs too? [15:59:13] I've done two already [15:59:26] Of course puppet isn't running on virt0 so it will never remove itself [16:00:18] then no idea what's going on :( what conflict btw? [16:04:40] (03PS1) 10Andrew Bogott: Add some extra acis that were set by hand on virt1000 [puppet] - 10https://gerrit.wikimedia.org/r/164096 [16:04:53] godog: labs-ns0 [16:19:07] (03CR) 10jenkins-bot: [V: 04-1] ganglia collector for OCG service [puppet] - 10https://gerrit.wikimedia.org/r/164077 (owner: 10Jgreen) [16:20:11] (03PS1) 10Alexandros Kosiaris: include admin for pollux, plutonium [puppet] - 10https://gerrit.wikimedia.org/r/164097 [16:20:14] (03PS1) 10Filippo Giunchedi: Revert "switch syslog to lithium" [dns] - 10https://gerrit.wikimedia.org/r/164098 [16:20:40] (03PS7) 10Jgreen: ganglia collector for OCG service [puppet] - 10https://gerrit.wikimedia.org/r/164077 [16:20:41] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] Revert "switch syslog to lithium" [dns] - 10https://gerrit.wikimedia.org/r/164098 (owner: 10Filippo Giunchedi) [16:22:14] (03CR) 10Jgreen: [C: 032 V: 031] ganglia collector for OCG service [puppet] - 10https://gerrit.wikimedia.org/r/164077 (owner: 10Jgreen) [16:22:54] (03PS1) 10Cscott: Fully disable all mwlib formats; use OCG service instead. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/164099 [16:24:22] RECOVERY - Auth DNS on labs-ns0.wikimedia.org is OK: DNS OK: 0.053 seconds response time. nagiostest.beta.wmflabs.org returns 208.80.155.135 [16:24:32] RECOVERY - Host labs-ns0.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 6.57 ms [16:25:02] RECOVERY - DPKG on stat1002 is OK: All packages OK [16:25:47] (03CR) 10Alexandros Kosiaris: [C: 032] include admin for pollux, plutonium [puppet] - 10https://gerrit.wikimedia.org/r/164097 (owner: 10Alexandros Kosiaris) [16:25:51] (03PS4) 10Andrew Bogott: Move servermon ldap to the new ldap servers [puppet] - 10https://gerrit.wikimedia.org/r/163262 [16:25:57] (03PS3) 10Andrew Bogott: Move ishmael to the new ldap servers. [puppet] - 10https://gerrit.wikimedia.org/r/163285 [16:26:01] (03PS3) 10Andrew Bogott: Move tendril ldap to the new ldap servers. [puppet] - 10https://gerrit.wikimedia.org/r/163286 [16:26:04] (03PS2) 10Andrew Bogott: Switch kibana to the new ldap servers. [puppet] - 10https://gerrit.wikimedia.org/r/163290 [16:26:43] PROBLEM - puppet last run on neon is CRITICAL: CRITICAL: Puppet has 1 failures [16:27:17] (03CR) 10Andrew Bogott: [C: 032] Move servermon ldap to the new ldap servers [puppet] - 10https://gerrit.wikimedia.org/r/163262 (owner: 10Andrew Bogott) [16:28:54] RECOVERY - puppet last run on neon is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [16:30:30] (03PS1) 10Alexandros Kosiaris: Followup commit to I851588b1e21b569d9b750a2b49322d [puppet] - 10https://gerrit.wikimedia.org/r/164100 [16:30:53] (03PS1) 10Jgreen: add another OCG health check metric (response time for health check URI) [puppet] - 10https://gerrit.wikimedia.org/r/164101 [16:32:03] !log reverted change to syslog.eqiad.wmnet, back to nfs-home.pmtpa.wmnet [16:32:08] Logged the message, Master [16:32:27] (03CR) 10Jgreen: [C: 032 V: 031] add another OCG health check metric (response time for health check URI) [puppet] - 10https://gerrit.wikimedia.org/r/164101 (owner: 10Jgreen) [16:32:31] (03CR) 10Alexandros Kosiaris: [C: 032] Followup commit to I851588b1e21b569d9b750a2b49322d [puppet] - 10https://gerrit.wikimedia.org/r/164100 (owner: 10Alexandros Kosiaris) [16:33:28] anomie: thanks for noticing the ODF/ZIM stuff. I added https://gerrit.wikimedia.org/r/164099 to the SWAT queue for tomorrow to mop up the rest of the pieces. [16:34:23] (03CR) 10RobH: [C: 04-1] "The mgmt entries shouldn't be removed, as this is going back into the spare's pool." (032 comments) [dns] - 10https://gerrit.wikimedia.org/r/163992 (owner: 10Dzahn) [16:34:23] RECOVERY - puppet last run on plutonium is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [16:36:42] (03CR) 10Aaron Schulz: Restore "Set bloom cache config" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/162333 (owner: 10Aaron Schulz) [16:42:30] (03PS1) 10Yuvipanda: icinga: Add a parameter to icinga::web to parameterize SSL [puppet] - 10https://gerrit.wikimedia.org/r/164103 [16:44:53] (03CR) 10Yuvipanda: [C: 04-1] "So, the only things referred to from private repo are:" [puppet] - 10https://gerrit.wikimedia.org/r/158355 (owner: 10JanZerebecki) [16:47:38] (03CR) 10Yuvipanda: [C: 04-1] "BAH, the http clauses in there are just redirects. I need to rework the config some more." [puppet] - 10https://gerrit.wikimedia.org/r/164103 (owner: 10Yuvipanda) [16:51:02] (03CR) 10Andrew Bogott: [C: 032] Move ishmael to the new ldap servers. [puppet] - 10https://gerrit.wikimedia.org/r/163285 (owner: 10Andrew Bogott) [16:51:34] ottomata: around? [16:52:42] (03PS1) 10RobH: setting production dns entries for ms-fe200* [dns] - 10https://gerrit.wikimedia.org/r/164104 [16:53:42] JetLaggedPanda: ja [16:54:05] ottomata: have time to do a *little* bit of icinga sleuthing for me? just want the output of 'find .' for a couple of paths on neon. [16:54:13] sure [16:54:38] ottomata: /etc/nagios, /etc/icinga, /var/lib/nagios, /var/lib/icinga [16:54:49] ottomata: hmm, I need to see perms + ownership as well [16:54:56] ls -lR? [16:55:09] ottomata: ah, yes [16:55:10] that [16:55:11] should do [16:56:27] !log swapping disk db1020 [16:56:34] https://gist.github.com/ottomata/4b51f6d3b2cb4a8f7d02 [16:56:35] Logged the message, Master [16:56:35] JetLaggedPanda: ^ [16:57:01] aaargh, ffs [16:57:04] those execs are still required [16:57:05] grr [16:57:08] so much 'root' [16:57:13] thanks ottomata [16:57:13] (03CR) 10Andrew Bogott: [C: 032] Move tendril ldap to the new ldap servers. [puppet] - 10https://gerrit.wikimedia.org/r/163286 (owner: 10Andrew Bogott) [16:57:24] * JetLaggedPanda wonders how to kill the exec {} in misc/icinga.pp [17:10:05] heya apergos, yt? [17:10:15] yes but not much [17:10:25] ottomata: [17:10:37] so, analytics is now generating a slightly better version of the pagecounts and projectcounts files via hadoop [17:10:48] we would like to rsync these to a public host, likely dumps.wikimedia.org [17:11:02] i'm looking in puppet, and there seem to be a lot of scripts around syncing the webstatscollector generated ones [17:11:04] how much data, about the same as the current onces? [17:11:09] yes [17:11:16] a little bit more, since mobile is now included [17:11:20] but not much [17:11:20] uh huh [17:11:33] so, yeah, question: why not just rsync? [17:11:36] why all the scripting? [17:11:42] mutante: do you remember how we restarted tendril? [17:11:47] because you give them in one pile [17:11:53] and they need to go in separate directories [17:12:07] ah, we could generate them in separate directories [17:12:10] for this new dataset [17:12:25] if you do that and set it up then that worksforme [17:12:32] oh [17:12:33] I'd rather not have the whole staging area thing [17:12:33] hm [17:12:35] so [17:12:38] we want this as [17:12:43] ottomata: we already generate the needed directory structure. [17:12:49] dir/2014/2014-09/allthefiles, right? [17:12:58] yeah, just noticing that [17:13:04] however the regular ones are, it's something like that yep [17:13:11] ok, apergos, would you prefer push or pull? [17:13:32] let's pull so all the rsyncs can be coordinated on the dataset1001 side [17:13:36] ok cool [17:13:48] so cron on datasets, readable rsycn module on stat1002, I like it [17:13:50] that way moving them is easier too [17:13:54] yep [17:13:59] cool, agreed. [17:14:06] thanks! [17:14:20] just put that rsync staggered with the cron that's lready there [17:14:24] k [17:14:39] I take it we continue to get the regular page count files? [17:15:01] is there some thought of turning those off once it's proven that yours are more complete/better/etc? [17:15:40] eventually, maybe [17:15:58] the intention is to generate a more official pageview definition and create a whole new dataset entirely [17:16:03] this is a first step in that process [17:16:21] I see [17:16:22] (03CR) 10RobH: [C: 032] setting production dns entries for ms-fe200* [dns] - 10https://gerrit.wikimedia.org/r/164104 (owner: 10RobH) [17:16:29] (03PS1) 10Glaisher: Enable DynamicPageList extension on srwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/164114 (https://bugzilla.wikimedia.org/68346) [17:16:39] andrewbogott: /etc/init.d/apache2 restart or another variety to restart apache [17:16:47] mutante: ah, sure, ok [17:16:52] all right well keep me in the loop if the definition and layout changes of the dat, that should be documented in a readme file in the top level dir I guess [17:17:04] k, apergos, should I add to the dataset module? [17:17:08] (03PS2) 10Glaisher: Enable DynamicPageList extension on srwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/164114 (https://bugzilla.wikimedia.org/68346) [17:17:39] Coren: so far, neon services are handling the new ldap config nicely now. [17:17:39] suppose just in dataset::cron::pagecounts_all_sites (that's what we are going to call this) [17:17:45] sure, you see how things are set up in there now, each rsync that goes off is a separate cron job and manifest [17:17:55] k [17:18:19] add it to the rsyncs, give it a name, so in the roles datasets primary and secndary we can just add it to the list (for primry, ight now the secondary is gone, by bye tampa) [17:18:35] well, not to rsyncs, right? as we don'tneed an rsync module on datasets [17:18:36] you can stick me on as reviewer and nag me about it tomorrow or whenever [17:18:40] <^d> Who's the best person to ping about simple puppet patches that don't really have a right owner? RT on call? [17:18:48] ^d, yup :) [17:18:50] right [17:19:11] <^d> In that case Coren, you got a minute? [17:20:45] (03CR) 10Dzahn: [C: 031] fix cert mismatch on mail.wikipedia.org [puppet] - 10https://gerrit.wikimedia.org/r/154223 (https://bugzilla.wikimedia.org/44731) (owner: 10Jeremyb) [17:22:21] ^d: What can I help with? [17:22:28] * Coren reads up. [17:22:39] <^d> I want to puppetize my .gitconfig :) [17:22:47] ^d: Point me at your patch. [17:22:55] <^d> https://gerrit.wikimedia.org/r/#/c/163984/ [17:32:11] (03PS2) 10Dzahn: remove host 'silver' [dns] - 10https://gerrit.wikimedia.org/r/163992 [17:33:45] (03CR) 10RobH: [C: 031] remove host 'silver' [dns] - 10https://gerrit.wikimedia.org/r/163992 (owner: 10Dzahn) [17:35:05] (03CR) 10Dzahn: [C: 032] remove host 'silver' [dns] - 10https://gerrit.wikimedia.org/r/163992 (owner: 10Dzahn) [17:35:51] (03CR) 10Andrew Bogott: [C: 032] Switch kibana to the new ldap servers. [puppet] - 10https://gerrit.wikimedia.org/r/163290 (owner: 10Andrew Bogott) [17:36:06] (03CR) 10Dzahn: "one less public IP as well" [dns] - 10https://gerrit.wikimedia.org/r/163992 (owner: 10Dzahn) [17:36:15] mutante: same question re: kibana? [17:37:20] andrewbogott: that's logstash, and i think bd808 did it.. it' [17:37:27] s more than one server [17:37:40] kibana is on logstash100[123] [17:37:54] ok, let me refresh [17:37:58] remember when we tried switching LDAP and it broke and we reverted? [17:38:28] yep, I'm doing that again right now :) [17:38:33] mutante: thanks for taking care of the decom! :) [17:38:47] yea, that was more to bd808 to say what i was referring to [17:38:54] andrewbogott: ok [17:39:03] mutante: I also sent an email about contacts.cfg and stuff, and commented on the patch you poked me about [17:39:32] JetLaggedPanda: yw. ok, cool! [17:39:40] wait, JetLagged.. where are you? [17:39:44] mutante: India [17:39:45] did you actually arrive in SF? [17:39:47] ah [17:39:47] just got back from UK [17:39:54] i see [17:40:55] "[notice] child pid 23972 exit signal Segmentation fault (11)" 25 in last 15 minutes; 69 in last hour; across multiple mw* hosts in the cluster [17:41:18] JetLaggedPanda: and thanks for cleaning up vumi [17:41:27] mutante: \o/ :) [17:41:41] !log graceful'd apache on logstash1001 logstash1002 logstash1003 [17:41:50] Logged the message, Master [17:41:52] bd808: how's kibana looking now? [17:42:09] andrewbogott: Still loading :) [17:42:14] (03CR) 10Yuvipanda: "Also, I think publishing email addresses is completely ok. Most of these people have emails in our git repos as author headers anyway. We " [puppet] - 10https://gerrit.wikimedia.org/r/158355 (owner: 10JanZerebecki) [17:42:23] s/loading/working/ [17:43:12] apergos: is there a reason the words 'pagestats' and 'pagecounts' are both used in the puppet manifests? [17:43:12] Segfaults in logstash -- https://logstash.wikimedia.org/#dashboard/temp/zZEjhoGxT--AcL6poIx8Rg [17:43:14] (03CR) 10coren: [C: 032] "Trivial file addition to home." [puppet] - 10https://gerrit.wikimedia.org/r/163984 (owner: 10Chad) [17:43:16] is there a distincation? [17:43:18] distinction? [17:43:21] (03PS1) 10Jgreen: tweak ocg ganglia collector to stop blocking when ocg server is slow [puppet] - 10https://gerrit.wikimedia.org/r/164116 [17:44:00] (03CR) 10jenkins-bot: [V: 04-1] tweak ocg ganglia collector to stop blocking when ocg server is slow [puppet] - 10https://gerrit.wikimedia.org/r/164116 (owner: 10Jgreen) [17:44:04] Reedy: What do we do for segfaults normally? apache graceful? [17:44:11] (03PS2) 10Jgreen: tweak ocg ganglia collector to stop blocking when ocg server is slow [puppet] - 10https://gerrit.wikimedia.org/r/164116 [17:44:53] (03CR) 10jenkins-bot: [V: 04-1] tweak ocg ganglia collector to stop blocking when ocg server is slow [puppet] - 10https://gerrit.wikimedia.org/r/164116 (owner: 10Jgreen) [17:47:31] ^d: merged [17:47:39] <^d> ty! [17:49:11] andrewbogott: pdns down on ns0 while I debug. [17:49:24] (no answers is better than wrong answers anyways) [17:49:32] http://www.wikidata.org/wiki/Q183 [17:49:37] Request: GET http://www.wikidata.org/wiki/Q183, from 10.128.0.118 via cp1053 cp1053 ([10.64.32.105]:3128), Varnish XID 3532429044 [17:49:40] Coren: ok, you just switched it off? [17:49:44] I'll get out of your way then [17:50:18] (03PS1) 10Dzahn: remove dobson [dns] - 10https://gerrit.wikimedia.org/r/164119 [17:50:33] PROBLEM - Auth DNS on labs-ns0.wikimedia.org is CRITICAL: CRITICAL - Plugin timed out while executing system call [17:51:41] (03PS2) 10Dzahn: remove dobson [dns] - 10https://gerrit.wikimedia.org/r/164119 [17:54:23] RECOVERY - Auth DNS on labs-ns0.wikimedia.org is OK: DNS OK: 0.144 seconds response time. nagiostest.beta.wmflabs.org returns 208.80.155.135 [17:54:45] Aw, excrement! Heisenbug. [17:55:51] andrewbogott: So, I restarted pdns under instrumentation. Look at traces. "Hm, that looks like it works, why doesn't it work?" Test from outside. "Damn, it works again." [17:57:08] (03PS1) 10Dzahn: decom dobson [puppet] - 10https://gerrit.wikimedia.org/r/164120 [17:57:39] Coren: pdns is very sensitive to ldap hiccups. So it could be that I nudged ldap and didn't restart pdns properly and everything's been broken since [17:57:44] andrewbogott: It's back up now, and seems to be working much to my annoyance. I have *no* idea what went wrong, but I can tell you the EINVAL seem to be unrelated. [17:57:46] Not a very satisfying answer though [17:59:25] <^d> Hah, did we kill fenari access for non-roots? [17:59:41] ^d: fenary is gone [17:59:50] <^d> Yeah I knew it was on its way out :) [17:59:58] ^d: RIP fenari, may it rest in pieces. [18:00:14] Pretty much everything in pmtpa is off, the network service is going soon [18:00:36] https://www.youtube.com/watch?v=TRP67WHxFfw [18:01:49] <^d> ori: That's great. [18:09:14] (03PS1) 10Dzahn: decom mchenry [puppet] - 10https://gerrit.wikimedia.org/r/164123 [18:10:26] (03PS1) 10Ottomata: Rsync Hive generated webstats pagecounts_all_sites dataset [puppet] - 10https://gerrit.wikimedia.org/r/164124 [18:14:09] (03PS1) 10Alexandros Kosiaris: Sync up ganglia_new configuration with ganglia [puppet] - 10https://gerrit.wikimedia.org/r/164127 [18:15:27] (03PS1) 10Dzahn: remove tarin [dns] - 10https://gerrit.wikimedia.org/r/164128 [18:15:43] (03CR) 10Ori.livneh: [C: 031] Citoid puppetization [puppet] - 10https://gerrit.wikimedia.org/r/163068 (owner: 10Catrope) [18:17:04] (03PS1) 10Dzahn: remove 'bayle' remnant [dns] - 10https://gerrit.wikimedia.org/r/164129 [18:20:29] (03PS1) 10Dzahn: remove labsdb/labstore pmtpa entries [dns] - 10https://gerrit.wikimedia.org/r/164130 [18:21:18] Coren: https://gerrit.wikimedia.org/r/#/c/164130/1/templates/10.in-addr.arpa [18:21:51] just cleanup to see there is no Tampa stuff left [18:22:44] RECOVERY - RAID on db1020 is OK: OK: optimal, 1 logical, 2 physical [18:24:48] (03CR) 10Dzahn: [C: 032] "https://wikitech.wikimedia.org/wiki/Tampa_cluster#bayle" [dns] - 10https://gerrit.wikimedia.org/r/164129 (owner: 10Dzahn) [18:25:04] !log Jenkins jobs fir repos with git submodules broken ("git-submodule: git reset: not found") [18:25:12] Logged the message, Master [18:26:34] meh [18:27:59] !log disabling puppet on mw1019 to test impact of ProxyBadHeader apache directive [18:28:05] Logged the message, Master [18:30:18] (03PS1) 10Dzahn: beta - replace mchenry with polonium for smtp [puppet] - 10https://gerrit.wikimedia.org/r/164132 [18:34:27] !log (..jenkins) The command runs fine when done in that workspace from shell. Looks like a bug with Jenkins Java abstraction layer. [18:34:30] quit [18:34:33] (03CR) 10QChris: [C: 04-1] Rsync Hive generated webstats pagecounts_all_sites dataset (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/164124 (owner: 10Ottomata) [18:34:34] Logged the message, Master [18:36:38] (03CR) 10coren: [C: 031] "All of those are known dead." [dns] - 10https://gerrit.wikimedia.org/r/164130 (owner: 10Dzahn) [18:37:48] (03CR) 10Ottomata: Rsync Hive generated webstats pagecounts_all_sites dataset (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/164124 (owner: 10Ottomata) [18:38:00] (03PS2) 10Ottomata: Rsync Hive generated webstats pagecounts_all_sites dataset [puppet] - 10https://gerrit.wikimedia.org/r/164124 [18:41:12] (03CR) 10Rush: [C: 04-1] "two small notes, the dupe resource thing is the only real -1 I see." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/162873 (owner: 10Christopher Johnson (WMDE)) [18:46:48] (03CR) 10QChris: [C: 031] Rsync Hive generated webstats pagecounts_all_sites dataset (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/164124 (owner: 10Ottomata) [18:47:19] greg-g, pls fix that scheduling script, i was about to hit deploy and realized we are not on the schedule again :) [18:49:37] (03PS1) 10Dzahn: replace sanger,sfo-aaa1 with ldap1/ldap2.corp [puppet] - 10https://gerrit.wikimedia.org/r/164139 [18:50:49] (03PS2) 10Dzahn: replace sanger,sfo-aaa1 with ldap1/ldap2.corp [puppet] - 10https://gerrit.wikimedia.org/r/164139 [18:51:03] (03CR) 10Dzahn: "cajoel, also see https://rt.wikimedia.org/Ticket/Display.html?id=6163" [puppet] - 10https://gerrit.wikimedia.org/r/164139 (owner: 10Dzahn) [18:56:17] !log yurik Synchronized php-1.24wmf22/extensions/ZeroBanner/: (no message) (duration: 01m 04s) [18:56:24] Logged the message, Master [18:59:41] (03CR) 10Dzahn: [C: 032] remove labsdb/labstore pmtpa entries [dns] - 10https://gerrit.wikimedia.org/r/164130 (owner: 10Dzahn) [18:59:54] !log yurik Synchronized php-1.24wmf22/extensions/ZeroBanner/: (no message) (duration: 01m 09s) [18:59:59] Logged the message, Master [19:00:40] yurikR: revert for wmf22 please [19:00:43] flooding the logs [19:00:49] [2014-10-01 19:00:27] Fatal error: Call to private method ZeroBanner\ZeroConfig::setEnabled() from context '' at /srv/mediawiki/php-1.24wmf22/extensions/ZeroBanner/includes/ZeroConfig.php on line 341 [19:01:03] hoo, on it [19:01:49] (03CR) 10Dzahn: [C: 032] redirect noc user homedirs to people.wm.org [puppet] - 10https://gerrit.wikimedia.org/r/163756 (owner: 10Dzahn) [19:03:36] !log yurik Synchronized php-1.24wmf22/extensions/ZeroBanner/: (no message) (duration: 01m 08s) [19:03:42] Logged the message, Master [19:03:45] duuh.. Invalid command 'RewriteEngine' [19:03:57] yurikR: Has stopped now, thanks [19:04:57] hoo, locally fixed and synced, patching in gerrit and getting stuff in sync. Btw, dir-sync is causing an error -- 1 apache sync error [19:05:15] which one failed? [19:05:28] 19:02:54 ['/srv/deployment/scap/scap/bin/sync-common', '--no-update-l10n', '--include', 'php-1.24wmf22', '--include', 'php-1.24wmf22/extensions', '--include', 'php-1.24wmf22/extensions/ZeroBanner', '--include', 'php-1.24wmf22/extensions/ZeroBanner/***', 'mw1010.eqiad.wmnet', 'mw1070.eqiad.wmnet', 'mw1161.eqiad.wmnet', 'mw1201.eqiad.wmnet'] on fenari returned [255]: Permission denied (publickey). [19:05:34] i can't figure out which one [19:05:38] oh fenari [19:05:38] duh [19:05:40] one sec [19:06:03] * bd808|LUNCH needs to fox those error messages [19:06:05] *fix [19:06:15] What does the fix say? [19:06:16] bd808: What will the fox say? :D [19:06:25] f1r5t [19:06:26] (03PS1) 10Dzahn: load mod_rewrite on terbium Apache [puppet] - 10https://gerrit.wikimedia.org/r/164144 [19:06:40] it jumps over the lazy dog [19:07:15] mutante: Ok to remove fenari from all of dsh? [19:07:23] I guess it just waits for its shutdown? [19:07:33] (03CR) 10Dzahn: "needed Change-Id: Ie82addf1f521b" [puppet] - 10https://gerrit.wikimedia.org/r/163756 (owner: 10Dzahn) [19:07:42] There's a patch waiting in gerrit [19:07:56] hoo: https://gerrit.wikimedia.org/r/#/c/163315/ [19:08:15] duh [19:08:40] mutante: And taht's blocked? Fair enough [19:08:57] (03CR) 10Dzahn: [C: 032] load mod_rewrite on terbium Apache [puppet] - 10https://gerrit.wikimedia.org/r/164144 (owner: 10Dzahn) [19:09:15] * hoo should be a little less distracted when skimming his mails [19:11:39] i'm not sure if it's blocked. i guess not [19:14:20] (03CR) 10Dzahn: "curl -vvv http://noc.wikimedia.org/~dzahn/" [puppet] - 10https://gerrit.wikimedia.org/r/163756 (owner: 10Dzahn) [19:17:04] (03CR) 10Dzahn: [C: 032] "ok, i'm removing it from dsh so deployers don't have to sync to it anymore. it does not appear in pybal at all. also from DHCP in the hope" [puppet] - 10https://gerrit.wikimedia.org/r/163315 (owner: 10Dzahn) [19:17:50] !log fenari - removed from dsh - rejoice deployers, should be faster now [19:17:55] Logged the message, Master [19:18:00] bd808: ^ [19:18:38] * bd808 looks for a good dancing emoticon [19:19:17] \(^▽^@)ノ [19:19:24] hehe :) [19:19:41] source -- http://japaneseemoticons.net/dancing-emoticons/ [19:20:02] ┌(★o☆)┘ [19:20:32] I like that one. Its got a 1979 Paul Stanley feel to it [19:20:43] The star child approves [19:21:00] haha, true [19:21:29] puppet$ grep -r fenari * [19:21:37] so much misc stuff that mentions fenari :p [19:22:08] well, a little [19:22:45] manifests/misc/deployment.pp: # add ssh keypair for l10nupdate user from fenari for RT-5187 [19:25:06] (03PS1) 10Dzahn: remove fenari from site.pp [puppet] - 10https://gerrit.wikimedia.org/r/164148 [19:26:01] bd808: ahaha, the monitoring caught me doing that [19:26:22] (03PS1) 10Ori.livneh: Move php_ini() from HHVM module to wmflib [puppet] - 10https://gerrit.wikimedia.org/r/164149 [19:26:36] ACKNOWLEDGEMENT - mediawiki-installation DSH group on fenari is CRITICAL: Host fenari is not in mediawiki-installation dsh group daniel_zahn RT #6145 [19:26:37] ACKNOWLEDGEMENT - puppet last run on fenari is CRITICAL: CRITICAL: puppet fail daniel_zahn RT #6145 [19:26:55] (03PS3) 10Jgreen: tweak ocg ganglia collector to stop blocking when ocg server is slow [puppet] - 10https://gerrit.wikimedia.org/r/164116 [19:27:54] (03CR) 10Dzahn: [C: 031] "https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=fenari&service=puppet+last+run" [puppet] - 10https://gerrit.wikimedia.org/r/164148 (owner: 10Dzahn) [19:28:25] hoo|away, ping when back [19:28:52] (03CR) 10jenkins-bot: [V: 04-1] tweak ocg ganglia collector to stop blocking when ocg server is slow [puppet] - 10https://gerrit.wikimedia.org/r/164116 (owner: 10Jgreen) [19:28:59] (03PS4) 10Jgreen: tweak ocg ganglia collector to stop blocking when ocg server is slow [puppet] - 10https://gerrit.wikimedia.org/r/164116 [19:29:43] (03CR) 10jenkins-bot: [V: 04-1] tweak ocg ganglia collector to stop blocking when ocg server is slow [puppet] - 10https://gerrit.wikimedia.org/r/164116 (owner: 10Jgreen) [19:30:21] (03CR) 10Dzahn: "ping, who wants developers.wikimedia.org and who thinks we shouldn't ?" [apache-config] - 10https://gerrit.wikimedia.org/r/24407 (owner: 10Jeremyb) [19:31:40] (03PS1) 10Dzahn: network.pp - remove fenari [puppet] - 10https://gerrit.wikimedia.org/r/164154 [19:33:51] trying to depl again... [19:33:56] greg-g, ^ [19:34:18] (03PS1) 10Dzahn: tcpircbot - remove fenari references [puppet] - 10https://gerrit.wikimedia.org/r/164157 [19:38:03] (03PS4) 10Dzahn: mail: remove secondary MX role from sodium [puppet] - 10https://gerrit.wikimedia.org/r/143887 (owner: 10Faidon Liambotis) [19:38:20] !log yurik Synchronized php-1.24wmf22/extensions/ZeroBanner/: (no message) (duration: 01m 03s) [19:38:25] Logged the message, Master [19:39:56] (03CR) 10Dzahn: "Reedy: how's the Apache change related to this meanwhile?" [dns] - 10https://gerrit.wikimedia.org/r/143086 (owner: 10Reedy) [19:40:50] !log yurik Synchronized php-1.25wmf1/extensions/ZeroBanner/: (no message) (duration: 01m 05s) [19:40:55] Logged the message, Master [19:41:08] (03CR) 10Dzahn: "ah, it's here now: https://gerrit.wikimedia.org/r/#/c/147485/" [dns] - 10https://gerrit.wikimedia.org/r/143086 (owner: 10Reedy) [19:42:19] (03CR) 10Dzahn: "bump and rm self" [puppet] - 10https://gerrit.wikimedia.org/r/144640 (owner: 10ArielGlenn) [19:42:21] (03PS2) 10Chad: First of (hopefully many) Elastic tools [puppet] - 10https://gerrit.wikimedia.org/r/163945 [19:42:56] so will someone actually shutdown fenari? :) [19:43:02] (03CR) 10jenkins-bot: [V: 04-1] First of (hopefully many) Elastic tools [puppet] - 10https://gerrit.wikimedia.org/r/163945 (owner: 10Chad) [19:43:19] (03CR) 10Dzahn: "this might be all different now if sanger has been replaced by new corp/oit ldap, removing my -1" [puppet] - 10https://gerrit.wikimedia.org/r/117698 (owner: 10Matanya) [19:43:36] (03PS5) 10Jgreen: tweak ocg ganglia collector to stop blocking when ocg server is slow [puppet] - 10https://gerrit.wikimedia.org/r/164116 [19:43:39] (03PS1) 10Ori.livneh: HHVM: Get rid of HDF files [puppet] - 10https://gerrit.wikimedia.org/r/164160 [19:43:59] (03CR) 10Alexandros Kosiaris: [C: 032] Sync up ganglia_new configuration with ganglia [puppet] - 10https://gerrit.wikimedia.org/r/164127 (owner: 10Alexandros Kosiaris) [19:44:20] (03CR) 10jenkins-bot: [V: 04-1] tweak ocg ganglia collector to stop blocking when ocg server is slow [puppet] - 10https://gerrit.wikimedia.org/r/164116 (owner: 10Jgreen) [19:45:28] (03PS6) 10Jgreen: tweak ocg ganglia collector to stop blocking when ocg server is slow [puppet] - 10https://gerrit.wikimedia.org/r/164116 [19:46:51] (03CR) 10Jgreen: [C: 032 V: 031] tweak ocg ganglia collector to stop blocking when ocg server is slow [puppet] - 10https://gerrit.wikimedia.org/r/164116 (owner: 10Jgreen) [19:47:12] (03CR) 10Dzahn: "original upload date in 2012 -removing self" [apache-config] - 10https://gerrit.wikimedia.org/r/24407 (owner: 10Jeremyb) [19:48:12] (03PS2) 10Ori.livneh: Move php_ini() from HHVM module to wmflib [puppet] - 10https://gerrit.wikimedia.org/r/164149 [19:48:58] ok, seems to be ok this time, done for now [19:49:51] (03CR) 10Chad: "elastictool.py needs a place to live. Not sure yet." [puppet] - 10https://gerrit.wikimedia.org/r/163945 (owner: 10Chad) [19:51:08] (03PS3) 10Ori.livneh: Move php_ini() from HHVM module to wmflib [puppet] - 10https://gerrit.wikimedia.org/r/164149 [19:51:36] (03CR) 10Ori.livneh: [C: 032 V: 032] Move php_ini() from HHVM module to wmflib [puppet] - 10https://gerrit.wikimedia.org/r/164149 (owner: 10Ori.livneh) [19:52:12] (03PS2) 10Ori.livneh: HHVM: Get rid of HDF files [puppet] - 10https://gerrit.wikimedia.org/r/164160 [19:52:57] (03PS3) 10Chad: First of (hopefully many) Elastic tools [puppet] - 10https://gerrit.wikimedia.org/r/163945 [19:54:55] (03CR) 10Ori.livneh: [C: 032] "Same settings, different syntax." [puppet] - 10https://gerrit.wikimedia.org/r/164160 (owner: 10Ori.livneh) [19:58:35] PROBLEM - Apache HTTP on mw1018 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50317 bytes in 0.073 second response time [19:58:36] PROBLEM - HHVM rendering on mw1018 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50317 bytes in 0.011 second response time [19:59:00] that's me, fixing [19:59:06] PROBLEM - Apache HTTP on mw1017 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50317 bytes in 0.025 second response time [19:59:45] PROBLEM - HHVM rendering on mw1017 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50317 bytes in 0.010 second response time [19:59:55] PROBLEM - Apache HTTP on mw1020 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50317 bytes in 0.049 second response time [20:00:04] gwicke, cscott, subbu: Respected human, time to deploy Parsoid/OCG (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20141001T2000). Please do the needful. [20:00:05] PROBLEM - HHVM rendering on mw1020 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50317 bytes in 0.011 second response time [20:01:15] PROBLEM - puppet last run on amssq58 is CRITICAL: CRITICAL: puppet fail [20:03:59] (03PS3) 10Dzahn: Stop sending apache syslogs to remote syslog [puppet] - 10https://gerrit.wikimedia.org/r/164002 (owner: 10Reedy) [20:04:16] (03CR) 10Dzahn: [C: 032] Stop sending apache syslogs to remote syslog [puppet] - 10https://gerrit.wikimedia.org/r/164002 (owner: 10Reedy) [20:04:56] mutante: you are aware that filippo set up a new server for it today? [20:05:39] we are not deploying parsoid today .. in editing QR. we will push changes on monday. [20:05:41] mark: no, ah, but this was already duplicate [20:05:53] cscott, ocg? ^ [20:05:55] mark: well, when talking to reedy it was like we already have this on fluorine [20:06:21] httpd for all servers or just application servers? [20:06:23] (I don't know) [20:06:46] yeah, i'm going to deploy ocg [20:07:08] (03PS1) 10Ori.livneh: Fix-up for HHVM upstart config [puppet] - 10https://gerrit.wikimedia.org/r/164192 [20:07:24] (03CR) 10Ori.livneh: [C: 032 V: 032] Fix-up for HHVM upstart config [puppet] - 10https://gerrit.wikimedia.org/r/164192 (owner: 10Ori.livneh) [20:07:31] (03PS1) 10Andrew Bogott: Turn on some debug strings while I'm working on this [puppet] - 10https://gerrit.wikimedia.org/r/164195 [20:08:39] mark: hmm, i think just appservers, /a/mw-log/apache2.log [20:08:57] so best revert this then [20:09:01] ok [20:09:09] filippo will take care of it tomorrow [20:09:11] (03PS1) 10Dzahn: Revert "Stop sending apache syslogs to remote syslog" [puppet] - 10https://gerrit.wikimedia.org/r/164210 [20:09:15] (and then shut off nfs1) [20:09:25] fenari can go off now right? [20:09:28] mark: ok, cool, thanks for pointing that out [20:09:33] yurikR: You poked? [20:09:58] i think so, yea.. let me remove it from site.pp so it can be cleaned from icinga [20:10:07] re: fenari [20:10:32] as long as nothing prevents it from being powered back up in need, sure [20:10:42] let's do the remaining cleanup on monday, when we pull the boxes [20:10:48] then we can cleanup tampa in one go from all configs [20:10:58] (well, next week anyway) [20:11:17] ok [20:11:45] deployments are already faster because fenari is gone:) [20:11:51] (03CR) 10Dzahn: [C: 032] "the existing log on fluorine is likely just appservers, but this would have affected all apaches. godog setting up new location for this" [puppet] - 10https://gerrit.wikimedia.org/r/164210 (owner: 10Dzahn) [20:11:55] hehe [20:11:56] PROBLEM - Apache HTTP on mw1021 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50317 bytes in 0.007 second response time [20:12:00] PROBLEM - HHVM rendering on mw1021 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50317 bytes in 0.007 second response time [20:12:16] PROBLEM - HTTP error ratio anomaly detection on tungsten is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 0 below the confidence bounds [20:12:19] PROBLEM - LVS HTTP IPv4 on hhvm-appservers.svc.eqiad.wmnet is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50317 bytes in 0.043 second response time [20:12:42] ori: merging the upstart hhvm fix with mine, ok? [20:12:47] (03PS2) 10Andrew Bogott: Turn on some debug strings while I'm working on this [puppet] - 10https://gerrit.wikimedia.org/r/164195 [20:13:06] RECOVERY - Apache HTTP on mw1020 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.207 second response time [20:13:16] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [500.0] [20:13:17] RECOVERY - LVS HTTP IPv4 on hhvm-appservers.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 67965 bytes in 0.277 second response time [20:13:39] mutante: please [20:13:42] done [20:13:46] RECOVERY - HHVM rendering on mw1020 is OK: HTTP OK: HTTP/1.1 200 OK - 67958 bytes in 0.204 second response time [20:13:56] RECOVERY - Apache HTTP on mw1017 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.059 second response time [20:14:05] Lua errors reported in -en [20:14:16] RECOVERY - HHVM rendering on mw1017 is OK: HTTP OK: HTTP/1.1 200 OK - 67958 bytes in 0.395 second response time [20:14:17] mutante: do you know anything about labs project 'planet'? [20:14:18] RECOVERY - Apache HTTP on mw1021 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.103 second response time [20:14:20] RECOVERY - HHVM rendering on mw1021 is OK: HTTP OK: HTTP/1.1 200 OK - 67958 bytes in 0.398 second response time [20:14:37] andrewbogott: yea, sounds like it was mine to test planet module [20:15:01] andrewbogott: does it need to move? [20:15:01] mutante: it contains one instance, 'eris'. May I delete it and erase the project? [20:15:13] puppet is broken on eris and I tired of messing with it :) [20:15:17] PROBLEM - puppet last run on mw1053 is CRITICAL: CRITICAL: Puppet last ran 63890 seconds ago, expected 14400 [20:15:42] andrewbogott: give me 5 min to look at it [20:16:08] RECOVERY - Apache HTTP on mw1018 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.076 second response time [20:16:17] RECOVERY - HHVM rendering on mw1018 is OK: HTTP OK: HTTP/1.1 200 OK - 67958 bytes in 0.321 second response time [20:16:17] RECOVERY - puppet last run on mw1053 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [20:17:01] (03CR) 10Alexandros Kosiaris: "The localcerts thing has introduced a timebomb in our infrastructure, one that is dependent on apparmor. Apparmor rules that are included " [puppet] - 10https://gerrit.wikimedia.org/r/163222 (owner: 10coren) [20:17:06] (03PS1) 10Andrew Bogott: Turn on pluginsync on puppet master [puppet] - 10https://gerrit.wikimedia.org/r/164213 [20:17:39] (03CR) 10Andrew Bogott: [C: 032] Turn on some debug strings while I'm working on this [puppet] - 10https://gerrit.wikimedia.org/r/164195 (owner: 10Andrew Bogott) [20:18:08] andrewbogott: hmm.. permission denied , public key [20:18:20] is that cause it's broken? [20:18:35] mutante: try again [20:18:45] (03PS1) 10Ori.livneh: php_ini: use unique array keys for array options [puppet] - 10https://gerrit.wikimedia.org/r/164214 [20:19:14] (03PS2) 10Andrew Bogott: Turn on pluginsync on puppet master [puppet] - 10https://gerrit.wikimedia.org/r/164213 [20:19:18] mutante: why logging in as root? [20:19:47] andrewbogott: because it doesn't work as dzahn, so i'm trying the backup solution when home dirs arent mounted [20:19:57] my key is in root auth keys [20:20:03] mutante: will you please try logging in as yourself so I can see what's happening? [20:20:23] andrewbogott: and now i'm on it [20:20:30] *shrug* ok [20:21:06] RECOVERY - puppet last run on amssq58 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [20:21:07] PROBLEM - puppet last run on ocg1002 is CRITICAL: CRITICAL: Puppet has 1 failures [20:21:32] lol, role phabricator is applied on it? in the planet project. i dont remember doing that at all [20:21:47] (03PS2) 10Ori.livneh: php_ini: use unique array keys for array options [puppet] - 10https://gerrit.wikimedia.org/r/164214 [20:21:57] (03CR) 10Ori.livneh: [C: 032 V: 032] php_ini: use unique array keys for array options [puppet] - 10https://gerrit.wikimedia.org/r/164214 (owner: 10Ori.livneh) [20:23:13] andrewbogott: "manage instances" does not list instances for me ? [20:23:31] mutante: probably you haven't used labsconsole in a week or so. Log out and in, should be fixed [20:23:59] * JetLaggedPanda hopes we switch to horizon at some point [20:27:43] !log updated OCG to version 48c495e3656f528abe636ce0cd7562270505534f [20:27:49] Logged the message, Master [20:27:56] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [20:28:10] ori: what's up with mw1053? [20:28:21] The last Puppet run was at Wed Oct 1 02:29:22 UTC 2014 (1030 minutes ago). [20:29:43] andrewbogott: instance shutdown and deleted. can you let the project itself stay though? there might have been other users on it, but puppet::self and phab doesn't belong in planet project:) [20:30:05] mutante: there aren't any other instances, so it seems unlikely that anyone else is using it... [20:30:12] but, it doesn't have to be deleted, just cleaning up [20:30:14] thank you! [20:30:15] no, i mean this instance [20:30:16] RECOVERY - puppet last run on ocg1002 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [20:30:35] but i might want to create a new instance here later [20:30:42] ok [20:31:10] meh [20:31:13] because i dont remember having applied phab on this one.. but it's definitely not phab-01 [20:31:29] phab has a proper project meanwhile [20:32:28] !log Ran sync-common on mw1053 to stop "Unrecognized job type 'EchoNotificationDeleteJob'." exceptions [20:32:35] Logged the message, Master [20:32:45] well, not yet finished [20:32:58] I wonder why restarting hhvm didn't do the trick [20:37:26] ah, it's because that job really doesn't exist :P [20:40:04] (03CR) 10Dzahn: [C: 032] "no point i having it in puppet anymore - and the puppet run fails, going to shut it down too, /home already unmounted" [puppet] - 10https://gerrit.wikimedia.org/r/164148 (owner: 10Dzahn) [20:42:49] (03PS1) 10RobH: settting codfw es servers mgmt [dns] - 10https://gerrit.wikimedia.org/r/164215 [20:43:23] (03PS1) 10Andrew Bogott: Replace a couple of notify that were accidentally removed in d8bbd59292f26806f44b9f0b5492b7594858f6ba [puppet] - 10https://gerrit.wikimedia.org/r/164216 [20:44:21] (03CR) 10Andrew Bogott: [C: 032] Replace a couple of notify that were accidentally removed in d8bbd59292f26806f44b9f0b5492b7594858f6ba [puppet] - 10https://gerrit.wikimedia.org/r/164216 (owner: 10Andrew Bogott) [20:44:34] (03PS2) 10Chad: Remove unused bash functions. Nothing calls this. [puppet] - 10https://gerrit.wikimedia.org/r/163944 [20:45:54] ACKNOWLEDGEMENT - Host mchenry is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn decom - RT #1804 [20:48:03] ACKNOWLEDGEMENT - Host dobson is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn decom RT #6157 [20:48:57] (03PS1) 10Andrew Bogott: Revert "Turn on some debug strings while I'm working on this" [puppet] - 10https://gerrit.wikimedia.org/r/164218 [20:49:51] (03CR) 10Andrew Bogott: [C: 032] Revert "Turn on some debug strings while I'm working on this" [puppet] - 10https://gerrit.wikimedia.org/r/164218 (owner: 10Andrew Bogott) [20:50:34] (03CR) 10Dzahn: "mchenry is already down - so i guess this cant work currently" [puppet] - 10https://gerrit.wikimedia.org/r/164132 (owner: 10Dzahn) [20:51:10] (03Abandoned) 10Andrew Bogott: Turn on pluginsync on puppet master [puppet] - 10https://gerrit.wikimedia.org/r/164213 (owner: 10Andrew Bogott) [20:52:17] (03CR) 10Dzahn: [C: 032] tcpircbot - remove fenari references [puppet] - 10https://gerrit.wikimedia.org/r/164157 (owner: 10Dzahn) [20:53:42] (03CR) 10Manybubbles: [C: 031] Remove unused bash functions. Nothing calls this. [puppet] - 10https://gerrit.wikimedia.org/r/163944 (owner: 10Chad) [20:55:30] (03CR) 10BryanDavis: [C: 031] beta - replace mchenry with polonium for smtp [puppet] - 10https://gerrit.wikimedia.org/r/164132 (owner: 10Dzahn) [20:56:53] (03CR) 10Dzahn: [C: 032] beta - replace mchenry with polonium for smtp [puppet] - 10https://gerrit.wikimedia.org/r/164132 (owner: 10Dzahn) [21:00:07] ACKNOWLEDGEMENT - Host 208.80.152.132 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn mchenry [21:00:33] ACKNOWLEDGEMENT - Host 208.80.152.131 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn dobson [21:03:54] ori: hallo! [21:04:09] ori: [Bug 71486] Some links turned green (200) today. [21:04:19] ori: I added a cookbook to page https://de.wikipedia.org/wiki/Benutzer:Boshomi/testurl#cookbook [21:04:53] !log fenari - shutdown -h now (omg) :) [21:05:01] Logged the message, Master [21:05:23] ori: so you can to the same steps then I did yesterday [21:06:37] PROBLEM - Host fenari is DOWN: CRITICAL - Plugin timed out after 15 seconds [21:06:43] GASP [21:07:42] ACKNOWLEDGEMENT - Host fenari is DOWN: CRITICAL - Plugin timed out after 15 seconds daniel_zahn yea, downtime and disabled notifications - just wanted to announce :) - #6145: shutdown fenari [21:12:06] ACKNOWLEDGEMENT - Host dataset2 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn dumps.wikimedia.org is an alias for dataset1001.wikimedia.org. [21:17:01] (03PS1) 10Dzahn: decom dataset2, replace with dataset1001 [puppet] - 10https://gerrit.wikimedia.org/r/164233 [21:17:33] (03PS2) 10Dzahn: decom dataset2, replace with dataset1001 [puppet] - 10https://gerrit.wikimedia.org/r/164233 [21:18:01] (03PS1) 10coren: Fix apparmor to allow for new cert scheme. [puppet] - 10https://gerrit.wikimedia.org/r/164234 [21:19:28] (03CR) 10Dzahn: [C: 031] "dumps.wikimedia.org is an alias for dataset1001.wikimedia.org." [puppet] - 10https://gerrit.wikimedia.org/r/164233 (owner: 10Dzahn) [21:19:47] andrewbogott: robh: Quick fix to apparmor for or new cert scheme. ^^ [21:20:10] (03CR) 10Dzahn: "https://rt.wikimedia.org/Ticket/Display.html?id=8512" [puppet] - 10https://gerrit.wikimedia.org/r/164233 (owner: 10Dzahn) [21:20:13] (03CR) 10jenkins-bot: [V: 04-1] Fix apparmor to allow for new cert scheme. [puppet] - 10https://gerrit.wikimedia.org/r/164234 (owner: 10coren) [21:20:36] fancy [21:20:47] Coren: jenkins dislikes [21:22:09] (03PS3) 10Dzahn: decom dataset2, replace with dataset1001 [puppet] - 10https://gerrit.wikimedia.org/r/164233 [21:24:13] yurikR: add it to the calendar and it'll be there, there is no script [21:24:36] greg-g, which calendar? I added it last time and this time [21:25:08] yurikR: I don't see it on wed: https://wikitech.wikimedia.org/wiki/Deployments#Week_of_September_29th [21:25:15] oh, there it is [21:25:27] what's the complaint, then? [21:25:38] greg-g, it wasn't auto-scheduled for today :) [21:26:03] yurikR: now it will be for next week if I don't accidentally remove it [21:26:06] again, it's a wiki [21:26:15] there's no script, it's a copy/pasta mess [21:26:27] (03CR) 10Dzahn: "citoid.svc does not appear to be in DNS yet, merging this would add an LVS check on it though, which would likely cause pages" [puppet] - 10https://gerrit.wikimedia.org/r/163068 (owner: 10Catrope) [21:26:37] greg-g, that's my point - where do you copy it from? I will update it there, so that next time its copypasted, it will be included [21:26:46] * greg-g shouldn't be on IRC with this headache and nausa [21:26:51] yurikR: from the previous week [21:27:11] but it wasn't copied, so you must have copied it from somewhere else [21:27:17] nope [21:27:24] well, see the rev history :) [21:27:27] or, just maybe, I accidentally deleted [21:27:32] :D [21:27:38] noone likes zero :( [21:27:40] yurikR: look, there's no other page [21:27:49] put it there, I'll do my best, add it if I forget [21:27:53] I'm sick, now go one [21:27:54] -e [21:27:59] :) [21:28:01] get better! [21:28:10] i remember you said that there is a lua script somewhere for it? [21:28:17] that just does the rendering [21:28:23] pro-tip: view-source [21:28:30] :P [21:28:44] !log aaron Synchronized php-1.25wmf1/maintenance/findMissingFiles.php: 832ed2ce9938dc51fdb4190423ce03e93e65c639 (duration: 00m 05s) [21:28:49] Logged the message, Master [21:30:12] (03PS1) 10Dzahn: add citoid.svc.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/164238 [21:32:18] (03CR) 10Dzahn: "created https://gerrit.wikimedia.org/r/#/c/164238/1 for that" [puppet] - 10https://gerrit.wikimedia.org/r/163068 (owner: 10Catrope) [21:36:38] (03CR) 10Dzahn: "upload is from April, predicting this is not going to be merged unless somebody on the HAT project makes a new patch" [puppet] - 10https://gerrit.wikimedia.org/r/130296 (owner: 10ArielGlenn) [21:37:33] (03CR) 10Dzahn: "rm self, can't have stuff in queue for months or overload" [puppet] - 10https://gerrit.wikimedia.org/r/130296 (owner: 10ArielGlenn) [21:38:08] RECOVERY - HTTP error ratio anomaly detection on tungsten is OK: OK: No anomaly detected [21:38:50] (03PS2) 10coren: Fix apparmor to allow for new cert scheme. [puppet] - 10https://gerrit.wikimedia.org/r/164234 [21:40:47] PROBLEM - CI tmpfs disk space on lanthanum is CRITICAL: DISK CRITICAL - free space: /var/lib/jenkins-slave/tmpfs 30 MB (5% inode=99%): [21:41:16] _joe_: I guess we can move a few more runners to hhvm soon [21:41:49] _joe_: afaik there are fixes for wikidata that were deployed and I have a patch to fix deferred uploads [21:42:24] (03PS1) 10Yuvipanda: icinga: Get rid of ganglios [puppet] - 10https://gerrit.wikimedia.org/r/164239 [21:42:27] * AaronS hurls https://gerrit.wikimedia.org/r/#/c/163774/ at ori [21:42:35] ottomata: ^ wanna merge? [21:46:43] JetLaggedPanda: tomorrow perhaps! it is late my time [21:47:50] mutante: So... I need to deploy a patch to our deployed version of a Jenkins Plugin. context: Jenkins runs on gallium. It is not puppetised. The plugins are managed by Jenkins itself on-disk as a directory of .jpi / .hpi files. [21:48:15] I've got a patch merged in our fork of it in git. I would like guidance on how to compile it and then replace the file. [21:48:16] hashar: [21:48:34] hashar: Should I just compile it with maven locally (on a server running the same ubuntu version as gallium) [21:48:39] and then scp the file? [21:49:21] Krinkle: you can upload it with https://integration.wikimedia.org/ci/pluginManager/advanced [21:49:30] which let you upload a .hpi file [21:49:31] i don't know that, sorry, never involved in compiling jenkins plugins [21:49:46] hashar: OK. I still need to compile it though [21:49:47] PROBLEM - CI tmpfs disk space on lanthanum is CRITICAL: DISK CRITICAL - free space: /var/lib/jenkins-slave/tmpfs 26 MB (5% inode=99%): [21:49:48] Krinkle: you probably want to copy / backup the current one [21:49:57] I'm on a labs instance now. just ran apt-get install maven [21:50:04] Krinkle: maven would do, but it is a mess to deal with specially now [21:50:04] https://wiki.jenkins-ci.org/display/JENKINS/Plugin+tutorial#Plugintutorial-BuildingaPlugin [21:50:12] or [21:50:28] you can create a jenkins job to poll wikimedia/jenkins-git-plugin and build your pull requests :] [21:52:19] ottomata: ah, cool. late my time too, now that I realized (3:30AM) [21:52:21] ottomata: shall poke tomorrow! [21:54:32] ok! [21:56:26] (03PS1) 10Dzahn: switch DHCP domain-name - pmtpa->eqiad [puppet] - 10https://gerrit.wikimedia.org/r/164241 [21:56:55] (03PS2) 10Dzahn: switch DHCP domain-name - pmtpa->eqiad [puppet] - 10https://gerrit.wikimedia.org/r/164241 [21:59:51] (03PS1) 10Dzahn: remove subnet 10.4.6.0/24 - pmtpa virtual hosts [puppet] - 10https://gerrit.wikimedia.org/r/164242 [22:14:58] (03PS1) 10Dzahn: remove pmtpa db's from coredb role config [puppet] - 10https://gerrit.wikimedia.org/r/164244 [22:19:00] (03PS1) 10Dzahn: labsnfs - replace labstore1 with labstore1001 [puppet] - 10https://gerrit.wikimedia.org/r/164247 [22:22:32] (03PS1) 10Dzahn: torrus - remove pmtpa data sources [puppet] - 10https://gerrit.wikimedia.org/r/164248 [22:25:10] (03PS1) 10Dzahn: remove Tampa db and es servers from DHCP [puppet] - 10https://gerrit.wikimedia.org/r/164249 [22:26:02] mutante: Do we have varnish error logs around somehwer? [22:26:07] * somewhere [22:29:05] hoo: I suspect the question is more whether they're anywhere you can access them? ;) [22:29:09] Might be better asking bblack [22:29:42] well, at least on misc varnish, /var/log/varnish is empty [22:31:02] oh, right, of course [22:31:03] it seems to be send to kafka? [22:31:04] https://wikitech.wikimedia.org/wiki/Varnish#See_request_logs [22:31:13] there, you use varnishnsca for that [22:31:41] haha... I probably don't (as in can't) [22:32:10] (03PS1) 10Ori.livneh: HHVM build-deps: add condition to exec, move to contint [puppet] - 10https://gerrit.wikimedia.org/r/164250 [22:32:43] "As explained below, there are no access logs. However, you can see NCSA style log entries for current requests using: " [22:32:52] so it wouldn't have helped for the past anyways [22:33:17] I can reproduce :P [22:34:38] what are you trying to get? access logs from the past? [22:34:43] errors? [22:35:07] oh you said errors, nvm [22:35:09] mutante: Errors, yes [22:36:16] mutante: Thanks for https://gerrit.wikimedia.org/r/#/c/164238/1 :) [22:36:43] ah I see... purging the URL might work for us [22:36:53] (03CR) 10Catrope: [C: 031] add citoid.svc.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/164238 (owner: 10Dzahn) [22:36:54] not sure who can/ will do that [22:40:52] hoo: need to kill neg dns cache? [22:40:59] (03CR) 10Tim Landscheidt: [C: 04-1] "The whole role::labsnfs::client is deprecated outside of pmtpa (cf. manifests/role/labsnfs.pp; and IIRC autofs caused a lot of pain). So " [puppet] - 10https://gerrit.wikimedia.org/r/164247 (owner: 10Dzahn) [22:41:23] robh: We have a cached 503 or something weird liek that [22:41:40] oh, thought you mean dns cache, not varnish cache [22:41:56] on who can do that, not sure, but Coren is on point of contact duty [22:42:06] so we should ping him like i did, and ask him [22:42:09] =] [22:42:32] ie: if you arent sure who handles it, it is the responsibility of the ops duty person [22:42:35] this week marc [22:42:39] now my other URL is also failing [22:42:58] now i have a message that its bblack [22:43:00] https://wikitech.wikimedia.org/wiki/Varnish#One-off_purges [22:43:01] so [22:43:10] bblack: ^? are you the one off purge dude? [22:43:23] it says 'dont do this, consult a specialist' [22:43:35] heh [22:43:54] hoo: so yea, in this case, varnish experts are like.. brandon, who i just pinged, mar k (where its late there so wont be here) [22:43:56] uhh... [22:44:28] Ok :S [22:44:38] hrmm [22:44:48] i feel the need to have an actual solution. [22:45:07] RECOVERY - CI tmpfs disk space on lanthanum is OK: DISK OK [22:45:14] but touching somethning with 'consult an expert' on our actual docs makes me not want to touch them, heh [22:45:23] Sure [22:45:37] hoo: so is this something where i should start texting folks or ? [22:45:46] ie: i dont want to be the jerk who just ignores you [22:46:08] no, not this urgent [22:46:14] though i suppose at this point i'd be the jerk who interacted with you, and THEN ignored you [22:46:15] i don't think so [22:46:18] which is kinda worse ;] [22:46:31] let me see how large the apache response is [22:46:32] i think it affects just one wikidata page [22:46:47] 2.9MiB [22:46:51] then i expect it'll expire before someone gets to it =P [22:46:54] oooo [22:47:03] That's what the apaches send [22:47:12] I guess varnish chokes on that (on purpose?) [22:47:44] Last time we had something like that we were sending bad headers and varnish went crazy on that [22:48:59] !log Deploy Jenkins git-client-plugin v1.4.6+wmf1 from https://github.com/wikimedia/git-client-plugin/tree/git-client-1.4.6+wmf1 (c80b05bb10985ab94c4c4217d07a0868087b5994) – https://wikitech.wikimedia.org/wiki/Nova_Resource:Integration/Jenkins_Plugin [22:49:06] Logged the message, Master [22:49:19] headers look reasonable [22:49:46] (from both hhvm appservers and zend) [22:54:38] (03PS1) 10Dzahn: remove Tampa db's from site and dsh [puppet] - 10https://gerrit.wikimedia.org/r/164253 [22:56:17] <^d> I can do swat today. [22:56:22] * hoo wtfs [22:56:30] If I poll varnish from tin, it works [22:58:50] <^d> Are we having some sort of issue? Should we hold swat? [22:59:15] aude: Does it work for you now? [22:59:18] I think it works agai [22:59:19] n [22:59:30] (using the normal URL) [22:59:37] did someone poke varnish? [23:00:04] RoanKattouw, ^d, marktraceur, MaxSem: Respected human, time to deploy SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20141001T2300). Please do the needful. [23:01:13] (03PS1) 10Dzahn: remove dataset2 [dns] - 10https://gerrit.wikimedia.org/r/164255 [23:02:28] hoo: works for me [23:02:34] Ok [23:02:52] curl -H 'Host: www.wikidata.org' cp1052/wiki/Q183 [23:02:55] maybe this fixed it? [23:02:56] veeeeeeeeeery slow [23:04:02] (03PS1) 10Dzahn: remove Tampa db and es servers [dns] - 10https://gerrit.wikimedia.org/r/164257 [23:04:04] different issue :P [23:04:08] not my department [23:04:10] (03CR) 10jenkins-bot: [V: 04-1] remove Tampa db and es servers [dns] - 10https://gerrit.wikimedia.org/r/164257 (owner: 10Dzahn) [23:04:10] * hoo hides [23:04:16] heh [23:05:55] (03PS2) 10Dzahn: remove Tampa db and es servers [dns] - 10https://gerrit.wikimedia.org/r/164257 [23:07:51] I'll do the SWAT [23:08:10] ^d: Oh sorry never mind, didn't see you'd claimed it [23:08:23] (03PS1) 10Dzahn: remove mchenry [dns] - 10https://gerrit.wikimedia.org/r/164259 [23:08:28] <^d> I was going to, but wasn't sure if something was up. [23:08:37] <^d> I kept seeing mumble mumble varnish [23:08:42] <^d> mumble page someone [23:08:42] I'm so used to people claiming [23:08:53] SWAT in response to jouncebot that I had completely missed that you said you'd do it [23:09:04] And I was like "no one volunteered grumble grumble I'll do it" :) [23:09:34] !log Jenkins restart finished. Patched git-client-plugin seems to work as expected (bug 71533). [23:09:40] <^d> RoanKattouw: I'll do yours first. [23:09:41] Logged the message, Master [23:09:41] <^d> Merging. [23:09:49] (03CR) 10Jforrester: [C: 031] "Good to go whenever, per Adam." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/160494 (owner: 10Awight) [23:10:22] (03PS1) 10Dzahn: remove mexia [dns] - 10https://gerrit.wikimedia.org/r/164260 [23:10:40] <^d> superm401: You about for your swat patches? [23:11:57] (03PS1) 10Dzahn: remove sanger [dns] - 10https://gerrit.wikimedia.org/r/164261 [23:12:34] (03CR) 10Cscott: "Scheduled for SWAT morning of 2-Oct-2014, so that I can be online and available just in case there are issues." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/164099 (owner: 10Cscott) [23:13:25] ^d, yeah. [23:13:26] <^d> RoanKattouw: still just waiting on jenkins. [23:13:31] <^d> sweetness. [23:13:48] <^d> superm401: I can merge all your branch patches. Can you prepare a patch for MW core's two branches when they merge? [23:13:57] ^d, sure. [23:14:25] (03PS1) 10Dzahn: smtp service -> polonium, remove imap service [dns] - 10https://gerrit.wikimedia.org/r/164262 [23:14:38] (03CR) 10Ori.livneh: [C: 031] use scap's embedded linking, remove lint script [puppet] - 10https://gerrit.wikimedia.org/r/160691 (https://bugzilla.wikimedia.org/68255) (owner: 10Filippo Giunchedi) [23:15:45] * ^d twiddles thumbs [23:19:20] !log demon Synchronized php-1.25wmf1/extensions/VisualEditor: (no message) (duration: 00m 04s) [23:19:22] <^d> RoanKattouw: ^ [23:19:27] Logged the message, Master [23:19:58] ^d, all merged. Doing the bumps now. [23:20:08] <^d> Okie dokie [23:24:06] ^d, https://gerrit.wikimedia.org/r/164265 and https://gerrit.wikimedia.org/r/164267 [23:24:39] <^d> I didn't wait for jenkins because I have no patience and I'm a rebel. [23:24:44] <^d> Living dangerously. [23:25:55] !log demon Synchronized php-1.24wmf22/extensions/GuidedTour: (no message) (duration: 00m 04s) [23:26:02] Logged the message, Master [23:26:16] ^d: Thanks man [23:26:32] !log demon Synchronized php-1.25wmf1/extensions/GuidedTour: (no message) (duration: 00m 04s) [23:26:36] <^d> superm401: And you're live. [23:26:37] Logged the message, Master [23:26:37] <^d> RoanKattouw: yw [23:27:38] Thanks, ^d, I'll test. [23:31:28] ^d, looks good. [23:31:35] <^d> Sweet. [23:31:44] <^d> I declare swat closeddddd [23:55:38] (03Abandoned) 10Yuvipanda: icinga: Remove ganglios checks [puppet] - 10https://gerrit.wikimedia.org/r/162969 (owner: 10Yuvipanda) [23:59:18] (03PS4) 10Chad: First of (hopefully many) Elastic tools [puppet] - 10https://gerrit.wikimedia.org/r/163945 [23:59:20] (03PS1) 10Chad: More elasticsearch tools [puppet] - 10https://gerrit.wikimedia.org/r/164270