[00:01:09] <grrrit-wm>	 (03PS1) 10Platonides: Add an "authenticate" command for identifying with nickserv after connection [labs/tools/stewardbots] - 10https://gerrit.wikimedia.org/r/318229 
[00:03:12] <wikibugs>	 10Tool-Labs-tools-stewardbots: StewardBot not logged into irc - https://phabricator.wikimedia.org/T149265#2747153 (10Platonides) This should help in case it happens again by allowing any privileged user to reauthenticate it.  https://gerrit.wikimedia.org/r/318229
[00:40:46] <shinken-wm>	 PROBLEM - Puppet run on tools-puppetmaster-01 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0]
[00:57:05] <Krenair>	 !log tools.stewardbots Restart - it looks like this started during a NickServ outage, resulting in no authentication, resulting in IRC ops getting pinged by an anti-flood bot about this bot's behaviour
[00:57:11] <labs-morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.stewardbots/SAL, Master
[05:07:54] <bd808>	 !log tools.stashbot Tried to switch main bot from OGE to k8s but pod ended in CrashLoopBackOff status with no log output that I could find
[05:08:00] <labs-morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.stashbot/SAL, Master
[06:32:28] <shinken-wm>	 PROBLEM - Puppet run on tools-flannel-etcd-03 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [0.0]
[07:11:46] <grrrit-wm>	 (03CR) 10Ricordisamoa: [C: 032] Support GET requests in get_json() and get_json_cached() [labs/tools/ptable] - 10https://gerrit.wikimedia.org/r/318054 (owner: 10Ricordisamoa)
[07:12:01] <grrrit-wm>	 (03Merged) 10jenkins-bot: Support GET requests in get_json() and get_json_cached() [labs/tools/ptable] - 10https://gerrit.wikimedia.org/r/318054 (owner: 10Ricordisamoa)
[07:12:26] <shinken-wm>	 RECOVERY - Puppet run on tools-flannel-etcd-03 is OK: OK: Less than 1.00% above the threshold [0.0]
[08:02:40] <wikibugs>	 10Labs-Kubernetes, 06Wikisource, 03Community-Tech-Sprint: Make Google OCR API on Tool Labs work under Kubernetes - https://phabricator.wikimedia.org/T146311#2747662 (10Samwilson) ``` 2016-10-27 06:26:57: (mod_fastcgi.c.2569) unexpected end-of-file (perhaps the fastcgi process died): pid: 9 socket: unix:/var/...
[09:17:52] <shinken-wm>	 PROBLEM - Puppet run on tools-services-02 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0]
[09:34:25] <shinken-wm>	 RECOVERY - Host tools-secgroup-test-103 is UP: PING OK - Packet loss = 0%, RTA = 0.64 ms
[09:44:22] <shinken-wm>	 PROBLEM - Host tools-secgroup-test-103 is DOWN: CRITICAL - Host Unreachable (10.68.21.22)
[09:57:51] <shinken-wm>	 RECOVERY - Puppet run on tools-services-02 is OK: OK: Less than 1.00% above the threshold [0.0]
[10:01:17] <wikibugs>	 10Labs-Kubernetes, 06Wikisource, 03Community-Tech-Sprint: Make Google OCR API on Tool Labs work under Kubernetes - https://phabricator.wikimedia.org/T146311#2747860 (10Niharika) a:03Samwilson
[10:13:10] <shinken-wm>	 RECOVERY - Host secgroup-lag-102 is UP: PING OK - Packet loss = 0%, RTA = 121.89 ms
[10:18:08] <shinken-wm>	 PROBLEM - Host secgroup-lag-102 is DOWN: CRITICAL - Host Unreachable (10.68.17.218)
[10:38:11] <shinken-wm>	 PROBLEM - Puppet staleness on tools-prometheus-01 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [43200.0]
[11:12:53] <shinken-wm>	 PROBLEM - Puppet staleness on tools-prometheus-02 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [43200.0]
[11:38:30] <doctaxon>	 Hi! Running a script on the labs instance dwl I get the error '35 SSL connect error. The SSL handshaking failed.' so about one time the hour. What is the reason?
[11:48:53] <shinken-wm>	 PROBLEM - Puppet run on tools-services-01 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0]
[11:49:38] <shinken-wm>	 PROBLEM - Puppet run on tools-docker-registry-01 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0]
[12:04:40] <shinken-wm>	 RECOVERY - Puppet run on tools-docker-registry-01 is OK: OK: Less than 1.00% above the threshold [0.0]
[12:23:51] <shinken-wm>	 RECOVERY - Puppet run on tools-services-01 is OK: OK: Less than 1.00% above the threshold [0.0]
[12:26:40] <wikibugs>	 10Wikibugs: wikibugs - throttle output, don't get kicked for flooding - https://phabricator.wikimedia.org/T112032#2748088 (10Samtar) 05Open>03Resolved
[13:33:54] <wikibugs>	 06Labs, 10Tool-Labs, 06Developer-Relations, 06WMF-Legal: Provide an easy way for Tool Labs tools to expose their source code - https://phabricator.wikimedia.org/T102081#1355202 (10Qgil) Would this task be a good topic for the #wikidev17 ? If so, the deadline to submit new proposals is next Monday, October...
[13:34:08] <wikibugs>	 06Labs, 10Tool-Labs, 06Developer-Relations, 06WMF-Legal: Make sure tools can be taken over after they are abandoned - https://phabricator.wikimedia.org/T102066#1354813 (10Qgil) Would this task be a good topic for the #wikidev17 ? If so, the deadline to submit new proposals is next Monday, October 31: https...
[13:35:21] <wikibugs>	 06Labs, 06Community-Tech-Tool-Labs, 06Developer-Relations, 10wikitech.wikimedia.org, 07Epic: [EPIC] Make wikitech more friendly for the multiple audiences it supports - https://phabricator.wikimedia.org/T123425#2748274 (10Qgil) Would this task be a good topic for the #wikidev17 ? If so, the deadline to s...
[13:35:35] <wikibugs>	 06Labs, 10Tool-Labs-tools-Other, 06Community-Tech-Tool-Labs, 06Developer-Relations: Create an authoritative and well promoted catalog of Wikimedia tools - https://phabricator.wikimedia.org/T115650#2748275 (10Qgil) Would this task be a good topic for the #wikidev17 ? If so, the deadline to submit new propos...
[13:35:54] <wikibugs>	 06Labs, 10Wikimedia-Labs-General, 06Developer-Relations: Community-maintained projects on Labs are hard to track - https://phabricator.wikimedia.org/T64837#2748276 (10Qgil) Would this task be a good topic for the #wikidev17 ? If so, the deadline to submit new proposals is next Monday, October 31: https://www...
[13:37:24] <wikibugs>	 06Labs, 10Tool-Labs, 06Community-Tech-Tool-Labs, 06Developer-Relations, and 4 others: Set up process / criteria for taking over abandoned tools - https://phabricator.wikimedia.org/T87730#2748277 (10Qgil) Would this task be a good topic for the #wikidev17 ? If so, the deadline to submit new proposals is nex...
[13:47:57] <shinken-wm>	 RECOVERY - Host tools-docker-builder-01 is UP: PING OK - Packet loss = 0%, RTA = 0.76 ms
[13:50:50] <shinken-wm>	 RECOVERY - SSH on tools-webgrid-generic-1403 is OK: SSH OK - OpenSSH_6.9p1 Ubuntu-2~trusty1 (protocol 2.0)
[13:50:56] <chasemp>	 !log tools reboot dockerbuilder-01
[13:51:00] <labs-morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master
[13:51:04] <chasemp>	 !log tools reboot tools-webgrid-generic-1403
[13:51:08] <labs-morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master
[13:51:55] <shinken-wm>	 PROBLEM - Puppet staleness on tools-docker-builder-01 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [43200.0]
[14:01:03] <shinken-wm>	 RECOVERY - Puppet staleness on tools-webgrid-generic-1403 is OK: OK: Less than 1.00% above the threshold [3600.0]
[14:01:53] <shinken-wm>	 RECOVERY - Puppet staleness on tools-docker-builder-01 is OK: OK: Less than 1.00% above the threshold [3600.0]
[14:35:35] <Zppix>	 why doesnt grid work with my bot it works fine from bastion
[14:35:38] <shinken-wm>	 PROBLEM - Puppet run on tools-docker-registry-01 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0]
[14:43:25] <chasemp>	 Zppix: the most common answer is you need to specify trusty on the grid as precise is still the default but the bastions are trusty as well
[14:43:45] <chasemp>	 otherwise you'll need to be more descriptive 
[14:43:55] <Zppix>	 trusty?
[14:46:13] <shinken-wm>	 PROBLEM - Host tools-exec-cyberbot is DOWN: CRITICAL - Host Unreachable (10.68.16.39)
[14:47:55] <chasemp>	 Zppix|Away: https://wiki.ubuntu.com/Releases
[14:48:29] <bd808>	 chasemp: as of midday yesterday, trusty is the default :)
[14:48:50] <chasemp>	 ah ok right, I need to reverse my narrative
[14:49:03] <bd808>	 !log tools.stashbot Shutdown OGE job, started k8s job
[14:49:07] <labs-morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.stashbot/SAL, Master
[14:50:51] <chasemp>	 bd808: https://graphite-labs.wikimedia.org/render/?width=586&height=308&_salt=1477579792.187&target=sumSeries(tools.tools-services-01.sge.hosts.tools*12*.job_count)&from=-3d
[14:50:54] <chasemp>	 interesting
[14:51:14] <chasemp>	 https://graphite-labs.wikimedia.org/render/?width=586&height=308&_salt=1477579792.187&target=sumSeries(tools.tools-services-01.sge.hosts.tools*12*.job_count)&target=sumSeries(tools.tools-services-01.sge.hosts.tools*14*.job_count)&from=-3d
[14:52:40] <chasemp>	 https://graphite-labs.wikimedia.org/render/?width=586&height=308&_salt=1477579792.187&target=cactiStyle(sumSeries(tools.tools-services-01.sge.hosts.tools*12*.job_count))&target=cactiStyle(sumSeries(tools.tools-services-01.sge.hosts.tools*14*.job_count))&from=-3d
[14:55:49] <shinken-wm>	 RECOVERY - Puppet run on tools-docker-registry-01 is OK: OK: Less than 1.00% above the threshold [0.0]
[14:57:04] <bd808>	 chasemp: :) changing the default is having some impact then. In a couple of weeks I'll start trying to figure out who is still using precise and sending them nice emails about switching.
[14:57:04] <bd808>	 I'd really love to get everyone off well before our drop dead date
[15:09:01] <chasemp>	 bd808: cool, that seems like a good approach
[15:09:24] <chasemp>	 I'm kind of curious on the groupings for remaining precise things tbh, the why of it and if that sticks
[15:09:36] <chasemp>	 and in theory worst comes to worse we can surely containerize those outliers
[15:11:27] <bd808>	 Some amount of it will just be long running jobs that have not restarted. Others will be things with cautious maintainers who haven't taken the time to test yet. I think there will actually be very few that at intrinsically tied to precise.
[15:12:10] <bd808>	 but at some point we have to pull the plug on precise. Faidon already grumbled at me that my deprecation timeline ran too long. ;)
[15:16:29] <wikibugs>	 06Labs, 10Labs-Infrastructure, 10DBA, 06Operations: Create maintain-views user for labsdb1001 and labsdb1003 - https://phabricator.wikimedia.org/T148560#2748504 (10chasemp) @Marostegui I'm of the opinion at the moment that reworking the definer could be a more nuanced bit of work itself.  Honestly, no idea...
[15:16:51] <chasemp>	 yeah, I think the long running jobs thing cuts at the majority
[15:17:36] <bd808>	 If I'd been on the ball and had the switch done before the last kernel reboot...
[15:17:51] * bd808 shakes fist at the time lords
[15:24:02] <wikibugs>	 06Labs, 10Labs-Infrastructure, 10DBA, 06Operations: Create maintain-views user for labsdb1001 and labsdb1003 - https://phabricator.wikimedia.org/T148560#2748518 (10jcrespo) > Then the path forward I think is to keep both the VIEWMASTER and MAINTAINVIEWS users with SUPER privs and to use the MAINTAINVEWS us...
[15:25:41] <wikibugs>	 06Labs, 10Labs-Infrastructure, 10DBA, 06Operations: Create maintain-views user for labsdb1001 and labsdb1003 - https://phabricator.wikimedia.org/T148560#2748524 (10jcrespo) Also, there is a labsdbadmin user already existing, which was tasked to do maintenance (create users), maybe that should be the one us...
[15:27:33] <wikibugs>	 06Labs, 10Tool-Labs, 13Patch-For-Review, 07Wikimedia-Incident: Setup a simple service that pages when it is unreachable - https://phabricator.wikimedia.org/T143638#2748529 (10madhuvishy) 05Open>03Resolved
[15:31:02] <wikibugs>	 06Labs, 10Tool-Labs, 06Community-Tech-Tool-Labs, 06Developer-Relations, and 4 others: Set up process / criteria for taking over abandoned tools - https://phabricator.wikimedia.org/T87730#2748543 (10bd808) >>! In T87730#2748277, @Qgil wrote: > Would this task be a good topic for the #wikidev17 ? If so, the...
[15:36:10] <wikibugs>	 06Labs, 10Labs-Infrastructure, 10DBA, 06Operations: Create maintain-views user for labsdb1001 and labsdb1003 - https://phabricator.wikimedia.org/T148560#2748568 (10chasemp) >>! In T148560#2748518, @jcrespo wrote: >> Then the path forward I think is to keep both the VIEWMASTER and MAINTAINVIEWS users with S...
[15:57:26] <wikibugs>	 10Labs-project-Wikistats: W3C wiki updates broken - https://phabricator.wikimedia.org/T149000#2748779 (10Dzahn) p:05Triage>03Normal
[16:28:21] <wikibugs>	 10Labs-project-Wikistats: W3C wiki updates broken - https://phabricator.wikimedia.org/T149000#2739316 (10hashar) The page shows they all have HTTP error 301, a redirect. Most probably because we list http and they have switched to https with the script not following redirects? :]
[17:03:40] <wikibugs>	 10Tool-Labs-tools-Pageviews: Investigation: Recursive category search in Massviews - https://phabricator.wikimedia.org/T149334#2749058 (10MusikAnimal)
[17:31:13] <wikibugs>	 10Labs-Kubernetes, 06Wikisource, 03Community-Tech-Sprint: Make Google OCR API on Tool Labs work under Kubernetes - https://phabricator.wikimedia.org/T146311#2749203 (10kaldari) 05Open>03Resolved Nice work!  @bd808: Is there some documentation somewhere about things that won't work under Kubernetes? If so...
[17:34:05] <Amir1>	 Are the reboots in labs instances related to this? https://www.theguardian.com/technology/2016/oct/21/dirty-cow-linux-vulnerability-found-after-nine-years
[17:34:14] <wikibugs>	 10Labs-Kubernetes, 06Wikisource, 03Community-Tech-Sprint: Make Google OCR API on Tool Labs work under Kubernetes - https://phabricator.wikimedia.org/T146311#2749224 (10bd808) Having looked at the icecave/isolator library briefly, I'm not sure that it really works anywhere in a robust and stable manner. It do...
[17:36:55] <Zppix>	 what is trusty?
[17:39:41] <jynus>	 Zppix, are you refering to https://en.wikipedia.org/wiki/Ubuntu_version_history#Ubuntu_14.04_LTS_.28Trusty_Tahr.29 ?
[17:39:56] <Zppix>	 it has to do with the grid
[17:40:55] <jynus>	 yes, probably ubuntu precise 12.04 starting to be deprecated in favour of ubuntu trussty 14.04 ?
[17:41:52] <jynus>	 you can see here soon it will stop receiving security updates: https://en.wikipedia.org/wiki/Ubuntu_version_history#Version_timeline
[17:42:35] <Zppix>	 well my bot doesnt seem to be able to run it connects to irc but once issued a command in irc it disconnects. but when running bot via bastion its working fine
[17:45:00] <Zppix>	 nevermind
[17:52:54] <bd808>	 Zppix: what does your bot try to do when it receives a command? Is your source published somewhere that I can look at?
[17:53:17] <Zppix>	 bd808:  you can check the error file in my tool acct (it should be a public dir
[17:53:32] <bd808>	 We have a lot of irc connected bots so it seems likely that the problem is somewhere in the implementation
[17:53:34] <Zppix>	 its under project/zppixbot
[17:57:22] <bd808>	 the "Fatal Python error: Couldn't create autoTLSkey mapping" was from running on precise (older Ubuntu version). The bastions are trusty (newer that precise)
[17:58:45] <bd808>	 The last run in the err log shows that it started on trusty because I changed the default from precise to trusty yesterday for everyone.
[17:58:56] <bd808>	 It doesn't seem to have the crash message
[18:00:19] <Zppix>	 i just tried trustty
[18:09:46] <bd808>	 !log tools.morebots Stopping bots in #wikimedia-labs and #wikimedia-releng channels.
[18:09:49] <labs-morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.morebots/SAL, Master
[18:12:41] <bd808>	 !log tools.stashbot Restarting to take over wiki logging in #wikimedia-labs and #wikimedia-releng channels.
[18:13:27] <bd808>	 !log tools.stashbot First test of wiki logging
[18:13:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.stashbot/SAL
[18:14:28] <bd808>	 !log deployment-prep Testing dual page wiki logging by stashbot.
[18:14:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL
[18:14:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Deployment-prep/SAL
[18:17:00] <bd808>	 hmmm... that wasn't actually supposed to show both links
[18:36:09] <bd808>	 !log deployment-prep Testing dual page wiki logging by stashbot. (second attempt)
[18:36:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Deployment-prep/SAL
[18:46:13] <bd808>	 !log deployment-prep Testing dual page wiki logging by stashbot. (check #3)
[18:46:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Deployment-prep/SAL
[18:49:10] <andrewbogott>	 !log tools rebooting  tools-webgrid-lighttpd-1401
[18:49:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL
[19:00:49] <shinken-wm>	 PROBLEM - Puppet staleness on tools-webgrid-lighttpd-1401 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [43200.0]
[19:05:50] <shinken-wm>	 RECOVERY - Puppet staleness on tools-webgrid-lighttpd-1401 is OK: OK: Less than 1.00% above the threshold [3600.0]
[19:10:41] <wikibugs>	 06Labs, 10Tool-Labs, 07Epic: Phase out precise instances from toollabs - https://phabricator.wikimedia.org/T94790#2749580 (10bd808)
[19:10:43] <wikibugs>	 06Labs, 10Tool-Labs, 15User-bd808: Make webservice warn when run with `-l release=precise` - https://phabricator.wikimedia.org/T143283#2749576 (10bd808) 05Open>03Resolved a:03bd808 This was done quite a while ago in {rOSTW5acbb62dc9d8ff6fc4f7d001fa9740323497a111}
[19:12:38] <shinken-wm>	 PROBLEM - Puppet run on tools-checker-02 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0]
[19:15:18] <wikibugs>	 06Labs, 10Tool-Labs, 06Community-Tech-Tool-Labs, 07Epic, 15User-bd808: Remove support for precise OGE exec hosts - https://phabricator.wikimedia.org/T94792#2749591 (10bd808)
[19:17:10] <bd808>	 !log stashbot should tell me that I forgot to give a project/tool name
[19:17:11] <stashbot>	 Did you mean tools.stashbot instead of stashbot?
[19:17:18] <bd808>	 !log and stashbot should tell me that I forgot to give a project/tool name
[19:18:03] <bd808>	 !log invalid project
[19:37:19] <shinken-wm>	 PROBLEM - Puppet run on tools-docker-builder-01 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0]
[19:47:39] <shinken-wm>	 RECOVERY - Puppet run on tools-checker-02 is OK: OK: Less than 1.00% above the threshold [0.0]
[19:59:57] <wikibugs>	 10Tool-Labs-tools-Pageviews: Query stats.grok.se for data older than July 2015 - https://phabricator.wikimedia.org/T149358#2749796 (10MusikAnimal)
[20:00:20] <wm-bot>	 Change on 12wikitech.wikimedia.org a page Nova Resource:Tools/Access Request/Gilles was created, changed by Gilles link https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/Access_Request/Gilles edit summary: Created page with "{{Tools Access Request |Justification=Trying to make ori's perflogbot work |Completed=false |User Name=Gilles }}"
[20:24:59] <wm-bot>	 Change on 12wikitech.wikimedia.org a page Nova Resource:Tools/Access Request/Gilles was modified, changed by BryanDavis link https://wikitech.wikimedia.org/w/index.php?diff=933902 edit summary: 
[20:26:05] <bd808>	 !log tools.perflogbot Added Gilles as maintainer
[20:26:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.perflogbot/SAL
[20:29:03] <shinken-wm>	 PROBLEM - Puppet run on tools-webgrid-generic-1401 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0]
[20:38:36] <shinken-wm>	 PROBLEM - Host tools-secgroup-test-102 is DOWN: CRITICAL - Host Unreachable (10.68.21.170)
[20:39:09] <wikibugs>	 06Labs, 10Tool-Labs, 06Community-Tech-Tool-Labs, 06Research-and-Data, 15User-bd808: 2016 Tool Labs user survey - https://phabricator.wikimedia.org/T147336#2749998 (10bd808)
[21:04:04] <shinken-wm>	 RECOVERY - Puppet run on tools-webgrid-generic-1401 is OK: OK: Less than 1.00% above the threshold [0.0]
[21:07:54] <shinken-wm>	 RECOVERY - Puppet staleness on tools-prometheus-02 is OK: OK: Less than 1.00% above the threshold [3600.0]
[21:09:43] <godog>	 !log tools upgrade prometheus on tools-prometheus0[12]
[21:09:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL
[21:17:59] <yuvipanda>	 godog: nice!
[21:20:17] <godog>	 yuvipanda: aye! not sure why tools-prometheus-02 didn't like the upgrade yesterday, -01 was fine
[21:21:50] <yuvipanda>	 heh
[21:21:54] <yuvipanda>	 I've never actively used -02
[21:22:02] <yuvipanda>	 godog: also, we found a fun issue yesterday
[21:22:14] <yuvipanda>	 which is that a container was dying when it hit the 1Gi limit I'd set for it
[21:22:26] <yuvipanda>	 godog: but prometheus thinks it was always at 800M
[21:22:28] <yuvipanda>	 because
[21:22:39] <yuvipanda>	 code tries to malloc a large buffer, and dies
[21:22:43] <yuvipanda>	 but this happens in between scrapes
[21:22:45] <yuvipanda>	 so is missed
[21:22:57] <yuvipanda>	 I could find out this was happening from /var/log/messages on the host
[21:23:01] <yuvipanda>	 since that keeps track of oom kills
[21:23:09] <shinken-wm>	 RECOVERY - Puppet staleness on tools-prometheus-01 is OK: OK: Less than 1.00% above the threshold [3600.0]
[21:23:37] <godog>	 yuvipanda: fun indeed! 
[21:23:47] <godog>	 no OOM/deaths exported by k8s ?
[21:26:03] <yuvipanda>	 godog: good question. I don't know if k8s reportsit anywhere
[21:26:09] <yuvipanda>	 godog: but the container itself didn't die
[21:26:19] <yuvipanda>	 godog: the process (python) was killed by OOM killer
[21:28:34] <godog>	 yuvipanda: ah! misread what was going on, heh sort of same problem with thumbor (tracking OOMs)
[21:28:59] <yuvipanda>	 godog: is that why I heard a bit about mtail?
[21:29:44] <godog>	 it is yeah yuvipanda, finishing the code review for mtail is actually ~next on my TODO
[21:29:51] <yuvipanda>	 godog: nice
[21:30:00] <yuvipanda>	 godog: any particular reason for usnig it over logster?
[21:31:45] <godog>	 I just read about logster, I suppose already packaged for debian and prometheus support amongst the reasons
[21:35:42] <godog>	 "written in golang" if you want to stretch it!
[21:45:53] <ori>	 yuvipanda: I have a small IRC notif bot written in node that I am trying to launch on tool labs. It runs from the staging env, but when I submit it to run via 'jstart' node.js prints the following error message and core dumps: 
[21:45:55] <ori>	 FATAL ERROR: v8::Context::New() V8 is no longer usable
[21:45:55] <ori>	 Aborted (core dumped)
[21:45:56] <ori>	 [2016-27-10T19:05] /usr/bin/nodejs exited with code 134. Respawning...
[21:46:09] <ori>	 have you seen that before?
[21:46:14] <yuvipanda>	 hey
[21:46:21] <yuvipanda>	 ori: try -mem 4G to jstart
[21:46:46] * ori tries
[21:49:31] <chasemp-tester>	 yuvipanda: is node a mem hog in general?
[21:52:26] <ori>	 yuvipanda: it's working
[21:52:39] <yuvipanda>	 chasemp-tester: yeah
[21:52:39] <ori>	 there's no way it is using up 4 gigs of ram, tho!
[21:52:53] <yuvipanda>	 chasemp-tester: ori yeah, gridengine's memory counting is... 'weird'
[21:53:00] <yuvipanda>	 I'll admit to not entirely understanding it
[21:53:06] <chasemp-tester>	 I vaguely recall it being detached from reality yeah
[21:53:20] <yuvipanda>	 ori: yeah, 4G is just a 'sane max' of sorts
[21:53:29] <yuvipanda>	 ori: for node / java
[21:53:36] <yuvipanda>	 python's fine with 512, so is php
[21:53:44] <yuvipanda>	 node might be ok with 1G actually
[21:53:52] <yuvipanda>	 but since this doesn't really match reality too much...
[21:53:57] <yuvipanda>	 it's fine-ish
[21:54:16] <chasemp-tester>	 small blah as SGE sucks so much at hard and soft allocations (i.e. no concept) but yeah 
[21:54:18] <yuvipanda>	 ori: if you want to, you can binary search your way down to something (try 2G, 1G, etc until it crashes again)
[21:54:20] <chasemp-tester>	 one more reason (tm)
[21:54:25] <yuvipanda>	 yeah
[21:54:37] <yuvipanda>	 k8s has 'guarantees' (requests) and 'limits'
[21:54:41] <yuvipanda>	 former for scheduling purposes, latter for killing
[21:54:45] <yuvipanda>	 works pretty nicely
[21:55:19] <chasemp-tester>	 for fun today I did search docker hub for SGE and poked at few of them, and then I mused on that for a few minutes over lunch
[21:55:26] <yuvipanda>	 :D
[21:55:36] <yuvipanda>	 I need to do another push maybe next week
[21:55:42] <yuvipanda>	 let's see if I can sharpen the trusty imgae today
[21:55:44] <yuvipanda>	 *this week
[21:56:10] <yuvipanda>	 I'm going to have to step afk for a bit now, i'll brb (switching locations, this place too rainy)
[21:56:36] <ori>	 hahaha
[21:56:37] <ori>	 from the man page
[21:56:40] <ori>	 "qstat -s h is an abbreviation for qstat -s huhohshdhjha"
[21:56:49] <chasemp-tester>	 later on yuvipanda, I'm going to clean out a fish tank post-haste 
[21:58:42] <wikibugs>	 10Tool-Labs-tools-Pageviews: Show smaller pie chart in addition to line/bar/radar chart - https://phabricator.wikimedia.org/T149374#2750346 (10MusikAnimal)
[22:12:48] <Leloiandudu>	 hi! I have a question regarding tool labs. I have a db on tools.labsdb. Is it backed up regularly? If so, how often? if not, what is the preferred way to configure backups?
[22:34:15] <Krenair>	 yuvipanda, ^
[22:48:27] <bd808>	 Leloiandudu: there are no backups of tool databases.
[22:48:45] <bd808>	 You could cron a mysqldump or similar for your tool's db
[22:49:35] <bd808>	 in an ideal world the tool databases are just transient working data. The world is often not ideal however :/
[22:50:52] <Leloiandudu>	 bd808, I heard that about the replica server, but is it also true for tools.labsdb? where am I supposed to store the persistent data then?
[22:52:08] <bd808>	 tools.labsdb and/or your tool's $HOME on NFS are the only persistent storage options
[22:53:03] <Leloiandudu>	 ok. tools.labsdb is where I store my data. I will configure the backups manually then...
[22:53:07] <bd808>	 there has been some discussion about looking for ways to provide backups for tool databases but there is no solution at the moment
[22:53:10] <Leloiandudu>	 thanks
[23:40:31] <wikibugs>	 10Labs-Kubernetes, 06Wikisource, 03Community-Tech-Sprint: Make Google OCR API on Tool Labs work under Kubernetes - https://phabricator.wikimedia.org/T146311#2750641 (10Samwilson) @bd808 I quite agree! It's a ridiculous library. I mean, all I wanted was a simple thing to turn errors into exceptions (and so wa...