[00:07:50] <T13|mobile>	 Reedy or James_F around? I'd like to be marked as a helper in #wikipedia-userscripts please.
[00:09:01] <James_F>	 T13|mobile: I don't think I have access, sorry.
[00:09:54] <T13|mobile>	 James_F: :09] -ChanServ-: 4     James_F                +AVefiorstv (Admin) [modified 6y 9w 3d ago]
[00:10:03] <T13|mobile>	 Chanserv says you do.
[00:10:07] <James_F>	 Wow. Really? Ha.
[00:11:26] <James_F>	 However, I'm certainly not active there.
[00:11:33] <T13|mobile>	 I'd be happy with being in the helper group, but wouldn't oppose being an admin since the channel is almost always empty.
[00:11:33] * James_F defers to Reedy.
[00:12:00] <T13|mobile>	 Yeah, no-one's active there, that's why I'm here. Lol
[00:13:30] <bd808>	 !log restarted elasticserch on logstash1001 & logstash1003; OOM
[00:13:37] <morebots>	 Logged the message, Master
[00:13:40] <icinga-wm>	 PROBLEM - ElasticSearch health check for shards on logstash1003 is CRITICAL: CRITICAL - elasticsearch inactive shards 62 threshold =0.1% breach: {ustatus: uyellow, unumber_of_nodes: 3, uunassigned_shards: 58, utimed_out: False, uactive_primary_shards: 42, ucluster_name: uproduction-logstash-eqiad, urelocating_shards: 0, uactive_shards: 64, uinitializing_shards: 4, unumber_of_data_nodes: 3}  
[00:15:20] <icinga-wm>	 PROBLEM - ElasticSearch health check for shards on logstash1001 is CRITICAL: CRITICAL - elasticsearch inactive shards 32 threshold =0.1% breach: {ustatus: uyellow, unumber_of_nodes: 3, uunassigned_shards: 28, utimed_out: False, uactive_primary_shards: 42, ucluster_name: uproduction-logstash-eqiad, urelocating_shards: 0, uactive_shards: 94, uinitializing_shards: 4, unumber_of_data_nodes: 3}  
[00:15:49] <icinga-wm>	 PROBLEM - ElasticSearch health check for shards on logstash1002 is CRITICAL: CRITICAL - elasticsearch inactive shards 32 threshold =0.1% breach: {ustatus: uyellow, unumber_of_nodes: 3, uunassigned_shards: 28, utimed_out: False, uactive_primary_shards: 42, ucluster_name: uproduction-logstash-eqiad, urelocating_shards: 0, uactive_shards: 94, uinitializing_shards: 4, unumber_of_data_nodes: 3}  
[00:18:22] <legoktm>	 T13|mobile: is that channel different from #mediawiki-scripts?
[00:18:59] <icinga-wm>	 RECOVERY - ElasticSearch health check for shards on logstash1001 is OK: OK - elasticsearch status production-logstash-eqiad: status: yellow, number_of_nodes: 3, unassigned_shards: 0, timed_out: False, active_primary_shards: 42, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 124, initializing_shards: 2, number_of_data_nodes: 3  
[00:19:29] <icinga-wm>	 RECOVERY - ElasticSearch health check for shards on logstash1002 is OK: OK - elasticsearch status production-logstash-eqiad: status: yellow, number_of_nodes: 3, unassigned_shards: 0, timed_out: False, active_primary_shards: 42, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 124, initializing_shards: 2, number_of_data_nodes: 3  
[00:19:49] <icinga-wm>	 RECOVERY - ElasticSearch health check for shards on logstash1003 is OK: OK - elasticsearch status production-logstash-eqiad: status: yellow, number_of_nodes: 3, unassigned_shards: 0, timed_out: False, active_primary_shards: 42, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 124, initializing_shards: 2, number_of_data_nodes: 3  
[00:27:00] <icinga-wm>	 PROBLEM - puppet last run on amssq35 is CRITICAL: CRITICAL: puppet fail  
[00:45:57] <wikibugs>	 3Wikimedia-Logstash, operations, ops-core: Upgrade RAM for logstash100[123] to 64G - https://phabricator.wikimedia.org/T87078#983418 (10bd808) 3NEW
[00:46:19] <icinga-wm>	 RECOVERY - puppet last run on amssq35 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures  
[00:48:54] <wikibugs>	 3operations, ops-core: Allocate a few servers to logstash - https://phabricator.wikimedia.org/T87031#983426 (10bd808) If the ram from these is compatible with the logstash100[123] boxes we have then maybe {T87078} can be done by just moving some sticks from one to another?
[00:57:12] <wikibugs>	 3Wikimedia-Logstash, operations, ops-core: Upgrade RAM for logstash100[123] to 64G - https://phabricator.wikimedia.org/T87078#983430 (10ori)
[00:58:15] <spagewmf>	 greg-g: I'm changing "set 0" to "group 0" in https://wikitech.wikimedia.org/wiki/Deployments/One_week to match other pages
[01:00:09] <icinga-wm>	 RECOVERY - HTTP error ratio anomaly detection on tungsten is OK: OK: No anomaly detected  
[01:31:59] <icinga-wm>	 PROBLEM - dhclient process on rhenium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.  
[01:31:59] <icinga-wm>	 PROBLEM - salt-minion processes on rhenium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.  
[01:32:10] <icinga-wm>	 PROBLEM - configured eth on rhenium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.  
[01:32:19] <icinga-wm>	 PROBLEM - SSH on rhenium is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[01:32:19] <icinga-wm>	 PROBLEM - RAID on rhenium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.  
[01:32:29] <icinga-wm>	 PROBLEM - puppet last run on rhenium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.  
[01:32:40] <icinga-wm>	 PROBLEM - Disk space on rhenium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.  
[01:32:49] <icinga-wm>	 PROBLEM - DPKG on rhenium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.  
[01:46:50] <icinga-wm>	 PROBLEM - NTP on rhenium is CRITICAL: NTP CRITICAL: No response from NTP server  
[02:01:28] <grrrit-wm>	 (03PS1) 10Ori.livneh: xenon: Annotate file scope and closure scope with filename [mediawiki-config] - 10https://gerrit.wikimedia.org/r/185593 
[02:02:16] <ori>	 AaronSchulz: ^
[02:03:02] <grrrit-wm>	 (03PS3) 10Ori.livneh: admin: update my deployment script [puppet] - 10https://gerrit.wikimedia.org/r/185374 
[02:03:08] <grrrit-wm>	 (03CR) 10Ori.livneh: [C: 032 V: 032] admin: update my deployment script [puppet] - 10https://gerrit.wikimedia.org/r/185374 (owner: 10Ori.livneh)
[02:05:49] <icinga-wm>	 PROBLEM - Ori committing changes on the weekend on palladium is CRITICAL: CRITICAL: Ori committed a change on a weekend  
[02:05:51] <spagewmf>	 ori or anyone: are requests to testwiki never cached?  It seems so, response is always X-Cache: cp1054 miss (0), cp4017 miss (0), cp4017 frontend miss (0)
[02:08:15] <grrrit-wm>	 (03PS2) 10Ori.livneh: xenon: Annotate file scope and closure scope with filename [mediawiki-config] - 10https://gerrit.wikimedia.org/r/185593 
[02:08:20] <ori>	 spagewmf: yes, by design
[02:08:42] <ori>	 makes debugging PHP bugs simpler.
[02:09:11] <grrrit-wm>	 (03PS3) 10Ori.livneh: xenon: Annotate file scope and closure scope with filename [mediawiki-config] - 10https://gerrit.wikimedia.org/r/185593 
[02:09:41] <grrrit-wm>	 (03CR) 10Aaron Schulz: [C: 031] xenon: Annotate file scope and closure scope with filename [mediawiki-config] - 10https://gerrit.wikimedia.org/r/185593 (owner: 10Ori.livneh)
[02:09:51] <grrrit-wm>	 (03CR) 10Ori.livneh: [C: 032] xenon: Annotate file scope and closure scope with filename [mediawiki-config] - 10https://gerrit.wikimedia.org/r/185593 (owner: 10Ori.livneh)
[02:09:56] <grrrit-wm>	 (03Merged) 10jenkins-bot: xenon: Annotate file scope and closure scope with filename [mediawiki-config] - 10https://gerrit.wikimedia.org/r/185593 (owner: 10Ori.livneh)
[02:10:20] <spagewmf>	 ori:  thanks, I'm updating some wikitech pages with the X-Wikimedia-Debug: 1 trick.
[02:10:27] <logmsgbot>	 !log ori Synchronized wmf-config/StartProfiler.php: I4e3871d3d: xenon: Annotate file scope and closure scope with filename (duration: 00m 05s)
[02:10:40] <morebots>	 Logged the message, Master
[02:10:44] <ori>	 spagewmf: ooooh! Thanks a ton. The X-Wikimedia-Debug trick actually lets you bypass the cache for other wikis, too.
[02:11:31] <spagewmf>	 yup, I've been `curl -H 'X-Wikimedia-Debug: 1' --dump-header - https://it.wikiquote.org/wiki/Pagina_principale`.  Reply to your fine post coming
[02:12:35] <ori>	 spagewmf: I also realized I forgot to explain how to install a Chrome / Chromium extension from GitHub. You need to clone it locally and then tell Chrome to 'Load unpacked extension...'
[02:12:45] <ori>	 I could just add it to the Chrome Web Store, that would probably make things easier.
[02:14:17] <tgr>	 do I need some special right beyond cluster access to be able to log in to the job queue hosts?
[02:14:32] <logmsgbot>	 !log krinkle Synchronized php-1.25wmf15/resources/src/mediawiki/mediawiki.content.json.css: Ic1d10393912fcefa22d (duration: 00m 06s)
[02:14:38] <morebots>	 Logged the message, Master
[02:14:40] <icinga-wm>	 PROBLEM - puppet last run on pc1003 is CRITICAL: CRITICAL: Puppet has 1 failures  
[02:14:44] <ori>	 tgr: not sure. have you tried?
[02:14:52] <logmsgbot>	 !log krinkle Synchronized php-1.25wmf15/includes/content/JsonContent.php: Ic1d10393912fcefa22d (duration: 00m 05s)
[02:14:55] <morebots>	 Logged the message, Master
[02:15:16] <tgr>	 ori: yes, the normal SSH key does not seem to work
[02:15:39] <icinga-wm>	 PROBLEM - Apache HTTP on mw1148 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[02:16:10] <icinga-wm>	 PROBLEM - HHVM rendering on mw1148 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[02:16:12] <logmsgbot>	 !log krinkle Synchronized php-1.25wmf14/resources/src/mediawiki/mediawiki.content.json.css: Ic1d10393912fcefa22d (duration: 00m 06s)
[02:16:13] <ori>	 yeah, it seems like the new puppetization does not include access rights for wikidev. not sure if that was by design. you'd want to check with _joe_. in the interim, is there anything i can do to help you?
[02:16:16] <morebots>	 Logged the message, Master
[02:16:20] <ori>	 i'll check mw1148
[02:16:23] <logmsgbot>	 !log krinkle Synchronized php-1.25wmf14/includes/content/JsonContent.php: Ic1d10393912fcefa22d (duration: 00m 06s)
[02:16:28] <morebots>	 Logged the message, Master
[02:16:39] <icinga-wm>	 PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 6.67% of data above the critical threshold [500.0]  
[02:17:20] <icinga-wm>	 RECOVERY - HHVM rendering on mw1148 is OK: HTTP OK: HTTP/1.1 200 OK - 64978 bytes in 0.389 second response time  
[02:17:25] <ori>	 mw1148: threads in stuck in __lll_lock_wait (); restarted HHVM.
[02:17:29] <ori>	 errrrr
[02:17:30] <tgr>	 ori: I'm trying to debug https://phabricator.wikimedia.org/T87040
[02:17:31] <ori>	 !log mw1148: threads in stuck in __lll_lock_wait (); restarted HHVM.
[02:17:34] <morebots>	 Logged the message, Master
[02:17:50] <icinga-wm>	 RECOVERY - Apache HTTP on mw1148 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.068 second response time  
[02:17:56] <tgr>	 could you run the query in https://wikitech.wikimedia.org/wiki/Job_queue#Examining_the_data_for_a_job for gwt?
[02:18:20] <logmsgbot>	 !log l10nupdate Synchronized php-1.25wmf14/cache/l10n: (no message) (duration: 00m 01s)
[02:18:23] <morebots>	 Logged the message, Master
[02:18:24] <logmsgbot>	 !log LocalisationUpdate completed (1.25wmf14) at 2015-01-17 02:18:24+00:00
[02:18:28] <morebots>	 Logged the message, Master
[02:18:51] <ori>	 tgr: you can do that yourself without sshing to the host; just run 'redis-cli -h rdb1001'
[02:18:53] <ori>	 on tin
[02:20:19] <tgr>	 ori: thanks! https://wikitech.wikimedia.org/wiki/Redis is a bit outdated then, I'll update it
[02:20:32] <ori>	 cool. thank you!
[02:28:49] <icinga-wm>	 RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0]  
[02:30:53] <logmsgbot>	 !log l10nupdate Synchronized php-1.25wmf15/cache/l10n: (no message) (duration: 00m 01s)
[02:30:57] <logmsgbot>	 !log LocalisationUpdate completed (1.25wmf15) at 2015-01-17 02:30:57+00:00
[02:30:59] <morebots>	 Logged the message, Master
[02:31:03] <morebots>	 Logged the message, Master
[02:32:40] <icinga-wm>	 RECOVERY - puppet last run on pc1003 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures  
[02:41:57] <tgr>	 ori: is there a way to find out the redis password without going to the redis host and looking up the config file?
[02:42:13] <tgr>	 I thought I could do something like "hiera 'passwords::redis::main_password'" on tin but I can't get it to work
[03:04:19] <icinga-wm>	 RECOVERY - Ori committing changes on the weekend on palladium is OK: OK: Ori is behaving himself  
[04:20:13] <grrrit-wm>	 (03PS1) 10Tim Landscheidt: Add IP mapping for toolserver.org [puppet] - 10https://gerrit.wikimedia.org/r/185604 (https://phabricator.wikimedia.org/T87086) 
[04:21:00] <grrrit-wm>	 (03CR) 10Tim Landscheidt: "Based on assumption that toolserver.org = relic." [puppet] - 10https://gerrit.wikimedia.org/r/185604 (https://phabricator.wikimedia.org/T87086) (owner: 10Tim Landscheidt)
[04:40:19] <icinga-wm>	 PROBLEM - Slow CirrusSearch query rate on fluorine is CRITICAL: CirrusSearch-slow.log_line_rate CRITICAL: 0.00333333333333  
[04:44:48] <grrrit-wm>	 (03PS1) 10Yuvipanda: beta: Add shinken monitoring for parsoid [puppet] - 10https://gerrit.wikimedia.org/r/185606 (https://phabricator.wikimedia.org/T87063) 
[04:44:55] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] beta: Add shinken monitoring for parsoid [puppet] - 10https://gerrit.wikimedia.org/r/185606 (https://phabricator.wikimedia.org/T87063) (owner: 10Yuvipanda)
[04:45:29] <icinga-wm>	 RECOVERY - Slow CirrusSearch query rate on fluorine is OK: CirrusSearch-slow.log_line_rate OKAY: 0.0  
[04:45:44] <grrrit-wm>	 (03PS2) 10Yuvipanda: beta: Add shinken monitoring for parsoid [puppet] - 10https://gerrit.wikimedia.org/r/185606 (https://phabricator.wikimedia.org/T87063) 
[04:47:42] <logmsgbot>	 !log LocalisationUpdate ResourceLoader cache refresh completed at Sat Jan 17 04:47:42 UTC 2015 (duration 47m 41s)
[04:47:52] <morebots>	 Logged the message, Master
[05:13:24] <grrrit-wm>	 (03PS1) 10Yuvipanda: parsoid: Open port 8000 with ferm rule [puppet] - 10https://gerrit.wikimedia.org/r/185607 (https://phabricator.wikimedia.org/T86951) 
[05:13:32] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] parsoid: Open port 8000 with ferm rule [puppet] - 10https://gerrit.wikimedia.org/r/185607 (https://phabricator.wikimedia.org/T86951) (owner: 10Yuvipanda)
[05:13:59] <grrrit-wm>	 (03PS2) 10Yuvipanda: parsoid: Open port 8000 with ferm rule [puppet] - 10https://gerrit.wikimedia.org/r/185607 (https://phabricator.wikimedia.org/T86951) 
[05:14:06] <grrrit-wm>	 (03PS2) 10Yuvipanda: beta: Remove dup of /home/mwdeploy/.ssh [puppet] - 10https://gerrit.wikimedia.org/r/185570 (owner: 10BryanDavis)
[05:14:15] <grrrit-wm>	 (03CR) 10Yuvipanda: [C: 032] beta: Add shinken monitoring for parsoid [puppet] - 10https://gerrit.wikimedia.org/r/185606 (https://phabricator.wikimedia.org/T87063) (owner: 10Yuvipanda)
[05:14:35] <grrrit-wm>	 (03CR) 10Yuvipanda: [C: 032] parsoid: Open port 8000 with ferm rule [puppet] - 10https://gerrit.wikimedia.org/r/185607 (https://phabricator.wikimedia.org/T86951) (owner: 10Yuvipanda)
[05:16:14] <grrrit-wm>	 (03PS3) 10Yuvipanda: beta: Remove dup of /home/mwdeploy/.ssh [puppet] - 10https://gerrit.wikimedia.org/r/185570 (owner: 10BryanDavis)
[05:17:38] <grrrit-wm>	 (03CR) 10Yuvipanda: [C: 032] beta: Remove dup of /home/mwdeploy/.ssh [puppet] - 10https://gerrit.wikimedia.org/r/185570 (owner: 10BryanDavis)
[05:38:00] <grrrit-wm>	 (03PS1) 10Yuvipanda: parsoid: Include base::firewall on parsoid hosts [puppet] - 10https://gerrit.wikimedia.org/r/185610 
[05:38:06] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] parsoid: Include base::firewall on parsoid hosts [puppet] - 10https://gerrit.wikimedia.org/r/185610 (owner: 10Yuvipanda)
[05:38:18] <grrrit-wm>	 (03PS2) 10Yuvipanda: parsoid: Include base::firewall on parsoid hosts [puppet] - 10https://gerrit.wikimedia.org/r/185610 
[05:39:30] <YuviPanda>	 akosiaris: ^ 
[05:52:09] <grrrit-wm>	 (03PS1) 10Yuvipanda: beta: Monitor cxserver, mathoid, citoid and apertium services [puppet] - 10https://gerrit.wikimedia.org/r/185613 (https://phabricator.wikimedia.org/T87087) 
[05:52:58] <grrrit-wm>	 (03CR) 10Yuvipanda: [C: 032] beta: Monitor cxserver, mathoid, citoid and apertium services [puppet] - 10https://gerrit.wikimedia.org/r/185613 (https://phabricator.wikimedia.org/T87087) (owner: 10Yuvipanda)
[05:59:22] <grrrit-wm>	 (03PS2) 10Glaisher: Redirect wikibook(s).(org|com) to www.wikibooks.org [puppet] - 10https://gerrit.wikimedia.org/r/185474 (https://phabricator.wikimedia.org/T87039) 
[06:02:20] <grrrit-wm>	 (03CR) 10Glaisher: "Updated now, this is a better solution, imo." [puppet] - 10https://gerrit.wikimedia.org/r/185474 (https://phabricator.wikimedia.org/T87039) (owner: 10Glaisher)
[06:05:10] <icinga-wm>	 PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet).  
[06:05:11] <grrrit-wm>	 (03PS2) 10Yuvipanda: Add IP mapping for toolserver.org [puppet] - 10https://gerrit.wikimedia.org/r/185604 (https://phabricator.wikimedia.org/T87086) (owner: 10Tim Landscheidt)
[06:05:41] <grrrit-wm>	 (03CR) 10Yuvipanda: [C: 032] "Ideally at some point in the future we'd automate this, but until then..." [puppet] - 10https://gerrit.wikimedia.org/r/185604 (https://phabricator.wikimedia.org/T87086) (owner: 10Tim Landscheidt)
[06:07:39] <icinga-wm>	 RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge.  
[06:28:29] <icinga-wm>	 PROBLEM - puppet last run on mw1118 is CRITICAL: CRITICAL: Puppet has 1 failures  
[06:29:10] <icinga-wm>	 PROBLEM - puppet last run on mw1235 is CRITICAL: CRITICAL: Puppet has 1 failures  
[06:29:20] <icinga-wm>	 PROBLEM - puppet last run on db1059 is CRITICAL: CRITICAL: Puppet has 1 failures  
[06:29:20] <icinga-wm>	 PROBLEM - puppet last run on db2040 is CRITICAL: CRITICAL: Puppet has 2 failures  
[06:29:39] <icinga-wm>	 PROBLEM - puppet last run on cp3016 is CRITICAL: CRITICAL: Puppet has 3 failures  
[06:29:59] <icinga-wm>	 PROBLEM - puppet last run on mw1025 is CRITICAL: CRITICAL: Puppet has 1 failures  
[06:30:30] <icinga-wm>	 PROBLEM - puppet last run on mw1170 is CRITICAL: CRITICAL: Puppet has 1 failures  
[06:31:24] <YuviPanda>	 hmm
[06:31:34] <YuviPanda>	 these all seem to fail *and* /var/log/puppet.log is empy
[06:31:37] <YuviPanda>	 *empty
[06:36:23] <YuviPanda>	 !log restarted dnsmasq on labnet1001 (see https://wikitech.wikimedia.org/wiki/Labs_DNS#DHCP_and_internal_DNS for how to)
[06:36:29] <morebots>	 Logged the message, Master
[06:44:49] <icinga-wm>	 RECOVERY - puppet last run on db1059 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures  
[06:45:10] <icinga-wm>	 RECOVERY - puppet last run on cp3016 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures  
[06:45:59] <icinga-wm>	 RECOVERY - puppet last run on mw1235 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures  
[06:46:09] <icinga-wm>	 RECOVERY - puppet last run on mw1170 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures  
[06:46:10] <icinga-wm>	 RECOVERY - puppet last run on db2040 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures  
[06:46:29] <icinga-wm>	 RECOVERY - puppet last run on mw1118 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures  
[06:46:39] <icinga-wm>	 RECOVERY - puppet last run on mw1025 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures  
[06:49:44] <grrrit-wm>	 (03PS1) 10Yuvipanda: deployment: Open up redis port on deployment masters [puppet] - 10https://gerrit.wikimedia.org/r/185614 
[06:49:45] <ori>	 YuviPanda: maybe it happens whenever logrotate rotates a log file while a puppet run is in progress
[06:49:53] <YuviPanda>	 ori: yeah, that’s what I was thinking as well
[06:50:07] <YuviPanda>	 didn’t check though, got distracted
[06:50:17] <ori>	 it has been an issue for months
[06:50:38] <YuviPanda>	 perhaps we can make logrotate abort if puppet is running atm
[06:50:51] <YuviPanda>	 with a pid check
[06:51:02] <ori>	 how many servers do we have in total?
[06:51:19] <YuviPanda>	 good question.
[06:51:22] <YuviPanda>	 I don’t actually know
[06:52:08] <YuviPanda>	 I guess I can find out by looking at the salt certificates list
[06:52:28] <ori>	 819
[06:52:39] <ori>	 and puppet runs every 20 minutes, right?
[06:52:52] <ori>	 and stays running, for, say, a minute on average
[06:53:27] <YuviPanda>	 (yeah, 819)
[06:53:50] <YuviPanda>	 (on that note, we record puppet running time on labs, perhaps not a bad idea to do so on prod too)
[06:54:00] <ori>	 so at any given moment 40 hosts are running puppet, right?
[06:54:28] <ori>	 that doesn't match up with the number of hosts that were affected (7)
[06:54:43] <ori>	 but maybe the conditions are more specific
[06:54:56] <YuviPanda>	 or maybe it’s less than a minute now, because apt-get is outside.
[06:55:27] <ori>	 i expect the log files are useful (if not puppet, then dmesg)
[06:55:41] <YuviPanda>	 I checked them. no failures reported
[06:55:46] <YuviPanda>	 on puppet that is
[06:55:46] <ori>	 but anyways, whereabouts are you at? still in chennai? 
[06:55:49] <YuviPanda>	 didn’t see dmesg
[06:55:59] <YuviPanda>	 ori: I am! I fly out tonight, will be in SF sunday afternoon
[06:56:32] <ori>	 oh man, i hate long flights
[06:57:12] <ori>	 the worst is when you can't sleep and you watch the movie on your neighbor's display without any sound
[06:57:17] <YuviPanda>	 ori: I’ve been asked to stop drinking as well now, so my usual coping mechanism isn’t going to work. *And* I can’t use a computer on flight either...
[06:57:30] <YuviPanda>	 ori: YUP. Terrible. and all my neighbors seem to have terrible taset
[06:57:30] <ori>	 and you just see all the formulas and cliches
[06:57:31] <YuviPanda>	 *taste
[06:57:36] <YuviPanda>	 yeah.
[06:57:45] <YuviPanda>	 I have a kindle now, though
[06:57:59] <YuviPanda>	 and am reading a BIND book. Might add TLDP’s LDAP one to it and read it during the flight too
[06:58:44] <ori>	 (re: formulas, https://www.youtube.com/watch?v=0zYYCCsSjkw)
[06:59:05] * YuviPanda is in coffe place with terrible 3G, will watch when he gets home
[06:59:28] <ori>	 just a silly comedic video :)
[06:59:41] <YuviPanda>	 ori: have you read / come across http://www.cs.virginia.edu/~robins/YouAndYourResearch.html before
[06:59:59] <ori>	 nope! is it worth reading?
[07:00:07] <YuviPanda>	 ori: yeah, I think so. 
[07:00:15] <ori>	 nice, i'll read it.
[07:00:43] <YuviPanda>	 ori: there’s a video of the talk as well https://www.youtube.com/watch?v=a1zDuOPkMSw
[07:01:05] <YuviPanda>	 it’s interesting, and has some rather good points I think. I would definitely be interested in seeing what you think about it, ori  :)
[07:01:40] <YuviPanda>	 I think it should be called ‘You and Your Work’. It specifically talks about research, but I think the points are generalizable to any work you care about.
[07:34:49] <icinga-wm>	 PROBLEM - puppet last run on netmon1001 is CRITICAL: CRITICAL: puppet fail  
[07:52:49] <icinga-wm>	 RECOVERY - puppet last run on netmon1001 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures  
[08:14:14] <_joe_>	 good morning/evening
[08:14:38] * YuviPanda waves at _joe_
[08:15:11] <_joe_>	 YuviPanda: which way are you flying to the us?
[08:15:30] <YuviPanda>	 _joe_: frnakfurt
[08:15:32] <YuviPanda>	 *frankfurt
[08:15:41] <_joe_>	 oh, like me
[08:16:00] <YuviPanda>	 _joe_: yeah. although that makes the flight 20+ hours long
[08:16:05] <YuviPanda>	 instead of flying through hong kong
[08:16:33] <_joe_>	 yeah I thought the east route maybe shorter
[08:17:43] <YuviPanda>	 _joe_: yeah, 19.5h (via HK) vs close to 24h now
[08:46:25] <grrrit-wm>	 (03PS2) 10Yuvipanda: deployment: Open up redis port on deployment masters [puppet] - 10https://gerrit.wikimedia.org/r/185614 
[09:19:49] <grrrit-wm>	 (03PS1) 10Yuvipanda: mediawiki: Move admin class inclusion from role to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/185621 
[09:21:27] <grrrit-wm>	 (03PS2) 10Yuvipanda: mediawiki: Move admin class inclusion from role to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/185621 
[09:23:04] <grrrit-wm>	 (03PS1) 10Yuvipanda: mediawiki: Don't include ::admin in labs [puppet] - 10https://gerrit.wikimedia.org/r/185622 
[09:23:34] <grrrit-wm>	 (03Abandoned) 10Yuvipanda: mediawiki: Move admin class inclusion from role to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/185621 (owner: 10Yuvipanda)
[09:25:54] <grrrit-wm>	 (03CR) 10Giuseppe Lavagetto: [C: 031] "I will remove the inclusion and the realm guard as soon as possible anyways." [puppet] - 10https://gerrit.wikimedia.org/r/185622 (owner: 10Yuvipanda)
[09:29:04] <grrrit-wm>	 (03PS2) 10Yuvipanda: mediawiki: Don't include ::admin in labs [puppet] - 10https://gerrit.wikimedia.org/r/185622 
[09:29:29] <grrrit-wm>	 (03PS3) 10Yuvipanda: mediawiki: Don't include ::admin in labs [puppet] - 10https://gerrit.wikimedia.org/r/185622 
[09:31:11] <grrrit-wm>	 (03CR) 10Yuvipanda: [C: 032] mediawiki: Don't include ::admin in labs [puppet] - 10https://gerrit.wikimedia.org/r/185622 (owner: 10Yuvipanda)
[09:33:40] <icinga-wm>	 PROBLEM - puppet last run on netmon1001 is CRITICAL: CRITICAL: puppet fail  
[09:39:31] <grrrit-wm>	 (03PS1) 10Yuvipanda: ocg: Don't include admin class in labs [puppet] - 10https://gerrit.wikimedia.org/r/185625 
[09:40:09] <grrrit-wm>	 (03PS2) 10Yuvipanda: ocg: Don't include admin class in labs [puppet] - 10https://gerrit.wikimedia.org/r/185625 
[09:40:56] <grrrit-wm>	 (03CR) 10Yuvipanda: [C: 032] ocg: Don't include admin class in labs [puppet] - 10https://gerrit.wikimedia.org/r/185625 (owner: 10Yuvipanda)
[09:52:03] <grrrit-wm>	 (03PS3) 10Yuvipanda: deployment: Open up redis port on deployment masters [puppet] - 10https://gerrit.wikimedia.org/r/185614 
[09:54:40] <grrrit-wm>	 (03CR) 10Yuvipanda: [C: 032] deployment: Open up redis port on deployment masters [puppet] - 10https://gerrit.wikimedia.org/r/185614 (owner: 10Yuvipanda)
[10:07:20] <icinga-wm>	 PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [500.0]  
[10:27:50] <icinga-wm>	 RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0]  
[17:53:33] <greg-g>	 spagewmf: thanks!
[18:40:39] <icinga-wm>	 PROBLEM - puppet last run on mw1179 is CRITICAL: CRITICAL: Puppet has 1 failures  
[18:58:39] <icinga-wm>	 RECOVERY - puppet last run on mw1179 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures  
[19:20:20] <icinga-wm>	 PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 26.67% of data above the critical threshold [500.0]  
[19:23:49] <icinga-wm>	 PROBLEM - puppet last run on lvs4001 is CRITICAL: CRITICAL: Puppet has 1 failures  
[19:23:49] <icinga-wm>	 PROBLEM - puppet last run on cp4016 is CRITICAL: CRITICAL: Puppet has 4 failures  
[19:24:00] <icinga-wm>	 PROBLEM - puppet last run on cp4010 is CRITICAL: CRITICAL: Puppet has 1 failures  
[19:24:59] <icinga-wm>	 PROBLEM - puppet last run on cp4015 is CRITICAL: CRITICAL: Puppet has 1 failures  
[19:38:10] <icinga-wm>	 RECOVERY - puppet last run on cp4015 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures  
[19:38:10] <icinga-wm>	 RECOVERY - puppet last run on cp4016 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures  
[19:38:29] <icinga-wm>	 RECOVERY - puppet last run on cp4010 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures  
[19:41:59] <icinga-wm>	 RECOVERY - puppet last run on lvs4001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures  
[19:42:10] <icinga-wm>	 RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0]  
[20:17:59] <Qcoder00>	 Hi
[20:18:26] <Qcoder00>	 Is this the right place to make a request for a rather specific project?
[20:19:36] <Qcoder00>	 Namely would it be possible to setup an OSM like stack on Wikimedia or Labs toolserver that I could use to make a 'free' (or reasonably free map of the world's rail networks?
[21:04:39] <MaxSem>	 maps in labs: #wikimedia-labs but you should remember that these maps should be for Wikimedia projects and that labs performance is... imperfect for OSM stack (I checked). maps in the main cluster: there are such plans, expect more news in the next 10-15 years
[21:04:50] <MaxSem>	 Qcoder00, ^
[21:05:06] <Qcoder00>	 OK
[23:34:23] <NotASpy>	 anybody able to diagnose the problem with the RC feed on Commons ?
[23:35:06] <Krenair>	 what is the problem exactly?
[23:35:14] <NotASpy>	 it appears to be only showing GWToolset Log entries
[23:35:50] <Krenair>	 and broken log entries too, ew.
[23:36:35] <tgr>	 Krenair: https://phabricator.wikimedia.org/T87040
[23:36:50] <tgr>	 haven't had the time to look into it yet
[23:37:13] <tgr>	 it's a single job, can be killed if the log spam is a problem in itself
[23:38:26] <Krenair>	 yeah, that's flooding RC
[23:38:36] <NotASpy>	 Special:NewFiles is still working, but I don't think any of the patrollers are able to do much on Commons right now because of the RC flooding. 
[23:38:39] <Krenair>	 from two users
[23:39:18] <Krenair>	 tgr, so those are duplicate logs for the same upload event?
[23:39:29] <Krenair>	 or are they all distinct events that should be marked as bot?
[23:40:45] <Krenair>	 wow, I just opened #commons.wikimedia.
[23:41:08] <ori>	 should probably kill it
[23:44:51] <Krenair>	 tgr, ori: did one of you kill it?
[23:44:53] <Krenair>	 looks like no
[23:45:33] <tgr>	 give me a sec to figure out how that's done
[23:50:18] <Krenair>	 ori, do you know?
[23:51:04] <tgr>	 ori: do you think anything will explode if I just delete the job table entry?
[23:51:21] <tgr>	 there doesn't seem to be any documentation around about killing jobs
[23:51:30] <Krenair>	 I doubt that would stop a job that's already executing
[23:52:56] <tgr>	 delete the gwtoolset job list in redis then?
[23:53:29] <tgr>	 the job is not exactly running, it is continuously rescheduling itself somehow, I think
[23:53:50] <Krenair>	 <Krenair> tgr, so those are duplicate logs for the same upload event?
[23:53:50] <Krenair>	 <Krenair> or are they all distinct events that should be marked as bot?
[23:54:06] <Krenair>	 it's a job for a single small upload being run multiple times?
[23:54:24] <tgr>	 it's not an upload event, as far as I understand, this is GWT trying to fetch a metadata file from some GLAM server
[23:54:53] <tgr>	 and there should be limits and throttling and whatnot to it, but clearly that's borked
[23:55:52] <tgr>	 https://github.com/wikimedia/mediawiki-extensions-GWToolset/blob/master/includes/Jobs/UploadMetadataJob.php
[23:56:46] <tgr>	 the actual uploading is done by UploadMediafileJob from what I remember, a bunch of instances of that job would be created by UploadMetadataJob but it's not even getting to that point
[23:57:25] <Krenair>	 ew it recreates itself?
[23:58:24] <tgr>	 yup
[23:58:44] <tgr>	 it's an exponential backoff type thing, I think
[23:59:02] <Krenair>	 still it's not clear why it would recreate itself, then go and perform the log anyway
[23:59:22] <tgr>	 which theoretically should only happen when there are too many mediafile jobs, and there are 0 now