[00:00:04] <icinga-wm>	 PROBLEM - check_listener_gc on thulium is CRITICAL: HTTP CRITICAL: HTTP/1.1 403 Cannot continue. - string OK not found on https://127.0.0.1:443https://payments-listener.wikimedia.org/globalcollect - 187 bytes in 0.011 second response time
[00:05:04] <icinga-wm>	 PROBLEM - check_listener_gc on thulium is CRITICAL: HTTP CRITICAL: HTTP/1.1 403 Cannot continue. - string OK not found on https://127.0.0.1:443https://payments-listener.wikimedia.org/globalcollect - 187 bytes in 0.010 second response time
[00:10:04] <icinga-wm>	 PROBLEM - check_listener_gc on thulium is CRITICAL: HTTP CRITICAL: HTTP/1.1 403 Cannot continue. - string OK not found on https://127.0.0.1:443https://payments-listener.wikimedia.org/globalcollect - 187 bytes in 0.011 second response time
[00:15:04] <icinga-wm>	 PROBLEM - check_listener_gc on thulium is CRITICAL: HTTP CRITICAL: HTTP/1.1 403 Cannot continue. - string OK not found on https://127.0.0.1:443https://payments-listener.wikimedia.org/globalcollect - 187 bytes in 0.010 second response time
[00:20:04] <icinga-wm>	 RECOVERY - check_listener_gc on thulium is OK: HTTP OK: HTTP/1.1 200 OK - 249 bytes in 0.052 second response time
[00:54:54] <icinga-wm>	 PROBLEM - citoid endpoints health on scb2004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[00:56:44] <icinga-wm>	 RECOVERY - citoid endpoints health on scb2004 is OK: All endpoints are healthy
[01:11:14] <ebernhardson>	 :q
[01:34:34] <icinga-wm>	 PROBLEM - puppet last run on sca1004 is CRITICAL: CRITICAL: Puppet has 27 failures. Last run 2 minutes ago with 27 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[wipe],Package[zotero/translators],Package[zotero/translation-server]
[02:02:34] <icinga-wm>	 RECOVERY - puppet last run on sca1004 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures
[02:34:54] <icinga-wm>	 PROBLEM - puppet last run on cp3049 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[02:38:36] <wikibugs>	 06Operations, 10Gerrit, 06Release-Engineering-Team, 05Security, 07Upstream: Gerrit: Restarting gerrit could lead to data loss + accounts - https://phabricator.wikimedia.org/T154205#2905700 (10Paladox) @Aklapper i haven't tried yet reproducing data loss for accounts, but i managed with patches without eve...
[02:46:34] <icinga-wm>	 PROBLEM - puppet last run on elastic1046 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[02:50:44] <wikibugs>	 06Operations, 10Gerrit, 06Release-Engineering-Team, 07Upstream: Gerrit: Restarting gerrit could lead to data loss + accounts - https://phabricator.wikimedia.org/T154205#2905703 (10Peachey88) Do we actually have account_data that can be lost (and cause issues)? since all our account data is synced from a ex...
[02:54:40] <wikibugs>	 06Operations, 10Gerrit, 06Release-Engineering-Team, 07Upstream: Gerrit: Restarting gerrit could lead to data loss + accounts - https://phabricator.wikimedia.org/T154205#2905707 (10Paladox) @Peachey88 even though we use ldap, gerrit stores all accounts in the index. Which is what caused T152640 that problem...
[03:02:54] <icinga-wm>	 RECOVERY - puppet last run on cp3049 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures
[03:06:04] <icinga-wm>	 PROBLEM - citoid endpoints health on scb2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[03:06:04] <icinga-wm>	 PROBLEM - citoid endpoints health on scb2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[03:06:54] <icinga-wm>	 RECOVERY - citoid endpoints health on scb2002 is OK: All endpoints are healthy
[03:06:54] <icinga-wm>	 RECOVERY - citoid endpoints health on scb2001 is OK: All endpoints are healthy
[03:13:04] <icinga-wm>	 PROBLEM - citoid endpoints health on scb2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[03:14:04] <icinga-wm>	 RECOVERY - citoid endpoints health on scb2001 is OK: All endpoints are healthy
[03:14:34] <icinga-wm>	 RECOVERY - puppet last run on elastic1046 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures
[03:16:04] <icinga-wm>	 PROBLEM - citoid endpoints health on scb2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[03:16:54] <icinga-wm>	 RECOVERY - citoid endpoints health on scb2002 is OK: All endpoints are healthy
[04:02:34] <icinga-wm>	 PROBLEM - puppet last run on sca2003 is CRITICAL: CRITICAL: Puppet has 27 failures. Last run 2 minutes ago with 27 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[wipe],Package[zotero/translators],Package[zotero/translation-server]
[04:16:04] <icinga-wm>	 PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=2710.20 Read Requests/Sec=2770.50 Write Requests/Sec=21.90 KBytes Read/Sec=11180.40 KBytes_Written/Sec=796.00
[04:25:04] <icinga-wm>	 RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=20.40 Read Requests/Sec=0.00 Write Requests/Sec=12.50 KBytes Read/Sec=0.00 KBytes_Written/Sec=387.60
[04:29:34] <icinga-wm>	 RECOVERY - puppet last run on sca2003 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures
[04:54:24] <icinga-wm>	 PROBLEM - puppet last run on sca1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[05:23:24] <icinga-wm>	 RECOVERY - puppet last run on sca1003 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures
[05:39:34] <icinga-wm>	 PROBLEM - puppet last run on sca2004 is CRITICAL: CRITICAL: Puppet has 12 failures. Last run 2 minutes ago with 12 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[wipe],Package[zotero/translators],Package[zotero/translation-server]
[05:40:24] <icinga-wm>	 PROBLEM - puppet last run on elastic1024 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[06:04:34] <icinga-wm>	 PROBLEM - puppet last run on sca1004 is CRITICAL: CRITICAL: Puppet has 20 failures. Last run 2 minutes ago with 20 failures. Failed resources (up to 3 shown): Service[puppet],Service[rsyslog],Exec[ip addr add 2620:0:861:103:10:64:32:125/64 dev eth0],Service[ferm]
[06:07:34] <icinga-wm>	 RECOVERY - puppet last run on sca2004 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures
[06:09:24] <icinga-wm>	 RECOVERY - puppet last run on elastic1024 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures
[06:28:24] <icinga-wm>	 PROBLEM - Check HHVM threads for leakage on mw1168 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers
[06:30:34] <icinga-wm>	 PROBLEM - Check HHVM threads for leakage on mw1260 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers
[06:32:24] <icinga-wm>	 PROBLEM - Check HHVM threads for leakage on mw1169 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers
[06:32:34] <icinga-wm>	 RECOVERY - puppet last run on sca1004 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures
[06:48:24] <icinga-wm>	 PROBLEM - Check HHVM threads for leakage on mw1259 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers
[07:01:39] <_joe_>	 uhm
[07:02:40] <_joe_>	 I'll check when I'm properly awake
[07:02:48] <_joe_>	 but this seems like a logrotate thing
[07:03:27] <_joe_>	 yes, it's apache "reloading", wtf
[07:03:36] <_joe_>	 ok, I'll fix it
[07:07:03] <_joe_>	 unless this is an artifact of reloading + apache status, which seems probable
[07:14:24] <icinga-wm>	 PROBLEM - puppet last run on elastic1034 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[07:19:44] <icinga-wm>	 RECOVERY - Check HHVM threads for leakage on mw1168 is OK: OK
[07:41:20] <_joe_>	 matanya: I think turning off video2commons for a couple of days might help stopping the pressure on the videoscalers
[07:42:25] <icinga-wm>	 RECOVERY - puppet last run on elastic1034 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures
[07:55:36] <wikibugs>	 (03CR) 10Tim Landscheidt: "@chasemp: This updated also modules/cdh and modules/mariadb which does not seem to have been intended.  modules/mariadb has been updated w" [puppet] - 10https://gerrit.wikimedia.org/r/316577 (owner: 10Rush)
[08:06:34] <icinga-wm>	 PROBLEM - puppet last run on mw1272 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[08:13:24] <icinga-wm>	 RECOVERY - Check HHVM threads for leakage on mw1169 is OK: OK
[08:15:24] <icinga-wm>	 PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0]
[08:16:24] <icinga-wm>	 PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0]
[08:19:24] <icinga-wm>	 RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[08:22:24] <icinga-wm>	 RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[08:28:34] <icinga-wm>	 RECOVERY - Check HHVM threads for leakage on mw1260 is OK: OK
[08:34:34] <icinga-wm>	 RECOVERY - puppet last run on mw1272 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures
[08:34:45] <wikibugs>	 (03CR) 10Tim Landscheidt: "@Joe: Just to clarify: I don't *want* to use expand_path in Labs; I'm only interested in hiera('puppetdb::password::rw') providing some du" [puppet] - 10https://gerrit.wikimedia.org/r/329226 (owner: 10Tim Landscheidt)
[08:35:44] <icinga-wm>	 RECOVERY - Check HHVM threads for leakage on mw1259 is OK: OK
[08:54:18] <matanya>	 _joe_: i will handle it later
[08:54:28] <matanya>	 zhuyifei1999_: FYI - see above
[08:54:40] <zhuyifei1999_>	 yeah?
[08:54:43] <_joe_>	 matanya: yeah thanks, it's not like I'm mandating that, but
[08:54:56] <_joe_>	 we have 8 days of backlog of transcodes atm
[08:55:13] <matanya>	 zhuyifei1999_:  _joe_ asked to shut down v2c until the scalers catch up
[08:55:18] <Urbanecm>	 Hi all, is anybody with deploy access around? I'd like 329516 to be deployed. Greg approved it, see T154254
[08:55:18] <stashbot>	 T154254: Set celebration logo for hewiki - https://phabricator.wikimedia.org/T154254
[08:55:27] <zhuyifei1999_>	 hmm
[08:55:41] <zhuyifei1999_>	 ok
[08:56:00] <zhuyifei1999_>	 I'll set up an message saying the service is down
[08:56:25] <_joe_>	 we are currenlty processing files from dec 19th
[08:56:43] <_joe_>	 so we're 9 and a half days beyond
[08:58:02] <p858snake|L2>	 Urbanecm: Reedy will probably be on in the next few hours
[08:58:18] <matanya>	 thats what i wanted to say to Urbanecm :)
[08:58:20] <p858snake|L2>	 ((if you ask nicely and no one steps up before that)
[08:58:38] <matanya>	 Dereckson might be around too
[08:59:08] <matanya>	 jzerebecki: might have rights as well
[08:59:17] <Urbanecm>	 Thanks both of you
[08:59:31] <matanya>	 and one last name for you might be legoktm 
[09:00:26] <legoktm>	 hi
[09:00:32] <legoktm>	 https://phabricator.wikimedia.org/T154254#2906104
[09:00:37] <legoktm>	 celebration logos are done via CSS
[09:00:55] <_joe_>	 I have the rights to deploy, but what lego said
[09:00:56] <_joe_>	 :)
[09:01:08] <Urbanecm>	 legoktm, for HD displays too?
[09:01:12] <legoktm>	 Urbanecm: yes
[09:02:04] <Urbanecm>	 Then i cant see a reason why they requested this again...
[09:04:32] <legoktm>	 Urbanecm: someone needs to make the high-res versions though
[09:04:45] <legoktm>	 https://en.wikipedia.org/w/index.php?title=MediaWiki%3ACommon.css&type=revision&diff=688516426&oldid=686648742 is the CSS to override those
[09:05:07] <legoktm>	 oh I guess https://commons.wikimedia.org/wiki/File:Hewiki_200,000_articles4.svg is good enough
[09:05:22] <legoktm>	 someone just needs to scale it down to the right dimensions
[09:05:47] <legoktm>	 docs are at https://www.mediawiki.org/wiki/Manual:$wgLogoHD
[09:05:51] <matanya>	 legoktm: it was already done here : https://he.wikipedia.org/w/index.php?title=%D7%9E%D7%93%D7%99%D7%94_%D7%95%D7%99%D7%A7%D7%99%3ACommon.css&type=revision&diff=19841881&oldid=19838328
[09:06:05] <Urbanecm>	 legoktm, 329516 isn't enough?
[09:06:07] <legoktm>	 matanya: that doesn't affect people who have hdpi screens
[09:06:27] <zhuyifei1999_>	 https://tools.wmflabs.org/video2commons/ <= it's down now
[09:06:28] <legoktm>	 Urbanecm: that will get cached for a month at least
[09:06:56] <legoktm>	 Urbanecm: but those png files are probably fine (I didn't check)
[09:07:04] <Urbanecm>	 There isn't anything like purge?
[09:07:10] <matanya>	 thank you zhuyifei1999_ 
[09:07:17] <zhuyifei1999_>	 np
[09:08:35] <legoktm>	 Urbanecm: there is, but it's much easier if this is done via local CSS
[09:09:17] <_joe_>	 zhuyifei1999_: thanks a lot
[09:09:24] <zhuyifei1999_>	 np
[09:10:03] <Urbanecm>	 legoktm, As I do not understand CSS well can you advice them what line(s) they should add?
[09:10:11] <Urbanecm>	 Or add then if you have rights. 
[09:10:50] <Urbanecm>	 BTW T131605 was resolved and it is about the same thing but another wiki and another milestone. 
[09:10:50] <stashbot>	 T131605: Set celebration logo on Czech Wikipedia - https://phabricator.wikimedia.org/T131605
[09:11:15] <wikibugs>	 06Operations, 06Commons, 10TimedMediaHandler-Transcode, 10Wikimedia-Video, and 3 others: Commons video transcoders have over 6500 tasks in the backlog. - https://phabricator.wikimedia.org/T153488#2906120 (10zhuyifei1999) For the record #video2commons is off now.
[09:12:39] <legoktm>	 Urbanecm: Well that wasn't during a deploy freeze when you're trying to find a sysadmin :P but I wouldn't have done it then either. Doing a local CSS change gives the local wiki much more control
[09:13:23] <matanya>	 I am a local admin, i can do the css change, if i know what it was :)
[09:13:40] <matanya>	 *knew
[09:14:38] <legoktm>	 matanya: sorry, I'm about to sleep. the CSS is on https://www.mediawiki.org/wiki/Manual:$wgLogoHD, you just need to upload the 1.5x and 2x logos (png) somewhere
[09:14:47] <legoktm>	 I think Urbanecm already created those files
[09:14:59] <matanya>	 thanks legoktm good night
[09:15:17] <Urbanecm>	 good night legoktm
[09:17:40] <Urbanecm>	 matanya, you can download my files from https://gerrit.wikimedia.org/r/#/c/329516/ (the 1.5x and 2x). 
[09:19:22] <wikibugs>	 (03Abandoned) 10Urbanecm: Set celebration logo for hewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/329516 (https://phabricator.wikimedia.org/T154254) (owner: 10Urbanecm)
[09:20:28] <matanya>	 Urbanecm: want me to upload them to commons ?
[09:21:02] <Urbanecm>	 matanya, please upload them locally (they should be protected). 
[09:21:10] <matanya>	 ok
[09:22:58] <zhuyifei1999_>	 I just protected it..
[09:23:14] <zhuyifei1999_>	 if you mean https://commons.wikimedia.org/wiki/File:Hewiki_200,000_articles4.svg
[09:24:55] <Urbanecm>	 zhuyifei1999_, this file doesn't need to be protected as it isn't used in interface. The SVG is needed for making PNG (because mediawiki doesn't support SVG yet). 
[09:25:20] <Urbanecm>	 Local PNG (or even on commons; it doesn't matter) should be protected. 
[09:25:43] <zhuyifei1999_>	 you can use thumb generation for that
[09:25:55] <Urbanecm>	 zhuyifei1999_, they are generated :)
[09:26:21] <Urbanecm>	 BTW I can generate a thumb whatever size I want?
[09:26:51] <zhuyifei1999_>	 yes
[09:27:23] <zhuyifei1999_>	 https://upload.wikimedia.org/wikipedia/commons/thumb/1/14/Hewiki_200%2C000_articles4.svg/1234px-Hewiki_200%2C000_articles4.svg.png
[09:27:30] <zhuyifei1999_>	 just replace that 1234px
[09:27:50] <zhuyifei1999_>	 (I just unprotected it btw)
[09:29:48] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: Specify kubernetes admission controllers via hiera [puppet] - 10https://gerrit.wikimedia.org/r/329592
[09:31:33] <Urbanecm>	 matanya, http://pastebin.com/4RQpB92W should be your HD css. 
[09:32:26] <matanya>	 yes Urbanecm thanks. i was in the middle of editing that :)
[09:34:11] <matanya>	 Urbanecm: if you have HD screen, i'd like if you can verify it works well
[09:34:56] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: kubernetes::worker: Allow access to kubelet from master [puppet] - 10https://gerrit.wikimedia.org/r/329494
[09:34:58] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: Specify kubernetes admission controllers via hiera [puppet] - 10https://gerrit.wikimedia.org/r/329592
[09:35:34] <Urbanecm>	 matanya, I don't have that screen. 
[09:41:49] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 032] kubernetes::worker: Allow access to kubelet from master [puppet] - 10https://gerrit.wikimedia.org/r/329494 (owner: 10Alexandros Kosiaris)
[09:41:55] <wikibugs>	 (03PS3) 10Alexandros Kosiaris: kubernetes::worker: Allow access to kubelet from master [puppet] - 10https://gerrit.wikimedia.org/r/329494
[09:41:59] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] kubernetes::worker: Allow access to kubelet from master [puppet] - 10https://gerrit.wikimedia.org/r/329494 (owner: 10Alexandros Kosiaris)
[09:42:12] <_joe_>	 akosiaris: did it work at all without that?
[09:42:15] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 032] Specify kubernetes admission controllers via hiera [puppet] - 10https://gerrit.wikimedia.org/r/329592 (owner: 10Alexandros Kosiaris)
[09:42:23] <wikibugs>	 (03PS3) 10Alexandros Kosiaris: Specify kubernetes admission controllers via hiera [puppet] - 10https://gerrit.wikimedia.org/r/329592
[09:42:27] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Specify kubernetes admission controllers via hiera [puppet] - 10https://gerrit.wikimedia.org/r/329592 (owner: 10Alexandros Kosiaris)
[09:45:57] <akosiaris>	 _joe_: yes. Only attach and exec to a running container did not
[09:46:13] <_joe_>	 from kubectl?
[09:46:16] <akosiaris>	 yes
[09:46:21] <akosiaris>	 they go via the master
[09:46:22] <_joe_>	 heh kinda expected, ain't it?
[09:46:28] <akosiaris>	 yup
[09:46:30] <_joe_>	 it makes sense
[09:46:35] <akosiaris>	 IIRC, these 2 are the only ones that go via the master
[09:46:41] <akosiaris>	 everything else is pull model
[09:46:45] <icinga-wm>	 PROBLEM - puppet last run on copper is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[docker-engine]
[09:46:46] <akosiaris>	 but I may be wrong
[09:46:50] <_joe_>	 which, again, makes sense
[09:47:08] <akosiaris>	 hmm docker is me
[09:47:37] <akosiaris>	 I 've upgraded docker to 1.12.5
[09:47:44] <akosiaris>	 I 'll amend puppet
[09:49:46] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: builder: Specify the newer docker-engine version [puppet] - 10https://gerrit.wikimedia.org/r/329593
[09:50:02] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] builder: Specify the newer docker-engine version [puppet] - 10https://gerrit.wikimedia.org/r/329593 (owner: 10Alexandros Kosiaris)
[09:50:08] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: builder: Specify the newer docker-engine version [puppet] - 10https://gerrit.wikimedia.org/r/329593
[09:50:11] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] builder: Specify the newer docker-engine version [puppet] - 10https://gerrit.wikimedia.org/r/329593 (owner: 10Alexandros Kosiaris)
[09:51:44] <icinga-wm>	 RECOVERY - puppet last run on copper is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures
[09:57:08] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 032] "https://puppet-compiler.wmflabs.org/5003/ says noop, merging. Let's see what breaks!" [puppet] - 10https://gerrit.wikimedia.org/r/302695 (owner: 10Alexandros Kosiaris)
[09:57:14] <wikibugs>	 (03PS3) 10Alexandros Kosiaris: Move external_networks to network module data.yaml [puppet] - 10https://gerrit.wikimedia.org/r/302695
[10:03:50] <wikibugs>	 (03PS1) 10Tim Landscheidt: puppetmaster: Specify $group for all repositories [puppet] - 10https://gerrit.wikimedia.org/r/329595 (https://phabricator.wikimedia.org/T152060)
[10:17:44] <icinga-wm>	 PROBLEM - puppet last run on ms-fe1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[10:44:38] <wikibugs>	 06Operations, 10GlobalRename, 10MediaWiki-extensions-CentralAuth: Rename user TextworkerBot to VladiBot on ru.wiki - https://phabricator.wikimedia.org/T153602#2885305 (10Vladis13) I'm original requestor. I thought that for renaming should change the only one row in the database. Ok, if it is problematic then...
[10:45:44] <icinga-wm>	 RECOVERY - puppet last run on ms-fe1002 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures
[10:54:44] <icinga-wm>	 PROBLEM - puppet last run on ms-be1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[11:23:33] <wikibugs>	 (03PS1) 10MarcoAurelio: Set $wgAbuseFilterNotificationsPrivate = true; for Meta-Wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/329600
[11:23:44] <icinga-wm>	 RECOVERY - puppet last run on ms-be1003 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures
[11:40:54] <icinga-wm>	 PROBLEM - puppet last run on cp3044 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[11:49:10] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 04-1] "Various inline comments, will work on them myself today" (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/323816 (owner: 10Giuseppe Lavagetto)
[12:08:54] <icinga-wm>	 RECOVERY - puppet last run on cp3044 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures
[12:10:44] <icinga-wm>	 PROBLEM - Disk space on labtestnet2001 is CRITICAL: DISK CRITICAL - free space: / 350 MB (3% inode=46%)
[12:19:44] <icinga-wm>	 RECOVERY - Disk space on labtestnet2001 is OK: DISK OK
[12:19:50] <akosiaris>	 !log running sudo apt-get autoremove on labtestnet2001
[12:19:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:20:01] <akosiaris>	 !log running sudo apt-get autoremove on labtestnet2001. Removing various older kernels
[12:20:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:04:45] <icinga-wm>	 PROBLEM - puppet last run on sca1004 is CRITICAL: CRITICAL: Puppet has 27 failures. Last run 2 minutes ago with 27 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[wipe],Package[zotero/translators],Package[zotero/translation-server]
[13:30:54] <icinga-wm>	 PROBLEM - puppet last run on sca2003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[13:32:44] <icinga-wm>	 RECOVERY - puppet last run on sca1004 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures
[13:59:54] <icinga-wm>	 RECOVERY - puppet last run on sca2003 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures
[14:09:08] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: calico: add module/profile to use as kubernetes networking [puppet] - 10https://gerrit.wikimedia.org/r/323816 (owner: 10Giuseppe Lavagetto)
[14:09:10] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: Enable the calico profile on kubernetes::worker [puppet] - 10https://gerrit.wikimedia.org/r/329606
[14:09:12] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: Enable CNI plugin on kubernetes::node profile [puppet] - 10https://gerrit.wikimedia.org/r/329607
[14:16:44] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: Enable the calico profile on kubernetes::worker [puppet] - 10https://gerrit.wikimedia.org/r/329606
[14:16:46] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: Enable CNI plugin on kubernetes::node profile [puppet] - 10https://gerrit.wikimedia.org/r/329607
[14:19:45] <wikibugs>	 (03PS3) 10Alexandros Kosiaris: calico: add module/profile to use as kubernetes networking [puppet] - 10https://gerrit.wikimedia.org/r/323816 (owner: 10Giuseppe Lavagetto)
[14:19:47] <wikibugs>	 (03PS3) 10Alexandros Kosiaris: Enable the calico profile on kubernetes::worker [puppet] - 10https://gerrit.wikimedia.org/r/329606
[14:19:49] <wikibugs>	 (03PS3) 10Alexandros Kosiaris: Enable CNI plugin on kubernetes::node profile [puppet] - 10https://gerrit.wikimedia.org/r/329607
[14:26:12] <wikibugs>	 (03PS4) 10Alexandros Kosiaris: calico: add module/profile to use as kubernetes networking [puppet] - 10https://gerrit.wikimedia.org/r/323816 (owner: 10Giuseppe Lavagetto)
[14:26:14] <wikibugs>	 (03PS4) 10Alexandros Kosiaris: Enable the calico profile on kubernetes::worker [puppet] - 10https://gerrit.wikimedia.org/r/329606
[14:26:17] <wikibugs>	 (03PS4) 10Alexandros Kosiaris: Enable CNI plugin on kubernetes::node profile [puppet] - 10https://gerrit.wikimedia.org/r/329607
[14:32:02] <wikibugs>	 (03PS5) 10Alexandros Kosiaris: calico: add module/profile to use as kubernetes networking [puppet] - 10https://gerrit.wikimedia.org/r/323816 (owner: 10Giuseppe Lavagetto)
[14:32:04] <wikibugs>	 (03PS5) 10Alexandros Kosiaris: Enable the calico profile on kubernetes::worker [puppet] - 10https://gerrit.wikimedia.org/r/329606
[14:32:06] <wikibugs>	 (03PS5) 10Alexandros Kosiaris: Enable CNI plugin on kubernetes::node profile [puppet] - 10https://gerrit.wikimedia.org/r/329607
[14:35:41] <wikibugs>	 (03PS6) 10Alexandros Kosiaris: calico: add module/profile to use as kubernetes networking [puppet] - 10https://gerrit.wikimedia.org/r/323816 (owner: 10Giuseppe Lavagetto)
[14:35:43] <wikibugs>	 (03PS6) 10Alexandros Kosiaris: Enable the calico profile on kubernetes::worker [puppet] - 10https://gerrit.wikimedia.org/r/329606
[14:35:45] <wikibugs>	 (03PS6) 10Alexandros Kosiaris: Enable CNI plugin on kubernetes::node profile [puppet] - 10https://gerrit.wikimedia.org/r/329607
[14:36:45] <wikibugs>	 06Operations, 10MediaWiki-Special-pages: Fatal exception of type MWException on all the Special:GlobalUserRights pages - https://phabricator.wikimedia.org/T154185#2903588 (10abian)
[14:40:27] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 032] "After a large round of patches, pcc is happy at https://puppet-compiler.wmflabs.org/5009/" [puppet] - 10https://gerrit.wikimedia.org/r/323816 (owner: 10Giuseppe Lavagetto)
[14:40:48] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 032] Enable the calico profile on kubernetes::worker [puppet] - 10https://gerrit.wikimedia.org/r/329606 (owner: 10Alexandros Kosiaris)
[14:44:38] <icinga-wm>	 PROBLEM - puppet last run on kubernetes1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[14:44:38] <icinga-wm>	 PROBLEM - puppet last run on kubernetes1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[14:45:25] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: Fix dependency to calico/node for systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/329608
[14:46:34] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 032] Fix dependency to calico/node for systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/329608 (owner: 10Alexandros Kosiaris)
[14:46:42] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: Fix dependency to calico/node for systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/329608
[14:46:45] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Fix dependency to calico/node for systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/329608 (owner: 10Alexandros Kosiaris)
[14:52:30] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: Bump calico CNI plugin version [puppet] - 10https://gerrit.wikimedia.org/r/329611
[14:52:48] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: Bump calico CNI plugin version [puppet] - 10https://gerrit.wikimedia.org/r/329611
[14:52:54] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Bump calico CNI plugin version [puppet] - 10https://gerrit.wikimedia.org/r/329611 (owner: 10Alexandros Kosiaris)
[14:58:13] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: Specify the registry as well for calico/node [puppet] - 10https://gerrit.wikimedia.org/r/329612
[14:58:34] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: Specify the registry as well for calico/node [puppet] - 10https://gerrit.wikimedia.org/r/329612
[14:58:40] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Specify the registry as well for calico/node [puppet] - 10https://gerrit.wikimedia.org/r/329612 (owner: 10Alexandros Kosiaris)
[15:03:08] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: Fix ETCD_CA_CERT_FILE for calico/node [puppet] - 10https://gerrit.wikimedia.org/r/329613
[15:04:18] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Fix ETCD_CA_CERT_FILE for calico/node [puppet] - 10https://gerrit.wikimedia.org/r/329613 (owner: 10Alexandros Kosiaris)
[15:04:22] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: Fix ETCD_CA_CERT_FILE for calico/node [puppet] - 10https://gerrit.wikimedia.org/r/329613
[15:04:25] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Fix ETCD_CA_CERT_FILE for calico/node [puppet] - 10https://gerrit.wikimedia.org/r/329613 (owner: 10Alexandros Kosiaris)
[15:06:44] <Revent>	 With v2c not flooding huge files, the transcoder backlog is heading steadily downward.
[15:07:08] <Revent>	 Probably about 1500 below it’s peak.
[15:09:38] <icinga-wm>	 PROBLEM - puppet last run on kubernetes1004 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[cni]
[15:10:37] <_joe_>	 Revent: yes, still a ton of v2c white house files in the queue
[15:10:46] <Revent>	 Yeah, I know.
[15:11:00] <Revent>	 But it’s at least making headway.
[15:11:05] <_joe_>	 as fresh as from the 27th
[15:11:08] <_joe_>	 yup
[15:12:38] <icinga-wm>	 PROBLEM - puppet last run on kubernetes1003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[cni]
[15:18:16] <akosiaris>	 /etc/rc.local: 4: .: startup.env: not found
[15:18:17] <akosiaris>	 argh
[15:18:53] <akosiaris>	 not to mention https://github.com/shazow/urllib3/issues/497 ...
[15:19:02] <akosiaris>	 thankfully the urllib3 issue is not a blocker
[15:19:06] <_joe_>	 wat?
[15:19:11] <_joe_>	 (rc.local)
[15:19:13] <akosiaris>	 but how is that startup.env populated ...
[15:19:17] <akosiaris>	 damn containers
[15:19:32] <akosiaris>	 seems botched somehow
[15:19:49] <_joe_>	 is that calico/node?
[15:19:53] <akosiaris>	 yes
[15:19:57] <_joe_>	 uhm
[15:20:21] <_joe_>	 I can swear that container worked when I tried it, meh
[15:20:34] <_joe_>	 but had no CA_CERT_FILE
[15:24:54] <Revent>	 _joe_: I counf 726 transcodes with “White_House_Press_Briefing” in the filename (queued)
[15:25:00] <Revent>	 (sigh)
[15:30:09] <akosiaris>	 _joe_: yeah it seems to be related to HTTPS
[15:30:19] <akosiaris>	 not that HTTP works.. but I get a different error
[15:31:25] <_joe_>	 Revent: I count transcludes from 470 such files
[15:31:30] <_joe_>	 in the queue
[15:31:46] <Revent>	 Yeah, I counted transcodes, not files.
[15:32:03] <Revent>	 (in the search the special page uses)
[15:37:58] <_joe_>	 transcodes of such files are ~ 23400
[15:38:04] <_joe_>	 sorry, 3400
[15:39:34] <Revent>	 How many total transcodes do you see queued?
[15:40:00] <Revent>	 The special page says 11528
[15:40:17] <_joe_>	 11700 or so, let me see
[15:40:55] <Revent>	 I might have undercounted, it was not an actual search, I just paged through and added
[15:40:58] <_joe_>	 uhm, no, it's 21759 queued
[15:41:04] <Revent>	 Hopy crap.
[15:41:07] <Revent>	 *Holy
[15:41:16] <_joe_>	 it was like 23k this morning
[15:42:11] <Revent>	 I really wonder if there are suplicate entries.
[15:42:16] <Revent>	 *duplicate
[15:42:55] <Urbanecm>	 Hi, can anybody advice me how I can get 329597 (patch for T154278) deployed? It is a throttle rule for an event that is held tomorrow around 03:00 UTC. They gave us a single day notice (ticket was created today in the UTC morning). I wrote greg-g an e-mail and asked in -releng and was directed by paladox that I should ask here too. 
[15:42:56] <stashbot>	 T154278: Temporary IP Cap Lift on mai.wiki - https://phabricator.wikimedia.org/T154278
[15:49:28] <akosiaris>	 Urbanecm: I can get that deploying
[15:49:31] <akosiaris>	 deployed*
[15:50:09] <Urbanecm>	 akosiaris, that's great! I don't know the rules for deploying during code freeze exactly so I'm asking everywhere it seems suitable for me. 
[15:51:29] <Urbanecm>	 akosiaris, Am I required to do something?
[15:51:30] <akosiaris>	 looks sensible enough to not be dangerous
[15:51:36] <akosiaris>	 no you are not
[15:51:45] <Urbanecm>	 Okay. 
[15:53:55] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 032] "Despite the code freeze, this is configuration and looks innocuous enough to warrant deploying." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/329597 (https://phabricator.wikimedia.org/T154278) (owner: 10Urbanecm)
[15:54:31] <wikibugs>	 (03Merged) 10jenkins-bot: [throttle] Rule for maiwiki - December 30th [mediawiki-config] - 10https://gerrit.wikimedia.org/r/329597 (https://phabricator.wikimedia.org/T154278) (owner: 10Urbanecm)
[15:54:43] <wikibugs>	 (03CR) 10jenkins-bot: [throttle] Rule for maiwiki - December 30th [mediawiki-config] - 10https://gerrit.wikimedia.org/r/329597 (https://phabricator.wikimedia.org/T154278) (owner: 10Urbanecm)
[15:55:51] <logmsgbot>	 !log akosiaris@tin Synchronized wmf-config/throttle.php: (no message) (duration: 00m 42s)
[15:55:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:56:07] <akosiaris>	 !log merging https://gerrit.wikimedia.org/r/329597 for T154278 (IP throttle raise)
[15:56:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:56:10] <stashbot>	 T154278: Temporary IP Cap Lift on mai.wiki - https://phabricator.wikimedia.org/T154278
[15:56:18] <akosiaris>	 Urbanecm: done
[15:56:29] <Urbanecm>	 Thanks a lot akosiaris!
[15:56:36] <akosiaris>	 you 're welcome
[15:57:09] * akosiaris back to damn calico. I 'll keep an eye for errors though
[15:57:39] <icinga-wm>	 PROBLEM - Apache HTTP on mw1279 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[15:57:39] <icinga-wm>	 PROBLEM - Nginx local proxy to apache on mw1279 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[15:57:48] <akosiaris>	 hmm, not good
[15:58:04] <Urbanecm>	 akosiaris, do you think this is caused by the throttle patch?
[15:58:08] <icinga-wm>	 PROBLEM - HHVM rendering on mw1279 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[15:58:34] <akosiaris>	 it seems isolated
[15:58:37] <akosiaris>	 at least for now
[15:59:00] <Urbanecm>	 What does it mean? It won't work?
[15:59:20] <akosiaris>	 all it means yet is a single box's HHVM process does not like something
[15:59:35] <akosiaris>	 could very well be unrelated
[15:59:51] <Urbanecm>	 If unrelated to the patch it's good :). 
[16:03:48] <icinga-wm>	 PROBLEM - Misc HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0]
[16:03:50] <_joe_>	 akosiaris: looking into mw1279
[16:05:41] <_joe_>	 !log restarted HHVM on mw1279, stuck in HPHP::Treadmill::getAgeOldestRequest
[16:05:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:05:46] <_joe_>	 known issue
[16:06:23] <akosiaris>	 it did cause a small 5xx spike as it seems
[16:06:38] <_joe_>	 that is misc 
[16:07:38] <icinga-wm>	 RECOVERY - Nginx local proxy to apache on mw1279 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 613 bytes in 0.026 second response time
[16:07:38] <icinga-wm>	 RECOVERY - Apache HTTP on mw1279 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.025 second response time
[16:07:58] <icinga-wm>	 RECOVERY - HHVM rendering on mw1279 is OK: HTTP OK: HTTP/1.1 200 OK - 72493 bytes in 0.463 second response time
[16:10:48] <icinga-wm>	 RECOVERY - Misc HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[16:14:21] <akosiaris>	 _joe_: the calico node docker image is pure crap.. nothing works...
[16:14:39] <_joe_>	 akosiaris: might be my fault
[16:14:39] <akosiaris>	 I 've fixed like 2 bugs already and now it is trying to execute a non existant binary
[16:14:45] <_joe_>	 I rebuilt it
[16:14:51] <_joe_>	 uhm
[16:15:25] <_joe_>	 you should honestly try with their one, or I should build the latest version
[16:15:39] <akosiaris>	 I 'll try and build the latest version
[16:15:42] <akosiaris>	 seems more promising
[16:35:07] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: Update calico/node version [puppet] - 10https://gerrit.wikimedia.org/r/329621
[16:36:05] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: Update calico/node version [puppet] - 10https://gerrit.wikimedia.org/r/329621
[16:36:12] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Update calico/node version [puppet] - 10https://gerrit.wikimedia.org/r/329621 (owner: 10Alexandros Kosiaris)
[16:38:48] <icinga-wm>	 PROBLEM - puppet last run on kubernetes1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 16 seconds ago with 1 failures. Failed resources (up to 3 shown): Package[cni]
[16:39:33] <ottomata>	 hiii akosiaris :)
[16:39:35] * ottomata waves
[16:41:38] <akosiaris>	 ottomata: o/
[16:42:09] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: calico/node needs certs bind mounted in namespace [puppet] - 10https://gerrit.wikimedia.org/r/329622
[16:42:45] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] calico/node needs certs bind mounted in namespace [puppet] - 10https://gerrit.wikimedia.org/r/329622 (owner: 10Alexandros Kosiaris)
[16:42:52] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: calico/node needs certs bind mounted in namespace [puppet] - 10https://gerrit.wikimedia.org/r/329622
[16:42:56] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] calico/node needs certs bind mounted in namespace [puppet] - 10https://gerrit.wikimedia.org/r/329622 (owner: 10Alexandros Kosiaris)
[16:45:39] <icinga-wm>	 ACKNOWLEDGEMENT - puppet last run on kubernetes1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[cni] alexandros kosiaris Known. still missing package cni for kubernetes
[16:45:39] <icinga-wm>	 ACKNOWLEDGEMENT - puppet last run on kubernetes1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[cni] alexandros kosiaris Known. still missing package cni for kubernetes
[16:45:39] <icinga-wm>	 ACKNOWLEDGEMENT - puppet last run on kubernetes1003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[cni] alexandros kosiaris Known. still missing package cni for kubernetes
[16:45:39] <icinga-wm>	 ACKNOWLEDGEMENT - puppet last run on kubernetes1004 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 10 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[cni] alexandros kosiaris Known. still missing package cni for kubernetes
[16:51:04] <wikibugs>	 06Operations, 10Gerrit, 06Release-Engineering-Team, 07Upstream: Gerrit: Restarting gerrit could lead to data loss + accounts - https://phabricator.wikimedia.org/T154205#2906496 (10Paladox) This is the commit that broke it https://gerrit-review.googlesource.com/#/c/86804/  we can revert it and apply it loca...
[17:01:38] <wikibugs>	 06Operations, 10Gerrit, 06Release-Engineering-Team, 07Upstream: Gerrit: Restarting gerrit could lead to data loss + accounts - https://phabricator.wikimedia.org/T154205#2906498 (10Paladox) it will require an index after we upgrade gerrit to include ^^ just so we doint lose any commits.
[17:21:35] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: Allow BGP between kubernetes workers [puppet] - 10https://gerrit.wikimedia.org/r/329627
[17:24:14] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 032] Allow BGP between kubernetes workers [puppet] - 10https://gerrit.wikimedia.org/r/329627 (owner: 10Alexandros Kosiaris)
[17:58:15] <wikibugs>	 (03PS7) 10Alexandros Kosiaris: Enable CNI plugin on kubernetes::node profile [puppet] - 10https://gerrit.wikimedia.org/r/329607
[17:58:21] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Enable CNI plugin on kubernetes::node profile [puppet] - 10https://gerrit.wikimedia.org/r/329607 (owner: 10Alexandros Kosiaris)
[18:07:29] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: calico: Expose puppet keys as well [puppet] - 10https://gerrit.wikimedia.org/r/329630
[18:07:56] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] calico: Expose puppet keys as well [puppet] - 10https://gerrit.wikimedia.org/r/329630 (owner: 10Alexandros Kosiaris)
[18:25:58] <icinga-wm>	 PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL: CRITICAL: 1.69% of data above the critical threshold [1000.0]
[18:26:58] <icinga-wm>	 RECOVERY - carbon-cache too many creates on graphite1001 is OK: OK: Less than 1.00% above the threshold [500.0]
[18:30:58] <icinga-wm>	 PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL: CRITICAL: 1.69% of data above the critical threshold [1000.0]
[19:08:08] <greg-g>	 thanks akosiaris for that config deploy
[19:55:08] <icinga-wm>	 PROBLEM - puppet last run on cp3039 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[19:55:28] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes1001 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly
[19:56:28] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes1001 is OK: OK ferm input default policy is set
[20:02:58] <icinga-wm>	 PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL: CRITICAL: 1.69% of data above the critical threshold [1000.0]
[20:14:58] <icinga-wm>	 RECOVERY - carbon-cache too many creates on graphite1001 is OK: OK: Less than 1.00% above the threshold [500.0]
[20:24:08] <icinga-wm>	 RECOVERY - puppet last run on cp3039 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures
[20:33:31] <wikibugs>	 06Operations, 10ops-codfw: rack/setup/install mw2051-mw2060 - https://phabricator.wikimedia.org/T152698#2906698 (10RobH) a:05RobH>03Joe @Papaul: The systems he listed are not currently offline, as they show in monitoring.    @Joe: That sounds reasonable to me.  Will the servers they are replacing be able t...
[20:41:38] <wikibugs>	 06Operations, 10Gerrit, 06Release-Engineering-Team, 07Upstream: Gerrit: Restarting gerrit could lead to data loss + accounts - https://phabricator.wikimedia.org/T154205#2906714 (10Paladox) this  https://gerrit-review.googlesource.com/#/c/86804/ also broke it where you merge a change, restart gerrit, look a...
[20:46:28] <icinga-wm>	 PROBLEM - YARN NodeManager Node-State on analytics1039 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[20:47:18] <icinga-wm>	 RECOVERY - YARN NodeManager Node-State on analytics1039 is OK: OK: YARN NodeManager analytics1039.eqiad.wmnet:8041 Node-State: RUNNING
[20:47:26] <ottomata1>	 we should make that not page ^
[20:47:27] <ottomata1>	 hm
[20:47:37] <ottomata>	 unless down for a while
[20:51:51] <elukey>	 it pages only us ottomata, maybe something in the analytics contact group?
[20:58:53] <elukey>	 (or maybe we could review the alarm itself..)
[20:59:00] <elukey>	 anyhow, I didn't get why it went down
[20:59:08] <ottomata>	 ahh ok
[20:59:09] <ottomata>	 that's fine then
[20:59:11] <ottomata>	 forgot that 
[20:59:13] <ottomata>	 HI! BTW!
[21:03:06] <elukey>	 HELLOO!
[21:03:13] <elukey>	 (forgot to say you are right :P)
[21:12:03] <elukey>	 so it doesn't seem to be memory related, but the node manager shut down and then restarted.. weird
[21:12:15] <ottomata>	 yeah i've noticed it does that sometimes
[21:12:15] <elukey>	 we have already seen this issue sporadically
[21:12:20] <ottomata>	 i thought you had found why with that memory bug
[21:13:35] <apergos>	 by 'only us' you must not mean ops because I sure did not get a page for that
[21:13:55] <apergos>	 which is fine if there are people who get notified
[21:17:07] <logmsgbot>	 !log otto@tin Starting deploy [eventstreams/deploy@4098bb4]: (no message)
[21:17:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:17:41] <elukey>	 apergos: yes sorry, us == analytics in this case :)
[21:17:48] <apergos>	 ah ok!
[21:17:59] <elukey>	 ottomata: 2016-12-29 20:47:18,473 ERROR org.apache.hadoop.yarn.server.nodemanager.NodeManager: RECEIVED SIGNAL 15: SIGTERM - ???
[21:18:10] <ottomata>	 i did restart it 
[21:18:19] <ottomata>	 but,i think after it had already come back up, but i hadn't realized
[21:18:45] <elukey>	 ah ok! 
[21:18:54] <elukey>	 didn't find anything good in the logs, super weird
[21:19:06] <logmsgbot>	 !log otto@tin Finished deploy [eventstreams/deploy@4098bb4]: (no message) (duration: 01m 59s)
[21:19:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:19:34] <elukey>	 ottomata: going afk again, let's keep an eye and see if it re-occurs.. 
[21:19:38] <elukey>	 o/
[21:19:42] <ottomata>	 laters!
[21:40:48] <icinga-wm>	 PROBLEM - puppet last run on elastic1024 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[21:47:13] <hoo>	 yuvipanda: do you have a moment to talk about video scaling? :P
[21:55:25] <bd808>	 hoo: I think _jo.e_ and others have been working on getting more horsepower into the pros video scalers
[21:55:38] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes1004 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly
[21:55:40] <bd808>	 s/pros/prod/
[21:56:45] <hoo>	 bd808: Cool, but I need some videos scaled fast
[21:57:38] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes1004 is OK: OK ferm input default policy is set
[21:58:00] <bd808>	 hoo: https://aws.amazon.com/ ;)
[21:59:03] <hoo>	 If I could upload my own transcodes, that would also work, I guess :P
[22:00:15] <bd808>	 I don't know if there is any magic way to jump to the front of the transcode queue.
[22:02:43] <hoo>	 bd808: Not that I'm aware of
[22:03:00] <hoo>	 I'm just hacking a tiny little script that could only process the jobs we're interested in
[22:03:06] <hoo>	 I guess it might not be to bad(tm)
[22:04:12] <bd808>	 hoo: you might poke Reedy. He's done lots of big import things in the past and may know some tricks.
[22:04:43] * hoo eyes Reedy 
[22:08:48] <icinga-wm>	 RECOVERY - puppet last run on elastic1024 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures
[22:13:59] <hoo>	 hm :/
[22:14:08] <hoo>	 Anyone firm with the job queue?
[22:14:13] <hoo>	 Like AaronSchulz ^?
[22:15:28] <p858snake|L2>	 that would also be Reedy
[22:18:07] <p858snake|L2>	 hoo: Revent is pretty clued up with the video stuff I believe (Queue wise), but joe and co are working on scalers atm since they are massively overloaded
[22:19:17] <Revent>	 hoo: The queue is nuts, and there is (from what I have been told) no way to jump it.
[22:20:25] <hoo>	 https://paste.fedoraproject.org/514846/30500141/
[22:20:28] <Revent>	 v2c is completely turned off right now, also, to prevent more flooding.
[22:20:36] <hoo>	 that would probably work, but I didn't test
[22:21:18] <hoo>	 v2c?
[22:21:27] <Revent>	 video2commons, on labs.
[22:21:40] <hoo>	 ah, yeah
[22:22:12] <Revent>	 Tho… is it just that you need them transcoded right away? Like, not necessarily on Commons right now?
[22:22:35] <hoo>	 Nah, that would probably be easy
[22:22:52] <hoo>	 but we need them for a tutorial page etc. very soon
[22:23:03] <hoo>	 So it needs to be wikimedia hosted
[22:23:15] <Revent>	 Yeah, it’s likely to take the scalers like a week to catch up. :/
[22:23:36] <hoo>	 webVideoTranscode: 21494 queued; 562 claimed (180 active, 382 abandoned); 0 delayed
[22:23:37] <hoo>	 yeha
[22:23:57] <Revent>	 Hundreds of those are ‘very’ large files.
[22:24:39] <Amortias>	 got another database error when creating an account if someone can check it out please - [WGWNAgpAICsAAHi8PIcAAAAK] 2016-12-29 22:24:03: Fatal exception of type "DBQueryError"
[22:25:02] <Revent>	 Frankly, it would not horrify me if they was to just completely kick everything out of the queue, start fresh, and then shovel shit back in as a backlog.
[22:25:26] <Revent>	 (it’s not like there isn’t a backlog of a third of a million old failed transcodes anyhow, lol)
[22:25:46] <hoo>	 sadness :/
[22:26:20] <Revent>	 I’m planning on working on shoving those through, once it’s working right again.
[22:26:34] <Revent>	 (most are small)
[22:27:06] <hoo>	 :)
[22:27:20] <hoo>	 Anyway, we would really like to get this resolved
[22:27:39] <hoo>	 there are a few ways, to bypass the usual queueing/ poping
[22:29:11] <Revent>	 hoo: A while back there was something like 15k ‘uninitialized’ transcodes, I shoved all those through before this drama came up, it’s ‘mindless while watching a movie’ stuff.
[22:29:38] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes1003 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly
[22:30:38] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes1003 is OK: OK ferm input default policy is set
[22:31:00] <hoo>	 cool, but not applicable right now :S
[22:33:30] <Revent>	 Yeah, I more just meant that kicking everything off (so the system is responsive ‘now’) would not be unworkable drama.
[22:47:14] <hoo>	 I'm giving up for now, posted my code snippet on https://phabricator.wikimedia.org/T154186
[22:47:25] <hoo>	 Would be nice if someone were to have a look
[22:47:36] <bd808>	 bah. Amortias left. The error was "Error: 1213 Deadlock found when trying to get lock;" while setting some echo user properties
[22:48:16] <hoo>	 Enough of my vacation spend
[22:52:08] <icinga-wm>	 PROBLEM - puppet last run on cp3013 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[23:01:39] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes1003 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly
[23:02:38] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes1003 is OK: OK ferm input default policy is set
[23:10:38] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes1001 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly
[23:10:57] <akosiaris>	 hmm
[23:11:38] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes1001 is OK: OK ferm input default policy is set
[23:16:38] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes1003 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly
[23:17:25] <akosiaris>	 Another app is currently holding the xtables lock. Perhaps you want to use the -w option?
[23:17:28] <akosiaris>	 interesting
[23:17:39] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes1003 is OK: OK ferm input default policy is set
[23:19:44] <akosiaris>	 !log schedule downtime for ferm checks on kubernetes nodes. Some race between kubernetes + ferm, investigating
[23:19:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:21:08] <icinga-wm>	 RECOVERY - puppet last run on cp3013 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures
[23:23:08] <icinga-wm>	 PROBLEM - puppet last run on labvirt1007 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[23:25:08] <icinga-wm>	 PROBLEM - puppet last run on sca1003 is CRITICAL: CRITICAL: Puppet has 27 failures. Last run 2 minutes ago with 27 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[wipe],Package[zotero/translators],Package[zotero/translation-server]
[23:51:08] <icinga-wm>	 RECOVERY - puppet last run on labvirt1007 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures
[23:52:08] <icinga-wm>	 RECOVERY - puppet last run on sca1003 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures