[00:00:04] PROBLEM - check_listener_gc on thulium is CRITICAL: HTTP CRITICAL: HTTP/1.1 403 Cannot continue. - string OK not found on https://127.0.0.1:443https://payments-listener.wikimedia.org/globalcollect - 187 bytes in 0.011 second response time [00:05:04] PROBLEM - check_listener_gc on thulium is CRITICAL: HTTP CRITICAL: HTTP/1.1 403 Cannot continue. - string OK not found on https://127.0.0.1:443https://payments-listener.wikimedia.org/globalcollect - 187 bytes in 0.010 second response time [00:10:04] PROBLEM - check_listener_gc on thulium is CRITICAL: HTTP CRITICAL: HTTP/1.1 403 Cannot continue. - string OK not found on https://127.0.0.1:443https://payments-listener.wikimedia.org/globalcollect - 187 bytes in 0.011 second response time [00:15:04] PROBLEM - check_listener_gc on thulium is CRITICAL: HTTP CRITICAL: HTTP/1.1 403 Cannot continue. - string OK not found on https://127.0.0.1:443https://payments-listener.wikimedia.org/globalcollect - 187 bytes in 0.010 second response time [00:20:04] RECOVERY - check_listener_gc on thulium is OK: HTTP OK: HTTP/1.1 200 OK - 249 bytes in 0.052 second response time [00:54:54] PROBLEM - citoid endpoints health on scb2004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:56:44] RECOVERY - citoid endpoints health on scb2004 is OK: All endpoints are healthy [01:11:14] :q [01:34:34] PROBLEM - puppet last run on sca1004 is CRITICAL: CRITICAL: Puppet has 27 failures. Last run 2 minutes ago with 27 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[wipe],Package[zotero/translators],Package[zotero/translation-server] [02:02:34] RECOVERY - puppet last run on sca1004 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [02:34:54] PROBLEM - puppet last run on cp3049 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [02:38:36] 06Operations, 10Gerrit, 06Release-Engineering-Team, 05Security, 07Upstream: Gerrit: Restarting gerrit could lead to data loss + accounts - https://phabricator.wikimedia.org/T154205#2905700 (10Paladox) @Aklapper i haven't tried yet reproducing data loss for accounts, but i managed with patches without eve... [02:46:34] PROBLEM - puppet last run on elastic1046 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [02:50:44] 06Operations, 10Gerrit, 06Release-Engineering-Team, 07Upstream: Gerrit: Restarting gerrit could lead to data loss + accounts - https://phabricator.wikimedia.org/T154205#2905703 (10Peachey88) Do we actually have account_data that can be lost (and cause issues)? since all our account data is synced from a ex... [02:54:40] 06Operations, 10Gerrit, 06Release-Engineering-Team, 07Upstream: Gerrit: Restarting gerrit could lead to data loss + accounts - https://phabricator.wikimedia.org/T154205#2905707 (10Paladox) @Peachey88 even though we use ldap, gerrit stores all accounts in the index. Which is what caused T152640 that problem... [03:02:54] RECOVERY - puppet last run on cp3049 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [03:06:04] PROBLEM - citoid endpoints health on scb2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:06:04] PROBLEM - citoid endpoints health on scb2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:06:54] RECOVERY - citoid endpoints health on scb2002 is OK: All endpoints are healthy [03:06:54] RECOVERY - citoid endpoints health on scb2001 is OK: All endpoints are healthy [03:13:04] PROBLEM - citoid endpoints health on scb2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:14:04] RECOVERY - citoid endpoints health on scb2001 is OK: All endpoints are healthy [03:14:34] RECOVERY - puppet last run on elastic1046 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [03:16:04] PROBLEM - citoid endpoints health on scb2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:16:54] RECOVERY - citoid endpoints health on scb2002 is OK: All endpoints are healthy [04:02:34] PROBLEM - puppet last run on sca2003 is CRITICAL: CRITICAL: Puppet has 27 failures. Last run 2 minutes ago with 27 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[wipe],Package[zotero/translators],Package[zotero/translation-server] [04:16:04] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=2710.20 Read Requests/Sec=2770.50 Write Requests/Sec=21.90 KBytes Read/Sec=11180.40 KBytes_Written/Sec=796.00 [04:25:04] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=20.40 Read Requests/Sec=0.00 Write Requests/Sec=12.50 KBytes Read/Sec=0.00 KBytes_Written/Sec=387.60 [04:29:34] RECOVERY - puppet last run on sca2003 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [04:54:24] PROBLEM - puppet last run on sca1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:23:24] RECOVERY - puppet last run on sca1003 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [05:39:34] PROBLEM - puppet last run on sca2004 is CRITICAL: CRITICAL: Puppet has 12 failures. Last run 2 minutes ago with 12 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[wipe],Package[zotero/translators],Package[zotero/translation-server] [05:40:24] PROBLEM - puppet last run on elastic1024 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:04:34] PROBLEM - puppet last run on sca1004 is CRITICAL: CRITICAL: Puppet has 20 failures. Last run 2 minutes ago with 20 failures. Failed resources (up to 3 shown): Service[puppet],Service[rsyslog],Exec[ip addr add 2620:0:861:103:10:64:32:125/64 dev eth0],Service[ferm] [06:07:34] RECOVERY - puppet last run on sca2004 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [06:09:24] RECOVERY - puppet last run on elastic1024 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [06:28:24] PROBLEM - Check HHVM threads for leakage on mw1168 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers [06:30:34] PROBLEM - Check HHVM threads for leakage on mw1260 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers [06:32:24] PROBLEM - Check HHVM threads for leakage on mw1169 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers [06:32:34] RECOVERY - puppet last run on sca1004 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [06:48:24] PROBLEM - Check HHVM threads for leakage on mw1259 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers [07:01:39] <_joe_> uhm [07:02:40] <_joe_> I'll check when I'm properly awake [07:02:48] <_joe_> but this seems like a logrotate thing [07:03:27] <_joe_> yes, it's apache "reloading", wtf [07:03:36] <_joe_> ok, I'll fix it [07:07:03] <_joe_> unless this is an artifact of reloading + apache status, which seems probable [07:14:24] PROBLEM - puppet last run on elastic1034 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:19:44] RECOVERY - Check HHVM threads for leakage on mw1168 is OK: OK [07:41:20] <_joe_> matanya: I think turning off video2commons for a couple of days might help stopping the pressure on the videoscalers [07:42:25] RECOVERY - puppet last run on elastic1034 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [07:55:36] (03CR) 10Tim Landscheidt: "@chasemp: This updated also modules/cdh and modules/mariadb which does not seem to have been intended. modules/mariadb has been updated w" [puppet] - 10https://gerrit.wikimedia.org/r/316577 (owner: 10Rush) [08:06:34] PROBLEM - puppet last run on mw1272 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:13:24] RECOVERY - Check HHVM threads for leakage on mw1169 is OK: OK [08:15:24] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [08:16:24] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [08:19:24] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [08:22:24] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [08:28:34] RECOVERY - Check HHVM threads for leakage on mw1260 is OK: OK [08:34:34] RECOVERY - puppet last run on mw1272 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [08:34:45] (03CR) 10Tim Landscheidt: "@Joe: Just to clarify: I don't *want* to use expand_path in Labs; I'm only interested in hiera('puppetdb::password::rw') providing some du" [puppet] - 10https://gerrit.wikimedia.org/r/329226 (owner: 10Tim Landscheidt) [08:35:44] RECOVERY - Check HHVM threads for leakage on mw1259 is OK: OK [08:54:18] _joe_: i will handle it later [08:54:28] zhuyifei1999_: FYI - see above [08:54:40] yeah? [08:54:43] <_joe_> matanya: yeah thanks, it's not like I'm mandating that, but [08:54:56] <_joe_> we have 8 days of backlog of transcodes atm [08:55:13] zhuyifei1999_: _joe_ asked to shut down v2c until the scalers catch up [08:55:18] Hi all, is anybody with deploy access around? I'd like 329516 to be deployed. Greg approved it, see T154254 [08:55:18] T154254: Set celebration logo for hewiki - https://phabricator.wikimedia.org/T154254 [08:55:27] hmm [08:55:41] ok [08:56:00] I'll set up an message saying the service is down [08:56:25] <_joe_> we are currenlty processing files from dec 19th [08:56:43] <_joe_> so we're 9 and a half days beyond [08:58:02] Urbanecm: Reedy will probably be on in the next few hours [08:58:18] thats what i wanted to say to Urbanecm :) [08:58:20] ((if you ask nicely and no one steps up before that) [08:58:38] Dereckson might be around too [08:59:08] jzerebecki: might have rights as well [08:59:17] Thanks both of you [08:59:31] and one last name for you might be legoktm [09:00:26] hi [09:00:32] https://phabricator.wikimedia.org/T154254#2906104 [09:00:37] celebration logos are done via CSS [09:00:55] <_joe_> I have the rights to deploy, but what lego said [09:00:56] <_joe_> :) [09:01:08] legoktm, for HD displays too? [09:01:12] Urbanecm: yes [09:02:04] Then i cant see a reason why they requested this again... [09:04:32] Urbanecm: someone needs to make the high-res versions though [09:04:45] https://en.wikipedia.org/w/index.php?title=MediaWiki%3ACommon.css&type=revision&diff=688516426&oldid=686648742 is the CSS to override those [09:05:07] oh I guess https://commons.wikimedia.org/wiki/File:Hewiki_200,000_articles4.svg is good enough [09:05:22] someone just needs to scale it down to the right dimensions [09:05:47] docs are at https://www.mediawiki.org/wiki/Manual:$wgLogoHD [09:05:51] legoktm: it was already done here : https://he.wikipedia.org/w/index.php?title=%D7%9E%D7%93%D7%99%D7%94_%D7%95%D7%99%D7%A7%D7%99%3ACommon.css&type=revision&diff=19841881&oldid=19838328 [09:06:05] legoktm, 329516 isn't enough? [09:06:07] matanya: that doesn't affect people who have hdpi screens [09:06:27] https://tools.wmflabs.org/video2commons/ <= it's down now [09:06:28] Urbanecm: that will get cached for a month at least [09:06:56] Urbanecm: but those png files are probably fine (I didn't check) [09:07:04] There isn't anything like purge? [09:07:10] thank you zhuyifei1999_ [09:07:17] np [09:08:35] Urbanecm: there is, but it's much easier if this is done via local CSS [09:09:17] <_joe_> zhuyifei1999_: thanks a lot [09:09:24] np [09:10:03] legoktm, As I do not understand CSS well can you advice them what line(s) they should add? [09:10:11] Or add then if you have rights. [09:10:50] BTW T131605 was resolved and it is about the same thing but another wiki and another milestone. [09:10:50] T131605: Set celebration logo on Czech Wikipedia - https://phabricator.wikimedia.org/T131605 [09:11:15] 06Operations, 06Commons, 10TimedMediaHandler-Transcode, 10Wikimedia-Video, and 3 others: Commons video transcoders have over 6500 tasks in the backlog. - https://phabricator.wikimedia.org/T153488#2906120 (10zhuyifei1999) For the record #video2commons is off now. [09:12:39] Urbanecm: Well that wasn't during a deploy freeze when you're trying to find a sysadmin :P but I wouldn't have done it then either. Doing a local CSS change gives the local wiki much more control [09:13:23] I am a local admin, i can do the css change, if i know what it was :) [09:13:40] *knew [09:14:38] matanya: sorry, I'm about to sleep. the CSS is on https://www.mediawiki.org/wiki/Manual:$wgLogoHD, you just need to upload the 1.5x and 2x logos (png) somewhere [09:14:47] I think Urbanecm already created those files [09:14:59] thanks legoktm good night [09:15:17] good night legoktm [09:17:40] matanya, you can download my files from https://gerrit.wikimedia.org/r/#/c/329516/ (the 1.5x and 2x). [09:19:22] (03Abandoned) 10Urbanecm: Set celebration logo for hewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/329516 (https://phabricator.wikimedia.org/T154254) (owner: 10Urbanecm) [09:20:28] Urbanecm: want me to upload them to commons ? [09:21:02] matanya, please upload them locally (they should be protected). [09:21:10] ok [09:22:58] I just protected it.. [09:23:14] if you mean https://commons.wikimedia.org/wiki/File:Hewiki_200,000_articles4.svg [09:24:55] zhuyifei1999_, this file doesn't need to be protected as it isn't used in interface. The SVG is needed for making PNG (because mediawiki doesn't support SVG yet). [09:25:20] Local PNG (or even on commons; it doesn't matter) should be protected. [09:25:43] you can use thumb generation for that [09:25:55] zhuyifei1999_, they are generated :) [09:26:21] BTW I can generate a thumb whatever size I want? [09:26:51] yes [09:27:23] https://upload.wikimedia.org/wikipedia/commons/thumb/1/14/Hewiki_200%2C000_articles4.svg/1234px-Hewiki_200%2C000_articles4.svg.png [09:27:30] just replace that 1234px [09:27:50] (I just unprotected it btw) [09:29:48] (03PS1) 10Alexandros Kosiaris: Specify kubernetes admission controllers via hiera [puppet] - 10https://gerrit.wikimedia.org/r/329592 [09:31:33] matanya, http://pastebin.com/4RQpB92W should be your HD css. [09:32:26] yes Urbanecm thanks. i was in the middle of editing that :) [09:34:11] Urbanecm: if you have HD screen, i'd like if you can verify it works well [09:34:56] (03PS2) 10Alexandros Kosiaris: kubernetes::worker: Allow access to kubelet from master [puppet] - 10https://gerrit.wikimedia.org/r/329494 [09:34:58] (03PS2) 10Alexandros Kosiaris: Specify kubernetes admission controllers via hiera [puppet] - 10https://gerrit.wikimedia.org/r/329592 [09:35:34] matanya, I don't have that screen. [09:41:49] (03CR) 10Alexandros Kosiaris: [C: 032] kubernetes::worker: Allow access to kubelet from master [puppet] - 10https://gerrit.wikimedia.org/r/329494 (owner: 10Alexandros Kosiaris) [09:41:55] (03PS3) 10Alexandros Kosiaris: kubernetes::worker: Allow access to kubelet from master [puppet] - 10https://gerrit.wikimedia.org/r/329494 [09:41:59] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] kubernetes::worker: Allow access to kubelet from master [puppet] - 10https://gerrit.wikimedia.org/r/329494 (owner: 10Alexandros Kosiaris) [09:42:12] <_joe_> akosiaris: did it work at all without that? [09:42:15] (03CR) 10Alexandros Kosiaris: [C: 032] Specify kubernetes admission controllers via hiera [puppet] - 10https://gerrit.wikimedia.org/r/329592 (owner: 10Alexandros Kosiaris) [09:42:23] (03PS3) 10Alexandros Kosiaris: Specify kubernetes admission controllers via hiera [puppet] - 10https://gerrit.wikimedia.org/r/329592 [09:42:27] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Specify kubernetes admission controllers via hiera [puppet] - 10https://gerrit.wikimedia.org/r/329592 (owner: 10Alexandros Kosiaris) [09:45:57] _joe_: yes. Only attach and exec to a running container did not [09:46:13] <_joe_> from kubectl? [09:46:16] yes [09:46:21] they go via the master [09:46:22] <_joe_> heh kinda expected, ain't it? [09:46:28] yup [09:46:30] <_joe_> it makes sense [09:46:35] IIRC, these 2 are the only ones that go via the master [09:46:41] everything else is pull model [09:46:45] PROBLEM - puppet last run on copper is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[docker-engine] [09:46:46] but I may be wrong [09:46:50] <_joe_> which, again, makes sense [09:47:08] hmm docker is me [09:47:37] I 've upgraded docker to 1.12.5 [09:47:44] I 'll amend puppet [09:49:46] (03PS1) 10Alexandros Kosiaris: builder: Specify the newer docker-engine version [puppet] - 10https://gerrit.wikimedia.org/r/329593 [09:50:02] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] builder: Specify the newer docker-engine version [puppet] - 10https://gerrit.wikimedia.org/r/329593 (owner: 10Alexandros Kosiaris) [09:50:08] (03PS2) 10Alexandros Kosiaris: builder: Specify the newer docker-engine version [puppet] - 10https://gerrit.wikimedia.org/r/329593 [09:50:11] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] builder: Specify the newer docker-engine version [puppet] - 10https://gerrit.wikimedia.org/r/329593 (owner: 10Alexandros Kosiaris) [09:51:44] RECOVERY - puppet last run on copper is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [09:57:08] (03CR) 10Alexandros Kosiaris: [C: 032] "https://puppet-compiler.wmflabs.org/5003/ says noop, merging. Let's see what breaks!" [puppet] - 10https://gerrit.wikimedia.org/r/302695 (owner: 10Alexandros Kosiaris) [09:57:14] (03PS3) 10Alexandros Kosiaris: Move external_networks to network module data.yaml [puppet] - 10https://gerrit.wikimedia.org/r/302695 [10:03:50] (03PS1) 10Tim Landscheidt: puppetmaster: Specify $group for all repositories [puppet] - 10https://gerrit.wikimedia.org/r/329595 (https://phabricator.wikimedia.org/T152060) [10:17:44] PROBLEM - puppet last run on ms-fe1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:44:38] 06Operations, 10GlobalRename, 10MediaWiki-extensions-CentralAuth: Rename user TextworkerBot to VladiBot on ru.wiki - https://phabricator.wikimedia.org/T153602#2885305 (10Vladis13) I'm original requestor. I thought that for renaming should change the only one row in the database. Ok, if it is problematic then... [10:45:44] RECOVERY - puppet last run on ms-fe1002 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [10:54:44] PROBLEM - puppet last run on ms-be1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:23:33] (03PS1) 10MarcoAurelio: Set $wgAbuseFilterNotificationsPrivate = true; for Meta-Wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/329600 [11:23:44] RECOVERY - puppet last run on ms-be1003 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [11:40:54] PROBLEM - puppet last run on cp3044 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:49:10] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Various inline comments, will work on them myself today" (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/323816 (owner: 10Giuseppe Lavagetto) [12:08:54] RECOVERY - puppet last run on cp3044 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [12:10:44] PROBLEM - Disk space on labtestnet2001 is CRITICAL: DISK CRITICAL - free space: / 350 MB (3% inode=46%) [12:19:44] RECOVERY - Disk space on labtestnet2001 is OK: DISK OK [12:19:50] !log running sudo apt-get autoremove on labtestnet2001 [12:19:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:20:01] !log running sudo apt-get autoremove on labtestnet2001. Removing various older kernels [12:20:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:04:45] PROBLEM - puppet last run on sca1004 is CRITICAL: CRITICAL: Puppet has 27 failures. Last run 2 minutes ago with 27 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[wipe],Package[zotero/translators],Package[zotero/translation-server] [13:30:54] PROBLEM - puppet last run on sca2003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:32:44] RECOVERY - puppet last run on sca1004 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [13:59:54] RECOVERY - puppet last run on sca2003 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [14:09:08] (03PS2) 10Alexandros Kosiaris: calico: add module/profile to use as kubernetes networking [puppet] - 10https://gerrit.wikimedia.org/r/323816 (owner: 10Giuseppe Lavagetto) [14:09:10] (03PS1) 10Alexandros Kosiaris: Enable the calico profile on kubernetes::worker [puppet] - 10https://gerrit.wikimedia.org/r/329606 [14:09:12] (03PS1) 10Alexandros Kosiaris: Enable CNI plugin on kubernetes::node profile [puppet] - 10https://gerrit.wikimedia.org/r/329607 [14:16:44] (03PS2) 10Alexandros Kosiaris: Enable the calico profile on kubernetes::worker [puppet] - 10https://gerrit.wikimedia.org/r/329606 [14:16:46] (03PS2) 10Alexandros Kosiaris: Enable CNI plugin on kubernetes::node profile [puppet] - 10https://gerrit.wikimedia.org/r/329607 [14:19:45] (03PS3) 10Alexandros Kosiaris: calico: add module/profile to use as kubernetes networking [puppet] - 10https://gerrit.wikimedia.org/r/323816 (owner: 10Giuseppe Lavagetto) [14:19:47] (03PS3) 10Alexandros Kosiaris: Enable the calico profile on kubernetes::worker [puppet] - 10https://gerrit.wikimedia.org/r/329606 [14:19:49] (03PS3) 10Alexandros Kosiaris: Enable CNI plugin on kubernetes::node profile [puppet] - 10https://gerrit.wikimedia.org/r/329607 [14:26:12] (03PS4) 10Alexandros Kosiaris: calico: add module/profile to use as kubernetes networking [puppet] - 10https://gerrit.wikimedia.org/r/323816 (owner: 10Giuseppe Lavagetto) [14:26:14] (03PS4) 10Alexandros Kosiaris: Enable the calico profile on kubernetes::worker [puppet] - 10https://gerrit.wikimedia.org/r/329606 [14:26:17] (03PS4) 10Alexandros Kosiaris: Enable CNI plugin on kubernetes::node profile [puppet] - 10https://gerrit.wikimedia.org/r/329607 [14:32:02] (03PS5) 10Alexandros Kosiaris: calico: add module/profile to use as kubernetes networking [puppet] - 10https://gerrit.wikimedia.org/r/323816 (owner: 10Giuseppe Lavagetto) [14:32:04] (03PS5) 10Alexandros Kosiaris: Enable the calico profile on kubernetes::worker [puppet] - 10https://gerrit.wikimedia.org/r/329606 [14:32:06] (03PS5) 10Alexandros Kosiaris: Enable CNI plugin on kubernetes::node profile [puppet] - 10https://gerrit.wikimedia.org/r/329607 [14:35:41] (03PS6) 10Alexandros Kosiaris: calico: add module/profile to use as kubernetes networking [puppet] - 10https://gerrit.wikimedia.org/r/323816 (owner: 10Giuseppe Lavagetto) [14:35:43] (03PS6) 10Alexandros Kosiaris: Enable the calico profile on kubernetes::worker [puppet] - 10https://gerrit.wikimedia.org/r/329606 [14:35:45] (03PS6) 10Alexandros Kosiaris: Enable CNI plugin on kubernetes::node profile [puppet] - 10https://gerrit.wikimedia.org/r/329607 [14:36:45] 06Operations, 10MediaWiki-Special-pages: Fatal exception of type MWException on all the Special:GlobalUserRights pages - https://phabricator.wikimedia.org/T154185#2903588 (10abian) [14:40:27] (03CR) 10Alexandros Kosiaris: [C: 032] "After a large round of patches, pcc is happy at https://puppet-compiler.wmflabs.org/5009/" [puppet] - 10https://gerrit.wikimedia.org/r/323816 (owner: 10Giuseppe Lavagetto) [14:40:48] (03CR) 10Alexandros Kosiaris: [C: 032] Enable the calico profile on kubernetes::worker [puppet] - 10https://gerrit.wikimedia.org/r/329606 (owner: 10Alexandros Kosiaris) [14:44:38] PROBLEM - puppet last run on kubernetes1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:44:38] PROBLEM - puppet last run on kubernetes1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:45:25] (03PS1) 10Alexandros Kosiaris: Fix dependency to calico/node for systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/329608 [14:46:34] (03CR) 10Alexandros Kosiaris: [C: 032] Fix dependency to calico/node for systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/329608 (owner: 10Alexandros Kosiaris) [14:46:42] (03PS2) 10Alexandros Kosiaris: Fix dependency to calico/node for systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/329608 [14:46:45] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Fix dependency to calico/node for systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/329608 (owner: 10Alexandros Kosiaris) [14:52:30] (03PS1) 10Alexandros Kosiaris: Bump calico CNI plugin version [puppet] - 10https://gerrit.wikimedia.org/r/329611 [14:52:48] (03PS2) 10Alexandros Kosiaris: Bump calico CNI plugin version [puppet] - 10https://gerrit.wikimedia.org/r/329611 [14:52:54] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Bump calico CNI plugin version [puppet] - 10https://gerrit.wikimedia.org/r/329611 (owner: 10Alexandros Kosiaris) [14:58:13] (03PS1) 10Alexandros Kosiaris: Specify the registry as well for calico/node [puppet] - 10https://gerrit.wikimedia.org/r/329612 [14:58:34] (03PS2) 10Alexandros Kosiaris: Specify the registry as well for calico/node [puppet] - 10https://gerrit.wikimedia.org/r/329612 [14:58:40] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Specify the registry as well for calico/node [puppet] - 10https://gerrit.wikimedia.org/r/329612 (owner: 10Alexandros Kosiaris) [15:03:08] (03PS1) 10Alexandros Kosiaris: Fix ETCD_CA_CERT_FILE for calico/node [puppet] - 10https://gerrit.wikimedia.org/r/329613 [15:04:18] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Fix ETCD_CA_CERT_FILE for calico/node [puppet] - 10https://gerrit.wikimedia.org/r/329613 (owner: 10Alexandros Kosiaris) [15:04:22] (03PS2) 10Alexandros Kosiaris: Fix ETCD_CA_CERT_FILE for calico/node [puppet] - 10https://gerrit.wikimedia.org/r/329613 [15:04:25] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Fix ETCD_CA_CERT_FILE for calico/node [puppet] - 10https://gerrit.wikimedia.org/r/329613 (owner: 10Alexandros Kosiaris) [15:06:44] With v2c not flooding huge files, the transcoder backlog is heading steadily downward. [15:07:08] Probably about 1500 below it’s peak. [15:09:38] PROBLEM - puppet last run on kubernetes1004 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[cni] [15:10:37] <_joe_> Revent: yes, still a ton of v2c white house files in the queue [15:10:46] Yeah, I know. [15:11:00] But it’s at least making headway. [15:11:05] <_joe_> as fresh as from the 27th [15:11:08] <_joe_> yup [15:12:38] PROBLEM - puppet last run on kubernetes1003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[cni] [15:18:16] /etc/rc.local: 4: .: startup.env: not found [15:18:17] argh [15:18:53] not to mention https://github.com/shazow/urllib3/issues/497 ... [15:19:02] thankfully the urllib3 issue is not a blocker [15:19:06] <_joe_> wat? [15:19:11] <_joe_> (rc.local) [15:19:13] but how is that startup.env populated ... [15:19:17] damn containers [15:19:32] seems botched somehow [15:19:49] <_joe_> is that calico/node? [15:19:53] yes [15:19:57] <_joe_> uhm [15:20:21] <_joe_> I can swear that container worked when I tried it, meh [15:20:34] <_joe_> but had no CA_CERT_FILE [15:24:54] _joe_: I counf 726 transcodes with “White_House_Press_Briefing” in the filename (queued) [15:25:00] (sigh) [15:30:09] _joe_: yeah it seems to be related to HTTPS [15:30:19] not that HTTP works.. but I get a different error [15:31:25] <_joe_> Revent: I count transcludes from 470 such files [15:31:30] <_joe_> in the queue [15:31:46] Yeah, I counted transcodes, not files. [15:32:03] (in the search the special page uses) [15:37:58] <_joe_> transcodes of such files are ~ 23400 [15:38:04] <_joe_> sorry, 3400 [15:39:34] How many total transcodes do you see queued? [15:40:00] The special page says 11528 [15:40:17] <_joe_> 11700 or so, let me see [15:40:55] I might have undercounted, it was not an actual search, I just paged through and added [15:40:58] <_joe_> uhm, no, it's 21759 queued [15:41:04] Hopy crap. [15:41:07] *Holy [15:41:16] <_joe_> it was like 23k this morning [15:42:11] I really wonder if there are suplicate entries. [15:42:16] *duplicate [15:42:55] Hi, can anybody advice me how I can get 329597 (patch for T154278) deployed? It is a throttle rule for an event that is held tomorrow around 03:00 UTC. They gave us a single day notice (ticket was created today in the UTC morning). I wrote greg-g an e-mail and asked in -releng and was directed by paladox that I should ask here too. [15:42:56] T154278: Temporary IP Cap Lift on mai.wiki - https://phabricator.wikimedia.org/T154278 [15:49:28] Urbanecm: I can get that deploying [15:49:31] deployed* [15:50:09] akosiaris, that's great! I don't know the rules for deploying during code freeze exactly so I'm asking everywhere it seems suitable for me. [15:51:29] akosiaris, Am I required to do something? [15:51:30] looks sensible enough to not be dangerous [15:51:36] no you are not [15:51:45] Okay. [15:53:55] (03CR) 10Alexandros Kosiaris: [C: 032] "Despite the code freeze, this is configuration and looks innocuous enough to warrant deploying." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/329597 (https://phabricator.wikimedia.org/T154278) (owner: 10Urbanecm) [15:54:31] (03Merged) 10jenkins-bot: [throttle] Rule for maiwiki - December 30th [mediawiki-config] - 10https://gerrit.wikimedia.org/r/329597 (https://phabricator.wikimedia.org/T154278) (owner: 10Urbanecm) [15:54:43] (03CR) 10jenkins-bot: [throttle] Rule for maiwiki - December 30th [mediawiki-config] - 10https://gerrit.wikimedia.org/r/329597 (https://phabricator.wikimedia.org/T154278) (owner: 10Urbanecm) [15:55:51] !log akosiaris@tin Synchronized wmf-config/throttle.php: (no message) (duration: 00m 42s) [15:55:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:07] !log merging https://gerrit.wikimedia.org/r/329597 for T154278 (IP throttle raise) [15:56:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:10] T154278: Temporary IP Cap Lift on mai.wiki - https://phabricator.wikimedia.org/T154278 [15:56:18] Urbanecm: done [15:56:29] Thanks a lot akosiaris! [15:56:36] you 're welcome [15:57:09] * akosiaris back to damn calico. I 'll keep an eye for errors though [15:57:39] PROBLEM - Apache HTTP on mw1279 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:57:39] PROBLEM - Nginx local proxy to apache on mw1279 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:57:48] hmm, not good [15:58:04] akosiaris, do you think this is caused by the throttle patch? [15:58:08] PROBLEM - HHVM rendering on mw1279 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:58:34] it seems isolated [15:58:37] at least for now [15:59:00] What does it mean? It won't work? [15:59:20] all it means yet is a single box's HHVM process does not like something [15:59:35] could very well be unrelated [15:59:51] If unrelated to the patch it's good :). [16:03:48] PROBLEM - Misc HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [16:03:50] <_joe_> akosiaris: looking into mw1279 [16:05:41] <_joe_> !log restarted HHVM on mw1279, stuck in HPHP::Treadmill::getAgeOldestRequest [16:05:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:05:46] <_joe_> known issue [16:06:23] it did cause a small 5xx spike as it seems [16:06:38] <_joe_> that is misc [16:07:38] RECOVERY - Nginx local proxy to apache on mw1279 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 613 bytes in 0.026 second response time [16:07:38] RECOVERY - Apache HTTP on mw1279 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.025 second response time [16:07:58] RECOVERY - HHVM rendering on mw1279 is OK: HTTP OK: HTTP/1.1 200 OK - 72493 bytes in 0.463 second response time [16:10:48] RECOVERY - Misc HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [16:14:21] _joe_: the calico node docker image is pure crap.. nothing works... [16:14:39] <_joe_> akosiaris: might be my fault [16:14:39] I 've fixed like 2 bugs already and now it is trying to execute a non existant binary [16:14:45] <_joe_> I rebuilt it [16:14:51] <_joe_> uhm [16:15:25] <_joe_> you should honestly try with their one, or I should build the latest version [16:15:39] I 'll try and build the latest version [16:15:42] seems more promising [16:35:07] (03PS1) 10Alexandros Kosiaris: Update calico/node version [puppet] - 10https://gerrit.wikimedia.org/r/329621 [16:36:05] (03PS2) 10Alexandros Kosiaris: Update calico/node version [puppet] - 10https://gerrit.wikimedia.org/r/329621 [16:36:12] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Update calico/node version [puppet] - 10https://gerrit.wikimedia.org/r/329621 (owner: 10Alexandros Kosiaris) [16:38:48] PROBLEM - puppet last run on kubernetes1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 16 seconds ago with 1 failures. Failed resources (up to 3 shown): Package[cni] [16:39:33] hiii akosiaris :) [16:39:35] * ottomata waves [16:41:38] ottomata: o/ [16:42:09] (03PS1) 10Alexandros Kosiaris: calico/node needs certs bind mounted in namespace [puppet] - 10https://gerrit.wikimedia.org/r/329622 [16:42:45] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] calico/node needs certs bind mounted in namespace [puppet] - 10https://gerrit.wikimedia.org/r/329622 (owner: 10Alexandros Kosiaris) [16:42:52] (03PS2) 10Alexandros Kosiaris: calico/node needs certs bind mounted in namespace [puppet] - 10https://gerrit.wikimedia.org/r/329622 [16:42:56] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] calico/node needs certs bind mounted in namespace [puppet] - 10https://gerrit.wikimedia.org/r/329622 (owner: 10Alexandros Kosiaris) [16:45:39] ACKNOWLEDGEMENT - puppet last run on kubernetes1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[cni] alexandros kosiaris Known. still missing package cni for kubernetes [16:45:39] ACKNOWLEDGEMENT - puppet last run on kubernetes1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[cni] alexandros kosiaris Known. still missing package cni for kubernetes [16:45:39] ACKNOWLEDGEMENT - puppet last run on kubernetes1003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[cni] alexandros kosiaris Known. still missing package cni for kubernetes [16:45:39] ACKNOWLEDGEMENT - puppet last run on kubernetes1004 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 10 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[cni] alexandros kosiaris Known. still missing package cni for kubernetes [16:51:04] 06Operations, 10Gerrit, 06Release-Engineering-Team, 07Upstream: Gerrit: Restarting gerrit could lead to data loss + accounts - https://phabricator.wikimedia.org/T154205#2906496 (10Paladox) This is the commit that broke it https://gerrit-review.googlesource.com/#/c/86804/ we can revert it and apply it loca... [17:01:38] 06Operations, 10Gerrit, 06Release-Engineering-Team, 07Upstream: Gerrit: Restarting gerrit could lead to data loss + accounts - https://phabricator.wikimedia.org/T154205#2906498 (10Paladox) it will require an index after we upgrade gerrit to include ^^ just so we doint lose any commits. [17:21:35] (03PS1) 10Alexandros Kosiaris: Allow BGP between kubernetes workers [puppet] - 10https://gerrit.wikimedia.org/r/329627 [17:24:14] (03CR) 10Alexandros Kosiaris: [C: 032] Allow BGP between kubernetes workers [puppet] - 10https://gerrit.wikimedia.org/r/329627 (owner: 10Alexandros Kosiaris) [17:58:15] (03PS7) 10Alexandros Kosiaris: Enable CNI plugin on kubernetes::node profile [puppet] - 10https://gerrit.wikimedia.org/r/329607 [17:58:21] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Enable CNI plugin on kubernetes::node profile [puppet] - 10https://gerrit.wikimedia.org/r/329607 (owner: 10Alexandros Kosiaris) [18:07:29] (03PS1) 10Alexandros Kosiaris: calico: Expose puppet keys as well [puppet] - 10https://gerrit.wikimedia.org/r/329630 [18:07:56] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] calico: Expose puppet keys as well [puppet] - 10https://gerrit.wikimedia.org/r/329630 (owner: 10Alexandros Kosiaris) [18:25:58] PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL: CRITICAL: 1.69% of data above the critical threshold [1000.0] [18:26:58] RECOVERY - carbon-cache too many creates on graphite1001 is OK: OK: Less than 1.00% above the threshold [500.0] [18:30:58] PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL: CRITICAL: 1.69% of data above the critical threshold [1000.0] [19:08:08] thanks akosiaris for that config deploy [19:55:08] PROBLEM - puppet last run on cp3039 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:55:28] PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes1001 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly [19:56:28] RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes1001 is OK: OK ferm input default policy is set [20:02:58] PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL: CRITICAL: 1.69% of data above the critical threshold [1000.0] [20:14:58] RECOVERY - carbon-cache too many creates on graphite1001 is OK: OK: Less than 1.00% above the threshold [500.0] [20:24:08] RECOVERY - puppet last run on cp3039 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [20:33:31] 06Operations, 10ops-codfw: rack/setup/install mw2051-mw2060 - https://phabricator.wikimedia.org/T152698#2906698 (10RobH) a:05RobH>03Joe @Papaul: The systems he listed are not currently offline, as they show in monitoring. @Joe: That sounds reasonable to me. Will the servers they are replacing be able t... [20:41:38] 06Operations, 10Gerrit, 06Release-Engineering-Team, 07Upstream: Gerrit: Restarting gerrit could lead to data loss + accounts - https://phabricator.wikimedia.org/T154205#2906714 (10Paladox) this https://gerrit-review.googlesource.com/#/c/86804/ also broke it where you merge a change, restart gerrit, look a... [20:46:28] PROBLEM - YARN NodeManager Node-State on analytics1039 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:47:18] RECOVERY - YARN NodeManager Node-State on analytics1039 is OK: OK: YARN NodeManager analytics1039.eqiad.wmnet:8041 Node-State: RUNNING [20:47:26] we should make that not page ^ [20:47:27] hm [20:47:37] unless down for a while [20:51:51] it pages only us ottomata, maybe something in the analytics contact group? [20:58:53] (or maybe we could review the alarm itself..) [20:59:00] anyhow, I didn't get why it went down [20:59:08] ahh ok [20:59:09] that's fine then [20:59:11] forgot that [20:59:13] HI! BTW! [21:03:06] HELLOO! [21:03:13] (forgot to say you are right :P) [21:12:03] so it doesn't seem to be memory related, but the node manager shut down and then restarted.. weird [21:12:15] yeah i've noticed it does that sometimes [21:12:15] we have already seen this issue sporadically [21:12:20] i thought you had found why with that memory bug [21:13:35] by 'only us' you must not mean ops because I sure did not get a page for that [21:13:55] which is fine if there are people who get notified [21:17:07] !log otto@tin Starting deploy [eventstreams/deploy@4098bb4]: (no message) [21:17:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:17:41] apergos: yes sorry, us == analytics in this case :) [21:17:48] ah ok! [21:17:59] ottomata: 2016-12-29 20:47:18,473 ERROR org.apache.hadoop.yarn.server.nodemanager.NodeManager: RECEIVED SIGNAL 15: SIGTERM - ??? [21:18:10] i did restart it [21:18:19] but,i think after it had already come back up, but i hadn't realized [21:18:45] ah ok! [21:18:54] didn't find anything good in the logs, super weird [21:19:06] !log otto@tin Finished deploy [eventstreams/deploy@4098bb4]: (no message) (duration: 01m 59s) [21:19:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:19:34] ottomata: going afk again, let's keep an eye and see if it re-occurs.. [21:19:38] o/ [21:19:42] laters! [21:40:48] PROBLEM - puppet last run on elastic1024 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:47:13] yuvipanda: do you have a moment to talk about video scaling? :P [21:55:25] hoo: I think _jo.e_ and others have been working on getting more horsepower into the pros video scalers [21:55:38] PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes1004 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly [21:55:40] s/pros/prod/ [21:56:45] bd808: Cool, but I need some videos scaled fast [21:57:38] RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes1004 is OK: OK ferm input default policy is set [21:58:00] hoo: https://aws.amazon.com/ ;) [21:59:03] If I could upload my own transcodes, that would also work, I guess :P [22:00:15] I don't know if there is any magic way to jump to the front of the transcode queue. [22:02:43] bd808: Not that I'm aware of [22:03:00] I'm just hacking a tiny little script that could only process the jobs we're interested in [22:03:06] I guess it might not be to bad(tm) [22:04:12] hoo: you might poke Reedy. He's done lots of big import things in the past and may know some tricks. [22:04:43] * hoo eyes Reedy [22:08:48] RECOVERY - puppet last run on elastic1024 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [22:13:59] hm :/ [22:14:08] Anyone firm with the job queue? [22:14:13] Like AaronSchulz ^? [22:15:28] that would also be Reedy [22:18:07] hoo: Revent is pretty clued up with the video stuff I believe (Queue wise), but joe and co are working on scalers atm since they are massively overloaded [22:19:17] hoo: The queue is nuts, and there is (from what I have been told) no way to jump it. [22:20:25] https://paste.fedoraproject.org/514846/30500141/ [22:20:28] v2c is completely turned off right now, also, to prevent more flooding. [22:20:36] that would probably work, but I didn't test [22:21:18] v2c? [22:21:27] video2commons, on labs. [22:21:40] ah, yeah [22:22:12] Tho… is it just that you need them transcoded right away? Like, not necessarily on Commons right now? [22:22:35] Nah, that would probably be easy [22:22:52] but we need them for a tutorial page etc. very soon [22:23:03] So it needs to be wikimedia hosted [22:23:15] Yeah, it’s likely to take the scalers like a week to catch up. :/ [22:23:36] webVideoTranscode: 21494 queued; 562 claimed (180 active, 382 abandoned); 0 delayed [22:23:37] yeha [22:23:57] Hundreds of those are ‘very’ large files. [22:24:39] got another database error when creating an account if someone can check it out please - [WGWNAgpAICsAAHi8PIcAAAAK] 2016-12-29 22:24:03: Fatal exception of type "DBQueryError" [22:25:02] Frankly, it would not horrify me if they was to just completely kick everything out of the queue, start fresh, and then shovel shit back in as a backlog. [22:25:26] (it’s not like there isn’t a backlog of a third of a million old failed transcodes anyhow, lol) [22:25:46] sadness :/ [22:26:20] I’m planning on working on shoving those through, once it’s working right again. [22:26:34] (most are small) [22:27:06] :) [22:27:20] Anyway, we would really like to get this resolved [22:27:39] there are a few ways, to bypass the usual queueing/ poping [22:29:11] hoo: A while back there was something like 15k ‘uninitialized’ transcodes, I shoved all those through before this drama came up, it’s ‘mindless while watching a movie’ stuff. [22:29:38] PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes1003 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly [22:30:38] RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes1003 is OK: OK ferm input default policy is set [22:31:00] cool, but not applicable right now :S [22:33:30] Yeah, I more just meant that kicking everything off (so the system is responsive ‘now’) would not be unworkable drama. [22:47:14] I'm giving up for now, posted my code snippet on https://phabricator.wikimedia.org/T154186 [22:47:25] Would be nice if someone were to have a look [22:47:36] bah. Amortias left. The error was "Error: 1213 Deadlock found when trying to get lock;" while setting some echo user properties [22:48:16] Enough of my vacation spend [22:52:08] PROBLEM - puppet last run on cp3013 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:01:39] PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes1003 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly [23:02:38] RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes1003 is OK: OK ferm input default policy is set [23:10:38] PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes1001 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly [23:10:57] hmm [23:11:38] RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes1001 is OK: OK ferm input default policy is set [23:16:38] PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes1003 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly [23:17:25] Another app is currently holding the xtables lock. Perhaps you want to use the -w option? [23:17:28] interesting [23:17:39] RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes1003 is OK: OK ferm input default policy is set [23:19:44] !log schedule downtime for ferm checks on kubernetes nodes. Some race between kubernetes + ferm, investigating [23:19:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:21:08] RECOVERY - puppet last run on cp3013 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [23:23:08] PROBLEM - puppet last run on labvirt1007 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:25:08] PROBLEM - puppet last run on sca1003 is CRITICAL: CRITICAL: Puppet has 27 failures. Last run 2 minutes ago with 27 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[wipe],Package[zotero/translators],Package[zotero/translation-server] [23:51:08] RECOVERY - puppet last run on labvirt1007 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [23:52:08] RECOVERY - puppet last run on sca1003 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures