[00:00:05] mutante, well it's fixed the problem, so. . . [00:00:16] what did it do? [00:00:18] wut? it did? [00:00:26] how did you check it so quick [00:00:37] I edited a page, and it showed up in the rcstream feed [00:00:49] alex@alex-laptop:~/Dropbox$ python rcstream-client.py | grep labswiki [00:00:49] {u'comment': u'', u'wiki': u'labswiki', u'server_name': u'wikitech.wikimedia.org', u'server_script_path': u'/w', u'timestamp': 1464307192, u'title': u'User:Alex Monk/sandbox', u'namespace': 2, u'server_url': u'https://wikitech.wikimedia.org', u'length': {u'new': 1, u'old': 1}, u'user': u'Alex Monk', u'type': u'edit', u'bot': False, u'id': 826945, u'minor': False, u'revision': {u'new': 579260, u'old': 567719}} [00:00:51] oh, pebcak! [00:00:58] ip6tables ! [00:01:49] ACCEPT tcp 2620:0:861:2:208:80:154:136 ::/0 tcp dpt:6379 [00:01:56] all good, thanks [00:02:30] Krenair: the CentralNotice patch looks like it's doing the (very simple) thing it's supposed to do! [00:02:47] (03CR) 10Dzahn: "before:" [puppet] - 10https://gerrit.wikimedia.org/r/291142 (https://phabricator.wikimedia.org/T136245) (owner: 10Dzahn) [00:02:50] 06Operations, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: rcstream not working for wikitech wiki - https://phabricator.wikimedia.org/T136245#2332584 (10Krenair) 05Open>03Resolved That fixed it, thanks @dzahn! [00:02:53] great [00:03:16] 06Operations, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: rcstream not working for wikitech wiki - https://phabricator.wikimedia.org/T136245#2332586 (10Dzahn) @aude ^ [00:03:57] 06Operations, 06Labs, 10Labs-Infrastructure: rcstream not working for wikitech wiki - https://phabricator.wikimedia.org/T136245#2332592 (10Dzahn) [00:04:17] Krenair: labtestwikitech too? [00:04:39] Would most likely still be broken due to lack of IPv6 due to puppet being disabled there [00:05:12] no, i asked andre and he renabled [00:05:16] oh, ok [00:05:19] and then i added it [00:05:22] well I'll have a go then [00:05:40] oh, heh [00:05:46] labtestwikitech is currently giving HTTP 500 [00:06:00] RECOVERY - puppet last run on mw2128 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [00:06:06] ah, well [00:06:11] I'll deal with VE and look into that later [00:06:14] it's the testbox after all [00:06:21] sure, thanks [00:06:25] yep [00:15:09] (03PS3) 10Yuvipanda: Switch to using wikimedia-jessie as base container [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/290795 [00:15:11] (03PS3) 10Yuvipanda: Add a simple builder script [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/290793 [00:16:30] bd808: updated ^ with your CR, I've also given you merge rights there [00:19:18] (03CR) 10Yuvipanda: "Also probably merge without it paging for a few days and then omve it to page?" [puppet] - 10https://gerrit.wikimedia.org/r/290681 (https://phabricator.wikimedia.org/T136162) (owner: 10Rush) [00:20:40] PROBLEM - puppet last run on elastic2004 is CRITICAL: CRITICAL: puppet fail [00:28:14] !log purging pk.wikimedia.org from varnish, cache_text eqiad backends [00:28:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:29:10] PROBLEM - Ensure legal html en.m.wp on en.m.wikipedia.org is CRITICAL: a href=//wikimediafoundation.org/wiki/Privacy_policy title=wmf:Privacy policyPrivacy/a html not found [00:29:38] ooh, missing privacy policy? [00:30:11] looks like it's there to me [00:30:15] !log krenair@tin Synchronized php-1.28.0-wmf.3/extensions/WikimediaEvents/extension.json: https://gerrit.wikimedia.org/r/#/c/291143/1 (duration: 00m 28s) [00:30:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:33:55] ejegg: yea, odd, it seems to be there [00:34:12] but that is on mobile [00:34:14] .m. [00:34:58] oh ya [00:35:25] still seems to be there on m. [00:36:23] this is the exact URL it checks [00:36:55] errr ssl error - bad cert domain at https://en.m.wikpedia.org/ [00:36:58] can that be right? [00:37:05] * ejegg checks on phone [00:37:07] that is missing an i [00:37:09] wikipedia [00:37:11] not wikpedia [00:37:11] haha [00:37:12] thanks [00:37:16] we have wikpedia though :p [00:37:19] i think [00:37:33] it's serving ssl certs with our name... [00:37:36] yeah it's a redirect [00:37:43] on wmf servers [00:38:13] so yeah, en.m.wp.o still seems to have a privacy policy link in the footer [00:38:17] * Krenair images it would be one of the redirect domains served by the LE proposal for hosting those [00:38:20] imagines* [00:38:24] to the foundation policy [00:38:41] 313884 check_command check_legal_html!https://en.m.wikipedia.org/wiki/Main_Page!mobile [00:39:12] it says Main_Page there .. looking at check_legal_html now [00:39:39] still searching where it [00:39:42] is [00:39:53] duration 0d 1h 11m 52s [00:40:12] full status: Privacy html not found [00:40:55] modules/icinga/manifests/monitor/legal.pp: check_command => 'check_legal_html!https://en.m.wikipedia.org/wiki/Main_Page!mobile', [00:41:03] modules/icinga/templates/check_commands/check_legal_html.cfg.erb [00:41:08] 1hr? same screwy chromium as qunit? [00:41:23] Found the problem [00:41:28] gah I did those horrendous checks at the behest of legal [00:41:31] Actual HTML: Privacy [00:41:42] Difference is the extra class="extiw" [00:41:56] Check has become outdated [00:42:06] aha, needs laxer regex [00:42:15] don't think it's even a regex [00:42:36] oh yah, just needs regex [00:43:15] I'll fix it up tomorrow, maybe they will let me do something a bit less prone to breakage [00:43:15] modules/icinga/files/check_legal_html.py [00:43:16] thanks Krenair [00:46:00] RECOVERY - puppet last run on elastic2004 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [00:48:23] !log krenair@tin Synchronized php-1.28.0-wmf.3/extensions/VisualEditor/modules/ve-mw/init/ve.init.mw.ArticleTargetLoader.js: https://gerrit.wikimedia.org/r/#/c/291145/ (duration: 00m 23s) [00:48:23] !log purging pk.wikimedia.org from varnish, cache_text non-eqiad backends, then frontends [00:48:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:48:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:53:09] mutante, I ran sync-common on labtestweb2001 and it fixed labtestwikitech [00:54:06] Krenair: :) thanks! i added it to scap groups today, a little before the PST noon deploy [00:54:43] yeah we had no scap since then, but we have had smaller deployments, so no wonder it was broken [00:54:43] the pk.wm.org redirect works now [00:54:58] ah! that explains [00:57:10] Right, my VE EL schema is doing what it's supposed to [00:57:20] I'll leave it for a while now to gather data [01:06:56] mutante, krenair@labtestweb2001:~$ telnet rcs1001.eqiad.wmnet 6379 [01:06:56] Trying 2620:0:861:103:10:64:32:148... [01:06:56] Connected to rcs1001.eqiad.wmnet. [01:06:59] so that works [01:07:16] *nod* cool [01:08:06] https://labtestwikitech.wikimedia.org/w/index.php?title=User:Labtestalex&oldid=28389 [01:08:17] and success: alex@alex-laptop:~/Dropbox$ python rcstream-client.py | grep labtestwiki [01:08:17] {u'comment': u'Created page with "test"', u'wiki': u'labtestwiki', u'type': u'new', u'server_name': u'labtestwikitech.wikimedia.org', u'server_script_path': u'/w', u'timestamp': 1464311277, u'title': u'User:Labtestalex', u'namespace': 2, u'server_url': u'https://labtestwikitech.wikimedia.org', u'length': {u'new': 4, u'old': None}, u'user': u'Labtestalex', u'patrolled': False, u'bot': False, u'id': 25218, u'minor': False, u'revision': {u'new [01:08:17] ': 28389, u'old': None}} [01:08:22] :) [01:09:04] (I'm done deploying, btw) [01:10:43] a certain irony or makes perfect sense. disablessl3.com does not have ssl .. http://disablessl3.com/ [01:15:36] (03CR) 10BryanDavis: [C: 031] "2 trivial pep8 violations. I don't know how to actually test this." (032 comments) [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/290793 (owner: 10Yuvipanda) [02:21:50] !log mwdeploy@tin scap sync-l10n completed (1.28.0-wmf.3) (duration: 07m 50s) [02:21:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:31:02] !log l10nupdate@tin ResourceLoader cache refresh completed at Fri May 27 02:31:01 UTC 2016 (duration 9m 12s) [02:31:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:12:38] PROBLEM - RAID on etherpad1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:12:39] PROBLEM - DPKG on etherpad1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:12:48] PROBLEM - configured eth on etherpad1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:12:58] PROBLEM - dhclient process on etherpad1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:13:28] PROBLEM - salt-minion processes on etherpad1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:13:38] PROBLEM - etherpad_lite_process_running on etherpad1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:13:38] PROBLEM - puppet last run on etherpad1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:13:48] PROBLEM - Disk space on etherpad1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:13:48] PROBLEM - Check size of conntrack table on etherpad1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:18:21] (03CR) 10Yuvipanda: "@bd808 I've added more documentation including testing instructions at https://wikitech.wikimedia.org/wiki/Tools_Kubernetes#Docker_Images." [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/290793 (owner: 10Yuvipanda) [04:20:38] (03PS4) 10Yuvipanda: Switch to using wikimedia-jessie as base container [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/290795 [04:20:40] (03PS4) 10Yuvipanda: Add a simple builder script [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/290793 [04:21:10] bd808: I fixed the pep8 violations (I'll add a tox.ini file increasing the line length soon :D) and wrote up docs on testing [04:21:34] increasing line length is evil ;) [04:22:13] So will I melt anything if I actually try to follow those instructions? [04:22:31] testing things that are hardwired to use the prod resources is scary to me [04:23:01] you are not so much testing as hoping you can undo breakage at that point [04:23:54] PROBLEM - SSH on etherpad1001 is CRITICAL: Server answer [04:27:45] RECOVERY - SSH on etherpad1001 is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u2 (protocol 2.0) [04:42:40] bd808: hmm [04:42:46] bd808: so there are two ways around that I think [04:42:52] bd808: one is to add a '--no-push' [04:42:59] bd808: which will just build and then not do anything [04:43:03] bd808: the other is configurable prefixes [04:43:07] no reason to not do both :D [04:44:38] a --dry-run option would validate the hierarchy lookup [04:44:44] maybe that's all that's really needed? [04:44:56] bd808: what do you mean by 'validate the hierarchy lookup'? [04:45:08] bd808: I think I should add actual tests for that, since that's fairly separateable [04:45:15] the inheritance chain bits [04:45:18] bd808: the 'testing' option to me is that it tests that the images build succesfully [04:45:41] like that the dockerfiles aren't busted? [04:45:50] bd808: yeah [04:46:02] bd808: so cherry pick there, build and not push, then merge, then build and push [04:46:18] *nod* [04:46:26] bd808: and if I really wanna test the whole thing, cherry pick there, build and push to a different prefix, then try something elsewhere that uses them etc [04:47:02] yeah. that seems useful long term [04:47:47] bd808: yeah [05:01:15] PROBLEM - puppet last run on wtp2004 is CRITICAL: CRITICAL: puppet fail [05:05:02] (03PS5) 10Yuvipanda: Switch to using wikimedia-jessie as base container [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/290795 [05:05:04] (03PS5) 10Yuvipanda: Add a simple builder script [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/290793 [05:05:28] bd808: ^ gonna test those now [05:09:15] (03PS6) 10Yuvipanda: Add a simple builder script [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/290793 [05:13:17] hmmm [05:13:27] the prefix doesn't actually change the prefixes *in* the docker files themselves [05:18:00] ARG can't be used in a FROM it looks like [05:19:14] <_joe_> morning [05:22:13] 06Operations, 07HHVM, 13Patch-For-Review, 07User-notice: Switch HAT appservers to trusty's ICU (or newer) - https://phabricator.wikimedia.org/T86096#2332912 (10Joe) Status update - only the following wikis are still being converted/waiting for conversion: - frwiki - svwiki - thwiki - ruwiki Any anomaly o... [05:23:00] hello _joe_ [05:28:30] RECOVERY - puppet last run on wtp2004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [05:33:01] PROBLEM - NTP on etherpad1001 is CRITICAL: NTP CRITICAL: No response from NTP server [05:33:03] (03CR) 10Yuvipanda: "So, I made:" [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/290793 (owner: 10Yuvipanda) [05:33:20] PROBLEM - SSH on etherpad1001 is CRITICAL: Server answer [05:46:19] RECOVERY - SSH on etherpad1001 is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u2 (protocol 2.0) [05:50:20] PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: Puppet has 1 failures [06:08:21] flood of pep8 patches incoming [06:08:47] (03PS2) 10BryanDavis: Add pep8 environment to tox.ini for jenkins job [puppet] - 10https://gerrit.wikimedia.org/r/291138 [06:09:16] poor grrrit-wm couldn't take it [06:10:50] I think this series will make pep8 pass for all of ops/puppet -- https://gerrit.wikimedia.org/r/#/q/status:open+project:operations/puppet+branch:production+topic:pep8,n,z [06:10:59] PROBLEM - aqs endpoints health on aqs1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:12:00] PROBLEM - puppet last run on mw1101 is CRITICAL: CRITICAL: Puppet has 1 failures [06:12:50] RECOVERY - aqs endpoints health on aqs1002 is OK: All endpoints are healthy [06:14:54] bd808: you weren't kidding about flood! [06:15:18] It stared innocently and then got a bit wild [06:15:47] It stared innocently RECOVERY - puppet last run on labvirt1010 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:18:21] * YuviPanda goes to bed now [06:19:09] PROBLEM - SSH on etherpad1001 is CRITICAL: Server answer [06:19:48] (03CR) 10jenkins-bot: [V: 04-1] homedirectorymanager.py: Fix PEP8 violations [puppet] - 10https://gerrit.wikimedia.org/r/291170 (owner: 10BryanDavis) [06:20:01] (03CR) 10jenkins-bot: [V: 04-1] ldapsupportlib.py: Fix PEP8 violations [puppet] - 10https://gerrit.wikimedia.org/r/291171 (owner: 10BryanDavis) [06:20:34] (03CR) 10jenkins-bot: [V: 04-1] openstack: Fix PEP8 violations [puppet] - 10https://gerrit.wikimedia.org/r/291172 (owner: 10BryanDavis) [06:20:46] bah. what wasn't that caught locally? [06:21:05] (03CR) 10jenkins-bot: [V: 04-1] swift: Fix PEP8 violations [puppet] - 10https://gerrit.wikimedia.org/r/291173 (owner: 10BryanDavis) [06:22:10] (03CR) 10jenkins-bot: [V: 04-1] librenms: Fix PEP8 vilations [puppet] - 10https://gerrit.wikimedia.org/r/291174 (owner: 10BryanDavis) [06:22:21] (03CR) 10jenkins-bot: [V: 04-1] DBUtil.py: Fix PEP8 violations [puppet] - 10https://gerrit.wikimedia.org/r/291175 (owner: 10BryanDavis) [06:22:29] there's going to be flood of these -1's :/ [06:22:55] (03CR) 10jenkins-bot: [V: 04-1] gmond_memcached.py: Fix PEP8 violations [puppet] - 10https://gerrit.wikimedia.org/r/291176 (owner: 10BryanDavis) [06:23:00] RECOVERY - SSH on etherpad1001 is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u2 (protocol 2.0) [06:23:54] (03CR) 10jenkins-bot: [V: 04-1] rolematcher.py: Fix PEP8 violations [puppet] - 10https://gerrit.wikimedia.org/r/291177 (owner: 10BryanDavis) [06:25:10] (03CR) 10jenkins-bot: [V: 04-1] ganglia: Fix PEP8 violations [puppet] - 10https://gerrit.wikimedia.org/r/291179 (owner: 10BryanDavis) [06:25:15] (03CR) 10jenkins-bot: [V: 04-1] wmfelastic.py: Fix PEP8 violations [puppet] - 10https://gerrit.wikimedia.org/r/291178 (owner: 10BryanDavis) [06:25:56] (03CR) 10jenkins-bot: [V: 04-1] mailman: Fix PEP8 violations [puppet] - 10https://gerrit.wikimedia.org/r/291180 (owner: 10BryanDavis) [06:27:12] (03CR) 10jenkins-bot: [V: 04-1] ircd_stats.py: Fix PEP8 violations [puppet] - 10https://gerrit.wikimedia.org/r/291181 (owner: 10BryanDavis) [06:28:28] (03CR) 10jenkins-bot: [V: 04-1] postgresql.py: Fix PEP8 violations [puppet] - 10https://gerrit.wikimedia.org/r/291182 (owner: 10BryanDavis) [06:29:46] (03CR) 10jenkins-bot: [V: 04-1] pybal: Fix PEP8 violations [puppet] - 10https://gerrit.wikimedia.org/r/291183 (owner: 10BryanDavis) [06:29:56] <_joe_> wtf? [06:30:13] <_joe_> bd808: pep8 is overvalued [06:30:32] <_joe_> and I'll -1 any wrap of lines of less than 100 chars [06:30:33] <_joe_> :P [06:30:40] PROBLEM - puppet last run on mw1135 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:48] I don't think I did that [06:31:03] (03CR) 10jenkins-bot: [V: 04-1] salt: Fix PEP8 violations [puppet] - 10https://gerrit.wikimedia.org/r/291184 (owner: 10BryanDavis) [06:31:09] there are apparently multiple competing pep8 jobs [06:31:42] <_joe_> bd808: ofc :P [06:31:44] these all passed the tox tests, but there is one busted one for another pep8 checker thus far [06:31:59] PROBLEM - puppet last run on mw1119 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:00] PROBLEM - puppet last run on wtp2008 is CRITICAL: CRITICAL: Puppet has 2 failures [06:32:00] PROBLEM - puppet last run on mw2129 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:07] I'm waiting til the end to see if that's the only failure [06:32:29] 4 more to go [06:32:35] (03CR) 10jenkins-bot: [V: 04-1] servermon: Fix PEP8 violations [puppet] - 10https://gerrit.wikimedia.org/r/291185 (owner: 10BryanDavis) [06:32:49] PROBLEM - puppet last run on mw2207 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:05] (03PS1) 10Alexandros Kosiaris: base::firewall: WIP: Generate an all network subnets entry [puppet] - 10https://gerrit.wikimedia.org/r/291189 [06:33:20] PROBLEM - puppet last run on mw2073 is CRITICAL: CRITICAL: Puppet has 2 failures [06:33:40] PROBLEM - puppet last run on mw2158 is CRITICAL: CRITICAL: Puppet has 2 failures [06:33:49] (03CR) 10Alexandros Kosiaris: [C: 032] rsyslog::receiver: Increase log retention to 90 days [puppet] - 10https://gerrit.wikimedia.org/r/290935 (owner: 10Alexandros Kosiaris) [06:33:55] (03PS2) 10Alexandros Kosiaris: rsyslog::receiver: Increase log retention to 90 days [puppet] - 10https://gerrit.wikimedia.org/r/290935 [06:34:07] (03CR) 10jenkins-bot: [V: 04-1] udp2log: Fix PEP8 violations [puppet] - 10https://gerrit.wikimedia.org/r/291186 (owner: 10BryanDavis) [06:34:39] PROBLEM - puppet last run on mw1170 is CRITICAL: CRITICAL: Puppet has 3 failures [06:35:50] (03CR) 10jenkins-bot: [V: 04-1] varnish: Fix PEP8 violations [puppet] - 10https://gerrit.wikimedia.org/r/291187 (owner: 10BryanDavis) [06:35:58] bd808: _joe_ I think 100 is a nice compromise, although I'm a bit more partial to 120 myself [06:36:31] I'll write all of my new code at 80 thanks [06:37:38] (03CR) 10jenkins-bot: [V: 04-1] wdqs_updater.py: Fix PEP8 violations [puppet] - 10https://gerrit.wikimedia.org/r/291188 (owner: 10BryanDavis) [06:37:45] (03CR) 10Alexandros Kosiaris: [V: 032] rsyslog::receiver: Increase log retention to 90 days [puppet] - 10https://gerrit.wikimedia.org/r/290935 (owner: 10Alexandros Kosiaris) [06:38:47] (03PS2) 10BryanDavis: wdqs_updater.py: Fix PEP8 violations [puppet] - 10https://gerrit.wikimedia.org/r/291188 [06:38:49] (03PS2) 10BryanDavis: udp2log: Fix PEP8 violations [puppet] - 10https://gerrit.wikimedia.org/r/291186 [06:38:51] (03PS2) 10BryanDavis: varnish: Fix PEP8 violations [puppet] - 10https://gerrit.wikimedia.org/r/291187 [06:38:53] (03PS2) 10BryanDavis: salt: Fix PEP8 violations [puppet] - 10https://gerrit.wikimedia.org/r/291184 [06:38:55] (03PS2) 10BryanDavis: servermon: Fix PEP8 violations [puppet] - 10https://gerrit.wikimedia.org/r/291185 [06:38:57] (03PS2) 10BryanDavis: DBUtil.py: Fix PEP8 violations [puppet] - 10https://gerrit.wikimedia.org/r/291175 [06:38:59] (03PS2) 10BryanDavis: librenms: Fix PEP8 vilations [puppet] - 10https://gerrit.wikimedia.org/r/291174 [06:39:01] (03PS2) 10BryanDavis: swift: Fix PEP8 violations [puppet] - 10https://gerrit.wikimedia.org/r/291173 [06:39:03] (03PS2) 10BryanDavis: openstack: Fix PEP8 violations [puppet] - 10https://gerrit.wikimedia.org/r/291172 [06:39:05] (03PS2) 10BryanDavis: ldapsupportlib.py: Fix PEP8 violations [puppet] - 10https://gerrit.wikimedia.org/r/291171 [06:39:07] (03PS2) 10BryanDavis: homedirectorymanager.py: Fix PEP8 violations [puppet] - 10https://gerrit.wikimedia.org/r/291170 [06:39:08] the tox job lints the submodules? :/ [06:39:09] (03PS2) 10BryanDavis: pybal: Fix PEP8 violations [puppet] - 10https://gerrit.wikimedia.org/r/291183 [06:39:11] (03PS2) 10BryanDavis: postgresql.py: Fix PEP8 violations [puppet] - 10https://gerrit.wikimedia.org/r/291182 [06:39:13] (03PS2) 10BryanDavis: ircd_stats.py: Fix PEP8 violations [puppet] - 10https://gerrit.wikimedia.org/r/291181 [06:39:15] (03PS2) 10BryanDavis: mailman: Fix PEP8 violations [puppet] - 10https://gerrit.wikimedia.org/r/291180 [06:39:17] (03PS2) 10BryanDavis: ganglia: Fix PEP8 violations [puppet] - 10https://gerrit.wikimedia.org/r/291179 [06:39:19] (03PS2) 10BryanDavis: wmfelastic.py: Fix PEP8 violations [puppet] - 10https://gerrit.wikimedia.org/r/291178 [06:39:20] RECOVERY - puppet last run on mw1101 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:39:21] (03PS2) 10BryanDavis: rolematcher.py: Fix PEP8 violations [puppet] - 10https://gerrit.wikimedia.org/r/291177 [06:39:23] (03PS2) 10BryanDavis: gmond_memcached.py: Fix PEP8 violations [puppet] - 10https://gerrit.wikimedia.org/r/291176 [06:40:40] PROBLEM - SSH on etherpad1001 is CRITICAL: Server answer [06:41:12] (03PS2) 10Alexandros Kosiaris: base::firewall: WIP: Generate an all network subnets entry [puppet] - 10https://gerrit.wikimedia.org/r/291189 [06:44:39] RECOVERY - SSH on etherpad1001 is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u2 (protocol 2.0) [06:44:52] (03CR) 10Alexandros Kosiaris: "I am wondering why the /etc/prometheus-nginx/ directory. Why not /etc/nginx ?" [puppet] - 10https://gerrit.wikimedia.org/r/290479 (https://phabricator.wikimedia.org/T126785) (owner: 10Filippo Giunchedi) [06:45:19] (03PS4) 10Muehlenhoff: Provide a firejail profile for the image scalers [puppet] - 10https://gerrit.wikimedia.org/r/290696 (https://phabricator.wikimedia.org/T135111) [06:45:41] bd808: you should try the jupyter project sometimes. They don't even enforce removing trailing whitespace [06:45:50] * bd808 shudders [06:45:51] bd808: and most of the people who hate it have given up on trying to fight it [06:45:58] I sneak in cleanups now and then tho [06:46:00] it's pretty frustrating [06:46:03] oh well [06:46:16] bd808: also because it isn't documented anywhere [06:47:48] <_joe_> bd808: I'll take a look at this series of patches of yours today [06:47:52] <_joe_> pinky promise [06:48:22] is there anyone here who could spare a few minutes to talk about Icinga? [06:48:31] I try to log in and find 'Exception encountered, of type "Exception"' [06:48:31] In relation to https://phabricator.wikimedia.org/T134782 [06:50:14] jynus: https://phabricator.wikimedia.org/T119736 [06:50:15] "Could not find local user data for JCrespo (WMF)@tawiki" [06:50:29] PROBLEM - SSH on etherpad1001 is CRITICAL: Server answer [06:50:43] yes, but I am trying to login to enwiki [06:51:23] (03PS1) 10BryanDavis: varnishkafka_ganglia.py: Fix PEP8 violations [puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/291190 [06:52:20] RECOVERY - SSH on etherpad1001 is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u2 (protocol 2.0) [06:52:53] Could not find local user data for JCrespo (WMF)@tawiki on all wikis [06:53:39] apparently, only happening to me on the last 6 hours [06:56:00] RECOVERY - puppet last run on mw1170 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [06:56:19] RECOVERY - puppet last run on mw2207 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [06:56:58] (03PS1) 10BryanDavis: kafkatee_ganglia.py: Fix PEP8 violations [puppet/kafkatee] - 10https://gerrit.wikimedia.org/r/291191 [06:57:05] ok, bed time [06:57:08] I was able to login with my vandalizing account [06:57:09] RECOVERY - puppet last run on mw2158 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [06:57:29] RECOVERY - puppet last run on mw1119 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [06:57:30] RECOVERY - puppet last run on wtp2008 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:31] RECOVERY - puppet last run on mw2129 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [06:58:01] RECOVERY - puppet last run on mw1135 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:17] (03PS3) 10Alexandros Kosiaris: base::firewall: Generate an all network subnets entry [puppet] - 10https://gerrit.wikimedia.org/r/291189 [06:58:50] RECOVERY - puppet last run on mw2073 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:59:14] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] "PCC says OK https://puppet-compiler.wmflabs.org/2939/" [puppet] - 10https://gerrit.wikimedia.org/r/291189 (owner: 10Alexandros Kosiaris) [06:59:21] (03PS4) 10Alexandros Kosiaris: base::firewall: Generate an all network subnets entry [puppet] - 10https://gerrit.wikimedia.org/r/291189 [06:59:25] (03CR) 10Alexandros Kosiaris: [V: 032] base::firewall: Generate an all network subnets entry [puppet] - 10https://gerrit.wikimedia.org/r/291189 (owner: 10Alexandros Kosiaris) [07:00:52] (03CR) 10Alexandros Kosiaris: "I 've submitted a trimmed down version of this in https://gerrit.wikimedia.org/r/#/c/291189/ to solve an immediate problem while we work o" [puppet] - 10https://gerrit.wikimedia.org/r/260926 (https://phabricator.wikimedia.org/T122396) (owner: 10Faidon Liambotis) [07:01:23] (03PS5) 10Muehlenhoff: Provide a firejail profile for the image scalers [puppet] - 10https://gerrit.wikimedia.org/r/290696 (https://phabricator.wikimedia.org/T135111) [07:02:01] 06Operations, 10MediaWiki-Categories, 07HHVM: Broken sorting and multi-page categories for Cyrillic wikis - https://phabricator.wikimedia.org/T136281#2332967 (10Joe) The script is running on ruwiki now, I've clearly been too pessimistic last night. I'll report when it is done. @NickK is the situation any bet... [07:04:00] (03CR) 10Muehlenhoff: [C: 032 V: 032] Provide a firejail profile for the image scalers [puppet] - 10https://gerrit.wikimedia.org/r/290696 (https://phabricator.wikimedia.org/T135111) (owner: 10Muehlenhoff) [07:04:43] <_joe_> !log updating HHVM on the remaining hosts (mira, wasat, snapshot1*) [07:04:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:04:55] (03PS1) 10Alexandros Kosiaris: ferm: Restrict ganglia aggregator a bit more [puppet] - 10https://gerrit.wikimedia.org/r/291192 (https://phabricator.wikimedia.org/T115330) [07:05:10] PROBLEM - SSH on etherpad1001 is CRITICAL: Server answer [07:06:21] (03PS2) 10Alexandros Kosiaris: ferm: Restrict ganglia aggregator a bit more [puppet] - 10https://gerrit.wikimedia.org/r/291192 (https://phabricator.wikimedia.org/T115330) [07:06:27] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] ferm: Restrict ganglia aggregator a bit more [puppet] - 10https://gerrit.wikimedia.org/r/291192 (https://phabricator.wikimedia.org/T115330) (owner: 10Alexandros Kosiaris) [07:07:19] (03PS3) 10Muehlenhoff: Provide a wrapper to invoke convert using firejail [puppet] - 10https://gerrit.wikimedia.org/r/290909 [07:07:32] (03Abandoned) 10Volans: Monitoring: Install vendor specific RAID tool [puppet] - 10https://gerrit.wikimedia.org/r/290717 (https://phabricator.wikimedia.org/T97998) (owner: 10Volans) [07:13:21] (03CR) 10Jcrespo: "Please let's merge this ASAP for production testing." [puppet] - 10https://gerrit.wikimedia.org/r/290988 (https://phabricator.wikimedia.org/T84050) (owner: 10Faidon Liambotis) [07:14:41] RECOVERY - SSH on etherpad1001 is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u2 (protocol 2.0) [07:17:49] RECOVERY - dhclient process on etherpad1001 is OK: PROCS OK: 0 processes with command name dhclient [07:18:09] RECOVERY - etherpad_lite_process_running on etherpad1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/node /usr/share/etherpad-lite/node_modules/ep_etherpad-lite/node/server.js [07:18:21] RECOVERY - NTP on etherpad1001 is OK: NTP OK: Offset -0.0001047849655 secs [07:18:29] RECOVERY - salt-minion processes on etherpad1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [07:18:40] RECOVERY - DPKG on etherpad1001 is OK: All packages OK [07:18:59] RECOVERY - RAID on etherpad1001 is OK: OK: no RAID installed [07:19:10] RECOVERY - Check size of conntrack table on etherpad1001 is OK: OK: nf_conntrack is 0 % full [07:19:19] RECOVERY - configured eth on etherpad1001 is OK: OK - interfaces up [07:19:30] RECOVERY - Disk space on etherpad1001 is OK: DISK OK [07:19:50] RECOVERY - puppet last run on etherpad1001 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [07:40:20] (03CR) 10Muehlenhoff: [C: 032 V: 032] Provide a wrapper to invoke convert using firejail [puppet] - 10https://gerrit.wikimedia.org/r/290909 (owner: 10Muehlenhoff) [07:42:39] (03CR) 10Volans: raid: add monitoring for HP controllers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/291014 (https://phabricator.wikimedia.org/T97998) (owner: 10Faidon Liambotis) [07:50:36] PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: Puppet has 3 failures [07:53:20] (03PS6) 10Volans: Cleanup my.cnf by grouping options [puppet] - 10https://gerrit.wikimedia.org/r/286858 (owner: 10Jcrespo) [08:00:32] !log restarted memcached on mc1009 to collect metrics for T129963 [08:00:33] T129963: Update memcached package and configuration options - https://phabricator.wikimedia.org/T129963 [08:00:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:11:08] (03PS7) 10Volans: Cleanup my.cnf by grouping options [puppet] - 10https://gerrit.wikimedia.org/r/286858 (owner: 10Jcrespo) [08:30:11] 06Operations, 10DBA, 13Patch-For-Review: implement performance_schema for mysql monitoring - https://phabricator.wikimedia.org/T99485#2333052 (10jcrespo) The following production eqiad host are pending implementing p_s: ``` db1066.eqiad.wmnet: @@global.performance_schema 0 db1068.eqiad.wmnet: @@g... [08:30:19] 06Operations, 10Ops-Access-Requests: Requesting access to stat1002 for Pcoombe - https://phabricator.wikimedia.org/T136343#2333053 (10Pcoombe) Thanks for the info. @ellery, can you confirm which of these I need to look at Banner History and to run your fundraising reports? I think it's all in Hadoop, right? @... [08:32:36] (03PS8) 10Volans: Cleanup my.cnf by grouping options [puppet] - 10https://gerrit.wikimedia.org/r/286858 (owner: 10Jcrespo) [08:42:29] couldn't find HOME environment -- expanding `~' ... [08:42:32] seriously [08:47:51] RECOVERY - puppet last run on labvirt1010 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:57:29] !log Set sync_binlog=1 on db2011 (m2) and monitoring it. T133333 [08:57:30] T133333: Audit new eqiad masters configuration - https://phabricator.wikimedia.org/T133333 [08:57:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:07:19] RECOVERY - cassandra-c CQL 10.192.16.178:9042 on restbase2007 is OK: TCP OK - 0.033 second response time on port 9042 [09:12:36] 06Operations, 10cassandra: change graphite aggregation function for cassandra 'count' metrics - https://phabricator.wikimedia.org/T121789#2333096 (10fgiunchedi) graphite has no metric type per-se so everything is a gauge, though the `.count` naming convention to indicate counters is statsd's not graphite's [09:14:48] !log enable firejail for image scaling on mw1153 as a canary [09:14:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:15:32] (03CR) 10Filippo Giunchedi: [C: 031] raid: add a new "raid" fact [puppet] - 10https://gerrit.wikimedia.org/r/290988 (https://phabricator.wikimedia.org/T84050) (owner: 10Faidon Liambotis) [09:18:40] (03PS2) 10Gehel: Increase time before alter for elasticsearch disk space issues [puppet] - 10https://gerrit.wikimedia.org/r/290487 [09:22:19] (03Abandoned) 10Gehel: Make r8s module use base::expose_puppet_certs [puppet] - 10https://gerrit.wikimedia.org/r/276427 (https://phabricator.wikimedia.org/T124444) (owner: 10Gehel) [09:29:07] (03CR) 10Gehel: [C: 031] "Looks good, simple enough" [puppet] - 10https://gerrit.wikimedia.org/r/291178 (owner: 10BryanDavis) [09:31:26] (03PS1) 10Ppchelko: Partially port RESTBaseUpdateJobs to change propagation. [puppet] - 10https://gerrit.wikimedia.org/r/291201 [09:34:13] (03CR) 10Gehel: [C: 031] "Looks good and trivial enough" [puppet] - 10https://gerrit.wikimedia.org/r/291164 (owner: 10BryanDavis) [09:34:28] akosiaris, YuviPanda: could either of you spare some time to help me set up icinga to monitor http for all the ores-web nodes? Or direct me to someone who could? [09:35:44] schana: icinga in labs ? it doesn't really work [09:36:16] I was under the impression that I could expand https://phabricator.wikimedia.org/diffusion/OPUP/browse/production/modules/icinga/manifests/monitor/ores.pp [09:36:43] but maybe ping the web nodes directly without going through the load balancer [09:37:35] you can't [09:37:54] the web nodes have private IP addresses in the labs environment [09:38:02] the code you see there is for production [09:38:15] which is why it is only monitoring the publicly available labs endpoint [09:38:15] second thought: can icinga watch the log files on the load balancer and alert when it nginx fails a node? [09:38:35] (03CR) 10Faidon Liambotis: raid: add monitoring for HP controllers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/291014 (https://phabricator.wikimedia.org/T97998) (owner: 10Faidon Liambotis) [09:38:57] (03PS1) 10Muehlenhoff: Enable firejail for image scaling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/291202 (https://phabricator.wikimedia.org/T135111) [09:38:57] (does it run with an agent, or are all the checks remote?) [09:39:12] the latter in this specific setup [09:39:55] well, it's 2 different administrative environments hence the fact that doing what you suggested is really difficult [09:40:57] 06Operations, 10ops-codfw: ms-be2012.codfw.wmnet: slot=12 dev=sdm failed - https://phabricator.wikimedia.org/T136395#2333141 (10fgiunchedi) 03NEW [09:40:58] https://phabricator.wikimedia.org/T134782 -- we're having some issues with the web nodes failing and icinga being quiet for a while following (I think; I'm not 100% on the details) [09:41:52] (03CR) 10Volans: raid: add monitoring for HP controllers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/291014 (https://phabricator.wikimedia.org/T97998) (owner: 10Faidon Liambotis) [09:42:22] 06Operations, 10ops-codfw: ms-be2012.codfw.wmnet: slot=12 dev=sdm failed - https://phabricator.wikimedia.org/T136395#2333148 (10fgiunchedi) more errors from the failure ```lines=5 May 26 15:32:00 ms-be2012 kernel: [5973299.941810] sd 0:2:12:0: [sdm] May 26 15:32:00 ms-be2012 kernel: [5973299.941834] Result:... [09:43:34] schana: maybe it's possible to monitor the nginx status page [09:44:31] 06Operations, 10ops-codfw: ms-be2012.codfw.wmnet: slot=12 dev=sdm failed - https://phabricator.wikimedia.org/T136395#2333149 (10fgiunchedi) a:03Papaul @papaul also note that this an ssd, not a spinning disk as usual with swift failures [09:44:42] akosiaris: do you know if that is enabled for ORES? [09:46:08] schana: the nginx status page ? no, I don't, but it shouldn't be too difficult to enable [09:48:08] RECOVERY - Disk space on ms-be2012 is OK: DISK OK [09:48:55] akosiaris: would there be any security implications for having it world-readable, or should it be restricted to the icinga host? [09:49:00] (03CR) 10Volans: "jynus: compiler results available here https://puppet-compiler.wmflabs.org/2944/ , they looks sane to me." [puppet] - 10https://gerrit.wikimedia.org/r/286858 (owner: 10Jcrespo) [09:50:14] (03PS9) 10Volans: Cleanup my.cnf by grouping options [puppet] - 10https://gerrit.wikimedia.org/r/286858 (owner: 10Jcrespo) [09:51:08] PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: Puppet has 2 failures [09:51:11] (03CR) 10jenkins-bot: [V: 04-1] Cleanup my.cnf by grouping options [puppet] - 10https://gerrit.wikimedia.org/r/286858 (owner: 10Jcrespo) [09:51:28] jenkins disagrees :-) [09:51:39] I just rebased... [09:52:09] rsync: change_dir "/operations-puppet/production/rake-jessie" (in caches) failed: No such file or directory [09:52:09] I think faidon mentioned a similar issue before [09:52:19] he had issues with pep8 AFAIK [09:52:28] (03CR) 10Volans: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/286858 (owner: 10Jcrespo) [09:52:34] no, that too [09:52:55] rsync: change_dir "/operations-puppet/production/rake-jessie" (in caches) failed: No such file or directory (2) [09:53:06] rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1655) [Receiver=3.1.1] [09:53:16] jynus: yep, see above :) [09:54:29] now it passed on a different CI host [09:55:13] (03CR) 10Jcrespo: [C: 031] Cleanup my.cnf by grouping options [puppet] - 10https://gerrit.wikimedia.org/r/286858 (owner: 10Jcrespo) [09:57:22] (03CR) 10Volans: [C: 032] Cleanup my.cnf by grouping options [puppet] - 10https://gerrit.wikimedia.org/r/286858 (owner: 10Jcrespo) [10:01:11] hashar: ^^^ [10:01:30] my.cnf ? [10:01:46] jynus: paravoid ah no the rsync thingie, can be ignored sorry :- [10:02:26] was https://phabricator.wikimedia.org/T136261 [10:03:06] we're getting sporadic -1s for no apparent reason [10:03:20] unrelated to the tox-pep8 job [10:03:53] rake-jessie failed above, for instance [10:04:16] here: https://gerrit.wikimedia.org/r/#/c/290999/ pplint-HEAD failed, after 17 minutes of running time, for no apparent reason [10:04:22] ah that one https://integration.wikimedia.org/ci/job/rake-jessie/41984/console [10:04:27] 09:50:59 Gem::RemoteFetcher::UnknownHostError: no such name (https://rubygems.org/gems/diff-lcs-1.2.5.gem) [10:04:32] looks like a DNS resolution failure [10:04:57] !log disable firejail test on mw1153, all went well, but rather revert back since it's Friday and enable this along with the other image scalers on Monday [10:05:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:11:02] (03CR) 10Faidon Liambotis: [C: 04-1] move/copy ubuntu-cloud.key into openstack/swift modules (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/290874 (owner: 10Dzahn) [10:11:10] !log restarting MySQL on db2038 to test change 286858 - T133333 [10:11:11] T133333: Audit new eqiad masters configuration - https://phabricator.wikimedia.org/T133333 [10:11:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:28:52] hi. is work around "Upgrading HHVM to libicu52" over? [10:29:14] <_joe_> Elitre: nope [10:29:21] <_joe_> Elitre: you can ask me directly [10:29:35] <_joe_> there are three wikis left, namely svwiki, thwiki and ruwiki [10:29:50] frwiki finished? [10:29:55] <_joe_> jynus: yes [10:30:09] (03PS1) 10Volans: MariaDB: set the sql-mode only when has a value [puppet] - 10https://gerrit.wikimedia.org/r/291206 (https://phabricator.wikimedia.org/T136398) [10:30:12] <_joe_> Elitre: what wiki were the problems reported for? [10:30:29] <_joe_> if there is a phab bug, please add the "Operations" tag to it [10:30:59] I haven't seen any reports: I'm learning about this work just now, and I'll go check these 3-4 wikis you just mentioned. [10:31:29] <_joe_> ruwiki has some errors in the categories, it has already been reported [10:31:55] <_joe_> ukwiki had too, but AFAICS they're now solved, see https://phabricator.wikimedia.org/T136281 [10:33:47] <_joe_> and srwiki seems now correct as well [10:35:32] (03PS2) 10Volans: MariaDB: set the sql-mode only when has a value [puppet] - 10https://gerrit.wikimedia.org/r/291206 (https://phabricator.wikimedia.org/T136398) [10:35:33] <_joe_> I am also keeping an eye on https://phabricator.wikimedia.org/tag/mediawiki-categories/, but no other ticket has appeared [10:38:17] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 39, down: 5, dormant: 0, excluded: 0, unused: 0BRxe-0/1/0: down - hold for TeliaBRae2.0: down - BRae2: down - hold for AMS-IXBRxe-0/1/1: down - hold for AMS-IX1BRxe-0/1/2: down - hold for AMS-IX2BR [10:39:11] <_joe_> paravoid: ^^ should we worry? [10:39:18] no [10:39:24] cr2-esams is in prod yet [10:39:26] will be today [10:39:29] (hopefully) [10:39:31] I'm working on it [10:39:38] mark is on his way to knams, then esams [10:39:45] <_joe_> ok so I remembered correctly :) [10:40:02] <_joe_> sorry for the distraction [10:40:28] no worries, my bad for not adding a no-mon there :) [10:41:13] (03CR) 10Volans: "jynus: compiler results looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/291206 (https://phabricator.wikimedia.org/T136398) (owner: 10Volans) [10:41:39] (03CR) 10Jcrespo: [C: 031] MariaDB: set the sql-mode only when has a value [puppet] - 10https://gerrit.wikimedia.org/r/291206 (https://phabricator.wikimedia.org/T136398) (owner: 10Volans) [10:42:44] (03PS3) 10Volans: MariaDB: set the sql-mode only when has a value [puppet] - 10https://gerrit.wikimedia.org/r/291206 (https://phabricator.wikimedia.org/T136398) [10:42:57] (just rebased) [10:44:37] (03CR) 10Volans: [C: 032] MariaDB: set the sql-mode only when has a value [puppet] - 10https://gerrit.wikimedia.org/r/291206 (https://phabricator.wikimedia.org/T136398) (owner: 10Volans) [10:47:18] RECOVERY - puppet last run on labvirt1010 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:53:10] 06Operations, 10DBA: Physical location SPOF because of database server distribution on a single rack (D1) - https://phabricator.wikimedia.org/T111992#2333247 (10Volans) a:05Volans>03None The above schema for the distribution of the new servers will resolve this issue. Is pending the blocking task(s). [10:54:29] PROBLEM - Check size of conntrack table on kafka1013 is CRITICAL: CRITICAL: nf_conntrack is 92 % full [10:55:01] this is probably the conntrack timeout (cc moritzm ) [10:55:56] what the hell [10:56:23] https://grafana.wikimedia.org/dashboard/db/kafka?panelId=34&fullscreen [10:59:38] no net.netfilter.nf_conntrack_tcp_timeout_time_wait is 65 for all of them [11:00:39] RECOVERY - Check size of conntrack table on kafka1013 is OK: OK: nf_conntrack is 74 % full [11:00:40] from conntrack -L on kafka1013 a lot of TIME_WAITs from mw hosts [11:03:32] I'll have a look [11:06:03] so it seems that we dropped packets, but kafka1013 is acting weird from https://grafana.wikimedia.org/dashboard/db/kafka, but on the host I don't see anything exploding [11:16:24] moritzm, joal: https://grafana.wikimedia.org/dashboard/db/kafka-by-topic?from=now-1h&to=now [11:17:24] (03PS1) 10Giuseppe Lavagetto: Add PTR records for mw1261-83 [dns] - 10https://gerrit.wikimedia.org/r/291208 [11:18:00] (03CR) 10jenkins-bot: [V: 04-1] Add PTR records for mw1261-83 [dns] - 10https://gerrit.wikimedia.org/r/291208 (owner: 10Giuseppe Lavagetto) [11:19:37] (03PS2) 10Giuseppe Lavagetto: Add PTR records for mw1261-83 [dns] - 10https://gerrit.wikimedia.org/r/291208 [11:20:06] (03CR) 10jenkins-bot: [V: 04-1] Add PTR records for mw1261-83 [dns] - 10https://gerrit.wikimedia.org/r/291208 (owner: 10Giuseppe Lavagetto) [11:20:59] RECOVERY - BGP status on cr2-esams is OK: BGP OK - up: 0, down: 2, shutdown: 0 [11:21:11] (03PS3) 10Giuseppe Lavagetto: Add PTR records for mw1261-83 [dns] - 10https://gerrit.wikimedia.org/r/291208 [11:21:20] PROBLEM - Router interfaces on cr2-knams is CRITICAL: CRITICAL: host 91.198.174.246, interfaces up: 59, down: 1, dormant: 0, excluded: 1, unused: 0BRfxp0: down - BR [11:21:53] 06Operations, 10DBA, 13Patch-For-Review: Set up TLS for MariaDB replication - https://phabricator.wikimedia.org/T111654#2333351 (10Volans) Count of the missing DBs that needs restart to get the puppet certs: - 30 in eqiad over 74 configured with the new certs - 33 in codfw over 65 configured with the new cer... [11:21:59] PROBLEM - Juniper alarms on cr2-knams is CRITICAL: JNX_ALARMS CRITICAL - 1 red alarms, 0 yellow alarms [11:25:57] (03CR) 10Giuseppe Lavagetto: [C: 032] Add PTR records for mw1261-83 [dns] - 10https://gerrit.wikimedia.org/r/291208 (owner: 10Giuseppe Lavagetto) [11:27:46] !log restarted jmxtrans on kafka10* hosts [11:27:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:31:27] (03PS1) 10Muehlenhoff: Bump connection table size on Kafka brokers [puppet] - 10https://gerrit.wikimedia.org/r/291211 [11:31:29] RECOVERY - Router interfaces on cr2-knams is OK: OK: host 91.198.174.246, interfaces up: 59, down: 0, dormant: 0, excluded: 1, unused: 0 [11:32:29] (03CR) 10jenkins-bot: [V: 04-1] Bump connection table size on Kafka brokers [puppet] - 10https://gerrit.wikimedia.org/r/291211 (owner: 10Muehlenhoff) [11:32:30] RECOVERY - Juniper alarms on cr2-knams is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms [11:34:56] (03PS2) 10Muehlenhoff: Bump connection table size on Kafka brokers [puppet] - 10https://gerrit.wikimedia.org/r/291211 [11:36:38] (03CR) 10jenkins-bot: [V: 04-1] Bump connection table size on Kafka brokers [puppet] - 10https://gerrit.wikimedia.org/r/291211 (owner: 10Muehlenhoff) [11:36:50] (03CR) 10Dereckson: "Yes, it was." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/288582 (https://phabricator.wikimedia.org/T135212) (owner: 10Lokal Profil) [12:13:26] (03CR) 10BBlack: "Giuseppe should double-check this one, in case there's something strange about that json import..." [puppet] - 10https://gerrit.wikimedia.org/r/291183 (owner: 10BryanDavis) [12:17:47] (03CR) 10BBlack: [C: 031] varnish: Fix PEP8 violations [puppet] - 10https://gerrit.wikimedia.org/r/291187 (owner: 10BryanDavis) [12:22:29] !log Align runtime MySQL configurations on codfw slaves with the my.cnf ones T133333 [12:22:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:22:45] T133333: Audit MySQL configurations - https://phabricator.wikimedia.org/T133333 [12:27:09] !log CI/Zuul deadlocked quickly due to a dependency set on a repository not known to Zuul [12:27:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:27:15] fixed [12:29:23] (03PS3) 10Muehlenhoff: Bump connection table size on Kafka brokers [puppet] - 10https://gerrit.wikimedia.org/r/291211 [12:33:22] 06Operations, 10Analytics: Jmxtrans failures on Kafka hosts caused metric holes in grafana - https://phabricator.wikimedia.org/T136405#2333519 (10elukey) [12:39:43] (03CR) 10Filippo Giunchedi: move/copy ubuntu-cloud.key into openstack/swift modules (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/290874 (owner: 10Dzahn) [12:43:59] (03CR) 10Elukey: [C: 031] "https://puppet-compiler.wmflabs.org/2948/" [puppet] - 10https://gerrit.wikimedia.org/r/291211 (owner: 10Muehlenhoff) [12:47:20] 06Operations, 06Labs, 10Monitoring, 10Tool-Labs: Make icinga-wm report Tools homepage check at #wikimedia-labs, too - https://phabricator.wikimedia.org/T128716#2333557 (10valhallasw) p:05Low>03High [12:49:17] (03CR) 10Muehlenhoff: [C: 032 V: 032] Bump connection table size on Kafka brokers [puppet] - 10https://gerrit.wikimedia.org/r/291211 (owner: 10Muehlenhoff) [12:49:28] 06Operations, 06Labs, 10Monitoring, 10Tool-Labs: Make icinga-wm report Tools homepage check at #wikimedia-labs, too - https://phabricator.wikimedia.org/T128716#2333567 (10valhallasw) p:05High>03Low [12:51:58] PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: Puppet has 3 failures [12:52:54] 06Operations, 10Ops-Access-Requests: Requesting access to stats and scb for Ladsgroup - https://phabricator.wikimedia.org/T136406#2333571 (10Ladsgroup) [12:53:58] RECOVERY - puppet last run on labvirt1010 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [12:54:04] 06Operations: Move cp3030+ from OE14 to OE13 in racktables - https://phabricator.wikimedia.org/T136403#2333584 (10Peachey88) [12:54:09] 06Operations, 10Ops-Access-Requests: Requesting access to stats and scb for Ladsgroup - https://phabricator.wikimedia.org/T136406#2333586 (10Ladsgroup) I forgot to mention I signed the NDA ({T134651}) [12:54:13] 06Operations, 10Ops-Access-Requests: Requesting access to stats and scb for Ladsgroup - https://phabricator.wikimedia.org/T136406#2333588 (10Halfak) I'm supporting this request as WMF staff and the lead on the ORES project. Let me know if you need anything from me. [12:57:47] (03CR) 10Filippo Giunchedi: [C: 031] Create raid module to hold RAID monitoring checks [puppet] - 10https://gerrit.wikimedia.org/r/290986 (https://phabricator.wikimedia.org/T84050) (owner: 10Faidon Liambotis) [13:01:48] (03CR) 10Filippo Giunchedi: [C: 04-1] "LGTM, missing install for hpssacli though?" [puppet] - 10https://gerrit.wikimedia.org/r/290999 (https://phabricator.wikimedia.org/T84050) (owner: 10Faidon Liambotis) [13:02:13] godog: it wasn't there before [13:02:25] there is another patch in the series that adds hpssacli + a check [13:04:00] paravoid: I took a look at the series of related CR, they look sane to me, but only a good testing will guarantee that it works fleet wide [13:04:07] yeah [13:04:14] my plan is to merge the first two (raid module + fact) [13:04:25] then let it run for a while and collect the fact in the puppet database [13:04:28] then audit that [13:04:29] and thest the fact [13:04:34] +1 [13:04:35] via e.g. servermon [13:05:06] (03CR) 10Filippo Giunchedi: "re: /etc/prometheus-nginx no particular reason, /etc/nginx/prometheus or sth like that would do too" [puppet] - 10https://gerrit.wikimedia.org/r/290479 (https://phabricator.wikimedia.org/T126785) (owner: 10Filippo Giunchedi) [13:05:07] yes, that is my suggestion too [13:05:24] the facter is relatively safe [13:05:25] paravoid: ah ok, will change [13:05:35] I'm doing networking work today [13:05:41] mark is @ esams right now [13:05:45] (03CR) 10Filippo Giunchedi: [C: 031] raid: vary package installation on the RAID installed [puppet] - 10https://gerrit.wikimedia.org/r/290999 (https://phabricator.wikimedia.org/T84050) (owner: 10Faidon Liambotis) [13:06:05] lets deploy it soon, just let's not rush alert creation fleet wide [13:06:10] so I can't divert my attention too much -- if someone wants to merge/babysit the two first be my guest, otherwise I'll deal with it next week [13:06:44] (03Abandoned) 10Filippo Giunchedi: graphite: introduce local carbon-c-relay daemon [puppet] - 10https://gerrit.wikimedia.org/r/289440 (https://phabricator.wikimedia.org/T85451) (owner: 10Filippo Giunchedi) [13:06:46] I can next week [13:06:46] I will want to go soon today [13:07:08] but next week I will be focusing on non-db stuff [13:07:19] nah, I can do it too next wee [13:07:20] k [13:07:27] that's fine :) [13:07:51] BTE [13:07:53] BTW [13:08:15] there was already a fact that could have been useful, which is disk manufacturer [13:08:18] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). [13:08:24] but better to have something more reliable [13:08:59] which one? [13:09:16] blockdevice_*_model? [13:09:19] volans, what was the name disk_dev_manufacturer? [13:09:29] that one, but vendor [13:09:32] and _vendor [13:09:34] yes blockdevice* [13:09:37] yeah that's pretty useless [13:09:48] I know [13:09:49] for this use case, at least [13:10:03] but it was better heristic than nothing [13:10:10] yours is better than that [13:10:11] (03PS1) 10Matěj Suchánek: Remove no longer used Echo configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/291218 (https://phabricator.wikimedia.org/T58037) [13:10:26] we had the check-raid.py heuristics, I could have plugged the other check in there [13:10:43] but it'd still suck in terms of multiple RAID per box [13:10:59] that was another question? [13:11:02] where do we have that? [13:11:08] where do we have what? [13:11:14] multiple RAID per box? [13:11:15] multiple RAIDs [13:11:28] just to not get scared when I see it [13:11:30] labstore1001 has a Dell PERC 800 and software RAID [13:11:47] all swift backends have either a PERC or an HP controller, but also software RAID for their SSDs IIRC [13:12:12] in general, RAID is perhaps even a misnomer for this fact and module [13:12:29] these are disk controllers -- (sometimes) you can use them even without a RAID, but they still deal with health checks [13:12:55] (03PS1) 10Alexandros Kosiaris: network: Split off labs into it's own realm [puppet] - 10https://gerrit.wikimedia.org/r/291219 [13:13:36] (03PS1) 10BBlack: cache_text: cap frontend TTL at 1d [puppet] - 10https://gerrit.wikimedia.org/r/291220 (https://phabricator.wikimedia.org/T124954) [13:14:09] thank you paravoid for working on it [13:14:31] it was really needed [13:15:07] yeah [13:15:11] long overdue [13:15:18] there's more that's needed [13:15:28] better RAID checks for Dells etc. [13:15:34] battery checks and whatnot [13:15:40] and also a SMART check would be nice too [13:15:43] yes [13:16:12] (03CR) 10BBlack: [C: 032] cache_text: cap frontend TTL at 1d [puppet] - 10https://gerrit.wikimedia.org/r/291220 (https://phabricator.wikimedia.org/T124954) (owner: 10BBlack) [13:16:18] although I am not sure if the full thing should be alerts [13:16:36] maybe the least important ones be logs [13:17:50] with logs I mean "passive" monitoring, logs, metrics, etc [13:18:08] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [13:20:49] 06Operations, 10Ops-Access-Requests: Requesting access to stats and scb for Ladsgroup - https://phabricator.wikimedia.org/T136406#2333730 (10Ladsgroup) And also I need access to tin to deploy ORES [13:21:38] PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: Puppet has 1 failures [13:26:56] (03PS1) 10Giuseppe Lavagetto: partman: fix recipe for new appservers [puppet] - 10https://gerrit.wikimedia.org/r/291221 [13:28:10] (03PS1) 10Yuvipanda: Add a java base + web container [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/291222 (https://phabricator.wikimedia.org/T124903) [13:28:30] valhallasw`cloud: ^ java base images [13:28:34] I think that's the only thing needed/ [13:29:04] and tomcat, probably? [13:32:04] valhallasw`cloud: no, not necessary. [13:32:15] valhallasw`cloud: play has its own webserver, nety [13:32:17] 06Operations, 06Labs, 06Release-Engineering-Team, 10wikitech.wikimedia.org: Rename specific account in LDAP, Wikitech and Gerrit - https://phabricator.wikimedia.org/T133968#2333770 (10lfschenone) [13:32:30] valhallasw`cloud: and in general, I'd highly suggest people use something like jetty/netty in their applications than tomcat. [13:32:37] mm, ok. [13:32:48] valhallasw`cloud: towards the end I'll build the backwards compat images that'll work for people currently on tomcat [13:34:14] 06Operations, 10Analytics: Jmxtrans failures on Kafka hosts caused metric holes in grafana - https://phabricator.wikimedia.org/T136405#2333775 (10Ottomata) We didn't upgrade to a newer JMXtrans because of a verbose logging bug. Buuut! It looks like it has been fixed? https://github.com/jmxtrans/jmxtrans/issu... [13:34:36] 06Operations, 10Analytics: Jmxtrans failures on Kafka hosts caused metric holes in grafana - https://phabricator.wikimedia.org/T136405#2333779 (10Ottomata) > Why do we need to push to statsd rather than directly to graphite since jmxtrans does buffer for us? Good question! Perhaps we don't! [13:37:14] (03PS2) 10Giuseppe Lavagetto: partman: fix recipe for new appservers [puppet] - 10https://gerrit.wikimedia.org/r/291221 [13:37:31] (03CR) 10Giuseppe Lavagetto: [C: 032] partman: fix recipe for new appservers [puppet] - 10https://gerrit.wikimedia.org/r/291221 (owner: 10Giuseppe Lavagetto) [13:37:40] (03CR) 10Giuseppe Lavagetto: [V: 032] partman: fix recipe for new appservers [puppet] - 10https://gerrit.wikimedia.org/r/291221 (owner: 10Giuseppe Lavagetto) [13:40:05] (03CR) 10Ottomata: [C: 032 V: 032] s/etc/druid/middleManager/etc/druid/middlemanager/ in druid-middlemanager.dirs [debs/druid] - 10https://gerrit.wikimedia.org/r/291130 (owner: 10Ottomata) [13:40:11] (03PS1) 10Eevans: enable instance restbase2003-b.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/291225 (https://phabricator.wikimedia.org/T134016) [13:40:44] (03PS1) 10Ema: Remove unused varnishncsa-related code [puppet] - 10https://gerrit.wikimedia.org/r/291226 [13:40:53] could i get someone to merge https://gerrit.wikimedia.org/r/#/c/291225/ for me? (another Cassandra bootstrap) [13:41:26] urandom: sure, I'll merge [13:41:32] thanks! [13:41:39] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] enable instance restbase2003-b.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/291225 (https://phabricator.wikimedia.org/T134016) (owner: 10Eevans) [13:42:00] (03PS1) 10Ottomata: Bump to 0.9.0-2 with middlemananger -> middleManager fix [debs/druid] - 10https://gerrit.wikimedia.org/r/291228 (https://phabricator.wikimedia.org/T134503) [13:42:09] _joe_: I'm merging your change too [13:42:18] (03CR) 10Ottomata: [C: 032 V: 032] Bump to 0.9.0-2 with middlemananger -> middleManager fix [debs/druid] - 10https://gerrit.wikimedia.org/r/291228 (https://phabricator.wikimedia.org/T134503) (owner: 10Ottomata) [13:42:30] urandom: {{done}} [13:43:11] godog: thanks man! [13:45:13] 06Operations, 10Analytics: Jmxtrans failures on Kafka hosts caused metric holes in grafana - https://phabricator.wikimedia.org/T136405#2333507 (10fgiunchedi) the typical (only?) reasons for pushing to statsd is for aggregation across machines sending the metrics, or aggregation for a particular type of metric... [13:45:15] urandom: np! [13:45:38] RECOVERY - puppet last run on labvirt1010 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [13:47:06] <_joe_> godog: yeah sorry, I forgot :/ [13:49:13] 06Operations, 06Analytics-Kanban, 10Traffic: Verify why varnishkafka stats and webrequest logs count differs - https://phabricator.wikimedia.org/T136314#2333849 (10elukey) Tried for the first time to query Hadoop data via Hive, so this info will need to be validated, but I run a script to find how many holes... [13:50:37] _joe_: no worries, not dangerous so I just merged [13:54:13] !log Bootstrapping restbase2003-b.codfw.wmnet : T134016 [13:54:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:54:22] T134016: RESTBase Cassandra cluster: Increase instance count to 3 - https://phabricator.wikimedia.org/T134016 [13:54:54] (03CR) 10Matěj Suchánek: [C: 04-1] "Something went wrong..." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/291218 (https://phabricator.wikimedia.org/T58037) (owner: 10Matěj Suchánek) [13:55:38] PROBLEM - Apache HTTP on mw1184 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50404 bytes in 0.007 second response time [13:57:37] RECOVERY - Apache HTTP on mw1184 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 626 bytes in 0.039 second response time [13:58:37] (03CR) 10Muehlenhoff: Enable two-factor authentication in sshd (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/282160 (owner: 10Muehlenhoff) [14:03:06] (03CR) 10BBlack: [C: 031] Remove unused varnishncsa-related code [puppet] - 10https://gerrit.wikimedia.org/r/291226 (owner: 10Ema) [14:05:07] (03PS2) 10Matěj Suchánek: Remove no longer used Echo configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/291218 (https://phabricator.wikimedia.org/T58037) [14:07:30] (03PS2) 10Ema: Remove unused varnishncsa-related code [puppet] - 10https://gerrit.wikimedia.org/r/291226 [14:07:43] (03CR) 10Ema: [C: 032 V: 032] Remove unused varnishncsa-related code [puppet] - 10https://gerrit.wikimedia.org/r/291226 (owner: 10Ema) [14:07:54] 06Operations, 10Traffic, 13Patch-For-Review: Raise cache frontend memory sizes significantly - https://phabricator.wikimedia.org/T135384#2333902 (10BBlack) Today cp3048's at 156G virt + 75G resident and looks pretty stable. So it was a definite improvement, but there's still a lot of waste. As suggested ab... [14:08:05] PROBLEM - cassandra-b CQL 10.192.32.135:9042 on restbase2003 is CRITICAL: Connection refused [14:10:12] (03PS3) 10Alexandros Kosiaris: network: add $production_networks [puppet] - 10https://gerrit.wikimedia.org/r/260926 (https://phabricator.wikimedia.org/T122396) (owner: 10Faidon Liambotis) [14:10:14] (03PS1) 10Alexandros Kosiaris: network: Move into module [puppet] - 10https://gerrit.wikimedia.org/r/291234 [14:12:37] ACKNOWLEDGEMENT - cassandra-b CQL 10.192.32.135:9042 on restbase2003 is CRITICAL: Connection refused eevans Node is bootstrapping. - The acknowledgement expires at: 2016-05-29 14:12:14. [14:13:09] (03PS1) 10Dereckson: Revert "Enable RC patrol on ta.wikiquote" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/291237 (https://phabricator.wikimedia.org/T132868) [14:14:18] (03PS1) 10Ema: Stop varnishncsa [puppet] - 10https://gerrit.wikimedia.org/r/291238 [14:15:40] (03PS4) 10Alexandros Kosiaris: network: add $production_networks [puppet] - 10https://gerrit.wikimedia.org/r/260926 (https://phabricator.wikimedia.org/T122396) (owner: 10Faidon Liambotis) [14:15:42] (03PS2) 10Alexandros Kosiaris: network: Move into module [puppet] - 10https://gerrit.wikimedia.org/r/291234 [14:15:44] (03PS2) 10Alexandros Kosiaris: network::constants Split off labs into it's own realm [puppet] - 10https://gerrit.wikimedia.org/r/291219 [14:16:36] (03PS1) 10Yuvipanda: k8s: Decouple kubelet from kube-proxy [puppet] - 10https://gerrit.wikimedia.org/r/291239 (https://phabricator.wikimedia.org/T136413) [14:17:14] (03PS2) 10Yuvipanda: k8s: Decouple kubelet from kube-proxy [puppet] - 10https://gerrit.wikimedia.org/r/291239 (https://phabricator.wikimedia.org/T136413) [14:17:18] 06Operations, 10MediaWiki-Categories, 07HHVM: Broken sorting and multi-page categories for Cyrillic wikis - https://phabricator.wikimedia.org/T136281#2333931 (10Joe) Script has finished running, and as far as I can see, all pages reported in this ticket are now redered correctly. [14:17:54] (03CR) 10jenkins-bot: [V: 04-1] network::constants Split off labs into it's own realm [puppet] - 10https://gerrit.wikimedia.org/r/291219 (owner: 10Alexandros Kosiaris) [14:18:09] <_joe_> finally! [14:18:30] 06Operations, 07HHVM, 13Patch-For-Review, 07User-notice: Switch HAT appservers to trusty's ICU (or newer) - https://phabricator.wikimedia.org/T86096#2333932 (10Joe) All scripts have completed, closing this ticket (at last) [14:18:34] !lgo change-prop deploying 3747ebd [14:18:34] 06Operations, 05Continuous-Integration-Scaling, 07HHVM: Provide a HHVM package for jessie-wikimedia matching version of trusty-wikimedia - https://phabricator.wikimedia.org/T125821#2333937 (10Joe) [14:18:39] 06Operations, 07HHVM, 13Patch-For-Review, 07User-notice: Switch HAT appservers to trusty's ICU (or newer) - https://phabricator.wikimedia.org/T86096#2333933 (10Joe) 05Open>03Resolved [14:18:40] 06Operations, 07HHVM, 07Tracking: Complete the use of HHVM over Zend PHP on the Wikimedia cluster (tracking) - https://phabricator.wikimedia.org/T86081#2333940 (10Joe) [14:19:38] !log change-prop deployed 3747ebd [14:19:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:21:33] (03PS5) 10Alexandros Kosiaris: network: add $production_networks [puppet] - 10https://gerrit.wikimedia.org/r/260926 (https://phabricator.wikimedia.org/T122396) (owner: 10Faidon Liambotis) [14:21:35] (03PS3) 10Alexandros Kosiaris: network::constants Split off labs into it's own realm [puppet] - 10https://gerrit.wikimedia.org/r/291219 [14:22:45] (03PS6) 10Alexandros Kosiaris: network: add $production_networks [puppet] - 10https://gerrit.wikimedia.org/r/260926 (https://phabricator.wikimedia.org/T122396) (owner: 10Faidon Liambotis) [14:22:47] (03PS4) 10Alexandros Kosiaris: network::constants Split off labs into it's own realm [puppet] - 10https://gerrit.wikimedia.org/r/291219 [14:23:07] (03CR) 10jenkins-bot: [V: 04-1] network::constants Split off labs into it's own realm [puppet] - 10https://gerrit.wikimedia.org/r/291219 (owner: 10Alexandros Kosiaris) [14:23:39] (03PS2) 10Ema: Stop varnishncsa [puppet] - 10https://gerrit.wikimedia.org/r/291238 [14:24:05] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [14:24:06] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [14:24:14] hmm [14:24:17] wonder if that's me [14:24:50] (03CR) 10jenkins-bot: [V: 04-1] network::constants Split off labs into it's own realm [puppet] - 10https://gerrit.wikimedia.org/r/291219 (owner: 10Alexandros Kosiaris) [14:26:26] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 666 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 6425064 keys - replication_delay is 666 [14:27:29] 06Operations, 10Ops-Access-Requests: Requesting deployment access (for deploying to scb) for Ladsgroup - https://phabricator.wikimedia.org/T136406#2333952 (10Ladsgroup) [14:28:51] 06Operations, 10Ops-Access-Requests: Requesting deployment access (for deploying to scb) for Ladsgroup - https://phabricator.wikimedia.org/T136406#2333571 (10Ladsgroup) Moving request to stats to another phab card [14:29:13] (03PS1) 10Rush: icinga: check_legal_html improve robustness of check [puppet] - 10https://gerrit.wikimedia.org/r/291242 [14:29:14] 06Operations, 05Continuous-Integration-Scaling, 07HHVM: Provide a HHVM package for jessie-wikimedia matching version of trusty-wikimedia - https://phabricator.wikimedia.org/T125821#2333972 (10hashar) 05stalled>03Resolved a:03Joe Aced by  @Joe ! The remaining bit was matching libicu on Trusty/Jessie wh... [14:30:25] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [14:30:25] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [14:31:07] (03CR) 10Ema: [C: 031] varnish: mv wikimedia_vcl, netmapper_upd to separate files [puppet] - 10https://gerrit.wikimedia.org/r/290875 (owner: 10Dzahn) [14:31:26] (03PS1) 10Yuvipanda: tools: Fixup k8s bastion role [puppet] - 10https://gerrit.wikimedia.org/r/291243 (https://phabricator.wikimedia.org/T136406) [14:31:48] valhallasw`cloud: ^ is all that's required I think [14:31:51] hmm [14:31:58] and I need to switch the bastions to the new puppetmaster as well [14:32:29] https://gerrit.wikimedia.org/r/#/c/291243/1/modules/role/files/toollabs/deploy-bastion.bash frightens me [14:32:40] valhallasw`cloud: there's a whole folder of 'em [14:32:58] and I don't understand where it's called [14:33:20] valhallasw`cloud: manually by hand [14:33:28] ...oh. [14:33:31] (03PS1) 10Faidon Liambotis: lvs: move BGP sessions from cr2-knams to cr2-esams [puppet] - 10https://gerrit.wikimedia.org/r/291244 [14:33:32] valhallasw`cloud: https://etherpad.wikimedia.org/p/T130972 [14:33:36] can we fpm this? :P [14:33:42] (03CR) 10BBlack: [C: 031] Stop varnishncsa [puppet] - 10https://gerrit.wikimedia.org/r/291238 (owner: 10Ema) [14:33:54] valhallasw`cloud: nope, no debian packages [14:33:57] oh look, I already wrote that in that etherpad :P [14:33:57] (03CR) 10Faidon Liambotis: [C: 032 V: 032] lvs: move BGP sessions from cr2-knams to cr2-esams [puppet] - 10https://gerrit.wikimedia.org/r/291244 (owner: 10Faidon Liambotis) [14:34:06] valhallasw`cloud: yep, and I responded too :P [14:34:40] valhallasw`cloud: also another reason is that we don't want them to be staggered, etc. also using debs in our case provides, IMO, 0 advantages. [14:34:52] (03PS3) 10Ema: Stop varnishncsa [puppet] - 10https://gerrit.wikimedia.org/r/291238 [14:34:59] (03CR) 10Ema: [C: 032 V: 032] Stop varnishncsa [puppet] - 10https://gerrit.wikimedia.org/r/291238 (owner: 10Ema) [14:35:12] and scap is not staggered? :P [14:36:06] !log restarting pybal on lvs3003/lvs3004 [14:36:07] valhallasw`cloud: there's no way to currently setup a scap master that doesn't involve bringing in all of the mediawiki puppet code [14:36:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:36:44] valhallasw`cloud: oh, it was a rhetorical question! hah. [14:36:50] any deployment process on many servers is staggered [14:36:53] valhallasw`cloud: not as much as debs + puppet. [14:37:05] valhallasw`cloud: ok, relative to ensure => latest. [14:37:16] yes, it's clear we need to drop ensure => latest [14:37:30] but having a sane deployment process is independent from choosing to use dpkg or not [14:37:35] oh I totally agree [14:37:40] I am not calling this sane at all [14:38:01] however, attempting to setup scap3 is a multi-month rabbit hole from my pov atm that I don't want to fall into [14:38:17] valhallasw`cloud: Amir1 is doing a lot of work helping the releng team clean it up for ores [14:38:26] valhallasw`cloud: in maybe 3 months I expect the situation to not suck as much as it does now [14:38:42] valhallasw`cloud: discussion about this on https://phabricator.wikimedia.org/T129311 [14:38:43] (03CR) 10jenkins-bot: [V: 04-1] network: add $production_networks [puppet] - 10https://gerrit.wikimedia.org/r/260926 (https://phabricator.wikimedia.org/T122396) (owner: 10Faidon Liambotis) [14:40:49] (03PS2) 10Yuvipanda: tools: Fixup k8s bastion role [puppet] - 10https://gerrit.wikimedia.org/r/291243 (https://phabricator.wikimedia.org/T136413) [14:41:03] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Requesting deployment access (for deploying to scb) for Ladsgroup - https://phabricator.wikimedia.org/T136406#2333991 (10yuvipanda) ^ wrong task number, sorry [14:41:12] 06Operations, 10Ops-Access-Requests: Requesting deployment access (for deploying to scb) for Ladsgroup - https://phabricator.wikimedia.org/T136406#2333992 (10yuvipanda) [14:41:26] YuviPanda: well, as long as you document the manual tasks somewhere prominent, whatever works works. [14:41:47] valhallasw`cloud: yep, so that's in the etherpad right now, I'll migrate it to the Tools_Kubernetes setup [14:41:56] no, somewhere in the admin guide [14:42:00] valhallasw`cloud: chasemp has done the last few deploys, and I hope you'll find the time to do it sometime [14:42:01] 'how to make a new bastion host' [14:42:11] valhallasw`cloud: the Tools_Kubernetes setup is the admin guide too [14:42:14] valhallasw`cloud: aah, right. [14:42:19] valhallasw`cloud: yes, I'll do that too. [14:42:29] I should move it to make that clearer I guess [14:46:20] (03PS1) 10Faidon Liambotis: Move the backup gw from cr2-knams to cr2-esams [dns] - 10https://gerrit.wikimedia.org/r/291245 [14:46:55] (03CR) 10Faidon Liambotis: [C: 032] Move the backup gw from cr2-knams to cr2-esams [dns] - 10https://gerrit.wikimedia.org/r/291245 (owner: 10Faidon Liambotis) [14:50:04] 06Operations, 10ops-codfw: ms-be2012.codfw.wmnet: slot=12 dev=sdm failed - https://phabricator.wikimedia.org/T136395#2333141 (10Papaul) p:05Triage>03Normal [14:51:00] maybe I can help with Kubernetes [14:51:10] let me take a look [14:51:54] I think we should roll back the release [14:53:03] the impact of the rollback change from GET -> POST seems way higher than expected [14:53:16] 06Operations, 10Ops-Access-Requests: Requesting access to stats for Ladsgroup - https://phabricator.wikimedia.org/T136417#2334046 (10Ladsgroup) [14:53:30] we basically have 0 vandal fighting tools online atm. that's not a good idea going into the weekend. [14:53:37] 06Operations, 10Ops-Access-Requests: Requesting access to stats for Ladsgroup - https://phabricator.wikimedia.org/T136417#2334062 (10Ladsgroup) I signed the NDA {T134651} [14:54:29] YuviPanda: I think the bug mentioned in this patch is not correct: https://gerrit.wikimedia.org/r/#/c/291243/ :D [14:55:04] https://phabricator.wikimedia.org/T136406#2333978 [14:55:10] oh, you knew it. my bad [14:56:29] (03CR) 10jenkins-bot: [V: 04-1] tools: Fixup k8s bastion role [puppet] - 10https://gerrit.wikimedia.org/r/291243 (https://phabricator.wikimedia.org/T136413) (owner: 10Yuvipanda) [14:56:37] (03PS7) 10Alexandros Kosiaris: network: add $production_networks [puppet] - 10https://gerrit.wikimedia.org/r/260926 (https://phabricator.wikimedia.org/T122396) (owner: 10Faidon Liambotis) [14:56:39] (03PS3) 10Alexandros Kosiaris: network: Move into module [puppet] - 10https://gerrit.wikimedia.org/r/291234 [14:56:41] (03PS5) 10Alexandros Kosiaris: network::constants Split off labs into it's own realm [puppet] - 10https://gerrit.wikimedia.org/r/291219 [14:56:58] deploying https://gerrit.wikimedia.org/r/291246 soon [14:57:58] Krenair: Hey, WRT https://phabricator.wikimedia.org/T134651 Can you add me to the NDA LDAP group so I get access to graphite and grafana-admin? [14:58:05] no [14:58:19] is there anything else needed? [14:58:48] (03CR) 10jenkins-bot: [V: 04-1] network::constants Split off labs into it's own realm [puppet] - 10https://gerrit.wikimedia.org/r/291219 (owner: 10Alexandros Kosiaris) [14:59:06] no [15:00:00] okay :) [15:01:38] I just can't really add people to those groups [15:01:43] I mean, technically, sure. But I shouldn't. [15:02:01] !log krenair@tin Synchronized php-1.28.0-wmf.3/extensions/CentralNotice/resources/subscribing: rv due to T136387 (duration: 00m 36s) [15:02:02] T136387: EcmaScript 6 features are not supported in older browsers - https://phabricator.wikimedia.org/T136387 [15:02:06] MatmaRex, ^ [15:02:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:02:26] Krenair: thanks. i'm updating the tasks [15:02:43] Krenair: okay, thanks [15:03:18] 06Operations, 10cassandra: change graphite aggregation function for cassandra 'count' metrics - https://phabricator.wikimedia.org/T121789#2334095 (10Eevans) >>! In T121789#2332408, @GWicke wrote: > If this is really a gauge, should the cassandra metric reporter perhaps report it as such? @fgiunchedi and I dis... [15:03:21] (03PS8) 10Alexandros Kosiaris: network: add $production_networks [puppet] - 10https://gerrit.wikimedia.org/r/260926 (https://phabricator.wikimedia.org/T122396) (owner: 10Faidon Liambotis) [15:03:23] (03PS6) 10Alexandros Kosiaris: network::constants Split off labs into it's own realm [puppet] - 10https://gerrit.wikimedia.org/r/291219 [15:05:12] (03CR) 10jenkins-bot: [V: 04-1] network::constants Split off labs into it's own realm [puppet] - 10https://gerrit.wikimedia.org/r/291219 (owner: 10Alexandros Kosiaris) [15:05:58] 06Operations, 06Analytics-Kanban, 10Traffic: Verify why varnishkafka stats and webrequest logs count differs - https://phabricator.wikimedia.org/T136314#2334115 (10elukey) Modified a bit the script to print host and timestamp related to the sequence number right before the hole. Here some snippets of the res... [15:07:37] (03PS9) 10Alexandros Kosiaris: network: add $production_networks [puppet] - 10https://gerrit.wikimedia.org/r/260926 (https://phabricator.wikimedia.org/T122396) (owner: 10Faidon Liambotis) [15:07:39] (03PS7) 10Alexandros Kosiaris: network::constants Split off labs into it's own realm [puppet] - 10https://gerrit.wikimedia.org/r/291219 [15:09:02] (03CR) 10jenkins-bot: [V: 04-1] network::constants Split off labs into it's own realm [puppet] - 10https://gerrit.wikimedia.org/r/291219 (owner: 10Alexandros Kosiaris) [15:10:01] (03PS10) 10Alexandros Kosiaris: network: add $production_networks [puppet] - 10https://gerrit.wikimedia.org/r/260926 (https://phabricator.wikimedia.org/T122396) (owner: 10Faidon Liambotis) [15:10:02] (03PS8) 10Alexandros Kosiaris: network::constants Split off labs into it's own realm [puppet] - 10https://gerrit.wikimedia.org/r/291219 [15:11:20] (03CR) 10jenkins-bot: [V: 04-1] network: add $production_networks [puppet] - 10https://gerrit.wikimedia.org/r/260926 (https://phabricator.wikimedia.org/T122396) (owner: 10Faidon Liambotis) [15:16:26] (03PS11) 10Alexandros Kosiaris: network: add $production_networks [puppet] - 10https://gerrit.wikimedia.org/r/260926 (https://phabricator.wikimedia.org/T122396) (owner: 10Faidon Liambotis) [15:17:49] (03CR) 10jenkins-bot: [V: 04-1] network: add $production_networks [puppet] - 10https://gerrit.wikimedia.org/r/260926 (https://phabricator.wikimedia.org/T122396) (owner: 10Faidon Liambotis) [15:20:00] PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: Puppet has 4 failures [15:20:07] (03CR) 10Andrew Bogott: [C: 031] Add pep8 environment to tox.ini for jenkins job [puppet] - 10https://gerrit.wikimedia.org/r/291138 (owner: 10BryanDavis) [15:20:47] (03CR) 10Andrew Bogott: [C: 031] flake8: Ignore 'module level import not at top of file' error [puppet] - 10https://gerrit.wikimedia.org/r/291161 (owner: 10BryanDavis) [15:21:44] (03PS12) 10Alexandros Kosiaris: network: add $production_networks [puppet] - 10https://gerrit.wikimedia.org/r/260926 (https://phabricator.wikimedia.org/T122396) (owner: 10Faidon Liambotis) [15:21:49] PS12? :) [15:22:12] (03CR) 10Alexandros Kosiaris: [C: 032] servermon: Fix PEP8 violations [puppet] - 10https://gerrit.wikimedia.org/r/291185 (owner: 10BryanDavis) [15:22:34] bd808, hashar, at some point in the past (specifically, when I set them up) the pep8 tests allowed for per-directory override of warnings. That's no longer the case, I take it? [15:22:51] still afaik [15:22:57] One jobs does that, the otehr doesn't [15:23:05] the job is supposed to use your python pep8 wrapper [15:23:06] the tox job doesn't seem to care [15:23:09] So we have two different pep8 tests we run? [15:23:26] I had a serie of patches to migrate to just running tox from the root of the repo [15:23:34] and thus get rid of all the .pep8 in sub dirs [15:23:39] (03PS1) 10Faidon Liambotis: Drain esams for network maintenance [dns] - 10https://gerrit.wikimedia.org/r/291251 [15:23:47] but never pushed for it and eventually forgot about it [15:23:51] hashar: I don't understand… that's worse, isn't it? [15:23:53] that's the job my patches fix things for [15:23:57] Since then we have to disable tests repo-wide? [15:24:00] RECOVERY - puppet last run on labvirt1010 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [15:24:31] Or better, fix the code to match a unified standard :) [15:24:49] (03CR) 10Faidon Liambotis: [C: 032] Drain esams for network maintenance [dns] - 10https://gerrit.wikimedia.org/r/291251 (owner: 10Faidon Liambotis) [15:25:00] https://gerrit.wikimedia.org/r/#/c/244148/ [15:25:10] bd808: ok, but, as your patch series shows… many of our python files are from upstream [15:25:11] that is the patch that added tox.ini [15:25:18] so fixing their pep8 violations introduces lots of weird diffs with upstream [15:25:29] hashar: https://gerrit.wikimedia.org/r/#/q/status:open+topic:pep8+owner:%22BryanDavis+%3Cbdavis%40wikimedia.org%3E%22,n,z [15:25:31] with a few excludes to ignore upstream stuff like ganglia plugins [15:25:40] phhh [15:25:56] andrewbogott: ah. for upstreams maybe we should just add "# flake8: noqa" [15:26:06] bd808: currently that runs pep8 1.4.6 looks like your serie would bump it further ? [15:26:07] bd808: that's what you do here, right? https://gerrit.wikimedia.org/r/#/c/291162/1,publish [15:26:13] (03CR) 10Faidon Liambotis: [V: 032] Drain esams for network maintenance [dns] - 10https://gerrit.wikimedia.org/r/291251 (owner: 10Faidon Liambotis) [15:26:24] (grr) [15:26:34] !log draining esams for network maintenance [15:26:36] andrewbogott: yeah. that magic comment turns the flake8 test off for the file [15:26:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:26:40] (03CR) 10Hashar: [C: 031] acme_tiny.py: Tell flake8 to ignore file [puppet] - 10https://gerrit.wikimedia.org/r/291162 (owner: 10BryanDavis) [15:27:29] (03CR) 10jenkins-bot: [V: 04-1] network: add $production_networks [puppet] - 10https://gerrit.wikimedia.org/r/260926 (https://phabricator.wikimedia.org/T122396) (owner: 10Faidon Liambotis) [15:27:31] bd808: yep, I follow what that patch does. But what I'm trying to understand is... [15:27:41] we had a system previously that allowed us to exclude files without actually having to edit them [15:27:57] and… having to actually modify upstream files to make them comply strikes me as a step backwards [15:28:25] You and hashar seem to want to abolish the dir-specific .pep8 files but I'm wondering why [15:28:38] I just went OCD on making a red test green [15:28:41] for sake of simplicity? [15:29:01] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 6398155 keys - replication_delay is 0 [15:29:04] but it's less simple [15:29:54] hashar: is there some third way to tell tox 'ignore these files' that doesn't require us to edit the files themselves? [15:30:17] well if you are editing modules/ganglia/files/someplugin.py and there is a .pep8 file in the same dir, you have to remember to run pep8 in modules/ganglia/files/ [15:30:28] if one does it from the root of the repo, the rule would not apply [15:30:50] typical example is one doing a vim modules/ganglia/files/someplugin.py and vim running pep8 in the background [15:31:43] another reason is that we are running solely pep8 and a quite old version :( [15:33:25] hashar: is there some third way to tell tox 'ignore these files' that doesn't require us to edit the files themselves? [15:34:08] (03Restored) 10Hashar: tox entry point to run pep8==1.4.6 [puppet] - 10https://gerrit.wikimedia.org/r/244148 (https://phabricator.wikimedia.org/T114887) (owner: 10Hashar) [15:34:15] (03PS13) 10Hashar: tox entry point to run pep8==1.4.6 [puppet] - 10https://gerrit.wikimedia.org/r/244148 (https://phabricator.wikimedia.org/T114887) [15:34:17] (03PS3) 10Gehel: Increase time before alter for elasticsearch disk space issues [puppet] - 10https://gerrit.wikimedia.org/r/290487 [15:34:30] ah! ok [15:34:43] (03PS13) 10Alexandros Kosiaris: network: add $production_networks [puppet] - 10https://gerrit.wikimedia.org/r/260926 (https://phabricator.wikimedia.org/T122396) (owner: 10Faidon Liambotis) [15:34:52] (03CR) 10jenkins-bot: [V: 04-1] Increase time before alter for elasticsearch disk space issues [puppet] - 10https://gerrit.wikimedia.org/r/290487 (owner: 10Gehel) [15:34:59] So — bd808, I realize you may have already lost interest in this :) But I'd prefer that upstream files be excluded via that method. I'll add some notes in gerrit [15:35:32] (03CR) 10Andrew Bogott: "I'd rather we ignore upstream files via a tox rule rather than editing the files directly. For example, as in https://gerrit.wikimedia.o" [puppet] - 10https://gerrit.wikimedia.org/r/291162 (owner: 10BryanDavis) [15:35:43] andrewbogott: yeah so pep8 can read from tox.ini for settings [15:35:43] and one can exclude = */modules/ganglia/files/plugins/ [15:35:47] (03CR) 10jenkins-bot: [V: 04-1] network: add $production_networks [puppet] - 10https://gerrit.wikimedia.org/r/260926 (https://phabricator.wikimedia.org/T122396) (owner: 10Faidon Liambotis) [15:36:09] andrewbogott: works for me [15:36:42] * bd808 barfs at "max-line-length = 173" [15:37:00] I guess we also have competing patchsets that use flake8 vs pep8 [15:37:09] akosiaris: for rake jessie, you should be able to run rubocop locally with: bundle install ; bundle exec rubocop [15:37:11] which, I've never understood why both those things exist [15:37:19] 06Operations, 06Analytics-Kanban, 10Traffic: Verify why varnishkafka stats and webrequest logs count differs - https://phabricator.wikimedia.org/T136314#2334176 (10Nuria) > The loss seems to happen around the hour, but I don't have a good idea about the why (logrotate afaik happens daily). You probably have... [15:37:22] flake8 is a "better" pep8 [15:37:35] it actually runs pep8 and adds more tests [15:37:36] (03PS14) 10Alexandros Kosiaris: network: add $production_networks [puppet] - 10https://gerrit.wikimedia.org/r/260926 (https://phabricator.wikimedia.org/T122396) (owner: 10Faidon Liambotis) [15:37:37] yeah flake8 wraps around both pep8 and pyflakes + some other stuff [15:37:49] the standard we use right now is solely pep8 1.4.6 [15:37:56] so my idea originally was to first switch to use tox [15:38:10] then bump progressively the pep8 version and later switch to flake8 [15:38:24] well my patch chain goes to flake8 at the latest version [15:38:31] (03PS4) 10Gehel: Increase time before alter for elasticsearch disk space issues [puppet] - 10https://gerrit.wikimedia.org/r/290487 [15:38:33] great ) [15:38:47] so we just need to add in the right excludes for upstream things we don't want to mess with [15:38:48] so potentially we can land https://gerrit.wikimedia.org/r/#/c/244148/ and benefit from tox [15:38:57] yeah, so it sounds like we can mostly set aside hashar's patchsets and instead update bd808's to solve my quibble about upstream files [15:38:58] and drop the legacy script that relies on .pep8 files all over the place [15:38:59] yeah [15:39:15] bd808: do you want me to go through and make those changes or do you still have momentum? [15:39:20] then use the exclude to drop upstream files, which my patch seems to cover [15:39:30] (03CR) 10jenkins-bot: [V: 04-1] Increase time before alter for elasticsearch disk space issues [puppet] - 10https://gerrit.wikimedia.org/r/290487 (owner: 10Gehel) [15:39:33] and use bd808 patches to bring us to flake8 standard \^/ [15:39:38] andrewbogott: if you know what you want excluded and have time that would be great [15:39:44] ok, I'll have a look [15:39:54] bd808: https://gerrit.wikimedia.org/r/#/c/244148/13/tox.ini,cm :) [15:40:11] (03PS15) 10Alexandros Kosiaris: network: add $production_networks [puppet] - 10https://gerrit.wikimedia.org/r/260926 (https://phabricator.wikimedia.org/T122396) (owner: 10Faidon Liambotis) [15:40:31] hashar: I think some of those are just being lazy ;) [15:40:38] yeah [15:40:46] my whole point was to switch to tox [15:40:47] (03PS5) 10Gehel: Increase time before alter for elasticsearch disk space issues [puppet] - 10https://gerrit.wikimedia.org/r/290487 [15:40:47] and refine later [15:40:57] ie ignore exclude or refine some excludes that are too wide [15:41:12] screw that, do it right the first time [15:42:03] that is when you have to split that in small changes that are atomic per puppet module [15:42:08] and hunt ops that can review them [15:42:29] I broke the fixes up by module, so we can jsut figure out which ones shouldn't be linted and replace them with an exclude in the tox.ini [15:43:26] <_joe_> uhgh, puppet failures on the jessie mediawiki host [15:43:33] <_joe_> I am going to call it a week then [15:43:47] <_joe_> it's been quite intense; have a good weekend everyone [15:44:49] (03PS3) 10Yuvipanda: tools: Fixup k8s bastion role [puppet] - 10https://gerrit.wikimedia.org/r/291243 (https://phabricator.wikimedia.org/T136413) [15:44:55] bd808: yup that looks nice :) [15:45:06] anyway time to get kids back home! *wave* [15:45:27] g'night _joe_ [15:45:31] _joe_ buon weekend! [15:47:17] (03PS1) 10Eevans: enable instance restbase1007-c.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/291252 (https://phabricator.wikimedia.org/T134016) [15:47:26] (03PS1) 10Ema: tlsproxy: trim indentation in localssl.erb [puppet] - 10https://gerrit.wikimedia.org/r/291253 [15:48:52] (03CR) 10Gehel: "puppet compiler does not diff exproted resources, so not that interesting in this case. Still: https://puppet-compiler.wmflabs.org/2963/" [puppet] - 10https://gerrit.wikimedia.org/r/290487 (owner: 10Gehel) [15:49:21] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 629 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 6399474 keys - replication_delay is 629 [15:50:45] (03PS3) 10Andrew Bogott: Add pep8 environment to tox.ini for jenkins job [puppet] - 10https://gerrit.wikimedia.org/r/291138 (owner: 10BryanDavis) [15:50:47] (03PS2) 10Andrew Bogott: flake8: Ignore 'module level import not at top of file' error [puppet] - 10https://gerrit.wikimedia.org/r/291161 (owner: 10BryanDavis) [15:50:49] (03PS2) 10Andrew Bogott: acme_tiny.py: Tell flake8 to ignore file [puppet] - 10https://gerrit.wikimedia.org/r/291162 (owner: 10BryanDavis) [15:52:50] PROBLEM - citoid endpoints health on scb2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:53:31] PROBLEM - citoid endpoints health on scb2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:54:08] bd808: can you confirm that ps2 on https://gerrit.wikimedia.org/r/#/c/291162 does what you hoped for? [15:54:16] 06Operations, 06Analytics-Kanban, 10Traffic: Verify why varnishkafka stats and webrequest logs count differs - https://phabricator.wikimedia.org/T136314#2334236 (10Ottomata) Oh ho ho, check this out. Looking at > `cp1061.eqiad.wmnet 2016-05-27T12:59:59 2514541` ``` ADD JAR /usr/lib/hive-hcatalog... [15:55:18] (03PS4) 10Yuvipanda: tools: Fixup k8s bastion role [puppet] - 10https://gerrit.wikimedia.org/r/291243 (https://phabricator.wikimedia.org/T136413) [15:55:24] ACKNOWLEDGEMENT - cassandra-c CQL 10.64.48.137:9042 on restbase1014 is CRITICAL: Connection refused Filippo Giunchedi bootstrapping [15:56:12] (03PS1) 10Ladsgroup: Add ladsgroup user key and data [puppet] - 10https://gerrit.wikimedia.org/r/291255 (https://phabricator.wikimedia.org/T136417) [15:56:30] 06Operations, 10Traffic, 07Browser-Support-Firefox, 07HTTPS: Secure connection failed when attempting to send POST request - https://phabricator.wikimedia.org/T134869#2334257 (10Elvey) Tried it. Disabling http2 as @BBlack /@Thibaut120094 suggested eliminates the error for me too. It seems to occur only if... [15:58:14] (03CR) 10Ladsgroup: "UID is uid of me in labs per https://wikitech.wikimedia.org/wiki/Requesting_shell_access" [puppet] - 10https://gerrit.wikimedia.org/r/291255 (https://phabricator.wikimedia.org/T136417) (owner: 10Ladsgroup) [15:58:26] (03CR) 10Yuvipanda: [C: 032] Add pep8 environment to tox.ini for jenkins job [puppet] - 10https://gerrit.wikimedia.org/r/291138 (owner: 10BryanDavis) [15:58:49] (03CR) 10Yuvipanda: [C: 032] flake8: Ignore 'module level import not at top of file' error [puppet] - 10https://gerrit.wikimedia.org/r/291161 (owner: 10BryanDavis) [15:59:10] (03CR) 10Ottomata: [C: 032] kafkatee_ganglia.py: Fix PEP8 violations [puppet/kafkatee] - 10https://gerrit.wikimedia.org/r/291191 (owner: 10BryanDavis) [15:59:19] (03CR) 10Yuvipanda: [C: 032] acme_tiny.py: Tell flake8 to ignore file [puppet] - 10https://gerrit.wikimedia.org/r/291162 (owner: 10BryanDavis) [15:59:36] (03PS2) 10Yuvipanda: apache: PEP8 fixes [puppet] - 10https://gerrit.wikimedia.org/r/291163 (owner: 10BryanDavis) [16:00:00] YuviPanda: I want to re-work a few of those patches, so don't bulk-merge all of them yet unless you already have :) [16:00:18] andrewbogott: nope, I think these are the easy ones right now so far. [16:00:25] yep, agreed, thanks [16:00:34] (03CR) 10Ottomata: [C: 032] varnishkafka_ganglia.py: Fix PEP8 violations [puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/291190 (owner: 10BryanDavis) [16:00:49] andrewbogott: I'll stop now because the apache2 one makes no sense but I'm afraid it does in some way [16:00:55] andrewbogott: let me know when you're done reworking :D [16:01:00] RECOVERY - citoid endpoints health on scb2001 is OK: All endpoints are healthy [16:01:01] ok [16:01:06] there's not that much that I'm going to change, I don't think [16:01:42] RECOVERY - citoid endpoints health on scb2002 is OK: All endpoints are healthy [16:03:03] YuviPanda: why does the apache one not make sense? Doesn't it just prune dead code? [16:04:43] andrewbogott: looks like, but I wanted to grep to see if that file is being imported by anything and some magic happens somewhere [16:04:45] probably not [16:04:53] !log shutting down ms-fe3001/ms-fe3002 [16:04:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:04:58] ok [16:04:59] (03PS5) 10Yuvipanda: tools: Fixup k8s bastion role [puppet] - 10https://gerrit.wikimedia.org/r/291243 (https://phabricator.wikimedia.org/T136413) [16:05:16] 06Operations, 10ops-esams, 06DC-Ops: decom amslvs1-4 (dc work) - https://phabricator.wikimedia.org/T87790#2334280 (10mark) amslvs1 and amslvs2 i'm now disconnecting as they're in the way. [16:06:33] (03CR) 10Yuvipanda: [C: 032] "Looks to be copypasta from something that *was* threaded" [puppet] - 10https://gerrit.wikimedia.org/r/291163 (owner: 10BryanDavis) [16:06:52] (03PS2) 10Yuvipanda: puppetalert.py: Fix PEP8 violations [puppet] - 10https://gerrit.wikimedia.org/r/291164 (owner: 10BryanDavis) [16:07:05] (03PS16) 10Alexandros Kosiaris: network: add $production_networks [puppet] - 10https://gerrit.wikimedia.org/r/260926 (https://phabricator.wikimedia.org/T122396) (owner: 10Faidon Liambotis) [16:07:15] YuviPanda: In another file I found this: [16:07:18] https://www.irccloud.com/pastebin/l17P5Qny/ [16:07:22] which is maybe what should be there? [16:07:28] PROBLEM - Host ms-fe3001 is DOWN: PING CRITICAL - Packet loss = 100% [16:07:38] andrewbogott: only if that function is also being called by something [16:07:44] true [16:07:47] PROBLEM - Host ms-fe3002 is DOWN: PING CRITICAL - Packet loss = 100% [16:07:56] (03PS1) 10EBernhardson: Send wmf.4 search traffic to codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/291257 (https://phabricator.wikimedia.org/T133124) [16:08:36] (03CR) 10Andrew Bogott: [C: 032] puppetalert.py: Fix PEP8 violations [puppet] - 10https://gerrit.wikimedia.org/r/291164 (owner: 10BryanDavis) [16:08:46] (03CR) 10Faidon Liambotis: [C: 04-1] "I have multiple concerns -- I'll review properly next week." [puppet] - 10https://gerrit.wikimedia.org/r/290487 (owner: 10Gehel) [16:09:42] (03CR) 10Andrew Bogott: [C: 032] trebuchet: Fix PEP8 violations [puppet] - 10https://gerrit.wikimedia.org/r/291165 (owner: 10BryanDavis) [16:09:44] (03PS17) 10Alexandros Kosiaris: network: add $production_networks [puppet] - 10https://gerrit.wikimedia.org/r/260926 (https://phabricator.wikimedia.org/T122396) (owner: 10Faidon Liambotis) [16:09:50] (03PS1) 10Yuvipanda: base: Provide better error messages for service_unit [puppet] - 10https://gerrit.wikimedia.org/r/291259 [16:10:05] bblack: ^ (since I remember you did most of the service_unit work) [16:10:17] could i get someone to merge https://gerrit.wikimedia.org/r/#/c/291252 for me (it's (yet) another Cassandra bootstrap) [16:10:28] RECOVERY - cassandra-c CQL 10.64.48.137:9042 on restbase1014 is OK: TCP OK - 0.001 second response time on port 9042 [16:10:52] Hm, ms-fe3001 and ms-fe3002 down? [16:10:56] yes [16:11:01] why? [16:11:13] (03CR) 10Alexandros Kosiaris: [C: 031] Create raid module to hold RAID monitoring checks [puppet] - 10https://gerrit.wikimedia.org/r/290986 (https://phabricator.wikimedia.org/T84050) (owner: 10Faidon Liambotis) [16:11:23] (03CR) 10Andrew Bogott: [C: 031] diamond: Fix PEP8 violations [puppet] - 10https://gerrit.wikimedia.org/r/291166 (owner: 10BryanDavis) [16:11:57] 06Operations, 07HHVM, 13Patch-For-Review, 07User-notice: Switch HAT appservers to trusty's ICU (or newer) - https://phabricator.wikimedia.org/T86096#2334310 (10matmarex) [16:11:59] 06Operations, 10MediaWiki-Categories, 07HHVM: Broken sorting and multi-page categories for Cyrillic wikis - https://phabricator.wikimedia.org/T136281#2334309 (10matmarex) [16:11:59] (03PS2) 10Yuvipanda: base: Provide better error messages for service_unit [puppet] - 10https://gerrit.wikimedia.org/r/291259 [16:12:33] (03CR) 10Andrew Bogott: [C: 031] gmond_jenkins.py: Fix PEP8 violations [puppet] - 10https://gerrit.wikimedia.org/r/291167 (owner: 10BryanDavis) [16:12:36] 06Operations, 10MediaWiki-Categories, 07HHVM: Broken sorting and multi-page categories for Cyrillic wikis - https://phabricator.wikimedia.org/T136281#2329402 (10matmarex) 05Open>03Resolved @nickk, please reopen if you notice any remaining issues. [16:13:42] (03CR) 10Andrew Bogott: [C: 031] "So much whitespace :(" [puppet] - 10https://gerrit.wikimedia.org/r/291168 (owner: 10BryanDavis) [16:14:30] (03PS2) 10Filippo Giunchedi: enable instance restbase1007-c.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/291252 (https://phabricator.wikimedia.org/T134016) (owner: 10Eevans) [16:14:37] (03CR) 10Andrew Bogott: [C: 031] labs-ip-alias-dump.py: Fix PEP8 violations [puppet] - 10https://gerrit.wikimedia.org/r/291169 (owner: 10BryanDavis) [16:14:39] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] enable instance restbase1007-c.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/291252 (https://phabricator.wikimedia.org/T134016) (owner: 10Eevans) [16:14:50] godog: thank you sir! [16:15:00] (03PS18) 10Alexandros Kosiaris: network: add $production_networks [puppet] - 10https://gerrit.wikimedia.org/r/260926 (https://phabricator.wikimedia.org/T122396) (owner: 10Faidon Liambotis) [16:15:10] urandom: np! [16:15:30] (03CR) 10Andrew Bogott: [C: 031] homedirectorymanager.py: Fix PEP8 violations [puppet] - 10https://gerrit.wikimedia.org/r/291170 (owner: 10BryanDavis) [16:15:52] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Requesting access to stats for Ladsgroup - https://phabricator.wikimedia.org/T136417#2334322 (10RobH) a:03Ladsgroup @Ladsgroup: Shell access requires a few things from you in addition to what you've provided: * Please read https://wikitech.wikimedi... [16:15:54] (03PS2) 10Andrew Bogott: trebuchet: Fix PEP8 violations [puppet] - 10https://gerrit.wikimedia.org/r/291165 (owner: 10BryanDavis) [16:16:07] (03CR) 10jenkins-bot: [V: 04-1] network: add $production_networks [puppet] - 10https://gerrit.wikimedia.org/r/260926 (https://phabricator.wikimedia.org/T122396) (owner: 10Faidon Liambotis) [16:16:09] (03PS2) 10Andrew Bogott: diamond: Fix PEP8 violations [puppet] - 10https://gerrit.wikimedia.org/r/291166 (owner: 10BryanDavis) [16:16:13] (03PS2) 10Andrew Bogott: gmond_jenkins.py: Fix PEP8 violations [puppet] - 10https://gerrit.wikimedia.org/r/291167 (owner: 10BryanDavis) [16:16:24] (03PS2) 10Andrew Bogott: ganglia mysql.py: Fix PEP8 violations [puppet] - 10https://gerrit.wikimedia.org/r/291168 (owner: 10BryanDavis) [16:17:01] (03PS19) 10Alexandros Kosiaris: network: add $production_networks [puppet] - 10https://gerrit.wikimedia.org/r/260926 (https://phabricator.wikimedia.org/T122396) (owner: 10Faidon Liambotis) [16:19:16] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Requesting deployment access (for deploying to scb) for Ladsgroup - https://phabricator.wikimedia.org/T136406#2334326 (10RobH) a:03Ladsgroup This appears to be very similar to T136417 (since its the same user but different groups.) Unfortunately, th... [16:20:07] (03CR) 10Andrew Bogott: [C: 032] diamond: Fix PEP8 violations [puppet] - 10https://gerrit.wikimedia.org/r/291166 (owner: 10BryanDavis) [16:20:12] (03CR) 10Andrew Bogott: [C: 032] gmond_jenkins.py: Fix PEP8 violations [puppet] - 10https://gerrit.wikimedia.org/r/291167 (owner: 10BryanDavis) [16:20:14] Amir1: hey, lemme know if you have questions about those two access reuqest next steps =] [16:20:19] (03CR) 10Andrew Bogott: [C: 032] ganglia mysql.py: Fix PEP8 violations [puppet] - 10https://gerrit.wikimedia.org/r/291168 (owner: 10BryanDavis) [16:20:32] should be pretty easy and i dont forsee any blockers at this time though [16:23:22] (03CR) 10jenkins-bot: [V: 04-1] ganglia mysql.py: Fix PEP8 violations [puppet] - 10https://gerrit.wikimedia.org/r/291168 (owner: 10BryanDavis) [16:23:26] (03CR) 10jenkins-bot: [V: 04-1] tools: Fixup k8s bastion role [puppet] - 10https://gerrit.wikimedia.org/r/291243 (https://phabricator.wikimedia.org/T136413) (owner: 10Yuvipanda) [16:24:28] PROBLEM - Host mr1-esams is DOWN: CRITICAL - Network Unreachable (91.198.174.247) [16:24:28] PROBLEM - Host mr1-esams IPv6 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:862:ffff::1 [16:24:40] (mr1-esams & rest of esams' mgmt is expected) [16:24:47] PROBLEM - Host mr1-esams.oob is DOWN: PING CRITICAL - Packet loss = 100% [16:24:58] PROBLEM - Host csw2-esams.mgmt.esams.wmnet is DOWN: PING CRITICAL - Packet loss = 100% [16:25:18] (03CR) 10Andrew Bogott: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/291168 (owner: 10BryanDavis) [16:25:30] (03CR) 10Alexandros Kosiaris: [C: 031] raid: add a new "raid" fact [puppet] - 10https://gerrit.wikimedia.org/r/290988 (https://phabricator.wikimedia.org/T84050) (owner: 10Faidon Liambotis) [16:25:37] PROBLEM - Host asw-esams.mgmt.esams.wmnet is DOWN: PING CRITICAL - Packet loss = 100% [16:26:07] (03PS20) 10Alexandros Kosiaris: network: add $production_networks [puppet] - 10https://gerrit.wikimedia.org/r/260926 (https://phabricator.wikimedia.org/T122396) (owner: 10Faidon Liambotis) [16:26:35] wtf [16:26:36] Error: Could not retrieve catalog from remote server: Error 400 on SERVER: Service unit flannel has a systemd script but nothing useful for upstart at /etc/puppet/modules/base/manifests/service_unit.pp:82 on node tools-bastion-03.tools.eqiad.wmflabs [16:26:39] Warning: Not using cache on failed catalog [16:26:41] Error: Could not retrieve catalog; skipping run [16:26:46] but I do have a flannel.upstart.erb [16:26:58] RECOVERY - Host asw-esams.mgmt.esams.wmnet is UP: PING OK - Packet loss = 0%, RTA = 83.35 ms [16:27:17] (03CR) 10jenkins-bot: [V: 04-1] network: add $production_networks [puppet] - 10https://gerrit.wikimedia.org/r/260926 (https://phabricator.wikimedia.org/T122396) (owner: 10Faidon Liambotis) [16:27:20] ah [16:27:22] I'm just an idiot [16:27:35] akosiaris: you should probably move the hierazation into a separate commit [16:27:57] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Requesting deployment access (for deploying to scb) for Ladsgroup - https://phabricator.wikimedia.org/T136406#2334347 (10Ladsgroup) @RobH I signed it in "Mar 3 2016, 3:03 AM." [16:28:06] paravoid: indeed. But I am still fighting with the puppet compiler. I will as soon as I get something working [16:28:07] RECOVERY - Host mr1-esams is UP: PING OK - Packet loss = 0%, RTA = 83.35 ms [16:28:15] (03PS3) 10Yuvipanda: base: Provide better error messages for service_unit [puppet] - 10https://gerrit.wikimedia.org/r/291259 [16:28:17] (03PS6) 10Yuvipanda: tools: Fixup k8s bastion role [puppet] - 10https://gerrit.wikimedia.org/r/291243 (https://phabricator.wikimedia.org/T136413) [16:28:18] RECOVERY - Host csw2-esams.mgmt.esams.wmnet is UP: PING OK - Packet loss = 0%, RTA = 85.44 ms [16:28:25] (03PS2) 10Andrew Bogott: labs-ip-alias-dump.py: Fix PEP8 violations [puppet] - 10https://gerrit.wikimedia.org/r/291169 (owner: 10BryanDavis) [16:28:32] (03PS3) 10Andrew Bogott: homedirectorymanager.py: Fix PEP8 violations [puppet] - 10https://gerrit.wikimedia.org/r/291170 (owner: 10BryanDavis) [16:28:39] (03PS3) 10Andrew Bogott: ldapsupportlib.py: Fix PEP8 violations [puppet] - 10https://gerrit.wikimedia.org/r/291171 (owner: 10BryanDavis) [16:28:40] omg I did it [16:28:55] http://puppet-compiler.wmflabs.org/2969/carbon.wikimedia.org/ [16:29:08] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Requesting deployment access (for deploying to scb) for Ladsgroup - https://phabricator.wikimedia.org/T136406#2334354 (10RobH) 05Open>03stalled a:05Ladsgroup>03RobH Indeed, and I missed it, my bad. Stealing this back and I'll list it on the op... [16:29:12] damned puppet, but it works! [16:29:38] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Requesting access to stats for Ladsgroup - https://phabricator.wikimedia.org/T136417#2334358 (10RobH) @ladsgroup has pointed out on a related ticket that he already signed L3 (I simply missed it.) [16:30:29] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Requesting access to stats for Ladsgroup - https://phabricator.wikimedia.org/T136417#2334359 (10Ladsgroup) @RobH hey, thanks for your help. I read the page, signed L3 in March 3rd, I will let @halfak and @DarTar know about these. and I'm requesting ac... [16:30:35] talking about a hack on a hack on a hack [16:30:38] RECOVERY - Host mr1-esams IPv6 is UP: PING OK - Packet loss = 0%, RTA = 86.57 ms [16:30:46] (03CR) 10Andrew Bogott: [C: 032] labs-ip-alias-dump.py: Fix PEP8 violations [puppet] - 10https://gerrit.wikimedia.org/r/291169 (owner: 10BryanDavis) [16:30:51] (03CR) 10Andrew Bogott: [C: 032] homedirectorymanager.py: Fix PEP8 violations [puppet] - 10https://gerrit.wikimedia.org/r/291170 (owner: 10BryanDavis) [16:31:01] (03CR) 10Andrew Bogott: [C: 032] ldapsupportlib.py: Fix PEP8 violations [puppet] - 10https://gerrit.wikimedia.org/r/291171 (owner: 10BryanDavis) [16:31:07] RECOVERY - Host mr1-esams.oob is UP: PING OK - Packet loss = 0%, RTA = 82.56 ms [16:31:56] heh, barely a dent [16:32:21] (03PS2) 10EBernhardson: Send wmf.4 search and ttmserver traffic to codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/291257 (https://phabricator.wikimedia.org/T133124) [16:32:23] (03PS7) 10Yuvipanda: tools: Fixup k8s bastion role [puppet] - 10https://gerrit.wikimedia.org/r/291243 (https://phabricator.wikimedia.org/T136413) [16:33:38] PROBLEM - Juniper alarms on csw2-esams.mgmt.esams.wmnet is CRITICAL: JNX_ALARMS CRITICAL - 1 red alarms, 0 yellow alarms [16:34:23] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Requesting access to stats for Ladsgroup - https://phabricator.wikimedia.org/T136417#2334382 (10RobH) * statistics-privatedata-users ** Access to stat1002 where private webrequest logs are hosted. This does NOT include hadoop data, if you need that,... [16:34:28] !log Align runtime MySQL max_connections on codfw masters with the my.cnf ones T133333 [16:34:29] T133333: Audit MySQL configurations - https://phabricator.wikimedia.org/T133333 [16:34:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:34:59] !log elasticsearch in codfw: creating jamwiki index [16:35:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:36:16] jynus: if you could take a moment to look at the questions in https://phabricator.wikimedia.org/T136214 , that would be great. again, no grips about the queries that were called out in the task description, but that INSERT query blocks a major analysis in Reading, and i still don't quite understand why it was killed [16:37:18] HaeB, I am not asking you to not do queries, or not use mysql [16:37:25] just that we have to be smart about it [16:38:09] PROBLEM - cassandra-c CQL 10.64.0.232:9042 on restbase1007 is CRITICAL: Connection refused [16:38:23] e.g. instead of doing 7-day SELECTS, or 26-hours inserts, there is usually more apparently inneficient ways to do the same [16:38:34] in smaller chunks [16:38:49] with summary tables, avoiding large filsorts [16:38:53] and more batching [16:39:03] i just explained in detail why the "smaller chunks" idea (which was apparently the rationale for killing it) does not work here [16:39:09] (03PS21) 10Alexandros Kosiaris: network: add $production_networks [puppet] - 10https://gerrit.wikimedia.org/r/260926 (https://phabricator.wikimedia.org/T122396) (owner: 10Faidon Liambotis) [16:39:11] (03PS1) 10Alexandros Kosiaris: networks::constants: Hieraize all_network_subnets [puppet] - 10https://gerrit.wikimedia.org/r/291263 [16:39:39] https://phabricator.wikimedia.org/T136214#2334381 [16:39:52] what are summary tables? [16:40:07] I will try to give a look at the queries, but honestly, I have thousands of users and I cannot check every app- but there is probably more people that can help other than me so I am not a bottleneck [16:40:14] (03PS8) 10Yuvipanda: tools: Fixup k8s bastion role [puppet] - 10https://gerrit.wikimedia.org/r/291243 (https://phabricator.wikimedia.org/T136413) [16:40:28] other databases have the WITH keyword [16:41:00] in MySQL you may have to build intermediate tables, and in some cases that can improve the proformance to avoid unncesary dependent subqueries [16:41:35] regarding inserts, probably you can SELECT rows to the application [16:41:45] and then insert them in a second step [16:41:53] that is usually more inneficient [16:42:36] but in some cases, with long running queries it can avoid long periods of locking and resorce usage [16:43:16] there are also tools like EXPLAIN that can help you profile query plans [16:43:28] ans choose the more favorable ones [16:43:29] regarding locking, my question at https://phabricator.wikimedia.org/T136214#2332136 was whether it would lock the entire database, or just the table i'm writing to? [16:43:42] it depends [16:44:00] ...because in the latter case, if it's just my own ad-hoc table specifcally set up for that purpose, it would not matter [16:44:17] in same cases (I am not saying it is the case), while in theory INSERT SELECT should only block your table [16:44:47] if several table engines are mixed it could lead to blocking the original table selected too [16:45:06] this happens frequently on labs and in extreme cases we have to kill the queries [16:45:26] ok, so in that case i would very likely have been the only person using the source table (for the mobilewebsectionusage schema), too [16:45:32] (03PS9) 10Yuvipanda: tools: Fixup k8s bastion role [puppet] - 10https://gerrit.wikimedia.org/r/291243 (https://phabricator.wikimedia.org/T136413) [16:45:36] normally, I would not care much about queries [16:45:47] the resources for analytics are there to be used [16:46:13] but we had too crashes due to memory exhaustion of both analytics servers [16:46:18] *two [16:46:46] I am kindly asking you if there is something you can do to avoid such a long running queries, specially DMLs [16:46:57] knowing that you are one of the many users [16:47:09] I will be asking the same to other users [16:47:17] not to mention that for the whole duration of the INSERT...SELECT MySQL has to keep track of all the changes in the table from which you select [16:47:51] yes, UNDO records, but I didn't want to enter into specifics [16:47:53] it's an eventlogging schema that was removed from production months ago and should be sending very few (if any events) currently [16:48:34] the tables itself is not an issue, the question is, can you do something about the 7-day queries? [16:48:37] (03PS22) 10Alexandros Kosiaris: network: add $production_networks [puppet] - 10https://gerrit.wikimedia.org/r/260926 (https://phabricator.wikimedia.org/T122396) (owner: 10Faidon Liambotis) [16:48:45] and the 26-hour insert? [16:48:56] (03CR) 10Alexandros Kosiaris: [C: 031] "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/291255 (https://phabricator.wikimedia.org/T136417) (owner: 10Ladsgroup) [16:49:05] that is what the ticket is about [16:49:13] I do not expect to act immediately [16:49:44] 06Operations, 06Parsing-Team, 06Services, 03Mobile-Content-Service, 13Patch-For-Review: Create functional cluster checks for all services (and have them page!) - https://phabricator.wikimedia.org/T134551#2334443 (10GWicke) I have not gotten any service-global alerts so far, and would expect them to be ve... [16:49:45] i fully understand the resourcing constraints and want to be considerable,but as mentioned on the task it's hard for me (as user) to even find out how much i'm using with one query (i was aware of EXPLAIN though and usually have an idea of the number of rows involved) [16:50:01] but maybe check the query plan, see if you can do a more efficient query [16:50:42] here efficient means using less resources, which usually means shorter transactions [16:50:59] as i said on the ticket, i'm not complaining about the 7day selects that were killed there (although i had needed similarly long queries in the past - but that's another issue) [16:51:06] we really need that INSERT though [16:51:32] you need a 26-hour insert? [16:51:51] you cannot do a 26-hour SELECT, and then insert? [16:52:08] (03PS10) 10Yuvipanda: tools: Fixup k8s bastion role [puppet] - 10https://gerrit.wikimedia.org/r/291243 (https://phabricator.wikimedia.org/T136413) [16:52:25] 06Operations, 10cassandra, 10procurement: Staging / Test environment(s) for RESTBase - https://phabricator.wikimedia.org/T136340#2331203 (10RobH) We don't have spare or older hardware close to the requested specification of single cpu (not stated, assumed) 8-core, 48-64G RAM, 2T SSDs. The large RAM and SS... [16:52:31] no i meant that as a shorthand for the query discussed here https://phabricator.wikimedia.org/T136214#2334294 [16:52:31] you cannot split the subquery in a separate query? [16:53:09] (which has the form "insert into... select ...") [16:53:09] RECOVERY - Juniper alarms on csw2-esams.mgmt.esams.wmnet is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms [16:53:12] that is my advice- run the select [16:53:20] then insert, that will make me happy [16:53:45] 06Operations, 10cassandra, 10hardware-requests: Staging / Test environment(s) for RESTBase - https://phabricator.wikimedia.org/T136340#2334450 (10RobH) [16:53:58] we can later iterate if needed [16:54:06] cool, but i'm not sure if fully understand.... what is the exact syntax for that? [16:54:49] 06Operations, 10cassandra, 10hardware-requests: Staging / Test environment(s) for RESTBase - https://phabricator.wikimedia.org/T136340#2331203 (10RobH) We have the old restbase1001-1003 left in eqiad, which exceed the cpu requirement and meet the RAM requirement. They have no disks installed, so they would... [16:54:57] also, maybe I am wrong, but your subquery doesn't look like dependent [16:55:04] maybe that can run that independently? [16:55:32] which one, " ordered_data" or "events_with_readnavtime"? [16:55:40] the inner one [16:55:52] (03PS1) 10Volans: Use 0/1 instead of off/on for read_only [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/291265 (https://phabricator.wikimedia.org/T133333) [16:56:00] you mean if it could be written to an intermediate table? [16:56:05] if that is true, you can run that in a second step [16:56:13] *first step [16:56:28] (again, I have not checked if it is dependent) [16:56:41] but those are easy tricks to make queries smaller [16:57:18] they may not be faster (a bit slower), but making them smaller will reduce memory usage [16:57:34] you could probably even index those intermediate tables [16:57:41] which makes the outer ones faster [16:58:00] have I give you, HaeB, something to think of, at least? [16:58:10] OK, if you mean writing the result of "ordered_data" into an adhoc table in staging, that should work [16:59:07] even with me not knowing what that query does, it is easy to see bad patterns, you probaly will find more by yourself [16:59:16] :-) [16:59:39] of course, really appreciate your advice. [16:59:48] check if you can do something about it, and ask for help to your peers, etc. [17:00:11] can you still explain what precisely you meant by doing the SELECT first and the doing INSERT? [17:00:22] (03PS1) 10Thcipriani: Scap3 config for tilerator [puppet] - 10https://gerrit.wikimedia.org/r/291268 (https://phabricator.wikimedia.org/T129146) [17:00:32] (03CR) 10jenkins-bot: [V: 04-1] networks::constants: Hieraize all_network_subnets [puppet] - 10https://gerrit.wikimedia.org/r/291263 (owner: 10Alexandros Kosiaris) [17:00:43] INSERTs hold locks, and are in general more expensive [17:01:22] SELECTing to the applicatino and then inserting just a list of rows is inefficient in speed, but can save (I theorize) some resources [17:01:29] this is not a general rule [17:01:42] specially, not for OLTP [17:02:01] but with OLAP, as I said, we have to do some "hacks" for mysql to work well [17:02:18] re "bad patterns": well, as i already said on the ticket, i had not spent a lot of thought yet on whether its possible to reduce the nesting level from 3 to 2. also, ultimately i need to be mindful of whether it's worth me spending 3 hours to shave of 8 hours of a query ;) but i appreciate learning more about these things in general [17:02:26] I have to go, check the documentation I sent you [17:02:37] selecting to the application? [17:02:47] the presentation about query optimization [17:02:52] i will [17:03:08] in the end it could help not only the server [17:03:17] but make your queries much faster [17:04:08] :-) [17:06:32] (03CR) 10jenkins-bot: [V: 04-1] network: add $production_networks [puppet] - 10https://gerrit.wikimedia.org/r/260926 (https://phabricator.wikimedia.org/T122396) (owner: 10Faidon Liambotis) [17:09:59] (03PS1) 10Dzahn: swift: remove precise-specific section [puppet] - 10https://gerrit.wikimedia.org/r/291272 [17:10:39] RECOVERY - aqs endpoints health on aqs1004 is OK: All endpoints are healthy [17:10:45] (03CR) 10jenkins-bot: [V: 04-1] tools: Fixup k8s bastion role [puppet] - 10https://gerrit.wikimedia.org/r/291243 (https://phabricator.wikimedia.org/T136413) (owner: 10Yuvipanda) [17:10:52] !log Stop slave, stop mysql and shutdown es2017 and es2019 for hardware maintenance T130702 [17:10:53] T130702: es2017 and es2019 crashed with no logs - https://phabricator.wikimedia.org/T130702 [17:10:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:15:28] (03CR) 10Dzahn: move/copy ubuntu-cloud.key into openstack/swift modules (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/290874 (owner: 10Dzahn) [17:20:20] (03PS1) 10Dzahn: nginx: remove jessie conditional for mount [puppet/nginx] - 10https://gerrit.wikimedia.org/r/291278 [17:22:08] (03PS2) 10Dzahn: nginx: remove jessie conditional for mount [puppet/nginx] - 10https://gerrit.wikimedia.org/r/291278 [17:22:58] (03CR) 10Faidon Liambotis: [C: 031] swift: remove precise-specific section [puppet] - 10https://gerrit.wikimedia.org/r/291272 (owner: 10Dzahn) [17:26:19] (03CR) 10Dzahn: [C: 032] swift: remove precise-specific section [puppet] - 10https://gerrit.wikimedia.org/r/291272 (owner: 10Dzahn) [17:29:46] 06Operations, 07Need-volunteer, 13Patch-For-Review, 07Tracking: align puppet-lint config with coding style - https://phabricator.wikimedia.org/T93645#2334592 (10Dzahn) [17:29:59] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 57, down: 0, dormant: 0, excluded: 1, unused: 0 [17:32:08] PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS6939/IPv4: Connect, AS6939/IPv6: Active [17:37:19] YuviPanda: service_unit was j.oe not me :) [17:38:39] j e n k i nsssss [17:40:24] there is a puppet compiler job running on * [17:42:19] 06Operations, 10media-storage, 07Tracking: refresh swift hardware in codfw/eqiad (tracking) - https://phabricator.wikimedia.org/T130012#2334629 (10Danny_B) [17:44:08] RECOVERY - BGP status on cr2-esams is OK: BGP OK - up: 341, down: 10, shutdown: 0 [17:47:57] 07Puppet, 10Beta-Cluster-Infrastructure, 07Tracking: Deployment-prep hosts with puppet errors (tracking) - https://phabricator.wikimedia.org/T132259#2334681 (10Krenair) [17:51:39] 06Operations, 10ops-codfw, 10DBA: es2017 and es2019 crashed with no logs - https://phabricator.wikimedia.org/T130702#2334710 (10Volans) es2017 and es2019 were restarted after @Papaul replaced the memory. hardware logs are cleared, the time will tell us if it's fixed. [17:52:01] 06Operations, 10ops-eqiad, 13Patch-For-Review: Rack and set up 16 db's db1079-1094 - https://phabricator.wikimedia.org/T135253#2334711 (10Cmjohnson) [17:54:48] 06Operations, 06Analytics-Kanban, 10Traffic: Verify why varnishkafka stats and webrequest logs count differs - https://phabricator.wikimedia.org/T136314#2334727 (10Ottomata) Ah, I was incorrect in my previous comment. The dt is the request timestamp, and the sequence number is not generated until the respon... [17:56:14] AaronSchulz: hiii, yt? [18:01:12] (03PS2) 10Rush: icinga: check_legal_html improve robustness of check [puppet] - 10https://gerrit.wikimedia.org/r/291242 [18:01:19] 06Operations, 10DBA, 06Labs, 07Tracking: Database replication services (tracking) - https://phabricator.wikimedia.org/T50930#2334763 (10Danny_B) [18:05:18] (03CR) 10Rush: [C: 032] icinga: check_legal_html improve robustness of check [puppet] - 10https://gerrit.wikimedia.org/r/291242 (owner: 10Rush) [18:05:29] (03CR) 10Rush: [V: 032] icinga: check_legal_html improve robustness of check [puppet] - 10https://gerrit.wikimedia.org/r/291242 (owner: 10Rush) [18:11:17] (03PS2) 10Rush: tools.checker continually watch for webservices [puppet] - 10https://gerrit.wikimedia.org/r/290681 (https://phabricator.wikimedia.org/T136162) [18:15:26] !log restarted zuul due to deadlock issue [18:15:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:19:38] (03PS3) 10Rush: tools.checker continually watch for webservices [puppet] - 10https://gerrit.wikimedia.org/r/290681 (https://phabricator.wikimedia.org/T136162) [18:22:27] (03CR) 10Rush: [C: 032] tools.checker continually watch for webservices [puppet] - 10https://gerrit.wikimedia.org/r/290681 (https://phabricator.wikimedia.org/T136162) (owner: 10Rush) [18:32:27] (03CR) 10Volans: [C: 032] "Compiler result looks good: https://puppet-compiler.wmflabs.org/2972/silver.wikimedia.org/" [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/291265 (https://phabricator.wikimedia.org/T133333) (owner: 10Volans) [18:34:15] (03PS1) 10Volans: MariaDB: use 0/1 instead of off/on for read_only [puppet] - 10https://gerrit.wikimedia.org/r/291299 (https://phabricator.wikimedia.org/T133333) [18:36:39] ACKNOWLEDGEMENT - cassandra-c CQL 10.64.0.232:9042 on restbase1007 is CRITICAL: Connection refused eevans Node is bootstrapping - The acknowledgement expires at: 2016-05-28 18:36:16. [18:36:58] (03PS1) 10Cmjohnson: Adding mac addresses for newly racked db1079-db1094 [puppet] - 10https://gerrit.wikimedia.org/r/291304 [18:38:24] (03PS1) 10Faidon Liambotis: lvs: swap cr1-esams/cr2-esams sessions [puppet] - 10https://gerrit.wikimedia.org/r/291305 [18:38:44] (03PS2) 10Cmjohnson: Adding mac addresses for newly racked db1079-db1094 [puppet] - 10https://gerrit.wikimedia.org/r/291304 [18:38:47] (03CR) 10Faidon Liambotis: [C: 032 V: 032] lvs: swap cr1-esams/cr2-esams sessions [puppet] - 10https://gerrit.wikimedia.org/r/291305 (owner: 10Faidon Liambotis) [18:41:16] (03PS1) 10Faidon Liambotis: rancid: add cr2-esams to the list [puppet] - 10https://gerrit.wikimedia.org/r/291306 [18:41:34] (03CR) 10Faidon Liambotis: [C: 032 V: 032] rancid: add cr2-esams to the list [puppet] - 10https://gerrit.wikimedia.org/r/291306 (owner: 10Faidon Liambotis) [18:42:12] (03PS3) 10Cmjohnson: Adding mac addresses for newly racked db1079-db1094 [puppet] - 10https://gerrit.wikimedia.org/r/291304 [18:43:15] 06Operations, 10ops-codfw, 10DBA: es2017 and es2019 crashed with no logs - https://phabricator.wikimedia.org/T130702#2334865 (10Volans) 05Open>03Resolved Resolving for now. we can re-open it if it will happen again. [18:44:56] (03CR) 10Volans: "Compiler results: https://puppet-compiler.wmflabs.org/2974/" [puppet] - 10https://gerrit.wikimedia.org/r/291299 (https://phabricator.wikimedia.org/T133333) (owner: 10Volans) [18:45:28] (03CR) 10Cmjohnson: [C: 032] Adding mac addresses for newly racked db1079-db1094 [puppet] - 10https://gerrit.wikimedia.org/r/291304 (owner: 10Cmjohnson) [18:46:10] (03PS2) 10Dzahn: swift: remove precise-specific section [puppet] - 10https://gerrit.wikimedia.org/r/291272 [18:46:18] paravoid: I have a change mixed in with yours...feel free to merge with yours...thx [18:47:10] cmjohnson1: done [18:48:11] (03CR) 10Chad: [C: 031] "lgtm, just needs rebase + merge." [puppet] - 10https://gerrit.wikimedia.org/r/291172 (owner: 10BryanDavis) [18:52:24] I updated the tests for CentralNotice and fixed the ES6 breakage, if anyone has a chance to review: [18:52:27] https://gerrit.wikimedia.org/r/291264 [18:52:54] Then I got masochistic and made it all pass jscs too: https://gerrit.wikimedia.org/r/291288 [18:53:20] 06Operations, 10ops-esams, 06DC-Ops: Remove unused fibers - https://phabricator.wikimedia.org/T94704#2334895 (10mark) 05Open>03Resolved These fibers have been removed. [18:53:26] lappy needs electrons, back in a bit [19:02:12] (03PS1) 10Faidon Liambotis: Revert "Drain esams for network maintenance" [dns] - 10https://gerrit.wikimedia.org/r/291313 [19:02:30] MaxSem: Dereckson I presume the second part of this was done, right? "May 25: Ensure Gerrit:290581 is merged, then run l10nupdate" [19:02:38] er, MatmaRex not MaxSem ^ [19:02:50] * greg-g grrs at too similar of names [19:03:24] probably [19:03:40] greg-g: i imagine by now l10n-update ran a couple times [19:03:58] PROBLEM - puppet last run on elastic1038 is CRITICAL: CRITICAL: puppet fail [19:04:25] greg-g: the messages appear correctly at https://meta.wikimedia.org/wiki/Special:ListGroupRights [19:05:29] (03CR) 10Faidon Liambotis: [C: 032] Revert "Drain esams for network maintenance" [dns] - 10https://gerrit.wikimedia.org/r/291313 (owner: 10Faidon Liambotis) [19:06:05] !log un-draining esams [19:06:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:06:15] MatmaRex: true true :) thanks [19:06:22] * greg-g is just cleaning up [[wikitech:Deployments]] [19:09:18] 06Operations, 06Discovery, 06Maps: Configure monitoring / alerting of Postgresql / redis / ... cluster for maps - https://phabricator.wikimedia.org/T135647#2334963 (10Gehel) a:03Gehel [19:09:45] 06Operations, 06Discovery, 06Maps, 03Discovery-Maps-Sprint: Configure monitoring / alerting of Postgresql / redis / ... cluster for maps - https://phabricator.wikimedia.org/T135647#2334965 (10MaxSem) [19:13:07] RECOVERY - Host ms-fe3002 is UP: PING OK - Packet loss = 0%, RTA = 84.36 ms [19:13:44] 06Operations, 06Discovery, 06Maps, 03Discovery-Maps-Sprint: Configure monitoring / alerting of Postgresql / redis / ... cluster for maps - https://phabricator.wikimedia.org/T135647#2334969 (10Yurik) [19:14:58] RECOVERY - Host ms-fe3001 is UP: PING OK - Packet loss = 0%, RTA = 84.23 ms [19:22:12] (03PS4) 10Dzahn: move ubuntu-cloud.key to openstack module [puppet] - 10https://gerrit.wikimedia.org/r/290874 [19:22:37] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 6393528 keys - replication_delay is 0 [19:29:19] RECOVERY - puppet last run on elastic1038 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:32:36] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Requesting access to stats for Ladsgroup - https://phabricator.wikimedia.org/T136417#2335059 (10Ladsgroup) I don't think I need hadoop data. If I need it, I would make another request. Thanks [19:42:46] greg-g: yes, it's done [19:43:57] (03PS5) 10Dzahn: move ubuntu-cloud.key to openstack module [puppet] - 10https://gerrit.wikimedia.org/r/290874 [19:47:18] * Krinkle is deploying fix for T136375 [19:47:35] greg-g: [19:49:26] !log krinkle@tin Synchronized php-1.28.0-wmf.3/includes/actions/RollbackAction.php: Iba17ce55ff9 (duration: 00m 31s) [19:49:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:49:50] Krinkle: /me nods, ty [19:49:52] !log krinkle@tin Synchronized php-1.28.0-wmf.3/includes/Linker.php: Iba17ce55ff9 (duration: 00m 25s) [19:49:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:03:04] (03CR) 10Andrew Bogott: [C: 031] move ubuntu-cloud.key to openstack module [puppet] - 10https://gerrit.wikimedia.org/r/290874 (owner: 10Dzahn) [20:06:41] ottomata: you're asking about the fixme in setVisibility()? [20:07:17] AaronSchulz: partly, yeah [20:07:50] on the whole i'm asking if those changes sound sane or doable to you, wanted to get a MW person to at least think they were ok before we tried making them [20:08:46] but ja AaronSchulz for that one [20:08:57] it seems to me the call to updateLog won't have anything useful for oldBits and newBits [20:09:11] because those could be different for each of the ids in the $idsForLog [20:11:42] (03CR) 10Dzahn: [C: 032] move ubuntu-cloud.key to openstack module [puppet] - 10https://gerrit.wikimedia.org/r/290874 (owner: 10Dzahn) [20:15:02] 06Operations, 10Incident-20160126-WikimediaDomainRedirection, 10Monitoring, 13Patch-For-Review: add icinga and watchmouse https checks for content on commons. or other wikimedia.org sites - https://phabricator.wikimedia.org/T124812#2335159 (10Dzahn) I added the same type of check to "watchmouse" too: http... [20:17:23] ottomata: anyway, ArticleRevisionVisibilitySet could have an oldfields and newfields map of [id => flags]. The others seems fine too. [20:20:52] hm, AaronSchulz aye ok, so we could modify ArticleRevisionVisibilitySet to take a map of ids -> oldfields,newfields? [20:21:50] i suppose we'd also have to modify RevDelList to either pass that map to doPostCommitUpdates, or save it on the object [20:22:23] 06Operations, 10Incident-20160126-WikimediaDomainRedirection, 10Monitoring, 13Patch-For-Review: add icinga and watchmouse https checks for content on commons. or other wikimedia.org sites - https://phabricator.wikimedia.org/T124812#2335168 (10Dzahn) http://status.wikimedia.org/8777/438553/https-content---c... [20:28:06] (03PS1) 10Dzahn: icinga: make commonts content check critical (paging) [puppet] - 10https://gerrit.wikimedia.org/r/291347 (https://phabricator.wikimedia.org/T124812) [20:29:18] (03PS2) 10Dzahn: icinga: make commons content check critical (paging) [puppet] - 10https://gerrit.wikimedia.org/r/291347 (https://phabricator.wikimedia.org/T124812) [20:30:45] (03PS3) 10Dzahn: icinga: make commons content check critical (paging) [puppet] - 10https://gerrit.wikimedia.org/r/291347 (https://phabricator.wikimedia.org/T124812) [20:31:14] (03PS4) 10Dzahn: icinga: make commons content check critical (paging) [puppet] - 10https://gerrit.wikimedia.org/r/291347 (https://phabricator.wikimedia.org/T124812) [20:31:41] (03CR) 10Dzahn: [C: 032] icinga: make commons content check critical (paging) [puppet] - 10https://gerrit.wikimedia.org/r/291347 (https://phabricator.wikimedia.org/T124812) (owner: 10Dzahn) [20:39:48] ottomata: seems fine [20:40:15] ok! thanks AaronSchulz, appreciate the vote of confidence :) [20:43:36] (03PS1) 10Alex Monk: Make VE RB URLs domain-relative [mediawiki-config] - 10https://gerrit.wikimedia.org/r/291349 (https://phabricator.wikimedia.org/T135171) [20:49:03] (03PS17) 10Andrew Bogott: MySQL backend for storing roles / hiera data for labs [puppet] - 10https://gerrit.wikimedia.org/r/285014 (https://phabricator.wikimedia.org/T133412) (owner: 10Yuvipanda) [20:50:02] (03PS1) 10GWicke: Use domain-relative URL for client-side RESTBase requests [mediawiki-config] - 10https://gerrit.wikimedia.org/r/291350 (https://phabricator.wikimedia.org/T135171) [20:52:31] (03CR) 10Dzahn: "compiler finished with this: http://puppet-compiler.wmflabs.org/2971/" [puppet] - 10https://gerrit.wikimedia.org/r/260926 (https://phabricator.wikimedia.org/T122396) (owner: 10Faidon Liambotis) [20:52:36] (03CR) 10GWicke: [C: 031] Make VE RB URLs domain-relative [mediawiki-config] - 10https://gerrit.wikimedia.org/r/291349 (https://phabricator.wikimedia.org/T135171) (owner: 10Alex Monk) [20:53:25] (03Abandoned) 10GWicke: Use domain-relative URL for client-side RESTBase requests [mediawiki-config] - 10https://gerrit.wikimedia.org/r/291350 (https://phabricator.wikimedia.org/T135171) (owner: 10GWicke) [20:55:16] (03PS3) 10Andrew Bogott: Allow horizon to query the labs puppetmaster for a list of classes [puppet] - 10https://gerrit.wikimedia.org/r/284103 [20:55:46] (03CR) 10Jforrester: [C: 031] "Worth trying." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/291349 (https://phabricator.wikimedia.org/T135171) (owner: 10Alex Monk) [20:58:12] 06Operations, 10Incident-20160126-WikimediaDomainRedirection, 10Monitoring, 13Patch-For-Review: add icinga and watchmouse https checks for content on commons. or other wikimedia.org sites - https://phabricator.wikimedia.org/T124812#2335317 (10Dzahn) 05Open>03Resolved a:03Dzahn [20:58:28] 06Operations, 10Incident-20160126-WikimediaDomainRedirection, 10Monitoring: add icinga and watchmouse https checks for content on commons. or other wikimedia.org sites - https://phabricator.wikimedia.org/T124812#1966725 (10Dzahn) [21:12:22] Weird https://no.wiktionary.org/wiki/MediaWiki:Mainpage points to Project:Forside, and yet the sidebar and logo point to "Forside" (ignoring the local override) [21:12:32] legoktm: Reedy: Any idea? [21:13:32] Krinkle: is that a regression? [21:13:38] Not sure.. [21:13:41] Same on cli [21:13:50] wfMessage('mainpage')->text() Forside [21:13:56] and inContentLanguage() [21:14:07] and inLanguage('no') [21:14:19] message cache corruption? [21:14:20] Tried editing and puring [21:14:25] purging [21:14:47] um, does {{ns:Project}} even work in that message? [21:16:14] I have to run a quick errand irl, I'll be back in 10 [21:16:51] legoktm: Tried Wiktionary: too, but same thing [21:16:55] legoktm: Yeah, it works fine. [21:16:57] (03CR) 10Andrew Bogott: [C: 031] openstack: Fix PEP8 violations [puppet] - 10https://gerrit.wikimedia.org/r/291172 (owner: 10BryanDavis) [21:17:12] (03PS3) 10Andrew Bogott: openstack: Fix PEP8 violations [puppet] - 10https://gerrit.wikimedia.org/r/291172 (owner: 10BryanDavis) [21:17:22] * Krinkle Signs off [21:19:12] (03CR) 10Andrew Bogott: [C: 031] swift: Fix PEP8 violations [puppet] - 10https://gerrit.wikimedia.org/r/291173 (owner: 10BryanDavis) [21:19:17] (03PS3) 10Andrew Bogott: swift: Fix PEP8 violations [puppet] - 10https://gerrit.wikimedia.org/r/291173 (owner: 10BryanDavis) [21:19:17] 06Operations, 10Wikimedia-Site-requests, 10Wikimedia-Video: Upload the Wikimania 2014 videos to Commons - https://phabricator.wikimedia.org/T106038#2335361 (10Dzahn) volunteer needed to split the videos into smaller chunks then... [21:19:28] (03CR) 10Andrew Bogott: [C: 032] openstack: Fix PEP8 violations [puppet] - 10https://gerrit.wikimedia.org/r/291172 (owner: 10BryanDavis) [21:19:56] 06Operations, 06Labs, 10Wikimedia-Video, 07Need-volunteer: Upload the Wikimania 2014 videos to Commons - https://phabricator.wikimedia.org/T106038#2335362 (10Dzahn) [21:20:46] 06Operations, 06Labs, 10Wikimedia-Video, 07Need-volunteer: Upload the Wikimania 2014 videos to Commons - https://phabricator.wikimedia.org/T106038#1457394 (10Dzahn) removed outdated "site-requests" tag. added Need-volunteer and Labs for attention and because the files are there. [21:20:50] (03CR) 10Andrew Bogott: [C: 031] librenms: Fix PEP8 vilations [puppet] - 10https://gerrit.wikimedia.org/r/291174 (owner: 10BryanDavis) [21:21:41] (03CR) 10Andrew Bogott: [C: 031] DBUtil.py: Fix PEP8 violations [puppet] - 10https://gerrit.wikimedia.org/r/291175 (owner: 10BryanDavis) [21:22:00] (03PS3) 10Andrew Bogott: librenms: Fix PEP8 violations [puppet] - 10https://gerrit.wikimedia.org/r/291174 (owner: 10BryanDavis) [21:22:24] (03CR) 10jenkins-bot: [V: 04-1] librenms: Fix PEP8 violations [puppet] - 10https://gerrit.wikimedia.org/r/291174 (owner: 10BryanDavis) [21:26:29] 06Operations, 06Discovery, 06Maps, 07Epic: Epic: switch Maps to production status - https://phabricator.wikimedia.org/T133744#2335368 (10MaxSem) [21:30:09] (03PS18) 10Andrew Bogott: MySQL backend for storing roles / hiera data for labs [puppet] - 10https://gerrit.wikimedia.org/r/285014 (https://phabricator.wikimedia.org/T133412) (owner: 10Yuvipanda) [21:36:32] legoktm: do you remember which repo this was in ? https://phabricator.wikimedia.org/T116819 [21:36:43] is that mw-core? [21:40:47] mutante: integration/config I think [21:41:19] legoktm: ah! ok, looking at that old ticket [21:41:25] paladox tries something [21:43:19] hmm, yea, the repo seems right but still [21:43:22] Could not fetch review information for change 154757 [21:45:25] mutante: I think people lock it down for security reasons. But i doint think it is even possible the easy way for gerrit since it is more open then closed. [21:45:34] Phabricator you can the easy way. [21:46:04] mutante: hasharAway say it may be a draft patch. [21:46:13] mutante do you have access to the gerrit database. [21:46:29] probably yea..if i find it [21:46:44] Ok [21:46:45] i'll take a look [21:47:24] mutante thanks [21:47:43] mutante: Mine has started to replicate on phabricator [21:47:56] Takes forever to delete because of all refs/changes/ [21:48:11] Plus causes it to slow down for mw core due to me not having enough ram. [21:48:40] paladox: you can stop the mw core part, it's in integration/config , much smaller [21:48:54] mutante: Oh yeh forgot. [21:49:04] mutante i can clone integration. [21:49:15] http://www.test-random-wikisaur.tk/ [21:49:35] ok! let's see .. heh. tk domain [21:50:32] Yep it's free. Plus bt seem to be really good since i can run my own hosting. [21:51:23] 55mbps up from 40mbps. [21:51:34] yea, i once used one because they were free, a really long time ago, it reminded me [21:51:50] Oh :) [21:52:29] many of us did :) [21:53:21] mutante http://www.test-random-wikisaur.tk/diffusion/9/ [21:53:30] daemons cause my pc to be really slow [21:53:52] but im running ubuntu and windows at the same time [21:54:11] No vagrant and no virtual [21:54:30] https://msdn.microsoft.com/en-us/commandline/wsl/about [21:55:15] wait, this is phabricator on ubuntu on windows ? :) [21:55:21] mutante yes [21:55:26] and you are importing our CI stuff now? [21:55:28] hahah, nice [21:55:31] Yep [21:55:59] mutante: Theres a hack that allows you to run linux desktop apps through it [21:56:17] They need to fix nginx which currently wont start. [21:57:23] mutante: ubuntu on windows bash is the actual image of ubuntu. [21:57:44] Microsoft did not change anything and are live translating it into windows code. [22:13:47] (03PS4) 10BryanDavis: librenms: Fix PEP8 violations [puppet] - 10https://gerrit.wikimedia.org/r/291174 [22:14:50] (03PS3) 10BryanDavis: wdqs_updater.py: Fix PEP8 violations [puppet] - 10https://gerrit.wikimedia.org/r/291188 [22:15:22] (03PS3) 10BryanDavis: udp2log: Fix PEP8 violations [puppet] - 10https://gerrit.wikimedia.org/r/291186 [22:15:51] (03PS3) 10BryanDavis: salt: Fix PEP8 violations [puppet] - 10https://gerrit.wikimedia.org/r/291184 [22:16:18] (03PS3) 10BryanDavis: postgresql.py: Fix PEP8 violations [puppet] - 10https://gerrit.wikimedia.org/r/291182 [22:18:06] (03Abandoned) 10Ladsgroup: grafana: give access to "wikidev" LDAP memebers [puppet] - 10https://gerrit.wikimedia.org/r/288616 (owner: 10Ladsgroup) [22:18:32] (03PS3) 10BryanDavis: ircd_stats.py: Fix PEP8 violations [puppet] - 10https://gerrit.wikimedia.org/r/291181 [22:19:09] bd808: hey, I have some flake8 patches: https://gerrit.wikimedia.org/r/#/q/status:open+project:operations/puppet+owner:%22Ladsgroup+%253Cladsgroup%2540gmail.com%253E%22,n,z [22:19:20] (03PS3) 10BryanDavis: mailman: Fix PEP8 violations [puppet] - 10https://gerrit.wikimedia.org/r/291180 [22:19:26] it would be great if you take a look at them [22:20:02] I would be more than happy if you want more hands to the job [22:20:37] I think I have them all fixed. I'm just breaking up the epic chain of patches to make merges easier. -- https://gerrit.wikimedia.org/r/#/q/status:open+project:operations/puppet+branch:production+topic:pep8,n,z [22:21:40] I'm not a root so I have to bargain for reviews just like you [22:21:50] awesome, bd808. Do you think it would be better to switch to flake8 instead of pep8? [22:22:19] flake8 can expose bugs too. I found some bugs in puppet because of flake8 [22:23:06] I did that at the start of the chain with https://gerrit.wikimedia.org/r/#/c/291138/ which is merged now [22:23:27] the test still says pep8 but it's running flake8 [22:23:52] (03PS3) 10BryanDavis: ganglia: Fix PEP8 violations [puppet] - 10https://gerrit.wikimedia.org/r/291179 [22:24:24] (03PS3) 10BryanDavis: rolematcher.py: Fix PEP8 violations [puppet] - 10https://gerrit.wikimedia.org/r/291177 [22:24:53] (03PS3) 10BryanDavis: gmond_memcached.py: Fix PEP8 violations [puppet] - 10https://gerrit.wikimedia.org/r/291176 [22:25:20] (03PS3) 10BryanDavis: wmfelastic.py: Fix PEP8 violations [puppet] - 10https://gerrit.wikimedia.org/r/291178 [22:25:30] (03PS4) 10Yuvipanda: base: Provide better error messages for service_unit [puppet] - 10https://gerrit.wikimedia.org/r/291259 [22:25:53] (03PS3) 10BryanDavis: pybal: Fix PEP8 violations [puppet] - 10https://gerrit.wikimedia.org/r/291183 [22:26:24] (03PS3) 10BryanDavis: varnish: Fix PEP8 violations [puppet] - 10https://gerrit.wikimedia.org/r/291187 [22:26:48] Krenair: this still current? it says we should "mwscript initSiteStats.php lrcwiki --update" https://phabricator.wikimedia.org/T109635 [22:27:01] (03PS3) 10BryanDavis: servermon: Fix PEP8 violations [puppet] - 10https://gerrit.wikimedia.org/r/291185 [22:27:15] I don't know if the issue is still valid [22:27:25] you can still run the command [22:28:46] (03PS3) 10BryanDavis: DBUtil.py: Fix PEP8 violations [puppet] - 10https://gerrit.wikimedia.org/r/291175 [22:31:49] (03CR) 10Yuvipanda: [C: 032] base: Provide better error messages for service_unit [puppet] - 10https://gerrit.wikimedia.org/r/291259 (owner: 10Yuvipanda) [22:32:05] (03PS11) 10Yuvipanda: tools: Fixup k8s bastion role [puppet] - 10https://gerrit.wikimedia.org/r/291243 (https://phabricator.wikimedia.org/T136413) [22:35:02] (03PS1) 10BryanDavis: varnishkafka: submodule bump for pep8 fix [puppet] - 10https://gerrit.wikimedia.org/r/291365 [22:37:27] (03PS1) 10BryanDavis: kafkatee: submodule bump for pep8 fix [puppet] - 10https://gerrit.wikimedia.org/r/291366 [22:49:14] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). [22:49:14] PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). [22:50:35] !log mwscript initSiteStats.php --wiki=lrcwiki --update for T109635 [22:50:36] T109635: lrcwiki stats are wrong - https://phabricator.wikimedia.org/T109635 [22:50:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:01:04] The unmerged changes is me I'll fix shortly. [23:07:31] where is the ultimate source for the data on https://en.wikipedia.org/wiki/Special:SiteMatrix [23:08:04] operations/mediawiki-config? [23:08:17] CommonSettings probably has a bunch of it [23:09:03] legoktm, oh ! the "localname" values have bugs in them [23:10:01] so i once reported it as a bug in parsoid config, it turned out that uses "fetch-sitematrix.js" to fetch it [23:10:31] and that gets it from mw-config then [23:15:38] !log legoktm@tin Synchronized php-1.28.0-wmf.3/includes/actions/RollbackAction.php: RollbackAction: Don't return true, causes '1' to be output (duration: 00m 34s) [23:15:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:37:39] bd808: want to do https://gerrit.wikimedia.org/r/#/c/278315/ now? [23:37:48] re: > It would probably be a good idea to force a post-merge puppet run on at least one of the three Logstash frontend boxes (logstash100[123]) and verify that the restart is successful. [23:38:06] sure! [23:44:49] !log ran extensions/GlobalBlocking/fixGlobalBlockWhitelist.php for T56496 [23:44:50] T56496: Local Special:GlobalBlockStatus does not work at all - https://phabricator.wikimedia.org/T56496 [23:44:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:53:07] (03PS3) 10Ori.livneh: logstash: Make truncated MediaWiki json easier to find [puppet] - 10https://gerrit.wikimedia.org/r/278315 (owner: 10BryanDavis) [23:53:19] (03CR) 10Ori.livneh: [C: 032 V: 032] logstash: Make truncated MediaWiki json easier to find [puppet] - 10https://gerrit.wikimedia.org/r/278315 (owner: 10BryanDavis) [23:56:13] (03PS1) 10Ori.livneh: Revert "base: Provide better error messages for service_unit" [puppet] - 10https://gerrit.wikimedia.org/r/291381 [23:56:58] (03PS2) 10Ori.livneh: Revert "base: Provide better error messages for service_unit" [puppet] - 10https://gerrit.wikimedia.org/r/291381 [23:57:09] (03CR) 10Ori.livneh: [C: 032 V: 032] Revert "base: Provide better error messages for service_unit" [puppet] - 10https://gerrit.wikimedia.org/r/291381 (owner: 10Ori.livneh) [23:57:37] PROBLEM - puppet last run on mw2081 is CRITICAL: CRITICAL: puppet fail [23:58:10] !log Forcing a Puppet run on logstash* [23:58:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:58:56] RECOVERY - Unmerged changes on repository puppet on palladium is OK: No changes to merge. [23:58:56] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [23:59:58] bd808: looks good to me, at least insofar as logstash restarted successfully on all boxes