[00:00:04] RoanKattouw, ^d, marktraceur, kaldari: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150113T0000). [00:01:16] (03PS11) 10BBlack: varnish: Route requests with 'X-Wikimedia-Debug=1' to test_wikipedia backend [puppet] - 10https://gerrit.wikimedia.org/r/183171 (owner: 10Ori.livneh) [00:06:06] (03PS1) 10Dzahn: remove remnants of the 'views' column [debs/wikistats] - 10https://gerrit.wikimedia.org/r/184517 (https://phabricator.wikimedia.org/T38293) [00:06:07] Who will SWAT? [00:06:27] !swat_roulette [00:06:54] If no one jumps in, I can volunteer to do it [00:07:25] kaldari|2 is about to... [00:08:00] 3Ops-Access-Requests: EventLogging access for Tilman - https://phabricator.wikimedia.org/T86533#972518 (10Eloquence) Approved. If we can skip the three day wait this would be great, since Comms is urgently trying to pull some numbers. [00:08:28] hoo: I was just doing the mobile updates since no one else was around :P [00:08:52] kaldari|2: You can also do my updates... or at least ping me once you're done [00:10:21] hoo: I'll ping you. Normally I would be happy to do all of them, but we also have to do a complicated QA session and config change after my deployment, which is going to press me for time [00:10:34] sure, that's fine [00:13:45] ori: https://gerrit.wikimedia.org/r/#/c/184516/ [00:14:20] kaldari|2: bogus error, i'll override. [00:14:35] ori: thanks [00:15:10] ori: The wmf14 submodule update is https://gerrit.wikimedia.org/r/#/c/184521/ [00:19:17] hoo: ori says he's going to go ahead and do all of them, which is fine with me [00:19:42] ok [00:20:39] hoo: what are your changes? [00:21:03] https://gerrit.wikimedia.org/r/#q,184514,n,z and https://gerrit.wikimedia.org/r/#q,184515,n,z [00:21:38] and https://gerrit.wikimedia.org/r/#q,184136,n,z (plus eventually Reedy's follow-up, which si comment-only) [00:22:37] (03CR) 10Ori.livneh: [C: 032] Remove no longer needed hook handlers from Bug54847 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/184136 (owner: 10Hoo man) [00:22:53] (03CR) 10Ori.livneh: [C: 032] Parameter type hints for Bug54847.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/184141 (owner: 10Reedy) [00:31:33] (03PS3) 10Ori.livneh: Remove no longer needed hook handlers from Bug54847 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/184136 (owner: 10Hoo man) [00:31:40] (03CR) 10Ori.livneh: [V: 032] Remove no longer needed hook handlers from Bug54847 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/184136 (owner: 10Hoo man) [00:31:47] (03PS3) 10Ori.livneh: Parameter type hints for Bug54847.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/184141 (owner: 10Reedy) [00:31:53] (03CR) 10Ori.livneh: [V: 032] Parameter type hints for Bug54847.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/184141 (owner: 10Reedy) [00:38:30] !log ori Started scap: Updates to MobileFrontend, CentralAuth, EventLogging and WikimediaEvents [00:38:35] ^ kaldari|2 [00:38:38] Logged the message, Master [00:38:40] & hoo [00:38:57] :) [00:39:23] ori: is that for wmf13 or 14 or both? [00:39:53] both [00:39:57] cool [00:44:58] !log ori Finished scap: Updates to MobileFrontend, CentralAuth, EventLogging and WikimediaEvents (duration: 06m 27s) [00:45:03] Logged the message, Master [00:46:56] kaldari|2, hoo ^^^^ [00:47:04] yay [00:57:31] (03CR) 10Dzahn: [C: 032 V: 032] remove remnants of the 'views' column [debs/wikistats] - 10https://gerrit.wikimedia.org/r/184517 (https://phabricator.wikimedia.org/T38293) (owner: 10Dzahn) [01:03:24] (03PS1) 10Dzahn: bump version to 2.9 [debs/wikistats] - 10https://gerrit.wikimedia.org/r/184535 [01:05:38] (03CR) 10Dzahn: [C: 032 V: 032] bump version to 2.9 [debs/wikistats] - 10https://gerrit.wikimedia.org/r/184535 (owner: 10Dzahn) [01:36:02] (03PS12) 10Ori.livneh: varnish: Route requests with 'X-Wikimedia-Debug=1' to test_wikipedia backend [puppet] - 10https://gerrit.wikimedia.org/r/183171 [01:36:25] ^ bblack [02:17:04] !log l10nupdate Synchronized php-1.25wmf13/cache/l10n: (no message) (duration: 00m 03s) [02:17:09] !log LocalisationUpdate completed (1.25wmf13) at 2015-01-13 02:17:08+00:00 [02:17:14] Logged the message, Master [02:17:16] Logged the message, Master [02:29:14] !log l10nupdate Synchronized php-1.25wmf14/cache/l10n: (no message) (duration: 00m 03s) [02:29:21] !log LocalisationUpdate completed (1.25wmf14) at 2015-01-13 02:29:21+00:00 [02:29:23] Logged the message, Master [02:29:26] Logged the message, Master [04:07:54] !log LocalisationUpdate ResourceLoader cache refresh completed at Tue Jan 13 04:07:54 UTC 2015 (duration 7m 52s) [04:07:59] Logged the message, Master [04:19:06] (03CR) 10Glaisher: [C: 04-1] Create "autopatrolled", "patroller" and "rollbacker" user groups on fawiktionary (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/184370 (https://phabricator.wikimedia.org/T85381) (owner: 10Calak) [04:35:54] (03PS3) 10Andrew Bogott: Move the codfw ldap server off of labcontrol2001. [puppet] - 10https://gerrit.wikimedia.org/r/183005 [04:37:15] (03CR) 10Andrew Bogott: [C: 032] Move the codfw ldap server off of labcontrol2001. [puppet] - 10https://gerrit.wikimedia.org/r/183005 (owner: 10Andrew Bogott) [04:42:35] PROBLEM - LDAP on labcontrol2001 is CRITICAL: Connection refused [04:42:45] PROBLEM - Certificate expiration on labcontrol2001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:IO::Socket::SSL: connect: Connection refused [04:42:45] PROBLEM - LDAPS on labcontrol2001 is CRITICAL: Connection refused [04:50:16] ACKNOWLEDGEMENT - Certificate expiration on labcontrol2001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:IO::Socket::SSL: connect: Connection refused andrew bogott I just moved ldap off this server. [04:50:16] ACKNOWLEDGEMENT - LDAP on labcontrol2001 is CRITICAL: Connection refused andrew bogott I just moved ldap off this server. [04:50:16] ACKNOWLEDGEMENT - LDAPS on labcontrol2001 is CRITICAL: Connection refused andrew bogott I just moved ldap off this server. [04:53:52] (03PS1) 10Springle: depool db1067 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/184545 [04:54:42] (03CR) 10Springle: [C: 032] depool db1067 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/184545 (owner: 10Springle) [04:54:52] (03Merged) 10jenkins-bot: depool db1067 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/184545 (owner: 10Springle) [04:55:36] !log springle Synchronized wmf-config/db-eqiad.php: depool db1067 (duration: 00m 05s) [04:55:39] Logged the message, Master [05:01:31] (03PS1) 10Ori.livneh: Eliminate dead code from text VCL [puppet] - 10https://gerrit.wikimedia.org/r/184546 [05:19:22] (03PS1) 10Ori.livneh: VCL: Add 'maybe_use_random_scheduler' subroutine [puppet] - 10https://gerrit.wikimedia.org/r/184547 [05:26:44] (03PS1) 10Ori.livneh: Remove GettingStarted cookie workaround, reverting ae30ae0ba [puppet] - 10https://gerrit.wikimedia.org/r/184548 [05:41:37] (03PS1) 10Ori.livneh: VCL: Get rid of hhvm.inc.vcl.erb [puppet] - 10https://gerrit.wikimedia.org/r/184551 [05:45:00] (03PS1) 10Ori.livneh: Get rid of blog.inc.vcl.erb and associated probe [puppet] - 10https://gerrit.wikimedia.org/r/184552 [05:47:20] (03PS1) 10Ori.livneh: VCL: Get rid of graphite.inc.vcl.erb [puppet] - 10https://gerrit.wikimedia.org/r/184553 [06:21:58] PROBLEM - puppet last run on amslvs4 is CRITICAL: CRITICAL: puppet fail [06:28:48] PROBLEM - puppet last run on elastic1027 is CRITICAL: CRITICAL: Puppet has 1 failures [06:28:58] PROBLEM - puppet last run on db2040 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:07] PROBLEM - puppet last run on amssq35 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:08] PROBLEM - puppet last run on db1021 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:27] PROBLEM - puppet last run on db1028 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:48] PROBLEM - puppet last run on mw1025 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:27] PROBLEM - puppet last run on search1007 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:38] PROBLEM - puppet last run on labcontrol2001 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:48] PROBLEM - puppet last run on cp3014 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:48] PROBLEM - puppet last run on cp4008 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:04] (03CR) 10Chmarkine: [C: 031] "It's definitely good to enforce HTTPS whenever possible." [puppet] - 10https://gerrit.wikimedia.org/r/181949 (owner: 10Hoo man) [06:39:48] RECOVERY - puppet last run on amslvs4 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [06:46:37] RECOVERY - puppet last run on mw1025 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [06:46:48] RECOVERY - puppet last run on elastic1027 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [06:46:57] RECOVERY - puppet last run on db2040 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [06:47:27] RECOVERY - puppet last run on labcontrol2001 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [06:47:38] RECOVERY - puppet last run on cp3014 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [06:49:28] RECOVERY - puppet last run on amssq35 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [07:08:37] RECOVERY - puppet last run on db1021 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [07:09:58] RECOVERY - puppet last run on db1028 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [07:10:08] RECOVERY - puppet last run on cp4008 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [07:10:58] RECOVERY - puppet last run on search1007 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [07:21:59] PROBLEM - MySQL Processlist on db1056 is CRITICAL: CRIT 177 unauthenticated, 0 locked, 0 copy to table, 0 statistics [07:23:09] RECOVERY - MySQL Processlist on db1056 is OK: OK 30 unauthenticated, 0 locked, 0 copy to table, 0 statistics [07:30:35] !log ongoing schema changes T86415 externallinks, codfw first [07:30:38] Logged the message, Master [07:42:46] (03PS1) 10Ori.livneh: VCL: Use header.append() in more places. [puppet] - 10https://gerrit.wikimedia.org/r/184567 [07:47:09] (03PS1) 10Ori.livneh: VCL lint: 'return(xxx)' => 'return (xxx)' [puppet] - 10https://gerrit.wikimedia.org/r/184568 [07:49:07] (03PS2) 10Ori.livneh: VCL: Get rid of hhvm.inc.vcl.erb [puppet] - 10https://gerrit.wikimedia.org/r/184551 [07:49:32] (03PS3) 10Calak: Create "autopatrolled", "patroller" and "rollbacker" user groups on fawiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/184370 (https://phabricator.wikimedia.org/T85381) [07:50:46] (03CR) 10Calak: "Thank you Glaisher." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/184370 (https://phabricator.wikimedia.org/T85381) (owner: 10Calak) [08:10:54] (03PS1) 10Ori.livneh: VCL: Standardize on '//'-style comments [puppet] - 10https://gerrit.wikimedia.org/r/184570 [08:13:11] ooh vcl spree [08:13:11] nice [08:14:21] (03CR) 10Faidon Liambotis: "Is Analytics aware of this? Are we sure this won't break any scripts of theirs?" [puppet] - 10https://gerrit.wikimedia.org/r/184551 (owner: 10Ori.livneh) [08:16:04] (03CR) 10Faidon Liambotis: [C: 032] Get rid of blog.inc.vcl.erb and associated probe [puppet] - 10https://gerrit.wikimedia.org/r/184552 (owner: 10Ori.livneh) [08:16:45] (03CR) 10Faidon Liambotis: [C: 032] VCL: Get rid of graphite.inc.vcl.erb [puppet] - 10https://gerrit.wikimedia.org/r/184553 (owner: 10Ori.livneh) [08:17:41] (03CR) 10Faidon Liambotis: [C: 032] VCL: Use header.append() in more places. [puppet] - 10https://gerrit.wikimedia.org/r/184567 (owner: 10Ori.livneh) [08:18:04] akosiaris: re: cxserver puppet failures on betalabs, parsoidcache varnish on betalabs puppet is also failing due to something cxserver related [08:18:32] paravoid: woo, thanks. i haven't tested these on labs yet, so be careful, i could have done something dumb. [08:19:02] the ones you merged are safe [08:19:12] (03CR) 10Faidon Liambotis: [C: 032] VCL lint: 'return(xxx)' => 'return (xxx)' [puppet] - 10https://gerrit.wikimedia.org/r/184568 (owner: 10Ori.livneh) [08:19:14] didn't merge yet [08:19:16] just +2ed :) [08:19:32] ok, let me cherry-pick them on labs first then [08:19:45] so we know at least that there are no nasty syntax errors [08:20:14] manual CI [08:20:31] MI [08:22:29] (03PS2) 10Ori.livneh: Get rid of blog.inc.vcl.erb and associated probe [puppet] - 10https://gerrit.wikimedia.org/r/184552 [08:22:58] (03CR) 10Ori.livneh: "(rebased to avoid dependency on I75b30bef2)" [puppet] - 10https://gerrit.wikimedia.org/r/184552 (owner: 10Ori.livneh) [08:23:00] (03CR) 10Faidon Liambotis: "Why did you pick C++-style comments? Personally I can't say I have a preference for either so I don't care either way (but I like consiste" [puppet] - 10https://gerrit.wikimedia.org/r/184570 (owner: 10Ori.livneh) [08:25:49] (03CR) 10Ori.livneh: "VCL supports /* */, #, and //. I consider multi-line comments problematic because in the absence of reliable syntax highlighting they can " [puppet] - 10https://gerrit.wikimedia.org/r/184570 (owner: 10Ori.livneh) [08:26:22] (03PS1) 10Yuvipanda: Rename package to labsdb.auditor [software/labsdb-auditor] - 10https://gerrit.wikimedia.org/r/184577 [08:26:36] (03PS2) 10Yuvipanda: Rename package to labsdb.auditor [software/labsdb-auditor] - 10https://gerrit.wikimedia.org/r/184577 [08:27:13] (03CR) 10Yuvipanda: [C: 032 V: 032] Rename package to labsdb.auditor [software/labsdb-auditor] - 10https://gerrit.wikimedia.org/r/184577 (owner: 10Yuvipanda) [08:28:28] (03CR) 10Ori.livneh: [C: 04-1] "I created this commit using sed trickery, so I need to go over it carefully to make sure there are no mistakes." [puppet] - 10https://gerrit.wikimedia.org/r/184570 (owner: 10Ori.livneh) [08:34:08] paravoid: the ones you +2'd applied cleanly in labs [08:34:54] (03CR) 10Ori.livneh: "Not sure. CC @Nuria, @QChris, @Milimetric." [puppet] - 10https://gerrit.wikimedia.org/r/184551 (owner: 10Ori.livneh) [08:38:43] (03PS2) 10Faidon Liambotis: VCL: Get rid of graphite.inc.vcl.erb [puppet] - 10https://gerrit.wikimedia.org/r/184553 (owner: 10Ori.livneh) [08:38:52] (03CR) 10Faidon Liambotis: [V: 032] VCL: Get rid of graphite.inc.vcl.erb [puppet] - 10https://gerrit.wikimedia.org/r/184553 (owner: 10Ori.livneh) [08:38:59] grrr gerrit [08:39:45] (03PS2) 10Faidon Liambotis: VCL: Use header.append() in more places. [puppet] - 10https://gerrit.wikimedia.org/r/184567 (owner: 10Ori.livneh) [08:39:50] (03CR) 10Faidon Liambotis: [V: 032] VCL: Use header.append() in more places. [puppet] - 10https://gerrit.wikimedia.org/r/184567 (owner: 10Ori.livneh) [08:39:58] (03PS2) 10Faidon Liambotis: VCL lint: 'return(xxx)' => 'return (xxx)' [puppet] - 10https://gerrit.wikimedia.org/r/184568 (owner: 10Ori.livneh) [08:40:04] (03CR) 10Faidon Liambotis: [V: 032] VCL lint: 'return(xxx)' => 'return (xxx)' [puppet] - 10https://gerrit.wikimedia.org/r/184568 (owner: 10Ori.livneh) [08:40:47] sorry, I tried to merge them without all these weird merge commits [08:40:52] (and succeded) [08:41:07] np, thanks for reviewing [08:43:10] <_joe_> !log raising the net.ipv4.ip_local_port_range on mw1230 [08:43:17] Logged the message, Master [08:44:08] ori: hey btw [08:44:15] ? [08:44:18] I dislike having binaries in the puppet tree [08:44:38] the xenon stuff have a bunch of OpenSans fonts included [08:45:13] the only way I can describe my feeling about shipping ttf/svg/etc. via puppet is ewww :) [08:45:15] yeah, i'll move that out. thanks for the reminder. i initially referenced the fonts by hot-linking google fonts, but hoo pointed out that that was a violation of the privacy policy [08:45:26] yeah I was about to say that too :) [08:47:19] that apaxy theme is nice [08:47:31] we should use it elsewhere as well :) [08:47:45] love your attention to detail ;) [08:51:03] thanks! any interesting hacky plans for the all-staff / dev summit? [08:51:36] I am supposed to be working on VisualEditor, so I am looking to do anything except VisualEditor, in typical procrastinator fashion [08:52:38] <_joe_> ori: and of course memcached connection failures ended the second I started watching them [08:53:57] heh [08:54:36] you fixed it! [08:54:38] ACKNOWLEDGEMENT - puppet last run on mw1062 is CRITICAL: CRITICAL: Puppet last ran 17 hours ago Giuseppe Lavagetto this server has a failed disk [08:58:15] <_joe_> !log raising the net.ipv4.ip_local_port_range on mw1196 [08:58:18] Logged the message, Master [09:05:03] _joe_: you could probably force it by increasing the weight of one of the servers in pybal to 25 or 30 [09:05:46] <_joe_> ori: that was my thinking as well [09:06:10] <_joe_> ori: but this always happens in batches [09:06:36] interesting [09:06:37] <_joe_> so, for now I've seen my change is not harming anything, I'll do it via puppet, as it's a good idea anyways [09:06:46] * ori nods [09:07:24] <_joe_> oh, did I forgot to mention we had no failure in the last 10 minutes and it's the longest hiatus this morning? [09:07:28] <_joe_> Heisenbug! [09:12:36] (03PS2) 10Ori.livneh: VCL: Standardize on '//'-style comments [puppet] - 10https://gerrit.wikimedia.org/r/184570 [09:12:59] (03CR) 10Ori.livneh: "Looks correct" [puppet] - 10https://gerrit.wikimedia.org/r/184570 (owner: 10Ori.livneh) [09:13:42] good night [09:13:48] <_joe_> good night [09:18:38] (03PS1) 10Giuseppe Lavagetto: mediawiki: raise local_port_range for the api pool [puppet] - 10https://gerrit.wikimedia.org/r/184583 [09:20:01] <_joe_> paravoid: ^^ how does it looks? [09:20:36] even lower I'd say [09:20:49] <_joe_> well I was being conservative [09:20:58] <_joe_> twemproxy has a management port at 22222 [09:22:02] oh right [09:22:42] <_joe_> Tim-away: thanks for rebasing the patch! [09:23:29] <_joe_> paravoid: in case we need to use even more, I'll reconfigure twemproxy, but I don't think we'll need to [09:23:36] (03CR) 10Giuseppe Lavagetto: [C: 032] mediawiki: raise local_port_range for the api pool [puppet] - 10https://gerrit.wikimedia.org/r/184583 (owner: 10Giuseppe Lavagetto) [09:26:38] PROBLEM - Disk space on search1021 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:29:20] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "See comment." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/183222 (https://phabricator.wikimedia.org/T85964) (owner: 10Hashar) [09:43:33] (03PS1) 10Alexandros Kosiaris: Fix typo in delete-ldap-group [puppet] - 10https://gerrit.wikimedia.org/r/184585 [09:46:43] (03PS1) 10Yuvipanda: apertium: Unify production and beta roles [puppet] - 10https://gerrit.wikimedia.org/r/184586 [09:46:45] _joe_: akosiaris ^ [09:47:53] hashar: ^ [09:48:01] (since that deals with jenkins) [09:48:39] hello [09:48:44] YuviPanda: hola [09:49:11] <_joe_> YuviPanda: I don't like it a lot [09:49:23] hi kart_ [09:49:44] _joe_: I’d want to split up the jenkins_access further, but not yet (am looking through other places that is used) [09:50:34] 3operations, Wikimedia-General-or-Unknown: COPYING is served as application/octet-stream - https://phabricator.wikimedia.org/T63903#973076 (10fgiunchedi) indeed, I don't think there's anything operations can help with at the moment, let us know if that's not the case though! [09:50:44] YuviPanda: look at cxserver too :) [09:51:00] But, I have no idea why port was removed. [09:51:08] We need it afaik. [09:51:33] <_joe_> YuviPanda: why not now? [09:51:34] kart_: ‘port was removed’ as in? [09:52:02] <_joe_> YuviPanda: also, not having the $apertium_port being a class parameter was a deliberate choice from akosiaris [09:52:15] _joe_: oh? why so? [09:52:28] I looked around and didn’t see it being used anywhere else [09:52:42] <_joe_> because class parameters are discouraged in roles [09:52:49] <_joe_> but well, whatever :) [09:53:16] <_joe_> YuviPanda: the problematic parts are probably: mediawiki and varnish [09:53:21] _joe_: YuviPanda oh, feel free to fix it :) [09:53:37] _joe_: what why? [09:53:41] class parameters *without* defaults, sure [09:54:25] _joe_: right. and lack of LVS [09:54:37] <_joe_> mmmh looks like raising the local ports range wasn't enough after all [09:54:44] * _joe_ sighs [09:56:30] <_joe_> let's try with tw_reuse [09:57:37] (03PS2) 10Yuvipanda: apertium: Unify production and beta roles [puppet] - 10https://gerrit.wikimedia.org/r/184586 [09:59:31] 3operations, Beta-Cluster: Unify ::production / ::beta roles for *oid - https://phabricator.wikimedia.org/T86633#973092 (10yuvipanda) 3NEW [10:00:09] (03PS3) 10Yuvipanda: apertium: Unify production and beta roles [puppet] - 10https://gerrit.wikimedia.org/r/184586 (https://phabricator.wikimedia.org/T86633) [10:00:25] YuviPanda: I was explicitly avoid the class param in the role since it is frowned upon [10:00:31] avoiding* [10:00:33] akosiaris: hmm, why? [10:01:07] class params with defaults, that is? Ones without defaults seem bad [10:01:23] well, it is a fine line after that [10:01:38] <_joe_> YuviPanda: this is a remainder of the pre-hiera epoch, one might say [10:01:38] the idea is to avoid cluttering site.pp with configured explicitly role classes [10:01:54] right, but if you have them all with defaults you just set any config in hiera [10:02:07] <_joe_> but, one can also say 'if you need role class parameters, you have bad base classes' [10:02:44] the idea is that role classes actually configure the module classes [10:02:45] <_joe_> !log net.ipv4.tcp_tw_reuse = 1 on mw1223 [10:02:49] Logged the message, Master [10:03:08] and well if you configure the configuration ... turtles upon turtles [10:03:29] it's an arbitrary stop to that chicken and egg problem [10:03:38] bad simili [10:03:51] to the turtles upon turtles problem sounds better [10:04:15] akosiaris: but hiera(‘’) is pretty much the same thing except the configuration is no longer explicit [10:04:34] I suppose we could just make the port not configurable from puppet as such, perhaps. I don’t think it’s actually being changed at all... [10:06:29] (03PS4) 10Yuvipanda: apertium: Unify production and beta roles [puppet] - 10https://gerrit.wikimedia.org/r/184586 (https://phabricator.wikimedia.org/T86633) [10:06:31] (03PS1) 10Yuvipanda: cxserver: Unify production and beta roles [puppet] - 10https://gerrit.wikimedia.org/r/184590 (https://phabricator.wikimedia.org/T86633) [10:07:15] _joe_: if you look at the cxserver patch, most of the differences for beta can be configured in the cxserver base class. [10:07:25] $port is needed only for monitoring + firewall [10:08:54] yeah, it should make it downstream to configuration as well [10:09:40] 3operations, Beta-Cluster: Unify ::production / ::beta roles for *oid - https://phabricator.wikimedia.org/T86633#973108 (10yuvipanda) Note that hiera changes for deployment-prep must be made before any of these get merged. [10:09:54] akosiaris: and we aren’t actually passing the port downstream to the base module at all. [10:09:55] well, role classes ain't gonna go away since they are also the grouping place for saying "module classes+monitoring+firewall+backups+etc" [10:10:19] so hiera is only going to remove the configuration part of the role class functionality [10:10:19] akosiaris: yeah. and so they *will* have some config, I think. [10:10:32] yeah [10:10:35] calling hiera() just makes that less obvious [10:10:40] than putting it as a param with a default, IMO [10:12:15] it does? for me it is clearer that way TBH [10:13:00] cause you got a non really overridable place in the role class where configuration happens [10:13:38] like, you have to read through the role to figure out where config is and how configurable it is. [10:13:57] hmm, I suppose this predicates on if we want them to be seen as configurable or not [10:14:11] if we want them to be not seen / written as configurable, then not having any params makes sense. [10:14:48] yeah, but you end up knowing it is in that place alone. not having to go through site.pp to see if a variable has been passed for a specific node [10:14:50] <_joe_> akosiaris: well once we set something with a hiera lookup, it's like having a parameter [10:15:07] (03PS1) 10KartikMistry: pep8: Fixed comments [puppet] - 10https://gerrit.wikimedia.org/r/184594 [10:15:14] YuviPanda: my editors stucks when I edit modules/admin/data/data.yaml [10:15:15] or god forbid someone including a role class in another role class in a different manifest [10:15:23] kart_: vim? [10:15:24] <_joe_> and well, we're going to /forbid/ setting parameters in site.pp [10:15:26] <_joe_> SOON [10:15:29] YuviPanda: yes. [10:15:35] _joe_: yes!!! [10:15:37] kart_: it’s a bug in the default YAML role [10:15:47] ah :) [10:15:57] kart_: use stephpy/vim-yaml instead. works fine then :) [10:16:02] I hope I don't need to use Emacs ;) [10:16:11] <_joe_> akosiaris: on this theme, would you mind taking a look at https://gerrit.wikimedia.org/r/#/q/status:open+project:operations/puppet+branch:production+topic:remove_globals,n,z [10:16:14] okay! [10:16:16] kart_: yeah, this is just a differnt plugin [10:16:56] akosiaris: hmm, that’s also right. still, I’d prefer if we just not set params in site.pp / have roles include each other via CR / conventions and use params.. [10:16:57] _joe_: my point is that the role classes is the point where configuration starts. If it is not there, and not a default in classes down the stream it is not anywhere [10:17:39] again, an arbitrary stop somewhere in the chain [10:19:30] <_joe_> !log reimaging mw1001, mw1002 [10:19:34] Logged the message, Master [10:19:58] _joe_: akosiaris labs also has other shittiness, like public IP of instances not being accessible from inside labs thus requiring NAT rules in places (https://phabricator.wikimedia.org/T47868) [10:20:52] (03PS2) 10Yuvipanda: pep8: Fixed comments [puppet] - 10https://gerrit.wikimedia.org/r/184594 (owner: 10KartikMistry) [10:21:33] (03PS3) 10Yuvipanda: admin: Fix comment style in linter script [puppet] - 10https://gerrit.wikimedia.org/r/184594 (owner: 10KartikMistry) [10:21:48] (03CR) 10Yuvipanda: [C: 032] admin: Fix comment style in linter script [puppet] - 10https://gerrit.wikimedia.org/r/184594 (owner: 10KartikMistry) [10:22:09] (03CR) 10Yuvipanda: [V: 032] admin: Fix comment style in linter script [puppet] - 10https://gerrit.wikimedia.org/r/184594 (owner: 10KartikMistry) [10:22:25] kart_: thanks [10:23:13] akosiaris: _joe_ either way, if you think we should defer that params for role conversation later, I can rejigger the patches to not use them. Even without that a lot of code duplication will be moved to hiera otherwise. [10:23:57] <_joe_> YuviPanda: lemme take a look in a few [10:24:03] _joe_: cool [10:24:05] I am still pondering on the issue. First first take was a ::common class to be included by ::production and ::labs with the code that role::production has [10:24:08] <_joe_> I'm starting reimaging [10:24:34] your's is better IMHO, I just don't like the class param [10:25:07] hmm, I still like the class param. Would be nice if we can enforce puppet to *not* set class params manually but only pick them up from hiera :) [10:25:18] hiera() is that, but I dunno - it feels not-explicit-enough for me [10:25:21] <_joe_> YuviPanda: that would be a bit foolish I think [10:25:25] (03PS4) 10Filippo Giunchedi: lsearchd: remove lvs configuration [puppet] - 10https://gerrit.wikimedia.org/r/183462 (https://phabricator.wikimedia.org/T85009) [10:25:34] _joe_: well, only for some classes. [10:25:40] specifically, only for role classes [10:26:37] <_joe_> YuviPanda: well, we should use the 'role' keyword for those, from now on [10:26:49] oh, right [10:27:08] so that’s basically what I wanted :) [10:27:54] 3operations, WMF-NDA-Requests: Grant access to Nikerabbit - https://phabricator.wikimedia.org/T86632#973123 (10Qgil) Being a long-term WMF employee, I guess you have an NDA already. CCing #Operations. Someone these should give you access to WMF-NDA. [10:29:23] akosiaris: also in the cxserver role, T47868 is mentioned as a reason for a ferm rule in the ::production role, but that affects only labs... [10:29:46] in fact that is the case for apertium too [10:29:50] citoid seems to have it in the right place [10:30:00] <_joe_> YuviPanda: I think we may have ::production and ::beta roles only if they have some really env-specific rules [10:30:09] <_joe_> like, firewall rules etc etc [10:30:22] !log Set email for global account "Liberipedia" as per https://phabricator.wikimedia.org/T76321 [10:30:26] Logged the message, Master [10:30:42] right, but I think we should explicitly call them out as firewall rules for an explicit reason (like, jenkins_access) than just as ::beta [10:31:20] hashar: heya! [10:31:27] hashar: I saw this comment in some roles: [10:31:30] # Since the "root" user is local, we cant add the sudo policy in [10:31:30] # OpenStack manager interface at wikitech [10:31:38] hashar: we fixed that bug a while ago :) [10:31:53] YuviPanda: adding in the things to be fixed list [10:33:06] YuviPanda: oh really ! Guess some puppet manifests deserve a cleanup [10:33:17] are you guys cleaning up the manifests? [10:33:37] hashar: trying to :) [10:33:48] discussing about how to do it sounds more like it [10:35:39] hashar: I just created a rule called jenkins-deploy for deployment-prep [10:36:15] hashar: you’ll see that it says ‘sudo as ALL’ rather than as ‘All Project members’. The ALL allows local users too [10:37:22] (03PS1) 10Alexandros Kosiaris: Use the director varnish backend for cxserver [puppet] - 10https://gerrit.wikimedia.org/r/184598 [10:38:02] YuviPanda: ^ with this cxserver's problems in beta should be fine [10:38:50] akosiaris: nice! Can you test and merge or want me to? [10:39:19] merging now [10:39:36] heh cool [10:43:29] RECOVERY - Graphite Carbon on graphite1002 is OK: OK: All defined Carbon jobs are runnning. [10:44:42] (03PS2) 10Yuvipanda: cxserver: Unify production and beta roles [puppet] - 10https://gerrit.wikimedia.org/r/184590 (https://phabricator.wikimedia.org/T86633) [10:44:44] (03PS5) 10Yuvipanda: apertium: Unify production and beta roles [puppet] - 10https://gerrit.wikimedia.org/r/184586 (https://phabricator.wikimedia.org/T86633) [10:44:57] hashar: ^ I’ve removed the explicit sudo rule for apertium, and added it via wikitech [10:45:53] hashar: only apertium and parsoid seem to have sudo rules? [10:47:08] PROBLEM - Graphite Carbon on graphite1002 is CRITICAL: CRITICAL: Not all configured Carbon instances are running. [10:49:36] (03CR) 10Alexandros Kosiaris: [C: 032] Use the director varnish backend for cxserver [puppet] - 10https://gerrit.wikimedia.org/r/184598 (owner: 10Alexandros Kosiaris) [10:49:49] YuviPanda: no clue I havent looked [10:50:02] hashar: alright. [10:58:03] YuviPanda: when will shinken in beta labs wake up and say OK about the puppet failures ? [10:58:14] aka how often does it check :-) [10:58:59] I hate btw that the web frontend does not display the entire hostname [10:59:26] akosiaris: 10mins [10:59:36] ok thanks [10:59:44] akosiaris: yeah. I was going to backport the 2.x packages, but then migh tas well move this to jessie when we have those instances [11:02:25] PROBLEM - RAID on mw1002 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:02:55] PROBLEM - configured eth on mw1002 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:03:05] PROBLEM - dhclient process on mw1002 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:03:17] akosiaris: I could also potentially use another web frontend + LiveStatus API. But the popular ones seem to be written in perl [11:03:25] PROBLEM - nutcracker port on mw1002 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:03:36] PROBLEM - nutcracker process on mw1002 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:03:56] PROBLEM - puppet last run on mw1002 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:04:05] PROBLEM - salt-minion processes on mw1002 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:04:06] PROBLEM - puppet last run on mw1001 is CRITICAL: CRITICAL: Puppet has 4 failures [11:04:19] perl ??? the whole of shinken is in python but the web frontend is in perl ? [11:04:24] * akosiaris sighs [11:04:26] PROBLEM - DPKG on mw1002 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:04:35] akosiaris: no no [11:04:40] akosiaris: the shinken web frontend is python [11:04:45] PROBLEM - Disk space on mw1002 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:05:07] akosiaris: but there are *other* perl frontends that can act as web frontends for a variety of backends (shinken, icinga, nagios, etc) via the livestatus api [11:05:11] aaah, the alternatives are in perl [11:05:13] ok [11:05:45] akosiaris: yeah. and upstream talks a lot in French :( [11:05:45] RECOVERY - DPKG on mw1002 is OK: All packages OK [11:05:46] RECOVERY - nutcracker port on mw1002 is OK: TCP OK - 0.000 second response time on port 11212 [11:05:55] RECOVERY - Disk space on mw1002 is OK: DISK OK [11:05:56] RECOVERY - RAID on mw1002 is OK: OK: no RAID installed [11:06:05] RECOVERY - nutcracker process on mw1002 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [11:06:09] (03PS1) 10Giuseppe Lavagetto: mediawiki: activate net.ipv4.tcp_tw_reuse [puppet] - 10https://gerrit.wikimedia.org/r/184602 [11:06:26] RECOVERY - configured eth on mw1002 is OK: NRPE: Unable to read output [11:06:26] RECOVERY - salt-minion processes on mw1002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [11:06:45] RECOVERY - dhclient process on mw1002 is OK: PROCS OK: 0 processes with command name dhclient [11:06:48] <_joe_> paravoid, akosiaris: care to review? [11:06:55] sorry, just got back [11:07:04] <_joe_> paravoid: np, just committed [11:07:23] <_joe_> allowing more local ports reduced the error rate by 75% [11:07:26] sure [11:07:35] <_joe_> but it's still too high [11:08:20] expand it further too :) [11:08:33] akosiaris: yeah, puppet recovered after that patch [11:08:33] there's tw_reuse and tw_recycle, I don't remember the differences [11:08:33] :) [11:09:30] maybe it started as a hack to Nagios/Icinga web frontend? :D [11:09:40] bah I am lagged [11:10:22] Known to cause some issues with hoststated (load balancing and fail over) if enabled, should be used with caution. [11:10:30] about TCP_TW_RECYCLE [11:10:36] http://www.speedguide.net/articles/linux-tweaking-121 [11:10:41] hashar: hmm, re: https://phabricator.wikimedia.org/T47868 why did we need ferm rules? arne’t security groups enough? [11:10:51] but they could point out what the issues are [11:11:02] YuviPanda: we rely on prod manifests which come with ferm and ferm::rules [11:11:13] anyway, we don't do that in mediawiki appservers so I suppose ok for that too ? [11:11:15] YuviPanda: so the instances also have a local firewall (in addition to labs security rules) [11:11:53] hashar: ah, hmm. [11:12:04] so they have to be opened up in prod too via ferm rules of some sort, I suppose? [11:12:14] YuviPanda: also CI instant and some beta instances are Jenkins slaves, and the puppet manifest contint::something bring in ferm [11:12:41] <_joe_> akosiaris: reuse, not recycle [11:12:48] <_joe_> recycle is bad [11:13:03] yeah, I now see the one is an alternative to the other in reality [11:13:05] <_joe_> http://vincent.bernat.im/en/blog/2014-tcp-time-wait-state-linux.html has a good explanation [11:13:08] YuviPanda: as for https://phabricator.wikimedia.org/T47868 , that was an iptables based rewriting of packets. Got fixed in dnsmasq by bblack iirc [11:13:14] I haven't used those before tbh [11:13:25] oh right, I remember that blog post [11:13:31] I do remember issues with NAT in the past [11:13:34] <_joe_> it's very well written [11:13:38] NAT clients [11:13:41] <_joe_> nat issues are with recycle [11:13:44] vincent's blog posts usually are :) [11:13:46] YuviPanda: we needed the DNS server to yield the internal instance IP for some public dns entries (ex: en.wikipedia.beta.wmflabs.org ). Since instances can't communicate over the public IP. [11:13:55] but appservers are just being hit by varnish, purely internally [11:14:01] so in any case shouldn't matter [11:14:11] <_joe_> well, there are other issues with recycle [11:14:29] (03CR) 10Alexandros Kosiaris: [C: 032] mediawiki: activate net.ipv4.tcp_tw_reuse [puppet] - 10https://gerrit.wikimedia.org/r/184602 (owner: 10Giuseppe Lavagetto) [11:14:30] hashar: ah, right. [11:14:53] yeah I got the point fine now, thanks for that link _joe_ [11:15:22] hashar: so the rewriting isn’t needed anymore, but we have to open up the ports anyway [11:16:34] akosiaris: I'm re-reviewing https://gerrit.wikimedia.org/r/#/c/169691/ which you have +1ed [11:16:52] I'm worried that ferm's default DROP is going to have unintended consequences [11:17:09] for instance, role::mediawiki::logging applies to fluorine, fluorine runs apache on :80 now because of xenon [11:18:09] so my comment on https://gerrit.wikimedia.org/r/#/c/179166/ is outdated [11:18:18] right [11:18:23] * akosiaris sigh [11:18:31] YuviPanda: and the Jenkins slaves have a rule to restrict ssh from bastion / gallium [11:18:35] RECOVERY - puppet last run on mw1001 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [11:18:37] well, we just need to update it ? [11:18:37] well that's what you get from +1 instead of +2/merge :P [11:18:53] I'm thinking of adding a proto !udp ACCEPT for now [11:19:13] or even LOG maybe [11:19:30] it is the path of least resistance [11:19:36] PROBLEM - puppet last run on mw1002 is CRITICAL: CRITICAL: Puppet has 3 failures [11:19:59] feel free, tbh it needs to be done and then amended so let's do it that way [11:20:28] hashar: yeah, I added you to a couple of patches. refactoring some ::production ::beta roles to a role + jenkins_access [11:20:31] roles [11:20:31] !log upgrade db1067 trusty [11:20:34] Logged the message, Master [11:20:43] (03PS2) 10Giuseppe Lavagetto: mediawiki: activate net.ipv4.tcp_tw_reuse [puppet] - 10https://gerrit.wikimedia.org/r/184602 [11:20:46] RECOVERY - puppet last run on mw1002 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [11:21:14] <_joe_> springle: we should save us ~ half a day in SF so that I can properly get a braindump of yours and start helping out on dbs [11:22:46] RECOVERY - Graphite Carbon on graphite1002 is OK: OK: All defined Carbon jobs are runnning. [11:24:02] (03PS1) 10Yuvipanda: citoid: Unify production and beta roles [puppet] - 10https://gerrit.wikimedia.org/r/184605 (https://phabricator.wikimedia.org/T86633) [11:25:07] _joe_: on a related note, let me know what I can do to help out with MW in codfw [11:25:55] * springle drops analytics and labsdb on _joe_ and runs off [11:26:02] <_joe_> springle: lol [11:26:25] PROBLEM - Graphite Carbon on graphite1002 is CRITICAL: CRITICAL: Not all configured Carbon instances are running. [11:26:26] <_joe_> springle: I'm not a newbie man, I know how to dodge grenades [11:26:31] (03PS1) 10Springle: repool db1067 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/184607 [11:26:45] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 203, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/2/0: down - Core: cr2-codfw:xe-5/2/1 (Telia, IC-307236) (#3658) [10Gbps wave]BR [11:26:57] <_joe_> it's funny how analytics and "dev env" are the ethernal pains for DBAs everywhere I've been [11:27:27] * hashar throws grenades [11:29:11] <_joe_> hashar: CI is part of the "dev env" definition above, obviously :P [11:29:31] (03CR) 10Springle: [C: 032] repool db1067 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/184607 (owner: 10Springle) [11:29:35] (03Merged) 10jenkins-bot: repool db1067 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/184607 (owner: 10Springle) [11:30:25] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 205, down: 0, dormant: 0, excluded: 0, unused: 0 [11:30:43] !log springle Synchronized wmf-config/db-eqiad.php: repool db1067, warm up (duration: 00m 05s) [11:30:48] Logged the message, Master [11:30:51] <_joe_> !log reimaging mw1003, mw1004 [11:30:54] Logged the message, Master [11:32:54] _joe_: na CI is production grade, happening to work on top of labs infra [11:33:47] _joe_: but yeah, it is rather messy :-( [11:33:50] _joe_: scap still failing on mw1062 jfyi.. what sort of depool did you intend to do yesterday? [11:35:50] 3Ops-Access-Requests: EventLogging access for Tilman - https://phabricator.wikimedia.org/T86533#973192 (10mark) Alright, approval to skip the 3 day rule. :) [11:36:15] <_joe_> springle: I depooled it from pybal [11:36:31] <_joe_> springle: I should just turn it of and depool it from mediawiki-installation as well [11:36:49] oh this is the two-actions-required thing again [11:36:52] :) np [11:37:02] <_joe_> yeah [11:37:16] _joe_: tl;dr CI has a bit of flows https://upload.wikimedia.org/wikipedia/commons/e/e9/Integrationwikimediaci-zuul_git_flows.svg . And I am going to make it even more complicated :-/ [11:42:09] (03CR) 10Ori.livneh: Reuse parsoid varnish for cxserver (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/181613 (https://phabricator.wikimedia.org/T76200) (owner: 10Alexandros Kosiaris) [11:49:09] (03PS5) 10Filippo Giunchedi: lsearchd: remove lvs configuration [puppet] - 10https://gerrit.wikimedia.org/r/183462 (https://phabricator.wikimedia.org/T85009) [11:49:16] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] lsearchd: remove lvs configuration [puppet] - 10https://gerrit.wikimedia.org/r/183462 (https://phabricator.wikimedia.org/T85009) (owner: 10Filippo Giunchedi) [11:49:24] (03PS1) 10Ori.livneh: VCL: Standardize whitespace in parens [puppet] - 10https://gerrit.wikimedia.org/r/184608 [11:50:07] <_joe_> ori: is this your version of "counting the sheeps"? [11:50:37] 🐑 🐑 🐑 🐑 [11:51:00] <_joe_> btw, you should really use f.lux if you're on a mac, it helps your sleep cycle by turning down the blue light from the monitor at night time [11:51:06] <_joe_> it helped me a lot [11:51:16] but the reddish hue is so garish [11:51:41] i probably should give it a shot [11:51:48] <_joe_> (also on linux, wow) [11:52:34] <_joe_> or you could get away from the damn screen after 8 PM, but I figured that's not an option [11:52:42] <_joe_> :) [11:53:31] PROBLEM - puppet last run on search1015 is CRITICAL: CRITICAL: puppet fail [11:54:06] that'd be my last change I think, checking [11:55:31] PROBLEM - puppet last run on search1013 is CRITICAL: CRITICAL: puppet fail [12:00:11] PROBLEM - puppet last run on lvs1006 is CRITICAL: CRITICAL: puppet fail [12:01:41] PROBLEM - puppet last run on lvs1003 is CRITICAL: CRITICAL: puppet fail [12:02:00] that's search_pool, fix [12:02:02] fixing [12:03:35] (03PS1) 10Filippo Giunchedi: lsearchd: remove all lvs references [puppet] - 10https://gerrit.wikimedia.org/r/184611 (https://phabricator.wikimedia.org/T85009) [12:03:53] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] lsearchd: remove all lvs references [puppet] - 10https://gerrit.wikimedia.org/r/184611 (https://phabricator.wikimedia.org/T85009) (owner: 10Filippo Giunchedi) [12:04:01] PROBLEM - nutcracker port on mw1003 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [12:04:01] PROBLEM - nutcracker process on mw1004 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [12:04:12] PROBLEM - nutcracker process on mw1003 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [12:04:13] PROBLEM - puppet last run on mw1004 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [12:04:31] PROBLEM - puppet last run on mw1003 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [12:04:31] PROBLEM - salt-minion processes on mw1004 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [12:04:42] PROBLEM - salt-minion processes on mw1003 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [12:04:51] PROBLEM - DPKG on mw1004 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [12:05:02] PROBLEM - DPKG on mw1003 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [12:05:02] PROBLEM - Disk space on mw1004 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [12:05:12] PROBLEM - Disk space on mw1003 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [12:05:31] PROBLEM - RAID on mw1004 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [12:05:42] PROBLEM - RAID on mw1003 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [12:06:01] PROBLEM - configured eth on mw1004 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [12:06:02] PROBLEM - dhclient process on mw1004 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [12:06:02] PROBLEM - configured eth on mw1003 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [12:06:12] PROBLEM - dhclient process on mw1003 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [12:06:26] (03CR) 10Alexandros Kosiaris: Reuse parsoid varnish for cxserver (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/181613 (https://phabricator.wikimedia.org/T76200) (owner: 10Alexandros Kosiaris) [12:06:31] PROBLEM - nutcracker port on mw1004 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [12:06:51] RECOVERY - puppet last run on lvs1006 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [12:09:21] RECOVERY - RAID on mw1003 is OK: OK: no RAID installed [12:09:31] RECOVERY - salt-minion processes on mw1003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [12:09:41] RECOVERY - configured eth on mw1003 is OK: NRPE: Unable to read output [12:09:51] RECOVERY - DPKG on mw1003 is OK: All packages OK [12:09:51] RECOVERY - dhclient process on mw1003 is OK: PROCS OK: 0 processes with command name dhclient [12:09:52] RECOVERY - nutcracker port on mw1003 is OK: TCP OK - 0.000 second response time on port 11212 [12:10:01] RECOVERY - Disk space on mw1003 is OK: DISK OK [12:10:11] RECOVERY - nutcracker process on mw1003 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [12:11:22] RECOVERY - RAID on mw1004 is OK: OK: no RAID installed [12:11:32] RECOVERY - salt-minion processes on mw1004 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [12:11:52] RECOVERY - configured eth on mw1004 is OK: NRPE: Unable to read output [12:12:01] RECOVERY - DPKG on mw1004 is OK: All packages OK [12:12:01] RECOVERY - dhclient process on mw1004 is OK: PROCS OK: 0 processes with command name dhclient [12:12:11] RECOVERY - Disk space on mw1004 is OK: DISK OK [12:12:12] RECOVERY - puppet last run on search1013 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [12:12:12] RECOVERY - puppet last run on search1015 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [12:12:12] RECOVERY - nutcracker process on mw1004 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [12:12:21] RECOVERY - nutcracker port on mw1004 is OK: TCP OK - 0.000 second response time on port 11212 [12:15:22] RECOVERY - puppet last run on lvs1003 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [12:16:34] (03PS4) 10Filippo Giunchedi: lsearchd: remove udp2log configuration [puppet] - 10https://gerrit.wikimedia.org/r/183469 (https://phabricator.wikimedia.org/T85009) [12:16:41] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] lsearchd: remove udp2log configuration [puppet] - 10https://gerrit.wikimedia.org/r/183469 (https://phabricator.wikimedia.org/T85009) (owner: 10Filippo Giunchedi) [12:20:02] PROBLEM - puppet last run on mw1003 is CRITICAL: CRITICAL: Puppet has 2 failures [12:21:05] RECOVERY - puppet last run on mw1003 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [12:21:55] PROBLEM - puppet last run on mw1004 is CRITICAL: CRITICAL: Puppet has 8 failures [12:24:16] RECOVERY - puppet last run on mw1004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:29:11] 3ops-core: monitor SSD wear levels - https://phabricator.wikimedia.org/T86556#973261 (10mark) p:5Normal>3Low [12:33:02] (03PS1) 10Yuvipanda: logstash: Move IRC Bot definition out of ::beta role [puppet] - 10https://gerrit.wikimedia.org/r/184618 (https://phabricator.wikimedia.org/T86642) [12:33:45] (03CR) 10Yuvipanda: "@bd808: Also, is this really used / required? I thought things just went to SAL. We should either replace SAL with a system like this, or " [puppet] - 10https://gerrit.wikimedia.org/r/184618 (https://phabricator.wikimedia.org/T86642) (owner: 10Yuvipanda) [12:33:47] bd808|BUFFER: ^ [12:34:13] (03PS1) 10Filippo Giunchedi: lsearchd: remove lucene role and class [puppet] - 10https://gerrit.wikimedia.org/r/184620 (https://phabricator.wikimedia.org/T86150) [12:37:32] (03PS2) 10Yuvipanda: logstash: Split beta role into two composable clearer ones [puppet] - 10https://gerrit.wikimedia.org/r/184618 (https://phabricator.wikimedia.org/T86642) [12:40:57] (03PS3) 10Yuvipanda: logstash: Split beta role into two composable clearer ones [puppet] - 10https://gerrit.wikimedia.org/r/184618 (https://phabricator.wikimedia.org/T86642) [12:45:46] (03PS1) 10Mjbmr: Fix project talk namespace for mznwiki T85383 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/184622 [12:48:21] (03PS1) 10Yuvipanda: beta: Kill role::syslog::centralserver::beta [puppet] - 10https://gerrit.wikimedia.org/r/184623 (https://phabricator.wikimedia.org/T86645) [12:52:08] ^d: sigh, I failed at using kill-lsearchd as a topic because git-review would reset it when updating a change and I'm using multiple branches [12:54:43] (03PS2) 10Yuvipanda: beta: Kill role::syslog::centralserver::beta [puppet] - 10https://gerrit.wikimedia.org/r/184623 (https://phabricator.wikimedia.org/T86645) [12:58:08] (03PS1) 10Filippo Giunchedi: remove service endpoints for lsearchd [dns] - 10https://gerrit.wikimedia.org/r/184624 (https://phabricator.wikimedia.org/T85009) [12:59:46] (03PS1) 10Mjbmr: Set wgRestrictDisplayTitle to false for fawikinews T85380 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/184625 [12:59:50] 3operations: Decomission lsearchd - https://phabricator.wikimedia.org/T85009#973328 (10fgiunchedi) see https://gerrit.wikimedia.org/r/#/c/184620/ for puppet decom and https://gerrit.wikimedia.org/r/#/c/184624/ for dns pending machine deprovisioning, @mark did we have an use for those already or back to the spar... [13:02:31] (03PS3) 10Yuvipanda: beta: Kill role::syslog::centralserver::beta [puppet] - 10https://gerrit.wikimedia.org/r/184623 (https://phabricator.wikimedia.org/T86645) [13:02:32] 3operations: Decomission lsearchd - https://phabricator.wikimedia.org/T85009#973336 (10mark) >>! In T85009#973328, @fgiunchedi wrote: > see https://gerrit.wikimedia.org/r/#/c/184620/ for puppet decom and https://gerrit.wikimedia.org/r/#/c/184624/ for dns > > pending machine deprovisioning, @mark did we have an... [13:02:51] 3operations: Decommission lsearchd - https://phabricator.wikimedia.org/T85009#973339 (10mark) [13:04:41] (03PS4) 10Yuvipanda: beta: Kill role::syslog::centralserver::beta [puppet] - 10https://gerrit.wikimedia.org/r/184623 (https://phabricator.wikimedia.org/T86645) [13:05:30] (03CR) 10Yuvipanda: [C: 032] beta: Kill role::syslog::centralserver::beta [puppet] - 10https://gerrit.wikimedia.org/r/184623 (https://phabricator.wikimedia.org/T86645) (owner: 10Yuvipanda) [13:18:35] 3WMF-NDA-Requests, operations: Grant Nikerabbit access to WMF-NDA group - https://phabricator.wikimedia.org/T86632#973369 (10Aklapper) p:5Triage>3Normal [13:19:58] (03PS2) 10Yuvipanda: Fix typo in delete-ldap-group [puppet] - 10https://gerrit.wikimedia.org/r/184585 (owner: 10Alexandros Kosiaris) [13:20:20] (03CR) 10Yuvipanda: [C: 032] Fix typo in delete-ldap-group [puppet] - 10https://gerrit.wikimedia.org/r/184585 (owner: 10Alexandros Kosiaris) [13:20:38] <^d> godog: It's all good :) [13:20:48] * ^d yawns, wonders why he's awake [13:20:48] It’s all goodoog? [13:30:52] (03CR) 10Yuvipanda: [C: 031] "Tested by cherry-picking on beta, seems good." [puppet] - 10https://gerrit.wikimedia.org/r/184618 (https://phabricator.wikimedia.org/T86642) (owner: 10Yuvipanda) [13:39:53] 3ops-core: Upgrade all HTTP frontends to Debian jessie - https://phabricator.wikimedia.org/T86648#973396 (10faidon) 3NEW a:3BBlack [13:40:36] 3ops-core: Upgrade all HTTP frontends to Debian jessie - https://phabricator.wikimedia.org/T86648#973405 (10yuvipanda) [13:41:25] eep, thanks [13:42:06] PROBLEM - Host payments1003 is DOWN: PING CRITICAL - Packet loss = 100% [13:42:53] (03PS1) 10Yuvipanda: beta: Remove defunct graphite/icinga based monitoring [puppet] - 10https://gerrit.wikimedia.org/r/184628 [13:43:03] (03PS2) 10Yuvipanda: beta: Remove defunct graphite/icinga based monitoring [puppet] - 10https://gerrit.wikimedia.org/r/184628 [13:43:08] paged about payments1003 [13:43:10] payments1003? [13:43:12] heh [13:43:46] there he is to save the day [13:43:57] which one paged? [13:44:02] Jeff_Green: payments1003 [13:44:04] payments1003 [13:44:06] HERE I AM TO SAVE THE DAY!!!!! [13:44:12] i probably caused it...looking [13:44:13] (03CR) 10Yuvipanda: [C: 032] beta: Remove defunct graphite/icinga based monitoring [puppet] - 10https://gerrit.wikimedia.org/r/184628 (owner: 10Yuvipanda) [13:44:42] yeah it was me. all better [13:44:54] i'm doing package updates and reboots, forgot to mute it [13:45:11] !log package updates and reboots on many fundraising hosts [13:45:14] 3ops-core: Add support for setting weight=0 when depooling - https://phabricator.wikimedia.org/T86650#973415 (10faidon) 3NEW a:3BBlack [13:45:19] Logged the message, Master [13:45:25] RECOVERY - Host payments1003 is UP: PING OK - Packet loss = 0%, RTA = 1.51 ms [13:47:30] <_joe_> got paged [13:47:49] yeah, my fault. sorry. [13:48:16] 3ops-core: Add support for setting weight=0 when depooling - https://phabricator.wikimedia.org/T86650#973423 (10mark) Let's make this a configurable boolean option per LVS service. [13:50:19] do we have an easy way to determine what package version is installed on various hosts in the entire cluster? [13:50:52] i'm looking for hosts with nodejs installed to see if we have a precise host with nodejs 1.x running [13:56:33] you can basically do a salt across all nodes running dpkg-query [13:57:27] we keep upgrade candidates in the puppet database (and servermon) [13:57:32] but not all packages/all versions [13:57:36] so what bblack said [13:57:38] ok [13:59:31] so i.e. salt '*.eqiad.wmnet' cmd.run 'dpkg -l nodejs' [13:59:40] 3ops-core: Fix LVS "sh" shortcomings - https://phabricator.wikimedia.org/T86651#973429 (10faidon) 3NEW [14:00:11] or dpkg-query -W or something [14:00:32] yeah something like that. sometimes it's helpful to pipe it into a local grep (inside cmd.run) to match the versions you care about and just echo $? after [14:00:39] easier to sort 0/1 in the local aggregate output [14:00:57] --output=raw helps a lot with grep [14:01:01] ah ok, i didn't know whether I could include a pipe in the command [14:01:02] since it's all in one line [14:01:20] I really with salt had automatic aggregation of duplicate outputs for cases like this (like my shitty perl dsh script did!) [14:01:39] dpkg-query -W nodejs gives nice 1 line output [14:01:44] it could say "cp1010,cp1011,.....: 0\ncp2020,....: 1" [14:02:18] Jeff_Green: yes, but salt isn't, it prints $hostname\n\t$output [14:02:21] 3ops-core: Fix LVS "sh" shortcomings - https://phabricator.wikimedia.org/T86651#973435 (10mark) FWIW: An alternative sh implementation that I've written for an old kernel and fixes some of these issues (a looong time ago), lives [[ http://svn.wikimedia.org/viewvc/mediawiki/trunk/routing/lvs/net/ipv4/ipvs/ip_vs_w... [14:03:50] Number of permanent LOCAL HACK cherry picks on deployment-prep down to 1 now [14:03:51] \o/ [14:04:11] 3ops-core: Fix LVS "sh" shortcomings - https://phabricator.wikimedia.org/T86651#973445 (10faidon) [14:04:30] 3operations: Puppet's apache2_test_config_and_restart fails to restart apache - https://phabricator.wikimedia.org/T86652#973447 (10Joe) 3NEW a:3Joe [14:07:06] 3ops-core: Switch to ECDSA hybrid certificates - https://phabricator.wikimedia.org/T86654#973461 (10faidon) 3NEW a:3faidon [14:07:28] <_joe_> !log reimaging mw1005,mw1006 [14:07:32] Logged the message, Master [14:07:47] we can’t really kill subversion, can we? [14:08:14] <_joe_> ?? [14:11:41] (03CR) 10Chad: "lgtm, 2 last pieces to clean up:" [puppet] - 10https://gerrit.wikimedia.org/r/184620 (https://phabricator.wikimedia.org/T86150) (owner: 10Filippo Giunchedi) [14:12:19] I think not [14:12:22] ^d is here ;) [14:12:29] <^d> We could crawl it like we're doing with BZ [14:12:43] <^d> Do a static html dump [14:13:26] PROBLEM - Host mw1005 is DOWN: PING CRITICAL - Packet loss = 100% [14:13:46] ^d: yeah, that would be nice. [14:14:02] _joe_: paravoid ^d asking because that was given as a reason in https://phabricator.wikimedia.org/T67591 [14:14:05] am digging to understand why [14:14:08] <^d> I was also thinking of importing the old SVN repo to Phab. That'd allow us to keep the repo around (still, it has its moments), but shut down the actual standalone svn service. [14:14:16] PROBLEM - Host mw1006 is DOWN: PING CRITICAL - Packet loss = 100% [14:14:33] we could do (a) make sure all svn repos are deleted that actually have been imported over to git already (hopefully none left like that) and (b) do a git-svn conversion with full history of the rest to a tiny legacy git server somewhere [14:14:46] <^d> There's a ton of shit that was never imported to git. [14:14:47] and then maybe even just archive that gitserver's data if nobody's using them [14:15:29] <^d> (we only have one svn repo, it was a giant repo with lots of folders, fwiw) [14:15:32] yeah [14:15:36] oh ok [14:15:41] import to phab, kill service with redirect? [14:15:51] if we import to phab, will we import as svn or via git-svn? [14:15:56] RECOVERY - Host mw1005 is UP: PING OK - Packet loss = 0%, RTA = 2.78 ms [14:16:03] so just do one automatic svn->git conversion with full history, set that up as the "legacy" repo, add it to our current main git server [14:16:05] RECOVERY - Host mw1006 is UP: PING OK - Packet loss = 0%, RTA = 1.78 ms [14:16:19] <^d> We tried that before the git conversion you know :) [14:16:21] maybe make it readonly, so it's available for git perusal, but people need to copy code out to a fresh repo if they want to actually work with it [14:16:27] <^d> To see if we wanted to split up the repo or just keep a giant one. [14:16:37] <^d> It ends up being like 20GB and completely unusable :p [14:16:44] yes, but at this point in time everything in constant use should be converted already [14:16:57] it's just a legacy escape hatch for the remainder [14:17:06] oh is svn full of binaries or something? [14:17:25] <^d> There's a bunch, yeah. [14:17:39] ok, step 0: delete the offending binaries :) [14:17:47] <^d> But it's mainly that it's a 10-year-old repo with > 100k commits and hundreds and hundreds of folders of /stuff/ [14:17:48] <^d> :) [14:18:06] really, 100k commits and hundreds of folders isn't an issue once converted to git [14:18:16] 100k commits full of big binary assets is, though [14:18:27] PROBLEM - puppet last run on mw1005 is CRITICAL: Connection refused by host [14:18:27] PROBLEM - RAID on mw1006 is CRITICAL: Connection refused by host [14:18:27] 3operations: Decomission svn.wikimedia.org - https://phabricator.wikimedia.org/T86655#973496 (10yuvipanda) 3NEW [14:18:35] PROBLEM - dhclient process on mw1005 is CRITICAL: Connection refused by host [14:18:36] PROBLEM - nutcracker process on mw1006 is CRITICAL: Connection refused by host [14:18:46] PROBLEM - nutcracker port on mw1005 is CRITICAL: Connection refused by host [14:18:46] PROBLEM - configured eth on mw1006 is CRITICAL: Connection refused by host [14:18:52] 3operations, ops-core, Analytics: Deprecate HTTPS udp2log stream? - https://phabricator.wikimedia.org/T86656#973506 (10faidon) 3NEW a:3faidon [14:18:59] PROBLEM - RAID on mw1005 is CRITICAL: Connection refused by host [14:18:59] PROBLEM - configured eth on mw1005 is CRITICAL: Connection refused by host [14:19:00] PROBLEM - dhclient process on mw1006 is CRITICAL: Connection refused by host [14:19:00] PROBLEM - salt-minion processes on mw1005 is CRITICAL: Connection refused by host [14:19:00] PROBLEM - nutcracker port on mw1006 is CRITICAL: Connection refused by host [14:19:07] PROBLEM - DPKG on mw1005 is CRITICAL: Connection refused by host [14:19:08] PROBLEM - Disk space on mw1005 is CRITICAL: Connection refused by host [14:19:15] PROBLEM - puppet last run on mw1006 is CRITICAL: Connection refused by host [14:19:15] PROBLEM - nutcracker process on mw1005 is CRITICAL: Connection refused by host [14:19:16] PROBLEM - DPKG on mw1006 is CRITICAL: Connection refused by host [14:19:25] PROBLEM - Disk space on mw1006 is CRITICAL: Connection refused by host [14:19:28] 3operations: Decomission svn.wikimedia.org - https://phabricator.wikimedia.org/T86655#973496 (10yuvipanda) p:5Triage>3Low [14:19:45] PROBLEM - salt-minion processes on mw1006 is CRITICAL: Connection refused by host [14:20:25] <^d> YuviPanda: To answer your earlier question, no, we won't use git-svn. It's slow as fuck and creates messy history from svn. [14:20:43] <^d> We shall resurrect svn2git :) [14:21:00] ^d: phab supports svn natively :D [14:21:05] <^d> Yes I know. [14:21:09] but I’m afraid that will bring the svn loyalists out of the woodwork [14:21:10] so maybenot [14:21:26] <^d> I was thinking of just importing it as a single repo with the callsign SVN [14:21:31] <^d> And keep it r/o [14:23:47] <^d> I should probably test SVN on my VM. [14:23:54] <^d> See how it actually looks in Phab for a r/o import. [14:32:05] (03CR) 10Manybubbles: [C: 031] lsearchd: remove lucene role and class [puppet] - 10https://gerrit.wikimedia.org/r/184620 (https://phabricator.wikimedia.org/T86150) (owner: 10Filippo Giunchedi) [14:36:16] !log ssl runtime config updated for +3DES/-RC4 ( I87616455abd58c986aa960348fc20c017f097716 ) [14:36:19] Logged the message, Master [14:41:37] 3ops-core: Expand HTTP frontend clusters with new hardware - https://phabricator.wikimedia.org/T86663#973608 (10faidon) 3NEW [14:43:04] PROBLEM - nutcracker port on mw1006 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [14:43:23] PROBLEM - nutcracker port on mw1005 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [14:43:23] PROBLEM - nutcracker process on mw1006 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [14:43:33] PROBLEM - nutcracker process on mw1005 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [14:43:33] PROBLEM - puppet last run on mw1006 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [14:43:43] PROBLEM - puppet last run on mw1005 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [14:43:44] PROBLEM - salt-minion processes on mw1006 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [14:43:54] PROBLEM - salt-minion processes on mw1005 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [14:44:03] PROBLEM - DPKG on mw1006 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [14:44:14] PROBLEM - Disk space on mw1006 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [14:44:14] PROBLEM - DPKG on mw1005 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [14:44:33] PROBLEM - Disk space on mw1005 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [14:44:41] 3ops-core: HTTPS performance & UA adoption metrics - https://phabricator.wikimedia.org/T86664#973617 (10faidon) 3NEW [14:44:43] PROBLEM - RAID on mw1006 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [14:44:54] PROBLEM - RAID on mw1005 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [14:45:04] PROBLEM - configured eth on mw1006 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [14:45:16] PROBLEM - dhclient process on mw1006 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [14:45:16] PROBLEM - configured eth on mw1005 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [14:45:33] PROBLEM - dhclient process on mw1005 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [14:47:43] RECOVERY - dhclient process on mw1006 is OK: PROCS OK: 0 processes with command name dhclient [14:47:44] RECOVERY - nutcracker port on mw1006 is OK: TCP OK - 0.000 second response time on port 11212 [14:47:53] RECOVERY - Disk space on mw1006 is OK: DISK OK [14:48:03] RECOVERY - nutcracker process on mw1006 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [14:48:14] RECOVERY - RAID on mw1006 is OK: OK: no RAID installed [14:49:25] RECOVERY - salt-minion processes on mw1006 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [14:49:25] RECOVERY - salt-minion processes on mw1005 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [14:49:25] RECOVERY - configured eth on mw1006 is OK: NRPE: Unable to read output [14:49:25] RECOVERY - DPKG on mw1006 is OK: All packages OK [14:49:25] RECOVERY - configured eth on mw1005 is OK: NRPE: Unable to read output [14:49:25] RECOVERY - DPKG on mw1005 is OK: All packages OK [14:49:25] RECOVERY - dhclient process on mw1005 is OK: PROCS OK: 0 processes with command name dhclient [14:49:25] RECOVERY - Disk space on mw1005 is OK: DISK OK [14:49:25] RECOVERY - nutcracker port on mw1005 is OK: TCP OK - 0.000 second response time on port 11212 [14:49:25] RECOVERY - nutcracker process on mw1005 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [14:49:25] PROBLEM - puppet last run on mw1006 is CRITICAL: CRITICAL: Puppet has 4 failures [14:49:43] RECOVERY - RAID on mw1005 is OK: OK: no RAID installed [14:50:53] PROBLEM - puppet last run on mw1005 is CRITICAL: CRITICAL: Puppet has 4 failures [14:51:22] 3ops-core: HTTPS performance tuning - https://phabricator.wikimedia.org/T86666#973645 (10faidon) 3NEW [14:51:30] 3ops-core: HTTPS performance tuning - https://phabricator.wikimedia.org/T86666#973645 (10faidon) [14:54:50] 3operations, Beta-Cluster: Make all ldap users have a sane shell (/bin/bash) - https://phabricator.wikimedia.org/T86668#973669 (10yuvipanda) 3NEW [14:55:42] 3operations: Following Up - https://phabricator.wikimedia.org/T86670#973681 (10emailbot) [14:58:51] 3Beta-Cluster, operations: Make all ldap users have a sane shell (/bin/bash) - https://phabricator.wikimedia.org/T86668#973691 (10yuvipanda) New users created by wikitech should also be set to /bin/bash rather than sillyshell [14:59:01] manybubbles: ping re: poolcounter deploy, happy to do it today, let me know when you are online! [14:59:01] (03PS1) 10Yuvipanda: Make wikitech default shell sillyshell [mediawiki-config] - 10https://gerrit.wikimedia.org/r/184635 (https://phabricator.wikimedia.org/T86668) [14:59:13] godog: so online [14:59:23] hmm [14:59:27] is there SWAT today? [14:59:29] we'd have to flip a switch to actually use it [14:59:31] YuviPanda: sure [15:00:11] hmm, I’ll probably be on a bus. Should just wait for tomorrow's [15:00:14] RECOVERY - puppet last run on mw1006 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [15:00:24] RECOVERY - puppet last run on mw1005 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [15:01:20] 3ops-core: HTTPS RFC5077 session tickets encryption key rollovers - https://phabricator.wikimedia.org/T86671#973696 (10faidon) 3NEW [15:01:21] (03PS1) 10coren: Labs: Unmount /public/backups [puppet] - 10https://gerrit.wikimedia.org/r/184636 [15:01:28] 3ops-core: HTTPS RFC5077 session tickets encryption key rollovers - https://phabricator.wikimedia.org/T86671#973696 (10faidon) [15:01:32] YuviPanda: ^^ stfu patch for puppet [15:01:51] YuviPanda: you can get someone to validate it for you if you want [15:01:59] (03CR) 10Yuvipanda: [C: 031] Labs: Unmount /public/backups [puppet] - 10https://gerrit.wikimedia.org/r/184636 (owner: 10coren) [15:02:13] manybubbles: yeah, emailing ops@ now. Is https://gerrit.wikimedia.org/r/#/c/184635/ [15:02:21] (03CR) 10coren: [C: 032] "Quiet, you." [puppet] - 10https://gerrit.wikimedia.org/r/184636 (owner: 10coren) [15:02:57] I hadn't anticipated puppet to be whiny about it. :-( [15:03:13] 3Project-Creators, operations: HTTPS phabricator project(s) - https://phabricator.wikimedia.org/T86063#973706 (10faidon) [15:03:45] (03CR) 10Hashar: [C: 04-1] Make wikitech default shell sillyshell (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/184635 (https://phabricator.wikimedia.org/T86668) (owner: 10Yuvipanda) [15:03:55] 3operations: Decomission svn.wikimedia.org - https://phabricator.wikimedia.org/T86655#973707 (10Aklapper) Would that mean that extension codebases in SVN that have not been migrated to Git would be lost and links to codebases on extension homepages on mediawiki.org would break? If so, any idea how many (ancient... [15:04:42] 3ops-core: Support SPDY - https://phabricator.wikimedia.org/T35890#973717 (10faidon) [15:04:52] (03CR) 10Yuvipanda: Make wikitech default shell sillyshell (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/184635 (https://phabricator.wikimedia.org/T86668) (owner: 10Yuvipanda) [15:05:01] manybubbles: oh ok, I'll upload the package to precise shortly [15:05:33] 3operations: Decomission svn.wikimedia.org - https://phabricator.wikimedia.org/T86655#973721 (10yuvipanda) @Aklapper: No, they would probably be migrated to phabricator (or some other such tool) in a readonly fashion. [15:06:43] <_joe_> !log reimaging mw1007 mw1008 [15:07:31] <_joe_> !log jobrunners 100% on HHVM [15:08:12] <_joe_> oh no logmsgbot ?/quit [15:08:47] <_joe_> anyways, one step closer to HHVM conquering the world [15:08:55] (03PS1) 10Anomie: Revert "Revert of Iab860b8a5: Make puppet cronjob to run SecurePoll/cli/purgePrivateVoteData.php" [puppet] - 10https://gerrit.wikimedia.org/r/184637 [15:09:10] YuviPanda: BTW, not sure if that's on purpose, but your phab avatar makes you look like a Bond Villain and/or Sith Lord. [15:09:32] Coren: :D Fairly on purpose, I just wish it didn’t try to force it to be square. [15:09:42] 3Beta-Cluster, operations: Make all ldap users have a sane shell (/bin/bash) - https://phabricator.wikimedia.org/T86668#973726 (10Chad) Sillyshell, for those unaware, was a dead-simple shell that wrapped svnserve & co. and we used to run it on the Subversion box. We committed using svn+ssh but didn't want to giv... [15:10:33] 3operations: Switch HAT appservers to trusty's ICU - https://phabricator.wikimedia.org/T86096#973731 (10Joe) [15:10:33] Coren: one of my few surviving pics from way back then... [15:10:34] 3operations: Complete the use of HHVM over Zend PHP on the Wikimedia cluster - https://phabricator.wikimedia.org/T86081#973730 (10Joe) [15:10:35] 3ops-requests, operations: Provision more job runners - stalled until migrated to HHVM - https://phabricator.wikimedia.org/T84702#973732 (10Joe) [15:10:38] 3ops-core: Convert jobrunners to HHVM - https://phabricator.wikimedia.org/T78765#973729 (10Joe) 5Open>3Resolved [15:10:51] Coren: also, I emailed ops@ about an LDAP switch of loginShell. Do chime in [15:11:53] Oy. puppetmaster just broke. That wasn't me! [15:12:08] 3ops-requests, operations: Provision more job runners - stalled until migrated to HHVM - https://phabricator.wikimedia.org/T84702#930400 (10Joe) Average utilization on the jobrunners cluster is low enough that I don't see a reason for this. Resolving as declined. [15:12:10] undefined method `[]' for nil:NilClass at /etc/puppet/manifests/site.pp:46 [15:12:12] <_joe_> Coren: wat? [15:12:32] 3ops-requests, operations: Provision more job runners - stalled until migrated to HHVM - https://phabricator.wikimedia.org/T84702#973738 (10Joe) 5stalled>3declined [15:13:41] !log upload poolcounter 1.0.3 to precise-wikimedia [15:13:48] !log restarted mysql on virt1000, wikitech failing with db errors. seems fine now [15:13:59] Coren: ^ might’ve been intermittent during the restart? [15:14:04] <_joe_> YuviPanda: logmsgbot doesn't work [15:14:25] gah [15:14:40] _joe_: let me go restart them [15:14:55] (documentation at https://wikitech.wikimedia.org/wiki/Morebots fwiw) [15:15:07] YuviPanda: It's back indeed. Horrid timing then. [15:16:08] But omg seriously!? Puppet is really really stupid. So, if the mount is unavailable and you have ensure => mounted it fails noisily. Reasonable. If you ensure => umounted, then it breaks after the second run because it's not mounted. [15:16:53] PROBLEM - puppet last run on virt1000 is CRITICAL: CRITICAL: Puppet has 2 failures [15:17:23] Coren: ^ seems back again (puppet failures due to mysql issues on virt1000) [15:17:27] !log test [15:17:31] Why in hell would ensure => unmounted fail with an error if the filesystem isn't mounted?! Resource definitions are supposed to be idempotent. [15:17:31] Logged the message, Master [15:17:47] !log restarted mysql on virt1000, wikitech failing with db errors. seems fine now (4 minutes ago) [15:17:50] Logged the message, Master [15:17:50] _joe_: it’s back! [15:17:53] !log upload poolcounter 1.0.3 to precise-wikimedia [15:17:56] Logged the message, Master [15:19:23] godog: looks like I'm going to head to the doctor pretty soon - have to have my wrist looked at [15:19:31] was able to get an appointment. [15:19:42] 3operations: Decomission svn.wikimedia.org - https://phabricator.wikimedia.org/T86655#973754 (10Aklapper) I think you just identified a separate task that is blocking this one? ;) [15:19:44] so going to leave in about half an hour [15:19:58] manybubbles: ouch :( sounds painful [15:20:06] manybubbles: wrist problems hi5 [15:20:06] godog: it is! [15:20:17] manybubbles: can do it right now or when you are back, whichever suits best [15:20:18] hi5 on the other hand, tenderly, that is. [15:20:21] hurts way worse than the other one [15:20:58] godog: what is the plan for the deploy? just install the package and bounce the service on all the servers running it? [15:21:00] manybubbles: hope it gets better! [15:21:08] yeah I had my wrists giving me trouble too, found out ~2w later I needed to adjust the chair [15:21:17] manybubbles: yep, helium and potassium [15:21:35] godog: cool - I'm ready to watch logs and let you know if things look bad [15:21:46] the rollback plan for this is just to rollback [15:21:48] 3operations: Migrate leftover old svn content to a readonly repo somewher - https://phabricator.wikimedia.org/T86674#973756 (10yuvipanda) 3NEW [15:22:44] manybubbles: yep, starting with helium [15:22:50] (03PS2) 10Yuvipanda: Make wikitech default shell /bin/bash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/184635 (https://phabricator.wikimedia.org/T86668) [15:22:57] godog: ready when you are [15:23:08] !log upgrade and restart poolcounter on helium [15:23:14] Logged the message, Master [15:24:14] (03CR) 10Hashar: [C: 031] Make wikitech default shell /bin/bash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/184635 (https://phabricator.wikimedia.org/T86668) (owner: 10Yuvipanda) [15:24:16] (03CR) 10Chad: [C: 031] Make wikitech default shell /bin/bash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/184635 (https://phabricator.wikimedia.org/T86668) (owner: 10Yuvipanda) [15:24:25] Jan 13 15:24:10 mw1062: [log_config:warn] [pid 15595] (30)Read-only file system: [client 10.64.1.6:50867] AH00646: Error writing to /var/log/apache2/other_vhosts_access.log [15:24:37] ggrr ensure=>latest already upgraded poolcounter [15:24:38] not us- but something worth looking into [15:24:45] godog: really! [15:24:58] did it bounce it and everything? [15:25:30] it bounced it and then threw its toys out of the pram, a manual start did it [15:25:39] PROBLEM - DPKG on helium is CRITICAL: DPKG CRITICAL dpkg reports broken packages [15:25:54] * Coren wants to strangle puppet. [15:26:09] PROBLEM - puppet last run on helium is CRITICAL: CRITICAL: Puppet has 1 failures [15:26:10] PROBLEM - DPKG on potassium is CRITICAL: DPKG CRITICAL dpkg reports broken packages [15:26:29] PROBLEM - puppet last run on potassium is CRITICAL: CRITICAL: Puppet has 1 failures [15:26:36] should recover shortly [15:26:46] godog: well nothing interesting is showing up in the logs [15:26:49] RECOVERY - DPKG on helium is OK: All packages OK [15:27:03] sad_trombone.wav [15:27:06] ah - found it [15:27:14] they look fine [15:27:20] RECOVERY - DPKG on potassium is OK: All packages OK [15:27:55] sweet, also this looks fine: echo 'STATS FULL' | nc -w1 potassium 7531 [15:28:06] cool [15:28:16] oh - that is better than my telnets [15:28:27] <^d> godog: http://anyonecanedit.org/shit.mp3 [15:29:08] ^d: hahha nice [15:29:10] looks like it went down on 15:23:08 and came back 15:24:54 [15:29:24] <^d> godog: That's also the sound I get when I get an e-mail :p [15:29:28] during that tie it loged 310756 warning while mw proceeded to fail open [15:31:23] ^d: that is the first thing I do when I get a new phone [15:31:24] turn off that fucking "you've got mail" relic [15:31:24] I just get so much mail [15:31:41] manybubbles: heh not bad considering I wasn't expecting an automatic global rollout :( [15:31:50] godog: :) [15:31:56] <^d> manybubbles: Well that's why I pick sounds I enjoy :p [15:32:01] its ok - that system doesn't really have failover [15:32:03] not properly [15:32:24] every time it fails over it loses its locks [15:33:00] mw just fails as though the pool counter replied with "go ahead and do whatever you want" [15:33:03] when pool counter is down [15:33:09] RECOVERY - puppet last run on virt1000 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [15:33:46] (03PS1) 10coren: Labs: Remove /public/backups entirely [puppet] - 10https://gerrit.wikimedia.org/r/184638 [15:34:00] YuviPanda: Workaround puppet dumb ^^ [15:34:26] (03CR) 10Yuvipanda: [C: 031] Labs: Remove /public/backups entirely [puppet] - 10https://gerrit.wikimedia.org/r/184638 (owner: 10coren) [15:34:40] (03CR) 10coren: [C: 032] Labs: Remove /public/backups entirely [puppet] - 10https://gerrit.wikimedia.org/r/184638 (owner: 10coren) [15:35:13] $ git grep -ch 'ensure.*latest' | numsum [15:35:14] 140 [15:36:18] (03PS1) 10Filippo Giunchedi: poolcounter: ensure => installed [puppet] - 10https://gerrit.wikimedia.org/r/184640 [15:37:31] (03PS2) 10Filippo Giunchedi: poolcounter: ensure => installed [puppet] - 10https://gerrit.wikimedia.org/r/184640 [15:37:37] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] poolcounter: ensure => installed [puppet] - 10https://gerrit.wikimedia.org/r/184640 (owner: 10Filippo Giunchedi) [15:42:24] 3operations: Decomission svn.wikimedia.org - https://phabricator.wikimedia.org/T86655#973806 (10valhallasw) I still semi-regularly (every few months) use viewvc to figure out history of files for pywikibot, as not all branches were converted during the git migration -- mainly because it was too complicated to do... [15:42:39] PROBLEM - DPKG on mw1008 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [15:42:48] 3MediaWiki-Core-Team, operations: Deploy multi-lock PoolCounter change - https://phabricator.wikimedia.org/T85071#973808 (10fgiunchedi) 5Open>3Resolved a:3fgiunchedi resolving, changes deployed including https://gerrit.wikimedia.org/r/#/c/184640/ [15:42:58] PROBLEM - Disk space on mw1008 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [15:43:39] RECOVERY - puppet last run on helium is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [15:43:39] RECOVERY - puppet last run on potassium is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [15:43:43] 3operations: Decomission svn.wikimedia.org - https://phabricator.wikimedia.org/T86655#973812 (10mark) There's definitely stuff in there that hasn't been migrated yet indeed. I referred to it just today as well. [15:43:47] 3operations: Decomission svn.wikimedia.org - https://phabricator.wikimedia.org/T86655#973814 (10Chad) For that reason I'm inclined to import SVN repos to Phab so they can be usable from here. The history is important. [15:43:58] RECOVERY - DPKG on mw1008 is OK: All packages OK [15:44:09] RECOVERY - Disk space on mw1008 is OK: DISK OK [15:44:59] PROBLEM - puppet last run on mw1007 is CRITICAL: CRITICAL: Puppet has 5 failures [15:45:09] PROBLEM - puppetmaster https on virt1000 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:45:59] PROBLEM - puppet last run on mw1008 is CRITICAL: CRITICAL: Puppet has 4 failures [15:49:48] RECOVERY - puppetmaster https on virt1000 is OK: HTTP OK: Status line output matched 400 - 335 bytes in 0.063 second response time [15:50:27] <_joe_> "HTTP OK: Status line output matched 400" [15:50:34] <_joe_> this always amazes me [15:50:55] <_joe_> I'd fix this check if it wasn't so funny [15:52:33] wikitech is down, "Cannot contact the database server: Too many connections" [15:53:32] (03CR) 10BryanDavis: "Seems like a good first step to me." [puppet] - 10https://gerrit.wikimedia.org/r/184618 (https://phabricator.wikimedia.org/T86642) (owner: 10Yuvipanda) [15:53:42] 3operations: Decommission lsearchd - https://phabricator.wikimedia.org/T85009#973826 (10fgiunchedi) according to the dell website warranty expired on 2014-02-02 for all machines, I couldn't find any machine with a longer warranty curiously enough I couldn't find the asset tag for search1014 via dmidecode, rackt... [15:54:29] RECOVERY - puppet last run on mw1007 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [15:55:29] RECOVERY - puppet last run on mw1008 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [15:55:29] PROBLEM - puppet last run on virt1000 is CRITICAL: CRITICAL: Puppet has 4 failures [15:56:29] 3operations: Decommission lsearchd - https://phabricator.wikimedia.org/T85009#973833 (10mark) Then we shouldn't repurpose it for any critical roles. If anyone has any purpose for them where machine failures/lack of support isn't a problem we can keep them, but keep in mind that these won't live much longer. [15:56:40] <_joe_> godog: \o/ [15:56:59] <_joe_> manybubbles|away, ^d: \o/ [15:57:26] <_joe_> (decommission lsearchd) [15:57:44] :) [15:59:47] :D [16:00:04] manybubbles, anomie, ^d, marktraceur, Glaisher: Respected human, time to deploy Morning SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150113T1600). Please do the needful. [16:00:19] Jeff_Green, _joe_, anyone else: wikitech is throwing errors about too many database connections. Any idea? [16:00:33] <^d> +1 [16:00:34] Nothing from icinga-wm, right? [16:00:51] err YuviPanda wikitech is down again [16:01:02] <_joe_> anomie: seems like YuviPanda was handling it [16:01:11] <_joe_> YuviPanda: are you on it? [16:01:15] marktraceur: PROBLEM - puppet last run on virt1000 is CRITICAL: CRITICAL: Puppet has 4 failures [16:01:22] shrugs [16:01:26] Nah [16:01:34] Puppet is shit, I'm not surprised [16:01:47] There are problems all the time, doesn't mean it's related [16:01:50] virt1000 is wikitech, right? [16:01:51] _joe_: it’s mysql acting up again [16:01:58] _joe_: looking at access log, not much traffic [16:02:00] <_joe_> YuviPanda: what does this mean? [16:02:27] !log restart mysql on virt1000, wikitech acting up again [16:02:39] _joe_: so this happened about a month ago, and disappeared before we could investigate properly [16:02:43] <_joe_> mmmh don't do that [16:02:45] Logged the message, Master [16:03:56] I have to go now. [16:04:07] if it happens again, feel free to investigate (might also cause puppet failures) [16:04:12] virt1000 [16:04:30] _joe_: \o/ indeed! [16:05:01] <^d> mark: I dunno how much RAM we have on hand in eqiad or if there's any systems that need a boost, but those search* boxes have a fair bit that could be salvaged. [16:05:09] <^d> Even if they're otherwise very dated [16:07:24] 3ops-eqiad, ops-codfw: ship blanking panels from eqiad to codfw - https://phabricator.wikimedia.org/T86082#973850 (10Cmjohnson) Dear Christopher Johnson, To place orders, perform user management, and request status please visit the Portal. This is to inform you that order 2248746 has been processed and dispatc... [16:07:38] swatt! [16:07:49] 3 config patches [16:08:07] * ^d takes [16:08:33] <_joe_> anomie: still seeing errors? [16:08:44] _joe_: Working now [16:09:00] <_joe_> anomie: ping me in case it happens again in the next hour or so [16:09:31] <^d> Glaisher: Is zeroadmin empty? [16:09:31] marktraceur, manybubbles|away, ^d: Now that we can see the patches, who wants to SWAT? [16:09:35] yes [16:09:40] <^d> anomie: I've got it [16:09:42] ok [16:09:43] <^d> Glaisher: Ok thx [16:10:18] (03CR) 10Chad: [C: 032] Create 'autopatrolled' group on dawiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/183800 (https://phabricator.wikimedia.org/T86062) (owner: 10Glaisher) [16:10:23] (03CR) 10Chad: [C: 032] Create 'Draft' (118) namespace on hewikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/184130 (https://phabricator.wikimedia.org/T86329) (owner: 10Glaisher) [16:10:29] (03CR) 10Chad: [C: 032] Remove FlaggedRevs config on metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/184133 (https://phabricator.wikimedia.org/T86443) (owner: 10Glaisher) [16:11:30] (03Merged) 10jenkins-bot: Create 'autopatrolled' group on dawiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/183800 (https://phabricator.wikimedia.org/T86062) (owner: 10Glaisher) [16:11:32] (03Merged) 10jenkins-bot: Create 'Draft' (118) namespace on hewikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/184130 (https://phabricator.wikimedia.org/T86329) (owner: 10Glaisher) [16:11:35] (03Merged) 10jenkins-bot: Remove FlaggedRevs config on metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/184133 (https://phabricator.wikimedia.org/T86443) (owner: 10Glaisher) [16:12:14] !log demon Synchronized flaggedrevs.dblist: (no message) (duration: 00m 05s) [16:12:18] Logged the message, Master [16:12:30] !log demon Synchronized wmf-config/: (no message) (duration: 00m 05s) [16:12:33] Logged the message, Master [16:13:28] RECOVERY - puppet last run on virt1000 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [16:14:19] <^d> Glaisher: Things look ok to me [16:14:30] ^d: all working [16:14:30] yeah [16:14:30] thanks :D [16:14:37] <^d> yw [16:14:56] <^d> swat over! [16:14:59] * ^d throws confetti [16:16:00] 3ops-codfw: Update Racktables scs-c8 - https://phabricator.wikimedia.org/T86591#973874 (10Papaul) 5Open>3Resolved scs-c8-codfw in rack 8 is in there temporary onto we fix the serial connections problem on the pwf*; that is the reason why i did not put that in rack tables. [16:17:25] (03PS2) 10Filippo Giunchedi: lsearchd: remove lucene role and class [puppet] - 10https://gerrit.wikimedia.org/r/184620 (https://phabricator.wikimedia.org/T86150) [16:17:54] <_joe_> ^d: I can't understand how the italian word "confetti" got to mean what it means in english [16:18:07] (03CR) 10Filippo Giunchedi: "deleted passwords::lucene too, will remove from private.git" [puppet] - 10https://gerrit.wikimedia.org/r/184620 (https://phabricator.wikimedia.org/T86150) (owner: 10Filippo Giunchedi) [16:18:32] <_joe_> in italian, confetti means http://lifestyle.tiscali.it/sposi/media/11/confetti.jpg [16:18:46] <_joe_> which you can throw, of course, but what a waste :) [16:18:47] "Dragées" in french [16:19:01] we throw them at weddings [16:19:06] well confetti always means confetti in italian, it's more a question of what it means in italy :) [16:19:12] along with the paper confetti [16:19:48] <_joe_> paper confetti are "coriandoli" [16:19:54] (and yes, that's a subtle dig at how we often conflate wikipedia language-subdomains with countries) [16:20:04] _joe_: how the french wikipedia has an explanation! https://fr.wikipedia.org/wiki/Confetti#Origine_et_orthographe_du_mot originally we were throwing the candy version [16:20:07] <^d> Hehe [16:20:13] <^d> godog: What about check_lucene? [16:21:10] grocery time [16:21:27] ^d: I think that should be separate not to insta-break icinga, iirc it'd take a couple of puppet runs to fully deprovision the check [16:21:42] <^d> Ah ok, makes sense. [16:29:02] 3Beta-Cluster, operations: Unify ::production / ::beta roles for *oid - https://phabricator.wikimedia.org/T86633#973943 (10greg) p:5Triage>3Normal [16:30:49] ^d: cool! I'll go ahead [16:31:41] (03PS3) 10Filippo Giunchedi: lsearchd: remove lucene role and class [puppet] - 10https://gerrit.wikimedia.org/r/184620 (https://phabricator.wikimedia.org/T86150) [16:31:59] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] lsearchd: remove lucene role and class [puppet] - 10https://gerrit.wikimedia.org/r/184620 (https://phabricator.wikimedia.org/T86150) (owner: 10Filippo Giunchedi) [16:36:59] PROBLEM - Varnishkafka Delivery Errors per minute on cp3015 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [20000.0] [16:39:24] Jeff_Green: , yooo [16:39:35] ohi [16:40:30] ottomata: are you pointing out the varnishkafka thing? [16:45:13] RECOVERY - Varnishkafka Delivery Errors per minute on cp3015 is OK: OK: Less than 1.00% above the threshold [0.0] [16:48:07] ah, um, no, that is aproblem [16:48:22] qchris wanted me to sync up about kafkatee udp2log fundraising etc. [16:48:30] wait, uh, qchris_meeting, what do I need to sync up about? [16:48:39] there was something specifically [16:49:02] about the use of the fundraising tsvs on erbium. [16:49:12] Whether they are still used/needed. [16:49:18] tsvs? [16:49:23] And whether or not daily files (from Hive) would make sense. [16:49:36] bannerImpressions, etc. [16:50:01] (also bannerRandom is pretty empty these days) [16:50:05] so, qchris is working on converting most of the udp2log generated files to hadoop. [16:50:17] in our continuing effort to shutdown udp2log [16:50:50] fundraising wants/needs logs more frequently than once an hour, i assume, right? that's why kafkatee is a better fit for yall? just double checking [16:51:33] i guess it depends what other options there are [16:51:43] but yeah, they're going to want realtime data [16:51:50] well, there might be other options in the future, for now the options for realtime are really just kafkatee [16:52:13] they're currently tooled for 15 minute intervals [16:52:13] ok, and, also just double checking, are all of the currently configured fundraising ud2plog filters in use? [16:52:37] yeah probably [16:52:50] these in partiular [16:52:50] https://github.com/wikimedia/operations-puppet/blob/production/templates/udp2log/filters.erbium.erb [16:53:16] yeah, still in use [16:53:30] there's no data coming in for bannerRequests [16:53:42] I don't think we have campaigns up atm [16:53:46] ah, ok [16:53:51] so if there were campaigns there would be data [16:54:03] afaik yeah [16:54:07] ok [16:54:28] ok, cool. sync up complete :) [16:54:34] ha ok [16:54:39] we're going to work on turning off all the batch udp2log processes first [16:54:43] sorry, I haven't gotten to kafkatee [16:54:46] we'll sync up again when that is done and we can focus on kafkatee [16:54:47] no worries [16:54:59] it has been fundraiser time, and i thikn we wanted to wait til after to make changes anyway [16:55:08] yeah [16:55:30] although if I get time we should be able to do it as a parallel task, while the old system is running [16:55:47] (03CR) 10Nuria: "I am not sure as to the implications of this change, does it affect headers of requests?" [puppet] - 10https://gerrit.wikimedia.org/r/184551 (owner: 10Ori.livneh) [16:58:41] Jeff_Green: yeah [16:59:01] well, also, if we can turn off a udp2log instance or two before this, that will give us a free node on which to try just kafkatee for fundraising [16:59:34] to run the client side? i've got the hosts up already to handle it [17:00:05] but I have to get the package into frack and cook up the puppet config [17:03:32] (03CR) 10Ottomata: "if he just needs eventlogging log access, he will only need access to stat1003, which is the statistics-users group, not either of the pri" [puppet] - 10https://gerrit.wikimedia.org/r/184500 (https://phabricator.wikimedia.org/T86533) (owner: 10Dzahn) [17:13:07] (03CR) 10QChris: [C: 031] "Faidon> Is Analytics aware of this? Are we sure this won't break any scripts" [puppet] - 10https://gerrit.wikimedia.org/r/184551 (owner: 10Ori.livneh) [17:19:01] zuul is stuck..... [17:23:36] AGAIN? [17:23:37] Fuck. [17:23:42] (03CR) 10Ottomata: [C: 031] "Cool!" [puppet] - 10https://gerrit.wikimedia.org/r/184551 (owner: 10Ori.livneh) [17:24:00] oh Jeff_Green, right right [17:24:02] that goes on FR end [17:24:10] sorry, yeah ok cool, get that thing up there in running then :p :) [17:24:17] hah yeah [17:25:15] ok. coffee/food then clinic duty calls [17:33:57] (03PS1) 10Glaisher: Add missing m.{project}.org entries [dns] - 10https://gerrit.wikimedia.org/r/184690 (https://phabricator.wikimedia.org/T78421) [17:37:51] !log Restarting deadlocked Zuul , which drops ALL events. Reason is Gerrit lost connection with its database which is not handled by Zuul . See https://wikitech.wikimedia.org/wiki/Incident_documentation/20150106-Zuul [17:37:58] Logged the message, Master [17:39:38] !log Zuul back in action. Got recheck or +2 again the changes that have been discarded. [17:39:42] Logged the message, Master [17:43:43] thanks hashar [17:45:51] legoktm: poor Zuul is waiting indefinitely for a reply from Gerrit which deadlock it :-( [17:46:31] hashar: https://s-media-cache-ak0.pinimg.com/originals/7b/a7/8d/7ba78d6182b3e8f5c5b6600ee47f2e4a.jpg [17:47:06] marktraceur: should we bring rmoen in on the idea? Isn't he the api-guy now? [17:47:27] Not sure! [17:47:34] ori: relevant! ahah [17:48:01] I am out of there! see you tomorrow [17:48:38] T13|mobile: am i? [17:48:56] !log If Zuul status page ( https://integration.wikimedia.org/zuul/ ) shows a lot of changes with completed jobs and the number of results growing, Zuul is deadlocked waiting for Gerrit. Have to restart it on gallium.wikimedia.org with /etc/init.d/zuul restart [17:49:02] Logged the message, Master [17:50:12] (03PS3) 10Faidon Liambotis: VCL: Get rid of hhvm.inc.vcl.erb [puppet] - 10https://gerrit.wikimedia.org/r/184551 (owner: 10Ori.livneh) [17:50:20] (03CR) 10Faidon Liambotis: [C: 032] VCL: Get rid of hhvm.inc.vcl.erb [puppet] - 10https://gerrit.wikimedia.org/r/184551 (owner: 10Ori.livneh) [17:50:42] rmoen: are you? :p [17:51:03] no jenkins again? [17:51:04] thanks [17:51:08] (03CR) 10Faidon Liambotis: [V: 032] VCL: Get rid of hhvm.inc.vcl.erb [puppet] - 10https://gerrit.wikimedia.org/r/184551 (owner: 10Ori.livneh) [17:51:26] paravoid: https://gerrit.wikimedia.org/r/#/c/184608/ is a no-brainer too [17:51:32] (and i tested it) [17:51:39] oh I had already hit rebase [17:51:46] but there's a path conflict apparently [17:51:52] i'll rebase [17:51:56] so I'm... [17:51:57] ok [17:55:05] (03PS2) 10Ori.livneh: VCL: Standardize whitespace in parens [puppet] - 10https://gerrit.wikimedia.org/r/184608 [17:56:14] (03CR) 10Faidon Liambotis: [C: 032] VCL: Standardize whitespace in parens [puppet] - 10https://gerrit.wikimedia.org/r/184608 (owner: 10Ori.livneh) [18:00:04] maxsem, kaldari: Respected human, time to deploy Mobile Web (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150113T1800). Please do the needful. [18:02:28] reviews for topic:ssh-userkey very much welcome [18:04:46] (03PS1) 10Legoktm: Fix flake8 issues [software/labsdb-auditor] - 10https://gerrit.wikimedia.org/r/184691 [18:07:33] * ori reviews [18:12:50] 3operations: Rolling restart for Elasticsearch to pick up new version of wikimedia-extra plugin - https://phabricator.wikimedia.org/T86602#974069 (10Jgreen) We need to get this onto the deployment calendar, please talk to Greg and update the ticket when it's scheduled. [18:13:02] 3operations: Rolling restart for Elasticsearch to pick up new version of wikimedia-extra plugin - https://phabricator.wikimedia.org/T86602#974071 (10Jgreen) 5Open>3stalled [18:13:20] paravoid: does work without modification to sshd_config.erb? [18:13:44] i'm probably being dumb but i don't quite get how it works [18:13:56] Guice provision errors: [18:13:56] 1) Cannot open ReviewDb at com.google.gerrit.server.util.ThreadLocalRequestContext$1.provideReviewDb(ThreadLocalRequestContext.java:70) while locating com.google.gerrit.reviewdb.server.ReviewDb [18:14:00] 1 error [18:14:05] grumble [18:14:39] ori: see modules/ssh/manifests/server.pp:16 [18:14:57] /etc/ssh/userkeys is already provisioned by labs-private [18:15:15] but the directory File never belonged there in the first place [18:15:38] so I'm basically moving it [18:15:45] out of curiosity, why do the whole directory structure rather than /etc/ssh/userkeys/%u ? [18:15:49] and create a defined resource to populate entries in it [18:16:02] Jeff_Green: https://gerrit.wikimedia.org/r/#/c/183816/ [18:16:12] ha! [18:16:16] winning. [18:16:18] ;) [18:16:25] that's what I did in frack [18:18:28] (03CR) 10Ori.livneh: "@Ottomata: Tilman has just moved to the product group, and he has extensive background in Wikimedia-related research/data analysis (he edi" [puppet] - 10https://gerrit.wikimedia.org/r/184500 (https://phabricator.wikimedia.org/T86533) (owner: 10Dzahn) [18:19:30] 500 Internal Server Error [18:20:13] gerrit died? [18:20:22] up for me atm [18:20:33] oh, it loaded now. thanks [18:21:54] (03CR) 10Ori.livneh: [C: 031] ssh: introduce ssh::userkey resource [puppet] - 10https://gerrit.wikimedia.org/r/183814 (owner: 10Faidon Liambotis) [18:22:16] (03CR) 10Dzahn: [C: 031] "i like it, just nitpicks about the example source. how about modules/admin/files/userkeys" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/183814 (owner: 10Faidon Liambotis) [18:22:40] (03CR) 10Ori.livneh: [C: 031] "My favorite Puppet idiom" [puppet] - 10https://gerrit.wikimedia.org/r/183815 (owner: 10Faidon Liambotis) [18:23:06] (03CR) 10Yuvipanda: [C: 032 V: 032] Fix flake8 issues [software/labsdb-auditor] - 10https://gerrit.wikimedia.org/r/184691 (owner: 10Legoktm) [18:24:08] (03PS6) 10Faidon Liambotis: udp2log: replace iptables with ferm [puppet] - 10https://gerrit.wikimedia.org/r/169691 (owner: 10Matanya) [18:24:10] (03PS1) 10Faidon Liambotis: Remove now unused iptables.pp [puppet] - 10https://gerrit.wikimedia.org/r/184694 [18:24:12] (03PS1) 10Faidon Liambotis: udp2log: tighten up firewall rules [puppet] - 10https://gerrit.wikimedia.org/r/184695 [18:24:45] (03CR) 10Dzahn: [C: 031] ssh: recurse/purge => true for /etc/ssh/userkeys [puppet] - 10https://gerrit.wikimedia.org/r/183815 (owner: 10Faidon Liambotis) [18:26:20] (03PS2) 10Faidon Liambotis: Remove now unused iptables.pp [puppet] - 10https://gerrit.wikimedia.org/r/184694 [18:26:22] (03PS2) 10Faidon Liambotis: udp2log: tighten up firewall rules [puppet] - 10https://gerrit.wikimedia.org/r/184695 [18:28:18] (03CR) 10Dzahn: [C: 031] "yes, very cool. i checked this before and udp2log was really the last using the old iptables method, so after that is merged. yay!" [puppet] - 10https://gerrit.wikimedia.org/r/184694 (owner: 10Faidon Liambotis) [18:29:00] (03PS1) 10Ori.livneh: xenon: Skip frames that don't have a 'phpStack' key [mediawiki-config] - 10https://gerrit.wikimedia.org/r/184697 [18:29:07] bd808: ^ [18:30:02] (03CR) 10BryanDavis: [C: 031] xenon: Skip frames that don't have a 'phpStack' key [mediawiki-config] - 10https://gerrit.wikimedia.org/r/184697 (owner: 10Ori.livneh) [18:30:43] mutante: I have to go in a bit so I won't merge [18:30:49] but if you want to, feel free [18:30:58] otherwise I'll do it when I get back or tomorrow morning [18:31:18] my deployment plan is to merge the first two and make sure nothing breaks [18:31:32] then compare iptables -nvxL & netstat -nap [18:31:38] (03CR) 10Ori.livneh: [C: 032] xenon: Skip frames that don't have a 'phpStack' key [mediawiki-config] - 10https://gerrit.wikimedia.org/r/184697 (owner: 10Ori.livneh) [18:31:46] make sure there's ACCEPT entries for all of them [18:31:50] then merge the tighten-up changeset [18:33:02] paravoid: ok! i'll get back to it in a bit after handling an access request [18:33:03] (03PS1) 10Faidon Liambotis: Kill udpprofile::collector, unused [puppet] - 10https://gerrit.wikimedia.org/r/184698 [18:33:39] I was hoping for ottomata's +1 too but he hasn't been responsive to pings [18:33:42] (03Merged) 10jenkins-bot: xenon: Skip frames that don't have a 'phpStack' key [mediawiki-config] - 10https://gerrit.wikimedia.org/r/184697 (owner: 10Ori.livneh) [18:33:53] I have pending incomplete patchsets for cert.pp and mail.pp [18:38:29] !log ori Synchronized wmf-config/StartProfiler.php: xenon: Skip frames that don't have a 'phpStack' key (duration: 00m 06s) [18:38:35] Logged the message, Master [18:39:35] (03PS1) 10Faidon Liambotis: Kill role::labsnfs, deprecated & empty [puppet] - 10https://gerrit.wikimedia.org/r/184699 [18:40:09] !log mw1062: sync-file failed, read-only file system. Host should be removed from dsh group. [18:40:12] Logged the message, Master [18:44:40] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 6.67% of data above the critical threshold [500.0] [18:46:07] 3operations: Requesting access to gallium for cmcmahon - https://phabricator.wikimedia.org/T86685#974127 (10Cmcmahon) 3NEW [18:56:21] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [19:00:05] Reedy, greg-g: Respected human, time to deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150113T1900). Please do the needful. [19:00:09] 3operations, Phabricator: The options of the Security dropdown in Phabricator need to be clear and documented - https://phabricator.wikimedia.org/T76564#974165 (10chasemp) [19:00:39] 3operations, Phabricator: The options of the Security dropdown in Phabricator need to be clear and documented - https://phabricator.wikimedia.org/T76564#806002 (10chasemp) [19:01:36] 3operations, Phabricator: The options of the Security dropdown in Phabricator need to be clear and documented - https://phabricator.wikimedia.org/T76564#806002 (10chasemp) [19:02:33] (03CR) 10BBlack: [C: 031] "Rationale is sane (we ran into the block comment problem yesterday), and I double-checked visually for errors and didn't see any." [puppet] - 10https://gerrit.wikimedia.org/r/184570 (owner: 10Ori.livneh) [19:02:40] 3operations, Phabricator: The options of the Security dropdown in Phabricator need to be clear and documented - https://phabricator.wikimedia.org/T76564#974175 (10chasemp) [19:04:19] (03PS1) 10Reedy: Non Wikipedias to 1.25wmf14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/184703 [19:04:36] 3operations, Phabricator: The options of the Security dropdown in Phabricator need to be clear and documented - https://phabricator.wikimedia.org/T76564#806002 (10chasemp) [19:05:22] 3operations, Phabricator: The options of the Security dropdown in Phabricator need to be clear and documented - https://phabricator.wikimedia.org/T76564#974187 (10chasemp) Merging in T518 as we are settled on this functionality: Security-bug issues (I think meant o be named Mediawiki Related Security Bugs) will... [19:05:42] (03CR) 10Reedy: [C: 032] Non Wikipedias to 1.25wmf14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/184703 (owner: 10Reedy) [19:05:46] (03Merged) 10jenkins-bot: Non Wikipedias to 1.25wmf14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/184703 (owner: 10Reedy) [19:06:08] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: Non Wikipedias to 1.25wmf14 [19:06:14] Logged the message, Master [19:07:31] (03PS2) 10Reedy: Set wgRestrictDisplayTitle to false for fawikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/184625 (https://phabricator.wikimedia.org/T85380) (owner: 10Mjbmr) [19:07:48] (03PS2) 10Reedy: Fix project talk namespace for mznwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/184622 (https://phabricator.wikimedia.org/T85383) (owner: 10Mjbmr) [19:08:07] (03PS1) 10Aude: Bump cache epoch for wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/184706 [19:08:10] renoirb: guess what ? ^ [19:08:11] (03PS5) 10Reedy: monolog: enable for group0 + group1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/181130 (https://phabricator.wikimedia.org/T76759) (owner: 10BryanDavis) [19:08:13] aaa [19:08:15] Reedy: [19:08:15] (03CR) 10MaxSem: [C: 031] "What to do with these entries is being discussed at T78421." [dns] - 10https://gerrit.wikimedia.org/r/184690 (https://phabricator.wikimedia.org/T78421) (owner: 10Glaisher) [19:08:17] (03CR) 10Reedy: [C: 032] monolog: enable for group0 + group1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/181130 (https://phabricator.wikimedia.org/T76759) (owner: 10BryanDavis) [19:08:19] Hahaha! [19:08:21] (03Merged) 10jenkins-bot: monolog: enable for group0 + group1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/181130 (https://phabricator.wikimedia.org/T76759) (owner: 10BryanDavis) [19:08:27] got pigned for no reason :P [19:08:56] sorry.... tab fail [19:09:01] lol [19:09:03] !log reedy Synchronized wmf-config/InitialiseSettings.php: monolog: enable for group0 + group1 wikis (duration: 00m 07s) [19:09:07] Logged the message, Master [19:09:07] No worries aude. While i’m here, let me make it constructive :) [19:09:14] :) [19:09:17] (03PS2) 10Reedy: Bump cache epoch for wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/184706 (owner: 10Aude) [19:09:26] (03CR) 10Reedy: [C: 032] Bump cache epoch for wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/184706 (owner: 10Aude) [19:09:39] * aude wants to automate this cache bump thing [19:10:11] sees that Parser has a version constant [19:10:15] Reedy: can you deploy https://gerrit.wikimedia.org/r/#/c/183087/ too? :) [19:11:06] I have an issue with configuring $wgSquidServers. I have a list of CIDR ranges of an external Varnish provider and I it seems I cannot make MW to detect that its been proxyed to my users. It logs the Varnish edge server instead of the visitor and prevents me to effectively ban. You have recommendation of somebody to talk to Reedy, aude? [19:11:56] wtf is Jenkins upto now [19:12:52] 3Code-Review, Wikimedia-Git-or-Gerrit, operations: Chrome warns about insecure certificate on gerrit.wikimedia.org - https://phabricator.wikimedia.org/T76562#974221 (10RobH) There are two issues here that I can see: 1) gerrit.wikimedia.org certificate is sha1 2) gerrit.wikimedia.org is rapidssl certificate, a... [19:13:11] Or maybe somebody else I talked to such as springle, lelegoktm, bd808, or Brion Vibber might know. [19:13:38] * aude doesn't know much about varnish beyond just to get it to "work" [19:13:46] * renoirb has the same issue aude [19:14:00] i'm not sure who is best but asking here is good [19:14:52] (03Merged) 10jenkins-bot: Bump cache epoch for wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/184706 (owner: 10Aude) [19:15:15] (03CR) 10Bartosz Dziewoński: "(It would time out. Nevermind, seems to work now.)" [puppet] - 10https://gerrit.wikimedia.org/r/177128 (https://phabricator.wikimedia.org/T75997) (owner: 10Krinkle) [19:15:52] I have an interresting case documented. I have a few IP CIDR ranges. When I test with an IP (`$ip`) that is part of a CIDR range. Tests with `IP::isInRange($a, $range_cidr)` works. But when I do `IP::isConfiguredProxy($ip)` [19:16:07] 3operations: Replace SHA1 certificates with SHA256 - https://phabricator.wikimedia.org/T73156#974224 (10RobH) So additional corrections, the following certificates more than likely do not need re-order, for the reasons following each: bug-attachment.wikimedia.org - old service, should push behind misc web since... [19:16:07] I have a Gist to illustrate the tests, maybe I miss a config. [19:16:12] aude ^ [19:16:13] renoirb: if your upstream proxy is setting X-Forwarded-For headers properly and you have populated $wgSquidServersNoPurge with the known proxies then it should "just work" [19:16:21] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [500.0] [19:16:35] 3operations, Phabricator: The options of the Security dropdown in Phabricator need to be clear and documented - https://phabricator.wikimedia.org/T76564#974225 (10chasemp) [19:16:51] what is the **difference** between what I put in $wgSquidServers and $wgSquidServerNoPurge ? [19:16:56] bd808? [19:17:40] 3operations, Phabricator: The options of the Security dropdown in Phabricator need to be clear and documented - https://phabricator.wikimedia.org/T76564#806002 (10chasemp) [19:17:59] renoirb: $wgSquidServers is a straight "in_array" check. $wgSquidServersNoPurge allows CIDR notation [19:18:17] See IP::isConfiguredProxy() for the gory details [19:18:49] renoirb: Actually starting from WebRequest::getIP gives the best idea of what's going on [19:18:56] bd808 I do have the XFF header set, i´m in a thread with my provider to set it up correctly as they also provide equivalent of a CDN and it adds some complexities. But its still unclear which variable are the ones I should use. [19:19:12] 3ops-core: revoke old digicert certificates - https://phabricator.wikimedia.org/T86689#974236 (10RobH) 3NEW a:3RobH [19:19:24] 3operations, Phabricator: The options of the Security dropdown in Phabricator need to be clear and documented - https://phabricator.wikimedia.org/T76564#974244 (10chasemp) [19:19:38] 3ops-core: revoke old digicert certificates - https://phabricator.wikimedia.org/T86689#974236 (10RobH) forgot, planet is digicert, it will move, but dont revoke it yet, only the unified cluster cert(s) [19:19:48] renoirb: $wgSquidServers for single ips, $wgSquidServersNoPurge for CIDR ranges [19:19:56] bd808 I used IP::isConfiguredProxy but only through maintenance/eval.php. How could I make such test through a browser? Is there such helper? [19:20:10] OH!!!! bd808. Now it makes sense! [19:20:25] I’ll test and adjust documentation wiki pages right away! [19:23:32] (03PS2) 10Reedy: Add Renameuser debug log group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/183087 (https://phabricator.wikimedia.org/T85042) (owner: 10Legoktm) [19:23:34] (03PS1) 10BBlack: remove old unified cert ref from certs.pp [puppet] - 10https://gerrit.wikimedia.org/r/184709 [19:23:38] bd808: we should be monologging all the things now [19:23:44] (03CR) 10Reedy: [C: 032] Add Renameuser debug log group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/183087 (https://phabricator.wikimedia.org/T85042) (owner: 10Legoktm) [19:23:52] The funny thing bd808 is that I went through creating a closure to create the ranges. It was creating a big array of about 13 000 individual IPs. Glad to know that there’s something specific for ranges that works. [19:23:57] sweet! [19:24:04] !log reedy Synchronized wmf-config/Wikibase.php: bump cache epoch (duration: 00m 06s) [19:24:09] Logged the message, Master [19:24:16] (03PS2) 10BBlack: remove old unified cert ref from certs.pp T86689 [puppet] - 10https://gerrit.wikimedia.org/r/184709 [19:25:28] (03Merged) 10jenkins-bot: Add Renameuser debug log group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/183087 (https://phabricator.wikimedia.org/T85042) (owner: 10Legoktm) [19:25:29] Reedy: Log volume is definitely up at https://logstash.wikimedia.org/#/dashboard/elasticsearch/monolog \o/ [19:25:45] bd808 and what is $wgUsePrivateIPs about? Should I use it? Remember that the Varnish nodes from my provider are outside of my network. [19:26:43] 3operations, Phabricator: The options of the Security dropdown in Phabricator need to be clear and documented - https://phabricator.wikimedia.org/T76564#974277 (10chasemp) [19:27:23] (03PS1) 10BBlack: remove pubkey files for old GlobalSign certs T86689 [puppet] - 10https://gerrit.wikimedia.org/r/184711 [19:27:31] 3operations, Beta-Cluster: Make all ldap users have a sane shell (/bin/bash) - https://phabricator.wikimedia.org/T86668#974281 (10scfc) See also T67591 (maybe duplicate). IIRC the two questions there were: # Don't accidentally unlock the Subversion server for anyone with shell access. # Don't accidentally lock... [19:28:18] (03CR) 10BBlack: [C: 032] remove old unified cert ref from certs.pp T86689 [puppet] - 10https://gerrit.wikimedia.org/r/184709 (owner: 10BBlack) [19:28:56] renoirb: It looks like wgUsePrivateIPs is for if you want to return non-public IPs from WebRequest::getIP(). It seems like that would only be useful in an intranet type install. [19:29:18] (03CR) 10BBlack: [C: 032 V: 032] remove pubkey files for old GlobalSign certs T86689 [puppet] - 10https://gerrit.wikimedia.org/r/184711 (owner: 10BBlack) [19:29:21] oh, ok! [19:29:27] 3operations, Phabricator: The options of the Security dropdown in Phabricator need to be clear and documented - https://phabricator.wikimedia.org/T76564#974310 (10chasemp) [19:29:37] ok bd808 thanks! [19:30:41] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [19:30:51] So bd808, based on the name. $wgSquidServersNoPurge doesn’t support HTTP purging to Varnish on page save? Or I´m misinterpreting the variable name? [19:31:50] (03PS8) 10QChris: Add jobs for aggregating hourly projectcount files to daily per wiki csvs [puppet] - 10https://gerrit.wikimedia.org/r/172201 (https://bugzilla.wikimedia.org/72740) [19:32:23] renoirb: correct. that list is just for XFF calculations. [19:33:12] 3operations, Phabricator: The options of the Security dropdown in Phabricator need to be clear and documented - https://phabricator.wikimedia.org/T76564#974325 (10chasemp) [19:33:16] Ok bd808, is there a mechanism to send purges to Varnish then. How would it be called? [19:33:30] 3operations, Phabricator: The options of the Security dropdown in Phabricator need to be clear and documented - https://phabricator.wikimedia.org/T76564#806002 (10chasemp) [19:33:33] (is this recommended?) [19:34:03] 3operations, Phabricator: The options of the Security dropdown in Phabricator need to be clear and documented - https://phabricator.wikimedia.org/T76564#806002 (10chasemp) [19:34:11] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 14.29% of data above the critical threshold [500.0] [19:35:17] 3operations, Phabricator: The options of the Security dropdown in Phabricator need to be clear and documented - https://phabricator.wikimedia.org/T76564#974331 (10chasemp) [19:35:29] 3operations, Beta-Cluster: Make all ldap users have a sane shell (/bin/bash) - https://phabricator.wikimedia.org/T86668#974333 (10Chad) >>! In T86668#974281, @scfc wrote: > See also T67591 (maybe duplicate). IIRC the two questions there were: > > # Don't accidentally unlock the Subversion server for anyone wit... [19:36:31] PROBLEM - Unmerged changes on repository mediawiki_config on tin is CRITICAL: There is one unmerged change in mediawiki_config (dir /srv/mediawiki-staging/). [19:38:07] renoirb: The "squid" in the variable names is sort of legacy left over. It should be "upstreamCache" or something instead. Do a `git grep SquidUpdate` to see where and how purging is configured. [19:38:16] 3operations, Phabricator: The options of the Security dropdown in Phabricator need to be clear and documented - https://phabricator.wikimedia.org/T76564#974344 (10chasemp) [19:38:51] !log reedy Synchronized wmf-config/InitialiseSettings.php: Add Renameuser debug log group (duration: 00m 09s) [19:38:51] RECOVERY - Unmerged changes on repository mediawiki_config on tin is OK: No changes to merge. [19:38:54] Logged the message, Master [19:49:17] (03PS3) 10Dzahn: add tbayer to statistics-user/privatedata users [puppet] - 10https://gerrit.wikimedia.org/r/184500 (https://phabricator.wikimedia.org/T86533) [19:49:21] (03PS1) 10Ori.livneh: VCL: Standardize on '//'-style comments [puppet] - 10https://gerrit.wikimedia.org/r/184722 [19:49:33] 3operations: Rolling restart for Elasticsearch to pick up new version of wikimedia-extra plugin - https://phabricator.wikimedia.org/T86602#974362 (10Manybubbles) Ping @greg: As with all other times we've done this it'll take some time - 24 hours or more isn't unlikely, especially if we don't run the restarts whi... [19:50:47] (03PS9) 10Ottomata: Add jobs for aggregating hourly projectcount files to daily per wiki csvs [puppet] - 10https://gerrit.wikimedia.org/r/172201 (https://bugzilla.wikimedia.org/72740) (owner: 10QChris) [19:50:56] bd808|LUNCH, I wonder. about a detail. If I have both single IPs and CIDR ranges. Could we use both $wgSquidServers and $wgSquidServersNoPurge variables? [19:51:27] Where $wgSquidServers has individual IP entries AND ranges are in the other [19:52:28] (03CR) 10Ottomata: "That's fine with me if that's what he wants/needs and has been approved for. :)" [puppet] - 10https://gerrit.wikimedia.org/r/184500 (https://phabricator.wikimedia.org/T86533) (owner: 10Dzahn) [19:52:32] (03CR) 10Dzahn: [C: 032] add tbayer to statistics-user/privatedata users [puppet] - 10https://gerrit.wikimedia.org/r/184500 (https://phabricator.wikimedia.org/T86533) (owner: 10Dzahn) [19:52:52] (03PS10) 10Ottomata: Add jobs for aggregating hourly projectcount files to daily per wiki csvs [puppet] - 10https://gerrit.wikimedia.org/r/172201 (https://bugzilla.wikimedia.org/72740) (owner: 10QChris) [19:53:01] (03CR) 10Ottomata: [C: 032 V: 032] Add jobs for aggregating hourly projectcount files to daily per wiki csvs [puppet] - 10https://gerrit.wikimedia.org/r/172201 (https://bugzilla.wikimedia.org/72740) (owner: 10QChris) [19:53:07] ori: why a new gerrit changeset for the VCL // thing? [19:53:15] bblack: needed a manual rebase [19:53:22] mutante: , merging your change [19:53:39] you can still preserve the gerrit changeid! :) [19:53:47] ugh, did i create a new change? [19:53:49] that was an accident [19:53:56] not important, just curious [19:54:03] yeah, that was a mistake :/ [19:54:15] (03CR) 10Dzahn: "statistics-user and statistics-privatedata-users , because that is what he needs, access to weblogs. just not the analytics-private class" [puppet] - 10https://gerrit.wikimedia.org/r/184500 (https://phabricator.wikimedia.org/T86533) (owner: 10Dzahn) [19:54:23] applied it in labs tho [19:54:29] so i was going to merge it if it's cool with you [19:54:36] ottomata: ah, yes please, was still writing the comment :) thx [19:55:15] bblack: but never fear, there are other patches to review ;) https://gerrit.wikimedia.org/r/#/c/184546/ , https://gerrit.wikimedia.org/r/#/c/184547/ , https://gerrit.wikimedia.org/r/#/c/184548/ [19:55:37] lol [19:55:56] yes, the comments thing is cool, I +1'd the original [19:56:43] (03PS3) 10Ori.livneh: VCL: Standardize on '//'-style comments [puppet] - 10https://gerrit.wikimedia.org/r/184570 [19:56:51] PROBLEM - puppet last run on stat1002 is CRITICAL: CRITICAL: puppet fail [19:56:52] (03Abandoned) 10Ori.livneh: VCL: Standardize on '//'-style comments [puppet] - 10https://gerrit.wikimedia.org/r/184722 (owner: 10Ori.livneh) [19:56:55] (03PS1) 10Ottomata: Fix for cron parameter in new ::agregator class [puppet] - 10https://gerrit.wikimedia.org/r/184724 [19:56:58] ori: re: 184546, do you happen to know if the CentralAutoLogin hits are in fact normally GETs or POSTs? [19:57:17] because for perf reasons, we do want that block executing, and we should really fix it if it's being blocked by pass_requests [19:57:50] (03CR) 10Ottomata: [C: 032] Fix for cron parameter in new ::agregator class [puppet] - 10https://gerrit.wikimedia.org/r/184724 (owner: 10Ottomata) [19:58:19] bblack: they are GETs [19:58:47] so it is executing, because because the second operand of the || expression is true [19:58:49] the first never is [19:59:03] ok [19:59:24] (03PS4) 10Ori.livneh: VCL: Standardize on '//'-style comments [puppet] - 10https://gerrit.wikimedia.org/r/184570 [19:59:31] (03CR) 10Ori.livneh: [C: 032 V: 032] VCL: Standardize on '//'-style comments [puppet] - 10https://gerrit.wikimedia.org/r/184570 (owner: 10Ori.livneh) [20:00:55] <_joe_> ori: so... twemproxy package is ready but not in apt (yet), memcached errors were gone last I checked, and jobrunners are all on HHVM [20:01:04] ori: so the problem remains that if we have hard "return" statements in a block that come before backend-setting statements, we can't simply insert the debug backend override "after all backend-setters, but before any early-returners" [20:01:13] it can be worked around I'm sure, but just saying [20:01:16] _joe_: woooooo, you're a hero [20:01:56] bblack: which patch are you talking about right now? the 'eliminate dead code' one? [20:02:19] (03CR) 10BBlack: [C: 031] Eliminate dead code from text VCL [puppet] - 10https://gerrit.wikimedia.org/r/184546 (owner: 10Ori.livneh) [20:03:03] (03CR) 10BBlack: [C: 031] VCL: Add 'maybe_use_random_scheduler' subroutine [puppet] - 10https://gerrit.wikimedia.org/r/184547 (owner: 10Ori.livneh) [20:03:31] ori: no, I'm talking about where we're at on the debug patch structure after all of these others [20:03:53] (03CR) 10BBlack: [C: 031] Remove GettingStarted cookie workaround, reverting ae30ae0ba [puppet] - 10https://gerrit.wikimedia.org/r/184548 (owner: 10Ori.livneh) [20:04:15] (03CR) 10Ori.livneh: "(Needs to be updated for '//' comments)" [puppet] - 10https://gerrit.wikimedia.org/r/184547 (owner: 10Ori.livneh) [20:04:51] (03PS1) 10Ottomata: Dependency fixes for new ::aggregator class [puppet] - 10https://gerrit.wikimedia.org/r/184728 [20:05:19] arguably we should eliminate any return statements from custom-named subroutines in our VCL. I'm not sure how many there are. pass_requests being the current evil example. [20:05:26] those always make things hard to understand [20:06:11] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [20:06:35] (03PS2) 10Ori.livneh: VCL: Add 'maybe_use_random_scheduler' subroutine [puppet] - 10https://gerrit.wikimedia.org/r/184547 [20:06:54] (perhaps with the exception of "errorpage", that one might be appropriate) [20:07:00] (03CR) 10Ottomata: [C: 032] Dependency fixes for new ::aggregator class [puppet] - 10https://gerrit.wikimedia.org/r/184728 (owner: 10Ottomata) [20:07:34] bblack: yeah, once the debug thing is in then pass_requests is no more than a if (req.request != "GET" && req.request != "HEAD") { return (pass); }, so i think we could just inline it, which would remove the confusing indirection [20:08:54] (03PS1) 10Ottomata: Remove $data_path dependency from aggregator cron [puppet] - 10https://gerrit.wikimedia.org/r/184730 [20:09:19] (03PS2) 10Ori.livneh: Eliminate dead code from text VCL [puppet] - 10https://gerrit.wikimedia.org/r/184546 [20:09:30] (03CR) 10Ori.livneh: [C: 032 V: 032] Eliminate dead code from text VCL [puppet] - 10https://gerrit.wikimedia.org/r/184546 (owner: 10Ori.livneh) [20:09:31] ori: yeah I just did a quick audit of the puppet vcl templates, I think pass_requests is actually the only really egregious case [20:09:51] 3ops-core: revoke old digicert certificates - https://phabricator.wikimedia.org/T86689#974412 (10RobH) p:5Triage>3Normal [20:10:31] (03PS2) 10Ottomata: Remove $data_path dependency from aggregator cron [puppet] - 10https://gerrit.wikimedia.org/r/184730 [20:10:36] (03CR) 10Ottomata: [C: 032 V: 032] Remove $data_path dependency from aggregator cron [puppet] - 10https://gerrit.wikimedia.org/r/184730 (owner: 10Ottomata) [20:10:50] ori, ok to merge? [20:11:09] ottomata: yes [20:11:10] thanks [20:11:18] done [20:11:29] ori: but still, ideally we'd put that inline pass-non-GET/HEAD-block *after* the random-backend thing. and then put the debug stuff to set the debug backend between the two, where applicable (so that the req.backend setting for debug is always after all other req.backend-setters, but before any early-returners) [20:12:16] I don't think it breaks anything to reorder those two things (everywhere they occur together) as another pre-patch [20:12:25] (03PS1) 10Jforrester: Enable VisualEditor for the 'Article incubator' namespace on ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/184731 (https://phabricator.wikimedia.org/T86688) [20:13:31] RECOVERY - puppet last run on stat1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:17:18] (03PS3) 10Ori.livneh: VCL: Add 'maybe_use_random_scheduler' subroutine [puppet] - 10https://gerrit.wikimedia.org/r/184547 [20:17:29] (03PS1) 10Ottomata: Use globally qualified variable for $::cdh::hadoop::mount::mount_point in new aggregator class [puppet] - 10https://gerrit.wikimedia.org/r/184733 [20:18:04] ^d: manybubbles Any changes in content namespace requires some search stuff things? [20:18:06] https://gerrit.wikimedia.org/r/#/c/184016/ [20:18:55] (03CR) 10QChris: [C: 031] Use globally qualified variable for $::cdh::hadoop::mount::mount_point in new aggregator class [puppet] - 10https://gerrit.wikimedia.org/r/184733 (owner: 10Ottomata) [20:19:01] <^d> Reedy: Yes, requires a full reindex. [20:19:02] (03PS2) 10Reedy: Fix timeout comment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/183160 (owner: 10Tim Starling) [20:19:05] renoirb: Yes, using both config arrays is supported. [20:19:06] (03CR) 10Ottomata: [C: 032] Use globally qualified variable for $::cdh::hadoop::mount::mount_point in new aggregator class [puppet] - 10https://gerrit.wikimedia.org/r/184733 (owner: 10Ottomata) [20:19:19] ^d: Do you want a ticket filing or something? [20:19:26] (03CR) 10Reedy: [C: 032] Fix timeout comment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/183160 (owner: 10Tim Starling) [20:19:49] <^d> Reedy: Can do that. Or I can point you to the right script if you'd like :) [20:19:49] (03Merged) 10jenkins-bot: Fix timeout comment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/183160 (owner: 10Tim Starling) [20:20:31] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [500.0] [20:22:20] (03CR) 10Reedy: "Are we scheduling a window for this or something specific?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/170129 (https://phabricator.wikimedia.org/T51193) (owner: 10Spage) [20:22:34] bblack: ooh, labs came in handy. we both missed a serious bug in [20:23:11] text-frontend.inc.vcl.erb includes text-common *above* the declaration of the backend_random director [20:23:35] so it isn't defined when the subroutine in text-common is parsed [20:23:57] the backend declaration should go in text-common, obviously [20:24:16] ah, well, no, it can't [20:24:17] hm [20:25:00] (03PS3) 10Reedy: Fix project talk namespace for mznwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/184622 (https://phabricator.wikimedia.org/T85383) (owner: 10Mjbmr) [20:25:06] (03CR) 10Reedy: [C: 032] Fix project talk namespace for mznwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/184622 (https://phabricator.wikimedia.org/T85383) (owner: 10Mjbmr) [20:25:11] (03Merged) 10jenkins-bot: Fix project talk namespace for mznwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/184622 (https://phabricator.wikimedia.org/T85383) (owner: 10Mjbmr) [20:26:20] :) [20:26:38] well the nice thing about big bugs like that in VCL is it'll just fail to reload and spam us in here about puppet failures [20:26:55] yeah [20:27:41] (03PS3) 10Reedy: Set wgRestrictDisplayTitle to false for fawikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/184625 (https://phabricator.wikimedia.org/T85380) (owner: 10Mjbmr) [20:27:54] (03CR) 10Reedy: [C: 032] Set wgRestrictDisplayTitle to false for fawikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/184625 (https://phabricator.wikimedia.org/T85380) (owner: 10Mjbmr) [20:27:56] (03PS1) 10Ottomata: Atempt to properly resolve hdfs mount point from a variable [puppet] - 10https://gerrit.wikimedia.org/r/184735 [20:28:07] (03Merged) 10jenkins-bot: Set wgRestrictDisplayTitle to false for fawikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/184625 (https://phabricator.wikimedia.org/T85380) (owner: 10Mjbmr) [20:28:31] the backend_random declaration in text-frontend and text-backend is identical, except for the fact that text-backend's is enclosed in a <% if @vcl_config.fetch("cluster_tier", "1") != "1" -%> block [20:29:31] yeah that part's kinda tricky actually [20:29:42] maybe just leave them inlined and duplicate for now to keep the risk down [20:30:01] yeah [20:30:12] (03PS2) 10Reedy: New user message extension configuration on fa.wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/184402 (https://phabricator.wikimedia.org/T76716) (owner: 10Dereckson) [20:30:19] (03CR) 10Reedy: [C: 032] New user message extension configuration on fa.wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/184402 (https://phabricator.wikimedia.org/T76716) (owner: 10Dereckson) [20:30:23] (03Merged) 10jenkins-bot: New user message extension configuration on fa.wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/184402 (https://phabricator.wikimedia.org/T76716) (owner: 10Dereckson) [20:30:48] (03CR) 10Ottomata: [C: 032] Atempt to properly resolve hdfs mount point from a variable [puppet] - 10https://gerrit.wikimedia.org/r/184735 (owner: 10Ottomata) [20:30:51] PROBLEM - Unmerged changes on repository mediawiki_config on tin is CRITICAL: There are 4 unmerged changes in mediawiki_config (dir /srv/mediawiki-staging/). [20:31:42] <^d> Reedy: It's actually really easy (and shouldn't take more than like 15m for mw.org I think). Sync the config change, then run `mwscript extensions/CirrusSearch/maintenance/updateSearchIndexConfig.php --wiki=mediawikiwiki --reindexAndRemoveOk --indexIdentifier=now` [20:31:46] (03PS3) 10Reedy: Update client db list setting for test.wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/184357 (owner: 10Aude) [20:31:51] <^d> That's the "update config and reindex in place" command [20:31:51] (03CR) 10Reedy: [C: 032] Update client db list setting for test.wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/184357 (owner: 10Aude) [20:31:55] (03Merged) 10jenkins-bot: Update client db list setting for test.wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/184357 (owner: 10Aude) [20:33:40] yay, thanks :) [20:34:54] (03PS4) 10Reedy: Create "autopatrolled", "patroller" and "rollbacker" user groups on fawiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/184370 (https://phabricator.wikimedia.org/T85381) (owner: 10Calak) [20:35:02] (03CR) 10Reedy: [C: 032] Create "autopatrolled", "patroller" and "rollbacker" user groups on fawiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/184370 (https://phabricator.wikimedia.org/T85381) (owner: 10Calak) [20:35:06] (03Merged) 10jenkins-bot: Create "autopatrolled", "patroller" and "rollbacker" user groups on fawiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/184370 (https://phabricator.wikimedia.org/T85381) (owner: 10Calak) [20:35:10] the basic dimensional cardinality of our VCL is 3: cache-role (bits, text, etc), front-vs-back "layer", 1-vs-2 "tier". So the common wikimedia vcl file that's used by all of them is already in a 3-dimensional space before we even start thinking about multiple definitions of the same subroutine from different included files and such. [20:35:21] (03PS1) 10Ottomata: Hardcode $hdfs_mount_point, revert previous change [puppet] - 10https://gerrit.wikimedia.org/r/184736 [20:35:52] (03PS2) 10Reedy: Enable VisualEditor for the 'Article incubator' namespace on ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/184731 (https://phabricator.wikimedia.org/T86688) (owner: 10Jforrester) [20:36:11] (03CR) 10Reedy: [C: 032] Enable VisualEditor for the 'Article incubator' namespace on ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/184731 (https://phabricator.wikimedia.org/T86688) (owner: 10Jforrester) [20:36:15] (03Merged) 10jenkins-bot: Enable VisualEditor for the 'Article incubator' namespace on ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/184731 (https://phabricator.wikimedia.org/T86688) (owner: 10Jforrester) [20:36:18] Reedy: Ha. OK… [20:36:41] (03CR) 10Ottomata: [C: 032] Hardcode $hdfs_mount_point, revert previous change [puppet] - 10https://gerrit.wikimedia.org/r/184736 (owner: 10Ottomata) [20:38:22] 3Ops-Access-Requests: EventLogging access for Tilman - https://phabricator.wikimedia.org/T86533#974491 (10Dzahn) a:3Dzahn [20:39:37] _joe_: so how do we get it in apt? [20:40:30] 3Ops-Access-Requests: EventLogging access for Tilman - https://phabricator.wikimedia.org/T86533#974494 (10Dzahn) adjusted the groups after gerrit discussion: statistics-users (mysql queries against EventLogging db) statistics-privatedata-users (for web access logs) bastion (to be able to jump to stat1003) http... [20:40:56] mutante: <3 [20:41:00] thanks man [20:41:40] RECOVERY - Unmerged changes on repository mediawiki_config on tin is OK: No changes to merge. [20:41:44] !log reedy Synchronized database lists: wikidata dblist update (duration: 00m 06s) [20:41:47] Logged the message, Master [20:41:50] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [20:41:58] !log reedy Synchronized wmf-config/: Config updates (duration: 00m 06s) [20:42:02] Logged the message, Master [20:42:13] ori: so after looking at it again, the answer was neither , one of the private groups but not the other. we are in PM and on it to verify [20:42:29] (03PS1) 10Ori.livneh: VCL: Flip order of pass_requests / backend_random assignment [puppet] - 10https://gerrit.wikimedia.org/r/184738 [20:43:13] (03PS3) 10Reedy: mediawikiwiki: Add Api: and Skin: namespace to wgContentNamespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/184016 (https://phabricator.wikimedia.org/T86391) (owner: 10Florianschmidtwelzow) [20:46:00] (03CR) 10Reedy: [C: 032] mediawikiwiki: Add Api: and Skin: namespace to wgContentNamespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/184016 (https://phabricator.wikimedia.org/T86391) (owner: 10Florianschmidtwelzow) [20:47:35] (03Merged) 10jenkins-bot: mediawikiwiki: Add Api: and Skin: namespace to wgContentNamespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/184016 (https://phabricator.wikimedia.org/T86391) (owner: 10Florianschmidtwelzow) [20:48:04] !log reedy Synchronized wmf-config/InitialiseSettings.php: mediawikiwiki content namespaces (duration: 00m 05s) [20:48:08] Logged the message, Master [20:48:20] !log running mwscript extensions/CirrusSearch/maintenance/updateSearchIndexConfig.php --wiki=mediawikiwiki --reindexAndRemoveOk --indexIdentifier=now [20:48:23] Logged the message, Master [20:48:56] ^d: uh oh [20:49:05] Fatal error: Call to private method CirrusSearch\Maintenance\Reindexer::sendDocuments() from context '' in /srv/mediawiki-staging/php-1.25wmf14/extensions/CirrusSearch/includes/Maintenance/Reindexer.php on line 311 [20:49:25] (03PS1) 10Dzahn: use freshly generated SSH key for tbayer [puppet] - 10https://gerrit.wikimedia.org/r/184746 (https://phabricator.wikimedia.org/T86533) [20:49:42] <^d> Reedy: Sec. [20:50:00] (03CR) 10BBlack: [C: 031] "I suggested this restructuring, I think it's sane and doesn't break anything, even though it does technically change behavior. It helps f" [puppet] - 10https://gerrit.wikimedia.org/r/184738 (owner: 10Ori.livneh) [20:50:39] (03CR) 10Dzahn: [C: 032] "confirmed by: wiki login and IRC cloak" [puppet] - 10https://gerrit.wikimedia.org/r/184746 (https://phabricator.wikimedia.org/T86533) (owner: 10Dzahn) [20:55:41] <^d> Reedy: Livehack https://phabricator.wikimedia.org/P213. If still no, try making it public. I'll get a patch into master. [20:56:04] (03PS2) 10Ori.livneh: VCL: Flip order of pass_requests / backend_random assignment [puppet] - 10https://gerrit.wikimedia.org/r/184738 [20:56:09] (03PS13) 10Ori.livneh: varnish: Route requests with 'X-Wikimedia-Debug=1' to test_wikipedia backend [puppet] - 10https://gerrit.wikimedia.org/r/183171 [20:56:12] (03CR) 10Ori.livneh: [C: 032 V: 032] VCL: Flip order of pass_requests / backend_random assignment [puppet] - 10https://gerrit.wikimedia.org/r/184738 (owner: 10Ori.livneh) [20:56:24] (03CR) 10jenkins-bot: [V: 04-1] varnish: Route requests with 'X-Wikimedia-Debug=1' to test_wikipedia backend [puppet] - 10https://gerrit.wikimedia.org/r/183171 (owner: 10Ori.livneh) [20:57:22] bblack: fyi, puppet is disabled on cp1008 (Reason: 'reason not specified') [20:57:57] (03PS14) 10Ori.livneh: varnish: Route requests with 'X-Wikimedia-Debug=1' to test_wikipedia backend [puppet] - 10https://gerrit.wikimedia.org/r/183171 [20:58:29] ori: it's ok, that's my personal hack-things box, basically [20:58:35] nod [20:59:08] right now it's being used to hack jessie packages for everything varnish/ssl -related [21:01:01] bd808, I just finished documenting both https://www.mediawiki.org/wiki/Manual:$wgSquidServersNoPurge and https://www.mediawiki.org/wiki/Manual:$wgSquidServers to make it clear. Mind to take a quick look? [21:02:06] 3Ops-Access-Requests: EventLogging access for Tilman - https://phabricator.wikimedia.org/T86533#974530 (10Dzahn) 5Open>3Resolved Jan 13 20:58:57 bast1001 sshd[24705]: Accepted publickey for tbayer .. confirmed login works on bastion host, confirmed user exists on stat1003. we'll go through the needed ssh co... [21:02:21] <^d> Reedy: That work? [21:02:21] (03CR) 10Ori.livneh: [C: 04-2] "This won't work as-is, because text-common is included before the random backend is declared, meaning backend_random is undefined when tex" [puppet] - 10https://gerrit.wikimedia.org/r/184547 (owner: 10Ori.livneh) [21:02:30] protected didn't [21:02:38] <^d> public did? dammit, ok. [21:03:00] I'm just gonna try public [21:04:19] <^d> Yeah that'll obviously work. [21:04:33] <^d> Dangit, can we get terbium on trusty and hhvm? [21:04:57] yeah, we should [21:05:04] it runs a lot of short-lived scripts [21:05:11] so we should be careful to make sure hhvm doesn't degrade performance [21:05:40] <^d> We could make those crons all use the php5 binary until we're ready for them to be on hhvm [21:06:22] could do, but puppet doesn't install the requisite zend extensions if ubuntu == trusty [21:06:42] that may need to change as we migrate misc servers [21:07:00] ^d: yeah, public works [21:07:09] unless we are going to live with hhvm making some things much slower [21:07:14] <^d> Reedy: Yeah of course it will :p [21:07:17] <^d> Stupid 5.3.x [21:07:26] <^d> Can't use anything but public from a closure. [21:07:28] Well, it could be completely fucked up [21:07:38] ^d: closure problems? [21:07:49] * bd808 sees "yes" [21:07:53] the whole self-in-closure thing? [21:08:27] <^d> No, it's the $self->foo() where foo() is private. [21:08:34] In 5.3.x closures aren't bound to the origin object scope so they can see private/protected members and methods [21:08:36] (03PS1) 10Legoktm: Enable $wgCentralAuthEnableGlobalRenameRequest in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/184757 [21:08:41] <^d> $self is annoying too :) [21:08:44] *can't see [21:09:01] bblack: OK, I think the debug patch is correct now. If you let your eyes wander and look at the context for some of these lines, you'll see unrelated warts that beg to be cleaned up......but the solution for now is not to let your eyes wander :P [21:09:06] 5.3 is teh suck once you've used 5.4+ [21:13:05] (03CR) 10Ori.livneh: "cherry-picked in labs; applied correctly" [puppet] - 10https://gerrit.wikimedia.org/r/183171 (owner: 10Ori.livneh) [21:13:10] 3ops-core: IPsec: add firewall rules - https://phabricator.wikimedia.org/T85823#974556 (10Gage) p:5High>3Low [21:14:27] (03PS2) 10Ori.livneh: Remove GettingStarted cookie workaround, reverting ae30ae0ba [puppet] - 10https://gerrit.wikimedia.org/r/184548 [21:14:34] ori: I assume the primary thing I'm supposed to avert my eyes from is the ordering of evaluate_cookie? :) [21:14:47] yep! [21:15:19] (03CR) 10Ori.livneh: [C: 032 V: 032] Remove GettingStarted cookie workaround, reverting ae30ae0ba [puppet] - 10https://gerrit.wikimedia.org/r/184548 (owner: 10Ori.livneh) [21:15:50] 3ops-core: IPsec: add firewall rules - https://phabricator.wikimedia.org/T85823#974563 (10Gage) Given that Varnish nodes have only private IPs there is no explicit need for this. Proper configuration of security associations between nodes in cache colos and main colos would ensure that traffic is encrypted; fire... [21:16:37] (03PS1) 10Hashar: Jenkins job validation (DO NOT SUBMIT) [software/labsdb-auditor] - 10https://gerrit.wikimedia.org/r/184776 [21:16:53] bblack: oh, wait -- that's a regression i'm introducing [21:16:56] right? [21:16:57] (03CR) 10Keegan: [C: 031] "That wasn't quick enough, sir :P" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/184757 (owner: 10Legoktm) [21:17:11] yeah, it should go above [21:17:13] eeep. [21:17:24] you mean the eval cookie part? [21:17:40] yeah kinda [21:17:51] yeah, that's a rebase error [21:18:07] (03CR) 10Hashar: "recheck" [software/labsdb-auditor] - 10https://gerrit.wikimedia.org/r/184776 (owner: 10Hashar) [21:18:30] (03CR) 10Hoo man: "Has this even been tested?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/184757 (owner: 10Legoktm) [21:19:36] (03PS15) 10Ori.livneh: varnish: Route requests with 'X-Wikimedia-Debug=1' to test_wikipedia backend [puppet] - 10https://gerrit.wikimedia.org/r/183171 [21:19:47] (03PS16) 10Ori.livneh: varnish: Route requests with 'X-Wikimedia-Debug=1' to test_wikipedia backend [puppet] - 10https://gerrit.wikimedia.org/r/183171 [21:20:01] there [21:20:22] 16 is a power of 2 and a perfect 4th power [21:20:47] it is the smallest number with exactly five divisors [21:20:52] must be significant [21:21:57] (03CR) 10Hashar: "recheck" [software/labsdb-auditor] - 10https://gerrit.wikimedia.org/r/184776 (owner: 10Hashar) [21:23:11] it's also the value of 8+8, so it's double-lucky [21:23:11] (03Abandoned) 10Hashar: Jenkins job validation (DO NOT SUBMIT) [software/labsdb-auditor] - 10https://gerrit.wikimedia.org/r/184776 (owner: 10Hashar) [21:25:01] (03CR) 10Legoktm: "Yes...it's been on beta labs for a while now." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/184757 (owner: 10Legoktm) [21:25:28] ori: I think text-frontend cookie stuff is still a little messed up in the net [21:25:34] 3operations, Beta-Cluster: Make all ldap users have a sane shell (/bin/bash) - https://phabricator.wikimedia.org/T86668#974590 (10scfc) >>! In T86668#974333, @Chad wrote: >>>! In T86668#974281, @scfc wrote: >> See also T67591 (maybe duplicate). IIRC the two questions there were: >> >> # Don't accidentally unlo... [21:26:05] <^d> Reedy: Not a livehack anymore, merged all the way through [21:26:07] ori: the ordering of the IMS check and evaluate_cookie is probably significant [21:26:10] Thanks [21:26:14] move both above your stuff? [21:26:18] bblack: ugh, you're right. sorry. [21:26:37] well above where your req.backend would theoretically be, if this were -backend [21:26:40] heh [21:26:52] so much for numerology [21:27:58] heh, i was about to complain wikibugs doesnt talk about my edits, but it's absolutely correct not leaking stuff that has is limited to NDA [21:28:26] 3operations, Beta-Cluster: Make all ldap users have a sane shell (/bin/bash) - https://phabricator.wikimedia.org/T86668#974594 (10Chad) If we do {T86674}/{T86655} this won't be a problem [21:28:44] I'm calling PS18 as the one [21:28:47] bblack: IMS? [21:28:54] If-Modified-Since [21:29:06] it checks a cookie in that same clause [21:30:30] ori: really, I don't think that clause does anything in practice. I think evaluate_cookie currently kills the cookie before the IMS+Cookie check can see it. [21:30:40] but, best to just preserve the existing ordering/behavior there for now. [21:31:15] !next [21:31:35] that used to tell us the next swat ? [21:31:57] oh wait I was thinking backwards. in your PS16 that's what happens. in the origin the IMS+Cookie check is before Cookie is killed in evaluate_cookie [21:32:33] s/origin/original/ [21:32:38] (03PS1) 10Chad: Add my .bashrc [puppet] - 10https://gerrit.wikimedia.org/r/184783 [21:34:10] (03PS18) 10Ori.livneh: varnish: Route requests with 'X-Wikimedia-Debug=1' to test_wikipedia backend [puppet] - 10https://gerrit.wikimedia.org/r/183171 [21:35:04] lol, I missed PS17 in the split [21:35:16] oh, there was no 17 [21:35:20] you're sneaky [21:35:45] bblack: i have no idea how that happened [21:38:28] anyways, PS18 has a whitespace error on L53 of text-backend [21:38:31] ori: ^ [21:38:48] but otherwise I think this might be it. I'm starting over at the first file and re-reading again now [21:39:19] (03PS19) 10Ori.livneh: varnish: Route requests with 'X-Wikimedia-Debug=1' to test_wikipedia backend [puppet] - 10https://gerrit.wikimedia.org/r/183171 [21:43:15] 3Ops-Access-Requests, Continuous-Integration: Make sure relevant RelEng people have access to gallium (Chris M, Dan, Mukunda, Zeljko) - https://phabricator.wikimedia.org/T85936#974614 (10Cmcmahon) Need this merged: https://phabricator.wikimedia.org/T86685 [21:44:03] 3Ops-Access-Requests, Continuous-Integration: Make sure relevant RelEng people have access to gallium (Chris M, Dan, Mukunda, Zeljko) - https://phabricator.wikimedia.org/T85936#974615 (10Cmcmahon) [21:44:56] ori: switch the order of evaluate_cookie and the non-GET/HEAD-pass block in text-*end? so that the relative order of the two is preserved vs original [21:45:41] (and keep the IMS-check block stapled to the top of evaluate_cookie in the frontend case, move the pass block above both?) [21:46:02] meh I don't even know if we're split now or not [21:46:29] bblack: you're not, whomever you're talking to might be. [21:46:59] :) [21:49:59] 3operations, Phabricator: The options of the Security dropdown in Phabricator need to be clear and documented - https://phabricator.wikimedia.org/T76564#974626 (10chasemp) [21:51:49] (03CR) 10BBlack: [C: 04-1] "I think we're almost there, but needs moving the non-GET/HEAD-pass block up above the IMS+Cookie checking block + evaluate_cookie in text-" [puppet] - 10https://gerrit.wikimedia.org/r/183171 (owner: 10Ori.livneh) [21:53:06] bblack: i have a piggish ask -- any chance you could amend the patch? I'm at the point where I'm both eager (and thus liable to be careless) and have been staring at the change so often that it all looks like gibberish. [21:53:33] if not it's ok, i'll just amend it later [21:54:00] ori: yeah ok (btw my local git started working fine again this morning, so no idea wtf was going on last night, but it wasn't a local setup issue?) [21:55:16] no clue [21:57:02] (03PS20) 10BBlack: varnish: Route requests with 'X-Wikimedia-Debug=1' to test_wikipedia backend [puppet] - 10https://gerrit.wikimedia.org/r/183171 (owner: 10Ori.livneh) [21:59:07] (03CR) 10Dzahn: [C: 032] udp2log: replace iptables with ferm [puppet] - 10https://gerrit.wikimedia.org/r/169691 (owner: 10Matanya) [21:59:30] (03CR) 10BBlack: "(irc is so borked right now: does PS19 -> PS20 make sense to you?)" [puppet] - 10https://gerrit.wikimedia.org/r/183171 (owner: 10Ori.livneh) [22:00:02] uhm, yea, touching udp2log, removing iptables, please let me know if anything unexpected with logging [22:00:08] checking fluorine [22:00:27] (removing iptables means replacing with ferm , of course) [22:00:44] 3operations, Phabricator: The options of the Security dropdown in Phabricator need to be clear and documented - https://phabricator.wikimedia.org/T76564#974657 (10chasemp) [22:02:44] matanya: heh, did i make you join? [22:03:52] (03CR) 10Dzahn: "checked fluorine (iptables before vs. after), ferm has applied rules. looks ok so far, nothing gets dropped, checked counters" [puppet] - 10https://gerrit.wikimedia.org/r/169691 (owner: 10Matanya) [22:04:27] (03CR) 10Ori.livneh: [C: 031] varnish: Route requests with 'X-Wikimedia-Debug=1' to test_wikipedia backend [puppet] - 10https://gerrit.wikimedia.org/r/183171 (owner: 10Ori.livneh) [22:05:34] 3operations, Phabricator: The options of the Security dropdown in Phabricator need to be clear and documented - https://phabricator.wikimedia.org/T76564#974669 (10chasemp) [22:08:10] PROBLEM - dhclient process on gadolinium is CRITICAL: Timeout while attempting connection [22:08:12] bbiaf [22:08:20] PROBLEM - Disk space on gadolinium is CRITICAL: Timeout while attempting connection [22:08:40] PROBLEM - DPKG on gadolinium is CRITICAL: Timeout while attempting connection [22:08:51] PROBLEM - configured eth on gadolinium is CRITICAL: Timeout while attempting connection [22:08:51] PROBLEM - RAID on gadolinium is CRITICAL: Timeout while attempting connection [22:08:51] PROBLEM - salt-minion processes on gadolinium is CRITICAL: Timeout while attempting connection [22:09:10] PROBLEM - udp2log log age for nginx on gadolinium is CRITICAL: Timeout while attempting connection [22:09:13] PROBLEM - puppet last run on gadolinium is CRITICAL: Timeout while attempting connection [22:09:30] uhmm.. that would be what i merged above [22:09:46] needs hole for monitoring [22:10:48] (bblack: makes sense to me) [22:11:23] actually, come on icinga, wasnt that just temp? [22:11:50] PROBLEM - udp2log log age for oxygen on oxygen is CRITICAL: Timeout while attempting connection [22:11:53] PROBLEM - Disk space on oxygen is CRITICAL: Timeout while attempting connection [22:12:00] PROBLEM - DPKG on oxygen is CRITICAL: Timeout while attempting connection [22:12:21] PROBLEM - RAID on oxygen is CRITICAL: Timeout while attempting connection [22:12:31] PROBLEM - dhclient process on oxygen is CRITICAL: Timeout while attempting connection [22:12:50] PROBLEM - puppet last run on oxygen is CRITICAL: Timeout while attempting connection [22:12:50] PROBLEM - configured eth on oxygen is CRITICAL: Timeout while attempting connection [22:12:51] PROBLEM - salt-minion processes on oxygen is CRITICAL: Timeout while attempting connection [22:13:27] ori: I think we're good for merge, unless you want to keep staring and/or doing some kind of validation/testing first [22:13:39] mutante: kind of, upgraded the server OS. [22:15:45] matanya: oxygen and gadolinium.. the others seem fine .. hrmm [22:16:34] mutante: they drop ? [22:17:15] matanya: that's the thing, icinga says it cant connect but i dont even see dropped packets in the counters [22:18:10] mutante: on what port icinga is trying ? [22:18:29] matanya: should be 5666 [22:18:49] mutante: can you try by hand from neon ? [22:19:42] (03CR) 10BBlack: [C: 031] ":)" [puppet] - 10https://gerrit.wikimedia.org/r/183171 (owner: 10Ori.livneh) [22:21:10] PROBLEM - salt-minion processes on erbium is CRITICAL: Timeout while attempting connection [22:21:10] PROBLEM - Disk space on erbium is CRITICAL: Timeout while attempting connection [22:21:12] matanya: yea, doesn't work, needs a hole for nrpe, works with identical command on other hosts [22:21:31] PROBLEM - RAID on erbium is CRITICAL: Timeout while attempting connection [22:21:40] PROBLEM - dhclient process on erbium is CRITICAL: Timeout while attempting connection [22:21:42] weird [22:21:50] PROBLEM - DPKG on erbium is CRITICAL: Timeout while attempting connection [22:21:50] PROBLEM - puppet last run on erbium is CRITICAL: Timeout while attempting connection [22:22:00] PROBLEM - udp2log log age for erbium on erbium is CRITICAL: Timeout while attempting connection [22:22:10] PROBLEM - configured eth on erbium is CRITICAL: Timeout while attempting connection [22:22:11] mutante: regarding openmeetings, the fix it easy [22:22:22] point port 5008 to 443 [22:23:12] matanya: ah, these are the servers who don't use role::logging::mediawiki, but still have udp2log users on them [22:23:20] matanya: re: openmeetings, cool! [22:24:16] fluorine and vanadium are fine, they use the logging::mediawiki class [22:27:07] ecmabot-wm: not yet :p [22:27:07] mutante: There is no command: not yet :p [22:27:12] hah [22:27:25] ecmabot-wm: help [22:29:47] # mediawiki udp2log instance. Does not use monitoring. [22:32:18] ori: I need to leave the house for a bit and get some groceries and such. I think we're good, but you might want to wait to merge till I get back later so someone's around to share the blame [22:32:28] but, up to you :) [22:35:02] mutante: typo: 5080 is the port [22:35:53] matanya: ? for nrpe? it listens on 5666 [22:35:59] typo where [22:36:56] ooh, openmeeting, ok [22:37:14] guess what has broken? [22:37:15] Zuul! [22:37:48] !log Restarted Zuul, deadlocked waiting for Gerrit [22:37:51] Logged the message, Master [22:39:43] bblack: cool. i'll wait. thanks a ton! :) [22:44:07] (03PS1) 10Ori.livneh: nutcracker: specify 0666 file mode for UNIX socket [puppet] - 10https://gerrit.wikimedia.org/r/184790 [22:44:39] (03PS2) 10Ori.livneh: nutcracker: specify 0666 file mode for UNIX socket [puppet] - 10https://gerrit.wikimedia.org/r/184790 [22:44:57] (03CR) 10Ori.livneh: [C: 032 V: 032] nutcracker: specify 0666 file mode for UNIX socket [puppet] - 10https://gerrit.wikimedia.org/r/184790 (owner: 10Ori.livneh) [22:47:14] (03PS1) 10Dzahn: ud2plog: open hole for monitoring, nrpe from icinga [puppet] - 10https://gerrit.wikimedia.org/r/184791 [22:48:10] (03CR) 10jenkins-bot: [V: 04-1] ud2plog: open hole for monitoring, nrpe from icinga [puppet] - 10https://gerrit.wikimedia.org/r/184791 (owner: 10Dzahn) [22:49:04] (03PS2) 10Dzahn: ud2plog: open hole for monitoring, nrpe from icinga [puppet] - 10https://gerrit.wikimedia.org/r/184791 [22:49:37] (03PS1) 10Ori.livneh: Use UNIX domain socket for nutcracker on mw1030 & mw1031 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/184792 [22:49:46] (03PS2) 10Ori.livneh: Use UNIX domain socket for nutcracker on mw1030 & mw1031 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/184792 [22:49:52] (03CR) 10Ori.livneh: [C: 032] Use UNIX domain socket for nutcracker on mw1030 & mw1031 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/184792 (owner: 10Ori.livneh) [22:49:56] (03CR) 10jenkins-bot: [V: 04-1] ud2plog: open hole for monitoring, nrpe from icinga [puppet] - 10https://gerrit.wikimedia.org/r/184791 (owner: 10Dzahn) [22:50:04] (03Merged) 10jenkins-bot: Use UNIX domain socket for nutcracker on mw1030 & mw1031 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/184792 (owner: 10Ori.livneh) [22:50:20] PROBLEM - Slow CirrusSearch query rate on fluorine is CRITICAL: CirrusSearch-slow.log_line_rate CRITICAL: 0.00666666666667 [22:51:19] (03PS3) 10Dzahn: ud2plog: open hole for monitoring, nrpe from icinga [puppet] - 10https://gerrit.wikimedia.org/r/184791 [22:52:48] !log ori Synchronized wmf-config/mc.php: Use UNIX domain socket for nutcracker on mw1030 & mw1031 (duration: 00m 05s) [22:52:57] Logged the message, Master [22:53:38] 3operations, Phabricator: The options of the Security dropdown in Phabricator need to be clear and documented - https://phabricator.wikimedia.org/T76564#974750 (10chasemp) [22:53:53] (03CR) 10Dzahn: [C: 032] ud2plog: open hole for monitoring, nrpe from icinga [puppet] - 10https://gerrit.wikimedia.org/r/184791 (owner: 10Dzahn) [22:55:30] RECOVERY - Slow CirrusSearch query rate on fluorine is OK: CirrusSearch-slow.log_line_rate OKAY: 0.0 [22:58:02] 3operations, Phabricator: The options of the Security dropdown in Phabricator need to be clear and documented - https://phabricator.wikimedia.org/T76564#974765 (10chasemp) [23:00:05] 3ops-core: revoke old digicert certificates - https://phabricator.wikimedia.org/T86689#974766 (10RobH) chatted with brandon about this earlier today (thanks for linking patchsets!) I've begun to revoke the certs, so dns-admin@wikimedia.org will be getting spammed. [23:04:04] even when i know im supposed to revoke the cert, and I confirm its not actually running in production, im still paranoid =P [23:08:08] (03PS1) 10Ori.livneh: nutcracker: add UNIX domain socket server pool everywhere, but don't use it [puppet] - 10https://gerrit.wikimedia.org/r/184799 [23:09:27] (03PS2) 10Ori.livneh: nutcracker: add UNIX domain socket server pool everywhere, but don't use it [puppet] - 10https://gerrit.wikimedia.org/r/184799 [23:15:47] (03PS1) 10Rush: phab update (and peripherals) for T78243 [puppet] - 10https://gerrit.wikimedia.org/r/184802 [23:16:43] (03PS1) 10Dzahn: udp2log: fix ferm error with udp2log_rsyncd [puppet] - 10https://gerrit.wikimedia.org/r/184803 [23:17:12] (03PS2) 10Dzahn: udp2log: fix ferm error with udp2log_rsyncd [puppet] - 10https://gerrit.wikimedia.org/r/184803 [23:17:53] (03PS3) 10Dzahn: udp2log: fix ferm error with udp2log_rsyncd [puppet] - 10https://gerrit.wikimedia.org/r/184803 [23:17:57] 3ops-core: increase misc-web-lb cp pool from 2 to 3 systems? - https://phabricator.wikimedia.org/T86718#974794 (10RobH) 3NEW a:3mark [23:19:04] (03CR) 10Dzahn: [C: 032] "$hosts_allow = ['stat1002.eqiad.wmnet']" [puppet] - 10https://gerrit.wikimedia.org/r/184803 (owner: 10Dzahn) [23:19:32] (03CR) 1020after4: [C: 031] phab update (and peripherals) for T78243 [puppet] - 10https://gerrit.wikimedia.org/r/184802 (owner: 10Rush) [23:22:15] (03CR) 10Dzahn: "that fixed it - after this ferm service starts properly again on: gadolinium, erbium and oxygen and Icinga checks on these hosts started r" [puppet] - 10https://gerrit.wikimedia.org/r/184803 (owner: 10Dzahn) [23:23:01] PROBLEM - puppet last run on elastic1026 is CRITICAL: CRITICAL: Puppet has 1 failures [23:24:53] (03PS2) 1020after4: phab update (and peripherals) for T78243 [puppet] - 10https://gerrit.wikimedia.org/r/184802 (owner: 10Rush) [23:25:01] RECOVERY - Disk space on oxygen is OK: DISK OK [23:25:01] RECOVERY - DPKG on oxygen is OK: All packages OK [23:25:13] (03CR) 1020after4: [C: 031] phab update (and peripherals) for T78243 [puppet] - 10https://gerrit.wikimedia.org/r/184802 (owner: 10Rush) [23:25:31] RECOVERY - RAID on oxygen is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [23:25:31] RECOVERY - dhclient process on oxygen is OK: PROCS OK: 0 processes with command name dhclient [23:25:41] RECOVERY - puppet last run on oxygen is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [23:26:00] RECOVERY - configured eth on oxygen is OK: NRPE: Unable to read output [23:26:00] RECOVERY - salt-minion processes on oxygen is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [23:26:01] RECOVERY - udp2log log age for oxygen on oxygen is OK: OK: all log files active [23:33:59] (03PS3) 10Ori.livneh: nutcracker: add UNIX domain socket server pool everywhere, but don't use it [puppet] - 10https://gerrit.wikimedia.org/r/184799 [23:37:29] (03CR) 10Ori.livneh: [C: 032] nutcracker: add UNIX domain socket server pool everywhere, but don't use it [puppet] - 10https://gerrit.wikimedia.org/r/184799 (owner: 10Ori.livneh) [23:37:47] (03CR) 10Dzahn: [C: 032] Remove now unused iptables.pp [puppet] - 10https://gerrit.wikimedia.org/r/184694 (owner: 10Faidon Liambotis) [23:39:39] (03PS3) 10Rush: phab update (and peripherals) for T78243 [puppet] - 10https://gerrit.wikimedia.org/r/184802 [23:39:40] RECOVERY - puppet last run on elastic1026 is OK: OK: Puppet is currently enabled, last run 0 seconds ago with 0 failures [23:41:20] PROBLEM - puppet last run on mw1035 is CRITICAL: CRITICAL: puppet fail [23:43:09] mutante: Error: Could not retrieve catalog from remote server: Error 400 on SERVER: Could not parse for environment production: No file(s) found for import of 'iptables.pp' at /etc/puppet/manifests/site.pp:9 on node mw1035.eqiad.wmnet [23:43:19] (mw1035) [23:43:25] re-running puppet to see if it was ephemeral. [23:43:34] arg.. but i grep'ed a whole bunch [23:43:37] ok [23:43:55] yeah, seems OK now. must have been a race condition. [23:44:00] PROBLEM - Packetloss_Average on analytics1026 is CRITICAL: packet_loss_average CRITICAL: 89.0481615591 [23:44:18] phhew, cool, i hope that one above is also not related [23:44:51] RECOVERY - puppet last run on mw1035 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [23:45:17] (03PS1) 10Kaldari: Turning on WikiGrok for anons on test and test2 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/184810 [23:46:10] RECOVERY - Packetloss_Average on analytics1026 is OK: packet_loss_average OKAY: 0.301448532609 [23:46:17] uhm, yet, must be, i got denied pubkey [23:46:20] and recoverd [23:46:31] (03PS1) 10Kaldari: Turning on WikiGrok for anons on English Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/184812 [23:47:05] (03CR) 10Kaldari: [C: 04-2] "Don't merge until change I459431d is merged and tested." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/184812 (owner: 10Kaldari) [23:50:41] !log Updated nutcracker on application servers to 0.4.0+dfsg-1+wm1. [23:50:45] Logged the message, Master [23:51:02] (03PS2) 10Dzahn: Kill role::labsnfs, deprecated & empty [puppet] - 10https://gerrit.wikimedia.org/r/184699 (owner: 10Faidon Liambotis) [23:52:15] (03CR) 10Dzahn: [C: 032] "yep, this was the only leftover from I7561d5f39086fe6" [puppet] - 10https://gerrit.wikimedia.org/r/184699 (owner: 10Faidon Liambotis) [23:54:20] 3operations, Project-Creators: HTTPS phabricator project(s) - https://phabricator.wikimedia.org/T86063#974873 (10chasemp) talked to andre about this idea and he said he would sleep on it :) [23:57:50] greg-g, around? [23:57:51] Krenair: You sent me a contentless ping. This is a contentless pong. Please provide a bit of information about what you want and I will respond when I am around.