[00:00:05] RoanKattouw ostriches Krenair: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151204T0000). [00:00:10] the queue is rapidly clearing, down to 900k from 2m [00:01:03] (03PS2) 10Dzahn: kafkatee: bumb submodule [puppet] - 10https://gerrit.wikimedia.org/r/256859 [00:01:31] (03CR) 10Dzahn: [C: 032] kafkatee: bumb submodule [puppet] - 10https://gerrit.wikimedia.org/r/256859 (owner: 10Dzahn) [00:02:38] (03CR) 10Dzahn: [V: 032] kafkatee: bumb submodule [puppet] - 10https://gerrit.wikimedia.org/r/256859 (owner: 10Dzahn) [00:02:57] * yuvipanda bumbs mutante [00:03:37] RoanKattouw: the patches are there on the Deployments page (those are the patches on master), I'll update the CentralNotice deploy branch, sorry for the delay! [00:03:51] No worries, I just came back from the kitchen anyway [00:03:58] Ah K...thx! [00:04:54] AndyRussG: Are these cherry-picked into the wmf_deploy branch already? If not, do you want me to do that? [00:05:26] RoanKattouw: Just doing that... I think order may matter here... [00:05:41] Right, yeah [00:05:45] You know that better than I do [00:06:23] There aren't many more commits between wmf_deploy and master besides yours, but there are a couple [00:06:38] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There are 2 unmerged changes in puppet (dir /var/lib/git/operations/puppet). [00:06:38] PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: There are 2 unmerged changes in puppet (dir /var/lib/git/operations/puppet). [00:06:41] RoanKattouw: yeah I'm leaving those off [00:07:07] Yeah both look a bit big [00:07:16] ( https://gerrit.wikimedia.org/r/#/c/234736/ and https://gerrit.wikimedia.org/r/#/c/256054/ ) [00:10:35] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [00:10:35] RECOVERY - Unmerged changes on repository puppet on palladium is OK: No changes to merge. [00:10:59] RoanKattouw: ah no the second of those is going in, the first isn't tho [00:11:46] OK [00:12:21] Oh right, you have four, not three, my bad [00:14:43] (03CR) 10Dzahn: [C: 032] "http://puppet-compiler.wmflabs.org/1428/" [puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/256484 (owner: 10Dzahn) [00:15:04] (03CR) 10Dzahn: [V: 032] "http://puppet-compiler.wmflabs.org/1428/" [puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/256484 (owner: 10Dzahn) [00:16:39] (03PS1) 10Dzahn: varnishkafka: update submodule [puppet] - 10https://gerrit.wikimedia.org/r/256868 [00:17:46] RoanKattouw: K wmf_deploy is all updated! [00:17:52] Thanks [00:17:59] RoanKattouw: likewise!!! [00:18:03] (03CR) 10Dzahn: [C: 032 V: 032] varnishkafka: update submodule [puppet] - 10https://gerrit.wikimedia.org/r/256868 (owner: 10Dzahn) [00:18:38] AaronSchulz: so these incoming links count jobs are fired for every link that is added or removed from a page. Best guess we get floods of them when templates are edited. The throttle has changed over time from .75 to .25 and then up to 1. The concern seems to revolve around flooding the elasticsearch instances with too many writes. Perhaps i'm not paranoid enough, but a good number of these writes should just result in noop's and [00:19:47] RoanKattouw: HEAD of wmf_deploy at 05544e751a1eee5fb303d6c82944027f590b1848 ... [00:20:44] Yup [00:20:56] Doing the wmf7 commit now [00:22:25] (03CR) 10Dzahn: [C: 032] minimal lint fix, indentation warning [puppet/cdh] - 10https://gerrit.wikimedia.org/r/256487 (owner: 10Dzahn) [00:23:02] RoanKattouw: cool beans! [00:23:40] (03PS1) 10Dzahn: cdh: update submodule [puppet] - 10https://gerrit.wikimedia.org/r/256871 [00:25:51] (03CR) 10Dzahn: [C: 032 V: 032] cdh: update submodule [puppet] - 10https://gerrit.wikimedia.org/r/256871 (owner: 10Dzahn) [00:27:34] (03CR) 10Dzahn: [C: 032 V: 032] "http://puppet-compiler.wmflabs.org/1429/" [puppet/nginx] - 10https://gerrit.wikimedia.org/r/256496 (owner: 10Dzahn) [00:28:54] (03PS1) 10Dzahn: nginx: update submodule [puppet] - 10https://gerrit.wikimedia.org/r/256872 [00:29:42] (03CR) 10Dzahn: [C: 032 V: 032] nginx: update submodule [puppet] - 10https://gerrit.wikimedia.org/r/256872 (owner: 10Dzahn) [00:29:59] 7Blocked-on-Operations, 6operations, 6Commons, 10Wikimedia-Media-storage: Update rsvg on the image scalers - https://phabricator.wikimedia.org/T112421#1851091 (10Tgr) >>! In T112421#1846029, @Tgr wrote: > there is an error where sometimes the image is badly clipped (depending on dimensions) Maybe librsvg... [00:30:38] (03CR) 10Dzahn: [C: 04-2] "http://puppet-compiler.wmflabs.org/1430/mw1033.eqiad.wmnet/change.mw1033.eqiad.wmnet.err" [puppet] - 10https://gerrit.wikimedia.org/r/256574 (owner: 10Dzahn) [00:32:51] Argh [00:33:01] $ git pull [00:33:03] error: Cannot pull with rebase: You have unstaged changes. [00:33:12] hmmm [00:33:16] funny [00:33:21] ori: AaronSchulz: Has either of you modified JobRunner.php on tin? [00:33:41] yes, to fix the jobqueue snafu from earlier [00:33:41] ori I assume [00:33:49] go ahead and stash, i'll sort it out later [00:33:53] OK [00:33:57] I'll stash, pull and reapply [00:34:03] Please do commit it to a repository somewhere later [00:34:03] thanks [00:34:06] yep [00:35:06] AndyRussG: Deploying now, sorry for the delays [00:35:19] !log catrope@tin Synchronized php-1.27.0-wmf.7/extensions/CentralNotice: SWAT (duration: 00m 32s) [00:35:20] I was distracted for a bit and had to wait for Jenkins [00:35:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:35:24] There we go [00:35:33] AndyRussG: Should be live now [00:35:42] thanks roan [00:36:49] (03PS3) 10Dzahn: mediawiki: move roles into separate files [puppet] - 10https://gerrit.wikimedia.org/r/256574 [00:37:25] RoanKattouw: fantastic thx! Cache should roll over in a bit... [00:37:47] Yup [00:38:19] AndyRussG: Trick: use the network panel of your favorite browser's inspector to look at the load.php?modules=startup request and find the Expires header. That'll tell you the timestamp of the next roll-over [00:38:46] Ah cool [00:39:00] (Partly because of something that we kind of consider a bug, but it's a nice debugging feature :) ) [00:39:45] (03PS4) 10Dzahn: mediawiki: move roles into separate files [puppet] - 10https://gerrit.wikimedia.org/r/256574 [00:44:15] (03CR) 10Dzahn: "yea, that's right. hmm.. not sure if it has any advantage to move it to icinga::groups::labs vs icinga::groups::misc, but that's all i can" [puppet] - 10https://gerrit.wikimedia.org/r/256509 (https://phabricator.wikimedia.org/T110893) (owner: 10Dzahn) [00:48:05] (03PS2) 10Dzahn: icinga/labsnfs: move monitoring groups to labsnfs [puppet] - 10https://gerrit.wikimedia.org/r/256509 (https://phabricator.wikimedia.org/T110893) [00:48:22] (03PS3) 10Dzahn: icinga/labsnfs: move monitoring groups [puppet] - 10https://gerrit.wikimedia.org/r/256509 (https://phabricator.wikimedia.org/T110893) [00:49:13] (03PS4) 10Dzahn: icinga/labsnfs: move monitoring groups to labsnfs [puppet] - 10https://gerrit.wikimedia.org/r/256509 (https://phabricator.wikimedia.org/T110893) [00:51:27] RoanKattouw: looks good so far! [00:51:33] Good [00:54:11] (03PS8) 10Dzahn: add IPv6 for argon (irc,mw-rc streams) [puppet] - 10https://gerrit.wikimedia.org/r/214434 (https://phabricator.wikimedia.org/T105422) [00:56:31] (03PS9) 10Dzahn: add IPv6 for argon (irc,mw-rc streams) [puppet] - 10https://gerrit.wikimedia.org/r/214434 (https://phabricator.wikimedia.org/T105422) [00:58:27] (03PS10) 10Dzahn: add IPv6 for argon (irc,mw-rc streams) [puppet] - 10https://gerrit.wikimedia.org/r/214434 (https://phabricator.wikimedia.org/T105422) [00:58:46] RoanKattouw: yeah all continue to seem fine! banners working on mobile and desktop, no issues with the functionality we tweaked [01:01:15] (03CR) 10Dzahn: [C: 032] add IPv6 for argon (irc,mw-rc streams) [puppet] - 10https://gerrit.wikimedia.org/r/214434 (https://phabricator.wikimedia.org/T105422) (owner: 10Dzahn) [01:05:51] RoanKattouw: yeah all looks good, I'll be around if any problems do arise... Thanks a ton, really appreciate the help!!!! \o/ [01:06:01] (03CR) 10Dzahn: "eth0 ..." [dns] - 10https://gerrit.wikimedia.org/r/214506 (https://phabricator.wikimedia.org/T105422) (owner: 10Dzahn) [01:06:26] greg-g: also thanks much ^ as noted, all good :) [01:06:50] (03PS5) 10Dzahn: add AAAA record for argon (irc,rc streams) [dns] - 10https://gerrit.wikimedia.org/r/214506 (https://phabricator.wikimedia.org/T105422) [01:09:11] (03Abandoned) 10Dzahn: add script to flush all iptables rules for emergencies [puppet] - 10https://gerrit.wikimedia.org/r/228137 (owner: 10Dzahn) [01:10:00] * AndyRussG waves at yuvipanda ostriches mutante Krenair :) [01:10:12] (03Abandoned) 10Dzahn: base::firewall: remove exec for nf_conntrack [puppet] - 10https://gerrit.wikimedia.org/r/253056 (owner: 10Dzahn) [01:12:55] (03PS2) 10Dzahn: torrus: switch to misc-web [dns] - 10https://gerrit.wikimedia.org/r/255463 (https://phabricator.wikimedia.org/T119582) [01:15:39] (03PS3) 10Dzahn: torrus: switch to misc-web [dns] - 10https://gerrit.wikimedia.org/r/255463 (https://phabricator.wikimedia.org/T119582) [01:16:18] (03CR) 10Dzahn: [C: 032] torrus: switch to misc-web [dns] - 10https://gerrit.wikimedia.org/r/255463 (https://phabricator.wikimedia.org/T119582) (owner: 10Dzahn) [01:20:30] AndyRussG: sweet, have a good evening [01:22:07] greg-g: likewise! :) [01:33:35] !log Updated scholarships.wikimedia.org to af73bf6 [01:33:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:41:37] 6operations, 10Fundraising-Backlog, 10Wikimedia-DNS: donate.wikimedia.org needs an MX record - https://phabricator.wikimedia.org/T120322#1851233 (10Peachey88) [01:49:37] (03PS1) 10Dzahn: smokeping: switch to misc-web cluster [dns] - 10https://gerrit.wikimedia.org/r/256879 (https://phabricator.wikimedia.org/T120258) [01:58:16] (03CR) 10Dzahn: [C: 032] smokeping: switch to misc-web cluster [dns] - 10https://gerrit.wikimedia.org/r/256879 (https://phabricator.wikimedia.org/T120258) (owner: 10Dzahn) [02:13:27] (03PS1) 10Ori.livneh: Turn off backoff throttling of CirrusSearch jobs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/256880 [02:13:52] (03CR) 10Ori.livneh: [C: 032] Turn off backoff throttling of CirrusSearch jobs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/256880 (owner: 10Ori.livneh) [02:14:31] (03Merged) 10jenkins-bot: Turn off backoff throttling of CirrusSearch jobs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/256880 (owner: 10Ori.livneh) [02:15:18] !log ori@tin Synchronized wmf-config/CirrusSearch-common.php: (no message) (duration: 00m 29s) [02:15:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:15:49] !log CirrusSearch-common.php sync was for I826d000ca: Turn off backoff throttling of CirrusSearch jobs [02:15:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:19:56] 6operations, 7HTTPS, 5Patch-For-Review: move torrus behind misc-web - https://phabricator.wikimedia.org/T119582#1851282 (10Dzahn) done. but i'll also add the proto-redirect [02:19:59] 6operations, 7HTTPS, 5Patch-For-Review: move smokeping behind misc-web varnish - https://phabricator.wikimedia.org/T120258#1851283 (10Dzahn) done. but i'll also add the proto-redirect [02:26:59] ori: digging through the git logs nik was aparently worried about knocking over the cluster with those jobs. I was going to ping david (our es internals expert) before making any changes [02:27:04] ori: it will probably be fine, but just fyi [02:27:54] ebernhardson: at the moment the job queue has no official maintainers and one unofficial maintainer, and when it gets overloaded the impact is broad [02:28:10] !log mwdeploy@tin sync-l10n completed (1.27.0-wmf.7) (duration: 10m 19s) [02:28:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:28:35] if elastic is saturated, only search is affected. that's a big "only", but it's still narrow. [02:28:58] and elastic has better docs :) [02:29:02] heh [02:30:00] * ostriches sees elastc [02:30:05] (03PS1) 10Jforrester: Enable importupload some import sources for officewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/256881 [02:30:11] ostriches: turned off throttling of the incoming links jobs [02:30:29] ostriches: nik changed it from .25 to .75 to 1, seemed worried it would break things [02:31:03] Eh, it's more of a problem when doing mass indexing operations. [02:31:57] ostriches: ok good to know [02:33:11] It's one of those things to just keep an eye on. Cirrus jobs can easily cause load (apaches, db [worst case], elastic itself). Backoff is just one of the knobs we gave ourselves to throttle that. [02:33:39] (03PS1) 10Yuvipanda: dynamicproxy: Increase websocket timeout [puppet] - 10https://gerrit.wikimedia.org/r/256882 (https://phabricator.wikimedia.org/T120335) [02:35:53] (03PS1) 10Dzahn: cassandra: quoting/alignment fixes [puppet] - 10https://gerrit.wikimedia.org/r/256883 [02:42:39] _joe_: puppet's still broken on all the things with the tools puppetmaster [02:42:48] oh wlel [02:42:50] *well [02:42:57] * yuvipanda gets to fixing [02:43:24] (03PS1) 10Dzahn: lint fixes [puppet/wikimetrics] - 10https://gerrit.wikimedia.org/r/256884 [02:43:37] (03PS2) 10Yuvipanda: dynamicproxy: Increase websocket timeout [puppet] - 10https://gerrit.wikimedia.org/r/256882 (https://phabricator.wikimedia.org/T120335) [02:43:58] (03PS2) 10Alex Monk: Enable importupload and some import sources for officewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/256881 (owner: 10Jforrester) [02:53:14] (03CR) 10Dzahn: [C: 032] "http://puppet-compiler.wmflabs.org/1433/" [puppet] - 10https://gerrit.wikimedia.org/r/256883 (owner: 10Dzahn) [03:38:20] (03PS1) 10Yuvipanda: labs: Setup saltmaster separately from puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/256889 [03:39:39] (03PS1) 10Yuvipanda: base: Allow auto puppetmaster switching tuning [puppet] - 10https://gerrit.wikimedia.org/r/256890 (https://phabricator.wikimedia.org/T120159) [03:46:46] (03PS2) 10Yuvipanda: labs: Setup saltmaster separately from puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/256889 [03:47:04] (03CR) 10Yuvipanda: [C: 032 V: 032] labs: Setup saltmaster separately from puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/256889 (owner: 10Yuvipanda) [03:47:30] (03PS1) 10Dzahn: openstack,quarry: fix last "ensure" warnings [puppet] - 10https://gerrit.wikimedia.org/r/256891 [03:48:43] (03PS2) 10Dzahn: openstack,quarry: fix last "ensure" warnings [puppet] - 10https://gerrit.wikimedia.org/r/256891 [03:51:18] 6operations, 6Labs, 5Patch-For-Review: Kill the 'puppet' module with fire, make self hosted puppetmasters use the puppetmaster module - https://phabricator.wikimedia.org/T120159#1851429 (10yuvipanda) Once ^ gets merged, I'll have to find list of all instances that have role::puppet::self *and* the puppetmast... [03:51:19] (03CR) 10Dzahn: [C: 032] openstack,quarry: fix last "ensure" warnings [puppet] - 10https://gerrit.wikimedia.org/r/256891 (owner: 10Dzahn) [03:51:32] (03PS1) 10Dzahn: puppet-lint: re-enable "ensure first param" [puppet] - 10https://gerrit.wikimedia.org/r/256892 (https://phabricator.wikimedia.org/T93645) [03:52:20] (03PS2) 10Dzahn: puppet-lint: re-enable "ensure first param" [puppet] - 10https://gerrit.wikimedia.org/r/256892 (https://phabricator.wikimedia.org/T93645) [03:53:24] (03CR) 10Dzahn: [C: 032] puppet-lint: re-enable "ensure first param" [puppet] - 10https://gerrit.wikimedia.org/r/256892 (https://phabricator.wikimedia.org/T93645) (owner: 10Dzahn) [04:00:24] (03CR) 10coren: [C: 031] "Sane." [puppet] - 10https://gerrit.wikimedia.org/r/256509 (https://phabricator.wikimedia.org/T110893) (owner: 10Dzahn) [04:33:58] (03PS2) 10BryanDavis: [WIP] Elasticsearch with proxy for tool labs [puppet] - 10https://gerrit.wikimedia.org/r/256618 (https://phabricator.wikimedia.org/T120040) [04:41:16] 6operations, 10Fundraising-Backlog, 10Wikimedia-DNS: donate.wikimedia.org needs an MX record - https://phabricator.wikimedia.org/T120322#1851468 (10Jgreen) a:3Jgreen [04:44:52] 6operations, 10Fundraising-Backlog, 10Wikimedia-DNS: donate.wikimedia.org needs an MX record - https://phabricator.wikimedia.org/T120322#1851470 (10Jgreen) @BBlack the DNS record for donate.wikimedia.org uses geodns, which used to be a blocker for adding MX records. IIRC that is no longer an issue, is there... [04:51:30] (03PS3) 10BryanDavis: [WIP] Elasticsearch with proxy for tool labs [puppet] - 10https://gerrit.wikimedia.org/r/256618 (https://phabricator.wikimedia.org/T120040) [04:57:24] (03PS4) 10BryanDavis: [WIP] Elasticsearch with proxy for tool labs [puppet] - 10https://gerrit.wikimedia.org/r/256618 (https://phabricator.wikimedia.org/T120040) [05:25:34] (03CR) 10KartikMistry: Enable new user groups on gu.wikipedia.org (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255810 (https://phabricator.wikimedia.org/T119787) (owner: 10Mdann52) [05:30:22] (03PS5) 10BryanDavis: [WIP] Elasticsearch with proxy for tool labs [puppet] - 10https://gerrit.wikimedia.org/r/256618 (https://phabricator.wikimedia.org/T120040) [05:31:36] PROBLEM - puppet last run on mw2018 is CRITICAL: CRITICAL: Puppet has 1 failures [05:35:02] (03PS6) 10BryanDavis: [WIP] Elasticsearch with proxy for tool labs [puppet] - 10https://gerrit.wikimedia.org/r/256618 (https://phabricator.wikimedia.org/T120040) [05:38:06] PROBLEM - cassandra service on restbase1009 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed [05:42:55] PROBLEM - Kafka Broker Replica Max Lag on kafka1012 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [5000000.0] [05:47:46] RECOVERY - cassandra service on restbase1009 is OK: OK - cassandra is active [05:49:46] !log l10nupdate@tin ResourceLoader cache refresh completed at Fri Dec 4 05:49:46 UTC 2015 (duration 3h 21m 36s) [05:49:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [05:52:40] (03PS7) 10BryanDavis: [WIP] Elasticsearch with proxy for tool labs [puppet] - 10https://gerrit.wikimedia.org/r/256618 (https://phabricator.wikimedia.org/T120040) [05:53:18] !log moved /var/lib/cassandra out of the way in an attempt to stop puppet restarting cassandra on decommissioned restbase1009 [05:53:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [05:55:27] PROBLEM - cassandra service on restbase1009 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed [05:58:26] RECOVERY - Kafka Broker Replica Max Lag on kafka1012 is OK: OK: Less than 1.00% above the threshold [1000000.0] [05:58:46] RECOVERY - puppet last run on mw2018 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [05:59:47] !log ran systemctl mask cassandra on restbase1009; it is important that this node does not start up. [05:59:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:06:07] PROBLEM - puppet last run on restbase1009 is CRITICAL: CRITICAL: Puppet has 1 failures [06:06:09] 6operations, 10RESTBase-Cassandra, 5Patch-For-Review: puppet should safely manage cassandra start/stop - https://phabricator.wikimedia.org/T103134#1851583 (10GWicke) If I understand this right, cassandra will still start up when a node comes back after an extended outage? That would be bad if the node has be... [06:12:04] 6operations, 10Beta-Cluster-Infrastructure, 6Labs, 10Labs-Infrastructure: beta: Get SSL certificates for *.{projects}.beta.wmflabs.org - https://phabricator.wikimedia.org/T50501#1851595 (10Chmarkine) Let's Encrypt is in Public Beta now. Everyone can get free certificates from them now. [1] https://letsenc... [06:29:57] PROBLEM - puppet last run on ms-be1010 is CRITICAL: CRITICAL: puppet fail [06:30:36] PROBLEM - puppet last run on cp2013 is CRITICAL: CRITICAL: Puppet has 2 failures [06:30:56] PROBLEM - puppet last run on mw2024 is CRITICAL: CRITICAL: puppet fail [06:31:26] PROBLEM - puppet last run on neodymium is CRITICAL: CRITICAL: Puppet has 2 failures [06:32:06] PROBLEM - puppet last run on mc2015 is CRITICAL: CRITICAL: Puppet has 3 failures [06:32:15] PROBLEM - puppet last run on db2044 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:16] PROBLEM - puppet last run on mw1110 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:36] PROBLEM - puppet last run on mw1119 is CRITICAL: CRITICAL: Puppet has 2 failures [06:32:46] PROBLEM - puppet last run on holmium is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:56] PROBLEM - puppet last run on mw2146 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:56] PROBLEM - puppet last run on mw2050 is CRITICAL: CRITICAL: Puppet has 1 failures [06:34:27] PROBLEM - puppet last run on mw2045 is CRITICAL: CRITICAL: Puppet has 1 failures [06:37:41] (03CR) 10Jcrespo: [C: 04-1] mariadb: bump submodule [puppet] - 10https://gerrit.wikimedia.org/r/256857 (owner: 10Dzahn) [06:39:03] (03CR) 10Jcrespo: "I'm holding off on purpose 256657 (there is a feature freeze in place). You should abandon this." [puppet] - 10https://gerrit.wikimedia.org/r/256857 (owner: 10Dzahn) [06:56:46] RECOVERY - puppet last run on neodymium is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [06:57:16] RECOVERY - puppet last run on ms-be1010 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [06:57:17] RECOVERY - puppet last run on mc2015 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:55] RECOVERY - puppet last run on cp2013 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:24:46] RECOVERY - puppet last run on db2044 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:26:45] RECOVERY - puppet last run on mw1110 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [07:27:06] RECOVERY - puppet last run on mw1119 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [07:27:07] RECOVERY - puppet last run on mw2045 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [07:27:16] RECOVERY - puppet last run on holmium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:28:26] RECOVERY - puppet last run on mw2146 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:28:27] RECOVERY - puppet last run on mw2050 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:29:27] RECOVERY - puppet last run on mw2024 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:50:37] PROBLEM - puppet last run on mw2184 is CRITICAL: CRITICAL: Puppet has 1 failures [08:18:06] RECOVERY - puppet last run on mw2184 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [08:32:27] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [08:33:16] PROBLEM - Mobile HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [08:34:07] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [08:35:06] (03PS3) 10Muehlenhoff: Add comment on server_id parameter in openldap module [puppet] - 10https://gerrit.wikimedia.org/r/255115 [08:35:30] (03CR) 10Muehlenhoff: [C: 032 V: 032] Add comment on server_id parameter in openldap module [puppet] - 10https://gerrit.wikimedia.org/r/255115 (owner: 10Muehlenhoff) [08:37:28] (03CR) 10Muehlenhoff: [C: 031] Open openldap servers to all wikimedia hosts. [puppet] - 10https://gerrit.wikimedia.org/r/256844 (owner: 10Andrew Bogott) [08:38:06] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [08:38:26] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [08:39:07] RECOVERY - Mobile HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [08:42:55] PROBLEM - puppet last run on cp3020 is CRITICAL: CRITICAL: puppet fail [08:52:30] !log reimage restbase1009 [08:52:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:53:02] (03PS2) 10Filippo Giunchedi: cassandra: provision restbase1009 with 128 tokens [puppet] - 10https://gerrit.wikimedia.org/r/256690 (https://phabricator.wikimedia.org/T95253) [08:53:12] <_joe_> uhm [08:53:13] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] cassandra: provision restbase1009 with 128 tokens [puppet] - 10https://gerrit.wikimedia.org/r/256690 (https://phabricator.wikimedia.org/T95253) (owner: 10Filippo Giunchedi) [09:10:30] RECOVERY - puppet last run on cp3020 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:14:02] (03PS1) 10Filippo Giunchedi: cassandra: add restbase1009-a instance [puppet] - 10https://gerrit.wikimedia.org/r/256902 [09:34:24] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] cassandra: add restbase1009-a instance [puppet] - 10https://gerrit.wikimedia.org/r/256902 (owner: 10Filippo Giunchedi) [09:35:53] PROBLEM - cassandra service on restbase1009 is CRITICAL: CRITICAL - Expecting active but unit cassandra is inactive [09:39:01] PROBLEM - Restbase endpoints health on restbase1009 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=127.0.0.1, port=7231): Max retries exceeded with url: /en.wikipedia.org/v1/?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [09:39:13] PROBLEM - Restbase root url on restbase1009 is CRITICAL: Connection refused [09:39:52] PROBLEM - cassandra CQL 10.64.48.110:9042 on restbase1009 is CRITICAL: Connection refused [09:47:26] known ^ [09:48:22] ACKNOWLEDGEMENT - cassandra CQL 10.64.48.110:9042 on restbase1009 is CRITICAL: Connection refused Filippo Giunchedi reimage [09:48:22] ACKNOWLEDGEMENT - cassandra service on restbase1009 is CRITICAL: CRITICAL - Expecting active but unit cassandra is inactive Filippo Giunchedi reimage [09:54:44] 6operations, 10RESTBase-Cassandra, 5Patch-For-Review: puppet should safely manage cassandra start/stop - https://phabricator.wikimedia.org/T103134#1851797 (10akosiaris) >>! In T103134#1851583, @GWicke wrote: > If I understand this right, cassandra will still start up when a node comes back after an extended... [09:56:08] mobrovac: FYI I've reimaged restbase1009 but restbase itself doesn't seem to come up by itself after that Dec 04 09:49:04 restbase1009 nodejs[63553]: Error: Cannot find module '/usr/lib/restbase/deploy/restbase/server.js' [09:56:45] euh? [09:56:55] i even updated the repo on tin yesterday [09:57:05] hm, lemme take a look godog [09:57:14] it is depooled in pybal anyways so no worries there [09:57:26] kk cool [09:58:29] hah looks like the restbase git repo isn't checked out in deploy/ ? [09:59:07] yup, no submodule [09:59:15] godog: lemme ansible it [09:59:16] :) [10:01:33] RECOVERY - Restbase endpoints health on restbase1009 is OK: All endpoints are healthy [10:01:37] kk all good, rb is back up godog [10:01:40] (03CR) 10Alexandros Kosiaris: [C: 031] "Looks ok. I am wondering whether we just got a pattern. We got various hostgroups around scattered in our puppet configs, perhaps that dir" [puppet] - 10https://gerrit.wikimedia.org/r/256509 (https://phabricator.wikimedia.org/T110893) (owner: 10Dzahn) [10:01:54] RECOVERY - Restbase root url on restbase1009 is OK: HTTP OK: HTTP/1.1 200 - 15171 bytes in 0.005 second response time [10:02:07] mobrovac: ack, thanks, I'll followup in a ticket on how to best fix that [10:08:43] PROBLEM - cassandra-a CQL 10.64.48.120:9042 on restbase1009 is CRITICAL: Connection refused [10:09:49] (03PS3) 10Alexandros Kosiaris: package_builder: clarify how to download a package [puppet] - 10https://gerrit.wikimedia.org/r/256125 (owner: 10Merlijn van Deen) [10:10:11] (03CR) 10Alexandros Kosiaris: [V: 032] package_builder: clarify how to download a package [puppet] - 10https://gerrit.wikimedia.org/r/256125 (owner: 10Merlijn van Deen) [10:11:24] ACKNOWLEDGEMENT - cassandra-a CQL 10.64.48.120:9042 on restbase1009 is CRITICAL: Connection refused Filippo Giunchedi reimage [10:15:56] (03PS1) 10Muehlenhoff: Migrate the OpenDJ ACL for the "Directory Managers" group [puppet] - 10https://gerrit.wikimedia.org/r/256909 [10:27:05] (03CR) 10Faidon Liambotis: ""We" is the WMF. I would like to keep end-to-end control of domains and not host domains in WMF-operated infrastructure that the WMF does " [dns] - 10https://gerrit.wikimedia.org/r/252703 (https://phabricator.wikimedia.org/T118468) (owner: 10JanZerebecki) [10:27:21] (03CR) 10Alexandros Kosiaris: [C: 032] "Did some extensive testing on this one, seems to work fine now. I 'll merge" [puppet] - 10https://gerrit.wikimedia.org/r/256176 (owner: 10Merlijn van Deen) [10:27:26] (03PS6) 10Alexandros Kosiaris: package_builder: add option to use built packages during build [puppet] - 10https://gerrit.wikimedia.org/r/256176 (owner: 10Merlijn van Deen) [10:27:41] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] package_builder: add option to use built packages during build [puppet] - 10https://gerrit.wikimedia.org/r/256176 (owner: 10Merlijn van Deen) [10:36:59] (03CR) 10Muehlenhoff: [C: 031] "Looks good to me." [puppet] - 10https://gerrit.wikimedia.org/r/256696 (owner: 10Alexandros Kosiaris) [10:54:39] (03PS3) 10Thiemo Mättig (WMDE): Avoid breaking full phabricator URLs [puppet] - 10https://gerrit.wikimedia.org/r/256663 (https://phabricator.wikimedia.org/T75997) [11:01:11] (03PS2) 10Muehlenhoff: Specific size_limit specifically for repluser [puppet] - 10https://gerrit.wikimedia.org/r/256696 (owner: 10Alexandros Kosiaris) [11:02:06] (03CR) 10Muehlenhoff: [C: 032 V: 032] Specific size_limit specifically for repluser [puppet] - 10https://gerrit.wikimedia.org/r/256696 (owner: 10Alexandros Kosiaris) [11:03:45] 6operations, 10Beta-Cluster-Infrastructure, 6Labs, 10Labs-Infrastructure: beta: Get SSL certificates for *.{projects}.beta.wmflabs.org - https://phabricator.wikimedia.org/T50501#1851883 (10Krenair) Yeah, beta.wmflabs.org was in the private beta. Don't know if it can actually work with our setup though. [11:13:16] (03PS4) 10Krinkle: Fix getMWScriptWithArgs() user error message [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249803 (owner: 10Aaron Schulz) [11:13:28] (03CR) 10Krinkle: [C: 032] Fix getMWScriptWithArgs() user error message [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249803 (owner: 10Aaron Schulz) [11:13:51] (03Merged) 10jenkins-bot: Fix getMWScriptWithArgs() user error message [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249803 (owner: 10Aaron Schulz) [11:16:23] PROBLEM - puppet last run on ganeti2001 is CRITICAL: CRITICAL: puppet fail [11:17:22] (03PS4) 10Bmansurov: Enable RelatedArticles and Cards on Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255553 (https://phabricator.wikimedia.org/T116676) [11:32:13] PROBLEM - Unmerged changes on repository mediawiki_config on tin is CRITICAL: There is one unmerged change in mediawiki_config (dir /srv/mediawiki-staging/). [11:33:02] PROBLEM - Unmerged changes on repository mediawiki_config on mira is CRITICAL: There is one unmerged change in mediawiki_config (dir /srv/mediawiki-staging/). [11:37:50] who's that? [11:38:39] Reedy: ? [11:39:12] Not in mediawiki-config [11:39:12] diff --git a/portals b/portals [11:39:12] index fc041b8..8f5ce8f 160000 [11:39:12] --- a/portals [11:39:13] +++ b/portals [11:39:13] @@ -1 +1 @@ [11:39:13] -Subproject commit fc041b88d742751a9ef011de66d01d083f3ae899 [11:39:14] +Subproject commit 8f5ce8fdbc38ad64d68c39ffcde06dcdab61d9cf [11:39:19] That's not me, sorry [11:39:34] you were nano'ing, that's why I asked [11:40:10] php-1.27.0-wmf.7$ nano extensions/WikimediaMaintenance/cleanBogusLanguagesFromMsgResource.php [11:40:17] maintenance script in a subdir [11:40:56] jgirault: ping [11:42:31] 6operations: Grant tomasz access to Google Web Master Tools for top 10 languages across desktop and mobile plus wikipedia.org portal - https://phabricator.wikimedia.org/T120136#1851930 (10fgiunchedi) 5Open>3Invalid that makes sense, thanks @deskana! [11:44:22] RECOVERY - puppet last run on ganeti2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:44:44] 6operations, 10Analytics-Cluster, 10Traffic, 5Patch-For-Review: Can't download large datasets from datasets.wikimedia.org - https://phabricator.wikimedia.org/T104004#1851937 (10fgiunchedi) confirmed, thanks @bblack! [11:46:57] 6operations, 6Discovery, 10Wikidata, 10Wikidata-Query-Service, 10hardware-requests: Additional diskspace of wdqs1001/wdqs1002 - https://phabricator.wikimedia.org/T119579#1851939 (10fgiunchedi) drive-by comment: partman will likely to be adjusted too so we don't run into surprises when reprovisioning [11:57:43] 6operations, 10ops-esams: decom amslvs1-4 (dc work) - https://phabricator.wikimedia.org/T87790#1851964 (10fgiunchedi) 5Open>3stalled [12:18:09] (03PS1) 10Muehlenhoff: Also index substring searches for sudoUser [puppet] - 10https://gerrit.wikimedia.org/r/256929 [12:25:20] 6operations, 6Phabricator, 7Mail: DomainKeys Identified Mail (DKIM) for phabricator.wikimedia.org - https://phabricator.wikimedia.org/T116805#1852017 (10fgiunchedi) if I'm reading exim's configuration right, whether to use dkim or not is `remote_smtp_signed` vs `remote_smtp` ``` # Route non-local domains (i... [12:35:53] PROBLEM - puppet last run on mw2121 is CRITICAL: CRITICAL: puppet fail [12:40:21] 6operations, 10RESTBase-Cassandra, 5Patch-For-Review: PID not expanded in heap dumps - https://phabricator.wikimedia.org/T116814#1852049 (10fgiunchedi) 5Open>3Resolved completed, cassandra has been rolled restarted following openjdk security advisory [12:46:49] 6operations, 10ops-eqiad: mw1259 does not have hyperthreading enabled - https://phabricator.wikimedia.org/T120270#1852056 (10fgiunchedi) p:5Triage>3Normal [12:49:26] 6operations, 10RESTBase, 10RESTBase-Cassandra, 5Patch-For-Review: Test multiple Cassandra instances per hardware node - https://phabricator.wikimedia.org/T95253#1852062 (10fgiunchedi) restbase1009 is bootstrapping, started at 09:36 [12:51:22] (03CR) 10Alexandros Kosiaris: [C: 031] Also index substring searches for sudoUser [puppet] - 10https://gerrit.wikimedia.org/r/256929 (owner: 10Muehlenhoff) [13:02:04] (03PS5) 10Giuseppe Lavagetto: k8s: switch to using systems' CA [puppet] - 10https://gerrit.wikimedia.org/r/243662 (https://phabricator.wikimedia.org/T114638) [13:03:43] PROBLEM - puppet last run on mw1119 is CRITICAL: CRITICAL: Puppet has 72 failures [13:05:53] RECOVERY - puppet last run on mw2121 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:30:28] (03PS3) 1020after4: Install arc on the jenkins slaves. [puppet] - 10https://gerrit.wikimedia.org/r/256712 (https://phabricator.wikimedia.org/T103127) [13:31:13] ^ this is pretty straightforward patch, anyone willing to merge it [13:43:37] (03PS8) 10BryanDavis: Elasticsearch with proxy for tool labs [puppet] - 10https://gerrit.wikimedia.org/r/256618 (https://phabricator.wikimedia.org/T120040) [13:48:05] (03PS2) 10Krinkle: Revert "Don't commit interwiki.cdb anymore" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250294 (owner: 10Ori.livneh) [13:48:30] twentyafterfour: we're usually using puppet swat for those, unless it is urgent? [13:49:06] not emergency urgent [13:49:16] godog: when is puppet swat? [13:49:27] * twentyafterfour didn't know that was a thing now [13:50:43] twentyafterfour: next thurs looks like, from Deployments [13:52:04] (03CR) 10Krinkle: [C: 032] Revert "Don't commit interwiki.cdb anymore" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250294 (owner: 10Ori.livneh) [13:52:21] ugh [13:52:26] (03Merged) 10jenkins-bot: Revert "Don't commit interwiki.cdb anymore" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250294 (owner: 10Ori.livneh) [13:52:29] * Krinkle is comitting to gerrit what is already on tin [13:52:38] cleaning up the mis-match [13:53:45] RECOVERY - Unmerged changes on repository mediawiki_config on tin is OK: No changes to merge. [13:54:29] twentyafterfour: I'll JFDI since I'm also clinic duty [13:54:49] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] Install arc on the jenkins slaves. [puppet] - 10https://gerrit.wikimedia.org/r/256712 (https://phabricator.wikimedia.org/T103127) (owner: 1020after4) [13:56:15] (03PS1) 10Krinkle: Commit untracked file interwiki.json from tin [mediawiki-config] - 10https://gerrit.wikimedia.org/r/256939 [13:56:47] 6operations, 10ops-codfw: rack new yubico auth system - https://phabricator.wikimedia.org/T120263#1852196 (10MoritzMuehlenhoff) Using auth1001/auth2001 seems fine with me. OS should indeed be jessie and these will use an internal IP. [13:56:49] (03CR) 10Krinkle: [C: 032] Commit untracked file interwiki.json from tin [mediawiki-config] - 10https://gerrit.wikimedia.org/r/256939 (owner: 10Krinkle) [13:57:11] (03Merged) 10jenkins-bot: Commit untracked file interwiki.json from tin [mediawiki-config] - 10https://gerrit.wikimedia.org/r/256939 (owner: 10Krinkle) [13:57:21] <_joe_> Krinkle: sigh, seriously? [13:57:52] PROBLEM - HHVM rendering on mw1119 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:58:33] PROBLEM - Apache HTTP on mw1119 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:59:53] PROBLEM - RAID on mw1119 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:00:03] PROBLEM - SSH on mw1119 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:00:22] PROBLEM - HHVM processes on mw1119 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:00:22] PROBLEM - nutcracker port on mw1119 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:00:23] PROBLEM - DPKG on mw1119 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:00:23] _joe_: There is three other dirty changes on tin since Nov 1 [14:00:24] mw1119 is having a bad day it looks like [14:00:42] PROBLEM - configured eth on mw1119 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:00:43] I've not once since my deploy access encountered the deployment host in a clean state. [14:00:53] <_joe_> bd808: for some reasons, the API appservers are going OOM once a week [14:00:59] <_joe_> I have zero time to debug that [14:01:02] PROBLEM - dhclient process on mw1119 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:01:13] PROBLEM - nutcracker process on mw1119 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:01:24] PROBLEM - Disk space on mw1119 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:01:26] With the exception of this one, I do always leave it in a clean state by either nuking or committing whatever is there. [14:01:33] PROBLEM - salt-minion processes on mw1119 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:01:43] PROBLEM - Check size of conntrack table on mw1119 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:02:10] (03PS30) 10Ottomata: [WIP] Puppetize eventlogging-service with systemd in role::eventbus [puppet] - 10https://gerrit.wikimedia.org/r/253465 (https://phabricator.wikimedia.org/T118780) [14:02:39] Usually debug files, sometimes hotpatches or cherry-picks (which, sensitive or not, should at least be committed locally on tin) [14:04:13] (03CR) 10Mdann52: "In reply to the above, I've followed the general convention across sites as I could see (which was to call it just "rollback"). If you wan" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255810 (https://phabricator.wikimedia.org/T119787) (owner: 10Mdann52) [14:07:10] (03PS9) 10BryanDavis: Elasticsearch with proxy for tool labs [puppet] - 10https://gerrit.wikimedia.org/r/256618 (https://phabricator.wikimedia.org/T120040) [14:11:18] (03CR) 10Hashar: "I have switched xvfb service to use base::service_unit with parent change https://gerrit.wikimedia.org/r/#/c/256643/ . This "just" adds sy" [puppet] - 10https://gerrit.wikimedia.org/r/256659 (https://phabricator.wikimedia.org/T95003) (owner: 10Hashar) [14:14:41] (03CR) 10Faidon Liambotis: [C: 04-1] Add a new security module with ::pam and ::access (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/256693 (https://phabricator.wikimedia.org/T120106) (owner: 10coren) [14:16:16] (03PS1) 10Hashar: Move arcanist install to contint::packages::labs [puppet] - 10https://gerrit.wikimedia.org/r/256941 [14:17:08] (03CR) 10Hashar: "jenkins::requisites is to setup a Jenkins slave, not a CI node. The class is used on beta cluster instances among others where arcanist i" [puppet] - 10https://gerrit.wikimedia.org/r/256712 (https://phabricator.wikimedia.org/T103127) (owner: 1020after4) [14:17:37] paravoid: You know, I was pretty sure that worked but I haven't actually tested it. I can't see a way to throw the puppet compiler at it though since this is all labs. :-( [14:18:44] paravoid: I knew init.pp also works for tests/ and examples/, but those aren't really dubdirectories. [14:18:46] (03Abandoned) 10Hashar: Update gitblit.properties file with new configs [puppet] - 10https://gerrit.wikimedia.org/r/251836 (owner: 10Paladox) [14:18:53] RECOVERY - dhclient process on mw1119 is OK: PROCS OK: 0 processes with command name dhclient [14:19:03] RECOVERY - nutcracker process on mw1119 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [14:19:13] RECOVERY - Disk space on mw1119 is OK: DISK OK [14:19:23] RECOVERY - salt-minion processes on mw1119 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [14:19:24] RECOVERY - Check size of conntrack table on mw1119 is OK: OK: nf_conntrack is 0 % full [14:19:44] RECOVERY - RAID on mw1119 is OK: OK: no RAID installed [14:19:45] (03PS31) 10Ottomata: [WIP] Puppetize eventlogging-service with systemd in role::eventbus [puppet] - 10https://gerrit.wikimedia.org/r/253465 (https://phabricator.wikimedia.org/T118780) [14:19:54] RECOVERY - SSH on mw1119 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0) [14:20:07] (03PS2) 10Muehlenhoff: Also index substring searches for sudoUser [puppet] - 10https://gerrit.wikimedia.org/r/256929 [14:20:13] RECOVERY - HHVM processes on mw1119 is OK: PROCS OK: 6 processes with command name hhvm [14:20:13] RECOVERY - nutcracker port on mw1119 is OK: TCP OK - 0.000 second response time on port 11212 [14:20:22] RECOVERY - DPKG on mw1119 is OK: All packages OK [14:20:29] (03CR) 10Muehlenhoff: [C: 032 V: 032] Also index substring searches for sudoUser [puppet] - 10https://gerrit.wikimedia.org/r/256929 (owner: 10Muehlenhoff) [14:20:32] RECOVERY - configured eth on mw1119 is OK: OK - interfaces up [14:20:33] RECOVERY - Apache HTTP on mw1119 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.074 second response time [14:21:42] (03PS1) 10Mark Bergsma: Make all IdleConnection TCP KeepAlive parameters configurable [debs/pybal] - 10https://gerrit.wikimedia.org/r/256942 [14:21:44] RECOVERY - HHVM rendering on mw1119 is OK: HTTP OK: HTTP/1.1 200 OK - 70672 bytes in 1.377 second response time [14:26:14] (03PS1) 10Faidon Liambotis: librenms: update settings for newer version [puppet] - 10https://gerrit.wikimedia.org/r/256943 [14:27:04] (03CR) 10Faidon Liambotis: [C: 032] librenms: update settings for newer version [puppet] - 10https://gerrit.wikimedia.org/r/256943 (owner: 10Faidon Liambotis) [14:27:33] (03CR) 10Krinkle: [C: 031] "Created dashboard at https://grafana-admin.wikimedia.org/dashboard/db/redis" [puppet] - 10https://gerrit.wikimedia.org/r/252396 (https://phabricator.wikimedia.org/T118331) (owner: 10Ori.livneh) [14:29:13] RECOVERY - puppet last run on mw1119 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:29:32] (03PS2) 10Faidon Liambotis: librenms: update settings for newer version [puppet] - 10https://gerrit.wikimedia.org/r/256943 [14:29:46] (03PS1) 10Jgreen: mx records for donate.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/256944 [14:29:48] (03CR) 10Faidon Liambotis: [C: 032 V: 032] librenms: update settings for newer version [puppet] - 10https://gerrit.wikimedia.org/r/256943 (owner: 10Faidon Liambotis) [14:31:02] (03PS2) 1020after4: Move arcanist install to contint::packages::labs [puppet] - 10https://gerrit.wikimedia.org/r/256941 (owner: 10Hashar) [14:31:05] (03PS32) 10Ottomata: [WIP] Puppetize eventlogging-service with systemd in role::eventbus [puppet] - 10https://gerrit.wikimedia.org/r/253465 (https://phabricator.wikimedia.org/T118780) [14:31:42] (03CR) 10BBlack: [C: 031] mx records for donate.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/256944 (owner: 10Jgreen) [14:31:47] (03CR) 1020after4: [C: 031] Move arcanist install to contint::packages::labs [puppet] - 10https://gerrit.wikimedia.org/r/256941 (owner: 10Hashar) [14:36:01] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] Move arcanist install to contint::packages::labs [puppet] - 10https://gerrit.wikimedia.org/r/256941 (owner: 10Hashar) [14:36:04] (03PS33) 10Ottomata: [WIP] Puppetize eventlogging-service with systemd in role::eventbus [puppet] - 10https://gerrit.wikimedia.org/r/253465 (https://phabricator.wikimedia.org/T118780) [14:38:02] 7Puppet, 6Phabricator: phabricator at labs is not up to date - https://phabricator.wikimedia.org/T117441#1852262 (10hashar) [14:45:09] 6operations, 10Wikimedia-Apache-configuration, 7HHVM, 7Wikimedia-log-errors: Fix Apache proxy_fcgi error "Invalid argument: AH01075: Error dispatching request to" (Causing HTTP 503) - https://phabricator.wikimedia.org/T73487#1852274 (10hashar) [14:45:47] (03PS10) 10coren: Add a new security module with ::pam and ::access [puppet] - 10https://gerrit.wikimedia.org/r/256693 (https://phabricator.wikimedia.org/T120106) [14:51:38] paravoid: ^^ moves $x/init.pp to $x.pp [14:51:45] (03PS2) 10Muehlenhoff: Open openldap servers to all wikimedia hosts. [puppet] - 10https://gerrit.wikimedia.org/r/256844 (owner: 10Andrew Bogott) [14:51:50] Which /does/ work properly. :-) [14:56:06] (03CR) 10Muehlenhoff: [C: 031] Open openldap servers to all wikimedia hosts. [puppet] - 10https://gerrit.wikimedia.org/r/256844 (owner: 10Andrew Bogott) [14:59:08] (03CR) 10Andrew Bogott: [C: 032] Open openldap servers to all wikimedia hosts. [puppet] - 10https://gerrit.wikimedia.org/r/256844 (owner: 10Andrew Bogott) [15:06:40] (03PS34) 10Ottomata: [WIP] Puppetize eventlogging-service with systemd in role::eventbus [puppet] - 10https://gerrit.wikimedia.org/r/253465 (https://phabricator.wikimedia.org/T118780) [15:13:17] (03CR) 10Jgreen: [C: 032 V: 032] mx records for donate.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/256944 (owner: 10Jgreen) [15:16:56] 6operations, 7Graphite: graphite instance archiver keeps archiving the same instances - https://phabricator.wikimedia.org/T120377#1852354 (10fgiunchedi) 3NEW [15:19:27] 6operations, 10RESTBase: restbase unable to start after machine reimage - https://phabricator.wikimedia.org/T120379#1852368 (10fgiunchedi) 3NEW [15:21:47] 6operations, 6Commons, 10MediaWiki-File-management, 6Multimedia, and 2 others: Image cache issue when 'over-writing' an image on commons - https://phabricator.wikimedia.org/T119038#1852376 (10BBlack) >>! In T119038#1849717, @aaron wrote: > The list of thumbnails to purge comes from the list of thumbnails i... [15:22:14] PROBLEM - Host ms-be2019 is DOWN: PING CRITICAL - Packet loss = 100% [15:23:32] I'll take a look [15:24:13] RECOVERY - Host ms-be2019 is UP: PING OK - Packet loss = 0%, RTA = 36.57 ms [15:27:12] (03CR) 10Andrew Bogott: "It's not obviously to me how the new templates are installed by puppet. Is there a manifest that applies the templates with a wildcard?" [puppet] - 10https://gerrit.wikimedia.org/r/256909 (owner: 10Muehlenhoff) [15:27:43] sigh, looks like it just rebooted [15:31:39] paravoid: Did you have any other issues with the security module beyond the puppet silliness about init.pp? [15:32:56] (03PS1) 10Faidon Liambotis: librenms: update URL to logo [puppet] - 10https://gerrit.wikimedia.org/r/256948 [15:33:12] (03CR) 10Faidon Liambotis: [C: 032 V: 032] librenms: update URL to logo [puppet] - 10https://gerrit.wikimedia.org/r/256948 (owner: 10Faidon Liambotis) [15:33:13] !log ms-be2019 rebooted by itself, ilo event log shows "Uncorrectable Machine Check Exception (Board 0, Processor 2, APIC ID 0x00000038, Bank 0x00000003, Status 0xFE000040'00020135, Address 0x00000000'FEB82F63, Misc 0x00000000'00002285)" [15:33:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:33:31] (03PS2) 10Muehlenhoff: Migrate the OpenDJ ACL for the "Directory Managers" group [puppet] - 10https://gerrit.wikimedia.org/r/256909 [15:33:48] godog: No indication on what MCE that was? [15:34:14] Coren: not afaict, that is from ilo, nothing from mcelog [15:37:41] 6operations, 6Commons, 10MediaWiki-File-management, 6Multimedia, and 2 others: Image cache issue when 'over-writing' an image on commons - https://phabricator.wikimedia.org/T119038#1852422 (10BBlack) And to expound a bit earlier on the cache-layering problem (and Swift is another layer to consider!). When... [15:37:56] godog: If Status reflects the bank status register, the MCA should mean "Internal timer error" [15:38:15] I.e.: busted ram control. [15:39:42] (03CR) 10Muehlenhoff: "I've added the missing includes" [puppet] - 10https://gerrit.wikimedia.org/r/256909 (owner: 10Muehlenhoff) [15:41:51] 6operations, 7Graphite: graphite instance archiver keeps archiving the same instances - https://phabricator.wikimedia.org/T120377#1852430 (10fgiunchedi) I think the cause is metrics that are flowing into labmon1001 but don't really belong to a project, their prefix happens to have the same name as a project, n... [15:43:48] (03CR) 10Andrew Bogott: [C: 031] Migrate the OpenDJ ACL for the "Directory Managers" group [puppet] - 10https://gerrit.wikimedia.org/r/256909 (owner: 10Muehlenhoff) [15:52:20] !log add mx record for donate.wikimedia.org [15:52:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:53:47] 6operations, 10Fundraising-Backlog, 10Wikimedia-DNS: donate.wikimedia.org needs an MX record - https://phabricator.wikimedia.org/T120322#1852467 (10Jgreen) 5Open>3Resolved Done! ~> dig mx donate.wikimedia.org ; ANSWER SECTION: donate.wikimedia.org. 3600 IN MX 10 mx1001.wikimedia.org. donate.wikimedia.or... [15:59:40] 6operations, 10RESTBase: restbase unable to start after machine reimage - https://phabricator.wikimedia.org/T120379#1852494 (10GWicke) There is a puppet patch to fix this: https://gerrit.wikimedia.org/r/#/c/219253/ [16:00:54] 6operations, 6Discovery: Investigate adding memory to elastic10{01...16} to bring more parity between the two types of servers running elasticsearch in eqiad - https://phabricator.wikimedia.org/T117110#1852498 (10EBernhardson) 5Open>3declined a:3EBernhardson these servers are due for replacement next yea... [16:08:11] 6operations, 10RESTBase: restbase unable to start after machine reimage - https://phabricator.wikimedia.org/T120379#1852529 (10fgiunchedi) the problem at hand is "restbase unable to start after machine reimage", how would that patch help fix that? [16:08:41] 6operations, 10RESTBase: restbase unable to start after machine reimage - https://phabricator.wikimedia.org/T120379#1852531 (10GWicke) It would set things up for a clean deploy, without having to manually deleting the trebuchet stuff. [16:11:32] PROBLEM - puppet last run on cp3038 is CRITICAL: CRITICAL: puppet fail [16:12:47] (03PS1) 10DCausse: Use snappy for mediawiki avro logs [puppet] - 10https://gerrit.wikimedia.org/r/256954 [16:35:24] 6operations, 7Graphite: labmon1001 graphite instance archiver keeps archiving the same instances - https://phabricator.wikimedia.org/T120377#1852582 (10Krinkle) [16:35:53] bblack, yeha if they dont' persist reboots that is a problem [16:36:18] it's a problem for everything though, not just your service [16:36:19] nuria: , bblack, i think it does go to syslog too by default (or maybe not by default, but i've seen logs in there), [16:36:33] and yes, it does persist via-syslog, but not via journalctl itself [16:36:33] right, but we are configuring logging for this thing and setting it up in systemd for the first time [16:36:42] with the intention of moving all of eventlogging to systemd [16:36:46] bblack: we really like to have logs for teh past weeks [16:36:47] we don't want to output all logs to syslog [16:36:50] wait what? [16:36:57] *past week [16:37:07] ottomata: I thought we were talking about application info/error logs [16:37:10] yeah [16:37:16] and http access logs [16:37:17] as in tracking the behavior/errors of the code, not the actual events [16:37:20] thats right [16:37:28] "and http access logs"? [16:37:37] bblack: we are, but we need those to persist for more than 1 week [16:37:42] yeah, well, this is specifically for the eventlogging service [16:37:44] RECOVERY - puppet last run on cp3038 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:37:50] but, more generally all of eventlogging [16:37:54] stderr->journalctl is not the right place for data, even if your data happens to be log-like [16:37:59] neither is syslog [16:38:00] we aren't talking about data [16:38:03] events aren't going here [16:38:05] just operational logs [16:38:15] but, there is a lot of logging happening [16:38:18] like? [16:38:24] ok, so just amend the logrotate config to account for the amount of days you need [16:38:33] bblack, look at /var/log/upstart/eventlogging*.log on eventlog1001 [16:38:45] akosiaris: logrotate for journalctl? [16:39:02] what does journalctl have to do with anything ? [16:39:07] ok [16:39:08] so [16:39:11] we are using systemd [16:39:18] and i can configure python logging to go to a file and use logrotate [16:39:23] but, stderr (and stdout) aren't going there [16:39:29] and thus if there are problems, like exceptsion thrown [16:39:34] and not caught [16:39:38] we won't see them in those logs [16:39:41] those will go to stderr [16:39:42] yes but systemd-journald forwards everything to syslog [16:39:45] which journalctl catches [16:39:48] as in /var/log/syslog [16:39:59] well, not everything, but maybe it should [16:39:59] yeah, but i don't want to have to tail logs in different places [16:40:09] plus, there will be a lot of daemons doing this [16:40:13] and other stuff in syslog [16:40:29] earlier, bblack recommended that we just scrap the python file logging, and make python logs go to stdout [16:40:30] which is fine [16:40:36] if they are all in the same place [16:40:38] i'm ok with journalctl [16:40:45] I respectfully disagree [16:40:51] but the logs do need to stick around through a reboot [16:40:57] because we need longer term retention [16:40:59] nuria: with what specifically ? [16:41:14] akosiaris: with making everything log to stdout [16:41:17] 2015-12-04 16:39:24,714 (Thread-10 ) Sending request(xid=7613): GetData(path='/kafka/eqiad/consumers/eventlogging-files-00/owners/eventlogging-valid-mixed/20-10', watcher=None) [16:41:21] 2015-12-04 16:39:24,714 (Thread-10 ) Received response(xid=7613): ('eventlog1001:73ab098a-c541-4eb5-8c40-9845d4e2de00', ZnodeStat(czxid=85918553531, mzxid=85918553531, ctime=1449177384218, mtime=1449177384218, version=0, cversion=0, aversion=0, ephemeralOwner=5571145557287961682, dataLength=49, numChildren=0, pzxid=85918553531)) [16:41:24] nuria: thank god. I am with ya [16:41:26] 2015-12-04 16:39:25,556 (Thread-13 ) Autocommitting consumer offset for consumer group eventlogging-files-00 and topic eventlogging-valid-mixed [16:41:29] seems ridiculously verbose, and there's tons of logfiles in that directory... [16:41:35] akosiaris: i have benefited from disntic logs many times when mainataning services [16:42:02] nuria: I can relate to that [16:42:16] bblack, ja some are too verbose for sure, those autocommits are annoyhing [16:42:17] but still [16:42:19] bblack: and rememeber taht our service is tier-2 thus it might take us couple days to backfill data after an outage [16:42:22] these seem like debugging logs, do we really need retention in the long term? [16:42:28] not in the long term [16:42:30] what does tier-2 mean ? [16:42:32] but long enough for us to respond [16:42:32] bblack: as in a week, yes [16:42:41] yeah, at least a week, i'd say a month is good [16:42:41] and how do debug logs aid in backfilling lost data? [16:42:50] oh, not that thing that someone invented that means "not critical service" [16:42:59] bblack: because they pinpoint different issues [16:43:04] shhhhhhh [16:43:08] bbblack: for example, a storm of invalid events [16:43:09] lets talk about logs [16:43:16] jaja [16:43:18] cause whoever that someone was made a bad choice [16:43:18] heh [16:43:20] ok [16:43:24] nuria: those events should be backfilled from the raw logs or kafka, not error logs [16:43:24] but [16:43:39] it is useful to look at error logs to see why they failed [16:43:41] ottomata: right, but having process logs [16:43:47] so yes, they are debug/operational logs [16:43:53] ottomata: lets you identify if you are backfilling due to an outage [16:43:59] sure [16:44:06] NEway, so summary [16:44:08] ottomata: or to someone comitting a bad schema and thus creating an storm of invalid events [16:44:09] but even then, for most things we don't keep more than a week or so [16:44:19] ottomata: this is happen many times [16:44:21] a week is fine [16:44:30] e.g. the system-level syslog only goes back 7 logrotate files [16:44:42] ottomata: mmm.. a week is minimum [16:45:21] - operational logs for a process should be in the same place [16:45:21] - they should persist across reboots [16:45:21] - they should stick around at least a week [16:45:26] but if you're going to be spamming out heavy debug info, we don't want to just send it all to stdout as one stream [16:45:40] bblack: agreed [16:45:49] you want to at least get that into journalctl/syslog with some level=info/crit/debug type of metadata [16:46:00] bblack: not debug, but useful info to pinpoint errors in several disjointed systems [16:46:34] there's 200M of gzipped log data in eventlog1001:/var/log/upstart/ , all for eventlogging [16:46:35] debug is just one more priority level, whether you want to log it or not is entirely up to the configuration of syslog [16:46:40] that sounds like debugging to me [16:47:05] bblack, example, yesterday, the mysql inserts got real slow for some reason [16:47:18] we had to look at the mysql consumer output logs to find that out [16:47:34] it was nice to have information about number of events inserted, and how long those inserts too, and batch sizes, all rolling in [16:47:43] took* [16:48:01] that's info level [16:48:13] ok, yes, and much of that is configured via python logging at info level [16:48:15] and btw syslog allows you to differentiate between priority levels [16:48:15] they aren't just print statments [16:48:26] log 10 days of info, 15 or warn, 20 of err etc [16:48:41] ja but then we have to figure out how to translate between python log levels and syslog log levels [16:48:42] ok maybe to put it in different words: [16:48:49] they are the same [16:49:05] i mean, we could make a syslog facility, i'm sure there's some syslog python log handler [16:49:07] thankfully, that's why python logging module supports syslog [16:49:09] but, do we have to?! [16:49:14] if your "log" is outputting one or more lines of log data for every tiny event coming through your code at high rate, that's debugging-level logging, not normal "just errors and important info" logging [16:49:33] its not for every tiny event [16:49:34] but ja [16:49:35] it is a lot [16:49:51] we don't want that kind of data spamming into e.g. the default /var/log/syslog [16:49:56] bblack: not for every event m, by any means, we have 200 per sec [16:49:57] uh oh, we are defining now what is debug and what not.. this is becoming bikeshedding I think [16:50:05] (which it won't, if it's set for the debug level in the metadata wherever it enters at) [16:50:15] bblack: but think that EL is not 1 system, is several daemons that do many different things [16:50:20] akosiaris: I'm just defining it in a "don't spam the central logs" way [16:50:31] yeah agree with that [16:50:41] so, apart for python stacktraces (which btw can also be in /var/log/syslog) is there anything else that should not be in it's own /var/log/eventlog/.log ? [16:50:44] which is why it is nice that each process has its own output file [16:50:52] so if you want to do this through systemd's capture of stdout/stderr... [16:50:53] bblack, akosiaris : and we need to log info about our validating process and kafka consumer and mysql insertion process [16:50:55] akosiaris: i want everything to be in a file like that [16:51:00] you can prefix each message with [16:51:25] ottomata: great! so what's stopping us ? [16:51:31] systemd! :) [16:51:32] I want the same thing btw [16:51:33] hehehe [16:51:35] why ? [16:51:40] how do you make systemd do that? [16:51:51] <5>This is an info message [16:51:59] <7>This is a debug message [16:52:03] etc... to stdout [16:52:27] bblack: i am not fond of that if all ends up on teh same log [16:52:29] *the [16:52:34] ottomata: http://0pointer.de/blog/projects/journal-submit.html [16:52:36] it doesn't [16:52:47] so that's how the journal works [16:52:48] the <5> or <7> tells systemd what priority level the message is [16:53:03] but really, I 'd say avoid going through that... [16:53:03] bblack: we need as akosiaris said "/var/log/eventlog/.log" and at least 1 week retention [16:53:24] ok akosiaris [16:53:39] how do you make a stderr stacktrace go to .log with systemd? [16:53:43] bblack: i am sure you guys know the best system to achieve that but it cannot be all lumped together [16:54:16] ottomata: ah, that's a valid question. so you either 1) don't which is fine cause you don't expect it to be often 2) wrap everything in the logging handling [16:54:31] and make sure the logging handling is really simple and can't mess it up [16:54:46] eitherwise you manage to DDoS your own application [16:55:06] true story, done in WMF btw [16:55:14] heh,i did some googling, and i can set sys.stderr to a logging handler [16:55:15] in python [16:55:18] but that just seemed hacky [16:56:14] well [16:56:58] bblack, can we change journalctl persistence settings? [16:57:03] in theory, (1) journald already knows the separate daemon names as metadata (2) you can output priority metadata to stdout too with <4>, and (3) rsyslog pulls data from the journal to write syslog files, and we can configure rsyslog to pick up and place those by- daemon/priority [16:57:14] but I haven't really dug into the details for all of that for any service yet [16:57:35] ottomata: the journal doesn't persist reboots, but the system logs in /var/log/, which can be sourced from the journal, do [16:57:47] bblack, i think you can configure it to persist reboots, no? [16:57:53] Storage= [16:57:54] http://www.freedesktop.org/software/systemd/man/journald.conf.html [16:57:55] ottomata: yes we can. moritzm already said "see /usr/share/doc/systemd/README.Debian.gz how to enable it" [16:57:55] I don't know that we want to [16:58:05] "persistent", data will be stored preferably on disk, i.e. below the /var/log/journal hierarchy [16:58:07] since we're already persisting via rsyslog [16:58:08] but we don't want to do that just because eventlogging needs it [16:58:22] ja, but i can't do rsyslog -u eventlogging-service-eventbus [16:58:24] which is pretty nice [16:58:28] akosiaris: it's the humans behing EL [16:58:31] *behind [16:58:38] not the system itself [16:58:41] i mean, we could do it not for everything [16:58:45] nuria: and the humans behind the rest of the fleet as well [16:58:51] just those who want it? [16:58:52] iunno [16:58:56] ottomata: not like that [16:59:05] no, i mean [16:59:05] akosiaris: jaja [16:59:08] just boxes that want it [16:59:18] but ja, not ideal [16:59:24] we should evaluate if it's worth doing it for sure. but this is something that should be done globally [16:59:26] i'm not excited about setting up all that syslog stuff [16:59:26] ottomata: the idea would be if you want to view via journalctl (which you don't have to) it would be "journalctl -u eventlogging-client-01" (or whatever daemon name) [16:59:31] its nice to just be able to use python logging [16:59:34] maybe i'll jsut do that ? [16:59:39] yeah [16:59:41] bblack,i like that [16:59:43] but if rsyslog is pulling to disk, you'd just tail the files there [16:59:52] for the persistent files you want [17:00:04] ottomata: python logging does support syslog, in fact it's an abstraction layer of all the possible logging methods [17:00:10] but i don't want to spam it, and i think modifying the whole app itself for systemd isn't right [17:00:24] akosiaris: ja i'm ok with that [17:00:29] syslog doesn't have to spam the /var/log/syslog, again we can configure the service/priority -> different files [17:00:31] buuut, won't I need special syslog facilities? [17:00:58] been a while since i configured a bunch of things for syslog, but aren't there only a few facilicties? local0, local1, etc.? [17:01:00] but the same is true regardless of whether you use the syslog API, or the journal API (which is just stdout and priority <4> prefixes) [17:01:00] yup, there are multiple available [17:01:18] ottomata: traditional syslog only has a few predefined, but journald+rsyslog can make up new ones [17:01:23] hm ok [17:01:27] oof [17:01:30] still not excited about it [17:01:32] but hm [17:01:56] if $msg startswith 'application1' then /var/log/application1 [17:02:03] that easily done in rsyslog [17:02:17] that's exactly the configuration used in fact [17:03:25] akosiaris: taht will happen by default if I use syslog python hanlder? [17:04:04] no, but puppet is there for ya [17:04:11] it's fully decoupled systems... [17:04:21] the 'application1' part will happen [17:04:34] what we will need is that rsyslog config stanza [17:04:44] but that's about it [17:05:18] so rsyslog also has an imjournal module, but I'm not sure if jessie has all that set up [17:06:04] OOOOOok will look into it [17:06:25] but even without that, what we have now is e.g. /etc/rsyslog.d/70-varnishkafka.conf => if $programname == 'varnishkafka' then /var/log/varnishkafka.log [17:07:34] PROBLEM - HHVM rendering on mw2100 is CRITICAL: Connection refused [17:07:44] PROBLEM - Apache HTTP on mw2100 is CRITICAL: Connection refused [17:07:44] yeah I don't think jessie has rsyslog imjournal [17:08:06] OHyeah, i remember that. hmm [17:08:07] not so bad :) [17:08:09] bblack: jessie does the systemd-journald default where the socket for /dev/log is used by systemd-journald which then pipes everything to rsyslog via another socket [17:08:22] thank you for your time bblack and akosiaris [17:08:29] trying to find out which other socket that is [17:08:40] nuria: you are very welcome. thanks for approaching us on this [17:09:41] akosiaris: ok [17:09:52] bblack: cat /lib/systemd/system/systemd-journald-dev-log.socket and /lib/systemd/system/systemd-journald.socket [17:09:54] ah [17:10:02] ListenDatagram=/run/systemd/journal/dev-log [17:10:02] Symlinks=/dev/log [17:10:34] PROBLEM - puppet last run on cp4004 is CRITICAL: CRITICAL: puppet fail [17:11:41] bblack: systemd-journald will forward all received log messages to the AF_UNIXSOCK_DGRAM socket /run/systemd/journal/syslog, if it [17:11:41] exists, which may be used by Unix syslog daemons to process the data further. [17:13:33] RECOVERY - HHVM rendering on mw2100 is OK: HTTP OK: HTTP/1.1 200 OK - 70693 bytes in 0.278 second response time [17:13:39] which /lib/systemd/system/syslog.socket does provide [17:13:43] RECOVERY - Apache HTTP on mw2100 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.201 second response time [17:15:33] (mw2100 is me) [17:15:35] akosiaris: ok, the thing I'm unsure about is whether, with that simply log socket thing, journald can actually communicate metadata like the program/daemon name to rsyslog or not [17:15:59] we have pybal as a testcase though (as a named daemon that logs to journald exclusively, can play with getting rsyslog to output it to one file) [17:16:20] bblack: agreed. [17:17:42] PROBLEM - HHVM rendering on mw2100 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:17:54] PROBLEM - Apache HTTP on mw2100 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:33:44] RECOVERY - HHVM rendering on mw2100 is OK: HTTP OK: HTTP/1.1 200 OK - 70693 bytes in 0.271 second response time [17:33:54] RECOVERY - Apache HTTP on mw2100 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.110 second response time [17:36:43] PROBLEM - Misc HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [17:38:44] RECOVERY - puppet last run on cp4004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:39:24] https://grafana.wikimedia.org/dashboard/db/varnish-aggregate-client-status-codes?from=1448645955998&to=1449250755998&var-site=All&var-cache_type=All&var-status_type=1&var-status_type=2&var-status_type=3&var-status_type=4&var-status_type=5&theme=dark [17:39:41] (03PS3) 10Alexandros Kosiaris: xvfb: systemd support for Debian [puppet] - 10https://gerrit.wikimedia.org/r/256659 (https://phabricator.wikimedia.org/T95003) (owner: 10Hashar) [17:39:51] ^ are we still having jobq issues? the abornal pattern of purge traffic there (in the 1 week view, see the change on the right) is still there... [17:40:19] with some ugly spikes just a few hours ago [17:40:40] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] xvfb: systemd support for Debian [puppet] - 10https://gerrit.wikimedia.org/r/256659 (https://phabricator.wikimedia.org/T95003) (owner: 10Hashar) [17:40:43] RECOVERY - Misc HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [17:40:54] akosiaris: thanks :-} [17:50:21] akosiaris: confirmed progname works. pybal emits logs to stdout with defaults for journaling (so its logs only show up in "journalctl -n 100 -u pybal" and such) [17:50:38] and with: [17:50:38] root@lvs1007:/etc/rsyslog.d# cat 99-pybal.conf [17:50:39] if $programname == 'pybal' then /var/log/pybal-sd.log [17:50:50] bblack: cool! that's reassuring [17:50:56] it created a /var/log/pybal-sd.log with only pybal's journal output recorded there [17:51:37] 6operations, 7Pybal: Pybal IdleConnectionMonitor with TCP KeepAlive shows random fails if more than 100 servers are involved. - https://phabricator.wikimedia.org/T119372#1852767 (10mark) I've looked at this some more. Using a pcap, I found that the affected backend hosts were in fact sending a TCP RST after ex... [17:51:40] I'd assuming programname will always match the systemd service unit name in these cases [17:52:21] so we can also get easily levels as well. it's probably quite universal [17:52:26] the <#> thing [17:54:03] 6operations, 10Traffic, 5Patch-For-Review, 7Pybal: pybal fails to detect dead servers under production lb IPs for port 80 - https://phabricator.wikimedia.org/T113151#1852771 (10mark) 5Resolved>3Open This was actually caused by AcceptFilter being default enabled in Apache nowadays - this results in conn... [17:54:05] yeah [17:54:16] there are of course journald "APIs" too, even a python module [17:54:26] bblack: ^ see the above two phab ticket updates for the pybal idleconnection problem [17:54:37] but journal-output APIs are generally just going to do <4> for you, may as well just skip the complexity and put the string in yourself [17:56:18] mark: yeah TCP_DEFER_ACCEPT makes sense, awesome [17:56:38] it will make apache's polling slightly less-efficient, but yeah for something that's not a public IP, I don't think it's going to matter a ton [17:56:43] indeed [18:28:37] (03CR) 10Alex Monk: "What about wmf-config/interwiki-labs.cdb from deployment-bastion? Why isn't there a json file for that?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/256939 (owner: 10Krinkle) [18:34:26] (03Abandoned) 10Dzahn: mariadb: bump submodule [puppet] - 10https://gerrit.wikimedia.org/r/256857 (owner: 10Dzahn) [18:35:40] (03CR) 10Dzahn: "i think we might be on to something with 2 directories: "monitor" and "groups", yea, agree" [puppet] - 10https://gerrit.wikimedia.org/r/256509 (https://phabricator.wikimedia.org/T110893) (owner: 10Dzahn) [18:37:00] (03PS5) 10Dzahn: icinga/labsnfs: move monitoring groups to labsnfs [puppet] - 10https://gerrit.wikimedia.org/r/256509 (https://phabricator.wikimedia.org/T110893) [18:37:28] (03CR) 10Dzahn: [C: 032] icinga/labsnfs: move monitoring groups to labsnfs [puppet] - 10https://gerrit.wikimedia.org/r/256509 (https://phabricator.wikimedia.org/T110893) (owner: 10Dzahn) [18:42:40] (03PS3) 10Dzahn: add 15.wikipedia.org -> misc-addrs [dns] - 10https://gerrit.wikimedia.org/r/248504 (https://phabricator.wikimedia.org/T599) [18:43:51] (03PS4) 10Dzahn: add 15.wikipedia.org -> misc-addrs [dns] - 10https://gerrit.wikimedia.org/r/248504 (https://phabricator.wikimedia.org/T599) [18:45:10] 6operations: Make services manageable by systemd (tracking) - https://phabricator.wikimedia.org/T97402#1852921 (10hashar) [18:45:15] 7Blocked-on-Operations, 10Continuous-Integration-Infrastructure, 5Patch-For-Review: Use systemd for xvfb service on Debian/Jessie - https://phabricator.wikimedia.org/T95003#1852917 (10hashar) 5Open>3Resolved a:3hashar And we now have xvfb on Jessie!!! [18:46:24] (03CR) 10Dzahn: "so yea, update:" [dns] - 10https://gerrit.wikimedia.org/r/248504 (https://phabricator.wikimedia.org/T599) (owner: 10Dzahn) [18:46:45] 6operations, 10RESTBase: restbase unable to start after machine reimage - https://phabricator.wikimedia.org/T120379#1852936 (10GWicke) @mobrovac & I discussed this on IRC. Notes from this conversation: - Ansible 1.9 now actually changes the git repository source during the regular deploy, so no manual trebuch... [18:51:15] (03CR) 10Krinkle: "Dunno, I'm just cleaning up month-old technical debt and adhering status quo. I don't operate beta." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/256939 (owner: 10Krinkle) [19:17:45] (03PS1) 10Paladox: Enabled ogg opus support for TimedMediaHandler [mediawiki-config] - 10https://gerrit.wikimedia.org/r/256967 [19:21:52] !log krypton: updated Grafana to 2.6.0-beta1 for bug fix for issue 3422 [19:21:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:30:25] mark: awesome, re: T113151. [19:32:23] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [1000.0] [19:32:40] _joe_: ^ [19:32:53] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [1000.0] [19:37:02] (03Abandoned) 10Dzahn: remove quality.wikipedia.[org|com] redirects. [puppet] - 10https://gerrit.wikimedia.org/r/254061 (owner: 10Dzahn) [19:37:55] (03PS6) 10Dzahn: varnish: move file to module [puppet] - 10https://gerrit.wikimedia.org/r/253457 [19:38:49] <_joe_> ori: I'm not exactly in condition to investigate, if it's not an emergency [19:39:04] what do you think? it's your alert ;) [19:39:17] you tell me -- is it an emergency? [19:39:48] <_joe_> something is not right, given that now the range is more narrow [19:40:04] <_joe_> at a different time of the day, _I_ would look into it [19:40:26] k, i'll look [19:40:27] <_joe_> now it's the turn of some of you yankees [19:40:36] <_joe_> given it's localized in one dc [19:40:43] <_joe_> makes that less worrying [19:40:49] it also already ended [19:40:55] or seems to have anyways [19:40:57] <_joe_> you shouldn't be the one looking btw [19:41:48] me? [19:41:52] oh ori [19:42:23] <_joe_> yeah I have an epic lag [19:42:27] anyways, in the varnish status graphs, it was mostly-esams+text, from about 19:24->19:34, with an elevated 503 rate [19:42:42] <_joe_> which, keeping into account I'm now travelling at about 200 mph [19:42:54] <_joe_> is not something I can complain about [19:43:20] where "elevated" means it's normally <1/sec, and during the event it was up around 30/sec, versus a request rate ~15K/sec [19:43:47] something happened, probably minor link-quality issue with esams<->eqiad or something else networky in nature, but not a total outage by any means [19:44:03] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [19:44:33] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [19:44:46] <_joe_> btw this mid-threshold alarms have allowed us to catch things like bad varnishes multiple times [19:45:21] (03CR) 10Dzahn: "http://puppet-compiler.wmflabs.org/1438/" [puppet] - 10https://gerrit.wikimedia.org/r/253457 (owner: 10Dzahn) [19:47:38] (03CR) 10Chad: [C: 031] "Lez do it!" [puppet] - 10https://gerrit.wikimedia.org/r/256605 (https://phabricator.wikimedia.org/T110607) (owner: 10Chad) [19:47:42] the criterion is not presence of signal value, but the ratio of signal to noise [19:48:16] (03PS1) 10Ori.livneh: Prevent Apache from setting TCP_DEFER_ACCEPT by default [puppet] - 10https://gerrit.wikimedia.org/r/256968 (https://phabricator.wikimedia.org/T119372) [19:48:29] bblack: ^ [19:48:33] (03CR) 10Rush: "https://media.giphy.com/media/3o85xrQqsHEDpgQgIo/giphy.gif" [puppet] - 10https://gerrit.wikimedia.org/r/256605 (https://phabricator.wikimedia.org/T110607) (owner: 10Chad) [19:54:50] (03PS1) 10Dzahn: torrus: add protocol redirect [puppet] - 10https://gerrit.wikimedia.org/r/256969 (https://phabricator.wikimedia.org/T119582) [19:57:12] (03CR) 10Ori.livneh: [C: 031] torrus: add protocol redirect [puppet] - 10https://gerrit.wikimedia.org/r/256969 (https://phabricator.wikimedia.org/T119582) (owner: 10Dzahn) [20:01:38] (03CR) 10Dzahn: [C: 032] torrus: add protocol redirect [puppet] - 10https://gerrit.wikimedia.org/r/256969 (https://phabricator.wikimedia.org/T119582) (owner: 10Dzahn) [20:03:00] (03PS3) 10Rush: phab: add git-ssh IPv6 LVS [puppet] - 10https://gerrit.wikimedia.org/r/255173 (https://phabricator.wikimedia.org/T100519) [20:03:03] (03PS1) 10Rush: phab_epipe better debug options [puppet] - 10https://gerrit.wikimedia.org/r/256973 [20:04:37] (03CR) 10jenkins-bot: [V: 04-1] phab_epipe better debug options [puppet] - 10https://gerrit.wikimedia.org/r/256973 (owner: 10Rush) [20:06:23] (03PS1) 10EBernhardson: Set initial titlesuggest shard sizes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/256974 [20:07:05] 6operations, 10Datasets-General-or-Unknown: Sometimes (at peak usage?), dumps.wikimedia.org becomes very slow for users (sometimes unresponsive) - https://phabricator.wikimedia.org/T45647#1853199 (10Nemo_bis) I've yet to find a file or occasion where average download speed goes over 1.5-2 MB/s... rather painfu... [20:08:40] (03PS2) 10Rush: phab_epipe better debug options [puppet] - 10https://gerrit.wikimedia.org/r/256973 [20:08:54] PROBLEM - torrus.wikimedia.org HTTP on netmon1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 301 Moved Permanently - string Torrus Top: Wikimedia not found on http://torrus.wikimedia.org:80/torrus - 605 bytes in 0.002 second response time [20:09:40] (03Abandoned) 10Rush: phab_epipe better debug options [puppet] - 10https://gerrit.wikimedia.org/r/256973 (owner: 10Rush) [20:10:10] (03PS1) 10Rush: phab_epipe better debug options [puppet] - 10https://gerrit.wikimedia.org/r/256975 [20:11:05] (03PS1) 10Ori.livneh: Keep jobqueue aggregator and queue data on separate instances [puppet] - 10https://gerrit.wikimedia.org/r/256976 [20:13:37] (03PS2) 10Ori.livneh: Keep jobqueue aggregator and queue data on separate instances [puppet] - 10https://gerrit.wikimedia.org/r/256976 [20:13:48] (03CR) 10Ori.livneh: [C: 032 V: 032] Keep jobqueue aggregator and queue data on separate instances [puppet] - 10https://gerrit.wikimedia.org/r/256976 (owner: 10Ori.livneh) [20:16:39] (03PS2) 10Rush: phab_epipe better debug options [puppet] - 10https://gerrit.wikimedia.org/r/256975 [20:18:55] (03PS3) 10Rush: phab_epipe better debug options [puppet] - 10https://gerrit.wikimedia.org/r/256975 [20:20:14] (03PS2) 10Rush: base: Allow auto puppetmaster switching tuning [puppet] - 10https://gerrit.wikimedia.org/r/256890 (https://phabricator.wikimedia.org/T120159) (owner: 10Yuvipanda) [20:20:51] (03CR) 10Rush: [C: 032] phab_epipe better debug options [puppet] - 10https://gerrit.wikimedia.org/r/256975 (owner: 10Rush) [20:21:34] 6operations, 10Phabricator-Bot-Requests, 10procurement, 5Patch-For-Review: update emailbot to allow cc: for #procurement - https://phabricator.wikimedia.org/T117113#1853226 (10chasemp) test [20:23:08] (03PS1) 10Ori.livneh: Declare that job queue aggregators are on port 6378 [puppet] - 10https://gerrit.wikimedia.org/r/256982 [20:24:05] (03PS1) 10Ori.livneh: job queue: use instances on port 6378 as aggregators [mediawiki-config] - 10https://gerrit.wikimedia.org/r/256984 [20:24:09] (03PS2) 10Ori.livneh: Declare that job queue aggregators are on port 6378 [puppet] - 10https://gerrit.wikimedia.org/r/256982 [20:24:22] (03PS2) 10Ori.livneh: job queue: use instances on port 6378 as aggregators [mediawiki-config] - 10https://gerrit.wikimedia.org/r/256984 [20:25:13] 6operations, 7HTTPS, 5Patch-For-Review: move torrus behind misc-web - https://phabricator.wikimedia.org/T119582#1853337 (10Dzahn) 5Open>3Resolved [20:25:23] 6operations, 7HTTPS: move torrus behind misc-web - https://phabricator.wikimedia.org/T119582#1830152 (10Dzahn) [20:25:54] (03CR) 10Ori.livneh: [C: 032] job queue: use instances on port 6378 as aggregators [mediawiki-config] - 10https://gerrit.wikimedia.org/r/256984 (owner: 10Ori.livneh) [20:26:09] (03PS1) 10Dzahn: smokeping: add protocol redirect [puppet] - 10https://gerrit.wikimedia.org/r/256986 (https://phabricator.wikimedia.org/T120258) [20:26:41] (03Merged) 10jenkins-bot: job queue: use instances on port 6378 as aggregators [mediawiki-config] - 10https://gerrit.wikimedia.org/r/256984 (owner: 10Ori.livneh) [20:26:43] (03PS2) 10Dzahn: smokeping: add protocol redirect [puppet] - 10https://gerrit.wikimedia.org/r/256986 (https://phabricator.wikimedia.org/T120258) [20:27:36] (03PS3) 10Dzahn: smokeping: add protocol redirect [puppet] - 10https://gerrit.wikimedia.org/r/256986 (https://phabricator.wikimedia.org/T120258) [20:28:05] (03PS4) 10Dzahn: smokeping: add protocol redirect [puppet] - 10https://gerrit.wikimedia.org/r/256986 (https://phabricator.wikimedia.org/T120258) [20:28:25] (03PS5) 10Dzahn: smokeping: add protocol redirect [puppet] - 10https://gerrit.wikimedia.org/r/256986 (https://phabricator.wikimedia.org/T120258) [20:28:47] (03CR) 10Dzahn: [C: 032] "just like for torrus" [puppet] - 10https://gerrit.wikimedia.org/r/256986 (https://phabricator.wikimedia.org/T120258) (owner: 10Dzahn) [20:28:50] !log ori@tin Synchronized wmf-config/jobqueue-eqiad.php: Idee6a1980: job queue: use instances on port 6378 as aggregators (duration: 00m 30s) [20:28:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:29:13] RECOVERY - Unmerged changes on repository mediawiki_config on mira is OK: No changes to merge. [20:29:53] (03CR) 10Ori.livneh: [C: 032] Declare that job queue aggregators are on port 6378 [puppet] - 10https://gerrit.wikimedia.org/r/256982 (owner: 10Ori.livneh) [20:30:32] (03PS6) 10Dzahn: smokeping: add protocol redirect [puppet] - 10https://gerrit.wikimedia.org/r/256986 (https://phabricator.wikimedia.org/T120258) [20:31:13] (03CR) 10Dzahn: "..squeezing in" [puppet] - 10https://gerrit.wikimedia.org/r/256986 (https://phabricator.wikimedia.org/T120258) (owner: 10Dzahn) [20:36:38] 6operations, 7HTTPS, 5Patch-For-Review: move smokeping behind misc-web varnish - https://phabricator.wikimedia.org/T120258#1853470 (10Dzahn) 5Open>3Resolved [20:36:48] 6operations, 7HTTPS: move smokeping behind misc-web varnish - https://phabricator.wikimedia.org/T120258#1849395 (10Dzahn) [20:40:31] (03CR) 10coren: [C: 031] "Appears to be sane." [puppet] - 10https://gerrit.wikimedia.org/r/256890 (https://phabricator.wikimedia.org/T120159) (owner: 10Yuvipanda) [20:41:02] (03PS1) 10Chad: servermon/urls.py: minor pep8 fixes [puppet] - 10https://gerrit.wikimedia.org/r/256992 [20:44:28] (03PS1) 10Chad: ldap-yaml-enc.py: one trivial pep8 fix [puppet] - 10https://gerrit.wikimedia.org/r/256993 [20:45:05] 6operations, 6Commons, 10MediaWiki-File-management, 6Multimedia, and 2 others: Image cache issue when 'over-writing' an image on commons - https://phabricator.wikimedia.org/T119038#1853512 (10aaron) >>! In T119038#1852376, @BBlack wrote: >>>! In T119038#1849717, @aaron wrote: >> The list of thumbnails to p... [20:47:18] (03CR) 10Dzahn: [C: 032] servermon/urls.py: minor pep8 fixes [puppet] - 10https://gerrit.wikimedia.org/r/256992 (owner: 10Chad) [20:48:02] (03CR) 10Dzahn: [C: 032] ldap-yaml-enc.py: one trivial pep8 fix [puppet] - 10https://gerrit.wikimedia.org/r/256993 (owner: 10Chad) [20:48:15] (03PS2) 10Dzahn: ldap-yaml-enc.py: one trivial pep8 fix [puppet] - 10https://gerrit.wikimedia.org/r/256993 (owner: 10Chad) [20:56:27] (03CR) 10Dzahn: [C: 031] Have misc-web talk directly to etherpad-lite [puppet] - 10https://gerrit.wikimedia.org/r/255406 (owner: 10Alexandros Kosiaris) [20:57:17] (03CR) 10Dzahn: [C: 031] Uninstall ecryptfs-utils [puppet] - 10https://gerrit.wikimedia.org/r/256650 (owner: 10Muehlenhoff) [20:57:46] (03PS1) 10Rush: phabricator: direct task messaging matching [puppet] - 10https://gerrit.wikimedia.org/r/256995 (https://phabricator.wikimedia.org/T117113) [20:57:58] (03PS2) 10Rush: phabricator: direct task messaging matching [puppet] - 10https://gerrit.wikimedia.org/r/256995 (https://phabricator.wikimedia.org/T117113) [20:58:22] (03PS2) 10Dzahn: zuul: tweak git-daemon monitoring [puppet] - 10https://gerrit.wikimedia.org/r/256593 (owner: 10Hashar) [20:58:55] (03CR) 10Dzahn: [C: 032] "thank you! i noticed this one on IRC recently and wanted to suggest something similar" [puppet] - 10https://gerrit.wikimedia.org/r/256593 (owner: 10Hashar) [20:59:31] mutante: Had https://gerrit.wikimedia.org/r/#/c/256438/ too :) [21:00:05] mutante: hopefully that zuul git-daemon monitoring patch will work fine. [21:00:17] i saw, but that's waay larger :) [21:01:24] hashar: i think it will, and if not we can just change the thresholds to allow 2 as OK [21:02:37] mutante: i might have two git-daemon running on parallel on different ports one day though [21:02:43] mutante: we will see [21:05:41] ok! yep [21:10:26] (03CR) 10Dzahn: "let's also split that into one file for each class for the right layout. i'd also amend if you don't mind" [puppet] - 10https://gerrit.wikimedia.org/r/255082 (owner: 10Yuvipanda) [21:10:50] mutante: feel free to! [21:10:51] and htanks [21:10:54] *thanks [21:11:13] ok!:) [21:14:36] (03CR) 10Dzahn: [C: 031] "looks reasonable to me. added Rush" [puppet] - 10https://gerrit.wikimedia.org/r/254287 (owner: 10Merlijn van Deen) [21:15:43] PROBLEM - git_daemon_running on gallium is CRITICAL: PROCS CRITICAL: 2 processes with regex args ^/usr/lib/git-core/git-daemon [21:17:51] grrr [21:19:28] (03CR) 10Dzahn: [C: 04-1] "author says he is not sure if it fixes the problem and the problem isn't really described in the commit message." [puppet] - 10https://gerrit.wikimedia.org/r/250450 (owner: 10Paladox) [21:20:53] hashar: this is already cherry-picked ? https://gerrit.wikimedia.org/r/#/c/243992/4 [21:21:27] mutante: should be yes [21:21:29] let me chek [21:21:33] RECOVERY - git_daemon_running on gallium is OK: PROCS OK: 1 process with regex args ^/usr/lib/git-core/git-daemon [21:21:43] heh, see the icinga [21:23:02] mutante: so yeah the Parsoid change https://gerrit.wikimedia.org/r/#/c/243992/4 is cherry picked. Used it to switch the repo to use for Parsoid code. The whole diff is solely on the puppet class role::parsoid::beta [21:23:07] mutante: so it must be a noop for prod [21:24:04] 6operations, 10Datasets-General-or-Unknown: dumps.wikimedia.org seems to have poor networking towards Telia - https://phabricator.wikimedia.org/T120425#1853689 (10Nemo_bis) 3NEW a:3ArielGlenn [21:24:04] hashar: right! just beta role,cool [21:24:36] 6operations, 10Datasets-General-or-Unknown: dumps.wikimedia.org seems to have poor networking towards Telia - https://phabricator.wikimedia.org/T120425#1853689 (10Nemo_bis) [21:24:38] (03PS5) 10Dzahn: beta: parsoid now uses modules defined in source [puppet] - 10https://gerrit.wikimedia.org/r/243992 (https://phabricator.wikimedia.org/T92871) (owner: 10Hashar) [21:24:49] (03CR) 10Dzahn: [C: 032] "role::parsoid::beta only" [puppet] - 10https://gerrit.wikimedia.org/r/243992 (https://phabricator.wikimedia.org/T92871) (owner: 10Hashar) [21:25:21] :-} [21:26:32] (03CR) 10Andrew Bogott: base: Allow auto puppetmaster switching tuning (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/256890 (https://phabricator.wikimedia.org/T120159) (owner: 10Yuvipanda) [21:26:46] (03CR) 10Rush: [C: 04-1] "I think it's cool but I don't like dependencies on packaged managed directories directly persay, I would rather depend on the package" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/254287 (owner: 10Merlijn van Deen) [21:27:51] 6operations, 10Datasets-General-or-Unknown, 10netops: dumps.wikimedia.org seems to have poor networking towards Telia - https://phabricator.wikimedia.org/T120425#1853708 (10Dzahn) [21:28:12] 6operations, 10Datasets-General-or-Unknown: dumps.wikimedia.org seems to have poor networking towards Telia - https://phabricator.wikimedia.org/T120425#1853715 (10Nemo_bis) [21:28:26] 6operations, 10Datasets-General-or-Unknown, 10netops: dumps.wikimedia.org seems to have poor networking towards Telia - https://phabricator.wikimedia.org/T120425#1853689 (10Nemo_bis) [21:32:07] (03CR) 10Yuvipanda: base: Allow auto puppetmaster switching tuning (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/256890 (https://phabricator.wikimedia.org/T120159) (owner: 10Yuvipanda) [21:34:02] (03PS3) 10Yuvipanda: base: Allow auto puppetmaster switching tuning [puppet] - 10https://gerrit.wikimedia.org/r/256890 (https://phabricator.wikimedia.org/T120159) [21:34:10] andrewbogott: added comment [21:34:19] thanks :) [21:35:49] Reedy: https://github.com/EFForg/https-everywhere/pull/3364 [21:36:06] (03PS1) 10Chad: toollabs: pep8 fixes for pretty code :) [puppet] - 10https://gerrit.wikimedia.org/r/257002 [21:36:09] mutante: I'll merge when Travis is done :D [21:36:12] !log ori@tin Synchronized php-1.27.0-wmf.7/includes/Hooks.php: Iba0138a: Don't install a custom error handler for hooks (T117553) (duration: 00m 28s) [21:36:13] (03CR) 10Andrew Bogott: [C: 031] "cool" [puppet] - 10https://gerrit.wikimedia.org/r/256890 (https://phabricator.wikimedia.org/T120159) (owner: 10Yuvipanda) [21:36:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:36:18] Reedy: :)) [21:36:33] Reedy: did i do it right to also remove the test urls? [21:36:40] just cloned that the first time [21:37:07] I think it's right, yeah [21:39:41] mutante: merged :D [21:40:02] Reedy: thanks!:) [21:41:44] thank you for the wikitech fix, ori [21:42:58] andrewbogott: np, sorry it took a bit. [21:44:51] !log disabling puppet on labcontrol1002 for ldap testing [21:44:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:53:37] (03CR) 10RobH: [C: 031] phabricator: direct task messaging matching [puppet] - 10https://gerrit.wikimedia.org/r/256995 (https://phabricator.wikimedia.org/T117113) (owner: 10Rush) [21:54:26] (03PS3) 10Rush: phabricator: direct task messaging matching [puppet] - 10https://gerrit.wikimedia.org/r/256995 (https://phabricator.wikimedia.org/T117113) [21:57:17] (03CR) 10Rush: [C: 032] phabricator: direct task messaging matching [puppet] - 10https://gerrit.wikimedia.org/r/256995 (https://phabricator.wikimedia.org/T117113) (owner: 10Rush) [22:06:04] (03PS1) 10Ori.livneh: Remove ocg::role::test; unused [puppet] - 10https://gerrit.wikimedia.org/r/257009 [22:06:06] (03PS1) 10Ori.livneh: Migrate sentry and deployment roles to redis::instance [puppet] - 10https://gerrit.wikimedia.org/r/257010 [22:16:00] 6operations, 10Datasets-General-or-Unknown, 10netops: dumps.wikimedia.org seems to have poor networking towards Telia - https://phabricator.wikimedia.org/T120425#1854036 (10faidon) Can you additionally try: `wget -O /dev/null --no-check-certificate --header='Host: upload.wikimedia.org' https://upload-lb.eqia... [22:21:33] 6operations, 10ops-codfw, 10netops: Connect Apple Airport to mr1-codfw - https://phabricator.wikimedia.org/T86574#1854081 (10Legoktm) [22:24:47] (03CR) 10Ori.livneh: [C: 032] Remove ocg::role::test; unused [puppet] - 10https://gerrit.wikimedia.org/r/257009 (owner: 10Ori.livneh) [22:25:03] (03CR) 10Ori.livneh: [C: 032] Migrate sentry and deployment roles to redis::instance [puppet] - 10https://gerrit.wikimedia.org/r/257010 (owner: 10Ori.livneh) [22:32:46] PROBLEM - WDQS HTTP on wdqs1002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Temporarily Unavailable - 416 bytes in 0.006 second response time [22:33:16] PROBLEM - WDQS SPARQL on wdqs1002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Temporarily Unavailable - 416 bytes in 0.004 second response time [22:33:38] ^ SMalyshev [22:37:05] RECOVERY - WDQS SPARQL on wdqs1002 is OK: HTTP OK: HTTP/1.1 200 OK - 7098 bytes in 0.001 second response time [22:37:11] ori: thanks [22:38:16] RECOVERY - WDQS HTTP on wdqs1002 is OK: HTTP OK: HTTP/1.1 200 OK - 7098 bytes in 0.004 second response time [22:38:42] 6operations, 6Labs, 10Labs-Infrastructure: Apache on labs-ns0 - https://phabricator.wikimedia.org/T120463#1854225 (10Dzahn) [22:38:48] 6operations, 6Labs, 10Labs-Infrastructure: Apache on labs-ns0? - https://phabricator.wikimedia.org/T120463#1854229 (10Dzahn) [22:40:03] 6operations, 6Labs, 10Labs-Infrastructure: Apache on labs-ns[01]? - https://phabricator.wikimedia.org/T120463#1854236 (10Krenair) [23:12:32] (03CR) 10Awight: [C: 031] "Good idea, thank you!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/256707 (owner: 10Aude) [23:20:31] (03PS2) 10Dzahn: lint fixes [puppet/wikimetrics] - 10https://gerrit.wikimedia.org/r/256884 [23:30:10] (03PS1) 10Ori.livneh: maps: migrate to redis::instance [puppet] - 10https://gerrit.wikimedia.org/r/257031 [23:32:20] (03PS4) 10Dzahn: role: Move quarry to use autolayout [puppet] - 10https://gerrit.wikimedia.org/r/255082 (owner: 10Yuvipanda) [23:36:10] (03PS1) 10Andrew Bogott: Remove default apache configs from labs puppetmasters [puppet] - 10https://gerrit.wikimedia.org/r/257034 (https://phabricator.wikimedia.org/T120449) [23:37:01] (03CR) 10jenkins-bot: [V: 04-1] Remove default apache configs from labs puppetmasters [puppet] - 10https://gerrit.wikimedia.org/r/257034 (https://phabricator.wikimedia.org/T120449) (owner: 10Andrew Bogott) [23:37:07] andrewbogott: not sure if you saw, there's a duplicate ticket i think [23:37:14] https://phabricator.wikimedia.org/T120463 [23:38:16] 6operations, 6Labs, 10Labs-Infrastructure: Apache on labs-ns[01]? - https://phabricator.wikimedia.org/T120463#1854383 (10Andrew) [23:38:19] 6operations, 6Labs, 10Labs-Infrastructure: Apache on labs-ns[01]? - https://phabricator.wikimedia.org/T120463#1854387 (10Dzahn) looks like this is a duplicate of T120449 just saw https://gerrit.wikimedia.org/r/257034 [23:38:24] :) [23:38:43] mutante: thanks [23:39:43] andrewbogott: yw. maybe httpseverywhere can remove that rule as well then [23:39:57] yep, Reedy did that a few minutes ago :) [23:40:00] could make another pull request for reedy [23:40:05] oh ! even better, cool [23:43:01] (03PS2) 10Andrew Bogott: Remove default apache configs from labs puppetmasters [puppet] - 10https://gerrit.wikimedia.org/r/257034 (https://phabricator.wikimedia.org/T120449) [23:45:21] (03PS1) 10Dzahn: zookeeper: move roles to module/role [puppet] - 10https://gerrit.wikimedia.org/r/257035 [23:49:35] (03PS2) 10Dzahn: zookeeper: move roles to module/role [puppet] - 10https://gerrit.wikimedia.org/r/257035 [23:56:49] (03PS1) 10Dzahn: elasticsearch: move role to module/role [puppet] - 10https://gerrit.wikimedia.org/r/257036