[00:13:52] (03PS1) 10BryanDavis: ssh-key-ldap-lookup: handle missing users [puppet] - 10https://gerrit.wikimedia.org/r/481343 (https://phabricator.wikimedia.org/T204563) [00:19:27] (03CR) 10BryanDavis: "Tested via cherry-pick on striker-puppet01.striker.eqiad.wmflabs. Ssh as root, LDAP user (bd808), and local user with local key (deploy-se" [puppet] - 10https://gerrit.wikimedia.org/r/481343 (https://phabricator.wikimedia.org/T204563) (owner: 10BryanDavis) [02:27:24] 10Operations, 10ops-eqiad, 10cloud-services-team: Degraded RAID on cloudvirt1019 - https://phabricator.wikimedia.org/T211807 (10bd808) [02:27:25] 10Operations, 10ops-eqiad, 10Patch-For-Review, 10cloud-services-team (Kanban): Degraded RAID on cloudvirt1019 - https://phabricator.wikimedia.org/T196507 (10bd808) [02:36:24] PROBLEM - puppet last run on restbase1012 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:02:27] RECOVERY - puppet last run on restbase1012 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [03:32:51] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 929.42 seconds [04:09:25] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 242.75 seconds [05:58:09] (03PS1) 10BryanDavis: toolforge: disable lighttpd service on webgrid nodes [puppet] - 10https://gerrit.wikimedia.org/r/481344 (https://phabricator.wikimedia.org/T105059) [06:29:13] PROBLEM - puppet last run on ms-be1027 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/prometheus-puppet-agent-stats] [06:32:31] PROBLEM - puppet last run on ganeti1006 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/ferm/conf.d/00_main] [06:58:31] RECOVERY - puppet last run on ganeti1006 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [07:00:25] RECOVERY - puppet last run on ms-be1027 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures [08:58:28] 10Operations, 10Wikimedia-Logstash: logstash stuck on its persistent queue - https://phabricator.wikimedia.org/T212640 (10fgiunchedi) [09:04:21] 10Operations, 10Wikimedia-Logstash: logstash stuck on its persistent queue - https://phabricator.wikimedia.org/T212640 (10fgiunchedi) The specific bug _might_ be fixed in logstash 5.6: https://www.elastic.co/guide/en/logstash/5.6/logstash-5-6-5.html New persistent queue implementation in 5.6: https://www.elast... [10:09:33] PROBLEM - recommendation_api endpoints health on scb1001 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target) is CRITICAL: Test normal source and target returned the unexpected status 429 (expecting: 200) [10:10:47] RECOVERY - recommendation_api endpoints health on scb1001 is OK: All endpoints are healthy [10:34:17] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Ensure Zotero is working) timed out before a response was received: /api (Scrapes sample page) timed out before a response was received [10:36:37] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy [10:57:52] 10Operations, 10CirrusSearch, 10Discovery-Search (Current work): Add chi, psi and omega selector to the elasticsearch dashboards in grafana - https://phabricator.wikimedia.org/T211956 (10Mathew.onipe) a:03Mathew.onipe [11:23:05] PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (Ensure Zotero is working) timed out before a response was received [11:25:25] RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy [12:36:39] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3fullscreenorgId=1var-site=eqiadvar-cache_type=Allvar-status_type=5 [12:41:31] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3fullscreenorgId=1var-site=eqiadvar-cache_type=Allvar-status_type=5 [12:52:32] (03PS1) 10ArielGlenn: make misc dumps logging use console or console and file [dumps] - 10https://gerrit.wikimedia.org/r/481483 (https://phabricator.wikimedia.org/T212349) [13:00:06] 10Operations, 10Kubernetes: kubernetes1001 cronspam - https://phabricator.wikimedia.org/T212648 (10GTirloni) [13:00:59] (03PS1) 10ArielGlenn: make adds-changes dump quieter [puppet] - 10https://gerrit.wikimedia.org/r/481484 (https://phabricator.wikimedia.org/T212349) [15:13:34] So I'm going to do an interwiki update for T212650 (broken commonly used interwiki prefix) [15:13:34] T212650: Please update the interwiki cache - https://phabricator.wikimedia.org/T212650 [15:16:31] (03PS1) 10Brian Wolff: Updating interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/481488 [15:16:33] (03CR) 10Brian Wolff: [C: 03+2] Updating interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/481488 (owner: 10Brian Wolff) [15:17:44] (03Merged) 10jenkins-bot: Updating interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/481488 (owner: 10Brian Wolff) [15:18:50] well here goes [15:19:44] !log bawolff@deploy1001 Synchronized wmf-config/interwiki.php: Updating interwiki cache (duration: 03m 34s) [15:19:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:21:01] !log Updated interwiki cache https://biblio.wiki to https://wikilivres.org -> T212650 [15:21:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:21:04] T212650: Please update the interwiki cache - https://phabricator.wikimedia.org/T212650 [15:24:04] (03CR) 10jenkins-bot: Updating interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/481488 (owner: 10Brian Wolff) [15:59:39] RECOVERY - Host wtp1028 is UP: PING OK - Packet loss = 0%, RTA = 0.37 ms [16:00:43] !log powercycling frdb1001 for troubleshooting [16:00:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:01:01] PROBLEM - Logstash rate of ingestion percent change compared to yesterday on icinga1001 is CRITICAL: 7.435e+06 ge 130 https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1panelId=2fullscreen [16:05:28] yeah 7.435e+06 ge 130 that's an artifact from yesterday's logstash problems [16:23:05] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Ensure Zotero is working) timed out before a response was received: /api (Scrapes sample page) timed out before a response was received [16:25:50] 10Operations, 10Developer-Advocacy, 10Discourse: Discourse migration from wmflabs to production - https://phabricator.wikimedia.org/T184461 (10Qgil) For what is worth, on my personal time and project, this month I had to migrate from one Discourse forum that broke (mea culpa) to a brand new one relying on a... [16:26:37] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy [16:36:47] 10Operations, 10Developer-Advocacy, 10Discourse, 10Epic: Bring discourse.mediawiki.org to production - https://phabricator.wikimedia.org/T180853 (10Qgil) > Plan agreed with Operations As an update, Mark Bergsma confirmed to me that their team have committed to bringing Discourse to production and have st... [17:09:50] heads-up, i will be deploying RB shortly to get the fix for T212631 (which crashes master processes every once in a while) [17:09:50] T212631: Kademlia rate limiter failing unexpectedly - https://phabricator.wikimedia.org/T212631 [17:14:13] mobrovac: ack [17:15:54] !log mobrovac@deploy1001 Started deploy [restbase/deploy@ae7a537]: Fix rate-limiter crash - T212631 - deploy only on canary restbase1007 [17:15:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:15:58] T212631: Kademlia rate limiter failing unexpectedly - https://phabricator.wikimedia.org/T212631 [17:18:55] PROBLEM - Restbase root url on restbase1007 is CRITICAL: connect to address 10.64.0.223 and port 7231: Connection refused [17:20:18] !log mobrovac@deploy1001 Finished deploy [restbase/deploy@ae7a537]: Fix rate-limiter crash - T212631 - deploy only on canary restbase1007 (duration: 04m 24s) [17:20:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:21:21] RECOVERY - Restbase root url on restbase1007 is OK: HTTP OK: HTTP/1.1 200 - 16184 bytes in 0.007 second response time [17:25:59] godog: hm it seems something is wrong with restbase1016 [17:26:06] can't reach it at all [17:26:55] mobrovac: indeed, that host is down ATM, https://phabricator.wikimedia.org/T212418 [17:27:12] ah ok [17:27:37] godog: this is causing problems with rb startup as cassandra there cannot be reached, would you be ok with removing it from the config for the time being? [17:28:46] mobrovac: for sure, LMK if I can help with a puppet merge [17:28:53] great thnx [17:29:03] will come up with one for you right away :P [17:29:09] mobrovac: also please followup with a task for this if there isn't one already? [17:29:21] this == rb starts to fail with one host down [17:29:40] it does start, but it takes it much longer [17:29:51] too long for scap [17:29:55] so the deploy fails [17:30:01] but yes, i'll create a task [17:30:36] got it, thanks for the context! [17:33:39] 10Operations, 10ops-eqiad, 10RESTBase, 10RESTBase-Cassandra, and 2 others: Memory error on restbase1016 - https://phabricator.wikimedia.org/T212418 (10mobrovac) [17:37:27] godog: ah no, actually we can't do much, the driver picks the rb1016 instances up from the list given by other nodes, alas they are marked as DN [17:38:06] the task seems to suggest the failure is one week old, so we'll likely need to re-bootstrap those [17:38:07] uh [17:39:50] godog: do you think we should remove rb1016-{a,b,c} from the ring? [17:39:54] or keep it as-is? [17:40:50] mobrovac: I'd say keep it as-is for now, once the hardware is back we can decide [17:40:57] k [17:41:11] this driver is too eager to connect to all nodes [17:41:20] it takes now 40 minutes for one rb node to fully start-up [17:41:32] /o\ sigh [17:42:09] same driver eager to log when hosts are down too, leading me to file T212424 [17:42:09] T212424: restbase cassandra driver excessive logging when cassandra hosts are down - https://phabricator.wikimedia.org/T212424 [17:46:04] yup, that's the one! [17:50:18] !log mobrovac@deploy1001 Started deploy [restbase/deploy@70c4752]: Fix rate-limiter crash - T212631 [17:50:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:50:22] T212631: Kademlia rate limiter failing unexpectedly - https://phabricator.wikimedia.org/T212631 [17:52:01] there will be icinga complaints about restbase root url not being up, that will all be known and will fix itself [17:54:05] PROBLEM - Restbase root url on restbase1007 is CRITICAL: connect to address 10.64.0.223 and port 7231: Connection refused [17:55:19] RECOVERY - Restbase root url on restbase1007 is OK: HTTP OK: HTTP/1.1 200 - 16184 bytes in 0.007 second response time [17:55:27] mobrovac: ack [17:56:00] also, that will happen while the hosts in quesiton are depooled, so no harm no foul [17:59:38] indeed [18:01:13] RECOVERY - Logstash rate of ingestion percent change compared to yesterday on icinga1001 is OK: (C)130 ge (W)110 ge 109.1 https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1panelId=2fullscreen [18:03:17] PROBLEM - Restbase root url on restbase1018 is CRITICAL: connect to address 10.64.48.97 and port 7231: Connection refused [18:03:28] !log mobrovac@deploy1001 deploy aborted: Fix rate-limiter crash - T212631 (duration: 13m 09s) [18:03:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:03:31] T212631: Kademlia rate limiter failing unexpectedly - https://phabricator.wikimedia.org/T212631 [18:04:06] !log mobrovac@deploy1001 Started deploy [restbase/deploy@a67f38e]: Fix rate-limiter crash (with increased deploy delays) - T212631 [18:04:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:04:15] !log mobrovac@deploy1001 deploy aborted: Fix rate-limiter crash (with increased deploy delays) - T212631 (duration: 00m 09s) [18:04:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:04:24] !log mobrovac@deploy1001 Started deploy [restbase/deploy@a67f38e]: Fix rate-limiter crash (with increased deploy delays) - T212631 [18:04:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:05:25] mobrovac: LMK when the deploy has finished just in case, I'm about to log off [18:05:38] kk sure, will do godog [18:05:41] thnx for sticking around! [18:05:53] will probably be done in 45-50 mins or so [18:05:56] sigh [18:07:27] PROBLEM - Restbase root url on restbase1007 is CRITICAL: connect to address 10.64.0.223 and port 7231: Connection refused [18:08:31] 10Operations, 10RESTBase-Cassandra, 10Services: restbase cassandra driver excessive logging when cassandra hosts are down - https://phabricator.wikimedia.org/T212424 (10mobrovac) This is indeed a problem, causing issues also during deploys. I have had to [increase the delay](https://gerrit.wikimedia.org/r/#/... [18:08:57] PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) timed out before a response was received: /en.wikipedia.org/v1/page/html/{title} (Get html by title from storage) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was recei [18:08:57] a.org/v1/page/mobile-sections/{title}{/revision} (Get mobile-sections for a test page on enwiki) timed out before a response was received [18:09:19] RECOVERY - Restbase root url on restbase1018 is OK: HTTP OK: HTTP/1.1 200 - 16184 bytes in 0.008 second response time [18:09:51] RECOVERY - Restbase root url on restbase1007 is OK: HTTP OK: HTTP/1.1 200 - 16184 bytes in 0.027 second response time [18:10:03] RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy [18:11:27] mobrovac: uugghh ok, I might check back later [18:12:30] :) [18:15:44] !log mobrovac@deploy1001 Finished deploy [restbase/deploy@a67f38e]: Fix rate-limiter crash (with increased deploy delays) - T212631 (duration: 11m 20s) [18:15:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:15:47] T212631: Kademlia rate limiter failing unexpectedly - https://phabricator.wikimedia.org/T212631 [18:15:59] !log mobrovac@deploy1001 Started deploy [restbase/deploy@a67f38e]: Fix rate-limiter crash (with increased deploy delays), take #2 [18:16:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:26:19] PROBLEM - Restbase root url on restbase1018 is CRITICAL: connect to address 10.64.48.97 and port 7231: Connection refused [18:31:37] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Scrapes sample page) timed out before a response was received [18:32:21] RECOVERY - Restbase root url on restbase1018 is OK: HTTP OK: HTTP/1.1 200 - 16184 bytes in 0.011 second response time [18:36:23] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy [18:41:27] PROBLEM - Restbase root url on restbase1008 is CRITICAL: connect to address 10.64.32.178 and port 7231: Connection refused [18:47:33] RECOVERY - Restbase root url on restbase1008 is OK: HTTP OK: HTTP/1.1 200 - 16184 bytes in 0.009 second response time [18:50:34] (03CR) 10BryanDavis: wmcs: Add a cli script for managing dynamicproxy entries (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/478377 (https://phabricator.wikimedia.org/T211367) (owner: 10BryanDavis) [18:54:29] PROBLEM - Restbase root url on restbase1013 is CRITICAL: connect to address 10.64.32.80 and port 7231: Connection refused [18:54:31] PROBLEM - Restbase root url on restbase1014 is CRITICAL: connect to address 10.64.48.133 and port 7231: Connection refused [18:54:37] PROBLEM - Restbase root url on restbase1010 is CRITICAL: connect to address 10.64.0.112 and port 7231: Connection refused [18:55:05] PROBLEM - Restbase root url on restbase1009 is CRITICAL: connect to address 10.64.48.110 and port 7231: Connection refused [18:57:01] RECOVERY - Restbase root url on restbase1010 is OK: HTTP OK: HTTP/1.1 200 - 16184 bytes in 0.033 second response time [19:01:11] RECOVERY - Restbase root url on restbase1009 is OK: HTTP OK: HTTP/1.1 200 - 16184 bytes in 0.094 second response time [19:01:11] PROBLEM - restbase endpoints health on restbase1014 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) timed out before a response was received: /en.wikipedia.org/v1/page/html/{title} (Get html by title from storage) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was recei [19:01:11] a.org/v1/page/mobile-sections/{title}{/revision} (Get mobile-sections for a test page on enwiki) timed out before a response was received [19:01:26] mobrovac: those restbase whines, are they from your deploy? [19:01:47] RECOVERY - Restbase root url on restbase1013 is OK: HTTP OK: HTTP/1.1 200 - 16184 bytes in 0.009 second response time [19:01:47] RECOVERY - Restbase root url on restbase1014 is OK: HTTP OK: HTTP/1.1 200 - 16184 bytes in 0.017 second response time [19:02:19] RECOVERY - restbase endpoints health on restbase1014 is OK: All endpoints are healthy [19:05:13] !log mobrovac@deploy1001 Started deploy [restbase/deploy@a67f38e]: Fix rate-limiter crash (with increased deploy delays), take #3 [19:05:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:08:43] PROBLEM - Restbase root url on restbase1012 is CRITICAL: connect to address 10.64.32.79 and port 7231: Connection refused [19:09:07] PROBLEM - Restbase root url on restbase1017 is CRITICAL: connect to address 10.64.32.129 and port 7231: Connection refused [19:09:43] PROBLEM - restbase endpoints health on restbase1014 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [19:09:57] PROBLEM - Restbase root url on restbase1015 is CRITICAL: connect to address 10.64.48.134 and port 7231: Connection refused [19:10:55] RECOVERY - restbase endpoints health on restbase1014 is OK: All endpoints are healthy [19:15:11] RECOVERY - Restbase root url on restbase1017 is OK: HTTP OK: HTTP/1.1 200 - 16184 bytes in 0.007 second response time [19:15:31] PROBLEM - restbase endpoints health on restbase1017 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) timed out before a response was received: /en.wikipedia.org/v1/page/html/{title} (Get html by title from storage) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was recei [19:15:31] a.org/v1/page/mobile-sections/{title}{/revision} (Get mobile-sections for a test page on enwiki) timed out before a response was received [19:15:39] PROBLEM - restbase endpoints health on restbase1015 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) timed out before a response was received: /en.wikipedia.org/v1/page/html/{title} (Get html by title from storage) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was recei [19:15:39] a.org/v1/page/mobile-sections/{title}{/revision} (Get mobile-sections for a test page on enwiki) timed out before a response was received [19:16:01] RECOVERY - Restbase root url on restbase1012 is OK: HTTP OK: HTTP/1.1 200 - 16184 bytes in 0.014 second response time [19:16:01] RECOVERY - Restbase root url on restbase1015 is OK: HTTP OK: HTTP/1.1 200 - 16184 bytes in 0.012 second response time [19:16:39] RECOVERY - restbase endpoints health on restbase1017 is OK: All endpoints are healthy [19:16:47] RECOVERY - restbase endpoints health on restbase1015 is OK: All endpoints are healthy [19:23:37] PROBLEM - Restbase root url on restbase1011 is CRITICAL: connect to address 10.64.0.113 and port 7231: Connection refused [19:26:03] RECOVERY - Restbase root url on restbase1011 is OK: HTTP OK: HTTP/1.1 200 - 16184 bytes in 0.013 second response time [19:29:06] "Using the Horizon interface: [19:29:10] Set the hiera variable .." [19:29:26] can someone tell me WHERE in horizon i find this ? [19:29:54] thedj: under the "puppet" section on the left [19:30:03] you have project puppet and prefix puppet [19:30:22] or, you can also click on an instance and then you have the puppet config tab there [19:30:48] godog: ok, rb deployed, all should be good now [19:30:55] yeah, so i want to do it for a specific instance... [19:31:01] !log mobrovac@deploy1001 Finished deploy [restbase/deploy@a67f38e]: Fix rate-limiter crash (with increased deploy delays), take #3 (duration: 25m 49s) [19:31:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:31:45] ah found it ! [19:31:57] i was looking in the dropdown menu, but you have to click the instance [19:32:03] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/definition/{title}{/revision}{/tid} (retrieve en-wiktionary definitions for cat) timed out before a response was received [19:33:01] so it's json ? [19:38:03] RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy [19:38:39] PROBLEM - mobileapps endpoints health on scb1004 is CRITICAL: /{domain}/v1/page/definition/{title}{/revision}{/tid} (retrieve en-wiktionary definitions for cat) timed out before a response was received [19:39:49] RECOVERY - mobileapps endpoints health on scb1004 is OK: All endpoints are healthy [19:49:18] thedj: hiera settings are yaml, but often look like json in the horizon interface because json is a proper subset of yaml [21:03:43] ugh.. i should probably learn puppet on something simpler than wikimedia... [21:04:25] heh [21:14:31] wikimedia puppet is very complex [21:20:18] you don't say... [21:20:28] i'm just trying to write and test some things.. but... [21:21:49] i figured i'd set up a puppetmaster on an extra instance, but that's even worse. now i have MORE things i don't fully uderstand :) [21:22:52] bd808: is there a puppet naming convention for stuff on wmflabs in terms of profiles/roles ? [21:23:31] bd808: i'm running a bit into a naming clash, between maps services and maps project. [21:23:54] and wmcs, toolforgs, labs etc :D [21:26:49] thedj: if you figure out a good resource you should share it [21:29:03] hi Platonides [21:29:17] hi Hauskatze [21:31:48] thedj, I'd say don't use wmcs unless you're working on some labs infrastructure [21:31:50] (or labs) [21:32:14] if you're just writing roles that will happen to run inside labs/wmcs, just name them appropriately for the application and ignore the realm [21:36:30] thedj: do you have a specific example to help me understand what you are having a hard time naming? Our "namespacing" is a bit of a mess right now with things transitioning away from "labs" but only partially. [21:37:36] maps-tiles1 apache with mod_tile [21:39:27] role:maps::* is already in use [21:39:37] hmmm.. ok. I think the current role::osm::* things are only used for cloud vps pieces parts, but let me double check that [21:40:33] yeah, thats the osm db and synching with osm upstream. amongst others [21:41:12] but that's for production maps.wikimedia.org as well i think.. not something i'd like to mess with. [21:43:01] role::osm::* seems to only be used for the stuff in cloud services. The only hits for it in site.pp are labsdb100[67] [21:43:26] k. i'll namespace there then [21:43:29] so you could probably make new role::osm::tile or something [21:45:06] also.. what's the difference between the httpd and the apache packages... ? [21:45:08] 10Operations, 10Developer-Advocacy, 10Discourse, 10Epic: Bring discourse.mediawiki.org to production - https://phabricator.wikimedia.org/T180853 (10Tgr) [21:45:14] the perfect world solution for this is something totally different that gives each Cloud VPS project the opportunity to have an isolated puppet tree that somehow pulls in required parts from a shared repo [21:46:24] thedj: I think the httpd module is newer and intended to replace the apache module eventually... [21:47:43] found the start of ::httpd -- https://github.com/wikimedia/puppet/commit/3ecd45490551ffc2f2c157c0b9d5a48db90df5f8#diff-e51fc2c3bcbbcbba0473b202340f075c -- it seems to concur with my vague recollection [21:49:07] k. i'll play some more on saturday. [21:49:29] thx for the help people. [21:49:51] * onimisionipe follows the conversation with curiosity :) [21:50:12] i will say that expecting people to use puppet for volunteer projects right now is a bit ambitious :D [21:51:15] its pretty much not done unless you are actively working to move something from Cloud-land to prod, and yeah its a ton of weird local conventions to learn [21:51:43] there are a few exceptions, but they are mostly projects started by folks on the SRE team ;) [21:55:11] yeah. i have some interest in learning puppet myself so that helps here. but if i run out of time i might have to drop it. [21:55:51] i still need to test if the stretch instance actually creates working tiles [21:56:51] and there are only a couple of free hours left until works starts full force again on the 2nd, so what isn't done before that date, won't be done :D [22:06:50] 10Operations, 10Developer-Advocacy, 10Discourse, 10Epic: Bring discourse.mediawiki.org to production - https://phabricator.wikimedia.org/T180853 (10Qgil) [22:17:24] 10Puppet, 10Cloud-Services, 10MediaWiki-Vagrant, 10Patch-For-Review: Make role::labs::mediawiki_vagrant work on Debian Jessie host systems - https://phabricator.wikimedia.org/T154340 (10bd808)