[00:06:15] PROBLEM - recommendation_api endpoints health on scb1004 is CRITICAL: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - normal source and target) is CRITICAL: Test article.creation.translation - normal source and target returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [00:06:31] PROBLEM - recommendation_api endpoints health on scb2001 is CRITICAL: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - normal source and target) is CRITICAL: Test article.creation.translation - normal source and target returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [00:06:49] PROBLEM - recommendation_api endpoints health on scb2005 is CRITICAL: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - normal source and target) is CRITICAL: Test article.creation.translation - normal source and target returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [00:06:51] PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - normal source and target) is CRITICAL: Test article.creation.translation - normal source and target returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [00:07:05] PROBLEM - recommendation_api endpoints health on scb1003 is CRITICAL: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - normal source and target) is CRITICAL: Test article.creation.translation - normal source and target returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [00:07:15] PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - normal source and target) is CRITICAL: Test article.creation.translation - normal source and target returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [00:07:19] PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - normal source and target) is CRITICAL: Test article.creation.translation - normal source and target returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [00:07:27] PROBLEM - recommendation_api endpoints health on scb1001 is CRITICAL: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - normal source and target) is CRITICAL: Test article.creation.translation - normal source and target returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [00:09:25] 10Operations, 10MobileFrontend, 10Traffic: Sections on some mobile pages are not collabsable - https://phabricator.wikimedia.org/T233373 (10AntiCompositeNumber) [00:09:47] PROBLEM - recommendation_api endpoints health on scb1002 is CRITICAL: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - normal source and target) is CRITICAL: Test article.creation.translation - normal source and target returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [00:13:47] PROBLEM - recommendation_api endpoints health on scb2002 is CRITICAL: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - normal source and target) is CRITICAL: Test article.creation.translation - normal source and target returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [00:26:26] ^ looking [00:59:14] tgr|away: o/ The above alerts are happening because action=query&list=mostviewed is currently returning zero results on enwiki. Possibly a bug in PageViewInfo? [00:59:26] (I see you were involved with that.) [00:59:57] It doesn't seem like a coincidence that the endpoint alerted right after 0:00 UTC today [01:00:34] It looks like the code tries to get pageviews for the last completed day (to ensure results are present) but maybe that's not happening? [01:00:58] Anyway, I'll roll back the recommendation-api deployment from earlier today until it's sorted out [01:02:21] I'm live in mwdebug1002. [01:03:48] !log mholloway-shell@deploy1001 Started deploy [recommendation-api/deploy@a29da76]: Rolling back deployment due to alerts beginning after 0:00 UTC [01:03:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:04:25] RECOVERY - recommendation_api endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [01:05:07] RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [01:05:08] !log jforrester@deploy1001 Synchronized php-1.34.0-wmf.23/extensions/TimedMediaHandler/: T233360 Fix Safari 13.0 regression in video playback with audio (duration: 00m 58s) [01:05:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:05:12] T233360: Regression: popup videos (kaltura player) don't play on first attempt with Safari 13 - https://phabricator.wikimedia.org/T233360 [01:05:19] RECOVERY - recommendation_api endpoints health on scb1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [01:05:33] RECOVERY - recommendation_api endpoints health on scb1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [01:06:17] RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [01:06:21] RECOVERY - recommendation_api endpoints health on scb1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [01:06:31] RECOVERY - recommendation_api endpoints health on scb1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [01:06:39] !log mholloway-shell@deploy1001 Finished deploy [recommendation-api/deploy@a29da76]: Rolling back deployment due to alerts beginning after 0:00 UTC (duration: 02m 51s) [01:06:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:06:51] RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [01:06:57] OK, prod is clear. [01:07:21] RECOVERY - recommendation_api endpoints health on scb2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [01:07:49] RECOVERY - recommendation_api endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [01:16:33] (03CR) 10Jforrester: Variant configuration: Pre-calculate config and store it in config.git (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507729 (https://phabricator.wikimedia.org/T223602) (owner: 10Jforrester) [01:23:41] (03PS3) 10Jforrester: MWConfigCacheGenerator: Provide getCachableMWConfig() which doesn't rely on wgConf [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538078 [01:23:43] (03PS11) 10Jforrester: Variant configuration: Pre-calculate config for each wiki and store it in config.git [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507729 (https://phabricator.wikimedia.org/T223602) [01:23:45] (03PS2) 10Jforrester: [WIP] Provide for YAML-based inherited configuration to eventually replace InitialiseSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538129 [01:34:42] Filed T233381 about the unexpected API response that caused recommendation-api to alert [01:34:43] T233381: list=mostviewed returning zero results on enwiki - https://phabricator.wikimedia.org/T233381 [02:26:08] 10Operations, 10Patch-For-Review, 10User-Ladsgroup, 10User-Urbanecm, 10Wiki-Setup (Create): Create Wikisource Hindi - https://phabricator.wikimedia.org/T218155 (10jayantanth) >>! In T218155#5507139, @Ankry wrote: >>>! In T218155#5506935, @jayantanth wrote: >> Could you please anyone import all the pages... [02:38:34] (03PS1) 10Ammarpad: New throttle rule for Wikimedia Chile editathon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538134 (https://phabricator.wikimedia.org/T233378) [03:19:35] PROBLEM - MediaWiki codfw exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=codfw https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=codfw+prometheus/ops [03:21:09] RECOVERY - MediaWiki codfw exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=codfw+prometheus/ops [04:19:13] PROBLEM - Check systemd state on an-coord1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:31:25] (03PS1) 10Vgutierrez: ATS: Add blubberoid mapping rule [puppet] - 10https://gerrit.wikimedia.org/r/538138 (https://phabricator.wikimedia.org/T233369) [04:47:19] (03CR) 10Vgutierrez: [C: 03+2] ATS: Add blubberoid mapping rule [puppet] - 10https://gerrit.wikimedia.org/r/538138 (https://phabricator.wikimedia.org/T233369) (owner: 10Vgutierrez) [04:47:29] (03PS2) 10Vgutierrez: ATS: Add blubberoid mapping rule [puppet] - 10https://gerrit.wikimedia.org/r/538138 (https://phabricator.wikimedia.org/T233369) [04:51:10] (03PS1) 10ArielGlenn: delay start of xml dmps for one day while wikidata run finishes up [puppet] - 10https://gerrit.wikimedia.org/r/538139 [04:51:44] (03PS2) 10ArielGlenn: delay start of xml dmps for one day while wikidata run finishes up [puppet] - 10https://gerrit.wikimedia.org/r/538139 [04:51:47] (03CR) 10jerkins-bot: [V: 04-1] delay start of xml dmps for one day while wikidata run finishes up [puppet] - 10https://gerrit.wikimedia.org/r/538139 (owner: 10ArielGlenn) [04:52:20] (03CR) 10jerkins-bot: [V: 04-1] delay start of xml dmps for one day while wikidata run finishes up [puppet] - 10https://gerrit.wikimedia.org/r/538139 (owner: 10ArielGlenn) [04:53:06] (03CR) 10DannyS712: delay start of xml dmps for one day while wikidata run finishes up (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/538139 (owner: 10ArielGlenn) [04:53:10] 10Operations, 10Release-Engineering-Team-TODO, 10Traffic, 10Patch-For-Review: Blubberoid endpoint intermittently routing to MediaWiki backend - https://phabricator.wikimedia.org/T233369 (10Vgutierrez) 05Open→03Resolved a:03Vgutierrez before: `vgutierrez@cp1075:~$ curl -H 'Host: blubberoid.wikimedia.o... [04:54:30] (03PS1) 10Marostegui: index-conf.yaml: Remove temporary index timestamp on archive [puppet] - 10https://gerrit.wikimedia.org/r/538140 (https://phabricator.wikimedia.org/T219374) [04:55:19] (03PS2) 10Marostegui: index-conf.yaml: Remove temporary index timestamp on archive [puppet] - 10https://gerrit.wikimedia.org/r/538140 (https://phabricator.wikimedia.org/T219374) [04:55:27] (03PS3) 10ArielGlenn: delay start of xml dmps for one day while wikidata run finishes up [puppet] - 10https://gerrit.wikimedia.org/r/538139 (https://phabricator.wikimedia.org/T233276) [04:57:18] (03PS4) 10ArielGlenn: delay start of xml dmps for one day while wikidata run finishes up [puppet] - 10https://gerrit.wikimedia.org/r/538139 (https://phabricator.wikimedia.org/T233276) [04:58:10] (03CR) 10ArielGlenn: [C: 03+2] delay start of xml dmps for one day while wikidata run finishes up [puppet] - 10https://gerrit.wikimedia.org/r/538139 (https://phabricator.wikimedia.org/T233276) (owner: 10ArielGlenn) [04:58:24] (03PS3) 10Marostegui: index-conf.yaml: Remove temporary index timestamp on archive [puppet] - 10https://gerrit.wikimedia.org/r/538140 (https://phabricator.wikimedia.org/T219374) [05:00:40] (03CR) 10Marostegui: [C: 03+2] index-conf.yaml: Remove temporary index timestamp on archive [puppet] - 10https://gerrit.wikimedia.org/r/538140 (https://phabricator.wikimedia.org/T219374) (owner: 10Marostegui) [05:07:12] !log Remove temporary index on hiwikisource views /usr/local/sbin/maintain-views --databases $a --replace-all --clean [05:07:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:07:29] !log Remove temporary index on hiwikisource views T219374 [05:07:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:07:32] T219374: Prepare and check storage layer for hi.wikisource - https://phabricator.wikimedia.org/T219374 [05:17:02] (03CR) 10Vgutierrez: [C: 03+2] "pcc looks happy: https://puppet-compiler.wmflabs.org/compiler1002/18443/" [puppet] - 10https://gerrit.wikimedia.org/r/537593 (https://phabricator.wikimedia.org/T232988) (owner: 10Vgutierrez) [05:17:12] (03PS2) 10Vgutierrez: install_server: Enable OCSP stapling and SSL monitoring [puppet] - 10https://gerrit.wikimedia.org/r/537593 (https://phabricator.wikimedia.org/T232988) [05:27:27] !log Analyze table enwiki.logging on db2102 - T223151 [05:27:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:27:31] T223151: Review special replica partitioning of certain tables by `xx_user` - https://phabricator.wikimedia.org/T223151 [05:29:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove db1089 from logpager and contributions after testing, repool back with normal weight on main traffic T223151', diff saved to https://phabricator.wikimedia.org/P9136 and previous config saved to /var/cache/conftool/dbconfig/20190920-052902-marostegui.json [05:29:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:32:26] (03PS1) 10Vgutierrez: install_server: Fix https check_command [puppet] - 10https://gerrit.wikimedia.org/r/538141 [05:32:44] :_( [05:35:29] (03CR) 10Vgutierrez: [C: 03+2] install_server: Fix https check_command [puppet] - 10https://gerrit.wikimedia.org/r/538141 (owner: 10Vgutierrez) [06:01:11] RECOVERY - Check systemd state on an-coord1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:43:46] (03PS2) 10Giuseppe Lavagetto: Add rakefile to run helm tests [deployment-charts] - 10https://gerrit.wikimedia.org/r/537984 (https://phabricator.wikimedia.org/T233291) [06:45:12] (03PS3) 10Giuseppe Lavagetto: Add rakefile to run helm tests [deployment-charts] - 10https://gerrit.wikimedia.org/r/537984 (https://phabricator.wikimedia.org/T233291) [06:46:20] (03CR) 10Jcrespo: [C: 03+1] "https://puppet-compiler.wmflabs.org/compiler1002/18444/" [puppet] - 10https://gerrit.wikimedia.org/r/538042 (https://phabricator.wikimedia.org/T202367) (owner: 10Jcrespo) [06:49:23] 10Operations, 10User-DannyS712: 503 Backend fetch failed - https://phabricator.wikimedia.org/T233271 (10Zzuuzz) I've been getting this intermittently through checkuser since around Thursday afternoon (UTC). Request from [my ip] via cp1089 cp1089, Varnish XID 777562684 Error: 503, Backend fetch failed at Fri,... [06:52:12] (03CR) 10Muehlenhoff: "The RO servers are behind LVS, there's two LDAP servers behind each of eqiad and codfw." [puppet] - 10https://gerrit.wikimedia.org/r/536714 (owner: 10Dzahn) [07:00:34] PROBLEM - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [07:11:06] RECOVERY - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [07:13:31] 10Operations, 10ops-eqiad: Unable to power on ms-be1027 - https://phabricator.wikimedia.org/T233289 (10fgiunchedi) Reporting from IRC, it doesn't look like the hw is coming back at all, with this system being OOW and a new batch of ms-be hosts coming in, I'll start removing this host from service for its prema... [07:14:23] !log eqiad-prod: start ms-be1027 decom - T233289 [07:14:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:14:29] T233289: Unable to power on ms-be1027 - https://phabricator.wikimedia.org/T233289 [07:15:12] 10Operations, 10ops-eqiad, 10User-fgiunchedi: Unable to power on ms-be1027 - https://phabricator.wikimedia.org/T233289 (10fgiunchedi) a:05Cmjohnson→03fgiunchedi I'll take the task until the host is no longer active in swift [07:20:46] (03PS1) 10Muehlenhoff: Remove raid1-gpt.cfg partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/538152 (https://phabricator.wikimedia.org/T156955) [07:21:27] (03PS2) 10Filippo Giunchedi: prometheus: alert on overall puppet failures [puppet] - 10https://gerrit.wikimedia.org/r/536591 (https://phabricator.wikimedia.org/T232303) [07:22:36] (03PS2) 10Jcrespo: bacula: Make bacula db parameters configurable on hiera [puppet] - 10https://gerrit.wikimedia.org/r/538042 (https://phabricator.wikimedia.org/T202367) [07:23:08] (03CR) 10Jcrespo: [C: 03+2] bacula: Make bacula db parameters configurable on hiera [puppet] - 10https://gerrit.wikimedia.org/r/538042 (https://phabricator.wikimedia.org/T202367) (owner: 10Jcrespo) [07:25:16] (03PS1) 10Muehlenhoff: Remove obsolete cassandrahosts-5ssd.cfg partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/538153 (https://phabricator.wikimedia.org/T156955) [07:27:40] (03CR) 10Jcrespo: "This was noop on heze and helium." [puppet] - 10https://gerrit.wikimedia.org/r/538042 (https://phabricator.wikimedia.org/T202367) (owner: 10Jcrespo) [07:28:40] !log marostegui@cumin1001 START - Cookbook sre.hosts.decommission [07:28:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:29:06] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=False) [07:29:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:29:16] 10Operations, 10ops-codfw, 10decommission, 10Patch-For-Review: Decommission db2043.codfw.wmnet - https://phabricator.wikimedia.org/T230311 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by marostegui@cumin1001 for hosts: `db2043.codfw.wmnet` - db2043.codfw.wmnet (**PASS**) - Downtimed... [07:29:59] (03CR) 10Jcrespo: "This is not urgent, but let me know if the database sub-sub-team is ok with this :-P" [puppet] - 10https://gerrit.wikimedia.org/r/537928 (https://phabricator.wikimedia.org/T229209) (owner: 10Jcrespo) [07:31:56] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: alert on overall puppet failures [puppet] - 10https://gerrit.wikimedia.org/r/536591 (https://phabricator.wikimedia.org/T232303) (owner: 10Filippo Giunchedi) [07:32:06] (03PS3) 10Filippo Giunchedi: prometheus: alert on overall puppet failures [puppet] - 10https://gerrit.wikimedia.org/r/536591 (https://phabricator.wikimedia.org/T232303) [07:33:20] (03PS1) 10DCausse: [cirrus] fix cirrusTest.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538156 (https://phabricator.wikimedia.org/T232691) [07:34:03] (03PS1) 10Marostegui: db2043: Remove it from puppet [puppet] - 10https://gerrit.wikimedia.org/r/538157 (https://phabricator.wikimedia.org/T230311) [07:34:50] (03CR) 10Marostegui: [C: 03+2] db2043: Remove it from puppet [puppet] - 10https://gerrit.wikimedia.org/r/538157 (https://phabricator.wikimedia.org/T230311) (owner: 10Marostegui) [07:35:30] (03PS4) 10Filippo Giunchedi: prometheus: alert on overall puppet failures [puppet] - 10https://gerrit.wikimedia.org/r/536591 (https://phabricator.wikimedia.org/T232303) [07:35:42] (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] prometheus: alert on overall puppet failures [puppet] - 10https://gerrit.wikimedia.org/r/536591 (https://phabricator.wikimedia.org/T232303) (owner: 10Filippo Giunchedi) [07:36:33] (03CR) 10Hashar: "check experimental" [deployment-charts] - 10https://gerrit.wikimedia.org/r/537984 (https://phabricator.wikimedia.org/T233291) (owner: 10Giuseppe Lavagetto) [07:37:26] 10Operations, 10MediaWiki-extensions-OATHAuth, 10Patch-For-Review: Cannot enable 2FA on testwiki - https://phabricator.wikimedia.org/T233146 (10ItSpiderman) Switched to jsonSerializing objects before putting them into session. These changes are made on top of existing changes to avoid merge conflicts [07:38:18] (03PS1) 10Marostegui: db2043: Decom dns production entries [dns] - 10https://gerrit.wikimedia.org/r/538158 (https://phabricator.wikimedia.org/T230311) [07:39:06] (03CR) 10Marostegui: [C: 03+2] db2043: Decom dns production entries [dns] - 10https://gerrit.wikimedia.org/r/538158 (https://phabricator.wikimedia.org/T230311) (owner: 10Marostegui) [07:39:29] 10Operations, 10ops-codfw, 10decommission, 10Patch-For-Review: Decommission db2043.codfw.wmnet - https://phabricator.wikimedia.org/T230311 (10Marostegui) [07:39:31] 10Operations, 10ops-eqiad, 10decommission, 10Patch-For-Review: Decommission neodymium - https://phabricator.wikimedia.org/T220503 (10MoritzMuehlenhoff) a:05RobH→03MoritzMuehlenhoff [07:39:46] 10Operations, 10ops-codfw, 10decommission, 10Patch-For-Review: Decommission sarin - https://phabricator.wikimedia.org/T220504 (10MoritzMuehlenhoff) a:05RobH→03MoritzMuehlenhoff [07:40:14] 10Operations, 10ops-codfw, 10decommission, 10Patch-For-Review: Decommission db2043.codfw.wmnet - https://phabricator.wikimedia.org/T230311 (10Marostegui) a:05RobH→03Papaul I have tried the new cookbook to decom servers after having a chat with @Muehlenhoff - I think this is now ready for @Papaul to ful... [07:40:57] (03CR) 10Filippo Giunchedi: "wezen should be added next to centrallog1001, other than that LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/538152 (https://phabricator.wikimedia.org/T156955) (owner: 10Muehlenhoff) [07:42:20] (03CR) 10Muehlenhoff: Remove raid1-gpt.cfg partman recipe (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/538152 (https://phabricator.wikimedia.org/T156955) (owner: 10Muehlenhoff) [07:42:33] (03CR) 10Filippo Giunchedi: [C: 03+1] kafka_shipper: try parsing syslog messages as raw json [puppet] - 10https://gerrit.wikimedia.org/r/538081 (owner: 10Herron) [07:43:12] PROBLEM - HHVM rendering on mw1233 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [07:43:22] (03CR) 10DCausse: "sorry about this one, yet again I probably forgot to carefully check jenkins output when refactoring the cirrus config when migrating to e" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538156 (https://phabricator.wikimedia.org/T232691) (owner: 10DCausse) [07:44:35] (03CR) 10Hashar: Add rakefile to run helm tests (034 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/537984 (https://phabricator.wikimedia.org/T233291) (owner: 10Giuseppe Lavagetto) [07:44:36] RECOVERY - HHVM rendering on mw1233 is OK: HTTP OK: HTTP/1.1 200 OK - 80421 bytes in 0.120 second response time https://wikitech.wikimedia.org/wiki/Application_servers [07:45:20] !log jmm@cumin1001 START - Cookbook sre.hosts.decommission [07:45:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:46:06] !log jmm@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=False) [07:46:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:46:30] (03PS8) 10Alexandros Kosiaris: Adds git::lfs class and include it respectively [puppet] - 10https://gerrit.wikimedia.org/r/535631 (https://phabricator.wikimedia.org/T232494) (owner: 10Halfak) [07:46:34] !log jmm@cumin1001 START - Cookbook sre.hosts.decommission [07:46:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:46:56] !log jmm@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=False) [07:46:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:47:33] (03CR) 10Filippo Giunchedi: Remove raid1-gpt.cfg partman recipe (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/538152 (https://phabricator.wikimedia.org/T156955) (owner: 10Muehlenhoff) [07:47:40] 10Operations, 10ops-eqiad, 10decommission, 10Patch-For-Review: Decommission neodymium - https://phabricator.wikimedia.org/T220503 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin1001 for hosts: `neodymium.eqiad.wmnet` - neodymium.eqiad.wmnet (**PASS**) - Downtimed host on Ic... [07:47:56] 10Operations, 10ops-codfw, 10decommission, 10Patch-For-Review: Decommission sarin - https://phabricator.wikimedia.org/T220504 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin1001 for hosts: `sarin.codfw.wmnet` - sarin.codfw.wmnet (**PASS**) - Downtimed host on Icinga - Dow... [07:49:01] (03PS1) 10Muehlenhoff: Remove site.pp entries for neodymium/sarin [puppet] - 10https://gerrit.wikimedia.org/r/538159 (https://phabricator.wikimedia.org/T220503) [07:49:53] (03CR) 10Muehlenhoff: [C: 03+2] Remove site.pp entries for neodymium/sarin [puppet] - 10https://gerrit.wikimedia.org/r/538159 (https://phabricator.wikimedia.org/T220503) (owner: 10Muehlenhoff) [07:55:00] (03PS1) 10Muehlenhoff: Remove DNS entries for neodymium/sarin [dns] - 10https://gerrit.wikimedia.org/r/538160 (https://phabricator.wikimedia.org/T220503) [07:56:54] (03CR) 10Muehlenhoff: [C: 03+2] Remove DNS entries for neodymium/sarin [dns] - 10https://gerrit.wikimedia.org/r/538160 (https://phabricator.wikimedia.org/T220503) (owner: 10Muehlenhoff) [07:57:50] 10Operations: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10MoritzMuehlenhoff) [07:59:54] 10Operations, 10ops-eqiad, 10decommission, 10Patch-For-Review: Decommission neodymium - https://phabricator.wikimedia.org/T220503 (10MoritzMuehlenhoff) a:05MoritzMuehlenhoff→03RobH [08:00:27] 10Operations, 10ops-codfw, 10decommission, 10Patch-For-Review: Decommission sarin - https://phabricator.wikimedia.org/T220504 (10MoritzMuehlenhoff) a:05MoritzMuehlenhoff→03RobH [08:00:28] (03PS1) 10Filippo Giunchedi: facilities: add phase monitoring for single phase PDUs [puppet] - 10https://gerrit.wikimedia.org/r/538161 (https://phabricator.wikimedia.org/T229101) [08:01:41] PROBLEM - rsyslog in codfw is failing to deliver messages on icinga1001 is CRITICAL: action=fwd_centrallog1001.eqiad.wmnet:6514 https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=codfw+prometheus/ops [08:02:33] (03CR) 10Filippo Giunchedi: "This will add phase monitoring for ulsfo PDUs now, is the breaker figure correct for ulsfo? i.e. using the same eqiad/codfw value (30A)" [puppet] - 10https://gerrit.wikimedia.org/r/538161 (https://phabricator.wikimedia.org/T229101) (owner: 10Filippo Giunchedi) [08:02:49] (03PS9) 10Alexandros Kosiaris: Adds git::lfs class and include it respectively [puppet] - 10https://gerrit.wikimedia.org/r/535631 (https://phabricator.wikimedia.org/T232494) (owner: 10Halfak) [08:02:51] (03PS1) 10Alexandros Kosiaris: ATS: Switch blubberoid to discovery records [puppet] - 10https://gerrit.wikimedia.org/r/538162 [08:03:01] RECOVERY - rsyslog in codfw is failing to deliver messages on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=codfw+prometheus/ops [08:04:58] (03CR) 10jerkins-bot: [V: 04-1] Adds git::lfs class and include it respectively [puppet] - 10https://gerrit.wikimedia.org/r/535631 (https://phabricator.wikimedia.org/T232494) (owner: 10Halfak) [08:06:50] interesting, the temporary rsyslog delivery failures were due to swift rebalance / ms-be1027 apparently [08:07:24] <_joe_> too many logs? [08:08:49] (03PS10) 10Alexandros Kosiaris: Adds git::lfs class and include it respectively [puppet] - 10https://gerrit.wikimedia.org/r/535631 (https://phabricator.wikimedia.org/T232494) (owner: 10Halfak) [08:09:46] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1001/18446/" [puppet] - 10https://gerrit.wikimedia.org/r/538045 (https://phabricator.wikimedia.org/T231208) (owner: 10Elukey) [08:09:54] usually on swift machines it’s too high an IO load or something [08:10:06] might even be XFS holding too many kernel locks, who knows [08:10:53] is wikipedia SGI-powered? :D [08:11:01] (03CR) 10jerkins-bot: [V: 04-1] Adds git::lfs class and include it respectively [puppet] - 10https://gerrit.wikimedia.org/r/535631 (https://phabricator.wikimedia.org/T232494) (owner: 10Halfak) [08:12:01] !log contint2001: upgrade zuul to 2.5.1-wmf10 # T203846 [08:12:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:12:05] T203846: Zuul cancels all changes when a change is manually merged - https://phabricator.wikimedia.org/T203846 [08:12:52] (03PS3) 10Jcrespo: backups: Apply no-srv-format recipe to backup hosts [puppet] - 10https://gerrit.wikimedia.org/r/537928 (https://phabricator.wikimedia.org/T229209) [08:12:54] (03PS3) 10Jcrespo: backups: Setup new director and storage daemons hw in parallel [puppet] - 10https://gerrit.wikimedia.org/r/537929 (https://phabricator.wikimedia.org/T196478) [08:13:00] (03CR) 10jerkins-bot: [V: 04-1] backups: Apply no-srv-format recipe to backup hosts [puppet] - 10https://gerrit.wikimedia.org/r/537928 (https://phabricator.wikimedia.org/T229209) (owner: 10Jcrespo) [08:13:03] (03CR) 10jerkins-bot: [V: 04-1] backups: Setup new director and storage daemons hw in parallel [puppet] - 10https://gerrit.wikimedia.org/r/537929 (https://phabricator.wikimedia.org/T196478) (owner: 10Jcrespo) [08:15:36] (03PS11) 10Alexandros Kosiaris: Adds git::lfs class and include it respectively [puppet] - 10https://gerrit.wikimedia.org/r/535631 (https://phabricator.wikimedia.org/T232494) (owner: 10Halfak) [08:15:43] (03CR) 10jerkins-bot: [V: 04-1] Adds git::lfs class and include it respectively [puppet] - 10https://gerrit.wikimedia.org/r/535631 (https://phabricator.wikimedia.org/T232494) (owner: 10Halfak) [08:16:17] (03PS4) 10Jcrespo: backups: Apply no-srv-format recipe to backup hosts [puppet] - 10https://gerrit.wikimedia.org/r/537928 (https://phabricator.wikimedia.org/T229209) [08:16:19] (03PS4) 10Jcrespo: backups: Setup new director and storage daemons hw in parallel [puppet] - 10https://gerrit.wikimedia.org/r/537929 (https://phabricator.wikimedia.org/T196478) [08:16:25] (03CR) 10jerkins-bot: [V: 04-1] backups: Apply no-srv-format recipe to backup hosts [puppet] - 10https://gerrit.wikimedia.org/r/537928 (https://phabricator.wikimedia.org/T229209) (owner: 10Jcrespo) [08:16:27] (03CR) 10jerkins-bot: [V: 04-1] backups: Setup new director and storage daemons hw in parallel [puppet] - 10https://gerrit.wikimedia.org/r/537929 (https://phabricator.wikimedia.org/T196478) (owner: 10Jcrespo) [08:17:10] (03PS2) 10Jbond: admin: add phedenskog to deployment group [puppet] - 10https://gerrit.wikimedia.org/r/537519 (https://phabricator.wikimedia.org/T232489) (owner: 10Herron) [08:17:18] (03CR) 10jerkins-bot: [V: 04-1] admin: add phedenskog to deployment group [puppet] - 10https://gerrit.wikimedia.org/r/537519 (https://phabricator.wikimedia.org/T232489) (owner: 10Herron) [08:17:40] what's the issue with CI? [08:18:26] rebase manually in top of production [08:18:33] that usually fixes it [08:18:52] 10Operations, 10Mail, 10WMF-Communications: Updating DNS records - https://phabricator.wikimedia.org/T231387 (10MoritzMuehlenhoff) @herron : There's now two log entries in the /var/log/exim4/paniclog on mx1001.wikimedia.org related to DKIM, not sure whether we need to tweak something still or whether that's... [08:18:58] vgutierrez: I just did that [08:19:07] I rebased localy, there is no conflict [08:19:48] (03PS5) 10Jcrespo: backups: Apply no-srv-format recipe to backup hosts [puppet] - 10https://gerrit.wikimedia.org/r/537928 (https://phabricator.wikimedia.org/T229209) [08:19:50] (03PS5) 10Jcrespo: backups: Setup new director and storage daemons hw in parallel [puppet] - 10https://gerrit.wikimedia.org/r/537929 (https://phabricator.wikimedia.org/T196478) [08:19:55] (03CR) 10jerkins-bot: [V: 04-1] backups: Apply no-srv-format recipe to backup hosts [puppet] - 10https://gerrit.wikimedia.org/r/537928 (https://phabricator.wikimedia.org/T229209) (owner: 10Jcrespo) [08:19:58] (03CR) 10jerkins-bot: [V: 04-1] backups: Setup new director and storage daemons hw in parallel [puppet] - 10https://gerrit.wikimedia.org/r/537929 (https://phabricator.wikimedia.org/T196478) (owner: 10Jcrespo) [08:20:09] !log contint1001: upgrade zuul to 2.5.1-wmf10 # T203846 [08:20:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:20:13] T203846: Zuul cancels all changes when a change is manually merged - https://phabricator.wikimedia.org/T203846 [08:20:29] (03CR) 10Hashar: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/537928 (https://phabricator.wikimedia.org/T229209) (owner: 10Jcrespo) [08:20:32] (03CR) 10Hashar: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/537929 (https://phabricator.wikimedia.org/T196478) (owner: 10Jcrespo) [08:20:38] (03CR) 10jerkins-bot: [V: 04-1] backups: Apply no-srv-format recipe to backup hosts [puppet] - 10https://gerrit.wikimedia.org/r/537928 (https://phabricator.wikimedia.org/T229209) (owner: 10Jcrespo) [08:20:42] (03CR) 10jerkins-bot: [V: 04-1] backups: Setup new director and storage daemons hw in parallel [puppet] - 10https://gerrit.wikimedia.org/r/537929 (https://phabricator.wikimedia.org/T196478) (owner: 10Jcrespo) [08:20:45] jynus: not sure why they fail [08:20:49] gotta dig in zuul logs sorry [08:21:09] just wanted to know it was not me doing something wrong! [08:21:11] (03CR) 10Alexandros Kosiaris: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/535631 (https://phabricator.wikimedia.org/T232494) (owner: 10Halfak) [08:21:19] (03CR) 10jerkins-bot: [V: 04-1] Adds git::lfs class and include it respectively [puppet] - 10https://gerrit.wikimedia.org/r/535631 (https://phabricator.wikimedia.org/T232494) (owner: 10Halfak) [08:21:29] i noticed that on some other changes [08:21:36] started recently -and before the upgrade I have doneè [08:21:38] bah [08:21:39] :-\ [08:21:54] It is ok, I can wait [08:22:07] GitCommandError: Cmd('git') failed due to: exit code(128) [08:22:07] cmdline: git fetch --tags -v origin [08:22:07] stderr: 'fatal: Could not read from remote repository. [08:23:14] (03PS2) 10Muehlenhoff: Remove raid1-gpt.cfg partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/538152 (https://phabricator.wikimedia.org/T156955) [08:23:22] (03CR) 10jerkins-bot: [V: 04-1] Remove raid1-gpt.cfg partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/538152 (https://phabricator.wikimedia.org/T156955) (owner: 10Muehlenhoff) [08:23:33] (03CR) 10Vgutierrez: "It's fine by me.. but this would introduce a split behaviour between varnish-be and ats-be right now" [puppet] - 10https://gerrit.wikimedia.org/r/538162 (owner: 10Alexandros Kosiaris) [08:23:36] !log CI in default since it is somehow no more able to fetch from Gerrit T233390 [08:23:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:23:39] T233390: zuul-merger fails to fetch from Gerrit - https://phabricator.wikimedia.org/T233390 [08:24:55] akosiaris: I added blubberoid like that to mimick the current varnish config.. if feels weird to have the varnish-be instances sending traffic only to codfw and ats-be to both DCs [08:26:13] 10Operations, 10DC-Ops, 10observability, 10Patch-For-Review, 10User-fgiunchedi: Phase monitoring for new PDUs - https://phabricator.wikimedia.org/T229101 (10fgiunchedi) [08:26:42] vgutierrez: while that is true, funnily enough ATS config for cxserver deviates as well. it uses the discovery records over http [08:27:05] same goes for ores, wdqs and eventstreams [08:27:13] (03CR) 10Giuseppe Lavagetto: Add rakefile to run helm tests (035 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/537984 (https://phabricator.wikimedia.org/T233291) (owner: 10Giuseppe Lavagetto) [08:27:45] and for some of that, it might be concerning, e.g. cxserver where the service interacts daily with end-users [08:27:52] ok found it [08:28:03] (03PS4) 10Giuseppe Lavagetto: Add rakefile to run helm tests [deployment-charts] - 10https://gerrit.wikimedia.org/r/537984 (https://phabricator.wikimedia.org/T233291) [08:28:16] !log Killed zuul-server process on contint2001 which was establishing connections to Gerrit and filling the pool of allowed ssh connections # T233390 [08:28:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:28:21] akosiaris: sadly who can answer that is on vacations right now [08:28:22] ok so that was related to the upgrade :-\\\\ [08:28:23] my bad [08:29:27] (03CR) 10Jcrespo: "Am I missing something?" [puppet] - 10https://gerrit.wikimedia.org/r/537929 (https://phabricator.wikimedia.org/T196478) (owner: 10Jcrespo) [08:29:34] but right now I'm inclined to mimick varnish setup as much as possible [08:30:02] jynus: yeah, a CI issue, not your fault :) [08:30:18] vgutierrez: no, that question wasn't about CU [08:30:21] it was about the review [08:30:24] ahaha ok :) [08:30:41] I started the review on that comment, but that is not sent to IRC [08:30:57] jynus: not that I care much, but what is the 9 about? [08:31:05] in bacula9 I mean [08:31:09] akosiaris: something to differenciate it [08:31:18] 42 [08:31:21] 42 !!!! [08:31:24] ts ts ts :P [08:31:25] I belive bacula 7 vs 9 version [08:31:31] (03CR) 10Hashar: "recheck CI rejected all changes :\ T233390" [puppet] - 10https://gerrit.wikimedia.org/r/537929 (https://phabricator.wikimedia.org/T196478) (owner: 10Jcrespo) [08:31:32] I think 9 made more sense [08:31:37] (03CR) 10Hashar: "recheck CI rejected all changes :\ T233390" [puppet] - 10https://gerrit.wikimedia.org/r/538152 (https://phabricator.wikimedia.org/T156955) (owner: 10Muehlenhoff) [08:31:39] (03CR) 10Hashar: "recheck CI rejected all changes :\ T233390" [puppet] - 10https://gerrit.wikimedia.org/r/535631 (https://phabricator.wikimedia.org/T232494) (owner: 10Halfak) [08:31:42] (03CR) 10Hashar: "recheck CI rejected all changes :\ T233390" [puppet] - 10https://gerrit.wikimedia.org/r/537928 (https://phabricator.wikimedia.org/T229209) (owner: 10Jcrespo) [08:31:48] e.g. in the future it could be bacula Xs [08:31:49] (03CR) 10Hashar: "recheck CI rejected all changes :\ T233390" [puppet] - 10https://gerrit.wikimedia.org/r/537519 (https://phabricator.wikimedia.org/T232489) (owner: 10Herron) [08:31:59] but akosiaris stress it is temporary [08:32:01] (03CR) 10Marostegui: [C: 04-1] backups: Setup new director and storage daemons hw in parallel (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/537929 (https://phabricator.wikimedia.org/T196478) (owner: 10Jcrespo) [08:32:41] marostegui: I copied and pasted, so existing error [08:32:43] :-D [08:32:47] haha [08:32:50] (03CR) 10Filippo Giunchedi: [C: 03+2] Remove obsolete cassandrahosts-5ssd.cfg partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/538153 (https://phabricator.wikimedia.org/T156955) (owner: 10Muehlenhoff) [08:32:59] and I am going to bet it was you :-P [08:33:04] I will fix both [08:33:07] (03CR) 10Alexandros Kosiaris: [C: 03+1] "Aside from manuel's comment, LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/537929 (https://phabricator.wikimedia.org/T196478) (owner: 10Jcrespo) [08:33:09] good catch [08:33:15] thanks jynus ! :) [08:34:38] (03PS6) 10Jcrespo: backups: Apply no-srv-format recipe to backup hosts [puppet] - 10https://gerrit.wikimedia.org/r/537928 (https://phabricator.wikimedia.org/T229209) [08:34:40] (03PS6) 10Jcrespo: backups: Setup new director and storage daemons hw in parallel [puppet] - 10https://gerrit.wikimedia.org/r/537929 (https://phabricator.wikimedia.org/T196478) [08:35:10] (03CR) 10Marostegui: [C: 03+1] "Thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/537929 (https://phabricator.wikimedia.org/T196478) (owner: 10Jcrespo) [08:35:31] thanks, hashar, I think it works now btw [08:35:49] jynus: yeah fixed, and that was a side effect of the upgrade I have done of zuul :-\\\ [08:39:42] akosiaris: what about "Plan 9 from outer space"? [08:40:06] jynus: deep space 9 [08:40:26] but you are on the right track now [08:40:32] I am more a tng guy [08:42:24] (03CR) 10Jbond: [C: 03+2] admin: add phedenskog to deployment group [puppet] - 10https://gerrit.wikimedia.org/r/537519 (https://phabricator.wikimedia.org/T232489) (owner: 10Herron) [08:44:02] <_joe_> are you really getting into the "rate the star trek series" debate? [08:44:04] 10Operations, 10Performance-Team, 10SRE-Access-Requests, 10Patch-For-Review: Request access to 'deployment' user group for phedenskog - https://phabricator.wikimedia.org/T232489 (10jbond) @Peter this has been deployed now, please allow upto 30 minutes for the change to fully propagate, if you are still see... [08:44:25] <_joe_> it has a good chance of being less civilized than vim vs emacs [08:52:03] (03CR) 10Giuseppe Lavagetto: "check experimental" [deployment-charts] - 10https://gerrit.wikimedia.org/r/537984 (https://phabricator.wikimedia.org/T233291) (owner: 10Giuseppe Lavagetto) [08:52:35] !log creating new database on m1 "bacula9" T229209 [08:52:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:52:39] T229209: Strengthen backup infrastructure and support - https://phabricator.wikimedia.org/T229209 [08:56:52] (03PS7) 10Jcrespo: backups: Setup new director and storage daemons hw in parallel [puppet] - 10https://gerrit.wikimedia.org/r/537929 (https://phabricator.wikimedia.org/T196478) [08:57:15] (03CR) 10Marostegui: "It looks good to me, but grants for the dump user must be deployed on the databases that will be backed-up" [puppet] - 10https://gerrit.wikimedia.org/r/538045 (https://phabricator.wikimedia.org/T231208) (owner: 10Elukey) [08:58:39] (03PS1) 10Vgutierrez: ATS: Adjust OCSP freshness check for acme-chief managed responses [puppet] - 10https://gerrit.wikimedia.org/r/538165 (https://phabricator.wikimedia.org/T232988) [09:01:41] (03CR) 10Vgutierrez: [C: 03+2] ATS: Adjust OCSP freshness check for acme-chief managed responses [puppet] - 10https://gerrit.wikimedia.org/r/538165 (https://phabricator.wikimedia.org/T232988) (owner: 10Vgutierrez) [09:01:52] (03PS2) 10Vgutierrez: ATS: Adjust OCSP freshness check for acme-chief managed responses [puppet] - 10https://gerrit.wikimedia.org/r/538165 (https://phabricator.wikimedia.org/T232988) [09:02:55] 10Operations, 10ops-codfw, 10DBA: db2127 memory issues - https://phabricator.wikimedia.org/T233184 (10Marostegui) No more errors, if it continues clean on Monday I will close this task [09:04:13] (03PS8) 10Jcrespo: backups: Setup new director and storage daemons hw in parallel [puppet] - 10https://gerrit.wikimedia.org/r/537929 (https://phabricator.wikimedia.org/T196478) [09:06:24] (03CR) 10Jcrespo: [C: 03+2] backups: Setup new director and storage daemons hw in parallel [puppet] - 10https://gerrit.wikimedia.org/r/537929 (https://phabricator.wikimedia.org/T196478) (owner: 10Jcrespo) [09:06:27] 10Operations, 10Traffic: Tune ATS SSL session cache - https://phabricator.wikimedia.org/T231849 (10Vgutierrez) 05Open→03Resolved a:03Vgutierrez [09:06:45] (03PS9) 10Jcrespo: backups: Setup new director and storage daemons hw in parallel [puppet] - 10https://gerrit.wikimedia.org/r/537929 (https://phabricator.wikimedia.org/T196478) [09:07:15] 10Operations, 10Release Pipeline, 10local-charts, 10serviceops, and 3 others: Set up CI for the deployment-charts repository - https://phabricator.wikimedia.org/T233291 (10Joe) [09:07:27] 10Operations, 10Traffic: acme-chief failing in puppet with "Cannot open input file" - https://phabricator.wikimedia.org/T229091 (10Vgutierrez) 05Open→03Resolved a:03Vgutierrez [09:07:56] (03PS12) 10Alexandros Kosiaris: Adds git::lfs class and include it respectively [puppet] - 10https://gerrit.wikimedia.org/r/535631 (https://phabricator.wikimedia.org/T232494) (owner: 10Halfak) [09:11:36] (03CR) 10Alexandros Kosiaris: "@hashar thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/535631 (https://phabricator.wikimedia.org/T232494) (owner: 10Halfak) [09:12:59] (03PS5) 10Giuseppe Lavagetto: Add rakefile to run helm tests [deployment-charts] - 10https://gerrit.wikimedia.org/r/537984 (https://phabricator.wikimedia.org/T233291) [09:13:15] (03PS13) 10Alexandros Kosiaris: Adds git::lfs class and include it respectively [puppet] - 10https://gerrit.wikimedia.org/r/535631 (https://phabricator.wikimedia.org/T232494) (owner: 10Halfak) [09:13:19] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] Adds git::lfs class and include it respectively [puppet] - 10https://gerrit.wikimedia.org/r/535631 (https://phabricator.wikimedia.org/T232494) (owner: 10Halfak) [09:13:28] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Add rakefile to run helm tests [deployment-charts] - 10https://gerrit.wikimedia.org/r/537984 (https://phabricator.wikimedia.org/T233291) (owner: 10Giuseppe Lavagetto) [09:13:48] (03Merged) 10jenkins-bot: Add rakefile to run helm tests [deployment-charts] - 10https://gerrit.wikimedia.org/r/537984 (https://phabricator.wikimedia.org/T233291) (owner: 10Giuseppe Lavagetto) [09:14:43] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/537061 (owner: 10Alexandros Kosiaris) [09:18:18] (03CR) 10Alexandros Kosiaris: "\o/" [deployment-charts] - 10https://gerrit.wikimedia.org/r/537061 (owner: 10Alexandros Kosiaris) [09:27:51] (03CR) 10Elukey: "Added the dump user to both databases with the following grants (hope it is enough):" [puppet] - 10https://gerrit.wikimedia.org/r/538045 (https://phabricator.wikimedia.org/T231208) (owner: 10Elukey) [09:30:31] !log jbond@cumin1001 START - Cookbook sre.hosts.ipmi-password-reset [09:30:31] !log jbond@cumin1001 END (FAIL) - Cookbook sre.hosts.ipmi-password-reset (exit_code=99) [09:30:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:30:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:30:41] !log jbond@cumin1001 START - Cookbook sre.hosts.ipmi-password-reset [09:30:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:30:47] !log jbond@cumin1001 END (ERROR) - Cookbook sre.hosts.ipmi-password-reset (exit_code=97) [09:30:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:31:15] !log jbond@cumin1001 START - Cookbook sre.hosts.ipmi-password-reset [09:31:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:31:22] !log jbond@cumin1001 Updating IPMI password on 1 hosts - jbond@cumin1001 [09:31:22] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.ipmi-password-reset (exit_code=0) [09:31:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:31:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:34:18] (03PS1) 10Filippo Giunchedi: netops: add ripe atlas tools class [puppet] - 10https://gerrit.wikimedia.org/r/538169 (https://phabricator.wikimedia.org/T232711) [09:34:20] (03PS1) 10Filippo Giunchedi: role: add ripe atlas cli to cluster::management role [puppet] - 10https://gerrit.wikimedia.org/r/538170 (https://phabricator.wikimedia.org/T232711) [09:36:22] (03CR) 10jerkins-bot: [V: 04-1] netops: add ripe atlas tools class [puppet] - 10https://gerrit.wikimedia.org/r/538169 (https://phabricator.wikimedia.org/T232711) (owner: 10Filippo Giunchedi) [09:40:36] (03PS7) 10Jcrespo: backups: Apply no-srv-format recipe to backup hosts [puppet] - 10https://gerrit.wikimedia.org/r/537928 (https://phabricator.wikimedia.org/T229209) [09:40:38] (03PS1) 10Jcrespo: backups: Install bacula-sd instead of the sql variant on buster [puppet] - 10https://gerrit.wikimedia.org/r/538175 (https://phabricator.wikimedia.org/T229209) [09:41:32] (03CR) 10jerkins-bot: [V: 04-1] backups: Install bacula-sd instead of the sql variant on buster [puppet] - 10https://gerrit.wikimedia.org/r/538175 (https://phabricator.wikimedia.org/T229209) (owner: 10Jcrespo) [09:43:28] (03PS8) 10Jcrespo: backups: Apply no-srv-format recipe to backup hosts [puppet] - 10https://gerrit.wikimedia.org/r/537928 (https://phabricator.wikimedia.org/T229209) [09:43:30] (03PS2) 10Jcrespo: backups: Install bacula-sd instead of the sql variant on buster [puppet] - 10https://gerrit.wikimedia.org/r/538175 (https://phabricator.wikimedia.org/T229209) [09:44:39] (03CR) 10jerkins-bot: [V: 04-1] backups: Install bacula-sd instead of the sql variant on buster [puppet] - 10https://gerrit.wikimedia.org/r/538175 (https://phabricator.wikimedia.org/T229209) (owner: 10Jcrespo) [09:53:01] (03PS3) 10Jcrespo: backups: Install bacula-sd instead of the sql variant on buster [puppet] - 10https://gerrit.wikimedia.org/r/538175 (https://phabricator.wikimedia.org/T229209) [09:53:51] (03CR) 10jerkins-bot: [V: 04-1] backups: Install bacula-sd instead of the sql variant on buster [puppet] - 10https://gerrit.wikimedia.org/r/538175 (https://phabricator.wikimedia.org/T229209) (owner: 10Jcrespo) [09:55:13] (03PS4) 10Jcrespo: backups: Install bacula-sd instead of the sql variant on buster [puppet] - 10https://gerrit.wikimedia.org/r/538175 (https://phabricator.wikimedia.org/T229209) [09:56:14] (03CR) 10jerkins-bot: [V: 04-1] backups: Install bacula-sd instead of the sql variant on buster [puppet] - 10https://gerrit.wikimedia.org/r/538175 (https://phabricator.wikimedia.org/T229209) (owner: 10Jcrespo) [10:01:38] (03PS9) 10Jcrespo: backups: Apply no-srv-format recipe to backup hosts [puppet] - 10https://gerrit.wikimedia.org/r/537928 (https://phabricator.wikimedia.org/T229209) [10:01:40] (03PS5) 10Jcrespo: backups: Install bacula-sd instead of the sql variant on buster [puppet] - 10https://gerrit.wikimedia.org/r/538175 (https://phabricator.wikimedia.org/T229209) [10:02:43] (03CR) 10jerkins-bot: [V: 04-1] backups: Install bacula-sd instead of the sql variant on buster [puppet] - 10https://gerrit.wikimedia.org/r/538175 (https://phabricator.wikimedia.org/T229209) (owner: 10Jcrespo) [10:06:18] (03PS6) 10Jcrespo: backups: Install bacula-sd instead of the sql variant on buster [puppet] - 10https://gerrit.wikimedia.org/r/538175 (https://phabricator.wikimedia.org/T229209) [10:07:31] (03CR) 10Jcrespo: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/538045 (https://phabricator.wikimedia.org/T231208) (owner: 10Elukey) [10:10:33] (03PS1) 10Jbond: IPMI: add support for channel 2 [software/spicerack] - 10https://gerrit.wikimedia.org/r/538230 [10:10:56] (03PS2) 10Jbond: IPMI: add support for channel 2 [software/spicerack] - 10https://gerrit.wikimedia.org/r/538230 (https://phabricator.wikimedia.org/T147074) [10:12:40] (03CR) 10Alexandros Kosiaris: [C: 03+1] backups: Install bacula-sd instead of the sql variant on buster [puppet] - 10https://gerrit.wikimedia.org/r/538175 (https://phabricator.wikimedia.org/T229209) (owner: 10Jcrespo) [10:15:22] (03PS2) 10Filippo Giunchedi: netops: add ripe atlas tools class [puppet] - 10https://gerrit.wikimedia.org/r/538169 (https://phabricator.wikimedia.org/T232711) [10:15:23] (03PS2) 10Filippo Giunchedi: role: add ripe atlas cli to cluster::management role [puppet] - 10https://gerrit.wikimedia.org/r/538170 (https://phabricator.wikimedia.org/T232711) [10:15:27] (03PS1) 10Vgutierrez: ATS: Provide HTTPS check [puppet] - 10https://gerrit.wikimedia.org/r/538231 (https://phabricator.wikimedia.org/T231433) [10:15:29] (03CR) 10jerkins-bot: [V: 04-1] IPMI: add support for channel 2 [software/spicerack] - 10https://gerrit.wikimedia.org/r/538230 (https://phabricator.wikimedia.org/T147074) (owner: 10Jbond) [10:15:42] (03CR) 10jerkins-bot: [V: 04-1] ATS: Provide HTTPS check [puppet] - 10https://gerrit.wikimedia.org/r/538231 (https://phabricator.wikimedia.org/T231433) (owner: 10Vgutierrez) [10:16:28] uh... same issue as before hashar? [10:17:09] !log jbond@cumin1001 START - Cookbook sre.hosts.ipmi-password-reset [10:17:09] !log jbond@cumin1001 END (FAIL) - Cookbook sre.hosts.ipmi-password-reset (exit_code=99) [10:17:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:17:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:17:17] !log jbond@cumin1001 START - Cookbook sre.hosts.ipmi-password-reset [10:17:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:17:24] !log jbond@cumin1001 Updating IPMI password on 1 hosts - jbond@cumin1001 [10:17:24] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.ipmi-password-reset (exit_code=0) [10:17:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:17:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:17:36] !log jbond@cumin1001 START - Cookbook sre.hosts.ipmi-password-reset [10:17:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:17:38] hmm [10:17:41] !log jbond@cumin1001 Updating IPMI password on 1 hosts - jbond@cumin1001 [10:17:42] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.ipmi-password-reset (exit_code=0) [10:17:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:17:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:17:55] (03CR) 10jerkins-bot: [V: 04-1] netops: add ripe atlas tools class [puppet] - 10https://gerrit.wikimedia.org/r/538169 (https://phabricator.wikimedia.org/T232711) (owner: 10Filippo Giunchedi) [10:17:55] vgutierrez: nope :) [10:17:56] (03CR) 10Muehlenhoff: [C: 03+1] "Thanks! HostActions can probably even be shared/used by all sre.hosts.* cookbooks? (Just a thought, should not block merging this in any w" [cookbooks] - 10https://gerrit.wikimedia.org/r/538047 (owner: 10Volans) [10:17:59] well should not [10:18:38] checking... [10:18:59] (03PS1) 10Hashar: zuul: mask service when it is disabled [puppet] - 10https://gerrit.wikimedia.org/r/538233 (https://phabricator.wikimedia.org/T233391) [10:19:14] right... L8 issue on my side [10:19:24] (03CR) 10Jcrespo: [C: 03+1] "https://puppet-compiler.wmflabs.org/compiler1001/18452/" [puppet] - 10https://gerrit.wikimedia.org/r/538175 (https://phabricator.wikimedia.org/T229209) (owner: 10Jcrespo) [10:20:38] vgutierrez: ERROR: content conflict in modules/profile/manifests/trafficserver/monitoring.pp [10:20:45] yup yup [10:20:47] it's fixed [10:20:48] (03PS2) 10Vgutierrez: ATS: Provide HTTPS check [puppet] - 10https://gerrit.wikimedia.org/r/538231 (https://phabricator.wikimedia.org/T231433) [10:20:49] vgutierrez: so yeah seems https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/538231/ needs a rebase [10:20:51] ah cool [10:20:56] sorry about the noise :) [10:21:30] (03CR) 10jerkins-bot: [V: 04-1] ATS: Provide HTTPS check [puppet] - 10https://gerrit.wikimedia.org/r/538231 (https://phabricator.wikimedia.org/T231433) (owner: 10Vgutierrez) [10:21:46] and those are the tests screaming [10:21:57] (03CR) 10Hashar: "and while at it I have added some tests that also cover the monitoring part ;]" [puppet] - 10https://gerrit.wikimedia.org/r/538233 (https://phabricator.wikimedia.org/T233391) (owner: 10Hashar) [10:22:59] (03CR) 10Hashar: "I forgot to follow on Dzahn comment which eventually caused a brief ci outage this morning bah" [puppet] - 10https://gerrit.wikimedia.org/r/502203 (owner: 10Hashar) [10:23:23] (03CR) 10Jcrespo: [C: 03+2] backups: Install bacula-sd instead of the sql variant on buster [puppet] - 10https://gerrit.wikimedia.org/r/538175 (https://phabricator.wikimedia.org/T229209) (owner: 10Jcrespo) [10:23:35] (03PS7) 10Jcrespo: backups: Install bacula-sd instead of the sql variant on buster [puppet] - 10https://gerrit.wikimedia.org/r/538175 (https://phabricator.wikimedia.org/T229209) [10:23:42] (03PS3) 10Vgutierrez: ATS: Provide HTTPS check [puppet] - 10https://gerrit.wikimedia.org/r/538231 (https://phabricator.wikimedia.org/T231433) [10:25:55] (03PS3) 10Filippo Giunchedi: netops: add ripe atlas tools class [puppet] - 10https://gerrit.wikimedia.org/r/538169 (https://phabricator.wikimedia.org/T232711) [10:25:56] (03PS3) 10Filippo Giunchedi: role: add ripe atlas cli to cluster::management role [puppet] - 10https://gerrit.wikimedia.org/r/538170 (https://phabricator.wikimedia.org/T232711) [10:30:21] (03PS4) 10Filippo Giunchedi: netops: add ripe atlas tools class [puppet] - 10https://gerrit.wikimedia.org/r/538169 (https://phabricator.wikimedia.org/T232711) [10:30:23] (03PS4) 10Filippo Giunchedi: role: add ripe atlas cli to cluster::management role [puppet] - 10https://gerrit.wikimedia.org/r/538170 (https://phabricator.wikimedia.org/T232711) [10:31:01] (03CR) 10Muehlenhoff: profile::icinga: update to use lookup instead of hiera (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/538018 (owner: 10Jbond) [10:31:22] (03PS2) 10Muehlenhoff: Remove obsolete cassandrahosts-5ssd.cfg partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/538153 (https://phabricator.wikimedia.org/T156955) [10:31:24] (03PS5) 10Filippo Giunchedi: netops: add ripe atlas tools class [puppet] - 10https://gerrit.wikimedia.org/r/538169 (https://phabricator.wikimedia.org/T232711) [10:31:26] (03PS5) 10Filippo Giunchedi: role: add ripe atlas cli to cluster::management role [puppet] - 10https://gerrit.wikimedia.org/r/538170 (https://phabricator.wikimedia.org/T232711) [10:31:28] (03CR) 10Vgutierrez: "pcc seems happy: https://puppet-compiler.wmflabs.org/compiler1002/18455/" [puppet] - 10https://gerrit.wikimedia.org/r/538231 (https://phabricator.wikimedia.org/T231433) (owner: 10Vgutierrez) [10:32:09] (03PS1) 10Jcrespo: backups: Fix wrong dependency on buster bacula storage servers [puppet] - 10https://gerrit.wikimedia.org/r/538236 (https://phabricator.wikimedia.org/T229209) [10:33:34] (03CR) 10Jcrespo: [C: 03+2] backups: Fix wrong dependency on buster bacula storage servers [puppet] - 10https://gerrit.wikimedia.org/r/538236 (https://phabricator.wikimedia.org/T229209) (owner: 10Jcrespo) [10:33:45] (03PS2) 10Jcrespo: backups: Fix wrong dependency on buster bacula storage servers [puppet] - 10https://gerrit.wikimedia.org/r/538236 (https://phabricator.wikimedia.org/T229209) [10:35:53] ERROR: puppet-merge on rhodium.eqiad.wmnet failed [10:36:08] (03PS6) 10Filippo Giunchedi: netops: add ripe atlas tools class [puppet] - 10https://gerrit.wikimedia.org/r/538169 (https://phabricator.wikimedia.org/T232711) [10:36:10] (03PS6) 10Filippo Giunchedi: role: add ripe atlas cli to cluster::management role [puppet] - 10https://gerrit.wikimedia.org/r/538170 (https://phabricator.wikimedia.org/T232711) [10:36:35] pause puppet deploys, puppet repos may be inconsistent [10:36:53] unable to update local ref [10:37:01] I will retry [10:37:20] or maybe, godog, deploy yours [10:37:29] and see if it pulls to the last change [10:37:58] jynus: ah mhh I just pushed to labs/private, is that it ? [10:38:14] I don't know, did you get errors? [10:38:31] probably not, as it is a separate repo [10:38:40] I haven't run puppet-merge, I don't have changes pending merge [10:38:49] I see one [10:39:02] "Filippo Giunchedi: passwords: add ripe atlas api key (7dc4524)" [10:39:04] (03PS3) 10Jbond: IPMI: add support for channel 2 [software/spicerack] - 10https://gerrit.wikimedia.org/r/538230 (https://phabricator.wikimedia.org/T147074) [10:39:10] right, that's the labs/private commit [10:39:14] ok [10:39:18] merged now [10:39:45] anyone with a deployment to operations/puppet [10:39:48] ? [10:40:55] I think it may have merged to puppetm1001, but not to the otehrs [10:42:00] (03PS10) 10Jcrespo: backups: Apply no-srv-format recipe to backup hosts [puppet] - 10https://gerrit.wikimedia.org/r/537928 (https://phabricator.wikimedia.org/T229209) [10:42:05] let me deploy something myself then [10:42:41] (03CR) 10Jbond: "looks good" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/538169 (https://phabricator.wikimedia.org/T232711) (owner: 10Filippo Giunchedi) [10:43:13] (03CR) 10jerkins-bot: [V: 04-1] IPMI: add support for channel 2 [software/spicerack] - 10https://gerrit.wikimedia.org/r/538230 (https://phabricator.wikimedia.org/T147074) (owner: 10Jbond) [10:43:26] (03CR) 10Jcrespo: [C: 03+2] backups: Apply no-srv-format recipe to backup hosts [puppet] - 10https://gerrit.wikimedia.org/r/537928 (https://phabricator.wikimedia.org/T229209) (owner: 10Jcrespo) [10:43:52] 10Operations, 10SRE-Access-Requests: Requesting access to deployment for andrew-wmde - https://phabricator.wikimedia.org/T233202 (10Tobi_WMDE_SW) >>! In T233202#5506870, @herron wrote: > > @Andrew-WMDE could you please coordinate obtaining a comment of approval here from your supervisor? > @herron @greg As... [10:44:15] ok, it is working, so probably only a conflict and no real issue [10:44:24] (03CR) 10Filippo Giunchedi: [C: 03+2] Remove raid1-gpt.cfg partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/538152 (https://phabricator.wikimedia.org/T156955) (owner: 10Muehlenhoff) [10:44:40] 10Operations, 10Patch-For-Review, 10User-Ladsgroup, 10User-Urbanecm, 10Wiki-Setup (Create): Create Wikisource Hindi - https://phabricator.wikimedia.org/T218155 (10jhsoby) >>! In T218155#5508368, @jayantanth wrote: > > Thanks @Ankry for your comments. Presently there are 28,307 Page: and 40+ associate In... [10:46:25] 10Operations, 10DBA, 10serviceops, 10Goal, 10Patch-For-Review: Strengthen backup infrastructure and support - https://phabricator.wikimedia.org/T229209 (10jcrespo) Almost there: > ` > > > Config error: Cannot open config file "/etc/bacula/bacula-sd.conf": Permission denied > > ` [10:48:00] (03PS4) 10Jbond: IPMI: add support for channel 2 [software/spicerack] - 10https://gerrit.wikimedia.org/r/538230 (https://phabricator.wikimedia.org/T147074) [10:53:19] (03PS7) 10Filippo Giunchedi: netops: add ripe atlas tools class [puppet] - 10https://gerrit.wikimedia.org/r/538169 (https://phabricator.wikimedia.org/T232711) [10:53:21] (03PS7) 10Filippo Giunchedi: role: add ripe atlas cli to cluster::management role [puppet] - 10https://gerrit.wikimedia.org/r/538170 (https://phabricator.wikimedia.org/T232711) [10:55:07] (03CR) 10Jbond: [C: 03+2] prometheus - bastion: use per-resource default attributes not resource defaults [puppet] - 10https://gerrit.wikimedia.org/r/537618 (https://phabricator.wikimedia.org/T233203) (owner: 10Jbond) [10:55:17] (03PS12) 10Jbond: prometheus - bastion: use per-resource default attributes not resource defaults [puppet] - 10https://gerrit.wikimedia.org/r/537618 (https://phabricator.wikimedia.org/T233203) [10:55:55] (03PS2) 10Jbond: profile::icinga: update to use lookup instead of hiera [puppet] - 10https://gerrit.wikimedia.org/r/538018 [10:55:57] (03PS1) 10Mobrovac: RESTRouter: Clean up the config && add the wikifeeds URI [deployment-charts] - 10https://gerrit.wikimedia.org/r/538238 (https://phabricator.wikimedia.org/T223953) [10:56:14] (03PS1) 10Jcrespo: backups: Change owner of /etc/bacula/bacula-sd.conf to bacula [puppet] - 10https://gerrit.wikimedia.org/r/538239 (https://phabricator.wikimedia.org/T229209) [10:56:32] (03CR) 10jerkins-bot: [V: 04-1] netops: add ripe atlas tools class [puppet] - 10https://gerrit.wikimedia.org/r/538169 (https://phabricator.wikimedia.org/T232711) (owner: 10Filippo Giunchedi) [10:57:25] (03CR) 10jerkins-bot: [V: 04-1] backups: Change owner of /etc/bacula/bacula-sd.conf to bacula [puppet] - 10https://gerrit.wikimedia.org/r/538239 (https://phabricator.wikimedia.org/T229209) (owner: 10Jcrespo) [11:00:21] (03PS2) 10Mobrovac: RESTRouter: Clean up the config && add the wikifeeds URI [deployment-charts] - 10https://gerrit.wikimedia.org/r/538238 (https://phabricator.wikimedia.org/T223953) [11:00:28] (03PS8) 10Filippo Giunchedi: netops: add ripe atlas tools class [puppet] - 10https://gerrit.wikimedia.org/r/538169 (https://phabricator.wikimedia.org/T232711) [11:00:30] (03PS8) 10Filippo Giunchedi: role: add ripe atlas cli to cluster::management role [puppet] - 10https://gerrit.wikimedia.org/r/538170 (https://phabricator.wikimedia.org/T232711) [11:00:32] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, one comment inline." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/538035 (owner: 10Jbond) [11:01:46] (03CR) 10Alexandros Kosiaris: [C: 03+2] RESTRouter: Clean up the config && add the wikifeeds URI [deployment-charts] - 10https://gerrit.wikimedia.org/r/538238 (https://phabricator.wikimedia.org/T223953) (owner: 10Mobrovac) [11:02:02] (03CR) 10Jbond: profile::icinga: update to use lookup instead of hiera (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/538018 (owner: 10Jbond) [11:02:18] (03PS2) 10Jcrespo: backups: Change owner of /etc/bacula/bacula-sd.conf to bacula [puppet] - 10https://gerrit.wikimedia.org/r/538239 (https://phabricator.wikimedia.org/T229209) [11:05:03] PROBLEM - Check the last execution of git_pull_charts on deploy1001 is CRITICAL: CRITICAL: Status of the systemd unit git_pull_charts https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [11:05:07] (03CR) 10Jcrespo: "Not sure about this." [puppet] - 10https://gerrit.wikimedia.org/r/538239 (https://phabricator.wikimedia.org/T229209) (owner: 10Jcrespo) [11:05:27] PROBLEM - Check systemd state on deploy1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:11:51] RECOVERY - Check systemd state on deploy1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:12:18] (03PS3) 10Jbond: profile::icinga: update to use lookup instead of hiera [puppet] - 10https://gerrit.wikimedia.org/r/538018 [11:12:20] (03PS2) 10Jbond: profile::icinga: Add apereo_cas authenticated vhost [puppet] - 10https://gerrit.wikimedia.org/r/538019 [11:12:35] (03PS1) 10Alexandros Kosiaris: restrouter: Upgrade to version v1.1.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/538241 [11:13:21] (03CR) 10Alexandros Kosiaris: [C: 03+2] restrouter: Upgrade to version v1.1.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/538241 (owner: 10Alexandros Kosiaris) [11:13:35] (03Merged) 10jenkins-bot: restrouter: Upgrade to version v1.1.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/538241 (owner: 10Alexandros Kosiaris) [11:14:09] 10Operations, 10Code-Stewardship-Reviews, 10Graphoid, 10Core Platform Team Legacy (Watching / External), and 3 others: graphoid: Code stewardship request - https://phabricator.wikimedia.org/T211881 (10dr0ptp4kt) Hello all. We're going to turn this into a client-side feature and divest of the server side re... [11:14:30] (03CR) 10Jcrespo: [C: 04-1] "It would require more work anyway:" [puppet] - 10https://gerrit.wikimedia.org/r/538239 (https://phabricator.wikimedia.org/T229209) (owner: 10Jcrespo) [11:15:36] !log @ helmfile [STAGING] Ran 'sync' command on namespace 'restrouter' for release 'staging' . [11:15:37] RECOVERY - Check the last execution of git_pull_charts on deploy1001 is OK: OK: Status of the systemd unit git_pull_charts https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [11:15:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:16:01] !log @ helmfile [STAGING] Ran 'sync' command on namespace 'restrouter' for release 'staging' . [11:16:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:17:32] (03CR) 10Muehlenhoff: profile::icinga: update to use lookup instead of hiera (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/538018 (owner: 10Jbond) [11:18:05] (03CR) 10Elukey: "Updated to:" [puppet] - 10https://gerrit.wikimedia.org/r/538045 (https://phabricator.wikimedia.org/T231208) (owner: 10Elukey) [11:21:13] (03PS1) 10Alexandros Kosiaris: Release restrouter chart version 0.0.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/538242 (https://phabricator.wikimedia.org/T223953) [11:21:49] (03CR) 10Alexandros Kosiaris: [C: 03+2] Release restrouter chart version 0.0.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/538242 (https://phabricator.wikimedia.org/T223953) (owner: 10Alexandros Kosiaris) [11:22:29] !log @ helmfile [STAGING] Ran 'sync' command on namespace 'restrouter' for release 'staging' . [11:22:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:22:36] !log @ helmfile [STAGING] Ran 'sync' command on namespace 'restrouter' for release 'staging' . [11:22:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:22:44] !log @ helmfile [STAGING] Ran 'sync' command on namespace 'restrouter' for release 'staging' . [11:22:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:23:46] (03PS4) 10Jbond: profile::icinga: update to use lookup instead of hiera [puppet] - 10https://gerrit.wikimedia.org/r/538018 [11:27:10] (03CR) 10Jbond: [C: 03+1] "lgtm: 2 minor nits" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/538169 (https://phabricator.wikimedia.org/T232711) (owner: 10Filippo Giunchedi) [11:27:25] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/538170 (https://phabricator.wikimedia.org/T232711) (owner: 10Filippo Giunchedi) [11:27:59] (03CR) 10Jbond: profile::icinga: update to use lookup instead of hiera (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/538018 (owner: 10Jbond) [11:29:37] !log @ helmfile [STAGING] Ran 'sync' command on namespace 'restrouter' for release 'staging' . [11:29:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:32:00] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, I also ran a PCC and it confirms NOPs for icinga[12]001:" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/538018 (owner: 10Jbond) [11:32:11] (03PS4) 10Jbond: apereo_cas: add icinga service [puppet] - 10https://gerrit.wikimedia.org/r/538035 [11:32:41] (03CR) 10Jbond: apereo_cas: add icinga service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/538035 (owner: 10Jbond) [11:33:27] (03CR) 10Jbond: [C: 03+2] profile::icinga: update to use lookup instead of hiera [puppet] - 10https://gerrit.wikimedia.org/r/538018 (owner: 10Jbond) [11:35:40] (03PS5) 10Jbond: apereo_cas: add icinga service [puppet] - 10https://gerrit.wikimedia.org/r/538035 [11:36:08] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/538035 (owner: 10Jbond) [11:37:46] (03CR) 10Jbond: [C: 03+2] apereo_cas: add icinga service [puppet] - 10https://gerrit.wikimedia.org/r/538035 (owner: 10Jbond) [11:46:20] 10Operations, 10Patch-For-Review, 10User-Ladsgroup, 10User-Urbanecm, 10Wiki-Setup (Create): Create Wikisource Hindi - https://phabricator.wikimedia.org/T218155 (10Ankry) >>! In T218155#5508932, @jhsoby wrote: >>>! In T218155#5508368, @jayantanth wrote: >> >> Thanks @Ankry for your comments. Presently th... [11:53:53] fyi there's a survey for PolyGerrit's ui if anyone wants to give feedback https://docs.google.com/forms/d/e/1FAIpQLSd7PvOdMBWOfQU_O9dm_CUSIqqBXYVNkzQAD1f6tnS3pQ9Szw/viewform (https://groups.google.com/forum/#!topic/repo-discuss/oDvOhpimkBQ) [11:57:23] (03PS3) 10Jbond: profile::icinga: Add apereo_cas authenticated vhost [puppet] - 10https://gerrit.wikimedia.org/r/538019 [11:58:02] (03CR) 10jerkins-bot: [V: 04-1] profile::icinga: Add apereo_cas authenticated vhost [puppet] - 10https://gerrit.wikimedia.org/r/538019 (owner: 10Jbond) [12:01:58] (03PS1) 10Jbond: idp: correct services parameter [puppet] - 10https://gerrit.wikimedia.org/r/538248 [12:02:25] 10Operations, 10Patch-For-Review, 10User-Ladsgroup, 10User-Urbanecm, 10Wiki-Setup (Create): Create Wikisource Hindi - https://phabricator.wikimedia.org/T218155 (10jhsoby) >>! In T218155#5509071, @Ankry wrote: > Later changing the internal field names (to Hindi, as they are in the initial configuration) m... [12:02:42] (03CR) 10Jbond: [C: 03+2] idp: correct services parameter [puppet] - 10https://gerrit.wikimedia.org/r/538248 (owner: 10Jbond) [12:03:25] PROBLEM - dhclient process on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [12:03:35] PROBLEM - Check whether ferm is active by checking the default input chain on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [12:03:49] PROBLEM - configured eth on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [12:04:13] PROBLEM - Check systemd state on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:04:33] PROBLEM - Disk space on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=stat1007&var-datasource=eqiad+prometheus/ops [12:04:37] PROBLEM - MD RAID on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [12:04:43] PROBLEM - DPKG on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [12:04:47] PROBLEM - Check size of conntrack table on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [12:05:59] PROBLEM - puppet last run on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [12:06:21] usual Spark job OOM fun probably, having a look [12:07:21] yeah, some R job OOMed [12:07:31] PROBLEM - Check the last execution of search-drop-query-clicks on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [12:08:07] RECOVERY - dhclient process on stat1007 is OK: PROCS OK: 0 processes with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [12:08:16] bounced nagios-nrpe-server, should recover in a bit [12:08:19] RECOVERY - Check whether ferm is active by checking the default input chain on stat1007 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [12:08:33] RECOVERY - configured eth on stat1007 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [12:08:40] (03PS4) 10Jbond: profile::icinga: Add apereo_cas authenticated vhost [puppet] - 10https://gerrit.wikimedia.org/r/538019 [12:08:57] RECOVERY - Check systemd state on stat1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:09:17] RECOVERY - Disk space on stat1007 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=stat1007&var-datasource=eqiad+prometheus/ops [12:09:19] RECOVERY - MD RAID on stat1007 is OK: OK: Active: 8, Working: 8, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [12:09:25] RECOVERY - DPKG on stat1007 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [12:09:29] RECOVERY - Check size of conntrack table on stat1007 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [12:11:35] RECOVERY - puppet last run on stat1007 is OK: OK: Puppet is currently enabled, last run 24 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [12:14:16] (03PS1) 10Jbond: idp: correct service id [puppet] - 10https://gerrit.wikimedia.org/r/538250 [12:16:16] (03CR) 10Jbond: [C: 03+2] idp: correct service id [puppet] - 10https://gerrit.wikimedia.org/r/538250 (owner: 10Jbond) [12:18:05] RECOVERY - Check the last execution of search-drop-query-clicks on stat1007 is OK: OK: Status of the systemd unit search-drop-query-clicks https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [12:43:52] (03CR) 10Alexandros Kosiaris: "the -u and -g passed to bacula-fd are for bacula-sd to drop privileges, which it does right after reading the config file. Systemd however" [puppet] - 10https://gerrit.wikimedia.org/r/538239 (https://phabricator.wikimedia.org/T229209) (owner: 10Jcrespo) [12:51:07] 10Operations, 10Discovery-Search, 10Elasticsearch: Unassigned shards in eqiad - https://phabricator.wikimedia.org/T233403 (10Mathew.onipe) [12:58:58] (03PS3) 10Muehlenhoff: Remove raid1-gpt.cfg partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/538152 (https://phabricator.wikimedia.org/T156955) [13:01:00] (03CR) 10Muehlenhoff: [C: 03+2] Remove raid1-gpt.cfg partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/538152 (https://phabricator.wikimedia.org/T156955) (owner: 10Muehlenhoff) [13:01:20] (03PS1) 10Hashar: contint: pin Docker on stretch [puppet] - 10https://gerrit.wikimedia.org/r/538259 (https://phabricator.wikimedia.org/T226233) [13:02:39] (03CR) 10Filippo Giunchedi: "> Patch Set 6:" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/538169 (https://phabricator.wikimedia.org/T232711) (owner: 10Filippo Giunchedi) [13:05:09] (03PS1) 10Muehlenhoff: cas: Disable gauth for now [puppet] - 10https://gerrit.wikimedia.org/r/538260 [13:05:14] (03PS2) 10Herron: kafka_shipper: try parsing syslog messages as raw json [puppet] - 10https://gerrit.wikimedia.org/r/538081 [13:05:40] (03PS2) 10Hashar: contint: pin Docker on stretch [puppet] - 10https://gerrit.wikimedia.org/r/538259 (https://phabricator.wikimedia.org/T226233) [13:06:03] 10Operations, 10Performance-Team, 10SRE-Access-Requests: Request access to 'deployment' user group for phedenskog - https://phabricator.wikimedia.org/T232489 (10herron) 05Open→03Resolved a:03herron [13:14:35] (03PS1) 10Muehlenhoff: cas: Add service ID for debmonitor [puppet] - 10https://gerrit.wikimedia.org/r/538263 [13:28:42] (03PS3) 10Muehlenhoff: contint: pin Docker on stretch [puppet] - 10https://gerrit.wikimedia.org/r/538259 (https://phabricator.wikimedia.org/T226233) (owner: 10Hashar) [13:32:56] (03CR) 10Muehlenhoff: [C: 03+2] contint: pin Docker on stretch [puppet] - 10https://gerrit.wikimedia.org/r/538259 (https://phabricator.wikimedia.org/T226233) (owner: 10Hashar) [13:46:01] (03PS1) 10Jhedden: openstack: configure eqiad1 keystone for apache wsgi [puppet] - 10https://gerrit.wikimedia.org/r/538267 (https://phabricator.wikimedia.org/T223907) [13:46:26] 10Operations, 10Release-Engineering-Team-TODO, 10Continuous-Integration-Infrastructure (phase-out-jessie), 10Release-Engineering-Team (CI & Testing services): Migrate contint* hosts to Stretch/Buster - https://phabricator.wikimedia.org/T224591 (10hashar) From a quick chat with @MoritzMuehlenhoff : python2... [13:46:39] 10Operations, 10Release-Engineering-Team-TODO, 10Continuous-Integration-Infrastructure (phase-out-jessie), 10Release-Engineering-Team (CI & Testing services): Migrate contint* hosts to Buster - https://phabricator.wikimedia.org/T224591 (10hashar) [13:47:09] 10Operations, 10Release-Engineering-Team-TODO, 10Continuous-Integration-Infrastructure (phase-out-jessie), 10Release-Engineering-Team (CI & Testing services): Migrate contint* hosts to Buster - https://phabricator.wikimedia.org/T224591 (10hashar) [13:48:43] (03CR) 10Jcrespo: [C: 04-1] "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/538239 (https://phabricator.wikimedia.org/T229209) (owner: 10Jcrespo) [13:49:13] 10Operations, 10Release-Engineering-Team-TODO, 10Continuous-Integration-Infrastructure (phase-out-jessie), 10Release-Engineering-Team (CI & Testing services): Migrate contint* hosts to Buster - https://phabricator.wikimedia.org/T224591 (10hashar) [13:50:49] (03CR) 10Jhedden: "PCC results: https://puppet-compiler.wmflabs.org/compiler1002/18461/" [puppet] - 10https://gerrit.wikimedia.org/r/538267 (https://phabricator.wikimedia.org/T223907) (owner: 10Jhedden) [13:57:11] (03PS1) 10Muehlenhoff: Remove obsolete partman recipe lvm.cfg [puppet] - 10https://gerrit.wikimedia.org/r/538268 (https://phabricator.wikimedia.org/T156955) [13:58:01] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] envoy: sync base container with conventions used in production [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/537466 (owner: 10Giuseppe Lavagetto) [13:58:46] (03CR) 10Andrew Bogott: [C: 03+1] "looks good! Might want to save this for next week in case it behaves badly over the weekend." [puppet] - 10https://gerrit.wikimedia.org/r/538267 (https://phabricator.wikimedia.org/T223907) (owner: 10Jhedden) [14:02:21] (03CR) 10Jhedden: "> Patch Set 1: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/538267 (https://phabricator.wikimedia.org/T223907) (owner: 10Jhedden) [14:15:01] (03CR) 10Jhedden: [C: 04-1] "On hold until monday Sep 23 2019" [puppet] - 10https://gerrit.wikimedia.org/r/538267 (https://phabricator.wikimedia.org/T223907) (owner: 10Jhedden) [14:19:39] (03PS1) 10Jhedden: openstack: add newton keystone apache config [puppet] - 10https://gerrit.wikimedia.org/r/538269 (https://phabricator.wikimedia.org/T223907) [14:23:31] (03PS1) 10Muehlenhoff: Remove grants for labpuppet [puppet] - 10https://gerrit.wikimedia.org/r/538270 (https://phabricator.wikimedia.org/T233281) [14:24:01] (03PS9) 10Filippo Giunchedi: netops: add ripe atlas tools class [puppet] - 10https://gerrit.wikimedia.org/r/538169 (https://phabricator.wikimedia.org/T232711) [14:24:03] (03PS9) 10Filippo Giunchedi: role: add ripe atlas cli to cluster::management role [puppet] - 10https://gerrit.wikimedia.org/r/538170 (https://phabricator.wikimedia.org/T232711) [14:24:11] 10Operations, 10DBA, 10Patch-For-Review: Check/remove unused databases following labpuppetmaster deprecation - https://phabricator.wikimedia.org/T233281 (10MoritzMuehlenhoff) >>! In T233281#5505493, @jcrespo wrote: > Reminder: Let's check grants too. Created https://gerrit.wikimedia.org/r/538270 for this [14:32:16] (03CR) 10Jcrespo: "The grant change is ok, but I think it needs further puppet changes on both private repo and public one related to password handling." [puppet] - 10https://gerrit.wikimedia.org/r/538270 (https://phabricator.wikimedia.org/T233281) (owner: 10Muehlenhoff) [14:37:48] (03PS10) 10Filippo Giunchedi: netops: add ripe atlas tools class [puppet] - 10https://gerrit.wikimedia.org/r/538169 (https://phabricator.wikimedia.org/T232711) [14:37:50] (03PS10) 10Filippo Giunchedi: role: add ripe atlas cli to cluster::management role [puppet] - 10https://gerrit.wikimedia.org/r/538170 (https://phabricator.wikimedia.org/T232711) [14:40:17] (03PS3) 10Herron: kafka_shipper: try parsing syslog messages as raw json [puppet] - 10https://gerrit.wikimedia.org/r/538081 [14:40:37] (03PS11) 10Filippo Giunchedi: netops: add ripe atlas tools class [puppet] - 10https://gerrit.wikimedia.org/r/538169 (https://phabricator.wikimedia.org/T232711) [14:40:39] (03PS11) 10Filippo Giunchedi: role: add ripe atlas cli to cluster::management role [puppet] - 10https://gerrit.wikimedia.org/r/538170 (https://phabricator.wikimedia.org/T232711) [14:42:58] (03PS1) 10Muehlenhoff: Convert openldap/corp to profile (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/538273 [14:43:09] (03CR) 10Filippo Giunchedi: "PCC https://puppet-compiler.wmflabs.org/compiler1001/18464/" [puppet] - 10https://gerrit.wikimedia.org/r/538169 (https://phabricator.wikimedia.org/T232711) (owner: 10Filippo Giunchedi) [14:43:25] (03CR) 10Herron: [C: 03+2] kafka_shipper: try parsing syslog messages as raw json [puppet] - 10https://gerrit.wikimedia.org/r/538081 (owner: 10Herron) [14:44:38] (03PS4) 10Herron: logstash: add gerrit json log file to kafka output table [puppet] - 10https://gerrit.wikimedia.org/r/538082 [14:49:20] (03PS1) 10Muehlenhoff: Add component for OpenJDK 8 forwardport for Buster [puppet] - 10https://gerrit.wikimedia.org/r/538274 [14:51:07] (03CR) 10Herron: [C: 03+2] logstash: add gerrit json log file to kafka output table [puppet] - 10https://gerrit.wikimedia.org/r/538082 (owner: 10Herron) [14:52:05] (03PS2) 10Muehlenhoff: Add component for OpenJDK 8 forwardport for Buster [puppet] - 10https://gerrit.wikimedia.org/r/538274 [14:52:55] (03PS2) 10Giuseppe Lavagetto: mediawiki-cache-warmup: Update CI tools for the past two years [puppet] - 10https://gerrit.wikimedia.org/r/537133 (owner: 10Jforrester) [14:53:39] (03Abandoned) 10Herron: admin: add srishakatux to researchers [puppet] - 10https://gerrit.wikimedia.org/r/537516 (https://phabricator.wikimedia.org/T232664) (owner: 10Herron) [14:55:33] (03PS1) 10Muehlenhoff: Add pbuilder hook to build packages with/against forward-ported JDK 8 [puppet] - 10https://gerrit.wikimedia.org/r/538277 [14:55:59] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki-cache-warmup: Update CI tools for the past two years [puppet] - 10https://gerrit.wikimedia.org/r/537133 (owner: 10Jforrester) [15:00:14] moritzm: oh, so we have jdk8 for buster (in the wikimedia apt repo)? [15:00:24] (03CR) 10Jhedden: [C: 03+2] openstack: add newton keystone apache config [puppet] - 10https://gerrit.wikimedia.org/r/538269 (https://phabricator.wikimedia.org/T223907) (owner: 10Jhedden) [15:00:38] (03PS2) 10Jhedden: openstack: add newton keystone apache config [puppet] - 10https://gerrit.wikimedia.org/r/538269 (https://phabricator.wikimedia.org/T223907) [15:03:10] !log remove AS-PATH prepending in ams [15:03:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:03:26] (03PS2) 10Muehlenhoff: Convert openldap/corp to profile [puppet] - 10https://gerrit.wikimedia.org/r/538273 [15:04:13] paladox: almost, I have a bootstrap build, but the patches I posted are needed to turn this into a proper build [15:04:31] Ah, awesome! [15:07:22] (03CR) 10Muehlenhoff: "PCC: https://puppet-compiler.wmflabs.org/compiler1001/18465/" [puppet] - 10https://gerrit.wikimedia.org/r/538273 (owner: 10Muehlenhoff) [15:17:25] !log mwscript extensions/CentralAuth/maintenance/fixStuckGlobalRename.php --wiki=bgwiki --logwiki=metawiki 'Newrdkter' 'NRdk' (T233313) [15:17:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:17:28] T233313: Please unblock stuck global rename - https://phabricator.wikimedia.org/T233313 [15:31:14] (03CR) 10Jbond: [C: 03+1] cas: Disable gauth for now [puppet] - 10https://gerrit.wikimedia.org/r/538260 (owner: 10Muehlenhoff) [15:31:26] (03CR) 10Filippo Giunchedi: [C: 03+2] netops: add ripe atlas tools class [puppet] - 10https://gerrit.wikimedia.org/r/538169 (https://phabricator.wikimedia.org/T232711) (owner: 10Filippo Giunchedi) [15:31:42] (03PS12) 10Filippo Giunchedi: netops: add ripe atlas tools class [puppet] - 10https://gerrit.wikimedia.org/r/538169 (https://phabricator.wikimedia.org/T232711) [15:31:49] (03CR) 10Jbond: [C: 03+1] cas: Add service ID for debmonitor [puppet] - 10https://gerrit.wikimedia.org/r/538263 (owner: 10Muehlenhoff) [15:35:05] PROBLEM - Disk space on analytics1045 is CRITICAL: DISK CRITICAL - free space: / 1341 MB (2% inode=96%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=analytics1045&var-datasource=eqiad+prometheus/ops [15:38:22] (03PS1) 10Jhedden: openstack: add haproxy prometheus exporter [puppet] - 10https://gerrit.wikimedia.org/r/538289 [15:38:26] (03CR) 10Filippo Giunchedi: [C: 03+2] role: add ripe atlas cli to cluster::management role [puppet] - 10https://gerrit.wikimedia.org/r/538170 (https://phabricator.wikimedia.org/T232711) (owner: 10Filippo Giunchedi) [15:38:37] (03PS12) 10Filippo Giunchedi: role: add ripe atlas cli to cluster::management role [puppet] - 10https://gerrit.wikimedia.org/r/538170 (https://phabricator.wikimedia.org/T232711) [15:42:53] RECOVERY - Disk space on analytics1045 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=analytics1045&var-datasource=eqiad+prometheus/ops [15:43:38] (03CR) 10Jhedden: [C: 03+2] openstack: add haproxy prometheus exporter [puppet] - 10https://gerrit.wikimedia.org/r/538289 (owner: 10Jhedden) [15:43:50] (03PS2) 10Jhedden: openstack: add haproxy prometheus exporter [puppet] - 10https://gerrit.wikimedia.org/r/538289 [15:46:01] 10Operations, 10netops, 10observability, 10Patch-For-Review: Deploy ripe-atlas-tools for ad-hoc network tests - https://phabricator.wikimedia.org/T232711 (10fgiunchedi) [15:49:44] (03PS1) 10Odder: Add localized logos for the Zulu Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538291 (https://phabricator.wikimedia.org/T233424) [15:57:57] (03PS1) 10Odder: Add localized logos for the Zulu Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538293 [15:58:18] (03CR) 10RobH: [C: 03+1] facilities: add phase monitoring for single phase PDUs [puppet] - 10https://gerrit.wikimedia.org/r/538161 (https://phabricator.wikimedia.org/T229101) (owner: 10Filippo Giunchedi) [16:02:13] (03PS2) 10Odder: Add localized logos for the Zulu Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538293 (https://phabricator.wikimedia.org/T233424) [16:06:25] (03PS1) 10Pmiazga: Enable alternate mobile link for ar,zh,hi,it,nl and ko wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538295 (https://phabricator.wikimedia.org/T206497) [16:08:18] (03CR) 10Elukey: [C: 03+1] "\o/" [puppet] - 10https://gerrit.wikimedia.org/r/538274 (owner: 10Muehlenhoff) [16:11:01] (03PS2) 10Pmiazga: Enable alternate mobile link for ar,zh,hi wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538295 (https://phabricator.wikimedia.org/T206497) [16:11:44] !log update esams firewall filters - T233268 [16:11:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:14:01] (03PS3) 10Pmiazga: Enable alternate mobile link for ar,zh,hi wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538295 (https://phabricator.wikimedia.org/T206497) [16:14:05] (03PS1) 10Pmiazga: Enable alternate mobile link for it, nl, ko wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538296 (https://phabricator.wikimedia.org/T206497) [16:14:13] !log update eqiad firewall filters - T233268 [16:14:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:26:47] (03PS4) 10Krinkle: Uninstall dev deps from production vendor/ [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538102 [16:27:02] (03CR) 10Krinkle: [C: 03+2] "One less thing to worry about over the weekend." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538102 (owner: 10Krinkle) [16:27:24] * Krinkle testing on mwdebug1002 [16:28:38] (03Merged) 10jenkins-bot: Uninstall dev deps from production vendor/ [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538102 (owner: 10Krinkle) [16:28:55] (03CR) 10jenkins-bot: Uninstall dev deps from production vendor/ [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538102 (owner: 10Krinkle) [16:35:46] !log krinkle@deploy1001 Synchronized vendor/: ead70240892e9 (duration: 00m 59s) [16:35:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:48:45] PROBLEM - Disk space on analytics1045 is CRITICAL: DISK CRITICAL - free space: / 1201 MB (2% inode=96%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=analytics1045&var-datasource=eqiad+prometheus/ops [16:50:02] (03PS2) 10Hashar: zuul: mask service when it is disabled [puppet] - 10https://gerrit.wikimedia.org/r/538233 (https://phabricator.wikimedia.org/T233391) [16:50:56] (03CR) 10Hashar: "PS2: added Hosts: header for the puppet compiler." [puppet] - 10https://gerrit.wikimedia.org/r/538233 (https://phabricator.wikimedia.org/T233391) (owner: 10Hashar) [16:51:03] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/538233 (https://phabricator.wikimedia.org/T233391) (owner: 10Hashar) [16:51:22] ottomata: 6 [16:51:24] ottomata: ^ [16:52:13] (03CR) 10jerkins-bot: [V: 04-1] zuul: mask service when it is disabled [puppet] - 10https://gerrit.wikimedia.org/r/538233 (https://phabricator.wikimedia.org/T233391) (owner: 10Hashar) [16:53:12] (03PS3) 10Hashar: zuul: mask service when it is disabled [puppet] - 10https://gerrit.wikimedia.org/r/538233 (https://phabricator.wikimedia.org/T233391) [16:54:32] (03CR) 10Hashar: [C: 03+1] "(PS3 fixes an issue in the commit message)" [puppet] - 10https://gerrit.wikimedia.org/r/538233 (https://phabricator.wikimedia.org/T233391) (owner: 10Hashar) [16:55:38] (03PS1) 10Elukey: Add hadoop overrides for analytics1045 [puppet] - 10https://gerrit.wikimedia.org/r/538303 [16:56:03] this should fix 1045 :) [16:59:23] (03CR) 10Elukey: [C: 03+2] Add hadoop overrides for analytics1045 [puppet] - 10https://gerrit.wikimedia.org/r/538303 (owner: 10Elukey) [17:02:47] RECOVERY - Disk space on analytics1045 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=analytics1045&var-datasource=eqiad+prometheus/ops [17:06:18] all right an1045 should be good [17:24:34] (03CR) 10Dzahn: "> Patch Set 8:" [puppet] - 10https://gerrit.wikimedia.org/r/536714 (owner: 10Dzahn) [17:40:32] (03PS2) 10Dzahn: add gerrit-new.wikimedia.org for migration [dns] - 10https://gerrit.wikimedia.org/r/538127 (https://phabricator.wikimedia.org/T222391) [17:42:39] onimisionipe: do you know about " [17:42:39] [17:42:39] [17:42:41] ElasticSearch unassigned shard check - 9243" [17:43:37] I think he has a task for that [17:43:39] https://phabricator.wikimedia.org/T233403 [17:47:07] thanks paladox [17:47:21] would be nice if people would link tickets to icinga [17:47:25] to save us all time [17:47:48] (03CR) 10Ottomata: "This change is ready for review." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/535924 (https://phabricator.wikimedia.org/T229674) (owner: 10Mforns) [17:49:44] ACKNOWLEDGEMENT - ElasticSearch unassigned shard check - 9243 on search.svc.eqiad.wmnet is CRITICAL: CRITICAL - enwiki_content_1546970425[3](2019-09-15T13:39:54.892Z), dewiki_content_1566659363[6](2019-09-15T13:39:44.466Z) daniel_zahn https://phabricator.wikimedia.org/T233403 https://wikitech.wikimedia.org/wiki/Search%23Administration [17:51:19] mutante: I'm so sorry [17:51:25] I thought I linked it [17:51:40] I created the task to link it then I forgot [17:51:59] onimisionipe: thank you:) [17:52:58] onimisionipe: i found an unrelated one about the memory on elastic1029. i know we already both reopened that ticket. but we are not supposed to use the same ticket anymore [17:53:27] so now i am not sure because it would require going through troubleshooting steps [17:53:43] and it already broke like 4 times [17:54:17] mutante: plans are already ongoing to replace that server [17:55:42] onimisionipe: sounds good. i would expect it needs a procurement ticket [17:56:09] indeed this seems a case where collectively we spent too much time on a single box [17:57:23] Yea yea.. [17:57:47] I can create the task before I sleep [17:57:56] Following dcops rules [17:58:38] onimisionipe: thank you. you know better how urgent it is. i just noticed the previous tickets we were on [17:58:59] up to you if that needs to be done right now [17:59:40] Alright. Thanks! [18:01:04] (03PS4) 1020after4: Set up scap target for deploying the phatality plugin into kibana [puppet] - 10https://gerrit.wikimedia.org/r/537240 (https://phabricator.wikimedia.org/T230752) [18:01:06] (03PS1) 1020after4: Add python-pygerrit2 to scap master [puppet] - 10https://gerrit.wikimedia.org/r/538311 [18:03:19] (03CR) 10Dzahn: [C: 03+2] "this is the same we did last time we migrated gerrit to a new server" [dns] - 10https://gerrit.wikimedia.org/r/538127 (https://phabricator.wikimedia.org/T222391) (owner: 10Dzahn) [18:03:58] (03PS2) 1020after4: Add python-pygerrit2 to scap master [puppet] - 10https://gerrit.wikimedia.org/r/538311 [18:04:34] (03PS3) 10Dzahn: add gerrit-new.wikimedia.org for migration [dns] - 10https://gerrit.wikimedia.org/r/538127 (https://phabricator.wikimedia.org/T222391) [18:05:24] (03CR) 10Dzahn: "seems to exist only in buster? https://packages.debian.org/search?keywords=python-pygerrit2&searchon=names&suite=all§ion=all" [puppet] - 10https://gerrit.wikimedia.org/r/538311 (owner: 1020after4) [18:07:36] (03PS1) 10Mforns: Rsync analytics mediawiki history dumps to dumps.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/538312 (https://phabricator.wikimedia.org/T208612) [18:08:36] (03CR) 1020after4: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/538311 (owner: 1020after4) [18:08:43] (03CR) 10Thcipriani: "> seems to exist only in buster? https://packages.debian.org/search?keywords=python-pygerrit2&searchon=names&suite=all§ion=all" [puppet] - 10https://gerrit.wikimedia.org/r/538311 (owner: 1020after4) [18:12:43] (03CR) 10Dzahn: [C: 03+1] "wow, "madison" :) ok, confirmed. ack" [puppet] - 10https://gerrit.wikimedia.org/r/538311 (owner: 1020after4) [18:12:55] (03PS3) 10Dzahn: Add python-pygerrit2 to scap master [puppet] - 10https://gerrit.wikimedia.org/r/538311 (owner: 1020after4) [18:15:09] (03CR) 10Dzahn: [C: 03+2] Add python-pygerrit2 to scap master [puppet] - 10https://gerrit.wikimedia.org/r/538311 (owner: 1020after4) [18:17:50] (03CR) 10Dzahn: "Notice: /Stage[main]/Packages::Python_pygerrit2/Package[python-pygerrit2]/ensure: created" [puppet] - 10https://gerrit.wikimedia.org/r/538311 (owner: 1020after4) [18:20:01] (03CR) 10Dzahn: "We did the same thing last time we migrated gerrit to a new server. Temp had "gerrit-new". Otherwise the new server uses gerrit.wikimedia." [puppet] - 10https://gerrit.wikimedia.org/r/538128 (https://phabricator.wikimedia.org/T222391) (owner: 10Dzahn) [18:22:05] (03PS2) 10Jgreen: add frav1002, frmon1001, and frmon2001 to nsca_frack.cfg.erb [puppet] - 10https://gerrit.wikimedia.org/r/533595 [18:27:18] (03CR) 10Jgreen: [C: 03+2] add frav1002, frmon1001, and frmon2001 to nsca_frack.cfg.erb [puppet] - 10https://gerrit.wikimedia.org/r/533595 (owner: 10Jgreen) [18:27:59] (03PS4) 10Dzahn: zuul: mask service when it is disabled [puppet] - 10https://gerrit.wikimedia.org/r/538233 (https://phabricator.wikimedia.org/T233391) (owner: 10Hashar) [18:31:24] (03CR) 10jerkins-bot: [V: 04-1] zuul: mask service when it is disabled [puppet] - 10https://gerrit.wikimedia.org/r/538233 (https://phabricator.wikimedia.org/T233391) (owner: 10Hashar) [18:32:11] (03CR) 10Dzahn: "uhm.. jenkins-bot - aborted - test and changed its mind after rebase?" [puppet] - 10https://gerrit.wikimedia.org/r/538233 (https://phabricator.wikimedia.org/T233391) (owner: 10Hashar) [18:32:45] (03CR) 10Dzahn: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/538233 (https://phabricator.wikimedia.org/T233391) (owner: 10Hashar) [18:41:08] (03CR) 10Dzahn: [C: 03+2] zuul: mask service when it is disabled [puppet] - 10https://gerrit.wikimedia.org/r/538233 (https://phabricator.wikimedia.org/T233391) (owner: 10Hashar) [18:44:48] (03CR) 10Dzahn: [C: 03+2] "noop in prod https://puppet-compiler.wmflabs.org/compiler1001/18470/" [puppet] - 10https://gerrit.wikimedia.org/r/538128 (https://phabricator.wikimedia.org/T222391) (owner: 10Dzahn) [18:45:45] (03CR) 10Dzahn: "on contint2001: Exec[mask_zuul.service]/returns: executed successfully on contint1001: no change" [puppet] - 10https://gerrit.wikimedia.org/r/538233 (https://phabricator.wikimedia.org/T233391) (owner: 10Hashar) [18:46:39] (03CR) 10Urbanecm: [C: 03+1] New throttle rule for Wikimedia Chile editathon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538134 (https://phabricator.wikimedia.org/T233378) (owner: 10Ammarpad) [18:47:38] (03PS2) 10Dzahn: gerrit: set gerrit-new as name/IP for new gerrit server [puppet] - 10https://gerrit.wikimedia.org/r/538128 (https://phabricator.wikimedia.org/T222391) [18:51:50] (03PS2) 10Jforrester: [cirrus] fix cirrusTest.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538156 (https://phabricator.wikimedia.org/T232691) (owner: 10DCausse) [18:51:52] (03PS1) 10Jforrester: CirrusSettings-labs: Move as much as possible to InitialiseSettings-Labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538318 [18:51:54] (03PS1) 10Jforrester: CirrusSettings-common: Move as much as possible to VariantSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538319 [18:53:02] (03CR) 10jerkins-bot: [V: 04-1] CirrusSettings-labs: Move as much as possible to InitialiseSettings-Labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538318 (owner: 10Jforrester) [18:53:19] (03CR) 10jerkins-bot: [V: 04-1] CirrusSettings-common: Move as much as possible to VariantSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538319 (owner: 10Jforrester) [19:03:00] (03PS1) 10Herron: admin: add raja_wmde to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/538322 (https://phabricator.wikimedia.org/T231984) [19:04:20] (03PS1) 10Jforrester: CirrusSettings-common: Move timeouts to VariantSettings (full array, not just over-writes) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538323 [19:04:22] (03PS1) 10Jforrester: VariantSettings: Set wgWMESearchRelevancePages directly [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538324 [19:04:24] (03PS1) 10Jforrester: CirrusSettings-production: Move settings to VariantSettings where static [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538325 [19:05:23] (03CR) 10jerkins-bot: [V: 04-1] CirrusSettings-common: Move timeouts to VariantSettings (full array, not just over-writes) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538323 (owner: 10Jforrester) [19:06:15] (03CR) 10jerkins-bot: [V: 04-1] CirrusSettings-production: Move settings to VariantSettings where static [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538325 (owner: 10Jforrester) [19:06:19] 10Operations, 10Gerrit, 10Release-Engineering-Team-TODO, 10Release-Engineering-Team (Development services): Reimage cobalt and gerrit2001 as buster - https://phabricator.wikimedia.org/T176774 (10Dzahn) p:05Lowest→03Normal [19:06:30] 10Operations, 10Gerrit, 10Release-Engineering-Team-TODO, 10Release-Engineering-Team (Development services): Reimage cobalt and gerrit2001 as buster - https://phabricator.wikimedia.org/T176774 (10Dzahn) a:03Dzahn [19:06:32] (03CR) 10jerkins-bot: [V: 04-1] VariantSettings: Set wgWMESearchRelevancePages directly [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538324 (owner: 10Jforrester) [19:06:34] (03CR) 10Paladox: [C: 03+1] gerrit: set gerrit-new as name/IP for new gerrit server [puppet] - 10https://gerrit.wikimedia.org/r/538128 (https://phabricator.wikimedia.org/T222391) (owner: 10Dzahn) [19:06:54] (03PS2) 10Herron: admin: add raja_wmde to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/538322 (https://phabricator.wikimedia.org/T231984) [19:09:58] (03CR) 10Herron: [C: 03+2] admin: add raja_wmde to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/538322 (https://phabricator.wikimedia.org/T231984) (owner: 10Herron) [19:12:09] 10Operations, 10LDAP-Access-Requests: NDA Request from WMDE employee Raja - https://phabricator.wikimedia.org/T231984 (10herron) 05Open→03Resolved a:03herron Hi @raja_wmde, you have been added to the NDA LDAP group. I'll transition this to resolved now, but please don't hesitate to re-open if any follow... [19:12:48] 10Operations, 10Discovery-Search, 10Elasticsearch: Unassigned shards in eqiad - https://phabricator.wikimedia.org/T233403 (10herron) p:05Triage→03High [19:13:15] 10Operations, 10MobileFrontend, 10Traffic: Sections on some mobile pages are not collabsable - https://phabricator.wikimedia.org/T233373 (10herron) p:05Triage→03Normal [19:26:02] 10Operations, 10LDAP-Access-Requests: NDA Request from WMDE employee Raja - https://phabricator.wikimedia.org/T231984 (10Nuria) @raja_wmde you should have access to both http://turnilo.wikimedia.org and http://superset.wikimedia.org [19:26:02] !log update eqsin firewall filters - T233268 [19:26:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:28:07] (03CR) 10Jforrester: "> Patch Set 5:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535704 (owner: 10Jforrester) [19:33:46] (03CR) 10Nuria: "Looks good, just a comment about naming, I really do not have a good suggestion but let's think about it for a bit." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/538312 (https://phabricator.wikimedia.org/T208612) (owner: 10Mforns) [19:39:07] (03CR) 10Paladox: hhvm: make it possible to let puppet remove all hhvm remnants (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/538108 (https://phabricator.wikimedia.org/T229792) (owner: 10Dzahn) [19:40:22] (03CR) 10Paladox: hhvm: make it possible to let puppet remove all hhvm remnants (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/538108 (https://phabricator.wikimedia.org/T229792) (owner: 10Dzahn) [20:30:54] (03PS1) 10Dzahn: zuul: if zuul is disabled/masked don't use zuul::server [puppet] - 10https://gerrit.wikimedia.org/r/538330 [20:30:56] (03PS1) 10Hashar: Revert "zuul: mask service when it is disabled" [puppet] - 10https://gerrit.wikimedia.org/r/538329 (https://phabricator.wikimedia.org/T233391) [20:31:29] (03CR) 10jerkins-bot: [V: 04-1] zuul: if zuul is disabled/masked don't use zuul::server [puppet] - 10https://gerrit.wikimedia.org/r/538330 (owner: 10Dzahn) [20:31:58] (03PS2) 10Dzahn: zuul: if zuul is disabled/masked don't use zuul::server [puppet] - 10https://gerrit.wikimedia.org/r/538330 [20:34:00] (03CR) 10jerkins-bot: [V: 04-1] zuul: if zuul is disabled/masked don't use zuul::server [puppet] - 10https://gerrit.wikimedia.org/r/538330 (owner: 10Dzahn) [20:35:08] (03Abandoned) 10Hashar: Revert "zuul: mask service when it is disabled" [puppet] - 10https://gerrit.wikimedia.org/r/538329 (https://phabricator.wikimedia.org/T233391) (owner: 10Hashar) [20:43:00] (03Restored) 10Hashar: Revert "zuul: mask service when it is disabled" [puppet] - 10https://gerrit.wikimedia.org/r/538329 (https://phabricator.wikimedia.org/T233391) (owner: 10Hashar) [20:43:09] (03PS2) 10Hashar: Revert "zuul: mask service when it is disabled" [puppet] - 10https://gerrit.wikimedia.org/r/538329 (https://phabricator.wikimedia.org/T233391) [20:45:10] (03CR) 10Dzahn: [C: 03+2] Revert "zuul: mask service when it is disabled" [puppet] - 10https://gerrit.wikimedia.org/r/538329 (https://phabricator.wikimedia.org/T233391) (owner: 10Hashar) [20:46:18] (03CR) 10Krinkle: [C: 03+1] "Test still outputs one warning, but not from cirruss:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538156 (https://phabricator.wikimedia.org/T232691) (owner: 10DCausse) [20:48:46] (03PS1) 10Hashar: zuul: mask service when not enabled [puppet] - 10https://gerrit.wikimedia.org/r/538331 (https://phabricator.wikimedia.org/T233391) [20:51:22] (03CR) 10Jforrester: [C: 03+2] "Yes, this fixed the tests (but leaves them entangled with MediaWiki). Am doing follow-ups to migrate the tests to static config where poss" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538156 (https://phabricator.wikimedia.org/T232691) (owner: 10DCausse) [20:51:29] (03PS2) 10Hashar: zuul: mask service when not enabled [puppet] - 10https://gerrit.wikimedia.org/r/538331 (https://phabricator.wikimedia.org/T233391) [20:52:19] (03Merged) 10jenkins-bot: [cirrus] fix cirrusTest.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538156 (https://phabricator.wikimedia.org/T232691) (owner: 10DCausse) [20:52:38] (03CR) 10jenkins-bot: [cirrus] fix cirrusTest.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538156 (https://phabricator.wikimedia.org/T232691) (owner: 10DCausse) [20:54:51] (03CR) 10Dzahn: [C: 03+1] "https://puppet-compiler.wmflabs.org/compiler1001/18471/" [puppet] - 10https://gerrit.wikimedia.org/r/538331 (https://phabricator.wikimedia.org/T233391) (owner: 10Hashar) [20:55:49] (03CR) 10Dzahn: [C: 03+2] zuul: mask service when not enabled [puppet] - 10https://gerrit.wikimedia.org/r/538331 (https://phabricator.wikimedia.org/T233391) (owner: 10Hashar) [21:08:28] 10Operations, 10ops-eqsin: rack/setup/install ganeti500[123].eqsin.wmnet - https://phabricator.wikimedia.org/T228099 (10RobH) All remote hands setup on T229243 are done. [21:08:44] 10Operations, 10ops-eqsin: remote hands setups for ganeti500[123] - https://phabricator.wikimedia.org/T229243 (10RobH) 05Open→03Resolved All the remote setup has been completed, installation to continue on T228099. [21:12:36] (03CR) 10Krinkle: "The diff is a bit slow to load, but from Gitiles, it looks like the labs realm condition got lost. Can you double check? Moving that out t" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537220 (owner: 10Jforrester) [21:13:42] (03CR) 10Jforrester: "> Patch Set 2:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537220 (owner: 10Jforrester) [21:16:16] (03CR) 10Krinkle: "In this commit:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537220 (owner: 10Jforrester) [21:16:48] (03CR) 10Krinkle: [C: 03+1] "Ah, looks like that was already inlined in CS.php, got it. Assuming that's dead, can we remove that in a separate patch quickly first?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537220 (owner: 10Jforrester) [21:23:47] (03CR) 10Jforrester: "> Patch Set 2: Code-Review+1" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537220 (owner: 10Jforrester) [21:29:32] !log urbanecm@deploy1001 Synchronized php-1.34.0-wmf.23/extensions/CheckUser: fix T233453 (duration: 00m 58s) [21:29:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:29:36] T233453: Couldn't fetch revision cu_changes table links to (cuc_this_oldid xxxx) - https://phabricator.wikimedia.org/T233453 [21:30:22] (03CR) 10Krinkle: [C: 03+1] "I thought the fix for that just landed. And it did, but it only fixed the PHP error, not the general hacky way of loading IS. Indeed, that" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537220 (owner: 10Jforrester) [21:30:32] (03PS2) 10Krinkle: tests: Skip the Cirrus configuration tests as they're inextricable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537756 (owner: 10Jforrester) [21:30:37] !log urbanecm@deploy1001 Synchronized php-1.34.0-wmf.22/extensions/CheckUser: fix T233453 (duration: 00m 56s) [21:30:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:33:49] (03PS1) 10Krinkle: Remove labs check from InitialiseSetings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538336 [21:39:54] (03CR) 10Jforrester: "This patch could just blank the file at this point…" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538336 (owner: 10Krinkle) [21:40:11] (03PS1) 10Krinkle: tests: Mock wmfRealm to let WgConfTestCase/CirrusTest work without labs IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538338 [21:40:32] James_F: There were definitely other issues with WgConfTestCase/CirrusTest but I can't repro them now [21:40:34] Let's see what happens [21:40:57] (03CR) 10Jforrester: "Oooh." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538338 (owner: 10Krinkle) [21:42:02] Oh, right, I didn't push my local hacky patch to re-write Cirrus tests to use VariantSettings only (and mostly comment things out). [21:42:49] Krinkle: Did you see https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/538129/ BTW? Starting to move things from CommonSettings.php and FlaggedRevs.php too as a demo. [21:57:02] (03PS3) 10Krinkle: Move VariantSettings back to InitialiseSettings now that the migration is done [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537220 (owner: 10Jforrester) [21:57:56] (03CR) 10jerkins-bot: [V: 04-1] Move VariantSettings back to InitialiseSettings now that the migration is done [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537220 (owner: 10Jforrester) [22:00:10] (03CR) 10Dzahn: [C: 04-1] hhvm: make it possible to let puppet remove all hhvm remnants (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/538108 (https://phabricator.wikimedia.org/T229792) (owner: 10Dzahn) [22:01:11] (03PS2) 10Krinkle: tests: Prep WgConfTestCase/CirrusTest for simplified IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538338 [22:01:13] (03PS4) 10Krinkle: Move VariantSettings back to InitialiseSettings now that the migration is done [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537220 (owner: 10Jforrester) [22:01:59] (03CR) 10jerkins-bot: [V: 04-1] tests: Prep WgConfTestCase/CirrusTest for simplified IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538338 (owner: 10Krinkle) [22:02:12] (03CR) 10jerkins-bot: [V: 04-1] Move VariantSettings back to InitialiseSettings now that the migration is done [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537220 (owner: 10Jforrester) [22:02:40] (03CR) 10Dzahn: [C: 04-1] hhvm: make it possible to let puppet remove all hhvm remnants (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/538108 (https://phabricator.wikimedia.org/T229792) (owner: 10Dzahn) [22:03:10] (03PS8) 10Dzahn: hhvm: make it possible to let puppet remove all hhvm remnants [puppet] - 10https://gerrit.wikimedia.org/r/538108 (https://phabricator.wikimedia.org/T229792) [22:04:08] (03Abandoned) 10Dzahn: zuul: if zuul is disabled/masked don't use zuul::server [puppet] - 10https://gerrit.wikimedia.org/r/538330 (owner: 10Dzahn) [22:04:36] Krinkle: The rebase of VS->IS doesn't work because of how wgConfTest tries to mock the world and load "real" (non-static) config code too. [22:04:44] I don't think it's feasible, sadly. [22:05:36] See what's left behind even after all my static config moves in https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/538325/1 and below – inline hook definitions, array_map(!) use, etc. [22:08:36] (03PS3) 10Krinkle: tests: Prep WgConfTestCase/CirrusTest for simplified IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538338 [22:08:38] (03PS5) 10Krinkle: Move VariantSettings back to InitialiseSettings now that the migration is done [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537220 (owner: 10Jforrester) [22:08:58] (03CR) 10Dzahn: [C: 04-1] "https://puppet-compiler.wmflabs.org/compiler1002/18473/" [puppet] - 10https://gerrit.wikimedia.org/r/538108 (https://phabricator.wikimedia.org/T229792) (owner: 10Dzahn) [22:11:07] James_F: yeah, it's not great, but CS hasn't changed /that/ much. It was just missing the bit we moved really. [22:11:10] Seems to pass now :) [22:11:11] (03CR) 10Jforrester: [C: 03+2] tests: Prep WgConfTestCase/CirrusTest for simplified IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538338 (owner: 10Krinkle) [22:11:32] (03Abandoned) 10Krinkle: Remove labs check from InitialiseSetings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538336 (owner: 10Krinkle) [22:12:02] (03Merged) 10jenkins-bot: tests: Prep WgConfTestCase/CirrusTest for simplified IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538338 (owner: 10Krinkle) [22:12:06] Krinkle: Yeah. [22:12:10] This whole thing needs a lot more documentation and tests, or just be less complicated :P [22:12:17] (03CR) 10jenkins-bot: tests: Prep WgConfTestCase/CirrusTest for simplified IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538338 (owner: 10Krinkle) [22:12:23] Well, less complicated and lots of tests it what I'm going for. [22:12:35] I'm trying to force git to diff the VS and IS files [22:12:37] But ultimately, wgConfTest just needs to die. [22:12:44] to verify we didn't change anything [22:12:46] I'll re-do the patch with git mv. [22:13:10] I did already in my rebase. But would like to see it in 'git diff' but I can't seem to convince git it's the same file [22:13:16] evne though it's only changing <1% or so [22:13:18] weird [22:13:31] We could do a patch deleting IS first. [22:13:31] feel free to do again though, that's probably the easiest CR [22:13:38] That'd make it clearer for git. [22:13:54] Yeah, good point. Update WgConfTestCase to include VS temporarily [22:14:02] should be fine [22:16:22] (03PS9) 10Dzahn: hhvm: make it possible to let puppet completely remove hhvm [puppet] - 10https://gerrit.wikimedia.org/r/538108 (https://phabricator.wikimedia.org/T229792) [22:18:48] (03CR) 10jerkins-bot: [V: 04-1] hhvm: make it possible to let puppet completely remove hhvm [puppet] - 10https://gerrit.wikimedia.org/r/538108 (https://phabricator.wikimedia.org/T229792) (owner: 10Dzahn) [22:19:43] (03PS10) 10Dzahn: hhvm: make it possible to let puppet completely remove hhvm [puppet] - 10https://gerrit.wikimedia.org/r/538108 (https://phabricator.wikimedia.org/T229792) [22:23:07] (03CR) 10jerkins-bot: [V: 04-1] hhvm: make it possible to let puppet completely remove hhvm [puppet] - 10https://gerrit.wikimedia.org/r/538108 (https://phabricator.wikimedia.org/T229792) (owner: 10Dzahn) [22:24:41] (03PS6) 10Jforrester: Move VariantSettings back to InitialiseSettings now that the migration is done [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537220 [22:24:43] (03PS1) 10Jforrester: Drop InitialiseSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538341 [22:25:03] Krinkle: There, now gerrit shows the diff. [22:26:27] (03PS11) 10Dzahn: hhvm: make it possible to let puppet completely remove hhvm [puppet] - 10https://gerrit.wikimedia.org/r/538108 (https://phabricator.wikimedia.org/T229792) [22:27:22] (03CR) 10Krinkle: [C: 03+1] Drop InitialiseSettings.php (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538341 (owner: 10Jforrester) [22:27:26] James_F: nice [22:28:10] Indeed. Thanks for finding the way through the thicket to the tweaks needed. [22:29:08] (03CR) 10Jforrester: Drop InitialiseSettings.php (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538341 (owner: 10Jforrester) [22:29:26] (03PS2) 10Jforrester: Drop InitialiseSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538341 [22:29:53] (03CR) 10Dzahn: "looks better now, in this example the host mwdebug1001 gets removal, mwdebug1002 stays but adds a few 'present' and scandium is unchanged " [puppet] - 10https://gerrit.wikimedia.org/r/538108 (https://phabricator.wikimedia.org/T229792) (owner: 10Dzahn) [22:30:43] (03PS7) 10Jforrester: Move VariantSettings back to InitialiseSettings now that the migration is done [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537220 [22:31:37] (03CR) 10Krinkle: [C: 03+1] Drop InitialiseSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538341 (owner: 10Jforrester) [22:31:51] Krinkle: I'm out early next week; would you be OK to deploy the move of VS back to IS then? It'd be nice for our volunteer devs to have things back to "normal" as much as possible. [22:31:58] * Krinkle scratches head [22:31:59] James_F: scap order for VS>IS? [22:32:15] sure, can do. [22:32:18] (03CR) 10Dzahn: "https://puppet-compiler.wmflabs.org/compiler1002/18476/mwdebug1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/538108 (https://phabricator.wikimedia.org/T229792) (owner: 10Dzahn) [22:32:55] IS (create), CS, WmfClusters, VS (delete) [22:33:09] (03CR) 10Jforrester: "Scap order: IS (create), CS, src/WmfClusters, VS (delete)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537220 (owner: 10Jforrester) [22:33:52] (03PS4) 10Jforrester: MWConfigCacheGenerator: Provide getCachableMWConfig() which doesn't rely on wgConf [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538078 [22:34:10] (03PS12) 10Jforrester: Variant configuration: Pre-calculate config for each wiki and store it in config.git [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507729 (https://phabricator.wikimedia.org/T223602) [22:35:07] (03CR) 10Dzahn: "This should let us clean hosts from all hhvm remnants by setting a parameter in hieradata hosts and later by roles." [puppet] - 10https://gerrit.wikimedia.org/r/538108 (https://phabricator.wikimedia.org/T229792) (owner: 10Dzahn) [22:35:17] (03CR) 10jerkins-bot: [V: 04-1] Variant configuration: Pre-calculate config for each wiki and store it in config.git [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507729 (https://phabricator.wikimedia.org/T223602) (owner: 10Jforrester) [22:37:32] James_F: thx [22:38:13] (03PS3) 10Jforrester: [WIP] Provide for YAML-based inherited configuration to eventually replace InitialiseSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538129 [22:39:19] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Provide for YAML-based inherited configuration to eventually replace InitialiseSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538129 (owner: 10Jforrester) [22:39:55] 10Operations, 10serviceops, 10HHVM, 10Patch-For-Review: Remove HHVM from production - https://phabricator.wikimedia.org/T229792 (10Dzahn) I uploaded a patch to allow us to clean hosts from hhvm resources by setting a Hiera key to absent. The example change applies it on mwdebug1001.eqiad.wmnet. We can use... [22:40:33] 10Operations, 10serviceops, 10HHVM, 10Patch-For-Review, 10Performance-Team (Radar): Remove HHVM from production - https://phabricator.wikimedia.org/T229792 (10Krinkle) [22:51:29] (03PS13) 10Jforrester: Variant configuration: Pre-calculate config for each wiki and store it in config.git [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507729 (https://phabricator.wikimedia.org/T223602) [22:51:31] (03PS1) 10Jforrester: CommonSettings: Switch from getMWConfigForCacheing to getCachableMWConfig to avoid wgConf [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538342 [22:51:33] (03PS1) 10Jforrester: Drop getMWConfigForCacheing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538343 [22:52:53] (03CR) 10jerkins-bot: [V: 04-1] CommonSettings: Switch from getMWConfigForCacheing to getCachableMWConfig to avoid wgConf [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538342 (owner: 10Jforrester) [22:53:23] (03CR) 10jerkins-bot: [V: 04-1] Drop getMWConfigForCacheing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538343 (owner: 10Jforrester) [22:56:28] (03PS9) 10Dzahn: gerrit: get LDAP server from ldap_config, use ro server, simplify [puppet] - 10https://gerrit.wikimedia.org/r/536714 [22:58:11] (03CR) 10Dzahn: "Paladox found the second server was originally added in https://github.com/wikimedia/puppet/commit/0b5af694d80d3528a0217a738b19468aa32925a" [puppet] - 10https://gerrit.wikimedia.org/r/536714 (owner: 10Dzahn) [22:59:54] 10Operations, 10Release Pipeline, 10local-charts, 10serviceops, and 3 others: Set up CI for the deployment-charts repository - https://phabricator.wikimedia.org/T233291 (10Jdforrester-WMF) Done? [23:00:31] (03PS10) 10Dzahn: gerrit: get LDAP server from ldap_config, use ro server, simplify [puppet] - 10https://gerrit.wikimedia.org/r/536714 [23:01:03] 10Operations, 10Release Pipeline, 10local-charts, 10serviceops, and 3 others: Set up CI for the deployment-charts repository - https://phabricator.wikimedia.org/T233291 (10Jdforrester-WMF) [23:03:08] (03PS11) 10Dzahn: gerrit: get LDAP server from ldap_config, use ro server, simplify [puppet] - 10https://gerrit.wikimedia.org/r/536714 [23:04:43] (03PS12) 10Dzahn: gerrit: get LDAP server from ldap_config, use ro server, simplify [puppet] - 10https://gerrit.wikimedia.org/r/536714 [23:05:31] (03CR) 10Dzahn: "https://puppet-compiler.wmflabs.org/compiler1002/18478/cobalt.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/536714 (owner: 10Dzahn) [23:09:36] (03PS13) 10Dzahn: gerrit: get LDAP server from ldap_config, use ro server, simplify [puppet] - 10https://gerrit.wikimedia.org/r/536714 [23:12:35] (03CR) 10Dzahn: [C: 03+1] "i thought maybe it's best to do this after the rest of the gerrit role is running fine and we like what we are seeing. that way we can be " [puppet] - 10https://gerrit.wikimedia.org/r/535966 (https://phabricator.wikimedia.org/T222391) (owner: 10Dzahn) [23:30:40] (03PS1) 10Dzahn: webperf: use performance.discovery instead webperf.discovery [puppet] - 10https://gerrit.wikimedia.org/r/538347 [23:42:35] (03CR) 10Dzahn: "created new certificate and name was changed in DNS in Ic9fbe9028d92d5e3" [puppet] - 10https://gerrit.wikimedia.org/r/538347 (owner: 10Dzahn) [23:43:57] (03CR) 10Dzahn: [C: 04-2] "first setup moscovium but some issue with the Ganeti VM..." [dns] - 10https://gerrit.wikimedia.org/r/534129 (owner: 10Dzahn) [23:55:08] (03PS1) 10Dzahn: ssl: add certificate for performance.discovery.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/538348 [23:56:48] PROBLEM - HTTPS-wmflabs on tools.wmflabs.org is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Temporary failure in name resolution https://phabricator.wikimedia.org/tag/toolforge/ [23:57:07] PROBLEM - Host commons.wikimedia.org is DOWN: /bin/ping -n -U -w 15 -c 5 commons.wikimedia.org [23:57:13] (03PS1) 10Dzahn: add fake ssl key for performance.discovery.wmnet, remove webperf [labs/private] - 10https://gerrit.wikimedia.org/r/538349 [23:57:30] Site issues with Varnish? [23:57:42] what's up [23:57:48] PROBLEM - Host ores.wmflabs.org is DOWN: /bin/ping -n -U -w 15 -c 5 ores.wmflabs.org [23:57:48] what .. [23:57:52] Request from 198.73.209.241 via cp1085 cp1085, Varnish XID 192118946 Error: 503, Backend fetch failed at Fri, 20 Sep 2019 23:57:35 GMT [23:58:00] RECOVERY - Host ores.wmflabs.org is UP: PING OK - Packet loss = 0%, RTA = 0.62 ms [23:58:04] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_text site=ulsfo https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [23:58:14] PROBLEM - HTTP availability for Varnish at ulsfo on icinga1001 is CRITICAL: job=varnish-text site=ulsfo https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [23:58:15] (03PS3) 10Cwhite: initial commit [debs/prometheus-swagger-exporter] - 10https://gerrit.wikimedia.org/r/536376 [23:58:36] RECOVERY - Host commons.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 0.19 ms [23:58:40] PROBLEM - HTTP availability for Varnish at eqsin on icinga1001 is CRITICAL: job=varnish-text site=eqsin https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [23:58:44] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is CRITICAL: cluster=cache_text site=eqiad https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [23:58:46] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is CRITICAL: cluster=cache_text site=eqsin https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [23:58:46] PROBLEM - Host asw2-d-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [23:58:48] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler={proxy:fcgi://127.0.0.1:9000,proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&va [23:58:48] server&var-method=GET [23:59:00] interesting [23:59:02] Comes and goes. [23:59:04] PROBLEM - HTTP availability for Varnish at esams on icinga1001 is CRITICAL: job=varnish-text site=esams https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [23:59:04] PROBLEM - HTTP availability for Varnish at codfw on icinga1001 is CRITICAL: job=varnish-text site=codfw https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [23:59:10] PROBLEM - Nginx local proxy to apache on mw1240 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [23:59:14] PROBLEM - Nginx local proxy to apache on mw1347 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [23:59:14] PROBLEM - PHP7 rendering on mw1276 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [23:59:22] PROBLEM - PHP7 rendering on mw1288 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [23:59:22] PROBLEM - Nginx local proxy to apache on mw1348 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [23:59:24] Not local to one cluster? [23:59:24] PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is CRITICAL: cluster=cache_text site=esams https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [23:59:36] is tihs monitoring exploding maybe ? [23:59:36] PROBLEM - PHP7 rendering on mw1347 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [23:59:38] PROBLEM - HHVM rendering on mw1245 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [23:59:43] PROBLEM - PHP7 rendering on mw1283 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [23:59:44] RECOVERY - Host asw2-d-eqiad is UP: PING OK - Packet loss = 0%, RTA = 0.80 ms [23:59:44] PROBLEM - Host db1093 is DOWN: PING CRITICAL - Packet loss = 100% [23:59:44] PROBLEM - Apache HTTP on mw1225 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [23:59:44] PROBLEM - HHVM rendering on mw1347 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [23:59:44] PROBLEM - PHP7 rendering on mw1257 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [23:59:47] PROBLEM - Apache HTTP on mw1228 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [23:59:47] PROBLEM - PHP7 rendering on mw1339 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [23:59:48] hmmm [23:59:48] PROBLEM - PHP7 rendering on mw1342 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [23:59:48] PROBLEM - Host druid1003 is DOWN: PING CRITICAL - Packet loss = 100% [23:59:48] PROBLEM - PHP7 rendering on mw1341 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [23:59:48] PROBLEM - PHP7 rendering on mw1235 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [23:59:48] PROBLEM - Nginx local proxy to apache on mw1226 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [23:59:49] PROBLEM - HHVM rendering on mw1222 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [23:59:49] PROBLEM - PHP7 rendering on mw1344 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [23:59:50] PROBLEM - PHP7 rendering on mw1222 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [23:59:50] PROBLEM - Host an-worker1092 is DOWN: PING CRITICAL - Packet loss = 100% [23:59:51] PROBLEM - HHVM rendering on mw1258 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [23:59:51] Maybe. But I'm getting actual issues in prod. [23:59:51] PROBLEM - PHP7 rendering on mw1234 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [23:59:52] PROBLEM - Host analytics1044 is DOWN: PING CRITICAL - Packet loss = 100% [23:59:52] PROBLEM - Host analytics1043 is DOWN: PING CRITICAL - Packet loss = 100% [23:59:53] PROBLEM - Apache HTTP on mw1347 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [23:59:53] PROBLEM - PHP7 rendering on mw1265 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [23:59:54] PROBLEM - Apache HTTP on mw1242 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [23:59:54] PROBLEM - PHP7 rendering on mw1231 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [23:59:55] PROBLEM - PHP7 rendering on mw1251 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [23:59:55] PROBLEM - Apache HTTP on mw1257 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [23:59:56] PROBLEM - Nginx local proxy to apache on mw1227 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [23:59:57] RECOVERY - Host druid1003 is UP: PING OK - Packet loss = 0%, RTA = 0.39 ms [23:59:58] PROBLEM - Host labstore1007 is DOWN: PING CRITICAL - Packet loss = 100% [23:59:58] PROBLEM - PHP7 rendering on mw1258 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 937 bytes in 0.367 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [23:59:58] okay [23:59:59] PROBLEM - graphoid endpoints health on scb1003 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) is CRITICAL: Test retrieve PNG from mediawiki.org returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [23:59:59] PROBLEM - PHP7 rendering on mw1240 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering