[00:00:05] RoanKattouw, Niharika, and Urbanecm: Your horoscope predicts another unfortunate Evening SWAT(Max 6 patches) deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200306T0000). [00:00:05] No GERRIT patches in the queue for this window AFAICS. [00:00:56] (03CR) 10CDanis: [C: 03+2] depool esams for cr2 router maintenance [dns] - 10https://gerrit.wikimedia.org/r/577363 (https://phabricator.wikimedia.org/T246338) (owner: 10CDanis) [00:01:41] (03PS2) 10CDanis: depool esams for cr2 router maintenance [dns] - 10https://gerrit.wikimedia.org/r/577363 (https://phabricator.wikimedia.org/T246338) [00:01:57] 10Operations, 10netops: Netbox has incorrect email address for GTT - https://phabricator.wikimedia.org/T246564 (10faidon) We have one global account, migrated from a previous system. I wasn't able to find how to create individual accounts, so that will do I guess :) I've added it to the pwstore, so @ayounsi s... [00:02:54] !log T246338 depool esams for router maintenance [00:02:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:02:59] T246338: Add graceful-restart to cr2-esams - https://phabricator.wikimedia.org/T246338 [00:11:37] (03PS1) 10Dzahn: site: add mw2301 through mw2309 as api and appservers [puppet] - 10https://gerrit.wikimedia.org/r/577388 (https://phabricator.wikimedia.org/T247021) [00:12:00] PROBLEM - Varnish traffic drop between 30min ago and now at esams on icinga1001 is CRITICAL: 44.34 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [00:12:59] i consider this normal because esams got depooled above [00:17:09] (03CR) 10Dzahn: [C: 03+2] site: add mw2301 through mw2309 as api and appservers [puppet] - 10https://gerrit.wikimedia.org/r/577388 (https://phabricator.wikimedia.org/T247021) (owner: 10Dzahn) [00:19:14] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [00:19:18] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [00:19:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:19:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:19:24] 10Operations, 10serviceops, 10Patch-For-Review: move all 86 new codfw appservers into production - https://phabricator.wikimedia.org/T247021 (10ops-monitoring-bot) Icinga downtime for 2:00:00 set by dzahn@cumin1001 on 9 host(s) and their services with reason: new_install ` mw[2301-2309].codfw.wmnet ` [00:19:37] (03PS2) 10Dzahn: site: add mw2301 through mw2309 as api and appservers [puppet] - 10https://gerrit.wikimedia.org/r/577388 (https://phabricator.wikimedia.org/T247021) [00:23:02] PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [00:25:34] RECOVERY - BGP status on cr2-esams is OK: BGP OK - up: 409, down: 4, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [00:32:08] 10Operations, 10netops, 10Wikimedia-Incident: Add graceful-restart to cr2-esams - https://phabricator.wikimedia.org/T246338 (10CDanis) Committed configuration at 00:21 UTC. Took a few minutes for all BGP sessions to be recreated but eventually wound up with this: `cdanis@re0.cr2-esams> show bgp summary | m... [00:32:34] (03PS1) 10CDanis: Revert "depool esams for cr2 router maintenance" [dns] - 10https://gerrit.wikimedia.org/r/577390 (https://phabricator.wikimedia.org/T246338) [00:33:25] (03CR) 10CDanis: [C: 03+2] Revert "depool esams for cr2 router maintenance" [dns] - 10https://gerrit.wikimedia.org/r/577390 (https://phabricator.wikimedia.org/T246338) (owner: 10CDanis) [00:33:58] !log repool esams T246338 [00:34:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:34:04] T246338: Add graceful-restart to cr2-esams - https://phabricator.wikimedia.org/T246338 [00:35:03] 10Operations, 10Epic: Migrate all of production to Buster or later - https://phabricator.wikimedia.org/T247045 (10Jdforrester-WMF) [00:35:39] 10Operations, 10Performance-Team, 10Traffic, 10Patch-For-Review, 10Wikimedia-Incident: 15% response start regression as of 2019-11-11 (Varnish->ATS) - https://phabricator.wikimedia.org/T238494 (10Krinkle) [00:35:45] 10Operations: Migrate Cumin hosts to Buster - https://phabricator.wikimedia.org/T245114 (10Jdforrester-WMF) [00:35:47] 10Operations, 10Epic: Migrate all of production to Buster or later - https://phabricator.wikimedia.org/T247045 (10Jdforrester-WMF) [00:35:49] 10Operations, 10Discovery-Search: Migrate Elasticsearch to Debian Buster - https://phabricator.wikimedia.org/T244736 (10Jdforrester-WMF) [00:35:52] 10Operations, 10Wikidata, 10Wikidata-Query-Service: Migrate WDQS to Debian Buster - https://phabricator.wikimedia.org/T244753 (10Jdforrester-WMF) [00:36:19] 10Operations, 10Epic: Migrate all of production metal to Buster or later - https://phabricator.wikimedia.org/T247045 (10Jdforrester-WMF) [00:39:15] 10Operations, 10Epic: Migrate all of production metal to Buster or later - https://phabricator.wikimedia.org/T247045 (10Jdforrester-WMF) [00:41:16] RECOVERY - Varnish traffic drop between 30min ago and now at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [00:42:30] (03PS1) 10Dzahn: admins: add whatamidoing to ldap_only_admins (wmf) [puppet] - 10https://gerrit.wikimedia.org/r/577395 (https://phabricator.wikimedia.org/T247016) [00:47:34] (03CR) 10Dzahn: "see:" [puppet] - 10https://gerrit.wikimedia.org/r/577395 (https://phabricator.wikimedia.org/T247016) (owner: 10Dzahn) [00:48:48] PROBLEM - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is CRITICAL: 26.95 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [00:49:43] (also expected) [00:50:41] 10Operations, 10Wikimedia-Logstash, 10Patch-For-Review: ELK7 shards failed errors when loading saved objects, e.g. "field expansion matches too many fields, limit: 1024, got: 1726" - https://phabricator.wikimedia.org/T247014 (10EBernhardson) The _all field wasn't free before, there was simply an implicit cop... [00:51:35] thanks cdanis [00:52:02] adding more appservers in codfw as well [00:58:41] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [00:58:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:01:15] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [01:01:19] 10Operations, 10serviceops, 10Patch-For-Review: move all 86 new codfw appservers into production - https://phabricator.wikimedia.org/T247021 (10ops-monitoring-bot) Icinga downtime for 2:00:00 set by dzahn@cumin1001 on 9 host(s) and their services with reason: new_install ` mw[2301-2309].codfw.wmnet ` [01:01:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:15:15] RECOVERY - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [01:15:52] 10Operations, 10ops-codfw: ganeti2003.mgmt - stopped responding on SSH - please reset DRAC/BMC? - https://phabricator.wikimedia.org/T246857 (10Dzahn) For unkown reasons both checks are OK again by now. [01:20:13] (03CR) 10Dzahn: [C: 03+1] "thanks for doing this! yes, the OOM-killer always seems to pick nagios-nrpe-server among all possible victims and that causes icinga spam " [puppet] - 10https://gerrit.wikimedia.org/r/577320 (owner: 10Elukey) [01:22:58] 10Operations, 10ops-codfw: ganeti2003.mgmt - stopped responding on SSH - please reset DRAC/BMC? - https://phabricator.wikimedia.org/T246857 (10Papaul) 05Open→03Resolved a:03Papaul I tested this, SSh is working. Don't know why Icinga did alert. [01:26:31] (03CR) 10CRusnov: [C: 03+1] "LGTM" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/576986 (https://phabricator.wikimedia.org/T233183) (owner: 10Volans) [01:26:33] (03CR) 10Jforrester: [C: 03+1] wmf-config: Document wgConf.php load order [mediawiki-config] - 10https://gerrit.wikimedia.org/r/577376 (owner: 10Krinkle) [01:26:50] (03CR) 10Jforrester: [C: 03+1] ":-)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/577374 (owner: 10Krinkle) [01:27:37] (03CR) 10Jforrester: [C: 03+1] tests: Move MWWikiversionsTest out of dblistTest.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/577375 (owner: 10Krinkle) [01:27:45] (03CR) 10CRusnov: [C: 03+1] "Seems reasonable!" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/576987 (https://phabricator.wikimedia.org/T233183) (owner: 10Volans) [01:34:25] !log dzahn@cumin1001 conftool action : set/weight=15; selector: name=mw230[1-9].codfw.wmnet [01:34:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:34:46] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw230[1-9].codfw.wmnet [01:34:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:37:28] !log added 9 more appservers to codfw pool [01:37:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:38:12] !log added 9 more appservers to codfw pool split between appserver and API appservers, weight 15 (like all in codfw) T247021 [01:38:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:38:16] T247021: move all 86 new codfw appservers into production - https://phabricator.wikimedia.org/T247021 [01:41:08] 10Operations, 10serviceops: move all 86 new codfw appservers into production - https://phabricator.wikimedia.org/T247021 (10Dzahn) mw2301 thru mw2309 pooled and set to active in netbox [01:42:38] 10Operations, 10serviceops: move all 86 new codfw appservers into production (mw2[291-2377].codfw.wmnet) - https://phabricator.wikimedia.org/T247021 (10Dzahn) [01:51:38] 10Operations, 10ops-codfw, 10serviceops: (Need by: TBD) rack/setup/install 86 new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10Dzahn) mw2291 through mw2324 are now pooled and status active in netbox (34 servers) mw2325 through mw2334 are not pooled but in site.pp and status staged in net... [01:51:51] 10Operations, 10serviceops: move all 86 new codfw appservers into production (mw2[291-2377].codfw.wmnet) - https://phabricator.wikimedia.org/T247021 (10Dzahn) mw2291 through mw2324 are now pooled and status active in netbox (34 servers) mw2325 through mw2334 are not pooled but in site.pp and status staged in... [02:00:55] (03PS1) 10Dzahn: add mw2325-mw2334 as API and appservers, codfw rack B6 [puppet] - 10https://gerrit.wikimedia.org/r/577408 (https://phabricator.wikimedia.org/T247021) [02:02:48] (03PS2) 10Dzahn: add mw2325-mw2334 as API and appservers, codfw rack B6 [puppet] - 10https://gerrit.wikimedia.org/r/577408 (https://phabricator.wikimedia.org/T247021) [02:04:03] 10Operations, 10serviceops, 10Patch-For-Review: move all 86 new codfw appservers into production (mw2[291-2377].codfw.wmnet) - https://phabricator.wikimedia.org/T247021 (10Dzahn) [02:11:34] (03PS1) 10Dzahn: add mw2350-2376 as API and appservers, codfw rack C6 [puppet] - 10https://gerrit.wikimedia.org/r/577409 (https://phabricator.wikimedia.org/T247021) [02:14:12] (03PS3) 10Dzahn: add mw2325-mw2334 as API and appservers, codfw rack B6 [puppet] - 10https://gerrit.wikimedia.org/r/577408 (https://phabricator.wikimedia.org/T247021) [02:15:43] 10Operations, 10netops, 10Wikimedia-Incident: Add graceful-restart to cr2-esams - https://phabricator.wikimedia.org/T246338 (10CDanis) `cdanis@re0.cr2-esams> show bgp summary | match "(Active|Idle|Connect)" Peer AS InPkt OutPkt OutQ Flaps Last Up/Dwn State|#Active/Received/... [02:16:56] (03PS2) 10Dzahn: add mw2350-2376 as API and appservers, codfw rack C6 [puppet] - 10https://gerrit.wikimedia.org/r/577409 (https://phabricator.wikimedia.org/T247021) [02:17:52] 10Operations, 10serviceops, 10Patch-For-Review: move all 86 new codfw appservers into production (mw2[291-2377].codfw.wmnet) - https://phabricator.wikimedia.org/T247021 (10Dzahn) p:05Triage→03High a:03Dzahn [02:19:07] 10Operations, 10Wikimedia-Logstash, 10Patch-For-Review: ELK7 shards failed errors when loading saved objects, e.g. "field expansion matches too many fields, limit: 1024, got: 1726" - https://phabricator.wikimedia.org/T247014 (10herron) >>! In T247014#5946926, @EBernhardson wrote: > I hope you don't mind, i... [02:22:08] 10Operations, 10serviceops, 10Patch-For-Review: move all 86 new codfw appservers into production (mw2[291-2377].codfw.wmnet) - https://phabricator.wikimedia.org/T247021 (10Dzahn) [02:36:53] 10Operations, 10DC-Ops, 10decommission: decommission WMF6147 (old frpig2001.frack.codfw.wmnet) - https://phabricator.wikimedia.org/T246824 (10Papaul) [02:37:47] 10Operations, 10Performance-Team, 10serviceops, 10Performance-Team-publish: (Need by: 2020-02-28) rack/setup/install mw[1385-1413].eqiad.wmnet - https://phabricator.wikimedia.org/T241849 (10Dzahn) set mw1385 - mw1413 all to status Active in Netbox. mw1413 is also pooled meanwhile. mw1403 was planned -> act... [02:40:41] 10Operations, 10LDAP-Access-Requests, 10WMF-Legal: Add Abban Dunne to the ldap/wmde group - https://phabricator.wikimedia.org/T246664 (10Dzahn) 05Resolved→03Open LDAP users also need an entry in the "ldap_only_admins" sections of admins/data/data.yaml (or the cross-validate-accounts script will start ma... [02:43:56] (03PS1) 10Dzahn: admins: add Abban Dunne to ldap_only_admins (wmde) [puppet] - 10https://gerrit.wikimedia.org/r/577410 (https://phabricator.wikimedia.org/T246664) [02:44:33] (03PS1) 10Papaul: DNS: Remove mgmt DNS for old frpig2001, payments200[1-3] [dns] - 10https://gerrit.wikimedia.org/r/577411 [02:45:01] (03CR) 10jerkins-bot: [V: 04-1] DNS: Remove mgmt DNS for old frpig2001, payments200[1-3] [dns] - 10https://gerrit.wikimedia.org/r/577411 (owner: 10Papaul) [02:49:30] (03CR) 10Dzahn: "CNAME 'saiph.mgmt.frack.codfw.wmnet.' points to known same-zone NXDOMAIN 'frpig2001.mgmt.frack.codfw.wmnet.'" [dns] - 10https://gerrit.wikimedia.org/r/577411 (owner: 10Papaul) [02:49:31] 10Operations, 10Wikimedia-Logstash, 10Patch-For-Review: ELK7 shards failed errors when loading saved objects, e.g. "field expansion matches too many fields, limit: 1024, got: 1726" - https://phabricator.wikimedia.org/T247014 (10colewhite) > ... with 2M docs indexed it looks like the change might only be from... [02:51:18] 10Operations, 10DC-Ops, 10decommission, 10Patch-For-Review: decommission WMF6147 (old frpig2001.frack.codfw.wmnet) - https://phabricator.wikimedia.org/T246824 (10Papaul) @jgree while trying to remove the old frpig2001 mgmt DNS, i am getting the error below ` error: CNAME 'saiph.mgmt.frack.codfw.wmnet.' po... [03:57:43] (03CR) 10Andrew Bogott: [C: 03+2] Neutron: override the default neutron init script from Queens [puppet] - 10https://gerrit.wikimedia.org/r/577365 (owner: 10Andrew Bogott) [05:35:58] 10Operations, 10Performance-Team, 10serviceops, 10Performance-Team-publish: (Need by: 2020-02-28) rack/setup/install mw[1385-1413].eqiad.wmnet - https://phabricator.wikimedia.org/T241849 (10Joe) Let's see how those numbers work when we decommission the oldest servers, but this seems very encouraging indeed. [06:03:04] (03CR) 10Giuseppe Lavagetto: [C: 03+2] profile::services_proxy: use non-deprecated config format [puppet] - 10https://gerrit.wikimedia.org/r/577187 (owner: 10Giuseppe Lavagetto) [06:38:18] (03CR) 10Marostegui: "Great work, this give us lots of possibilities! Just some comments and questions" (034 comments) [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/577224 (https://phabricator.wikimedia.org/T244884) (owner: 10Jcrespo) [06:41:32] (03PS1) 10Marostegui: install_server: Revert to original no-srv-format line [puppet] - 10https://gerrit.wikimedia.org/r/577419 (https://phabricator.wikimedia.org/T246604) [06:43:44] (03CR) 10Marostegui: [C: 03+2] install_server: Revert to original no-srv-format line [puppet] - 10https://gerrit.wikimedia.org/r/577419 (https://phabricator.wikimedia.org/T246604) (owner: 10Marostegui) [06:46:58] (03PS2) 10Samwilson: Enable watchlist expiry on Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/577033 (https://phabricator.wikimedia.org/T246849) [06:47:34] (03PS1) 10Marostegui: mariadb: Set 10.4 as default for buster [puppet] - 10https://gerrit.wikimedia.org/r/577420 (https://phabricator.wikimedia.org/T246604) [06:48:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'Install 10.4 instead of 10.3 on db1078', diff saved to https://phabricator.wikimedia.org/P10633 and previous config saved to /var/cache/conftool/dbconfig/20200306-064800-marostegui.json [06:48:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:53:08] (03CR) 10Marostegui: [C: 03+2] mariadb: Set 10.4 as default for buster [puppet] - 10https://gerrit.wikimedia.org/r/577420 (https://phabricator.wikimedia.org/T246604) (owner: 10Marostegui) [06:54:18] (03CR) 10Jcrespo: "Remember this is a POC! And focus for now is ES. I have to cut features at some point to advance and focus on the goal :-D. Will apply fix" (034 comments) [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/577224 (https://phabricator.wikimedia.org/T244884) (owner: 10Jcrespo) [07:05:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1078 T246604', diff saved to https://phabricator.wikimedia.org/P10634 and previous config saved to /var/cache/conftool/dbconfig/20200306-070538-marostegui.json [07:05:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:05:44] T246604: Install 1 buster+10.4 host per section - https://phabricator.wikimedia.org/T246604 [07:19:11] (03PS5) 10Jcrespo: wmfbackups: Add new simple script to analyze dump row ids [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/577224 (https://phabricator.wikimedia.org/T244884) [07:33:14] (03PS1) 10Marostegui: install_server: Allow reimage of db2085 [puppet] - 10https://gerrit.wikimedia.org/r/577460 (https://phabricator.wikimedia.org/T246604) [07:35:28] (03PS2) 10Marostegui: install_server: Allow reimage of db2085 [puppet] - 10https://gerrit.wikimedia.org/r/577460 (https://phabricator.wikimedia.org/T246604) [07:37:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1078 T246604', diff saved to https://phabricator.wikimedia.org/P10635 and previous config saved to /var/cache/conftool/dbconfig/20200306-073707-marostegui.json [07:37:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:37:13] T246604: Install 1 buster+10.4 host per section - https://phabricator.wikimedia.org/T246604 [07:42:54] (03CR) 10Jcrespo: [C: 03+1] "yay" [puppet] - 10https://gerrit.wikimedia.org/r/577460 (https://phabricator.wikimedia.org/T246604) (owner: 10Marostegui) [07:43:19] (03CR) 10Marostegui: [C: 03+2] install_server: Allow reimage of db2085 [puppet] - 10https://gerrit.wikimedia.org/r/577460 (https://phabricator.wikimedia.org/T246604) (owner: 10Marostegui) [07:44:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2085:3311, db2085:3318 for reimage to buster - T246604', diff saved to https://phabricator.wikimedia.org/P10636 and previous config saved to /var/cache/conftool/dbconfig/20200306-074427-marostegui.json [07:44:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:44:32] T246604: Install 1 buster+10.4 host per section - https://phabricator.wikimedia.org/T246604 [07:45:40] (03PS1) 10Marostegui: db2085: Reimage to buster [puppet] - 10https://gerrit.wikimedia.org/r/577461 (https://phabricator.wikimedia.org/T246604) [07:49:42] (03CR) 10Marostegui: [C: 03+2] db2085: Reimage to buster [puppet] - 10https://gerrit.wikimedia.org/r/577461 (https://phabricator.wikimedia.org/T246604) (owner: 10Marostegui) [07:50:18] !log Stop MySQL on db2085:3311, db2085:3318 for reimage to buster T246604 [07:50:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:50:22] T246604: Install 1 buster+10.4 host per section - https://phabricator.wikimedia.org/T246604 [07:52:08] (03PS1) 10Jcrespo: mariadb-backups: Increase snapshot frequency and retention (6 days) [puppet] - 10https://gerrit.wikimedia.org/r/577462 (https://phabricator.wikimedia.org/T138562) [07:55:54] (03CR) 10Marostegui: [C: 03+1] Extend package list for HP package sync with ssaducli [puppet] - 10https://gerrit.wikimedia.org/r/577292 (owner: 10Muehlenhoff) [07:56:17] (03PS2) 10Jcrespo: mariadb-backups: Increase snapshot frequency and retention (6 days) [puppet] - 10https://gerrit.wikimedia.org/r/577462 (https://phabricator.wikimedia.org/T138562) [07:56:21] (03CR) 10Muehlenhoff: [C: 03+2] Extend package list for HP package sync with ssaducli [puppet] - 10https://gerrit.wikimedia.org/r/577292 (owner: 10Muehlenhoff) [08:00:04] Deploy window NO DEPLOYS (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200306T0800) [08:09:11] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime [08:09:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:11:39] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [08:11:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:17:01] (03CR) 10Elukey: [C: 03+2] jupyterhub: force systemd spawner to use the user.slice [puppet] - 10https://gerrit.wikimedia.org/r/577320 (owner: 10Elukey) [08:19:03] !log installing openjpeg2 security updates [08:19:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:25:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1078 T246604', diff saved to https://phabricator.wikimedia.org/P10637 and previous config saved to /var/cache/conftool/dbconfig/20200306-082549-marostegui.json [08:25:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:25:55] T246604: Install 1 buster+10.4 host per section - https://phabricator.wikimedia.org/T246604 [08:27:13] (03CR) 10Elukey: [C: 03+2] "To keep archives happy: we need to upgrade the pip package on the jupyterhub venv to v0.13 since the current version doesn't support the s" [puppet] - 10https://gerrit.wikimedia.org/r/577320 (owner: 10Elukey) [08:28:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db2085:3311, db2085:3318 after reimage to buster - T246604', diff saved to https://phabricator.wikimedia.org/P10638 and previous config saved to /var/cache/conftool/dbconfig/20200306-082858-marostegui.json [08:29:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:31:47] (03CR) 10Marostegui: [C: 03+1] mariadb-backups: Increase snapshot frequency and retention (6 days) [puppet] - 10https://gerrit.wikimedia.org/r/577462 (https://phabricator.wikimedia.org/T138562) (owner: 10Jcrespo) [08:36:25] (03PS1) 10DCausse: [wdqs] purge wdqs query logs [puppet] - 10https://gerrit.wikimedia.org/r/577469 (https://phabricator.wikimedia.org/T247034) [08:37:06] (03CR) 10DCausse: [C: 03+1] wdqs: Initial configuration of wdqs200[78]. [puppet] - 10https://gerrit.wikimedia.org/r/577324 (https://phabricator.wikimedia.org/T246343) (owner: 10Gehel) [08:41:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Fully repool db1078 T246604', diff saved to https://phabricator.wikimedia.org/P10639 and previous config saved to /var/cache/conftool/dbconfig/20200306-084141-marostegui.json [08:41:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:41:55] (03CR) 10DCausse: [C: 03+1] cirrus: initial configuration of elastic20[55-60] [puppet] - 10https://gerrit.wikimedia.org/r/577250 (https://phabricator.wikimedia.org/T246975) (owner: 10Gehel) [08:44:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1113:3315, db1113:3316 for upgrade - T239791', diff saved to https://phabricator.wikimedia.org/P10640 and previous config saved to /var/cache/conftool/dbconfig/20200306-084439-marostegui.json [08:44:46] T246604: Install 1 buster+10.4 host per section - https://phabricator.wikimedia.org/T246604 [08:44:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:45:25] !log Stop mysql for db1113:3315, db1113:3316 for upgrade T239791 [08:47:51] T239791: DB: perform rolling restart of mariadb daemons to pick up CA changes - https://phabricator.wikimedia.org/T239791 [08:47:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:51:31] 10Operations, 10LDAP-Access-Requests, 10Patch-For-Review: LDAP access to the wmf group for WhatamIdoing - https://phabricator.wikimedia.org/T247016 (10akosiaris) p:05Triage→03Medium [08:52:27] 10Operations, 10Puppet, 10DBA, 10User-jbond: DB: perform rolling restart of mariadb daemons to pick up CA changes - https://phabricator.wikimedia.org/T239791 (10Marostegui) [08:53:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1113:3315, db1113:3316 after upgrade - T239791', diff saved to https://phabricator.wikimedia.org/P10641 and previous config saved to /var/cache/conftool/dbconfig/20200306-085332-marostegui.json [08:53:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:54:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1074 for upgrade T239791', diff saved to https://phabricator.wikimedia.org/P10642 and previous config saved to /var/cache/conftool/dbconfig/20200306-085435-marostegui.json [08:55:13] !log Stop MySQL on db1074 for upgrade T239791 [08:56:37] T239791: DB: perform rolling restart of mariadb daemons to pick up CA changes - https://phabricator.wikimedia.org/T239791 [08:56:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:56:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:56:56] !log rolling restart of kartotherian/tilerator/tileratorui to pick up OpenJPEG security updates [08:56:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:57:44] 10Operations, 10Puppet, 10DBA, 10User-jbond: DB: perform rolling restart of mariadb daemons to pick up CA changes - https://phabricator.wikimedia.org/T239791 (10Marostegui) [09:00:58] !log installing libidn security updates [09:01:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:02:13] (03PS2) 10DCausse: [wdqs] purge wdqs query logs [puppet] - 10https://gerrit.wikimedia.org/r/577469 (https://phabricator.wikimedia.org/T247034) [09:03:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1074', diff saved to https://phabricator.wikimedia.org/P10643 and previous config saved to /var/cache/conftool/dbconfig/20200306-090328-marostegui.json [09:03:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:12:46] !log rolling restart of mw canaries to pick up libidn security updates [09:12:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:13:05] (03PS1) 10Marostegui: Revert "install_server: Allow reimage of db2085" [puppet] - 10https://gerrit.wikimedia.org/r/577512 [09:15:23] (03CR) 10Marostegui: [C: 03+2] Revert "install_server: Allow reimage of db2085" [puppet] - 10https://gerrit.wikimedia.org/r/577512 (owner: 10Marostegui) [09:15:29] (03PS1) 10Giuseppe Lavagetto: profile::services_proxy::envoy: refactor around listeners [puppet] - 10https://gerrit.wikimedia.org/r/577513 [09:19:16] (03PS1) 10Marostegui: install_server: Allow reimage db2084 [puppet] - 10https://gerrit.wikimedia.org/r/577518 (https://phabricator.wikimedia.org/T246604) [09:19:50] (03CR) 10Marostegui: [C: 03+2] install_server: Allow reimage db2084 [puppet] - 10https://gerrit.wikimedia.org/r/577518 (https://phabricator.wikimedia.org/T246604) (owner: 10Marostegui) [09:20:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1074', diff saved to https://phabricator.wikimedia.org/P10644 and previous config saved to /var/cache/conftool/dbconfig/20200306-092026-marostegui.json [09:20:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:20:31] 10Operations, 10observability, 10serviceops, 10vm-requests: Provision grafana VM in codfw - https://phabricator.wikimedia.org/T244357 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi Nothing left to do here, resolving [09:21:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2084:3314, db2084:3315 for reimage to buster - T246604', diff saved to https://phabricator.wikimedia.org/P10645 and previous config saved to /var/cache/conftool/dbconfig/20200306-092103-marostegui.json [09:21:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:21:08] T246604: Install 1 buster+10.4 host per section - https://phabricator.wikimedia.org/T246604 [09:21:36] !log Stop MySQL on db2084:3315, db2084:3314 for reimage T246604 [09:21:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:22:31] (03PS1) 10Marostegui: db2084: Reimage to buster [puppet] - 10https://gerrit.wikimedia.org/r/577520 (https://phabricator.wikimedia.org/T246604) [09:24:00] (03CR) 10Marostegui: [C: 03+2] db2084: Reimage to buster [puppet] - 10https://gerrit.wikimedia.org/r/577520 (https://phabricator.wikimedia.org/T246604) (owner: 10Marostegui) [09:26:02] 10Operations, 10Page Content Service, 10Wikimedia-Logstash, 10observability, and 4 others: Move mobileapps logging to new logging pipeline - https://phabricator.wikimedia.org/T219924 (10fgiunchedi) >>! In T219924#5946258, @Mholloway wrote: > Hmm, is this still worth doing if mobileapps is finally moving to... [09:28:28] (03PS2) 10Giuseppe Lavagetto: profile::services_proxy::envoy: refactor around listeners [puppet] - 10https://gerrit.wikimedia.org/r/577513 [09:35:54] (03PS3) 10Giuseppe Lavagetto: profile::services_proxy::envoy: refactor around listeners [puppet] - 10https://gerrit.wikimedia.org/r/577513 [09:42:20] (03PS4) 10Giuseppe Lavagetto: profile::services_proxy::envoy: refactor around listeners [puppet] - 10https://gerrit.wikimedia.org/r/577513 [09:42:37] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime [09:42:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:43:30] !log elukey@cumin1001 START - Cookbook sre.aqs.roll-restart [09:43:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:45:03] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [09:45:03] 10Operations: Migrate Cumin hosts to Buster - https://phabricator.wikimedia.org/T245114 (10Volans) @Krenair sure, we'll need to make the buster package anyway for the prod migration. If you have a date in mind for your migration let me know so that I can adjust on when doing the buster package. Should be pretty... [09:45:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:46:45] !log elukey@cumin1001 END (PASS) - Cookbook sre.aqs.roll-restart (exit_code=0) [09:46:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:48:41] (03CR) 10Cparle: [C: 03+1] MachineVision: Update label blacklist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/577335 (owner: 10Mholloway) [09:51:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1074', diff saved to https://phabricator.wikimedia.org/P10646 and previous config saved to /var/cache/conftool/dbconfig/20200306-095115-marostegui.json [09:51:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:52:43] (03PS1) 10Marostegui: Revert "install_server: Allow reimage db2084" [puppet] - 10https://gerrit.wikimedia.org/r/577523 [09:52:47] !log rolling restart of slapd on LDAP replicas to pick up libidn security updates [09:52:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:17] (03PS5) 10Giuseppe Lavagetto: profile::services_proxy::envoy: refactor around listeners [puppet] - 10https://gerrit.wikimedia.org/r/577513 [09:53:27] 10Operations: Migrate Cumin hosts to Buster - https://phabricator.wikimedia.org/T245114 (10Krenair) Thanks Volans. I'm planning to do it as soon as the package is available. [09:54:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db2084:3314, db2084:3315 after reimage to buster - T246604', diff saved to https://phabricator.wikimedia.org/P10647 and previous config saved to /var/cache/conftool/dbconfig/20200306-095407-marostegui.json [09:54:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:54:13] T246604: Install 1 buster+10.4 host per section - https://phabricator.wikimedia.org/T246604 [09:55:47] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "https://puppet-compiler.wmflabs.org/compiler1002/21319/mw1331.eqiad.wmnet/ this change just removes the unused clusters." [puppet] - 10https://gerrit.wikimedia.org/r/577513 (owner: 10Giuseppe Lavagetto) [09:57:53] (03PS1) 10Elukey: cumin: rename presto alias to presto-analytics [puppet] - 10https://gerrit.wikimedia.org/r/577524 [10:00:24] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/577524 (owner: 10Elukey) [10:00:46] (03CR) 10Elukey: [C: 03+2] cumin: rename presto alias to presto-analytics [puppet] - 10https://gerrit.wikimedia.org/r/577524 (owner: 10Elukey) [10:00:48] (03PS1) 10Elukey: Add a cookbook to restart the Analytics Presto worker nodes [cookbooks] - 10https://gerrit.wikimedia.org/r/577525 [10:02:14] (03PS2) 10Elukey: Add a cookbook to restart the Analytics Presto worker nodes [cookbooks] - 10https://gerrit.wikimedia.org/r/577525 [10:03:45] !log rolling restart of labweb* to pick up libidn security updates [10:03:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:04:14] (03CR) 10Jbond: [C: 03+1] Sample all inbound traffic [homer/public] - 10https://gerrit.wikimedia.org/r/577316 (https://phabricator.wikimedia.org/T246618) (owner: 10Ayounsi) [10:05:06] (03CR) 10Elukey: [C: 03+2] "I am self-merging this since it is super straightforward, and I'd like to use it /test it today (there are some restarts planned). I will " [cookbooks] - 10https://gerrit.wikimedia.org/r/577525 (owner: 10Elukey) [10:06:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Fully repool db1074', diff saved to https://phabricator.wikimedia.org/P10648 and previous config saved to /var/cache/conftool/dbconfig/20200306-100628-marostegui.json [10:06:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:06:33] !log elukey@cumin1001 START - Cookbook sre.presto.roll-restart-workers [10:06:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:08:17] (03CR) 10Jbond: [C: 03+1] admins: add Abban Dunne to ldap_only_admins (wmde) [puppet] - 10https://gerrit.wikimedia.org/r/577410 (https://phabricator.wikimedia.org/T246664) (owner: 10Dzahn) [10:10:04] !log rolling restart of Exim on mx* to pick up libidn security updates [10:10:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:11:02] (03CR) 10Marostegui: [C: 03+2] Revert "install_server: Allow reimage db2084" [puppet] - 10https://gerrit.wikimedia.org/r/577523 (owner: 10Marostegui) [10:14:05] 10Operations: smartd not starting properly on gen9 + buster - https://phabricator.wikimedia.org/T246997 (10fgiunchedi) Interesting find! Looks like db1078 is the first system that we run Buster on and has HP raid controller (so the disks are "masked" behind a single device). This looks like a "regression" in sma... [10:15:03] 10Operations, 10SRE-Access-Requests: offboard tilman bayer - https://phabricator.wikimedia.org/T220565 (10elukey) [10:16:44] !log elukey@cumin1001 END (PASS) - Cookbook sre.presto.roll-restart-workers (exit_code=0) [10:16:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:17:13] \o/ [10:17:47] 10Operations, 10Wikidata, 10Wikidata-Query-Service: Deployment strategy and hardware requirement for new Flink based WDQS updater - https://phabricator.wikimedia.org/T247058 (10Gehel) [10:18:00] (03CR) 10Alexandros Kosiaris: [C: 03+2] admins: add Abban Dunne to ldap_only_admins (wmde) [puppet] - 10https://gerrit.wikimedia.org/r/577410 (https://phabricator.wikimedia.org/T246664) (owner: 10Dzahn) [10:18:40] 10Operations, 10LDAP-Access-Requests, 10WMF-Legal, 10Patch-For-Review: Add Abban Dunne to the ldap/wmde group - https://phabricator.wikimedia.org/T246664 (10akosiaris) 05Open→03Resolved Thanks!. Done. Re-resolving [10:21:45] !log hnowlan@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'changeprop' for release 'staging' . [10:21:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:30:07] 10Operations, 10Wikidata, 10Wikidata-Query-Service: Deployment strategy and hardware requirement for new Flink based WDQS updater - https://phabricator.wikimedia.org/T247058 (10Zbyszko) [10:30:33] 10Operations: smartd not starting properly on gen9 + buster - https://phabricator.wikimedia.org/T246997 (10Marostegui) Thanks for taking a look. From both options you suggest, I am more inclined on the first one so we can get rid of a component which is overruled by Prometheus anyways, no? The idea of having to... [10:32:54] (03CR) 10Hnowlan: [C: 03+1] admin: Add redis databases for changeprop [deployment-charts] - 10https://gerrit.wikimedia.org/r/577239 (https://phabricator.wikimedia.org/T213193) (owner: 10Alexandros Kosiaris) [10:33:05] 10Operations, 10Wikidata, 10Wikidata-Query-Service: Deployment strategy and hardware requirement for new Flink based WDQS updater - https://phabricator.wikimedia.org/T247058 (10Zbyszko) [10:33:19] 10Operations: smartd not starting properly on gen9 + buster - https://phabricator.wikimedia.org/T246997 (10MoritzMuehlenhoff) Let's report this upstream (or in the Debian BTS, not sure if there are possible local packaging changes which might make a difference)? [10:36:42] 10Operations, 10Traffic, 10Wikimedia-Logstash, 10observability, and 3 others: Changing Kibana filters is ridiculously slow - https://phabricator.wikimedia.org/T189333 (10fgiunchedi) >>! In T189333#5945454, @Krinkle wrote: > This is still an issue. > > Editing Kibana dashboards: > * In Safari, crashes the... [10:37:42] (03PS3) 10Volans: dns: fix sub/24 IPv4 netmasks file generation [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/576987 (https://phabricator.wikimedia.org/T233183) [10:37:44] (03PS1) 10Volans: dns: add support for two-phase commit [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/577528 (https://phabricator.wikimedia.org/T233183) [10:40:03] (03PS1) 10Volans: scripts: improve decommission script [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/577529 [10:42:31] (03CR) 10Volans: "The only thing changed in PS3 is the removal of the related item from the TODO list in the module's docstring." [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/576987 (https://phabricator.wikimedia.org/T233183) (owner: 10Volans) [10:42:40] (03CR) 10Volans: [C: 03+2] gitignore: add paths used for local testing [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/576984 (owner: 10Volans) [10:42:46] (03CR) 10Volans: [C: 03+2] dns: convert Netbox data gathering into a class [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/576985 (https://phabricator.wikimedia.org/T233183) (owner: 10Volans) [10:42:55] (03CR) 10Volans: [C: 03+2] dns: convert records management in classes [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/576986 (https://phabricator.wikimedia.org/T233183) (owner: 10Volans) [10:43:01] (03CR) 10Volans: [C: 03+2] dns: fix sub/24 IPv4 netmasks file generation [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/576987 (https://phabricator.wikimedia.org/T233183) (owner: 10Volans) [10:45:56] PROBLEM - Check systemd state on db1078 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:46:27] godog: going to ack this ^ [10:46:56] marostegui: ah yes, thank you [10:47:36] ACKNOWLEDGEMENT - Check systemd state on db1078 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Marostegui T246997 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:49:25] 10Operations, 10Wikidata, 10Wikidata-Query-Service: Deployment strategy and hardware requirement for new Flink based WDQS updater - https://phabricator.wikimedia.org/T247058 (10dcausse) [10:49:31] (03PS1) 10Jcrespo: mariadb: Update package to 10.1.44 (including stretch systemd fix) [software] - 10https://gerrit.wikimedia.org/r/577532 [10:53:25] 10Operations, 10Wikidata, 10Wikidata-Query-Service: Deployment strategy and hardware requirement for new Flink based WDQS updater - https://phabricator.wikimedia.org/T247058 (10dcausse) [10:54:09] 10Operations: smartd not starting properly on gen9 + buster - https://phabricator.wikimedia.org/T246997 (10fgiunchedi) >>! In T246997#5947487, @Marostegui wrote: > Thanks for taking a look. > From both options you suggest, I am more inclined on the first one so we can get rid of a component which is overruled by... [11:06:17] 10Operations, 10MediaWiki-ResourceLoader, 10Performance-Team, 10Traffic, 10Wikimedia-Incident: load.php?modules=startup miss rate trippled on 2020-02-05 - https://phabricator.wikimedia.org/T247020 (10akosiaris) p:05Triage→03High [11:09:06] 10Operations, 10MediaWiki-ResourceLoader, 10Performance-Team, 10Traffic, 10Wikimedia-Incident: load.php?modules=startup miss rate tripled on 2020-02-05 - https://phabricator.wikimedia.org/T247020 (10CDanis) [11:15:40] (03CR) 10MarcoAurelio: "This has conflicts now on wikiversions.json and InitialiseSettings.php. Given that there's no estimated time when this is going to be appl" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/574184 (https://phabricator.wikimedia.org/T245911) (owner: 10MarcoAurelio) [11:21:15] (03PS1) 10Aaron Schulz: Update SSH keys for myself (aaron) [puppet] - 10https://gerrit.wikimedia.org/r/577535 [11:28:39] 10Operations, 10LDAP-Access-Requests, 10Patch-For-Review: LDAP access to the wmf group for WhatamIdoing - https://phabricator.wikimedia.org/T247016 (10akosiaris) Ciao, @Elitre. I think we need your approval for this. [11:30:38] 10Operations, 10Wikidata, 10Wikidata-Query-Service: Deployment strategy and hardware requirement for new Flink based WDQS updater - https://phabricator.wikimedia.org/T247058 (10akosiaris) p:05Triage→03Medium [11:30:47] 10Operations, 10Epic: Migrate all of production metal to Buster or later - https://phabricator.wikimedia.org/T247045 (10akosiaris) p:05Triage→03Low [11:31:00] 10Operations, 10ops-codfw, 10serviceops: decom at least 15 appservers in codfw rack C3 to make room for new servers - https://phabricator.wikimedia.org/T247018 (10akosiaris) p:05Triage→03High [11:31:30] 10Operations, 10Wikimedia-Logstash, 10Patch-For-Review: ELK7 shards failed errors when loading saved objects, e.g. "field expansion matches too many fields, limit: 1024, got: 1726" - https://phabricator.wikimedia.org/T247014 (10akosiaris) p:05Triage→03Medium [11:31:41] 10Operations: smartd not starting properly on gen9 + buster - https://phabricator.wikimedia.org/T246997 (10akosiaris) p:05Triage→03Medium [11:32:46] (03CR) 10Alexandros Kosiaris: [C: 03+2] eventstreams: Add kubernetes hosts to conftool [puppet] - 10https://gerrit.wikimedia.org/r/566771 (https://phabricator.wikimedia.org/T238658) (owner: 10Alexandros Kosiaris) [11:32:53] (03PS2) 10Alexandros Kosiaris: eventstreams: Add kubernetes hosts to conftool [puppet] - 10https://gerrit.wikimedia.org/r/566771 (https://phabricator.wikimedia.org/T238658) [11:34:19] @akosiaris: you have my permission, or you can ask Keegan re: t247016. [11:34:25] 10Operations, 10Wikidata, 10Wikidata-Query-Service: Deployment strategy and hardware requirement for new Flink based WDQS updater - https://phabricator.wikimedia.org/T247058 (10Joe) I would like to read an assessment of why our current event processing platform, change-propagation, is not suited for this pur... [11:35:32] (03CR) 10Alexandros Kosiaris: [C: 03+2] "> Alex, this is the first step eh?" [puppet] - 10https://gerrit.wikimedia.org/r/566771 (https://phabricator.wikimedia.org/T238658) (owner: 10Alexandros Kosiaris) [11:35:49] Elitre: cool, thanks! [11:36:55] 10Operations, 10LDAP-Access-Requests, 10Patch-For-Review: LDAP access to the wmf group for WhatamIdoing - https://phabricator.wikimedia.org/T247016 (10akosiaris) From IRC ` (11:34:19) Elitre: @akosiaris: you have my permission, or you can ask Keegan re: t247016. ` [11:39:47] (03CR) 10Alexandros Kosiaris: [C: 03+2] admins: add whatamidoing to ldap_only_admins (wmf) [puppet] - 10https://gerrit.wikimedia.org/r/577395 (https://phabricator.wikimedia.org/T247016) (owner: 10Dzahn) [11:39:55] (03PS2) 10Alexandros Kosiaris: admins: add whatamidoing to ldap_only_admins (wmf) [puppet] - 10https://gerrit.wikimedia.org/r/577395 (https://phabricator.wikimedia.org/T247016) (owner: 10Dzahn) [11:47:56] (03PS1) 10Muehlenhoff: Change CAS actuator base web path to api/ [puppet] - 10https://gerrit.wikimedia.org/r/577539 [11:48:55] 10Operations, 10LDAP-Access-Requests, 10Patch-For-Review: LDAP access to the wmf group for WhatamIdoing - https://phabricator.wikimedia.org/T247016 (10akosiaris) 05Open→03Resolved a:03akosiaris @Whatamidoing-WMF You should now have access to superset. I 'll close this as resolved, feel free to reopen. [11:49:05] 10Operations, 10netops: Configure management-instance on router with Junos > 17.3 - https://phabricator.wikimedia.org/T247073 (10ayounsi) p:05Triage→03Low [11:49:18] 10Operations, 10netops: Configure management-instance on routers with Junos > 17.3 - https://phabricator.wikimedia.org/T247073 (10ayounsi) [11:50:56] !log akosiaris@cumin1001 conftool action : set/weight=1; selector: dc=eqiad,service=eventstreams,name=kube.* [11:51:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:51:17] !log akosiaris@cumin1001 conftool action : set/pooled=yes; selector: dc=eqiad,service=eventstreams,name=kubernetes1001.* [11:51:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:52:04] 10Operations, 10netops: Configure management-instance on routers with Junos > 17.3 - https://phabricator.wikimedia.org/T247073 (10ayounsi) [11:52:06] 10Operations, 10netops: Upgrade routers - https://phabricator.wikimedia.org/T243080 (10ayounsi) [11:54:12] 10Operations, 10Design-Research: Edit optoutresearch@ mailing list recipients - https://phabricator.wikimedia.org/T100860 (10Aklapper) a:05aripstra→03None Resetting task assignee, as that user account is not active anymore. [11:54:35] 10Operations, 10Citoid: Request took down both zotero and citoid (exceeding memory) - https://phabricator.wikimedia.org/T243444 (10akosiaris) I guess we aren't gonna find the source of this specific event. @mvolz do you feel we are at least better prepared logging wise for the next time this happens? [11:55:28] (03CR) 10Jcrespo: "Let's give a deeper thought to this next week. If we move some snapshots to bacula, we may not need to have such a high retention." [puppet] - 10https://gerrit.wikimedia.org/r/577462 (https://phabricator.wikimedia.org/T138562) (owner: 10Jcrespo) [11:56:09] !log T238658. kubernetes1001 pooled for eventstreams, weight=1 which should account for 2.1% of traffic [11:56:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:56:16] T238658: Migrate EventStreams to k8s deployment pipeline - https://phabricator.wikimedia.org/T238658 [11:56:19] (03CR) 10Jcrespo: [C: 03+2] mysql: Fix mysql server configuration for the percona flavour [puppet] - 10https://gerrit.wikimedia.org/r/575496 (https://phabricator.wikimedia.org/T193224) (owner: 10Jcrespo) [11:57:43] 10Operations, 10Design-Research: Edit optoutresearch@ mailing list recipients - https://phabricator.wikimedia.org/T100860 (10Aklapper) >>! In T100860#4541426, @ArielGlenn wrote: > Ping: is the optoutresearch@ alias actually being used? Let's get a decision on this so we can move forward one way or the other.... [12:15:38] (03CR) 10Alexandros Kosiaris: [C: 03+2] hfst: New upstream release 3.15.1 [debs/contenttranslation/hfst] - 10https://gerrit.wikimedia.org/r/550092 (https://phabricator.wikimedia.org/T233697) (owner: 10KartikMistry) [12:17:45] (03CR) 10Alexandros Kosiaris: [C: 03+2] Add chart for chromium-render [deployment-charts] - 10https://gerrit.wikimedia.org/r/557090 (https://phabricator.wikimedia.org/T238830) (owner: 10MSantos) [12:18:13] (03Merged) 10jenkins-bot: Add chart for chromium-render [deployment-charts] - 10https://gerrit.wikimedia.org/r/557090 (https://phabricator.wikimedia.org/T238830) (owner: 10MSantos) [12:19:36] (03PS1) 10Alexandros Kosiaris: chromium-render: Package and release 0.0.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/577546 (https://phabricator.wikimedia.org/T238830) [12:21:11] (03CR) 10Alexandros Kosiaris: [C: 03+2] chromium-render: Package and release 0.0.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/577546 (https://phabricator.wikimedia.org/T238830) (owner: 10Alexandros Kosiaris) [12:23:13] 10Operations, 10netops, 10Wikimedia-Incident: Juniper HA audit - https://phabricator.wikimedia.org/T191667 (10ayounsi) Next step here as this is something that needs to be squared away. Decide what should be configured for which type of devices, respectively: * Dual-REs routers: GRES + GR or NSR ** The GRES... [12:33:06] 10Operations, 10MediaWiki-ResourceLoader, 10Performance-Team, 10Traffic, 10Wikimedia-Incident: load.php?modules=startup miss rate tripled on 2020-02-05 - https://phabricator.wikimedia.org/T247020 (10ema) I suspect this is due to the fact that we are unsetting `Accept-Encoding` in `do_global_send_request`... [12:35:14] (03CR) 10Jcrespo: [V: 03+2 C: 03+2] mariadb: Update package to 10.1.44 (including stretch systemd fix) [software] - 10https://gerrit.wikimedia.org/r/577532 (owner: 10Jcrespo) [12:45:34] (03PS7) 10Vgutierrez: ATS: Support TLS Session tickets [puppet] - 10https://gerrit.wikimedia.org/r/573977 (https://phabricator.wikimedia.org/T245616) [12:48:17] (03PS8) 10Vgutierrez: ATS: Support TLS Session tickets [puppet] - 10https://gerrit.wikimedia.org/r/573977 (https://phabricator.wikimedia.org/T245616) [12:50:46] (03PS1) 10Ema: ATS: unset client req Accept-Encoding on ats-be [puppet] - 10https://gerrit.wikimedia.org/r/577551 (https://phabricator.wikimedia.org/T247020) [12:50:58] (03CR) 10jerkins-bot: [V: 04-1] ATS: Support TLS Session tickets [puppet] - 10https://gerrit.wikimedia.org/r/573977 (https://phabricator.wikimedia.org/T245616) (owner: 10Vgutierrez) [12:51:48] (03PS9) 10Vgutierrez: ATS: Support TLS Session tickets [puppet] - 10https://gerrit.wikimedia.org/r/573977 (https://phabricator.wikimedia.org/T245616) [12:54:09] (03CR) 10jerkins-bot: [V: 04-1] ATS: Support TLS Session tickets [puppet] - 10https://gerrit.wikimedia.org/r/573977 (https://phabricator.wikimedia.org/T245616) (owner: 10Vgutierrez) [12:54:17] FFS [12:55:14] vgutierrez: it's not RO Friday for jenkins apparently :P [12:55:21] *sigh* [12:56:27] (03PS1) 10Muehlenhoff: CAS: Make the actuators a profile argument [puppet] - 10https://gerrit.wikimedia.org/r/577552 [12:57:34] (03PS10) 10Vgutierrez: ATS: Support TLS Session tickets [puppet] - 10https://gerrit.wikimedia.org/r/573977 (https://phabricator.wikimedia.org/T245616) [12:57:51] obviously that's my almighty L8 [12:58:12] 10Operations, 10netops, 10cloud-services-team (Kanban): CloudVPS: introduce filtering for neutron BGP addresses - https://phabricator.wikimedia.org/T246887 (10aborrero) Ping. It would be nice to address this ASAP. [13:07:00] (03PS1) 10Jbond: offboard-user: update script so that it can traverse subgroups [puppet] - 10https://gerrit.wikimedia.org/r/577553 (https://phabricator.wikimedia.org/T245771) [13:07:52] (03CR) 10jerkins-bot: [V: 04-1] offboard-user: update script so that it can traverse subgroups [puppet] - 10https://gerrit.wikimedia.org/r/577553 (https://phabricator.wikimedia.org/T245771) (owner: 10Jbond) [13:08:32] (03PS11) 10Vgutierrez: ATS: Support TLS Session tickets [puppet] - 10https://gerrit.wikimedia.org/r/573977 (https://phabricator.wikimedia.org/T245616) [13:08:49] (03PS2) 10Ema: ATS: unset client req Accept-Encoding on ats-be [puppet] - 10https://gerrit.wikimedia.org/r/577551 (https://phabricator.wikimedia.org/T247020) [13:11:17] (03PS2) 10Muehlenhoff: CAS: Make the actuators a profile argument [puppet] - 10https://gerrit.wikimedia.org/r/577552 [13:13:46] (03PS2) 10Jbond: offboard-user: update script so that it can traverse subgroups [puppet] - 10https://gerrit.wikimedia.org/r/577553 (https://phabricator.wikimedia.org/T245771) [13:14:23] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/577539 (owner: 10Muehlenhoff) [13:17:30] (03CR) 10Muehlenhoff: "PCC: https://puppet-compiler.wmflabs.org/compiler1001/21321/" [puppet] - 10https://gerrit.wikimedia.org/r/577552 (owner: 10Muehlenhoff) [13:18:26] (03PS12) 10Vgutierrez: ATS: Support TLS Session tickets [puppet] - 10https://gerrit.wikimedia.org/r/573977 (https://phabricator.wikimedia.org/T245616) [13:25:44] (03PS2) 10Muehlenhoff: Change CAS actuator base web path to api/ [puppet] - 10https://gerrit.wikimedia.org/r/577539 [13:26:42] 10Operations: Integrate Stretch 9.12 point update - https://phabricator.wikimedia.org/T244695 (10MoritzMuehlenhoff) [13:33:38] (03PS1) 10CDanis: allow overriding ssh_config path in homer's config [software/homer] - 10https://gerrit.wikimedia.org/r/577555 [13:36:26] (03CR) 10jerkins-bot: [V: 04-1] allow overriding ssh_config path in homer's config [software/homer] - 10https://gerrit.wikimedia.org/r/577555 (owner: 10CDanis) [13:37:57] (03PS2) 10CDanis: allow overriding ssh_config path in homer's config [software/homer] - 10https://gerrit.wikimedia.org/r/577555 [13:44:19] (03CR) 10Volans: "LGTM, thanks for the patch. One optional nit inline (probably my fault)." (032 comments) [software/homer] - 10https://gerrit.wikimedia.org/r/577555 (owner: 10CDanis) [13:53:05] (03PS3) 10CDanis: allow overriding ssh_config path in homer's config [software/homer] - 10https://gerrit.wikimedia.org/r/577555 [13:55:34] (03CR) 10Muehlenhoff: [C: 03+2] Change CAS actuator base web path to api/ [puppet] - 10https://gerrit.wikimedia.org/r/577539 (owner: 10Muehlenhoff) [13:55:49] (03CR) 10jerkins-bot: [V: 04-1] allow overriding ssh_config path in homer's config [software/homer] - 10https://gerrit.wikimedia.org/r/577555 (owner: 10CDanis) [13:58:29] (03PS4) 10CDanis: allow overriding ssh_config path in homer's config [software/homer] - 10https://gerrit.wikimedia.org/r/577555 [13:59:20] (03PS1) 10Elukey: admin: add dsaez to gpu_testers [puppet] - 10https://gerrit.wikimedia.org/r/577559 [14:01:58] (03CR) 10Elukey: [C: 03+2] admin: add dsaez to gpu_testers [puppet] - 10https://gerrit.wikimedia.org/r/577559 (owner: 10Elukey) [14:02:01] (03PS5) 10Volans: allow overriding ssh_config path in homer's config [software/homer] - 10https://gerrit.wikimedia.org/r/577555 (owner: 10CDanis) [14:03:07] (03PS6) 10Volans: allow overriding ssh_config path in homer's config [software/homer] - 10https://gerrit.wikimedia.org/r/577555 (owner: 10CDanis) [14:05:23] (03PS1) 10Muehlenhoff: Actually change endpoint in the template [puppet] - 10https://gerrit.wikimedia.org/r/577561 [14:05:49] Urbanecm: need help debugging T247078? [14:05:50] T247078: Main pages of several Beta Cluster wikis redirect to other production wikis - https://phabricator.wikimedia.org/T247078 [14:06:19] (03CR) 10CDanis: [C: 03+2] allow overriding ssh_config path in homer's config [software/homer] - 10https://gerrit.wikimedia.org/r/577555 (owner: 10CDanis) [14:06:31] (03CR) 10Alexandros Kosiaris: [C: 03+2] lttoolbox: Update to new upstream release 3.5.1 [debs/contenttranslation/lttoolbox] - 10https://gerrit.wikimedia.org/r/576817 (https://phabricator.wikimedia.org/T234182) (owner: 10KartikMistry) [14:08:16] (03CR) 10Volans: [C: 03+2] allow overriding ssh_config path in homer's config [software/homer] - 10https://gerrit.wikimedia.org/r/577555 (owner: 10CDanis) [14:08:31] (03CR) 10Muehlenhoff: [C: 03+2] Actually change endpoint in the template [puppet] - 10https://gerrit.wikimedia.org/r/577561 (owner: 10Muehlenhoff) [14:09:11] (03Merged) 10jenkins-bot: allow overriding ssh_config path in homer's config [software/homer] - 10https://gerrit.wikimedia.org/r/577555 (owner: 10CDanis) [14:09:21] (03PS3) 10Jbond: offboard-user: update script so that it can traverse subgroups [puppet] - 10https://gerrit.wikimedia.org/r/577553 (https://phabricator.wikimedia.org/T245771) [14:09:46] (03PS1) 10CDanis: clean up stub routing-options left behind in caf7b4f [homer/public] - 10https://gerrit.wikimedia.org/r/577563 [14:09:48] (03PS1) 10CDanis: add graceful-restart to CRs [homer/public] - 10https://gerrit.wikimedia.org/r/577564 [14:10:01] hashar: anything changed recently for which gate-and-submit might not remove the V+2 before merging? [14:10:36] (03CR) 10CDanis: "Obviously can't just roll this out willy-nilly, but it should wind up here after we do a careful deployment" [homer/public] - 10https://gerrit.wikimedia.org/r/577564 (owner: 10CDanis) [14:12:24] (03CR) 10Alexandros Kosiaris: [C: 03+2] cg3: Update to new upstream release 1.3.1 [debs/contenttranslation/cg3] - 10https://gerrit.wikimedia.org/r/576833 (https://phabricator.wikimedia.org/T234182) (owner: 10KartikMistry) [14:13:09] 10Operations, 10DBA, 10Goal, 10Patch-For-Review: Implement logic to be able to perform full and incremental backups of ES hosts - https://phabricator.wikimedia.org/T244884 (10jcrespo) mydumper-like tool to wrap mysqldump for ES incrementals: ` $ ./backup_es_incremental.py --help usage: backup_es_increment... [14:13:54] (03CR) 10Jbond: "LGTM a few nits but they can be added later" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/577552 (owner: 10Muehlenhoff) [14:14:12] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "The setup of the patch is sound; however this change will change the apache setup of the jobrunners;" [puppet] - 10https://gerrit.wikimedia.org/r/576913 (https://phabricator.wikimedia.org/T246389) (owner: 10Hnowlan) [14:14:16] 10Operations, 10Wikidata, 10Wikidata-Query-Service: Deployment strategy and hardware requirement for new Flink based WDQS updater - https://phabricator.wikimedia.org/T247058 (10Gehel) [14:16:16] 10Operations, 10DBA, 10Goal, 10Patch-For-Review: Implement logic to be able to perform full and incremental backups of ES hosts - https://phabricator.wikimedia.org/T244884 (10Marostegui) Nice one! :-) If I can suggest (doesn't have to be for this iteration, of course), maybe you can create a `--dry-`run or... [14:20:15] (03PS2) 10C. Scott Ananian: All parsoid profiles use_php=true [puppet] - 10https://gerrit.wikimedia.org/r/577043 [14:20:17] (03PS1) 10C. Scott Ananian: Parsoid-testing: Add temporary symlink to old deploy repo [puppet] - 10https://gerrit.wikimedia.org/r/577568 [14:21:21] 10Operations, 10netops, 10cloud-services-team (Kanban): CloudVPS: introduce filtering for neutron BGP addresses - https://phabricator.wikimedia.org/T246887 (10ayounsi) Will push the following to keep the previous behavior and be on a whitelist basis instead of blacklist. We can tune it later on if needed. `... [14:21:59] (03CR) 10Muehlenhoff: CAS: Make the actuators a profile argument (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/577552 (owner: 10Muehlenhoff) [14:22:06] (03PS3) 10Muehlenhoff: CAS: Make the actuators a profile argument [puppet] - 10https://gerrit.wikimedia.org/r/577552 [14:23:52] (03PS1) 10Vgutierrez: Release 8.0.6-1wm2 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/577569 (https://phabricator.wikimedia.org/T245616) [14:24:13] (03CR) 10jerkins-bot: [V: 04-1] Release 8.0.6-1wm2 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/577569 (https://phabricator.wikimedia.org/T245616) (owner: 10Vgutierrez) [14:26:56] volans: hi, do you have any change reflecting that Zuul behavior? gate-and-submit pipeline is configured to set verified=0 [14:27:03] 10Operations, 10Wikidata, 10Wikidata-Query-Service: Deployment strategy and hardware requirement for new Flink based WDQS updater - https://phabricator.wikimedia.org/T247058 (10dcausse) @Joe the main reason for me is that we need to do state-full computation over multiple event streams: - we want to union mu... [14:27:12] that happens when the change enters the zuul pipeline [14:27:50] 10Operations, 10DBA, 10Goal, 10Patch-For-Review: Implement logic to be able to perform full and incremental backups of ES hosts - https://phabricator.wikimedia.org/T244884 (10jcrespo) Could you elaborate en what would `--dry-run`would do or skip? * Read the file/original backup * Connect to the database *... [14:28:12] hashar: yes https://gerrit.wikimedia.org/r/c/operations/software/homer/+/577555 [14:31:15] (03PS2) 10C. Scott Ananian: Parsoid-testing: Add temporary symlink to old deploy repo [puppet] - 10https://gerrit.wikimedia.org/r/577568 [14:31:17] (03PS3) 10C. Scott Ananian: All parsoid profiles use_php=true [puppet] - 10https://gerrit.wikimedia.org/r/577043 [14:31:22] 10Operations, 10DBA, 10Goal, 10Patch-For-Review: Implement logic to be able to perform full and incremental backups of ES hosts - https://phabricator.wikimedia.org/T244884 (10Marostegui) Sure thing! The scenario I have in mind is: I need to recover a specific incremental because I have to examine several... [14:31:25] (03CR) 10Ayounsi: [C: 03+1] clean up stub routing-options left behind in caf7b4f [homer/public] - 10https://gerrit.wikimedia.org/r/577563 (owner: 10CDanis) [14:34:08] (03PS1) 10Joal: Bump AQS druid snapshot to 2020_02 [puppet] - 10https://gerrit.wikimedia.org/r/577570 [14:34:11] volans: no clue :( [14:34:15] zuul has: 2020-03-06 14:06:25,356 DEBUG zuul.reporter.gerrit.Reporter: Report change , params {'verified': 0}, message: Starting gate-and-submit jobs. [14:34:22] so it did vote verified:0 [14:34:34] (03CR) 10Ayounsi: [C: 03+1] "Some context https://phabricator.wikimedia.org/T191667" [homer/public] - 10https://gerrit.wikimedia.org/r/577564 (owner: 10CDanis) [14:34:38] but gerrit didn't listen [14:37:33] elukey: if you have a minute - https://gerrit.wikimedia.org/r/577570 [14:41:08] volans: ho I got it. PS6 got uploaded then a CR+2 applied [14:41:27] at that time the tests had not completed yet thus the patch had no V+2 yet [14:41:40] (03CR) 10Elukey: [C: 03+2] Bump AQS druid snapshot to 2020_02 [puppet] - 10https://gerrit.wikimedia.org/r/577570 (owner: 10Joal) [14:41:42] thus when Zuul started the gate-and-submit, it did vote verified:0 to remove the label properly [14:41:43] (03PS1) 10Muehlenhoff: Enable the sso endpoint [puppet] - 10https://gerrit.wikimedia.org/r/577571 [14:41:54] but since the test pipeline had not reported a verified+2 yet, there was nothing to remove ;) [14:42:03] ahhh got it, thanks! cc cdanis ^^^ [14:42:22] maybe we shoul dhave zuul to vote +1 [14:42:29] (03PS1) 10Volans: homer: notify noc@ instead of me personally [puppet] - 10https://gerrit.wikimedia.org/r/577572 [14:42:34] and only vote verified +2 when it is about to submit the change in gerrit [14:43:12] (03CR) 10Jbond: [C: 03+1] "lgtm" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/577552 (owner: 10Muehlenhoff) [14:43:52] (03CR) 10Giuseppe Lavagetto: [C: 03+2] profile::services_proxy::envoy: refactor around listeners [puppet] - 10https://gerrit.wikimedia.org/r/577513 (owner: 10Giuseppe Lavagetto) [14:44:39] hah [14:44:56] !log add cloud-out4 firewall filter in codfw - T246887 [14:45:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:01] T246887: CloudVPS: introduce filtering for neutron BGP addresses - https://phabricator.wikimedia.org/T246887 [14:45:38] (03CR) 10Muehlenhoff: "PCC: https://puppet-compiler.wmflabs.org/compiler1003/21323/" [puppet] - 10https://gerrit.wikimedia.org/r/577571 (owner: 10Muehlenhoff) [14:47:08] 10Operations, 10netops, 10cloud-services-team (Kanban): CloudVPS: introduce filtering for neutron BGP addresses - https://phabricator.wikimedia.org/T246887 (10ayounsi) Confirmed that `nc -zv 208.80.153.189 22` doesn't work anymore. While ping to 208.80.153.190 does. I'll send a Homer CR to make it generic a... [14:47:14] (03CR) 10Subramanya Sastry: [C: 03+1] "This is what I had to do y'day to get rt testing going again." [puppet] - 10https://gerrit.wikimedia.org/r/577568 (owner: 10C. Scott Ananian) [14:50:32] !log elukey@cumin1001 START - Cookbook sre.aqs.roll-restart [14:50:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:50] 10Operations, 10DBA, 10Goal, 10Patch-For-Review: Implement logic to be able to perform full and incremental backups of ES hosts - https://phabricator.wikimedia.org/T244884 (10jcrespo) I see, so mostly interested around examining existing backups. I think I would solve your use case this way (not exactly th... [14:53:47] !log elukey@cumin1001 END (PASS) - Cookbook sre.aqs.roll-restart (exit_code=0) [14:53:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:25] (03PS1) 10Ayounsi: Add cloud-out4 firewall filter [homer/public] - 10https://gerrit.wikimedia.org/r/577575 (https://phabricator.wikimedia.org/T246887) [14:55:51] Daimona: if you have any idea how the WIkidata string got into the cache 🙂 [14:57:18] (03CR) 10Ayounsi: "I feel like we're abusing noc@ those days. Should we send it to the same address we send Puppet private diff emails to instead?" [puppet] - 10https://gerrit.wikimedia.org/r/577572 (owner: 10Volans) [15:00:52] (03CR) 10CDanis: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/577572 (owner: 10Volans) [15:02:15] (03PS1) 10Dvorapa: Add vi to langlist-labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/577577 (https://phabricator.wikimedia.org/T247091) [15:03:03] (03PS2) 10Dvorapa: Add vi to langlist-labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/577577 (https://phabricator.wikimedia.org/T247091) [15:03:49] (03PS3) 10Reedy: Add vi to langlist-labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/577577 (https://phabricator.wikimedia.org/T247091) (owner: 10Dvorapa) [15:03:55] (03CR) 10Reedy: [C: 03+2] Add vi to langlist-labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/577577 (https://phabricator.wikimedia.org/T247091) (owner: 10Dvorapa) [15:04:14] (03PS2) 10Jbond: systemd::syslog: ensure log dir is removed if resource is absent [puppet] - 10https://gerrit.wikimedia.org/r/576364 (https://phabricator.wikimedia.org/T242910) [15:04:34] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/576364 (https://phabricator.wikimedia.org/T242910) (owner: 10Jbond) [15:05:02] (03Merged) 10jenkins-bot: Add vi to langlist-labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/577577 (https://phabricator.wikimedia.org/T247091) (owner: 10Dvorapa) [15:05:44] 10Operations, 10DBA, 10Goal, 10Patch-For-Review: Implement logic to be able to perform full and incremental backups of ES hosts - https://phabricator.wikimedia.org/T244884 (10Marostegui) That sounds good to me. I don't have any strong opinion on whether `--examine` should go to `analyze_dump.py` or to `rec... [15:06:15] (03CR) 10Muehlenhoff: [C: 03+2] CAS: Make the actuators a profile argument [puppet] - 10https://gerrit.wikimedia.org/r/577552 (owner: 10Muehlenhoff) [15:07:04] 10Operations, 10netops: Netbox has incorrect email address for GTT - https://phabricator.wikimedia.org/T246564 (10ayounsi) LGTM. The main advantage I see is that it reduces the amount of passwords we have to rotate when someone leaves. The portal doesn't have a NOC email but phone numbers. I added the 1st and... [15:07:09] 10Operations, 10Wikidata, 10Wikidata-Query-Service: Deployment strategy and hardware requirement for new Flink based WDQS updater - https://phabricator.wikimedia.org/T247058 (10Ottomata) Ping also @Pchelolo for comments on ^ [15:07:14] !log reedy@deploy1001 Synchronized langlist-labs: T247091 (duration: 01m 05s) [15:07:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:07:19] T247091: Add vi to langlist-labs - https://phabricator.wikimedia.org/T247091 [15:07:52] 10Operations, 10Analytics, 10Wikidata, 10Wikidata-Query-Service: Deployment strategy and hardware requirement for new Flink based WDQS updater - https://phabricator.wikimedia.org/T247058 (10Ottomata) [15:08:21] (03CR) 10jerkins-bot: [V: 04-1] systemd::syslog: ensure log dir is removed if resource is absent [puppet] - 10https://gerrit.wikimedia.org/r/576364 (https://phabricator.wikimedia.org/T242910) (owner: 10Jbond) [15:09:38] (03PS3) 10Jbond: systemd::syslog: ensure log dir is removed if resource is absent [puppet] - 10https://gerrit.wikimedia.org/r/576364 (https://phabricator.wikimedia.org/T242910) [15:11:26] !log installing libtimedate-perl updates from Stretch point release [15:11:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:13:58] (03CR) 10Alexandros Kosiaris: [C: 03+2] Apertium: Update to new upstream release 3.6.1 [debs/contenttranslation/apertium] - 10https://gerrit.wikimedia.org/r/576664 (https://phabricator.wikimedia.org/T234182) (owner: 10KartikMistry) [15:14:00] (03PS1) 10Giuseppe Lavagetto: envoy: purge undeclared listeners and clusters definitions [puppet] - 10https://gerrit.wikimedia.org/r/577580 [15:14:02] (03PS1) 10Giuseppe Lavagetto: admin: add a function to my shell [puppet] - 10https://gerrit.wikimedia.org/r/577581 [15:16:50] (03CR) 10RLazarus: [C: 03+1] envoy: purge undeclared listeners and clusters definitions [puppet] - 10https://gerrit.wikimedia.org/r/577580 (owner: 10Giuseppe Lavagetto) [15:26:32] 10Operations, 10Analytics, 10User-Elukey: Refactor Analytics POSIX groups in puppet to improve maintainability - https://phabricator.wikimedia.org/T246578 (10elukey) I checked all the `statistics-users` members and all except two are already in other groups (`analytics-privatedata-users`, `statistics-private... [15:26:42] (03CR) 10Jbond: [C: 03+1] "lgtm but see comment" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/577571 (owner: 10Muehlenhoff) [15:27:01] (03CR) 10Rush: [C: 03+2] offboard-user: update script so that it can traverse subgroups [puppet] - 10https://gerrit.wikimedia.org/r/577553 (https://phabricator.wikimedia.org/T245771) (owner: 10Jbond) [15:27:21] (03CR) 10Rush: "Confirmed, parent projects in phab. i.e. projects with subprojects cannot have direct members -- only the subprojects can. Conversely, if" [puppet] - 10https://gerrit.wikimedia.org/r/577553 (https://phabricator.wikimedia.org/T245771) (owner: 10Jbond) [15:28:39] (03CR) 10Jbond: "chase has confirmed" [puppet] - 10https://gerrit.wikimedia.org/r/577553 (https://phabricator.wikimedia.org/T245771) (owner: 10Jbond) [15:29:23] (03CR) 10Muehlenhoff: Enable the sso endpoint (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/577571 (owner: 10Muehlenhoff) [15:34:28] (03PS2) 10Volans: homer: notify ops-private instead of me personally [puppet] - 10https://gerrit.wikimedia.org/r/577572 [15:35:40] (03CR) 10Volans: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/577572 (owner: 10Volans) [15:36:59] (03CR) 10Ayounsi: [C: 03+1] homer: notify ops-private instead of me personally [puppet] - 10https://gerrit.wikimedia.org/r/577572 (owner: 10Volans) [15:39:26] (03PS1) 10Elukey: jupyterhub: add the option to use nodejs 10 [puppet] - 10https://gerrit.wikimedia.org/r/577590 (https://phabricator.wikimedia.org/T247055) [15:40:49] (03PS2) 10Elukey: jupyterhub: add the option to use nodejs 10 [puppet] - 10https://gerrit.wikimedia.org/r/577590 (https://phabricator.wikimedia.org/T247055) [15:45:28] 10Operations, 10Analytics, 10Wikidata, 10Wikidata-Query-Service: Deployment strategy and hardware requirement for new Flink based WDQS updater - https://phabricator.wikimedia.org/T247058 (10Ottomata) A nice feature of Flink is its support for both batch and stream processing. Ideally, we'd be able to buil... [15:49:30] (03CR) 10Ottomata: [C: 03+1] "Nice!" [puppet] - 10https://gerrit.wikimedia.org/r/577590 (https://phabricator.wikimedia.org/T247055) (owner: 10Elukey) [15:49:38] (03CR) 10Mforns: [wdqs] purge wdqs query logs (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/577469 (https://phabricator.wikimedia.org/T247034) (owner: 10DCausse) [15:50:22] 10Operations, 10netops, 10Wikimedia-Incident: Investigate Juniper storm control - https://phabricator.wikimedia.org/T245192 (10ayounsi) Looking good! So the `description storm_control_interface;` is not needed here as the individual interface descriptions have the priority. Now that we have the interface-ra... [15:55:57] 10Operations, 10Maps: disk usage increase on maps servers - https://phabricator.wikimedia.org/T194966 (10Mholloway) 05Open→03Resolved a:03Gehel This particular instance of increasing disk usage appears resolved, so I'm resolving it, but see T243609 re: current disk usage. [15:56:24] 10Operations, 10Maps: Track more detailed disk usage on maps servers - https://phabricator.wikimedia.org/T194997 (10Mholloway) [15:56:26] 10Operations, 10Maps: disk usage increase on maps servers - https://phabricator.wikimedia.org/T194966 (10Mholloway) [15:57:59] (03CR) 10Muehlenhoff: "Better use apt::package_from_component, see 565567 for an example" [puppet] - 10https://gerrit.wikimedia.org/r/577590 (https://phabricator.wikimedia.org/T247055) (owner: 10Elukey) [16:00:26] moritzm: ah yes sorry I forgot about it! [16:01:07] (03CR) 10DCausse: [wdqs] purge wdqs query logs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/577469 (https://phabricator.wikimedia.org/T247034) (owner: 10DCausse) [16:01:41] (03PS3) 10DCausse: [wdqs] purge wdqs query logs [puppet] - 10https://gerrit.wikimedia.org/r/577469 (https://phabricator.wikimedia.org/T247034) [16:03:15] (03CR) 10C. Scott Ananian: "> This is what I had to do y'day to get rt testing going again." [puppet] - 10https://gerrit.wikimedia.org/r/577568 (owner: 10C. Scott Ananian) [16:03:29] 10Operations, 10Epic, 10Maps (Kartotherian), 10Patch-For-Review: Move Kartotherian and Tilerator to Kubernetes - https://phabricator.wikimedia.org/T216826 (10Mholloway) a:05Mathew.onipe→03None [16:03:48] 10Operations, 10Maps (Kartotherian), 10Patch-For-Review: Create helm chart for kartotherian k8s deployment - https://phabricator.wikimedia.org/T231006 (10Mholloway) a:05Mathew.onipe→03None [16:05:30] (03CR) 10Subramanya Sastry: [C: 03+1] "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/577568 (owner: 10C. Scott Ananian) [16:06:59] 10Operations, 10Cassandra, 10Maps: Collect metrics on maps cassandra - https://phabricator.wikimedia.org/T221055 (10Mholloway) a:05Mathew.onipe→03None [16:07:38] 10Operations, 10Cassandra, 10Maps: Collect metrics on maps cassandra - https://phabricator.wikimedia.org/T221055 (10Mholloway) a:03Mathew.onipe [16:07:55] 10Operations, 10Maps: Find a better partitioning scheme for maps - https://phabricator.wikimedia.org/T224967 (10Mholloway) a:05Mathew.onipe→03None [16:08:20] (03CR) 10CDanis: [C: 03+1] homer: notify ops-private instead of me personally [puppet] - 10https://gerrit.wikimedia.org/r/577572 (owner: 10Volans) [16:08:29] (03PS3) 10Elukey: jupyterhub: add the option to use nodejs 10 [puppet] - 10https://gerrit.wikimedia.org/r/577590 (https://phabricator.wikimedia.org/T247055) [16:09:23] (03CR) 10Volans: [C: 03+2] homer: notify ops-private instead of me personally [puppet] - 10https://gerrit.wikimedia.org/r/577572 (owner: 10Volans) [16:10:36] (03CR) 10Jbond: [C: 03+1] "LGTM" [homer/public] - 10https://gerrit.wikimedia.org/r/577575 (https://phabricator.wikimedia.org/T246887) (owner: 10Ayounsi) [16:11:34] 10Operations, 10Maps: Find a better partitioning scheme for maps - https://phabricator.wikimedia.org/T224967 (10Mholloway) Current task re: disk space trouble: T243609. [16:11:50] 10Operations, 10Maps: Maps[12]004 /srv disk space is critical - https://phabricator.wikimedia.org/T224395 (10Mholloway) [16:11:52] 10Operations, 10Maps: Find a better partitioning scheme for maps - https://phabricator.wikimedia.org/T224967 (10Mholloway) [16:14:03] 10Operations, 10Maps, 10SRE-tools, 10User-Joe, 10User-jijiki: Create cookbook to reboot Maps - https://phabricator.wikimedia.org/T224072 (10Mholloway) 05Open→03Resolved [16:14:05] 10Operations, 10SRE-tools, 10User-Joe, 10User-jijiki: Spicerack cookbooks TODO list - https://phabricator.wikimedia.org/T203943 (10Mholloway) [16:14:16] 10Operations, 10Cassandra, 10Maps: Collect metrics on maps cassandra - https://phabricator.wikimedia.org/T221055 (10Mholloway) 05Open→03Resolved [16:16:33] (03PS1) 10Andrew Bogott: keystone: Backport the Rocky version of ldap integration [puppet] - 10https://gerrit.wikimedia.org/r/577602 (https://phabricator.wikimedia.org/T247050) [16:16:55] (03PS2) 10BryanDavis: redirects: Remove redirect handling for techblog.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/577373 (https://phabricator.wikimedia.org/T246507) [16:17:17] (03PS3) 10BryanDavis: techblog.wikimedia.org: Point at upstream service provider [dns] - 10https://gerrit.wikimedia.org/r/577371 (https://phabricator.wikimedia.org/T246507) [16:17:27] (03CR) 10jerkins-bot: [V: 04-1] redirects: Remove redirect handling for techblog.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/577373 (https://phabricator.wikimedia.org/T246507) (owner: 10BryanDavis) [16:17:39] (03CR) 10jerkins-bot: [V: 04-1] keystone: Backport the Rocky version of ldap integration [puppet] - 10https://gerrit.wikimedia.org/r/577602 (https://phabricator.wikimedia.org/T247050) (owner: 10Andrew Bogott) [16:18:46] (03PS1) 10Volans: homer: actually manage the git hook directory [puppet] - 10https://gerrit.wikimedia.org/r/577604 [16:19:41] (03PS2) 10Andrew Bogott: keystone: Backport the Rocky version of ldap integration [puppet] - 10https://gerrit.wikimedia.org/r/577602 (https://phabricator.wikimedia.org/T247050) [16:19:48] (03CR) 10Mforns: [C: 03+1] "No problem, it's kind of tricky the $$ and \\\\ think when obtaining the checksum." [puppet] - 10https://gerrit.wikimedia.org/r/577469 (https://phabricator.wikimedia.org/T247034) (owner: 10DCausse) [16:21:09] (03CR) 10Volans: "Compiler results:" [puppet] - 10https://gerrit.wikimedia.org/r/577604 (owner: 10Volans) [16:23:35] (03CR) 10CDanis: [C: 03+1] homer: actually manage the git hook directory [puppet] - 10https://gerrit.wikimedia.org/r/577604 (owner: 10Volans) [16:23:47] cdanis: I hope is the right way [16:23:54] volans: it looks reasonable to me [16:23:56] want to update stuff and not delete existing ones [16:24:02] i haven't used recurse=>remote but it seems correct [16:24:19] we'll find out soon enough :D [16:24:30] (03CR) 10Volans: [C: 03+2] homer: actually manage the git hook directory [puppet] - 10https://gerrit.wikimedia.org/r/577604 (owner: 10Volans) [16:25:38] (03CR) 10Muehlenhoff: jupyterhub: add the option to use nodejs 10 (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/577590 (https://phabricator.wikimedia.org/T247055) (owner: 10Elukey) [16:27:04] (03PS3) 10C. Scott Ananian: Parsoid-testing: Add temporary symlink to old deploy repo [puppet] - 10https://gerrit.wikimedia.org/r/577568 [16:29:11] (03CR) 10Dzahn: [C: 03+2] Parsoid-testing: Add temporary symlink to old deploy repo [puppet] - 10https://gerrit.wikimedia.org/r/577568 (owner: 10C. Scott Ananian) [16:29:13] (03CR) 10Elukey: jupyterhub: add the option to use nodejs 10 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/577590 (https://phabricator.wikimedia.org/T247055) (owner: 10Elukey) [16:29:54] PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3056 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [16:30:31] seems to have worked as expected cdanis fwiw [16:31:00] (03PS4) 10Elukey: jupyterhub: add the option to use nodejs 10 [puppet] - 10https://gerrit.wikimedia.org/r/577590 (https://phabricator.wikimedia.org/T247055) [16:31:22] cscott: the link changed on scandium just now [16:37:54] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/577590 (https://phabricator.wikimedia.org/T247055) (owner: 10Elukey) [16:39:00] RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3056 is OK: HTTP OK: HTTP/1.0 200 OK - 22050 bytes in 0.255 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [16:40:01] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [16:40:04] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [16:40:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:40:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:40:09] 10Operations, 10serviceops, 10Patch-For-Review: move all 86 new codfw appservers into production (mw2[291-2377].codfw.wmnet) - https://phabricator.wikimedia.org/T247021 (10ops-monitoring-bot) Icinga downtime for 2:00:00 set by dzahn@cumin1001 on 5 host(s) and their services with reason: new_install ` mw[2325... [16:40:31] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [16:40:34] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [16:40:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:40:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:40:39] 10Operations, 10serviceops, 10Patch-For-Review: move all 86 new codfw appservers into production (mw2[291-2377].codfw.wmnet) - https://phabricator.wikimedia.org/T247021 (10ops-monitoring-bot) Icinga downtime for 2:00:00 set by dzahn@cumin1001 on 4 host(s) and their services with reason: new_install ` mw[2331... [16:40:56] (03PS4) 10Dzahn: add mw2325-mw2334 as API and appservers, codfw rack B6 [puppet] - 10https://gerrit.wikimedia.org/r/577408 (https://phabricator.wikimedia.org/T247021) [16:42:35] (03CR) 10Dzahn: [C: 03+2] add mw2325-mw2334 as API and appservers, codfw rack B6 [puppet] - 10https://gerrit.wikimedia.org/r/577408 (https://phabricator.wikimedia.org/T247021) (owner: 10Dzahn) [16:45:53] 10Operations, 10ops-eqiad, 10User-jbond, 10cloud-services-team (Hardware): drain cloudvirt1006 for battery replacement - https://phabricator.wikimedia.org/T246908 (10Andrew) 05Open→03Resolved done [16:45:56] 10Operations, 10ops-eqiad, 10User-jbond, 10cloud-services-team (Hardware): (OoW) cloudvirt1006 - RAID battery failed - https://phabricator.wikimedia.org/T222950 (10Andrew) [16:46:11] 10Operations, 10ops-eqiad, 10User-jbond, 10cloud-services-team (Hardware): (OoW) cloudvirt1006 - RAID battery failed - https://phabricator.wikimedia.org/T222950 (10Andrew) this host is drained and ready for maintenance. [16:46:56] (03CR) 10Elukey: [C: 03+2] jupyterhub: add the option to use nodejs 10 [puppet] - 10https://gerrit.wikimedia.org/r/577590 (https://phabricator.wikimedia.org/T247055) (owner: 10Elukey) [16:47:58] (03PS1) 10CDanis: cdanis dotfiles: back in home TZ [puppet] - 10https://gerrit.wikimedia.org/r/577613 [16:48:10] (03CR) 10CDanis: [C: 03+2] cdanis dotfiles: back in home TZ [puppet] - 10https://gerrit.wikimedia.org/r/577613 (owner: 10CDanis) [16:54:31] !log reedy@deploy1001 Synchronized php-1.35.0-wmf.22/extensions/WikimediaMaintenance/dumpInterwiki.php: T247097 (duration: 01m 00s) [16:54:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:54:36] T247097: dumpInterwiki is using prod lists in labs, labs lists in prod - https://phabricator.wikimedia.org/T247097 [17:02:37] Reedy: Does it create the right output now? [17:04:05] (03PS1) 10Andrew Bogott: openstack haproxy: change glance-api health check [puppet] - 10https://gerrit.wikimedia.org/r/577620 (https://phabricator.wikimedia.org/T242766) [17:07:00] (03CR) 10jerkins-bot: [V: 04-1] openstack haproxy: change glance-api health check [puppet] - 10https://gerrit.wikimedia.org/r/577620 (https://phabricator.wikimedia.org/T242766) (owner: 10Andrew Bogott) [17:07:15] (03PS1) 10Elukey: profile::swap: pass use_nodejs10 parameter to jupyterhub [puppet] - 10https://gerrit.wikimedia.org/r/577622 (https://phabricator.wikimedia.org/T247055) [17:08:53] (03PS2) 10Andrew Bogott: openstack haproxy: change glance-api health check [puppet] - 10https://gerrit.wikimedia.org/r/577620 (https://phabricator.wikimedia.org/T242766) [17:10:22] (03CR) 10Elukey: [C: 03+2] profile::swap: pass use_nodejs10 parameter to jupyterhub [puppet] - 10https://gerrit.wikimedia.org/r/577622 (https://phabricator.wikimedia.org/T247055) (owner: 10Elukey) [17:13:17] (03PS2) 10Krinkle: wmf-config: Document wgConf.php load order [mediawiki-config] - 10https://gerrit.wikimedia.org/r/577376 [17:13:34] PROBLEM - Widespread puppet agent failures on icinga1001 is CRITICAL: 0.01248 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [17:14:01] ^ it's from the first puppet run on 10 servers at once [17:14:18] will recover in a moment [17:15:50] RECOVERY - Widespread puppet agent failures on icinga1001 is OK: (C)0.01 ge (W)0.006 ge 0.001255 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [17:17:01] (03PS3) 10Andrew Bogott: openstack haproxy: change glance-api health check [puppet] - 10https://gerrit.wikimedia.org/r/577620 (https://phabricator.wikimedia.org/T242766) [17:17:05] (03PS1) 10Andrew Bogott: nova-placement: update the usgi init script for Queens [puppet] - 10https://gerrit.wikimedia.org/r/577625 (https://phabricator.wikimedia.org/T242766) [17:18:10] (03PS2) 10Andrew Bogott: nova-placement: update the uwsgi init script for Queens [puppet] - 10https://gerrit.wikimedia.org/r/577625 (https://phabricator.wikimedia.org/T242766) [17:18:12] (03PS4) 10Andrew Bogott: openstack haproxy: change glance-api health check [puppet] - 10https://gerrit.wikimedia.org/r/577620 (https://phabricator.wikimedia.org/T242766) [17:19:27] (03CR) 10Andrew Bogott: [C: 03+2] nova-placement: update the uwsgi init script for Queens [puppet] - 10https://gerrit.wikimedia.org/r/577625 (https://phabricator.wikimedia.org/T242766) (owner: 10Andrew Bogott) [17:23:28] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [17:23:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:23:46] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [17:23:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:26:03] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [17:26:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:26:09] 10Operations, 10serviceops, 10Patch-For-Review: move all 86 new codfw appservers into production (mw2[291-2377].codfw.wmnet) - https://phabricator.wikimedia.org/T247021 (10ops-monitoring-bot) Icinga downtime for 2:00:00 set by dzahn@cumin1001 on 4 host(s) and their services with reason: new_install ` mw[2331... [17:26:22] (03PS4) 10Jbond: systemd::syslog: ensure log dir is removed if resource is absent [puppet] - 10https://gerrit.wikimedia.org/r/576364 (https://phabricator.wikimedia.org/T242910) [17:28:29] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [17:28:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:28:35] 10Operations, 10serviceops, 10Patch-For-Review: move all 86 new codfw appservers into production (mw2[291-2377].codfw.wmnet) - https://phabricator.wikimedia.org/T247021 (10ops-monitoring-bot) Icinga downtime for 2:00:00 set by dzahn@cumin1001 on 5 host(s) and their services with reason: new_install ` mw[2325... [17:29:28] PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3058 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [17:30:10] (03CR) 10Mholloway: Add recommendation-api chart (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/565788 (https://phabricator.wikimedia.org/T241230) (owner: 10Bmansurov) [17:31:50] (03PS5) 10Andrew Bogott: openstack haproxy: change glance-api health check [puppet] - 10https://gerrit.wikimedia.org/r/577620 (https://phabricator.wikimedia.org/T242766) [17:31:52] (03PS1) 10Andrew Bogott: nova-placement haproxy: update the health check route [puppet] - 10https://gerrit.wikimedia.org/r/577630 (https://phabricator.wikimedia.org/T242766) [17:32:20] 10Operations, 10Page Content Service, 10Wikimedia-Logstash, 10observability, and 4 others: Move mobileapps logging to new logging pipeline - https://phabricator.wikimedia.org/T219924 (10Mholloway) @fgiunchedi This should be finished by April at the latest for both services. AIUI that's when SCB is planned... [17:33:40] (03CR) 10Krinkle: [C: 03+2] wmf-config: Document wgConf.php load order [mediawiki-config] - 10https://gerrit.wikimedia.org/r/577376 (owner: 10Krinkle) [17:34:13] (03CR) 10Krinkle: "Will do this on Monday." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/577366 (owner: 10Krinkle) [17:34:46] (03Merged) 10jenkins-bot: wmf-config: Document wgConf.php load order [mediawiki-config] - 10https://gerrit.wikimedia.org/r/577376 (owner: 10Krinkle) [17:34:55] (03PS3) 10Krinkle: tests: Remove "wiki-suffix disambiguation" dblist structure test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/577374 [17:35:00] (03CR) 10Krinkle: [C: 03+2] tests: Remove "wiki-suffix disambiguation" dblist structure test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/577374 (owner: 10Krinkle) [17:36:00] (03Merged) 10jenkins-bot: tests: Remove "wiki-suffix disambiguation" dblist structure test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/577374 (owner: 10Krinkle) [17:36:38] (03PS4) 10Krinkle: multiversion: Introduce MWMultiVersion::SUFFIXES constant [mediawiki-config] - 10https://gerrit.wikimedia.org/r/577366 [17:36:47] (03PS3) 10Krinkle: tests: Move MWWikiversionsTest out of dblistTest.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/577375 [17:41:56] RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3058 is OK: HTTP OK: HTTP/1.0 200 OK - 22052 bytes in 0.258 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [17:42:59] !log krinkle@deploy1001 Synchronized wmf-config/wgConf.php: I260bafdb8e (no-op) (duration: 01m 00s) [17:43:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:44:08] (03CR) 10Jhedden: [C: 03+1] openstack haproxy: change glance-api health check [puppet] - 10https://gerrit.wikimedia.org/r/577620 (https://phabricator.wikimedia.org/T242766) (owner: 10Andrew Bogott) [17:51:17] (03CR) 10Jhedden: nova-placement haproxy: update the health check route (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/577630 (https://phabricator.wikimedia.org/T242766) (owner: 10Andrew Bogott) [17:51:19] (03CR) 10Andrew Bogott: [C: 03+2] openstack haproxy: change glance-api health check [puppet] - 10https://gerrit.wikimedia.org/r/577620 (https://phabricator.wikimedia.org/T242766) (owner: 10Andrew Bogott) [17:51:32] (03CR) 10Andrew Bogott: [C: 03+2] nova-placement haproxy: update the health check route [puppet] - 10https://gerrit.wikimedia.org/r/577630 (https://phabricator.wikimedia.org/T242766) (owner: 10Andrew Bogott) [17:56:43] (03PS1) 10Andrew Bogott: haproxy nova-placement: fix typo [puppet] - 10https://gerrit.wikimedia.org/r/577635 [17:59:22] (03CR) 10Andrew Bogott: [C: 03+2] haproxy nova-placement: fix typo [puppet] - 10https://gerrit.wikimedia.org/r/577635 (owner: 10Andrew Bogott) [18:00:14] 10Operations, 10ops-eqiad, 10User-jbond, 10cloud-services-team (Hardware): (OoW) cloudvirt1006 - RAID battery failed - https://phabricator.wikimedia.org/T222950 (10RobH) a:05RobH→03Cmjohnson I'm not sure why this is still assigned to me, as the battery was ordered and I cannot do the actual swap. This... [18:01:03] (03CR) 10Jhedden: [C: 03+2] codesearch: Prevent ferm from deleting Docker iptables rules [puppet] - 10https://gerrit.wikimedia.org/r/574524 (https://phabricator.wikimedia.org/T246017) (owner: 10BryanDavis) [18:02:38] PROBLEM - mediawiki-installation DSH group on mw2330 is CRITICAL: Host mw2330 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [18:04:02] ^ ack [18:04:45] !log dzahn@cumin1001 conftool action : set/weight=15; selector: name=mw232[5-9].codfw.wmnet [18:04:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:04:59] !log dzahn@cumin1001 conftool action : set/weight=15; selector: name=mw233[0-4].codfw.wmnet [18:05:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:05:22] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw232[5-9].codfw.wmnet [18:05:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:05:35] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw233[0-4].codfw.wmnet [18:05:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:05:55] jouncebot: now [18:05:55] For the next 13 hour(s) and 54 minute(s): NO DEPLOYS (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200306T0800) [18:07:16] !log sudo -i cumin -b 15 'mw23[25-34].codfw.wmnet' 'sudo -u dzahn scap pull' [18:07:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:09:04] PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3064 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [18:09:49] 10Operations, 10serviceops, 10Patch-For-Review: move all 86 new codfw appservers into production (mw2[291-2377].codfw.wmnet) - https://phabricator.wikimedia.org/T247021 (10Dzahn) ` {"mw2325.codfw.wmnet": {"weight": 15, "pooled": "yes"}, "tags": "dc=codfw,cluster=appserver,service=apache2"} {"mw2325.codfw.wmn... [18:10:48] 10Operations, 10Analytics, 10Wikidata, 10Wikidata-Query-Service: Deployment strategy and hardware requirement for new Flink based WDQS updater - https://phabricator.wikimedia.org/T247058 (10Pchelolo) Yeah, @Gehel analysis is correct - change-prop is pretty simple and doesn't support any of the advanced fea... [18:11:20] 10Operations, 10serviceops, 10Patch-For-Review: move all 86 new codfw appservers into production (mw2[291-2377].codfw.wmnet) - https://phabricator.wikimedia.org/T247021 (10Dzahn) mw2325 through mw2334 set to Active in Netbox 10 servers pooled at 18:05 UTC, March 6th. [18:13:36] RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3064 is OK: HTTP OK: HTTP/1.0 200 OK - 22049 bytes in 0.255 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [18:19:37] 10Operations, 10serviceops, 10Patch-For-Review: move all 86 new codfw appservers into production (mw2[291-2377].codfw.wmnet) - https://phabricator.wikimedia.org/T247021 (10Dzahn) [18:28:46] (03CR) 10Volans: [C: 03+1] "post-merge +1, LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/577525 (owner: 10Elukey) [18:30:15] (03CR) 10Volans: [C: 03+2] netbox: consolidate naming of resources [puppet] - 10https://gerrit.wikimedia.org/r/566882 (owner: 10Volans) [18:36:00] (03PS1) 10Dzahn: remove install1003/2003 again to recreate with public IPs [dns] - 10https://gerrit.wikimedia.org/r/577640 (https://phabricator.wikimedia.org/T224576) [18:37:53] (03PS1) 10Dzahn: site: switch install1003/2003 to public IPs [puppet] - 10https://gerrit.wikimedia.org/r/577641 (https://phabricator.wikimedia.org/T224576) [18:39:52] (03PS1) 10Ppchelko: Use Request-Timeout header to set jobrunner PHP timeouts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/577642 (https://phabricator.wikimedia.org/T247114) [18:40:58] (03CR) 10jerkins-bot: [V: 04-1] Use Request-Timeout header to set jobrunner PHP timeouts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/577642 (https://phabricator.wikimedia.org/T247114) (owner: 10Ppchelko) [18:43:00] (03PS2) 10Ppchelko: Use Request-Timeout header to set jobrunner PHP timeouts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/577642 (https://phabricator.wikimedia.org/T247114) [18:44:57] !log dzahn@cumin1001 START - Cookbook sre.hosts.decommission [18:45:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:45:43] (03PS1) 10Volans: dns: retrocompatibility with older pynetbox [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/577644 (https://phabricator.wikimedia.org/T233183) [18:46:47] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [18:46:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:46:51] (03CR) 10Ppchelko: Use Request-Timeout header to set jobrunner PHP timeouts (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/577642 (https://phabricator.wikimedia.org/T247114) (owner: 10Ppchelko) [18:46:53] 10Operations, 10Patch-For-Review: Upgrade install servers to Buster - https://phabricator.wikimedia.org/T224576 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: `install1003.eqiad.wmnet` - install1003.eqiad.wmnet (**PASS**) - Downtimed host on Icinga - Found G... [18:46:55] (03PS1) 10Krinkle: mediawiki: Change php-wmerrors channel from "fatal" to as "exception" [puppet] - 10https://gerrit.wikimedia.org/r/577645 (https://phabricator.wikimedia.org/T247113) [18:47:07] (03PS2) 10Krinkle: mediawiki: Change php-wmerrors channel from "fatal" to as "exception" [puppet] - 10https://gerrit.wikimedia.org/r/577645 (https://phabricator.wikimedia.org/T247113) [18:49:46] 10Operations: Integrate Stretch 9.12 point update - https://phabricator.wikimedia.org/T244695 (10MoritzMuehlenhoff) [18:52:26] !log dzahn@cumin1001 START - Cookbook sre.hosts.decommission [18:52:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:52:49] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) [18:52:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:52:55] 10Operations, 10Patch-For-Review: Upgrade install servers to Buster - https://phabricator.wikimedia.org/T224576 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: `install2003.codfw.wmnet` - install2003.codfw.wmnet (**FAIL**) - Host steps raised exception: Empty M... [18:53:47] !log dzahn@cumin1001 START - Cookbook sre.hosts.decommission [18:53:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:54:44] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [18:54:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:54:50] 10Operations, 10Patch-For-Review: Upgrade install servers to Buster - https://phabricator.wikimedia.org/T224576 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: `install2003.codfw.wmnet` - install2003.codfw.wmnet (**PASS**) - Downtimed host on Icinga - Found G... [18:55:13] 10Operations: smartd not starting properly on gen9 + buster - https://phabricator.wikimedia.org/T246997 (10MoritzMuehlenhoff) >>! In T246997#5947571, @fgiunchedi wrote: >>>! In T246997#5947487, @Marostegui wrote: > My understanding is that previously it was working by chance/accident After digging a little furt... [18:56:32] (03CR) 10Alexandros Kosiaris: "recheck" [debs/contenttranslation/apertium-lex-tools] - 10https://gerrit.wikimedia.org/r/577045 (https://phabricator.wikimedia.org/T234182) (owner: 10KartikMistry) [18:57:37] (03CR) 10Aaron Schulz: [C: 03+1] db-eqiad,db-codfw.php: Add es5 as new ES, for initial testing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/577185 (https://phabricator.wikimedia.org/T246072) (owner: 10Marostegui) [18:57:39] (03PS1) 10Volans: sre.hosts.decommission: skip mgmt for virtual [cookbooks] - 10https://gerrit.wikimedia.org/r/577646 [18:58:24] (03CR) 10Dzahn: [C: 03+1] sre.hosts.decommission: skip mgmt for virtual [cookbooks] - 10https://gerrit.wikimedia.org/r/577646 (owner: 10Volans) [19:00:25] (03CR) 10CRusnov: [C: 04-1] "The solution to this problem is not __in which I believe will be ignored, but to make the arrays of statuses be python lists instead of tu" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/577644 (https://phabricator.wikimedia.org/T233183) (owner: 10Volans) [19:02:19] (03CR) 10Dzahn: [C: 03+2] "VMs removed with the cookbook. already gone from ganeti and netbox." [dns] - 10https://gerrit.wikimedia.org/r/577640 (https://phabricator.wikimedia.org/T224576) (owner: 10Dzahn) [19:03:58] RECOVERY - mediawiki-installation DSH group on mw2330 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [19:06:36] (03PS2) 10Volans: dns: retrocompatibility with older pynetbox [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/577644 (https://phabricator.wikimedia.org/T233183) [19:07:10] (03CR) 10Volans: "> Patch Set 1: Code-Review-1" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/577644 (https://phabricator.wikimedia.org/T233183) (owner: 10Volans) [19:10:03] (03PS1) 10Dzahn: re-add install1002/install2003 with public IPs [dns] - 10https://gerrit.wikimedia.org/r/577649 (https://phabricator.wikimedia.org/T224576) [19:12:13] (03CR) 10CRusnov: [C: 03+1] "LGTM :)" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/577644 (https://phabricator.wikimedia.org/T233183) (owner: 10Volans) [19:12:31] (03PS2) 10Dzahn: re-add install1002/install2003 with public IPs [dns] - 10https://gerrit.wikimedia.org/r/577649 (https://phabricator.wikimedia.org/T224576) [19:12:52] (03CR) 10Volans: [C: 03+2] dns: retrocompatibility with older pynetbox [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/577644 (https://phabricator.wikimedia.org/T233183) (owner: 10Volans) [19:14:40] (03PS1) 10Cmjohnson: Finishing mgmt entries for logstash1029 [dns] - 10https://gerrit.wikimedia.org/r/577650 (https://phabricator.wikimedia.org/T240881) [19:15:05] (03PS2) 10Cmjohnson: Finishing mgmt entries for logstash1029 [dns] - 10https://gerrit.wikimedia.org/r/577650 (https://phabricator.wikimedia.org/T240881) [19:16:16] (03CR) 10Cmjohnson: [C: 03+2] Finishing mgmt entries for logstash1029 [dns] - 10https://gerrit.wikimedia.org/r/577650 (https://phabricator.wikimedia.org/T240881) (owner: 10Cmjohnson) [19:19:52] (03CR) 10Aaron Schulz: "See /home/aaron/requested_keys at bast1002@wikimedia.org" [puppet] - 10https://gerrit.wikimedia.org/r/577535 (owner: 10Aaron Schulz) [19:21:39] 10Operations: upgrade install servers to stretch - https://phabricator.wikimedia.org/T210038 (10Dzahn) Ah, heh., this ticket has been replaced by T224576 . We are currently upgrading to buster right away. [19:21:59] 10Operations: upgrade install servers to stretch - https://phabricator.wikimedia.org/T210038 (10Dzahn) [19:22:01] 10Operations, 10Patch-For-Review: Upgrade install servers to Buster - https://phabricator.wikimedia.org/T224576 (10Dzahn) [19:22:36] (03PS1) 10Ori.livneh: php-admin: remove dead code for partial opcache invalidation [puppet] - 10https://gerrit.wikimedia.org/r/577652 [19:25:01] (03CR) 10CDanis: [C: 03+2] Update SSH keys for myself (aaron) [puppet] - 10https://gerrit.wikimedia.org/r/577535 (owner: 10Aaron Schulz) [19:25:03] (03PS2) 10CDanis: Update SSH keys for myself (aaron) [puppet] - 10https://gerrit.wikimedia.org/r/577535 (owner: 10Aaron Schulz) [19:25:25] (03CR) 10jerkins-bot: [V: 04-1] php-admin: remove dead code for partial opcache invalidation [puppet] - 10https://gerrit.wikimedia.org/r/577652 (owner: 10Ori.livneh) [19:26:06] (03CR) 10CRusnov: [C: 03+1] "LGTM assuming the addresses are correct to start with :)" [dns] - 10https://gerrit.wikimedia.org/r/577649 (https://phabricator.wikimedia.org/T224576) (owner: 10Dzahn) [19:27:25] (03CR) 10Dzahn: [V: 03+1] Update SSH keys for myself (aaron) [puppet] - 10https://gerrit.wikimedia.org/r/577535 (owner: 10Aaron Schulz) [19:27:49] (03PS2) 10Ori.livneh: php-admin: remove dead code for partial opcache invalidation [puppet] - 10https://gerrit.wikimedia.org/r/577652 [19:29:57] greg-g: Done the survey! Thought I would as a new developer. [19:30:23] RhinosF1: thank you [19:30:24] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 240, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:30:59] greg-g: no problem. I would say I should have done it earlier but I only had my first deploy on Wednesday! [19:31:19] ^ icinga alert = Zayo transport [19:31:26] checking maint-announce [19:32:08] (03PS1) 10C. Scott Ananian: Fix permissions on /srv/parsoid-testing [puppet] - 10https://gerrit.wikimedia.org/r/577654 [19:36:00] PROBLEM - OSPF status on cr2-codfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:36:16] 10Operations, 10Wikimedia-Logstash, 10Patch-For-Review: ELK7 shards failed errors when loading saved objects, e.g. "field expansion matches too many fields, limit: 1024, got: 1726" - https://phabricator.wikimedia.org/T247014 (10EBernhardson) ` health status index uuid pri r... [19:36:29] ACKNOWLEDGEMENT - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 240, down: 1, dormant: 0, excluded: 0, unused: 0: daniel_zahn Zayo is being mailed. https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:38:21] (03CR) 10Dzahn: [C: 03+2] re-add install1002/install2003 with public IPs [dns] - 10https://gerrit.wikimedia.org/r/577649 (https://phabricator.wikimedia.org/T224576) (owner: 10Dzahn) [19:38:30] (03PS3) 10Dzahn: re-add install1002/install2003 with public IPs [dns] - 10https://gerrit.wikimedia.org/r/577649 (https://phabricator.wikimedia.org/T224576) [19:39:12] 10Operations, 10ops-eqiad, 10DC-Ops: (Need by: TBD) rack/setup/install wdqs101[123].eqiad.wmnet - https://phabricator.wikimedia.org/T246352 (10Cmjohnson) These were racked out of order, fixing the physical locations and corresponding port numbers wdqs1011: A5, 19 wdqs1012: B5 ,34 wdqs1013:C5 , 39 [19:42:07] (03PS1) 10C. Scott Ananian: Check out parsoid deploy modules using git::clone, not scap [puppet] - 10https://gerrit.wikimedia.org/r/577656 [19:43:32] (03CR) 10Dzahn: "[authdns1001:~] $ host install1003" [dns] - 10https://gerrit.wikimedia.org/r/577649 (https://phabricator.wikimedia.org/T224576) (owner: 10Dzahn) [19:43:46] (03PS1) 10C. Scott Ananian: WIP: fix logspam script [puppet] - 10https://gerrit.wikimedia.org/r/577657 [19:44:39] (03CR) 10jerkins-bot: [V: 04-1] Check out parsoid deploy modules using git::clone, not scap [puppet] - 10https://gerrit.wikimedia.org/r/577656 (owner: 10C. Scott Ananian) [19:46:21] !log dzahn@cumin1001 START - Cookbook sre.ganeti.makevm [19:46:21] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) [19:46:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:46:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:46:40] !log dzahn@cumin1001 START - Cookbook sre.ganeti.makevm [19:46:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:47:27] (03PS2) 10Volans: sre.hosts.decommission: skip mgmt for virtual [cookbooks] - 10https://gerrit.wikimedia.org/r/577646 [19:47:53] mutante: also the makevm has been refactored heavily recently, so lmk if you encounter any issue ;) [19:47:57] !log dzahn@cumin1001 START - Cookbook sre.ganeti.makevm [19:48:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:48:14] volans: oh, alright! [19:48:40] !log re-creating install1003 and install2003 with same specs as before but public IP (T244390) [19:48:40] pretty much same functionality but was moved into the Ganeti module in spicerack, before was all inlined in the cookbook [19:48:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:48:45] T244390: VM requests for install_server replacements - https://phabricator.wikimedia.org/T244390 [19:49:22] ok [19:51:27] (03PS2) 10Papaul: DNS: Remove mgmt DNS for payments200[1-3] [dns] - 10https://gerrit.wikimedia.org/r/577411 [19:52:15] (03CR) 10Dzahn: [C: 03+1] DNS: Remove mgmt DNS for payments200[1-3] [dns] - 10https://gerrit.wikimedia.org/r/577411 (owner: 10Papaul) [19:53:34] 10Operations, 10vm-requests: VM requests for install_server replacements - https://phabricator.wikimedia.org/T244390 (10Dzahn) VMs with private IPs have been removed again (with cookbook) and new VMs have been created in wikimedia.org. [19:55:16] 10Operations, 10Patch-For-Review: Upgrade install servers to Buster - https://phabricator.wikimedia.org/T224576 (10Dzahn) install1003.eqiad.wmnet and install2003.codfw.wmnet have been removed entirely with the decom cookbook. DNS has been adjusted and then VMs with identical specs but public IPs have been cre... [19:55:43] (03CR) 10Papaul: [C: 03+2] DNS: Remove mgmt DNS for payments200[1-3] [dns] - 10https://gerrit.wikimedia.org/r/577411 (owner: 10Papaul) [19:55:51] (03PS3) 10Papaul: DNS: Remove mgmt DNS for payments200[1-3] [dns] - 10https://gerrit.wikimedia.org/r/577411 [19:55:55] (03CR) 10Papaul: [V: 03+2 C: 03+2] DNS: Remove mgmt DNS for payments200[1-3] [dns] - 10https://gerrit.wikimedia.org/r/577411 (owner: 10Papaul) [19:56:25] !log dzahn@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) [19:56:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:57:36] !log dzahn@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) [19:57:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:57:59] 10Operations, 10DC-Ops, 10decommission, 10Patch-For-Review: decommission WMF6141 (old payments2001.frack.codfw.wmnet) - https://phabricator.wikimedia.org/T246697 (10Papaul) [19:58:20] 10Operations, 10DC-Ops, 10decommission, 10Patch-For-Review: decommission WMF6141 (old payments2001.frack.codfw.wmnet) - https://phabricator.wikimedia.org/T246697 (10Papaul) 05Open→03Resolved Complete [19:58:49] 10Operations, 10DC-Ops, 10decommission, 10Patch-For-Review: decommission WMF6143 (old payments2002.frack.codfw.wmnet) - https://phabricator.wikimedia.org/T246698 (10Papaul) [19:59:00] 10Operations, 10DC-Ops, 10decommission, 10Patch-For-Review: decommission WMF6143 (old payments2002.frack.codfw.wmnet) - https://phabricator.wikimedia.org/T246698 (10Papaul) 05Open→03Resolved Complete [19:59:17] (03PS2) 10C. Scott Ananian: Check out parsoid deploy modules using git::clone, not scap [puppet] - 10https://gerrit.wikimedia.org/r/577656 [20:00:25] (03PS1) 10Dzahn: DHCP: switch install1003/2003 to public IP, new VMs, update MACs [puppet] - 10https://gerrit.wikimedia.org/r/577663 (https://phabricator.wikimedia.org/T244390) [20:01:48] (03CR) 10Dzahn: [C: 03+2] site: switch install1003/2003 to public IPs [puppet] - 10https://gerrit.wikimedia.org/r/577641 (https://phabricator.wikimedia.org/T224576) (owner: 10Dzahn) [20:02:13] 10Operations, 10DC-Ops, 10decommission, 10Patch-For-Review: decommission WMF6142 (old payments2003.frack.codfw.wmnet) - https://phabricator.wikimedia.org/T246699 (10Papaul) [20:02:19] !log ebernhardson@deploy1001 Started deploy [search/mjolnir/deploy@dda3d28]: Re-deploy python3.7 upgrade [20:02:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:02:38] 10Operations, 10DC-Ops, 10decommission, 10Patch-For-Review: decommission WMF6142 (old payments2003.frack.codfw.wmnet) - https://phabricator.wikimedia.org/T246699 (10Papaul) 05Open→03Resolved Complete [20:02:53] (03PS2) 10Dzahn: DHCP: switch install1003/2003 to public IP, new VMs, update MACs [puppet] - 10https://gerrit.wikimedia.org/r/577663 (https://phabricator.wikimedia.org/T244390) [20:05:23] 10Operations, 10ops-eqiad, 10DC-Ops: (Need by: TBD) rack/setup/install wdqs101[123].eqiad.wmnet - https://phabricator.wikimedia.org/T246352 (10Cmjohnson) [20:05:28] (03PS3) 10Dzahn: DHCP: switch install1003/2003 to public IP, new VMs, update MACs [puppet] - 10https://gerrit.wikimedia.org/r/577663 (https://phabricator.wikimedia.org/T244390) [20:06:38] (03CR) 10Dzahn: [C: 03+2] DHCP: switch install1003/2003 to public IP, new VMs, update MACs [puppet] - 10https://gerrit.wikimedia.org/r/577663 (https://phabricator.wikimedia.org/T244390) (owner: 10Dzahn) [20:07:33] !log ebernhardson@deploy1001 Finished deploy [search/mjolnir/deploy@dda3d28]: Re-deploy python3.7 upgrade (duration: 05m 14s) [20:07:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:09:49] volans: i have no issues to report with makevm. now installing OS. i should also use wmf-auto-reimage-host --new for it, right [20:10:24] mutante: no, sorry, teh reimage script does not support VMs for now [20:10:27] last time i still just started the VM on ganeti [20:10:29] ACKNOWLEDGEMENT - OSPF status on cr2-codfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP CDanis Zayo TTN-0003933687 https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [20:10:47] volans: ok, just making sure. i will do it like i did last time [20:10:54] ACKNOWLEDGEMENT - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 240, down: 1, dormant: 0, excluded: 0, unused: 0: CDanis Zayo TTN-0003933687 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:11:07] it should be documented all on wikitech [20:11:31] ack [20:17:59] yep, no issues with the workflow and got debian installer on both [20:23:30] !log post-deploy restart mjolnir bulk and msearch daemons across eqiad and codfw [20:23:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:34:25] 10Operations, 10netops, 10Wikimedia-Incident: Add graceful-restart to cr2-esams - https://phabricator.wikimedia.org/T246338 (10CDanis) 05Open→03Resolved Discussion of rolling out `graceful-restart` to other dual-RE routers is at T191667#5948038 [20:40:05] (03CR) 10Bstorm: [C: 03+2] toolforge: remove old k8s client material for Jessie [puppet] - 10https://gerrit.wikimedia.org/r/576995 (https://phabricator.wikimedia.org/T246689) (owner: 10Bstorm) [20:40:13] 10Operations, 10cloud-services-team (Kanban): Migrate remaining self-hosted puppet masters to Puppet 5 / facter 3 - https://phabricator.wikimedia.org/T241719 (10JHedden) [20:40:41] (03PS3) 10Andrew Bogott: keystone: Backport the Rocky version of ldap integration [puppet] - 10https://gerrit.wikimedia.org/r/577602 (https://phabricator.wikimedia.org/T247050) [20:40:45] (03PS1) 10Andrew Bogott: Queens keystone: add a hack for utf8 decoding in already-hacked ldap handler [puppet] - 10https://gerrit.wikimedia.org/r/577669 (https://phabricator.wikimedia.org/T247050) [20:43:17] (03CR) 10Mholloway: [C: 03+1] Recommendation API: upgrade node to version 10 [puppet] - 10https://gerrit.wikimedia.org/r/560454 (https://phabricator.wikimedia.org/T241230) (owner: 10Bmansurov) [20:54:53] (03CR) 10RLazarus: [C: 03+1] "As discussed in #wikimedia-serviceops. Not my area of expertise but looks like a plausible fix for the issue you describe, let's give it a" [puppet] - 10https://gerrit.wikimedia.org/r/577654 (owner: 10C. Scott Ananian) [21:00:21] (03CR) 10RLazarus: [C: 03+2] httpbb: Replace apache-fast-test with httpbb in deploy_apache_change. [puppet] - 10https://gerrit.wikimedia.org/r/576485 (owner: 10RLazarus) [21:10:06] (03PS1) 10RLazarus: puppetmaster: Treat 'y' like 'yes' in puppet-merge. [puppet] - 10https://gerrit.wikimedia.org/r/577670 [21:11:45] (03PS1) 10Jon Harald Søby: Add `fkv` Kven to $wmgExtraLanguageNames [mediawiki-config] - 10https://gerrit.wikimedia.org/r/577671 (https://phabricator.wikimedia.org/T167259) [21:13:52] (03CR) 10CDanis: puppetmaster: Treat 'y' like 'yes' in puppet-merge. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/577670 (owner: 10RLazarus) [21:16:04] (03PS2) 10RLazarus: puppetmaster: Treat 'y' like 'yes' in puppet-merge. [puppet] - 10https://gerrit.wikimedia.org/r/577670 [21:17:06] (03CR) 10RLazarus: puppetmaster: Treat 'y' like 'yes' in puppet-merge. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/577670 (owner: 10RLazarus) [21:17:59] (03PS1) 10Volans: spicerack: allow to cache the Ipmi instance [software/spicerack] - 10https://gerrit.wikimedia.org/r/577672 [21:18:25] (03CR) 10CDanis: [C: 03+1] puppetmaster: Treat 'y' like 'yes' in puppet-merge. [puppet] - 10https://gerrit.wikimedia.org/r/577670 (owner: 10RLazarus) [21:19:40] (03CR) 10RLazarus: [C: 03+2] puppetmaster: Treat 'y' like 'yes' in puppet-merge. [puppet] - 10https://gerrit.wikimedia.org/r/577670 (owner: 10RLazarus) [21:20:16] (03CR) 10Volans: "Not a super fan of things that simplify muscle-memory for stopgap actions where people should really look at the diff." [puppet] - 10https://gerrit.wikimedia.org/r/577670 (owner: 10RLazarus) [21:21:21] volans: we can discuss if you want -- personally, my experience is that I review the diff, then type "y", then get frustrated that the tool exits for no reason :) [21:21:33] I don't think the two are connected [21:21:37] :) [21:22:08] 'es' is probably not enough anyway I guess to force looking [21:22:22] resolve this equation would the job better :-P [21:22:26] I do like that "multiple" is a different thing you have to type, and I was careful to maintain that [21:22:44] yeah noticed that, thanks. That was one of my first contributions ;) [21:23:48] (03CR) 10Volans: [C: 03+2] "Last PS just updated a comment." [cookbooks] - 10https://gerrit.wikimedia.org/r/577646 (owner: 10Volans) [21:25:23] (03Merged) 10jenkins-bot: sre.hosts.decommission: skip mgmt for virtual [cookbooks] - 10https://gerrit.wikimedia.org/r/577646 (owner: 10Volans) [21:25:55] (03CR) 10Volans: "Deleted /srv/automation on both netbox hosts." [puppet] - 10https://gerrit.wikimedia.org/r/566882 (owner: 10Volans) [21:54:32] (03PS1) 10Clarakosi: cpjobqueue: Add jobrunner_host & videoscaler_host to deployment vars [puppet] - 10https://gerrit.wikimedia.org/r/577677 (https://phabricator.wikimedia.org/T246371) [22:00:13] PROBLEM - Host cloudvirt1006.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [22:05:39] 10Operations, 10ops-eqiad, 10User-jbond, 10cloud-services-team (Hardware): (OoW) cloudvirt1006 - RAID battery failed - https://phabricator.wikimedia.org/T222950 (10Jclark-ctr) a:05Cmjohnson→03Jclark-ctr Replaced Failed battery [22:07:48] 10Operations, 10ops-eqiad, 10User-jbond, 10cloud-services-team (Hardware): (OoW) cloudvirt1006 - RAID battery failed - https://phabricator.wikimedia.org/T222950 (10Jclark-ctr) 05Open→03Resolved [22:08:01] RECOVERY - Host cloudvirt1006.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.88 ms [22:08:59] !log ebernhardson@deploy1001 Started deploy [wikimedia/discovery/analytics@18f13e4]: update to pyhton3.7, ship articletopic propagation [22:09:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:09:35] !log ebernhardson@deploy1001 Finished deploy [wikimedia/discovery/analytics@18f13e4]: update to pyhton3.7, ship articletopic propagation (duration: 00m 36s) [22:09:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:19:16] 10Operations, 10cloud-services-team (Kanban): Migrate remaining self-hosted puppet masters to Puppet 5 / facter 3 - https://phabricator.wikimedia.org/T241719 (10JHedden) [22:24:57] (03CR) 10RLazarus: [C: 03+2] Fix permissions on /srv/parsoid-testing [puppet] - 10https://gerrit.wikimedia.org/r/577654 (owner: 10C. Scott Ananian) [22:28:52] (03PS1) 10Reedy: Update interwiki-labs.php again [mediawiki-config] - 10https://gerrit.wikimedia.org/r/577683 (https://phabricator.wikimedia.org/T247091) [22:29:38] (03CR) 10Reedy: [C: 03+2] Update interwiki-labs.php again [mediawiki-config] - 10https://gerrit.wikimedia.org/r/577683 (https://phabricator.wikimedia.org/T247091) (owner: 10Reedy) [22:30:36] (03Merged) 10jenkins-bot: Update interwiki-labs.php again [mediawiki-config] - 10https://gerrit.wikimedia.org/r/577683 (https://phabricator.wikimedia.org/T247091) (owner: 10Reedy) [22:33:00] (03CR) 10Aaron Schulz: mcrouter: add gutter pool servers in configuration (039 comments) [puppet] - 10https://gerrit.wikimedia.org/r/569541 (https://phabricator.wikimedia.org/T213089) (owner: 10Effie Mouzeli) [22:33:06] (03CR) 10Aaron Schulz: [C: 04-1] mcrouter: add gutter pool servers in configuration [puppet] - 10https://gerrit.wikimedia.org/r/569541 (https://phabricator.wikimedia.org/T213089) (owner: 10Effie Mouzeli) [22:33:17] !log reedy@deploy1001 Synchronized wmf-config/interwiki-labs.php: T247091 (duration: 00m 57s) [22:33:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:33:22] T247091: Add vi to langlist-labs - https://phabricator.wikimedia.org/T247091 [22:34:08] (03CR) 10Aaron Schulz: [C: 04-1] mcrouter: add gutter pool servers in configuration (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/569541 (https://phabricator.wikimedia.org/T213089) (owner: 10Effie Mouzeli) [22:34:58] 10Operations, 10netops, 10Wikimedia-Incident: Investigate Juniper storm control - https://phabricator.wikimedia.org/T245192 (10Papaul) - Removed all interfaces xxx unit 0 family ethernet-switching of the interfaces covered by the two existing interface-range (yours and the existing disabled one) - Commit us... [22:40:23] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: decommission labstore2001.codfw.wmnet and labstore2002.codfw.wmnet - https://phabricator.wikimedia.org/T243329 (10Papaul) [22:40:28] (03PS1) 10Cmjohnson: Add dhcpd/netboot.cfg and site.pp (role::spare) logstash102[6-9] [puppet] - 10https://gerrit.wikimedia.org/r/577685 (https://phabricator.wikimedia.org/T240881) [22:43:14] (03CR) 10Ppchelko: [C: 03+1] cpjobqueue: Add jobrunner_host & videoscaler_host to deployment vars [puppet] - 10https://gerrit.wikimedia.org/r/577677 (https://phabricator.wikimedia.org/T246371) (owner: 10Clarakosi) [23:04:54] !log signing puppet certs for install1003/install2003, initial puppet runs [23:04:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:06:57] (03PS1) 10Papaul: DNS: Remove mgmt DNS for labstore200[1-4] [dns] - 10https://gerrit.wikimedia.org/r/577690 [23:08:34] (03CR) 10Dzahn: [C: 03+1] "wmf numbers matching in netbox" [dns] - 10https://gerrit.wikimedia.org/r/577690 (owner: 10Papaul) [23:09:03] (03CR) 10Papaul: [C: 03+2] DNS: Remove mgmt DNS for labstore200[1-4] [dns] - 10https://gerrit.wikimedia.org/r/577690 (owner: 10Papaul) [23:10:49] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: decommission labstore2001.codfw.wmnet and labstore2002.codfw.wmnet - https://phabricator.wikimedia.org/T243329 (10Papaul) [23:11:17] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: decommission labstore2001.codfw.wmnet and labstore2002.codfw.wmnet - https://phabricator.wikimedia.org/T243329 (10Papaul) 05Open→03Resolved complete [23:12:06] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission, 10cloud-services-team (Hardware): decommission labstore2003.codfw.wmnet and labstore2004.codfw.wmnet - https://phabricator.wikimedia.org/T243319 (10Papaul) [23:12:21] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission, 10cloud-services-team (Hardware): decommission labstore2003.codfw.wmnet and labstore2004.codfw.wmnet - https://phabricator.wikimedia.org/T243319 (10Papaul) 05Open→03Resolved complete [23:13:21] (03PS1) 10Cmjohnson: Adding mgmt dns wdqs10[1-3] [dns] - 10https://gerrit.wikimedia.org/r/577692 (https://phabricator.wikimedia.org/T246352) [23:15:24] 10Operations, 10Wikimedia-Mailing-lists, 10Patch-For-Review: Rename multimedia-team to structured-data-team - https://phabricator.wikimedia.org/T235550 (10MBinder_WMF) @jbond Thanks for volunteering to update the clinic duty rotation. Is there somewhere I can see that? I would love to know with whom I should... [23:15:41] (03CR) 10Dzahn: [C: 03+1] Adding mgmt dns wdqs10[1-3] [dns] - 10https://gerrit.wikimedia.org/r/577692 (https://phabricator.wikimedia.org/T246352) (owner: 10Cmjohnson) [23:16:05] mutante can you check the site.pp for this please https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/577685/ [23:16:56] (03CR) 10Cmjohnson: [C: 03+2] Adding mgmt dns wdqs10[1-3] [dns] - 10https://gerrit.wikimedia.org/r/577692 (https://phabricator.wikimedia.org/T246352) (owner: 10Cmjohnson) [23:17:03] (03PS2) 10Cmjohnson: Adding mgmt dns wdqs10[1-3] [dns] - 10https://gerrit.wikimedia.org/r/577692 (https://phabricator.wikimedia.org/T246352) [23:17:08] (03CR) 10Cmjohnson: [V: 03+2 C: 03+2] Adding mgmt dns wdqs10[1-3] [dns] - 10https://gerrit.wikimedia.org/r/577692 (https://phabricator.wikimedia.org/T246352) (owner: 10Cmjohnson) [23:17:28] cmjohnson1: yep, one moment [23:18:55] cmjohnson1: the commit message says 102[6-9] but in netboot.cfg it is 201[6-9] [23:19:52] hah..okay..I see what I did there...anything else before I make that change [23:20:02] cmjohnson1: the site.pp part looks good [23:20:14] no, the rest looks fine to me [23:20:17] thx! [23:20:19] yw [23:21:36] (03PS2) 10Cmjohnson: Add dhcpd/netboot.cfg and site.pp (role::spare) logstash102[6-9] [puppet] - 10https://gerrit.wikimedia.org/r/577685 (https://phabricator.wikimedia.org/T240881) [23:22:48] (03CR) 10Dzahn: [C: 03+2] Add Prometheus exporter for Squid [puppet] - 10https://gerrit.wikimedia.org/r/575342 (https://phabricator.wikimedia.org/T245176) (owner: 10Ayounsi) [23:23:30] (03CR) 10Dzahn: [C: 03+1] Add dhcpd/netboot.cfg and site.pp (role::spare) logstash102[6-9] [puppet] - 10https://gerrit.wikimedia.org/r/577685 (https://phabricator.wikimedia.org/T240881) (owner: 10Cmjohnson) [23:26:12] (03PS2) 10Dzahn: Add Prometheus exporter for Squid [puppet] - 10https://gerrit.wikimedia.org/r/575342 (https://phabricator.wikimedia.org/T245176) (owner: 10Ayounsi) [23:26:52] (03CR) 10Cmjohnson: [C: 03+2] Add dhcpd/netboot.cfg and site.pp (role::spare) logstash102[6-9] [puppet] - 10https://gerrit.wikimedia.org/r/577685 (https://phabricator.wikimedia.org/T240881) (owner: 10Cmjohnson) [23:28:52] (03CR) 10Dzahn: [C: 03+2] Add Prometheus exporter for Squid [puppet] - 10https://gerrit.wikimedia.org/r/575342 (https://phabricator.wikimedia.org/T245176) (owner: 10Ayounsi) [23:31:46] (03CR) 10Dzahn: "the service is called simply "squid" and not "squid3". fixing that." [puppet] - 10https://gerrit.wikimedia.org/r/575342 (https://phabricator.wikimedia.org/T245176) (owner: 10Ayounsi) [23:33:38] (03CR) 10Dzahn: "File[/srv/prometheus/ops/targets/squid_eqiad.yaml was created on prometheus1003" [puppet] - 10https://gerrit.wikimedia.org/r/575342 (https://phabricator.wikimedia.org/T245176) (owner: 10Ayounsi) [23:34:31] PROBLEM - Check systemd state on install2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:35:46] (03CR) 10Dzahn: "re " # Squid3 and not Squid as it's squid3 when >= stretch" but then in buster it is Squid again and Squid3 is just a transitional package" [puppet] - 10https://gerrit.wikimedia.org/r/575342 (https://phabricator.wikimedia.org/T245176) (owner: 10Ayounsi) [23:36:35] ACKNOWLEDGEMENT - Check systemd state on install2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. daniel_zahn install in progress https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:37:35] 10Operations, 10ops-codfw, 10DC-Ops, 10Traffic, 10decommission: decommission lvs2005.codfw.wmnet - https://phabricator.wikimedia.org/T246666 (10Papaul) [23:38:56] 10Operations, 10ops-codfw, 10DC-Ops, 10Traffic, 10decommission: decommission lvs2004.codfw.wmnet - https://phabricator.wikimedia.org/T246669 (10Papaul) [23:40:05] 10Operations, 10ops-codfw, 10DC-Ops, 10Traffic, 10decommission: decommission lvs2006.codfw.wmnet - https://phabricator.wikimedia.org/T246329 (10Papaul) [23:40:52] 10Operations, 10ops-eqiad, 10Wikimedia-Logstash, 10Patch-For-Review: (Need by: 2020-03-06) rack/setup/install logstash102[6-9].eqiad.wmnet - https://phabricator.wikimedia.org/T240881 (10Cmjohnson) [23:41:52] 10Operations, 10ops-eqiad, 10Wikimedia-Logstash, 10Patch-For-Review: (Need by: 2020-03-06) rack/setup/install logstash102[6-9].eqiad.wmnet - https://phabricator.wikimedia.org/T240881 (10Cmjohnson) moved these to 10G racks today, updated all the network ports and did the operations/puppet updates [23:43:24] (03PS1) 10Dzahn: prometheus::squid_exporter: service is called 'squid' again on buster [puppet] - 10https://gerrit.wikimedia.org/r/577694 (https://phabricator.wikimedia.org/T245176) [23:45:54] (03CR) 10Dzahn: [C: 03+2] prometheus::squid_exporter: service is called 'squid' again on buster [puppet] - 10https://gerrit.wikimedia.org/r/577694 (https://phabricator.wikimedia.org/T245176) (owner: 10Dzahn) [23:46:39] PROBLEM - Check systemd state on install1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:47:15] ACKNOWLEDGEMENT - Check systemd state on install1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. daniel_zahn fix in progress https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:50:38] !log install1003/2003 - starting DHCP servers and letting puppet stop them again to clear systemd state [23:50:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:51:19] RECOVERY - Check systemd state on install1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:51:21] RECOVERY - Check systemd state on install2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:55:09] 10Operations, 10ops-codfw, 10fundraising-tech-ops, 10Patch-For-Review: rack/setup/install frpm2001.frack.codfw.wmnet - https://phabricator.wikimedia.org/T242269 (10Dwisehaupt) [23:55:30] 10Operations, 10ops-codfw, 10fundraising-tech-ops, 10Patch-For-Review: rack/setup/install frpm2001.frack.codfw.wmnet - https://phabricator.wikimedia.org/T242269 (10Dwisehaupt) Interface bonding set up and activated.