[00:20:29] (03PS2) 10HMonroy: Enable watchlist expiry feature on Wikidata & Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/641250 (https://phabricator.wikimedia.org/T266874) [00:26:41] (03PS1) 10Gergő Tisza: GrowthExperiments: Enable help panel top-posting on ruwiki, svwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/641282 [00:32:19] I want to deploy that ^ beta-only config patch, can that interfere with T264991? [00:32:19] T264991: Upgrade the MediaWiki servers to ICU 63 - https://phabricator.wikimedia.org/T264991 [00:33:58] PROBLEM - Ensure hosts are not performing a change on every puppet run on puppetdb1002 is CRITICAL: CRITICAL: the following (14) node(s) change every puppet run: ms-be2031.codfw.wmnet, scb2005.codfw.wmnet, scb2006.codfw.wmnet, peek2001.codfw.wmnet, deploy1002.eqiad.wmnet, scb1003.eqiad.wmnet, scb1001.eqiad.wmnet, scb2004.codfw.wmnet, scb1004.eqiad.wmnet, scb1002.eqiad.wmnet, scb2002.codfw.wmnet, scb2003.codfw.wmnet, scb2001.codfw [00:33:58] eqiad.wmnet https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes [00:37:07] (03PS3) 10Dzahn: wmcs::instance: remove diamond removal remnants [puppet] - 10https://gerrit.wikimedia.org/r/632570 (https://phabricator.wikimedia.org/T210993) [00:45:32] (03CR) 10Dzahn: "UID matches what is in Wikitech LDAP (not exactly the email address but makes sense), key matches what is in the ticket. groups also match" [puppet] - 10https://gerrit.wikimedia.org/r/641241 (https://phabricator.wikimedia.org/T267817) (owner: 10Herron) [00:50:01] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to researchers, analytics-privatedata-users and wmf LDAP for fkaelin - https://phabricator.wikimedia.org/T267817 (10Dzahn) >>! In T267817#6624032, @Ottomata wrote: > Fabian will also need analyitcs-privatedata-users. I edited the tas... [01:00:14] (03CR) 10Dzahn: poolcounter: add client configuration classes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/635992 (https://phabricator.wikimedia.org/T266055) (owner: 10Giuseppe Lavagetto) [01:01:07] (03CR) 10CRusnov: "This change is ready for review." [dns] - 10https://gerrit.wikimedia.org/r/641284 (https://phabricator.wikimedia.org/T233183) (owner: 10CRusnov) [01:01:17] (03CR) 10CRusnov: "This change is ready for review." [dns] - 10https://gerrit.wikimedia.org/r/641285 (https://phabricator.wikimedia.org/T233183) (owner: 10CRusnov) [01:03:37] (03CR) 10CRusnov: "This is erroring (failing ci) on `kubetcd2004.codfw.wmnet' not being found as an address, but as far as I can see it is present in the exp" [dns] - 10https://gerrit.wikimedia.org/r/641284 (https://phabricator.wikimedia.org/T233183) (owner: 10CRusnov) [01:06:17] (03PS1) 10Dzahn: poolcounter: do not attempt to install python3-poolcounter on jessie [puppet] - 10https://gerrit.wikimedia.org/r/641307 [01:06:30] (03CR) 10Dzahn: "https://gerrit.wikimedia.org/r/c/operations/puppet/+/641307" [puppet] - 10https://gerrit.wikimedia.org/r/635992 (https://phabricator.wikimedia.org/T266055) (owner: 10Giuseppe Lavagetto) [01:07:40] (03CR) 10jerkins-bot: [V: 04-1] poolcounter: do not attempt to install python3-poolcounter on jessie [puppet] - 10https://gerrit.wikimedia.org/r/641307 (owner: 10Dzahn) [01:08:19] (03PS2) 10Dzahn: poolcounter: do not attempt to install python3-poolcounter on jessie [puppet] - 10https://gerrit.wikimedia.org/r/641307 [01:10:40] 10Operations, 10observability, 10Patch-For-Review, 10User-fgiunchedi: Deprovision Diamond collectors no longer in use - https://phabricator.wikimedia.org/T183454 (10Dzahn) [01:10:50] (03Abandoned) 10Dzahn: delete the diamond module [puppet] - 10https://gerrit.wikimedia.org/r/632572 (https://phabricator.wikimedia.org/T210993) (owner: 10Dzahn) [01:11:11] 10Operations, 10observability, 10Patch-For-Review, 10cloud-services-team (Kanban): Deprecate Diamond collectors in Cloud VPS - https://phabricator.wikimedia.org/T210993 (10Dzahn) 05Open→03Stalled blocked by T264920 [01:13:54] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install deploy2002 - https://phabricator.wikimedia.org/T266363 (10Dzahn) Thank you:) [01:16:15] (03PS1) 10Dzahn: mailman: replace require_package with ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/641309 (https://phabricator.wikimedia.org/T266479) [01:19:18] (03PS1) 10Dzahn: iegreview: replace require_package with ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/641310 (https://phabricator.wikimedia.org/T266479) [01:19:40] (03CR) 10jerkins-bot: [V: 04-1] iegreview: replace require_package with ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/641310 (https://phabricator.wikimedia.org/T266479) (owner: 10Dzahn) [01:20:56] (03PS1) 10Dzahn: httpbb: replace require_package with ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/641311 (https://phabricator.wikimedia.org/T266479) [01:22:49] (03PS1) 10Dzahn: aptrepo: replace require_package with ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/641312 (https://phabricator.wikimedia.org/T266479) [01:23:20] 10Operations, 10observability: smart-data-dump should fail loudly when it can't gather metrics - https://phabricator.wikimedia.org/T267135 (10colewhite) 25 hosts affected `device_smart_device_count < 1` [01:23:50] 10Operations, 10LDAP-Access-Requests: Request for LDAP Access in order to access Superset for IJethroBT-WMF - https://phabricator.wikimedia.org/T267962 (10IJethroBT-WMF) Thanks @Dzahn , and apologies for assigning this to you out of process! [01:23:57] (03PS1) 10Dzahn: noc: replace require_package with ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/641313 (https://phabricator.wikimedia.org/T266479) [01:24:46] (03CR) 10Cwhite: [C: 03+1] role: add Alertmanager API profile [puppet] - 10https://gerrit.wikimedia.org/r/641192 (https://phabricator.wikimedia.org/T266017) (owner: 10Filippo Giunchedi) [01:25:28] (03CR) 10Cwhite: [C: 03+1] profile: add Alertmanager API virtual host [puppet] - 10https://gerrit.wikimedia.org/r/641191 (https://phabricator.wikimedia.org/T266017) (owner: 10Filippo Giunchedi) [01:26:09] (03PS1) 10Dzahn: git: replace require_package with ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/641314 (https://phabricator.wikimedia.org/T266479) [01:27:32] (03CR) 10jerkins-bot: [V: 04-1] git: replace require_package with ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/641314 (https://phabricator.wikimedia.org/T266479) (owner: 10Dzahn) [01:51:50] 10Operations, 10ops-eqiad, 10DC-Ops, 10serviceops: eqiad: Physical moves for MediaWiki servers - https://phabricator.wikimedia.org/T266164 (10Dzahn) @cmjohnson I'm back. we can continue with these second batch. Ideally in 2 batches. Let's talk what times are best. [01:52:31] (03PS1) 10Dzahn: aptrepo: replace cron with systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/641315 (https://phabricator.wikimedia.org/T265138) [01:53:11] (03CR) 10jerkins-bot: [V: 04-1] aptrepo: replace cron with systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/641315 (https://phabricator.wikimedia.org/T265138) (owner: 10Dzahn) [01:59:13] 10Operations, 10ops-eqiad, 10DC-Ops, 10serviceops: eqiad: Physical moves for MediaWiki servers - https://phabricator.wikimedia.org/T266164 (10Dzahn) @Cmjohnson So the servers in at the bottom of B7 (mw1313-mw1318), should we move them to B4? That has the space and no other mw servers yet. Would that work... [02:07:14] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.36.0-wmf.18 [core] (wmf/1.36.0-wmf.18) - 10https://gerrit.wikimedia.org/r/641319 [02:07:38] (03PS2) 10DannyS712: Branch commit for wmf/1.36.0-wmf.18 [core] (wmf/1.36.0-wmf.18) - 10https://gerrit.wikimedia.org/r/641319 (https://phabricator.wikimedia.org/T263184) (owner: 10TrainBranchBot) [02:47:04] (03PS11) 10Ryan Kemper: Bring 3 new eqiad wdqs nodes into service [puppet] - 10https://gerrit.wikimedia.org/r/634381 (https://phabricator.wikimedia.org/T246345) [03:08:18] (03CR) 10Ryan Kemper: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/26447" [puppet] - 10https://gerrit.wikimedia.org/r/634381 (https://phabricator.wikimedia.org/T246345) (owner: 10Ryan Kemper) [03:11:36] (03CR) 10Ryan Kemper: [V: 03+1] "> Patch Set 10:" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/634381 (https://phabricator.wikimedia.org/T246345) (owner: 10Ryan Kemper) [03:20:35] (03PS1) 10Krinkle: filerepo: remove repo name from getSharedCacheKey() [core] (wmf/1.36.0-wmf.18) - 10https://gerrit.wikimedia.org/r/641290 (https://phabricator.wikimedia.org/T267668) [03:20:41] (03CR) 10Krinkle: [C: 03+2] filerepo: remove repo name from getSharedCacheKey() [core] (wmf/1.36.0-wmf.18) - 10https://gerrit.wikimedia.org/r/641290 (https://phabricator.wikimedia.org/T267668) (owner: 10Krinkle) [03:43:32] (03Merged) 10jenkins-bot: filerepo: remove repo name from getSharedCacheKey() [core] (wmf/1.36.0-wmf.18) - 10https://gerrit.wikimedia.org/r/641290 (https://phabricator.wikimedia.org/T267668) (owner: 10Krinkle) [04:04:02] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:16:08] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:25:08] (03CR) 10Samwilson: [C: 03+1] Enable watchlist expiry feature on Wikidata & Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/641250 (https://phabricator.wikimedia.org/T266874) (owner: 10HMonroy) [05:48:01] 10Operations, 10Security-Team, 10Performance-Team (Radar), 10Security, and 2 others: Burst of connections on ruwikinews (s3) - https://phabricator.wikimedia.org/T262240 (10Marostegui) Sounds good to me [05:48:53] 10Operations, 10Security-Team, 10Performance-Team (Radar), 10Security, and 2 others: Burst of connections on ruwikinews (s3) - https://phabricator.wikimedia.org/T262240 (10Bawolff) [06:21:34] (03CR) 10Gehel: "Last minor comment inline. Feel free to merge once this is addressed." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/634381 (https://phabricator.wikimedia.org/T246345) (owner: 10Ryan Kemper) [06:30:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1015 (re)pooling @ 10%: Slowly pool es1015 after cloning es1033 T261717', diff saved to https://phabricator.wikimedia.org/P13269 and previous config saved to /var/cache/conftool/dbconfig/20201117-063019-root.json [06:30:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:30:29] T261717: Productionize es20[26-34] and es10[26-34] - https://phabricator.wikimedia.org/T261717 [06:30:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1019 (re)pooling @ 10%: Slowly pool es1019 after cloning es1034 T261717', diff saved to https://phabricator.wikimedia.org/P13270 and previous config saved to /var/cache/conftool/dbconfig/20201117-063043-root.json [06:30:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:31:58] (03PS1) 10Marostegui: es1033,es1034: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/641327 (https://phabricator.wikimedia.org/T261717) [06:33:05] (03CR) 10Marostegui: [C: 03+2] es1033,es1034: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/641327 (https://phabricator.wikimedia.org/T261717) (owner: 10Marostegui) [06:35:11] (03PS1) 10Marostegui: instances.yaml: Add es1033,es1034 [puppet] - 10https://gerrit.wikimedia.org/r/641328 (https://phabricator.wikimedia.org/T261717) [06:35:40] (03CR) 10Marostegui: [C: 03+2] instances.yaml: Add es1033,es1034 [puppet] - 10https://gerrit.wikimedia.org/r/641328 (https://phabricator.wikimedia.org/T261717) (owner: 10Marostegui) [06:38:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'Pool es1033 with minimum weight on es2 T261717', diff saved to https://phabricator.wikimedia.org/P13271 and previous config saved to /var/cache/conftool/dbconfig/20201117-063805-marostegui.json [06:38:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:38:12] T261717: Productionize es20[26-34] and es10[26-34] - https://phabricator.wikimedia.org/T261717 [06:39:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'Pool es1034 with minimum weight on es3 T261717', diff saved to https://phabricator.wikimedia.org/P13272 and previous config saved to /var/cache/conftool/dbconfig/20201117-063933-marostegui.json [06:39:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:45:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1015 (re)pooling @ 25%: Slowly pool es1015 after cloning es1033 T261717', diff saved to https://phabricator.wikimedia.org/P13273 and previous config saved to /var/cache/conftool/dbconfig/20201117-064522-root.json [06:45:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:45:29] T261717: Productionize es20[26-34] and es10[26-34] - https://phabricator.wikimedia.org/T261717 [06:45:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1019 (re)pooling @ 25%: Slowly pool es1019 after cloning es1034 T261717', diff saved to https://phabricator.wikimedia.org/P13274 and previous config saved to /var/cache/conftool/dbconfig/20201117-064546-root.json [06:45:52] (03PS1) 10Marostegui: install_server: Do not reimage es1033, es1034 [puppet] - 10https://gerrit.wikimedia.org/r/641329 [06:45:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:47:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1033 (re)pooling @ 10%: Slowly pool es1033 after being recloned T261717', diff saved to https://phabricator.wikimedia.org/P13275 and previous config saved to /var/cache/conftool/dbconfig/20201117-064705-root.json [06:47:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:47:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1034 (re)pooling @ 10%: Slowly pool es1034 after being recloned T261717', diff saved to https://phabricator.wikimedia.org/P13276 and previous config saved to /var/cache/conftool/dbconfig/20201117-064716-root.json [06:47:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:47:42] (03CR) 10Marostegui: [C: 03+2] install_server: Do not reimage es1033, es1034 [puppet] - 10https://gerrit.wikimedia.org/r/641329 (owner: 10Marostegui) [06:51:29] !log Upgrade db1077 and pc2010 to 10.4.17 [06:51:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:57:17] (03PS4) 10Marostegui: wikireplicas: set up site.pp and hosts hiera for new servers [puppet] - 10https://gerrit.wikimedia.org/r/639815 (https://phabricator.wikimedia.org/T260843) (owner: 10Bstorm) [06:58:32] (03CR) 10Marostegui: [C: 03+2] wikireplicas: set up site.pp and hosts hiera for new servers [puppet] - 10https://gerrit.wikimedia.org/r/639815 (https://phabricator.wikimedia.org/T260843) (owner: 10Bstorm) [07:00:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1015 (re)pooling @ 50%: Slowly pool es1015 after cloning es1033 T261717', diff saved to https://phabricator.wikimedia.org/P13277 and previous config saved to /var/cache/conftool/dbconfig/20201117-070025-root.json [07:00:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:00:35] T261717: Productionize es20[26-34] and es10[26-34] - https://phabricator.wikimedia.org/T261717 [07:00:41] !log Stop mysql on db1124: s1 and s3, this will generate lag on enwiki and s3 on labsdb - T267090 [07:00:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:00:49] T267090: Productionize clouddb10[13-20] - https://phabricator.wikimedia.org/T267090 [07:00:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1019 (re)pooling @ 50%: Slowly pool es1019 after cloning es1034 T261717', diff saved to https://phabricator.wikimedia.org/P13278 and previous config saved to /var/cache/conftool/dbconfig/20201117-070050-root.json [07:00:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:02:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1033 (re)pooling @ 20%: Slowly pool es1033 after being recloned T261717', diff saved to https://phabricator.wikimedia.org/P13280 and previous config saved to /var/cache/conftool/dbconfig/20201117-070209-root.json [07:02:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:02:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1034 (re)pooling @ 20%: Slowly pool es1034 after being recloned T261717', diff saved to https://phabricator.wikimedia.org/P13281 and previous config saved to /var/cache/conftool/dbconfig/20201117-070220-root.json [07:02:24] (03PS5) 10Giuseppe Lavagetto: P:lvs::realserver: only sintall python3-poolcounter on > jessie [puppet] - 10https://gerrit.wikimedia.org/r/641194 (owner: 10Jbond) [07:02:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:06:17] (03PS6) 10Giuseppe Lavagetto: P:lvs::realserver: only sintall python3-poolcounter on > jessie [puppet] - 10https://gerrit.wikimedia.org/r/641194 (owner: 10Jbond) [07:08:10] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/26449" [puppet] - 10https://gerrit.wikimedia.org/r/641194 (owner: 10Jbond) [07:10:17] (03CR) 10Giuseppe Lavagetto: [V: 03+1 C: 03+2] P:lvs::realserver: only sintall python3-poolcounter on > jessie [puppet] - 10https://gerrit.wikimedia.org/r/641194 (owner: 10Jbond) [07:12:19] (03CR) 10Ayounsi: "Thanks!" (036 comments) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/635849 (owner: 10Ayounsi) [07:13:09] (03PS19) 10Ayounsi: Add CSV import to ProvisionServerNetwork script [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/635849 [07:13:11] (03PS20) 10Ayounsi: ProvisionServerNetwork, cleanup and standardize logs format [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/635853 (https://phabricator.wikimedia.org/T265339) [07:13:13] (03PS3) 10Ayounsi: Add python 3.8 to tox [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/640664 [07:14:12] PROBLEM - Widespread puppet agent failures on alert1001 is CRITICAL: 0.01077 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [07:15:25] (03PS1) 10Marostegui: db-eqiad.php: Depool pc1008 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/641330 (https://phabricator.wikimedia.org/T266483) [07:15:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1015 (re)pooling @ 75%: Slowly pool es1015 after cloning es1033 T261717', diff saved to https://phabricator.wikimedia.org/P13282 and previous config saved to /var/cache/conftool/dbconfig/20201117-071529-root.json [07:15:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:15:36] T261717: Productionize es20[26-34] and es10[26-34] - https://phabricator.wikimedia.org/T261717 [07:15:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1019 (re)pooling @ 75%: Slowly pool es1019 after cloning es1034 T261717', diff saved to https://phabricator.wikimedia.org/P13283 and previous config saved to /var/cache/conftool/dbconfig/20201117-071553-root.json [07:15:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:17:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1033 (re)pooling @ 25%: Slowly pool es1033 after being recloned T261717', diff saved to https://phabricator.wikimedia.org/P13284 and previous config saved to /var/cache/conftool/dbconfig/20201117-071712-root.json [07:17:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:17:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1034 (re)pooling @ 25%: Slowly pool es1034 after being recloned T261717', diff saved to https://phabricator.wikimedia.org/P13285 and previous config saved to /var/cache/conftool/dbconfig/20201117-071723-root.json [07:17:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:19:06] (03CR) 10Ayounsi: [C: 03+2] Add CSV import to ProvisionServerNetwork script [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/635849 (owner: 10Ayounsi) [07:20:42] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] Add apache httpd base image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/634924 (https://phabricator.wikimedia.org/T265324) (owner: 10Giuseppe Lavagetto) [07:21:19] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] Add an httpd-fcgi image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/636634 (https://phabricator.wikimedia.org/T265324) (owner: 10Giuseppe Lavagetto) [07:23:05] (03CR) 10Ayounsi: [C: 03+2] Add CSV import to ProvisionServerNetwork script (031 comment) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/635849 (owner: 10Ayounsi) [07:23:43] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] Add base php cli image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/638095 (https://phabricator.wikimedia.org/T265324) (owner: 10Giuseppe Lavagetto) [07:26:41] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] Add a php-fpm image for php 7.2 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/640386 (https://phabricator.wikimedia.org/T265324) (owner: 10Giuseppe Lavagetto) [07:28:24] (03PS1) 10Giuseppe Lavagetto: Add a mcrouter image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/641331 (https://phabricator.wikimedia.org/T265324) [07:29:23] 10Operations, 10MW-on-K8s, 10serviceops, 10Patch-For-Review: Create the base container images for running MediaWiki in a production environment - https://phabricator.wikimedia.org/T265324 (10Joe) [07:30:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1015 (re)pooling @ 100%: Slowly pool es1015 after cloning es1033 T261717', diff saved to https://phabricator.wikimedia.org/P13286 and previous config saved to /var/cache/conftool/dbconfig/20201117-073032-root.json [07:30:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:30:40] T261717: Productionize es20[26-34] and es10[26-34] - https://phabricator.wikimedia.org/T261717 [07:30:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1019 (re)pooling @ 100%: Slowly pool es1019 after cloning es1034 T261717', diff saved to https://phabricator.wikimedia.org/P13287 and previous config saved to /var/cache/conftool/dbconfig/20201117-073057-root.json [07:31:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:32:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1033 (re)pooling @ 30%: Slowly pool es1033 after being recloned T261717', diff saved to https://phabricator.wikimedia.org/P13288 and previous config saved to /var/cache/conftool/dbconfig/20201117-073216-root.json [07:32:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:32:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1034 (re)pooling @ 30%: Slowly pool es1034 after being recloned T261717', diff saved to https://phabricator.wikimedia.org/P13289 and previous config saved to /var/cache/conftool/dbconfig/20201117-073227-root.json [07:32:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:38:11] RECOVERY - Widespread puppet agent failures on alert1001 is OK: (C)0.01 ge (W)0.006 ge 0.0057 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [07:39:16] 10Operations, 10Packaging, 10serviceops: Create Debian packages for Node.js 14 upgrade - https://phabricator.wikimedia.org/T267891 (10Joe) While nothing above seems to contradict that we don't have a compelling reason to install node 14 packages for production now, I want to make a point: developers don't ju... [07:42:02] 10Operations, 10Packaging, 10serviceops: Create Debian packages for Node.js 14 upgrade - https://phabricator.wikimedia.org/T267891 (10Joe) 05Open→03Declined I didn't notice the task was opened again, I will decline again as I don't see a need for a node 14 package to use in production. CI can easily use... [07:47:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1033 (re)pooling @ 50%: Slowly pool es1033 after being recloned T261717', diff saved to https://phabricator.wikimedia.org/P13290 and previous config saved to /var/cache/conftool/dbconfig/20201117-074719-root.json [07:47:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:47:26] T261717: Productionize es20[26-34] and es10[26-34] - https://phabricator.wikimedia.org/T261717 [07:47:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1034 (re)pooling @ 50%: Slowly pool es1034 after being recloned T261717', diff saved to https://phabricator.wikimedia.org/P13291 and previous config saved to /var/cache/conftool/dbconfig/20201117-074730-root.json [07:47:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:49:09] PROBLEM - Check systemd state on ms-be2031 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:49:57] 10Operations, 10ops-codfw: Degraded RAID on ms-be2031 - https://phabricator.wikimedia.org/T267748 (10fgiunchedi) 05Open→03Resolved Thank you @Papaul ! SSD is rebuilding [07:49:57] !log split codfw row D ganeti switch ports out of the interface group [07:50:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:54:13] !log codfw row D: explicitly set access ports to "interface-mode access" [07:54:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:54:32] 10Operations: FY2020-2021 Q1 codfw -> eqiad switchback - https://phabricator.wikimedia.org/T243318 (10Marostegui) [07:54:43] 10Operations, 10ops-eqiad, 10SRE-swift-storage: ms-be1022 smart storage battery failure; disk sdb possibly bad - https://phabricator.wikimedia.org/T267870 (10fgiunchedi) >>! In T267870#6624981, @Cmjohnson wrote: > @fgiunchedi The server is out of warranty, I have some decom'd HP servers and most likely can... [07:58:19] !log codfw row D: Convert LVS ranges to individual interfaces [07:58:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:59:21] RECOVERY - MD RAID on ms-be2031 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [08:02:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1033 (re)pooling @ 60%: Slowly pool es1033 after being recloned T261717', diff saved to https://phabricator.wikimedia.org/P13292 and previous config saved to /var/cache/conftool/dbconfig/20201117-080222-root.json [08:02:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:02:30] T261717: Productionize es20[26-34] and es10[26-34] - https://phabricator.wikimedia.org/T261717 [08:02:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1034 (re)pooling @ 60%: Slowly pool es1034 after being recloned T261717', diff saved to https://phabricator.wikimedia.org/P13293 and previous config saved to /var/cache/conftool/dbconfig/20201117-080234-root.json [08:02:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:02:55] RECOVERY - Check systemd state on ms-be2031 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:08:48] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/641248 (https://phabricator.wikimedia.org/T267961) (owner: 10Herron) [08:13:54] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/641241 (https://phabricator.wikimedia.org/T267817) (owner: 10Herron) [08:15:21] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/641312 (https://phabricator.wikimedia.org/T266479) (owner: 10Dzahn) [08:17:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1033 (re)pooling @ 75%: Slowly pool es1033 after being recloned T261717', diff saved to https://phabricator.wikimedia.org/P13294 and previous config saved to /var/cache/conftool/dbconfig/20201117-081726-root.json [08:17:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:17:34] T261717: Productionize es20[26-34] and es10[26-34] - https://phabricator.wikimedia.org/T261717 [08:17:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1034 (re)pooling @ 75%: Slowly pool es1034 after being recloned T261717', diff saved to https://phabricator.wikimedia.org/P13295 and previous config saved to /var/cache/conftool/dbconfig/20201117-081737-root.json [08:17:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:21:02] (03PS1) 10Elukey: role::analytics_cluster::coordinator::replica: add meta db instance [puppet] - 10https://gerrit.wikimedia.org/r/641381 (https://phabricator.wikimedia.org/T257412) [08:21:59] (03PS1) 10Filippo Giunchedi: grafana: redirect /explore to grafana-rw [puppet] - 10https://gerrit.wikimedia.org/r/641382 (https://phabricator.wikimedia.org/T267972) [08:22:28] !log restart netbox on netbox1001 to test new logging configuration [08:22:32] (03PS2) 10Elukey: role::analytics_cluster::coordinator::replica: add meta db instance [puppet] - 10https://gerrit.wikimedia.org/r/641381 (https://phabricator.wikimedia.org/T257412) [08:22:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:24:03] (03CR) 10Volans: [C: 03+2] cli: change confirmation input check [software/cumin] - 10https://gerrit.wikimedia.org/r/636729 (owner: 10Volans) [08:24:47] !log elukey@cumin1001 START - Cookbook sre.hosts.decommission [08:24:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:25:13] (03PS1) 10Hashar: gerrit: use multiline regex flag for Sonar report [puppet] - 10https://gerrit.wikimedia.org/r/641383 (https://phabricator.wikimedia.org/T267028) [08:25:43] (03CR) 10Hashar: gerrit: fix SonarQube report url discovery (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/638565 (https://phabricator.wikimedia.org/T267028) (owner: 10Hashar) [08:26:39] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/641382 (https://phabricator.wikimedia.org/T267972) (owner: 10Filippo Giunchedi) [08:27:22] (03Merged) 10jenkins-bot: cli: change confirmation input check [software/cumin] - 10https://gerrit.wikimedia.org/r/636729 (owner: 10Volans) [08:29:15] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/26450" [puppet] - 10https://gerrit.wikimedia.org/r/641381 (https://phabricator.wikimedia.org/T257412) (owner: 10Elukey) [08:29:19] (03CR) 10Filippo Giunchedi: [C: 03+2] grafana: redirect /explore to grafana-rw [puppet] - 10https://gerrit.wikimedia.org/r/641382 (https://phabricator.wikimedia.org/T267972) (owner: 10Filippo Giunchedi) [08:31:21] wow what is that V: +1 ? [08:31:33] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/26451" [puppet] - 10https://gerrit.wikimedia.org/r/641381 (https://phabricator.wikimedia.org/T257412) (owner: 10Elukey) [08:31:41] this is niceeeee [08:31:49] !log elukey@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) [08:31:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:31:57] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 236, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:31:58] this is not nice but unrelated [08:32:08] lol [08:32:23] PROBLEM - OSPF status on cr2-codfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:32:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1033 (re)pooling @ 80%: Slowly pool es1033 after being recloned T261717', diff saved to https://phabricator.wikimedia.org/P13296 and previous config saved to /var/cache/conftool/dbconfig/20201117-083229-root.json [08:32:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:32:37] T261717: Productionize es20[26-34] and es10[26-34] - https://phabricator.wikimedia.org/T261717 [08:32:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1034 (re)pooling @ 80%: Slowly pool es1034 after being recloned T261717', diff saved to https://phabricator.wikimedia.org/P13297 and previous config saved to /var/cache/conftool/dbconfig/20201117-083241-root.json [08:32:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:33:58] !log elukey@cumin1001 START - Cookbook sre.hosts.decommission [08:34:03] RECOVERY - OSPF status on cr2-codfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:34:03] (03PS2) 10Filippo Giunchedi: profile: redirect to grafana-rw with referer [puppet] - 10https://gerrit.wikimedia.org/r/641164 (https://phabricator.wikimedia.org/T267645) [08:34:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:35:15] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 238, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:35:57] (03CR) 10Filippo Giunchedi: [C: 03+2] profile: add Alertmanager API virtual host [puppet] - 10https://gerrit.wikimedia.org/r/641191 (https://phabricator.wikimedia.org/T266017) (owner: 10Filippo Giunchedi) [08:36:15] (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+2] role: add Alertmanager API profile [puppet] - 10https://gerrit.wikimedia.org/r/641192 (https://phabricator.wikimedia.org/T266017) (owner: 10Filippo Giunchedi) [08:37:56] !log elukey@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) [08:38:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:42:10] 04Critical Alert for device asw-c-codfw.mgmt.codfw.wmnet - Juniper alarm active [08:43:04] (03CR) 10Elukey: [V: 03+1 C: 03+2] role::analytics_cluster::coordinator::replica: add meta db instance [puppet] - 10https://gerrit.wikimedia.org/r/641381 (https://phabricator.wikimedia.org/T257412) (owner: 10Elukey) [08:43:42] !log Truncate tendril.global_status_log - T231185 [08:43:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:43:49] T231185: Disable/remove unused features on Tendril - https://phabricator.wikimedia.org/T231185 [08:44:20] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/641164 (https://phabricator.wikimedia.org/T267645) (owner: 10Filippo Giunchedi) [08:47:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1033 (re)pooling @ 100%: Slowly pool es1033 after being recloned T261717', diff saved to https://phabricator.wikimedia.org/P13298 and previous config saved to /var/cache/conftool/dbconfig/20201117-084733-root.json [08:47:39] (03PS1) 10Filippo Giunchedi: pcc: post comment with jenkins console link [puppet] - 10https://gerrit.wikimedia.org/r/641386 [08:47:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:47:40] T261717: Productionize es20[26-34] and es10[26-34] - https://phabricator.wikimedia.org/T261717 [08:47:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1034 (re)pooling @ 100%: Slowly pool es1034 after being recloned T261717', diff saved to https://phabricator.wikimedia.org/P13299 and previous config saved to /var/cache/conftool/dbconfig/20201117-084744-root.json [08:47:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:50:18] (03PS1) 10Elukey: role::analytics_cluster::coordinator::replica: change datadir for Meta [puppet] - 10https://gerrit.wikimedia.org/r/641387 [08:51:44] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/26452" [puppet] - 10https://gerrit.wikimedia.org/r/641387 (owner: 10Elukey) [08:52:15] !log elukey@cumin1001 START - Cookbook sre.hosts.decommission [08:52:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:54:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool es1011 before decommissioning it and pool es1026 as new es2 master', diff saved to https://phabricator.wikimedia.org/P13300 and previous config saved to /var/cache/conftool/dbconfig/20201117-085432-marostegui.json [08:54:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:55:05] (03CR) 10Elukey: [V: 03+1 C: 03+2] role::analytics_cluster::coordinator::replica: change datadir for Meta [puppet] - 10https://gerrit.wikimedia.org/r/641387 (owner: 10Elukey) [08:55:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Set es1028 as new es3 master', diff saved to https://phabricator.wikimedia.org/P13301 and previous config saved to /var/cache/conftool/dbconfig/20201117-085542-marostegui.json [08:55:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:56:15] !log elukey@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) [08:56:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:56:35] !log elukey@cumin1001 START - Cookbook sre.hosts.decommission [08:56:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:01:21] !log elukey@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) [09:01:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:02:00] !log elukey@cumin1001 START - Cookbook sre.hosts.decommission [09:02:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:03:28] I am testing some issue with riccardo, this is why I am spamming all these failures --^ [09:04:10] yeah sorry about that [09:07:35] (03CR) 10Jbond: "already fixed here https://gerrit.wikimedia.org/r/c/operations/puppet/+/641194/" [puppet] - 10https://gerrit.wikimedia.org/r/641307 (owner: 10Dzahn) [09:08:35] !log elukey@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) [09:08:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:10:20] !log volans@cumin1001 START - Cookbook sre.hosts.decommission [09:10:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:13:21] (03PS1) 10Muehlenhoff: Flag an error if key file can't be found [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/641390 [09:14:31] !log volans@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) [09:14:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:17:12] !log volans@cumin1001 START - Cookbook sre.hosts.decommission [09:17:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:20:55] (03CR) 10Jbond: [C: 03+1] "LGTM thx" [puppet] - 10https://gerrit.wikimedia.org/r/641386 (owner: 10Filippo Giunchedi) [09:23:34] (03CR) 10QChris: [C: 03+1] gerrit: use multiline regex flag for Sonar report [puppet] - 10https://gerrit.wikimedia.org/r/641383 (https://phabricator.wikimedia.org/T267028) (owner: 10Hashar) [09:27:15] (03PS1) 10Volans: sre.hosts.decommission: workaround Netbox cache [cookbooks] - 10https://gerrit.wikimedia.org/r/641391 [09:29:00] !log volans@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) [09:29:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:29:24] this was an expected failure, all went good [09:32:53] 10Operations, 10Puppet, 10puppet-compiler, 10User-jbond: puppet master command will be removed in puppet 6 - https://phabricator.wikimedia.org/T236373 (10jbond) https://github.com/github/octocatalog-diff/pull/226 [09:35:08] (03PS1) 10Volans: netbox: log stacktrace on error [puppet] - 10https://gerrit.wikimedia.org/r/641393 [09:36:13] (03CR) 10Volans: "already tested" [puppet] - 10https://gerrit.wikimedia.org/r/641393 (owner: 10Volans) [09:36:22] (03CR) 10Kormat: [C: 03+1] db-eqiad.php: Depool pc1008 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/641330 (https://phabricator.wikimedia.org/T266483) (owner: 10Marostegui) [09:43:54] (03CR) 10Jbond: [C: 03+1] "lgtm minor nit" (031 comment) [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/641390 (owner: 10Muehlenhoff) [09:46:13] (03PS2) 10Muehlenhoff: Flag an error if key file can't be found [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/641390 [09:46:48] (03CR) 10Muehlenhoff: Flag an error if key file can't be found (031 comment) [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/641390 (owner: 10Muehlenhoff) [09:53:34] (03CR) 10Jbond: [C: 03+1] "lgtm" [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/641390 (owner: 10Muehlenhoff) [09:54:21] (03CR) 10Muehlenhoff: [C: 03+1] "Couple of things/remarks:" [puppet] - 10https://gerrit.wikimedia.org/r/640100 (https://phabricator.wikimedia.org/T257412) (owner: 10Elukey) [09:57:34] PROBLEM - analytics-meta MySQL instance on an-coord1002 is CRITICAL: NRPE: Command check_mysql_analytics-meta not defined https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Mysql_Meta [09:58:08] PROBLEM - Check systemd state on an-coord1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:58:54] PROBLEM - MySQL disk space for analytics-meta instance on an-coord1002 is CRITICAL: NRPE: Command check_mysql_analytics-meta_disk_space not defined https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Mysql_Meta [10:00:23] ah snap this is me, wip --^ [10:00:28] downtime expired, I was afk [10:01:32] PROBLEM - HTTPS-wmfusercontent on phab.wmfusercontent.org is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org valid until 2020-12-17 10:00:19 +0000 (expires in 29 days) https://phabricator.wikimedia.org/tag/phabricator/ [10:01:54] PROBLEM - HTTPS-planet on en.planet.wikimedia.org is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org valid until 2020-12-17 10:00:19 +0000 (expires in 29 days) https://wikitech.wikimedia.org/wiki/Planet.wikimedia.org [10:03:16] (03CR) 10Elukey: [C: 03+1] sre.hosts.decommission: workaround Netbox cache [cookbooks] - 10https://gerrit.wikimedia.org/r/641391 (owner: 10Volans) [10:04:08] (03CR) 10Elukey: [C: 03+1] netbox: log stacktrace on error [puppet] - 10https://gerrit.wikimedia.org/r/641393 (owner: 10Volans) [10:05:54] (03CR) 10Volans: [C: 03+2] netbox: log stacktrace on error [puppet] - 10https://gerrit.wikimedia.org/r/641393 (owner: 10Volans) [10:06:39] (03CR) 10Volans: [C: 03+2] sre.hosts.decommission: workaround Netbox cache [cookbooks] - 10https://gerrit.wikimedia.org/r/641391 (owner: 10Volans) [10:06:44] \o/ [10:08:07] (03Merged) 10jenkins-bot: sre.hosts.decommission: workaround Netbox cache [cookbooks] - 10https://gerrit.wikimedia.org/r/641391 (owner: 10Volans) [10:08:52] (03CR) 10Filippo Giunchedi: [C: 03+2] pcc: post comment with jenkins console link [puppet] - 10https://gerrit.wikimedia.org/r/641386 (owner: 10Filippo Giunchedi) [10:10:31] (03PS1) 10Volans: scripts: provision server's network fix select [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/641401 [10:11:02] RECOVERY - MySQL disk space for analytics-meta instance on an-coord1002 is OK: DISK OK - free space: / 52344 MB (73% inode=93%): https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Mysql_Meta [10:11:22] RECOVERY - analytics-meta MySQL instance on an-coord1002 is OK: PROCS OK: 1 process with command name mysqld https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Mysql_Meta [10:12:17] (03PS1) 10Kormat: orchestrator: Work with fqdns instead of hostnames. [puppet] - 10https://gerrit.wikimedia.org/r/641402 (https://phabricator.wikimedia.org/T267929) [10:13:27] (03CR) 10Volans: "> Patch Set 5:" [puppet] - 10https://gerrit.wikimedia.org/r/636102 (https://phabricator.wikimedia.org/T265138) (owner: 10Dzahn) [10:14:18] (03CR) 10Kormat: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/26454" [puppet] - 10https://gerrit.wikimedia.org/r/641402 (https://phabricator.wikimedia.org/T267929) (owner: 10Kormat) [10:14:59] (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] Flag an error if key file can't be found [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/641390 (owner: 10Muehlenhoff) [10:15:21] (03CR) 10Kormat: "PCC: https://puppet-compiler.wmflabs.org/compiler1002/26454/" [puppet] - 10https://gerrit.wikimedia.org/r/641402 (https://phabricator.wikimedia.org/T267929) (owner: 10Kormat) [10:15:59] (03PS2) 10Kormat: orchestrator: Work with fqdns instead of hostnames. [puppet] - 10https://gerrit.wikimedia.org/r/641402 (https://phabricator.wikimedia.org/T267929) [10:16:53] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool pc1008 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/641330 (https://phabricator.wikimedia.org/T266483) (owner: 10Marostegui) [10:17:52] (03Merged) 10jenkins-bot: db-eqiad.php: Depool pc1008 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/641330 (https://phabricator.wikimedia.org/T266483) (owner: 10Marostegui) [10:19:09] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool pc1008 and place pc1010 instead of it T266483 (duration: 00m 57s) [10:19:12] !log Restart mysql on pc1008 T266483 [10:19:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:19:17] T266483: Enable report_host for mariadb - https://phabricator.wikimedia.org/T266483 [10:19:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:21:00] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool pc1008" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/641292 [10:21:02] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] Cloud NFS: remove the load alerts [puppet] - 10https://gerrit.wikimedia.org/r/641239 (owner: 10Bstorm) [10:21:25] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime [10:21:28] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [10:21:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:21:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:21:48] 10Operations, 10DBA, 10Blocked-on-schema-change, 10User-Kormat: Schema change to make change_tag.ct_rc_id unsigned - https://phabricator.wikimedia.org/T259831 (10Kormat) [10:22:00] !running schema change against s3 in codfw T259831 [10:22:00] T259831: Schema change to make change_tag.ct_rc_id unsigned - https://phabricator.wikimedia.org/T259831 [10:22:37] (03CR) 10Marostegui: [C: 03+2] Revert "db-eqiad.php: Depool pc1008" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/641292 (owner: 10Marostegui) [10:23:40] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool pc1008" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/641292 (owner: 10Marostegui) [10:24:59] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool pc1008 in pc2 after restarting mysql T266483 (duration: 00m 56s) [10:25:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:25:11] T266483: Enable report_host for mariadb - https://phabricator.wikimedia.org/T266483 [10:25:20] PROBLEM - Unmerged changes on repository puppet on puppetmaster1001 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [10:26:11] ^ moritzm maybe your last commit? [10:26:21] ah no, it wasn't to puppet [10:26:51] godog: ^? [10:26:55] yeah, that was a deb [10:27:07] (03CR) 10Volans: "LGTM and is already an improvement, thanks for this! However I think we should aim to more explicitly fail when returning earlier from pro" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/635853 (https://phabricator.wikimedia.org/T265339) (owner: 10Ayounsi) [10:27:11] (03CR) 10Volans: [C: 03+1] ProvisionServerNetwork, cleanup and standardize logs format [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/635853 (https://phabricator.wikimedia.org/T265339) (owner: 10Ayounsi) [10:27:39] marostegui: doh! thank you [10:27:52] hit 'puppet-merge' and promptly forgot 'yes' [10:28:34] (03PS1) 10Filippo Giunchedi: alertmanager: vary apache access config based on type [puppet] - 10https://gerrit.wikimedia.org/r/641404 (https://phabricator.wikimedia.org/T266017) [10:28:48] RECOVERY - Unmerged changes on repository puppet on puppetmaster1001 is OK: No changes to merge. https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [10:31:34] (03PS2) 10Filippo Giunchedi: alertmanager: vary apache access config based on type [puppet] - 10https://gerrit.wikimedia.org/r/641404 (https://phabricator.wikimedia.org/T266017) [10:34:27] (03CR) 10Marostegui: [C: 03+1] "pc1008 was restarted already!" [puppet] - 10https://gerrit.wikimedia.org/r/641402 (https://phabricator.wikimedia.org/T267929) (owner: 10Kormat) [10:34:51] (03CR) 10Kormat: [C: 03+2] orchestrator: Work with fqdns instead of hostnames. [puppet] - 10https://gerrit.wikimedia.org/r/641402 (https://phabricator.wikimedia.org/T267929) (owner: 10Kormat) [10:40:01] (03PS1) 10Marostegui: check_private_data_report: Add clouddb1013, clouddb1017 [puppet] - 10https://gerrit.wikimedia.org/r/641405 (https://phabricator.wikimedia.org/T267090) [10:41:49] (03CR) 10Nikerabbit: [C: 03+1] JobQueue: Move LocalGlobalUserPageCacheUpdateJob to it's own queue. [deployment-charts] - 10https://gerrit.wikimedia.org/r/640446 (https://phabricator.wikimedia.org/T267520) (owner: 10Ppchelko) [10:43:20] (03CR) 10Marostegui: [C: 03+2] check_private_data_report: Add clouddb1013, clouddb1017 [puppet] - 10https://gerrit.wikimedia.org/r/641405 (https://phabricator.wikimedia.org/T267090) (owner: 10Marostegui) [10:46:32] !log Run a test on check_private_data on clouddb1013 for s1 and s3 - T267090 [10:46:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:46:39] T267090: Productionize clouddb10[13-20] - https://phabricator.wikimedia.org/T267090 [10:54:47] (03PS3) 10Filippo Giunchedi: alertmanager: vary apache access config based on type [puppet] - 10https://gerrit.wikimedia.org/r/641404 (https://phabricator.wikimedia.org/T266017) [10:56:09] (03CR) 10jerkins-bot: [V: 04-1] alertmanager: vary apache access config based on type [puppet] - 10https://gerrit.wikimedia.org/r/641404 (https://phabricator.wikimedia.org/T266017) (owner: 10Filippo Giunchedi) [10:57:16] (03PS4) 10Filippo Giunchedi: alertmanager: vary apache access config based on type [puppet] - 10https://gerrit.wikimedia.org/r/641404 (https://phabricator.wikimedia.org/T266017) [10:58:59] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/26458/console" [puppet] - 10https://gerrit.wikimedia.org/r/641404 (https://phabricator.wikimedia.org/T266017) (owner: 10Filippo Giunchedi) [11:04:34] (03PS1) 10Volans: scripts: interface allocation skip mgmt if present [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/641409 [11:10:50] (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+2] alertmanager: vary apache access config based on type [puppet] - 10https://gerrit.wikimedia.org/r/641404 (https://phabricator.wikimedia.org/T266017) (owner: 10Filippo Giunchedi) [11:34:50] (03CR) 10Filippo Giunchedi: [C: 03+2] profile: redirect to grafana-rw with referer [puppet] - 10https://gerrit.wikimedia.org/r/641164 (https://phabricator.wikimedia.org/T267645) (owner: 10Filippo Giunchedi) [11:37:48] (03PS1) 10Klausman: home/klausman: Add vim-go [puppet] - 10https://gerrit.wikimedia.org/r/641411 [11:38:44] (03CR) 10Klausman: [C: 03+2] home/klausman: Add vim-go [puppet] - 10https://gerrit.wikimedia.org/r/641411 (owner: 10Klausman) [11:39:53] (03CR) 10Ayounsi: [C: 03+1] "Thx!" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/641401 (owner: 10Volans) [11:40:59] (03CR) 10Ayounsi: [C: 03+2] ProvisionServerNetwork, cleanup and standardize logs format [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/635853 (https://phabricator.wikimedia.org/T265339) (owner: 10Ayounsi) [11:43:11] (03PS1) 10Elukey: profile::analytics::database::meta: set readonly depending on status [puppet] - 10https://gerrit.wikimedia.org/r/641412 [11:44:48] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/26459/console" [puppet] - 10https://gerrit.wikimedia.org/r/641412 (owner: 10Elukey) [11:46:55] (03CR) 10Ayounsi: [C: 03+1] scripts: interface allocation skip mgmt if present [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/641409 (owner: 10Volans) [11:49:08] PROBLEM - MediaWiki exceptions and fatals per minute on alert1001 is CRITICAL: 143 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [11:50:54] RECOVERY - MediaWiki exceptions and fatals per minute on alert1001 is OK: (C)100 gt (W)50 gt 13 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [11:51:37] !log codfw row C: Standardize interfaces descriptions [11:51:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: That opportune time is upon us again. Time for a European mid-day backport window deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201117T1200). [12:00:04] Lucas_WMDE: A patch you scheduled for European mid-day backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [12:00:12] o/ [12:00:49] (03PS5) 10Lucas Werkmeister (WMDE): Remove migration settings in Wikibase.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/631431 (https://phabricator.wikimedia.org/T264286) (owner: 10Tobias Andersson) [12:01:44] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Remove migration settings in Wikibase.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/631431 (https://phabricator.wikimedia.org/T264286) (owner: 10Tobias Andersson) [12:02:34] (03Merged) 10jenkins-bot: Remove migration settings in Wikibase.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/631431 (https://phabricator.wikimedia.org/T264286) (owner: 10Tobias Andersson) [12:03:02] pulled to mwdebug1001, quickly testing [12:04:13] all seems fine, syncing [12:05:38] !log lucaswerkmeister-wmde@deploy1001 Synchronized wmf-config/Wikibase.php: Config: [[gerrit:631431|Remove migration settings in Wikibase.php (T264286)]] (duration: 00m 57s) [12:05:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:05:45] T264286: Remove wb_terms migration configuration options from production config - https://phabricator.wikimedia.org/T264286 [12:06:58] (03PS3) 10Lucas Werkmeister (WMDE): Remove migration settings in InitialiseSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/631496 (https://phabricator.wikimedia.org/T264286) (owner: 10Tobias Andersson) [12:07:20] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Remove migration settings in InitialiseSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/631496 (https://phabricator.wikimedia.org/T264286) (owner: 10Tobias Andersson) [12:08:11] (03Merged) 10jenkins-bot: Remove migration settings in InitialiseSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/631496 (https://phabricator.wikimedia.org/T264286) (owner: 10Tobias Andersson) [12:08:37] (03PS1) 10Muehlenhoff: Point idp CNAME to idp2001 [dns] - 10https://gerrit.wikimedia.org/r/641416 (https://phabricator.wikimedia.org/T265857) [12:09:39] (03CR) 10Jbond: [C: 03+2] Point idp CNAME to idp2001 [dns] - 10https://gerrit.wikimedia.org/r/641416 (https://phabricator.wikimedia.org/T265857) (owner: 10Muehlenhoff) [12:09:53] quickly testing this change on mwdebug1001 too [12:12:09] !log lucaswerkmeister-wmde@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:631496|Remove migration settings in InitialiseSettings.php (T264286)]], 1/2 (prod) (duration: 00m 56s) [12:12:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:12:16] T264286: Remove wb_terms migration configuration options from production config - https://phabricator.wikimedia.org/T264286 [12:13:19] !log lucaswerkmeister-wmde@deploy1001 Synchronized wmf-config/InitialiseSettings-labs.php: Config: [[gerrit:631496|Remove migration settings in InitialiseSettings.php (T264286)]], 2/2 (labs) (duration: 00m 56s) [12:13:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:13:53] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes:weight=10; selector: dc=codfw,cluster=maps,service=kartotherian,name=maps2006.codfw.wmnet [12:13:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:14:42] looks like that’s it, nothing else in the calendar [12:15:10] !log EU backport&config window done [12:15:11] !log codfw row C: remove extra "enable" [12:15:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:15:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:18:34] (03PS1) 10Klausman: admin: Add config for vim-go [puppet] - 10https://gerrit.wikimedia.org/r/641417 [12:19:19] (03CR) 10Klausman: [C: 03+2] admin: Add config for vim-go [puppet] - 10https://gerrit.wikimedia.org/r/641417 (owner: 10Klausman) [12:23:07] !log codfw row C: move ganeti and LVS from interface-range to individual term [12:23:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:25:39] (03PS3) 10Jbond: puppetmaster: update webconfig to use correct file path [puppet] - 10https://gerrit.wikimedia.org/r/550459 (https://phabricator.wikimedia.org/T237994) [12:26:54] (03PS4) 10Jbond: puppetmaster: update webconfig to use correct file path [puppet] - 10https://gerrit.wikimedia.org/r/550459 (https://phabricator.wikimedia.org/T237994) [12:27:58] Lucas_WMDE: I have a production error I'd like to fix and backport now, if you're still around to do it during this window. If not it oculd wait until later [12:28:27] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/26460" [puppet] - 10https://gerrit.wikimedia.org/r/550459 (https://phabricator.wikimedia.org/T237994) (owner: 10Jbond) [12:32:02] (03PS2) 10BBlack: configure digicert-2020 certificates [puppet] - 10https://gerrit.wikimedia.org/r/640213 (https://phabricator.wikimedia.org/T261419) [12:32:06] kostajh: I’m around-ish [12:32:12] can you link the gerrit change? [12:32:23] (03PS5) 10Jbond: puppetmaster: update webconfig to use correct file path [puppet] - 10https://gerrit.wikimedia.org/r/550459 (https://phabricator.wikimedia.org/T237994) [12:32:32] T268012 [12:32:33] T268012: Call to a member function getDifficulty() on null - https://phabricator.wikimedia.org/T268012 [12:32:40] Lucas_WMDE: https://gerrit.wikimedia.org/r/641419 [12:33:06] ah, I think I’ve seen that error in logstas [12:33:09] *logstash [12:33:12] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/26461" [puppet] - 10https://gerrit.wikimedia.org/r/550459 (https://phabricator.wikimedia.org/T237994) (owner: 10Jbond) [12:33:15] !log reopen EU backport&config window [12:33:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:33:29] (03PS6) 10Jbond: puppetmaster: update webconfig to use correct file path [puppet] - 10https://gerrit.wikimedia.org/r/550459 (https://phabricator.wikimedia.org/T237994) [12:34:11] Lucas_WMDE: thank you, I'm double checking the fix now locally [12:34:25] ok [12:34:41] should I +2 the master change as well (once you’ve checked it), or do you only want to merge a backport for now and leave master open? [12:34:58] both, I think [12:35:01] (maybe you want more review or add unit tests for the master version or somtehing else) [12:35:04] okay [12:35:32] I assume this will need a backport to wmf.18 as well [12:35:49] Lucas_WMDE: Ok, I've verified the patch locally [12:35:52] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/26462" [puppet] - 10https://gerrit.wikimedia.org/r/550459 (https://phabricator.wikimedia.org/T237994) (owner: 10Jbond) [12:36:00] not sure how to do those before the train has started to roll out… last time I tried (with a Wikibase change), it ended up not being applied after the train [12:36:06] but let’s fix wmf.16 first [12:36:15] yeah [12:36:26] I'll update the deploy calendar with a link to the cherry pick [12:36:27] (03PS1) 10Lucas Werkmeister (WMDE): Suggested Edits: Guard against task type not existing [extensions/GrowthExperiments] (wmf/1.36.0-wmf.16) - 10https://gerrit.wikimedia.org/r/641293 (https://phabricator.wikimedia.org/T268012) [12:36:38] sounds good, I just uploaded the cherry pick ^ [12:36:51] (03CR) 10Kosta Harlan: [C: 03+1] Suggested Edits: Guard against task type not existing [extensions/GrowthExperiments] (wmf/1.36.0-wmf.16) - 10https://gerrit.wikimedia.org/r/641293 (https://phabricator.wikimedia.org/T268012) (owner: 10Lucas Werkmeister (WMDE)) [12:38:03] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Suggested Edits: Guard against task type not existing [extensions/GrowthExperiments] (wmf/1.36.0-wmf.16) - 10https://gerrit.wikimedia.org/r/641293 (https://phabricator.wikimedia.org/T268012) (owner: 10Lucas Werkmeister (WMDE)) [12:40:54] ok, I can reproduce the error on kowiki+mwdebug now [12:41:06] so should be able to test it too [12:41:19] cool. I'm happy to as well, just let me know. [12:41:20] once CI goes through [12:42:10] !log IDP updated to 6.2.4 [12:42:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:44:30] (03CR) 10BBlack: [C: 03+2] configure digicert-2020 certificates [puppet] - 10https://gerrit.wikimedia.org/r/640213 (https://phabricator.wikimedia.org/T261419) (owner: 10BBlack) [12:47:51] Lucas_WMDE: There's another one that is blowing up logstash (T268008) but it's affecting job queue so maybe could wait for the next window [12:47:52] T268008: Argument 2 passed to GrowthExperiments\NewcomerTasks\TaskSuggester\CacheDecorator::suggest() must be of the type array, null given, called in /srv/mediawiki/php-1.36.0-wmf.16/extensions/GrowthExperiments/includes/NewcomerTasks/TaskSuggester/NewcomerTasksCacheRefreshJob.php on line 35 - https://phabricator.wikimedia.org/T268008 [12:48:01] ok [12:48:24] yeah I don’t think we have time for two backports [12:48:25] sorry, the combination of odd train schedules plus the 5 day weekend last week caught me off my usual logstash patrolling rhythm [12:48:28] no worries [12:49:10] hm, there’? no morning backport window today? [12:49:12] *there’s [12:49:20] only EU and evening apparently [12:50:04] PROBLEM - Postgres Replication Lag on maps2001 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 45505616 and 3 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:50:39] (03Merged) 10jenkins-bot: Suggested Edits: Guard against task type not existing [extensions/GrowthExperiments] (wmf/1.36.0-wmf.16) - 10https://gerrit.wikimedia.org/r/641293 (https://phabricator.wikimedia.org/T268012) (owner: 10Lucas Werkmeister (WMDE)) [12:51:50] RECOVERY - Postgres Replication Lag on maps2001 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 27600 and 47 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:51:52] okay, the fix should be on mwdebug1001 kostajh [12:51:56] ACKNOWLEDGEMENT - Postgres Replication Lag on maps1001 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 231224016 and 937803 seconds Hnowlan Awaiting fix https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:52:28] Lucas_WMDE: looking [12:53:13] Lucas_WMDE: looks good [12:53:20] yup, same here [12:53:22] syncing [12:53:44] !log cpNNNN: removing old (30d+) failure reports from /var/cache/ocsp [12:53:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:55:16] !log lucaswerkmeister-wmde@deploy1001 Synchronized php-1.36.0-wmf.16/extensions/GrowthExperiments/includes/HomepageModules/SuggestedEdits.php: Backport: [[gerrit:641293|Suggested Edits: Guard against task type not existing (T268012)]] (duration: 00m 58s) [12:55:22] Lucas_WMDE: thanks for your help, I appreciate you reopening the window to get it deployed [12:55:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:55:23] T268012: Call to a member function getDifficulty() on null - https://phabricator.wikimedia.org/T268012 [12:55:31] no problem, thanks for fixing the issue kostajh :) [12:55:36] I’ll leave a comment on the train task [12:56:04] (03PS1) 10Kosta Harlan: Suggested Edits: Guard against task type not existing [extensions/GrowthExperiments] (wmf/1.36.0-wmf.18) - 10https://gerrit.wikimedia.org/r/641294 (https://phabricator.wikimedia.org/T268012) [12:56:37] Lucas_WMDE: do you know where the cherry pick for wmf.18 should be listed on the deployment calendar? just the next available window? [12:57:08] probably, I guess [12:57:20] I’d coordinate it with the train… what are they called? [12:57:22] the train people [12:57:27] conductors, that’s the word [12:57:36] !log codfw row B: Standardize interfaces descriptions [12:57:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:58:01] alright [12:58:19] (03PS1) 10Muehlenhoff: Fix typo in .install file [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/641422 [12:58:43] commented at https://phabricator.wikimedia.org/T263184#6627149 [12:58:56] !log updating idp-test* to 6.2.4-2 [12:59:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:59:25] !log EU backport&config window done (again ☺) [12:59:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:02:24] (03PS2) 10Elukey: profile::analytics::database::meta: set readonly depending on status [puppet] - 10https://gerrit.wikimedia.org/r/641412 [13:03:46] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/26463/console" [puppet] - 10https://gerrit.wikimedia.org/r/641412 (owner: 10Elukey) [13:06:44] (03PS1) 10Jbond: puppetmaster::ssl: fold puppetmaster::ca_server into puppetmaster::ssl [puppet] - 10https://gerrit.wikimedia.org/r/641423 [13:08:10] (03CR) 10jerkins-bot: [V: 04-1] puppetmaster::ssl: fold puppetmaster::ca_server into puppetmaster::ssl [puppet] - 10https://gerrit.wikimedia.org/r/641423 (owner: 10Jbond) [13:09:06] (03PS3) 10Elukey: profile::analytics::database::meta: set readonly depending on status [puppet] - 10https://gerrit.wikimedia.org/r/641412 [13:09:31] (03PS2) 10Jbond: puppetmaster::ssl: fold puppetmaster::ca_server into puppetmaster::ssl [puppet] - 10https://gerrit.wikimedia.org/r/641423 [13:09:56] !log codfw row B: remove extra "enable" [13:10:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:10:12] (03PS1) 10BBlack: TLS unified public cert: switch non-us to dc-2020 [puppet] - 10https://gerrit.wikimedia.org/r/641424 (https://phabricator.wikimedia.org/T261419) [13:10:14] (03PS1) 10BBlack: TLS unified public cert: remove expiring gs-2019 [puppet] - 10https://gerrit.wikimedia.org/r/641425 (https://phabricator.wikimedia.org/T266503) [13:10:56] (03CR) 10jerkins-bot: [V: 04-1] puppetmaster::ssl: fold puppetmaster::ca_server into puppetmaster::ssl [puppet] - 10https://gerrit.wikimedia.org/r/641423 (owner: 10Jbond) [13:11:22] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/26466/console" [puppet] - 10https://gerrit.wikimedia.org/r/641412 (owner: 10Elukey) [13:17:56] 10Operations, 10observability, 10User-fgiunchedi: Wrong redirect when logging into grafana-rw from a grafana.w.o dashboard - https://phabricator.wikimedia.org/T267645 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi This is fixed now \o/ [13:17:59] 10Operations, 10observability, 10Patch-For-Review, 10User-fgiunchedi: Enable CAS authentication for Grafana - https://phabricator.wikimedia.org/T262512 (10fgiunchedi) [13:19:42] (03PS4) 10Elukey: profile::analytics::database::meta: set readonly depending on status [puppet] - 10https://gerrit.wikimedia.org/r/641412 [13:21:16] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/26467/console" [puppet] - 10https://gerrit.wikimedia.org/r/641412 (owner: 10Elukey) [13:21:30] !log elukey@cumin1001 START - Cookbook sre.hosts.decommission [13:21:33] !log elukey@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=99) [13:21:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:22:24] I am re-excuting the decom cookbook for some analytics nodes that failed previously to make sure that netbox is up to date, it will spam a little, sorry in advance [13:22:29] !log elukey@cumin1001 START - Cookbook sre.hosts.decommission [13:22:33] !log elukey@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=99) [13:22:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:22:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:23:29] !log elukey@cumin1001 START - Cookbook sre.hosts.decommission [13:23:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:04] (03PS1) 10Kosta Harlan: Suggested edits: Guard against empty topic data [extensions/GrowthExperiments] (wmf/1.36.0-wmf.18) - 10https://gerrit.wikimedia.org/r/641295 (https://phabricator.wikimedia.org/T268015) [13:27:36] !log elukey@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) [13:27:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:29:33] !log elukey@cumin1001 START - Cookbook sre.hosts.decommission [13:29:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:23] !log codfw row B: move ganeti, Cloud and LVS from interface-range to individual term [13:36:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:37:20] !log elukey@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) [13:37:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:37:34] (03PS1) 10Filippo Giunchedi: alertmanager: add more correct types for apache require [puppet] - 10https://gerrit.wikimedia.org/r/641428 [13:39:10] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/26468/console" [puppet] - 10https://gerrit.wikimedia.org/r/641428 (owner: 10Filippo Giunchedi) [13:39:29] (03PS3) 10Jbond: puppetmaster::ssl: fold puppetmaster::ca_server into puppetmaster::ssl [puppet] - 10https://gerrit.wikimedia.org/r/641423 [13:40:43] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/26469/console" [puppet] - 10https://gerrit.wikimedia.org/r/641423 (owner: 10Jbond) [13:40:52] (03CR) 10jerkins-bot: [V: 04-1] puppetmaster::ssl: fold puppetmaster::ca_server into puppetmaster::ssl [puppet] - 10https://gerrit.wikimedia.org/r/641423 (owner: 10Jbond) [13:42:56] (03PS4) 10Jbond: puppetmaster::ssl: fold puppetmaster::ca_server into puppetmaster::ssl [puppet] - 10https://gerrit.wikimedia.org/r/641423 [13:45:12] (03PS5) 10Jbond: puppetmaster::ssl: fold puppetmaster::ca_server into puppetmaster::ssl [puppet] - 10https://gerrit.wikimedia.org/r/641423 [13:46:22] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/26470/console" [puppet] - 10https://gerrit.wikimedia.org/r/641423 (owner: 10Jbond) [13:53:20] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/26471/console" [puppet] - 10https://gerrit.wikimedia.org/r/641423 (owner: 10Jbond) [13:56:35] (03CR) 10Jbond: [C: 03+1] "lgtm minor nit" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/641428 (owner: 10Filippo Giunchedi) [13:57:34] (03PS6) 10Jbond: puppetmaster::ssl: fold puppetmaster::ca_server into puppetmaster::ssl [puppet] - 10https://gerrit.wikimedia.org/r/641423 [13:59:06] (03PS2) 10Filippo Giunchedi: alertmanager: add more correct types for apache require [puppet] - 10https://gerrit.wikimedia.org/r/641428 [14:00:41] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/26472/console" [puppet] - 10https://gerrit.wikimedia.org/r/641428 (owner: 10Filippo Giunchedi) [14:00:44] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/26473/console" [puppet] - 10https://gerrit.wikimedia.org/r/641423 (owner: 10Jbond) [14:03:40] !log codfw row A: standardize interfaces [14:03:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:02] (03PS7) 10Jbond: puppetmaster::ssl: fold puppetmaster::ca_server into puppetmaster::ssl [puppet] - 10https://gerrit.wikimedia.org/r/641423 [14:10:18] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/26475/console" [puppet] - 10https://gerrit.wikimedia.org/r/641423 (owner: 10Jbond) [14:17:18] (03CR) 10Jbond: [V: 03+1] "PCC cloud: https://puppet-compiler.wmflabs.org/compiler1001/26476/" [puppet] - 10https://gerrit.wikimedia.org/r/641423 (owner: 10Jbond) [14:19:02] (03CR) 10Jbond: [V: 03+1 C: 03+2] puppetmaster::ssl: fold puppetmaster::ca_server into puppetmaster::ssl [puppet] - 10https://gerrit.wikimedia.org/r/641423 (owner: 10Jbond) [14:20:30] PROBLEM - OSPF status on cr2-codfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:21:06] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 236, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [14:22:48] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/26477/console" [puppet] - 10https://gerrit.wikimedia.org/r/641412 (owner: 10Elukey) [14:29:19] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to researchers, analytics-privatedata-users and wmf LDAP for fkaelin - https://phabricator.wikimedia.org/T267817 (10Ottomata) If the comments say that they are probably true. `researchers` is kinda outdated. [14:33:13] !log otto@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'canary' . [14:33:13] !log otto@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'production' . [14:33:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:47] (03CR) 10Volans: [C: 03+2] scripts: provision server's network fix select [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/641401 (owner: 10Volans) [14:34:52] (03CR) 10Volans: [C: 03+2] scripts: interface allocation skip mgmt if present [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/641409 (owner: 10Volans) [14:35:54] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install mwlog2002.codfw.wmnet - https://phabricator.wikimedia.org/T267272 (10Papaul) [14:36:19] !log otto@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'production' . [14:36:19] !log otto@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'canary' . [14:36:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:36:24] (03PS5) 10Elukey: profile::analytics::database::meta: set readonly depending on status [puppet] - 10https://gerrit.wikimedia.org/r/641412 [14:36:26] (03PS1) 10Elukey: Remove DB init scripts for hive/oozie [puppet] - 10https://gerrit.wikimedia.org/r/641432 [14:36:28] 10Operations, 10serviceops, 10CommRel-Specialists-Support (Oct-Dec-2020), 10User-notice: CommRel support for ICU 63 upgrade - https://phabricator.wikimedia.org/T267145 (10Trizek-WMF) p:05High→03Medium [14:36:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:10] !log Start of mwscript extensions/AbuseFilter/maintenance/updateVarDumps.php --wiki=$wiki --print-orphaned-records-to=/tmp/urbanecm/$wiki-orphaned.log --progress-markers > $wiki.log in a tmux at mwmaint1002 (wiki=itwiki; T246539) [14:37:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:17] T246539: Dry-run, then actually run updateVarDumps - https://phabricator.wikimedia.org/T246539 [14:38:07] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/26478/console" [puppet] - 10https://gerrit.wikimedia.org/r/641412 (owner: 10Elukey) [14:40:16] (03PS1) 10Volans: sre.hosts.decommission: fix netbox API call [cookbooks] - 10https://gerrit.wikimedia.org/r/641433 [14:40:51] (03PS2) 10Elukey: Remove DB init scripts for hive/oozie [puppet] - 10https://gerrit.wikimedia.org/r/641432 [14:40:53] (03PS6) 10Elukey: profile::analytics::database::meta: set readonly depending on status [puppet] - 10https://gerrit.wikimedia.org/r/641412 [14:41:11] (03CR) 10Elukey: [C: 03+1] "daje!" [cookbooks] - 10https://gerrit.wikimedia.org/r/641433 (owner: 10Volans) [14:41:23] !log otto@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'eventgate-analytics' for release 'production' . [14:41:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:41:34] 10Operations, 10ops-eqiad: Degraded RAID on an-presto1004 - https://phabricator.wikimedia.org/T267160 (10Cmjohnson) Dell is sending a new backplane and a couple of disks with a technician. I am not sure when they will arrive. I received an email from Dell this morning that they are delayed. @elukey I will giv... [14:41:35] 10Operations, 10ops-codfw, 10netops: ripe-atlas-codfw is down - https://phabricator.wikimedia.org/T267714 (10CDanis) a:05faidon→03Papaul Papaul, could you please attach atlas-codfw to one of the SCS servers so we can take a look via serial console? Thanks! [14:42:12] (03CR) 10Volans: [C: 03+2] sre.hosts.decommission: fix netbox API call [cookbooks] - 10https://gerrit.wikimedia.org/r/641433 (owner: 10Volans) [14:42:34] PROBLEM - Uncommitted DNS changes in Netbox on netbox1001 is CRITICAL: An error occurred checking if Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [14:43:25] !log codfw row A: move ganeti and LVS from interface-range to individual term [14:43:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:00] (03Merged) 10jenkins-bot: sre.hosts.decommission: fix netbox API call [cookbooks] - 10https://gerrit.wikimedia.org/r/641433 (owner: 10Volans) [14:44:22] !log ayounsi@cumin1001 START - Cookbook sre.dns.netbox [14:44:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:37] (03CR) 10Vgutierrez: [C: 03+1] "everything seems ready on cp nodes on esams|eqsin to perform the switch :)" [puppet] - 10https://gerrit.wikimedia.org/r/641424 (https://phabricator.wikimedia.org/T261419) (owner: 10BBlack) [14:45:07] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/26479/console" [puppet] - 10https://gerrit.wikimedia.org/r/641412 (owner: 10Elukey) [14:46:28] (03CR) 10Filippo Giunchedi: [V: 03+1] alertmanager: add more correct types for apache require (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/641428 (owner: 10Filippo Giunchedi) [14:46:42] (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+2] alertmanager: add more correct types for apache require [puppet] - 10https://gerrit.wikimedia.org/r/641428 (owner: 10Filippo Giunchedi) [14:47:42] !log filippo@cumin1001 START - Cookbook sre.hosts.downtime [14:47:42] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [14:47:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:01] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:48:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:33] (03PS9) 10Alexandros Kosiaris: Add recommendation-api helmfile stanzas [deployment-charts] - 10https://gerrit.wikimedia.org/r/602527 (https://phabricator.wikimedia.org/T241230) (owner: 10Bmansurov) [14:49:22] !log otto@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-analytics' for release 'production' . [14:49:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:59] jouncebot: next [14:50:59] In 2 hour(s) and 9 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201117T1700) [14:51:20] (03CR) 10jerkins-bot: [V: 04-1] Add recommendation-api helmfile stanzas [deployment-charts] - 10https://gerrit.wikimedia.org/r/602527 (https://phabricator.wikimedia.org/T241230) (owner: 10Bmansurov) [14:54:27] (03PS1) 10Ayounsi: Manage codfw switches interfaces with Netbox/Homer [homer/public] - 10https://gerrit.wikimedia.org/r/641436 (https://phabricator.wikimedia.org/T250429) [14:54:58] PROBLEM - Host ms-be1022.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:55:05] !log otto@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'production' . [14:55:05] !log otto@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'canary' . [14:55:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:22] I'm about to deploy https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/639601 [14:55:27] (03CR) 10CDanis: [C: 03+2] Special docroot for thankyou.wp.org (and donate) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/639601 (https://phabricator.wikimedia.org/T259312) (owner: 10Ejegg) [14:56:20] (03Merged) 10jenkins-bot: Special docroot for thankyou.wp.org (and donate) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/639601 (https://phabricator.wikimedia.org/T259312) (owner: 10Ejegg) [14:57:23] !log stutdown stat1008 for ram expansion [14:57:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:57:48] PROBLEM - puppetmaster backend https on puppetmaster2002 is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 8141: HTTP/1.1 500 Internal Server Error https://wikitech.wikimedia.org/wiki/Puppet%23Debugging [14:58:06] PROBLEM - Widespread puppet agent failures on alert1001 is CRITICAL: 0.02795 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [14:58:49] !log otto@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'production' . [14:58:49] !log otto@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'canary' . [14:58:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:58:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:59:06] !log cdanis@deploy1001 Synchronized docroot/thankyou: Special docroot for thankyouwiki T259312 d2a20ec57 (duration: 00m 55s) [14:59:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:59:14] T259312: Deal with donatewiki Thank You page launching in apps - https://phabricator.wikimedia.org/T259312 [15:00:06] <_joe_> cdanis: ok, I'll merge the other change, and disable puppet on all apaches as usual [15:00:38] PROBLEM - Widespread puppet agent failures- no resources reported on alert1001 is CRITICAL: 0.01715 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [15:00:41] * jbond42 loking at puppet issues [15:00:53] _joe_: thanks! [15:01:13] !log otto@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'eventgate-main' for release 'production' . [15:01:13] !log otto@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'eventgate-main' for release 'canary' . [15:01:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:57] probably caused by "HTTP CRITICAL - Invalid HTTP response received from host on port 8141: HTTP/1.1 500 Internal Server Error" for puppetmaster2002 [15:02:52] moritzm: thanks [15:04:16] some cert error in apache logs [15:04:34] the CSR retrieved from the master does not match the agent's public key [15:05:19] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Special docroot for thankyou.wikipedia.org [puppet] - 10https://gerrit.wikimedia.org/r/639771 (https://phabricator.wikimedia.org/T259312) (owner: 10Ejegg) [15:05:34] RECOVERY - Uncommitted DNS changes in Netbox on netbox1001 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [15:05:35] started 14:52:08 [15:07:08] moritzm: i think cause by this https://gerrit.wikimedia.org/r/c/operations/puppet/+/641423/ not sure why it onl;y affects 2002 [15:07:34] <_joe_> cdanis: you had a httpbb test for this, right? [15:07:47] _joe_: yes, just a minute [15:07:51] <_joe_> sure [15:09:09] (03PS1) 10CDanis: Enable httpbb tests for thankyouwiki Apple app association [puppet] - 10https://gerrit.wikimedia.org/r/641438 (https://phabricator.wikimedia.org/T259312) [15:09:16] !log otto@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-main' for release 'production' . [15:09:16] !log otto@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-main' for release 'canary' . [15:09:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:09:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:09:28] RECOVERY - Widespread puppet agent failures- no resources reported on alert1001 is OK: (C)0.01 ge (W)0.006 ge 0.003814 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [15:09:33] jbond42: no idea why it's specific to 2002, maybe some restart kicked in and it would affect the others at some point as well [15:10:05] ack ill revert [15:10:14] (03CR) 10Ottomata: [C: 03+1] "Huh, interesting. Does this just mean you'll have to do it manually whenever you set up a test cluster?" [puppet] - 10https://gerrit.wikimedia.org/r/641432 (owner: 10Elukey) [15:10:19] (03PS10) 10Alexandros Kosiaris: Add recommendation-api helmfile stanzas [deployment-charts] - 10https://gerrit.wikimedia.org/r/602527 (https://phabricator.wikimedia.org/T241230) (owner: 10Bmansurov) [15:10:21] (03PS1) 10Alexandros Kosiaris: recommendation-api: Supply more configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/641439 (https://phabricator.wikimedia.org/T241230) [15:10:24] !log pt1979@cumin2001 START - Cookbook sre.dns.netbox [15:10:24] (03PS1) 10Jbond: Revert "puppetmaster::ssl: fold puppetmaster::ca_server into puppetmaster::ssl" [puppet] - 10https://gerrit.wikimedia.org/r/641296 [15:10:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:36] (03CR) 10Jbond: [V: 03+2 C: 03+2] Revert "puppetmaster::ssl: fold puppetmaster::ca_server into puppetmaster::ssl" [puppet] - 10https://gerrit.wikimedia.org/r/641296 (owner: 10Jbond) [15:11:20] (03CR) 10Ottomata: [C: 03+1] profile::analytics::database::meta: set readonly depending on status [puppet] - 10https://gerrit.wikimedia.org/r/641412 (owner: 10Elukey) [15:12:08] PROBLEM - Widespread puppet agent failures on alert1001 is CRITICAL: 0.01271 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [15:12:47] _joe_: https://gerrit.wikimedia.org/r/c/operations/puppet/+/641438 [15:13:01] (03CR) 10jerkins-bot: [V: 04-1] Add recommendation-api helmfile stanzas [deployment-charts] - 10https://gerrit.wikimedia.org/r/602527 (https://phabricator.wikimedia.org/T241230) (owner: 10Bmansurov) [15:13:09] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Enable httpbb tests for thankyouwiki Apple app association [puppet] - 10https://gerrit.wikimedia.org/r/641438 (https://phabricator.wikimedia.org/T259312) (owner: 10CDanis) [15:13:27] (03CR) 10Elukey: "> Patch Set 2: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/641432 (owner: 10Elukey) [15:14:08] PROBLEM - Host stat1008.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:14:41] (03PS2) 10JMeybohm: Build a calico-images package [debs/calico] - 10https://gerrit.wikimedia.org/r/640095 (https://phabricator.wikimedia.org/T266893) [15:14:44] PROBLEM - Widespread puppet agent failures- no resources reported on alert1001 is CRITICAL: 0.01462 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [15:16:06] PROBLEM - puppetmaster backend https on puppetmaster1003 is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 8141: HTTP/1.1 500 Internal Server Error https://wikitech.wikimedia.org/wiki/Puppet%23Debugging [15:16:06] 10Operations, 10serviceops, 10CommRel-Specialists-Support (Oct-Dec-2020), 10User-notice: CommRel support for ICU 63 upgrade - https://phabricator.wikimedia.org/T267145 (10RLazarus) The s5 script (shwiki, srwiki) is finished, the rest are still chugging along. [15:16:12] PROBLEM - puppetmaster https on puppetmaster1001 is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 8140: HTTP/1.1 500 Internal Server Error https://wikitech.wikimedia.org/wiki/Puppet%23Debugging [15:16:14] !log disable puppet fleet wide [15:16:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:16:21] (03CR) 10Alexandros Kosiaris: [C: 03+2] recommendation-api: Supply more configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/641439 (https://phabricator.wikimedia.org/T241230) (owner: 10Alexandros Kosiaris) [15:16:29] <_joe_> cdanis: our deployment is delayed [15:16:36] _joe_: indeed [15:16:38] jbond42: do you need help? [15:16:38] !log pt1979@cumin2001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:16:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:17:10] cdanis: thanks give me 5 mins to see if i can work things our [15:17:58] RECOVERY - puppetmaster https on puppetmaster1001 is OK: HTTP OK: Status line output matched 400 - 414 bytes in 0.019 second response time https://wikitech.wikimedia.org/wiki/Puppet%23Debugging [15:19:37] (03Merged) 10jenkins-bot: recommendation-api: Supply more configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/641439 (https://phabricator.wikimedia.org/T241230) (owner: 10Alexandros Kosiaris) [15:20:18] RECOVERY - Host stat1008.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.17 ms [15:21:09] (03CR) 10Alexandros Kosiaris: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/602527 (https://phabricator.wikimedia.org/T241230) (owner: 10Bmansurov) [15:21:14] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install cloudcephmon200[12]-dev - https://phabricator.wikimedia.org/T267378 (10Papaul) [15:21:17] (03PS9) 10Ppchelko: Enable parsoid on api_appserver [mediawiki-config] - 10https://gerrit.wikimedia.org/r/635086 (https://phabricator.wikimedia.org/T265954) [15:22:09] <_joe_> Pchelolo: uhm, that would not work right now [15:22:14] <_joe_> so hold your horses [15:22:30] <_joe_> it's time we fix the /w/rest.php routing embarassment though [15:22:35] (03PS3) 10JMeybohm: Build a calico-images package [debs/calico] - 10https://gerrit.wikimedia.org/r/640095 (https://phabricator.wikimedia.org/T266893) [15:22:42] _joe_: I've only rebased it, but I have had the intention to deploy it [15:23:03] _joe_: what's wrong with it? [15:23:40] <_joe_> Pchelolo: given /w/rest.php was deployed without any warning to SRE, it goes where all the url go by default - the appserver cluster [15:23:51] <_joe_> we need to add the direction to the traffic layer [15:24:58] oh. so my config patch actaully will work as designed, it will just work on a wrong server group? :) [15:25:50] <_joe_> yes your patch will work, and it's for the right server group [15:25:56] <_joe_> it will just not get that traffic :P [15:26:00] <_joe_> it should though [15:26:44] yeah, that's what I meant... so, yesterday I have merged a patch for rest api that needs this, trying to get it into this train... [15:27:18] (03CR) 10Elukey: [C: 03+2] Remove DB init scripts for hive/oozie [puppet] - 10https://gerrit.wikimedia.org/r/641432 (owner: 10Elukey) [15:27:18] !log ayounsi@cumin1001 START - Cookbook sre.dns.netbox [15:27:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:27:31] (03CR) 10Elukey: [V: 03+1 C: 03+2] profile::analytics::database::meta: set readonly depending on status [puppet] - 10https://gerrit.wikimedia.org/r/641412 (owner: 10Elukey) [15:28:02] so I either should revert it again, or just enable Parsoid without it's own rest api on appserver cluster too [15:28:06] PROBLEM - puppetmaster https on puppetmaster2001 is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 8140: HTTP/1.1 500 Internal Server Error https://wikitech.wikimedia.org/wiki/Puppet%23Debugging [15:28:12] which one should it be _joe_? [15:28:23] (03CR) 10Alexandros Kosiaris: [C: 03+2] Add recommendation-api helmfile stanzas [deployment-charts] - 10https://gerrit.wikimedia.org/r/602527 (https://phabricator.wikimedia.org/T241230) (owner: 10Bmansurov) [15:28:57] <_joe_> Pchelolo: no the right thing to do is you wait a few days and we get the rest traffic to go to the right place [15:28:59] jbond42: o/ the puppetmaster2001 alert is you right? [15:29:08] yes unfortunatly [15:29:31] 10Operations, 10ops-eqiad, 10SRE-swift-storage: ms-be1022 smart storage battery failure; disk sdb possibly bad - https://phabricator.wikimedia.org/T267870 (10Cmjohnson) a:03wiki_willy I swapped the bbu with one from a decom'd ms-be host. The server shutdown during the boot process. I put the old bbu back... [15:29:48] RECOVERY - puppetmaster https on puppetmaster2001 is OK: HTTP OK: Status line output matched 400 - 415 bytes in 0.152 second response time https://wikitech.wikimedia.org/wiki/Puppet%23Debugging [15:30:20] PROBLEM - puppetmaster https on puppetmaster1001 is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 8140: HTTP/1.1 500 Internal Server Error https://wikitech.wikimedia.org/wiki/Puppet%23Debugging [15:30:24] 10Operations, 10LDAP-Access-Requests: Add gmodena to wmf LDAP group - https://phabricator.wikimedia.org/T267913 (10hnowlan) Thanks! [15:31:02] (03Merged) 10jenkins-bot: Add recommendation-api helmfile stanzas [deployment-charts] - 10https://gerrit.wikimedia.org/r/602527 (https://phabricator.wikimedia.org/T241230) (owner: 10Bmansurov) [15:31:22] PROBLEM - puppetmaster backend https on puppetmaster2003 is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 8141: HTTP/1.1 500 Internal Server Error https://wikitech.wikimedia.org/wiki/Puppet%23Debugging [15:31:30] ok _joe_, I'll revert it again... [15:31:44] PROBLEM - puppetmaster backend https on puppetmaster1002 is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 8141: HTTP/1.1 500 Internal Server Error https://wikitech.wikimedia.org/wiki/Puppet%23Debugging [15:31:46] <_joe_> Pchelolo: did you deploy it? [15:31:56] (03PS1) 10Papaul: DNS: ADD production DNS for rdb2009 and rdb2010 [dns] - 10https://gerrit.wikimedia.org/r/641441 (https://phabricator.wikimedia.org/T266721) [15:32:00] <_joe_> it looks like you didn't :P [15:32:10] RECOVERY - puppetmaster https on puppetmaster1001 is OK: HTTP OK: Status line output matched 400 - 415 bytes in 0.035 second response time https://wikitech.wikimedia.org/wiki/Puppet%23Debugging [15:32:21] <_joe_> also merge your change, but you'll have to wait before it has an effect on traffic [15:32:24] (03CR) 10jerkins-bot: [V: 04-1] DNS: ADD production DNS for rdb2009 and rdb2010 [dns] - 10https://gerrit.wikimedia.org/r/641441 (https://phabricator.wikimedia.org/T266721) (owner: 10Papaul) [15:32:36] _joe_: should I make the ticket to reroute rest traffic? [15:32:40] (03PS1) 10Jbond: puppetmaster: disable backends for now [puppet] - 10https://gerrit.wikimedia.org/r/641442 [15:32:49] <_joe_> lemme see if I didn't do it out of embarassment though [15:33:05] (03CR) 10Jbond: [V: 03+2 C: 03+2] puppetmaster: disable backends for now [puppet] - 10https://gerrit.wikimedia.org/r/641442 (owner: 10Jbond) [15:34:13] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:34:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:34:55] 10Operations, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: 2020-09-15) upgrade/replace memory in stat100[58] - https://phabricator.wikimedia.org/T260448 (10Cmjohnson) 05Open→03Resolved added the new power supplies (will keep the older ones for spares). Added all the new memory sticks. resolv... [15:35:50] 10Operations, 10ops-eqiad, 10DC-Ops, 10serviceops: eqiad: Physical moves for MediaWiki servers - https://phabricator.wikimedia.org/T266164 (10Cmjohnson) @Dzahn I would need to move them to a 1G rack, (B1,B3,B5,B6 and B8) [15:36:05] (03PS2) 10Papaul: DNS: ADD production DNS for rdb2009 and rdb2010 [dns] - 10https://gerrit.wikimedia.org/r/641441 (https://phabricator.wikimedia.org/T266721) [15:37:00] (03CR) 10Krinkle: Enable parsoid on api_appserver (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/635086 (https://phabricator.wikimedia.org/T265954) (owner: 10Ppchelko) [15:39:17] (03CR) 10Papaul: [C: 03+2] DNS: ADD production DNS for rdb2009 and rdb2010 [dns] - 10https://gerrit.wikimedia.org/r/641441 (https://phabricator.wikimedia.org/T266721) (owner: 10Papaul) [15:39:25] (03PS3) 10Papaul: DNS: ADD production DNS for rdb2009 and rdb2010 [dns] - 10https://gerrit.wikimedia.org/r/641441 (https://phabricator.wikimedia.org/T266721) [15:39:29] (03CR) 10Papaul: [V: 03+2 C: 03+2] DNS: ADD production DNS for rdb2009 and rdb2010 [dns] - 10https://gerrit.wikimedia.org/r/641441 (https://phabricator.wikimedia.org/T266721) (owner: 10Papaul) [15:39:46] 10Operations, 10Wikimedia-Mailing-lists: Request for creation: Wiki Loves Africa Organizers Mailing List - https://phabricator.wikimedia.org/T267083 (10Anthere) Hello. Well, we'd love to see it created :) Thank you in advance [15:39:52] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: ASAP) rack/setup/install clouddb10[13-20] - https://phabricator.wikimedia.org/T260441 (10Marostegui) [15:40:24] (03PS1) 10Jbond: puppetmaster: add puppetmaster back [puppet] - 10https://gerrit.wikimedia.org/r/641443 [15:40:51] (03CR) 10Jbond: [C: 03+2] puppetmaster: add puppetmaster back [puppet] - 10https://gerrit.wikimedia.org/r/641443 (owner: 10Jbond) [15:41:37] (03PS1) 10Ayounsi: Update eqiad OOB IPs [puppet] - 10https://gerrit.wikimedia.org/r/641444 (https://phabricator.wikimedia.org/T243855) [15:42:32] (03CR) 10Ayounsi: [C: 03+2] Update eqiad OOB IPs [puppet] - 10https://gerrit.wikimedia.org/r/641444 (https://phabricator.wikimedia.org/T243855) (owner: 10Ayounsi) [15:42:46] (03CR) 10Arturo Borrero Gonzalez: "did you test the change in codfw1dev? I can help cherry pick there." [puppet] - 10https://gerrit.wikimedia.org/r/550459 (https://phabricator.wikimedia.org/T237994) (owner: 10Jbond) [15:45:11] (03PS1) 10JMeybohm: aptrepo: add component for future calico packages [puppet] - 10https://gerrit.wikimedia.org/r/641445 (https://phabricator.wikimedia.org/T266893) [15:46:57] (03CR) 10Muehlenhoff: [C: 03+1] aptrepo: add component for future calico packages [puppet] - 10https://gerrit.wikimedia.org/r/641445 (https://phabricator.wikimedia.org/T266893) (owner: 10JMeybohm) [15:49:45] <_joe_> jbond42: do you need any help? [15:49:56] PROBLEM - puppetmaster https on puppetmaster1001 is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 8140: HTTP/1.1 500 Internal Server Error https://wikitech.wikimedia.org/wiki/Puppet%23Debugging [15:50:04] yes help would be usefull now ( cdanis ) [15:50:31] so the frontends work but im getting an issue with the backends [15:50:43] <_joe_> jbond42: what changed? [15:50:52] https://phabricator.wikimedia.org/P13302 [15:51:16] _joe_: this is what i think broke it https://puppetboard.wikimedia.org/report/puppetmaster1003.eqiad.wmnet/360de9c7a2a63527efc01fa052464e9590be795b [15:51:17] (03CR) 10Andrew Bogott: "> Patch Set 10:" [puppet] - 10https://gerrit.wikimedia.org/r/638146 (https://phabricator.wikimedia.org/T267433) (owner: 10Ahmon Dancy) [15:51:19] <_joe_> no I mean, what changed to cause this outage? [15:51:33] 33 kormat │3.eqiad.wmnet/360de9c7a2a63527efc01fa052/puppetmaster100 [15:51:40] https://gerrit.wikimedia.org/r/c/operations/puppet/+/641423/ [15:51:42] RECOVERY - puppetmaster https on puppetmaster1001 is OK: HTTP OK: Status line output matched 400 - 415 bytes in 0.029 second response time https://wikitech.wikimedia.org/wiki/Puppet%23Debugging [15:51:48] (03PS1) 10Volans: sre.hosts.decommission: do not fail on missing DNS [cookbooks] - 10https://gerrit.wikimedia.org/r/641447 [15:51:56] <_joe_> oh gosh [15:52:02] <_joe_> can we just rollback? [15:52:10] i have reverted [15:52:26] * volans here if I can help too [15:52:30] and on puppetmaster 1002 i have rm -rf the ssl dir and re-enrolled [15:52:35] and still same issue [15:52:45] <_joe_> there is no way that's enough [15:52:55] (03CR) 10Vgutierrez: [C: 03+1] TLS unified public cert: remove expiring gs-2019 [puppet] - 10https://gerrit.wikimedia.org/r/641425 (https://phabricator.wikimedia.org/T266503) (owner: 10BBlack) [15:52:55] <_joe_> do you have a backup of the ssl dir on 1001? [15:53:06] nothting changed on 1001 [15:53:12] or 2001 [15:53:14] <_joe_> I'm perplexed then [15:53:16] frontends are all working [15:53:30] <_joe_> ok lemme see one backend [15:53:36] 10Operations, 10Product-Infrastructure-Team-Backlog, 10Maps (Kartotherian), 10Patch-For-Review, 10Sustainability (Incident Followup): Kartotherian/Maps outage followups, 2020-10-29 - https://phabricator.wikimedia.org/T266807 (10sdkim) a:03sdkim [15:53:40] the only thing that change on the backends is /var/lib/puppet/server/ssl/private_keys/puppet.pem [15:53:45] <_joe_> jbond42: how did you rollback? [15:53:50] which i didn;t think was refrenced anywhere [15:53:53] revert [15:54:07] <_joe_> how did you apply it given puppet wasn't working? [15:54:17] and manuly deleted /var/lib/puppet/server/ssl/certs/puppet.pem on backends [15:54:52] i have removed the backends from the puppetmaster config [15:54:57] <_joe_> that file is 1 byte now [15:55:02] (03CR) 10Ayounsi: [C: 03+2] Manage codfw switches interfaces with Netbox/Homer [homer/public] - 10https://gerrit.wikimedia.org/r/641436 (https://phabricator.wikimedia.org/T250429) (owner: 10Ayounsi) [15:55:03] <_joe_> ok so [15:55:03] * jbond42 well puppetmaster 1002 is back now [15:55:29] (03Merged) 10jenkins-bot: Manage codfw switches interfaces with Netbox/Homer [homer/public] - 10https://gerrit.wikimedia.org/r/641436 (https://phabricator.wikimedia.org/T250429) (owner: 10Ayounsi) [15:55:36] <_joe_> jbond42: no it's not [15:55:43] <_joe_> it's still spitting errors [15:55:51] sorry i ment 1002 is back in the backend config [15:55:59] <_joe_> ok, remove it for now [15:56:12] PROBLEM - Host ms-be1030 is DOWN: PING CRITICAL - Packet loss = 100% [15:56:25] (03PS1) 10Jbond: Revert "puppetmaster: add puppetmaster back" [puppet] - 10https://gerrit.wikimedia.org/r/641298 [15:56:33] (03CR) 10Jbond: [V: 03+2 C: 03+2] Revert "puppetmaster: add puppetmaster back" [puppet] - 10https://gerrit.wikimedia.org/r/641298 (owner: 10Jbond) [15:56:44] (03CR) 10Elukey: [C: 03+1] sre.hosts.decommission: do not fail on missing DNS (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/641447 (owner: 10Volans) [15:57:16] <_joe_> can someone look at ms-be? [15:57:25] on it [15:58:10] ok all backends are offline now [15:58:22] ...per the config on the frontends [15:58:36] <_joe_> jbond42: so advice 1 is don't trust what puppet suggests you to do [15:59:00] 10Operations, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install rdb20[09|10] - https://phabricator.wikimedia.org/T266721 (10Papaul) [15:59:10] <_joe_> point 2 [15:59:20] (03PS10) 10Ppchelko: Enable parsoid on api_appserver [mediawiki-config] - 10https://gerrit.wikimedia.org/r/635086 (https://phabricator.wikimedia.org/T265954) [15:59:23] <_joe_> /var/lib/puppet/server/ssl/public_keys/puppet.pem is 1 byte there [15:59:27] (03CR) 10Ppchelko: Enable parsoid on api_appserver (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/635086 (https://phabricator.wikimedia.org/T265954) (owner: 10Ppchelko) [15:59:39] godog: ms-be1030.eqiad.wmnet doesn't respond to ping or ssh, and I'm on the mgmt console but can't get a prompt, checking logs now [16:00:01] (03PS4) 10JMeybohm: Build a calico-images package [debs/calico] - 10https://gerrit.wikimedia.org/r/640095 (https://phabricator.wikimedia.org/T266893) [16:00:02] volans: thank you, must be the ms-be eqiad curse, 1022 is also unreachable [16:00:14] (03CR) 10Ahmon Dancy: "> Patch Set 10:" [puppet] - 10https://gerrit.wikimedia.org/r/638146 (https://phabricator.wikimedia.org/T267433) (owner: 10Ahmon Dancy) [16:00:20] _joe_: where is it 1 byte? [16:00:29] <_joe_> puppetmaster1002 [16:00:37] <_joe_> /var/lib/puppet/server/ssl/public_keys/puppet.pem is 1 byte [16:00:47] godog: nothing in show /system1/log1 [16:00:56] in the last few months [16:01:00] (03CR) 10Dave Pifke: "Puppet compiler output: https://puppet-compiler.wmflabs.org/compiler1001/26481/" [puppet] - 10https://gerrit.wikimedia.org/r/639885 (https://phabricator.wikimedia.org/T267269) (owner: 10Dave Pifke) [16:01:01] _joe_: could it have just changed [16:01:02] ls -la /var/lib/puppet/server/ssl/public_keys/puppet.pem [16:01:02] -rw-r--r-- 1 puppet puppet 800 Nov 17 15:29 /var/lib/puppet/server/ssl/public_keys/puppet.pem [16:01:46] <_joe_> sorry, yes, it's 800 bytes [16:01:54] (03CR) 10Dave Pifke: "Puppet compiler output: https://puppet-compiler.wmflabs.org/compiler1002/26482/" [puppet] - 10https://gerrit.wikimedia.org/r/640226 (https://phabricator.wikimedia.org/T267269) (owner: 10Dave Pifke) [16:02:02] volans: yeah nothing too obvious afaics from the host dashboard, except load dropping a little 1.5h ago https://grafana.wikimedia.org/d/000000377/host-overview?refresh=5m&orgId=1&var-server=ms-be1030&var-datasource=thanos&var-cluster=swift [16:02:26] <_joe_> jbond42: but that file is different than on the frontends, and I think your changes messed up the puppetmasters big time [16:02:32] (03PS1) 10JMeybohm: Exit on whatever curl error [debs/kubernetes] (future) - 10https://gerrit.wikimedia.org/r/641450 [16:02:44] volans: from what we said so far it seems the only option is a powercycle (?) [16:02:50] joe that file dosn;t exist on the front ends my change created it. i thought i had allready deleted it [16:02:52] <_joe_> jbond42: also I can't load that cert with openssl [16:02:55] (03CR) 10Dave Pifke: "Puppet compiler output: https://puppet-compiler.wmflabs.org/compiler1001/26483/" [puppet] - 10https://gerrit.wikimedia.org/r/639216 (https://phabricator.wikimedia.org/T267269) (owner: 10Dave Pifke) [16:03:06] jouncebot: is a pkey not a x509 [16:03:08] godog: if you want I can do it, I'm already in the ilo [16:03:08] <_joe_> jbond42: yes, because on the frontends, it's under /certs/ [16:03:12] <_joe_> which is where it should be [16:03:14] I don't have immediate ideas [16:03:21] volans: yes please, thank you [16:03:25] <_joe_> jbond42: can you point me to your actual change? I can't understand what happened here [16:03:38] joe_the x509 is uncder certs [16:03:55] https://gerrit.wikimedia.org/r/c/operations/puppet/+/641423/7/modules/puppetmaster/manifests/ssl.pp [16:04:01] !log powercycle ms-be1030.eqiad.wmnet, unresponsive to ping/ssh, no prompt in console, nothing in hw logs [16:04:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:04:21] _joe_: I think it was https://gerrit.wikimedia.org/r/c/operations/puppet/+/641423 [16:04:36] godog: interestingly enough [16:04:37] power: server power is currently: Off [16:04:52] I bet has no prompt :-P [16:05:05] haha indeed [16:05:09] booting now [16:05:43] pcc for that change https://puppet-compiler.wmflabs.org/compiler1001/26475/ [16:06:30] <_joe_> and you reverted it [16:06:46] (03PS2) 10Volans: sre.hosts.decommission: do not fail on missing DNS [cookbooks] - 10https://gerrit.wikimedia.org/r/641447 [16:06:46] <_joe_> which means whatever is happening wasn't rolled back by applying puppet afterwards [16:06:55] (03CR) 10Volans: "addressed comment" (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/641447 (owner: 10Volans) [16:08:05] _joe_: yes [16:08:16] <_joe_> jbond42: you don't have any backend where I could see a previous state? [16:08:46] RECOVERY - Host ms-be1030 is UP: PING OK - Packet loss = 0%, RTA = 0.17 ms [16:09:03] _joe_: no however i think we should be able to delete the /var/lib/puppet/server/ (backup) and puppet may well be able to fix it [16:09:17] <_joe_> jbond42: I think I found it [16:09:35] <_joe_> hostcert = /var/lib/puppet/server/ssl/certs/puppetmaster1002.eqiad.wmnet.pem [16:09:45] <_joe_> # ls -la /var/lib/puppet/server/ssl/certs/puppetmaster1002.eqiad.wmnet.pem [16:09:47] <_joe_> ls: cannot access '/var/lib/puppet/server/ssl/certs/puppetmaster1002.eqiad.wmnet.pem': No such file or directory [16:09:53] (03CR) 10Arturo Borrero Gonzalez: "> Patch Set 10:" [puppet] - 10https://gerrit.wikimedia.org/r/638146 (https://phabricator.wikimedia.org/T267433) (owner: 10Ahmon Dancy) [16:10:06] in the puppet.conf? [16:10:26] godog: anything to do on the host now that is back? [16:10:27] <_joe_> yes [16:10:36] <_joe_> puppet.conf points to server [16:10:49] <_joe_> and we only have the cert under /var/lib/puppet/ssl [16:11:06] <_joe_> and it was recreated today [16:11:09] <_joe_> 🤔 [16:11:20] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 132, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:11:26] on 1002 i recreated /var/lib/puppet/ssl [16:11:35] there is a backup in /root [16:11:42] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 75, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:11:44] volans: nothing to do no, thank you for your help though! [16:11:49] <_joe_> /var/lib/puppet/server/ssl/c is what we need jbond42 [16:12:01] <_joe_> so lemme try one thing [16:12:24] np, anytime [16:12:29] _joe_: without server is what we need (this is what i was trying to update) [16:12:34] <_joe_> wait a sec [16:13:04] (03CR) 10JMeybohm: "@Alex: I've updated this patch according to the outcome today. I also rewrote the "How to package" docs at https://wikitech.wikimedia.org/" [debs/calico] - 10https://gerrit.wikimedia.org/r/640095 (https://phabricator.wikimedia.org/T266893) (owner: 10JMeybohm) [16:13:10] ahh wait a minute i rember [16:13:22] let me try and find somethingthere is an old [16:13:24] <_joe_> jbond42: so now I copied the certs back under /srver [16:13:34] !log klausman@cumin1001 START - Cookbook sre.hosts.downtime [16:13:35] !log klausman@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [16:13:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:13:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:14:36] https://gerrit.wikimedia.org/r/c/operations/puppet/+/386666 [16:14:37] <_joe_> but that wasn't enough [16:16:28] <_joe_> jbond42: so lemme re-try stuff on 1002 [16:16:38] <_joe_> but I fear we might need to reimage them all [16:16:53] joe i think that puppet created /var/lib/puppet/server/ssl/public_keys/puppet.pem i think we need to delete that and the other puppet.pem files under server [16:17:50] <_joe_> I think you're quite wrong, but please test your hypothesis on another backend [16:17:58] ack [16:18:52] (03CR) 10Elukey: [C: 03+1] sre.hosts.decommission: do not fail on missing DNS [cookbooks] - 10https://gerrit.wikimedia.org/r/641447 (owner: 10Volans) [16:19:10] (03CR) 10Volans: [C: 03+2] sre.hosts.decommission: do not fail on missing DNS [cookbooks] - 10https://gerrit.wikimedia.org/r/641447 (owner: 10Volans) [16:19:32] RECOVERY - puppetmaster backend https on puppetmaster1002 is OK: HTTP OK: Status line output matched 400 - 414 bytes in 0.022 second response time https://wikitech.wikimedia.org/wiki/Puppet%23Debugging [16:19:48] (03CR) 10Ppchelko: [C: 03+2] JobQueue: Move LocalGlobalUserPageCacheUpdateJob to it's own queue. [deployment-charts] - 10https://gerrit.wikimedia.org/r/640446 (https://phabricator.wikimedia.org/T267520) (owner: 10Ppchelko) [16:19:54] <_joe_> jbond42: ok I think I got what happened [16:19:56] RECOVERY - puppetmaster backend https on puppetmaster1003 is OK: HTTP OK: Status line output matched 400 - 414 bytes in 0.031 second response time https://wikitech.wikimedia.org/wiki/Puppet%23Debugging [16:20:05] (03CR) 10Ppchelko: [C: 03+2] cpjobqueue: Increase cirrusSearchCheckerJob concurrency [deployment-charts] - 10https://gerrit.wikimedia.org/r/641251 (https://phabricator.wikimedia.org/T266762) (owner: 10Ebernhardson) [16:20:21] <_joe_> I just changed puppet.conf to point to /var/lib/puppet/ssl and everything is working [16:20:34] (03PS1) 10Milimetric: dumps/analytics: Deprecate pagecounts-ez [puppet] - 10https://gerrit.wikimedia.org/r/641451 (https://phabricator.wikimedia.org/T267575) [16:20:35] <_joe_> because I think you recreated the host certs [16:20:37] (03Merged) 10jenkins-bot: sre.hosts.decommission: do not fail on missing DNS [cookbooks] - 10https://gerrit.wikimedia.org/r/641447 (owner: 10Volans) [16:20:43] _joe_: sudo find /var/lib/puppet/server/ssl/ -name puppet.pem -delete and apache restart has also fixed it [16:20:51] on 1003 [16:21:06] <_joe_> uhm lemme see one thing on 1003 [16:21:17] i need to re-read https://phabricator.wikimedia.org/T179099 but i think that the absence of files causes strange behaviour [16:21:18] (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] Fix typo in .install file [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/641422 (owner: 10Muehlenhoff) [16:21:50] (03CR) 10Milimetric: "stopgap measure until we announce the new dataset more widely" [puppet] - 10https://gerrit.wikimedia.org/r/641451 (https://phabricator.wikimedia.org/T267575) (owner: 10Milimetric) [16:22:03] _joe_: from you https://phabricator.wikimedia.org/T179099#3747304 [16:22:31] !log uploaded zeromq3 4.0.5+dfsg-2+deb8u2+wmf1 to jessie-wikimedia [16:22:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:22:41] <_joe_> jbond42: what did you remove from 1003? [16:22:44] (03Merged) 10jenkins-bot: JobQueue: Move LocalGlobalUserPageCacheUpdateJob to it's own queue. [deployment-charts] - 10https://gerrit.wikimedia.org/r/640446 (https://phabricator.wikimedia.org/T267520) (owner: 10Ppchelko) [16:22:50] /var/lib/puppet/server/ssl/private_keys/puppet.pem [16:22:50] /var/lib/puppet/server/ssl/certificate_requests/puppet.pem [16:22:50] /var/lib/puppet/server/ssl/public_keys/puppet.pem [16:23:23] <_joe_> so not /var/lib/puppet/server/ssl/certs/puppetmaster1003.eqiad.wmnet.pem [16:23:31] no [16:23:40] (03PS2) 10Ppchelko: cpjobqueue: Increase cirrusSearchCheckerJob concurrency [deployment-charts] - 10https://gerrit.wikimedia.org/r/641251 (https://phabricator.wikimedia.org/T266762) (owner: 10Ebernhardson) [16:23:44] (03CR) 10Ppchelko: cpjobqueue: Increase cirrusSearchCheckerJob concurrency [deployment-charts] - 10https://gerrit.wikimedia.org/r/641251 (https://phabricator.wikimedia.org/T266762) (owner: 10Ebernhardson) [16:23:48] (03CR) 10Ppchelko: [C: 03+2] cpjobqueue: Increase cirrusSearchCheckerJob concurrency [deployment-charts] - 10https://gerrit.wikimedia.org/r/641251 (https://phabricator.wikimedia.org/T266762) (owner: 10Ebernhardson) [16:24:09] <_joe_> jbond42: those are still referenced in puppet.conf though [16:24:14] hmm yes slightly different [16:24:24] <_joe_> and /var/lib/puppet/server/ssl/private_keys/puppet.pem is still there [16:25:04] <_joe_> so on 1002 I just changed the settings in the [master] section [16:25:13] <_joe_> to point to /var/lib/puppet/ssl too [16:26:05] <_joe_> I also still see /var/lib/puppet/server/ssl/public_keys/puppet.pem on 1003 [16:26:09] _joe_: from the timestamps my guess is that a puppet agent run recreated those files [16:26:11] <_joe_> so not sure what you removed [16:26:18] (03Merged) 10jenkins-bot: cpjobqueue: Increase cirrusSearchCheckerJob concurrency [deployment-charts] - 10https://gerrit.wikimedia.org/r/641251 (https://phabricator.wikimedia.org/T266762) (owner: 10Ebernhardson) [16:26:47] but they are not in the output so it would have been something in the puppet agent preloading stuff [16:27:01] <_joe_> how can a puppet run do that? [16:27:29] <_joe_> the agent section doesn't override ssldir [16:27:33] _joe_: if you look at the agent debug log it does a bunch of stuff at the begning of the run which is not in the main output [16:27:36] <_joe_> and that's set to /var/lib/puppet/ssl [16:27:48] let me try and recreate on puppetmaster 2003 [16:27:58] <_joe_> do that with apache stopped [16:28:35] ack [16:28:44] <_joe_> I'm going to try the same on 1002 [16:29:48] !log clarakosi@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'staging' . [16:29:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:30:04] <_joe_> ok so, 1002 is broken again, good [16:30:16] _joe_: is it ok that we're deploying things on k8s? [16:30:22] <_joe_> Pchelolo: yes [16:30:27] thank you [16:31:30] RECOVERY - puppetmaster backend https on puppetmaster2003 is OK: HTTP OK: Status line output matched 400 - 414 bytes in 0.146 second response time https://wikitech.wikimedia.org/wiki/Puppet%23Debugging [16:31:59] _joe_: https://phabricator.wikimedia.org/P13303 [16:32:03] <_joe_> jbond42: so on 1002 I removed the whole server (I made a backup), and that didn't solve the issue [16:32:39] <_joe_> jbond42: uhm puppet-master is definitely the wrong thing to start [16:32:45] <_joe_> that's the standalone puppetmaster [16:33:06] <_joe_> oh sorry I read wrongly [16:33:06] !log clarakosi@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'production' . [16:33:07] _joe_: thats from starting apache [16:33:11] <_joe_> yeah sorry [16:33:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:33:30] PROBLEM - puppetmaster backend https on puppetmaster1002 is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 8141: HTTP/1.1 500 Internal Server Error https://wikitech.wikimedia.org/wiki/Puppet%23Debugging [16:33:33] (03PS1) 10Ahmon Dancy: Install emacs-nox on deployment servers [puppet] - 10https://gerrit.wikimedia.org/r/641454 [16:33:34] <_joe_> so ok, as I thought, it's the puppetmaster process that recreates trhem [16:33:37] 10Operations, 10ops-eqiad, 10DBA: db1139 memory errors on boot (issue continues after board change) 2020-08-27 - https://phabricator.wikimedia.org/T261405 (10RobH) >>! In T261405#6624984, @wiki_willy wrote: > @Jclark-ctr - can you double-check the S/N for db1139. We're getting the following Netbox error: >... [16:34:59] _joe_: you happy for me to apply this fix to 2002 [16:35:14] <_joe_> yes [16:35:18] RECOVERY - puppetmaster backend https on puppetmaster1002 is OK: HTTP OK: Status line output matched 400 - 417 bytes in 1.726 second response time https://wikitech.wikimedia.org/wiki/Puppet%23Debugging [16:35:27] <_joe_> the problem I found was of permissions, which is fucked up in its own way [16:35:34] <_joe_> but we can talk about that at another time [16:35:38] <_joe_> 1002 is fixed too [16:36:10] _joe_: ack thanks ill go through make sure evenything looks healthy and re-enable puppet [16:36:23] <_joe_> I reenabled puppet on 1002 [16:36:26] <_joe_> and it runs clean [16:36:34] !log clarakosi@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'production' . [16:36:36] ack thanks [16:36:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:36:46] ACKNOWLEDGEMENT - Host mr1-eqiad.oob is DOWN: PING CRITICAL - Packet loss = 100% ayounsi Waiting for puppet to run [16:36:46] ACKNOWLEDGEMENT - Host mr1-eqiad.oob IPv6 is DOWN: CRITICAL - Destination Unreachable (2607:f6f0:205::153) ayounsi Waiting for puppet to run [16:36:54] RECOVERY - puppetmaster backend https on puppetmaster2002 is OK: HTTP OK: Status line output matched 400 - 417 bytes in 1.992 second response time https://wikitech.wikimedia.org/wiki/Puppet%23Debugging [16:37:15] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install rdb20[09|10] - https://phabricator.wikimedia.org/T266721 (10Papaul) [16:37:25] <_joe_> jbond42: should we put the backends back into rotation, and reenable puppet? it will fail on the mw servers as I was in the midst of testing a change [16:37:52] yes ill do the change now just wanted to run puppet a-on all masters first [16:39:10] (03PS1) 10Jbond: puppetmasters: add backends back into config [puppet] - 10https://gerrit.wikimedia.org/r/641456 [16:39:21] cc _joe_ [16:39:44] PROBLEM - HP RAID on ms-be1030 is CRITICAL: CRITICAL: Slot 3: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 2I:4:1, 2I:4:2 - Controller: OK - Cache: Permanently Disabled - Battery count: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [16:39:45] (03CR) 10Jbond: [C: 03+2] puppetmasters: add backends back into config [puppet] - 10https://gerrit.wikimedia.org/r/641456 (owner: 10Jbond) [16:39:46] ACKNOWLEDGEMENT - HP RAID on ms-be1030 is CRITICAL: CRITICAL: Slot 3: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 2I:4:1, 2I:4:2 - Controller: OK - Cache: Permanently Disabled - Battery count: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T268036 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Inform [16:39:49] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1030 - https://phabricator.wikimedia.org/T268036 (10ops-monitoring-bot) [16:40:35] (03CR) 10Andrew Bogott: "> Patch Set 10:" [puppet] - 10https://gerrit.wikimedia.org/r/638146 (https://phabricator.wikimedia.org/T267433) (owner: 10Ahmon Dancy) [16:42:34] !log re-enable puppet fleet wide [16:42:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:44:27] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1030 - https://phabricator.wikimedia.org/T268036 (10fgiunchedi) This host went down earlier today, it is missing a hw raid firmware upgrade so I'll do that just in case. This looks like a BBU needing a replacement tho [16:45:08] (03CR) 10Lars Wirzenius: [C: 03+1] Install emacs-nox on deployment servers [puppet] - 10https://gerrit.wikimedia.org/r/641454 (owner: 10Ahmon Dancy) [16:45:30] jbond42: is puppet back to normal? [16:46:01] volans: should be just running on all failed nodes now [16:46:06] ok, thx [16:46:13] !log filippo@cumin1001 START - Cookbook sre.hosts.downtime [16:46:14] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [16:46:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:46:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:47:55] !log dzahn@cumin1001 START - Cookbook sre.ganeti.makevm [16:48:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:48:31] 10Operations, 10Puppet: Investigate why the exictence of files under the server ssl dir foobars puppet - https://phabricator.wikimedia.org/T268040 (10jbond) p:05Triage→03Medium [16:48:55] _joe_: i have created this task ^^ to investigate can you dump any thoughts there iwhen you get a sec [16:49:06] PROBLEM - Check systemd state on releases2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:50:16] (03PS2) 10Dave Pifke: webperf: add fake keys for WebPageTest [labs/private] - 10https://gerrit.wikimedia.org/r/635859 (https://phabricator.wikimedia.org/T262962) [16:50:50] 10Operations, 10ops-eqiad, 10DBA: db1139 memory errors on boot (issue continues after board change) 2020-08-27 - https://phabricator.wikimedia.org/T261405 (10RobH) @Jclark-ctr is taking this over, as the mainboard swap did not fix the memory and CPU errors. phab won't let me upload the IML file, so emailing... [16:52:09] RECOVERY - Widespread puppet agent failures- no resources reported on alert1001 is OK: (C)0.01 ge (W)0.006 ge 0.00445 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [16:52:11] 10Operations, 10Puppet, 10Patch-For-Review: Puppet Proposal to remove require_package - https://phabricator.wikimedia.org/T266479 (10dancy) [16:53:24] (03PS2) 10Dzahn: site: introduce mwdebug1003 as debug server on buster [puppet] - 10https://gerrit.wikimedia.org/r/638218 (https://phabricator.wikimedia.org/T245757) [16:53:26] (03CR) 10Hnowlan: [C: 03+1] site: introduce mwdebug1003 as debug server on buster [puppet] - 10https://gerrit.wikimedia.org/r/638218 (https://phabricator.wikimedia.org/T245757) (owner: 10Dzahn) [16:53:39] 10Operations, 10Puppet, 10Patch-For-Review: Puppet Proposal to remove require_package - https://phabricator.wikimedia.org/T266479 (10dancy) [16:53:45] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1030 - https://phabricator.wikimedia.org/T268036 (10fgiunchedi) @Cmjohnson @Jclark-ctr looks like BBU replacement for this host (out of warranty AFAICS), thank you! ` Degraded Performance Optimization: Disabled Inconsistency Repair Policy: Disabled Wa... [16:54:10] (03CR) 10Dzahn: [C: 03+2] site: introduce mwdebug1003 as debug server on buster [puppet] - 10https://gerrit.wikimedia.org/r/638218 (https://phabricator.wikimedia.org/T245757) (owner: 10Dzahn) [16:57:46] 10Operations, 10Android-app-Bugs, 10Fundraising-Backlog, 10Thank-You-Page, and 5 others: Deal with donatewiki Thank You page launching in apps - https://phabricator.wikimedia.org/T259312 (10Joe) the apache change has been merged, and tested to work with the renewed httpbb test suite. It will be deployed ev... [16:58:54] (03CR) 10Jbond: [C: 03+2] Install emacs-nox on deployment servers [puppet] - 10https://gerrit.wikimedia.org/r/641454 (owner: 10Ahmon Dancy) [16:59:14] <_joe_> dancy: so you're part of the elisp cult too? [16:59:20] <_joe_> :) [16:59:50] haha , I suppose so. [16:59:55] Although I almost never write elisp code. [16:59:58] <_joe_> I kind-of sold out lately, and switched to vscode for some stuff [17:00:04] jbond42 and cdanis: My dear minions, it's time we take the moon! Just kidding. Time for Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201117T1700). [17:00:05] dpifke: A patch you scheduled for Puppet request window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [17:00:06] * dancy gasps [17:00:24] <_joe_> dancy: same, "almost never" is what sets us away from the plebs using vim [17:00:37] <_joe_> apart, nto away [17:00:43] * _joe_ needs a break [17:00:59] Here. These puppet patches require some coordination with scap deploy. [17:01:01] dpifke: sorry missed you canges looking at them now [17:03:23] dpifke: in https://gerrit.wikimedia.org/r/c/operations/puppet/+/639885/1/modules/arclamp/templates/initscripts/arclamp-log.systemd.erb you are missing /usr/bin/python3 in ExecStart is that intentional? [17:03:48] Yes, the idea is to use whatever interpreter is specified by #!. [17:04:07] (Since it should have exec bit set by git.) [17:04:19] ack all looks godo to me, happy to merge what order do we need to do things? [17:04:20] PROBLEM - MariaDB Replica IO: analytics-meta-replica on an-coord1002 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [17:04:37] (03CR) 10Jbond: [C: 03+2] webperf: convert statsv to use Python 3 [puppet] - 10https://gerrit.wikimedia.org/r/639216 (https://phabricator.wikimedia.org/T267269) (owner: 10Dave Pifke) [17:04:39] (03CR) 10Jbond: [C: 03+2] coal: use Python 3 [puppet] - 10https://gerrit.wikimedia.org/r/640226 (https://phabricator.wikimedia.org/T267269) (owner: 10Dave Pifke) [17:04:41] (03CR) 10Jbond: [C: 03+2] arclamp: Use Python 3 [puppet] - 10https://gerrit.wikimedia.org/r/639885 (https://phabricator.wikimedia.org/T267269) (owner: 10Dave Pifke) [17:04:46] (03CR) 10Jbond: [C: 03+2] webperf: change navtiming to use Python 3 [puppet] - 10https://gerrit.wikimedia.org/r/639197 (https://phabricator.wikimedia.org/T267269) (owner: 10Dave Pifke) [17:05:28] I just needed the merge. :) I have perms to run the agent, and can do the scap deploys to fix what'll break when it switches to Py3. [17:05:40] ack ones sec then [17:05:49] (03CR) 10Herron: [C: 03+2] admin: add ldap-only entry for kassiameq [puppet] - 10https://gerrit.wikimedia.org/r/641248 (https://phabricator.wikimedia.org/T267961) (owner: 10Herron) [17:06:29] herron: can i merge yours [17:06:34] yes please do [17:06:35] the an-coord1002 alert is probably a monitoring issue, the replica is up, will check after meetings :( [17:07:05] dpifke: herron: merged! [17:07:10] ty ty [17:07:12] Thanks! [17:07:41] 10Operations, 10Traffic, 10serviceops, 10Platform Team Workboards (Green): MW REST API should be routed to api_appserver MW cluster - https://phabricator.wikimedia.org/T268043 (10Pchelolo) [17:08:50] !log dpifke@deploy1001 Started deploy [performance/coal@5a32eb2]: (no justification provided) [17:08:55] !log dpifke@deploy1001 Finished deploy [performance/coal@5a32eb2]: (no justification provided) (duration: 00m 04s) [17:08:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:09:02] PROBLEM - MariaDB Replica SQL: analytics-meta-replica on an-coord1002 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [17:09:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:09:57] downtiming [17:11:03] 10Operations, 10Puppet: Investigate why the exictence of files under the server ssl dir foobars puppet - https://phabricator.wikimedia.org/T268040 (10jbond) Tagging https://gerrit.wikimedia.org/r/c/operations/puppet/+/386666 as although its slightly different it seems to be around the same bit of code [17:11:08] PROBLEM - Check systemd state on webperf1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:12:26] PROBLEM - statsv process on webperf2001 is CRITICAL: PROCS CRITICAL: 0 processes with command name python3, args statsv https://wikitech.wikimedia.org/wiki/Graphite%23statsv [17:12:55] ^ synchronization issue, fix in progress [17:13:29] 10Operations, 10LDAP-Access-Requests: Request Superset Access (LDAP group 'wmf') for KEchavarriqueen - https://phabricator.wikimedia.org/T267961 (10herron) 05Open→03Resolved a:03herron Hi @KEchavarriqueen, the requested group access has been granted. I'll transition this to closed now, but please don't... [17:13:38] PROBLEM - Check systemd state on webperf2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:14:39] (03PS1) 10Ppchelko: Revert "Re-apply "Use parsoid directly in /page/html handler"" [core] (wmf/1.36.0-wmf.18) - 10https://gerrit.wikimedia.org/r/641300 [17:14:49] (03CR) 10Volans: "recheck" [dns] - 10https://gerrit.wikimedia.org/r/641284 (https://phabricator.wikimedia.org/T233183) (owner: 10CRusnov) [17:14:51] (03CR) 10Ppchelko: [C: 03+2] Revert "Re-apply "Use parsoid directly in /page/html handler"" [core] (wmf/1.36.0-wmf.18) - 10https://gerrit.wikimedia.org/r/641300 (owner: 10Ppchelko) [17:15:11] !log dpifke@deploy1001 Started deploy [performance/arc-lamp@55fccc6]: (no justification provided) [17:15:16] !log dpifke@deploy1001 Finished deploy [performance/arc-lamp@55fccc6]: (no justification provided) (duration: 00m 04s) [17:15:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:15:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:16:06] 10Operations, 10vm-requests: eqiad: 1 VM request for mwdebug - https://phabricator.wikimedia.org/T268044 (10Dzahn) [17:16:36] !log dpifke@deploy1001 Started deploy [performance/arc-lamp@55d4d41]: (no justification provided) [17:16:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:16:40] !log dpifke@deploy1001 Finished deploy [performance/arc-lamp@55d4d41]: (no justification provided) (duration: 00m 04s) [17:16:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:16:50] 10Operations, 10vm-requests: eqiad: 1 VM request for mwdebug (mwdebug1003) - https://phabricator.wikimedia.org/T268044 (10Dzahn) a:03Dzahn [17:17:45] (03PS1) 10Ayounsi: Cable report, log VC links with no ID as warning only [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/641458 [17:17:57] (03PS2) 10Herron: admin: add Fabian Kaelin 'fab' account, and group memberships [puppet] - 10https://gerrit.wikimedia.org/r/641241 (https://phabricator.wikimedia.org/T267817) [17:18:48] 10Operations, 10vm-requests: eqiad: 1 VM request for mwdebug (mwdebug1003) - https://phabricator.wikimedia.org/T268044 (10Dzahn) @hnowlan This is FYI, I should have mentioned this part in our meeting we just had. We are supposed to formally request the VM with the form at: https://wikitech.wikimedia.org/wiki/... [17:18:58] RECOVERY - MariaDB Replica SQL: analytics-meta-replica on an-coord1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [17:19:05] gkkk [17:19:07] goood [17:19:10] RECOVERY - MariaDB Replica IO: analytics-meta-replica on an-coord1002 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [17:19:18] 10Operations, 10vm-requests: eqiad: 1 VM request for mwdebug (mwdebug1003) - https://phabricator.wikimedia.org/T268044 (10Dzahn) [17:19:29] 10Operations, 10serviceops, 10vm-requests: eqiad: 1 VM request for mwdebug (mwdebug1003) - https://phabricator.wikimedia.org/T268044 (10Dzahn) [17:19:38] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users and wmf LDAP for fkaelin - https://phabricator.wikimedia.org/T267817 (10herron) [17:19:56] !log dzahn@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) [17:20:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:20:10] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users and wmf LDAP for fkaelin - https://phabricator.wikimedia.org/T267817 (10herron) >>! In T267817#6627468, @Ottomata wrote: > If the comments say that they are probably true. `researchers` is kinda outdate... [17:20:45] (03PS3) 10Herron: admin: add Fabian Kaelin 'fab' account, and group memberships [puppet] - 10https://gerrit.wikimedia.org/r/641241 (https://phabricator.wikimedia.org/T267817) [17:21:00] RECOVERY - Check systemd state on an-coord1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:21:04] 10Operations, 10serviceops, 10vm-requests: eqiad: 1 VM request for mwdebug (mwdebug1003) - https://phabricator.wikimedia.org/T268044 (10Dzahn) ` +mwdebug1003 1H IN A 10.64.32.9 +mwdebug1003 1H IN AAAA 2620:0:861:103:10:64:32:9 ` ` MAC address for... [17:21:20] 10Operations, 10LDAP-Access-Requests: Add STran to `wmf` LDAP group - https://phabricator.wikimedia.org/T267968 (10aezell) Approved by me, Tran's manager, as well. [17:21:39] 10Operations, 10Puppet: Investigate why the exictence of files under the server ssl dir foobars puppet - https://phabricator.wikimedia.org/T268040 (10jbond) checking the following one the backends shows that the keys are all different which points to the puppet master process generating theses keys when it fir... [17:22:14] 10Operations, 10serviceops, 10vm-requests: eqiad: 1 VM request for mwdebug (mwdebug1003) - https://phabricator.wikimedia.org/T268044 (10Dzahn) 05Open→03Resolved VM created with cookbook sre.ganeti.makevm. Added to puppet with "insetup" role. https://gerrit.wikimedia.org/r/c/operations/puppet/+/638218 [17:22:24] (03CR) 10Ahmon Dancy: [C: 03+2] Branch commit for wmf/1.36.0-wmf.18 [core] (wmf/1.36.0-wmf.18) - 10https://gerrit.wikimedia.org/r/641319 (https://phabricator.wikimedia.org/T263184) (owner: 10TrainBranchBot) [17:22:34] PROBLEM - statsv process on webperf1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name python3, args statsv https://wikitech.wikimedia.org/wiki/Graphite%23statsv [17:22:37] millimetric or ottomotto: one of you around to +2 https://gerrit.wikimedia.org/r/c/analytics/statsv/+/639223? [17:22:39] (03CR) 10Herron: [C: 03+2] admin: add Fabian Kaelin 'fab' account, and group memberships [puppet] - 10https://gerrit.wikimedia.org/r/641241 (https://phabricator.wikimedia.org/T267817) (owner: 10Herron) [17:23:59] Man, my use of vowels sucks today. :) Trying again: [17:24:15] milimetric or ottomata, one of you around to +2 https://gerrit.wikimedia.org/r/c/analytics/statsv/+/639223? [17:25:41] (03PS1) 10Ottomata: wgEventStreamsDefaultSettings in beta should only set eqiad as topic prefix [mediawiki-config] - 10https://gerrit.wikimedia.org/r/641459 (https://phabricator.wikimedia.org/T253069) [17:25:52] RECOVERY - Widespread puppet agent failures on alert1001 is OK: (C)0.01 ge (W)0.006 ge 0.005718 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [17:26:04] Thanks. Sorry, didn't realize I didn't have rights in that repo to merge at the same time the puppet patch landed. [17:26:46] (03CR) 10jerkins-bot: [V: 04-1] wgEventStreamsDefaultSettings in beta should only set eqiad as topic prefix [mediawiki-config] - 10https://gerrit.wikimedia.org/r/641459 (https://phabricator.wikimedia.org/T253069) (owner: 10Ottomata) [17:27:06] !log dpifke@deploy1001 Started deploy [statsv/statsv@873ea90]: (no justification provided) [17:27:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:27:11] !log dpifke@deploy1001 Finished deploy [statsv/statsv@873ea90]: (no justification provided) (duration: 00m 05s) [17:27:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:28:24] (03PS2) 10Ottomata: wgEventStreamsDefaultSettings in beta should only set eqiad as topic prefix [mediawiki-config] - 10https://gerrit.wikimedia.org/r/641459 (https://phabricator.wikimedia.org/T253069) [17:29:35] (03Abandoned) 10Dzahn: poolcounter: do not attempt to install python3-poolcounter on jessie [puppet] - 10https://gerrit.wikimedia.org/r/641307 (owner: 10Dzahn) [17:30:30] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users and wmf LDAP for fkaelin - https://phabricator.wikimedia.org/T267817 (10herron) [17:33:45] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users and wmf LDAP for fkaelin - https://phabricator.wikimedia.org/T267817 (10herron) 05Open→03Resolved a:03herron The requested shell and LDAP access has been granted, and will be fully active within 30... [17:34:23] !log dpifke@deploy1001 Started deploy [statsv/statsv@249d073]: (no justification provided) [17:34:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:34:28] !log dpifke@deploy1001 Finished deploy [statsv/statsv@249d073]: (no justification provided) (duration: 00m 05s) [17:34:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:34:58] RECOVERY - statsv process on webperf2001 is OK: PROCS OK: 3 processes with command name python3, args statsv https://wikitech.wikimedia.org/wiki/Graphite%23statsv [17:36:02] RECOVERY - statsv process on webperf1001 is OK: PROCS OK: 1 process with command name python3, args statsv https://wikitech.wikimedia.org/wiki/Graphite%23statsv [17:37:34] !log dpifke@deploy1001 Started deploy [performance/coal@43b91df]: (no justification provided) [17:37:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:37:40] !log dpifke@deploy1001 Finished deploy [performance/coal@43b91df]: (no justification provided) (duration: 00m 06s) [17:37:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:38:00] (03Merged) 10jenkins-bot: Revert "Re-apply "Use parsoid directly in /page/html handler"" [core] (wmf/1.36.0-wmf.18) - 10https://gerrit.wikimedia.org/r/641300 (owner: 10Ppchelko) [17:38:06] RECOVERY - Check systemd state on webperf2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:38:24] RECOVERY - Check systemd state on webperf1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:38:47] (03CR) 10ArielGlenn: dumps/analytics: Deprecate pagecounts-ez (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/641451 (https://phabricator.wikimedia.org/T267575) (owner: 10Milimetric) [17:38:50] 10Operations, 10ops-eqiad, 10SRE-swift-storage: ms-be1022 smart storage battery failure; disk sdb possibly bad - https://phabricator.wikimedia.org/T267870 (10fgiunchedi) Update: the host isn't coming back (both mgmt and ssh) but yes given we'll need a BBU for ms-be1030 (T268036) too I'd say let's order some... [17:39:07] (03PS4) 10Andrew Bogott: OpenStack: add initial config files for version Stein [puppet] - 10https://gerrit.wikimedia.org/r/641230 (https://phabricator.wikimedia.org/T261134) [17:39:09] (03PS4) 10Andrew Bogott: OpenStack: add server packages for version Stein [puppet] - 10https://gerrit.wikimedia.org/r/641231 (https://phabricator.wikimedia.org/T261134) [17:39:11] (03PS4) 10Andrew Bogott: OpenStack: add client packages for Stein [puppet] - 10https://gerrit.wikimedia.org/r/641233 (https://phabricator.wikimedia.org/T261134) [17:39:13] (03PS4) 10Andrew Bogott: OpenStack Designate: updates for version Stein [puppet] - 10https://gerrit.wikimedia.org/r/641232 (https://phabricator.wikimedia.org/T261134) [17:39:15] (03PS1) 10Andrew Bogott: OpenStack Nova Compute: configure libvirt_cpu_model from hiera [puppet] - 10https://gerrit.wikimedia.org/r/641461 [17:39:17] (03PS1) 10Andrew Bogott: OpenStack Nova Compute: set common cpu type to Haswell [puppet] - 10https://gerrit.wikimedia.org/r/641462 [17:41:27] (03CR) 10Volans: [C: 03+1] "SGTM" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/641458 (owner: 10Ayounsi) [17:44:05] (03CR) 10Nskaggs: [C: 03+1] "Thanks! I double checked the capitalization for `Nskaggs`" [puppet] - 10https://gerrit.wikimedia.org/r/640388 (https://phabricator.wikimedia.org/T266068) (owner: 10David Caro) [17:46:52] (03PS1) 10Herron: admin: create ldap_only account for stran [puppet] - 10https://gerrit.wikimedia.org/r/641463 (https://phabricator.wikimedia.org/T267968) [17:47:00] (03Merged) 10jenkins-bot: Branch commit for wmf/1.36.0-wmf.18 [core] (wmf/1.36.0-wmf.18) - 10https://gerrit.wikimedia.org/r/641319 (https://phabricator.wikimedia.org/T263184) (owner: 10TrainBranchBot) [17:47:19] (03PS2) 10Milimetric: dumps/analytics: Deprecate pagecounts-ez [puppet] - 10https://gerrit.wikimedia.org/r/641451 (https://phabricator.wikimedia.org/T267575) [17:47:21] (03PS2) 10Andrew Bogott: OpenStack Nova Compute: configure libvirt_cpu_model from hiera [puppet] - 10https://gerrit.wikimedia.org/r/641461 [17:47:23] (03PS2) 10Andrew Bogott: OpenStack Nova Compute: set common cpu type to Haswell [puppet] - 10https://gerrit.wikimedia.org/r/641462 [17:47:25] (03PS5) 10Andrew Bogott: OpenStack: add initial config files for version Stein [puppet] - 10https://gerrit.wikimedia.org/r/641230 (https://phabricator.wikimedia.org/T261134) [17:47:27] (03PS5) 10Andrew Bogott: OpenStack: add server packages for version Stein [puppet] - 10https://gerrit.wikimedia.org/r/641231 (https://phabricator.wikimedia.org/T261134) [17:47:29] (03PS5) 10Andrew Bogott: OpenStack: add client packages for Stein [puppet] - 10https://gerrit.wikimedia.org/r/641233 (https://phabricator.wikimedia.org/T261134) [17:47:31] (03PS5) 10Andrew Bogott: OpenStack Designate: updates for version Stein [puppet] - 10https://gerrit.wikimedia.org/r/641232 (https://phabricator.wikimedia.org/T261134) [17:47:35] (03CR) 10Milimetric: dumps/analytics: Deprecate pagecounts-ez (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/641451 (https://phabricator.wikimedia.org/T267575) (owner: 10Milimetric) [17:49:34] (03CR) 10Andrew Bogott: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1003/26486/" [puppet] - 10https://gerrit.wikimedia.org/r/641461 (owner: 10Andrew Bogott) [17:52:32] (03PS1) 10Herron: admin: create ldap_only entry for ijethrobt [puppet] - 10https://gerrit.wikimedia.org/r/641465 (https://phabricator.wikimedia.org/T267962) [17:55:29] (03CR) 10Andrew Bogott: [C: 03+2] OpenStack Nova Compute: set common cpu type to Haswell [puppet] - 10https://gerrit.wikimedia.org/r/641462 (owner: 10Andrew Bogott) [17:55:41] (03PS3) 10Andrew Bogott: OpenStack Nova Compute: set common cpu type to Haswell [puppet] - 10https://gerrit.wikimedia.org/r/641462 [17:56:38] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install rdb20[09|10] - https://phabricator.wikimedia.org/T266721 (10Papaul) [17:57:06] 10Operations, 10ops-eqiad, 10DBA: db1139 memory errors on boot (issue continues after board change) 2020-08-27 - https://phabricator.wikimedia.org/T261405 (10Jclark-ctr) @jcrespo will need downtime for host to remap dimms per HPE [17:57:29] 10Operations, 10Puppet: Investigate why the exictence of files under the server ssl dir foobars puppet - https://phabricator.wikimedia.org/T268040 (10jbond) I have done some initial testing an i think we should just drop the ssl config from the puppetmaster backend servers and let it use the default. the back... [17:58:57] !log dpifke@deploy1001 Started deploy [performance/navtiming@8eaf7db]: (no justification provided) [17:59:02] !log dpifke@deploy1001 Finished deploy [performance/navtiming@8eaf7db]: (no justification provided) (duration: 00m 05s) [17:59:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:59:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:59:46] (03CR) 10Andrew Bogott: [C: 03+1] "This seems fine now; I'm happy to do more testing if anyone has further suggestions of what to try." [puppet] - 10https://gerrit.wikimedia.org/r/638146 (https://phabricator.wikimedia.org/T267433) (owner: 10Ahmon Dancy) [18:00:04] chrisalbon and accraze: Time to snap out of that daydream and deploy Services – Graphoid / ORES. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201117T1800). [18:02:33] (03PS1) 10Herron: admin: create swagoel account, add to analytics_privatedata_users [puppet] - 10https://gerrit.wikimedia.org/r/641471 (https://phabricator.wikimedia.org/T267314) [18:03:31] 10Operations, 10Research, 10SRE-Access-Requests, 10Patch-For-Review: Access to analytics-privatedata-users for Research volunteer Swagoel - https://phabricator.wikimedia.org/T267314 (10herron) [18:04:20] RECOVERY - ats-tls HTTPS wikiworkshop.org ECDSA on cp2037 is OK: SSL OK - OCSP staple validity for wikiworkshop.org has 258941 seconds left:Certificate wikiworkshop.org contains all required SANs:Certificate wikiworkshop.org (ECDSA) valid until 2021-02-08 17:00:15 +0000 (expires in 82 days) https://wikitech.wikimedia.org/wiki/HTTPS [18:06:45] 10Operations, 10ops-eqiad, 10DBA: db1139 memory errors on boot (issue continues after board change) 2020-08-27 - https://phabricator.wikimedia.org/T261405 (10RobH) >>! In T261405#6627790, @RobH wrote: >>>! In T261405#6624984, @wiki_willy wrote: >> @Jclark-ctr - can you double-check the S/N for db1139. We're... [18:07:02] (03PS1) 10Jbond: puppetmaster: only configuere a separate ssl dir for CA puppetmasters [puppet] - 10https://gerrit.wikimedia.org/r/641472 (https://phabricator.wikimedia.org/T268040) [18:09:37] !log stopping db1139 for hw maintenance T261405 [18:09:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:09:43] T261405: db1139 memory errors on boot (issue continues after board change) 2020-08-27 - https://phabricator.wikimedia.org/T261405 [18:10:07] 10Operations, 10ops-eqiad, 10DBA: db1139 memory errors on boot (issue continues after board change) 2020-08-27 - https://phabricator.wikimedia.org/T261405 (10jcrespo) @Jclark-ctr I just stopped the host and downtimed it for almost a day, thank you! [18:16:52] (03CR) 10Jbond: "PCC: https://puppet-compiler.wmflabs.org/compiler1002/26487/" [puppet] - 10https://gerrit.wikimedia.org/r/641472 (https://phabricator.wikimedia.org/T268040) (owner: 10Jbond) [18:26:28] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install cloudcephmon200[12]-dev - https://phabricator.wikimedia.org/T267378 (10Papaul) We only need OS partitions for these. what OS partitions? [18:31:15] (03PS1) 10Papaul: DHCP: Add MAC address for rdb200[9][10] [puppet] - 10https://gerrit.wikimedia.org/r/641475 (https://phabricator.wikimedia.org/T266721) [18:31:58] 10Operations, 10Technical-blog-posts, 10Traffic: 2nd part of blog post series: the evolution of Wikimedia's Content Delivery Network - https://phabricator.wikimedia.org/T266857 (10srodlund) @ema The post is really solid and just needed some mild copyediting. Take a look at the suggestions and accept or decli... [18:32:41] (03CR) 10Papaul: [C: 03+2] DHCP: Add MAC address for rdb200[9][10] [puppet] - 10https://gerrit.wikimedia.org/r/641475 (https://phabricator.wikimedia.org/T266721) (owner: 10Papaul) [18:34:38] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:36:14] PROBLEM - Ensure hosts are not performing a change on every puppet run on puppetdb1002 is CRITICAL: CRITICAL: the following (5) node(s) change every puppet run: clouddb1015.eqiad.wmnet, clouddb1014.eqiad.wmnet, deploy1002.eqiad.wmnet, peek2001.codfw.wmnet, wdqs1009.eqiad.wmnet https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes [18:39:16] (03PS1) 10Papaul: Add rdb200[9][10] to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/641478 (https://phabricator.wikimedia.org/T266721) [18:41:31] (03CR) 10Papaul: [C: 03+2] Add rdb200[9][10] to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/641478 (https://phabricator.wikimedia.org/T266721) (owner: 10Papaul) [18:44:20] (03CR) 10Volans: [C: 04-1] "There is a wrong zonefile include, see inline. I'll do a more in depth pass later on." (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/641284 (https://phabricator.wikimedia.org/T233183) (owner: 10CRusnov) [18:44:42] 10Operations, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install rdb20[09|10] - https://phabricator.wikimedia.org/T266721 (10Papaul) [18:45:17] (03CR) 10Dzahn: "there is a typo, one is "rdb" and the other just "rd"" [puppet] - 10https://gerrit.wikimedia.org/r/641478 (https://phabricator.wikimedia.org/T266721) (owner: 10Papaul) [18:45:33] (03CR) 10CRusnov: Move codfw private to Netbox automation (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/641284 (https://phabricator.wikimedia.org/T233183) (owner: 10CRusnov) [18:46:30] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:48:28] PROBLEM - Host ms-be1022 is DOWN: PING CRITICAL - Packet loss = 100% [18:50:55] (03PS1) 10Papaul: FIX typo on rdb2010 [puppet] - 10https://gerrit.wikimedia.org/r/641480 (https://phabricator.wikimedia.org/T266721) [18:52:41] (03CR) 10Volans: [C: 04-1] "Couple of things to fix, see inline." (032 comments) [dns] - 10https://gerrit.wikimedia.org/r/641285 (https://phabricator.wikimedia.org/T233183) (owner: 10CRusnov) [18:52:48] 10Operations, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install rdb20[09|10] - https://phabricator.wikimedia.org/T266721 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` rdb2009.codfw.wmnet ` The log can be found in `/var... [18:53:47] (03CR) 10Papaul: [C: 03+2] FIX typo on rdb2010 [puppet] - 10https://gerrit.wikimedia.org/r/641480 (https://phabricator.wikimedia.org/T266721) (owner: 10Papaul) [18:53:49] (03CR) 10Dzahn: [C: 03+1] FIX typo on rdb2010 [puppet] - 10https://gerrit.wikimedia.org/r/641480 (https://phabricator.wikimedia.org/T266721) (owner: 10Papaul) [18:53:58] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install cloudcephmon200[12]-dev - https://phabricator.wikimedia.org/T267378 (10Andrew) >>! In T267378#6628303, @Papaul wrote: > We only need OS partitions for these. what OS partitions? Do you mean which OS or how to partition? The OS should b... [18:55:59] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/26488/apt2001.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/641312 (https://phabricator.wikimedia.org/T266479) (owner: 10Dzahn) [18:58:09] (03CR) 10Dzahn: [C: 03+2] mailman: replace require_package with ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/641309 (https://phabricator.wikimedia.org/T266479) (owner: 10Dzahn) [18:58:37] (03CR) 10Dzahn: "noop on apt1001" [puppet] - 10https://gerrit.wikimedia.org/r/641312 (https://phabricator.wikimedia.org/T266479) (owner: 10Dzahn) [19:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201117T1900) [19:02:29] (03CR) 10Dzahn: "noop in lists1001" [puppet] - 10https://gerrit.wikimedia.org/r/641309 (https://phabricator.wikimedia.org/T266479) (owner: 10Dzahn) [19:04:14] (03PS2) 10Dzahn: iegreview: replace require_package with ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/641310 (https://phabricator.wikimedia.org/T266479) [19:05:36] (03CR) 10Dzahn: [C: 03+2] iegreview: replace require_package with ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/641310 (https://phabricator.wikimedia.org/T266479) (owner: 10Dzahn) [19:07:27] (03CR) 10Dzahn: "noop on miscweb1002" [puppet] - 10https://gerrit.wikimedia.org/r/641310 (https://phabricator.wikimedia.org/T266479) (owner: 10Dzahn) [19:07:41] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime [19:07:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:08:43] (03CR) 10Ottomata: [C: 03+2] wgEventStreamsDefaultSettings in beta should only set eqiad as topic prefix [mediawiki-config] - 10https://gerrit.wikimedia.org/r/641459 (https://phabricator.wikimedia.org/T253069) (owner: 10Ottomata) [19:09:33] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [19:09:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:09:46] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/26489/mwmaint1002.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/641313 (https://phabricator.wikimedia.org/T266479) (owner: 10Dzahn) [19:11:19] 10Operations, 10ops-eqiad, 10DBA: db1139 memory errors on boot (issue continues after board change) 2020-08-27 - https://phabricator.wikimedia.org/T261405 (10jcrespo) {F33917836} {F33917835} [19:12:19] !log elukey@cumin1001 START - Cookbook sre.hosts.decommission [19:12:23] !log otto@deploy1001 Synchronized wmf-config/InitialiseSettings-labs.php: wgEventStreamsDefaultSettings in beta should only set eqiad as topic prefix - T253069 (duration: 02m 26s) [19:12:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:12:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:12:31] T253069: Set up an instance of EventStreams in beta that will allow for consuming any stream - https://phabricator.wikimedia.org/T253069 [19:15:40] (03CR) 10Dzahn: "noop on mwmaint1002" [puppet] - 10https://gerrit.wikimedia.org/r/641313 (https://phabricator.wikimedia.org/T266479) (owner: 10Dzahn) [19:16:47] 10Operations, 10ops-eqiad, 10SRE-swift-storage: ms-be1022 smart storage battery failure; disk sdb possibly bad - https://phabricator.wikimedia.org/T267870 (10wiki_willy) [19:17:30] 10Operations, 10ops-eqiad, 10SRE-swift-storage: ms-be1022 smart storage battery failure; disk sdb possibly bad - https://phabricator.wikimedia.org/T267870 (10wiki_willy) a:05wiki_willy→03Cmjohnson Request for replacement BBU placed via T268061 [19:18:03] !log elukey@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) [19:18:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:18:31] (03PS2) 10Dzahn: git: replace require_package with ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/641314 (https://phabricator.wikimedia.org/T266479) [19:19:03] (03PS1) 10Herron: admin: add ldap_only_user entry for tillmletzko-wmde [puppet] - 10https://gerrit.wikimedia.org/r/641508 (https://phabricator.wikimedia.org/T267744) [19:19:05] (03PS1) 10Herron: admin: add ldap_only_user entry for janjaquemot [puppet] - 10https://gerrit.wikimedia.org/r/641509 (https://phabricator.wikimedia.org/T267771) [19:19:07] (03PS1) 10Herron: admin: add ldap_only_entry for tobias-schumann-wmde-ext [puppet] - 10https://gerrit.wikimedia.org/r/641510 (https://phabricator.wikimedia.org/T267917) [19:19:43] (03PS1) 10JMeybohm: Add kubernetes-addon-manager [debs/kubernetes] (future) - 10https://gerrit.wikimedia.org/r/641511 (https://phabricator.wikimedia.org/T267653) [19:20:18] (03CR) 10jerkins-bot: [V: 04-1] git: replace require_package with ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/641314 (https://phabricator.wikimedia.org/T266479) (owner: 10Dzahn) [19:21:30] !log elukey@cumin1001 START - Cookbook sre.hosts.decommission [19:21:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:21:55] (03CR) 10JMeybohm: "@jmm I feel those scripts should not be in /usr/bin, plus kube-addon.sh is not intended to be called directly. What would be a better plac" [debs/kubernetes] (future) - 10https://gerrit.wikimedia.org/r/641511 (https://phabricator.wikimedia.org/T267653) (owner: 10JMeybohm) [19:24:05] !log elukey@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) [19:24:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:27:14] (03CR) 10RLazarus: [C: 03+1] httpbb: replace require_package with ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/641311 (https://phabricator.wikimedia.org/T266479) (owner: 10Dzahn) [19:27:16] PROBLEM - puppetmaster backend https on puppetmaster2003 is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 8141: HTTP/1.1 500 Internal Server Error https://wikitech.wikimedia.org/wiki/Puppet%23Debugging [19:27:39] jbond42: --^ [19:28:21] looking [19:28:37] thanks! [19:29:22] (03CR) 10Dzahn: "Hey John, as you see I already did some other conversions from require_package and they have all been noop and unproblematic.. up until th" [puppet] - 10https://gerrit.wikimedia.org/r/641314 (https://phabricator.wikimedia.org/T266479) (owner: 10Dzahn) [19:29:36] PROBLEM - Widespread puppet agent failures- no resources reported on alert1001 is CRITICAL: 0.01017 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [19:31:50] RECOVERY - puppetmaster backend https on puppetmaster2003 is OK: HTTP OK: Status line output matched 400 - 417 bytes in 1.441 second response time https://wikitech.wikimedia.org/wiki/Puppet%23Debugging [19:33:15] should we stop merging right now? [19:33:51] not it should recover one of the backends was in a broken state it should be fixed now [19:34:12] ok, thank you! [19:34:15] running puppet on failed nodes now [19:34:21] I was about to ask that next:) [19:34:42] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/26490/" [puppet] - 10https://gerrit.wikimedia.org/r/641311 (https://phabricator.wikimedia.org/T266479) (owner: 10Dzahn) [19:36:15] works on deploy1001/cumin1001 where I was running puppet, thumbs up [19:36:44] PROBLEM - Widespread puppet agent failures on alert1001 is CRITICAL: 0.04959 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [19:37:31] regarding "require_package" -> "ensure_packages" it works just fine on 9 out of 10 but it can also fail, for example in git module it gets a duplicate declaration so maybe wait before merging that "all in one" change. I will rebase it too. [19:38:34] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install rdb20[09|10] - https://phabricator.wikimedia.org/T266721 (10Papaul) @Volans another error on on the auto-reimage ` 2020-11-17 19:35:48 [ERROR] (pt1979) wmf-auto-reimage::check_uptime: Unable to determine uptime of host 'rdb2009.codfw.wm... [19:38:53] (03CR) 10Dzahn: "noop on deploy1001, cumin2001" [puppet] - 10https://gerrit.wikimedia.org/r/641311 (https://phabricator.wikimedia.org/T266479) (owner: 10Dzahn) [19:38:56] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install rdb20[09|10] - https://phabricator.wikimedia.org/T266721 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['rdb2009.codfw.wmnet'] ` and were **ALL** successful. [19:40:48] RECOVERY - Widespread puppet agent failures on alert1001 is OK: (C)0.01 ge (W)0.006 ge 0.005086 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [19:41:17] mutante: can you comment on the task with examples, that is exactly what ensure_packages is designed to avoid so it it does cause that its a bug also from that pov the implmentations are almost the same [19:42:01] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1030 - https://phabricator.wikimedia.org/T268036 (10wiki_willy) [19:42:05] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install rdb20[09|10] - https://phabricator.wikimedia.org/T266721 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` rdb2010.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/... [19:42:20] RECOVERY - Widespread puppet agent failures- no resources reported on alert1001 is OK: (C)0.01 ge (W)0.006 ge 0.001907 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [19:42:39] task is T266479 [19:42:39] T266479: Puppet Proposal to remove require_package - https://phabricator.wikimedia.org/T266479 [19:43:19] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1030 - https://phabricator.wikimedia.org/T268036 (10wiki_willy) a:03Cmjohnson Order for replacement BBU submitted via T268061 [19:44:29] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install rdb20[09|10] - https://phabricator.wikimedia.org/T266721 (10Papaul) [19:44:40] jbond42: ACK, already using that ticket to upload patches. leaving a comment [19:45:06] thanks <3 [19:45:18] and welcome back hope you had a nice vacation :) [19:46:00] !log dancy@deploy1001 Pruned MediaWiki: 1.36.0-wmf.11 (duration: 13m 05s) [19:46:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:47:48] (03CR) 10Cwhite: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/641472 (https://phabricator.wikimedia.org/T268040) (owner: 10Jbond) [19:48:28] 10Operations, 10Puppet, 10Patch-For-Review: Puppet Proposal to remove require_package - https://phabricator.wikimedia.org/T266479 (10Dzahn) I merged a couple changes you can see above (aptrepo, mailman, iegreview, noc, httpbb,..) and they were all unproblematic and noop. But then I got to the git module and... [19:48:37] !log ppchelko@deploy1001 Started deploy [restbase/deploy@8363aeb]: update to service-runner 2.8.0, canary on 2010 [19:48:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:48:48] PROBLEM - MariaDB Replica Lag: pc1 on pc2010 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 414.67 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [19:49:13] jbond42: thank you:). comment added. have a good rest of the day [19:50:40] !log ppchelko@deploy1001 Finished deploy [restbase/deploy@8363aeb]: update to service-runner 2.8.0, canary on 2010 (duration: 02m 03s) [19:50:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:51:05] (03PS1) 10Ahmon Dancy: testwikis wikis to 1.36.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/641513 [19:51:07] (03CR) 10Ahmon Dancy: [C: 03+2] testwikis wikis to 1.36.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/641513 (owner: 10Ahmon Dancy) [19:51:49] (03Merged) 10jenkins-bot: testwikis wikis to 1.36.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/641513 (owner: 10Ahmon Dancy) [19:52:07] !log dancy@deploy1001 Started scap: testwikis wikis to 1.36.0-wmf.18 [19:52:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:53:51] (03CR) 10Dzahn: "Let's just remove these lines from puppet and once this is setup again it can be changed in the deploy repo? Any objections?" [puppet] - 10https://gerrit.wikimedia.org/r/641245 (owner: 10Dzahn) [19:55:16] (03PS2) 10Dzahn: cumin: remove code for absented check aliases cron job [puppet] - 10https://gerrit.wikimedia.org/r/641274 [19:55:43] (03CR) 10Dzahn: [C: 03+2] "just removing the code for already absented cron jobs, actual change was yesterday" [puppet] - 10https://gerrit.wikimedia.org/r/641274 (owner: 10Dzahn) [19:56:53] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime [19:56:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:58:47] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [19:58:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:59:25] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install cloudcephmon200[12]-dev - https://phabricator.wikimedia.org/T267378 (10Papaul) @Andrew OS partitions needs a partman recipe. what partman recipe do you want to use for those servers ? [19:59:58] 10Operations, 10Analytics, 10Analytics-Kanban, 10Event-Platform: Reduce cache TTL of schema.wikimedia.org - https://phabricator.wikimedia.org/T267557 (10razzi) This has been deployed with a 60-second TTL. [20:00:04] dancy and hashar: How many deployers does it take to do Mediawiki train - American+European Version deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201117T2000). [20:04:04] RECOVERY - Host ms-be1022 is UP: PING OK - Packet loss = 0%, RTA = 0.20 ms [20:04:16] (03PS1) 10Clarakosi: Mathoid: Update mathoid to use the latest tag [deployment-charts] - 10https://gerrit.wikimedia.org/r/641514 (https://phabricator.wikimedia.org/T148304) [20:04:28] PROBLEM - Check systemd state on ms-be1022 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:04:55] 10Operations, 10Wikimedia-Mailing-lists: Wikimania-com mailing list: who’s the admin ? - https://phabricator.wikimedia.org/T268031 (10Aklapper) @Anthere: Assuming this task is about #Wikimedia-Mailing-lists code project, hence adding that project tag so other people can find this task when searching via projec... [20:07:18] 10Operations, 10Wikimedia-Mailing-lists: Wikimania-com mailing list: who’s the admin ? - https://phabricator.wikimedia.org/T268031 (10Aklapper) If I interpret T206089#4643921 correctly, then that were @eyoung and @ITait in 2018 who are now both inactive. :( I guess SRE needs to take a look here... [20:07:42] RECOVERY - Host ms-be1022.mgmt is UP: PING OK - Packet loss = 0%, RTA = 6.96 ms [20:08:01] ACKNOWLEDGEMENT - HP RAID on ms-be1022 is CRITICAL: CRITICAL: Slot 3: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 2I:4:1, 2I:4:2 - Controller: OK - Cache: Permanently Disabled - Battery count: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T268071 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Inform [20:08:05] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1022 - https://phabricator.wikimedia.org/T268071 (10ops-monitoring-bot) [20:09:51] 10Operations, 10Wikimedia-Mailing-lists: Wikimania-com mailing list: who’s the admin ? - https://phabricator.wikimedia.org/T268031 (10Dzahn) In the past it would also say "-owner@" but you would see the actual owner name when you hovered over the "list run by" line on the list info page. (bottom of https://lis... [20:10:38] (03PS3) 10Jbond: git: replace require_package with ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/641314 (https://phabricator.wikimedia.org/T266479) (owner: 10Dzahn) [20:11:19] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install rdb20[09|10] - https://phabricator.wikimedia.org/T266721 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['rdb2010.codfw.wmnet'] ` and were **ALL** successful. [20:13:22] mutante: i have uploaded a fix, was more an issue with the spec test then anything elses althought it is intresting that require_packages dosn't thow an error so will investoigate further thanks [20:14:06] jbond42: oooh, i did not even notice the git module had the spec test. thanks as well [20:14:13] 10Operations, 10Puppet, 10Patch-For-Review: Investigate the existence of files under the server ssl dir foobars puppet - https://phabricator.wikimedia.org/T268040 (10Aklapper) [20:16:03] 10Operations, 10Wikimedia-Mailing-lists: Wikimania-com mailing list: who’s the admin ? - https://phabricator.wikimedia.org/T268031 (10Dzahn) The current sole admin is "**itait@"**. I could also confirm on the mailserver that this address does not exist anymore. Which address would you like to be the new admi... [20:17:52] oooh2, the puppetmaster module, not the git module, gotcha :) [20:19:47] (03CR) 10Dzahn: [C: 03+2] "thanks! Did not notice it was about spec test in puppetmaster module, just focused on code in git module" [puppet] - 10https://gerrit.wikimedia.org/r/641314 (https://phabricator.wikimedia.org/T266479) (owner: 10Dzahn) [20:19:50] (03CR) 10Ppchelko: [C: 03+1] Mathoid: Update mathoid to use the latest tag [deployment-charts] - 10https://gerrit.wikimedia.org/r/641514 (https://phabricator.wikimedia.org/T148304) (owner: 10Clarakosi) [20:20:21] (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/641517 [20:22:47] PROBLEM - Long running screen/tmux on maps1004 is CRITICAL: CRIT: Long running SCREEN process. (user: root PID: 25959, 1738395s 1728000s). https://wikitech.wikimedia.org/wiki/Monitoring/Long_running_screens [20:22:54] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install rdb20[09|10] - https://phabricator.wikimedia.org/T266721 (10Papaul) [20:23:05] 10Operations, 10Wikimedia-Mailing-lists: Wikimania-com mailing list: who’s the admin ? - https://phabricator.wikimedia.org/T268031 (10Aklapper) >>! In T268031#6628731, @Dzahn wrote: > I could also confirm on the mailserver that this address does not exist anymore. Wondering if that's worth some kind of automat... [20:27:41] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install rdb20[09|10] - https://phabricator.wikimedia.org/T266721 (10Papaul) 05Open→03Resolved @jijiki this is complete [20:28:53] 10Operations, 10Wikimedia-Mailing-lists: Wikimania-com mailing list: who’s the admin ? - https://phabricator.wikimedia.org/T268031 (10Dzahn) >>! In T268031#6628742, @Aklapper wrote: >>>! In T268031#6628731, @Dzahn wrote: >> I could also confirm on the mailserver that this address does not exist anymore. > Wond... [20:31:06] 10Operations, 10Wikidata, 10Wikidata Query UI, 10User-Addshore: Move WDQS UI to microsites - https://phabricator.wikimedia.org/T266702 (10Dzahn) @Addshore Any suggestion who will take the "Deal with the favicon and the custom-config in the GUI build" check box that seems to be next here? [20:31:36] !log dancy@deploy1001 Finished scap: testwikis wikis to 1.36.0-wmf.18 (duration: 39m 37s) [20:31:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:33:35] (03CR) 10Dzahn: "noop on cumin1001" [puppet] - 10https://gerrit.wikimedia.org/r/641274 (owner: 10Dzahn) [20:33:41] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1003/26493/" [puppet] - 10https://gerrit.wikimedia.org/r/641314 (https://phabricator.wikimedia.org/T266479) (owner: 10Dzahn) [20:37:51] (03CR) 10Ahmon Dancy: [C: 03+2] Suggested Edits: Guard against task type not existing [extensions/GrowthExperiments] (wmf/1.36.0-wmf.18) - 10https://gerrit.wikimedia.org/r/641294 (https://phabricator.wikimedia.org/T268012) (owner: 10Kosta Harlan) [20:39:11] (03CR) 10Dzahn: "noop on peek2001, miscweb1002, cumin1001" [puppet] - 10https://gerrit.wikimedia.org/r/641314 (https://phabricator.wikimedia.org/T266479) (owner: 10Dzahn) [20:41:44] (03CR) 10Dzahn: "modules/arclamp/manifests/init.pp: needs merge" [puppet] - 10https://gerrit.wikimedia.org/r/640688 (https://phabricator.wikimedia.org/T266479) (owner: 10Jbond) [20:42:59] !log End of mwscript extensions/AbuseFilter/maintenance/updateVarDumps.php --wiki=$wiki --print-orphaned-records-to=/tmp/urbanecm/$wiki-orphaned.log --progress-markers > $wiki.log in a tmux at mwmaint1002 (wiki=itwiki; T246539) [20:43:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:43:07] T246539: Dry-run, then actually run updateVarDumps - https://phabricator.wikimedia.org/T246539 [20:49:11] (03Merged) 10jenkins-bot: Suggested Edits: Guard against task type not existing [extensions/GrowthExperiments] (wmf/1.36.0-wmf.18) - 10https://gerrit.wikimedia.org/r/641294 (https://phabricator.wikimedia.org/T268012) (owner: 10Kosta Harlan) [20:51:12] (03PS4) 10Dzahn: puppet: migrate from require_package to ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/640688 (https://phabricator.wikimedia.org/T266479) (owner: 10Jbond) [20:51:58] (03PS1) 10Daniel Kinzler: Set $wgOldRevisionParserCacheExpireTime to 0 in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/641527 (https://phabricator.wikimedia.org/T268075) [20:52:02] (03PS1) 10Herron: aptrepo: add elastic710 component [puppet] - 10https://gerrit.wikimedia.org/r/641528 [20:53:09] (03PS1) 10Ahmon Dancy: group0 wikis to 1.36.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/641529 [20:53:11] (03CR) 10Ahmon Dancy: [C: 03+2] group0 wikis to 1.36.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/641529 (owner: 10Ahmon Dancy) [20:53:15] (03CR) 10jerkins-bot: [V: 04-1] puppet: migrate from require_package to ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/640688 (https://phabricator.wikimedia.org/T266479) (owner: 10Jbond) [20:53:48] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install cloudcephmon200[12]-dev - https://phabricator.wikimedia.org/T267378 (10Andrew) If they have hw raid then all drives in one big raid10 and partman recipe hwraid-1dev.cfg. If no hwraid then... I think raid10-4dev.cfg ? It's hard for me to... [20:54:45] (03Merged) 10jenkins-bot: group0 wikis to 1.36.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/641529 (owner: 10Ahmon Dancy) [20:54:53] 10Operations, 10Wikimedia-Mailing-lists: Wikimania-com mailing list: who’s the admin ? - https://phabricator.wikimedia.org/T268031 (10Anthere) Well, Irène stopped working for the WMF. So it now fully makes sense that no one gets anything. Ok, let's move on this. Primary contact: fdevouard@anthere.org (mysel... [20:55:53] (03CR) 10Dzahn: [C: 03+1] "code looks good, manager and employeeType matches. it would just be nice to have a justification what it's actually needed for, which tool" [puppet] - 10https://gerrit.wikimedia.org/r/641463 (https://phabricator.wikimedia.org/T267968) (owner: 10Herron) [20:56:20] !log dancy@deploy1001 rebuilt and synchronized wikiversions files: group0 wikis to 1.36.0-wmf.18 [20:56:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:57:26] 10Operations, 10LDAP-Access-Requests, 10Patch-For-Review: Add STran to `wmf` LDAP group - https://phabricator.wikimedia.org/T267968 (10herron) Hi @STran, for our records could you please give a high level description of what the requested access will be used for? Thanks in advance! [21:03:42] (03CR) 10Ppchelko: [C: 03+1] Set $wgOldRevisionParserCacheExpireTime to 0 in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/641527 (https://phabricator.wikimedia.org/T268075) (owner: 10Daniel Kinzler) [21:06:01] PROBLEM - ensure kvm processes are running on cloudvirt1012 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [21:07:24] (03PS1) 10Ladsgroup: ores: Drop all precaching puppet roles for labs [puppet] - 10https://gerrit.wikimedia.org/r/641533 [21:14:07] RECOVERY - ensure kvm processes are running on cloudvirt1012 is OK: PROCS OK: 1 process with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [21:17:24] (03PS1) 10RLazarus: decorators: Log chained exception messages in @retry [software/spicerack] - 10https://gerrit.wikimedia.org/r/641534 [21:20:35] (03CR) 10jerkins-bot: [V: 04-1] decorators: Log chained exception messages in @retry [software/spicerack] - 10https://gerrit.wikimedia.org/r/641534 (owner: 10RLazarus) [21:24:51] !log ppchelko@deploy1001 Started deploy [restbase/deploy@8363aeb]: update to service-runner 2.8.0, codfw [21:24:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:25:57] (03PS6) 10Andrew Bogott: OpenStack: add initial config files for version Stein [puppet] - 10https://gerrit.wikimedia.org/r/641230 (https://phabricator.wikimedia.org/T261134) [21:25:59] (03PS6) 10Andrew Bogott: OpenStack: add server packages for version Stein [puppet] - 10https://gerrit.wikimedia.org/r/641231 (https://phabricator.wikimedia.org/T261134) [21:26:01] (03PS6) 10Andrew Bogott: OpenStack: add client packages for Stein [puppet] - 10https://gerrit.wikimedia.org/r/641233 (https://phabricator.wikimedia.org/T261134) [21:26:03] (03PS6) 10Andrew Bogott: OpenStack Designate: updates for version Stein [puppet] - 10https://gerrit.wikimedia.org/r/641232 (https://phabricator.wikimedia.org/T261134) [21:26:05] (03PS1) 10Andrew Bogott: OpenStack Nova Compute: change the common CPU model to Haswell-noTSX-IBRS [puppet] - 10https://gerrit.wikimedia.org/r/641535 [21:27:28] (03CR) 10Gergő Tisza: "Thanks for the quick fix!" [extensions/GrowthExperiments] (wmf/1.36.0-wmf.18) - 10https://gerrit.wikimedia.org/r/641294 (https://phabricator.wikimedia.org/T268012) (owner: 10Kosta Harlan) [21:28:31] (03CR) 10Andrew Bogott: [C: 03+2] OpenStack Nova Compute: change the common CPU model to Haswell-noTSX-IBRS [puppet] - 10https://gerrit.wikimedia.org/r/641535 (owner: 10Andrew Bogott) [21:30:11] 10Operations, 10Wikimedia-Mailing-lists: Wikimania-com mailing list: who’s the admin ? - https://phabricator.wikimedia.org/T268031 (10Dzahn) Hello @Anthere I added you and Joel in the admin field and then ran a shell command that resets the password to something random and mails it to the owner address. So... [21:32:02] !log ppchelko@deploy1001 Finished deploy [restbase/deploy@8363aeb]: update to service-runner 2.8.0, codfw (duration: 07m 11s) [21:32:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:36:14] (03PS5) 10Dzahn: puppet: migrate from require_package to ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/640688 (https://phabricator.wikimedia.org/T266479) (owner: 10Jbond) [21:40:22] (03PS2) 10RLazarus: decorators: Log chained exception messages in @retry [software/spicerack] - 10https://gerrit.wikimedia.org/r/641534 [21:40:48] 10Operations, 10Wikimedia-Mailing-lists: Wikimania-com mailing list: who’s the admin ? - https://phabricator.wikimedia.org/T268031 (10Anthere) Got it. Thank you ! [21:43:46] (03CR) 10jerkins-bot: [V: 04-1] decorators: Log chained exception messages in @retry [software/spicerack] - 10https://gerrit.wikimedia.org/r/641534 (owner: 10RLazarus) [21:49:15] 10Operations, 10ops-eqiad, 10Data-Services, 10Epic, 10cloud-services-team (Hardware): Move labstore1004 and labstore1005 to 10G Ethernet - https://phabricator.wikimedia.org/T266198 (10Andrew) to validate the move, check 'drbd-overview' output before and after [21:52:11] (03CR) 10Dzahn: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/641465 (https://phabricator.wikimedia.org/T267962) (owner: 10Herron) [21:52:39] 10Operations, 10ops-eqiad, 10Data-Services, 10Epic, 10cloud-services-team (Hardware): Move labstore1004 and labstore1005 to 10G Ethernet - https://phabricator.wikimedia.org/T266198 (10Andrew) a:05Bstorm→03Andrew [21:54:39] (03PS3) 10RLazarus: decorators: Log chained exception messages in @retry [software/spicerack] - 10https://gerrit.wikimedia.org/r/641534 [21:55:22] (03CR) 10Dzahn: "manually rebased on a couple changes that happened meanwhile and is a bit smaller now" [puppet] - 10https://gerrit.wikimedia.org/r/640688 (https://phabricator.wikimedia.org/T266479) (owner: 10Jbond) [21:57:16] !log ppchelko@deploy1001 Started deploy [restbase/deploy@8363aeb]: update to service-runner 2.8.0, everywhere [21:57:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:04:40] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/26494/otrs1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/637038 (https://phabricator.wikimedia.org/T265138) (owner: 10Dzahn) [22:06:57] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:07:15] !log otrs1001 - removing otrs-cache-cleanup cron from otrs's crontab - adding same command as systemd timer. gerrit:637038 T265138 [22:07:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:07:22] T265138: OKR: Work required to prepare for puppet 6 - https://phabricator.wikimedia.org/T265138 [22:08:23] !log ppchelko@deploy1001 Finished deploy [restbase/deploy@8363aeb]: update to service-runner 2.8.0, everywhere (duration: 11m 07s) [22:08:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:08:53] (03CR) 10Dzahn: "systemctl status otrs-cache-cleanup" [puppet] - 10https://gerrit.wikimedia.org/r/637038 (https://phabricator.wikimedia.org/T265138) (owner: 10Dzahn) [22:09:36] (03PS4) 10RLazarus: decorators: Log chained exception messages in @retry [software/spicerack] - 10https://gerrit.wikimedia.org/r/641534 [22:10:07] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:10:11] !log otrs1001 - systemctl start otrs-cache-cleanup [22:10:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:21:17] (03CR) 10Clarakosi: [C: 03+2] Mathoid: Update mathoid to use the latest tag [deployment-charts] - 10https://gerrit.wikimedia.org/r/641514 (https://phabricator.wikimedia.org/T148304) (owner: 10Clarakosi) [22:24:27] (03Merged) 10jenkins-bot: Mathoid: Update mathoid to use the latest tag [deployment-charts] - 10https://gerrit.wikimedia.org/r/641514 (https://phabricator.wikimedia.org/T148304) (owner: 10Clarakosi) [22:29:45] !log clarakosi@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'mathoid' for release 'staging' . [22:29:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:33:20] (03PS1) 10Dzahn: planet: adjust update timer to be based OnUnitActiveSec [puppet] - 10https://gerrit.wikimedia.org/r/641554 [22:33:35] (03CR) 10Dzahn: [C: 03+2] gerrit: use multiline regex flag for Sonar report [puppet] - 10https://gerrit.wikimedia.org/r/641383 (https://phabricator.wikimedia.org/T267028) (owner: 10Hashar) [22:38:57] (03CR) 10Dzahn: "This needs to be handled on the ticket and directly with the traffic team. Adding Valentin though." [dns] - 10https://gerrit.wikimedia.org/r/634928 (https://phabricator.wikimedia.org/T257536) (owner: 10Ladsgroup) [22:39:16] !log clarakosi@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'mathoid' for release 'production' . [22:39:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:40:10] !log mforns@deploy1001 Started deploy [analytics/refinery@f19d20c]: Regular analytics weekly train [analytics/refinery@f19d20c21ada05df230d00c6e0022a7d5c356c13] [22:40:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:45:15] !log clarakosi@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'mathoid' for release 'production' . [22:45:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:53:01] !log mforns@deploy1001 Finished deploy [analytics/refinery@f19d20c]: Regular analytics weekly train [analytics/refinery@f19d20c21ada05df230d00c6e0022a7d5c356c13] (duration: 12m 51s) [22:53:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:54:34] !log mforns@deploy1001 Started deploy [analytics/refinery@f19d20c] (thin): Regular analytics weekly train THIN [analytics/refinery@f19d20c21ada05df230d00c6e0022a7d5c356c13] [22:54:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:54:41] !log mforns@deploy1001 Finished deploy [analytics/refinery@f19d20c] (thin): Regular analytics weekly train THIN [analytics/refinery@f19d20c21ada05df230d00c6e0022a7d5c356c13] (duration: 00m 07s) [22:54:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:56:53] 10Operations, 10ops-eqiad, 10DBA: db1139 memory errors on boot (issue continues after board change) 2020-08-27 - https://phabricator.wikimedia.org/T261405 (10Jclark-ctr) @jcrespo. replacement dimms should arrive Thursday. Unsure what time they will arrive we can shoot for Thursday. If they arrive late it... [22:57:43] (03CR) 10Dzahn: "We are actually trying to remove all of these as part of https://phabricator.wikimedia.org/T218900 so let's keep it temporary. We are goin" [puppet] - 10https://gerrit.wikimedia.org/r/637790 (https://phabricator.wikimedia.org/T265912) (owner: 10Reedy) [23:04:12] (03CR) 10Reedy: [C: 03+1] peek: don't change permissions within a git repo [puppet] - 10https://gerrit.wikimedia.org/r/641245 (owner: 10Dzahn) [23:04:44] 10Operations, 10SRE-Access-Requests, 10Security-Team: Access to peek2001.codfw.wmnet - https://phabricator.wikimedia.org/T265922 (10Dzahn) 05Resolved→03Open [23:04:51] (03CR) 10Dzahn: [C: 03+2] peek: don't change permissions within a git repo [puppet] - 10https://gerrit.wikimedia.org/r/641245 (owner: 10Dzahn) [23:06:49] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 90, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:07:13] 10Operations, 10SRE-Access-Requests, 10Security-Team: Access to peek2001.codfw.wmnet - https://phabricator.wikimedia.org/T265922 (10Dzahn) More access than just unprivileged shell access is needed to take over maintenance of peek. We need to add sudo privileges. Suggesting we make a group "peek-roots" for... [23:08:53] 10Operations, 10SRE-Access-Requests, 10Security-Team: Access to peek2001.codfw.wmnet - https://phabricator.wikimedia.org/T265922 (10Dzahn) Reedy and Scott should be peek-roots and later we can make peek-admins with a few standard maintenance commands and give that to the entire secteam. [23:10:14] (03CR) 10Dzahn: "now handled in by https://gerrit.wikimedia.org/r/c/wikimedia/security/tooling/peek/+/641562" [puppet] - 10https://gerrit.wikimedia.org/r/641245 (owner: 10Dzahn) [23:10:28] (03PS1) 10RLazarus: Remove the legacy assert_headers regex format, which is unused. [software/httpbb] - 10https://gerrit.wikimedia.org/r/641567 [23:23:54] (03CR) 10Dzahn: "confirmed puppet run has no issues, git cloned and cloning fixed the permissions as set in software repo" [puppet] - 10https://gerrit.wikimedia.org/r/641245 (owner: 10Dzahn) [23:34:09] (03PS1) 10Dzahn: admin: create peek-roots group and apply on peek role [puppet] - 10https://gerrit.wikimedia.org/r/641570 (https://phabricator.wikimedia.org/T265922) [23:43:53] (03CR) 10Dzahn: [C: 03+2] planet: adjust update timer to be based OnUnitActiveSec [puppet] - 10https://gerrit.wikimedia.org/r/641554 (owner: 10Dzahn) [23:59:14] 10Operations, 10Anti-Harassment, 10Trust-and-Safety: Grant checkuser rights to DannyS712 on testwiki - https://phabricator.wikimedia.org/T268090 (10Niharika)