[00:01:44] 10Operations, 10ops-codfw: Degraded RAID on ms-be2019 - https://phabricator.wikimedia.org/T263484 (10Papaul) 05Open→03Declined Duplicate of T262182 [00:04:26] RECOVERY - Check systemd state on ms-be2052 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:08:59] (03PS1) 10Dzahn: swift::proxy: convert role to profile, fix various lint issues [puppet] - 10https://gerrit.wikimedia.org/r/628970 [00:09:57] (03CR) 10jerkins-bot: [V: 04-1] swift::proxy: convert role to profile, fix various lint issues [puppet] - 10https://gerrit.wikimedia.org/r/628970 (owner: 10Dzahn) [00:13:01] (03PS2) 10Dzahn: swift::proxy: convert role to profile, fix various lint issues [puppet] - 10https://gerrit.wikimedia.org/r/628970 [00:13:42] (03CR) 10Dzahn: [V: 03+1] "https://puppet-compiler.wmflabs.org/compiler1002/25251/ms-fe2005.codfw.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/628970 (owner: 10Dzahn) [00:19:02] 10Operations, 10ops-codfw: Degraded RAID on ms-be2019 - https://phabricator.wikimedia.org/T262182 (10Papaul) [00:23:27] (03PS1) 10Dzahn: mail::smarthost::wmcs: convert role to profile, fix lint issues [puppet] - 10https://gerrit.wikimedia.org/r/628972 [00:29:20] (03PS1) 10Dzahn: syslog::centralserver: convert role to profile, fix lint issues [puppet] - 10https://gerrit.wikimedia.org/r/628973 [00:30:41] (03CR) 10Cwhite: [C: 03+1] role::bastionhost::pop: remove prometheus instances [puppet] - 10https://gerrit.wikimedia.org/r/628940 (https://phabricator.wikimedia.org/T243057) (owner: 10Herron) [00:33:24] (03CR) 10Dzahn: [V: 03+1] "https://puppet-compiler.wmflabs.org/compiler1001/25252/centrallog1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/628973 (owner: 10Dzahn) [00:33:53] (03CR) 10Dzahn: [V: 03+1] "17:30:14 Resolved violations:" [puppet] - 10https://gerrit.wikimedia.org/r/628973 (owner: 10Dzahn) [00:34:24] (03CR) 10Dzahn: [V: 03+1] "17:13:23 Resolved violations:" [puppet] - 10https://gerrit.wikimedia.org/r/628970 (owner: 10Dzahn) [00:35:12] (03CR) 10Dzahn: "16:54:49 Resolved violations:" [puppet] - 10https://gerrit.wikimedia.org/r/628969 (owner: 10Dzahn) [00:35:48] (03CR) 10Dzahn: "16:49:00 Resolved violations:" [puppet] - 10https://gerrit.wikimedia.org/r/627967 (owner: 10Dzahn) [00:42:13] (03CR) 10Dzahn: "wow, thanks for amending and merging 😊" [puppet] - 10https://gerrit.wikimedia.org/r/624357 (owner: 10Dzahn) [00:45:58] (03Abandoned) 10Jeena Huneidi: This is a test [deployment-charts] - 10https://gerrit.wikimedia.org/r/628968 (owner: 10Jeena Huneidi) [00:53:02] 10Operations, 10Phatality, 10observability, 10Developer Productivity: Deploying "Phatality" plugin for Kibana invokes oom-killer on logstash::collector nodes - https://phabricator.wikimedia.org/T237706 (10Krinkle) [00:55:31] 10Operations, 10Phatality, 10observability, 10Developer Productivity: Deploying "Phatality" plugin for Kibana invokes oom-killer on logstash::collector nodes - https://phabricator.wikimedia.org/T237706 (10Krinkle) Having a plugin deployed that we can't change is actively harmful as it encourages people to... [01:17:36] !log Going to test patch to stick envoy in front of `cloudelastic`, see https://gerrit.wikimedia.org/r/c/operations/puppet/+/628243 [01:17:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:17:56] !log Disabling puppet on affected nodes via `sudo cumin C:profile::services_proxy::envoy 'disable-puppet "adding cloudelastic to the service proxy --rkemper"'` [01:18:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:19:25] (03PS2) 10Ryan Kemper: cloudelastic: use envoy to mitigate tls latency [puppet] - 10https://gerrit.wikimedia.org/r/628243 (https://phabricator.wikimedia.org/T263073) [01:20:11] (03CR) 10Ryan Kemper: [C: 03+2] cloudelastic: use envoy to mitigate tls latency [puppet] - 10https://gerrit.wikimedia.org/r/628243 (https://phabricator.wikimedia.org/T263073) (owner: 10Ryan Kemper) [01:28:29] !log `sudo puppet-merge` done, now will run puppet on a single eqiad appserver and verify we can curl `localhost:610{5,6,7}` [01:28:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:32:57] !log `sudo run-puppet-agent -e "adding cloudelastic to the service proxy --rkemper"` on `mwdebug1002.eqiad.wmnet` [01:33:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:35:27] !log woot! `curl -X GET -s 'http://localhost:6105/_cluster/health'` gives a response as expected. (As do 6106 and 6107). Re-enabling puppet across the fleet... [01:35:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:35:45] !log `sudo cumin C:profile::services_proxy::envoy 'enable-puppet "adding cloudelastic to the service proxy --rkemper"'` done [01:35:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:42:10] (03PS1) 10Herron: prometheus: point prometheus.svc.esams to prometheus3001 [dns] - 10https://gerrit.wikimedia.org/r/628977 (https://phabricator.wikimedia.org/T243057) [01:45:39] (03CR) 10Herron: [C: 03+2] prometheus: point prometheus.svc.esams to prometheus3001 [dns] - 10https://gerrit.wikimedia.org/r/628977 (https://phabricator.wikimedia.org/T243057) (owner: 10Herron) [01:47:42] RECOVERY - Thanos sidecar cannot connect to Prometheus on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/b19644bfbf0ec1e108027cce268d99f7/thanos-sidecar [01:53:52] PROBLEM - Prometheus prometheus2003/global -or a Prometheus it scrapes- was restarted: beware possible monitoring artifacts on prometheus2003 is CRITICAL: instance=127.0.0.1 job=prometheus site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/global [01:54:06] PROBLEM - Prometheus prometheus1003/global -or a Prometheus it scrapes- was restarted: beware possible monitoring artifacts on prometheus1003 is CRITICAL: instance=127.0.0.1 job=prometheus site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/global [01:54:54] PROBLEM - Thanos query has high gRPC client errors on icinga1001 is CRITICAL: job=thanos-query https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query [01:56:24] (03CR) 10Herron: "seeing the below error when attempting to run authdns-update now" [dns] - 10https://gerrit.wikimedia.org/r/623465 (owner: 10Dzahn) [02:01:25] (03PS1) 10Ryan Kemper: cloudelastic: envoy sits in front now [mediawiki-config] - 10https://gerrit.wikimedia.org/r/628978 (https://phabricator.wikimedia.org/T263073) [02:02:24] (03PS1) 10Herron: Revert "prometheus: point prometheus.svc.esams to prometheus3001" [dns] - 10https://gerrit.wikimedia.org/r/628994 [02:04:23] (03CR) 10Herron: "the change this is reverting was never fully deployed due to the issue in the commit message. still, reverting it and will try again later" [dns] - 10https://gerrit.wikimedia.org/r/628994 (owner: 10Herron) [02:04:27] (03CR) 10Herron: [C: 03+2] Revert "prometheus: point prometheus.svc.esams to prometheus3001" [dns] - 10https://gerrit.wikimedia.org/r/628994 (owner: 10Herron) [02:07:12] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.36.0-wmf.10 [core] (wmf/1.36.0-wmf.10) - 10https://gerrit.wikimedia.org/r/628979 [02:08:50] (03PS2) 10DannyS712: Branch commit for wmf/1.36.0-wmf.10 [core] (wmf/1.36.0-wmf.10) - 10https://gerrit.wikimedia.org/r/628979 (https://phabricator.wikimedia.org/T257978) (owner: 10TrainBranchBot) [02:10:12] PROBLEM - Thanos query has high gRPC client errors on icinga1001 is CRITICAL: job=thanos-query https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query [02:14:56] RECOVERY - Prometheus prometheus2003/global -or a Prometheus it scrapes- was restarted: beware possible monitoring artifacts on prometheus2003 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/global [02:15:10] RECOVERY - Prometheus prometheus1003/global -or a Prometheus it scrapes- was restarted: beware possible monitoring artifacts on prometheus1003 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/global [02:25:13] (03PS1) 10Ryan Kemper: cirrussearch: increase up commonswiki_file shards [mediawiki-config] - 10https://gerrit.wikimedia.org/r/628980 (https://phabricator.wikimedia.org/T260083) [02:26:06] (03CR) 10jerkins-bot: [V: 04-1] cirrussearch: increase up commonswiki_file shards [mediawiki-config] - 10https://gerrit.wikimedia.org/r/628980 (https://phabricator.wikimedia.org/T260083) (owner: 10Ryan Kemper) [02:29:34] PROBLEM - Thanos query has high gRPC client errors on icinga1001 is CRITICAL: job=thanos-query https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query [03:05:23] (03PS2) 10Ryan Kemper: cloudelastic: envoy sits in front now [mediawiki-config] - 10https://gerrit.wikimedia.org/r/628978 (https://phabricator.wikimedia.org/T263073) [03:14:08] (03CR) 10Ebernhardson: [C: 03+1] "Not sure what diffConfig is complaining about, but this looks reasonable to deploy in one of tomorrows backport windows." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/628978 (https://phabricator.wikimedia.org/T263073) (owner: 10Ryan Kemper) [03:15:34] (03PS2) 10Ryan Kemper: cirrussearch: increase up commonswiki_file shards [mediawiki-config] - 10https://gerrit.wikimedia.org/r/628980 (https://phabricator.wikimedia.org/T260083) [03:25:26] (03PS1) 10Jeena Huneidi: blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/628985 [03:26:03] (03CR) 10Jeena Huneidi: [C: 04-2] blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/628985 (owner: 10Jeena Huneidi) [03:26:57] (03Abandoned) 10Jeena Huneidi: blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/628985 (owner: 10Jeena Huneidi) [03:27:51] (03PS2) 10KartikMistry: Remove test2wiki, as ContentTranslation is not enabled there [mediawiki-config] - 10https://gerrit.wikimedia.org/r/628780 [03:28:57] (03PS2) 10KartikMistry: ContentTranslation: Move testwiki off extension1 cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/628790 (https://phabricator.wikimedia.org/T263417) [03:29:39] (03PS3) 10KartikMistry: ContentTranslation: Remove testwiki from extension1 cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/628790 (https://phabricator.wikimedia.org/T263417) [03:32:38] (03PS3) 10KartikMistry: ContentTranslation: Remove test2wiki from wgContentTranslationAsBetaFeature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/628780 [04:04:42] PROBLEM - Thanos query has high gRPC client errors on icinga1001 is CRITICAL: job=thanos-query https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query [04:20:24] PROBLEM - Check systemd state on ms-be2038 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:37:51] 10Operations, 10ops-codfw, 10DBA: db2127 memory errors - https://phabricator.wikimedia.org/T262247 (10Marostegui) 05Open→03Resolved Host was repooled Thank you Papaul! [04:51:20] RECOVERY - Check systemd state on ms-be2038 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:59:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2013 (re)pooling @ 25%: Slowly repool after recloning es2032 T261717 ', diff saved to https://phabricator.wikimedia.org/P12697 and previous config saved to /var/cache/conftool/dbconfig/20200922-045919-root.json [04:59:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:59:26] T261717: Productionize es20[26-34] and es10[26-34] - https://phabricator.wikimedia.org/T261717 [04:59:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2016 (re)pooling @ 25%: Slowly repool after recloning es2032 T261717 ', diff saved to https://phabricator.wikimedia.org/P12698 and previous config saved to /var/cache/conftool/dbconfig/20200922-045928-root.json [04:59:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:59:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2019 (re)pooling @ 25%: Slowly repool after recloning es2034 T261717 ', diff saved to https://phabricator.wikimedia.org/P12699 and previous config saved to /var/cache/conftool/dbconfig/20200922-045944-root.json [04:59:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:00:04] !log Add es2032 es2033 and es2034 to tendril and zarcillo T261717 [05:00:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:14:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2013 (re)pooling @ 50%: Slowly repool after recloning es2032 T261717 ', diff saved to https://phabricator.wikimedia.org/P12700 and previous config saved to /var/cache/conftool/dbconfig/20200922-051423-root.json [05:14:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:14:29] T261717: Productionize es20[26-34] and es10[26-34] - https://phabricator.wikimedia.org/T261717 [05:14:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2016 (re)pooling @ 50%: Slowly repool after recloning es2032 T261717 ', diff saved to https://phabricator.wikimedia.org/P12701 and previous config saved to /var/cache/conftool/dbconfig/20200922-051431-root.json [05:14:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:14:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2019 (re)pooling @ 50%: Slowly repool after recloning es2034 T261717 ', diff saved to https://phabricator.wikimedia.org/P12702 and previous config saved to /var/cache/conftool/dbconfig/20200922-051448-root.json [05:14:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:17:53] (03PS1) 10Marostegui: es2032,es2033,es2034: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/629011 (https://phabricator.wikimedia.org/T261717) [05:26:32] PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v2/suggest/sections/{title}/{from}/{to} (Suggest source sections to translate) is CRITICAL: Test Suggest source sections to translate returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/CX [05:26:55] (03CR) 10Marostegui: [C: 03+2] es2032,es2033,es2034: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/629011 (https://phabricator.wikimedia.org/T261717) (owner: 10Marostegui) [05:28:30] RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [05:28:52] (03PS1) 10Marostegui: instances.yaml: Add es203{2,3,4} to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/629012 (https://phabricator.wikimedia.org/T261717) [05:29:22] (03CR) 10Marostegui: [C: 03+2] instances.yaml: Add es203{2,3,4} to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/629012 (https://phabricator.wikimedia.org/T261717) (owner: 10Marostegui) [05:29:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2013 (re)pooling @ 75%: Slowly repool after recloning es2032 T261717 ', diff saved to https://phabricator.wikimedia.org/P12703 and previous config saved to /var/cache/conftool/dbconfig/20200922-052926-root.json [05:29:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:29:33] T261717: Productionize es20[26-34] and es10[26-34] - https://phabricator.wikimedia.org/T261717 [05:29:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2016 (re)pooling @ 75%: Slowly repool after recloning es2032 T261717 ', diff saved to https://phabricator.wikimedia.org/P12704 and previous config saved to /var/cache/conftool/dbconfig/20200922-052935-root.json [05:29:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:29:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2019 (re)pooling @ 75%: Slowly repool after recloning es2034 T261717 ', diff saved to https://phabricator.wikimedia.org/P12705 and previous config saved to /var/cache/conftool/dbconfig/20200922-052951-root.json [05:29:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:33:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'Add es2032, es2033 and es2034 into dbctl T261717', diff saved to https://phabricator.wikimedia.org/P12706 and previous config saved to /var/cache/conftool/dbconfig/20200922-053346-marostegui.json [05:33:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:36:24] mutante: no, CNAME/DYNA record will not be managed by Netbox at all (re gerrit dns/+/623465 ) [05:36:36] and thx for testing the decom cookkbook [05:39:43] !log Deploy MCR schema change on s3 eqiad, this will generate lag on s3 on labsdb T238966 [05:39:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:39:47] T238966: Apply updates for MCR, actor migration, and content migration, to production wikis. - https://phabricator.wikimedia.org/T238966 [05:40:56] !log Log remove triggers on revision table on db1124:3313 T238966 [05:41:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:42:07] (03PS1) 10Volans: sre.hosts.decommission: improve commit message [cookbooks] - 10https://gerrit.wikimedia.org/r/629013 [05:44:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2013 (re)pooling @ 100%: Slowly repool after recloning es2032 T261717 ', diff saved to https://phabricator.wikimedia.org/P12707 and previous config saved to /var/cache/conftool/dbconfig/20200922-054430-root.json [05:44:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:44:36] T261717: Productionize es20[26-34] and es10[26-34] - https://phabricator.wikimedia.org/T261717 [05:44:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2016 (re)pooling @ 100%: Slowly repool after recloning es2032 T261717 ', diff saved to https://phabricator.wikimedia.org/P12708 and previous config saved to /var/cache/conftool/dbconfig/20200922-054438-root.json [05:44:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:44:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2019 (re)pooling @ 100%: Slowly repool after recloning es2034 T261717 ', diff saved to https://phabricator.wikimedia.org/P12709 and previous config saved to /var/cache/conftool/dbconfig/20200922-054455-root.json [05:44:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:45:12] (03CR) 10Volans: [C: 03+2] "Trivial message change, self-merging." [cookbooks] - 10https://gerrit.wikimedia.org/r/629013 (owner: 10Volans) [05:46:10] (03Merged) 10jenkins-bot: sre.hosts.decommission: improve commit message [cookbooks] - 10https://gerrit.wikimedia.org/r/629013 (owner: 10Volans) [06:05:34] RECOVERY - Thanos query has high gRPC client errors on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query [06:07:39] (03CR) 10Volans: [C: 04-2] "As is this will break debmonitor. There are 2 endpoints that respond to debmonitor.discovery.wmnet:" [puppet] - 10https://gerrit.wikimedia.org/r/628966 (https://phabricator.wikimedia.org/T263506) (owner: 10Dzahn) [06:08:02] (03CR) 10Volans: [C: 04-2] "As is this will break debmonitor. See related comment on I4be02810c1386699065b6503312aaffdb1894a6f" [dns] - 10https://gerrit.wikimedia.org/r/628965 (https://phabricator.wikimedia.org/T263506) (owner: 10Dzahn) [06:18:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool es2014 for decommissioning T262889', diff saved to https://phabricator.wikimedia.org/P12710 and previous config saved to /var/cache/conftool/dbconfig/20200922-061815-marostegui.json [06:18:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:18:22] T262889: decommission es2014.codfw.wmnet - https://phabricator.wikimedia.org/T262889 [06:19:37] (03PS1) 10Marostegui: es2014: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/629015 (https://phabricator.wikimedia.org/T262889) [06:20:21] (03CR) 10Marostegui: [C: 03+2] es2014: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/629015 (https://phabricator.wikimedia.org/T262889) (owner: 10Marostegui) [06:29:13] PROBLEM - Thanos query has high gRPC client errors on icinga1001 is CRITICAL: job=thanos-query https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query [06:47:09] (03PS2) 10Muehlenhoff: urldownloader: convert A record to CNAME [dns] - 10https://gerrit.wikimedia.org/r/628763 (https://phabricator.wikimedia.org/T244153) [06:54:03] 10Operations, 10DNS, 10Traffic: dns repository left in a broken state - https://phabricator.wikimedia.org/T263518 (10Volans) p:05Triage→03Unbreak! [06:54:51] (03CR) 10Muehlenhoff: [C: 04-1] role::bastionhost::pop: remove prometheus instances (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/628940 (https://phabricator.wikimedia.org/T243057) (owner: 10Herron) [06:57:55] 10Operations, 10DNS, 10Traffic: dns repository left in a broken state - https://phabricator.wikimedia.org/T263518 (10Volans) The erros seems to be caused by the lack of the entry related to `releases` in the `discovery-states` file in gdnsd configuration. This in turn seems to be related to the fact that in... [06:58:27] (03PS1) 10Ayounsi: Re-enable tftp ALG [homer/public] - 10https://gerrit.wikimedia.org/r/629049 [06:59:07] (03PS1) 10Volans: Revert "add dns-disc for releases servers" [dns] - 10https://gerrit.wikimedia.org/r/628995 [06:59:44] (03CR) 10Ayounsi: [C: 03+2] Re-enable tftp ALG [homer/public] - 10https://gerrit.wikimedia.org/r/629049 (owner: 10Ayounsi) [06:59:47] (03PS2) 10Volans: Revert "add dns-disc for releases servers" [dns] - 10https://gerrit.wikimedia.org/r/628995 [07:00:09] (03Merged) 10jenkins-bot: Re-enable tftp ALG [homer/public] - 10https://gerrit.wikimedia.org/r/629049 (owner: 10Ayounsi) [07:01:11] (03CR) 10Volans: [C: 03+2] "Merging to put back DNS in a proper state" [dns] - 10https://gerrit.wikimedia.org/r/628995 (owner: 10Volans) [07:03:27] 10Operations, 10DNS, 10Traffic: dns repository left in a broken state - https://phabricator.wikimedia.org/T263518 (10Volans) 05Open→03Resolved I've merged https://gerrit.wikimedia.org/r/c/operations/dns/+/628995 and now `authdns-update` runs without errors and the DNS is unblocked. [07:03:29] (03CR) 10Volans: "> Patch Set 4:" [dns] - 10https://gerrit.wikimedia.org/r/623465 (owner: 10Dzahn) [07:04:30] (03CR) 10Volans: "See https://phabricator.wikimedia.org/T263518" [dns] - 10https://gerrit.wikimedia.org/r/628994 (owner: 10Herron) [07:05:18] (03CR) 10Volans: [C: 03+1] "LGTM, thanks!" [dns] - 10https://gerrit.wikimedia.org/r/628763 (https://phabricator.wikimedia.org/T244153) (owner: 10Muehlenhoff) [07:08:15] (03PS3) 10Muehlenhoff: urldownloader: convert A record to CNAME [dns] - 10https://gerrit.wikimedia.org/r/628763 (https://phabricator.wikimedia.org/T244153) [07:08:49] (03CR) 10Elukey: [C: 03+2] presto: use puppet host TLS certificates by default [puppet] - 10https://gerrit.wikimedia.org/r/628850 (https://phabricator.wikimedia.org/T253957) (owner: 10Elukey) [07:11:28] !log cr1-codfw# run clear bfd session address fe80::f27c:c7ff:fe11:2c1b [07:11:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:11:36] (03PS1) 10Marostegui: instances.yaml: Remove es2014 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/629054 (https://phabricator.wikimedia.org/T262889) [07:13:08] (03CR) 10Marostegui: [C: 03+2] instances.yaml: Remove es2014 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/629054 (https://phabricator.wikimedia.org/T262889) (owner: 10Marostegui) [07:13:23] (03CR) 10Muehlenhoff: [C: 03+2] urldownloader: convert A record to CNAME [dns] - 10https://gerrit.wikimedia.org/r/628763 (https://phabricator.wikimedia.org/T244153) (owner: 10Muehlenhoff) [07:14:25] RECOVERY - BFD status on cr1-codfw is OK: OK: UP: 14 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [07:14:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove es2014 from dbctl T262889', diff saved to https://phabricator.wikimedia.org/P12711 and previous config saved to /var/cache/conftool/dbconfig/20200922-071435-marostegui.json [07:14:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:14:40] T262889: decommission es2014.codfw.wmnet - https://phabricator.wikimedia.org/T262889 [07:19:26] <_joe_> marostegui: we have some etcdconfig alerts [07:19:35] <_joe_> lmk if you still see traffic and it's unexpected [07:19:49] _joe_: about uncommitted dbctl changes? [07:19:55] (my icinga doesn't show anything yet) [07:20:06] <_joe_> no about mediawiki etcdconfig being out of date [07:20:36] ah, that should be ok from my side, it will clear soon I think. I sometimes see them after pushing [07:20:50] <_joe_> yeah it's kinda-expected [07:21:12] (03PS1) 10Volans: depool ulsfo to migrate its DNS records to Netbox [dns] - 10https://gerrit.wikimedia.org/r/629055 (https://phabricator.wikimedia.org/T258729) [07:24:56] !log Stop MySQL on es2014 - host will be decommissioned T262889 [07:25:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:25:01] T262889: decommission es2014.codfw.wmnet - https://phabricator.wikimedia.org/T262889 [07:25:37] (03PS2) 10Giuseppe Lavagetto: services: use TLS to connect to ORES [puppet] - 10https://gerrit.wikimedia.org/r/628801 (https://phabricator.wikimedia.org/T244843) [07:26:22] (03PS4) 10JMeybohm: lvs: Remove termbox non-TLS endpoint from LVS 3/3 [puppet] - 10https://gerrit.wikimedia.org/r/627300 (https://phabricator.wikimedia.org/T254581) [07:26:42] PROBLEM - Thanos query has high gRPC client errors on icinga1001 is CRITICAL: job=thanos-query https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query [07:29:14] (03PS2) 10JMeybohm: lvs: Remove cxserver non-TLS endpoint from LVS 3/3 [puppet] - 10https://gerrit.wikimedia.org/r/627434 (https://phabricator.wikimedia.org/T255879) [07:29:15] PROBLEM - Thanos query has high gRPC client errors on icinga1001 is CRITICAL: job=thanos-query https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query [07:30:40] (03CR) 10Jcrespo: [C: 03+1] test: TestOnlineSchemaChanger missed analyze config [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/624622 (https://phabricator.wikimedia.org/T261098) (owner: 10Hashar) [07:30:58] (03CR) 10JMeybohm: [C: 03+2] "Rebased" [puppet] - 10https://gerrit.wikimedia.org/r/627300 (https://phabricator.wikimedia.org/T254581) (owner: 10JMeybohm) [07:32:01] (03CR) 10Ema: [C: 03+1] depool ulsfo to migrate its DNS records to Netbox [dns] - 10https://gerrit.wikimedia.org/r/629055 (https://phabricator.wikimedia.org/T258729) (owner: 10Volans) [07:32:26] (03PS2) 10Volans: depool ulsfo to migrate its DNS records to Netbox [dns] - 10https://gerrit.wikimedia.org/r/629055 (https://phabricator.wikimedia.org/T258729) [07:32:28] (03CR) 10Muehlenhoff: "Agreed, while it would be possible to split off the GUI from the internal endpoint, which receives the image/host data, but that would add" [puppet] - 10https://gerrit.wikimedia.org/r/628966 (https://phabricator.wikimedia.org/T263506) (owner: 10Dzahn) [07:33:03] (03PS3) 10JMeybohm: lvs: Remove cxserver non-TLS endpoint from LVS 3/3 [puppet] - 10https://gerrit.wikimedia.org/r/627434 (https://phabricator.wikimedia.org/T255879) [07:33:50] (03CR) 10JMeybohm: [C: 03+2] lvs: Remove cxserver non-TLS endpoint from LVS 3/3 [puppet] - 10https://gerrit.wikimedia.org/r/627434 (https://phabricator.wikimedia.org/T255879) (owner: 10JMeybohm) [07:34:02] !log depooling ulsfo to merge DNS migration to Netbox zonefiles - T258729 [07:34:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:34:07] T258729: netbox DNS Automation Workflow checklist for Commissioning and Decommissioning 2020Q1 - https://phabricator.wikimedia.org/T258729 [07:34:08] (03CR) 10Volans: [C: 03+2] depool ulsfo to migrate its DNS records to Netbox [dns] - 10https://gerrit.wikimedia.org/r/629055 (https://phabricator.wikimedia.org/T258729) (owner: 10Volans) [07:36:18] PROBLEM - Thanos query has high gRPC client errors on icinga1001 is CRITICAL: job=thanos-query https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query [07:37:24] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 75, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:37:38] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 52, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:39:19] !log running puppet on lvs servers - T255879 T254581 [07:39:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:39:25] T255879: Move cxserver to use TLS only - https://phabricator.wikimedia.org/T255879 [07:39:25] T254581: Move termbox to use TLS only - https://phabricator.wikimedia.org/T254581 [07:39:39] some LVS alerts incoming [07:40:39] XioNoX: ^^^ link down again? [07:40:53] different link [07:41:02] ulsfo-eqord [07:41:08] PROBLEM - Thanos query has high gRPC client errors on icinga1001 is CRITICAL: job=thanos-query https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query [07:41:10] ulsfo depool has been merged, lmk if we're still good [07:42:22] hm [07:42:28] !log restarting pybal on lvs1016.eqiad.wmnet,lvs2010.codfw.wmnet - T255879 T254581 [07:42:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:42:55] Zayo codfw-ulsfo link maintenance is starting, and ulsfo-eqord alerted, that's no good [07:43:01] volans: thx for the depool :) [07:43:18] lol [07:43:25] was meant for another thing... but sure :D [07:43:34] PROBLEM - PyBal IPVS diff check on lvs1016 is CRITICAL: CRITICAL: Services in IPVS but unknown to PyBal: set([10.2.2.18:8080, 10.2.2.46:3030]) https://wikitech.wikimedia.org/wiki/PyBal [07:43:38] we have another tunnel on top of that so we're safe-ish [07:43:38] PROBLEM - PyBal IPVS diff check on lvs2010 is CRITICAL: CRITICAL: Services in IPVS but unknown to PyBal: set([10.2.1.46:3030, 10.2.1.18:8080]) https://wikitech.wikimedia.org/wiki/PyBal [07:45:37] volans: Telia is back up [07:45:50] RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 77, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:46:02] PROBLEM - PyBal IPVS diff check on lvs2009 is CRITICAL: CRITICAL: Services in IPVS but unknown to PyBal: set([10.2.1.46:3030, 10.2.1.18:8080]) https://wikitech.wikimedia.org/wiki/PyBal [07:46:08] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 54, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:46:17] volans: anyway don't let those links stop you [07:46:29] !log restarting pybal on lvs1015.eqiad.wmnet,lvs2009.codfw.wmnet - T255879 T254581 [07:46:30] XioNoX: ack, I'm just waiting the reqs to go down [07:46:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:46:36] T255879: Move cxserver to use TLS only - https://phabricator.wikimedia.org/T255879 [07:46:36] T254581: Move termbox to use TLS only - https://phabricator.wikimedia.org/T254581 [07:46:47] will merge in few minutes [07:47:29] (03CR) 10Ema: [V: 03+2 C: 03+2] Add Origin and Description headers for every debian patch [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/621014 (https://phabricator.wikimedia.org/T260702) (owner: 10Vgutierrez) [07:48:07] PROBLEM - PyBal IPVS diff check on lvs1015 is CRITICAL: CRITICAL: Services in IPVS but unknown to PyBal: set([10.2.2.18:8080, 10.2.2.46:3030]) https://wikitech.wikimedia.org/wiki/PyBal [07:48:16] (03CR) 10Ema: [V: 03+2 C: 03+2] Remove unused patches [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/621015 (https://phabricator.wikimedia.org/T260702) (owner: 10Vgutierrez) [07:48:21] (03PS1) 10Ayounsi: Add policy-options for primary IXPs [homer/public] - 10https://gerrit.wikimedia.org/r/629056 (https://phabricator.wikimedia.org/T262517) [07:48:24] (03CR) 10Ema: [V: 03+2 C: 03+2] Remove unnecessary patches for Varnish 6 [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/621265 (https://phabricator.wikimedia.org/T260702) (owner: 10Vgutierrez) [07:48:37] RECOVERY - PyBal IPVS diff check on lvs1016 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [07:48:39] (03CR) 10Ema: [V: 03+2 C: 03+2] "This change is ready for review." [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/621284 (https://phabricator.wikimedia.org/T260702) (owner: 10Vgutierrez) [07:49:09] (03CR) 10Ema: [V: 03+2 C: 03+2] "This change is ready for review." [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/621532 (https://phabricator.wikimedia.org/T260702) (owner: 10Vgutierrez) [07:49:26] (03CR) 10Ema: [V: 03+2 C: 03+2] Add 0006-bump-api-soname [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/622964 (https://phabricator.wikimedia.org/T261487) (owner: 10Vgutierrez) [07:49:48] !log running ipvsadm -D -t 10.2.2.18:8080; ipvsadm -D -t 10.2.2.46:3030 on lvs1016.eqiad.wmnet,lvs1015.eqiad.wmnet - T255879 T254581 [07:49:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:50:00] (03CR) 10Ema: [V: 03+2 C: 03+2] Update debian/control [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/621693 (https://phabricator.wikimedia.org/T260702) (owner: 10Vgutierrez) [07:50:32] (03CR) 10Ema: [V: 03+2 C: 03+2] Bump libvarnishapi SONAME [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/622965 (https://phabricator.wikimedia.org/T261487) (owner: 10Vgutierrez) [07:50:48] (03CR) 10Ema: [V: 03+2 C: 03+2] Use libvarnishapi2 instead of libvarnishapi1 in override [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/622973 (https://phabricator.wikimedia.org/T261487) (owner: 10Ema) [07:51:03] (03CR) 10Ema: [V: 03+2 C: 03+2] Package vcstool.py [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/622967 (https://phabricator.wikimedia.org/T260702) (owner: 10Vgutierrez) [07:51:18] !log running ipvsadm -D -t 10.2.1.18:8080; ipvsadm -D -t 10.2.1.46:3030 on lvs2010.codfw.wmnet,lvs2009.codfw.wmnet - T255879 T254581 [07:51:22] (03CR) 10Ema: [V: 03+2 C: 03+2] Work around a breaking change in GNU make 4.3 [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/622975 (https://phabricator.wikimedia.org/T260702) (owner: 10Ema) [07:51:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:51:55] (03CR) 10Ema: [V: 03+2 C: 03+2] Set debhelper compatibility level to 12 [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/622978 (owner: 10Ema) [07:52:41] RECOVERY - PyBal IPVS diff check on lvs1015 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [07:52:41] RECOVERY - PyBal IPVS diff check on lvs2009 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [07:52:41] RECOVERY - PyBal IPVS diff check on lvs2010 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [07:53:42] 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes, 10Patch-For-Review: Move cxserver to use TLS only - https://phabricator.wikimedia.org/T255879 (10JMeybohm) [07:54:30] !log kormat@cumin1001 dbctl commit (dc=all): 'db2076 depooling: schema change T259831', diff saved to https://phabricator.wikimedia.org/P12712 and previous config saved to /var/cache/conftool/dbconfig/20200922-075429-kormat.json [07:54:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:54:35] T259831: Schema change to make change_tag.ct_rc_id unsigned - https://phabricator.wikimedia.org/T259831 [07:55:18] (03CR) 10Ema: [V: 03+2 C: 03+2] Drop 0003-vsm-perms.patch [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/623524 (https://phabricator.wikimedia.org/T260702) (owner: 10Vgutierrez) [07:55:42] (03CR) 10Ema: [V: 03+2 C: 03+2] Release 6.0.6-1wm1 [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/621694 (https://phabricator.wikimedia.org/T260702) (owner: 10Vgutierrez) [07:55:44] (03PS3) 10Volans: Migrate ulsfo private records to automated DNS [dns] - 10https://gerrit.wikimedia.org/r/627605 (https://phabricator.wikimedia.org/T258729) (owner: 10CRusnov) [07:57:20] !log migrating ulsfo private DNS records to the Netbox-generated ones - T258729 [07:57:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:57:25] T258729: netbox DNS Automation Workflow checklist for Commissioning and Decommissioning 2020Q1 - https://phabricator.wikimedia.org/T258729 [07:57:55] (03CR) 10Volans: [C: 03+2] "Last PS was just to resolve a conflict for the migration of prometheus from bastion to a dedicated host." [dns] - 10https://gerrit.wikimedia.org/r/627605 (https://phabricator.wikimedia.org/T258729) (owner: 10CRusnov) [07:58:20] (03PS3) 10Volans: Migrate ulsfo public records to automated DNS [dns] - 10https://gerrit.wikimedia.org/r/628046 (https://phabricator.wikimedia.org/T258729) [07:58:24] 10Operations, 10Analytics-Radar, 10Traffic, 10Patch-For-Review: Package varnish 6.0.x - https://phabricator.wikimedia.org/T261632 (10ema) [07:58:53] 10Operations, 10Traffic, 10Patch-For-Review: Analyze custom varnish 5.1 patches considering the migration to varnish 6 - https://phabricator.wikimedia.org/T260702 (10ema) 05Open→03Resolved Most patches dropped! We're left with: - 0005-stats-shortlived.patch - 0006-bump-api-soname.patch Closing. [07:59:38] (03PS3) 10Ema: varnish: make Accept-Language lowercase [puppet] - 10https://gerrit.wikimedia.org/r/627295 (https://phabricator.wikimedia.org/T262428) [08:01:08] PROBLEM - Thanos query has high gRPC client errors on icinga1001 is CRITICAL: job=thanos-query https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query [08:02:25] (03PS4) 10Jcrespo: mariadb: Stop using puppet to deploy wmfbackups and use debian packages [puppet] - 10https://gerrit.wikimedia.org/r/628163 (https://phabricator.wikimedia.org/T138562) [08:02:40] (03PS1) 10Elukey: Revert "presto: use puppet host TLS certificates by default" [puppet] - 10https://gerrit.wikimedia.org/r/628996 [08:03:41] (03CR) 10jerkins-bot: [V: 04-1] Revert "presto: use puppet host TLS certificates by default" [puppet] - 10https://gerrit.wikimedia.org/r/628996 (owner: 10Elukey) [08:04:23] it complains for the usage of hiera(), I am going to override jenkins for the revert [08:04:29] (03CR) 10Elukey: [V: 03+2 C: 03+2] Revert "presto: use puppet host TLS certificates by default" [puppet] - 10https://gerrit.wikimedia.org/r/628996 (owner: 10Elukey) [08:07:48] (03PS2) 10Giuseppe Lavagetto: services: add TLS encrypted endpoint for ores (1/2) [puppet] - 10https://gerrit.wikimedia.org/r/628799 (https://phabricator.wikimedia.org/T244843) [08:07:50] (03PS2) 10Giuseppe Lavagetto: services: add TLS encrypted endpoint for ores (2/2) [puppet] - 10https://gerrit.wikimedia.org/r/628800 (https://phabricator.wikimedia.org/T244843) [08:07:52] (03PS3) 10Giuseppe Lavagetto: services: use TLS to connect to ORES [puppet] - 10https://gerrit.wikimedia.org/r/628801 (https://phabricator.wikimedia.org/T244843) [08:07:54] (03PS2) 10Giuseppe Lavagetto: services: retire the ORES http endpoint (1/2) [puppet] - 10https://gerrit.wikimedia.org/r/628802 (https://phabricator.wikimedia.org/T244843) [08:07:58] (03PS2) 10Giuseppe Lavagetto: services: retire the ORES http endpoint (2/2) [puppet] - 10https://gerrit.wikimedia.org/r/628803 (https://phabricator.wikimedia.org/T244843) [08:09:56] (03PS3) 10Giuseppe Lavagetto: services: add TLS encrypted endpoint for ores (1/2) [puppet] - 10https://gerrit.wikimedia.org/r/628799 (https://phabricator.wikimedia.org/T244843) [08:09:58] (03PS3) 10Giuseppe Lavagetto: services: add TLS encrypted endpoint for ores (2/2) [puppet] - 10https://gerrit.wikimedia.org/r/628800 (https://phabricator.wikimedia.org/T244843) [08:10:00] (03PS4) 10Giuseppe Lavagetto: services: use TLS to connect to ORES [puppet] - 10https://gerrit.wikimedia.org/r/628801 (https://phabricator.wikimedia.org/T244843) [08:10:02] (03PS3) 10Giuseppe Lavagetto: services: retire the ORES http endpoint (1/2) [puppet] - 10https://gerrit.wikimedia.org/r/628802 (https://phabricator.wikimedia.org/T244843) [08:10:04] (03PS3) 10Giuseppe Lavagetto: services: retire the ORES http endpoint (2/2) [puppet] - 10https://gerrit.wikimedia.org/r/628803 (https://phabricator.wikimedia.org/T244843) [08:11:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'Pool es2032, es2033 and es2034 for the first time with minimal weight T261717', diff saved to https://phabricator.wikimedia.org/P12713 and previous config saved to /var/cache/conftool/dbconfig/20200922-081154-marostegui.json [08:11:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:11:59] T261717: Productionize es20[26-34] and es10[26-34] - https://phabricator.wikimedia.org/T261717 [08:12:59] (03CR) 10Nikerabbit: [C: 03+1] "I don't believe wikipedia.dblist contains any private wikis, so filtering them out is probably harmless no-op." [puppet] - 10https://gerrit.wikimedia.org/r/628758 (https://phabricator.wikimedia.org/T263417) (owner: 10KartikMistry) [08:13:02] (03CR) 10Muehlenhoff: [C: 03+1] "Ah yes, indeed." [puppet] - 10https://gerrit.wikimedia.org/r/628768 (owner: 10Jbond) [08:13:51] !log uploaded wmfmariadbpy v0.5 to apt. deploying now to fleet [08:13:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:14:44] (03CR) 10JMeybohm: [C: 04-1] services: add TLS encrypted endpoint for ores (1/2) (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/628799 (https://phabricator.wikimedia.org/T244843) (owner: 10Giuseppe Lavagetto) [08:15:19] (03PS4) 10Ema: varnish: make Accept-Language lowercase [puppet] - 10https://gerrit.wikimedia.org/r/627295 (https://phabricator.wikimedia.org/T262428) [08:16:49] (03CR) 10Jcrespo: "I am going to break down this patch in smaller ones for slow and safer deployment." [puppet] - 10https://gerrit.wikimedia.org/r/628163 (https://phabricator.wikimedia.org/T138562) (owner: 10Jcrespo) [08:16:57] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/628827 (owner: 10Jbond) [08:17:00] (03CR) 10JMeybohm: [C: 04-1] services: add TLS encrypted endpoint for ores (1/2) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/628799 (https://phabricator.wikimedia.org/T244843) (owner: 10Giuseppe Lavagetto) [08:17:13] (03CR) 10KartikMistry: "> Patch Set 2: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/628758 (https://phabricator.wikimedia.org/T263417) (owner: 10KartikMistry) [08:18:33] (03CR) 10Elukey: [C: 03+1] sslcert::x509_to_pkcs12: check for a valid p12 instead of just a file [puppet] - 10https://gerrit.wikimedia.org/r/628827 (owner: 10Jbond) [08:19:45] (03CR) 10JMeybohm: [C: 03+1] services: use TLS to connect to ORES [puppet] - 10https://gerrit.wikimedia.org/r/628801 (https://phabricator.wikimedia.org/T244843) (owner: 10Giuseppe Lavagetto) [08:20:10] (03CR) 10JMeybohm: [C: 03+1] services: retire the ORES http endpoint (1/2) [puppet] - 10https://gerrit.wikimedia.org/r/628802 (https://phabricator.wikimedia.org/T244843) (owner: 10Giuseppe Lavagetto) [08:20:11] !log kormat@cumin1001 dbctl commit (dc=all): 'db2076 (re)pooling @ 25%: schema change T259831', diff saved to https://phabricator.wikimedia.org/P12714 and previous config saved to /var/cache/conftool/dbconfig/20200922-082010-kormat.json [08:20:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:20:16] T259831: Schema change to make change_tag.ct_rc_id unsigned - https://phabricator.wikimedia.org/T259831 [08:20:40] (03CR) 10JMeybohm: [C: 03+1] services: retire the ORES http endpoint (2/2) [puppet] - 10https://gerrit.wikimedia.org/r/628803 (https://phabricator.wikimedia.org/T244843) (owner: 10Giuseppe Lavagetto) [08:20:59] (03CR) 10Giuseppe Lavagetto: services: add TLS encrypted endpoint for ores (1/2) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/628799 (https://phabricator.wikimedia.org/T244843) (owner: 10Giuseppe Lavagetto) [08:22:24] !log migrating ulsfo public DNS records to the Netbox-generated ones - T258729 [08:22:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:22:28] (03CR) 10Volans: [C: 03+2] Migrate ulsfo public records to automated DNS [dns] - 10https://gerrit.wikimedia.org/r/628046 (https://phabricator.wikimedia.org/T258729) (owner: 10Volans) [08:22:28] T258729: netbox DNS Automation Workflow checklist for Commissioning and Decommissioning 2020Q1 - https://phabricator.wikimedia.org/T258729 [08:22:41] RECOVERY - Thanos query has high gRPC client errors on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query [08:23:05] (03PS1) 10Ema: aptrepo: add component/varnish6 [puppet] - 10https://gerrit.wikimedia.org/r/629061 (https://phabricator.wikimedia.org/T261632) [08:23:43] (03CR) 10JMeybohm: [C: 03+1] services: add TLS encrypted endpoint for ores (1/2) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/628799 (https://phabricator.wikimedia.org/T244843) (owner: 10Giuseppe Lavagetto) [08:23:49] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/629061 (https://phabricator.wikimedia.org/T261632) (owner: 10Ema) [08:24:55] (03CR) 10JMeybohm: [C: 03+1] services: add TLS encrypted endpoint for ores (2/2) [puppet] - 10https://gerrit.wikimedia.org/r/628800 (https://phabricator.wikimedia.org/T244843) (owner: 10Giuseppe Lavagetto) [08:30:17] PROBLEM - Thanos query has high gRPC client errors on icinga1001 is CRITICAL: job=thanos-query https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query [08:33:05] (03PS1) 10Volans: multiple: mark ulsfo as migrated to Netbox [cookbooks] - 10https://gerrit.wikimedia.org/r/629063 (https://phabricator.wikimedia.org/T258729) [08:33:48] (03CR) 10Giuseppe Lavagetto: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/25255/" [puppet] - 10https://gerrit.wikimedia.org/r/628799 (https://phabricator.wikimedia.org/T244843) (owner: 10Giuseppe Lavagetto) [08:35:14] !log kormat@cumin1001 dbctl commit (dc=all): 'db2076 (re)pooling @ 50%: schema change T259831', diff saved to https://phabricator.wikimedia.org/P12715 and previous config saved to /var/cache/conftool/dbconfig/20200922-083514-kormat.json [08:35:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:35:20] T259831: Schema change to make change_tag.ct_rc_id unsigned - https://phabricator.wikimedia.org/T259831 [08:36:36] <_joe_> !log restarting pybal low-traffic in eqiad to pick up lvs changes [08:36:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:36:48] (03PS1) 10Volans: scripts: mark ulsfo as migrated to Netbox [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/629064 (https://phabricator.wikimedia.org/T258729) [08:38:09] (03CR) 10Volans: [C: 03+2] multiple: mark ulsfo as migrated to Netbox [cookbooks] - 10https://gerrit.wikimedia.org/r/629063 (https://phabricator.wikimedia.org/T258729) (owner: 10Volans) [08:39:13] (03Merged) 10jenkins-bot: multiple: mark ulsfo as migrated to Netbox [cookbooks] - 10https://gerrit.wikimedia.org/r/629063 (https://phabricator.wikimedia.org/T258729) (owner: 10Volans) [08:39:37] (03CR) 10Volans: [C: 03+2] scripts: mark ulsfo as migrated to Netbox [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/629064 (https://phabricator.wikimedia.org/T258729) (owner: 10Volans) [08:39:59] PROBLEM - PyBal IPVS diff check on lvs1016 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.2.10:443]) https://wikitech.wikimedia.org/wiki/PyBal [08:40:02] (03CR) 10Alexandros Kosiaris: [C: 03+2] Exclude testwikis and private wikis from CX draft purge script run [puppet] - 10https://gerrit.wikimedia.org/r/628758 (https://phabricator.wikimedia.org/T263417) (owner: 10KartikMistry) [08:40:50] (03PS1) 10Volans: Revert "depool ulsfo to migrate its DNS records to Netbox" [dns] - 10https://gerrit.wikimedia.org/r/628997 [08:40:51] (03PS2) 10Volans: Revert "depool ulsfo to migrate its DNS records to Netbox" [dns] - 10https://gerrit.wikimedia.org/r/628997 [08:41:53] (03PS1) 10Giuseppe Lavagetto: services_proxy: hotfix error from previous commit [puppet] - 10https://gerrit.wikimedia.org/r/629066 [08:42:19] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] services_proxy: hotfix error from previous commit [puppet] - 10https://gerrit.wikimedia.org/r/629066 (owner: 10Giuseppe Lavagetto) [08:43:49] PROBLEM - PyBal connections to etcd on lvs1016 is CRITICAL: CRITICAL: 101 connections established with conf1004.eqiad.wmnet:4001 (min=102) https://wikitech.wikimedia.org/wiki/PyBal [08:44:39] PROBLEM - PyBal IPVS diff check on lvs2009 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.1.10:443]) https://wikitech.wikimedia.org/wiki/PyBal [08:45:20] <_joe_> this is me ^^ sorry fixing another error [08:50:17] !log kormat@cumin1001 dbctl commit (dc=all): 'db2076 (re)pooling @ 75%: schema change T259831', diff saved to https://phabricator.wikimedia.org/P12717 and previous config saved to /var/cache/conftool/dbconfig/20200922-085017-kormat.json [08:50:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:50:23] T259831: Schema change to make change_tag.ct_rc_id unsigned - https://phabricator.wikimedia.org/T259831 [08:51:30] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:53:44] PROBLEM - PyBal IPVS diff check on lvs1015 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.2.10:443]) https://wikitech.wikimedia.org/wiki/PyBal [08:54:00] RECOVERY - PyBal connections to etcd on lvs1016 is OK: OK: 102 connections established with conf1004.eqiad.wmnet:4001 (min=102) https://wikitech.wikimedia.org/wiki/PyBal [08:54:05] <_joe_> !log restarted pybal on lvs1015 [08:54:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:56:02] RECOVERY - PyBal IPVS diff check on lvs1016 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [08:56:33] <_joe_> !log restarting pybal on lvs2010 [08:56:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:58:18] (03CR) 10Filippo Giunchedi: [C: 03+1] prometheus: reduce prometheus.svc.eqsin TTL to 5M [dns] - 10https://gerrit.wikimedia.org/r/628853 (owner: 10Herron) [08:58:45] <_joe_> !log restart pybal on lvs2009 [08:58:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:58:55] (03CR) 10Jbond: "LGTM couple of nits" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/628970 (owner: 10Dzahn) [08:59:04] RECOVERY - PyBal IPVS diff check on lvs1015 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [08:59:49] 10Operations, 10DBA, 10Blocked-on-schema-change, 10User-Kormat: Schema change to make change_tag.ct_rc_id unsigned - https://phabricator.wikimedia.org/T259831 (10Kormat) [09:00:06] RECOVERY - PyBal IPVS diff check on lvs2009 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [09:00:40] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/628969 (owner: 10Dzahn) [09:01:04] (03CR) 10Jbond: [C: 03+2] admin: remove razzi from analytics_admins_members as they are in ops [puppet] - 10https://gerrit.wikimedia.org/r/628768 (owner: 10Jbond) [09:01:20] 10Operations, 10DBA, 10Blocked-on-schema-change, 10User-Kormat: Schema change to make change_tag.ct_rc_id unsigned - https://phabricator.wikimedia.org/T259831 (10Kormat) [09:03:52] (03CR) 10Jbond: [C: 03+2] sslcert::x509_to_pkcs12: check for a valid p12 instead of just a file [puppet] - 10https://gerrit.wikimedia.org/r/628827 (owner: 10Jbond) [09:05:21] !log kormat@cumin1001 dbctl commit (dc=all): 'db2076 (re)pooling @ 100%: schema change T259831', diff saved to https://phabricator.wikimedia.org/P12718 and previous config saved to /var/cache/conftool/dbconfig/20200922-090520-kormat.json [09:05:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:05:26] T259831: Schema change to make change_tag.ct_rc_id unsigned - https://phabricator.wikimedia.org/T259831 [09:06:37] (03CR) 10Giuseppe Lavagetto: [C: 03+2] services: add TLS encrypted endpoint for ores (2/2) [puppet] - 10https://gerrit.wikimedia.org/r/628800 (https://phabricator.wikimedia.org/T244843) (owner: 10Giuseppe Lavagetto) [09:06:45] (03PS4) 10Giuseppe Lavagetto: services: add TLS encrypted endpoint for ores (2/2) [puppet] - 10https://gerrit.wikimedia.org/r/628800 (https://phabricator.wikimedia.org/T244843) [09:11:09] !log update snmp string on ps1-a8-codfw [09:11:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:12:34] RECOVERY - ps1-a8-codfw-infeed-load-tower-A-phase-Z on ps1-a8-codfw is OK: SNMP OK - ps1-a8-codfw-infeed-load-tower-A-phase-Z 425 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:12:54] RECOVERY - ps1-a8-codfw-infeed-load-tower-A-phase-X on ps1-a8-codfw is OK: SNMP OK - ps1-a8-codfw-infeed-load-tower-A-phase-X 600 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:12:54] RECOVERY - ps1-a8-codfw-infeed-load-tower-B-phase-Y on ps1-a8-codfw is OK: SNMP OK - ps1-a8-codfw-infeed-load-tower-B-phase-Y 275 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:12:54] RECOVERY - ps1-a8-codfw-infeed-load-tower-A-phase-Y on ps1-a8-codfw is OK: SNMP OK - ps1-a8-codfw-infeed-load-tower-A-phase-Y 350 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:12:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2032 (re)pooling @ 10%: Slowly es2032 T261717 ', diff saved to https://phabricator.wikimedia.org/P12719 and previous config saved to /var/cache/conftool/dbconfig/20200922-091255-root.json [09:12:59] (03PS1) 10Kormat: dbtools: Add utility scripts for {de,re}pooling [software] - 10https://gerrit.wikimedia.org/r/629067 [09:12:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:13:00] T261717: Productionize es20[26-34] and es10[26-34] - https://phabricator.wikimedia.org/T261717 [09:13:10] RECOVERY - ps1-a8-codfw-infeed-load-tower-B-phase-X on ps1-a8-codfw is OK: SNMP OK - ps1-a8-codfw-infeed-load-tower-B-phase-X 613 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:13:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2033 (re)pooling @ 10%: Slowly repool es2033 T261717 ', diff saved to https://phabricator.wikimedia.org/P12720 and previous config saved to /var/cache/conftool/dbconfig/20200922-091310-root.json [09:13:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:13:22] RECOVERY - ps1-a8-codfw-infeed-load-tower-B-phase-Z on ps1-a8-codfw is OK: SNMP OK - ps1-a8-codfw-infeed-load-tower-B-phase-Z 550 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:13:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2034 (re)pooling @ 10%: Slowly repool es2034 T261717 ', diff saved to https://phabricator.wikimedia.org/P12721 and previous config saved to /var/cache/conftool/dbconfig/20200922-091329-root.json [09:13:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:13:52] 10Operations, 10ops-codfw: ps1-a8-codfw WebUI unresponsive - https://phabricator.wikimedia.org/T263001 (10jbond) >>! In T263001#6480977, @Papaul wrote: > Firmware upgrade to version 7.1e fix the issue. Just to confirm this has fixed the issue and the snmp RO string has been updated [09:17:04] (03CR) 10Marostegui: [C: 03+1] "+1000000 thanks for this." [software] - 10https://gerrit.wikimedia.org/r/629067 (owner: 10Kormat) [09:17:08] (03CR) 10Filippo Giunchedi: am: use status.cgi JSON as source for problems (032 comments) [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/628090 (owner: 10Filippo Giunchedi) [09:17:12] (03CR) 10JMeybohm: [C: 03+1] wikifeeds: use the service proxy for reaching the MediaWiki api [deployment-charts] - 10https://gerrit.wikimedia.org/r/628756 (https://phabricator.wikimedia.org/T255878) (owner: 10Giuseppe Lavagetto) [09:18:08] 04Critical Alert for device ps1-a8-codfw.mgmt.codfw.wmnet - Device rebooted [09:19:34] (03PS1) 10Kormat: wmfmariadbpy: Add 'library' role value [puppet] - 10https://gerrit.wikimedia.org/r/629068 [09:19:41] (03CR) 10Kormat: [C: 03+2] dbtools: Add utility scripts for {de,re}pooling [software] - 10https://gerrit.wikimedia.org/r/629067 (owner: 10Kormat) [09:19:50] (03CR) 10jerkins-bot: [V: 04-1] dbtools: Add utility scripts for {de,re}pooling [software] - 10https://gerrit.wikimedia.org/r/629067 (owner: 10Kormat) [09:20:34] (03CR) 10Jcrespo: [C: 03+1] "I think this was character by character what my patch would be, including the name of the role!" [puppet] - 10https://gerrit.wikimedia.org/r/629068 (owner: 10Kormat) [09:21:45] (03CR) 10Kormat: [C: 03+2] wmfmariadbpy: Add 'library' role value [puppet] - 10https://gerrit.wikimedia.org/r/629068 (owner: 10Kormat) [09:22:22] !log add BGP_IXP_PRIMARY_in to cr3-ulsfo - T262517 [09:22:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:22:27] T262517: Prioritize underdog IXP - https://phabricator.wikimedia.org/T262517 [09:23:08] 04̶C̶r̶i̶t̶i̶c̶a̶l Device ps1-a8-codfw.mgmt.codfw.wmnet recovered from Device rebooted [09:23:16] (03PS1) 10Joal: Add page_props and user_properties to analytics sqoop [puppet] - 10https://gerrit.wikimedia.org/r/629070 (https://phabricator.wikimedia.org/T258047) [09:24:27] !log replace BGP_IXP_in with BGP_IXP_PRIMARY_in on cr3-ulsfo IX BGP group - T262517 [09:24:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:25:50] (03PS2) 10JMeybohm: lvs: Remove eventgate-main non-TLS endpoint from LVS 1/2 [puppet] - 10https://gerrit.wikimedia.org/r/627537 (https://phabricator.wikimedia.org/T255873) [09:26:14] (03PS2) 10JMeybohm: lvs: Remove eventgate-analytics non-TLS endpoint from LVS 1/2 [puppet] - 10https://gerrit.wikimedia.org/r/627529 (https://phabricator.wikimedia.org/T255870) [09:26:28] !log jbond@cumin1001 START - Cookbook sre.pdus.uptime [09:26:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:26:37] (03CR) 10JMeybohm: [C: 03+2] lvs: Remove eventgate-main non-TLS endpoint from LVS 1/2 [puppet] - 10https://gerrit.wikimedia.org/r/627537 (https://phabricator.wikimedia.org/T255873) (owner: 10JMeybohm) [09:27:05] (03PS3) 10JMeybohm: lvs: Remove eventgate-analytics non-TLS endpoint from LVS 1/2 [puppet] - 10https://gerrit.wikimedia.org/r/627529 (https://phabricator.wikimedia.org/T255870) [09:27:33] (03CR) 10JMeybohm: [C: 03+2] lvs: Remove eventgate-analytics non-TLS endpoint from LVS 1/2 [puppet] - 10https://gerrit.wikimedia.org/r/627529 (https://phabricator.wikimedia.org/T255870) (owner: 10JMeybohm) [09:27:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2032 (re)pooling @ 25%: Slowly es2032 T261717 ', diff saved to https://phabricator.wikimedia.org/P12722 and previous config saved to /var/cache/conftool/dbconfig/20200922-092758-root.json [09:28:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:28:03] T261717: Productionize es20[26-34] and es10[26-34] - https://phabricator.wikimedia.org/T261717 [09:28:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2033 (re)pooling @ 25%: Slowly repool es2033 T261717 ', diff saved to https://phabricator.wikimedia.org/P12723 and previous config saved to /var/cache/conftool/dbconfig/20200922-092814-root.json [09:28:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:28:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2034 (re)pooling @ 25%: Slowly repool es2034 T261717 ', diff saved to https://phabricator.wikimedia.org/P12724 and previous config saved to /var/cache/conftool/dbconfig/20200922-092832-root.json [09:28:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:29:13] 10Operations, 10ops-codfw: Degraded RAID on ms-be2019 - https://phabricator.wikimedia.org/T262182 (10fgiunchedi) Thanks @Papaul ! Just checked now and it seems the controller didn't like the new BBU (feel free to power down the host if you need to) ` root@ms-be2019:~# hpssacli 'controller slot=3 show' | grep... [09:30:10] !log jbond@cumin1001 END (PASS) - Cookbook sre.pdus.uptime (exit_code=0) [09:30:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:30:52] !log repooling ulsfo after merging DNS migration to Netbox zonefiles - T258729 [09:30:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:30:58] T258729: netbox DNS Automation Workflow checklist for Commissioning and Decommissioning 2020Q1 - https://phabricator.wikimedia.org/T258729 [09:31:09] (03CR) 10Volans: [C: 03+2] Revert "depool ulsfo to migrate its DNS records to Netbox" [dns] - 10https://gerrit.wikimedia.org/r/628997 (owner: 10Volans) [09:33:37] (03PS1) 10Arturo Borrero Gonzalez: labtestvirt2003: hiera: fix addressing [puppet] - 10https://gerrit.wikimedia.org/r/629072 (https://phabricator.wikimedia.org/T261724) [09:34:21] (03CR) 10Hashar: "recheck" [software] - 10https://gerrit.wikimedia.org/r/629067 (owner: 10Kormat) [09:34:51] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] labtestvirt2003: hiera: fix addressing [puppet] - 10https://gerrit.wikimedia.org/r/629072 (https://phabricator.wikimedia.org/T261724) (owner: 10Arturo Borrero Gonzalez) [09:39:55] (03CR) 10Kormat: [C: 03+2] dbtools: Add utility scripts for {de,re}pooling [software] - 10https://gerrit.wikimedia.org/r/629067 (owner: 10Kormat) [09:40:25] (03Merged) 10jenkins-bot: dbtools: Add utility scripts for {de,re}pooling [software] - 10https://gerrit.wikimedia.org/r/629067 (owner: 10Kormat) [09:40:40] (03PS2) 10JMeybohm: lvs: Remove eventgate-main non-TLS endpoint from LVS 2/2 [puppet] - 10https://gerrit.wikimedia.org/r/627538 (https://phabricator.wikimedia.org/T255873) [09:42:33] PROBLEM - Thanos query has high gRPC client errors on icinga1001 is CRITICAL: job=thanos-query https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query [09:42:48] * volans brb [09:43:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2032 (re)pooling @ 50%: Slowly es2032 T261717 ', diff saved to https://phabricator.wikimedia.org/P12725 and previous config saved to /var/cache/conftool/dbconfig/20200922-094302-root.json [09:43:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:43:07] T261717: Productionize es20[26-34] and es10[26-34] - https://phabricator.wikimedia.org/T261717 [09:43:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2033 (re)pooling @ 50%: Slowly repool es2033 T261717 ', diff saved to https://phabricator.wikimedia.org/P12726 and previous config saved to /var/cache/conftool/dbconfig/20200922-094317-root.json [09:43:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:43:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2034 (re)pooling @ 50%: Slowly repool es2034 T261717 ', diff saved to https://phabricator.wikimedia.org/P12727 and previous config saved to /var/cache/conftool/dbconfig/20200922-094336-root.json [09:43:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:46:34] (03CR) 10Ayounsi: "Pushed on cr3-ulsfo." [homer/public] - 10https://gerrit.wikimedia.org/r/629056 (https://phabricator.wikimedia.org/T262517) (owner: 10Ayounsi) [09:46:39] !log jbond@cumin1001 START - Cookbook sre.pdus.rotate-password [09:46:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:46:56] !log jbond@cumin1001 END (FAIL) - Cookbook sre.pdus.rotate-password (exit_code=99) [09:46:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:48:07] (03CR) 10JMeybohm: [C: 03+2] lvs: Remove eventgate-main non-TLS endpoint from LVS 2/2 [puppet] - 10https://gerrit.wikimedia.org/r/627538 (https://phabricator.wikimedia.org/T255873) (owner: 10JMeybohm) [09:48:43] (03PS2) 10JMeybohm: lvs: Remove eventgate-analytics non-TLS endpoint from LVS 2/2 [puppet] - 10https://gerrit.wikimedia.org/r/627530 (https://phabricator.wikimedia.org/T255870) [09:49:31] (03CR) 10JMeybohm: [C: 03+2] lvs: Remove eventgate-analytics non-TLS endpoint from LVS 2/2 [puppet] - 10https://gerrit.wikimedia.org/r/627530 (https://phabricator.wikimedia.org/T255870) (owner: 10JMeybohm) [09:50:23] PROBLEM - Thanos query has high gRPC client errors on icinga1001 is CRITICAL: job=thanos-query https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query [09:51:12] !log running puppet on lvs servers - T255873 T255870 [09:51:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:51:19] T255870: Move eventgate-analytics to use TLS only - https://phabricator.wikimedia.org/T255870 [09:51:19] T255873: Move eventgate-main to use TLS only - https://phabricator.wikimedia.org/T255873 [09:54:01] I'm taking a look at the thanos errors [09:55:19] PROBLEM - PyBal IPVS diff check on lvs2009 is CRITICAL: CRITICAL: Services in IPVS but unknown to PyBal: set([10.2.1.42:35192, 10.2.1.45:34192]) https://wikitech.wikimedia.org/wiki/PyBal [09:55:22] !log restarting pybal on lvs1016.eqiad.wmnet,lvs2010.codfw.wmnet - T255873 T255870 [09:55:27] pybal is me again [09:55:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:56:23] PROBLEM - PyBal IPVS diff check on lvs1015 is CRITICAL: CRITICAL: Services in IPVS but unknown to PyBal: set([10.2.2.45:34192, 10.2.2.42:35192]) https://wikitech.wikimedia.org/wiki/PyBal [09:57:17] !log restarting pybal on lvs1015.eqiad.wmnet,lvs2009.codfw.wmnet - T255873 T255870 [09:57:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:57:22] T255870: Move eventgate-analytics to use TLS only - https://phabricator.wikimedia.org/T255870 [09:57:23] T255873: Move eventgate-main to use TLS only - https://phabricator.wikimedia.org/T255873 [09:57:58] (03PS1) 10Jbond: sre.pdus.rotate-password: fix TypeError: 'tuple' object does not support item assignment [cookbooks] - 10https://gerrit.wikimedia.org/r/629074 [09:58:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2032 (re)pooling @ 75%: Slowly es2032 T261717 ', diff saved to https://phabricator.wikimedia.org/P12728 and previous config saved to /var/cache/conftool/dbconfig/20200922-095805-root.json [09:58:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:58:11] T261717: Productionize es20[26-34] and es10[26-34] - https://phabricator.wikimedia.org/T261717 [09:58:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2033 (re)pooling @ 75%: Slowly repool es2033 T261717 ', diff saved to https://phabricator.wikimedia.org/P12729 and previous config saved to /var/cache/conftool/dbconfig/20200922-095821-root.json [09:58:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:58:31] PROBLEM - PyBal IPVS diff check on lvs1016 is CRITICAL: CRITICAL: Services in IPVS but unknown to PyBal: set([10.2.2.42:35192, 10.2.2.45:34192]) https://wikitech.wikimedia.org/wiki/PyBal [09:58:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2034 (re)pooling @ 75%: Slowly repool es2034 T261717 ', diff saved to https://phabricator.wikimedia.org/P12730 and previous config saved to /var/cache/conftool/dbconfig/20200922-095839-root.json [09:58:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:59:03] !log running ipvsadm -D -t 10.2.2.45:34192; ipvsadm -D -t 10.2.2.42:35192 on lvs1016.eqiad.wmnet,lvs1015.eqiad.wmnet - T255873 T255870 [09:59:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:59:33] PROBLEM - PyBal IPVS diff check on lvs2010 is CRITICAL: CRITICAL: Services in IPVS but unknown to PyBal: set([10.2.1.42:35192, 10.2.1.45:34192]) https://wikitech.wikimedia.org/wiki/PyBal [09:59:44] !log running ipvsadm -D -t 10.2.1.45:34192; ipvsadm -D -t 10.2.1.42:35192 on lvs2010.codfw.wmnet,lvs2009.codfw.wmnet - T255873 T255870 [09:59:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:59:57] PROBLEM - Thanos query has high gRPC client errors on icinga1001 is CRITICAL: job=thanos-query https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query [10:00:18] !log deploying schema change to s2 in eqiad. labsdb will have s2 lag until this finishes. T259831 [10:00:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:00:22] T259831: Schema change to make change_tag.ct_rc_id unsigned - https://phabricator.wikimedia.org/T259831 [10:01:27] RECOVERY - PyBal IPVS diff check on lvs1016 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [10:01:27] RECOVERY - PyBal IPVS diff check on lvs1015 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [10:01:27] RECOVERY - PyBal IPVS diff check on lvs2009 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [10:01:27] RECOVERY - PyBal IPVS diff check on lvs2010 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [10:02:19] 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes, 10Patch-For-Review: Move eventgate-main to use TLS only - https://phabricator.wikimedia.org/T255873 (10JMeybohm) [10:02:31] 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes, 10Patch-For-Review: Move eventgate-analytics to use TLS only - https://phabricator.wikimedia.org/T255870 (10JMeybohm) [10:04:04] (03PS1) 10Jcrespo: mariadb-backups: Start using the package instead of the script on puppet [puppet] - 10https://gerrit.wikimedia.org/r/629076 (https://phabricator.wikimedia.org/T138562) [10:04:55] 10Operations, 10MediaWiki-General, 10serviceops, 10Patch-For-Review, 10Service-Architecture: Create a service-to-service proxy for handling HTTP calls from services to other entities - https://phabricator.wikimedia.org/T244843 (10JMeybohm) [10:05:07] (03CR) 10jerkins-bot: [V: 04-1] mariadb-backups: Start using the package instead of the script on puppet [puppet] - 10https://gerrit.wikimedia.org/r/629076 (https://phabricator.wikimedia.org/T138562) (owner: 10Jcrespo) [10:05:22] (03CR) 10Filippo Giunchedi: "Just as a notification: once this is merged and esams bastion doesn't have prometheus, the thanos-query sporadic errors will also be resol" [puppet] - 10https://gerrit.wikimedia.org/r/628940 (https://phabricator.wikimedia.org/T243057) (owner: 10Herron) [10:07:01] (03PS2) 10Jcrespo: mariadb-backups: Start using the package instead of the script on puppet [puppet] - 10https://gerrit.wikimedia.org/r/629076 (https://phabricator.wikimedia.org/T138562) [10:08:14] (03PS4) 10JMeybohm: citoid: remove unencrypted LVS endpoint [puppet] - 10https://gerrit.wikimedia.org/r/625603 (https://phabricator.wikimedia.org/T255868) (owner: 10Giuseppe Lavagetto) [10:08:28] (03CR) 10Jcrespo: "So this package is self-sufficient at the moment, but it will likely end up depending on WMFMariaDB. section and ports methods, so adding " [puppet] - 10https://gerrit.wikimedia.org/r/629076 (https://phabricator.wikimedia.org/T138562) (owner: 10Jcrespo) [10:11:28] (03PS5) 10JMeybohm: citoid: remove unencrypted LVS endpoint 2/2 [puppet] - 10https://gerrit.wikimedia.org/r/625603 (https://phabricator.wikimedia.org/T255868) (owner: 10Giuseppe Lavagetto) [10:11:30] (03PS1) 10JMeybohm: citoid: remove unencrypted LVS endpoint 1/2 [puppet] - 10https://gerrit.wikimedia.org/r/629077 (https://phabricator.wikimedia.org/T255868) [10:13:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2032 (re)pooling @ 100%: Slowly es2032 T261717 ', diff saved to https://phabricator.wikimedia.org/P12731 and previous config saved to /var/cache/conftool/dbconfig/20200922-101308-root.json [10:13:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:13:15] T261717: Productionize es20[26-34] and es10[26-34] - https://phabricator.wikimedia.org/T261717 [10:13:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2033 (re)pooling @ 100%: Slowly repool es2033 T261717 ', diff saved to https://phabricator.wikimedia.org/P12732 and previous config saved to /var/cache/conftool/dbconfig/20200922-101324-root.json [10:13:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:13:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2034 (re)pooling @ 100%: Slowly repool es2034 T261717 ', diff saved to https://phabricator.wikimedia.org/P12733 and previous config saved to /var/cache/conftool/dbconfig/20200922-101342-root.json [10:13:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:14:04] 10Operations, 10ops-eqiad, 10netops: eqiad row D switch fabric recabling - https://phabricator.wikimedia.org/T256112 (10fgiunchedi) Checking the full list of servers, we have 10 ms-be hosts in there. Since we're deploying Swift with row-availability in mind I'm ok not to depool Swift out of eqiad for this. I... [10:15:43] (03PS1) 10Arturo Borrero Gonzalez: cloudgw: add templated network interfaces file [puppet] - 10https://gerrit.wikimedia.org/r/629079 (https://phabricator.wikimedia.org/T261724) [10:16:23] (03PS1) 10Jcrespo: wmfmariadbpy: Update possible installation roles: add 'library' [puppet] - 10https://gerrit.wikimedia.org/r/629080 [10:18:01] (03PS2) 10Jcrespo: wmfmariadbpy: Update possible installation roles: add 'library' [puppet] - 10https://gerrit.wikimedia.org/r/629080 [10:18:20] (03PS3) 10Jcrespo: mariadb-backups: Start using the package instead of the script on puppet [puppet] - 10https://gerrit.wikimedia.org/r/629076 (https://phabricator.wikimedia.org/T138562) [10:18:53] PROBLEM - Thanos query has high gRPC client errors on icinga1001 is CRITICAL: job=thanos-query https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query [10:19:30] (03CR) 10Kormat: [C: 03+1] wmfmariadbpy: Update possible installation roles: add 'library' [puppet] - 10https://gerrit.wikimedia.org/r/629080 (owner: 10Jcrespo) [10:20:03] (03CR) 10Jcrespo: [C: 03+2] wmfmariadbpy: Update possible installation roles: add 'library' [puppet] - 10https://gerrit.wikimedia.org/r/629080 (owner: 10Jcrespo) [10:21:12] (03PS1) 10Jbond: .gitignore: add spec file to git ignore [puppet] - 10https://gerrit.wikimedia.org/r/629081 [10:23:29] (03PS2) 10Jbond: .gitignore: add spec file to git ignore [puppet] - 10https://gerrit.wikimedia.org/r/629081 [10:25:15] (03CR) 10Jbond: [C: 03+2] .gitignore: add spec file to git ignore [puppet] - 10https://gerrit.wikimedia.org/r/629081 (owner: 10Jbond) [10:25:27] (03PS1) 10Filippo Giunchedi: hieradata: bump Swift object replicator concurrency [puppet] - 10https://gerrit.wikimedia.org/r/629082 (https://phabricator.wikimedia.org/T261633) [10:26:43] (03PS4) 10Jcrespo: mariadb-backups: Start using the package instead of the script on puppet [puppet] - 10https://gerrit.wikimedia.org/r/629076 (https://phabricator.wikimedia.org/T138562) [10:29:49] (03CR) 10Jcrespo: "https://puppet-compiler.wmflabs.org/compiler1003/25258/icinga1001.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/629076 (https://phabricator.wikimedia.org/T138562) (owner: 10Jcrespo) [10:29:56] (03CR) 10JMeybohm: [C: 03+1] service/planet: turn planet into an active-active service using discovery [puppet] - 10https://gerrit.wikimedia.org/r/628963 (https://phabricator.wikimedia.org/T263506) (owner: 10Dzahn) [10:33:20] !log installing remaining libx11 security updates [10:33:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:33:58] (03Abandoned) 10Elukey: Add basic debian packaging [software/pywmflib] - 10https://gerrit.wikimedia.org/r/626380 (https://phabricator.wikimedia.org/T257905) (owner: 10Elukey) [10:38:16] (03PS5) 10Jcrespo: mariadb-backups: Start using the package instead of the script on puppet [puppet] - 10https://gerrit.wikimedia.org/r/629076 (https://phabricator.wikimedia.org/T138562) [10:43:02] (03CR) 10Jcrespo: [C: 03+2] mariadb-backups: Start using the package instead of the script on puppet [puppet] - 10https://gerrit.wikimedia.org/r/629076 (https://phabricator.wikimedia.org/T138562) (owner: 10Jcrespo) [10:44:12] !log installing bacula security updates on stretch [10:44:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:45:13] (03Abandoned) 10Filippo Giunchedi: logstash: output webrequest 5xx metrics [puppet] - 10https://gerrit.wikimedia.org/r/480943 (https://phabricator.wikimedia.org/T205870) (owner: 10Filippo Giunchedi) [10:48:31] !log roll-restarting sessionstore for java security updates [10:48:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:48:35] !log hnowlan@cumin1001 START - Cookbook sre.cassandra.roll-restart [10:48:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:49:49] (03PS1) 10Jbond: profile::pki::Server: update cfssl::signer to a define [puppet] - 10https://gerrit.wikimedia.org/r/629085 (https://phabricator.wikimedia.org/T259117) [10:50:21] 10Operations, 10Acme-chief, 10Traffic, 10Patch-For-Review: Let's Encrypt transitioning to ISRG's Root - https://phabricator.wikimedia.org/T263006 (10BBlack) For the record, the renewal of unified on the 18th did successfully fetch two different chains, but they're currently set up with the default chain to... [10:50:53] (03Abandoned) 10Filippo Giunchedi: [WIP] move varnish logging to logging pipeline [puppet] - 10https://gerrit.wikimedia.org/r/492676 (https://phabricator.wikimedia.org/T213899) (owner: 10Filippo Giunchedi) [10:50:55] (03CR) 10Ayounsi: [C: 03+2] Add policy-options for primary IXPs [homer/public] - 10https://gerrit.wikimedia.org/r/629056 (https://phabricator.wikimedia.org/T262517) (owner: 10Ayounsi) [10:51:54] !log Add policy-options for primary IXPs to all routers - T262517 [10:51:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:51:59] T262517: Prioritize underdog IXP - https://phabricator.wikimedia.org/T262517 [10:57:50] 10Operations: Integrate Stretch 9.13 point update - https://phabricator.wikimedia.org/T258407 (10MoritzMuehlenhoff) [11:00:05] Amir1, Lucas_WMDE, awight, and Urbanecm: May I have your attention please! European mid-day backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200922T1100) [11:00:05] Evrifaessa: A patch you scheduled for European mid-day backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:14] !log installing intel-microcode 3.20200616.1 on Stretch baremetal servers (compared to to current installed packages this reverts microcode changes for some Skylake CPUs we don't use [11:00:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:00:50] hello [11:02:29] Amir1, Lucas_WMDE, awight, and Urbanecm: anyone here? [11:02:38] (03PS1) 10Jcrespo: Fix dependencies and directory issues for wmfbackup-check package [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/629088 (https://phabricator.wikimedia.org/T257551) [11:02:41] At a meeting rn, sorry [11:02:45] same :/ [11:02:55] @seen Urbanecm [11:02:55] Evrifaessa: Urbanecm is in here, right now [11:03:04] (03PS2) 10Jcrespo: Fix dependencies and directory issues for wmfbackups-check package [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/629088 (https://phabricator.wikimedia.org/T257551) [11:07:31] (03PS2) 10Jbond: profile::pki::Server: update cfssl::signer to a define [puppet] - 10https://gerrit.wikimedia.org/r/629085 (https://phabricator.wikimedia.org/T259117) [11:08:31] Evrifaessa: if no one else jumps in can you ping me again in 30 minutes or so? I might have time then [11:08:38] sure [11:08:43] and the changes looks like we should still be able to get through them then [11:11:43] (03CR) 10Ema: [C: 03+2] aptrepo: add component/varnish6 [puppet] - 10https://gerrit.wikimedia.org/r/629061 (https://phabricator.wikimedia.org/T261632) (owner: 10Ema) [11:13:44] !log installing intel-microcode 3.20200616.1 on Buster baremetal servers (compared to to current installed packages this reverts microcode changes for some Skylake CPUs we don't use [11:13:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:13:59] 10Operations, 10MediaWiki-General, 10serviceops, 10Patch-For-Review, 10Service-Architecture: Create a service-to-service proxy for handling HTTP calls from services to other entities - https://phabricator.wikimedia.org/T244843 (10Joe) [11:16:58] 10Operations, 10Product-Infrastructure-Team-Backlog, 10Push-Notification-Service, 10Services, 10Service-deployment-requests: New Service Request: Wikimedia push notification service - https://phabricator.wikimedia.org/T250452 (10jijiki) [11:17:07] 10Operations, 10Product-Infrastructure-Team-Backlog, 10Push-Notification-Service, 10serviceops, and 2 others: Deploy push-notifications service to Kubernetes - https://phabricator.wikimedia.org/T256973 (10jijiki) 05Open→03Resolved [11:18:43] 10Operations, 10Product-Infrastructure-Team-Backlog, 10Push-Notification-Service, 10serviceops, and 2 others: Deploy push-notifications service to Kubernetes - https://phabricator.wikimedia.org/T256973 (10jijiki) @Mholloway I am marking this as resolved unless we believe there is a reason not to. [11:20:30] (03CR) 10Jbond: [C: 03+2] profile::pki::Server: update cfssl::signer to a define [puppet] - 10https://gerrit.wikimedia.org/r/629085 (https://phabricator.wikimedia.org/T259117) (owner: 10Jbond) [11:20:43] (03PS3) 10Jcrespo: Fix dependencies and directory issues for wmfbackups-check package [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/629088 (https://phabricator.wikimedia.org/T257551) [11:20:45] (03PS1) 10Jcrespo: Make package depend on wmfmariadbpy>=0.5 [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/629089 [11:21:35] (03CR) 10BBlack: [C: 03+2] geo-resources: create text-next for NEL [dns] - 10https://gerrit.wikimedia.org/r/626656 (https://phabricator.wikimedia.org/T261340) (owner: 10BBlack) [11:21:40] (03PS4) 10BBlack: geo-resources: create text-next for NEL [dns] - 10https://gerrit.wikimedia.org/r/626656 (https://phabricator.wikimedia.org/T261340) [11:24:03] (03CR) 10Ema: [C: 03+2] varnish: make Accept-Language lowercase [puppet] - 10https://gerrit.wikimedia.org/r/627295 (https://phabricator.wikimedia.org/T262428) (owner: 10Ema) [11:24:15] (03PS2) 10Arturo Borrero Gonzalez: cloudgw: add templated network interfaces file [puppet] - 10https://gerrit.wikimedia.org/r/629079 (https://phabricator.wikimedia.org/T261724) [11:24:17] (03CR) 10BBlack: [C: 03+1] point intake-logging.wikimedia.org to text-next (second-best DC) [dns] - 10https://gerrit.wikimedia.org/r/628935 (https://phabricator.wikimedia.org/T261340) (owner: 10CDanis) [11:24:26] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) [11:24:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:25:45] !log upload varnish 6.0.6-1wm1 to buster-wikimedia component/varnish6 T261632 [11:25:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:25:51] T261632: Package varnish 6.0.x - https://phabricator.wikimedia.org/T261632 [11:25:51] (03PS2) 10Jcrespo: Make package depend on wmfmariadbpy>=0.5 [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/629089 [11:26:33] Evrifaessa: ok I think I can deploy the changes now [11:26:45] \o/ [11:26:46] (while eating but we’re not in a video call so who cares :P) [11:26:52] :) [11:28:51] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/628515 (https://phabricator.wikimedia.org/T263123) (owner: 10Evrifaessa) [11:28:57] (03PS4) 10Lucas Werkmeister (WMDE): Set timezone for wikis of the CWIRP to Europe/Rome [mediawiki-config] - 10https://gerrit.wikimedia.org/r/628515 (https://phabricator.wikimedia.org/T263123) (owner: 10Evrifaessa) [11:29:02] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Set timezone for wikis of the CWIRP to Europe/Rome [mediawiki-config] - 10https://gerrit.wikimedia.org/r/628515 (https://phabricator.wikimedia.org/T263123) (owner: 10Evrifaessa) [11:30:00] (03Merged) 10jenkins-bot: Set timezone for wikis of the CWIRP to Europe/Rome [mediawiki-config] - 10https://gerrit.wikimedia.org/r/628515 (https://phabricator.wikimedia.org/T263123) (owner: 10Evrifaessa) [11:31:33] Evrifaessa: the change is on mwdebug2001, can you test it? [11:31:37] (03PS3) 10Arturo Borrero Gonzalez: cloudgw: add templated network interfaces file [puppet] - 10https://gerrit.wikimedia.org/r/629079 (https://phabricator.wikimedia.org/T261724) [11:31:39] sure [11:31:45] (maybe one wiki is enough) [11:32:39] (03CR) 10jerkins-bot: [V: 04-1] cloudgw: add templated network interfaces file [puppet] - 10https://gerrit.wikimedia.org/r/629079 (https://phabricator.wikimedia.org/T261724) (owner: 10Arturo Borrero Gonzalez) [11:32:49] 10Operations: Integrate Buster 10.5 point release - https://phabricator.wikimedia.org/T259519 (10MoritzMuehlenhoff) [11:32:52] hmm wait.. [11:33:05] Special:Preferences shows that the server time is 11.32 [11:33:10] it supposed to be 13.32 ig [11:33:20] but it says that it's using Europe/Rome [11:33:44] I also see 11:33 on dewiki FWIW [11:34:07] 10Operations: Integrate Buster 10.5 point release - https://phabricator.wikimedia.org/T259519 (10MoritzMuehlenhoff) [11:34:17] 10Operations: Integrate Stretch 9.13 point update - https://phabricator.wikimedia.org/T258407 (10MoritzMuehlenhoff) [11:34:24] so it says that the wiki's tz is europe/roe [11:34:26] rome* [11:34:37] but shows 11.34 as time [11:35:16] well, dewiki at least has the same behavior as far as I can tell [11:35:20] so I guess that’s how it’s supposed to be? [11:35:26] !log roll-restarting restbase eqiad for java updates [11:35:28] it supposed to be 13.32 [11:35:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:35:29] !log hnowlan@cumin1001 START - Cookbook sre.cassandra.roll-restart [11:35:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:35:37] wait, let me use my signature on the wiki [11:35:57] 13:35, 22 Set 2020 (CEST) [11:36:00] cool, this works.. [11:36:03] on Special:RecentChanges, the last change goes from 10:41 to 12:41 [11:36:11] (on lijwiki) [11:36:20] see here : https://eml.wikipedia.org/wiki/Discussioni_utente:Evrifaessa [11:36:25] it works [11:36:34] ok then I’ll snyc [11:36:36] *sync [11:36:43] sure [11:37:44] !log lucaswerkmeister-wmde@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:628515|Set timezone for wikis of the CWIRP to Europe/Rome (T263123)]] (duration: 00m 59s) [11:37:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:37:49] T263123: Change the time for some wikis of the CWIRP - https://phabricator.wikimedia.org/T263123 [11:38:51] (03PS3) 10Lucas Werkmeister (WMDE): Removing Wikipedia store link from enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/628521 (https://phabricator.wikimedia.org/T262329) (owner: 10Evrifaessa) [11:39:01] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Removing Wikipedia store link from enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/628521 (https://phabricator.wikimedia.org/T262329) (owner: 10Evrifaessa) [11:39:09] (03PS1) 10Jcrespo: mariadb-backups: Use the wmfbackups-remote package and not the puppet version [puppet] - 10https://gerrit.wikimedia.org/r/629092 (https://phabricator.wikimedia.org/T138562) [11:39:52] (03Merged) 10jenkins-bot: Removing Wikipedia store link from enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/628521 (https://phabricator.wikimedia.org/T262329) (owner: 10Evrifaessa) [11:40:12] (03CR) 10jerkins-bot: [V: 04-1] mariadb-backups: Use the wmfbackups-remote package and not the puppet version [puppet] - 10https://gerrit.wikimedia.org/r/629092 (https://phabricator.wikimedia.org/T138562) (owner: 10Jcrespo) [11:40:53] store link change is on mwdebug2001 [11:41:13] bit hard to test if they’ve already hidden the link with CSS, I s’pose [11:41:19] lemme see [11:41:38] (03PS2) 10Jcrespo: mariadb-backups: Use the wmfbackups-remote package and not the puppet version [puppet] - 10https://gerrit.wikimedia.org/r/629092 (https://phabricator.wikimedia.org/T138562) [11:41:39] doesn't work [11:41:45] they did hide it in CSS [11:41:55] but even after I removed the code that hides it [11:42:06] it shows up [11:42:07] hm, but with mwdebug2001 I still see the
  • in the browser dev tools [11:42:25] yeah, I can see the shop link [11:42:32] after removing display: none from css [11:42:41] (03CR) 10jerkins-bot: [V: 04-1] mariadb-backups: Use the wmfbackups-remote package and not the puppet version [puppet] - 10https://gerrit.wikimedia.org/r/629092 (https://phabricator.wikimedia.org/T138562) (owner: 10Jcrespo) [11:42:44] it supposed not to show up, I gues [11:42:48] guess* [11:43:20] if I disable the `display: none` in dev tools, then the link shows up [11:43:23] even on mwdebug2001 [11:43:25] yeah [11:43:27] that’s weird [11:43:29] (03CR) 10Jcrespo: "This deploy touches the cumin hosts." [puppet] - 10https://gerrit.wikimedia.org/r/629092 (https://phabricator.wikimedia.org/T138562) (owner: 10Jcrespo) [11:43:42] so um [11:43:48] what are we going to do? [11:44:07] I’ll see if I can find the relevant code really quickly [11:44:16] the task says [11:44:18] "According to @Pppery at T255381#6225398 , this can be accomplished by setting $wmgUseWikimediaShopLink to false." [11:44:19] T255381: English Wikipedia sidebar changes - https://phabricator.wikimedia.org/T255381 [11:44:28] 10Operations: Integrate Buster 10.5 point release - https://phabricator.wikimedia.org/T259519 (10MoritzMuehlenhoff) [11:44:57] (03PS3) 10Jcrespo: mariadb-backups: Use the wmfbackups-remote package and not the puppet version [puppet] - 10https://gerrit.wikimedia.org/r/629092 (https://phabricator.wikimedia.org/T138562) [11:45:12] (03PS1) 10Ema: varnish: install packages specifying archive component [puppet] - 10https://gerrit.wikimedia.org/r/629094 (https://phabricator.wikimedia.org/T261632) [11:45:13] PROBLEM - Rate of JVM GC Old generation-s runs - elastic2029-production-search-psi-codfw on elastic2029 is CRITICAL: 128.1 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-codfw&var-instance=elastic2029&panelId=37 [11:45:16] my best guess is that the sidebar is cached somewhere [11:45:31] I think this change is relatively safe to deploy even if it doesn’t quite do what it’s supposed to [11:45:36] so I’d say, sync it, and then we see what happens [11:45:41] okay [11:45:55] if the link still doesn’t vanish from the sidebar, the task needs to stay open, but I don’t see a reason not to sync the change [11:46:01] (03CR) 10jerkins-bot: [V: 04-1] mariadb-backups: Use the wmfbackups-remote package and not the puppet version [puppet] - 10https://gerrit.wikimedia.org/r/629092 (https://phabricator.wikimedia.org/T138562) (owner: 10Jcrespo) [11:46:11] (03PS1) 10BBlack: mediawiki-cache-warmup: use HTTPS [puppet] - 10https://gerrit.wikimedia.org/r/629095 [11:46:35] (03PS4) 10Jcrespo: mariadb-backups: Use the wmfbackups-remote package and not the puppet version [puppet] - 10https://gerrit.wikimedia.org/r/629092 (https://phabricator.wikimedia.org/T138562) [11:47:03] (03PS1) 10BBlack: check_etcd_mw_config_lastindex: use HTTPS [puppet] - 10https://gerrit.wikimedia.org/r/629096 [11:47:17] (btw I checked shell.php on mwdebug2001, and $wmgUseWikimediaShopLink is false on enwiki) [11:47:31] !log lucaswerkmeister-wmde@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:628521|Removing Wikipedia store link from enwiki (T262329)]] (duration: 00m 57s) [11:47:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:47:36] T262329: Remove Wikipedia store link from enwiki - https://phabricator.wikimedia.org/T262329 [11:47:44] (03PS4) 10Arturo Borrero Gonzalez: cloudgw: add templated network interfaces file [puppet] - 10https://gerrit.wikimedia.org/r/629079 (https://phabricator.wikimedia.org/T261724) [11:48:13] hm, still seems to be there [11:48:24] yup [11:48:43] (03CR) 10jerkins-bot: [V: 04-1] cloudgw: add templated network interfaces file [puppet] - 10https://gerrit.wikimedia.org/r/629079 (https://phabricator.wikimedia.org/T261724) (owner: 10Arturo Borrero Gonzalez) [11:48:50] left a comment on the task [11:49:47] okay [11:50:03] let’s continue with the last change [11:50:09] if you’re okay with that [11:50:29] okay [11:50:38] (03PS1) 10Jcrespo: Call "backup-mariadb" on prepare, not "python3 backup_mariadb.py" [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/629097 [11:50:47] (03PS3) 10Lucas Werkmeister (WMDE): Create Portal and Portal_talk namespaces on trwikisource, and fix an incorrect alias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/628598 (https://phabricator.wikimedia.org/T263358) (owner: 10Evrifaessa) [11:51:33] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Create Portal and Portal_talk namespaces on trwikisource, and fix an incorrect alias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/628598 (https://phabricator.wikimedia.org/T263358) (owner: 10Evrifaessa) [11:51:44] (03PS2) 10Ema: varnish: install packages specifying archive component [puppet] - 10https://gerrit.wikimedia.org/r/629094 (https://phabricator.wikimedia.org/T261632) [11:52:22] (03Merged) 10jenkins-bot: Create Portal and Portal_talk namespaces on trwikisource, and fix an incorrect alias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/628598 (https://phabricator.wikimedia.org/T263358) (owner: 10Evrifaessa) [11:52:29] (03CR) 10Jcrespo: [C: 04-2] "We need to upgrade wmfbackups package on dbprov* hosts first, as it is a dependency there. Then this, as it will fail to find the older sc" [puppet] - 10https://gerrit.wikimedia.org/r/629092 (https://phabricator.wikimedia.org/T138562) (owner: 10Jcrespo) [11:52:43] (03PS1) 10Jbond: cfssl: update config paths to be some what standard [puppet] - 10https://gerrit.wikimedia.org/r/629098 [11:52:50] change is on mwdebug2001 [11:53:27] seems to work fine at https://tr.wikisource.org/w/api.php?action=query&meta=siteinfo&siprop=namespaces|namespacealiases&formatversion=2 [11:53:37] seems to be working [11:53:39] ok [11:53:45] https://tr.wikisource.org/wiki/%C3%96zel:%C3%96nekDizini?prefix=test&namespace=201 [11:54:41] but wait [11:54:49] there are existing pages starting with Porta: [11:54:51] Portal:* [11:54:53] !log lucaswerkmeister-wmde@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:628598|Create Portal and Portal_talk namespaces on trwikisource, and fix an incorrect alias (T263358)]] (duration: 00m 57s) [11:54:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:54:58] yes, I’ll run a maintenance script to fix them [11:54:59] T263358: Create Portal and Portal_talk namespaces on trwikisource - https://phabricator.wikimedia.org/T263358 [11:55:56] !log lucaswerkmeister-wmde@mwmaint2001:~$ mwscript namespaceDupes.php trwikisource | tee T263358.dryrun # 1350 to fix, 1350 resolvable, 0 deleted [11:56:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:56:24] !log lucaswerkmeister-wmde@mwmaint2001:~$ mwscript namespaceDupes.php trwikisource --fix | tee T263358.fix # 1350 to fix, 1350 resolvable, 0 deleted [11:56:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:56:51] (03CR) 10Jcrespo: [C: 04-2] "Will be split in smaller patches." [puppet] - 10https://gerrit.wikimedia.org/r/628163 (https://phabricator.wikimedia.org/T138562) (owner: 10Jcrespo) [11:57:31] (03CR) 10Ema: [V: 03+2 C: 03+2] Update versioned dependency on libvarnishapi-dev [software/varnish/varnishkafka] (debian) - 10https://gerrit.wikimedia.org/r/622969 (https://phabricator.wikimedia.org/T261487) (owner: 10Ema) [11:58:03] (03PS2) 10Jbond: cfssl: update config paths to be some what standard [puppet] - 10https://gerrit.wikimedia.org/r/629098 [11:58:30] (03CR) 10Muehlenhoff: varnish: install packages specifying archive component (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/629094 (https://phabricator.wikimedia.org/T261632) (owner: 10Ema) [11:58:46] !log EU backport&config window done [11:58:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:58:57] yay [11:59:04] \o/ [11:59:15] well, except for the sidebar link I guess [11:59:20] yeah.. [11:59:35] (03PS1) 10Ayounsi: Introduce primary_ixp variable [homer/public] - 10https://gerrit.wikimedia.org/r/629100 (https://phabricator.wikimedia.org/T262517) [11:59:39] byeee o/ [11:59:48] see you 👋 [11:59:53] (03CR) 10Tobias Andersson: [C: 03+1] Remove $wgExtraLanguageNames from Wikidata and Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/620050 (https://phabricator.wikimedia.org/T260118) (owner: 10Guergana Tzatchkova) [11:59:53] !log Reset password for SUL User:Freibo [11:59:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:00:18] Lucas_WMDE: perhaps we should also deploy the patch Tobias just +1'ed? 🙂 [12:00:32] I’m in another meeting now, sorry [12:00:39] go ahead if you want to do it :) [12:00:43] 10Operations, 10DBA, 10Blocked-on-schema-change, 10User-Kormat: Schema change to make change_tag.ct_rc_id unsigned - https://phabricator.wikimedia.org/T259831 (10Kormat) [12:02:06] (03CR) 10Jbond: [C: 03+2] cfssl: update config paths to be some what standard [puppet] - 10https://gerrit.wikimedia.org/r/629098 (owner: 10Jbond) [12:03:58] (03Abandoned) 10Jcrespo: mariadb: Stop using puppet to deploy wmfbackups and use debian packages [puppet] - 10https://gerrit.wikimedia.org/r/628163 (https://phabricator.wikimedia.org/T138562) (owner: 10Jcrespo) [12:04:00] (03CR) 10Kormat: mariadb-backups: Start using the package instead of the script on puppet (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/629076 (https://phabricator.wikimedia.org/T138562) (owner: 10Jcrespo) [12:04:06] (03PS5) 10Jcrespo: mariadb-backups: Use the wmfbackups-remote package and not the puppet version [puppet] - 10https://gerrit.wikimedia.org/r/629092 (https://phabricator.wikimedia.org/T138562) [12:04:08] (03PS1) 10Jcrespo: mariadb: Stop using puppet to deploy wmfbackups and use debian packages [puppet] - 10https://gerrit.wikimedia.org/r/629101 (https://phabricator.wikimedia.org/T138562) [12:09:17] !log installing bundler updates on buster [12:09:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:10:04] 10Operations: Integrate Buster 10.5 point release - https://phabricator.wikimedia.org/T259519 (10MoritzMuehlenhoff) [12:11:00] !log installing python3.7 security updates on Buster [12:11:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:12:02] (03CR) 10Jcrespo: mariadb-backups: Start using the package instead of the script on puppet (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/629076 (https://phabricator.wikimedia.org/T138562) (owner: 10Jcrespo) [12:13:59] (03CR) 10Ayounsi: [C: 03+2] Introduce primary_ixp variable [homer/public] - 10https://gerrit.wikimedia.org/r/629100 (https://phabricator.wikimedia.org/T262517) (owner: 10Ayounsi) [12:14:37] (03Merged) 10jenkins-bot: Introduce primary_ixp variable [homer/public] - 10https://gerrit.wikimedia.org/r/629100 (https://phabricator.wikimedia.org/T262517) (owner: 10Ayounsi) [12:15:29] (03PS1) 10Ayounsi: Only create vlans with "active" status [homer/public] - 10https://gerrit.wikimedia.org/r/629102 [12:19:14] (03CR) 10Volans: [C: 03+1] "LGTM" [homer/public] - 10https://gerrit.wikimedia.org/r/629102 (owner: 10Ayounsi) [12:20:05] (03CR) 10Ayounsi: [C: 03+2] Only create vlans with "active" status [homer/public] - 10https://gerrit.wikimedia.org/r/629102 (owner: 10Ayounsi) [12:20:31] (03Merged) 10jenkins-bot: Only create vlans with "active" status [homer/public] - 10https://gerrit.wikimedia.org/r/629102 (owner: 10Ayounsi) [12:25:31] 10Operations, 10netops, 10Patch-For-Review: Prioritize underdog IXP - https://phabricator.wikimedia.org/T262517 (10ayounsi) 05Open→03Resolved All done in ulsfo, other sites to follow when they are turned on. [12:26:09] (03PS1) 10Ema: 1.1.0-1: Varnish 6 support [software/varnish/varnishkafka] (debian) - 10https://gerrit.wikimedia.org/r/629103 (https://phabricator.wikimedia.org/T261632) [12:26:19] (03CR) 10jerkins-bot: [V: 04-1] 1.1.0-1: Varnish 6 support [software/varnish/varnishkafka] (debian) - 10https://gerrit.wikimedia.org/r/629103 (https://phabricator.wikimedia.org/T261632) (owner: 10Ema) [12:28:10] 10Operations, 10Traffic, 10observability: Aggregated metrics for ats-tls <-> clients ttfb percentiles - https://phabricator.wikimedia.org/T263536 (10fgiunchedi) [12:30:06] (03PS2) 10Jcrespo: mariadb: Stop using puppet to deploy wmfbackups and use debian packages [puppet] - 10https://gerrit.wikimedia.org/r/629101 (https://phabricator.wikimedia.org/T138562) [12:32:23] (03CR) 10Jcrespo: "https://puppet-compiler.wmflabs.org/compiler1003/25264/dbprov1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/629101 (https://phabricator.wikimedia.org/T138562) (owner: 10Jcrespo) [12:32:42] (03CR) 10Volans: "I think that this will fail certificate validation given that we call it for each host using their IP address:" [puppet] - 10https://gerrit.wikimedia.org/r/629096 (owner: 10BBlack) [12:33:00] (03PS2) 10Ema: 1.1.0-1: Varnish 6 support [software/varnish/varnishkafka] (debian) - 10https://gerrit.wikimedia.org/r/629103 (https://phabricator.wikimedia.org/T261632) [12:33:30] (03CR) 10jerkins-bot: [V: 04-1] 1.1.0-1: Varnish 6 support [software/varnish/varnishkafka] (debian) - 10https://gerrit.wikimedia.org/r/629103 (https://phabricator.wikimedia.org/T261632) (owner: 10Ema) [12:33:32] (03CR) 10Jcrespo: [C: 03+2] mariadb: Stop using puppet to deploy wmfbackups and use debian packages [puppet] - 10https://gerrit.wikimedia.org/r/629101 (https://phabricator.wikimedia.org/T138562) (owner: 10Jcrespo) [12:34:55] PROBLEM - Check no envoy runtime configuration is left persistent on mwdebug1001 is CRITICAL: connect to address 127.0.0.1 and port 9631: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [12:38:45] RECOVERY - Rate of JVM GC Old generation-s runs - elastic2029-production-search-psi-codfw on elastic2029 is OK: (C)100 gt (W)80 gt 71.19 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-codfw&var-instance=elastic2029&panelId=37 [12:42:48] (03Abandoned) 10Jcrespo: backup_mariadb: Use path to find backup_mariadb.py [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/623754 (https://phabricator.wikimedia.org/T165358) (owner: 10Jcrespo) [12:44:37] (03PS1) 10Ema: package_builder: add hook D04varnish6 [puppet] - 10https://gerrit.wikimedia.org/r/629104 (https://phabricator.wikimedia.org/T261632) [12:45:13] (03PS4) 10Jcrespo: Fix dependencies and directory issues for wmfbackups-check package [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/629088 (https://phabricator.wikimedia.org/T257551) [12:45:45] (03CR) 10Jcrespo: [C: 03+2] Fix dependencies and directory issues for wmfbackups-check package [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/629088 (https://phabricator.wikimedia.org/T257551) (owner: 10Jcrespo) [12:45:48] (03PS1) 10Jbond: cfssl::csr: update csr resource to support multiple signer configs [puppet] - 10https://gerrit.wikimedia.org/r/629105 (https://phabricator.wikimedia.org/T259117) [12:46:27] (03CR) 10Jcrespo: [C: 03+2] Make package depend on wmfmariadbpy>=0.5 [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/629089 (owner: 10Jcrespo) [12:46:36] (03PS3) 10Jcrespo: Make package depend on wmfmariadbpy>=0.5 [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/629089 [12:46:50] (03CR) 10jerkins-bot: [V: 04-1] cfssl::csr: update csr resource to support multiple signer configs [puppet] - 10https://gerrit.wikimedia.org/r/629105 (https://phabricator.wikimedia.org/T259117) (owner: 10Jbond) [12:47:26] (03CR) 10Jcrespo: [C: 03+2] Call "backup-mariadb" on prepare, not "python3 backup_mariadb.py" [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/629097 (owner: 10Jcrespo) [12:47:45] (03PS2) 10Jbond: cfssl::csr: update csr resource to support multiple signer configs [puppet] - 10https://gerrit.wikimedia.org/r/629105 (https://phabricator.wikimedia.org/T259117) [12:47:57] (03PS2) 10Jcrespo: Call "backup-mariadb" on prepare, not "python3 backup_mariadb.py" [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/629097 [12:48:47] (03CR) 10jerkins-bot: [V: 04-1] cfssl::csr: update csr resource to support multiple signer configs [puppet] - 10https://gerrit.wikimedia.org/r/629105 (https://phabricator.wikimedia.org/T259117) (owner: 10Jbond) [12:50:26] (03CR) 10Ema: [V: 03+2 C: 03+2] 1.1.0-1: Varnish 6 support [software/varnish/varnishkafka] (debian) - 10https://gerrit.wikimedia.org/r/629103 (https://phabricator.wikimedia.org/T261632) (owner: 10Ema) [12:54:02] (03PS3) 10Jbond: cfssl::csr: update csr resource to support multiple signer configs [puppet] - 10https://gerrit.wikimedia.org/r/629105 (https://phabricator.wikimedia.org/T259117) [12:54:05] !log upload varnishkafka 1.1.0-1 to buster-wikimedia component/varnish6 T261632 [12:54:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:54:12] T261632: Package varnish 6.0.x - https://phabricator.wikimedia.org/T261632 [12:55:04] (03CR) 10jerkins-bot: [V: 04-1] cfssl::csr: update csr resource to support multiple signer configs [puppet] - 10https://gerrit.wikimedia.org/r/629105 (https://phabricator.wikimedia.org/T259117) (owner: 10Jbond) [12:58:03] (03PS4) 10Jbond: cfssl::csr: update csr resource to support multiple signer configs [puppet] - 10https://gerrit.wikimedia.org/r/629105 (https://phabricator.wikimedia.org/T259117) [12:58:27] (03PS1) 10Jcrespo: debian/control: Correct syntax for package min version requirement [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/629107 [12:59:03] (03CR) 10jerkins-bot: [V: 04-1] cfssl::csr: update csr resource to support multiple signer configs [puppet] - 10https://gerrit.wikimedia.org/r/629105 (https://phabricator.wikimedia.org/T259117) (owner: 10Jbond) [12:59:13] (03CR) 10Jcrespo: [C: 03+2] debian/control: Correct syntax for package min version requirement [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/629107 (owner: 10Jcrespo) [13:00:10] (03PS7) 10Cwhite: profile: install and configure statsd_exporter and retarget statsv [puppet] - 10https://gerrit.wikimedia.org/r/615269 (https://phabricator.wikimedia.org/T180105) [13:00:50] (03PS5) 10Jbond: cfssl::csr: update csr resource to support multiple signer configs [puppet] - 10https://gerrit.wikimedia.org/r/629105 (https://phabricator.wikimedia.org/T259117) [13:00:58] (03PS3) 10Ema: varnish: install packages specifying archive component [puppet] - 10https://gerrit.wikimedia.org/r/629094 (https://phabricator.wikimedia.org/T261632) [13:03:24] (03CR) 10Jbond: [C: 03+2] cfssl::csr: update csr resource to support multiple signer configs [puppet] - 10https://gerrit.wikimedia.org/r/629105 (https://phabricator.wikimedia.org/T259117) (owner: 10Jbond) [13:05:58] (03PS6) 10Jcrespo: mariadb-backups: Use the wmfbackups-remote package and not the puppet version [puppet] - 10https://gerrit.wikimedia.org/r/629092 (https://phabricator.wikimedia.org/T138562) [13:11:07] (03PS9) 10Cwhite: profile: add prometheus instance for external metrics [puppet] - 10https://gerrit.wikimedia.org/r/615288 (https://phabricator.wikimedia.org/T180105) [13:12:57] (03PS1) 10Jbond: cfssl::csr: fix white space in exec and update output location [puppet] - 10https://gerrit.wikimedia.org/r/629110 (https://phabricator.wikimedia.org/T259117) [13:13:14] (03CR) 10Jcrespo: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/25270/cumin1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/629092 (https://phabricator.wikimedia.org/T138562) (owner: 10Jcrespo) [13:15:07] PROBLEM - Rate of JVM GC Old generation-s runs - elastic2029-production-search-psi-codfw on elastic2029 is CRITICAL: 103.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-codfw&var-instance=elastic2029&panelId=37 [13:15:08] (03CR) 10Jbond: [C: 03+2] cfssl::csr: fix white space in exec and update output location [puppet] - 10https://gerrit.wikimedia.org/r/629110 (https://phabricator.wikimedia.org/T259117) (owner: 10Jbond) [13:16:13] (03PS5) 10Giuseppe Lavagetto: services: use TLS to connect to ORES [puppet] - 10https://gerrit.wikimedia.org/r/628801 (https://phabricator.wikimedia.org/T244843) [13:17:22] (03PS3) 10CDanis: point intake-logging.wikimedia.org to text-next (second-best DC) [dns] - 10https://gerrit.wikimedia.org/r/628935 (https://phabricator.wikimedia.org/T261340) [13:18:53] (03CR) 10CDanis: [C: 03+2] point intake-logging.wikimedia.org to text-next (second-best DC) [dns] - 10https://gerrit.wikimedia.org/r/628935 (https://phabricator.wikimedia.org/T261340) (owner: 10CDanis) [13:18:57] (03CR) 10Giuseppe Lavagetto: [C: 03+2] services: use TLS to connect to ORES [puppet] - 10https://gerrit.wikimedia.org/r/628801 (https://phabricator.wikimedia.org/T244843) (owner: 10Giuseppe Lavagetto) [13:20:39] 10Operations, 10Product-Infrastructure-Data, 10Epic, 10Goal, 10Patch-For-Review: automatically collect network error reports from users' browsers (Network Error Logging API) - https://phabricator.wikimedia.org/T257527 (10CDanis) [13:20:55] (03CR) 10Cwhite: [C: 03+2] "PCC checks out: https://puppet-compiler.wmflabs.org/compiler1001/25272/" [puppet] - 10https://gerrit.wikimedia.org/r/615288 (https://phabricator.wikimedia.org/T180105) (owner: 10Cwhite) [13:21:04] 10Operations, 10DNS, 10Traffic, 10Patch-For-Review: 'skip_first' feature flag for gdnsd GeoIP plugin - https://phabricator.wikimedia.org/T261340 (10CDanis) 05Open→03Resolved a:03CDanis Deployed at 13:20 UTC. The original TTL of the intake-logging CNAME was 1 day, so it will take that long for all cl... [13:22:06] (03PS4) 10Ema: varnish: install packages specifying archive component [puppet] - 10https://gerrit.wikimedia.org/r/629094 (https://phabricator.wikimedia.org/T261632) [13:23:01] (03CR) 10Ema: varnish: install packages specifying archive component (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/629094 (https://phabricator.wikimedia.org/T261632) (owner: 10Ema) [13:26:17] (03CR) 10Kormat: mariadb-backups: Start using the package instead of the script on puppet (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/629076 (https://phabricator.wikimedia.org/T138562) (owner: 10Jcrespo) [13:29:45] (03CR) 10Ema: [V: 03+2 C: 03+2] 1.8-1: Rebuild against Varnish 6 [software/varnish/libvmod-netmapper] (debian) - 10https://gerrit.wikimedia.org/r/625659 (https://phabricator.wikimedia.org/T261632) (owner: 10Vgutierrez) [13:30:15] (03PS1) 10Jbond: cfssl::csr: Ensure we manage the produced files [puppet] - 10https://gerrit.wikimedia.org/r/629113 (https://phabricator.wikimedia.org/T259117) [13:31:22] (03CR) 10Jbond: [C: 03+2] cfssl::csr: Ensure we manage the produced files [puppet] - 10https://gerrit.wikimedia.org/r/629113 (https://phabricator.wikimedia.org/T259117) (owner: 10Jbond) [13:32:01] (03CR) 10Jcrespo: mariadb-backups: Start using the package instead of the script on puppet (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/629076 (https://phabricator.wikimedia.org/T138562) (owner: 10Jcrespo) [13:32:37] 10Operations, 10ops-codfw, 10DC-Ops, 10serviceops: decom mw2256 (was: mw2256 - CPU/board hardware issue) - https://phabricator.wikimedia.org/T263065 (10Papaul) a:03Papaul [13:33:49] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/629094 (https://phabricator.wikimedia.org/T261632) (owner: 10Ema) [13:33:53] 10Operations, 10netops: Upgrade Fastnetmon to 1.1.7 - https://phabricator.wikimedia.org/T257035 (10CDanis) +1 for leaving the default `speed_calculation_delay = 1` +1 for setting `average_calculation_time_for_subnets` to the same as our overall `average_calculation_time` Just to make sure I understand, you w... [13:34:30] (03PS1) 10Jcrespo: test [puppet] - 10https://gerrit.wikimedia.org/r/629114 [13:34:56] (03CR) 10Jcrespo: "Is this what you mean?" [puppet] - 10https://gerrit.wikimedia.org/r/629114 (owner: 10Jcrespo) [13:35:45] !log upload libvmod-netmapper 1.8-1 to buster-wikimedia component/varnish6 T261632 [13:35:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:52] T261632: Package varnish 6.0.x - https://phabricator.wikimedia.org/T261632 [13:35:55] 10Operations, 10ops-codfw, 10DC-Ops, 10serviceops: decom mw2256 (was: mw2256 - CPU/board hardware issue) - https://phabricator.wikimedia.org/T263065 (10Papaul) ` [edit interfaces interface-range vlan-private1-b-codfw] - member ge-3/0/2; [edit interfaces interface-range disabled] member ge-4/0/39 {... [13:36:05] (03CR) 10Giuseppe Lavagetto: [C: 03+2] citoid: remove unencrypted LVS endpoint 1/2 [puppet] - 10https://gerrit.wikimedia.org/r/629077 (https://phabricator.wikimedia.org/T255868) (owner: 10JMeybohm) [13:36:43] RECOVERY - Check no envoy runtime configuration is left persistent on mwdebug1001 is OK: HTTP OK: HTTP/1.1 200 OK - 286 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [13:36:51] (03CR) 10Giuseppe Lavagetto: [C: 03+2] proton-http: stop monitoring the endpoint [puppet] - 10https://gerrit.wikimedia.org/r/627857 (https://phabricator.wikimedia.org/T255877) (owner: 10Giuseppe Lavagetto) [13:37:00] (03PS2) 10Giuseppe Lavagetto: proton-http: stop monitoring the endpoint [puppet] - 10https://gerrit.wikimedia.org/r/627857 (https://phabricator.wikimedia.org/T255877) [13:37:04] (03CR) 10Jcrespo: "Because if it is only this, this may be just an accidental omission" [puppet] - 10https://gerrit.wikimedia.org/r/629114 (owner: 10Jcrespo) [13:37:49] (03PS3) 10Giuseppe Lavagetto: proton-http: stop monitoring the endpoint [puppet] - 10https://gerrit.wikimedia.org/r/627857 (https://phabricator.wikimedia.org/T255877) [13:38:11] (03PS1) 10Volans: CHANGELOG: add changelogs for release v0.0.2 [software/pywmflib] - 10https://gerrit.wikimedia.org/r/629115 [13:38:54] (03CR) 10Elukey: [C: 03+1] CHANGELOG: add changelogs for release v0.0.2 [software/pywmflib] - 10https://gerrit.wikimedia.org/r/629115 (owner: 10Volans) [13:39:35] (03CR) 10Volans: [C: 03+2] CHANGELOG: add changelogs for release v0.0.2 [software/pywmflib] - 10https://gerrit.wikimedia.org/r/629115 (owner: 10Volans) [13:39:42] !log pt1979@cumin2001 START - Cookbook sre.dns.netbox [13:39:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:40:31] (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v0.0.2 [software/pywmflib] - 10https://gerrit.wikimedia.org/r/629115 (owner: 10Volans) [13:41:10] (03PS1) 10Elukey: Add basic debian packaging [software/pywmflib] (debian) - 10https://gerrit.wikimedia.org/r/629116 [13:42:19] (03CR) 10jerkins-bot: [V: 04-1] Add basic debian packaging [software/pywmflib] (debian) - 10https://gerrit.wikimedia.org/r/629116 (owner: 10Elukey) [13:43:35] (03CR) 10JMeybohm: [C: 03+1] citoid: remove unencrypted LVS endpoint 2/2 [puppet] - 10https://gerrit.wikimedia.org/r/625603 (https://phabricator.wikimedia.org/T255868) (owner: 10Giuseppe Lavagetto) [13:44:00] !log pt1979@cumin2001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:44:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:46:09] (03CR) 10Giuseppe Lavagetto: [C: 03+1] proton: remove non-https endpoint [puppet] - 10https://gerrit.wikimedia.org/r/627858 (https://phabricator.wikimedia.org/T255877) (owner: 10Giuseppe Lavagetto) [13:49:19] !log Deploy MCR change on db2098:3313 - T238966 [13:49:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:49:24] T238966: Apply updates for MCR, actor migration, and content migration, to production wikis. - https://phabricator.wikimedia.org/T238966 [13:50:12] 10Operations, 10ops-codfw, 10DC-Ops, 10serviceops: decom mw2256 (was: mw2256 - CPU/board hardware issue) - https://phabricator.wikimedia.org/T263065 (10Papaul) [x] - disable switch port / set to asset tag if host isn't being unracked / remove from switch if being unracked. [] - system disks removed (by on... [13:50:43] (03PS1) 10Kormat: wmfbackups_poc [puppet] - 10https://gerrit.wikimedia.org/r/629118 [13:51:01] (03PS1) 10Ayounsi: FNM: add new default speed_calculation_delay [puppet] - 10https://gerrit.wikimedia.org/r/629119 (https://phabricator.wikimedia.org/T257035) [13:51:47] (03CR) 10CDanis: [C: 03+1] hieradata: bump Swift object replicator concurrency [puppet] - 10https://gerrit.wikimedia.org/r/629082 (https://phabricator.wikimedia.org/T261633) (owner: 10Filippo Giunchedi) [13:51:50] (03CR) 10Kormat: "A sketch of what i'm thinking of. (Not tested, or guaranteed to work as-is :)" [puppet] - 10https://gerrit.wikimedia.org/r/629118 (owner: 10Kormat) [13:52:11] 10Operations, 10Analytics-Radar, 10Traffic, 10Patch-For-Review: Package varnish 6.0.x - https://phabricator.wikimedia.org/T261632 (10ema) [13:53:10] 10Operations, 10Product-Infrastructure-Team-Backlog, 10Push-Notification-Service: [EPIC] Deploy push-notifications service to production - https://phabricator.wikimedia.org/T256237 (10Jgiannelos) [13:55:08] !log upload varnish-modules 0.15.0-1+wmf1 to buster-wikimedia component/varnish6 T261632 [13:55:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:14] T261632: Package varnish 6.0.x - https://phabricator.wikimedia.org/T261632 [13:55:42] (03PS1) 10Jbond: base::expose_puppet_certs: add option to enable/disable p12 encypted keys [puppet] - 10https://gerrit.wikimedia.org/r/629120 [13:55:53] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) [13:55:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:56:02] (03Abandoned) 10JMeybohm: lvs: Remove proton non-TLS endpoint from LVS 1/2 [puppet] - 10https://gerrit.wikimedia.org/r/627541 (https://phabricator.wikimedia.org/T255877) (owner: 10JMeybohm) [13:56:52] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/629104 (https://phabricator.wikimedia.org/T261632) (owner: 10Ema) [13:57:07] (03CR) 10Kormat: [C: 04-2] "Not to be merged." [puppet] - 10https://gerrit.wikimedia.org/r/629118 (owner: 10Kormat) [13:57:45] (03CR) 10Cwhite: [C: 03+2] "PCC checks out: https://puppet-compiler.wmflabs.org/compiler1003/25271/" [puppet] - 10https://gerrit.wikimedia.org/r/615269 (https://phabricator.wikimedia.org/T180105) (owner: 10Cwhite) [13:57:53] kormat: don't be so hard with yourself in code reviews [13:57:59] (03PS6) 10JMeybohm: citoid: remove unencrypted LVS endpoint 2/2 [puppet] - 10https://gerrit.wikimedia.org/r/625603 (https://phabricator.wikimedia.org/T255868) (owner: 10Giuseppe Lavagetto) [13:58:03] (03CR) 10Ema: [V: 03+2 C: 03+2] package_builder: add hook D04varnish6 [puppet] - 10https://gerrit.wikimedia.org/r/629104 (https://phabricator.wikimedia.org/T261632) (owner: 10Ema) [13:58:23] elukey: i'm just trying to get there before someone else does ;) [13:58:45] (03CR) 10JMeybohm: [C: 03+2] citoid: remove unencrypted LVS endpoint 2/2 [puppet] - 10https://gerrit.wikimedia.org/r/625603 (https://phabricator.wikimedia.org/T255868) (owner: 10Giuseppe Lavagetto) [13:58:50] (03PS2) 10Elukey: Add basic debian packaging [software/pywmflib] (debian) - 10https://gerrit.wikimedia.org/r/629116 [13:59:02] kormat: you are a wise person, I knew it [13:59:15] !log add fastnetmon_1.1.7 to buster-wikimedia repo - T257035 [13:59:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:59:20] T257035: Upgrade Fastnetmon to 1.1.7 - https://phabricator.wikimedia.org/T257035 [13:59:20] ema: I got your package builder hook. Ready to deploy? [13:59:27] shdubsh: yes please [13:59:30] (03PS2) 10Kormat: wmfbackups_poc [puppet] - 10https://gerrit.wikimedia.org/r/629118 [13:59:32] (03PS1) 10Cparle: Generation of json dumps for wikimedia commons [puppet] - 10https://gerrit.wikimedia.org/r/629121 (https://phabricator.wikimedia.org/T259067) [13:59:34] (03PS2) 10JMeybohm: proton: remove non-https endpoint [puppet] - 10https://gerrit.wikimedia.org/r/627858 (https://phabricator.wikimedia.org/T255877) (owner: 10Giuseppe Lavagetto) [14:00:16] (03CR) 10Jbond: "@Elukey You will need to delete the current p12 files and let puppet regenerate them once this is merged" [puppet] - 10https://gerrit.wikimedia.org/r/629120 (owner: 10Jbond) [14:01:19] (03PS1) 10Muehlenhoff: Grafana config changes for CAS-enabled grafana-rw.w.o vhost [puppet] - 10https://gerrit.wikimedia.org/r/629122 (https://phabricator.wikimedia.org/T262512) [14:01:22] (03CR) 10Elukey: [C: 03+1] base::expose_puppet_certs: add option to enable/disable p12 encypted keys [puppet] - 10https://gerrit.wikimedia.org/r/629120 (owner: 10Jbond) [14:01:28] (03CR) 10JMeybohm: [C: 03+2] proton: remove non-https endpoint [puppet] - 10https://gerrit.wikimedia.org/r/627858 (https://phabricator.wikimedia.org/T255877) (owner: 10Giuseppe Lavagetto) [14:01:50] (03CR) 10jerkins-bot: [V: 04-1] Grafana config changes for CAS-enabled grafana-rw.w.o vhost [puppet] - 10https://gerrit.wikimedia.org/r/629122 (https://phabricator.wikimedia.org/T262512) (owner: 10Muehlenhoff) [14:01:54] 10Operations, 10netops, 10Patch-For-Review: Upgrade Fastnetmon to 1.1.7 - https://phabricator.wikimedia.org/T257035 (10ayounsi) > Just to make sure I understand, you were thinking we don't enable_connection_tracking? I agree with that, it rather expensive on CPU. Correct. > Surprisingly though one new CLI... [14:02:09] (03CR) 10Jbond: [C: 03+2] base::expose_puppet_certs: add option to enable/disable p12 encypted keys [puppet] - 10https://gerrit.wikimedia.org/r/629120 (owner: 10Jbond) [14:02:47] !log roll-restarting restbase codfw for java updates [14:02:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:12] !log hnowlan@cumin1001 START - Cookbook sre.cassandra.roll-restart [14:03:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:15] (03CR) 10LSobanski: [C: 04-1] "Testing my magical code review powers. Please ignore." [puppet] - 10https://gerrit.wikimedia.org/r/629118 (owner: 10Kormat) [14:03:17] jbond42: I catched some changes of yours in puppet merge. Okay to merge? [14:03:35] (03PS2) 10Muehlenhoff: Grafana config changes for CAS-enabled grafana-rw.w.o vhost [puppet] - 10https://gerrit.wikimedia.org/r/629122 (https://phabricator.wikimedia.org/T262512) [14:03:37] yes please [14:03:40] jayme: ^ [14:04:08] elukey: ^ this is the p12 change [14:04:12] jbond42: okay, done [14:04:17] thx [14:05:57] !log running puppet on lvs servers - T255868 T255877 [14:06:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:03] T255877: Move proton to use TLS only - https://phabricator.wikimedia.org/T255877 [14:06:03] T255868: Move citoid to use TLS only - https://phabricator.wikimedia.org/T255868 [14:06:37] !log upgrade FNM on netflow3001 - T257035 [14:06:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:41] T257035: Upgrade Fastnetmon to 1.1.7 - https://phabricator.wikimedia.org/T257035 [14:07:02] thanks jbond42 ! [14:07:08] (03CR) 10Volans: [C: 03+1] "LGTM!" (031 comment) [software/pywmflib] (debian) - 10https://gerrit.wikimedia.org/r/629116 (owner: 10Elukey) [14:07:10] will test on hadoop [14:07:40] (03PS3) 10Elukey: Add basic debian packaging [software/pywmflib] (debian) - 10https://gerrit.wikimedia.org/r/629116 [14:07:52] (03CR) 10Elukey: Add basic debian packaging (031 comment) [software/pywmflib] (debian) - 10https://gerrit.wikimedia.org/r/629116 (owner: 10Elukey) [14:08:00] (03PS1) 10Jcrespo: Make wmfbackups depend on the python library. [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/629123 [14:09:02] !log upgrade FNM on netflow1001 - T257035 [14:09:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:10] (03CR) 10Elukey: [C: 03+2] Add basic debian packaging [software/pywmflib] (debian) - 10https://gerrit.wikimedia.org/r/629116 (owner: 10Elukey) [14:09:11] !log restart statsv on webperf[1-2]001 to route metrics through statsd-exporter [14:09:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:15] !log restarting pybal on lvs1016.eqiad.wmnet,lvs2010.codfw.wmnet - T255868 T255877 [14:09:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:33] (03PS3) 10Kormat: wmfbackups_poc [puppet] - 10https://gerrit.wikimedia.org/r/629118 [14:09:36] (03CR) 10Ema: [V: 03+2 C: 03+2] Release 1.5.3-1 [software/varnish/libvmod-re2] (debian-6.0) - 10https://gerrit.wikimedia.org/r/625629 (https://phabricator.wikimedia.org/T261632) (owner: 10Vgutierrez) [14:10:10] incoming pybal alerts is me [14:10:50] !log upgrade FNM on netflow5001 - T257035 [14:10:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:21] !log restarting pybal on lvs1015.eqiad.wmnet,lvs2009.codfw.wmnet - T255868 T255877 [14:11:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:26] T255877: Move proton to use TLS only - https://phabricator.wikimedia.org/T255877 [14:11:27] T255868: Move citoid to use TLS only - https://phabricator.wikimedia.org/T255868 [14:11:41] (03PS1) 10Cwhite: profile: point statsv at proper statsd-exporter intake port [puppet] - 10https://gerrit.wikimedia.org/r/629124 (https://phabricator.wikimedia.org/T180105) [14:12:01] PROBLEM - PyBal IPVS diff check on lvs1015 is CRITICAL: CRITICAL: Services in IPVS but unknown to PyBal: set([10.2.2.21:24766, 10.2.2.19:1970]) https://wikitech.wikimedia.org/wiki/PyBal [14:12:01] PROBLEM - PyBal IPVS diff check on lvs2009 is CRITICAL: CRITICAL: Services in IPVS but unknown to PyBal: set([10.2.1.21:24766, 10.2.1.19:1970]) https://wikitech.wikimedia.org/wiki/PyBal [14:12:01] PROBLEM - PyBal IPVS diff check on lvs2010 is CRITICAL: CRITICAL: Services in IPVS but unknown to PyBal: set([10.2.1.21:24766, 10.2.1.19:1970]) https://wikitech.wikimedia.org/wiki/PyBal [14:12:01] PROBLEM - PyBal IPVS diff check on lvs1016 is CRITICAL: CRITICAL: Services in IPVS but unknown to PyBal: set([10.2.2.21:24766, 10.2.2.19:1970]) https://wikitech.wikimedia.org/wiki/PyBal [14:12:01] !log running ipvsadm -D -t 10.2.2.19:1970; ipvsadm -D -t 10.2.2.21:24766 on lvs1016.eqiad.wmnet,lvs1015.eqiad.wmnet - T255868 T255877 [14:12:02] (03CR) 10Cwhite: [C: 03+2] profile: point statsv at proper statsd-exporter intake port [puppet] - 10https://gerrit.wikimedia.org/r/629124 (https://phabricator.wikimedia.org/T180105) (owner: 10Cwhite) [14:12:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:12:40] !log running ipvsadm -D -t 10.2.1.19:1970; ipvsadm -D -t 10.2.1.21:24766 on lvs2010.codfw.wmnet,lvs2009.codfw.wmnet - T255868 T255877 [14:12:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:13] RECOVERY - PyBal IPVS diff check on lvs1015 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [14:14:13] RECOVERY - PyBal IPVS diff check on lvs2010 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [14:14:13] RECOVERY - PyBal IPVS diff check on lvs1016 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [14:14:13] RECOVERY - PyBal IPVS diff check on lvs2009 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [14:14:24] 10Operations, 10Citoid, 10Prod-Kubernetes, 10serviceops, and 2 others: Move citoid to use TLS only - https://phabricator.wikimedia.org/T255868 (10JMeybohm) a:03JMeybohm [14:14:27] 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes, 10Patch-For-Review: Move proton to use TLS only - https://phabricator.wikimedia.org/T255877 (10JMeybohm) [14:14:32] (03PS5) 10Arturo Borrero Gonzalez: cloudgw: add templated network interfaces file [puppet] - 10https://gerrit.wikimedia.org/r/629079 (https://phabricator.wikimedia.org/T261724) [14:14:49] (03CR) 10Muehlenhoff: "PCC: https://puppet-compiler.wmflabs.org/compiler1002/25278/" [puppet] - 10https://gerrit.wikimedia.org/r/629122 (https://phabricator.wikimedia.org/T262512) (owner: 10Muehlenhoff) [14:15:10] !log upgrade FNM on netflow2001 - T257035 [14:15:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:15] T257035: Upgrade Fastnetmon to 1.1.7 - https://phabricator.wikimedia.org/T257035 [14:15:18] 10Operations, 10MediaWiki-General, 10serviceops, 10Patch-For-Review, 10Service-Architecture: Create a service-to-service proxy for handling HTTP calls from services to other entities - https://phabricator.wikimedia.org/T244843 (10JMeybohm) [14:15:32] (03CR) 10jerkins-bot: [V: 04-1] cloudgw: add templated network interfaces file [puppet] - 10https://gerrit.wikimedia.org/r/629079 (https://phabricator.wikimedia.org/T261724) (owner: 10Arturo Borrero Gonzalez) [14:15:54] (03PS1) 10Lucas Werkmeister (WMDE): Configure entityDataCachePaths for Wikibase [mediawiki-config] - 10https://gerrit.wikimedia.org/r/629133 [14:16:39] (03PS6) 10Arturo Borrero Gonzalez: cloudgw: add templated network interfaces file [puppet] - 10https://gerrit.wikimedia.org/r/629079 (https://phabricator.wikimedia.org/T261724) [14:17:27] (03CR) 10Ayounsi: [C: 03+2] FNM: add new default speed_calculation_delay [puppet] - 10https://gerrit.wikimedia.org/r/629119 (https://phabricator.wikimedia.org/T257035) (owner: 10Ayounsi) [14:18:59] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudgw: add templated network interfaces file [puppet] - 10https://gerrit.wikimedia.org/r/629079 (https://phabricator.wikimedia.org/T261724) (owner: 10Arturo Borrero Gonzalez) [14:19:39] (03Abandoned) 10JMeybohm: lvs: Remove proton non-TLS endpoint from LVS 2/2 [puppet] - 10https://gerrit.wikimedia.org/r/627542 (https://phabricator.wikimedia.org/T255877) (owner: 10JMeybohm) [14:19:46] 10Operations, 10netops, 10Patch-For-Review: Upgrade Fastnetmon to 1.1.7 - https://phabricator.wikimedia.org/T257035 (10ayounsi) 05Open→03Resolved a:03ayounsi All done! Thanks a lot and we can revisit when we need the API client. [14:20:57] (03CR) 10Abijeet Patro: [C: 03+1] Enable Special:TranslationStats [mediawiki-config] - 10https://gerrit.wikimedia.org/r/627744 (https://phabricator.wikimedia.org/T263004) (owner: 10Nikerabbit) [14:23:24] 10Operations, 10ops-codfw: Degraded RAID on ms-be2019 - https://phabricator.wikimedia.org/T262182 (10Papaul) @fgiunchedi yes that is the reason i left the task open but forgot to comment. Thanks [14:24:22] !log rebooting ms-be2019 [14:24:25] !log upload libvmod-re2 1.5.3-1 to buster-wikimedia component/varnish6 T261632 [14:24:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:24:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:24:30] T261632: Package varnish 6.0.x - https://phabricator.wikimedia.org/T261632 [14:25:47] (03PS1) 10Jcrespo: Add missing dependency on wmfmariadbpy on mariadb::backup::transfer [puppet] - 10https://gerrit.wikimedia.org/r/629136 [14:26:00] (03CR) 10Nuria: [C: 04-1] [WIP] Drop /wmf/data/raw/mediawiki_job and /wmf/data/raw/netflow after 90 days (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/628895 (https://phabricator.wikimedia.org/T231339) (owner: 10Ottomata) [14:26:01] PROBLEM - Host ms-be2019 is DOWN: PING CRITICAL - Packet loss = 100% [14:29:10] (03PS1) 10Jbond: cfssl::csr: add better validation command [puppet] - 10https://gerrit.wikimedia.org/r/629137 (https://phabricator.wikimedia.org/T259117) [14:30:08] (03PS2) 10Jcrespo: Add missing dependency on wmfbackups-check to mariadb::backup::check [puppet] - 10https://gerrit.wikimedia.org/r/629114 [14:30:24] (03PS3) 10Jcrespo: Add missing dependency on wmfmariadbpy to mariadb::backup::check [puppet] - 10https://gerrit.wikimedia.org/r/629114 [14:30:46] (03PS2) 10Jbond: cfssl::csr: add better validation command [puppet] - 10https://gerrit.wikimedia.org/r/629137 (https://phabricator.wikimedia.org/T259117) [14:32:42] (03PS3) 10Jbond: cfssl::csr: add better validation command [puppet] - 10https://gerrit.wikimedia.org/r/629137 (https://phabricator.wikimedia.org/T259117) [14:33:09] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:33:49] !log restart apache on prometheus nodes to pick up new ext endpoint [14:33:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:21] !log installing nginx security updates on buster [14:34:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:54] 10Operations, 10Analytics-Radar, 10Traffic, 10Patch-For-Review: Package varnish 6.0.x - https://phabricator.wikimedia.org/T261632 (10ema) [14:36:08] 10Operations, 10MW-on-K8s, 10Platform Engineering, 10TechCom-RFC, 10serviceops: Decide on logging in k8s for ShellBox - https://phabricator.wikimedia.org/T263545 (10Pchelolo) [14:36:13] 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes: Move mathoid to use TLS only - https://phabricator.wikimedia.org/T255875 (10JMeybohm) a:03JMeybohm [14:36:16] (03CR) 10Jcrespo: [C: 03+2] Add missing dependency on wmfmariadbpy to mariadb::backup::check [puppet] - 10https://gerrit.wikimedia.org/r/629114 (owner: 10Jcrespo) [14:36:34] (03CR) 10Jcrespo: [C: 03+2] Add missing dependency on wmfmariadbpy on mariadb::backup::transfer [puppet] - 10https://gerrit.wikimedia.org/r/629136 (owner: 10Jcrespo) [14:36:41] (03PS2) 10Jcrespo: Add missing dependency on wmfmariadbpy on mariadb::backup::transfer [puppet] - 10https://gerrit.wikimedia.org/r/629136 [14:37:53] (03PS4) 10Jbond: cfssl::csr: add better validation command [puppet] - 10https://gerrit.wikimedia.org/r/629137 (https://phabricator.wikimedia.org/T259117) [14:38:44] (03CR) 10Jbond: [C: 03+2] cfssl::csr: add better validation command [puppet] - 10https://gerrit.wikimedia.org/r/629137 (https://phabricator.wikimedia.org/T259117) (owner: 10Jbond) [14:40:00] (03PS1) 10Giuseppe Lavagetto: service::catalog: format yaml consistently [puppet] - 10https://gerrit.wikimedia.org/r/629140 [14:42:37] PROBLEM - Rate of JVM GC Old generation-s runs - elastic2029-production-search-psi-codfw on elastic2029 is CRITICAL: 105.8 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-codfw&var-instance=elastic2029&panelId=37 [14:44:38] 10Operations: Integrate Buster 10.5 point release - https://phabricator.wikimedia.org/T259519 (10MoritzMuehlenhoff) [14:44:39] (03CR) 10Jcrespo: "I will add the missing requires." [puppet] - 10https://gerrit.wikimedia.org/r/629118 (owner: 10Kormat) [14:44:50] 10Operations: Integrate Buster 10.5 point release - https://phabricator.wikimedia.org/T259519 (10MoritzMuehlenhoff) [14:46:40] 10Operations, 10CommRel-Specialists-Support (Jul-Sep-2020), 10User-notice: CommRel support for FY2020-2021 Q1 DC switchover - https://phabricator.wikimedia.org/T244808 (10Trizek-WMF) Concerning the message, I received some feedback about it on wiki talk pages. Some people said that the message was still visi... [14:48:38] (03PS1) 10Arturo Borrero Gonzalez: cloudgw: rt_table: fix newline at the end of file [puppet] - 10https://gerrit.wikimedia.org/r/629143 (https://phabricator.wikimedia.org/T261724) [14:49:54] (03PS1) 10Volans: debian: fix lintian reported issues [software/pywmflib] (debian) - 10https://gerrit.wikimedia.org/r/629144 [14:50:30] (03CR) 10Elukey: [C: 03+1] debian: fix lintian reported issues [software/pywmflib] (debian) - 10https://gerrit.wikimedia.org/r/629144 (owner: 10Volans) [14:50:47] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudgw: rt_table: fix newline at the end of file [puppet] - 10https://gerrit.wikimedia.org/r/629143 (https://phabricator.wikimedia.org/T261724) (owner: 10Arturo Borrero Gonzalez) [14:51:20] (03PS2) 10Volans: debian: fix lintian reported issues [software/pywmflib] (debian) - 10https://gerrit.wikimedia.org/r/629144 [14:51:31] (03Abandoned) 10Kormat: wmfbackups_poc [puppet] - 10https://gerrit.wikimedia.org/r/629118 (owner: 10Kormat) [14:52:11] (03CR) 10Muehlenhoff: debian: fix lintian reported issues (031 comment) [software/pywmflib] (debian) - 10https://gerrit.wikimedia.org/r/629144 (owner: 10Volans) [14:53:56] (03CR) 10Volans: debian: fix lintian reported issues (031 comment) [software/pywmflib] (debian) - 10https://gerrit.wikimedia.org/r/629144 (owner: 10Volans) [14:54:09] (03CR) 10Giuseppe Lavagetto: [C: 03+2] "Noop in PCC for all potentially affected systems." [puppet] - 10https://gerrit.wikimedia.org/r/629140 (owner: 10Giuseppe Lavagetto) [14:56:26] (03PS1) 10Arturo Borrero Gonzalez: cloudgw: interfaces: avoid augeas loop [puppet] - 10https://gerrit.wikimedia.org/r/629146 (https://phabricator.wikimedia.org/T261724) [14:58:46] (03CR) 10Jcrespo: [C: 03+2] Make wmfbackups depend on the python library. [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/629123 (owner: 10Jcrespo) [15:00:19] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudgw: interfaces: avoid augeas loop [puppet] - 10https://gerrit.wikimedia.org/r/629146 (https://phabricator.wikimedia.org/T261724) (owner: 10Arturo Borrero Gonzalez) [15:00:44] (03PS1) 10Jcrespo: Create /var/log/mariadb-backups, not /var/lib/mariadb-backups [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/629147 [15:02:04] (03CR) 10Jcrespo: [C: 03+2] Create /var/log/mariadb-backups, not /var/lib/mariadb-backups [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/629147 (owner: 10Jcrespo) [15:02:50] (03CR) 10Volans: [C: 03+2] debian: fix lintian reported issues [software/pywmflib] (debian) - 10https://gerrit.wikimedia.org/r/629144 (owner: 10Volans) [15:03:48] (03Merged) 10jenkins-bot: debian: fix lintian reported issues [software/pywmflib] (debian) - 10https://gerrit.wikimedia.org/r/629144 (owner: 10Volans) [15:10:11] (03PS1) 10Jbond: cfssl::csr: update to compare strings not numbers [puppet] - 10https://gerrit.wikimedia.org/r/629149 (https://phabricator.wikimedia.org/T259117) [15:10:26] (03PS1) 10Arturo Borrero Gonzalez: cloudgw: interfaces: more template fixes [puppet] - 10https://gerrit.wikimedia.org/r/629150 (https://phabricator.wikimedia.org/T261724) [15:10:53] (03PS1) 10Jbond: Revert "base::expose_puppet_certs: add option to enable/disable p12 encypted keys" [puppet] - 10https://gerrit.wikimedia.org/r/629002 [15:11:18] (03CR) 10Jbond: [C: 03+2] cfssl::csr: update to compare strings not numbers [puppet] - 10https://gerrit.wikimedia.org/r/629149 (https://phabricator.wikimedia.org/T259117) (owner: 10Jbond) [15:12:05] (03CR) 10Jbond: [C: 03+2] Revert "base::expose_puppet_certs: add option to enable/disable p12 encypted keys" [puppet] - 10https://gerrit.wikimedia.org/r/629002 (owner: 10Jbond) [15:15:13] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudgw: interfaces: more template fixes [puppet] - 10https://gerrit.wikimedia.org/r/629150 (https://phabricator.wikimedia.org/T261724) (owner: 10Arturo Borrero Gonzalez) [15:19:16] (03PS1) 10Arturo Borrero Gonzalez: cloudgw: interfaces: template fixes [puppet] - 10https://gerrit.wikimedia.org/r/629151 (https://phabricator.wikimedia.org/T261724) [15:20:51] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] "PCC: https://puppet-compiler.wmflabs.org/compiler1002/25284/labtestvirt2003.codfw.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/629151 (https://phabricator.wikimedia.org/T261724) (owner: 10Arturo Borrero Gonzalez) [15:21:47] (03PS1) 10Filippo Giunchedi: prometheus: allow icinga-am to run status.cgi [puppet] - 10https://gerrit.wikimedia.org/r/629152 (https://phabricator.wikimedia.org/T258948) [15:21:49] (03PS1) 10Filippo Giunchedi: WIP am alerts [puppet] - 10https://gerrit.wikimedia.org/r/629153 (https://phabricator.wikimedia.org/T258948) [15:22:48] (03PS1) 10Giuseppe Lavagetto: services: add TLS encrypted endpoint for wikifeeds (1/2) [puppet] - 10https://gerrit.wikimedia.org/r/629154 (https://phabricator.wikimedia.org/T255878) [15:22:50] (03PS1) 10Giuseppe Lavagetto: services: add TLS encrypted endpoint for wikifeeds (2/2) [puppet] - 10https://gerrit.wikimedia.org/r/629155 (https://phabricator.wikimedia.org/T255878) [15:23:04] (03PS1) 10BBlack: VCL: varnishlog for XFP fakers [puppet] - 10https://gerrit.wikimedia.org/r/629156 [15:28:10] 10Operations, 10Traffic, 10Patch-For-Review: Varnish 6.0 needs a SONAME version bump - https://phabricator.wikimedia.org/T261487 (10ema) 05Open→03Resolved a:03ema This is now done, our Varnish package version 6.0.6-1wm1 includes 0006-bump-api-soname.patch taking care of the SONAME bump. All dependencie... [15:28:15] 10Operations, 10Traffic, 10Patch-For-Review: Analyze custom varnish 5.1 patches considering the migration to varnish 6 - https://phabricator.wikimedia.org/T260702 (10ema) [15:28:20] (03PS4) 10Filippo Giunchedi: am: use status.cgi JSON as source for problems [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/628090 [15:28:24] 10Operations, 10Analytics-Radar, 10Traffic, 10Patch-For-Review: Package varnish 6.0.x - https://phabricator.wikimedia.org/T261632 (10ema) 05Open→03Resolved a:03ema All packages ready for prime time! [15:31:23] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:34:43] (03PS1) 10Jbond: cfsssl: update service description [puppet] - 10https://gerrit.wikimedia.org/r/629158 (https://phabricator.wikimedia.org/T259117) [15:37:10] 10Operations, 10MW-on-K8s, 10Platform Engineering, 10TechCom-RFC, 10serviceops: Decide on logging in k8s for ShellBox - https://phabricator.wikimedia.org/T263545 (10TK-999) This also relies on FPM's [[ https://www.php.net/manual/en/install.fpm.configuration.php#catch-workers-output | `catch_workers_outpu... [15:37:56] (03PS2) 10Jbond: cfsssl: update service description [puppet] - 10https://gerrit.wikimedia.org/r/629158 (https://phabricator.wikimedia.org/T259117) [15:38:46] 10Operations, 10Traffic: Upgrade a production cache node to Varnish 6 - https://phabricator.wikimedia.org/T263557 (10ema) [15:39:00] 10Operations, 10Traffic: Upgrade a production cache node to Varnish 6 - https://phabricator.wikimedia.org/T263557 (10ema) p:05Triage→03Medium [15:41:24] (03PS3) 10Jbond: cfsssl: update service description [puppet] - 10https://gerrit.wikimedia.org/r/629158 (https://phabricator.wikimedia.org/T259117) [15:42:22] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: (Due By: 2020-07-11) rack/setup/install an-worker[1096-1101] - https://phabricator.wikimedia.org/T254892 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts: ` ['an-worker1101.eqiad.wmnet'] ` The l... [15:42:54] (03CR) 10Jbond: [C: 03+2] cfsssl: update service description [puppet] - 10https://gerrit.wikimedia.org/r/629158 (https://phabricator.wikimedia.org/T259117) (owner: 10Jbond) [15:44:03] RECOVERY - Rate of JVM GC Old generation-s runs - elastic2029-production-search-psi-codfw on elastic2029 is OK: (C)100 gt (W)80 gt 67.12 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-codfw&var-instance=elastic2029&panelId=37 [15:44:21] (03PS1) 10Herron: prometheus: point prometheus.svc.esams to prometheus3001 [dns] - 10https://gerrit.wikimedia.org/r/629003 [15:44:37] RECOVERY - Host ms-be2019 is UP: PING OK - Packet loss = 0%, RTA = 31.72 ms [15:45:25] (03PS2) 10Herron: prometheus: point prometheus.svc.esams to prometheus3001 [dns] - 10https://gerrit.wikimedia.org/r/629003 [15:47:19] 10Operations, 10ops-codfw: Degraded RAID on ms-be2019 - https://phabricator.wikimedia.org/T262182 (10Papaul) 05Open→03Resolved @fgiunchedi al good now upgrade the ILO firmware and reboot the server ` Cache Board Present: True Cache Status: OK Cache Ratio: 10% Read / 90% Write Drive Write Cache:... [15:48:14] (03PS3) 10Herron: prometheus: point prometheus.svc.esams to prometheus3001 [dns] - 10https://gerrit.wikimedia.org/r/629003 [15:48:58] (03CR) 10Herron: [C: 03+2] prometheus: point prometheus.svc.esams to prometheus3001 [dns] - 10https://gerrit.wikimedia.org/r/629003 (owner: 10Herron) [15:52:03] (03PS1) 10Jbond: profile::pki::server: Add new signer config [puppet] - 10https://gerrit.wikimedia.org/r/629161 (https://phabricator.wikimedia.org/T259117) [15:52:55] RECOVERY - Thanos query has high gRPC client errors on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query [15:53:33] RECOVERY - HP RAID on ms-be2019 is OK: OK: Slot 3: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 2I:4:1, 2I:4:2 - Controller: OK - Battery/Capacitor: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [15:54:26] 10Operations, 10ops-codfw, 10DC-Ops, 10Maps: (Need By: TBD) rack/setup/install maps20[05-10].codfw.wmnet - https://phabricator.wikimedia.org/T260271 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` maps2006.codfw.wmnet ` The log can be found in `/v... [15:54:31] 10Operations, 10ops-codfw, 10DC-Ops, 10Maps: (Need By: TBD) rack/setup/install maps20[05-10].codfw.wmnet - https://phabricator.wikimedia.org/T260271 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['maps2006.codfw.wmnet'] ` Of which those **FAILED**: ` ['maps2006.codfw.wmnet'] ` [15:55:52] (03CR) 10Cwhite: [C: 03+1] prometheus: allow icinga-am to run status.cgi [puppet] - 10https://gerrit.wikimedia.org/r/629152 (https://phabricator.wikimedia.org/T258948) (owner: 10Filippo Giunchedi) [15:56:27] (03PS2) 10Herron: role::bastionhost::pop: remove prometheus instances [puppet] - 10https://gerrit.wikimedia.org/r/628940 (https://phabricator.wikimedia.org/T243057) [15:56:48] 10Operations, 10ops-codfw: Degraded RAID on ms-be2019 - https://phabricator.wikimedia.org/T262182 (10fgiunchedi) Neat, thank you ! [15:57:49] (03PS2) 10Jbond: profile::pki::server: Add new signer config [puppet] - 10https://gerrit.wikimedia.org/r/629161 (https://phabricator.wikimedia.org/T259117) [15:58:45] (03CR) 10Herron: "> Patch Set 1:" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/628940 (https://phabricator.wikimedia.org/T243057) (owner: 10Herron) [16:00:04] !log running dell epsa test on down host mw1360 per T262151 [16:00:04] jbond42 and cdanis: Your horoscope predicts another unfortunate Puppet request window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200922T1600). [16:00:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:00:11] T262151: mw1360's NIC is faulty - https://phabricator.wikimedia.org/T262151 [16:01:03] (03PS3) 10Jbond: profile::pki::server: Add new signer config [puppet] - 10https://gerrit.wikimedia.org/r/629161 (https://phabricator.wikimedia.org/T259117) [16:02:08] (03CR) 10Jbond: [C: 03+2] profile::pki::server: Add new signer config [puppet] - 10https://gerrit.wikimedia.org/r/629161 (https://phabricator.wikimedia.org/T259117) (owner: 10Jbond) [16:03:09] (03CR) 10CRusnov: [C: 03+1] "LGTM" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/628898 (owner: 10Volans) [16:04:02] 10Operations, 10ops-esams, 10DC-Ops, 10Traffic: Upgrade BIOS and IDRAC firmware on R440 cp systems - https://phabricator.wikimedia.org/T243167 (10BBlack) a:05BBlack→03RobH @RobH please do when able [16:04:37] 10Operations, 10ops-eqiad, 10serviceops: mw1360's NIC is faulty - https://phabricator.wikimedia.org/T262151 (10RobH) Ok, running the troubleshooting steps as follows: Copy over the SEL, erase it so it won't throw errors in testing. At this time, there are no errors for the NIC on the SEL, but full testing... [16:05:33] 10Operations, 10ops-codfw, 10DC-Ops, 10Maps: (Need By: TBD) rack/setup/install maps20[05-10].codfw.wmnet - https://phabricator.wikimedia.org/T260271 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` maps2006.codfw.wmnet ` The log can be found in `/v... [16:06:18] (03PS1) 10Jbond: profile::pki::server: enable cfssl server for sre intermediate [puppet] - 10https://gerrit.wikimedia.org/r/629163 (https://phabricator.wikimedia.org/T259117) [16:07:55] (03CR) 10Jbond: [C: 03+2] profile::pki::server: enable cfssl server for sre intermediate [puppet] - 10https://gerrit.wikimedia.org/r/629163 (https://phabricator.wikimedia.org/T259117) (owner: 10Jbond) [16:09:05] (03PS2) 10Filippo Giunchedi: profile: add alertmanager::alerts [puppet] - 10https://gerrit.wikimedia.org/r/629153 (https://phabricator.wikimedia.org/T258948) [16:09:07] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: allow icinga-am to run status.cgi [puppet] - 10https://gerrit.wikimedia.org/r/629152 (https://phabricator.wikimedia.org/T258948) (owner: 10Filippo Giunchedi) [16:09:13] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/628940 (https://phabricator.wikimedia.org/T243057) (owner: 10Herron) [16:10:19] (03CR) 10Herron: [C: 03+2] role::bastionhost::pop: remove prometheus instances [puppet] - 10https://gerrit.wikimedia.org/r/628940 (https://phabricator.wikimedia.org/T243057) (owner: 10Herron) [16:10:45] (03PS1) 10Arturo Borrero Gonzalez: cloudgw: hardcode mapped IPv6 interface [puppet] - 10https://gerrit.wikimedia.org/r/629164 (https://phabricator.wikimedia.org/T261724) [16:12:04] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudgw: hardcode mapped IPv6 interface [puppet] - 10https://gerrit.wikimedia.org/r/629164 (https://phabricator.wikimedia.org/T261724) (owner: 10Arturo Borrero Gonzalez) [16:12:18] 10Operations, 10Traffic, 10serviceops-radar: Increase in esams/eqsin cache_text network traffic since 2020-03-10 11:42 UTC - https://phabricator.wikimedia.org/T247583 (10BBlack) 05Open→03Declined I don't think anyone's had any ideas in these months, and the operational context and grafana data is startin... [16:13:03] 10Operations, 10Cloud-VPS, 10netops, 10cloud-services-team (Kanban): Evaluate the possibility to add Juniper images to Openstack - https://phabricator.wikimedia.org/T180179 (10BBlack) [16:13:32] PROBLEM - puppet last run on bast5001 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [16:14:05] hrmm [16:14:10] no open tickets for bast5001 [16:14:31] The last Puppet run was at Tue Sep 22 16:11:24 UTC 2020 (2 minutes ago). [16:14:37] so the icinga check is outdated [16:14:44] (or someone just kicked it) [16:15:10] robh: was just re-enabling now actually [16:15:17] ahh cool [16:15:22] disregarding =] [16:15:33] kk, sorry for the noise [16:15:49] no worries, i only noticed cuz it falls into caching center and thus my brain pays attention [16:15:57] 'oh that shit is your problem if its busted rob' [16:16:00] heh [16:16:10] (03PS1) 10Elukey: profile::hadoop::common: add datanode mountpoints override [puppet] - 10https://gerrit.wikimedia.org/r/629165 [16:16:36] 10Operations, 10netops, 10IPv6, 10Patch-For-Review, 10User-jbond: Fix IPv6 autoconf issues once and for all, across the fleet. - https://phabricator.wikimedia.org/T102099 (10BBlack) a:03jbond [16:17:04] haha [16:18:29] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime [16:18:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:18:32] RECOVERY - puppet last run on bast5001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [16:20:39] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [16:20:40] (03CR) 10Bstorm: "I've found that after cleanup, it's only 113 MB on an installed exec_node with all its packages (with non-functional grid...it was a testl" [puppet] - 10https://gerrit.wikimedia.org/r/628879 (https://phabricator.wikimedia.org/T263339) (owner: 10Bstorm) [16:20:42] (03PS1) 10Jbond: cfssl: add ability to configure CA bundle for signers [puppet] - 10https://gerrit.wikimedia.org/r/629166 (https://phabricator.wikimedia.org/T259117) [16:20:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:21:07] (03PS2) 10Elukey: profile::hadoop::common: add datanode mountpoints override [puppet] - 10https://gerrit.wikimedia.org/r/629165 [16:21:36] 10Operations, 10CommRel-Specialists-Support (Jul-Sep-2020), 10User-notice: CommRel support for FY2020-2021 Q1 DC switchover - https://phabricator.wikimedia.org/T244808 (10RLazarus) October 21 looks good, tentatively. Let me confirm with folks. Would you like to plan the switchback here, or start a fresh task? [16:22:03] 10Operations, 10CommRel-Specialists-Support (Jul-Sep-2020), 10User-notice: CommRel support for FY2020-2021 Q1 DC switchover - https://phabricator.wikimedia.org/T244808 (10Trizek-WMF) Thank you. A fresh task would be nice. :) [16:22:34] (03CR) 10Jbond: [C: 03+2] cfssl: add ability to configure CA bundle for signers [puppet] - 10https://gerrit.wikimedia.org/r/629166 (https://phabricator.wikimedia.org/T259117) (owner: 10Jbond) [16:24:37] 10Operations, 10ops-codfw, 10DC-Ops, 10Maps: (Need By: TBD) rack/setup/install maps20[05-10].codfw.wmnet - https://phabricator.wikimedia.org/T260271 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['maps2006.codfw.wmnet'] ` and were **ALL** successful. [16:25:10] (03PS1) 10Jbond: cfssl::signer service definition add continuation line [puppet] - 10https://gerrit.wikimedia.org/r/629167 [16:25:42] (03CR) 10Jbond: [C: 03+2] cfssl::signer service definition add continuation line [puppet] - 10https://gerrit.wikimedia.org/r/629167 (owner: 10Jbond) [16:26:17] 10Operations, 10ops-eqiad: an-worker1115 lost PSU redundancy - https://phabricator.wikimedia.org/T263569 (10RobH) p:05Triage→03Medium [16:28:08] (03PS3) 10Elukey: WIP: profile::hadoop::common: add datanode mountpoints override [puppet] - 10https://gerrit.wikimedia.org/r/629165 [16:28:22] 10Operations, 10ops-eqiad: Check jumbo1008.eqiad.wmnet PSU redundancy reported as critical - https://phabricator.wikimedia.org/T263262 (10RobH) So the hostname is actually kafka-jumbo1008, updating task with the relevant details. [16:29:56] PROBLEM - Rate of JVM GC Old generation-s runs - elastic2029-production-search-psi-codfw on elastic2029 is CRITICAL: 102.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-codfw&var-instance=elastic2029&panelId=37 [16:30:16] 10Operations, 10ops-eqiad: kakfa-jumbo1008 psu redundacy fail - https://phabricator.wikimedia.org/T263262 (10RobH) [16:31:00] 10Operations, 10ops-eqiad, 10serviceops: mw1360's NIC is faulty - https://phabricator.wikimedia.org/T262151 (10RobH) quick tests complete no errors, full testing continuing, eta on screen 3 hours. [16:31:29] 10Operations, 10ops-codfw, 10DC-Ops, 10Maps: (Need By: TBD) rack/setup/install maps20[05-10].codfw.wmnet - https://phabricator.wikimedia.org/T260271 (10Papaul) [16:32:12] ACKNOWLEDGEMENT - IPMI Sensor Status on kafka-jumbo1008 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] rhalsell https://phabricator.wikimedia.org/T263262 - The acknowledgement expires at: 2020-10-07 16:31:40. https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [16:36:38] PROBLEM - Host mr1-eqsin IPv6 is DOWN: PING CRITICAL - Packet loss = 66%, RTA = 4000.89 ms [16:37:24] 10Operations, 10ops-codfw, 10DC-Ops, 10Maps: (Need By: TBD) rack/setup/install maps20[05-10].codfw.wmnet - https://phabricator.wikimedia.org/T260271 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` maps2007.codfw.wmnet ` The log can be found in `/v... [16:38:43] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) [16:38:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:40:58] PROBLEM - Router interfaces on mr1-eqsin is CRITICAL: CRITICAL: No response from remote host 103.102.166.128 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:41:36] PROBLEM - Juniper alarms on mr1-eqsin is CRITICAL: JNX_ALARMS CRITICAL - No response from remote host 103.102.166.128 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [16:42:06] PROBLEM - Host mr1-eqsin.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 66%, RTA = 5153.34 ms [16:44:18] PROBLEM - Router interfaces on mr1-eqsin is CRITICAL: CRITICAL: No response from remote host 103.102.166.128 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:45:16] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: (Due By: 2020-07-11) rack/setup/install an-worker[1096-1101] - https://phabricator.wikimedia.org/T254892 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-worker1101.eqiad.wmnet'] ` Of which those **FAILED**: ` ['an-worker1101.eqiad.wmn... [16:45:24] (03CR) 10Hnowlan: [C: 03+2] api-gateway: document values a bit better. [deployment-charts] - 10https://gerrit.wikimedia.org/r/628804 (https://phabricator.wikimedia.org/T254916) (owner: 10Hnowlan) [16:46:04] 10Operations, 10ops-eqiad, 10cloud-services-team (Kanban): cloudvirt1033 psu redundancy alert - https://phabricator.wikimedia.org/T263145 (10RobH) p:05Triage→03Medium [16:46:25] 10Operations, 10Traffic, 10serviceops-radar: Increase in esams/eqsin cache_text network traffic since 2020-03-10 11:42 UTC - https://phabricator.wikimedia.org/T247583 (10CDanis) Wow, I had completely forgotten about this ticket -- but, I'm plugging T263277 here as something that would've helped diagnose, if... [16:47:29] PROBLEM - Router interfaces on mr1-eqsin is CRITICAL: CRITICAL: No response from remote host 103.102.166.128 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:47:34] (03Merged) 10jenkins-bot: api-gateway: document values a bit better. [deployment-charts] - 10https://gerrit.wikimedia.org/r/628804 (https://phabricator.wikimedia.org/T254916) (owner: 10Hnowlan) [16:48:34] 10Operations, 10ops-eqiad, 10cloud-services-team (Kanban): cloudvirt1033 psu redundancy alert - https://phabricator.wikimedia.org/T263145 (10RobH) [16:49:49] (03PS2) 10Herron: prometheus: reduce prometheus.svc.eqsin TTL to 5M [dns] - 10https://gerrit.wikimedia.org/r/628853 [16:50:21] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime [16:50:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:50:32] (03PS1) 10Elukey: install_server: remove extra ] in netboot's config [puppet] - 10https://gerrit.wikimedia.org/r/629171 [16:50:37] PROBLEM - Router interfaces on mr1-eqsin is CRITICAL: CRITICAL: No response from remote host 103.102.166.128 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:51:26] (03CR) 10Elukey: [C: 03+2] install_server: remove extra ] in netboot's config [puppet] - 10https://gerrit.wikimedia.org/r/629171 (owner: 10Elukey) [16:52:26] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [16:52:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:53:03] 10Operations, 10ops-codfw, 10DC-Ops, 10Maps: (Need By: TBD) rack/setup/install maps20[05-10].codfw.wmnet - https://phabricator.wikimedia.org/T260271 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['maps2007.codfw.wmnet'] ` Of which those **FAILED**: ` ['maps2007.codfw.wmnet'] ` [16:54:13] PROBLEM - Host elastic2037 is DOWN: PING CRITICAL - Packet loss = 100% [16:54:17] 10Operations, 10ops-codfw, 10DC-Ops, 10Maps: (Need By: TBD) rack/setup/install maps20[05-10].codfw.wmnet - https://phabricator.wikimedia.org/T260271 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` maps2007.codfw.wmnet ` The log can be found in `/v... [16:56:23] PROBLEM - Router interfaces on mr1-eqsin is CRITICAL: CRITICAL: No response from remote host 103.102.166.128 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:58:17] 10Operations, 10ops-codfw, 10DC-Ops, 10Maps: (Need By: TBD) rack/setup/install maps20[05-10].codfw.wmnet - https://phabricator.wikimedia.org/T260271 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['maps2007.codfw.wmnet'] ` and were **ALL** successful. [17:00:04] chrisalbon and accraze: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Services – Graphoid / ORES . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200922T1700). [17:02:10] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: (Due By: 2020-07-11) rack/setup/install an-worker[1096-1101] - https://phabricator.wikimedia.org/T254892 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts: ` ['an-worker1101.eqiad.wmnet'] ` The l... [17:09:51] (03PS1) 10Jason Linehan: clientError: enable on ja,es,de,ru,it,zh,pt wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/629174 (https://phabricator.wikimedia.org/T255585) [17:10:21] RECOVERY - Juniper alarms on mr1-eqsin is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [17:11:19] (03PS2) 10Jason Linehan: clientError: enable on ja,es,de,ru,it,zh,pt wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/629174 (https://phabricator.wikimedia.org/T255585) [17:11:43] 10Operations, 10ops-codfw, 10DC-Ops, 10Maps: (Need By: TBD) rack/setup/install maps20[05-10].codfw.wmnet - https://phabricator.wikimedia.org/T260271 (10Papaul) [17:12:57] PROBLEM - Router interfaces on mr1-eqsin is CRITICAL: CRITICAL: No response from remote host 103.102.166.128 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:13:18] 10Operations, 10ops-codfw, 10DC-Ops, 10serviceops: decom mw2256 (was: mw2256 - CPU/board hardware issue) - https://phabricator.wikimedia.org/T263065 (10Papaul) 05Open→03Resolved Complete [17:15:41] PROBLEM - Juniper alarms on mr1-eqsin is CRITICAL: JNX_ALARMS CRITICAL - No response from remote host 103.102.166.128 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [17:17:50] PROBLEM - Router interfaces on mr1-eqsin is CRITICAL: CRITICAL: No response from remote host 103.102.166.128 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:19:41] (03PS4) 10Elukey: WIP: profile::hadoop::common: add datanode mountpoints override [puppet] - 10https://gerrit.wikimedia.org/r/629165 [17:24:00] (03PS5) 10Elukey: WIP: profile::hadoop::common: add datanode mountpoints override [puppet] - 10https://gerrit.wikimedia.org/r/629165 [17:24:45] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime [17:24:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:25:10] PROBLEM - Router interfaces on mr1-eqsin is CRITICAL: CRITICAL: No response from remote host 103.102.166.128 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:26:55] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [17:26:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:28:03] 10Operations, 10ops-eqiad, 10DC-Ops, 10netops: Telia IC-361191 patch - https://phabricator.wikimedia.org/T261791 (10RobH) 05Open→03Resolved ` robh@re0.cr2-eqiad> show interfaces diagnostics optics xe-3/3/7 Physical interface: xe-3/3/7 Laser bias current : 38.708 mA Lase... [17:28:23] 10Operations, 10ops-eqiad, 10DC-Ops, 10netops: Telia IC-361191 patch - https://phabricator.wikimedia.org/T261791 (10RobH) 05Resolved→03Open [17:29:08] RECOVERY - Rate of JVM GC Old generation-s runs - elastic2029-production-search-psi-codfw on elastic2029 is OK: (C)100 gt (W)80 gt 78.31 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-codfw&var-instance=elastic2029&panelId=37 [17:29:38] PROBLEM - Router interfaces on mr1-eqsin is CRITICAL: CRITICAL: No response from remote host 103.102.166.128 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:31:46] PROBLEM - Router interfaces on mr1-eqsin is CRITICAL: CRITICAL: No response from remote host 103.102.166.128 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:32:44] RECOVERY - Router interfaces on mr1-eqsin is OK: OK: host 103.102.166.128, interfaces up: 38, down: 0, dormant: 0, excluded: 1, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:33:14] 10Operations, 10ops-eqiad, 10DC-Ops, 10netops: Telia IC-361191 patch - https://phabricator.wikimedia.org/T261791 (10RobH) 05Open→03Resolved a:05Cmjohnson→03None [17:36:39] RECOVERY - Host mr1-eqsin.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 227.21 ms [17:36:52] RECOVERY - Host mr1-eqsin IPv6 is UP: PING OK - Packet loss = 0%, RTA = 227.22 ms [17:38:53] (03PS6) 10Elukey: WIP: profile::hadoop::common: add datanode mountpoints override [puppet] - 10https://gerrit.wikimedia.org/r/629165 [17:46:05] 10Operations, 10ops-eqiad, 10cloud-services-team (Kanban): cloudvirt1033 psu redundancy alert - https://phabricator.wikimedia.org/T263145 (10Andrew) [17:49:10] RECOVERY - Juniper alarms on mr1-eqsin is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [17:52:17] (03CR) 10Ahmon Dancy: [C: 03+2] Branch commit for wmf/1.36.0-wmf.10 [core] (wmf/1.36.0-wmf.10) - 10https://gerrit.wikimedia.org/r/628979 (https://phabricator.wikimedia.org/T257978) (owner: 10TrainBranchBot) [17:53:52] (03CR) 10Krinkle: [C: 03+1] "Good to go!" [puppet] - 10https://gerrit.wikimedia.org/r/626224 (https://phabricator.wikimedia.org/T260826) (owner: 10Brennen Bearnes) [17:59:04] (03CR) 10Ahmon Dancy: [C: 03+2] Updated some cross references in comments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/621589 (owner: 10Ahmon Dancy) [18:00:05] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200922T1800) [18:05:49] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: (Due By: 2020-07-11) rack/setup/install an-worker[1096-1101] - https://phabricator.wikimedia.org/T254892 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-worker1101.eqiad.wmnet'] ` and were **ALL** successful. [18:12:53] (03CR) 10jerkins-bot: [V: 04-1] Branch commit for wmf/1.36.0-wmf.10 [core] (wmf/1.36.0-wmf.10) - 10https://gerrit.wikimedia.org/r/628979 (https://phabricator.wikimedia.org/T257978) (owner: 10TrainBranchBot) [18:16:08] (03PS1) 10Dduvall: blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/629175 [18:17:38] (03Abandoned) 10Dduvall: blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/629175 (owner: 10Dduvall) [18:19:47] (03PS3) 10Ahmon Dancy: Updated some cross references in comments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/621589 [18:22:17] (03PS7) 10Elukey: WIP: profile::hadoop::common: add datanode mountpoints override [puppet] - 10https://gerrit.wikimedia.org/r/629165 [18:24:38] (03CR) 1020after4: [C: 03+2] "recheck" [core] (wmf/1.36.0-wmf.10) - 10https://gerrit.wikimedia.org/r/628979 (https://phabricator.wikimedia.org/T257978) (owner: 10TrainBranchBot) [18:44:35] 10Operations, 10Wikimedia-Mailing-lists: Wikimedia-RU mailing list page has wrong encoding - https://phabricator.wikimedia.org/T135226 (10crusnov) p:05Triage→03Medium [18:46:56] (03Merged) 10jenkins-bot: Branch commit for wmf/1.36.0-wmf.10 [core] (wmf/1.36.0-wmf.10) - 10https://gerrit.wikimedia.org/r/628979 (https://phabricator.wikimedia.org/T257978) (owner: 10TrainBranchBot) [18:49:05] 10Puppet: puppetdb seems to be slow on host reimage - https://phabricator.wikimedia.org/T263578 (10Volans) p:05Triage→03Medium [18:55:21] (03PS1) 10Volans: wmf-auto-reimage: temp workaround for PuppetDB [puppet] - 10https://gerrit.wikimedia.org/r/629181 (https://phabricator.wikimedia.org/T263578) [18:58:13] (03CR) 10CRusnov: [C: 03+2] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/629181 (https://phabricator.wikimedia.org/T263578) (owner: 10Volans) [18:58:17] 10Operations, 10Discovery, 10Traffic, 10WMDE-Analytics-Engineering, and 4 others: Allow access to wdqs.svc.eqiad.wmnet on port 8888 - https://phabricator.wikimedia.org/T176875 (10Gehel) p:05Medium→03Low [18:58:56] (03PS1) 10Jbond: cfssl::csr: add auto renew function [puppet] - 10https://gerrit.wikimedia.org/r/629182 (https://phabricator.wikimedia.org/T259117) [19:00:04] dancy and twentyafterfour: Time to snap out of that daydream and deploy Mediawiki train - American Version. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200922T1900). [19:00:32] Demanding bot. [19:02:12] (03PS2) 10Jbond: cfssl::csr: add auto renew function [puppet] - 10https://gerrit.wikimedia.org/r/629182 (https://phabricator.wikimedia.org/T259117) [19:04:34] (03CR) 10Jbond: [C: 03+2] cfssl::csr: add auto renew function [puppet] - 10https://gerrit.wikimedia.org/r/629182 (https://phabricator.wikimedia.org/T259117) (owner: 10Jbond) [19:07:24] 10Operations, 10ops-codfw, 10DC-Ops, 10Maps: (Need By: TBD) rack/setup/install maps20[05-10].codfw.wmnet - https://phabricator.wikimedia.org/T260271 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` maps2008.codfw.wmnet ` The log can be found in `/v... [19:18:46] (03PS1) 10Jbond: cfssl::csr: renam to cfssl::cert [puppet] - 10https://gerrit.wikimedia.org/r/629206 (https://phabricator.wikimedia.org/T259117) [19:20:17] (03CR) 10Jbond: [C: 03+2] cfssl::csr: renam to cfssl::cert [puppet] - 10https://gerrit.wikimedia.org/r/629206 (https://phabricator.wikimedia.org/T259117) (owner: 10Jbond) [19:26:58] 10Operations, 10Discovery, 10Traffic, 10WMDE-Analytics-Engineering, and 4 others: Allow access to wdqs.svc.eqiad.wmnet on port 8888 - https://phabricator.wikimedia.org/T176875 (10Gehel) 05Open→03Resolved a:03Gehel Looks like everything is done, please re-open if I've missed something. [19:29:38] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime [19:29:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:31:41] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [19:31:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:32:50] (03PS1) 10Jbond: (WIP) P:pki::client: add ability to create certs [puppet] - 10https://gerrit.wikimedia.org/r/629207 [19:33:08] (03CR) 10Dzahn: [C: 03+2] logspam-watch: display seconds and refresh each cycle [puppet] - 10https://gerrit.wikimedia.org/r/626224 (https://phabricator.wikimedia.org/T260826) (owner: 10Brennen Bearnes) [19:34:23] (03PS2) 10Jbond: (WIP) P:pki::client: add ability to create certs [puppet] - 10https://gerrit.wikimedia.org/r/629207 (https://phabricator.wikimedia.org/T259117) [19:38:35] 10Operations, 10ops-codfw, 10DC-Ops, 10Maps: (Need By: TBD) rack/setup/install maps20[05-10].codfw.wmnet - https://phabricator.wikimedia.org/T260271 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['maps2008.codfw.wmnet'] ` and were **ALL** successful. [19:39:54] (03PS3) 10Jbond: (WIP) P:pki::client: add ability to create certs [puppet] - 10https://gerrit.wikimedia.org/r/629207 (https://phabricator.wikimedia.org/T259117) [19:44:00] !log dancy@deploy1001 Pruned MediaWiki: 1.36.0-wmf.5 (duration: 17m 59s) [19:44:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:47:06] 10Operations, 10Product-Infrastructure-Team-Backlog, 10Push-Notification-Service, 10observability, 10serviceops: illegal_argument_exception - https://phabricator.wikimedia.org/T262429 (10Mholloway) Is this resolved? It looks like the last occurrences of this error happened on 9/9, and logs seem to be co... [19:47:52] (03PS1) 10Ahmon Dancy: testwikis wikis to 1.36.0-wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/629210 [19:47:54] (03CR) 10Ahmon Dancy: [C: 03+2] testwikis wikis to 1.36.0-wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/629210 (owner: 10Ahmon Dancy) [19:49:04] (03Merged) 10jenkins-bot: testwikis wikis to 1.36.0-wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/629210 (owner: 10Ahmon Dancy) [19:49:19] !log dancy@deploy1001 Started scap: testwikis wikis to 1.36.0-wmf.10 [19:49:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:49:45] (03Abandoned) 10Dzahn: service/debmonitor: turn debmonitor into an active-active service [puppet] - 10https://gerrit.wikimedia.org/r/628966 (https://phabricator.wikimedia.org/T263506) (owner: 10Dzahn) [19:50:02] (03Abandoned) 10Dzahn: switch debmonitor to discovery records [dns] - 10https://gerrit.wikimedia.org/r/628965 (https://phabricator.wikimedia.org/T263506) (owner: 10Dzahn) [19:57:37] ACKNOWLEDGEMENT - Host elastic2037 is DOWN: PING CRITICAL - Packet loss = 100% Ryan Kemper Instance is unreachable via ssh, looking into this [20:02:23] (03PS8) 10Elukey: WIP: profile::hadoop::common: add datanode mountpoints override [puppet] - 10https://gerrit.wikimedia.org/r/629165 [20:09:03] RECOVERY - Host elastic2037 is UP: PING OK - Packet loss = 0%, RTA = 31.72 ms [20:10:45] (03PS9) 10Elukey: WIP: profile::hadoop::common: add datanode mountpoints override [puppet] - 10https://gerrit.wikimedia.org/r/629165 [20:12:42] 10Operations, 10MW-on-K8s, 10TechCom-RFC, 10serviceops: Decide on logging in k8s for ShellBox - https://phabricator.wikimedia.org/T263545 (10holger.knust) [20:21:15] 10Operations, 10ops-codfw, 10DC-Ops: elastic2037 hardware errors - https://phabricator.wikimedia.org/T263588 (10wiki_willy) a:03Papaul [20:23:16] 10Operations, 10Analytics: Augment NEL reports with GeoIP country code and network AS number - https://phabricator.wikimedia.org/T263496 (10Ottomata) > The long-term answer (which might be stream processing stuff?) is stream processing stuff > In the very short term, I don't think it would be too hard to have... [20:23:49] (03PS10) 10Elukey: WIP: profile::hadoop::common: add datanode mountpoints override [puppet] - 10https://gerrit.wikimedia.org/r/629165 [20:24:51] (03CR) 10jerkins-bot: [V: 04-1] WIP: profile::hadoop::common: add datanode mountpoints override [puppet] - 10https://gerrit.wikimedia.org/r/629165 (owner: 10Elukey) [20:25:34] (03PS1) 10Jgreen: add A and PTR records for frmx2001.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/629212 (https://phabricator.wikimedia.org/T245557) [20:26:10] 10Operations, 10ops-codfw, 10DC-Ops: elastic2037 hardware errors - https://phabricator.wikimedia.org/T263588 (10RKemper) Just got shown https://phabricator.wikimedia.org/maniphest/task/edit/form/55/ so working on getting the ticket up to spec, one sec [20:26:20] (03CR) 10Jgreen: [C: 03+2] add A and PTR records for frmx2001.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/629212 (https://phabricator.wikimedia.org/T245557) (owner: 10Jgreen) [20:29:12] 10Operations, 10ops-codfw, 10DC-Ops, 10Maps: (Need By: TBD) rack/setup/install maps20[05-10].codfw.wmnet - https://phabricator.wikimedia.org/T260271 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` maps2009.codfw.wmnet ` The log can be found in `/v... [20:29:36] (03PS11) 10Elukey: WIP: profile::hadoop::common: add datanode mountpoints override [puppet] - 10https://gerrit.wikimedia.org/r/629165 [20:30:14] !log gerrit2001 (gerrit-replica) restarting gerrit service [20:30:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:30:37] (03CR) 10jerkins-bot: [V: 04-1] WIP: profile::hadoop::common: add datanode mountpoints override [puppet] - 10https://gerrit.wikimedia.org/r/629165 (owner: 10Elukey) [20:31:35] !log dancy@deploy1001 Finished scap: testwikis wikis to 1.36.0-wmf.10 (duration: 42m 21s) [20:31:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:33:15] 10Operations, 10ops-codfw, 10DC-Ops: elastic2037 hardware errors - https://phabricator.wikimedia.org/T263588 (10RKemper) [20:34:34] (03PS12) 10Elukey: WIP: profile::hadoop::common: add datanode mountpoints override [puppet] - 10https://gerrit.wikimedia.org/r/629165 [20:35:39] (03CR) 10jerkins-bot: [V: 04-1] WIP: profile::hadoop::common: add datanode mountpoints override [puppet] - 10https://gerrit.wikimedia.org/r/629165 (owner: 10Elukey) [20:38:09] (03PS13) 10Elukey: WIP: profile::hadoop::common: add datanode mountpoints override [puppet] - 10https://gerrit.wikimedia.org/r/629165 [20:41:27] (03PS1) 10Ahmon Dancy: group0 wikis to 1.36.0-wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/629215 [20:41:29] (03CR) 10Ahmon Dancy: [C: 03+2] group0 wikis to 1.36.0-wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/629215 (owner: 10Ahmon Dancy) [20:42:12] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime [20:42:15] (03Merged) 10jenkins-bot: group0 wikis to 1.36.0-wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/629215 (owner: 10Ahmon Dancy) [20:42:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:43:55] !log dancy@deploy1001 rebuilt and synchronized wikiversions files: group0 wikis to 1.36.0-wmf.10 [20:43:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:44:22] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [20:44:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:46:51] !log T259539 enabled adaptive replica selection on elasticsearch at search.svc.eqiad.wmnet:9[246]43 [20:46:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:46:56] T259539: Enable adaptive replica selection on CirrusSearch Elasticsearch clusters - https://phabricator.wikimedia.org/T259539 [20:48:10] 10Operations, 10DBA, 10Performance-Team, 10Platform Engineering, 10User-Kormat: Remove sections from db configs - https://phabricator.wikimedia.org/T263127 (10daniel) Contributions queries are somewhat special, we may want to keep them separate in case we want special indexes or sharding. [20:50:32] 10Operations, 10Analytics: Augment NEL reports with GeoIP country code and network AS number - https://phabricator.wikimedia.org/T263496 (10Nuria) I cannot think of case of any piece of data we hold that includes an IP where we would not want the geo localization. So even doing that by default at all times mak... [20:52:13] (03PS14) 10Elukey: WIP: profile::hadoop::common: add datanode mountpoints override [puppet] - 10https://gerrit.wikimedia.org/r/629165 [20:52:32] 10Operations, 10ops-codfw, 10DC-Ops, 10Maps: (Need By: TBD) rack/setup/install maps20[05-10].codfw.wmnet - https://phabricator.wikimedia.org/T260271 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['maps2009.codfw.wmnet'] ` and were **ALL** successful. [20:54:53] (03CR) 10Ottomata: WIP: profile::hadoop::common: add datanode mountpoints override (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/629165 (owner: 10Elukey) [20:55:30] 10Operations, 10ops-codfw, 10DC-Ops, 10Maps: (Need By: TBD) rack/setup/install maps20[05-10].codfw.wmnet - https://phabricator.wikimedia.org/T260271 (10Papaul) [20:57:55] PROBLEM - Host elastic2037 is DOWN: PING CRITICAL - Packet loss = 100% [20:58:23] 10Operations, 10Analytics: Augment NEL reports with GeoIP country code and network AS number - https://phabricator.wikimedia.org/T263496 (10Ottomata) Nuria, except in {T262626}, this is exactly what Timo wants. He thinks that we should remove IPs from the data, and use the GeoIP country header that varnish se... [20:59:59] (03CR) 10Elukey: WIP: profile::hadoop::common: add datanode mountpoints override (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/629165 (owner: 10Elukey) [21:04:23] (03PS1) 10Reedy: Disable deprecated warning in Language::commafy() for non numeric string [core] (wmf/1.36.0-wmf.10) - 10https://gerrit.wikimedia.org/r/629188 (https://phabricator.wikimedia.org/T263592) [21:05:20] (03CR) 1020after4: [C: 03+2] "Train blocker." [core] (wmf/1.36.0-wmf.10) - 10https://gerrit.wikimedia.org/r/629188 (https://phabricator.wikimedia.org/T263592) (owner: 10Reedy) [21:07:03] (03CR) 10Ottomata: "Naw go for it, however you can make it work will be great I'm sure :)" [puppet] - 10https://gerrit.wikimedia.org/r/629165 (owner: 10Elukey) [21:16:15] 10Operations, 10ops-codfw, 10DC-Ops, 10Maps: (Need By: TBD) rack/setup/install maps20[05-10].codfw.wmnet - https://phabricator.wikimedia.org/T260271 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` maps2010.codfw.wmnet ` The log can be found in `/v... [21:20:38] (03PS1) 10Legoktm: codesearch: Create use git protocol v2 [puppet] - 10https://gerrit.wikimedia.org/r/629222 (https://phabricator.wikimedia.org/T263591) [21:21:21] (03PS2) 10Legoktm: codesearch: Configure to use git protocol v2 [puppet] - 10https://gerrit.wikimedia.org/r/629222 (https://phabricator.wikimedia.org/T263591) [21:22:07] (03PS1) 10Dzahn: phabricator: replace hiera() with lookup() [puppet] - 10https://gerrit.wikimedia.org/r/629223 [21:26:26] 10Operations, 10ops-codfw, 10DC-Ops: elastic2037 Bad memroy (DIMM SLOT B1) Uncorrectabe memory Error - https://phabricator.wikimedia.org/T263588 (10Papaul) p:05Triage→03Medium [21:27:21] (03Merged) 10jenkins-bot: Disable deprecated warning in Language::commafy() for non numeric string [core] (wmf/1.36.0-wmf.10) - 10https://gerrit.wikimedia.org/r/629188 (https://phabricator.wikimedia.org/T263592) (owner: 10Reedy) [21:31:15] 10Operations, 10ops-codfw, 10DC-Ops, 10Maps: (Need By: TBD) rack/setup/install maps20[05-10].codfw.wmnet - https://phabricator.wikimedia.org/T260271 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['maps2010.codfw.wmnet'] ` Of which those **FAILED**: ` ['maps2010.codfw.wmnet'] ` [21:31:43] 10Operations, 10ops-codfw, 10DC-Ops, 10Maps: (Need By: TBD) rack/setup/install maps20[05-10].codfw.wmnet - https://phabricator.wikimedia.org/T260271 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` maps2010.codfw.wmnet ` The log can be found in `/v... [21:32:26] (03CR) 10Legoktm: "I'm having trouble getting PCC to work: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/25319/console (I asked" [puppet] - 10https://gerrit.wikimedia.org/r/629222 (https://phabricator.wikimedia.org/T263591) (owner: 10Legoktm) [21:33:37] 10Operations, 10ops-codfw, 10DC-Ops: elastic2037 Bad memory (DIMM SLOT B1) Uncorrectabe memory Error - https://phabricator.wikimedia.org/T263588 (10Papaul) [21:34:31] ACKNOWLEDGEMENT - Host elastic2037 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn https://phabricator.wikimedia.org/T263588 [21:35:27] RECOVERY - Host elastic2037 is UP: PING WARNING - Packet loss = 75%, RTA = 922.27 ms [21:35:46] (03CR) 10Jdlrobson: [C: 03+1] clientError: enable on ja,es,de,ru,it,zh,pt wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/629174 (https://phabricator.wikimedia.org/T255585) (owner: 10Jason Linehan) [21:42:19] (03CR) 10Dzahn: [V: 03+1] "Hmm.. I was trying to stay away from renaming things in addition to hiera->lookup, data types, converting role to profile etc.. it already" [puppet] - 10https://gerrit.wikimedia.org/r/628970 (owner: 10Dzahn) [21:47:45] (03PS1) 10Urbanecm: Enable wgCheckUserLogLogins at all wikis but few large wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/629227 (https://phabricator.wikimedia.org/T253802) [21:48:21] PROBLEM - Host elastic2037 is DOWN: PING CRITICAL - Packet loss = 100% [21:48:47] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime [21:48:47] ^ me working on it [21:48:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:50:56] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [21:51:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:54:07] RECOVERY - Host elastic2037 is UP: PING OK - Packet loss = 0%, RTA = 31.72 ms [21:56:50] thanks papaul [21:57:48] 10Operations, 10ops-codfw, 10DC-Ops, 10Maps: (Need By: TBD) rack/setup/install maps20[05-10].codfw.wmnet - https://phabricator.wikimedia.org/T260271 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['maps2010.codfw.wmnet'] ` and were **ALL** successful. [22:08:43] 10Operations, 10ops-codfw, 10DC-Ops, 10Maps: (Need By: TBD) rack/setup/install maps20[05-10].codfw.wmnet - https://phabricator.wikimedia.org/T260271 (10Papaul) [22:09:03] 10Operations, 10ops-codfw, 10DC-Ops, 10Maps: (Need By: TBD) rack/setup/install maps20[05-10].codfw.wmnet - https://phabricator.wikimedia.org/T260271 (10Papaul) 05Open→03Resolved This is complete [22:34:40] 10Operations, 10ops-codfw, 10serviceops: decommission mw2135-mw2147, mw2187-mw2214 - physical / datacenter part - https://phabricator.wikimedia.org/T261524 (10Papaul) [22:35:33] 10Operations, 10serviceops, 10Patch-For-Review: Decommission mw2135-mw2147, mw2187-mw2214 (all PowerEdge R420) - https://phabricator.wikimedia.org/T260654 (10Papaul) [22:35:41] 10Operations, 10ops-codfw, 10serviceops: decommission mw2135-mw2147, mw2187-mw2214 - physical / datacenter part - https://phabricator.wikimedia.org/T261524 (10Papaul) 05Open→03Resolved This is complete [22:37:27] (03CR) 10Legoktm: "PCC looks good: https://puppet-compiler.wmflabs.org/compiler1002/25324/codesearch6.codesearch.eqiad.wmflabs/index.html (thanks to Andrew f" [puppet] - 10https://gerrit.wikimedia.org/r/629222 (https://phabricator.wikimedia.org/T263591) (owner: 10Legoktm) [22:45:13] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review, 10User-Kormat: db2125 crashed - mgmt iface also not available - https://phabricator.wikimedia.org/T260670 (10Papaul) Dear Papaul Tshibamba, This e-mail is to update you on the status of your Dell Service Request. Current Status: The Dell replacemen... [22:50:54] (03CR) 10Dzahn: [C: 03+2] codesearch: Configure to use git protocol v2 [puppet] - 10https://gerrit.wikimedia.org/r/629222 (https://phabricator.wikimedia.org/T263591) (owner: 10Legoktm) [22:52:52] mutante: thanks :) [22:56:06] mutante: Error: /Stage[main]/Codesearch/File[/etc/hound-gitconfig]: Could not evaluate: Could not retrieve information from environment production source(s) puppet:///modules/codesearch/hound-gitconfig [22:57:26] oh, I see [22:58:19] (03PS1) 10Legoktm: codesearch: Fix name of hound-gitconfig [puppet] - 10https://gerrit.wikimedia.org/r/629236 [22:58:42] my bad [22:59:06] mutante: ^^ if you have time :) [23:00:05] RoanKattouw, Niharika, and Urbanecm: (Dis)respected human, time to deploy Evening backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200922T2300). Please do the needful. [23:00:05] dmaza and hip: A patch you scheduled for Evening backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:00:14] (03CR) 10Dzahn: [C: 03+2] codesearch: Fix name of hound-gitconfig [puppet] - 10https://gerrit.wikimedia.org/r/629236 (owner: 10Legoktm) [23:00:37] 10Operations, 10ops-codfw, 10DC-Ops: elastic2037 Bad memory (DIMM SLOT B1) Uncorrectabe memory Error - https://phabricator.wikimedia.org/T263588 (10Papaul) 05Open→03Resolved Indeed we had a memory problem on DIMM B1 and A2 showing in the HW log. maybe just a temporally HW errors since it did clean after... [23:00:44] [..] you will be rewarded with a sticker << I thought it was a t-shirt :( [23:06:27] is someone available for deployment? [23:06:40] mutante: hmm, I don't see what's wrong now: Error: Could not set 'file' on ensure: Error 404 on SERVER: {"message":"Not Found: Could not find file_content modules/codesearch/hound-gitconfig","issue_kind":"RESOURCE_NOT_FOUND"} (file: /etc/puppet/modules/codesearch/manifests/init.pp, line: 92) [23:08:15] RoanKattouw Niharika Urbanecm: who shall deploy these poor patches two [23:09:06] I can deploy them I suppose [23:09:24] thank you @legoktm [23:09:35] thanks, works for me [23:10:17] (03PS3) 10Legoktm: Enable watchlist expiry feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/628927 (https://phabricator.wikimedia.org/T261249) (owner: 10Dmaza) [23:10:25] (03CR) 10Legoktm: [C: 03+2] Enable watchlist expiry feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/628927 (https://phabricator.wikimedia.org/T261249) (owner: 10Dmaza) [23:11:10] (03Merged) 10jenkins-bot: Enable watchlist expiry feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/628927 (https://phabricator.wikimedia.org/T261249) (owner: 10Dmaza) [23:12:19] dmaza: it's live on mwdebug1002 [23:12:31] thanks.. looking [23:12:40] (03CR) 10Legoktm: [C: 03+2] clientError: enable on ja,es,de,ru,it,zh,pt wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/629174 (https://phabricator.wikimedia.org/T255585) (owner: 10Jason Linehan) [23:15:42] @legoktm I'm getting an error saying the wiki is on read-only mode [23:15:49] legoktm: are you using a local puppetmaster? running puppet on codesearch6 keeps showing different results. first run.. was not even your change yet, second run.. shows error,.. .third run.. error is gone but now stuck at refreshing hound-extensions [23:16:05] mutante: no, it should be the main cloud puppetmaster [23:16:16] and it shouldn't be possible that the result is different from compiler :/ [23:16:37] refreshing hound-extensions and the others will take a while [23:16:44] then .. it's working now [23:16:55] dmaza: looking.. [23:17:32] well it's read-only for me too [23:18:03] oh [23:18:11] maybe I should be syncing to codfw? [23:18:25] 10Operations, 10observability: librenms page didn't auto-resolve in VO - https://phabricator.wikimedia.org/T263423 (10crusnov) p:05Triage→03Medium [23:19:00] I can't really say [23:19:03] dmaza: give mwdebug2002 a try [23:19:11] ok [23:19:45] (03PS3) 10Legoktm: clientError: enable on ja,es,de,ru,it,zh,pt wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/629174 (https://phabricator.wikimedia.org/T255585) (owner: 10Jason Linehan) [23:19:48] perfect.. works now. Give me a sec to run a couple more tests [23:19:54] sweet [23:20:11] (03CR) 10Legoktm: [C: 03+2] "..." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/629174 (https://phabricator.wikimedia.org/T255585) (owner: 10Jason Linehan) [23:20:55] (03Merged) 10jenkins-bot: clientError: enable on ja,es,de,ru,it,zh,pt wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/629174 (https://phabricator.wikimedia.org/T255585) (owner: 10Jason Linehan) [23:22:23] @legoktm everything looks good [23:23:21] syncing everywhere [23:23:33] hip: yours will be ready for testing in a minute [23:23:53] legoktm: no sweat! [23:23:59] !log legoktm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Enable watchlist expiry feature (T261249) (duration: 01m 06s) [23:24:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:24:06] T261249: Watchlist Expiry: Enable feature for Group 0 pilot wikis [Start: Sept 22] - https://phabricator.wikimedia.org/T261249 [23:24:07] dmaza: ^^ [23:24:17] thank you very much [23:24:30] np [23:24:38] hip: ok, it's on mwdebug2002 [23:24:46] legoktm: thanks, checking [23:25:41] all green! [23:26:03] awesome [23:26:33] syncing [23:27:37] !log legoktm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: clientError: enable on ja,es,de,ru,it,zh,pt wikipedias (T255585) (duration: 01m 04s) [23:27:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:27:41] T255585: Extend client-side error logging coverage - https://phabricator.wikimedia.org/T255585 [23:27:44] hip: ^^ [23:27:55] legoktm: many thanks! [23:29:10] legoktm: Notice: Applied catalog in 900.70 seconds [23:29:20] that is a long time for a puppet run, but it is done [23:29:36] the issue earlier must have been in the sync between prod and cloud puppetmaster..shrug [23:30:06] it worked now. logging out of codesearch6 again and bbl [23:30:21] mutante: looks good to me, thanks for all your help! [23:30:32] no problem, cy later [23:58:48] (03CR) 10Huji: [C: 04-1] Enable wgCheckUserLogLogins at all wikis but few large wikis (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/629227 (https://phabricator.wikimedia.org/T253802) (owner: 10Urbanecm)