[00:42:11] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:43:47] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:45:09] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:50:57] PROBLEM - IPv4 ping to eqiad on ripe-atlas-eqiad is CRITICAL: CRITICAL - failed 55 probes of 649 (alerts on 35) - https://atlas.ripe.net/measurements/1790945/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [02:02:07] RECOVERY - IPv4 ping to eqiad on ripe-atlas-eqiad is OK: OK - failed 7 probes of 649 (alerts on 35) - https://atlas.ripe.net/measurements/1790945/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [02:17:43] (03CR) 10HitomiAkane: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/629738 (https://phabricator.wikimedia.org/T262218) (owner: 10HitomiAkane) [05:10:35] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on es2026 - https://phabricator.wikimedia.org/T263837 (10Marostegui) Thank you Papaul [05:19:40] (03PS1) 10Marostegui: dbproxy1018: Decrease labsdb1011 [puppet] - 10https://gerrit.wikimedia.org/r/630392 [05:21:03] 10Operations, 10OTRS, 10serviceops, 10Patch-For-Review, 10User-notice: Update OTRS to the latest stable version (6.0.x) - https://phabricator.wikimedia.org/T187984 (10Marostegui) >>! In T187984#6494499, @jcrespo wrote: > db1077 should now be available to be put back on test-* section, I don't think it is... [05:21:31] (03CR) 10Marostegui: [C: 03+2] dbproxy1018: Decrease labsdb1011 [puppet] - 10https://gerrit.wikimedia.org/r/630392 (owner: 10Marostegui) [05:22:16] !log Decrease labsdb1011 weight [05:22:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:25:07] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 240, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:26:45] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 242, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:33:13] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 240, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:33:49] PROBLEM - OSPF status on cr2-codfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:33:59] (03CR) 10Marostegui: [C: 03+1] wmnet: Update s8-master alias [dns] - 10https://gerrit.wikimedia.org/r/629716 (https://phabricator.wikimedia.org/T239238) (owner: 10Kormat) [05:34:29] (03CR) 10Marostegui: [C: 03+1] mariadb: Promote db1104 to s8 master [puppet] - 10https://gerrit.wikimedia.org/r/629707 (https://phabricator.wikimedia.org/T239238) (owner: 10Kormat) [05:46:15] PROBLEM - kubelet operational latencies on kubernetes2002 is CRITICAL: instance=kubernetes2002.codfw.wmnet https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [05:46:43] RECOVERY - OSPF status on cr2-codfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:47:43] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 242, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:48:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool es2013 T263740', diff saved to https://phabricator.wikimedia.org/P12804 and previous config saved to /var/cache/conftool/dbconfig/20200928-054846-marostegui.json [05:48:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:48:54] T263740: decommission es2013.codfw.wmnet - https://phabricator.wikimedia.org/T263740 [05:49:27] (03PS1) 10Marostegui: es2013: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/630393 (https://phabricator.wikimedia.org/T263740) [05:49:29] RECOVERY - kubelet operational latencies on kubernetes2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [05:50:30] (03CR) 10Marostegui: [C: 03+2] es2013: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/630393 (https://phabricator.wikimedia.org/T263740) (owner: 10Marostegui) [05:52:22] (03PS1) 10Marostegui: instances.yaml: Remove es2013 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/630394 (https://phabricator.wikimedia.org/T263740) [05:53:03] (03CR) 10Marostegui: [C: 03+2] instances.yaml: Remove es2013 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/630394 (https://phabricator.wikimedia.org/T263740) (owner: 10Marostegui) [05:54:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove es2013 from dbctl T263740', diff saved to https://phabricator.wikimedia.org/P12805 and previous config saved to /var/cache/conftool/dbconfig/20200928-055410-marostegui.json [05:54:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:54:18] T263740: decommission es2013.codfw.wmnet - https://phabricator.wikimedia.org/T263740 [05:55:41] !log Stop MySQL on es2013 before decommissioning it T263740 [05:55:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:15:34] !log Set innodb_change_buffering = inserts; on db2089 (s5), db2106 (s4), db2108 (s2), db2085 (s1), db2085 (s8), db2087 (s7), db2087 (s6), db2109 (s3) T263443 [06:15:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:15:43] T263443: Evaluate the impact of changing innodb_change_buffering to inserts - https://phabricator.wikimedia.org/T263443 [06:35:56] PROBLEM - OSPF status on cr2-codfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [06:36:12] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 240, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:40:40] RECOVERY - OSPF status on cr2-codfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [06:41:00] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 242, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:54:45] (03PS2) 10Giuseppe Lavagetto: wikifeeds: use the service proxy for reaching the MediaWiki api [deployment-charts] - 10https://gerrit.wikimedia.org/r/628756 (https://phabricator.wikimedia.org/T255878) [06:59:36] 10Operations, 10MediaWiki-General, 10Platform Engineering: Allow easier ICU transitions in MediaWiki - https://phabricator.wikimedia.org/T263437 (10Joe) p:05Medium→03High [06:59:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'Promote es2028 as es1 master in codfw T261717', diff saved to https://phabricator.wikimedia.org/P12806 and previous config saved to /var/cache/conftool/dbconfig/20200928-065938-marostegui.json [06:59:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:59:47] T261717: Productionize es20[26-34] and es10[26-34] - https://phabricator.wikimedia.org/T261717 [07:06:43] (03PS2) 10Giuseppe Lavagetto: services: add TLS encrypted endpoint for wikifeeds (1/2) [puppet] - 10https://gerrit.wikimedia.org/r/629154 (https://phabricator.wikimedia.org/T255878) [07:09:12] !log elukey@cumin1001 START - Cookbook sre.presto.roll-restart-workers [07:09:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:09:18] (03CR) 10Muehlenhoff: role:mx: add script to generate otrs aliases (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/623608 (https://phabricator.wikimedia.org/T244792) (owner: 10Jbond) [07:12:45] 10Operations, 10Continuous-Integration-Infrastructure, 10Jenkins, 10Release-Engineering-Team (CI & Testing services), and 2 others: Review process to fetch Jenkins Debian package from upstream - https://phabricator.wikimedia.org/T260282 (10MoritzMuehlenhoff) >>! In T260282#6494946, @hashar wrote: > So that... [07:13:27] (03CR) 10Giuseppe Lavagetto: [C: 03+2] services: add TLS encrypted endpoint for wikifeeds (1/2) [puppet] - 10https://gerrit.wikimedia.org/r/629154 (https://phabricator.wikimedia.org/T255878) (owner: 10Giuseppe Lavagetto) [07:16:04] (03PS1) 10Marostegui: dbproxy1018: Depool labsdb1011 [puppet] - 10https://gerrit.wikimedia.org/r/630531 [07:16:06] (03PS1) 10Volans: Use @abstractmethod instead of @abstractproperty [software/cumin] - 10https://gerrit.wikimedia.org/r/630532 [07:16:09] (03PS1) 10Volans: tox: add mypy environment [software/cumin] - 10https://gerrit.wikimedia.org/r/630533 [07:16:31] (03CR) 10Marostegui: [C: 03+2] dbproxy1018: Depool labsdb1011 [puppet] - 10https://gerrit.wikimedia.org/r/630531 (owner: 10Marostegui) [07:17:22] !log elukey@cumin1001 END (PASS) - Cookbook sre.presto.roll-restart-workers (exit_code=0) [07:17:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:18:14] <_joe_> !log restarting pybal on the backup LVS in eqiad, codfw to pick up the new wikifeeds endpoint [07:18:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:19:06] (03CR) 10Volans: [C: 03+2] "Trivial, self-merging" [software/cumin] - 10https://gerrit.wikimedia.org/r/630532 (owner: 10Volans) [07:20:06] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [07:20:41] <_joe_> uhmmm [07:20:56] <_joe_> the session is already established btw [07:21:21] (03Merged) 10jenkins-bot: Use @abstractmethod instead of @abstractproperty [software/cumin] - 10https://gerrit.wikimedia.org/r/630532 (owner: 10Volans) [07:22:03] PROBLEM - PyBal connections to etcd on lvs1015 is CRITICAL: CRITICAL: 68 connections established with conf1004.eqiad.wmnet:4001 (min=69) https://wikitech.wikimedia.org/wiki/PyBal [07:24:32] !log T263970: forcing allocation of enwiki_general_1587198756 (chi@eqiad) [07:24:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:24:39] T263970: ElasticSearch unassigned shard check apifeatureusage-2020.06.30@codfw and enwiki_general_1587198756@codfw - https://phabricator.wikimedia.org/T263970 [07:29:03] <_joe_> !log restarting pybal on the LVS primaries [07:29:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:31:43] ACKNOWLEDGEMENT - Long running screen/tmux on mwdebug1001 is CRITICAL: CRIT: Long running SCREEN process. (user: jiji PID: 8843, 1761775s 1728000s). Effie Mouzeli that is me https://wikitech.wikimedia.org/wiki/Monitoring/Long_running_screens [07:32:25] RECOVERY - PyBal connections to etcd on lvs1015 is OK: OK: 69 connections established with conf1004.eqiad.wmnet:4001 (min=69) https://wikitech.wikimedia.org/wiki/PyBal [07:39:57] RECOVERY - ElasticSearch unassigned shard check - 9243 on search.svc.eqiad.wmnet is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Administration [07:41:31] (03PS2) 10Giuseppe Lavagetto: services: add TLS encrypted endpoint for wikifeeds (2/2) [puppet] - 10https://gerrit.wikimedia.org/r/629155 (https://phabricator.wikimedia.org/T255878) [07:42:43] (03Abandoned) 10Giuseppe Lavagetto: mediawiki::common: use envoy for tls termination too in nodes using it [puppet] - 10https://gerrit.wikimedia.org/r/574988 (https://phabricator.wikimedia.org/T244843) (owner: 10Giuseppe Lavagetto) [07:43:06] (03CR) 10Giuseppe Lavagetto: "> Patch Set 1: Code-Review-1" [puppet] - 10https://gerrit.wikimedia.org/r/629155 (https://phabricator.wikimedia.org/T255878) (owner: 10Giuseppe Lavagetto) [07:43:14] !log kormat@cumin1001 dbctl commit (dc=all): 'db2125 (re)pooling @ 25%: mobo replaced T260670', diff saved to https://phabricator.wikimedia.org/P12809 and previous config saved to /var/cache/conftool/dbconfig/20200928-074313-kormat.json [07:43:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:43:21] T260670: db2125 crashed - mgmt iface also not available - https://phabricator.wikimedia.org/T260670 [07:43:54] (03CR) 10JMeybohm: [C: 03+1] services: add TLS encrypted endpoint for wikifeeds (2/2) [puppet] - 10https://gerrit.wikimedia.org/r/629155 (https://phabricator.wikimedia.org/T255878) (owner: 10Giuseppe Lavagetto) [07:44:45] (03PS1) 10Marostegui: db2125: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/630535 (https://phabricator.wikimedia.org/T260670) [07:44:48] 10Operations, 10Machine Learning Platform, 10ORES, 10serviceops, 10Wikimedia-production-error: ORES redis: max number of clients reached... - https://phabricator.wikimedia.org/T263910 (10awight) Possibly related to {T181632}. In the past, Redis was a single point of failure and if Celery could not conne... [07:45:22] (03CR) 10Marostegui: [C: 03+2] db2125: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/630535 (https://phabricator.wikimedia.org/T260670) (owner: 10Marostegui) [07:46:29] RECOVERY - BGP status on cr3-eqsin is OK: BGP OK - up: 308, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [07:47:31] (03PS1) 10Giuseppe Lavagetto: changeprop: use https to connect to ORES, restbase [deployment-charts] - 10https://gerrit.wikimedia.org/r/630537 (https://phabricator.wikimedia.org/T244843) [07:47:39] (03CR) 10jerkins-bot: [V: 04-1] changeprop: use https to connect to ORES, restbase [deployment-charts] - 10https://gerrit.wikimedia.org/r/630537 (https://phabricator.wikimedia.org/T244843) (owner: 10Giuseppe Lavagetto) [07:49:48] (03PS2) 10JMeybohm: Temporarily remove conf1006 from client SRV records [dns] - 10https://gerrit.wikimedia.org/r/626113 (https://phabricator.wikimedia.org/T196487) [07:52:39] (03CR) 10JMeybohm: [C: 03+2] pybal: Move from conf1006 to conf1005 as config_host in esams [puppet] - 10https://gerrit.wikimedia.org/r/626111 (https://phabricator.wikimedia.org/T196487) (owner: 10JMeybohm) [07:54:17] !log elukey@cumin1001 START - Cookbook sre.hadoop.stop-cluster [07:54:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:55:45] test cluster, prep for decom --^ [07:58:18] !log kormat@cumin1001 dbctl commit (dc=all): 'db2125 (re)pooling @ 50%: mobo replaced T260670', diff saved to https://phabricator.wikimedia.org/P12810 and previous config saved to /var/cache/conftool/dbconfig/20200928-075817-kormat.json [07:58:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:58:25] T260670: db2125 crashed - mgmt iface also not available - https://phabricator.wikimedia.org/T260670 [08:02:22] !log elukey@cumin1001 END (PASS) - Cookbook sre.hadoop.stop-cluster (exit_code=0) [08:02:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:02:28] !log restarting pybal on lvs3007 for switching to conf1005 - T196487 [08:02:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:02:37] T196487: Upgrade eqiad rack D4 to 10G switch - https://phabricator.wikimedia.org/T196487 [08:03:24] PROBLEM - PyBal connections to etcd on lvs3006 is CRITICAL: CRITICAL: 0 connections established with conf1005.eqiad.wmnet:4001 (min=4) https://wikitech.wikimedia.org/wiki/PyBal [08:03:44] this is probably me [08:04:12] RECOVERY - BGP status on cr2-esams is OK: BGP OK - up: 427, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:06:09] !log restarting pybal on lvs3006 for switching to conf1005 - T196487 [08:06:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:06:42] RECOVERY - PyBal connections to etcd on lvs3006 is OK: OK: 4 connections established with conf1005.eqiad.wmnet:4001 (min=4) https://wikitech.wikimedia.org/wiki/PyBal [08:07:02] !log restarting pybal on lvs3005 for switching to conf1005 - T196487 [08:07:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:07:56] PROBLEM - mcrouter process on mwdebug1001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 114 (mcrouter), command name mcrouter https://wikitech.wikimedia.org/wiki/Mcrouter [08:08:57] (03CR) 10JMeybohm: [C: 03+2] "recheck" [dns] - 10https://gerrit.wikimedia.org/r/626113 (https://phabricator.wikimedia.org/T196487) (owner: 10JMeybohm) [08:13:22] !log kormat@cumin1001 dbctl commit (dc=all): 'db2125 (re)pooling @ 75%: mobo replaced T260670', diff saved to https://phabricator.wikimedia.org/P12811 and previous config saved to /var/cache/conftool/dbconfig/20200928-081321-kormat.json [08:13:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:13:29] T260670: db2125 crashed - mgmt iface also not available - https://phabricator.wikimedia.org/T260670 [08:13:49] 10Operations: Integrate Buster 10.6 point update - https://phabricator.wikimedia.org/T263974 (10MoritzMuehlenhoff) [08:21:15] !log kormat@cumin1001 dbctl commit (dc=all): 'Remove db2113 from contributions/logpager/recentchanges*/watchlist T263842', diff saved to https://phabricator.wikimedia.org/P12812 and previous config saved to /var/cache/conftool/dbconfig/20200928-082114-kormat.json [08:21:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:21:23] T263842: S5 replication issue, affecting watchlist and probably recentchanges - https://phabricator.wikimedia.org/T263842 [08:21:31] !log upload@eqiad: rolling varnish upgrade to 6.0.6-1wm1 T263557 [08:21:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:21:37] T263557: Upgrade production cache nodes to Varnish 6 - https://phabricator.wikimedia.org/T263557 [08:23:09] (03PS1) 10Ema: cache: upgrade Varnish to v6 in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/630540 (https://phabricator.wikimedia.org/T263557) [08:23:25] (03CR) 10Ema: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/630540 (https://phabricator.wikimedia.org/T263557) (owner: 10Ema) [08:23:32] (03CR) 10jerkins-bot: [V: 04-1] cache: upgrade Varnish to v6 in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/630540 (https://phabricator.wikimedia.org/T263557) (owner: 10Ema) [08:24:06] (03PS2) 10Ema: cache: upgrade Varnish to v6 in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/630540 (https://phabricator.wikimedia.org/T263557) [08:24:35] 10Operations, 10ops-eqiad, 10decommission-hardware, 10netops: Decommission asw-c-eqiad - https://phabricator.wikimedia.org/T208734 (10ayounsi) I'm seeing interfaces down on asw2-c-eqiad, and I'm not able to ssh to asw-c-eqiad, so I guess some of those steps have been done? As they are now alerting I'm del... [08:24:54] (03CR) 10Ema: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/630540 (https://phabricator.wikimedia.org/T263557) (owner: 10Ema) [08:26:27] (03CR) 10Ema: [C: 03+2] cache: upgrade Varnish to v6 in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/630540 (https://phabricator.wikimedia.org/T263557) (owner: 10Ema) [08:26:36] (03PS1) 10JMeybohm: Revert "Temporarily remove conf1006 from client SRV records" [dns] - 10https://gerrit.wikimedia.org/r/630407 [08:26:52] 10Operations, 10ops-eqiad, 10decommission-hardware, 10netops: Decommission asw-c-eqiad - https://phabricator.wikimedia.org/T208734 (10ayounsi) [08:26:59] (03PS1) 10JMeybohm: Revert "pybal: Move from conf1006 to conf1005 as config_host in esams" [puppet] - 10https://gerrit.wikimedia.org/r/630408 [08:28:25] !log kormat@cumin1001 dbctl commit (dc=all): 'db2125 (re)pooling @ 100%: mobo replaced T260670', diff saved to https://phabricator.wikimedia.org/P12813 and previous config saved to /var/cache/conftool/dbconfig/20200928-082825-kormat.json [08:28:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:28:32] (03PS1) 10Elukey: Decommission Hadoop test cluster [puppet] - 10https://gerrit.wikimedia.org/r/630541 (https://phabricator.wikimedia.org/T227485) [08:28:34] T260670: db2125 crashed - mgmt iface also not available - https://phabricator.wikimedia.org/T260670 [08:30:37] (03CR) 10Elukey: [C: 03+2] Decommission Hadoop test cluster [puppet] - 10https://gerrit.wikimedia.org/r/630541 (https://phabricator.wikimedia.org/T227485) (owner: 10Elukey) [08:32:00] !log text@eqiad: rolling varnish upgrade to 6.0.6-1wm1 T263557 [08:32:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:32:07] T263557: Upgrade production cache nodes to Varnish 6 - https://phabricator.wikimedia.org/T263557 [08:34:45] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review, 10User-Kormat: db2125 crashed - mgmt iface also not available - https://phabricator.wikimedia.org/T260670 (10Kormat) 05Open→03Resolved Alright, the host is fully back in service now, so resolving this again :) [08:34:48] !log elukey@cumin1001 START - Cookbook sre.hosts.decommission [08:34:48] !log elukey@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) [08:34:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:34:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:35:16] !log elukey@cumin1001 START - Cookbook sre.hosts.decommission [08:35:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:36:19] !log elukey@cumin1001 END (ERROR) - Cookbook sre.hosts.decommission (exit_code=97) [08:36:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:36:35] !log elukey@cumin1001 START - Cookbook sre.hosts.decommission [08:36:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:37:34] !log decommission the hadoop test cluster (analytics1028->41) [08:37:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:38:59] 10Operations, 10Data-Services, 10SRE-Access-Requests, 10cloud-services-team (Kanban): Enable access for wmcs-admins to run wmcs-prefixed cookbooks on cumin hosts - https://phabricator.wikimedia.org/T261145 (10ArielGlenn) Should this task remain open until the feature mentioned by faidon (non-root cumin) is... [08:40:17] 10Operations, 10ops-eqiad, 10Analytics-Clusters, 10decommission-hardware, 10Patch-For-Review: Decommission analytics10[28-31,33-41] - https://phabricator.wikimedia.org/T227485 (10elukey) [08:40:20] 10Operations, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Djellel Difallah - https://phabricator.wikimedia.org/T263692 (10ArielGlenn) [08:40:54] (03CR) 10Muehlenhoff: [C: 03+2] Switch CI to profile::java [puppet] - 10https://gerrit.wikimedia.org/r/605886 (https://phabricator.wikimedia.org/T253553) (owner: 10Muehlenhoff) [08:42:10] 10Operations, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Djellel Difallah - https://phabricator.wikimedia.org/T263692 (10ArielGlenn) [08:42:12] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [08:42:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:42:18] 10Operations, 10ops-eqiad, 10Analytics-Clusters, 10decommission-hardware, 10Patch-For-Review: Decommission analytics10[28-31,33-41] - https://phabricator.wikimedia.org/T227485 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by elukey@cumin1001 for hosts: `analytics[1028-1029].eqiad.wmnet... [08:43:03] !log elukey@cumin1001 START - Cookbook sre.hosts.decommission [08:43:03] !log elukey@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) [08:43:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:43:08] !log elukey@cumin1001 START - Cookbook sre.hosts.decommission [08:43:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:43:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:44:32] 10Operations, 10Traffic, 10netops, 10Patch-For-Review: eqiad row D switch upgrade - https://phabricator.wikimedia.org/T172459 (10ayounsi) 05Stalled→03Declined Forgot about that old task! Not needed anymore as we're not using multicast anymore. [08:46:12] 10Operations, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Djellel Difallah - https://phabricator.wikimedia.org/T263692 (10ArielGlenn) [08:46:15] !log swift codfw-prod: bump object weight for ms-be2057 - T261633 [08:46:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:46:22] T261633: Put ms-be2057 (Dell R740xd2) in service - https://phabricator.wikimedia.org/T261633 [08:46:53] 10Operations, 10ops-eqiad, 10netops: eqiad row D switch fabric recabling - https://phabricator.wikimedia.org/T256112 (10Marostegui) [08:46:57] 10Operations, 10Traffic, 10netops, 10Patch-For-Review: eqiad row D switch upgrade - https://phabricator.wikimedia.org/T172459 (10Marostegui) [08:50:31] 10Operations, 10ops-eqiad, 10DBA, 10netops, and 3 others: Upgrade eqiad rack D4 to 10G switch - https://phabricator.wikimedia.org/T196487 (10ayounsi) @Cmjohnson the console port is still not responding, could you please have a look before today's maintenance? As we still need to configure the switch (and m... [08:51:04] 10Operations, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Djellel Difallah - https://phabricator.wikimedia.org/T263692 (10ArielGlenn) @DED Please have a look at https://wikitech.wikimedia.org/wiki/Analytics/Data_access#User_responsibilities if you have not already. Adding @Nuri... [08:53:30] 10Operations, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Djellel Difallah - https://phabricator.wikimedia.org/T263692 (10ArielGlenn) p:05Triage→03Medium [08:53:32] !log elukey@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) [08:53:33] 10Operations, 10SRE-Access-Requests, 10cloud-services-team (Kanban): wikitech-static access for Sam Reed - https://phabricator.wikimedia.org/T262468 (10ArielGlenn) Hey @Reedy... is ths all set? Can we resolve the task? (If it's done, please add a on line summary of how it was handled, so we have a record for... [08:53:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:53:40] 10Operations, 10ops-eqiad, 10Analytics-Clusters, 10decommission-hardware, 10Patch-For-Review: Decommission analytics10[28-31,33-41] - https://phabricator.wikimedia.org/T227485 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by elukey@cumin1001 for hosts: `analytics[1030-1031,1033-1039].e... [08:55:02] !log elukey@cumin1001 START - Cookbook sre.hosts.decommission [08:55:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:56:10] !log T263970: recovering lost apifeature indices (copying eqiad indices -> codfw) [08:56:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:56:16] T263970: ElasticSearch unassigned shard check apifeatureusage-2020.06.30@codfw and enwiki_general_1587198756@eqiad - https://phabricator.wikimedia.org/T263970 [08:56:33] 10Operations, 10Scap, 10Wikidata, 10Wikidata-Query-Service, and 2 others: Scap configuration for WDQS should get server groups from a known source or truth - https://phabricator.wikimedia.org/T252124 (10Zbyszko) @thcipriani - your proposal sounds reasonable (we don't really care if we're deploying public s... [08:58:22] 10Operations, 10Data-Services, 10SRE-Access-Requests, 10cloud-services-team (Kanban): Enable access for wmcs-admins to run wmcs-prefixed cookbooks on cumin hosts - https://phabricator.wikimedia.org/T261145 (10MoritzMuehlenhoff) >>! In T261145#6497678, @ArielGlenn wrote: > Should this task remain open until... [09:00:26] !log klausman@cumin1001 START - Cookbook sre.hosts.downtime [09:00:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:00:46] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [09:00:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:00:52] 10Operations, 10ops-eqiad, 10Analytics-Clusters, 10decommission-hardware, 10Patch-For-Review: Decommission analytics10[28-31,33-41] - https://phabricator.wikimedia.org/T227485 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by elukey@cumin1001 for hosts: `analytics[1040-1041].eqiad.wmnet... [09:01:16] 10Operations, 10LDAP-Access-Requests: LDAP access to the nda group for Michael Raish - https://phabricator.wikimedia.org/T262316 (10ArielGlenn) @MNovotny_WMF Once you have provided the expiration date and contact information, as mentioned above, we can add it to our system and resolve this task. If you are the... [09:02:27] !log klausman@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [09:02:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:06:15] !log restart bird on centrallog2001 - T262372 [09:06:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:06:22] T262372: Configure BGP route damping on Anycast sessions - https://phabricator.wikimedia.org/T262372 [09:06:42] (03PS1) 10Giuseppe Lavagetto: restbase: add restbase-async to the TLS SANs [puppet] - 10https://gerrit.wikimedia.org/r/630544 [09:08:08] (03CR) 10Arturo Borrero Gonzalez: OpenStack: add initial manifests for OpenStack Barbican, a secrets API (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/629472 (https://phabricator.wikimedia.org/T263680) (owner: 10Andrew Bogott) [09:08:53] (03CR) 10Volans: "Nice work! Some generic comments inline." (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/627841 (owner: 10Kormat) [09:09:42] (03PS1) 10Gehel: logstash: keep at least 2 copies of each shard for apifeatureusage [puppet] - 10https://gerrit.wikimedia.org/r/630546 (https://phabricator.wikimedia.org/T263970) [09:11:26] 10Operations, 10Gerrit, 10LDAP-Access-Requests: Add hashar to `archiva-deployers` LDAP group - https://phabricator.wikimedia.org/T263721 (10ArielGlenn) Me. This is done. Let me know that access is working as expected and I'll resolve this task. [09:11:38] (03CR) 10Giuseppe Lavagetto: [C: 03+2] restbase: add restbase-async to the TLS SANs [puppet] - 10https://gerrit.wikimedia.org/r/630544 (owner: 10Giuseppe Lavagetto) [09:11:55] (03CR) 10Gehel: "PCC looks happy: https://puppet-compiler.wmflabs.org/compiler1001/25450/logstash1007.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/630546 (https://phabricator.wikimedia.org/T263970) (owner: 10Gehel) [09:12:53] (03CR) 10Gehel: [C: 03+1] "\o/" [software/cumin] - 10https://gerrit.wikimedia.org/r/630533 (owner: 10Volans) [09:13:44] (03PS1) 10Hnowlan: changeprop: use restbase-async discovery name [deployment-charts] - 10https://gerrit.wikimedia.org/r/630547 [09:14:08] 10Operations, 10Gerrit, 10LDAP-Access-Requests: Add hashar to `archiva-deployers` LDAP group - https://phabricator.wikimedia.org/T263721 (10hashar) 05Open→03Resolved a:03ArielGlenn I am now listed at https://ldap.toolforge.org/group/archiva-deployers and I have managed to upload some artifacts. Thank... [09:15:06] (03CR) 10Muehlenhoff: [C: 03+2] Remove obsolete nda_audit script [puppet] - 10https://gerrit.wikimedia.org/r/630024 (https://phabricator.wikimedia.org/T247364) (owner: 10Muehlenhoff) [09:15:51] !log restart db1077 for upgrade and cleanup T187984 [09:15:52] (03PS1) 10Hashar: Add rename-project plugin stable-3.2-0-g7f89635 [software/gerrit] (deploy/wmf/stable-3.2) - 10https://gerrit.wikimedia.org/r/630548 (https://phabricator.wikimedia.org/T201953) [09:16:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:16:04] T187984: Update OTRS to the latest stable version (6.0.x) - https://phabricator.wikimedia.org/T187984 [09:17:09] !log restart bird on dns2001 - T262372 [09:17:11] (03CR) 10Jbond: [C: 03+1] "LGTM optional nit" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/630179 (https://phabricator.wikimedia.org/T158562) (owner: 10Muehlenhoff) [09:17:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:17:15] T262372: Configure BGP route damping on Anycast sessions - https://phabricator.wikimedia.org/T262372 [09:17:50] <_joe_> !log changing the restbase public TLS certs to include restbase-async.discovery.wmnet [09:17:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:18:04] 10Operations, 10ops-eqiad, 10Analytics-Clusters, 10decommission-hardware: Decommission analytics10[28-31,33-41] - https://phabricator.wikimedia.org/T227485 (10elukey) 05Stalled→03Open a:05elukey→03Cmjohnson [09:20:06] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:20:57] 10Puppet, 10Patch-For-Review: puppetdb seems to be slow on host reimage - https://phabricator.wikimedia.org/T263578 (10Volans) >>! In T263578#6492417, @jbond wrote: > however i wonder if its worth mounting a tmpfs dir here? the risk is that we may loose a submission but as it likely receives a lot of IO is w... [09:21:25] (03CR) 10Jbond: "> Patch Set 4:" [puppet] - 10https://gerrit.wikimedia.org/r/630099 (owner: 10Elukey) [09:22:09] (03CR) 10Jbond: profile::hadoop::common: get the datanode mountpoints from facter (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/629647 (owner: 10Elukey) [09:23:56] (03Abandoned) 10Giuseppe Lavagetto: changeprop: use https to connect to ORES, restbase [deployment-charts] - 10https://gerrit.wikimedia.org/r/630537 (https://phabricator.wikimedia.org/T244843) (owner: 10Giuseppe Lavagetto) [09:24:16] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:25:23] (03PS1) 10Ema: cache: upgrade Varnish to v6 in codfw [puppet] - 10https://gerrit.wikimedia.org/r/630550 (https://phabricator.wikimedia.org/T263557) [09:25:46] (03CR) 10Muehlenhoff: Have the puppetised sources.list depend on the wikimedia repository (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/630179 (https://phabricator.wikimedia.org/T158562) (owner: 10Muehlenhoff) [09:26:05] (03PS2) 10Muehlenhoff: Have the puppetised sources.list depend on the wikimedia repository [puppet] - 10https://gerrit.wikimedia.org/r/630179 (https://phabricator.wikimedia.org/T158562) [09:26:25] (03CR) 10Ema: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/630550 (https://phabricator.wikimedia.org/T263557) (owner: 10Ema) [09:26:44] (03CR) 10Jbond: [C: 04-1] "See inline" (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/629443 (owner: 10Dzahn) [09:26:48] (03CR) 10DCausse: [C: 03+1] logstash: keep at least 2 copies of each shard for apifeatureusage [puppet] - 10https://gerrit.wikimedia.org/r/630546 (https://phabricator.wikimedia.org/T263970) (owner: 10Gehel) [09:29:10] (03CR) 10Ema: [C: 03+2] cache: upgrade Varnish to v6 in codfw [puppet] - 10https://gerrit.wikimedia.org/r/630550 (https://phabricator.wikimedia.org/T263557) (owner: 10Ema) [09:29:37] (03CR) 10Jbond: [C: 04-1] "See inline" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/629439 (owner: 10Dzahn) [09:29:52] !log text@codfw: rolling varnish upgrade to 6.0.6-1wm1 T263557 [09:29:56] 10Operations, 10LDAP-Access-Requests: Access to Superset - https://phabricator.wikimedia.org/T263868 (10ArielGlenn) I have verified that the email address in wikitech was authenticated and that it is jrabah. This will require adding you to the wmf LDAP group. Pinging @JVargas for signoff. [09:29:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:29:58] T263557: Upgrade production cache nodes to Varnish 6 - https://phabricator.wikimedia.org/T263557 [09:30:30] (03CR) 10Giuseppe Lavagetto: [C: 03+2] services: add TLS encrypted endpoint for wikifeeds (2/2) [puppet] - 10https://gerrit.wikimedia.org/r/629155 (https://phabricator.wikimedia.org/T255878) (owner: 10Giuseppe Lavagetto) [09:30:52] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/629424 (owner: 10Dzahn) [09:32:04] (03PS1) 10Hashar: Upgrade javamelody from 1.83.0 to 1.85.0 [software/gerrit] (deploy/wmf/stable-3.2) - 10https://gerrit.wikimedia.org/r/630551 (https://phabricator.wikimedia.org/T232678) [09:32:20] 10Operations, 10LDAP-Access-Requests: Access to Superset - https://phabricator.wikimedia.org/T263868 (10ArielGlenn) p:05Triage→03Medium [09:32:57] 10Operations, 10Parsing-Team, 10TechCom, 10serviceops, and 4 others: Strategy for storing parser output for "old revision" (Popular diffs and permalinks) - https://phabricator.wikimedia.org/T244058 (10daniel) With {T263583} coming up, perhaps we should use a special ParserCache instance for old revisions,... [09:33:28] 10Operations, 10netops: Configure BGP route damping on Anycast sessions - https://phabricator.wikimedia.org/T262372 (10ayounsi) 05Resolved→03Open When bird restarts on the centrallog servers it causes bird to bounce a few times: ` Sep 28 09:06:18 centrallog2001 bird: Shutting down Sep 28 09:06:18 centrallo... [09:33:41] 10Operations, 10serviceops, 10Patch-For-Review, 10Platform Team Initiatives (API Gateway): Separate mediawiki latency metrics by endpoint - https://phabricator.wikimedia.org/T263727 (10ArielGlenn) p:05Triage→03Medium [09:33:43] (03CR) 10Hashar: "Not much to mention based on the changelog at https://github.com/javamelody/javamelody/wiki/ReleaseNotes . I felt we should just closely " [software/gerrit] (deploy/wmf/stable-3.2) - 10https://gerrit.wikimedia.org/r/630551 (https://phabricator.wikimedia.org/T232678) (owner: 10Hashar) [09:34:33] 10Operations, 10netops: Configure BGP route damping on Anycast sessions - https://phabricator.wikimedia.org/T262372 (10ayounsi) p:05Medium→03High [09:35:12] 10Operations, 10Traffic, 10serviceops: puppetmaster[12]001: add TLS termination - https://phabricator.wikimedia.org/T263831 (10ArielGlenn) p:05Triage→03Medium [09:35:47] 10Operations, 10Continuous-Integration-Infrastructure, 10Traffic: contint.wikimedia.org: add TLS termination - https://phabricator.wikimedia.org/T263830 (10ArielGlenn) p:05Triage→03Medium [09:37:00] 10Operations: Integrate Buster 10.6 point update - https://phabricator.wikimedia.org/T263974 (10ArielGlenn) p:05Triage→03Medium [09:37:47] 10Operations, 10Editing-team, 10MassMessage, 10WMF-JobQueue, 10Platform Team Workboards (Clinic Duty Team): Same MassMessage is being sent more than once - https://phabricator.wikimedia.org/T93049 (10Samwalton9) Looks like this happened again yesterday with the Signpost (https://en.wikipedia.org/wiki/Use... [09:39:11] 10Operations, 10Packaging: Update php-xdebug to 2.7.2 in apt.wikimedia.org - https://phabricator.wikimedia.org/T263933 (10ArielGlenn) p:05Triage→03High [09:39:22] PROBLEM - SSH on ms-be2017 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:39:59] 10Operations, 10Analytics-Radar, 10Domains, 10Traffic, and 2 others: Blocking all third-party storage access requests - https://phabricator.wikimedia.org/T262996 (10ArielGlenn) p:05Triage→03Medium [09:42:12] RECOVERY - SSH on ms-be2017 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:44:49] (03CR) 10Hnowlan: [C: 03+2] changeprop: use restbase-async discovery name [deployment-charts] - 10https://gerrit.wikimedia.org/r/630547 (owner: 10Hnowlan) [09:46:23] 10Operations, 10serviceops, 10User-jijiki: Test onhost memcached performance and functionality - https://phabricator.wikimedia.org/T263958 (10ArielGlenn) p:05Triage→03Medium [09:47:18] (03Merged) 10jenkins-bot: changeprop: use restbase-async discovery name [deployment-charts] - 10https://gerrit.wikimedia.org/r/630547 (owner: 10Hnowlan) [09:47:54] 10Operations, 10Traffic: backport ipvsadm>=1.30 to buster-wikimedia or buster-backports - https://phabricator.wikimedia.org/T263788 (10ArielGlenn) p:05Triage→03Medium [09:48:04] 10Operations, 10Traffic: Switch to Maglev hashing ('mh') on LVS hosts - https://phabricator.wikimedia.org/T263797 (10ArielGlenn) p:05Triage→03Medium [09:48:30] !log upload@codfw: rolling varnish upgrade to 6.0.6-1wm1 T263557 [09:48:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:48:38] T263557: Upgrade production cache nodes to Varnish 6 - https://phabricator.wikimedia.org/T263557 [09:48:49] 10Operations, 10MediaWiki-REST-API, 10Traffic: Route requests to the REST MediaWiki API to the api cluster - https://phabricator.wikimedia.org/T263729 (10ArielGlenn) p:05Triage→03Medium [09:48:54] !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'changeprop' for release 'production' . [09:48:54] !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'changeprop' for release 'staging' . [09:48:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:49:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:50:04] 10Operations, 10Machine Learning Platform, 10ORES, 10serviceops, 10Wikimedia-production-error: ORES redis: max number of clients reached... - https://phabricator.wikimedia.org/T263910 (10ArielGlenn) p:05Triage→03High [09:53:10] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: aggregation rules for ats-tls client TTFB [puppet] - 10https://gerrit.wikimedia.org/r/629430 (https://phabricator.wikimedia.org/T263536) (owner: 10Filippo Giunchedi) [09:54:10] 10Operations, 10observability, 10serviceops, 10Platform Team Initiatives (API Gateway): mtail 3.0.0-rc35 doesn't support the histogram type in -oneshot mode. - https://phabricator.wikimedia.org/T263728 (10ArielGlenn) p:05Triage→03High [09:54:12] 10Operations, 10serviceops, 10Patch-For-Review, 10Platform Team Initiatives (API Gateway): Separate mediawiki latency metrics by endpoint - https://phabricator.wikimedia.org/T263727 (10ArielGlenn) p:05Medium→03High [09:54:19] 10Operations, 10Puppet, 10observability: Notification spam from "last puppet run" upon re-enabling puppet - https://phabricator.wikimedia.org/T263720 (10ArielGlenn) p:05Triage→03Medium [09:54:49] 10Operations, 10Analytics, 10Traffic, 10netops: Turnilo: per-second rates for wmf_netflow bytes + packets - https://phabricator.wikimedia.org/T263290 (10ArielGlenn) p:05Triage→03Medium [09:55:45] 10Operations, 10Traffic, 10netops: experiment with reënabling compression between applayer's TLS terminators and edge caches - https://phabricator.wikimedia.org/T263288 (10ArielGlenn) p:05Triage→03Medium [09:55:56] (03PS3) 10Muehlenhoff: Have the puppetised sources.list depend on the wikimedia repository [puppet] - 10https://gerrit.wikimedia.org/r/630179 (https://phabricator.wikimedia.org/T158562) [09:59:05] 10Operations, 10Traffic, 10observability, 10Patch-For-Review, 10User-fgiunchedi: Aggregated metrics for ats-tls <-> clients ttfb percentiles - https://phabricator.wikimedia.org/T263536 (10fgiunchedi) [09:59:32] (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] am: use status.cgi JSON as source for problems [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/628090 (owner: 10Filippo Giunchedi) [10:00:36] 10Operations, 10Data-Services, 10SRE-Access-Requests, 10cloud-services-team (Kanban): Enable access for wmcs-admins to run wmcs-prefixed cookbooks on cumin hosts - https://phabricator.wikimedia.org/T261145 (10ArielGlenn) @Bstorm, please let us know that all is working as you expect and I'll close this. Tha... [10:04:38] 10Operations, 10OTRS, 10serviceops, 10Patch-For-Review, 10User-notice: Update OTRS to the latest stable version (6.0.x) - https://phabricator.wikimedia.org/T187984 (10jcrespo) db1077 is back into test-s4 role, although without any data. [10:05:15] (03PS1) 10Filippo Giunchedi: am: tweak alert labels/annotations [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/630554 [10:05:27] 10Operations: Provide failover capacity for package installations from main mirror - https://phabricator.wikimedia.org/T262647 (10ArielGlenn) p:05Triage→03Medium [10:05:57] 10Operations: Provide failover capacity for package installations from main mirror - https://phabricator.wikimedia.org/T262647 (10MoritzMuehlenhoff) a:03MoritzMuehlenhoff [10:08:59] (03PS1) 10Hnowlan: changeprop: open access to ores on 443 [deployment-charts] - 10https://gerrit.wikimedia.org/r/630555 [10:09:39] 10Operations, 10Analytics-Radar, 10Domains, 10Traffic, 10Wikimedia-General-or-Unknown: WMF third-party cookies rejected - https://phabricator.wikimedia.org/T262882 (10ArielGlenn) p:05Triage→03Medium [10:10:17] (03CR) 10Volans: [C: 03+2] tox: add mypy environment [software/cumin] - 10https://gerrit.wikimedia.org/r/630533 (owner: 10Volans) [10:10:54] (03PS1) 10Filippo Giunchedi: alertmanager: group alerts and add severity: page [puppet] - 10https://gerrit.wikimedia.org/r/630556 (https://phabricator.wikimedia.org/T258948) [10:12:28] (03Merged) 10jenkins-bot: tox: add mypy environment [software/cumin] - 10https://gerrit.wikimedia.org/r/630533 (owner: 10Volans) [10:12:54] (03PS2) 10Volans: swift: remove old unused service records [dns] - 10https://gerrit.wikimedia.org/r/628086 (https://phabricator.wikimedia.org/T244153) [10:14:23] (03CR) 10Hnowlan: [C: 03+2] changeprop: open access to ores on 443 [deployment-charts] - 10https://gerrit.wikimedia.org/r/630555 (owner: 10Hnowlan) [10:16:30] (03Merged) 10jenkins-bot: changeprop: open access to ores on 443 [deployment-charts] - 10https://gerrit.wikimedia.org/r/630555 (owner: 10Hnowlan) [10:18:00] (03PS1) 10Effie Mouzeli: mwdebug1001: remove opcache tuning [puppet] - 10https://gerrit.wikimedia.org/r/630558 [10:19:25] 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes, 10Patch-For-Review: Move wikifeeds to use TLS only - https://phabricator.wikimedia.org/T255878 (10Joe) [10:19:39] (03PS6) 10Volans: Extracting obvious reporting code to a Reporter class. [software/cumin] - 10https://gerrit.wikimedia.org/r/626660 (https://phabricator.wikimedia.org/T212783) (owner: 10Gehel) [10:19:52] 10Operations, 10MediaWiki-General, 10serviceops, 10Patch-For-Review, 10Service-Architecture: Create a service-to-service proxy for handling HTTP calls from services to other entities - https://phabricator.wikimedia.org/T244843 (10Joe) [10:21:46] (03CR) 10jerkins-bot: [V: 04-1] Extracting obvious reporting code to a Reporter class. [software/cumin] - 10https://gerrit.wikimedia.org/r/626660 (https://phabricator.wikimedia.org/T212783) (owner: 10Gehel) [10:21:57] 10Operations, 10MediaWiki-General, 10serviceops, 10Patch-For-Review, 10Service-Architecture: Create a service-to-service proxy for handling HTTP calls from services to other entities - https://phabricator.wikimedia.org/T244843 (10Joe) [10:23:20] 10Operations, 10Analytics-Radar, 10Patch-For-Review: Move Hue to a Buster VM - https://phabricator.wikimedia.org/T258768 (10elukey) Currently two outstanding UI issues: https://github.com/cloudera/hue/issues/1273 https://github.com/cloudera/hue/issues/1272 In theory those are not blocking the migration of... [10:23:46] !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'kube-system' for release 'calico-policy-controller' . [10:23:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:25:02] !log hnowlan@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'kube-system' for release 'calico-policy-controller' . [10:25:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:25:16] (03PS1) 10JMeybohm: Enable envoy telemetry for zortero [deployment-charts] - 10https://gerrit.wikimedia.org/r/630560 [10:25:51] PROBLEM - mailman_queue_size on lists1001 is CRITICAL: CRITICAL: 1 mailman queue(s) above limits (thresholds: bounces: 25 in: 25 virgin: 25) https://wikitech.wikimedia.org/wiki/Mailman%23Monitoring [10:25:58] !log hnowlan@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'kube-system' for release 'calico-policy-controller' . [10:26:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:27:27] RECOVERY - mailman_queue_size on lists1001 is OK: OK: mailman queues are below the limits. https://wikitech.wikimedia.org/wiki/Mailman%23Monitoring [10:29:54] !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'changeprop' for release 'staging' . [10:29:54] !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'changeprop' for release 'production' . [10:29:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:30:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:30:04] jan_drewniak: (Dis)respected human, time to deploy Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200928T1030). Please do the needful. [10:31:51] (03CR) 10Alexandros Kosiaris: [C: 03+1] Enable envoy telemetry for zortero [deployment-charts] - 10https://gerrit.wikimedia.org/r/630560 (owner: 10JMeybohm) [10:32:15] !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'changeprop' for release 'staging' . [10:32:15] !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'changeprop' for release 'production' . [10:32:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:32:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:32:52] (03CR) 10JMeybohm: [C: 03+2] Enable envoy telemetry for zortero [deployment-charts] - 10https://gerrit.wikimedia.org/r/630560 (owner: 10JMeybohm) [10:32:58] !log hnowlan@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'changeprop' for release 'staging' . [10:32:58] !log hnowlan@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'changeprop' for release 'production' . [10:33:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:33:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:34:52] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/630561 (https://phabricator.wikimedia.org/T128546) [10:35:37] (03Merged) 10jenkins-bot: Enable envoy telemetry for zortero [deployment-charts] - 10https://gerrit.wikimedia.org/r/630560 (owner: 10JMeybohm) [10:35:44] !log hnowlan@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'changeprop' for release 'staging' . [10:35:44] !log hnowlan@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'changeprop' for release 'production' . [10:35:45] (03PS1) 10Giuseppe Lavagetto: service::configuration: connect to restbase via TLS [puppet] - 10https://gerrit.wikimedia.org/r/630562 (https://phabricator.wikimedia.org/T244843) [10:35:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:35:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:37:22] !log jayme@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'zotero' for release 'staging' . [10:37:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:38:37] (03CR) 10Jdrewniak: [C: 03+2] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/630561 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [10:39:16] (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/630561 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [10:44:47] !log jdrewniak@deploy1001 Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:630561| Bumping portals to master (T128546)]] (duration: 00m 58s) [10:44:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:44:55] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [10:45:45] !log jdrewniak@deploy1001 Synchronized portals: Wikimedia Portals Update: [[gerrit:630561| Bumping portals to master (T128546)]] (duration: 00m 57s) [10:45:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:59:40] 10Operations, 10Machine Learning Platform, 10ORES, 10serviceops, 10Wikimedia-production-error: ORES redis: max number of clients reached... - https://phabricator.wikimedia.org/T263910 (10Ladsgroup) >>! In T263910#6497445, @awight wrote: > Possibly related to {T181632}. In the past, Redis was a single po... [11:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: My dear minions, it's time we take the moon! Just kidding. Time for European mid-day backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200928T1100). [11:00:04] kart_: A patch you scheduled for European mid-day backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:18] I can deploy toda [11:00:20] y [11:00:36] kart_: ready for second try? :-) [11:02:24] (03CR) 10Urbanecm: "> Patch Set 4: Code-Review+1" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/629738 (https://phabricator.wikimedia.org/T262218) (owner: 10HitomiAkane) [11:02:39] 10Operations, 10DBA, 10Wikimedia-Incident: S5 replication issue, affecting watchlist and probably recentchanges - https://phabricator.wikimedia.org/T263842 (10Marostegui) I have checked previous' week backups (22nd Sept) to see if there was anything existing for any of the involved data, at least on the PK (... [11:03:05] Urbanecm: sure [11:03:14] (03PS4) 10Urbanecm: ContentTranslation: Do not use wikishared DB for testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/629371 (https://phabricator.wikimedia.org/T263417) (owner: 10KartikMistry) [11:03:40] (03CR) 10Urbanecm: [C: 03+2] ContentTranslation: Do not use wikishared DB for testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/629371 (https://phabricator.wikimedia.org/T263417) (owner: 10KartikMistry) [11:03:49] let's see then :) [11:04:22] (03Merged) 10jenkins-bot: ContentTranslation: Do not use wikishared DB for testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/629371 (https://phabricator.wikimedia.org/T263417) (owner: 10KartikMistry) [11:06:04] kart_: pulled onto mwdebug2001, and...noticed an error [11:06:18] in IS.php, you add wmgContentTranslationDatabase, not wgContentTranslationDatabase [11:06:54] pushing a fix [11:07:33] (03PS1) 10Urbanecm: Follow-up for 483beb2: wmgContentTranslationDatabase => wgContentTranslationDatabase [mediawiki-config] - 10https://gerrit.wikimedia.org/r/630565 (https://phabricator.wikimedia.org/T263417) [11:07:42] (03CR) 10Urbanecm: [C: 03+2] Follow-up for 483beb2: wmgContentTranslationDatabase => wgContentTranslationDatabase [mediawiki-config] - 10https://gerrit.wikimedia.org/r/630565 (https://phabricator.wikimedia.org/T263417) (owner: 10Urbanecm) [11:07:50] Urbanecm: ahaaa. [11:08:30] (03Merged) 10jenkins-bot: Follow-up for 483beb2: wmgContentTranslationDatabase => wgContentTranslationDatabase [mediawiki-config] - 10https://gerrit.wikimedia.org/r/630565 (https://phabricator.wikimedia.org/T263417) (owner: 10Urbanecm) [11:09:19] kart_: pulled onto mwdebug2001, ready for your testing [11:09:43] variables should be correct AFAICS kart_ https://usercontent.irccloud-cdn.com/file/BRXTBOri/image.png [11:11:10] hmm. Looks good. [11:11:17] Urbanecm: wait a sec. [11:11:22] yes? [11:11:34] Urbanecm: need recheck. [11:11:34] (03CR) 10Volans: "Some first comment inline, I'll do some practical testing later on" (034 comments) [software/cumin] - 10https://gerrit.wikimedia.org/r/628315 (https://phabricator.wikimedia.org/T212783) (owner: 10Gehel) [11:11:42] kart_: sure, take your time :) [11:13:01] Urbanecm: yeah. Don't want to break CX anywhere else than testwiki ;) [11:13:30] good point, we should test somewhere else too :) [11:16:13] Urbanecm: testwiki and wikishare looks separate. Last test on 'other WPs' on. Few more minutes.. [11:17:35] Urbanecm: OK. Looks good. CX on other Wiki is saving content without issue. I published article on testwiki also. [11:18:04] Urbanecm: Please go ahead. [11:18:27] thanks, syncing [11:20:20] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 483beb2452caead8c44dfb8e608812778033fba0: ContentTranslation: Do not use wikishared DB for testwiki (T263417; follow-up af09303a4a155681b198ac70468494c2155868df also included in this sync) (duration: 00m 57s) [11:20:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:20:28] T263417: Exclude testwikis and private wikis from old unpublished CX draft purge script run - https://phabricator.wikimedia.org/T263417 [11:20:30] kart_: should be live :) [11:20:43] Urbanecm: great! [11:26:25] (03PS1) 10Ema: cache: upgrade Varnish to v6 in esams [puppet] - 10https://gerrit.wikimedia.org/r/630566 (https://phabricator.wikimedia.org/T263557) [11:27:06] (03PS1) 10Hnowlan: api-gateway: use TLS for restbase [deployment-charts] - 10https://gerrit.wikimedia.org/r/630567 [11:27:33] (03PS2) 10Ema: cache: upgrade Varnish to v6 in esams [puppet] - 10https://gerrit.wikimedia.org/r/630566 (https://phabricator.wikimedia.org/T263557) [11:28:02] (03CR) 10Ema: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/630566 (https://phabricator.wikimedia.org/T263557) (owner: 10Ema) [11:28:43] (03PS1) 10Jbond: swift: convert roles to profiles, fix various lint issues [puppet] - 10https://gerrit.wikimedia.org/r/630568 (https://phabricator.wikimedia.org/T247956) [11:29:22] (03CR) 10jerkins-bot: [V: 04-1] api-gateway: use TLS for restbase [deployment-charts] - 10https://gerrit.wikimedia.org/r/630567 (owner: 10Hnowlan) [11:29:24] (03PS5) 10Urbanecm: Creation of patroller group on arz.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/629738 (https://phabricator.wikimedia.org/T262218) (owner: 10HitomiAkane) [11:29:26] (03CR) 10Urbanecm: [C: 03+2] Creation of patroller group on arz.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/629738 (https://phabricator.wikimedia.org/T262218) (owner: 10HitomiAkane) [11:30:13] (03Merged) 10jenkins-bot: Creation of patroller group on arz.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/629738 (https://phabricator.wikimedia.org/T262218) (owner: 10HitomiAkane) [11:34:28] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 61eac95ef62aef682039761e0f02188437cb15fb: Creation of patroller group on arz.wikipedia (T262218) (duration: 00m 57s) [11:34:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:34:36] T262218: Creation of patroller group on arz.wikipedia - https://phabricator.wikimedia.org/T262218 [11:34:49] !log EU B&C window done [11:34:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:37:18] (03PS2) 10Jbond: swift: convert roles to profiles, fix various lint issues [puppet] - 10https://gerrit.wikimedia.org/r/630568 (https://phabricator.wikimedia.org/T247956) [11:37:37] (03PS2) 10Hnowlan: api-gateway: use TLS for restbase [deployment-charts] - 10https://gerrit.wikimedia.org/r/630567 [11:41:26] (03PS7) 10Muehlenhoff: reboot-groups [cookbooks] - 10https://gerrit.wikimedia.org/r/625597 [11:41:40] Urbanecm: around? [11:41:44] yes [11:41:48] what's up kart_ ? [11:41:59] Urbanecm: did you sync all files? Seems same DB issue is appearing :/ [11:42:14] Urbanecm: it worked earlier or am I missing something.. [11:42:14] damn it, I forgot to sync CS.php [11:42:16] mea culpa [11:42:41] fixing [11:42:51] ah. NP. It is testwiki :D [11:42:58] (03PS8) 10Muehlenhoff: reboot-groups [cookbooks] - 10https://gerrit.wikimedia.org/r/625597 [11:43:09] hehe [11:43:17] (03CR) 10Muehlenhoff: reboot-groups (036 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/625597 (owner: 10Muehlenhoff) [11:43:36] !log urbanecm@deploy1001 Synchronized wmf-config/CommonSettings.php: 483beb2452caead8c44dfb8e608812778033fba0: ContentTranslation: Do not use wikishared DB for testwiki (T263417; follow-up af09303a4a155681b198ac70468494c2155868df also included in this sync) (duration: 00m 56s) [11:43:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:43:42] kart_: can you check now, please? [11:43:43] T263417: Exclude testwikis and private wikis from CX draft purge script and separate CX database on testwiki - https://phabricator.wikimedia.org/T263417 [11:44:41] Urbanecm: sure [11:45:13] Urbanecm: yep. Works well now! [11:45:24] Glad to hear that kart_ :) [11:45:50] Urbanecm: also my apologies. That debug extension little icon need different color :) [11:54:10] (03PS8) 10Kormat: bsection: Script for binary-searching log files. [puppet] - 10https://gerrit.wikimedia.org/r/627841 [11:54:31] (03CR) 10Kormat: bsection: Script for binary-searching log files. (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/627841 (owner: 10Kormat) [11:57:14] jouncebot: next [11:57:15] In 0 hour(s) and 2 minute(s): Create a new wiki (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200928T1200) [11:57:53] (03CR) 10ArielGlenn: [C: 03+1] "This looks ok to me. I do hate the code for reversing present and absent, but making that nicer by way of better variable names or somethi" [puppet] - 10https://gerrit.wikimedia.org/r/462019 (https://phabricator.wikimedia.org/T125976) (owner: 10Thcipriani) [11:59:05] !log kormat@cumin1001 dbctl commit (dc=all): 'db1114 depooling: prep for rack switch upgrade T196487', diff saved to https://phabricator.wikimedia.org/P12815 and previous config saved to /var/cache/conftool/dbconfig/20200928-115904-kormat.json [11:59:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:59:12] T196487: Upgrade eqiad rack D4 to 10G switch - https://phabricator.wikimedia.org/T196487 [11:59:56] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime [11:59:57] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [12:00:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:00:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:00:04] Urbanecm and Amir1: I, the Bot under the Fountain, allow thee, The Deployer, to do Create a new wiki deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200928T1200). [12:00:14] \o/ [12:02:53] (03CR) 10ArielGlenn: "If this has been cherry-picked for so long on beta, maybe it can just be merged?" [puppet] - 10https://gerrit.wikimedia.org/r/462020 (https://phabricator.wikimedia.org/T125976) (owner: 10Thcipriani) [12:04:56] 10Operations, 10Traffic, 10conftool, 10serviceops: confd's watch functionality appears to be partially broken when interacting with etcd 3.x - https://phabricator.wikimedia.org/T260889 (10ArielGlenn) [12:06:49] (03PS4) 10Urbanecm: Initial configuration for arbcom_ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/627278 (https://phabricator.wikimedia.org/T262812) (owner: 10Majavah) [12:06:59] (03CR) 10jerkins-bot: [V: 04-1] Initial configuration for arbcom_ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/627278 (https://phabricator.wikimedia.org/T262812) (owner: 10Majavah) [12:09:20] (03PS5) 10Urbanecm: Initial configuration for arbcom_ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/627278 (https://phabricator.wikimedia.org/T262812) (owner: 10Majavah) [12:10:06] (03CR) 10Urbanecm: [C: 03+2] Initial configuration for arbcom_ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/627278 (https://phabricator.wikimedia.org/T262812) (owner: 10Majavah) [12:10:31] 10Operations, 10Product-Infrastructure-Team-Backlog, 10Wikifeeds, 10serviceops: [Bug] The feed/featured endpoint is broken - https://phabricator.wikimedia.org/T263043 (10ArielGlenn) [12:10:50] (03Merged) 10jenkins-bot: Initial configuration for arbcom_ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/627278 (https://phabricator.wikimedia.org/T262812) (owner: 10Majavah) [12:11:05] (03PS1) 10Urbanecm: Revert "Initial configuration for arbcom_ruwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/630414 [12:11:10] (03CR) 10Urbanecm: [V: 03+2 C: 03+2] Revert "Initial configuration for arbcom_ruwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/630414 (owner: 10Urbanecm) [12:13:04] (03PS1) 10Urbanecm: Initial configuration for arbcom_ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/630573 (https://phabricator.wikimedia.org/T262812) [12:13:36] (03CR) 10Urbanecm: [C: 03+2] Initial configuration for arbcom_ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/630573 (https://phabricator.wikimedia.org/T262812) (owner: 10Urbanecm) [12:14:19] (03Merged) 10jenkins-bot: Initial configuration for arbcom_ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/630573 (https://phabricator.wikimedia.org/T262812) (owner: 10Urbanecm) [12:16:03] 10Operations, 10Traffic: cache_upload varnish-fe exhausting transient memory - https://phabricator.wikimedia.org/T249809 (10ArielGlenn) [12:17:56] !log urbanecm@deploy1001 Synchronized wmf-config/db-eqiad.php: Creating arbcom_ruwiki (T262812) (duration: 00m 56s) [12:18:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:18:04] T262812: Create private arbcom-ru wiki - https://phabricator.wikimedia.org/T262812 [12:19:06] !log urbanecm@deploy1001 Synchronized wmf-config/db-codfw.php: Creating arbcom_ruwiki (T262812) (duration: 00m 57s) [12:19:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:20:05] !log urbanecm@deploy1001 Synchronized dblists: Creating arbcom_ruwiki (T262812) (duration: 00m 57s) [12:20:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:21:45] !log urbanecm@deploy1001 rebuilt and synchronized wikiversions files: Creating arbcom_ruwiki (T262812) [12:21:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:21:55] 10Operations, 10SRE-Access-Requests: Add new SSH key for Sam Smith - https://phabricator.wikimedia.org/T263992 (10phuedx) [12:22:47] (03PS1) 10Jbond: tools: puppetdb reduce postgres memory usage [puppet] - 10https://gerrit.wikimedia.org/r/630574 [12:22:58] !log urbanecm@deploy1001 Synchronized static/images/project-logos/: Creating arbcom_ruwiki (T262812) (duration: 00m 56s) [12:23:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:23:05] T262812: Create private arbcom-ru wiki - https://phabricator.wikimedia.org/T262812 [12:24:17] (03CR) 10Jbond: "PCC: https://puppet-compiler.wmflabs.org/compiler1002/25453/" [puppet] - 10https://gerrit.wikimedia.org/r/630574 (owner: 10Jbond) [12:24:18] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Creating arbcom_ruwiki (T262812) (duration: 00m 56s) [12:24:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:24:32] (03PS1) 10Urbanecm: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/630576 [12:24:34] (03CR) 10Urbanecm: [C: 03+2] Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/630576 (owner: 10Urbanecm) [12:25:14] (03Merged) 10jenkins-bot: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/630576 (owner: 10Urbanecm) [12:26:03] (03CR) 10Muehlenhoff: "Or maybe for all of cloud VPS, given that this also affects deployment-puppetdb03?" [puppet] - 10https://gerrit.wikimedia.org/r/630574 (owner: 10Jbond) [12:26:13] !log urbanecm@deploy1001 Synchronized wmf-config/interwiki.php: Update interwiki cache (duration: 01m 48s) [12:26:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:26:31] !log arbcom_ruwiki is created (T262812) [12:26:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:28:05] 10Operations, 10netops: Configure BGP route damping on Anycast sessions - https://phabricator.wikimedia.org/T262372 (10ayounsi) The issue is that `ss -lun | fgrep -q :10514` often take more than 2s to complete and we don't let it retry. As it happen regularly, it sometimes happen right after the bird restart,... [12:28:27] (03PS3) 10Jbond: swift: convert roles to profiles, fix various lint issues [puppet] - 10https://gerrit.wikimedia.org/r/630568 (https://phabricator.wikimedia.org/T247956) [12:28:40] !log [urbanecm@mwmaint2001 ~]$ mwscript createAndPromote.php --wiki=arbcom_ruwiki --bureaucrat --sysop 'Adamant.pwn' # T262812 [12:28:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:28:48] T262812: Create private arbcom-ru wiki - https://phabricator.wikimedia.org/T262812 [12:28:51] (03CR) 10Filippo Giunchedi: [C: 03+2] hieradata: bump Swift object replicator concurrency [puppet] - 10https://gerrit.wikimedia.org/r/629082 (https://phabricator.wikimedia.org/T261633) (owner: 10Filippo Giunchedi) [12:29:24] !log [urbanecm@mwmaint2001 ~]$ mwscript resetUserEmail.php --wiki=arbcom_ruwiki 'Adamant.pwn' 'adamant.pwn@hotmail.com' # T262812 [12:29:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:31:01] !log jayme@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'zotero' for release 'production' . [12:31:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:33:57] 10Operations, 10Traffic, 10Wikimedia-Incident: Memory leak on ats-tls 8.0.6 - https://phabricator.wikimedia.org/T249335 (10ArielGlenn) [12:35:39] 10Operations, 10Traffic, 10serviceops, 10Performance-Team (Radar), 10Sustainability: Make CDN purges reliable - https://phabricator.wikimedia.org/T133821 (10ArielGlenn) [12:37:22] !log jayme@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'zotero' for release 'production' . [12:37:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:41:21] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:42:55] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:44:10] 10Operations, 10Citoid, 10Wikimedia-Logstash, 10observability, and 2 others: Move citoid logging to new logging pipeline - https://phabricator.wikimedia.org/T219919 (10Mvolz) >>! In T219919#6478632, @fgiunchedi wrote: > It looks like citoid is now on k8s but still using gelf for logging, possibly the easie... [12:44:20] (03PS4) 10Jbond: swift: convert roles to profiles, fix various lint issues [puppet] - 10https://gerrit.wikimedia.org/r/630568 (https://phabricator.wikimedia.org/T247956) [12:44:50] 10Operations, 10DBA, 10Sustainability (Incident Followup), 10Wikimedia-Incident: S5 replication issue, affecting watchlist and probably recentchanges - https://phabricator.wikimedia.org/T263842 (10akosiaris) A preliminary incident report is at https://wikitech.wikimedia.org/wiki/Incident_documentation/2020... [12:47:04] (03CR) 10Muehlenhoff: [C: 03+2] Have the puppetised sources.list depend on the wikimedia repository [puppet] - 10https://gerrit.wikimedia.org/r/630179 (https://phabricator.wikimedia.org/T158562) (owner: 10Muehlenhoff) [12:49:47] (03PS1) 10Elukey: profile::analytics::cluster::packages::statistics: remove stretch bits [puppet] - 10https://gerrit.wikimedia.org/r/630578 (https://phabricator.wikimedia.org/T255028) [12:51:53] (03PS5) 10Jbond: swift: convert roles to profiles, fix various lint issues [puppet] - 10https://gerrit.wikimedia.org/r/630568 (https://phabricator.wikimedia.org/T247956) [12:52:13] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1001/25457/stat1007.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/630578 (https://phabricator.wikimedia.org/T255028) (owner: 10Elukey) [12:54:42] 10Operations, 10OTRS, 10vm-requests: Decommission mendelevium - https://phabricator.wikimedia.org/T263993 (10akosiaris) [12:54:50] (03CR) 10Muehlenhoff: "Looks good to me, it's worth pointing out that despite being recently reimaged to Buster per debmonitor stat1004-1007 _do_ have python3-go" [puppet] - 10https://gerrit.wikimedia.org/r/630578 (https://phabricator.wikimedia.org/T255028) (owner: 10Elukey) [12:55:58] 10Operations, 10OTRS: Migrate mendelevium/OTRS host to Buster - https://phabricator.wikimedia.org/T224590 (10akosiaris) 05Open→03Resolved a:03akosiaris mendelevium is powered off and decomissioning is tracked at T263993. I 'll resolve this task. [12:56:01] 10Operations: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10akosiaris) [12:57:44] !log akosiaris@cumin1001 START - Cookbook sre.hosts.decommission [12:57:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:59:40] (03PS1) 10Alexandros Kosiaris: Remove mendelevium [puppet] - 10https://gerrit.wikimedia.org/r/630581 (https://phabricator.wikimedia.org/T263993) [13:00:16] (03PS6) 10Jbond: swift: convert roles to profiles, fix various lint issues [puppet] - 10https://gerrit.wikimedia.org/r/630568 (https://phabricator.wikimedia.org/T247956) [13:03:31] !log akosiaris@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) [13:03:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:03:39] 10Operations, 10OTRS, 10vm-requests, 10Patch-For-Review: Decommission mendelevium - https://phabricator.wikimedia.org/T263993 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by akosiaris@cumin1001 for hosts: `mendelevium.eqiad.wmnet` - mendelevium.eqiad.wmnet (**WARN**) - **Failed downti... [13:04:30] akosiaris: what failed in the dns part of the decom cookbook? [13:04:49] Failed to run the sre.dns.netbox cookbook [13:04:55] Generating the DNS records from Netbox data. It will take a couple of minutes. [13:05:03] want the stacktrace? [13:05:08] yeah [13:05:42] (03CR) 10Volans: [C: 03+1] "LGTMm thanks for the fixes" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/627841 (owner: 10Kormat) [13:06:02] volans: https://phabricator.wikimedia.org/P12817 [13:06:08] thanks! [13:06:53] (03PS7) 10Jbond: swift: convert roles to profiles, fix various lint issues [puppet] - 10https://gerrit.wikimedia.org/r/630568 (https://phabricator.wikimedia.org/T247956) [13:07:37] (03CR) 10Jbond: "PCC full diff shows no real changes" [puppet] - 10https://gerrit.wikimedia.org/r/630568 (https://phabricator.wikimedia.org/T247956) (owner: 10Jbond) [13:10:09] (03CR) 10Jbond: "This has got rather big and possibly needs breaking into smaller chunks. The change should be a no-op it mainly moves parameters that are" [puppet] - 10https://gerrit.wikimedia.org/r/630568 (https://phabricator.wikimedia.org/T247956) (owner: 10Jbond) [13:12:07] 10Operations, 10ops-eqiad, 10DBA, 10netops, and 2 others: Upgrade eqiad rack D4 to 10G switch - https://phabricator.wikimedia.org/T196487 (10Cmjohnson) @ayounsi I am not able to get the console to work on the new switch, it's plugged in, I verfied it worked by connecting to the current asw in d4 and get th... [13:12:08] (03CR) 10Jbond: "Thanks for going tot he effort of trying to untangle the swift automatic parameter lookups however there are a quite a few other places th" [puppet] - 10https://gerrit.wikimedia.org/r/630568 (https://phabricator.wikimedia.org/T247956) (owner: 10Jbond) [13:12:41] akosiaris: interesting, can't repro it... [13:12:48] I'll run the sre.dns.netbox cookbook for you [13:12:57] (03CR) 10Jbond: "Thanks for going tot he effort of trying to untangle the swift automatic parameter lookups however there are a quite a few other places th" [puppet] - 10https://gerrit.wikimedia.org/r/628970 (owner: 10Dzahn) [13:13:25] (03CR) 10Jbond: "The comment above should have been made on the original change:" [puppet] - 10https://gerrit.wikimedia.org/r/630568 (https://phabricator.wikimedia.org/T247956) (owner: 10Jbond) [13:14:33] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/627967 (owner: 10Dzahn) [13:15:28] volans: thanks. Any idea what caused it? that exit_code=2 didn't help much [13:15:40] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/627970 (owner: 10Dzahn) [13:16:23] (03CR) 10Alexandros Kosiaris: service::configuration: connect to restbase via TLS (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/630562 (https://phabricator.wikimedia.org/T244843) (owner: 10Giuseppe Lavagetto) [13:16:26] unfortunately not as the log of the remote script was to stdout via ssh that gets discarded (temporarily?) [13:17:07] as we can now re-enable it but might be noisy with some/many cookbooks? but very soon we will be able to decide on a per command run basis if we want it or not from the cookbooks [13:18:28] (03PS5) 10Jbond: base/monitoring: move monitor_screens to proper profile parameter [puppet] - 10https://gerrit.wikimedia.org/r/628152 (owner: 10Dzahn) [13:18:53] (03CR) 10Jbond: "> Patch Set 4:" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/628152 (owner: 10Dzahn) [13:19:15] !log reimaging sretest1001 to validate puppetised sources.list with a new installation T158562 [13:19:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:23] T158562: Manage apt sources via puppet - https://phabricator.wikimedia.org/T158562 [13:20:01] !log volans@cumin1001 START - Cookbook sre.dns.netbox [13:20:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:23] (03CR) 10Jbond: [C: 03+2] "LGTM merging" [puppet] - 10https://gerrit.wikimedia.org/r/628152 (owner: 10Dzahn) [13:25:25] akosiaris: and to be clear, for now you still need the manual dns patch anyway [13:25:46] !log volans@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:25:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:53] yeah, I was about to. Thanks! [13:26:55] (03CR) 10Ema: [C: 03+1] geoip VCL: add a 'which' param to get_geo_xcip (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/630315 (https://phabricator.wikimedia.org/T263496) (owner: 10CDanis) [13:27:01] (03CR) 10Ema: [C: 03+1] geoip VCL: init/free functions are now reusable [puppet] - 10https://gerrit.wikimedia.org/r/630314 (https://phabricator.wikimedia.org/T263496) (owner: 10CDanis) [13:29:37] (03CR) 10Andrew Bogott: [C: 03+1] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/627967 (owner: 10Dzahn) [13:32:15] (03CR) 10Alexandros Kosiaris: [C: 03+2] Remove mendelevium [puppet] - 10https://gerrit.wikimedia.org/r/630581 (https://phabricator.wikimedia.org/T263993) (owner: 10Alexandros Kosiaris) [13:32:55] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [13:32:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:11] (03PS1) 10Alexandros Kosiaris: Remove mendelevium [dns] - 10https://gerrit.wikimedia.org/r/630588 (https://phabricator.wikimedia.org/T263993) [13:34:45] (03CR) 10Alexandros Kosiaris: [C: 03+2] Remove mendelevium [dns] - 10https://gerrit.wikimedia.org/r/630588 (https://phabricator.wikimedia.org/T263993) (owner: 10Alexandros Kosiaris) [13:35:00] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [13:35:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:37:25] (03PS1) 10Jbond: labstore::nfs_mount: drop support for empty string share_path [puppet] - 10https://gerrit.wikimedia.org/r/630589 [13:38:10] !log roll restart object-replicator on ms-be2* for higher concurrency - T261633 [13:38:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:38:17] T261633: Put ms-be2057 (Dell R740xd2) in service - https://phabricator.wikimedia.org/T261633 [13:40:10] (03PS1) 10Volans: homer: fix live config check [puppet] - 10https://gerrit.wikimedia.org/r/630590 [13:41:11] (03CR) 10jerkins-bot: [V: 04-1] homer: fix live config check [puppet] - 10https://gerrit.wikimedia.org/r/630590 (owner: 10Volans) [13:41:31] 10Operations: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10MoritzMuehlenhoff) [13:42:47] 10Operations: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10MoritzMuehlenhoff) [13:42:52] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [13:42:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:43:54] (03CR) 10Kormat: [C: 03+2] bsection: Script for binary-searching log files. [puppet] - 10https://gerrit.wikimedia.org/r/627841 (owner: 10Kormat) [13:45:03] (03PS2) 10Volans: homer: fix live config check [puppet] - 10https://gerrit.wikimedia.org/r/630590 [13:45:09] !log downtiming all eqiad row D hosts - T196487 [13:45:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:16] T196487: Upgrade eqiad rack D4 to 10G switch - https://phabricator.wikimedia.org/T196487 [13:45:24] only as I can't just downtime a rack worth of hosts [13:46:43] !log elukey@puppetmaster1001 conftool action : set/pooled=no; selector: name=aqs1006.eqiad.wmnet [13:46:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:47:31] 10Operations, 10Wikispeech-Jobrunner, 10Wikispeech-Text-to-Speech, 10Wikispeech-WMSE: TTS server deployment strategy - https://phabricator.wikimedia.org/T193072 (10Lokal_Profil) 05Stalled→03Invalid The Speechoid service has been changed to use Blubber. A separate task will be set up to track deployment... [13:47:48] (03CR) 10Ayounsi: [C: 03+1] homer: fix live config check [puppet] - 10https://gerrit.wikimedia.org/r/630590 (owner: 10Volans) [13:48:07] (03CR) 10Volans: [C: 03+2] homer: fix live config check [puppet] - 10https://gerrit.wikimedia.org/r/630590 (owner: 10Volans) [13:51:59] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:52:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:57:56] (03PS1) 10Kormat: WIP bsection: pull out binary search to separate function. [puppet] - 10https://gerrit.wikimedia.org/r/630596 [13:58:45] !log asw2-d-eqiad# run request system power-off member 4 [13:58:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:59:47] (03PS1) 10CDanis: eventgate-logging-external-tls-proxy: bump CPU up [deployment-charts] - 10https://gerrit.wikimedia.org/r/630597 (https://phabricator.wikimedia.org/T257527) [14:00:49] (03PS1) 10Elukey: install_server: set Debian buster for an-worker1101 [puppet] - 10https://gerrit.wikimedia.org/r/630598 [14:01:28] (03Abandoned) 10Elukey: install_server: set Debian buster for an-worker1101 [puppet] - 10https://gerrit.wikimedia.org/r/630598 (owner: 10Elukey) [14:02:08] 10Operations, 10Product-Infrastructure-Data, 10Epic, 10Goal, 10Patch-For-Review: automatically collect network error reports from users' browsers (Network Error Logging API) - https://phabricator.wikimedia.org/T257527 (10CDanis) [14:02:54] PROBLEM - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received https://wikitech.wikimedia.org/wiki/Wikifeeds [14:02:54] PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received https://wikitech.wikimedia.org/wiki/Wikifeeds [14:03:20] PROBLEM - aqs endpoints health on aqs1008 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/top-by-country/{project}/{access}/{year}/{month} (Get top countries by page views) timed out before a response was received: /analytics.wikimedia.org/v1/legacy/pagecounts/aggregate/{project}/{access-site}/{granularity}/{start}/{end} (Get pagecounts) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Moni [14:03:36] PROBLEM - aqs endpoints health on aqs1007 is CRITICAL: /analytics.wikimedia.org/v1/legacy/pagecounts/aggregate/{project}/{access-site}/{granularity}/{start}/{end} (Get pagecounts) timed out before a response was received: /analytics.wikimedia.org/v1/unique-devices/{project}/{access-site}/{granularity}/{start}/{end} (Get unique devices) timed out before a response was received: /analytics.wikimedia.org/v1/pageviews/top/{project}/{ [14:03:36] onth}/{day} (Get top page views) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:03:41] (03PS2) 10Jbond: labstore::nfs_mount: drop support for empty string share_path [puppet] - 10https://gerrit.wikimedia.org/r/630589 [14:03:46] PROBLEM - aqs endpoints health on aqs1004 is CRITICAL: /analytics.wikimedia.org/v1/legacy/pagecounts/aggregate/{project}/{access-site}/{granularity}/{start}/{end} (Get pagecounts) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:03:52] PROBLEM - aqs endpoints health on aqs1005 is CRITICAL: /analytics.wikimedia.org/v1/unique-devices/{project}/{access-site}/{granularity}/{start}/{end} (Get unique devices) timed out before a response was received: /analytics.wikimedia.org/v1/pageviews/top-by-country/{project}/{access}/{year}/{month} (Get top countries by page views) timed out before a response was received: /analytics.wikimedia.org/v1/legacy/pagecounts/aggregate/{ [14:03:52] site}/{granularity}/{start}/{end} (Get pagecounts) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:03:56] PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:04:38] RECOVERY - aqs endpoints health on aqs1008 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:04:42] PROBLEM - Juniper virtual chassis ports on asw2-d-eqiad is CRITICAL: CRIT: Down: 2 Unknown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23VCP_status [14:05:28] !log uploaded libdbi-perl 1.631-3+wmf1 for jessie-wikimedia T259102 [14:05:30] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1004 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [14:05:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:46] XioNoX: it is only d4's tor down right? [14:05:54] (03CR) 10Volans: [C: 03+1] "LGTM, I agree that not removing the break logic too doesn't improve that much.And to do that raising an exception would require to indent " [puppet] - 10https://gerrit.wikimedia.org/r/630596 (owner: 10Kormat) [14:06:23] elukey: yep [14:06:24] RECOVERY - aqs endpoints health on aqs1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:06:26] (03Abandoned) 10Kormat: WIP bsection: pull out binary search to separate function. [puppet] - 10https://gerrit.wikimedia.org/r/630596 (owner: 10Kormat) [14:06:34] RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:06:44] PROBLEM - Aggregate IPsec Tunnel Status codfw on alert1001 is CRITICAL: instance={mc2033,mc2034,mc2035,mc2036} site=codfw tunnel={mc1033_v4,mc1034_v4,mc1035_v4,mc1036_v4} https://wikitech.wikimedia.org/wiki/Monitoring/strongswan https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status [14:06:50] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={pdu_sentry4,swagger_check_citoid_cluster_codfw,swagger_check_wikifeeds_codfw} site={codfw,eqsin} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:07:42] RECOVERY - aqs endpoints health on aqs1007 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:07:52] RECOVERY - aqs endpoints health on aqs1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:08:21] RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [14:09:50] RECOVERY - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [14:10:33] so aqs1006 is in d4 (two cassandra instances) and it caused some read timeouts for the aqs service, that in turn caused timeouts for wikifeeds [14:11:14] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:11:34] (03CR) 10Hnowlan: [C: 03+2] api-gateway: add routing for static and other components [deployment-charts] - 10https://gerrit.wikimedia.org/r/628408 (https://phabricator.wikimedia.org/T263045) (owner: 10Hnowlan) [14:12:14] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-me [14:12:46] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1004 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [14:13:38] this --^ was probably due to the elastic node in d4 now unreachable [14:13:54] (03Merged) 10jenkins-bot: api-gateway: add routing for static and other components [deployment-charts] - 10https://gerrit.wikimedia.org/r/628408 (https://phabricator.wikimedia.org/T263045) (owner: 10Hnowlan) [14:14:05] elukey: will elastic.. snap back? [14:14:52] * elukey answers to kormat using /dev/urandom [14:15:22] * Reedy squints [14:16:28] RECOVERY - Juniper virtual chassis ports on asw2-d-eqiad is OK: OK: UP: 24 https://wikitech.wikimedia.org/wiki/Network_monitoring%23VCP_status [14:20:14] RECOVERY - Aggregate IPsec Tunnel Status codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/strongswan https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status [14:21:56] 10Operations, 10observability, 10serviceops: Strongswan Icinga check: do not report issues about depooled hosts - https://phabricator.wikimedia.org/T148976 (10ema) >>! In T148976#6488108, @BBlack wrote: > This was mostly about cache nodes back when those had ipsec, I think. The remaining case that uses ipse... [14:22:52] RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [14:23:30] 10Operations, 10DBA, 10Sustainability (Incident Followup), 10Wikimedia-Incident: S5 replication issue, affecting watchlist and probably recentchanges - https://phabricator.wikimedia.org/T263842 (10Marostegui) The sequence of events within the transaction that failed is interesting and it definitely didn't... [14:23:45] 10Operations, 10ops-eqiad, 10DBA, 10netops, and 2 others: Upgrade eqiad rack D4 to 10G switch - https://phabricator.wikimedia.org/T196487 (10ayounsi) [14:24:16] 10Operations, 10Patch-For-Review, 10User-jbond: Manage apt sources via puppet - https://phabricator.wikimedia.org/T158562 (10MoritzMuehlenhoff) >>! In T158562#6494038, @MoritzMuehlenhoff wrote: > I did a test installation with the new setting as I had a hunch there would be issues in early install and turns... [14:25:21] (03CR) 10Ppchelko: [C: 03+1] api-gateway: use TLS for restbase [deployment-charts] - 10https://gerrit.wikimedia.org/r/630567 (owner: 10Hnowlan) [14:27:17] !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'production' . [14:27:18] !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'staging' . [14:27:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:27:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:27:43] ACKNOWLEDGEMENT - mcrouter process on mwdebug1001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 114 (mcrouter), command name mcrouter Effie Mouzeli testing https://wikitech.wikimedia.org/wiki/Mcrouter [14:29:58] (03CR) 10Mholloway: [C: 03+1] wikifeeds: use the service proxy for reaching the MediaWiki api [deployment-charts] - 10https://gerrit.wikimedia.org/r/628756 (https://phabricator.wikimedia.org/T255878) (owner: 10Giuseppe Lavagetto) [14:32:16] (03PS2) 10Ayounsi: Revert "Depool eqiad for row D recabling" [dns] - 10https://gerrit.wikimedia.org/r/629519 [14:33:36] !log repool eqiad [14:33:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:48] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: (Due By: 2020-07-11) rack/setup/install an-worker[1096-1101] - https://phabricator.wikimedia.org/T254892 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts: ` ['an-worker1099.eqiad.wmnet', 'an-wor... [14:40:24] !log elukey@puppetmaster1001 conftool action : set/pooled=yes; selector: name=aqs1006.eqiad.wmnet [14:40:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:42:31] (03CR) 10JMeybohm: [C: 03+1] eventgate-logging-external-tls-proxy: bump CPU up [deployment-charts] - 10https://gerrit.wikimedia.org/r/630597 (https://phabricator.wikimedia.org/T257527) (owner: 10CDanis) [14:42:31] 10Operations, 10ops-eqiad, 10decommission-hardware: Decommission old-asw-d4-eqiad (ex4300) - https://phabricator.wikimedia.org/T264001 (10Cmjohnson) [14:43:01] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): WDQS Categories reload is failing on thankyouwiki - https://phabricator.wikimedia.org/T261097 (10Gehel) 05Open→03Resolved [14:43:15] (03CR) 10CDanis: [C: 03+2] eventgate-logging-external-tls-proxy: bump CPU up [deployment-charts] - 10https://gerrit.wikimedia.org/r/630597 (https://phabricator.wikimedia.org/T257527) (owner: 10CDanis) [14:43:48] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review, 10User-Kormat: db2125 crashed - mgmt iface also not available - https://phabricator.wikimedia.org/T260670 (10Papaul) @Kormat tell @Marostegui to not break the host again :) [14:44:10] lol [14:44:39] !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'production' . [14:44:39] !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'staging' . [14:44:41] 10Operations, 10LDAP-Access-Requests: Access to Superset - https://phabricator.wikimedia.org/T263868 (10JVargas) This is approved on my end, @ArielGlenn. Let me know if you need anything else from me. Thanks! [14:44:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:44] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission-hardware: decommission es2014.codfw.wmnet - https://phabricator.wikimedia.org/T262889 (10Papaul) [14:44:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:03] (03PS1) 10Alexandros Kosiaris: Cleanup scap::sources from some old objects [puppet] - 10https://gerrit.wikimedia.org/r/630604 [14:45:38] PROBLEM - Varnish traffic drop between 30min ago and now at codfw on alert1001 is CRITICAL: 49.13 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [14:45:52] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission-hardware: decommission es2014.codfw.wmnet - https://phabricator.wikimedia.org/T262889 (10Papaul) 05Open→03Resolved Complete [14:45:54] 10Operations: Integrate Buster 10.5 point release - https://phabricator.wikimedia.org/T259519 (10MoritzMuehlenhoff) [14:47:02] (03CR) 10Alexandros Kosiaris: [C: 03+2] Cleanup scap::sources from some old objects [puppet] - 10https://gerrit.wikimedia.org/r/630604 (owner: 10Alexandros Kosiaris) [14:48:46] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review, 10User-Kormat: db2125 crashed - mgmt iface also not available - https://phabricator.wikimedia.org/T260670 (10Marostegui) >>! In T260670#6498765, @Papaul wrote: > @Kormat tell @Marostegui to not break the host again :) hahah - reminder: you are the on... [14:49:43] !log installing glib-networking security updates [14:49:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:56] PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:50:05] (03CR) 10Alexandros Kosiaris: "PCC at https://puppet-compiler.wmflabs.org/compiler1001/25460/" [puppet] - 10https://gerrit.wikimedia.org/r/630562 (https://phabricator.wikimedia.org/T244843) (owner: 10Giuseppe Lavagetto) [14:51:13] RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:57:14] 10Operations, 10Product-Infrastructure-Team-Backlog, 10Proton: proton experienced a period of high CPU usage, busy queue, lockups - https://phabricator.wikimedia.org/T214975 (10akosiaris) 05Open→03Invalid This is close to 15months old, and the service has been moved to kubernetes in the meantime, so most... [14:58:23] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:59:18] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime [14:59:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:59:25] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:00:32] XioNoX: maybe coincidence, but db1114 (in row D) just lost net connectivity [15:00:47] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime [15:00:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:19] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [15:01:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:41] XioNoX: and it just came back, 4mins later [15:02:07] 10Operations, 10DBA, 10Sustainability (Incident Followup), 10Wikimedia-Incident: S5 replication issue, affecting watchlist and probably recentchanges - https://phabricator.wikimedia.org/T263842 (10jcrespo) These are the logs of the blocks (2 inserts and 1 update?) the timestamps would be close to (but not... [15:02:40] cmjohnson1: can you sync up with kormat to replace the SFP-T on ge-4/0/34 ? [15:03:02] kormat: yeah I see it in the logs.. [15:03:16] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [15:03:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:07:07] kormat ...okay to replace now? [15:07:21] cmjohnson1: yep, go for it [15:08:03] kormat done [15:08:30] cmjohnson1: great, thanks! [15:08:46] 10Operations, 10observability, 10serviceops, 10Platform Team Initiatives (API Gateway): mtail 3.0.0-rc35 doesn't support the histogram type in -oneshot mode. - https://phabricator.wikimedia.org/T263728 (10colewhite) a:03colewhite [15:08:47] !log hnowlan@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'api-gateway' for release 'production' . [15:08:47] !log hnowlan@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'api-gateway' for release 'staging' . [15:08:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:08:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:09:30] 10Operations, 10ops-eqiad, 10decommission-hardware: Decommission old-asw-d4-eqiad (ex4300) - https://phabricator.wikimedia.org/T264001 (10Cmjohnson) 05Open→03Resolved wiped, removed from rack updated netbox [15:10:26] (03CR) 10Jbond: [C: 03+1] "LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/623666 (owner: 10Dzahn) [15:11:12] 10Operations, 10DBA, 10Sustainability (Incident Followup), 10Wikimedia-Incident: S5 replication issue, affecting watchlist and probably recentchanges - https://phabricator.wikimedia.org/T263842 (10Marostegui) Yes, they are those, this is the order of events on the binlog for the ipblock table on that IP th... [15:11:44] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:12:43] !log cdanis@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'canary' . [15:12:43] !log cdanis@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'production' . [15:12:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:12:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:13:00] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:13:23] !log hnowlan@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'api-gateway' for release 'staging' . [15:13:23] !log hnowlan@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'api-gateway' for release 'production' . [15:13:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:13:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:13:36] 10Operations, 10Data-Services, 10SRE-Access-Requests, 10cloud-services-team (Kanban): Enable access for wmcs-admins to run wmcs-prefixed cookbooks on cumin hosts - https://phabricator.wikimedia.org/T261145 (10nskaggs) I'm trying as a non-global root wmcs admin; here's what I get: ` $ secure-cookbook -d w... [15:14:37] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission-hardware: decommission es2012.codfw.wmnet - https://phabricator.wikimedia.org/T263613 (10Papaul) [15:14:51] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission-hardware: decommission es2012.codfw.wmnet - https://phabricator.wikimedia.org/T263613 (10Papaul) 05Open→03Resolved Complete [15:15:08] RECOVERY - Varnish traffic drop between 30min ago and now at codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [15:15:22] !log hnowlan@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'api-gateway' for release 'production' . [15:15:23] !log hnowlan@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'api-gateway' for release 'staging' . [15:15:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:15:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:17:17] 10Operations: Integrate Buster 10.5 point release - https://phabricator.wikimedia.org/T259519 (10MoritzMuehlenhoff) [15:20:50] (03PS2) 10Jbond: tools: puppetdb reduce postgres memory usage [puppet] - 10https://gerrit.wikimedia.org/r/630574 [15:21:24] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_cxserver_cluster_codfw site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:22:16] 10Operations, 10ops-codfw: elastic2037 DIMM errors logged in racadm getsel - https://phabricator.wikimedia.org/T263714 (10Papaul) @elukey just got confirmation the part arrived. If the host is not depool yet please depool and power off, time for me to go and pick up the part. Thaanks [15:22:36] (03Abandoned) 10Hnowlan: api-gateway: Fall through to the appservers if a route isn't matched [deployment-charts] - 10https://gerrit.wikimedia.org/r/628772 (https://phabricator.wikimedia.org/T263045) (owner: 10Hnowlan) [15:22:52] (03CR) 10Jbond: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/630574 (owner: 10Jbond) [15:23:12] 10Operations, 10ops-eqiad, 10Analytics-Radar, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10Cmjohnson) [15:24:02] 10Operations, 10ops-eqiad, 10Analytics-Radar, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10Cmjohnson) a:05Cmjohnson→03RobH @RobH the new ssds have been installed to these servers, I appreciate you fixing the raid and... [15:25:41] !log Restarting CI Jenkins for plugins uninstallation T260565 [15:25:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:26:00] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:26:36] !log kormat@cumin1001 dbctl commit (dc=all): 'Repool db1114 T196487', diff saved to https://phabricator.wikimedia.org/P12818 and previous config saved to /var/cache/conftool/dbconfig/20200928-152635-kormat.json [15:26:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:26:41] T196487: Upgrade eqiad rack D4 to 10G switch - https://phabricator.wikimedia.org/T196487 [15:27:15] (03CR) 10Effie Mouzeli: [C: 03+1] push-notifications: change version tag to -production [deployment-charts] - 10https://gerrit.wikimedia.org/r/628340 (https://phabricator.wikimedia.org/T256973) (owner: 10MSantos) [15:27:19] 10Operations, 10Data-Services, 10SRE-Access-Requests, 10cloud-services-team (Kanban): Enable access for wmcs-admins to run wmcs-prefixed cookbooks on cumin hosts - https://phabricator.wikimedia.org/T261145 (10jbond) >>! In T261145#6498898, @nskaggs wrote: > I'm trying as a non-global root wmcs admin; here'... [15:27:57] (03PS2) 10Jdlrobson: Enable search in header A/B test for logged in users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/630206 (https://phabricator.wikimedia.org/T263032) [15:30:38] (03PS1) 10Kormat: bsection: Change exit code semantics [puppet] - 10https://gerrit.wikimedia.org/r/630622 [15:31:53] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/630622 (owner: 10Kormat) [15:33:21] (03CR) 10Kormat: [C: 03+2] bsection: Change exit code semantics [puppet] - 10https://gerrit.wikimedia.org/r/630622 (owner: 10Kormat) [15:35:59] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: (Due By: 2020-07-11) rack/setup/install an-worker[1096-1101] - https://phabricator.wikimedia.org/T254892 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-worker1099.eqiad.wmnet', 'an-worker1100.eqiad.wmnet'] ` and were **ALL** successf... [15:39:57] (03CR) 10Hnowlan: [C: 03+2] restbase: add restbase102[89]/restbase1030 [puppet] - 10https://gerrit.wikimedia.org/r/630106 (https://phabricator.wikimedia.org/T261512) (owner: 10Hnowlan) [15:41:01] PROBLEM - Host labweb1002 is DOWN: PING CRITICAL - Packet loss = 100% [15:41:12] PROBLEM - Host puppetmaster1002 is DOWN: PING CRITICAL - Packet loss = 100% [15:41:22] got paged, [15:41:24] that paged [15:41:30] huh [15:41:42] is labweb1002 downtime expected? [15:41:48] puppetmaster1002 is in D4, anyway related to current WIP? [15:41:49] D4: iscap 'project level' commands - https://phabricator.wikimedia.org/D4 [15:42:00] same for labweb1002 [15:42:05] oh, is the switch swap today? that would be it [15:42:08] <_joe_> I thought we were done with maintenance? [15:42:11] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: 2020-09-15) rack/setup/install db1150 (see note on hostname) - https://phabricator.wikimedia.org/T260817 (10Cmjohnson) [15:42:20] there is a swap of a switch in eqiad just not sure what day was on meeting notes [15:42:21] checking [15:42:22] I did not get paged because sometime between yesterday evening and now my phone powered off but is fully charged, how nice :-/ [15:42:34] wtf [15:42:59] ok, I think some or many of the SFP-Ts are faulty [15:43:05] it's not just the db host [15:43:07] mhhh both hosts are down since 45min, I guess expired downtimes [15:43:10] cmjohnson1: you're around? [15:43:27] yes [15:43:29] <_joe_> we need to at least depool the puppetmaster? [15:43:36] row d recable is on thursday so yeah, just closing that loop that its not that ;D [15:43:39] <_joe_> jbond42: ^^ not sure if that was done [15:43:48] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on es2026 - https://phabricator.wikimedia.org/T263837 (10Papaul) a:05Papaul→03Marostegui just after 1 month we received this server, we have already a bad disk. Disk replaced. [15:44:05] cmjohnson1: can you check/replace the SFP-T on ge-4/0/10 ? [15:44:09] yes [15:44:12] cmjohnson1: and ge-4/0/26 [15:44:19] ok [15:44:22] thx [15:45:26] 10Operations, 10ops-codfw, 10decommission-hardware: decommission es2018.codfw.wmnet - https://phabricator.wikimedia.org/T263615 (10Papaul) [15:45:30] 10Operations, 10ops-codfw, 10decommission-hardware: decommission es2018.codfw.wmnet - https://phabricator.wikimedia.org/T263615 (10Papaul) 05Open→03Resolved complete [15:45:57] (03CR) 10Jbond: "Taken another pass" (039 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/625597 (owner: 10Muehlenhoff) [15:46:56] jouncebot: looking now [15:47:00] RECOVERY - Host puppetmaster1002 is UP: PING OK - Packet loss = 0%, RTA = 0.32 ms [15:47:01] 10Operations, 10Data-Services, 10SRE-Access-Requests, 10cloud-services-team (Kanban): Enable access for wmcs-admins to run wmcs-prefixed cookbooks on cumin hosts - https://phabricator.wikimedia.org/T261145 (10nskaggs) Ahh, thanks jbond. Trying with sudo, I don't have passwordless sudo on that machine. [15:47:03] ge-4/0/10 up up puppetmaster1002 [15:47:05] yep [15:47:12] ge-4/0/26 up down labweb1002 [15:47:12] is next [15:47:23] RECOVERY - Host labweb1002 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms [15:47:25] <_joe_> jbond42: it just came back [15:47:28] ok, back up [15:47:30] PROBLEM - Check systemd state on labweb1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:47:39] done [15:47:42] PROBLEM - Disk space on an-coord1001 is CRITICAL: DISK CRITICAL - free space: /var/lib/mysql 2362 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=an-coord1001&var-datasource=eqiad+prometheus/ops [15:47:46] ack thanks for the record it didn't get depooled [15:48:48] 10Operations, 10SRE-Access-Requests, 10cloud-services-team (Kanban): wikitech-static access for Sam Reed - https://phabricator.wikimedia.org/T262468 (10Reedy) >>! In T262468#6497715, @ArielGlenn wrote: > Hey @Reedy... is ths all set? Can we resolve the task? (If it's done, please add a on line summary of how... [15:49:06] RECOVERY - Check systemd state on labweb1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:49:46] I'll monitor the logs and the switch port for more failures [15:50:08] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on es2026 - https://phabricator.wikimedia.org/T263837 (10Marostegui) Thanks @Papaul is the disk blinking there? I still don't see it on the OS. [15:51:55] !log poweroff elastic2037 for DIMM replacing [15:52:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:52:50] marostegui: yes the disk is blinking if you want i can remove it and put it back again [15:53:01] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on es2026 - https://phabricator.wikimedia.org/T263837 (10Marostegui) ` Time: Mon Sep 28 15:39:48 2020 Event Description: PD 02(e0x20/s2) Path 500056b34b011fc2 reset (Type 03) Time: Mon Sep 28 15:39:48 2020 Event Description: Removed: PD 02(e0x20/s2) Time: Mon... [15:53:09] papaul: haha, I just wrote that on phab. Great minds think alike [15:53:19] (03PS1) 10Cmjohnson: Adding db1150 to site.pp insetup role and dhcpd file [puppet] - 10https://gerrit.wikimedia.org/r/630631 (https://phabricator.wikimedia.org/T260817) [15:53:49] 10Operations, 10Data-Services, 10SRE-Access-Requests, 10cloud-services-team (Kanban): Enable access for wmcs-admins to run wmcs-prefixed cookbooks on cumin hosts - https://phabricator.wikimedia.org/T261145 (10jbond) @nskaggs which server are you testing on? things look good on cumin1001 ` lang=shell sud... [15:55:05] (03CR) 10Cmjohnson: [C: 03+2] Adding db1150 to site.pp insetup role and dhcpd file [puppet] - 10https://gerrit.wikimedia.org/r/630631 (https://phabricator.wikimedia.org/T260817) (owner: 10Cmjohnson) [15:55:21] marostegui: done [15:55:29] papaul: checking [15:55:51] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops, 10Patch-For-Review: (Need By: 2020-09-15) rack/setup/install db1150 (see note on hostname) - https://phabricator.wikimedia.org/T260817 (10Cmjohnson) [15:57:15] papaul: same error: Enclosure PD 20(c None/p1) phy bad for slot 2 maybe the disk is bad? [15:57:19] let me check the HW logs [15:58:40] PROBLEM - Host elastic2037.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:59:17] 10Operations, 10Puppet: unbound variable error when calling puppet-merge script with an explicit treeish - https://phabricator.wikimedia.org/T264014 (10CDanis) [15:59:22] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops, 10Patch-For-Review: (Need By: 2020-09-15) rack/setup/install db1150 (see note on hostname) - https://phabricator.wikimedia.org/T260817 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` db1150.eqiad.wm... [16:01:16] 10Operations, 10ops-codfw, 10DC-Ops, 10fundraising-tech-ops: (Need By: 2020-09-30) rack/setup/install frmx2001.frack.codfw.wmnet, frdata2001.frack.codfw.wmnet - https://phabricator.wikimedia.org/T260183 (10Jgreen) [16:03:11] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on es2026 - https://phabricator.wikimedia.org/T263837 (10Marostegui) @Papaul after putting the disk back in, I am seeing the same errors on the controller: ` [1764225.764609] megaraid_sas 0000:af:00.0: 1103 (654623787s/0x0004/CRIT) - Enclosure PD 20(c None/p1)... [16:03:54] RECOVERY - Host elastic2037.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.98 ms [16:04:46] 10Operations, 10ops-codfw, 10DC-Ops, 10fundraising-tech-ops: (Need By: 2020-09-30) rack/setup/install frmx2001.frack.codfw.wmnet, frdata2001.frack.codfw.wmnet - https://phabricator.wikimedia.org/T260183 (10Jgreen) [16:07:48] (03PS14) 10Jbond: puppet-merge: add Repository class [puppet] - 10https://gerrit.wikimedia.org/r/544943 (https://phabricator.wikimedia.org/T254249) [16:08:02] !log push pfw policies - T264013 [16:08:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:08:42] RECOVERY - Host elastic2037 is UP: PING OK - Packet loss = 0%, RTA = 31.83 ms [16:08:56] !log reimaging new restbase hosts - restbase1028, restbase1029, restbase1030 [16:08:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:09:16] (03CR) 10Jbond: "Rebased, ready for review" [puppet] - 10https://gerrit.wikimedia.org/r/544943 (https://phabricator.wikimedia.org/T254249) (owner: 10Jbond) [16:10:49] (03PS1) 10Jcrespo: mariadb: Set up db1150 as the new buster source backups host [puppet] - 10https://gerrit.wikimedia.org/r/630634 (https://phabricator.wikimedia.org/T257551) [16:11:24] 10Operations, 10ops-codfw: elastic2037 DIMM errors logged in racadm getsel - https://phabricator.wikimedia.org/T263714 (10Papaul) 05Open→03Resolved DiMM B1 replaced. All good now. [16:12:23] (03PS2) 10Jcrespo: mariadb: Set up db1150 as the new buster source backups host [puppet] - 10https://gerrit.wikimedia.org/r/630634 (https://phabricator.wikimedia.org/T257551) [16:13:56] PROBLEM - Check systemd state on ms-be2056 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:14:22] (03PS3) 10Jcrespo: mariadb: Set up db1150 as the new buster source backups host [puppet] - 10https://gerrit.wikimedia.org/r/630634 (https://phabricator.wikimedia.org/T257551) [16:17:03] (03CR) 10Jcrespo: "If this can be merged this week, after installed by DC ops (ongoing), there will be no work left for next week." [puppet] - 10https://gerrit.wikimedia.org/r/630634 (https://phabricator.wikimedia.org/T257551) (owner: 10Jcrespo) [16:17:25] (03PS1) 10Elukey: admin: add journactl perms to airflow-search-admins [puppet] - 10https://gerrit.wikimedia.org/r/630635 [16:19:22] 10Operations, 10SRE-Access-Requests, 10cloud-services-team (Kanban): wikitech-static access for Sam Reed - https://phabricator.wikimedia.org/T262468 (10Reedy) a:03Andrew Andrew passed on the password via telegram (to my phone number). [16:20:00] !log cdanis@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'canary' . [16:20:00] !log cdanis@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'production' . [16:20:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:20:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:20:34] !log nskaggs@cumin1001 START - Cookbook wmcs.wikireplicas.add_wiki [16:20:36] !log nskaggs@cumin1001 END (FAIL) - Cookbook wmcs.wikireplicas.add_wiki (exit_code=99) [16:20:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:20:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:21:26] 10Operations, 10OTRS, 10vm-requests: Decommission mendelevium - https://phabricator.wikimedia.org/T263993 (10akosiaris) 05Open→03Resolved [16:21:30] 10Operations, 10Puppet: unbound variable error when calling puppet-merge script with an explicit treeish - https://phabricator.wikimedia.org/T264014 (10jbond) I took a quick look and it seems the issue is caused when FETCH_HEAD_OR_EMPTY is empty which causes puppet-merge.py to get called with two positional ar... [16:22:31] (03PS2) 10Alexandros Kosiaris: proton: remove conftool-data [puppet] - 10https://gerrit.wikimedia.org/r/627859 (https://phabricator.wikimedia.org/T255877) (owner: 10Giuseppe Lavagetto) [16:22:54] 10Operations, 10SRE-Access-Requests, 10cloud-services-team (Kanban): wikitech-static access for Sam Reed - https://phabricator.wikimedia.org/T262468 (10Reedy) 05Open→03Resolved [16:23:26] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime [16:23:27] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime [16:23:29] !log hnowlan@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [16:23:30] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime [16:23:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:23:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:23:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:23:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:24:56] !log cdanis@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'canary' . [16:24:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:25:11] (03PS1) 10Cmjohnson: Adding prodcution dns manually to dns file [dns] - 10https://gerrit.wikimedia.org/r/630636 (https://phabricator.wikimedia.org/T260817) [16:25:26] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [16:25:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:26:04] (03CR) 10Cmjohnson: [C: 03+2] Adding prodcution dns manually to dns file [dns] - 10https://gerrit.wikimedia.org/r/630636 (https://phabricator.wikimedia.org/T260817) (owner: 10Cmjohnson) [16:27:20] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [16:27:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:27:49] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/630635 (owner: 10Elukey) [16:33:43] RECOVERY - Check systemd state on ms-be2056 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:34:29] !log cdanis@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'production' . [16:34:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:35:51] PROBLEM - Stale file for node-exporter textfile in codfw on alert1001 is CRITICAL: cluster=elasticsearch file=intel_microcode.prom instance=elastic2037 job=node site=codfw https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile [16:37:24] 10Operations, 10ops-eqiad, 10netops: (Need by: 2019-09-30) upgrade msw1-eqiad from EX4200 to EX4300 - https://phabricator.wikimedia.org/T225121 (10Cmjohnson) 05Open→03Resolved Both have been updated [16:42:15] PROBLEM - Check systemd state on deneb is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:44:16] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By:TBD) rack/setup/install rows C and D new PDUs - https://phabricator.wikimedia.org/T253694 (10Cmjohnson) [16:44:30] 10Operations, 10ops-eqiad, 10DC-Ops: Mon, Sept 14th - PDU Upgrade Racks D5 and D6 - https://phabricator.wikimedia.org/T261453 (10Cmjohnson) 05Open→03Resolved these were waiting on the temperature leads to be connected. finished and resolving the task [16:45:48] 10Operations, 10fundraising-tech-ops, 10netops: Automate diff and commit of frack ACL - https://phabricator.wikimedia.org/T260655 (10Jgreen) a:03Jgreen [16:49:58] 10Operations, 10ops-eqiad, 10Analytics-Radar, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` an-worker1102.eqiad.wmnet... [16:55:54] RECOVERY - Stale file for node-exporter textfile in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile [16:56:34] 10Operations, 10ops-eqiad, 10Analytics-Radar: an-presto1004 down - https://phabricator.wikimedia.org/T253438 (10Cmjohnson) a:05Cmjohnson→03RobH @RobH I did not see any signs of burning inside the chassis [16:57:07] (03PS1) 10Ahmon Dancy: InitialiseSettings-labs.php: updated a comment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/630640 [16:57:41] 10Operations, 10ops-eqiad, 10cloud-services-team (Kanban): cloudvirt1033 psu redundancy alert - https://phabricator.wikimedia.org/T263145 (10Cmjohnson) 05Open→03Resolved The issue seems to have been resolved. [16:58:53] 10Operations, 10Analytics, 10Traffic: ~1 request/minute to intake-logging.wikimedia.org times out at the traffic/service interface - https://phabricator.wikimedia.org/T264021 (10CDanis) [16:59:01] 10Operations, 10ops-eqiad, 10netops, 10Patch-For-Review: Rack and setup new eqiad row D switch stack (EX4300/QFX5100) - https://phabricator.wikimedia.org/T148506 (10Cmjohnson) [16:59:19] 10Operations, 10Analytics, 10Traffic: ~1 request/minute to intake-logging.wikimedia.org times out at the traffic/service interface - https://phabricator.wikimedia.org/T264021 (10CDanis) p:05Triage→03Low Clients will retry automatically so this isn't a huge deal, but it does merit investigation at some po... [16:59:21] 10Operations, 10ops-eqiad: Decommisson and store old row D network gear. - https://phabricator.wikimedia.org/T170474 (10Cmjohnson) 05Open→03Resolved all of old row D's old network was removed awhile ago. resolving this task [16:59:35] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops, 10Patch-For-Review: (Need By: 2020-09-15) rack/setup/install db1150 (see note on hostname) - https://phabricator.wikimedia.org/T260817 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db1150.eqiad.wmnet'] ` Of which those **FAILED**: ` ['db1150.eqia... [17:00:05] ryankemper: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Wikidata Query Service weekly deploy . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200928T1700). [17:00:30] (03PS1) 10Hnowlan: restbase: set role for new nodes [puppet] - 10https://gerrit.wikimedia.org/r/630641 (https://phabricator.wikimedia.org/T261512) [17:02:12] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops, 10Patch-For-Review: (Need By: 2020-09-15) rack/setup/install db1150 (see note on hostname) - https://phabricator.wikimedia.org/T260817 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` db1150.eqiad.wm... [17:03:44] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:04:32] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/630574 (owner: 10Jbond) [17:05:54] (03CR) 10CRusnov: "This change is ready for review." [dns] - 10https://gerrit.wikimedia.org/r/630644 (https://phabricator.wikimedia.org/T258729) (owner: 10CRusnov) [17:06:14] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:06:56] 10Puppet, 10Patch-For-Review: puppetdb seems to be slow on host reimage - https://phabricator.wikimedia.org/T263578 (10jbond) >>! In T263578#6497784, @Volans wrote: > > Does it need to be retained on reboots? If not seems a good idea to me. As far as i can see the directory is just used as a queue so if a re... [17:07:30] PROBLEM - Rate of JVM GC Old generation-s runs - elastic2029-production-search-psi-codfw on elastic2029 is CRITICAL: 124.1 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-codfw&var-instance=elastic2029&panelId=37 [17:08:19] 10Operations, 10ops-eqiad, 10Analytics-Radar, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-worker1102.eqiad.wmnet'] ` Of which those **FAILED**: ` ['an-worker1... [17:09:38] (03PS1) 10Phuedx: SearchBox: Fix data-search-loc attribute [skins/Vector] (wmf/1.36.0-wmf.10) - 10https://gerrit.wikimedia.org/r/630418 (https://phabricator.wikimedia.org/T256100) [17:13:07] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: 2020-09-15) rack/setup/install db1150 (see note on hostname) - https://phabricator.wikimedia.org/T260817 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db1150.eqiad.wmnet'] ` Of which those **FAILED**: ` ['db1150.eqiad.wmnet'] ` [17:13:32] 10Operations, 10ops-eqiad, 10Analytics-Radar, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10RobH) Ok, I setup an-worker1102 with raid1 on the two SSDS, and each HDD as its own raid0. Now it gets an LVM label in use error... [17:15:02] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime [17:15:02] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [17:15:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:15:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:15:52] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: TBD) install memory upgrades in ores100[1-9] - https://phabricator.wikimedia.org/T259909 (10Cmjohnson) @akosiaris Is scheduling this for this coming Wednesday too soon? 1400UTC? If not let's try Wednesday of next week same time. [17:18:14] 10Operations, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: 2020-09-15) upgrade/replace memory in stat100[58] - https://phabricator.wikimedia.org/T260448 (10Cmjohnson) @elukey Can you do this Monday 5 October 1400UTC? [17:19:36] 10Operations, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: 2020-09-15) upgrade/replace memory in stat100[58] - https://phabricator.wikimedia.org/T260448 (10elukey) Definitely yes! [17:20:00] 10Operations, 10Data-Services, 10SRE-Access-Requests, 10cloud-services-team (Kanban): Enable access for wmcs-admins to run wmcs-prefixed cookbooks on cumin hosts - https://phabricator.wikimedia.org/T261145 (10nskaggs) Ahh got it. No arguments are allowed for secure-cookbook. I confirmed I was able to run t... [17:20:08] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: 2020-09-15) rack/setup/install db1150 (see note on hostname) - https://phabricator.wikimedia.org/T260817 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` db1150.eqiad.wmnet ` The log can be f... [17:20:50] 10Operations, 10Data-Services, 10SRE-Access-Requests, 10cloud-services-team (Kanban): Enable access for wmcs-admins to run wmcs-prefixed cookbooks on cumin hosts - https://phabricator.wikimedia.org/T261145 (10nskaggs) 05Open→03Resolved a:03nskaggs [17:21:30] 10Operations, 10Data-Services, 10SRE-Access-Requests, 10cloud-services-team (Kanban): Enable access for wmcs-admins to run wmcs-prefixed cookbooks on cumin hosts - https://phabricator.wikimedia.org/T261145 (10nskaggs) 05Resolved→03Open a:05nskaggs→03None [17:21:56] 10Operations, 10ops-eqiad, 10Analytics-Clusters: (Need By: TBD) upgrade ram in an-master100[12] - https://phabricator.wikimedia.org/T259162 (10Cmjohnson) @elukey Same thing with these...can we do them all Monday or will you need multiple days? [17:23:20] 10Operations, 10ops-eqiad, 10Analytics-Clusters: (Need By: TBD) upgrade ram in an-master100[12] - https://phabricator.wikimedia.org/T259162 (10elukey) All on Monday is fine! [17:25:54] 10Operations, 10ops-eqiad, 10Analytics-Clusters: (Need By: TBD) upgrade ram in an-master100[12] - https://phabricator.wikimedia.org/T259162 (10Cmjohnson) Okay, great! [17:27:42] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:28:46] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:31:33] 10Operations, 10Analytics, 10Traffic: ~1 request/minute to intake-logging.wikimedia.org times out at the traffic/service interface - https://phabricator.wikimedia.org/T264021 (10JAllemandou) Idea: Could missing-revisions (T215001) be related to this? [17:32:58] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime [17:33:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:34:59] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [17:35:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:39:25] (03CR) 10CRusnov: "This change is ready for review." [dns] - 10https://gerrit.wikimedia.org/r/630647 (https://phabricator.wikimedia.org/T258729) (owner: 10CRusnov) [17:39:28] (03CR) 10jerkins-bot: [V: 04-1] Migrate ESAMS to Netbox Automation [dns] - 10https://gerrit.wikimedia.org/r/630647 (https://phabricator.wikimedia.org/T258729) (owner: 10CRusnov) [17:39:34] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: 2020-09-15) rack/setup/install db1150 (see note on hostname) - https://phabricator.wikimedia.org/T260817 (10Cmjohnson) [17:39:48] (03PS2) 10Catrope: Enable and configure GrowthExperiments on svwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/627395 (https://phabricator.wikimedia.org/T257220) [17:41:46] 10Puppet, 10Patch-For-Review: puppetdb seems to be slow on host reimage - https://phabricator.wikimedia.org/T263578 (10Volans) @jbond I think we can just try the tmpfs first as you said and check the impact. [17:41:52] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:42:57] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: 2020-09-15) rack/setup/install db1150 (see note on hostname) - https://phabricator.wikimedia.org/T260817 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db1150.eqiad.wmnet'] ` and were **ALL** successful. [17:43:00] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:43:36] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: 2020-09-15) rack/setup/install db1150 (see note on hostname) - https://phabricator.wikimedia.org/T260817 (10Cmjohnson) 05Open→03Resolved @Marostegui @jcrespo All yours [17:45:56] (03PS2) 10CRusnov: Migrate ESAMS to Netbox Automation [dns] - 10https://gerrit.wikimedia.org/r/630647 (https://phabricator.wikimedia.org/T258729) [17:49:56] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: 2020-09-15) rack/setup/install db1150 (see note on hostname) - https://phabricator.wikimedia.org/T260817 (10jcrespo) Thank you very much for you help, Cmjohnson!!! [17:50:45] (03CR) 10Jcrespo: [C: 03+2] mariadb: Set up db1150 as the new buster source backups host [puppet] - 10https://gerrit.wikimedia.org/r/630634 (https://phabricator.wikimedia.org/T257551) (owner: 10Jcrespo) [17:53:07] (03PS2) 10CRusnov: Migrate EQSIN to Netbox Automation [dns] - 10https://gerrit.wikimedia.org/r/630644 (https://phabricator.wikimedia.org/T258729) [17:53:48] (03PS3) 10CRusnov: Migrate ESAMS to Netbox Automation [dns] - 10https://gerrit.wikimedia.org/r/630647 (https://phabricator.wikimedia.org/T258729) [17:56:05] 10Operations, 10LDAP-Access-Requests: Add Bereket teshome to the ldap/wmde and ldap/nda group - https://phabricator.wikimedia.org/T262921 (10KFrancis) @RLazarus -Confirming Bereket's NDA is fully executed. Thanks! [17:56:29] (03Abandoned) 10Jdlrobson: SearchBox: Fix data-search-loc attribute [skins/Vector] (wmf/1.36.0-wmf.10) - 10https://gerrit.wikimedia.org/r/630418 (https://phabricator.wikimedia.org/T256100) (owner: 10Phuedx) [18:00:04] RoanKattouw, Niharika, and Urbanecm: I, the Bot under the Fountain, allow thee, The Deployer, to do Morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200928T1800). [18:00:04] jdlrobson: A patch you scheduled for Morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:00:17] I can deploy today [18:00:51] o/ [18:01:25] (03CR) 10Urbanecm: [C: 03+2] Enable search in header A/B test for logged in users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/630206 (https://phabricator.wikimedia.org/T263032) (owner: 10Jdlrobson) [18:02:12] (03Merged) 10jenkins-bot: Enable search in header A/B test for logged in users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/630206 (https://phabricator.wikimedia.org/T263032) (owner: 10Jdlrobson) [18:03:06] Jdlrobson: pulled onto mwdebug2001, can you test, please? [18:03:07] 10Operations, 10LDAP-Access-Requests: Add Bereket teshome to the ldap/wmde and ldap/nda group - https://phabricator.wikimedia.org/T262921 (10RLazarus) a:05KFrancis→03ArielGlenn Thanks @KFrancis! Passing this along to @ArielGlenn as the current SRE Clinic Duty person. [18:03:16] 10Operations, 10LDAP-Access-Requests: Add Bereket teshome to the ldap/wmde and ldap/nda group - https://phabricator.wikimedia.org/T262921 (10RLazarus) [18:03:25] 10Operations, 10ops-eqiad, 10Analytics-Radar, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` an-worker1102.eqiad.wmnet... [18:05:08] Urbanecm: almost done [18:05:16] ack, take your time :) [18:06:13] actually you might be able to help me [18:06:22] yes? [18:06:24] all the accounts I have are bucketted in the A group [18:06:28] I need to check the B group [18:06:38] the search bar should be next to the logo in the B group [18:06:44] could you login and see if that's true for your account? [18:06:52] if i can find one example I know it's working correctly [18:07:10] would using my main account work, or do i need to create a new one? [18:07:28] URL https://en.wikipedia.org/wiki/Speedway_(soundtrack)?useskinversion=2 [18:07:30] any account [18:09:05] Jdlrobson: works for me at euwiki, at least if this is what you expect https://usercontent.irccloud-cdn.com/file/2kkinYxm/image.png [18:09:17] I had to empty browser cache for the search bar to move [18:09:31] (normal refresh, ie. Ctrl+R, didn't change anything) [18:09:56] Urbanecm: okay it works [18:10:04] yep that's great and i confirmed as well for another account [18:10:08] sync away and thank you! [18:10:10] cool, I'll sync it then :) [18:10:16] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:10:46] Jdlrobson: is the "Contributions and Log out buttons moved away" issue known? [18:11:08] yep not relating [18:11:10] (03CR) 10Dzahn: [C: 03+2] openstack: replace remaining hiera() that had default values [puppet] - 10https://gerrit.wikimedia.org/r/627967 (owner: 10Dzahn) [18:11:12] that's fine [18:11:20] (03CR) 10Dmaza: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/630653 (https://phabricator.wikimedia.org/T260461) (owner: 10Dmaza) [18:11:30] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:11:36] (03PS2) 10Dmaza: Enable watchlist expiry feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/630653 (https://phabricator.wikimedia.org/T260461) [18:11:38] Jdlrobson: ack [18:12:08] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: c7e08bc2bbff6aead186350726d5c1c137cca052: Enable search in header A/B test for logged in users (T263032) (duration: 00m 58s) [18:12:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:12:13] T263032: Deploy the new location of the search bar to new vector and begin A/B test on test wikis - https://phabricator.wikimedia.org/T263032 [18:12:14] Jdlrobson: here you go :) [18:12:20] yeehaaa [18:12:25] thanks Urbanecm [18:12:29] no problem [18:13:55] 10Operations, 10Product-Infrastructure-Data, 10Epic, 10Goal, 10Patch-For-Review: automatically collect network error reports from users' browsers (Network Error Logging API) - https://phabricator.wikimedia.org/T257527 (10CDanis) With the current 5% sampling, we're getting about 30 reports/second at peak... [18:15:04] !log Morning B&C done [18:15:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:15:37] !log robh@cumin1001 START - Cookbook sre.hosts.downtime [18:15:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:17:39] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [18:17:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:19:54] 10Operations, 10DBA, 10Performance-Team, 10Platform Engineering, 10User-Kormat: Remove groups from db configs - https://phabricator.wikimedia.org/T263127 (10daniel) @Krinkle I think there are two parts to this. In my mind, the groups used in code are basically hints to the DB layer that a given cluster m... [18:22:19] (03CR) 10CRusnov: [C: 04-2] "This needs to be merged after EQSIN patch." [dns] - 10https://gerrit.wikimedia.org/r/630647 (https://phabricator.wikimedia.org/T258729) (owner: 10CRusnov) [18:23:44] (03CR) 10Dzahn: oozie: hiera->lookup, add data types (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/629443 (owner: 10Dzahn) [18:23:47] (03PS3) 10Dzahn: oozie: hiera->lookup, add data types [puppet] - 10https://gerrit.wikimedia.org/r/629443 [18:26:28] (03PS18) 10Dzahn: prometheus: replace remaining hiera() with lookup() [puppet] - 10https://gerrit.wikimedia.org/r/623666 [18:27:56] (03CR) 10Ebernhardson: [C: 03+1] logstash: keep at least 2 copies of each shard for apifeatureusage [puppet] - 10https://gerrit.wikimedia.org/r/630546 (https://phabricator.wikimedia.org/T263970) (owner: 10Gehel) [18:30:37] (03CR) 10Ahmon Dancy: [C: 04-2] "Nevermind this. Will be doing something more extensive in the train-dev branch." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/630640 (owner: 10Ahmon Dancy) [18:30:51] (03Abandoned) 10Ahmon Dancy: InitialiseSettings-labs.php: updated a comment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/630640 (owner: 10Ahmon Dancy) [18:32:14] (03CR) 10Dzahn: [C: 04-1] "https://puppet-compiler.wmflabs.org/compiler1003/25467/deneb.codfw.wmnet/change.deneb.codfw.wmnet.err" [puppet] - 10https://gerrit.wikimedia.org/r/627970 (owner: 10Dzahn) [18:32:58] 10Operations, 10DBA, 10Performance-Team, 10Platform Engineering, 10User-Kormat: Remove groups from db configs - https://phabricator.wikimedia.org/T263127 (10daniel) A quick inventory of DB groups used in core, based on some ad-hoc grep runs: {P12819} [18:35:05] (03PS2) 10Dzahn: docker: replace hiera with lookup, add data types for builder and registry [puppet] - 10https://gerrit.wikimedia.org/r/627970 [18:35:26] 10Operations, 10ops-eqiad, 10Analytics-Radar, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-worker1102.eqiad.wmnet'] ` and were **ALL** successful. [18:35:42] 10Operations, 10Machine Learning Platform, 10ORES, 10serviceops, 10Wikimedia-production-error: ORES redis: max number of clients reached... - https://phabricator.wikimedia.org/T263910 (10Halfak) I'm not familiar with this problem. Anything change with the deployment recently? Did any overload errors ha... [18:37:24] (03PS1) 10Andrew Bogott: Move cloudvirt1012,13 and 14 to ceph and Buster [puppet] - 10https://gerrit.wikimedia.org/r/630656 (https://phabricator.wikimedia.org/T259399) [18:39:26] (03CR) 10Dzahn: [V: 03+1] "https://puppet-compiler.wmflabs.org/compiler1003/25468/deneb.codfw.wmnet/index.html and docker::registry is not used anywhere?" [puppet] - 10https://gerrit.wikimedia.org/r/627970 (owner: 10Dzahn) [18:39:34] I will do some hacking on mwdebug1002 (for T264029) [18:39:35] T264029: Special:Homepage runs out of memory - https://phabricator.wikimedia.org/T264029 [18:41:56] 10Operations, 10Security-Team: Establish secteam production norms - https://phabricator.wikimedia.org/T224886 (10chasemp) [18:41:58] 10Operations, 10Security-Team, 10Patch-For-Review: (2019-09) Create secteam groups in admin.yaml and define permissions - https://phabricator.wikimedia.org/T223463 (10chasemp) 05Stalled→03Declined [18:42:16] btw if anyone has an idea why a query would OOM on hu, hy, uk but not a bunch of other wikis, I'd welcome it [18:43:07] (03Abandoned) 10Rush: admin: add secteam and secteam-admin for T223463 [puppet] - 10https://gerrit.wikimedia.org/r/510753 (https://phabricator.wikimedia.org/T223463) (owner: 10Rush) [18:43:12] (03PS2) 10Gehel: logstash: keep at least 2 copies of each shard for apifeatureusage [puppet] - 10https://gerrit.wikimedia.org/r/630546 (https://phabricator.wikimedia.org/T263970) [18:43:19] (03Abandoned) 10Rush: admin: new group add secteam-admin [puppet] - 10https://gerrit.wikimedia.org/r/521484 (https://phabricator.wikimedia.org/T223463) (owner: 10Jbond) [18:43:31] 10Operations, 10Security-Team: Establish secteam production norms - https://phabricator.wikimedia.org/T224886 (10chasemp) [18:43:33] 10Operations, 10Security-Team, 10Patch-For-Review: (2019-09) Create secteam groups in admin.yaml and define permissions - https://phabricator.wikimedia.org/T223463 (10chasemp) 05Declined→03Resolved [18:47:43] 10Operations, 10ops-eqiad, 10Analytics-Radar, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10RobH) [18:48:18] 10Operations, 10ops-eqiad, 10Analytics-Radar, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10RobH) Ok, updates: * an-worker1102 is now staged and ready for service owners to take it over. * I am working through the other hosts, rebuilding all... [18:48:24] (03PS1) 10Ppchelko: Expose /page/descrtion API [deployment-charts] - 10https://gerrit.wikimedia.org/r/630657 (https://phabricator.wikimedia.org/T262498) [18:48:46] (03PS2) 10Ppchelko: Expose /page/descrtion API [deployment-charts] - 10https://gerrit.wikimedia.org/r/630657 (https://phabricator.wikimedia.org/T262498) [18:50:01] RECOVERY - Disk space on an-coord1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=an-coord1001&var-datasource=eqiad+prometheus/ops [18:51:18] PROBLEM - MediaWiki exceptions and fatals per minute on alert1001 is CRITICAL: 381 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [18:52:20] RECOVERY - MediaWiki exceptions and fatals per minute on alert1001 is OK: (C)100 gt (W)50 gt 5 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [18:52:32] (03CR) 10Gehel: [C: 03+2] logstash: keep at least 2 copies of each shard for apifeatureusage [puppet] - 10https://gerrit.wikimedia.org/r/630546 (https://phabricator.wikimedia.org/T263970) (owner: 10Gehel) [18:57:52] (03CR) 10Dzahn: [V: 03+1 C: 03+2] docker: replace hiera with lookup, add data types for builder and registry [puppet] - 10https://gerrit.wikimedia.org/r/627970 (owner: 10Dzahn) [18:58:06] (03PS1) 10Dzahn: docker: add data types [puppet] - 10https://gerrit.wikimedia.org/r/630661 [18:59:05] (03CR) 10jerkins-bot: [V: 04-1] docker: add data types [puppet] - 10https://gerrit.wikimedia.org/r/630661 (owner: 10Dzahn) [19:00:46] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:01:56] PROBLEM - MediaWiki exceptions and fatals per minute on alert1001 is CRITICAL: 443 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [19:01:56] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:02:21] 10Operations, 10ops-eqiad, 10Analytics-Radar, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` ['an-worker1103.eqiad.wmnet', 'an-worker1104.eqi... [19:03:08] RECOVERY - MediaWiki exceptions and fatals per minute on alert1001 is OK: (C)100 gt (W)50 gt 9 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [19:03:26] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1003/25470/" [puppet] - 10https://gerrit.wikimedia.org/r/629424 (owner: 10Dzahn) [19:04:45] (03PS9) 10Dzahn: cache::base: replace hiera() with lookup(), add data types [puppet] - 10https://gerrit.wikimedia.org/r/623662 [19:05:45] (03PS1) 10Ahmon Dancy: Add support for dev realm [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/630664 [19:05:48] (03PS1) 10Ahmon Dancy: Don't load CirrusSearch extension for dev realm. [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/630665 [19:05:50] (03PS1) 10Ahmon Dancy: Support InitialiseSettings-.php [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/630666 [19:05:52] (03PS1) 10Ahmon Dancy: Add wmf-config/InitialiseSettings-dev.php [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/630667 [19:06:34] (03CR) 10jerkins-bot: [V: 04-1] Support InitialiseSettings-.php [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/630666 (owner: 10Ahmon Dancy) [19:06:37] (03CR) 10jerkins-bot: [V: 04-1] Add wmf-config/InitialiseSettings-dev.php [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/630667 (owner: 10Ahmon Dancy) [19:07:49] (03CR) 10Dzahn: "confirmed noop in prod -acmechief1001" [puppet] - 10https://gerrit.wikimedia.org/r/629424 (owner: 10Dzahn) [19:14:47] !log robh@cumin1001 START - Cookbook sre.hosts.downtime [19:14:48] !log robh@cumin1001 START - Cookbook sre.hosts.downtime [19:14:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:14:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:16:46] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [19:16:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:18:32] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [19:18:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:21:23] (03CR) 10Ahmon Dancy: [C: 03+2] Add support for dev realm [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/630664 (owner: 10Ahmon Dancy) [19:21:34] 10Operations, 10ops-eqiad, 10Analytics-Radar, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` an-worker1105.eqiad.wmnet ` The log can be found... [19:22:09] (03Merged) 10jenkins-bot: Add support for dev realm [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/630664 (owner: 10Ahmon Dancy) [19:22:41] (03CR) 10Ahmon Dancy: [C: 03+2] Don't load CirrusSearch extension for dev realm. [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/630665 (owner: 10Ahmon Dancy) [19:23:21] (03Merged) 10jenkins-bot: Don't load CirrusSearch extension for dev realm. [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/630665 (owner: 10Ahmon Dancy) [19:24:12] (03PS2) 10Ahmon Dancy: Support InitialiseSettings-.php [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/630666 [19:24:14] (03PS2) 10Ahmon Dancy: Add wmf-config/InitialiseSettings-dev.php [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/630667 [19:24:58] (03CR) 10jerkins-bot: [V: 04-1] Add wmf-config/InitialiseSettings-dev.php [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/630667 (owner: 10Ahmon Dancy) [19:25:10] (03CR) 10jerkins-bot: [V: 04-1] Support InitialiseSettings-.php [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/630666 (owner: 10Ahmon Dancy) [19:35:44] 10Operations, 10ops-eqiad, 10Analytics-Radar, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-worker1103.eqiad.wmnet'] ` Of which those **FAILED**: ` ['an-worker1104.eqiad.wmnet'] ` [19:36:33] 10Operations, 10ops-eqiad, 10Analytics-Radar, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-worker1105.eqiad.wmnet'] ` Of which those **FAILED**: ` ['an-worker1105.eqiad.wmnet'] ` [19:38:21] 10Operations, 10ops-eqiad, 10Analytics-Radar, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10RobH) [19:39:28] (03PS10) 10Dzahn: cache::base/varnish: replace hiera() with lookup(), add data types [puppet] - 10https://gerrit.wikimedia.org/r/623662 [19:40:41] (03PS7) 10Gehel: Extracting obvious reporting code to a Reporter class. [software/cumin] - 10https://gerrit.wikimedia.org/r/626660 (https://phabricator.wikimedia.org/T212783) [19:42:44] (03CR) 10Dzahn: [V: 03+1] "https://puppet-compiler.wmflabs.org/compiler1003/25473/cp1082.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/623662 (owner: 10Dzahn) [19:45:28] (03PS1) 10Mholloway: Update mobileapps to 2020-09-28-145812-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/630671 (https://phabricator.wikimedia.org/T259624) [19:47:02] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1003/25474/cp1082.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/623662 (owner: 10Dzahn) [19:51:31] (03CR) 10Dzahn: "double checked it's NOOP in prod like in compiler: cp4032, cp2036, cp1082" [puppet] - 10https://gerrit.wikimedia.org/r/623662 (owner: 10Dzahn) [19:52:19] (03PS2) 10Dzahn: tor: use Stdlib::Host to match FQDN or IP [puppet] - 10https://gerrit.wikimedia.org/r/630310 [19:53:12] (03CR) 10Dzahn: [C: 03+2] tor: use Stdlib::Host to match FQDN or IP [puppet] - 10https://gerrit.wikimedia.org/r/630310 (owner: 10Dzahn) [19:56:10] (03CR) 10Andrew Bogott: [C: 03+2] Move cloudvirt1012,13 and 14 to ceph and Buster [puppet] - 10https://gerrit.wikimedia.org/r/630656 (https://phabricator.wikimedia.org/T259399) (owner: 10Andrew Bogott) [19:56:36] (03PS2) 10Dzahn: phabricator: replace Stdlib::Ip_address with IP::Address [puppet] - 10https://gerrit.wikimedia.org/r/630309 [19:57:07] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/25476/phab1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/630309 (owner: 10Dzahn) [19:59:47] 10Operations, 10LDAP-Access-Requests: Access to Superset - https://phabricator.wikimedia.org/T263868 (10JRabah) Thanks @JVargas and @ArielGlenn. Please let me know if you have any questions for me. [20:00:04] chrisalbon and accraze: Time to snap out of that daydream and deploy Services – Graphoid / ORES. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200928T2000). [20:05:09] (03CR) 10Mholloway: [C: 03+2] Update mobileapps to 2020-09-28-145812-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/630671 (https://phabricator.wikimedia.org/T259624) (owner: 10Mholloway) [20:05:53] (03PS3) 10Ahmon Dancy: Support InitialiseSettings-.php [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/630666 [20:05:55] (03PS3) 10Ahmon Dancy: Add wmf-config/InitialiseSettings-dev.php [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/630667 [20:07:26] (03Merged) 10jenkins-bot: Update mobileapps to 2020-09-28-145812-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/630671 (https://phabricator.wikimedia.org/T259624) (owner: 10Mholloway) [20:08:24] (03CR) 10Ahmon Dancy: [C: 03+2] Support InitialiseSettings-.php [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/630666 (owner: 10Ahmon Dancy) [20:09:06] (03Merged) 10jenkins-bot: Support InitialiseSettings-.php [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/630666 (owner: 10Ahmon Dancy) [20:09:22] (03CR) 10Ahmon Dancy: [C: 03+2] Add wmf-config/InitialiseSettings-dev.php [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/630667 (owner: 10Ahmon Dancy) [20:09:59] (03Merged) 10jenkins-bot: Add wmf-config/InitialiseSettings-dev.php [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/630667 (owner: 10Ahmon Dancy) [20:10:27] !log mholloway-shell@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'mobileapps' for release 'staging' . [20:10:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:12:32] PROBLEM - Rate of JVM GC Old generation-s runs - elastic2029-production-search-psi-codfw on elastic2029 is CRITICAL: 100.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-codfw&var-instance=elastic2029&panelId=37 [20:12:44] that server :S it wants to be special [20:13:12] !log mholloway-shell@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'mobileapps' for release 'production' . [20:13:12] !log mholloway-shell@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'mobileapps' for release 'nontls' . [20:13:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:13:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:13:47] (03PS2) 10Dzahn: facilities: replace Stdlib::Ip_address with IP::Address [puppet] - 10https://gerrit.wikimedia.org/r/630308 [20:14:29] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/25477/" [puppet] - 10https://gerrit.wikimedia.org/r/630308 (owner: 10Dzahn) [20:15:29] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime [20:15:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:17:21] (03CR) 10Dzahn: "hmm.. a duplicate declaration on prometheus1004, but only there? https://puppet-compiler.wmflabs.org/compiler1002/25466/prometheus1004.eq" [puppet] - 10https://gerrit.wikimedia.org/r/623666 (owner: 10Dzahn) [20:17:28] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [20:17:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:17:53] !log mholloway-shell@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'mobileapps' for release 'production' . [20:17:54] !log mholloway-shell@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'mobileapps' for release 'nontls' . [20:17:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:18:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:19:59] PROBLEM - kubelet operational latencies on kubernetes1003 is CRITICAL: instance=kubernetes1003.eqiad.wmnet https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [20:20:59] PROBLEM - kubelet operational latencies on kubernetes1001 is CRITICAL: instance=kubernetes1001.eqiad.wmnet https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [20:22:01] RECOVERY - kubelet operational latencies on kubernetes1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [20:22:01] RECOVERY - kubelet operational latencies on kubernetes1003 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [20:24:27] PROBLEM - kubelet operational latencies on kubernetes2001 is CRITICAL: instance=kubernetes2001.codfw.wmnet https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [20:27:17] PROBLEM - CirrusSearch codfw 95th percentile latency - more_like on graphite1004 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [2000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=39&fullscreen&orgId=1&var-cluster=codfw&var-smoothing=1 [20:27:53] RECOVERY - kubelet operational latencies on kubernetes2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [20:31:41] (03PS1) 10Andrew Bogott: Try to re-image cloudvirt1012-14 without reformating the VM partition [puppet] - 10https://gerrit.wikimedia.org/r/630675 [20:32:27] (03CR) 10Andrew Bogott: [C: 03+2] Try to re-image cloudvirt1012-14 without reformating the VM partition [puppet] - 10https://gerrit.wikimedia.org/r/630675 (owner: 10Andrew Bogott) [20:32:49] PROBLEM - Rate of JVM GC Old generation-s runs - elastic2029-production-search-psi-codfw on elastic2029 is CRITICAL: 100.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-codfw&var-instance=elastic2029&panelId=37 [20:34:14] 10Operations, 10ops-eqiad, 10Analytics-Radar, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` ['an-worker1105.eqiad.wmnet', 'an-worker1106.eqi... [20:40:07] 10Operations, 10ops-eqiad, 10Analytics-Radar, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` an-worker1105.eqiad.wmnet ` The log can be found... [20:40:57] PROBLEM - Rate of JVM GC Old generation-s runs - elastic2029-production-search-psi-codfw on elastic2029 is CRITICAL: 100.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-codfw&var-instance=elastic2029&panelId=37 [20:45:59] !log robh@cumin1001 START - Cookbook sre.hosts.downtime [20:46:01] !log robh@cumin1001 START - Cookbook sre.hosts.downtime [20:46:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:46:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:48:01] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [20:48:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:49:54] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [20:49:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:50:29] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime [20:50:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:51:49] !log robh@cumin1001 START - Cookbook sre.hosts.downtime [20:51:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:52:26] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [20:52:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:54:16] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [20:54:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:57:13] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1004 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [20:57:34] 10Operations, 10ops-eqiad, 10Analytics-Radar, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-worker1106.eqiad.wmnet', 'an-worker1107.eqiad.wmnet'] ` Of which those **FAILED**: ` ['an-... [21:00:05] Reedy and sbassett: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Weekly Security deployment window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200928T2100). [21:00:46] 10Operations, 10ops-eqiad, 10Analytics-Radar, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10RobH) [21:05:53] 10Operations, 10ops-eqiad, 10Analytics-Radar, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-worker1105.eqiad.wmnet'] ` and were **ALL** successful. [21:08:47] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:09:45] 10Operations, 10ops-eqiad, 10Analytics-Radar, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` ['an-worker1108.eqiad.wmnet', 'an-worker1109.eqi... [21:10:05] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:12:33] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1004 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [21:17:21] (03CR) 10Dzahn: maps: hiera()->lookup(), add data types (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/629439 (owner: 10Dzahn) [21:17:36] (03PS3) 10Dzahn: maps: hiera()->lookup(), add data types [puppet] - 10https://gerrit.wikimedia.org/r/629439 [21:18:22] 10Operations, 10Platform Engineering, 10SRE-Access-Requests, 10Platform Team Workboards (Green): Request for membership of acl*procurement-review group for Platform Engineering staff - https://phabricator.wikimedia.org/T264054 (10WDoranWMF) [21:18:41] (03CR) 10jerkins-bot: [V: 04-1] maps: hiera()->lookup(), add data types [puppet] - 10https://gerrit.wikimedia.org/r/629439 (owner: 10Dzahn) [21:18:53] 10Operations, 10Platform Engineering, 10SRE-Access-Requests, 10Platform Team Workboards (Green): Request for membership of acl*procurement-review group for Platform Engineering staff - https://phabricator.wikimedia.org/T264054 (10WDoranWMF) p:05Triage→03High [21:18:59] RECOVERY - Rate of JVM GC Old generation-s runs - elastic2029-production-search-psi-codfw on elastic2029 is OK: (C)100 gt (W)80 gt 79.32 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-codfw&var-instance=elastic2029&panelId=37 [21:20:09] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:20:19] (03CR) 10Dzahn: [C: 03+2] DHCP: add testvm5001 MAC address [puppet] - 10https://gerrit.wikimedia.org/r/630320 (https://phabricator.wikimedia.org/T252526) (owner: 10Dzahn) [21:20:24] (03PS2) 10Dzahn: DHCP: add testvm5001 MAC address [puppet] - 10https://gerrit.wikimedia.org/r/630320 (https://phabricator.wikimedia.org/T252526) [21:21:27] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:21:30] !log robh@cumin1001 START - Cookbook sre.hosts.downtime [21:21:31] !log robh@cumin1001 START - Cookbook sre.hosts.downtime [21:21:33] !log robh@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [21:21:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:21:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:21:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:22:17] !log robh@cumin1001 START - Cookbook sre.hosts.downtime [21:22:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:23:30] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [21:23:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:25:24] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [21:25:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:25:37] RECOVERY - CirrusSearch codfw 95th percentile latency - more_like on graphite1004 is OK: OK: Less than 20.00% above the threshold [1200.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=39&fullscreen&orgId=1&var-cluster=codfw&var-smoothing=1 [21:33:11] (03PS4) 10Dzahn: maps: hiera()->lookup(), add data types [puppet] - 10https://gerrit.wikimedia.org/r/629439 [21:34:55] (03PS2) 10Dzahn: docker: add data types [puppet] - 10https://gerrit.wikimedia.org/r/630661 [21:35:54] 10Operations, 10ops-eqiad, 10Analytics-Radar, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-worker1109.eqiad.wmnet', 'an-worker1108.eqiad.wmnet'] ` Of which those **FAILED**: ` ['an-... [21:36:05] (03CR) 10jerkins-bot: [V: 04-1] docker: add data types [puppet] - 10https://gerrit.wikimedia.org/r/630661 (owner: 10Dzahn) [21:43:40] (03PS3) 10Dzahn: docker: add data types [puppet] - 10https://gerrit.wikimedia.org/r/630661 [21:45:21] (03PS19) 10Dzahn: prometheus: replace remaining hiera() with lookup() [puppet] - 10https://gerrit.wikimedia.org/r/623666 [21:47:02] 10Operations, 10ops-eqiad, 10Analytics-Radar, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` ['an-worker1111.eqiad.wmnet', 'an-worker1112.eqi... [21:47:40] (03PS2) 10Dzahn: tlsproxy::instance: switch from hiera() to lookup(), lint fix [puppet] - 10https://gerrit.wikimedia.org/r/623079 [21:50:22] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1003/24930/" [puppet] - 10https://gerrit.wikimedia.org/r/623079 (owner: 10Dzahn) [21:52:01] (03CR) 10Dzahn: "confirmed NOOP in prod: thorium, maps2001, elastic1036" [puppet] - 10https://gerrit.wikimedia.org/r/623079 (owner: 10Dzahn) [21:52:51] (03PS2) 10CDanis: geoip VCL: add a 'which' param to get_geo_xcip [puppet] - 10https://gerrit.wikimedia.org/r/630315 (https://phabricator.wikimedia.org/T263496) [21:52:53] (03PS2) 10CDanis: VCL: Attach a variety of GeoIP info as bereq headers; test GeoIP [puppet] - 10https://gerrit.wikimedia.org/r/630316 (https://phabricator.wikimedia.org/T263496) [21:53:06] (03CR) 10CDanis: geoip VCL: add a 'which' param to get_geo_xcip (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/630315 (https://phabricator.wikimedia.org/T263496) (owner: 10CDanis) [21:53:33] (03CR) 10jerkins-bot: [V: 04-1] VCL: Attach a variety of GeoIP info as bereq headers; test GeoIP [puppet] - 10https://gerrit.wikimedia.org/r/630316 (https://phabricator.wikimedia.org/T263496) (owner: 10CDanis) [21:53:57] (03PS20) 10Dzahn: prometheus: replace remaining hiera() with lookup() [puppet] - 10https://gerrit.wikimedia.org/r/623666 [21:55:15] 10Operations, 10Platform Engineering, 10SRE-Access-Requests, 10Platform Team Workboards (Green): Request for membership of acl*procurement-review group for Platform Engineering staff - https://phabricator.wikimedia.org/T264054 (10RobH) 05Open→03Resolved p:05High→03Medium a:05RobH→03None Added.... [21:56:01] (03PS3) 10CDanis: VCL: Attach a variety of GeoIP info as bereq headers; test GeoIP [puppet] - 10https://gerrit.wikimedia.org/r/630316 (https://phabricator.wikimedia.org/T263496) [21:57:00] (03CR) 10Dzahn: [C: 04-1] "this compiles fine on every single host _except_ on the eqiad prometheus hosts" [puppet] - 10https://gerrit.wikimedia.org/r/623666 (owner: 10Dzahn) [21:58:48] !log robh@cumin1001 START - Cookbook sre.hosts.downtime [21:58:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:59:19] (03CR) 10CDanis: "As discussed, now with tests! Which, by the way, I'm happy to move out of 02-frontend-headers.vtc into a new VTC file if you'd prefer. (" [puppet] - 10https://gerrit.wikimedia.org/r/630316 (https://phabricator.wikimedia.org/T263496) (owner: 10CDanis) [21:59:54] 10Operations, 10ops-eqiad, 10Analytics-Radar, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10RobH) [22:00:51] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [22:00:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:02:33] PROBLEM - High average GET latency for mw requests on appserver in codfw on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=appserver&var-method=GET [22:02:37] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-me [22:02:45] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:03:35] RECOVERY - High average GET latency for mw requests on appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=appserver&var-method=GET [22:03:39] RECOVERY - High average GET latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [22:03:47] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:04:33] PROBLEM - Restbase edge esams on text-lb.esams.wikimedia.org is CRITICAL: /api/rest_v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) is CRITICAL: Test Get a graph from Graphoid returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedia.org/wiki/RESTBase [22:06:22] (03CR) 10Dzahn: [C: 04-1] "the problem here is somehow in the k8s class along the calico-felix.yaml using ${::site} and $targets_path. it only fails in eqiad in the " [puppet] - 10https://gerrit.wikimedia.org/r/623666 (owner: 10Dzahn) [22:06:39] RECOVERY - Restbase edge esams on text-lb.esams.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [22:10:13] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:11:21] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:11:49] (03PS21) 10Dzahn: prometheus: replace remaining hiera() with lookup() [puppet] - 10https://gerrit.wikimedia.org/r/623666 [22:17:13] (03PS1) 10Mholloway: Update wikifeeds to 2020-09-28-221030-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/630688 [22:20:03] (03CR) 10Mholloway: [C: 03+2] Update wikifeeds to 2020-09-28-221030-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/630688 (owner: 10Mholloway) [22:22:09] (03Merged) 10jenkins-bot: Update wikifeeds to 2020-09-28-221030-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/630688 (owner: 10Mholloway) [22:24:21] !log mholloway-shell@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'wikifeeds' for release 'staging' . [22:24:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:25:34] !log mholloway-shell@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'wikifeeds' for release 'production' . [22:25:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:26:50] (03CR) 10Dzahn: [V: 03+1] "hah, it's always worth compiling on everything for these. the reason was https://gerrit.wikimedia.org/r/c/operations/puppet/+/623666/20..2" [puppet] - 10https://gerrit.wikimedia.org/r/623666 (owner: 10Dzahn) [22:27:14] !log mholloway-shell@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'wikifeeds' for release 'production' . [22:27:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:27:33] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/25481/ https://puppet-compiler.wmflabs.org/compiler1002/25466/" [puppet] - 10https://gerrit.wikimedia.org/r/623666 (owner: 10Dzahn) [22:29:52] (03CR) 10Dzahn: "noop on prometheus1003,2004" [puppet] - 10https://gerrit.wikimedia.org/r/623666 (owner: 10Dzahn) [22:32:59] PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=205 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [22:34:11] RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [22:34:50] 10Operations, 10ops-eqiad, 10Analytics-Radar, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-worker1112.eqiad.wmnet'] ` Of which those **FAILED**: ` ['an-worker1111.eqiad.wmnet', 'an-... [22:37:07] 10Operations, 10ops-eqiad, 10Analytics-Radar, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10RobH) [22:38:48] (03CR) 10Dzahn: [V: 04-1] "https://puppet-compiler.wmflabs.org/compiler1002/25483/db1133.eqiad.wmnet/change.db1133.eqiad.wmnet.err" [puppet] - 10https://gerrit.wikimedia.org/r/630317 (owner: 10Dzahn) [22:39:15] PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code={200,204} handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=G [22:40:01] 10Operations, 10ops-eqiad, 10Analytics-Radar, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10RobH) [22:41:17] (03PS5) 10Dzahn: mariadb::core_test: convert role to profile, add data types [puppet] - 10https://gerrit.wikimedia.org/r/630317 [22:41:57] RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [22:42:43] PROBLEM - PHP7 rendering on mwdebug1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [22:43:27] PROBLEM - Apache HTTP on mwdebug1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [22:44:27] (03PS6) 10Dzahn: mariadb::core_test: convert role to profile, add data types [puppet] - 10https://gerrit.wikimedia.org/r/630317 [22:45:34] (03PS1) 10Gergő Tisza: Properly handle namespaces in tasktype template configuration [extensions/GrowthExperiments] (wmf/1.36.0-wmf.10) - 10https://gerrit.wikimedia.org/r/630420 (https://phabricator.wikimedia.org/T264029) [22:45:58] (03PS7) 10Dzahn: mariadb::core_test: convert role to profile, add data types [puppet] - 10https://gerrit.wikimedia.org/r/630317 [22:46:56] (03PS8) 10Dzahn: mariadb::core_test: convert role to profile, add data types [puppet] - 10https://gerrit.wikimedia.org/r/630317 [22:48:38] (03PS3) 10CRusnov: base/check_systemd_state.py: Switch header to Python 3 [puppet] - 10https://gerrit.wikimedia.org/r/624733 (https://phabricator.wikimedia.org/T247364) [22:49:18] I have to go afk for a few minutes, will be back in time to do the B&C patch [22:50:01] (03CR) 10Dzahn: [V: 04-1] "parameter 'postgres_replicas' expects an Array value, got Struct" [puppet] - 10https://gerrit.wikimedia.org/r/629439 (owner: 10Dzahn) [22:50:50] (03CR) 10CRusnov: [C: 03+2] base/check_systemd_state.py: Switch header to Python 3 [puppet] - 10https://gerrit.wikimedia.org/r/624733 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [22:52:08] (03PS4) 10CRusnov: modules/service/files/logstash_checker.py: Move to Python3 [puppet] - 10https://gerrit.wikimedia.org/r/624116 (https://phabricator.wikimedia.org/T247364) [22:52:37] PROBLEM - ensure kvm processes are running on cloudvirt1013 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [22:52:44] (03CR) 10CRusnov: [C: 03+2] modules/service/files/logstash_checker.py: Move to Python3 [puppet] - 10https://gerrit.wikimedia.org/r/624116 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [22:53:13] PROBLEM - Check systemd state on mwdebug1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:53:16] (03CR) 10CRusnov: [C: 03+2] modules/service/files/logstash_checker.py: Move to Python3 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/624116 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [22:55:00] (03PS6) 10Dzahn: swift::proxy: convert role to profile, fix various lint issues [puppet] - 10https://gerrit.wikimedia.org/r/628970 [22:55:17] (03CR) 10Dzahn: "> Patch Set 5:" [puppet] - 10https://gerrit.wikimedia.org/r/628970 (owner: 10Dzahn) [22:56:09] (03CR) 10Dzahn: "no difference (except comments) between PS2 and PS6" [puppet] - 10https://gerrit.wikimedia.org/r/628970 (owner: 10Dzahn) [22:56:56] 10Operations, 10ops-eqiad, 10Analytics-Radar, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` an-worker1111.eqiad.wmnet ` The log can be found... [22:58:06] (03CR) 10Dzahn: "getting closer, removing -1" [puppet] - 10https://gerrit.wikimedia.org/r/621368 (owner: 10Dzahn) [23:00:04] RoanKattouw, Niharika, and Urbanecm: Dear deployers, time to do the Evening backport window deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200928T2300). [23:00:04] tgr: A patch you scheduled for Evening backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:01:06] (03CR) 10CRusnov: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/630690 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [23:02:04] (03Abandoned) 10CRusnov: modules/admin/data/nda_audit.py: Port to Python3 [puppet] - 10https://gerrit.wikimedia.org/r/624112 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [23:03:48] (03PS1) 10Dzahn: quarry: hiera->lookup, add data types [puppet] - 10https://gerrit.wikimedia.org/r/630691 [23:04:18] 10Operations, 10ops-eqiad, 10Analytics-Radar, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` an-worker1113.eqiad.wmnet ` The log can be found... [23:05:49] PROBLEM - Check systemd state on mwdebug1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:11:53] (03CR) 10CRusnov: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/630693 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [23:15:17] PROBLEM - Check systemd state on mwdebug1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:16:15] (03PS1) 10Dzahn: thumbor: role->profile, data types, lint (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/630694 [23:17:03] PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [23:18:00] 10Operations, 10ops-eqiad, 10Analytics-Radar, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-worker1111.eqiad.wmnet'] ` Of which those **FAILED**: ` ['an-worker1111.eqiad.wmnet'] ` [23:18:11] 10Operations, 10ops-eqiad, 10Analytics-Radar, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-worker1113.eqiad.wmnet'] ` Of which those **FAILED**: ` ['an-worker1113.eqiad.wmnet'] ` [23:19:59] PROBLEM - Check systemd state on mwdebug1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:24:27] (03CR) 10Gergő Tisza: [C: 03+2] Properly handle namespaces in tasktype template configuration [extensions/GrowthExperiments] (wmf/1.36.0-wmf.10) - 10https://gerrit.wikimedia.org/r/630420 (https://phabricator.wikimedia.org/T264029) (owner: 10Gergő Tisza) [23:24:29] PROBLEM - Check systemd state on mwdebug1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:26:51] RECOVERY - Apache HTTP on mwdebug1001 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 5.801 second response time https://wikitech.wikimedia.org/wiki/Application_servers [23:27:02] (03PS1) 10Dzahn: install_server: let testvm5001 use install5001 as TFTP server [puppet] - 10https://gerrit.wikimedia.org/r/630695 (https://phabricator.wikimedia.org/T252526) [23:27:29] RECOVERY - PHP7 rendering on mwdebug1001 is OK: HTTP OK: HTTP/1.1 302 Found - 649 bytes in 0.107 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [23:27:32] (03CR) 10Dzahn: [C: 03+2] install_server: let testvm5001 use install5001 as TFTP server [puppet] - 10https://gerrit.wikimedia.org/r/630695 (https://phabricator.wikimedia.org/T252526) (owner: 10Dzahn) [23:27:39] PROBLEM - Check systemd state on mwdebug1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:29:58] (03PS1) 10Dzahn: DHCP: switch TFTP server for eqsin from bast5001 to install5001 [puppet] - 10https://gerrit.wikimedia.org/r/630696 (https://phabricator.wikimedia.org/T252526) [23:32:18] (03Merged) 10jenkins-bot: Properly handle namespaces in tasktype template configuration [extensions/GrowthExperiments] (wmf/1.36.0-wmf.10) - 10https://gerrit.wikimedia.org/r/630420 (https://phabricator.wikimedia.org/T264029) (owner: 10Gergő Tisza) [23:33:21] PROBLEM - PHP7 rendering on mwdebug1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [23:34:15] PROBLEM - Apache HTTP on mwdebug1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [23:34:21] PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code={200,204} handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=G [23:35:55] RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [23:35:57] (03CR) 10CRusnov: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/630697 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [23:40:27] PROBLEM - Check systemd state on mwdebug1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:42:12] (03PS1) 10Dzahn: enable tftp service on install5001 [puppet] - 10https://gerrit.wikimedia.org/r/630699 (https://phabricator.wikimedia.org/T252526) [23:42:42] (03CR) 10Dzahn: [C: 03+2] enable tftp service on install5001 [puppet] - 10https://gerrit.wikimedia.org/r/630699 (https://phabricator.wikimedia.org/T252526) (owner: 10Dzahn) [23:42:56] !log tgr@deploy1001 Synchronized php-1.36.0-wmf.10/extensions/GrowthExperiments/includes/NewcomerTasks/ConfigurationLoader/PageConfigurationLoader.php: Backport: [[gerrit:630420|Properly handle namespaces in tasktype template configuration (T264029)]] (duration: 01m 03s) [23:43:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:43:02] T264029: Special:Homepage runs out of memory - https://phabricator.wikimedia.org/T264029 [23:45:07] tgr_: all done? I'm going to add something to backport [23:45:52] ebernhardson: I just realized I need one more fix. It will take a few minutes though so done for now. Are your deploying for yourself? [23:46:25] tgr_: yea i'll deploy, no rush it can go in 20 minutes or whatever [23:46:47] cool, go ahead. [23:46:53] ok [23:47:26] (03PS1) 10Ebernhardson: Remove commonswiki from sister search sidebar [mediawiki-config] - 10https://gerrit.wikimedia.org/r/630700 (https://phabricator.wikimedia.org/T264053) [23:48:15] PROBLEM - Check systemd state on mwdebug1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:49:40] (03PS2) 10Ebernhardson: Remove commonswiki from sister search sidebar [mediawiki-config] - 10https://gerrit.wikimedia.org/r/630700 (https://phabricator.wikimedia.org/T264053) [23:49:47] (03CR) 10Ebernhardson: [C: 03+2] "backport window" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/630700 (https://phabricator.wikimedia.org/T264053) (owner: 10Ebernhardson) [23:50:36] (03CR) 10Dzahn: [C: 03+2] DHCP: switch TFTP server for eqsin from bast5001 to install5001 [puppet] - 10https://gerrit.wikimedia.org/r/630696 (https://phabricator.wikimedia.org/T252526) (owner: 10Dzahn) [23:50:41] (03Merged) 10jenkins-bot: Remove commonswiki from sister search sidebar [mediawiki-config] - 10https://gerrit.wikimedia.org/r/630700 (https://phabricator.wikimedia.org/T264053) (owner: 10Ebernhardson) [23:51:01] PROBLEM - Check systemd state on mwdebug1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:53:54] (03PS1) 10Dzahn: stop tftp service on bast5001 [puppet] - 10https://gerrit.wikimedia.org/r/630702 (https://phabricator.wikimedia.org/T252526) [23:54:13] PROBLEM - Check systemd state on mwdebug1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:54:54] (03CR) 10Dzahn: [C: 03+2] stop tftp service on bast5001 [puppet] - 10https://gerrit.wikimedia.org/r/630702 (https://phabricator.wikimedia.org/T252526) (owner: 10Dzahn) [23:56:57] !log ebernhardson@deploy1001 Synchronized wmf-config/InitialiseSettings.php: T264053: Remove commonswiki from sidebar search (duration: 01m 09s) [23:57:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:57:01] T264053: Unsustainable increases in Elasticsearch cluster disk IO - https://phabricator.wikimedia.org/T264053 [23:58:12] tgr_: all done [23:58:20] thx [23:58:27] PROBLEM - Check systemd state on mwdebug1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:58:53] PROBLEM - TFTP service on bast5001 is CRITICAL: NRPE: Command check_atftpd not defined https://wikitech.wikimedia.org/wiki/Monitoring/atftpd