[00:42:11] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[00:43:47] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[01:45:09] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:50:57] <icinga-wm>	 PROBLEM - IPv4 ping to eqiad on ripe-atlas-eqiad is CRITICAL: CRITICAL - failed 55 probes of 649 (alerts on 35) - https://atlas.ripe.net/measurements/1790945/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[02:02:07] <icinga-wm>	 RECOVERY - IPv4 ping to eqiad on ripe-atlas-eqiad is OK: OK - failed 7 probes of 649 (alerts on 35) - https://atlas.ripe.net/measurements/1790945/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[02:17:43] <wikibugs>	 (03CR) 10HitomiAkane: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/629738 (https://phabricator.wikimedia.org/T262218) (owner: 10HitomiAkane)
[05:10:35] <wikibugs>	 10Operations, 10ops-codfw, 10DBA: Degraded RAID on es2026 - https://phabricator.wikimedia.org/T263837 (10Marostegui) Thank you Papaul
[05:19:40] <wikibugs>	 (03PS1) 10Marostegui: dbproxy1018: Decrease labsdb1011 [puppet] - 10https://gerrit.wikimedia.org/r/630392
[05:21:03] <wikibugs>	 10Operations, 10OTRS, 10serviceops, 10Patch-For-Review, 10User-notice: Update OTRS to the latest stable version (6.0.x) - https://phabricator.wikimedia.org/T187984 (10Marostegui) >>! In T187984#6494499, @jcrespo wrote: > db1077 should now be available to be put back on test-* section, I don't think it is...
[05:21:31] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] dbproxy1018: Decrease labsdb1011 [puppet] - 10https://gerrit.wikimedia.org/r/630392 (owner: 10Marostegui)
[05:22:16] <marostegui>	 !log Decrease labsdb1011 weight
[05:22:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:25:07] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 240, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[05:26:45] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 242, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[05:33:13] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 240, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[05:33:49] <icinga-wm>	 PROBLEM - OSPF status on cr2-codfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[05:33:59] <wikibugs>	 (03CR) 10Marostegui: [C: 03+1] wmnet: Update s8-master alias [dns] - 10https://gerrit.wikimedia.org/r/629716 (https://phabricator.wikimedia.org/T239238) (owner: 10Kormat)
[05:34:29] <wikibugs>	 (03CR) 10Marostegui: [C: 03+1] mariadb: Promote db1104 to s8 master [puppet] - 10https://gerrit.wikimedia.org/r/629707 (https://phabricator.wikimedia.org/T239238) (owner: 10Kormat)
[05:46:15] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes2002 is CRITICAL: instance=kubernetes2002.codfw.wmnet https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[05:46:43] <icinga-wm>	 RECOVERY - OSPF status on cr2-codfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[05:47:43] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 242, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[05:48:46] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool es2013 T263740', diff saved to https://phabricator.wikimedia.org/P12804 and previous config saved to /var/cache/conftool/dbconfig/20200928-054846-marostegui.json
[05:48:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:48:54] <stashbot>	 T263740: decommission es2013.codfw.wmnet - https://phabricator.wikimedia.org/T263740
[05:49:27] <wikibugs>	 (03PS1) 10Marostegui: es2013: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/630393 (https://phabricator.wikimedia.org/T263740)
[05:49:29] <icinga-wm>	 RECOVERY - kubelet operational latencies on kubernetes2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[05:50:30] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] es2013: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/630393 (https://phabricator.wikimedia.org/T263740) (owner: 10Marostegui)
[05:52:22] <wikibugs>	 (03PS1) 10Marostegui: instances.yaml: Remove es2013 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/630394 (https://phabricator.wikimedia.org/T263740)
[05:53:03] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] instances.yaml: Remove es2013 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/630394 (https://phabricator.wikimedia.org/T263740) (owner: 10Marostegui)
[05:54:11] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove es2013 from dbctl T263740', diff saved to https://phabricator.wikimedia.org/P12805 and previous config saved to /var/cache/conftool/dbconfig/20200928-055410-marostegui.json
[05:54:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:54:18] <stashbot>	 T263740: decommission es2013.codfw.wmnet - https://phabricator.wikimedia.org/T263740
[05:55:41] <marostegui>	 !log Stop MySQL on es2013 before decommissioning it T263740
[05:55:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:15:34] <marostegui>	 !log Set innodb_change_buffering = inserts; on db2089 (s5), db2106 (s4), db2108 (s2), db2085 (s1), db2085 (s8), db2087 (s7), db2087 (s6), db2109 (s3) T263443
[06:15:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:15:43] <stashbot>	 T263443: Evaluate the impact of changing innodb_change_buffering to inserts  - https://phabricator.wikimedia.org/T263443
[06:35:56] <icinga-wm>	 PROBLEM - OSPF status on cr2-codfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[06:36:12] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 240, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:40:40] <icinga-wm>	 RECOVERY - OSPF status on cr2-codfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[06:41:00] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 242, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:54:45] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: wikifeeds: use the service proxy for reaching the MediaWiki api [deployment-charts] - 10https://gerrit.wikimedia.org/r/628756 (https://phabricator.wikimedia.org/T255878)
[06:59:36] <wikibugs>	 10Operations, 10MediaWiki-General, 10Platform Engineering: Allow easier ICU transitions in MediaWiki - https://phabricator.wikimedia.org/T263437 (10Joe) p:05Medium→03High
[06:59:39] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Promote es2028 as es1 master in codfw T261717', diff saved to https://phabricator.wikimedia.org/P12806 and previous config saved to /var/cache/conftool/dbconfig/20200928-065938-marostegui.json
[06:59:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:59:47] <stashbot>	 T261717: Productionize es20[26-34] and es10[26-34] - https://phabricator.wikimedia.org/T261717
[07:06:43] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: services: add TLS encrypted endpoint for wikifeeds (1/2) [puppet] - 10https://gerrit.wikimedia.org/r/629154 (https://phabricator.wikimedia.org/T255878)
[07:09:12] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.presto.roll-restart-workers
[07:09:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:09:18] <wikibugs>	 (03CR) 10Muehlenhoff: role:mx: add script to generate otrs aliases (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/623608 (https://phabricator.wikimedia.org/T244792) (owner: 10Jbond)
[07:12:45] <wikibugs>	 10Operations, 10Continuous-Integration-Infrastructure, 10Jenkins, 10Release-Engineering-Team (CI & Testing services), and 2 others: Review process to fetch Jenkins Debian package from upstream - https://phabricator.wikimedia.org/T260282 (10MoritzMuehlenhoff) >>! In T260282#6494946, @hashar wrote: > So that...
[07:13:27] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] services: add TLS encrypted endpoint for wikifeeds (1/2) [puppet] - 10https://gerrit.wikimedia.org/r/629154 (https://phabricator.wikimedia.org/T255878) (owner: 10Giuseppe Lavagetto)
[07:16:04] <wikibugs>	 (03PS1) 10Marostegui: dbproxy1018: Depool labsdb1011 [puppet] - 10https://gerrit.wikimedia.org/r/630531
[07:16:06] <wikibugs>	 (03PS1) 10Volans: Use @abstractmethod instead of @abstractproperty [software/cumin] - 10https://gerrit.wikimedia.org/r/630532
[07:16:09] <wikibugs>	 (03PS1) 10Volans: tox: add mypy environment [software/cumin] - 10https://gerrit.wikimedia.org/r/630533
[07:16:31] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] dbproxy1018: Depool labsdb1011 [puppet] - 10https://gerrit.wikimedia.org/r/630531 (owner: 10Marostegui)
[07:17:22] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.presto.roll-restart-workers (exit_code=0)
[07:17:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:18:14] <_joe_>	 !log restarting pybal on the backup LVS in eqiad, codfw to pick up the new wikifeeds endpoint
[07:18:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:19:06] <wikibugs>	 (03CR) 10Volans: [C: 03+2] "Trivial, self-merging" [software/cumin] - 10https://gerrit.wikimedia.org/r/630532 (owner: 10Volans)
[07:20:06] <icinga-wm>	 PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[07:20:41] <_joe_>	 uhmmm
[07:20:56] <_joe_>	 the session is already established btw
[07:21:21] <wikibugs>	 (03Merged) 10jenkins-bot: Use @abstractmethod instead of @abstractproperty [software/cumin] - 10https://gerrit.wikimedia.org/r/630532 (owner: 10Volans)
[07:22:03] <icinga-wm>	 PROBLEM - PyBal connections to etcd on lvs1015 is CRITICAL: CRITICAL: 68 connections established with conf1004.eqiad.wmnet:4001 (min=69) https://wikitech.wikimedia.org/wiki/PyBal
[07:24:32] <dcausse>	 !log T263970: forcing allocation of enwiki_general_1587198756 (chi@eqiad)
[07:24:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:24:39] <stashbot>	 T263970: ElasticSearch unassigned shard check apifeatureusage-2020.06.30@codfw and enwiki_general_1587198756@codfw - https://phabricator.wikimedia.org/T263970
[07:29:03] <_joe_>	 !log restarting pybal on the LVS primaries
[07:29:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:31:43] <icinga-wm>	 ACKNOWLEDGEMENT - Long running screen/tmux on mwdebug1001 is CRITICAL: CRIT: Long running SCREEN process. (user: jiji PID: 8843, 1761775s 1728000s). Effie Mouzeli that is me https://wikitech.wikimedia.org/wiki/Monitoring/Long_running_screens
[07:32:25] <icinga-wm>	 RECOVERY - PyBal connections to etcd on lvs1015 is OK: OK: 69 connections established with conf1004.eqiad.wmnet:4001 (min=69) https://wikitech.wikimedia.org/wiki/PyBal
[07:39:57] <icinga-wm>	 RECOVERY - ElasticSearch unassigned shard check - 9243 on search.svc.eqiad.wmnet is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Administration
[07:41:31] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: services: add TLS encrypted endpoint for wikifeeds (2/2) [puppet] - 10https://gerrit.wikimedia.org/r/629155 (https://phabricator.wikimedia.org/T255878)
[07:42:43] <wikibugs>	 (03Abandoned) 10Giuseppe Lavagetto: mediawiki::common: use envoy for tls termination too in nodes using it [puppet] - 10https://gerrit.wikimedia.org/r/574988 (https://phabricator.wikimedia.org/T244843) (owner: 10Giuseppe Lavagetto)
[07:43:06] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: "> Patch Set 1: Code-Review-1" [puppet] - 10https://gerrit.wikimedia.org/r/629155 (https://phabricator.wikimedia.org/T255878) (owner: 10Giuseppe Lavagetto)
[07:43:14] <logmsgbot>	 !log kormat@cumin1001 dbctl commit (dc=all): 'db2125 (re)pooling @ 25%: mobo replaced T260670', diff saved to https://phabricator.wikimedia.org/P12809 and previous config saved to /var/cache/conftool/dbconfig/20200928-074313-kormat.json
[07:43:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:43:21] <stashbot>	 T260670: db2125 crashed - mgmt iface also not available - https://phabricator.wikimedia.org/T260670
[07:43:54] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] services: add TLS encrypted endpoint for wikifeeds (2/2) [puppet] - 10https://gerrit.wikimedia.org/r/629155 (https://phabricator.wikimedia.org/T255878) (owner: 10Giuseppe Lavagetto)
[07:44:45] <wikibugs>	 (03PS1) 10Marostegui: db2125: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/630535 (https://phabricator.wikimedia.org/T260670)
[07:44:48] <wikibugs>	 10Operations, 10Machine Learning Platform, 10ORES, 10serviceops, 10Wikimedia-production-error: ORES redis: max number of clients reached... - https://phabricator.wikimedia.org/T263910 (10awight) Possibly related to {T181632}.  In the past, Redis was a single point of failure and if Celery could not conne...
[07:45:22] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db2125: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/630535 (https://phabricator.wikimedia.org/T260670) (owner: 10Marostegui)
[07:46:29] <icinga-wm>	 RECOVERY - BGP status on cr3-eqsin is OK: BGP OK - up: 308, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[07:47:31] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: changeprop: use https to connect to ORES, restbase [deployment-charts] - 10https://gerrit.wikimedia.org/r/630537 (https://phabricator.wikimedia.org/T244843)
[07:47:39] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] changeprop: use https to connect to ORES, restbase [deployment-charts] - 10https://gerrit.wikimedia.org/r/630537 (https://phabricator.wikimedia.org/T244843) (owner: 10Giuseppe Lavagetto)
[07:49:48] <wikibugs>	 (03PS2) 10JMeybohm: Temporarily remove conf1006 from client SRV records [dns] - 10https://gerrit.wikimedia.org/r/626113 (https://phabricator.wikimedia.org/T196487)
[07:52:39] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] pybal: Move from conf1006 to conf1005 as config_host in esams [puppet] - 10https://gerrit.wikimedia.org/r/626111 (https://phabricator.wikimedia.org/T196487) (owner: 10JMeybohm)
[07:54:17] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hadoop.stop-cluster
[07:54:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:55:45] <elukey>	 test cluster, prep for decom --^
[07:58:18] <logmsgbot>	 !log kormat@cumin1001 dbctl commit (dc=all): 'db2125 (re)pooling @ 50%: mobo replaced T260670', diff saved to https://phabricator.wikimedia.org/P12810 and previous config saved to /var/cache/conftool/dbconfig/20200928-075817-kormat.json
[07:58:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:58:25] <stashbot>	 T260670: db2125 crashed - mgmt iface also not available - https://phabricator.wikimedia.org/T260670
[08:02:22] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hadoop.stop-cluster (exit_code=0)
[08:02:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:02:28] <jayme>	 !log restarting pybal on lvs3007 for switching to conf1005 - T196487
[08:02:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:02:37] <stashbot>	 T196487: Upgrade eqiad rack D4 to 10G switch - https://phabricator.wikimedia.org/T196487
[08:03:24] <icinga-wm>	 PROBLEM - PyBal connections to etcd on lvs3006 is CRITICAL: CRITICAL: 0 connections established with conf1005.eqiad.wmnet:4001 (min=4) https://wikitech.wikimedia.org/wiki/PyBal
[08:03:44] <jayme>	 this is probably me
[08:04:12] <icinga-wm>	 RECOVERY - BGP status on cr2-esams is OK: BGP OK - up: 427, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[08:06:09] <jayme>	 !log restarting pybal on lvs3006 for switching to conf1005 - T196487
[08:06:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:06:42] <icinga-wm>	 RECOVERY - PyBal connections to etcd on lvs3006 is OK: OK: 4 connections established with conf1005.eqiad.wmnet:4001 (min=4) https://wikitech.wikimedia.org/wiki/PyBal
[08:07:02] <jayme>	 !log restarting pybal on lvs3005 for switching to conf1005 - T196487
[08:07:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:07:56] <icinga-wm>	 PROBLEM - mcrouter process on mwdebug1001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 114 (mcrouter), command name mcrouter https://wikitech.wikimedia.org/wiki/Mcrouter
[08:08:57] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] "recheck" [dns] - 10https://gerrit.wikimedia.org/r/626113 (https://phabricator.wikimedia.org/T196487) (owner: 10JMeybohm)
[08:13:22] <logmsgbot>	 !log kormat@cumin1001 dbctl commit (dc=all): 'db2125 (re)pooling @ 75%: mobo replaced T260670', diff saved to https://phabricator.wikimedia.org/P12811 and previous config saved to /var/cache/conftool/dbconfig/20200928-081321-kormat.json
[08:13:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:13:29] <stashbot>	 T260670: db2125 crashed - mgmt iface also not available - https://phabricator.wikimedia.org/T260670
[08:13:49] <wikibugs>	 10Operations: Integrate Buster 10.6 point update - https://phabricator.wikimedia.org/T263974 (10MoritzMuehlenhoff)
[08:21:15] <logmsgbot>	 !log kormat@cumin1001 dbctl commit (dc=all): 'Remove db2113 from contributions/logpager/recentchanges*/watchlist T263842', diff saved to https://phabricator.wikimedia.org/P12812 and previous config saved to /var/cache/conftool/dbconfig/20200928-082114-kormat.json
[08:21:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:21:23] <stashbot>	 T263842: S5 replication issue, affecting watchlist and probably recentchanges - https://phabricator.wikimedia.org/T263842
[08:21:31] <ema>	 !log upload@eqiad: rolling varnish upgrade to 6.0.6-1wm1 T263557
[08:21:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:21:37] <stashbot>	 T263557: Upgrade production cache nodes to Varnish 6 - https://phabricator.wikimedia.org/T263557
[08:23:09] <wikibugs>	 (03PS1) 10Ema: cache: upgrade Varnish to v6 in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/630540 (https://phabricator.wikimedia.org/T263557)
[08:23:25] <wikibugs>	 (03CR) 10Ema: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/630540 (https://phabricator.wikimedia.org/T263557) (owner: 10Ema)
[08:23:32] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] cache: upgrade Varnish to v6 in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/630540 (https://phabricator.wikimedia.org/T263557) (owner: 10Ema)
[08:24:06] <wikibugs>	 (03PS2) 10Ema: cache: upgrade Varnish to v6 in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/630540 (https://phabricator.wikimedia.org/T263557)
[08:24:35] <wikibugs>	 10Operations, 10ops-eqiad, 10decommission-hardware, 10netops: Decommission asw-c-eqiad - https://phabricator.wikimedia.org/T208734 (10ayounsi) I'm seeing interfaces down on asw2-c-eqiad, and I'm not able to ssh to asw-c-eqiad, so I guess some of those steps have been done?  As they are now alerting I'm del...
[08:24:54] <wikibugs>	 (03CR) 10Ema: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/630540 (https://phabricator.wikimedia.org/T263557) (owner: 10Ema)
[08:26:27] <wikibugs>	 (03CR) 10Ema: [C: 03+2] cache: upgrade Varnish to v6 in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/630540 (https://phabricator.wikimedia.org/T263557) (owner: 10Ema)
[08:26:36] <wikibugs>	 (03PS1) 10JMeybohm: Revert "Temporarily remove conf1006 from client SRV records" [dns] - 10https://gerrit.wikimedia.org/r/630407
[08:26:52] <wikibugs>	 10Operations, 10ops-eqiad, 10decommission-hardware, 10netops: Decommission asw-c-eqiad - https://phabricator.wikimedia.org/T208734 (10ayounsi)
[08:26:59] <wikibugs>	 (03PS1) 10JMeybohm: Revert "pybal: Move from conf1006 to conf1005 as config_host in esams" [puppet] - 10https://gerrit.wikimedia.org/r/630408
[08:28:25] <logmsgbot>	 !log kormat@cumin1001 dbctl commit (dc=all): 'db2125 (re)pooling @ 100%: mobo replaced T260670', diff saved to https://phabricator.wikimedia.org/P12813 and previous config saved to /var/cache/conftool/dbconfig/20200928-082825-kormat.json
[08:28:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:28:32] <wikibugs>	 (03PS1) 10Elukey: Decommission Hadoop test cluster [puppet] - 10https://gerrit.wikimedia.org/r/630541 (https://phabricator.wikimedia.org/T227485)
[08:28:34] <stashbot>	 T260670: db2125 crashed - mgmt iface also not available - https://phabricator.wikimedia.org/T260670
[08:30:37] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Decommission Hadoop test cluster [puppet] - 10https://gerrit.wikimedia.org/r/630541 (https://phabricator.wikimedia.org/T227485) (owner: 10Elukey)
[08:32:00] <ema>	 !log text@eqiad: rolling varnish upgrade to 6.0.6-1wm1 T263557
[08:32:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:32:07] <stashbot>	 T263557: Upgrade production cache nodes to Varnish 6 - https://phabricator.wikimedia.org/T263557
[08:34:45] <wikibugs>	 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review, 10User-Kormat: db2125 crashed - mgmt iface also not available - https://phabricator.wikimedia.org/T260670 (10Kormat) 05Open→03Resolved Alright, the host is fully back in service now, so resolving this again :)
[08:34:48] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.decommission
[08:34:48] <logmsgbot>	 !log elukey@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1)
[08:34:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:34:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:35:16] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.decommission
[08:35:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:36:19] <logmsgbot>	 !log elukey@cumin1001 END (ERROR) - Cookbook sre.hosts.decommission (exit_code=97)
[08:36:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:36:35] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.decommission
[08:36:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:37:34] <elukey>	 !log decommission the hadoop test cluster (analytics1028->41)
[08:37:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:38:59] <wikibugs>	 10Operations, 10Data-Services, 10SRE-Access-Requests, 10cloud-services-team (Kanban): Enable access for wmcs-admins to run wmcs-prefixed cookbooks on cumin hosts - https://phabricator.wikimedia.org/T261145 (10ArielGlenn) Should this task remain open until the feature mentioned by faidon (non-root cumin) is...
[08:40:17] <wikibugs>	 10Operations, 10ops-eqiad, 10Analytics-Clusters, 10decommission-hardware, 10Patch-For-Review: Decommission analytics10[28-31,33-41] - https://phabricator.wikimedia.org/T227485 (10elukey)
[08:40:20] <wikibugs>	 10Operations, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Djellel Difallah - https://phabricator.wikimedia.org/T263692 (10ArielGlenn)
[08:40:54] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Switch CI to profile::java [puppet] - 10https://gerrit.wikimedia.org/r/605886 (https://phabricator.wikimedia.org/T253553) (owner: 10Muehlenhoff)
[08:42:10] <wikibugs>	 10Operations, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Djellel Difallah - https://phabricator.wikimedia.org/T263692 (10ArielGlenn)
[08:42:12] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0)
[08:42:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:42:18] <wikibugs>	 10Operations, 10ops-eqiad, 10Analytics-Clusters, 10decommission-hardware, 10Patch-For-Review: Decommission analytics10[28-31,33-41] - https://phabricator.wikimedia.org/T227485 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by elukey@cumin1001 for hosts: `analytics[1028-1029].eqiad.wmnet...
[08:43:03] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.decommission
[08:43:03] <logmsgbot>	 !log elukey@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1)
[08:43:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:43:08] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.decommission
[08:43:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:43:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:44:32] <wikibugs>	 10Operations, 10Traffic, 10netops, 10Patch-For-Review: eqiad row D switch upgrade - https://phabricator.wikimedia.org/T172459 (10ayounsi) 05Stalled→03Declined Forgot about that old task! Not needed anymore as we're not using  multicast anymore.
[08:46:12] <wikibugs>	 10Operations, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Djellel Difallah - https://phabricator.wikimedia.org/T263692 (10ArielGlenn)
[08:46:15] <godog>	 !log swift codfw-prod: bump object weight for ms-be2057 - T261633
[08:46:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:46:22] <stashbot>	 T261633: Put ms-be2057 (Dell R740xd2) in service - https://phabricator.wikimedia.org/T261633
[08:46:53] <wikibugs>	 10Operations, 10ops-eqiad, 10netops: eqiad row D switch fabric recabling - https://phabricator.wikimedia.org/T256112 (10Marostegui)
[08:46:57] <wikibugs>	 10Operations, 10Traffic, 10netops, 10Patch-For-Review: eqiad row D switch upgrade - https://phabricator.wikimedia.org/T172459 (10Marostegui)
[08:50:31] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA, 10netops, and 3 others: Upgrade eqiad rack D4 to 10G switch - https://phabricator.wikimedia.org/T196487 (10ayounsi) @Cmjohnson the console port is still not responding, could you please have a look before today's maintenance? As we still need to configure the switch (and m...
[08:51:04] <wikibugs>	 10Operations, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Djellel Difallah - https://phabricator.wikimedia.org/T263692 (10ArielGlenn) @DED Please have a look at https://wikitech.wikimedia.org/wiki/Analytics/Data_access#User_responsibilities if you have not already.  Adding @Nuri...
[08:53:30] <wikibugs>	 10Operations, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Djellel Difallah - https://phabricator.wikimedia.org/T263692 (10ArielGlenn) p:05Triage→03Medium
[08:53:32] <logmsgbot>	 !log elukey@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1)
[08:53:33] <wikibugs>	 10Operations, 10SRE-Access-Requests, 10cloud-services-team (Kanban): wikitech-static access for Sam Reed - https://phabricator.wikimedia.org/T262468 (10ArielGlenn) Hey @Reedy... is ths all set? Can we resolve the task? (If it's done, please add a on line summary of how it was handled, so we have a record for...
[08:53:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:53:40] <wikibugs>	 10Operations, 10ops-eqiad, 10Analytics-Clusters, 10decommission-hardware, 10Patch-For-Review: Decommission analytics10[28-31,33-41] - https://phabricator.wikimedia.org/T227485 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by elukey@cumin1001 for hosts: `analytics[1030-1031,1033-1039].e...
[08:55:02] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.decommission
[08:55:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:56:10] <dcausse>	 !log T263970: recovering lost apifeature indices (copying eqiad indices -> codfw)
[08:56:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:56:16] <stashbot>	 T263970: ElasticSearch unassigned shard check apifeatureusage-2020.06.30@codfw and enwiki_general_1587198756@eqiad - https://phabricator.wikimedia.org/T263970
[08:56:33] <wikibugs>	 10Operations, 10Scap, 10Wikidata, 10Wikidata-Query-Service, and 2 others: Scap configuration for WDQS should get server groups from a known source or truth - https://phabricator.wikimedia.org/T252124 (10Zbyszko) @thcipriani - your proposal sounds reasonable (we don't really care if we're deploying public s...
[08:58:22] <wikibugs>	 10Operations, 10Data-Services, 10SRE-Access-Requests, 10cloud-services-team (Kanban): Enable access for wmcs-admins to run wmcs-prefixed cookbooks on cumin hosts - https://phabricator.wikimedia.org/T261145 (10MoritzMuehlenhoff) >>! In T261145#6497678, @ArielGlenn wrote: > Should this task remain open until...
[09:00:26] <logmsgbot>	 !log klausman@cumin1001 START - Cookbook sre.hosts.downtime
[09:00:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:00:46] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0)
[09:00:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:00:52] <wikibugs>	 10Operations, 10ops-eqiad, 10Analytics-Clusters, 10decommission-hardware, 10Patch-For-Review: Decommission analytics10[28-31,33-41] - https://phabricator.wikimedia.org/T227485 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by elukey@cumin1001 for hosts: `analytics[1040-1041].eqiad.wmnet...
[09:01:16] <wikibugs>	 10Operations, 10LDAP-Access-Requests: LDAP access to the nda group for Michael Raish - https://phabricator.wikimedia.org/T262316 (10ArielGlenn) @MNovotny_WMF Once you have provided the expiration date and contact information, as mentioned above, we can add it to our system and resolve this task. If you are the...
[09:02:27] <logmsgbot>	 !log klausman@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[09:02:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:06:15] <XioNoX>	 !log restart bird on centrallog2001 - T262372
[09:06:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:06:22] <stashbot>	 T262372: Configure BGP route damping on Anycast sessions - https://phabricator.wikimedia.org/T262372
[09:06:42] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: restbase: add restbase-async to the TLS SANs [puppet] - 10https://gerrit.wikimedia.org/r/630544
[09:08:08] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: OpenStack: add initial manifests for OpenStack Barbican, a secrets API (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/629472 (https://phabricator.wikimedia.org/T263680) (owner: 10Andrew Bogott)
[09:08:53] <wikibugs>	 (03CR) 10Volans: "Nice work! Some generic comments inline." (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/627841 (owner: 10Kormat)
[09:09:42] <wikibugs>	 (03PS1) 10Gehel: logstash: keep at least 2 copies of each shard for apifeatureusage [puppet] - 10https://gerrit.wikimedia.org/r/630546 (https://phabricator.wikimedia.org/T263970)
[09:11:26] <wikibugs>	 10Operations, 10Gerrit, 10LDAP-Access-Requests: Add hashar to `archiva-deployers` LDAP group - https://phabricator.wikimedia.org/T263721 (10ArielGlenn) Me. This is done. Let me know that access is working as expected and I'll resolve this task.
[09:11:38] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] restbase: add restbase-async to the TLS SANs [puppet] - 10https://gerrit.wikimedia.org/r/630544 (owner: 10Giuseppe Lavagetto)
[09:11:55] <wikibugs>	 (03CR) 10Gehel: "PCC looks happy: https://puppet-compiler.wmflabs.org/compiler1001/25450/logstash1007.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/630546 (https://phabricator.wikimedia.org/T263970) (owner: 10Gehel)
[09:12:53] <wikibugs>	 (03CR) 10Gehel: [C: 03+1] "\o/" [software/cumin] - 10https://gerrit.wikimedia.org/r/630533 (owner: 10Volans)
[09:13:44] <wikibugs>	 (03PS1) 10Hnowlan: changeprop: use restbase-async discovery name [deployment-charts] - 10https://gerrit.wikimedia.org/r/630547
[09:14:08] <wikibugs>	 10Operations, 10Gerrit, 10LDAP-Access-Requests: Add hashar to `archiva-deployers` LDAP group - https://phabricator.wikimedia.org/T263721 (10hashar) 05Open→03Resolved a:03ArielGlenn I am now listed at https://ldap.toolforge.org/group/archiva-deployers  and I have managed to upload some artifacts. Thank...
[09:15:06] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Remove obsolete nda_audit script [puppet] - 10https://gerrit.wikimedia.org/r/630024 (https://phabricator.wikimedia.org/T247364) (owner: 10Muehlenhoff)
[09:15:51] <jynus>	 !log restart db1077 for upgrade and cleanup T187984
[09:15:52] <wikibugs>	 (03PS1) 10Hashar: Add rename-project plugin stable-3.2-0-g7f89635 [software/gerrit] (deploy/wmf/stable-3.2) - 10https://gerrit.wikimedia.org/r/630548 (https://phabricator.wikimedia.org/T201953)
[09:16:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:16:04] <stashbot>	 T187984: Update OTRS to the latest stable version (6.0.x) - https://phabricator.wikimedia.org/T187984
[09:17:09] <XioNoX>	 !log restart bird on dns2001 - T262372
[09:17:11] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM optional nit" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/630179 (https://phabricator.wikimedia.org/T158562) (owner: 10Muehlenhoff)
[09:17:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:17:15] <stashbot>	 T262372: Configure BGP route damping on Anycast sessions - https://phabricator.wikimedia.org/T262372
[09:17:50] <_joe_>	 !log changing the restbase public TLS certs to include restbase-async.discovery.wmnet
[09:17:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:18:04] <wikibugs>	 10Operations, 10ops-eqiad, 10Analytics-Clusters, 10decommission-hardware: Decommission analytics10[28-31,33-41] - https://phabricator.wikimedia.org/T227485 (10elukey) 05Stalled→03Open a:05elukey→03Cmjohnson
[09:20:06] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[09:20:57] <wikibugs>	 10Puppet, 10Patch-For-Review: puppetdb seems to be slow on host reimage - https://phabricator.wikimedia.org/T263578 (10Volans) >>! In T263578#6492417, @jbond wrote: >  however i wonder if its worth mounting a tmpfs dir here?  the risk is that we may loose a submission but as it likely receives a lot of IO is w...
[09:21:25] <wikibugs>	 (03CR) 10Jbond: "> Patch Set 4:" [puppet] - 10https://gerrit.wikimedia.org/r/630099 (owner: 10Elukey)
[09:22:09] <wikibugs>	 (03CR) 10Jbond: profile::hadoop::common: get the datanode mountpoints from facter (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/629647 (owner: 10Elukey)
[09:23:56] <wikibugs>	 (03Abandoned) 10Giuseppe Lavagetto: changeprop: use https to connect to ORES, restbase [deployment-charts] - 10https://gerrit.wikimedia.org/r/630537 (https://phabricator.wikimedia.org/T244843) (owner: 10Giuseppe Lavagetto)
[09:24:16] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[09:25:23] <wikibugs>	 (03PS1) 10Ema: cache: upgrade Varnish to v6 in codfw [puppet] - 10https://gerrit.wikimedia.org/r/630550 (https://phabricator.wikimedia.org/T263557)
[09:25:46] <wikibugs>	 (03CR) 10Muehlenhoff: Have the puppetised sources.list depend on the wikimedia repository (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/630179 (https://phabricator.wikimedia.org/T158562) (owner: 10Muehlenhoff)
[09:26:05] <wikibugs>	 (03PS2) 10Muehlenhoff: Have the puppetised sources.list depend on the wikimedia repository [puppet] - 10https://gerrit.wikimedia.org/r/630179 (https://phabricator.wikimedia.org/T158562)
[09:26:25] <wikibugs>	 (03CR) 10Ema: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/630550 (https://phabricator.wikimedia.org/T263557) (owner: 10Ema)
[09:26:44] <wikibugs>	 (03CR) 10Jbond: [C: 04-1] "See inline" (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/629443 (owner: 10Dzahn)
[09:26:48] <wikibugs>	 (03CR) 10DCausse: [C: 03+1] logstash: keep at least 2 copies of each shard for apifeatureusage [puppet] - 10https://gerrit.wikimedia.org/r/630546 (https://phabricator.wikimedia.org/T263970) (owner: 10Gehel)
[09:29:10] <wikibugs>	 (03CR) 10Ema: [C: 03+2] cache: upgrade Varnish to v6 in codfw [puppet] - 10https://gerrit.wikimedia.org/r/630550 (https://phabricator.wikimedia.org/T263557) (owner: 10Ema)
[09:29:37] <wikibugs>	 (03CR) 10Jbond: [C: 04-1] "See inline" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/629439 (owner: 10Dzahn)
[09:29:52] <ema>	 !log text@codfw: rolling varnish upgrade to 6.0.6-1wm1 T263557
[09:29:56] <wikibugs>	 10Operations, 10LDAP-Access-Requests: Access to Superset - https://phabricator.wikimedia.org/T263868 (10ArielGlenn) I have verified that the email address in wikitech was authenticated and that it is jrabah.  This will require adding you to the wmf LDAP group. Pinging @JVargas for signoff.
[09:29:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:29:58] <stashbot>	 T263557: Upgrade production cache nodes to Varnish 6 - https://phabricator.wikimedia.org/T263557
[09:30:30] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] services: add TLS encrypted endpoint for wikifeeds (2/2) [puppet] - 10https://gerrit.wikimedia.org/r/629155 (https://phabricator.wikimedia.org/T255878) (owner: 10Giuseppe Lavagetto)
[09:30:52] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/629424 (owner: 10Dzahn)
[09:32:04] <wikibugs>	 (03PS1) 10Hashar: Upgrade javamelody from 1.83.0 to 1.85.0 [software/gerrit] (deploy/wmf/stable-3.2) - 10https://gerrit.wikimedia.org/r/630551 (https://phabricator.wikimedia.org/T232678)
[09:32:20] <wikibugs>	 10Operations, 10LDAP-Access-Requests: Access to Superset - https://phabricator.wikimedia.org/T263868 (10ArielGlenn) p:05Triage→03Medium
[09:32:57] <wikibugs>	 10Operations, 10Parsing-Team, 10TechCom, 10serviceops, and 4 others: Strategy for storing parser output for "old revision" (Popular diffs and permalinks) - https://phabricator.wikimedia.org/T244058 (10daniel) With {T263583} coming up, perhaps we should use a special ParserCache instance for old revisions,...
[09:33:28] <wikibugs>	 10Operations, 10netops: Configure BGP route damping on Anycast sessions - https://phabricator.wikimedia.org/T262372 (10ayounsi) 05Resolved→03Open When bird restarts on the centrallog servers it causes bird to bounce a few times: ` Sep 28 09:06:18 centrallog2001 bird: Shutting down Sep 28 09:06:18 centrallo...
[09:33:41] <wikibugs>	 10Operations, 10serviceops, 10Patch-For-Review, 10Platform Team Initiatives (API Gateway): Separate mediawiki latency metrics by endpoint - https://phabricator.wikimedia.org/T263727 (10ArielGlenn) p:05Triage→03Medium
[09:33:43] <wikibugs>	 (03CR) 10Hashar: "Not much to mention based on the changelog at https://github.com/javamelody/javamelody/wiki/ReleaseNotes .  I felt we should just closely " [software/gerrit] (deploy/wmf/stable-3.2) - 10https://gerrit.wikimedia.org/r/630551 (https://phabricator.wikimedia.org/T232678) (owner: 10Hashar)
[09:34:33] <wikibugs>	 10Operations, 10netops: Configure BGP route damping on Anycast sessions - https://phabricator.wikimedia.org/T262372 (10ayounsi) p:05Medium→03High
[09:35:12] <wikibugs>	 10Operations, 10Traffic, 10serviceops: puppetmaster[12]001: add TLS termination - https://phabricator.wikimedia.org/T263831 (10ArielGlenn) p:05Triage→03Medium
[09:35:47] <wikibugs>	 10Operations, 10Continuous-Integration-Infrastructure, 10Traffic: contint.wikimedia.org: add TLS termination - https://phabricator.wikimedia.org/T263830 (10ArielGlenn) p:05Triage→03Medium
[09:37:00] <wikibugs>	 10Operations: Integrate Buster 10.6 point update - https://phabricator.wikimedia.org/T263974 (10ArielGlenn) p:05Triage→03Medium
[09:37:47] <wikibugs>	 10Operations, 10Editing-team, 10MassMessage, 10WMF-JobQueue, 10Platform Team Workboards (Clinic Duty Team): Same MassMessage is being sent more than once - https://phabricator.wikimedia.org/T93049 (10Samwalton9) Looks like this happened again yesterday with the Signpost (https://en.wikipedia.org/wiki/Use...
[09:39:11] <wikibugs>	 10Operations, 10Packaging: Update php-xdebug to 2.7.2 in apt.wikimedia.org - https://phabricator.wikimedia.org/T263933 (10ArielGlenn) p:05Triage→03High
[09:39:22] <icinga-wm>	 PROBLEM - SSH on ms-be2017 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[09:39:59] <wikibugs>	 10Operations, 10Analytics-Radar, 10Domains, 10Traffic, and 2 others: Blocking all third-party storage access requests - https://phabricator.wikimedia.org/T262996 (10ArielGlenn) p:05Triage→03Medium
[09:42:12] <icinga-wm>	 RECOVERY - SSH on ms-be2017 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[09:44:49] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] changeprop: use restbase-async discovery name [deployment-charts] - 10https://gerrit.wikimedia.org/r/630547 (owner: 10Hnowlan)
[09:46:23] <wikibugs>	 10Operations, 10serviceops, 10User-jijiki: Test onhost memcached performance and functionality - https://phabricator.wikimedia.org/T263958 (10ArielGlenn) p:05Triage→03Medium
[09:47:18] <wikibugs>	 (03Merged) 10jenkins-bot: changeprop: use restbase-async discovery name [deployment-charts] - 10https://gerrit.wikimedia.org/r/630547 (owner: 10Hnowlan)
[09:47:54] <wikibugs>	 10Operations, 10Traffic: backport ipvsadm>=1.30 to buster-wikimedia or buster-backports - https://phabricator.wikimedia.org/T263788 (10ArielGlenn) p:05Triage→03Medium
[09:48:04] <wikibugs>	 10Operations, 10Traffic: Switch to Maglev hashing ('mh') on LVS hosts - https://phabricator.wikimedia.org/T263797 (10ArielGlenn) p:05Triage→03Medium
[09:48:30] <ema>	 !log upload@codfw: rolling varnish upgrade to 6.0.6-1wm1 T263557
[09:48:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:48:38] <stashbot>	 T263557: Upgrade production cache nodes to Varnish 6 - https://phabricator.wikimedia.org/T263557
[09:48:49] <wikibugs>	 10Operations, 10MediaWiki-REST-API, 10Traffic: Route requests to the REST MediaWiki API to the api cluster - https://phabricator.wikimedia.org/T263729 (10ArielGlenn) p:05Triage→03Medium
[09:48:54] <logmsgbot>	 !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'changeprop' for release 'production' .
[09:48:54] <logmsgbot>	 !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'changeprop' for release 'staging' .
[09:48:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:49:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:50:04] <wikibugs>	 10Operations, 10Machine Learning Platform, 10ORES, 10serviceops, 10Wikimedia-production-error: ORES redis: max number of clients reached... - https://phabricator.wikimedia.org/T263910 (10ArielGlenn) p:05Triage→03High
[09:53:10] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: aggregation rules for ats-tls client TTFB [puppet] - 10https://gerrit.wikimedia.org/r/629430 (https://phabricator.wikimedia.org/T263536) (owner: 10Filippo Giunchedi)
[09:54:10] <wikibugs>	 10Operations, 10observability, 10serviceops, 10Platform Team Initiatives (API Gateway): mtail 3.0.0-rc35 doesn't support the histogram type in -oneshot mode. - https://phabricator.wikimedia.org/T263728 (10ArielGlenn) p:05Triage→03High
[09:54:12] <wikibugs>	 10Operations, 10serviceops, 10Patch-For-Review, 10Platform Team Initiatives (API Gateway): Separate mediawiki latency metrics by endpoint - https://phabricator.wikimedia.org/T263727 (10ArielGlenn) p:05Medium→03High
[09:54:19] <wikibugs>	 10Operations, 10Puppet, 10observability: Notification spam from "last puppet run" upon re-enabling puppet - https://phabricator.wikimedia.org/T263720 (10ArielGlenn) p:05Triage→03Medium
[09:54:49] <wikibugs>	 10Operations, 10Analytics, 10Traffic, 10netops: Turnilo: per-second rates for wmf_netflow bytes + packets - https://phabricator.wikimedia.org/T263290 (10ArielGlenn) p:05Triage→03Medium
[09:55:45] <wikibugs>	 10Operations, 10Traffic, 10netops: experiment with reënabling compression between applayer's TLS terminators and edge caches - https://phabricator.wikimedia.org/T263288 (10ArielGlenn) p:05Triage→03Medium
[09:55:56] <wikibugs>	 (03PS3) 10Muehlenhoff: Have the puppetised sources.list depend on the wikimedia repository [puppet] - 10https://gerrit.wikimedia.org/r/630179 (https://phabricator.wikimedia.org/T158562)
[09:59:05] <wikibugs>	 10Operations, 10Traffic, 10observability, 10Patch-For-Review, 10User-fgiunchedi: Aggregated metrics for ats-tls <-> clients ttfb percentiles - https://phabricator.wikimedia.org/T263536 (10fgiunchedi)
[09:59:32] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] am: use status.cgi JSON as source for problems [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/628090 (owner: 10Filippo Giunchedi)
[10:00:36] <wikibugs>	 10Operations, 10Data-Services, 10SRE-Access-Requests, 10cloud-services-team (Kanban): Enable access for wmcs-admins to run wmcs-prefixed cookbooks on cumin hosts - https://phabricator.wikimedia.org/T261145 (10ArielGlenn) @Bstorm, please let us know that all is working as you expect and I'll close this. Tha...
[10:04:38] <wikibugs>	 10Operations, 10OTRS, 10serviceops, 10Patch-For-Review, 10User-notice: Update OTRS to the latest stable version (6.0.x) - https://phabricator.wikimedia.org/T187984 (10jcrespo) db1077 is back into test-s4 role, although without any data.
[10:05:15] <wikibugs>	 (03PS1) 10Filippo Giunchedi: am: tweak alert labels/annotations [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/630554
[10:05:27] <wikibugs>	 10Operations: Provide failover capacity for package installations from main mirror - https://phabricator.wikimedia.org/T262647 (10ArielGlenn) p:05Triage→03Medium
[10:05:57] <wikibugs>	 10Operations: Provide failover capacity for package installations from main mirror - https://phabricator.wikimedia.org/T262647 (10MoritzMuehlenhoff) a:03MoritzMuehlenhoff
[10:08:59] <wikibugs>	 (03PS1) 10Hnowlan: changeprop: open access to ores on 443 [deployment-charts] - 10https://gerrit.wikimedia.org/r/630555
[10:09:39] <wikibugs>	 10Operations, 10Analytics-Radar, 10Domains, 10Traffic, 10Wikimedia-General-or-Unknown: WMF third-party cookies rejected - https://phabricator.wikimedia.org/T262882 (10ArielGlenn) p:05Triage→03Medium
[10:10:17] <wikibugs>	 (03CR) 10Volans: [C: 03+2] tox: add mypy environment [software/cumin] - 10https://gerrit.wikimedia.org/r/630533 (owner: 10Volans)
[10:10:54] <wikibugs>	 (03PS1) 10Filippo Giunchedi: alertmanager: group alerts and add severity: page [puppet] - 10https://gerrit.wikimedia.org/r/630556 (https://phabricator.wikimedia.org/T258948)
[10:12:28] <wikibugs>	 (03Merged) 10jenkins-bot: tox: add mypy environment [software/cumin] - 10https://gerrit.wikimedia.org/r/630533 (owner: 10Volans)
[10:12:54] <wikibugs>	 (03PS2) 10Volans: swift: remove old unused service records [dns] - 10https://gerrit.wikimedia.org/r/628086 (https://phabricator.wikimedia.org/T244153)
[10:14:23] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] changeprop: open access to ores on 443 [deployment-charts] - 10https://gerrit.wikimedia.org/r/630555 (owner: 10Hnowlan)
[10:16:30] <wikibugs>	 (03Merged) 10jenkins-bot: changeprop: open access to ores on 443 [deployment-charts] - 10https://gerrit.wikimedia.org/r/630555 (owner: 10Hnowlan)
[10:18:00] <wikibugs>	 (03PS1) 10Effie Mouzeli: mwdebug1001: remove opcache tuning [puppet] - 10https://gerrit.wikimedia.org/r/630558
[10:19:25] <wikibugs>	 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes, 10Patch-For-Review: Move wikifeeds to use TLS only - https://phabricator.wikimedia.org/T255878 (10Joe)
[10:19:39] <wikibugs>	 (03PS6) 10Volans: Extracting obvious reporting code to a Reporter class. [software/cumin] - 10https://gerrit.wikimedia.org/r/626660 (https://phabricator.wikimedia.org/T212783) (owner: 10Gehel)
[10:19:52] <wikibugs>	 10Operations, 10MediaWiki-General, 10serviceops, 10Patch-For-Review, 10Service-Architecture: Create a service-to-service proxy for handling HTTP calls from services to other entities - https://phabricator.wikimedia.org/T244843 (10Joe)
[10:21:46] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Extracting obvious reporting code to a Reporter class. [software/cumin] - 10https://gerrit.wikimedia.org/r/626660 (https://phabricator.wikimedia.org/T212783) (owner: 10Gehel)
[10:21:57] <wikibugs>	 10Operations, 10MediaWiki-General, 10serviceops, 10Patch-For-Review, 10Service-Architecture: Create a service-to-service proxy for handling HTTP calls from services to other entities - https://phabricator.wikimedia.org/T244843 (10Joe)
[10:23:20] <wikibugs>	 10Operations, 10Analytics-Radar, 10Patch-For-Review: Move Hue to a Buster VM - https://phabricator.wikimedia.org/T258768 (10elukey) Currently two outstanding UI issues:  https://github.com/cloudera/hue/issues/1273 https://github.com/cloudera/hue/issues/1272  In theory those are not blocking the migration of...
[10:23:46] <logmsgbot>	 !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'kube-system' for release 'calico-policy-controller' .
[10:23:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:25:02] <logmsgbot>	 !log hnowlan@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'kube-system' for release 'calico-policy-controller' .
[10:25:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:25:16] <wikibugs>	 (03PS1) 10JMeybohm: Enable envoy telemetry for zortero [deployment-charts] - 10https://gerrit.wikimedia.org/r/630560
[10:25:51] <icinga-wm>	 PROBLEM - mailman_queue_size on lists1001 is CRITICAL: CRITICAL: 1 mailman queue(s) above limits (thresholds: bounces: 25 in: 25 virgin: 25) https://wikitech.wikimedia.org/wiki/Mailman%23Monitoring
[10:25:58] <logmsgbot>	 !log hnowlan@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'kube-system' for release 'calico-policy-controller' .
[10:26:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:27:27] <icinga-wm>	 RECOVERY - mailman_queue_size on lists1001 is OK: OK: mailman queues are below the limits. https://wikitech.wikimedia.org/wiki/Mailman%23Monitoring
[10:29:54] <logmsgbot>	 !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'changeprop' for release 'staging' .
[10:29:54] <logmsgbot>	 !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'changeprop' for release 'production' .
[10:29:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:30:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:30:04] <jouncebot>	 jan_drewniak: (Dis)respected human, time to deploy Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200928T1030). Please do the needful.
[10:31:51] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] Enable envoy telemetry for zortero [deployment-charts] - 10https://gerrit.wikimedia.org/r/630560 (owner: 10JMeybohm)
[10:32:15] <logmsgbot>	 !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'changeprop' for release 'staging' .
[10:32:15] <logmsgbot>	 !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'changeprop' for release 'production' .
[10:32:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:32:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:32:52] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] Enable envoy telemetry for zortero [deployment-charts] - 10https://gerrit.wikimedia.org/r/630560 (owner: 10JMeybohm)
[10:32:58] <logmsgbot>	 !log hnowlan@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'changeprop' for release 'staging' .
[10:32:58] <logmsgbot>	 !log hnowlan@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'changeprop' for release 'production' .
[10:33:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:33:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:34:52] <wikibugs>	 (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/630561 (https://phabricator.wikimedia.org/T128546)
[10:35:37] <wikibugs>	 (03Merged) 10jenkins-bot: Enable envoy telemetry for zortero [deployment-charts] - 10https://gerrit.wikimedia.org/r/630560 (owner: 10JMeybohm)
[10:35:44] <logmsgbot>	 !log hnowlan@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'changeprop' for release 'staging' .
[10:35:44] <logmsgbot>	 !log hnowlan@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'changeprop' for release 'production' .
[10:35:45] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: service::configuration: connect to restbase via TLS [puppet] - 10https://gerrit.wikimedia.org/r/630562 (https://phabricator.wikimedia.org/T244843)
[10:35:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:35:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:37:22] <logmsgbot>	 !log jayme@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'zotero' for release 'staging' .
[10:37:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:38:37] <wikibugs>	 (03CR) 10Jdrewniak: [C: 03+2] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/630561 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak)
[10:39:16] <wikibugs>	 (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/630561 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak)
[10:44:47] <logmsgbot>	 !log jdrewniak@deploy1001 Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:630561| Bumping portals to master (T128546)]] (duration: 00m 58s)
[10:44:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:44:55] <stashbot>	 T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546
[10:45:45] <logmsgbot>	 !log jdrewniak@deploy1001 Synchronized portals: Wikimedia Portals Update: [[gerrit:630561| Bumping portals to master (T128546)]] (duration: 00m 57s)
[10:45:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:59:40] <wikibugs>	 10Operations, 10Machine Learning Platform, 10ORES, 10serviceops, 10Wikimedia-production-error: ORES redis: max number of clients reached... - https://phabricator.wikimedia.org/T263910 (10Ladsgroup) >>! In T263910#6497445, @awight wrote: > Possibly related to {T181632}.  In the past, Redis was a single po...
[11:00:04] <jouncebot>	 Amir1, Lucas_WMDE, awight, and Urbanecm: My dear minions, it's time we take the moon! Just kidding. Time for European mid-day backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200928T1100).
[11:00:04] <jouncebot>	 kart_: A patch you scheduled for European mid-day backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[11:00:18] <Urbanecm>	 I can deploy toda
[11:00:20] <Urbanecm>	 y
[11:00:36] <Urbanecm>	 kart_: ready for second try? :-)
[11:02:24] <wikibugs>	 (03CR) 10Urbanecm: "> Patch Set 4: Code-Review+1" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/629738 (https://phabricator.wikimedia.org/T262218) (owner: 10HitomiAkane)
[11:02:39] <wikibugs>	 10Operations, 10DBA, 10Wikimedia-Incident: S5 replication issue, affecting watchlist and probably recentchanges - https://phabricator.wikimedia.org/T263842 (10Marostegui) I have checked previous' week backups (22nd Sept) to see if there was anything existing for any of the involved data, at least on the PK (...
[11:03:05] <kart_>	 Urbanecm: sure
[11:03:14] <wikibugs>	 (03PS4) 10Urbanecm: ContentTranslation: Do not use wikishared DB for testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/629371 (https://phabricator.wikimedia.org/T263417) (owner: 10KartikMistry)
[11:03:40] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] ContentTranslation: Do not use wikishared DB for testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/629371 (https://phabricator.wikimedia.org/T263417) (owner: 10KartikMistry)
[11:03:49] <Urbanecm>	 let's see then :)
[11:04:22] <wikibugs>	 (03Merged) 10jenkins-bot: ContentTranslation: Do not use wikishared DB for testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/629371 (https://phabricator.wikimedia.org/T263417) (owner: 10KartikMistry)
[11:06:04] <Urbanecm>	 kart_: pulled onto mwdebug2001, and...noticed an error
[11:06:18] <Urbanecm>	 in IS.php, you add wmgContentTranslationDatabase, not wgContentTranslationDatabase 
[11:06:54] <Urbanecm>	 pushing a fix
[11:07:33] <wikibugs>	 (03PS1) 10Urbanecm: Follow-up for 483beb2: wmgContentTranslationDatabase => wgContentTranslationDatabase [mediawiki-config] - 10https://gerrit.wikimedia.org/r/630565 (https://phabricator.wikimedia.org/T263417)
[11:07:42] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Follow-up for 483beb2: wmgContentTranslationDatabase => wgContentTranslationDatabase [mediawiki-config] - 10https://gerrit.wikimedia.org/r/630565 (https://phabricator.wikimedia.org/T263417) (owner: 10Urbanecm)
[11:07:50] <kart_>	 Urbanecm: ahaaa.
[11:08:30] <wikibugs>	 (03Merged) 10jenkins-bot: Follow-up for 483beb2: wmgContentTranslationDatabase => wgContentTranslationDatabase [mediawiki-config] - 10https://gerrit.wikimedia.org/r/630565 (https://phabricator.wikimedia.org/T263417) (owner: 10Urbanecm)
[11:09:19] <Urbanecm>	 kart_: pulled onto mwdebug2001, ready for your testing
[11:09:43] <Urbanecm>	 variables should be correct AFAICS kart_ https://usercontent.irccloud-cdn.com/file/BRXTBOri/image.png
[11:11:10] <kart_>	 hmm. Looks good.
[11:11:17] <kart_>	 Urbanecm: wait a sec.
[11:11:22] <Urbanecm>	 yes?
[11:11:34] <kart_>	 Urbanecm: need recheck.
[11:11:34] <wikibugs>	 (03CR) 10Volans: "Some first comment inline, I'll do some practical testing later on" (034 comments) [software/cumin] - 10https://gerrit.wikimedia.org/r/628315 (https://phabricator.wikimedia.org/T212783) (owner: 10Gehel)
[11:11:42] <Urbanecm>	 kart_: sure, take your time :)
[11:13:01] <kart_>	 Urbanecm: yeah. Don't want to break CX anywhere else than testwiki ;)
[11:13:30] <Urbanecm>	 good point, we should test somewhere else too :)
[11:16:13] <kart_>	 Urbanecm: testwiki and wikishare looks separate. Last test on 'other WPs' on. Few more minutes..
[11:17:35] <kart_>	 Urbanecm: OK. Looks good. CX on other Wiki is saving content without issue. I published article on testwiki also.
[11:18:04] <kart_>	 Urbanecm: Please go ahead.
[11:18:27] <Urbanecm>	 thanks, syncing
[11:20:20] <logmsgbot>	 !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 483beb2452caead8c44dfb8e608812778033fba0: ContentTranslation: Do not use wikishared DB for testwiki (T263417; follow-up af09303a4a155681b198ac70468494c2155868df also included in this sync) (duration: 00m 57s)
[11:20:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:20:28] <stashbot>	 T263417: Exclude testwikis and private wikis from old unpublished CX draft purge script run - https://phabricator.wikimedia.org/T263417
[11:20:30] <Urbanecm>	 kart_: should be live :)
[11:20:43] <kart_>	 Urbanecm: great!
[11:26:25] <wikibugs>	 (03PS1) 10Ema: cache: upgrade Varnish to v6 in esams [puppet] - 10https://gerrit.wikimedia.org/r/630566 (https://phabricator.wikimedia.org/T263557)
[11:27:06] <wikibugs>	 (03PS1) 10Hnowlan: api-gateway: use TLS for restbase [deployment-charts] - 10https://gerrit.wikimedia.org/r/630567
[11:27:33] <wikibugs>	 (03PS2) 10Ema: cache: upgrade Varnish to v6 in esams [puppet] - 10https://gerrit.wikimedia.org/r/630566 (https://phabricator.wikimedia.org/T263557)
[11:28:02] <wikibugs>	 (03CR) 10Ema: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/630566 (https://phabricator.wikimedia.org/T263557) (owner: 10Ema)
[11:28:43] <wikibugs>	 (03PS1) 10Jbond: swift: convert roles to profiles, fix various lint issues [puppet] - 10https://gerrit.wikimedia.org/r/630568 (https://phabricator.wikimedia.org/T247956)
[11:29:22] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] api-gateway: use TLS for restbase [deployment-charts] - 10https://gerrit.wikimedia.org/r/630567 (owner: 10Hnowlan)
[11:29:24] <wikibugs>	 (03PS5) 10Urbanecm: Creation of patroller group on arz.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/629738 (https://phabricator.wikimedia.org/T262218) (owner: 10HitomiAkane)
[11:29:26] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Creation of patroller group on arz.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/629738 (https://phabricator.wikimedia.org/T262218) (owner: 10HitomiAkane)
[11:30:13] <wikibugs>	 (03Merged) 10jenkins-bot: Creation of patroller group on arz.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/629738 (https://phabricator.wikimedia.org/T262218) (owner: 10HitomiAkane)
[11:34:28] <logmsgbot>	 !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 61eac95ef62aef682039761e0f02188437cb15fb: Creation of patroller group on arz.wikipedia (T262218) (duration: 00m 57s)
[11:34:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:34:36] <stashbot>	 T262218: Creation of patroller group on arz.wikipedia - https://phabricator.wikimedia.org/T262218
[11:34:49] <Urbanecm>	 !log EU B&C window done
[11:34:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:37:18] <wikibugs>	 (03PS2) 10Jbond: swift: convert roles to profiles, fix various lint issues [puppet] - 10https://gerrit.wikimedia.org/r/630568 (https://phabricator.wikimedia.org/T247956)
[11:37:37] <wikibugs>	 (03PS2) 10Hnowlan: api-gateway: use TLS for restbase [deployment-charts] - 10https://gerrit.wikimedia.org/r/630567
[11:41:26] <wikibugs>	 (03PS7) 10Muehlenhoff: reboot-groups [cookbooks] - 10https://gerrit.wikimedia.org/r/625597
[11:41:40] <kart_>	 Urbanecm: around?
[11:41:44] <Urbanecm>	 yes
[11:41:48] <Urbanecm>	 what's up kart_ ?
[11:41:59] <kart_>	 Urbanecm: did you sync all files? Seems same DB issue is appearing :/
[11:42:14] <kart_>	 Urbanecm: it worked earlier or am I missing something..
[11:42:14] <Urbanecm>	 damn it, I forgot to sync CS.php
[11:42:16] <Urbanecm>	 mea culpa
[11:42:41] <Urbanecm>	 fixing
[11:42:51] <kart_>	 ah. NP. It is testwiki :D
[11:42:58] <wikibugs>	 (03PS8) 10Muehlenhoff: reboot-groups [cookbooks] - 10https://gerrit.wikimedia.org/r/625597
[11:43:09] <Urbanecm>	 hehe
[11:43:17] <wikibugs>	 (03CR) 10Muehlenhoff: reboot-groups (036 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/625597 (owner: 10Muehlenhoff)
[11:43:36] <logmsgbot>	 !log urbanecm@deploy1001 Synchronized wmf-config/CommonSettings.php: 483beb2452caead8c44dfb8e608812778033fba0: ContentTranslation: Do not use wikishared DB for testwiki (T263417; follow-up af09303a4a155681b198ac70468494c2155868df also included in this sync) (duration: 00m 56s)
[11:43:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:43:42] <Urbanecm>	 kart_: can you check now, please?
[11:43:43] <stashbot>	 T263417: Exclude testwikis and private wikis from CX draft purge script and separate CX database on testwiki - https://phabricator.wikimedia.org/T263417
[11:44:41] <kart_>	 Urbanecm: sure
[11:45:13] <kart_>	 Urbanecm: yep. Works well now!
[11:45:24] <Urbanecm>	 Glad to hear that kart_ :)
[11:45:50] <kart_>	 Urbanecm: also my apologies. That debug extension little  icon need different color :)
[11:54:10] <wikibugs>	 (03PS8) 10Kormat: bsection: Script for binary-searching log files. [puppet] - 10https://gerrit.wikimedia.org/r/627841
[11:54:31] <wikibugs>	 (03CR) 10Kormat: bsection: Script for binary-searching log files. (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/627841 (owner: 10Kormat)
[11:57:14] <Urbanecm>	 jouncebot: next
[11:57:15] <jouncebot>	 In 0 hour(s) and 2 minute(s): Create a new wiki (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200928T1200)
[11:57:53] <wikibugs>	 (03CR) 10ArielGlenn: [C: 03+1] "This looks ok to me. I do hate the code for reversing present and absent, but making that nicer by way of better variable names or somethi" [puppet] - 10https://gerrit.wikimedia.org/r/462019 (https://phabricator.wikimedia.org/T125976) (owner: 10Thcipriani)
[11:59:05] <logmsgbot>	 !log kormat@cumin1001 dbctl commit (dc=all): 'db1114 depooling: prep for rack switch upgrade T196487', diff saved to https://phabricator.wikimedia.org/P12815 and previous config saved to /var/cache/conftool/dbconfig/20200928-115904-kormat.json
[11:59:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:59:12] <stashbot>	 T196487: Upgrade eqiad rack D4 to 10G switch - https://phabricator.wikimedia.org/T196487
[11:59:56] <logmsgbot>	 !log kormat@cumin1001 START - Cookbook sre.hosts.downtime
[11:59:57] <logmsgbot>	 !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[12:00:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:00:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:00:04] <jouncebot>	 Urbanecm and Amir1: I, the Bot under the Fountain, allow thee, The Deployer, to do Create a new wiki deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200928T1200).
[12:00:14] <Urbanecm>	 \o/
[12:02:53] <wikibugs>	 (03CR) 10ArielGlenn: "If this has been cherry-picked for so long on beta, maybe it can just be merged?" [puppet] - 10https://gerrit.wikimedia.org/r/462020 (https://phabricator.wikimedia.org/T125976) (owner: 10Thcipriani)
[12:04:56] <wikibugs>	 10Operations, 10Traffic, 10conftool, 10serviceops: confd's watch functionality appears to be partially broken when interacting with etcd 3.x - https://phabricator.wikimedia.org/T260889 (10ArielGlenn)
[12:06:49] <wikibugs>	 (03PS4) 10Urbanecm: Initial configuration for arbcom_ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/627278 (https://phabricator.wikimedia.org/T262812) (owner: 10Majavah)
[12:06:59] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Initial configuration for arbcom_ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/627278 (https://phabricator.wikimedia.org/T262812) (owner: 10Majavah)
[12:09:20] <wikibugs>	 (03PS5) 10Urbanecm: Initial configuration for arbcom_ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/627278 (https://phabricator.wikimedia.org/T262812) (owner: 10Majavah)
[12:10:06] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Initial configuration for arbcom_ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/627278 (https://phabricator.wikimedia.org/T262812) (owner: 10Majavah)
[12:10:31] <wikibugs>	 10Operations, 10Product-Infrastructure-Team-Backlog, 10Wikifeeds, 10serviceops: [Bug] The feed/featured endpoint is broken - https://phabricator.wikimedia.org/T263043 (10ArielGlenn)
[12:10:50] <wikibugs>	 (03Merged) 10jenkins-bot: Initial configuration for arbcom_ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/627278 (https://phabricator.wikimedia.org/T262812) (owner: 10Majavah)
[12:11:05] <wikibugs>	 (03PS1) 10Urbanecm: Revert "Initial configuration for arbcom_ruwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/630414
[12:11:10] <wikibugs>	 (03CR) 10Urbanecm: [V: 03+2 C: 03+2] Revert "Initial configuration for arbcom_ruwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/630414 (owner: 10Urbanecm)
[12:13:04] <wikibugs>	 (03PS1) 10Urbanecm: Initial configuration for arbcom_ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/630573 (https://phabricator.wikimedia.org/T262812)
[12:13:36] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Initial configuration for arbcom_ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/630573 (https://phabricator.wikimedia.org/T262812) (owner: 10Urbanecm)
[12:14:19] <wikibugs>	 (03Merged) 10jenkins-bot: Initial configuration for arbcom_ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/630573 (https://phabricator.wikimedia.org/T262812) (owner: 10Urbanecm)
[12:16:03] <wikibugs>	 10Operations, 10Traffic: cache_upload varnish-fe exhausting transient memory - https://phabricator.wikimedia.org/T249809 (10ArielGlenn)
[12:17:56] <logmsgbot>	 !log urbanecm@deploy1001 Synchronized wmf-config/db-eqiad.php: Creating arbcom_ruwiki (T262812) (duration: 00m 56s)
[12:18:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:18:04] <stashbot>	 T262812: Create private arbcom-ru wiki - https://phabricator.wikimedia.org/T262812
[12:19:06] <logmsgbot>	 !log urbanecm@deploy1001 Synchronized wmf-config/db-codfw.php: Creating arbcom_ruwiki (T262812) (duration: 00m 57s)
[12:19:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:20:05] <logmsgbot>	 !log urbanecm@deploy1001 Synchronized dblists: Creating arbcom_ruwiki (T262812) (duration: 00m 57s)
[12:20:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:21:45] <logmsgbot>	 !log urbanecm@deploy1001 rebuilt and synchronized wikiversions files: Creating arbcom_ruwiki (T262812)
[12:21:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:21:55] <wikibugs>	 10Operations, 10SRE-Access-Requests: Add new SSH key for Sam Smith - https://phabricator.wikimedia.org/T263992 (10phuedx)
[12:22:47] <wikibugs>	 (03PS1) 10Jbond: tools: puppetdb reduce postgres memory usage [puppet] - 10https://gerrit.wikimedia.org/r/630574
[12:22:58] <logmsgbot>	 !log urbanecm@deploy1001 Synchronized static/images/project-logos/: Creating arbcom_ruwiki (T262812) (duration: 00m 56s)
[12:23:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:23:05] <stashbot>	 T262812: Create private arbcom-ru wiki - https://phabricator.wikimedia.org/T262812
[12:24:17] <wikibugs>	 (03CR) 10Jbond: "PCC: https://puppet-compiler.wmflabs.org/compiler1002/25453/" [puppet] - 10https://gerrit.wikimedia.org/r/630574 (owner: 10Jbond)
[12:24:18] <logmsgbot>	 !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Creating arbcom_ruwiki (T262812) (duration: 00m 56s)
[12:24:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:24:32] <wikibugs>	 (03PS1) 10Urbanecm: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/630576
[12:24:34] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/630576 (owner: 10Urbanecm)
[12:25:14] <wikibugs>	 (03Merged) 10jenkins-bot: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/630576 (owner: 10Urbanecm)
[12:26:03] <wikibugs>	 (03CR) 10Muehlenhoff: "Or maybe for all of cloud VPS, given that this also affects deployment-puppetdb03?" [puppet] - 10https://gerrit.wikimedia.org/r/630574 (owner: 10Jbond)
[12:26:13] <logmsgbot>	 !log urbanecm@deploy1001 Synchronized wmf-config/interwiki.php: Update interwiki cache (duration: 01m 48s)
[12:26:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:26:31] <Urbanecm>	 !log arbcom_ruwiki is created (T262812)
[12:26:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:28:05] <wikibugs>	 10Operations, 10netops: Configure BGP route damping on Anycast sessions - https://phabricator.wikimedia.org/T262372 (10ayounsi) The issue is that `ss -lun | fgrep -q :10514` often take more than 2s to complete and we don't let it retry. As it happen regularly, it sometimes happen right after the bird restart,...
[12:28:27] <wikibugs>	 (03PS3) 10Jbond: swift: convert roles to profiles, fix various lint issues [puppet] - 10https://gerrit.wikimedia.org/r/630568 (https://phabricator.wikimedia.org/T247956)
[12:28:40] <Urbanecm>	 !log [urbanecm@mwmaint2001 ~]$ mwscript createAndPromote.php --wiki=arbcom_ruwiki --bureaucrat --sysop 'Adamant.pwn' <PASSWORD REDACTED> # T262812
[12:28:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:28:48] <stashbot>	 T262812: Create private arbcom-ru wiki - https://phabricator.wikimedia.org/T262812
[12:28:51] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] hieradata: bump Swift object replicator concurrency [puppet] - 10https://gerrit.wikimedia.org/r/629082 (https://phabricator.wikimedia.org/T261633) (owner: 10Filippo Giunchedi)
[12:29:24] <Urbanecm>	 !log [urbanecm@mwmaint2001 ~]$ mwscript resetUserEmail.php --wiki=arbcom_ruwiki 'Adamant.pwn' 'adamant.pwn@hotmail.com' # T262812
[12:29:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:31:01] <logmsgbot>	 !log jayme@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'zotero' for release 'production' .
[12:31:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:33:57] <wikibugs>	 10Operations, 10Traffic, 10Wikimedia-Incident: Memory leak on ats-tls 8.0.6 - https://phabricator.wikimedia.org/T249335 (10ArielGlenn)
[12:35:39] <wikibugs>	 10Operations, 10Traffic, 10serviceops, 10Performance-Team (Radar), 10Sustainability: Make CDN purges reliable - https://phabricator.wikimedia.org/T133821 (10ArielGlenn)
[12:37:22] <logmsgbot>	 !log jayme@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'zotero' for release 'production' .
[12:37:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:41:21] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[12:42:55] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[12:44:10] <wikibugs>	 10Operations, 10Citoid, 10Wikimedia-Logstash, 10observability, and 2 others: Move citoid logging to new logging pipeline - https://phabricator.wikimedia.org/T219919 (10Mvolz) >>! In T219919#6478632, @fgiunchedi wrote: > It looks like citoid is now on k8s but still using gelf for logging, possibly the easie...
[12:44:20] <wikibugs>	 (03PS4) 10Jbond: swift: convert roles to profiles, fix various lint issues [puppet] - 10https://gerrit.wikimedia.org/r/630568 (https://phabricator.wikimedia.org/T247956)
[12:44:50] <wikibugs>	 10Operations, 10DBA, 10Sustainability (Incident Followup), 10Wikimedia-Incident: S5 replication issue, affecting watchlist and probably recentchanges - https://phabricator.wikimedia.org/T263842 (10akosiaris) A preliminary incident report is at https://wikitech.wikimedia.org/wiki/Incident_documentation/2020...
[12:47:04] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Have the puppetised sources.list depend on the wikimedia repository [puppet] - 10https://gerrit.wikimedia.org/r/630179 (https://phabricator.wikimedia.org/T158562) (owner: 10Muehlenhoff)
[12:49:47] <wikibugs>	 (03PS1) 10Elukey: profile::analytics::cluster::packages::statistics: remove stretch bits [puppet] - 10https://gerrit.wikimedia.org/r/630578 (https://phabricator.wikimedia.org/T255028)
[12:51:53] <wikibugs>	 (03PS5) 10Jbond: swift: convert roles to profiles, fix various lint issues [puppet] - 10https://gerrit.wikimedia.org/r/630568 (https://phabricator.wikimedia.org/T247956)
[12:52:13] <wikibugs>	 (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1001/25457/stat1007.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/630578 (https://phabricator.wikimedia.org/T255028) (owner: 10Elukey)
[12:54:42] <wikibugs>	 10Operations, 10OTRS, 10vm-requests: Decommission mendelevium - https://phabricator.wikimedia.org/T263993 (10akosiaris)
[12:54:50] <wikibugs>	 (03CR) 10Muehlenhoff: "Looks good to me, it's worth pointing out that despite being recently reimaged to Buster per debmonitor stat1004-1007 _do_ have python3-go" [puppet] - 10https://gerrit.wikimedia.org/r/630578 (https://phabricator.wikimedia.org/T255028) (owner: 10Elukey)
[12:55:58] <wikibugs>	 10Operations, 10OTRS: Migrate mendelevium/OTRS host to Buster - https://phabricator.wikimedia.org/T224590 (10akosiaris) 05Open→03Resolved a:03akosiaris mendelevium is powered off and decomissioning is tracked at T263993. I 'll resolve this task.
[12:56:01] <wikibugs>	 10Operations: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10akosiaris)
[12:57:44] <logmsgbot>	 !log akosiaris@cumin1001 START - Cookbook sre.hosts.decommission
[12:57:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:59:40] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: Remove mendelevium [puppet] - 10https://gerrit.wikimedia.org/r/630581 (https://phabricator.wikimedia.org/T263993)
[13:00:16] <wikibugs>	 (03PS6) 10Jbond: swift: convert roles to profiles, fix various lint issues [puppet] - 10https://gerrit.wikimedia.org/r/630568 (https://phabricator.wikimedia.org/T247956)
[13:03:31] <logmsgbot>	 !log akosiaris@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1)
[13:03:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:03:39] <wikibugs>	 10Operations, 10OTRS, 10vm-requests, 10Patch-For-Review: Decommission mendelevium - https://phabricator.wikimedia.org/T263993 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by akosiaris@cumin1001 for hosts: `mendelevium.eqiad.wmnet` - mendelevium.eqiad.wmnet (**WARN**)   - **Failed downti...
[13:04:30] <volans>	 akosiaris: what failed in the dns part of the decom cookbook?
[13:04:49] <akosiaris>	 Failed to run the sre.dns.netbox cookbook
[13:04:55] <akosiaris>	 Generating the DNS records from Netbox data. It will take a couple of minutes.
[13:05:03] <akosiaris>	 want the stacktrace?
[13:05:08] <volans>	 yeah
[13:05:42] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTMm thanks for the fixes" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/627841 (owner: 10Kormat)
[13:06:02] <akosiaris>	 volans: https://phabricator.wikimedia.org/P12817
[13:06:08] <volans>	 thanks!
[13:06:53] <wikibugs>	 (03PS7) 10Jbond: swift: convert roles to profiles, fix various lint issues [puppet] - 10https://gerrit.wikimedia.org/r/630568 (https://phabricator.wikimedia.org/T247956)
[13:07:37] <wikibugs>	 (03CR) 10Jbond: "PCC full diff shows no real changes" [puppet] - 10https://gerrit.wikimedia.org/r/630568 (https://phabricator.wikimedia.org/T247956) (owner: 10Jbond)
[13:10:09] <wikibugs>	 (03CR) 10Jbond: "This has got rather big and possibly needs breaking into smaller chunks.  The change should be a no-op it mainly moves parameters that are" [puppet] - 10https://gerrit.wikimedia.org/r/630568 (https://phabricator.wikimedia.org/T247956) (owner: 10Jbond)
[13:12:07] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA, 10netops, and 2 others: Upgrade eqiad rack D4 to 10G switch - https://phabricator.wikimedia.org/T196487 (10Cmjohnson) @ayounsi I am not able to get the console to work on the new switch, it's plugged in, I verfied it worked by connecting to the current asw in d4 and get th...
[13:12:08] <wikibugs>	 (03CR) 10Jbond: "Thanks for going tot he effort of trying to untangle the swift automatic parameter lookups however there are a quite a few other places th" [puppet] - 10https://gerrit.wikimedia.org/r/630568 (https://phabricator.wikimedia.org/T247956) (owner: 10Jbond)
[13:12:41] <volans>	 akosiaris: interesting, can't repro it...
[13:12:48] <volans>	 I'll run the sre.dns.netbox cookbook for you
[13:12:57] <wikibugs>	 (03CR) 10Jbond: "Thanks for going tot he effort of trying to untangle the swift automatic parameter lookups however there are a quite a few other places th" [puppet] - 10https://gerrit.wikimedia.org/r/628970 (owner: 10Dzahn)
[13:13:25] <wikibugs>	 (03CR) 10Jbond: "The comment above should have been made on the original change:" [puppet] - 10https://gerrit.wikimedia.org/r/630568 (https://phabricator.wikimedia.org/T247956) (owner: 10Jbond)
[13:14:33] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/627967 (owner: 10Dzahn)
[13:15:28] <akosiaris>	 volans: thanks. Any idea what caused it? that exit_code=2 didn't help much
[13:15:40] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/627970 (owner: 10Dzahn)
[13:16:23] <wikibugs>	 (03CR) 10Alexandros Kosiaris: service::configuration: connect to restbase via TLS (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/630562 (https://phabricator.wikimedia.org/T244843) (owner: 10Giuseppe Lavagetto)
[13:16:26] <volans>	 unfortunately not as the log of the remote script was to stdout via ssh that gets discarded (temporarily?)
[13:17:07] <volans>	 as we can now re-enable it but might be noisy with some/many cookbooks? but very soon we will be able to decide on a per command run basis if we want it or not from the cookbooks
[13:18:28] <wikibugs>	 (03PS5) 10Jbond: base/monitoring: move monitor_screens to proper profile parameter [puppet] - 10https://gerrit.wikimedia.org/r/628152 (owner: 10Dzahn)
[13:18:53] <wikibugs>	 (03CR) 10Jbond: "> Patch Set 4:" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/628152 (owner: 10Dzahn)
[13:19:15] <moritzm>	 !log reimaging sretest1001 to validate puppetised sources.list with a new installation T158562
[13:19:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:19:23] <stashbot>	 T158562: Manage apt sources via puppet - https://phabricator.wikimedia.org/T158562
[13:20:01] <logmsgbot>	 !log volans@cumin1001 START - Cookbook sre.dns.netbox
[13:20:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:20:23] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] "LGTM merging" [puppet] - 10https://gerrit.wikimedia.org/r/628152 (owner: 10Dzahn)
[13:25:25] <volans>	 akosiaris: and to be clear, for now you still need the manual dns patch anyway
[13:25:46] <logmsgbot>	 !log volans@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[13:25:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:25:53] <akosiaris>	 yeah, I was about to. Thanks!
[13:26:55] <wikibugs>	 (03CR) 10Ema: [C: 03+1] geoip VCL: add a 'which' param to get_geo_xcip (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/630315 (https://phabricator.wikimedia.org/T263496) (owner: 10CDanis)
[13:27:01] <wikibugs>	 (03CR) 10Ema: [C: 03+1] geoip VCL: init/free functions are now reusable [puppet] - 10https://gerrit.wikimedia.org/r/630314 (https://phabricator.wikimedia.org/T263496) (owner: 10CDanis)
[13:29:37] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+1] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/627967 (owner: 10Dzahn)
[13:32:15] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] Remove mendelevium [puppet] - 10https://gerrit.wikimedia.org/r/630581 (https://phabricator.wikimedia.org/T263993) (owner: 10Alexandros Kosiaris)
[13:32:55] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.downtime
[13:32:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:34:11] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: Remove mendelevium [dns] - 10https://gerrit.wikimedia.org/r/630588 (https://phabricator.wikimedia.org/T263993)
[13:34:45] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] Remove mendelevium [dns] - 10https://gerrit.wikimedia.org/r/630588 (https://phabricator.wikimedia.org/T263993) (owner: 10Alexandros Kosiaris)
[13:35:00] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[13:35:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:37:25] <wikibugs>	 (03PS1) 10Jbond: labstore::nfs_mount: drop support for empty string share_path [puppet] - 10https://gerrit.wikimedia.org/r/630589
[13:38:10] <godog>	 !log roll restart object-replicator on ms-be2* for higher concurrency - T261633
[13:38:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:38:17] <stashbot>	 T261633: Put ms-be2057 (Dell R740xd2) in service - https://phabricator.wikimedia.org/T261633
[13:40:10] <wikibugs>	 (03PS1) 10Volans: homer: fix live config check [puppet] - 10https://gerrit.wikimedia.org/r/630590
[13:41:11] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] homer: fix live config check [puppet] - 10https://gerrit.wikimedia.org/r/630590 (owner: 10Volans)
[13:41:31] <wikibugs>	 10Operations: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10MoritzMuehlenhoff)
[13:42:47] <wikibugs>	 10Operations: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10MoritzMuehlenhoff)
[13:42:52] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox
[13:42:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:43:54] <wikibugs>	 (03CR) 10Kormat: [C: 03+2] bsection: Script for binary-searching log files. [puppet] - 10https://gerrit.wikimedia.org/r/627841 (owner: 10Kormat)
[13:45:03] <wikibugs>	 (03PS2) 10Volans: homer: fix live config check [puppet] - 10https://gerrit.wikimedia.org/r/630590
[13:45:09] <XioNoX>	 !log downtiming all eqiad row D hosts - T196487
[13:45:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:45:16] <stashbot>	 T196487: Upgrade eqiad rack D4 to 10G switch - https://phabricator.wikimedia.org/T196487
[13:45:24] <XioNoX>	 only as I can't just downtime a rack worth of hosts
[13:46:43] <logmsgbot>	 !log elukey@puppetmaster1001 conftool action : set/pooled=no; selector: name=aqs1006.eqiad.wmnet
[13:46:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:47:31] <wikibugs>	 10Operations, 10Wikispeech-Jobrunner, 10Wikispeech-Text-to-Speech, 10Wikispeech-WMSE: TTS server deployment strategy - https://phabricator.wikimedia.org/T193072 (10Lokal_Profil) 05Stalled→03Invalid The Speechoid service has been changed to use Blubber. A separate task will be set up to track deployment...
[13:47:48] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] homer: fix live config check [puppet] - 10https://gerrit.wikimedia.org/r/630590 (owner: 10Volans)
[13:48:07] <wikibugs>	 (03CR) 10Volans: [C: 03+2] homer: fix live config check [puppet] - 10https://gerrit.wikimedia.org/r/630590 (owner: 10Volans)
[13:51:59] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[13:52:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:57:56] <wikibugs>	 (03PS1) 10Kormat: WIP bsection: pull out binary search to separate function. [puppet] - 10https://gerrit.wikimedia.org/r/630596
[13:58:45] <XioNoX>	 !log asw2-d-eqiad# run request system power-off member 4
[13:58:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:59:47] <wikibugs>	 (03PS1) 10CDanis: eventgate-logging-external-tls-proxy: bump CPU up [deployment-charts] - 10https://gerrit.wikimedia.org/r/630597 (https://phabricator.wikimedia.org/T257527)
[14:00:49] <wikibugs>	 (03PS1) 10Elukey: install_server: set Debian buster for an-worker1101 [puppet] - 10https://gerrit.wikimedia.org/r/630598
[14:01:28] <wikibugs>	 (03Abandoned) 10Elukey: install_server: set Debian buster for an-worker1101 [puppet] - 10https://gerrit.wikimedia.org/r/630598 (owner: 10Elukey)
[14:02:08] <wikibugs>	 10Operations, 10Product-Infrastructure-Data, 10Epic, 10Goal, 10Patch-For-Review: automatically collect network error reports from users' browsers (Network Error Logging API) - https://phabricator.wikimedia.org/T257527 (10CDanis)
[14:02:54] <icinga-wm>	 PROBLEM - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received https://wikitech.wikimedia.org/wiki/Wikifeeds
[14:02:54] <icinga-wm>	 PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received https://wikitech.wikimedia.org/wiki/Wikifeeds
[14:03:20] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1008 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/top-by-country/{project}/{access}/{year}/{month} (Get top countries by page views) timed out before a response was received: /analytics.wikimedia.org/v1/legacy/pagecounts/aggregate/{project}/{access-site}/{granularity}/{start}/{end} (Get pagecounts) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Moni
[14:03:36] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1007 is CRITICAL: /analytics.wikimedia.org/v1/legacy/pagecounts/aggregate/{project}/{access-site}/{granularity}/{start}/{end} (Get pagecounts) timed out before a response was received: /analytics.wikimedia.org/v1/unique-devices/{project}/{access-site}/{granularity}/{start}/{end} (Get unique devices) timed out before a response was received: /analytics.wikimedia.org/v1/pageviews/top/{project}/{
[14:03:36] <icinga-wm>	 onth}/{day} (Get top page views) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[14:03:41] <wikibugs>	 (03PS2) 10Jbond: labstore::nfs_mount: drop support for empty string share_path [puppet] - 10https://gerrit.wikimedia.org/r/630589
[14:03:46] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1004 is CRITICAL: /analytics.wikimedia.org/v1/legacy/pagecounts/aggregate/{project}/{access-site}/{granularity}/{start}/{end} (Get pagecounts) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[14:03:52] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1005 is CRITICAL: /analytics.wikimedia.org/v1/unique-devices/{project}/{access-site}/{granularity}/{start}/{end} (Get unique devices) timed out before a response was received: /analytics.wikimedia.org/v1/pageviews/top-by-country/{project}/{access}/{year}/{month} (Get top countries by page views) timed out before a response was received: /analytics.wikimedia.org/v1/legacy/pagecounts/aggregate/{
[14:03:52] <icinga-wm>	 site}/{granularity}/{start}/{end} (Get pagecounts) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[14:03:56] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:04:38] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1008 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[14:04:42] <icinga-wm>	 PROBLEM - Juniper virtual chassis ports on asw2-d-eqiad is CRITICAL: CRIT: Down: 2 Unknown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23VCP_status
[14:05:28] <moritzm>	 !log uploaded libdbi-perl 1.631-3+wmf1 for jessie-wikimedia T259102
[14:05:30] <icinga-wm>	 PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1004 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1
[14:05:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:05:46] <elukey>	 XioNoX: it is only d4's tor down right?
[14:05:54] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM, I agree that not removing the break logic too doesn't improve that much.And to do that raising an exception would require to indent " [puppet] - 10https://gerrit.wikimedia.org/r/630596 (owner: 10Kormat)
[14:06:23] <XioNoX>	 elukey: yep
[14:06:24] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[14:06:26] <wikibugs>	 (03Abandoned) 10Kormat: WIP bsection: pull out binary search to separate function. [puppet] - 10https://gerrit.wikimedia.org/r/630596 (owner: 10Kormat)
[14:06:34] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:06:44] <icinga-wm>	 PROBLEM - Aggregate IPsec Tunnel Status codfw on alert1001 is CRITICAL: instance={mc2033,mc2034,mc2035,mc2036} site=codfw tunnel={mc1033_v4,mc1034_v4,mc1035_v4,mc1036_v4} https://wikitech.wikimedia.org/wiki/Monitoring/strongswan https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status
[14:06:50] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={pdu_sentry4,swagger_check_citoid_cluster_codfw,swagger_check_wikifeeds_codfw} site={codfw,eqsin} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[14:07:42] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1007 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[14:07:52] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[14:08:21] <icinga-wm>	 RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds
[14:09:50] <icinga-wm>	 RECOVERY - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds
[14:10:33] <elukey>	 so aqs1006 is in d4 (two cassandra instances) and it caused some read timeouts for the aqs service, that in turn caused timeouts for wikifeeds
[14:11:14] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[14:11:34] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] api-gateway: add routing for static and other components [deployment-charts] - 10https://gerrit.wikimedia.org/r/628408 (https://phabricator.wikimedia.org/T263045) (owner: 10Hnowlan)
[14:12:14] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-me
[14:12:46] <icinga-wm>	 RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1004 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1
[14:13:38] <elukey>	 this --^ was probably due to the elastic node in d4 now unreachable
[14:13:54] <wikibugs>	 (03Merged) 10jenkins-bot: api-gateway: add routing for static and other components [deployment-charts] - 10https://gerrit.wikimedia.org/r/628408 (https://phabricator.wikimedia.org/T263045) (owner: 10Hnowlan)
[14:14:05] <kormat>	 elukey: will elastic.. snap back?
[14:14:52] * elukey answers to kormat using /dev/urandom
[14:15:22] * Reedy squints
[14:16:28] <icinga-wm>	 RECOVERY - Juniper virtual chassis ports on asw2-d-eqiad is OK: OK: UP: 24 https://wikitech.wikimedia.org/wiki/Network_monitoring%23VCP_status
[14:20:14] <icinga-wm>	 RECOVERY - Aggregate IPsec Tunnel Status codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/strongswan https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status
[14:21:56] <wikibugs>	 10Operations, 10observability, 10serviceops: Strongswan Icinga check: do not report issues about depooled hosts - https://phabricator.wikimedia.org/T148976 (10ema) >>! In T148976#6488108, @BBlack wrote: > This was mostly about cache nodes back when those had ipsec, I think.  The remaining case that uses ipse...
[14:22:52] <icinga-wm>	 RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET
[14:23:30] <wikibugs>	 10Operations, 10DBA, 10Sustainability (Incident Followup), 10Wikimedia-Incident: S5 replication issue, affecting watchlist and probably recentchanges - https://phabricator.wikimedia.org/T263842 (10Marostegui) The sequence of events within the transaction that failed is interesting and it definitely didn't...
[14:23:45] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA, 10netops, and 2 others: Upgrade eqiad rack D4 to 10G switch - https://phabricator.wikimedia.org/T196487 (10ayounsi)
[14:24:16] <wikibugs>	 10Operations, 10Patch-For-Review, 10User-jbond: Manage apt sources via puppet - https://phabricator.wikimedia.org/T158562 (10MoritzMuehlenhoff) >>! In T158562#6494038, @MoritzMuehlenhoff wrote: > I did a test installation with the new setting as I had a hunch there would be issues in early install and turns...
[14:25:21] <wikibugs>	 (03CR) 10Ppchelko: [C: 03+1] api-gateway: use TLS for restbase [deployment-charts] - 10https://gerrit.wikimedia.org/r/630567 (owner: 10Hnowlan)
[14:27:17] <logmsgbot>	 !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'production' .
[14:27:18] <logmsgbot>	 !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'staging' .
[14:27:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:27:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:27:43] <icinga-wm>	 ACKNOWLEDGEMENT - mcrouter process on mwdebug1001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 114 (mcrouter), command name mcrouter Effie Mouzeli testing https://wikitech.wikimedia.org/wiki/Mcrouter
[14:29:58] <wikibugs>	 (03CR) 10Mholloway: [C: 03+1] wikifeeds: use the service proxy for reaching the MediaWiki api [deployment-charts] - 10https://gerrit.wikimedia.org/r/628756 (https://phabricator.wikimedia.org/T255878) (owner: 10Giuseppe Lavagetto)
[14:32:16] <wikibugs>	 (03PS2) 10Ayounsi: Revert "Depool eqiad for row D recabling" [dns] - 10https://gerrit.wikimedia.org/r/629519
[14:33:36] <XioNoX>	 !log repool eqiad
[14:33:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:34:48] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: (Due By: 2020-07-11) rack/setup/install an-worker[1096-1101] - https://phabricator.wikimedia.org/T254892 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts: ` ['an-worker1099.eqiad.wmnet', 'an-wor...
[14:40:24] <logmsgbot>	 !log elukey@puppetmaster1001 conftool action : set/pooled=yes; selector: name=aqs1006.eqiad.wmnet
[14:40:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:42:31] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] eventgate-logging-external-tls-proxy: bump CPU up [deployment-charts] - 10https://gerrit.wikimedia.org/r/630597 (https://phabricator.wikimedia.org/T257527) (owner: 10CDanis)
[14:42:31] <wikibugs>	 10Operations, 10ops-eqiad, 10decommission-hardware: Decommission old-asw-d4-eqiad (ex4300) - https://phabricator.wikimedia.org/T264001 (10Cmjohnson)
[14:43:01] <wikibugs>	 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): WDQS Categories reload is failing on thankyouwiki - https://phabricator.wikimedia.org/T261097 (10Gehel) 05Open→03Resolved
[14:43:15] <wikibugs>	 (03CR) 10CDanis: [C: 03+2] eventgate-logging-external-tls-proxy: bump CPU up [deployment-charts] - 10https://gerrit.wikimedia.org/r/630597 (https://phabricator.wikimedia.org/T257527) (owner: 10CDanis)
[14:43:48] <wikibugs>	 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review, 10User-Kormat: db2125 crashed - mgmt iface also not available - https://phabricator.wikimedia.org/T260670 (10Papaul) @Kormat tell @Marostegui to not break the host again :)
[14:44:10] <kormat>	 lol
[14:44:39] <logmsgbot>	 !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'production' .
[14:44:39] <logmsgbot>	 !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'staging' .
[14:44:41] <wikibugs>	 10Operations, 10LDAP-Access-Requests: Access to Superset - https://phabricator.wikimedia.org/T263868 (10JVargas) This is approved on my end, @ArielGlenn. Let me know if you need anything else from me. Thanks!
[14:44:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:44:44] <wikibugs>	 10Operations, 10ops-codfw, 10DC-Ops, 10decommission-hardware: decommission es2014.codfw.wmnet - https://phabricator.wikimedia.org/T262889 (10Papaul)
[14:44:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:45:03] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: Cleanup scap::sources from some old objects [puppet] - 10https://gerrit.wikimedia.org/r/630604
[14:45:38] <icinga-wm>	 PROBLEM - Varnish traffic drop between 30min ago and now at codfw on alert1001 is CRITICAL: 49.13 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[14:45:52] <wikibugs>	 10Operations, 10ops-codfw, 10DC-Ops, 10decommission-hardware: decommission es2014.codfw.wmnet - https://phabricator.wikimedia.org/T262889 (10Papaul) 05Open→03Resolved Complete
[14:45:54] <wikibugs>	 10Operations: Integrate Buster 10.5 point release - https://phabricator.wikimedia.org/T259519 (10MoritzMuehlenhoff)
[14:47:02] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] Cleanup scap::sources from some old objects [puppet] - 10https://gerrit.wikimedia.org/r/630604 (owner: 10Alexandros Kosiaris)
[14:48:46] <wikibugs>	 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review, 10User-Kormat: db2125 crashed - mgmt iface also not available - https://phabricator.wikimedia.org/T260670 (10Marostegui) >>! In T260670#6498765, @Papaul wrote: > @Kormat tell @Marostegui to not break the host again :)  hahah - reminder: you are the on...
[14:49:43] <moritzm>	 !log installing glib-networking security updates
[14:49:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:49:56] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:50:05] <wikibugs>	 (03CR) 10Alexandros Kosiaris: "PCC at https://puppet-compiler.wmflabs.org/compiler1001/25460/" [puppet] - 10https://gerrit.wikimedia.org/r/630562 (https://phabricator.wikimedia.org/T244843) (owner: 10Giuseppe Lavagetto)
[14:51:13] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:57:14] <wikibugs>	 10Operations, 10Product-Infrastructure-Team-Backlog, 10Proton: proton experienced a period of high CPU usage, busy queue, lockups - https://phabricator.wikimedia.org/T214975 (10akosiaris) 05Open→03Invalid This is close to 15months old, and the service has been moved to kubernetes in the meantime, so most...
[14:58:23] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[14:59:18] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.downtime
[14:59:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:59:25] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[15:00:32] <kormat>	 XioNoX: maybe coincidence, but db1114 (in row D) just lost net connectivity
[15:00:47] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.downtime
[15:00:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:01:19] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[15:01:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:01:41] <kormat>	 XioNoX: and it just came back, 4mins later
[15:02:07] <wikibugs>	 10Operations, 10DBA, 10Sustainability (Incident Followup), 10Wikimedia-Incident: S5 replication issue, affecting watchlist and probably recentchanges - https://phabricator.wikimedia.org/T263842 (10jcrespo) These are the logs of the blocks (2 inserts and 1 update?) the timestamps would be close to (but not...
[15:02:40] <XioNoX>	 cmjohnson1: can you sync up with kormat to replace the SFP-T on ge-4/0/34 ?
[15:03:02] <XioNoX>	 kormat: yeah I see it in the logs..
[15:03:16] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[15:03:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:07:07] <cmjohnson1>	 kormat ...okay to replace now?
[15:07:21] <kormat>	 cmjohnson1: yep, go for it
[15:08:03] <cmjohnson1>	 kormat done
[15:08:30] <kormat>	 cmjohnson1: great, thanks!
[15:08:46] <wikibugs>	 10Operations, 10observability, 10serviceops, 10Platform Team Initiatives (API Gateway): mtail 3.0.0-rc35 doesn't support the histogram type in -oneshot mode. - https://phabricator.wikimedia.org/T263728 (10colewhite) a:03colewhite
[15:08:47] <logmsgbot>	 !log hnowlan@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'api-gateway' for release 'production' .
[15:08:47] <logmsgbot>	 !log hnowlan@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'api-gateway' for release 'staging' .
[15:08:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:08:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:09:30] <wikibugs>	 10Operations, 10ops-eqiad, 10decommission-hardware: Decommission old-asw-d4-eqiad (ex4300) - https://phabricator.wikimedia.org/T264001 (10Cmjohnson) 05Open→03Resolved wiped, removed from rack updated netbox
[15:10:26] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/623666 (owner: 10Dzahn)
[15:11:12] <wikibugs>	 10Operations, 10DBA, 10Sustainability (Incident Followup), 10Wikimedia-Incident: S5 replication issue, affecting watchlist and probably recentchanges - https://phabricator.wikimedia.org/T263842 (10Marostegui) Yes, they are those, this is the order of events on the binlog for the ipblock table on that IP th...
[15:11:44] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[15:12:43] <logmsgbot>	 !log cdanis@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'canary' .
[15:12:43] <logmsgbot>	 !log cdanis@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'production' .
[15:12:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:12:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:13:00] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[15:13:23] <logmsgbot>	 !log hnowlan@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'api-gateway' for release 'staging' .
[15:13:23] <logmsgbot>	 !log hnowlan@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'api-gateway' for release 'production' .
[15:13:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:13:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:13:36] <wikibugs>	 10Operations, 10Data-Services, 10SRE-Access-Requests, 10cloud-services-team (Kanban): Enable access for wmcs-admins to run wmcs-prefixed cookbooks on cumin hosts - https://phabricator.wikimedia.org/T261145 (10nskaggs) I'm trying as a non-global root wmcs admin; here's what I get:   ` $ secure-cookbook -d w...
[15:14:37] <wikibugs>	 10Operations, 10ops-codfw, 10DC-Ops, 10decommission-hardware: decommission es2012.codfw.wmnet - https://phabricator.wikimedia.org/T263613 (10Papaul)
[15:14:51] <wikibugs>	 10Operations, 10ops-codfw, 10DC-Ops, 10decommission-hardware: decommission es2012.codfw.wmnet - https://phabricator.wikimedia.org/T263613 (10Papaul) 05Open→03Resolved Complete
[15:15:08] <icinga-wm>	 RECOVERY - Varnish traffic drop between 30min ago and now at codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[15:15:22] <logmsgbot>	 !log hnowlan@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'api-gateway' for release 'production' .
[15:15:23] <logmsgbot>	 !log hnowlan@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'api-gateway' for release 'staging' .
[15:15:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:15:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:17:17] <wikibugs>	 10Operations: Integrate Buster 10.5 point release - https://phabricator.wikimedia.org/T259519 (10MoritzMuehlenhoff)
[15:20:50] <wikibugs>	 (03PS2) 10Jbond: tools: puppetdb reduce postgres memory usage [puppet] - 10https://gerrit.wikimedia.org/r/630574
[15:21:24] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_cxserver_cluster_codfw site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[15:22:16] <wikibugs>	 10Operations, 10ops-codfw: elastic2037 DIMM errors logged in racadm getsel - https://phabricator.wikimedia.org/T263714 (10Papaul) @elukey just got confirmation the part arrived. If the host is not depool yet please depool and power off, time for me to go and pick up the part.  Thaanks
[15:22:36] <wikibugs>	 (03Abandoned) 10Hnowlan: api-gateway: Fall through to the appservers if a route isn't matched [deployment-charts] - 10https://gerrit.wikimedia.org/r/628772 (https://phabricator.wikimedia.org/T263045) (owner: 10Hnowlan)
[15:22:52] <wikibugs>	 (03CR) 10Jbond: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/630574 (owner: 10Jbond)
[15:23:12] <wikibugs>	 10Operations, 10ops-eqiad, 10Analytics-Radar, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10Cmjohnson)
[15:24:02] <wikibugs>	 10Operations, 10ops-eqiad, 10Analytics-Radar, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10Cmjohnson) a:05Cmjohnson→03RobH @RobH the new ssds have been installed to these servers, I appreciate you fixing the raid and...
[15:25:41] <hashar>	 !log Restarting CI Jenkins for plugins uninstallation T260565
[15:25:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:26:00] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[15:26:36] <logmsgbot>	 !log kormat@cumin1001 dbctl commit (dc=all): 'Repool db1114 T196487', diff saved to https://phabricator.wikimedia.org/P12818 and previous config saved to /var/cache/conftool/dbconfig/20200928-152635-kormat.json
[15:26:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:26:41] <stashbot>	 T196487: Upgrade eqiad rack D4 to 10G switch - https://phabricator.wikimedia.org/T196487
[15:27:15] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 03+1] push-notifications: change version tag to -production [deployment-charts] - 10https://gerrit.wikimedia.org/r/628340 (https://phabricator.wikimedia.org/T256973) (owner: 10MSantos)
[15:27:19] <wikibugs>	 10Operations, 10Data-Services, 10SRE-Access-Requests, 10cloud-services-team (Kanban): Enable access for wmcs-admins to run wmcs-prefixed cookbooks on cumin hosts - https://phabricator.wikimedia.org/T261145 (10jbond) >>! In T261145#6498898, @nskaggs wrote: > I'm trying as a non-global root wmcs admin; here'...
[15:27:57] <wikibugs>	 (03PS2) 10Jdlrobson: Enable search in header A/B test for logged in users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/630206 (https://phabricator.wikimedia.org/T263032)
[15:30:38] <wikibugs>	 (03PS1) 10Kormat: bsection: Change exit code semantics [puppet] - 10https://gerrit.wikimedia.org/r/630622
[15:31:53] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/630622 (owner: 10Kormat)
[15:33:21] <wikibugs>	 (03CR) 10Kormat: [C: 03+2] bsection: Change exit code semantics [puppet] - 10https://gerrit.wikimedia.org/r/630622 (owner: 10Kormat)
[15:35:59] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: (Due By: 2020-07-11) rack/setup/install an-worker[1096-1101] - https://phabricator.wikimedia.org/T254892 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-worker1099.eqiad.wmnet', 'an-worker1100.eqiad.wmnet'] `  and were **ALL** successf...
[15:39:57] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] restbase: add restbase102[89]/restbase1030 [puppet] - 10https://gerrit.wikimedia.org/r/630106 (https://phabricator.wikimedia.org/T261512) (owner: 10Hnowlan)
[15:41:01] <icinga-wm>	 PROBLEM - Host labweb1002 is DOWN: PING CRITICAL - Packet loss = 100%
[15:41:12] <icinga-wm>	 PROBLEM - Host puppetmaster1002 is DOWN: PING CRITICAL - Packet loss = 100%
[15:41:22] <volans>	 got paged,
[15:41:24] <robh>	 that paged
[15:41:30] <apergos>	 huh
[15:41:42] <robh>	 is labweb1002 downtime expected?
[15:41:48] <volans>	 puppetmaster1002 is in D4, anyway related to current WIP?
[15:41:49] <stashbot>	 D4: iscap 'project level' commands - https://phabricator.wikimedia.org/D4
[15:42:00] <volans>	 same for labweb1002
[15:42:05] <robh>	 oh, is the switch swap today?  that would be it
[15:42:08] <_joe_>	 I thought we were done with maintenance?
[15:42:11] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: 2020-09-15) rack/setup/install db1150 (see note on hostname) - https://phabricator.wikimedia.org/T260817 (10Cmjohnson)
[15:42:20] <robh>	 there is a swap of a switch in eqiad just not sure what day was on meeting notes
[15:42:21] <robh>	 checking
[15:42:22] <apergos>	 I did not get paged because sometime between yesterday evening and now my phone powered off but is fully charged, how nice :-/
[15:42:34] <XioNoX>	 wtf
[15:42:59] <XioNoX>	 ok, I think some or many of the SFP-Ts are faulty
[15:43:05] <XioNoX>	 it's not just the db host
[15:43:07] <godog>	 mhhh both hosts are down since 45min, I guess expired downtimes
[15:43:10] <XioNoX>	 cmjohnson1: you're around?
[15:43:27] <cmjohnson1>	 yes 
[15:43:29] <_joe_>	 we need to at least depool the puppetmaster?
[15:43:36] <robh>	 row d recable is on thursday so yeah, just closing that loop that its not that ;D
[15:43:39] <_joe_>	 jbond42: ^^ not sure if that was done
[15:43:48] <wikibugs>	 10Operations, 10ops-codfw, 10DBA: Degraded RAID on es2026 - https://phabricator.wikimedia.org/T263837 (10Papaul) a:05Papaul→03Marostegui just after 1 month we received this server, we have already a bad disk.  Disk replaced.
[15:44:05] <XioNoX>	 cmjohnson1: can you check/replace the SFP-T on ge-4/0/10 ?
[15:44:09] <cmjohnson1>	 yes
[15:44:12] <XioNoX>	 cmjohnson1: and ge-4/0/26
[15:44:19] <cmjohnson1>	 ok
[15:44:22] <XioNoX>	 thx
[15:45:26] <wikibugs>	 10Operations, 10ops-codfw, 10decommission-hardware: decommission es2018.codfw.wmnet - https://phabricator.wikimedia.org/T263615 (10Papaul)
[15:45:30] <wikibugs>	 10Operations, 10ops-codfw, 10decommission-hardware: decommission es2018.codfw.wmnet - https://phabricator.wikimedia.org/T263615 (10Papaul) 05Open→03Resolved complete
[15:45:57] <wikibugs>	 (03CR) 10Jbond: "Taken another pass" (039 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/625597 (owner: 10Muehlenhoff)
[15:46:56] <jbond42>	 jouncebot:  looking now
[15:47:00] <icinga-wm>	 RECOVERY - Host puppetmaster1002 is UP: PING OK - Packet loss = 0%, RTA = 0.32 ms
[15:47:01] <wikibugs>	 10Operations, 10Data-Services, 10SRE-Access-Requests, 10cloud-services-team (Kanban): Enable access for wmcs-admins to run wmcs-prefixed cookbooks on cumin hosts - https://phabricator.wikimedia.org/T261145 (10nskaggs) Ahh, thanks jbond. Trying with sudo, I don't have passwordless sudo on that machine.
[15:47:03] <XioNoX>	 ge-4/0/10       up    up   puppetmaster1002
[15:47:05] <XioNoX>	 yep
[15:47:12] <XioNoX>	 ge-4/0/26       up    down labweb1002
[15:47:12] <XioNoX>	  is next
[15:47:23] <icinga-wm>	 RECOVERY - Host labweb1002 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms
[15:47:25] <_joe_>	 jbond42: it just came back
[15:47:28] <XioNoX>	 ok, back up
[15:47:30] <icinga-wm>	 PROBLEM - Check systemd state on labweb1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:47:39] <cmjohnson1>	 done
[15:47:42] <icinga-wm>	 PROBLEM - Disk space on an-coord1001 is CRITICAL: DISK CRITICAL - free space: /var/lib/mysql 2362 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=an-coord1001&var-datasource=eqiad+prometheus/ops
[15:47:46] <jbond42>	 ack thanks for the record  it didn't get depooled
[15:48:48] <wikibugs>	 10Operations, 10SRE-Access-Requests, 10cloud-services-team (Kanban): wikitech-static access for Sam Reed - https://phabricator.wikimedia.org/T262468 (10Reedy) >>! In T262468#6497715, @ArielGlenn wrote: > Hey @Reedy... is ths all set? Can we resolve the task? (If it's done, please add a on line summary of how...
[15:49:06] <icinga-wm>	 RECOVERY - Check systemd state on labweb1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:49:46] <XioNoX>	 I'll monitor the logs and the switch port for more failures
[15:50:08] <wikibugs>	 10Operations, 10ops-codfw, 10DBA: Degraded RAID on es2026 - https://phabricator.wikimedia.org/T263837 (10Marostegui) Thanks @Papaul is the disk blinking there? I still don't see it on the OS.
[15:51:55] <papaul>	 !log poweroff elastic2037 for DIMM replacing 
[15:52:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:52:50] <papaul>	 marostegui: yes the disk is blinking if you want i can remove it and put it back again
[15:53:01] <wikibugs>	 10Operations, 10ops-codfw, 10DBA: Degraded RAID on es2026 - https://phabricator.wikimedia.org/T263837 (10Marostegui) ` Time: Mon Sep 28 15:39:48 2020 Event Description: PD 02(e0x20/s2) Path 500056b34b011fc2  reset (Type 03) Time: Mon Sep 28 15:39:48 2020 Event Description: Removed: PD 02(e0x20/s2) Time: Mon...
[15:53:09] <marostegui>	 papaul: haha, I just wrote that on phab. Great minds think alike
[15:53:19] <wikibugs>	 (03PS1) 10Cmjohnson: Adding db1150 to site.pp insetup role and dhcpd file [puppet] - 10https://gerrit.wikimedia.org/r/630631 (https://phabricator.wikimedia.org/T260817)
[15:53:49] <wikibugs>	 10Operations, 10Data-Services, 10SRE-Access-Requests, 10cloud-services-team (Kanban): Enable access for wmcs-admins to run wmcs-prefixed cookbooks on cumin hosts - https://phabricator.wikimedia.org/T261145 (10jbond) @nskaggs which server are you testing on?  things look good on cumin1001   ` lang=shell sud...
[15:55:05] <wikibugs>	 (03CR) 10Cmjohnson: [C: 03+2] Adding db1150 to site.pp insetup role and dhcpd file [puppet] - 10https://gerrit.wikimedia.org/r/630631 (https://phabricator.wikimedia.org/T260817) (owner: 10Cmjohnson)
[15:55:21] <papaul>	 marostegui: done
[15:55:29] <marostegui>	 papaul: checking
[15:55:51] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops, 10Patch-For-Review: (Need By: 2020-09-15) rack/setup/install db1150 (see note on hostname) - https://phabricator.wikimedia.org/T260817 (10Cmjohnson)
[15:57:15] <marostegui>	 papaul: same error: Enclosure PD 20(c None/p1) phy bad for slot 2 maybe the disk is bad?
[15:57:19] <marostegui>	 let me check the HW logs
[15:58:40] <icinga-wm>	 PROBLEM - Host elastic2037.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[15:59:17] <wikibugs>	 10Operations, 10Puppet: unbound variable error when calling puppet-merge script with an explicit treeish - https://phabricator.wikimedia.org/T264014 (10CDanis)
[15:59:22] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops, 10Patch-For-Review: (Need By: 2020-09-15) rack/setup/install db1150 (see note on hostname) - https://phabricator.wikimedia.org/T260817 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` db1150.eqiad.wm...
[16:01:16] <wikibugs>	 10Operations, 10ops-codfw, 10DC-Ops, 10fundraising-tech-ops: (Need By:  2020-09-30) rack/setup/install frmx2001.frack.codfw.wmnet, frdata2001.frack.codfw.wmnet - https://phabricator.wikimedia.org/T260183 (10Jgreen)
[16:03:11] <wikibugs>	 10Operations, 10ops-codfw, 10DBA: Degraded RAID on es2026 - https://phabricator.wikimedia.org/T263837 (10Marostegui) @Papaul after putting the disk back in, I am seeing the same errors on the controller: ` [1764225.764609] megaraid_sas 0000:af:00.0: 1103 (654623787s/0x0004/CRIT) - Enclosure PD 20(c None/p1)...
[16:03:54] <icinga-wm>	 RECOVERY - Host elastic2037.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.98 ms
[16:04:46] <wikibugs>	 10Operations, 10ops-codfw, 10DC-Ops, 10fundraising-tech-ops: (Need By:  2020-09-30) rack/setup/install frmx2001.frack.codfw.wmnet, frdata2001.frack.codfw.wmnet - https://phabricator.wikimedia.org/T260183 (10Jgreen)
[16:07:48] <wikibugs>	 (03PS14) 10Jbond: puppet-merge: add Repository class [puppet] - 10https://gerrit.wikimedia.org/r/544943 (https://phabricator.wikimedia.org/T254249)
[16:08:02] <XioNoX>	 !log push pfw policies - T264013
[16:08:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:08:42] <icinga-wm>	 RECOVERY - Host elastic2037 is UP: PING OK - Packet loss = 0%, RTA = 31.83 ms
[16:08:56] <hnowlan>	 !log reimaging new restbase hosts - restbase1028, restbase1029, restbase1030
[16:08:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:09:16] <wikibugs>	 (03CR) 10Jbond: "Rebased, ready for review" [puppet] - 10https://gerrit.wikimedia.org/r/544943 (https://phabricator.wikimedia.org/T254249) (owner: 10Jbond)
[16:10:49] <wikibugs>	 (03PS1) 10Jcrespo: mariadb: Set up db1150 as the new buster source backups host [puppet] - 10https://gerrit.wikimedia.org/r/630634 (https://phabricator.wikimedia.org/T257551)
[16:11:24] <wikibugs>	 10Operations, 10ops-codfw: elastic2037 DIMM errors logged in racadm getsel - https://phabricator.wikimedia.org/T263714 (10Papaul) 05Open→03Resolved DiMM B1 replaced. All good now.
[16:12:23] <wikibugs>	 (03PS2) 10Jcrespo: mariadb: Set up db1150 as the new buster source backups host [puppet] - 10https://gerrit.wikimedia.org/r/630634 (https://phabricator.wikimedia.org/T257551)
[16:13:56] <icinga-wm>	 PROBLEM - Check systemd state on ms-be2056 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:14:22] <wikibugs>	 (03PS3) 10Jcrespo: mariadb: Set up db1150 as the new buster source backups host [puppet] - 10https://gerrit.wikimedia.org/r/630634 (https://phabricator.wikimedia.org/T257551)
[16:17:03] <wikibugs>	 (03CR) 10Jcrespo: "If this can be merged this week, after installed by DC ops (ongoing), there will be no work left for next week." [puppet] - 10https://gerrit.wikimedia.org/r/630634 (https://phabricator.wikimedia.org/T257551) (owner: 10Jcrespo)
[16:17:25] <wikibugs>	 (03PS1) 10Elukey: admin: add journactl perms to airflow-search-admins [puppet] - 10https://gerrit.wikimedia.org/r/630635
[16:19:22] <wikibugs>	 10Operations, 10SRE-Access-Requests, 10cloud-services-team (Kanban): wikitech-static access for Sam Reed - https://phabricator.wikimedia.org/T262468 (10Reedy) a:03Andrew Andrew passed on the password via telegram (to my phone number).
[16:20:00] <logmsgbot>	 !log cdanis@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'canary' .
[16:20:00] <logmsgbot>	 !log cdanis@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'production' .
[16:20:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:20:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:20:34] <logmsgbot>	 !log nskaggs@cumin1001 START - Cookbook wmcs.wikireplicas.add_wiki
[16:20:36] <logmsgbot>	 !log nskaggs@cumin1001 END (FAIL) - Cookbook wmcs.wikireplicas.add_wiki (exit_code=99)
[16:20:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:20:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:21:26] <wikibugs>	 10Operations, 10OTRS, 10vm-requests: Decommission mendelevium - https://phabricator.wikimedia.org/T263993 (10akosiaris) 05Open→03Resolved
[16:21:30] <wikibugs>	 10Operations, 10Puppet: unbound variable error when calling puppet-merge script with an explicit treeish - https://phabricator.wikimedia.org/T264014 (10jbond) I took a quick look and it seems the issue is caused when FETCH_HEAD_OR_EMPTY is empty which causes puppet-merge.py to get called with two positional ar...
[16:22:31] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: proton: remove conftool-data [puppet] - 10https://gerrit.wikimedia.org/r/627859 (https://phabricator.wikimedia.org/T255877) (owner: 10Giuseppe Lavagetto)
[16:22:54] <wikibugs>	 10Operations, 10SRE-Access-Requests, 10cloud-services-team (Kanban): wikitech-static access for Sam Reed - https://phabricator.wikimedia.org/T262468 (10Reedy) 05Open→03Resolved
[16:23:26] <logmsgbot>	 !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime
[16:23:27] <logmsgbot>	 !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime
[16:23:29] <logmsgbot>	 !log hnowlan@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
[16:23:30] <logmsgbot>	 !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime
[16:23:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:23:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:23:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:23:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:24:56] <logmsgbot>	 !log cdanis@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'canary' .
[16:24:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:25:11] <wikibugs>	 (03PS1) 10Cmjohnson: Adding prodcution dns manually to dns file [dns] - 10https://gerrit.wikimedia.org/r/630636 (https://phabricator.wikimedia.org/T260817)
[16:25:26] <logmsgbot>	 !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[16:25:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:26:04] <wikibugs>	 (03CR) 10Cmjohnson: [C: 03+2] Adding prodcution dns manually to dns file [dns] - 10https://gerrit.wikimedia.org/r/630636 (https://phabricator.wikimedia.org/T260817) (owner: 10Cmjohnson)
[16:27:20] <logmsgbot>	 !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[16:27:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:27:49] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/630635 (owner: 10Elukey)
[16:33:43] <icinga-wm>	 RECOVERY - Check systemd state on ms-be2056 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:34:29] <logmsgbot>	 !log cdanis@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'production' .
[16:34:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:35:51] <icinga-wm>	 PROBLEM - Stale file for node-exporter textfile in codfw on alert1001 is CRITICAL: cluster=elasticsearch file=intel_microcode.prom instance=elastic2037 job=node site=codfw https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile
[16:37:24] <wikibugs>	 10Operations, 10ops-eqiad, 10netops: (Need by: 2019-09-30) upgrade msw1-eqiad from EX4200 to EX4300 - https://phabricator.wikimedia.org/T225121 (10Cmjohnson) 05Open→03Resolved Both have been updated
[16:42:15] <icinga-wm>	 PROBLEM - Check systemd state on deneb is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:44:16] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: (Need By:TBD) rack/setup/install rows C and D new PDUs - https://phabricator.wikimedia.org/T253694 (10Cmjohnson)
[16:44:30] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: Mon, Sept 14th - PDU Upgrade Racks D5 and D6 - https://phabricator.wikimedia.org/T261453 (10Cmjohnson) 05Open→03Resolved these were waiting on the temperature leads to be connected.  finished and resolving the task
[16:45:48] <wikibugs>	 10Operations, 10fundraising-tech-ops, 10netops: Automate diff and commit of frack ACL - https://phabricator.wikimedia.org/T260655 (10Jgreen) a:03Jgreen
[16:49:58] <wikibugs>	 10Operations, 10ops-eqiad, 10Analytics-Radar, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` an-worker1102.eqiad.wmnet...
[16:55:54] <icinga-wm>	 RECOVERY - Stale file for node-exporter textfile in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile
[16:56:34] <wikibugs>	 10Operations, 10ops-eqiad, 10Analytics-Radar: an-presto1004 down - https://phabricator.wikimedia.org/T253438 (10Cmjohnson) a:05Cmjohnson→03RobH @RobH I did not see any signs of burning inside the chassis
[16:57:07] <wikibugs>	 (03PS1) 10Ahmon Dancy: InitialiseSettings-labs.php: updated a comment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/630640
[16:57:41] <wikibugs>	 10Operations, 10ops-eqiad, 10cloud-services-team (Kanban): cloudvirt1033 psu redundancy alert - https://phabricator.wikimedia.org/T263145 (10Cmjohnson) 05Open→03Resolved The issue seems to have been resolved.
[16:58:53] <wikibugs>	 10Operations, 10Analytics, 10Traffic: ~1 request/minute to intake-logging.wikimedia.org times out at the traffic/service interface - https://phabricator.wikimedia.org/T264021 (10CDanis)
[16:59:01] <wikibugs>	 10Operations, 10ops-eqiad, 10netops, 10Patch-For-Review: Rack and setup new eqiad row D switch stack (EX4300/QFX5100) - https://phabricator.wikimedia.org/T148506 (10Cmjohnson)
[16:59:19] <wikibugs>	 10Operations, 10Analytics, 10Traffic: ~1 request/minute to intake-logging.wikimedia.org times out at the traffic/service interface - https://phabricator.wikimedia.org/T264021 (10CDanis) p:05Triage→03Low Clients will retry automatically so this isn't a huge deal, but it does merit investigation at some po...
[16:59:21] <wikibugs>	 10Operations, 10ops-eqiad: Decommisson and store old row D network gear. - https://phabricator.wikimedia.org/T170474 (10Cmjohnson) 05Open→03Resolved all of old row D's old network was removed awhile ago. resolving this task
[16:59:35] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops, 10Patch-For-Review: (Need By: 2020-09-15) rack/setup/install db1150 (see note on hostname) - https://phabricator.wikimedia.org/T260817 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db1150.eqiad.wmnet'] `  Of which those **FAILED**: ` ['db1150.eqia...
[17:00:05] <jouncebot>	 ryankemper: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Wikidata Query Service weekly deploy . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200928T1700).
[17:00:30] <wikibugs>	 (03PS1) 10Hnowlan: restbase: set role for new nodes [puppet] - 10https://gerrit.wikimedia.org/r/630641 (https://phabricator.wikimedia.org/T261512)
[17:02:12] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops, 10Patch-For-Review: (Need By: 2020-09-15) rack/setup/install db1150 (see note on hostname) - https://phabricator.wikimedia.org/T260817 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` db1150.eqiad.wm...
[17:03:44] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[17:04:32] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/630574 (owner: 10Jbond)
[17:05:54] <wikibugs>	 (03CR) 10CRusnov: "This change is ready for review." [dns] - 10https://gerrit.wikimedia.org/r/630644 (https://phabricator.wikimedia.org/T258729) (owner: 10CRusnov)
[17:06:14] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[17:06:56] <wikibugs>	 10Puppet, 10Patch-For-Review: puppetdb seems to be slow on host reimage - https://phabricator.wikimedia.org/T263578 (10jbond) >>! In T263578#6497784, @Volans wrote: >  > Does it need to be retained on reboots? If not seems a good idea to me.  As far as i can see the directory is just used as a queue so if a re...
[17:07:30] <icinga-wm>	 PROBLEM - Rate of JVM GC Old generation-s runs - elastic2029-production-search-psi-codfw on elastic2029 is CRITICAL: 124.1 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-codfw&var-instance=elastic2029&panelId=37
[17:08:19] <wikibugs>	 10Operations, 10ops-eqiad, 10Analytics-Radar, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-worker1102.eqiad.wmnet'] `  Of which those **FAILED**: ` ['an-worker1...
[17:09:38] <wikibugs>	 (03PS1) 10Phuedx: SearchBox: Fix data-search-loc attribute [skins/Vector] (wmf/1.36.0-wmf.10) - 10https://gerrit.wikimedia.org/r/630418 (https://phabricator.wikimedia.org/T256100)
[17:13:07] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: 2020-09-15) rack/setup/install db1150 (see note on hostname) - https://phabricator.wikimedia.org/T260817 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db1150.eqiad.wmnet'] `  Of which those **FAILED**: ` ['db1150.eqiad.wmnet'] `
[17:13:32] <wikibugs>	 10Operations, 10ops-eqiad, 10Analytics-Radar, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10RobH) Ok, I setup an-worker1102 with raid1 on the two SSDS, and each HDD as its own raid0.  Now it gets an LVM label in use error...
[17:15:02] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime
[17:15:02] <logmsgbot>	 !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
[17:15:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:15:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:15:52] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: TBD) install memory upgrades in ores100[1-9] - https://phabricator.wikimedia.org/T259909 (10Cmjohnson) @akosiaris Is scheduling this for this coming Wednesday too soon?  1400UTC?  If not let's try Wednesday of next week same time.
[17:18:14] <wikibugs>	 10Operations, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: 2020-09-15) upgrade/replace memory in stat100[58] - https://phabricator.wikimedia.org/T260448 (10Cmjohnson) @elukey Can you do this Monday 5 October 1400UTC?
[17:19:36] <wikibugs>	 10Operations, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: 2020-09-15) upgrade/replace memory in stat100[58] - https://phabricator.wikimedia.org/T260448 (10elukey) Definitely yes!
[17:20:00] <wikibugs>	 10Operations, 10Data-Services, 10SRE-Access-Requests, 10cloud-services-team (Kanban): Enable access for wmcs-admins to run wmcs-prefixed cookbooks on cumin hosts - https://phabricator.wikimedia.org/T261145 (10nskaggs) Ahh got it. No arguments are allowed for secure-cookbook. I confirmed I was able to run t...
[17:20:08] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: 2020-09-15) rack/setup/install db1150 (see note on hostname) - https://phabricator.wikimedia.org/T260817 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` db1150.eqiad.wmnet ` The log can be f...
[17:20:50] <wikibugs>	 10Operations, 10Data-Services, 10SRE-Access-Requests, 10cloud-services-team (Kanban): Enable access for wmcs-admins to run wmcs-prefixed cookbooks on cumin hosts - https://phabricator.wikimedia.org/T261145 (10nskaggs) 05Open→03Resolved a:03nskaggs
[17:21:30] <wikibugs>	 10Operations, 10Data-Services, 10SRE-Access-Requests, 10cloud-services-team (Kanban): Enable access for wmcs-admins to run wmcs-prefixed cookbooks on cumin hosts - https://phabricator.wikimedia.org/T261145 (10nskaggs) 05Resolved→03Open a:05nskaggs→03None
[17:21:56] <wikibugs>	 10Operations, 10ops-eqiad, 10Analytics-Clusters: (Need By: TBD) upgrade ram in an-master100[12] - https://phabricator.wikimedia.org/T259162 (10Cmjohnson) @elukey Same thing with these...can we do them all Monday or will you need multiple days?
[17:23:20] <wikibugs>	 10Operations, 10ops-eqiad, 10Analytics-Clusters: (Need By: TBD) upgrade ram in an-master100[12] - https://phabricator.wikimedia.org/T259162 (10elukey) All on Monday is fine!
[17:25:54] <wikibugs>	 10Operations, 10ops-eqiad, 10Analytics-Clusters: (Need By: TBD) upgrade ram in an-master100[12] - https://phabricator.wikimedia.org/T259162 (10Cmjohnson) Okay, great!
[17:27:42] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[17:28:46] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[17:31:33] <wikibugs>	 10Operations, 10Analytics, 10Traffic: ~1 request/minute to intake-logging.wikimedia.org times out at the traffic/service interface - https://phabricator.wikimedia.org/T264021 (10JAllemandou) Idea: Could missing-revisions (T215001) be related to this?
[17:32:58] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime
[17:33:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:34:59] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[17:35:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:39:25] <wikibugs>	 (03CR) 10CRusnov: "This change is ready for review." [dns] - 10https://gerrit.wikimedia.org/r/630647 (https://phabricator.wikimedia.org/T258729) (owner: 10CRusnov)
[17:39:28] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Migrate ESAMS to Netbox Automation [dns] - 10https://gerrit.wikimedia.org/r/630647 (https://phabricator.wikimedia.org/T258729) (owner: 10CRusnov)
[17:39:34] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: 2020-09-15) rack/setup/install db1150 (see note on hostname) - https://phabricator.wikimedia.org/T260817 (10Cmjohnson)
[17:39:48] <wikibugs>	 (03PS2) 10Catrope: Enable and configure GrowthExperiments on svwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/627395 (https://phabricator.wikimedia.org/T257220)
[17:41:46] <wikibugs>	 10Puppet, 10Patch-For-Review: puppetdb seems to be slow on host reimage - https://phabricator.wikimedia.org/T263578 (10Volans) @jbond I think we can just try the tmpfs first as you said and check the impact.
[17:41:52] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[17:42:57] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: 2020-09-15) rack/setup/install db1150 (see note on hostname) - https://phabricator.wikimedia.org/T260817 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db1150.eqiad.wmnet'] `  and were **ALL** successful.
[17:43:00] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[17:43:36] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: 2020-09-15) rack/setup/install db1150 (see note on hostname) - https://phabricator.wikimedia.org/T260817 (10Cmjohnson) 05Open→03Resolved @Marostegui @jcrespo All yours
[17:45:56] <wikibugs>	 (03PS2) 10CRusnov: Migrate ESAMS to Netbox Automation [dns] - 10https://gerrit.wikimedia.org/r/630647 (https://phabricator.wikimedia.org/T258729)
[17:49:56] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: 2020-09-15) rack/setup/install db1150 (see note on hostname) - https://phabricator.wikimedia.org/T260817 (10jcrespo) Thank you very much for you help, Cmjohnson!!!
[17:50:45] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] mariadb: Set up db1150 as the new buster source backups host [puppet] - 10https://gerrit.wikimedia.org/r/630634 (https://phabricator.wikimedia.org/T257551) (owner: 10Jcrespo)
[17:53:07] <wikibugs>	 (03PS2) 10CRusnov: Migrate EQSIN to Netbox Automation [dns] - 10https://gerrit.wikimedia.org/r/630644 (https://phabricator.wikimedia.org/T258729)
[17:53:48] <wikibugs>	 (03PS3) 10CRusnov: Migrate ESAMS to Netbox Automation [dns] - 10https://gerrit.wikimedia.org/r/630647 (https://phabricator.wikimedia.org/T258729)
[17:56:05] <wikibugs>	 10Operations, 10LDAP-Access-Requests: Add Bereket teshome to the ldap/wmde and ldap/nda group - https://phabricator.wikimedia.org/T262921 (10KFrancis) @RLazarus -Confirming Bereket's NDA is fully executed.  Thanks!
[17:56:29] <wikibugs>	 (03Abandoned) 10Jdlrobson: SearchBox: Fix data-search-loc attribute [skins/Vector] (wmf/1.36.0-wmf.10) - 10https://gerrit.wikimedia.org/r/630418 (https://phabricator.wikimedia.org/T256100) (owner: 10Phuedx)
[18:00:04] <jouncebot>	 RoanKattouw, Niharika, and Urbanecm: I, the Bot under the Fountain, allow thee, The Deployer, to do Morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200928T1800).
[18:00:04] <jouncebot>	 jdlrobson: A patch you scheduled for Morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[18:00:17] <Urbanecm>	 I can deploy today
[18:00:51] <Jdlrobson>	 o/
[18:01:25] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Enable search in header A/B test for logged in users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/630206 (https://phabricator.wikimedia.org/T263032) (owner: 10Jdlrobson)
[18:02:12] <wikibugs>	 (03Merged) 10jenkins-bot: Enable search in header A/B test for logged in users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/630206 (https://phabricator.wikimedia.org/T263032) (owner: 10Jdlrobson)
[18:03:06] <Urbanecm>	 Jdlrobson: pulled onto mwdebug2001, can you test, please?
[18:03:07] <wikibugs>	 10Operations, 10LDAP-Access-Requests: Add Bereket teshome to the ldap/wmde and ldap/nda group - https://phabricator.wikimedia.org/T262921 (10RLazarus) a:05KFrancis→03ArielGlenn Thanks @KFrancis! Passing this along to @ArielGlenn as the current SRE Clinic Duty person.
[18:03:16] <wikibugs>	 10Operations, 10LDAP-Access-Requests: Add Bereket teshome to the ldap/wmde and ldap/nda group - https://phabricator.wikimedia.org/T262921 (10RLazarus)
[18:03:25] <wikibugs>	 10Operations, 10ops-eqiad, 10Analytics-Radar, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` an-worker1102.eqiad.wmnet...
[18:05:08] <Jdlrobson>	 Urbanecm: almost done
[18:05:16] <Urbanecm>	 ack, take your time :)
[18:06:13] <Jdlrobson>	 actually you might be able to help me
[18:06:22] <Urbanecm>	 yes?
[18:06:24] <Jdlrobson>	 all the accounts I have are bucketted in the A group
[18:06:28] <Jdlrobson>	 I need to check the B group
[18:06:38] <Jdlrobson>	 the search bar should be next to the logo in the B group
[18:06:44] <Jdlrobson>	 could you login and see if that's true for your account?
[18:06:52] <Jdlrobson>	 if i can find one example I know it's working correctly
[18:07:10] <Urbanecm>	 would using my main account work, or do i need to create a new one?
[18:07:28] <Jdlrobson>	 URL https://en.wikipedia.org/wiki/Speedway_(soundtrack)?useskinversion=2
[18:07:30] <Jdlrobson>	 any account
[18:09:05] <Urbanecm>	 Jdlrobson: works for me at euwiki, at least if this is what you expect https://usercontent.irccloud-cdn.com/file/2kkinYxm/image.png
[18:09:17] <Urbanecm>	 I had to empty browser cache for the search bar to move
[18:09:31] <Urbanecm>	 (normal refresh, ie. Ctrl+R, didn't change anything)
[18:09:56] <Jdlrobson>	 Urbanecm: okay it works
[18:10:04] <Jdlrobson>	 yep that's great and i confirmed as well for another account
[18:10:08] <Jdlrobson>	 sync away and thank you!
[18:10:10] <Urbanecm>	 cool, I'll sync it then :)
[18:10:16] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[18:10:46] <Urbanecm>	 Jdlrobson: is the "Contributions and Log out buttons moved away" issue known?
[18:11:08] <Jdlrobson>	 yep not relating
[18:11:10] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] openstack: replace remaining hiera() that had default values [puppet] - 10https://gerrit.wikimedia.org/r/627967 (owner: 10Dzahn)
[18:11:12] <Jdlrobson>	 that's fine
[18:11:20] <wikibugs>	 (03CR) 10Dmaza: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/630653 (https://phabricator.wikimedia.org/T260461) (owner: 10Dmaza)
[18:11:30] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[18:11:36] <wikibugs>	 (03PS2) 10Dmaza: Enable watchlist expiry feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/630653 (https://phabricator.wikimedia.org/T260461)
[18:11:38] <Urbanecm>	 Jdlrobson: ack
[18:12:08] <logmsgbot>	 !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: c7e08bc2bbff6aead186350726d5c1c137cca052: Enable search in header A/B test for logged in users (T263032) (duration: 00m 58s)
[18:12:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:12:13] <stashbot>	 T263032: Deploy the new location of the search bar to new vector and begin A/B test on test wikis - https://phabricator.wikimedia.org/T263032
[18:12:14] <Urbanecm>	 Jdlrobson: here you go :)
[18:12:20] <Jdlrobson>	 yeehaaa
[18:12:25] <Jdlrobson>	 thanks Urbanecm 
[18:12:29] <Urbanecm>	 no problem
[18:13:55] <wikibugs>	 10Operations, 10Product-Infrastructure-Data, 10Epic, 10Goal, 10Patch-For-Review: automatically collect network error reports from users' browsers (Network Error Logging API) - https://phabricator.wikimedia.org/T257527 (10CDanis) With the current 5% sampling, we're getting about 30 reports/second at peak...
[18:15:04] <Urbanecm>	 !log Morning B&C done
[18:15:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:15:37] <logmsgbot>	 !log robh@cumin1001 START - Cookbook sre.hosts.downtime
[18:15:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:17:39] <logmsgbot>	 !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[18:17:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:19:54] <wikibugs>	 10Operations, 10DBA, 10Performance-Team, 10Platform Engineering, 10User-Kormat: Remove groups from db configs - https://phabricator.wikimedia.org/T263127 (10daniel) @Krinkle I think there are two parts to this. In my mind, the groups used in code are basically hints to the DB layer that a given cluster m...
[18:22:19] <wikibugs>	 (03CR) 10CRusnov: [C: 04-2] "This needs to be merged after EQSIN patch." [dns] - 10https://gerrit.wikimedia.org/r/630647 (https://phabricator.wikimedia.org/T258729) (owner: 10CRusnov)
[18:23:44] <wikibugs>	 (03CR) 10Dzahn: oozie: hiera->lookup, add data types (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/629443 (owner: 10Dzahn)
[18:23:47] <wikibugs>	 (03PS3) 10Dzahn: oozie: hiera->lookup, add data types [puppet] - 10https://gerrit.wikimedia.org/r/629443
[18:26:28] <wikibugs>	 (03PS18) 10Dzahn: prometheus: replace remaining hiera() with lookup() [puppet] - 10https://gerrit.wikimedia.org/r/623666
[18:27:56] <wikibugs>	 (03CR) 10Ebernhardson: [C: 03+1] logstash: keep at least 2 copies of each shard for apifeatureusage [puppet] - 10https://gerrit.wikimedia.org/r/630546 (https://phabricator.wikimedia.org/T263970) (owner: 10Gehel)
[18:30:37] <wikibugs>	 (03CR) 10Ahmon Dancy: [C: 04-2] "Nevermind this.  Will be doing something more extensive in the train-dev branch." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/630640 (owner: 10Ahmon Dancy)
[18:30:51] <wikibugs>	 (03Abandoned) 10Ahmon Dancy: InitialiseSettings-labs.php: updated a comment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/630640 (owner: 10Ahmon Dancy)
[18:32:14] <wikibugs>	 (03CR) 10Dzahn: [C: 04-1] "https://puppet-compiler.wmflabs.org/compiler1003/25467/deneb.codfw.wmnet/change.deneb.codfw.wmnet.err" [puppet] - 10https://gerrit.wikimedia.org/r/627970 (owner: 10Dzahn)
[18:32:58] <wikibugs>	 10Operations, 10DBA, 10Performance-Team, 10Platform Engineering, 10User-Kormat: Remove groups from db configs - https://phabricator.wikimedia.org/T263127 (10daniel) A quick inventory of DB groups used in core, based on some ad-hoc grep runs: {P12819}
[18:35:05] <wikibugs>	 (03PS2) 10Dzahn: docker: replace hiera with lookup, add data types for builder and registry [puppet] - 10https://gerrit.wikimedia.org/r/627970
[18:35:26] <wikibugs>	 10Operations, 10ops-eqiad, 10Analytics-Radar, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-worker1102.eqiad.wmnet'] `  and were **ALL** successful.
[18:35:42] <wikibugs>	 10Operations, 10Machine Learning Platform, 10ORES, 10serviceops, 10Wikimedia-production-error: ORES redis: max number of clients reached... - https://phabricator.wikimedia.org/T263910 (10Halfak) I'm not familiar with this problem.  Anything change with the deployment recently?  Did any overload errors ha...
[18:37:24] <wikibugs>	 (03PS1) 10Andrew Bogott: Move cloudvirt1012,13 and 14 to ceph and Buster [puppet] - 10https://gerrit.wikimedia.org/r/630656 (https://phabricator.wikimedia.org/T259399)
[18:39:26] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1] "https://puppet-compiler.wmflabs.org/compiler1003/25468/deneb.codfw.wmnet/index.html and docker::registry is not used anywhere?" [puppet] - 10https://gerrit.wikimedia.org/r/627970 (owner: 10Dzahn)
[18:39:34] <tgr_>	 I will do some hacking on mwdebug1002 (for T264029)
[18:39:35] <stashbot>	 T264029: Special:Homepage runs out of memory - https://phabricator.wikimedia.org/T264029
[18:41:56] <wikibugs>	 10Operations, 10Security-Team: Establish secteam production norms - https://phabricator.wikimedia.org/T224886 (10chasemp)
[18:41:58] <wikibugs>	 10Operations, 10Security-Team, 10Patch-For-Review: (2019-09) Create secteam groups in admin.yaml and define permissions - https://phabricator.wikimedia.org/T223463 (10chasemp) 05Stalled→03Declined
[18:42:16] <tgr_>	 btw if anyone has an idea why a query would OOM on hu, hy, uk but not a bunch of other wikis, I'd welcome it
[18:43:07] <wikibugs>	 (03Abandoned) 10Rush: admin: add secteam and secteam-admin for T223463 [puppet] - 10https://gerrit.wikimedia.org/r/510753 (https://phabricator.wikimedia.org/T223463) (owner: 10Rush)
[18:43:12] <wikibugs>	 (03PS2) 10Gehel: logstash: keep at least 2 copies of each shard for apifeatureusage [puppet] - 10https://gerrit.wikimedia.org/r/630546 (https://phabricator.wikimedia.org/T263970)
[18:43:19] <wikibugs>	 (03Abandoned) 10Rush: admin: new group add secteam-admin [puppet] - 10https://gerrit.wikimedia.org/r/521484 (https://phabricator.wikimedia.org/T223463) (owner: 10Jbond)
[18:43:31] <wikibugs>	 10Operations, 10Security-Team: Establish secteam production norms - https://phabricator.wikimedia.org/T224886 (10chasemp)
[18:43:33] <wikibugs>	 10Operations, 10Security-Team, 10Patch-For-Review: (2019-09) Create secteam groups in admin.yaml and define permissions - https://phabricator.wikimedia.org/T223463 (10chasemp) 05Declined→03Resolved
[18:47:43] <wikibugs>	 10Operations, 10ops-eqiad, 10Analytics-Radar, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10RobH)
[18:48:18] <wikibugs>	 10Operations, 10ops-eqiad, 10Analytics-Radar, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10RobH) Ok, updates:  * an-worker1102 is now staged and ready for service owners to take it over. * I am working through the other hosts, rebuilding all...
[18:48:24] <wikibugs>	 (03PS1) 10Ppchelko: Expose /page/descrtion API [deployment-charts] - 10https://gerrit.wikimedia.org/r/630657 (https://phabricator.wikimedia.org/T262498)
[18:48:46] <wikibugs>	 (03PS2) 10Ppchelko: Expose /page/descrtion API [deployment-charts] - 10https://gerrit.wikimedia.org/r/630657 (https://phabricator.wikimedia.org/T262498)
[18:50:01] <icinga-wm>	 RECOVERY - Disk space on an-coord1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=an-coord1001&var-datasource=eqiad+prometheus/ops
[18:51:18] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on alert1001 is CRITICAL: 381 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[18:52:20] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on alert1001 is OK: (C)100 gt (W)50 gt 5 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[18:52:32] <wikibugs>	 (03CR) 10Gehel: [C: 03+2] logstash: keep at least 2 copies of each shard for apifeatureusage [puppet] - 10https://gerrit.wikimedia.org/r/630546 (https://phabricator.wikimedia.org/T263970) (owner: 10Gehel)
[18:57:52] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1 C: 03+2] docker: replace hiera with lookup, add data types for builder and registry [puppet] - 10https://gerrit.wikimedia.org/r/627970 (owner: 10Dzahn)
[18:58:06] <wikibugs>	 (03PS1) 10Dzahn: docker: add data types [puppet] - 10https://gerrit.wikimedia.org/r/630661
[18:59:05] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] docker: add data types [puppet] - 10https://gerrit.wikimedia.org/r/630661 (owner: 10Dzahn)
[19:00:46] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[19:01:56] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on alert1001 is CRITICAL: 443 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[19:01:56] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[19:02:21] <wikibugs>	 10Operations, 10ops-eqiad, 10Analytics-Radar, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` ['an-worker1103.eqiad.wmnet', 'an-worker1104.eqi...
[19:03:08] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on alert1001 is OK: (C)100 gt (W)50 gt 9 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[19:03:26] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1 C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1003/25470/" [puppet] - 10https://gerrit.wikimedia.org/r/629424 (owner: 10Dzahn)
[19:04:45] <wikibugs>	 (03PS9) 10Dzahn: cache::base: replace hiera() with lookup(), add data types [puppet] - 10https://gerrit.wikimedia.org/r/623662
[19:05:45] <wikibugs>	 (03PS1) 10Ahmon Dancy: Add support for dev realm [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/630664
[19:05:48] <wikibugs>	 (03PS1) 10Ahmon Dancy: Don't load CirrusSearch extension for dev realm. [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/630665
[19:05:50] <wikibugs>	 (03PS1) 10Ahmon Dancy: Support InitialiseSettings-<REALM>.php [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/630666
[19:05:52] <wikibugs>	 (03PS1) 10Ahmon Dancy: Add wmf-config/InitialiseSettings-dev.php [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/630667
[19:06:34] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Support InitialiseSettings-<REALM>.php [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/630666 (owner: 10Ahmon Dancy)
[19:06:37] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add wmf-config/InitialiseSettings-dev.php [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/630667 (owner: 10Ahmon Dancy)
[19:07:49] <wikibugs>	 (03CR) 10Dzahn: "confirmed noop in prod -acmechief1001" [puppet] - 10https://gerrit.wikimedia.org/r/629424 (owner: 10Dzahn)
[19:14:47] <logmsgbot>	 !log robh@cumin1001 START - Cookbook sre.hosts.downtime
[19:14:48] <logmsgbot>	 !log robh@cumin1001 START - Cookbook sre.hosts.downtime
[19:14:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:14:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:16:46] <logmsgbot>	 !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[19:16:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:18:32] <logmsgbot>	 !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[19:18:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:21:23] <wikibugs>	 (03CR) 10Ahmon Dancy: [C: 03+2] Add support for dev realm [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/630664 (owner: 10Ahmon Dancy)
[19:21:34] <wikibugs>	 10Operations, 10ops-eqiad, 10Analytics-Radar, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` an-worker1105.eqiad.wmnet ` The log can be found...
[19:22:09] <wikibugs>	 (03Merged) 10jenkins-bot: Add support for dev realm [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/630664 (owner: 10Ahmon Dancy)
[19:22:41] <wikibugs>	 (03CR) 10Ahmon Dancy: [C: 03+2] Don't load CirrusSearch extension for dev realm. [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/630665 (owner: 10Ahmon Dancy)
[19:23:21] <wikibugs>	 (03Merged) 10jenkins-bot: Don't load CirrusSearch extension for dev realm. [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/630665 (owner: 10Ahmon Dancy)
[19:24:12] <wikibugs>	 (03PS2) 10Ahmon Dancy: Support InitialiseSettings-<REALM>.php [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/630666
[19:24:14] <wikibugs>	 (03PS2) 10Ahmon Dancy: Add wmf-config/InitialiseSettings-dev.php [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/630667
[19:24:58] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add wmf-config/InitialiseSettings-dev.php [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/630667 (owner: 10Ahmon Dancy)
[19:25:10] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Support InitialiseSettings-<REALM>.php [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/630666 (owner: 10Ahmon Dancy)
[19:35:44] <wikibugs>	 10Operations, 10ops-eqiad, 10Analytics-Radar, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-worker1103.eqiad.wmnet'] `  Of which those **FAILED**: ` ['an-worker1104.eqiad.wmnet'] `
[19:36:33] <wikibugs>	 10Operations, 10ops-eqiad, 10Analytics-Radar, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-worker1105.eqiad.wmnet'] `  Of which those **FAILED**: ` ['an-worker1105.eqiad.wmnet'] `
[19:38:21] <wikibugs>	 10Operations, 10ops-eqiad, 10Analytics-Radar, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10RobH)
[19:39:28] <wikibugs>	 (03PS10) 10Dzahn: cache::base/varnish: replace hiera() with lookup(), add data types [puppet] - 10https://gerrit.wikimedia.org/r/623662
[19:40:41] <wikibugs>	 (03PS7) 10Gehel: Extracting obvious reporting code to a Reporter class. [software/cumin] - 10https://gerrit.wikimedia.org/r/626660 (https://phabricator.wikimedia.org/T212783)
[19:42:44] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1] "https://puppet-compiler.wmflabs.org/compiler1003/25473/cp1082.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/623662 (owner: 10Dzahn)
[19:45:28] <wikibugs>	 (03PS1) 10Mholloway: Update mobileapps to 2020-09-28-145812-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/630671 (https://phabricator.wikimedia.org/T259624)
[19:47:02] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1 C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1003/25474/cp1082.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/623662 (owner: 10Dzahn)
[19:51:31] <wikibugs>	 (03CR) 10Dzahn: "double checked it's NOOP in prod like in compiler: cp4032, cp2036, cp1082" [puppet] - 10https://gerrit.wikimedia.org/r/623662 (owner: 10Dzahn)
[19:52:19] <wikibugs>	 (03PS2) 10Dzahn: tor: use Stdlib::Host to match FQDN or IP [puppet] - 10https://gerrit.wikimedia.org/r/630310
[19:53:12] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] tor: use Stdlib::Host to match FQDN or IP [puppet] - 10https://gerrit.wikimedia.org/r/630310 (owner: 10Dzahn)
[19:56:10] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Move cloudvirt1012,13 and 14 to ceph and Buster [puppet] - 10https://gerrit.wikimedia.org/r/630656 (https://phabricator.wikimedia.org/T259399) (owner: 10Andrew Bogott)
[19:56:36] <wikibugs>	 (03PS2) 10Dzahn: phabricator: replace Stdlib::Ip_address with IP::Address [puppet] - 10https://gerrit.wikimedia.org/r/630309
[19:57:07] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/25476/phab1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/630309 (owner: 10Dzahn)
[19:59:47] <wikibugs>	 10Operations, 10LDAP-Access-Requests: Access to Superset - https://phabricator.wikimedia.org/T263868 (10JRabah) Thanks @JVargas and @ArielGlenn. Please let me know if you have any questions for me.
[20:00:04] <jouncebot>	 chrisalbon and accraze: Time to snap out of that daydream and deploy Services – Graphoid / ORES. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200928T2000).
[20:05:09] <wikibugs>	 (03CR) 10Mholloway: [C: 03+2] Update mobileapps to 2020-09-28-145812-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/630671 (https://phabricator.wikimedia.org/T259624) (owner: 10Mholloway)
[20:05:53] <wikibugs>	 (03PS3) 10Ahmon Dancy: Support InitialiseSettings-<REALM>.php [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/630666
[20:05:55] <wikibugs>	 (03PS3) 10Ahmon Dancy: Add wmf-config/InitialiseSettings-dev.php [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/630667
[20:07:26] <wikibugs>	 (03Merged) 10jenkins-bot: Update mobileapps to 2020-09-28-145812-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/630671 (https://phabricator.wikimedia.org/T259624) (owner: 10Mholloway)
[20:08:24] <wikibugs>	 (03CR) 10Ahmon Dancy: [C: 03+2] Support InitialiseSettings-<REALM>.php [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/630666 (owner: 10Ahmon Dancy)
[20:09:06] <wikibugs>	 (03Merged) 10jenkins-bot: Support InitialiseSettings-<REALM>.php [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/630666 (owner: 10Ahmon Dancy)
[20:09:22] <wikibugs>	 (03CR) 10Ahmon Dancy: [C: 03+2] Add wmf-config/InitialiseSettings-dev.php [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/630667 (owner: 10Ahmon Dancy)
[20:09:59] <wikibugs>	 (03Merged) 10jenkins-bot: Add wmf-config/InitialiseSettings-dev.php [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/630667 (owner: 10Ahmon Dancy)
[20:10:27] <logmsgbot>	 !log mholloway-shell@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'mobileapps' for release 'staging' .
[20:10:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:12:32] <icinga-wm>	 PROBLEM - Rate of JVM GC Old generation-s runs - elastic2029-production-search-psi-codfw on elastic2029 is CRITICAL: 100.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-codfw&var-instance=elastic2029&panelId=37
[20:12:44] <ebernhardson>	 that server :S it wants to be special
[20:13:12] <logmsgbot>	 !log mholloway-shell@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'mobileapps' for release 'production' .
[20:13:12] <logmsgbot>	 !log mholloway-shell@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'mobileapps' for release 'nontls' .
[20:13:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:13:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:13:47] <wikibugs>	 (03PS2) 10Dzahn: facilities: replace Stdlib::Ip_address with IP::Address [puppet] - 10https://gerrit.wikimedia.org/r/630308
[20:14:29] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/25477/" [puppet] - 10https://gerrit.wikimedia.org/r/630308 (owner: 10Dzahn)
[20:15:29] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.downtime
[20:15:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:17:21] <wikibugs>	 (03CR) 10Dzahn: "hmm.. a duplicate declaration on prometheus1004, but only there?  https://puppet-compiler.wmflabs.org/compiler1002/25466/prometheus1004.eq" [puppet] - 10https://gerrit.wikimedia.org/r/623666 (owner: 10Dzahn)
[20:17:28] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[20:17:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:17:53] <logmsgbot>	 !log mholloway-shell@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'mobileapps' for release 'production' .
[20:17:54] <logmsgbot>	 !log mholloway-shell@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'mobileapps' for release 'nontls' .
[20:17:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:18:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:19:59] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes1003 is CRITICAL: instance=kubernetes1003.eqiad.wmnet https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[20:20:59] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes1001 is CRITICAL: instance=kubernetes1001.eqiad.wmnet https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[20:22:01] <icinga-wm>	 RECOVERY - kubelet operational latencies on kubernetes1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[20:22:01] <icinga-wm>	 RECOVERY - kubelet operational latencies on kubernetes1003 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[20:24:27] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes2001 is CRITICAL: instance=kubernetes2001.codfw.wmnet https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[20:27:17] <icinga-wm>	 PROBLEM - CirrusSearch codfw 95th percentile latency - more_like on graphite1004 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [2000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=39&fullscreen&orgId=1&var-cluster=codfw&var-smoothing=1
[20:27:53] <icinga-wm>	 RECOVERY - kubelet operational latencies on kubernetes2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[20:31:41] <wikibugs>	 (03PS1) 10Andrew Bogott: Try to re-image cloudvirt1012-14 without reformating the VM partition [puppet] - 10https://gerrit.wikimedia.org/r/630675
[20:32:27] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Try to re-image cloudvirt1012-14 without reformating the VM partition [puppet] - 10https://gerrit.wikimedia.org/r/630675 (owner: 10Andrew Bogott)
[20:32:49] <icinga-wm>	 PROBLEM - Rate of JVM GC Old generation-s runs - elastic2029-production-search-psi-codfw on elastic2029 is CRITICAL: 100.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-codfw&var-instance=elastic2029&panelId=37
[20:34:14] <wikibugs>	 10Operations, 10ops-eqiad, 10Analytics-Radar, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` ['an-worker1105.eqiad.wmnet', 'an-worker1106.eqi...
[20:40:07] <wikibugs>	 10Operations, 10ops-eqiad, 10Analytics-Radar, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` an-worker1105.eqiad.wmnet ` The log can be found...
[20:40:57] <icinga-wm>	 PROBLEM - Rate of JVM GC Old generation-s runs - elastic2029-production-search-psi-codfw on elastic2029 is CRITICAL: 100.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-codfw&var-instance=elastic2029&panelId=37
[20:45:59] <logmsgbot>	 !log robh@cumin1001 START - Cookbook sre.hosts.downtime
[20:46:01] <logmsgbot>	 !log robh@cumin1001 START - Cookbook sre.hosts.downtime
[20:46:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:46:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:48:01] <logmsgbot>	 !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[20:48:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:49:54] <logmsgbot>	 !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[20:49:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:50:29] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.downtime
[20:50:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:51:49] <logmsgbot>	 !log robh@cumin1001 START - Cookbook sre.hosts.downtime
[20:51:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:52:26] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[20:52:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:54:16] <logmsgbot>	 !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[20:54:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:57:13] <icinga-wm>	 PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1004 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1
[20:57:34] <wikibugs>	 10Operations, 10ops-eqiad, 10Analytics-Radar, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-worker1106.eqiad.wmnet', 'an-worker1107.eqiad.wmnet'] `  Of which those **FAILED**: ` ['an-...
[21:00:05] <jouncebot>	 Reedy and sbassett: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Weekly Security deployment window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200928T2100).
[21:00:46] <wikibugs>	 10Operations, 10ops-eqiad, 10Analytics-Radar, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10RobH)
[21:05:53] <wikibugs>	 10Operations, 10ops-eqiad, 10Analytics-Radar, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-worker1105.eqiad.wmnet'] `  and were **ALL** successful.
[21:08:47] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[21:09:45] <wikibugs>	 10Operations, 10ops-eqiad, 10Analytics-Radar, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` ['an-worker1108.eqiad.wmnet', 'an-worker1109.eqi...
[21:10:05] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[21:12:33] <icinga-wm>	 RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1004 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1
[21:17:21] <wikibugs>	 (03CR) 10Dzahn: maps: hiera()->lookup(), add data types (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/629439 (owner: 10Dzahn)
[21:17:36] <wikibugs>	 (03PS3) 10Dzahn: maps: hiera()->lookup(), add data types [puppet] - 10https://gerrit.wikimedia.org/r/629439
[21:18:22] <wikibugs>	 10Operations, 10Platform Engineering, 10SRE-Access-Requests, 10Platform Team Workboards (Green): Request for membership of acl*procurement-review group for Platform Engineering staff - https://phabricator.wikimedia.org/T264054 (10WDoranWMF)
[21:18:41] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] maps: hiera()->lookup(), add data types [puppet] - 10https://gerrit.wikimedia.org/r/629439 (owner: 10Dzahn)
[21:18:53] <wikibugs>	 10Operations, 10Platform Engineering, 10SRE-Access-Requests, 10Platform Team Workboards (Green): Request for membership of acl*procurement-review group for Platform Engineering staff - https://phabricator.wikimedia.org/T264054 (10WDoranWMF) p:05Triage→03High
[21:18:59] <icinga-wm>	 RECOVERY - Rate of JVM GC Old generation-s runs - elastic2029-production-search-psi-codfw on elastic2029 is OK: (C)100 gt (W)80 gt 79.32 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-codfw&var-instance=elastic2029&panelId=37
[21:20:09] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[21:20:19] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] DHCP: add testvm5001 MAC address [puppet] - 10https://gerrit.wikimedia.org/r/630320 (https://phabricator.wikimedia.org/T252526) (owner: 10Dzahn)
[21:20:24] <wikibugs>	 (03PS2) 10Dzahn: DHCP: add testvm5001 MAC address [puppet] - 10https://gerrit.wikimedia.org/r/630320 (https://phabricator.wikimedia.org/T252526)
[21:21:27] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[21:21:30] <logmsgbot>	 !log robh@cumin1001 START - Cookbook sre.hosts.downtime
[21:21:31] <logmsgbot>	 !log robh@cumin1001 START - Cookbook sre.hosts.downtime
[21:21:33] <logmsgbot>	 !log robh@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
[21:21:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:21:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:21:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:22:17] <logmsgbot>	 !log robh@cumin1001 START - Cookbook sre.hosts.downtime
[21:22:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:23:30] <logmsgbot>	 !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[21:23:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:25:24] <logmsgbot>	 !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[21:25:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:25:37] <icinga-wm>	 RECOVERY - CirrusSearch codfw 95th percentile latency - more_like on graphite1004 is OK: OK: Less than 20.00% above the threshold [1200.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=39&fullscreen&orgId=1&var-cluster=codfw&var-smoothing=1
[21:33:11] <wikibugs>	 (03PS4) 10Dzahn: maps: hiera()->lookup(), add data types [puppet] - 10https://gerrit.wikimedia.org/r/629439
[21:34:55] <wikibugs>	 (03PS2) 10Dzahn: docker: add data types [puppet] - 10https://gerrit.wikimedia.org/r/630661
[21:35:54] <wikibugs>	 10Operations, 10ops-eqiad, 10Analytics-Radar, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-worker1109.eqiad.wmnet', 'an-worker1108.eqiad.wmnet'] `  Of which those **FAILED**: ` ['an-...
[21:36:05] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] docker: add data types [puppet] - 10https://gerrit.wikimedia.org/r/630661 (owner: 10Dzahn)
[21:43:40] <wikibugs>	 (03PS3) 10Dzahn: docker: add data types [puppet] - 10https://gerrit.wikimedia.org/r/630661
[21:45:21] <wikibugs>	 (03PS19) 10Dzahn: prometheus: replace remaining hiera() with lookup() [puppet] - 10https://gerrit.wikimedia.org/r/623666
[21:47:02] <wikibugs>	 10Operations, 10ops-eqiad, 10Analytics-Radar, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` ['an-worker1111.eqiad.wmnet', 'an-worker1112.eqi...
[21:47:40] <wikibugs>	 (03PS2) 10Dzahn: tlsproxy::instance: switch from hiera() to lookup(), lint fix [puppet] - 10https://gerrit.wikimedia.org/r/623079
[21:50:22] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1003/24930/" [puppet] - 10https://gerrit.wikimedia.org/r/623079 (owner: 10Dzahn)
[21:52:01] <wikibugs>	 (03CR) 10Dzahn: "confirmed NOOP in prod: thorium, maps2001, elastic1036" [puppet] - 10https://gerrit.wikimedia.org/r/623079 (owner: 10Dzahn)
[21:52:51] <wikibugs>	 (03PS2) 10CDanis: geoip VCL: add a 'which' param to get_geo_xcip [puppet] - 10https://gerrit.wikimedia.org/r/630315 (https://phabricator.wikimedia.org/T263496)
[21:52:53] <wikibugs>	 (03PS2) 10CDanis: VCL: Attach a variety of GeoIP info as bereq headers; test GeoIP [puppet] - 10https://gerrit.wikimedia.org/r/630316 (https://phabricator.wikimedia.org/T263496)
[21:53:06] <wikibugs>	 (03CR) 10CDanis: geoip VCL: add a 'which' param to get_geo_xcip (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/630315 (https://phabricator.wikimedia.org/T263496) (owner: 10CDanis)
[21:53:33] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] VCL: Attach a variety of GeoIP info as bereq headers; test GeoIP [puppet] - 10https://gerrit.wikimedia.org/r/630316 (https://phabricator.wikimedia.org/T263496) (owner: 10CDanis)
[21:53:57] <wikibugs>	 (03PS20) 10Dzahn: prometheus: replace remaining hiera() with lookup() [puppet] - 10https://gerrit.wikimedia.org/r/623666
[21:55:15] <wikibugs>	 10Operations, 10Platform Engineering, 10SRE-Access-Requests, 10Platform Team Workboards (Green): Request for membership of acl*procurement-review group for Platform Engineering staff - https://phabricator.wikimedia.org/T264054 (10RobH) 05Open→03Resolved p:05High→03Medium a:05RobH→03None Added....
[21:56:01] <wikibugs>	 (03PS3) 10CDanis: VCL: Attach a variety of GeoIP info as bereq headers; test GeoIP [puppet] - 10https://gerrit.wikimedia.org/r/630316 (https://phabricator.wikimedia.org/T263496)
[21:57:00] <wikibugs>	 (03CR) 10Dzahn: [C: 04-1] "this compiles fine on every single host _except_ on the eqiad prometheus hosts" [puppet] - 10https://gerrit.wikimedia.org/r/623666 (owner: 10Dzahn)
[21:58:48] <logmsgbot>	 !log robh@cumin1001 START - Cookbook sre.hosts.downtime
[21:58:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:59:19] <wikibugs>	 (03CR) 10CDanis: "As discussed, now with tests!  Which, by the way, I'm happy to move out of 02-frontend-headers.vtc into a new VTC file if you'd prefer.  (" [puppet] - 10https://gerrit.wikimedia.org/r/630316 (https://phabricator.wikimedia.org/T263496) (owner: 10CDanis)
[21:59:54] <wikibugs>	 10Operations, 10ops-eqiad, 10Analytics-Radar, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10RobH)
[22:00:51] <logmsgbot>	 !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[22:00:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:02:33] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on appserver in codfw on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=appserver&var-method=GET
[22:02:37] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-me
[22:02:45] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[22:03:35] <icinga-wm>	 RECOVERY - High average GET latency for mw requests on appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=appserver&var-method=GET
[22:03:39] <icinga-wm>	 RECOVERY - High average GET latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET
[22:03:47] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[22:04:33] <icinga-wm>	 PROBLEM - Restbase edge esams on text-lb.esams.wikimedia.org is CRITICAL: /api/rest_v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) is CRITICAL: Test Get a graph from Graphoid returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedia.org/wiki/RESTBase
[22:06:22] <wikibugs>	 (03CR) 10Dzahn: [C: 04-1] "the problem here is somehow in the k8s class along the calico-felix.yaml using ${::site} and $targets_path. it only fails in eqiad in the " [puppet] - 10https://gerrit.wikimedia.org/r/623666 (owner: 10Dzahn)
[22:06:39] <icinga-wm>	 RECOVERY - Restbase edge esams on text-lb.esams.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase
[22:10:13] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[22:11:21] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[22:11:49] <wikibugs>	 (03PS21) 10Dzahn: prometheus: replace remaining hiera() with lookup() [puppet] - 10https://gerrit.wikimedia.org/r/623666
[22:17:13] <wikibugs>	 (03PS1) 10Mholloway: Update wikifeeds to 2020-09-28-221030-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/630688
[22:20:03] <wikibugs>	 (03CR) 10Mholloway: [C: 03+2] Update wikifeeds to 2020-09-28-221030-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/630688 (owner: 10Mholloway)
[22:22:09] <wikibugs>	 (03Merged) 10jenkins-bot: Update wikifeeds to 2020-09-28-221030-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/630688 (owner: 10Mholloway)
[22:24:21] <logmsgbot>	 !log mholloway-shell@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'wikifeeds' for release 'staging' .
[22:24:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:25:34] <logmsgbot>	 !log mholloway-shell@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'wikifeeds' for release 'production' .
[22:25:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:26:50] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1] "hah, it's always worth compiling on everything for these. the reason was https://gerrit.wikimedia.org/r/c/operations/puppet/+/623666/20..2" [puppet] - 10https://gerrit.wikimedia.org/r/623666 (owner: 10Dzahn)
[22:27:14] <logmsgbot>	 !log mholloway-shell@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'wikifeeds' for release 'production' .
[22:27:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:27:33] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1 C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/25481/  https://puppet-compiler.wmflabs.org/compiler1002/25466/" [puppet] - 10https://gerrit.wikimedia.org/r/623666 (owner: 10Dzahn)
[22:29:52] <wikibugs>	 (03CR) 10Dzahn: "noop on prometheus1003,2004" [puppet] - 10https://gerrit.wikimedia.org/r/623666 (owner: 10Dzahn)
[22:32:59] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=205 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET
[22:34:11] <icinga-wm>	 RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET
[22:34:50] <wikibugs>	 10Operations, 10ops-eqiad, 10Analytics-Radar, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-worker1112.eqiad.wmnet'] `  Of which those **FAILED**: ` ['an-worker1111.eqiad.wmnet', 'an-...
[22:37:07] <wikibugs>	 10Operations, 10ops-eqiad, 10Analytics-Radar, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10RobH)
[22:38:48] <wikibugs>	 (03CR) 10Dzahn: [V: 04-1] "https://puppet-compiler.wmflabs.org/compiler1002/25483/db1133.eqiad.wmnet/change.db1133.eqiad.wmnet.err" [puppet] - 10https://gerrit.wikimedia.org/r/630317 (owner: 10Dzahn)
[22:39:15] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code={200,204} handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=G
[22:40:01] <wikibugs>	 10Operations, 10ops-eqiad, 10Analytics-Radar, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10RobH)
[22:41:17] <wikibugs>	 (03PS5) 10Dzahn: mariadb::core_test: convert role to profile, add data types [puppet] - 10https://gerrit.wikimedia.org/r/630317
[22:41:57] <icinga-wm>	 RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET
[22:42:43] <icinga-wm>	 PROBLEM - PHP7 rendering on mwdebug1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[22:43:27] <icinga-wm>	 PROBLEM - Apache HTTP on mwdebug1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[22:44:27] <wikibugs>	 (03PS6) 10Dzahn: mariadb::core_test: convert role to profile, add data types [puppet] - 10https://gerrit.wikimedia.org/r/630317
[22:45:34] <wikibugs>	 (03PS1) 10Gergő Tisza: Properly handle namespaces in tasktype template configuration [extensions/GrowthExperiments] (wmf/1.36.0-wmf.10) - 10https://gerrit.wikimedia.org/r/630420 (https://phabricator.wikimedia.org/T264029)
[22:45:58] <wikibugs>	 (03PS7) 10Dzahn: mariadb::core_test: convert role to profile, add data types [puppet] - 10https://gerrit.wikimedia.org/r/630317
[22:46:56] <wikibugs>	 (03PS8) 10Dzahn: mariadb::core_test: convert role to profile, add data types [puppet] - 10https://gerrit.wikimedia.org/r/630317
[22:48:38] <wikibugs>	 (03PS3) 10CRusnov: base/check_systemd_state.py: Switch header to Python 3 [puppet] - 10https://gerrit.wikimedia.org/r/624733 (https://phabricator.wikimedia.org/T247364)
[22:49:18] <tgr_>	 I have to go afk for a few minutes, will be back in time to do the B&C patch
[22:50:01] <wikibugs>	 (03CR) 10Dzahn: [V: 04-1] "parameter 'postgres_replicas' expects an Array value, got Struct" [puppet] - 10https://gerrit.wikimedia.org/r/629439 (owner: 10Dzahn)
[22:50:50] <wikibugs>	 (03CR) 10CRusnov: [C: 03+2] base/check_systemd_state.py: Switch header to Python 3 [puppet] - 10https://gerrit.wikimedia.org/r/624733 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov)
[22:52:08] <wikibugs>	 (03PS4) 10CRusnov: modules/service/files/logstash_checker.py: Move to Python3 [puppet] - 10https://gerrit.wikimedia.org/r/624116 (https://phabricator.wikimedia.org/T247364)
[22:52:37] <icinga-wm>	 PROBLEM - ensure kvm processes are running on cloudvirt1013 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[22:52:44] <wikibugs>	 (03CR) 10CRusnov: [C: 03+2] modules/service/files/logstash_checker.py: Move to Python3 [puppet] - 10https://gerrit.wikimedia.org/r/624116 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov)
[22:53:13] <icinga-wm>	 PROBLEM - Check systemd state on mwdebug1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:53:16] <wikibugs>	 (03CR) 10CRusnov: [C: 03+2] modules/service/files/logstash_checker.py: Move to Python3 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/624116 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov)
[22:55:00] <wikibugs>	 (03PS6) 10Dzahn: swift::proxy: convert role to profile, fix various lint issues [puppet] - 10https://gerrit.wikimedia.org/r/628970
[22:55:17] <wikibugs>	 (03CR) 10Dzahn: "> Patch Set 5:" [puppet] - 10https://gerrit.wikimedia.org/r/628970 (owner: 10Dzahn)
[22:56:09] <wikibugs>	 (03CR) 10Dzahn: "no difference (except comments) between PS2 and PS6" [puppet] - 10https://gerrit.wikimedia.org/r/628970 (owner: 10Dzahn)
[22:56:56] <wikibugs>	 10Operations, 10ops-eqiad, 10Analytics-Radar, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` an-worker1111.eqiad.wmnet ` The log can be found...
[22:58:06] <wikibugs>	 (03CR) 10Dzahn: "getting closer, removing -1" [puppet] - 10https://gerrit.wikimedia.org/r/621368 (owner: 10Dzahn)
[23:00:04] <jouncebot>	 RoanKattouw, Niharika, and Urbanecm: Dear deployers, time to do the Evening backport window deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200928T2300).
[23:00:04] <jouncebot>	 tgr: A patch you scheduled for Evening backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[23:01:06] <wikibugs>	 (03CR) 10CRusnov: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/630690 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov)
[23:02:04] <wikibugs>	 (03Abandoned) 10CRusnov: modules/admin/data/nda_audit.py: Port to Python3 [puppet] - 10https://gerrit.wikimedia.org/r/624112 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov)
[23:03:48] <wikibugs>	 (03PS1) 10Dzahn: quarry: hiera->lookup, add data types [puppet] - 10https://gerrit.wikimedia.org/r/630691
[23:04:18] <wikibugs>	 10Operations, 10ops-eqiad, 10Analytics-Radar, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` an-worker1113.eqiad.wmnet ` The log can be found...
[23:05:49] <icinga-wm>	 PROBLEM - Check systemd state on mwdebug1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:11:53] <wikibugs>	 (03CR) 10CRusnov: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/630693 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov)
[23:15:17] <icinga-wm>	 PROBLEM - Check systemd state on mwdebug1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:16:15] <wikibugs>	 (03PS1) 10Dzahn: thumbor: role->profile, data types, lint (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/630694
[23:17:03] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET
[23:18:00] <wikibugs>	 10Operations, 10ops-eqiad, 10Analytics-Radar, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-worker1111.eqiad.wmnet'] `  Of which those **FAILED**: ` ['an-worker1111.eqiad.wmnet'] `
[23:18:11] <wikibugs>	 10Operations, 10ops-eqiad, 10Analytics-Radar, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-worker1113.eqiad.wmnet'] `  Of which those **FAILED**: ` ['an-worker1113.eqiad.wmnet'] `
[23:19:59] <icinga-wm>	 PROBLEM - Check systemd state on mwdebug1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:24:27] <wikibugs>	 (03CR) 10Gergő Tisza: [C: 03+2] Properly handle namespaces in tasktype template configuration [extensions/GrowthExperiments] (wmf/1.36.0-wmf.10) - 10https://gerrit.wikimedia.org/r/630420 (https://phabricator.wikimedia.org/T264029) (owner: 10Gergő Tisza)
[23:24:29] <icinga-wm>	 PROBLEM - Check systemd state on mwdebug1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:26:51] <icinga-wm>	 RECOVERY - Apache HTTP on mwdebug1001 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 5.801 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[23:27:02] <wikibugs>	 (03PS1) 10Dzahn: install_server: let testvm5001 use install5001 as TFTP server [puppet] - 10https://gerrit.wikimedia.org/r/630695 (https://phabricator.wikimedia.org/T252526)
[23:27:29] <icinga-wm>	 RECOVERY - PHP7 rendering on mwdebug1001 is OK: HTTP OK: HTTP/1.1 302 Found - 649 bytes in 0.107 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[23:27:32] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] install_server: let testvm5001 use install5001 as TFTP server [puppet] - 10https://gerrit.wikimedia.org/r/630695 (https://phabricator.wikimedia.org/T252526) (owner: 10Dzahn)
[23:27:39] <icinga-wm>	 PROBLEM - Check systemd state on mwdebug1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:29:58] <wikibugs>	 (03PS1) 10Dzahn: DHCP: switch TFTP server for eqsin from bast5001 to install5001 [puppet] - 10https://gerrit.wikimedia.org/r/630696 (https://phabricator.wikimedia.org/T252526)
[23:32:18] <wikibugs>	 (03Merged) 10jenkins-bot: Properly handle namespaces in tasktype template configuration [extensions/GrowthExperiments] (wmf/1.36.0-wmf.10) - 10https://gerrit.wikimedia.org/r/630420 (https://phabricator.wikimedia.org/T264029) (owner: 10Gergő Tisza)
[23:33:21] <icinga-wm>	 PROBLEM - PHP7 rendering on mwdebug1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[23:34:15] <icinga-wm>	 PROBLEM - Apache HTTP on mwdebug1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[23:34:21] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code={200,204} handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=G
[23:35:55] <icinga-wm>	 RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET
[23:35:57] <wikibugs>	 (03CR) 10CRusnov: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/630697 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov)
[23:40:27] <icinga-wm>	 PROBLEM - Check systemd state on mwdebug1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:42:12] <wikibugs>	 (03PS1) 10Dzahn: enable tftp service on install5001 [puppet] - 10https://gerrit.wikimedia.org/r/630699 (https://phabricator.wikimedia.org/T252526)
[23:42:42] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] enable tftp service on install5001 [puppet] - 10https://gerrit.wikimedia.org/r/630699 (https://phabricator.wikimedia.org/T252526) (owner: 10Dzahn)
[23:42:56] <logmsgbot>	 !log tgr@deploy1001 Synchronized php-1.36.0-wmf.10/extensions/GrowthExperiments/includes/NewcomerTasks/ConfigurationLoader/PageConfigurationLoader.php: Backport: [[gerrit:630420|Properly handle namespaces in tasktype template configuration (T264029)]] (duration: 01m 03s)
[23:43:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:43:02] <stashbot>	 T264029: Special:Homepage runs out of memory - https://phabricator.wikimedia.org/T264029
[23:45:07] <ebernhardson>	 tgr_: all done? I'm going to add something to backport
[23:45:52] <tgr_>	 ebernhardson: I just realized I need one more fix. It will take a few minutes though so done for now. Are your deploying for yourself?
[23:46:25] <ebernhardson>	 tgr_: yea i'll deploy, no rush it can go in 20 minutes or whatever
[23:46:47] <tgr_>	 cool, go ahead.
[23:46:53] <ebernhardson>	 ok
[23:47:26] <wikibugs>	 (03PS1) 10Ebernhardson: Remove commonswiki from sister search sidebar [mediawiki-config] - 10https://gerrit.wikimedia.org/r/630700 (https://phabricator.wikimedia.org/T264053)
[23:48:15] <icinga-wm>	 PROBLEM - Check systemd state on mwdebug1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:49:40] <wikibugs>	 (03PS2) 10Ebernhardson: Remove commonswiki from sister search sidebar [mediawiki-config] - 10https://gerrit.wikimedia.org/r/630700 (https://phabricator.wikimedia.org/T264053)
[23:49:47] <wikibugs>	 (03CR) 10Ebernhardson: [C: 03+2] "backport window" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/630700 (https://phabricator.wikimedia.org/T264053) (owner: 10Ebernhardson)
[23:50:36] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] DHCP: switch TFTP server for eqsin from bast5001 to install5001 [puppet] - 10https://gerrit.wikimedia.org/r/630696 (https://phabricator.wikimedia.org/T252526) (owner: 10Dzahn)
[23:50:41] <wikibugs>	 (03Merged) 10jenkins-bot: Remove commonswiki from sister search sidebar [mediawiki-config] - 10https://gerrit.wikimedia.org/r/630700 (https://phabricator.wikimedia.org/T264053) (owner: 10Ebernhardson)
[23:51:01] <icinga-wm>	 PROBLEM - Check systemd state on mwdebug1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:53:54] <wikibugs>	 (03PS1) 10Dzahn: stop tftp service on bast5001 [puppet] - 10https://gerrit.wikimedia.org/r/630702 (https://phabricator.wikimedia.org/T252526)
[23:54:13] <icinga-wm>	 PROBLEM - Check systemd state on mwdebug1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:54:54] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] stop tftp service on bast5001 [puppet] - 10https://gerrit.wikimedia.org/r/630702 (https://phabricator.wikimedia.org/T252526) (owner: 10Dzahn)
[23:56:57] <logmsgbot>	 !log ebernhardson@deploy1001 Synchronized wmf-config/InitialiseSettings.php: T264053: Remove commonswiki from sidebar search (duration: 01m 09s)
[23:57:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:57:01] <stashbot>	 T264053: Unsustainable increases in Elasticsearch cluster disk IO - https://phabricator.wikimedia.org/T264053
[23:58:12] <ebernhardson>	 tgr_: all done
[23:58:20] <tgr_>	 thx
[23:58:27] <icinga-wm>	 PROBLEM - Check systemd state on mwdebug1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:58:53] <icinga-wm>	 PROBLEM - TFTP service on bast5001 is CRITICAL: NRPE: Command check_atftpd not defined https://wikitech.wikimedia.org/wiki/Monitoring/atftpd