[00:00:21] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:11:39] PROBLEM - Disk space on an-master1002 is CRITICAL: DISK CRITICAL - free space: /srv 1518 MB (1% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=an-master1002&var-datasource=eqiad+prometheus/ops [00:18:09] 10Operations, 10Cloud-Services, 10cloud-services-team (Kanban): rack/setup/install (3) new osd ceph nodes - https://phabricator.wikimedia.org/T224188 (10bd808) [00:41:29] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:59:59] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:01:43] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [03:38:59] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_citoid_cluster_codfw site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [03:40:39] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [03:46:32] (03CR) 10KartikMistry: [C: 03+1] "LGTM." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/628065 (https://phabricator.wikimedia.org/T263093) (owner: 10Santhosh) [03:46:43] (03PS3) 10KartikMistry: wgSkipSkins: Exclude contenttranslation skin from skin options for users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/628065 (https://phabricator.wikimedia.org/T263093) (owner: 10Santhosh) [05:06:32] (03PS1) 10Marostegui: mariadb: es2017 disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/631997 (https://phabricator.wikimedia.org/T264386) [05:06:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool es2017 T264386 ', diff saved to https://phabricator.wikimedia.org/P12916 and previous config saved to /var/cache/conftool/dbconfig/20201005-050636-marostegui.json [05:06:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:06:41] T264386: decommission es2017.codfw.wmnet - https://phabricator.wikimedia.org/T264386 [05:07:32] (03CR) 10Marostegui: [C: 03+2] mariadb: es2017 disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/631997 (https://phabricator.wikimedia.org/T264386) (owner: 10Marostegui) [05:31:23] (03PS1) 10Legoktm: Don't install apt-transport-https for buster [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/631998 [05:47:49] (03PS1) 10Marostegui: dbproxy1018: Depool labsdb1011 [puppet] - 10https://gerrit.wikimedia.org/r/632000 [05:49:34] (03CR) 10Marostegui: [C: 03+2] dbproxy1018: Depool labsdb1011 [puppet] - 10https://gerrit.wikimedia.org/r/632000 (owner: 10Marostegui) [06:26:33] PROBLEM - Router interfaces on mr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.199, interfaces up: 35, down: 1, dormant: 0, excluded: 1, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:28:07] RECOVERY - Router interfaces on mr1-eqiad is OK: OK: host 208.80.154.199, interfaces up: 37, down: 0, dormant: 0, excluded: 1, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:30:10] 10Operations, 10DNS, 10Traffic, 10User-DannyS712: DNS_PROBE_FINISHED_NXDOMAIN for mobile version of internal.wikimedia.org - https://phabricator.wikimedia.org/T264565 (10DannyS712) [06:33:16] 10Operations, 10DNS, 10Traffic, 10User-DannyS712: DNS_PROBE_FINISHED_NXDOMAIN for mobile version of internal.wikimedia.org - https://phabricator.wikimedia.org/T264565 (10DannyS712) The same occurs for the mobile view links at https://collab.wikimedia.org/wiki/Main_Page and https://board.wikimedia.org/wiki/... [06:33:21] !log reboot stat1005 to resolve weird GPU state (scheduled last week) [06:33:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:39:00] https://github.com/apache/spark/pull/22485 looks really nice [06:40:04] (03CR) 10Volans: [C: 03+1] "LGTM, minor documentation nit inline, no need for re-review once fixed." (031 comment) [software/pywmflib] - 10https://gerrit.wikimedia.org/r/631792 (https://phabricator.wikimedia.org/T257905) (owner: 10Elukey) [06:40:19] PROBLEM - Host stat1005 is DOWN: PING CRITICAL - Packet loss = 100% [06:42:35] RECOVERY - Host stat1005 is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms [06:43:23] PROBLEM - Check systemd state on stat1005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:44:59] RECOVERY - Check systemd state on stat1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:47:46] (03CR) 10Volans: [C: 04-1] "One missing dep, looks good otherwise. See also a couple of suggestions inline." (035 comments) [software/pywmflib] - 10https://gerrit.wikimedia.org/r/631909 (https://phabricator.wikimedia.org/T257905) (owner: 10Elukey) [06:51:25] ah ok this is more realistic :D [06:52:44] !log add static NAT to pfw3-eqiad - T264356 [06:52:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:54:10] (03CR) 10Volans: [C: 04-1] "clarified comment" (031 comment) [software/pywmflib] - 10https://gerrit.wikimedia.org/r/631909 (https://phabricator.wikimedia.org/T257905) (owner: 10Elukey) [06:57:31] (03CR) 10Volans: [C: 03+1] "LGTM, 2 nits inline." (032 comments) [software/pywmflib] - 10https://gerrit.wikimedia.org/r/631910 (https://phabricator.wikimedia.org/T257905) (owner: 10Elukey) [06:59:53] (03CR) 10Muehlenhoff: [C: 03+2] Add Cumin aliases for Thanos [puppet] - 10https://gerrit.wikimedia.org/r/631773 (owner: 10Muehlenhoff) [07:30:58] !log jmm@cumin2001 START - Cookbook sre.ganeti.makevm [07:31:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:46:37] !log Stop mysql on es2017 T264386 [07:46:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:46:44] T264386: decommission es2017.codfw.wmnet - https://phabricator.wikimedia.org/T264386 [07:51:56] (03PS1) 10Muehlenhoff: Add DNS entry ldap-replica2003.wikimedia.org based on what the makevm cookbook assigned [dns] - 10https://gerrit.wikimedia.org/r/632176 (https://phabricator.wikimedia.org/T264390) [07:54:16] (03CR) 10Elukey: "> Patch Set 2: Code-Review+1" (031 comment) [software/pywmflib] - 10https://gerrit.wikimedia.org/r/631792 (https://phabricator.wikimedia.org/T257905) (owner: 10Elukey) [07:54:51] (03CR) 10Muehlenhoff: [C: 03+2] Add DNS entry ldap-replica2003.wikimedia.org based on what the makevm cookbook assigned [dns] - 10https://gerrit.wikimedia.org/r/632176 (https://phabricator.wikimedia.org/T264390) (owner: 10Muehlenhoff) [07:55:32] (03CR) 10Volans: "Just did a quick first pass" (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/544943 (https://phabricator.wikimedia.org/T254249) (owner: 10Jbond) [07:56:31] (03PS1) 10Joal: Bump AQS druid datasource snapshot to 2020-09 [puppet] - 10https://gerrit.wikimedia.org/r/632178 [07:56:38] 10Operations, 10Performance-Team, 10Traffic: Elevated latency starting 2020-09-28 - https://phabricator.wikimedia.org/T264398 (10Gilles) @BBlack we don't see elevated latency that lasts for days like that on train rollouts and rollbacks. Train rollbacks are a frequent event. We're now at 6 days of consistent... [07:56:51] elukey: --^ [08:00:16] (03CR) 10Elukey: [C: 03+2] Bump AQS druid datasource snapshot to 2020-09 [puppet] - 10https://gerrit.wikimedia.org/r/632178 (owner: 10Joal) [08:06:32] !log jmm@cumin2001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) [08:06:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:09:25] 10Operations, 10Performance-Team, 10Traffic: Elevated latency starting 2020-09-28 - https://phabricator.wikimedia.org/T264398 (10Gilles) As for per-host on esams, it is very clear on every host. Even clearer if you zoom our and switch to a 1day rolling average: {F32373885} {F32373886} {F32373887} {F32373... [08:10:37] (03PS1) 10Muehlenhoff: Add ldap-replica2003 to DHCP and the new instances to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/632180 [08:16:13] (03PS2) 10Muehlenhoff: Add ldap-replica2003 to DHCP and the new instances to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/632180 [08:19:48] (03PS1) 10Filippo Giunchedi: prometheus: add 50 percentile for ats-tls TTFB [puppet] - 10https://gerrit.wikimedia.org/r/632190 (https://phabricator.wikimedia.org/T263536) [08:21:14] (03CR) 10Volans: "Feel free to ping me to chat about it offline" (033 comments) [software/cumin] - 10https://gerrit.wikimedia.org/r/631702 (https://phabricator.wikimedia.org/T212783) (owner: 10Gehel) [08:23:33] !log prometheus codfw/ops, add 100G to the LV [08:23:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:25:31] (03PS6) 10Filippo Giunchedi: profile: add alertmanager::alerts [puppet] - 10https://gerrit.wikimedia.org/r/629153 (https://phabricator.wikimedia.org/T258948) [08:25:33] (03PS1) 10Filippo Giunchedi: prometheus: remove unused IcingaServiceProblem [puppet] - 10https://gerrit.wikimedia.org/r/632191 (https://phabricator.wikimedia.org/T258948) [08:26:06] (03PS2) 10Filippo Giunchedi: prometheus: remove unused IcingaServiceProblem [puppet] - 10https://gerrit.wikimedia.org/r/632191 (https://phabricator.wikimedia.org/T258948) [08:26:08] (03PS7) 10Filippo Giunchedi: profile: add alertmanager::alerts [puppet] - 10https://gerrit.wikimedia.org/r/629153 (https://phabricator.wikimedia.org/T258948) [08:28:51] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: remove unused IcingaServiceProblem [puppet] - 10https://gerrit.wikimedia.org/r/632191 (https://phabricator.wikimedia.org/T258948) (owner: 10Filippo Giunchedi) [08:31:26] (03CR) 10Muehlenhoff: [C: 03+2] Add ldap-replica2003 to DHCP and the new instances to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/632180 (owner: 10Muehlenhoff) [08:31:29] (03PS1) 10Ema: vcl: include response status in cacheable cookie logging [puppet] - 10https://gerrit.wikimedia.org/r/632192 (https://phabricator.wikimedia.org/T264378) [08:32:52] (03CR) 10Gehel: "replied to comments inline, fixes will follow when I have time." (033 comments) [software/cumin] - 10https://gerrit.wikimedia.org/r/631702 (https://phabricator.wikimedia.org/T212783) (owner: 10Gehel) [08:38:23] !log kormat@cumin1001 dbctl commit (dc=all): 'Add db2119 to s4 dump/vslow temporarily T259831', diff saved to https://phabricator.wikimedia.org/P12917 and previous config saved to /var/cache/conftool/dbconfig/20201005-083822-kormat.json [08:38:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:38:29] T259831: Schema change to make change_tag.ct_rc_id unsigned - https://phabricator.wikimedia.org/T259831 [08:39:47] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime [08:39:48] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [08:39:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:39:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:40:22] !log kormat@cumin1001 dbctl commit (dc=all): 'db2073 depooling: schema change T259831', diff saved to https://phabricator.wikimedia.org/P12918 and previous config saved to /var/cache/conftool/dbconfig/20201005-084022-kormat.json [08:40:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:43:09] (03CR) 10Vgutierrez: [C: 03+1] vcl: include response status in cacheable cookie logging [puppet] - 10https://gerrit.wikimedia.org/r/632192 (https://phabricator.wikimedia.org/T264378) (owner: 10Ema) [08:45:49] (03PS1) 10Elukey: sre.aqs.roll-restart: add canary testing [cookbooks] - 10https://gerrit.wikimedia.org/r/632196 [08:47:39] (03PS2) 10Elukey: sre.aqs.roll-restart: add canary testing [cookbooks] - 10https://gerrit.wikimedia.org/r/632196 [08:57:54] !log installing ldap-replica2004 T264390 [08:57:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:58:00] T264390: Site: 4 VM request for LDAP replicas - https://phabricator.wikimedia.org/T264390 [08:59:05] (03PS3) 10Elukey: sre.aqs.roll-restart: add canary testing [cookbooks] - 10https://gerrit.wikimedia.org/r/632196 [09:01:55] (03PS4) 10Elukey: sre.aqs.roll-restart: add canary testing [cookbooks] - 10https://gerrit.wikimedia.org/r/632196 [09:02:39] !log bootstrapping restbase1030-b [09:02:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:03:21] RECOVERY - cassandra-b service on restbase1030 is OK: OK - cassandra-b is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:03:53] RECOVERY - cassandra-b SSL 10.64.48.235:7001 on restbase1030 is OK: SSL OK - Certificate restbase1030-b valid until 2022-09-29 10:16:56 +0000 (expires in 724 days) https://phabricator.wikimedia.org/T120662 [09:07:05] (03CR) 10Ema: [C: 03+2] vcl: include response status in cacheable cookie logging [puppet] - 10https://gerrit.wikimedia.org/r/632192 (https://phabricator.wikimedia.org/T264378) (owner: 10Ema) [09:10:47] (03CR) 10Volans: sre.aqs.roll-restart: add canary testing (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/632196 (owner: 10Elukey) [09:14:08] (03CR) 10Elukey: sre.aqs.roll-restart: add canary testing (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/632196 (owner: 10Elukey) [09:14:35] (03PS5) 10Elukey: sre.aqs.roll-restart: add canary testing [cookbooks] - 10https://gerrit.wikimedia.org/r/632196 [09:22:07] (03CR) 10Elukey: [C: 03+2] sre.aqs.roll-restart: add canary testing [cookbooks] - 10https://gerrit.wikimedia.org/r/632196 (owner: 10Elukey) [09:22:29] !log installing ldap-replica2003 T264390 [09:22:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:22:37] T264390: Site: 4 VM request for LDAP replicas - https://phabricator.wikimedia.org/T264390 [09:29:22] 10Operations, 10SRE-swift-storage, 10User-fgiunchedi: Report swift-object server per-method latencies - https://phabricator.wikimedia.org/T264588 (10fgiunchedi) [09:30:07] (03CR) 10Alexandros Kosiaris: [C: 03+2] "lol, +2" [puppet] - 10https://gerrit.wikimedia.org/r/631901 (owner: 10Dzahn) [09:31:42] 10Operations, 10Traffic: ATS-BE Lua mitigations for cacheable responses w/ Set-Cookie seemingly not working - https://phabricator.wikimedia.org/T264378 (10ema) p:05Triage→03Medium [09:31:51] 10Operations, 10Traffic: ATS-BE Lua mitigations for cacheable responses w/ Set-Cookie seemingly not working - https://phabricator.wikimedia.org/T264378 (10ema) I've broadened the search to the past 2 months, and there are a total of 10 matching log entries, all of which are from #parsoid. All are related to ed... [09:31:53] (03PS3) 10Klausman: aptrepo: Add rocm 3.8 packages to reprepro [puppet] - 10https://gerrit.wikimedia.org/r/631725 (https://phabricator.wikimedia.org/T264408) [09:38:08] (03PS4) 10Klausman: aptrepo: Add rocm 3.8 packages to reprepro [puppet] - 10https://gerrit.wikimedia.org/r/631725 (https://phabricator.wikimedia.org/T264408) [09:38:59] (03PS5) 10Klausman: aptrepo: Add rocm 3.8 packages to reprepro [puppet] - 10https://gerrit.wikimedia.org/r/631725 (https://phabricator.wikimedia.org/T264408) [09:39:28] (03PS6) 10Klausman: aptrepo: Add rocm 3.8 packages to reprepro [puppet] - 10https://gerrit.wikimedia.org/r/631725 (https://phabricator.wikimedia.org/T264408) [09:40:52] RECOVERY - Disk space on puppetdb2002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=puppetdb2002&var-datasource=codfw+prometheus/ops [09:52:32] !log installing ldap-replica1001 T264390 [09:52:35] (03PS1) 10Jbond: puppetdb: manage mnode on vardir [puppet] - 10https://gerrit.wikimedia.org/r/632198 [09:52:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:52:38] T264390: Site: 4 VM request for LDAP replicas - https://phabricator.wikimedia.org/T264390 [09:53:18] (03CR) 10Jbond: [C: 03+2] puppetdb: manage mnode on vardir [puppet] - 10https://gerrit.wikimedia.org/r/632198 (owner: 10Jbond) [09:53:35] (03PS2) 10Jbond: puppetdb: manage mode on vardir [puppet] - 10https://gerrit.wikimedia.org/r/632198 [09:53:37] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/632198 (owner: 10Jbond) [09:54:47] (03CR) 10Filippo Giunchedi: [C: 03+2] "PCC https://puppet-compiler.wmflabs.org/compiler1003/25654/" [puppet] - 10https://gerrit.wikimedia.org/r/601429 (https://phabricator.wikimedia.org/T229584) (owner: 10Dave Pifke) [10:00:50] (03CR) 10Filippo Giunchedi: "Merged! Next steps from my POV:" [puppet] - 10https://gerrit.wikimedia.org/r/601429 (https://phabricator.wikimedia.org/T229584) (owner: 10Dave Pifke) [10:02:03] 10Operations, 10LDAP-Access-Requests: Add lilients_WMDE to the ldap/wmde and ldap/nda group - https://phabricator.wikimedia.org/T264590 (10lilients_WMDE) [10:04:58] 10Operations, 10Analytics-Radar, 10Discovery, 10Recommendation-API, 10Patch-For-Review: Run swift-object-expirer as part of the swift cluster - https://phabricator.wikimedia.org/T229584 (10fgiunchedi) Merged the patch above, apologies for the late action on this. These are the steps I think are left to... [10:08:42] !log installing ldap-replica1002 T264390 [10:08:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:08:48] T264390: Site: 4 VM request for LDAP replicas - https://phabricator.wikimedia.org/T264390 [10:10:11] 10Operations, 10Puppet, 10Patch-For-Review, 10User-jbond: Hieradata yaml style checking - https://phabricator.wikimedia.org/T236954 (10jbond) > The opaqueness of this issue leads me to believe that there may be benefit in enabling quoted-strings in the yamllint config. Yes this is an issue with the underli... [10:11:04] 10Operations, 10Puppet, 10Patch-For-Review, 10User-jbond: Hieradata yaml style checking - https://phabricator.wikimedia.org/T236954 (10jbond) It is also worth mentioning that the private git repo on the puppetmaster does use yaml lint so we could possibly start by updating the config there [10:12:14] (03CR) 10Jbond: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/630889 (https://phabricator.wikimedia.org/T263578) (owner: 10Jbond) [10:17:04] 10Operations, 10LDAP-Access-Requests: Add lilients_WMDE to the ldap/wmde and ldap/nda group - https://phabricator.wikimedia.org/T264590 (10WMDE-leszek) [10:17:33] (03PS1) 10Ema: cache: downgrade Varnish on cp3052 to 5.1.3-1wm15 [puppet] - 10https://gerrit.wikimedia.org/r/632201 (https://phabricator.wikimedia.org/T264398) [10:18:39] (03PS2) 10Ema: cache: downgrade Varnish on cp3052 to 5.1.3-1wm15 [puppet] - 10https://gerrit.wikimedia.org/r/632201 (https://phabricator.wikimedia.org/T264398) [10:20:53] (03PS1) 10Elukey: Set an-worker111[02] as Hadoop workers [puppet] - 10https://gerrit.wikimedia.org/r/632202 (https://phabricator.wikimedia.org/T255140) [10:21:05] (03CR) 10Jbond: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/631758 (owner: 10Filippo Giunchedi) [10:21:38] (03CR) 10Ema: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/632201 (https://phabricator.wikimedia.org/T264398) (owner: 10Ema) [10:24:28] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/631430 (owner: 10Muehlenhoff) [10:25:01] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:25:50] (03CR) 10Ema: [C: 03+2] cache: downgrade Varnish on cp3052 to 5.1.3-1wm15 [puppet] - 10https://gerrit.wikimedia.org/r/632201 (https://phabricator.wikimedia.org/T264398) (owner: 10Ema) [10:26:01] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:27:08] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/631858 (owner: 10Dzahn) [10:28:27] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/632204 (https://phabricator.wikimedia.org/T128546) [10:28:50] (03CR) 10Jbond: [C: 03+1] trafficserver: replace hiera() with lookup() [puppet] - 10https://gerrit.wikimedia.org/r/631291 (owner: 10Dzahn) [10:28:54] !log cp3052: depool and downgrade varnish to 5.1.3-1wm15 T264398 [10:29:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:29:01] T264398: Elevated latency starting 2020-09-28 - https://phabricator.wikimedia.org/T264398 [10:29:48] (03PS1) 10Filippo Giunchedi: hieradata: expand swift object server statsd mappings [puppet] - 10https://gerrit.wikimedia.org/r/632205 (https://phabricator.wikimedia.org/T264588) [10:30:04] jan_drewniak: Time to snap out of that daydream and deploy Wikimedia Portals Update. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201005T1030). [10:30:12] (03CR) 10Elukey: [C: 03+2] Set an-worker111[02] as Hadoop workers [puppet] - 10https://gerrit.wikimedia.org/r/632202 (https://phabricator.wikimedia.org/T255140) (owner: 10Elukey) [10:32:13] !log cp3052: pool with varnish 5.1.3-1wm15 T264398 [10:32:14] 10Operations, 10vm-requests: Site: 4 VM request for LDAP replicas - https://phabricator.wikimedia.org/T264390 (10MoritzMuehlenhoff) 05Open→03Resolved VMs have been created [10:32:16] (03CR) 10Jdrewniak: [C: 03+2] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/632204 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [10:32:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:32:29] PROBLEM - Check systemd state on cp3052 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:33:05] (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/632204 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [10:33:39] RECOVERY - Check systemd state on cp3052 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:34:26] !log elukey@cumin1001 START - Cookbook sre.aqs.roll-restart [10:34:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:36:38] !log jdrewniak@deploy1001 Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:632204| Bumping portals to master (T128546)]] (duration: 01m 00s) [10:36:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:36:43] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [10:37:37] !log jdrewniak@deploy1001 Synchronized portals: Wikimedia Portals Update: [[gerrit:632204| Bumping portals to master (T128546)]] (duration: 00m 58s) [10:37:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:37:57] !log elukey@cumin1001 END (PASS) - Cookbook sre.aqs.roll-restart (exit_code=0) [10:38:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:40:09] PROBLEM - MariaDB Replica Lag: s4 on db2095 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 3289.33 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:41:06] ^ that's me [10:42:24] ACKNOWLEDGEMENT - MariaDB Replica Lag: s4 on db2095 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 3109.58 seconds Kormat schema change https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:47:31] (03CR) 10Elukey: [C: 03+1] aptrepo: Add rocm 3.8 packages to reprepro [puppet] - 10https://gerrit.wikimedia.org/r/631725 (https://phabricator.wikimedia.org/T264408) (owner: 10Klausman) [10:47:33] (03CR) 10Muehlenhoff: [C: 03+2] Remove profile::ipmi::mgmt from role::bastionhost::pop [puppet] - 10https://gerrit.wikimedia.org/r/631430 (owner: 10Muehlenhoff) [10:50:04] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/631725 (https://phabricator.wikimedia.org/T264408) (owner: 10Klausman) [10:50:29] (03PS4) 10Muehlenhoff: consolidate bastionhost roles, remove module [puppet] - 10https://gerrit.wikimedia.org/r/631858 (owner: 10Dzahn) [10:50:39] 10Operations, 10Puppet: Stop introducing new code expanded from erb templates - https://phabricator.wikimedia.org/T200984 (10fgiunchedi) 05Open→03Invalid >>! In T200984#6514252, @Dzahn wrote: > This seems somewhat but not exactly a duplicate of T254480. The other ticket would be resolved once we have remo... [10:51:49] (03CR) 10Filippo Giunchedi: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/631758 (owner: 10Filippo Giunchedi) [10:52:10] (03CR) 10Filippo Giunchedi: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/631758 (owner: 10Filippo Giunchedi) [10:53:00] 10Operations, 10Performance-Team, 10Traffic, 10Patch-For-Review: Elevated latency starting 2020-09-28 - https://phabricator.wikimedia.org/T264398 (10ema) Varnish downgraded on cp3052. I've made a new dashboard comparing response time on cp3052 (v5) vs cp3054 (v6): https://grafana.wikimedia.org/d/EiAVq3FGz... [10:55:04] (03CR) 10Filippo Giunchedi: [C: 03+2] profile: look up rsyslog queue_size in scope [puppet] - 10https://gerrit.wikimedia.org/r/631446 (https://phabricator.wikimedia.org/T226703) (owner: 10Filippo Giunchedi) [10:55:26] (03PS1) 10Hoo man: Revert "Remove $wgExtraLanguageNames from Wikidata and Commons" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/632209 (https://phabricator.wikimedia.org/T264295) [10:56:00] (03PS1) 10Muehlenhoff: Remove obsolete Hiera settings for edge bastions [puppet] - 10https://gerrit.wikimedia.org/r/632211 [10:56:16] (03CR) 10Filippo Giunchedi: [C: 03+2] hieradata: enable rsyslog kafka queues in ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/631439 (https://phabricator.wikimedia.org/T226703) (owner: 10Filippo Giunchedi) [10:57:58] (03PS2) 10Hoo man: Revert "Remove $wgExtraLanguageNames from Wikidata and Commons" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/632209 (https://phabricator.wikimedia.org/T264295) [10:58:25] 10Operations, 10SRE-tools, 10serviceops, 10User-jijiki: Create a cookbook to perform a rolling reboot of a kubernetes cluster - https://phabricator.wikimedia.org/T260661 (10jijiki) [10:59:57] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/632212 (https://phabricator.wikimedia.org/T128546) [11:00:05] Amir1, Lucas_WMDE, awight, and Urbanecm: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for European mid-day backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201005T1100). [11:00:05] kart_ and HitomiAkane: A patch you scheduled for European mid-day backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:23] * kart_ is here. [11:00:32] hey here, I noticed one small bug in the portals patch that I'd like to re-deploy, can we pause the EU swat for 5 minutes? [11:00:32] I can deploy today [11:00:37] 10Operations, 10SRE-tools, 10serviceops, 10User-jijiki: Create a cookbook to perform a rolling reboot of a kubernetes cluster - https://phabricator.wikimedia.org/T260661 (10Volans) [11:00:38] jan_drewniak: certainly [11:00:39] 10Operations, 10SRE-tools, 10User-Joe, 10User-jijiki: Spicerack cookbooks TODO list - https://phabricator.wikimedia.org/T203943 (10Volans) [11:01:15] (03CR) 10Jdrewniak: [C: 03+2] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/632212 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [11:01:41] (03PS1) 10KartikMistry: Set CXMTThresholdForPublish to 95% for Vietnamese Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/632214 (https://phabricator.wikimedia.org/T264161) [11:02:18] (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/632212 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [11:04:05] !log jdrewniak@deploy1001 Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:632212| Bumping portals to master (T128546)]] (duration: 00m 58s) [11:04:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:04:11] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [11:04:12] RECOVERY - cassandra-b CQL 10.64.48.235:9042 on restbase1030 is OK: TCP OK - 0.000 second response time on 10.64.48.235 port 9042 https://phabricator.wikimedia.org/T93886 [11:04:40] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] Revert "Remove $wgExtraLanguageNames from Wikidata and Commons" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/632209 (https://phabricator.wikimedia.org/T264295) (owner: 10Hoo man) [11:05:03] !log jdrewniak@deploy1001 Synchronized portals: Wikimedia Portals Update: [[gerrit:632212| Bumping portals to master (T128546)]] (duration: 00m 58s) [11:05:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:06:00] Urbanecm: Ok all better (I had some unsightly HTML escaping happening) EU swat can proceed, thanks for waiting! [11:06:08] jan_drewniak: thank you! [11:07:24] (03CR) 10Urbanecm: [C: 03+2] wgSkipSkins: Exclude contenttranslation skin from skin options for users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/628065 (https://phabricator.wikimedia.org/T263093) (owner: 10Santhosh) [11:07:34] kart_: I'll ping you once it's ready to be tested [11:07:38] RECOVERY - cassandra-c service on restbase1030 is OK: OK - cassandra-c is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:07:55] Urbanecm: sure. [11:08:08] (03PS4) 10Jbond: cassandra: add data types, remove validation code [puppet] - 10https://gerrit.wikimedia.org/r/630312 (owner: 10Dzahn) [11:08:12] (03Merged) 10jenkins-bot: wgSkipSkins: Exclude contenttranslation skin from skin options for users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/628065 (https://phabricator.wikimedia.org/T263093) (owner: 10Santhosh) [11:08:22] RECOVERY - cassandra-c SSL 10.64.48.236:7001 on restbase1030 is OK: SSL OK - Certificate restbase1030-c valid until 2022-09-29 10:16:58 +0000 (expires in 723 days) https://phabricator.wikimedia.org/T120662 [11:08:33] (03CR) 10jerkins-bot: [V: 04-1] cassandra: add data types, remove validation code [puppet] - 10https://gerrit.wikimedia.org/r/630312 (owner: 10Dzahn) [11:08:44] RECOVERY - MariaDB Replica Lag: s4 on db2095 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:08:54] kart_: your patch is at mwdebug2001 now [11:09:19] Urbanecm: testing.. [11:11:12] Urbanecm: working as expected. Please deploy. [11:11:16] syncing, thank you [11:11:39] 10Operations, 10MediaWiki-General, 10Performance-Team, 10serviceops-radar, and 3 others: Move MainStash out of Redis to a simpler multi-dc aware solution - https://phabricator.wikimedia.org/T212129 (10Marostegui) [11:11:51] HitomiAkane: hello [11:12:33] hi Urbanecm [11:12:41] HitomiAkane: I'll ping you once your patch is ready [11:12:43] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: cd30b626e23b48146b970c72731f8f7bb1eee9e1: wgSkipSkins: Exclude contenttranslation skin from skin options for users (T263093) (duration: 00m 59s) [11:12:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:12:49] T263093: Create a custom skin for Content Translation special pages - https://phabricator.wikimedia.org/T263093 [11:12:57] kart_: done :) [11:13:02] Urbanecm: thanks! [11:13:06] no problem [11:13:08] Thx [11:14:11] (03PS4) 10HitomiAkane: Move changetags right from users to sysop [trwiki] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/631956 (https://phabricator.wikimedia.org/T264508) [11:14:13] (03CR) 10Urbanecm: [C: 03+2] Move changetags right from users to sysop [trwiki] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/631956 (https://phabricator.wikimedia.org/T264508) (owner: 10HitomiAkane) [11:14:24] (03Merged) 10jenkins-bot: Move changetags right from users to sysop [trwiki] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/631956 (https://phabricator.wikimedia.org/T264508) (owner: 10HitomiAkane) [11:14:33] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, I've merged the dependent patch." [puppet] - 10https://gerrit.wikimedia.org/r/631858 (owner: 10Dzahn) [11:14:36] (03PS5) 10Jbond: cassandra: add data types, remove validation code [puppet] - 10https://gerrit.wikimedia.org/r/630312 (owner: 10Dzahn) [11:14:40] HitomiAkane: your patch is at mwdebug2001, can you test, please? [11:14:50] (03CR) 10jerkins-bot: [V: 04-1] cassandra: add data types, remove validation code [puppet] - 10https://gerrit.wikimedia.org/r/630312 (owner: 10Dzahn) [11:15:57] (03CR) 10Jbond: "> Patch Set 3:" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/630312 (owner: 10Dzahn) [11:16:34] the patch is merged now [11:17:00] HitomiAkane: yes, and it is fetched at mwdebug2001, can you test, please? [11:18:06] sorry but i'm still new to this how exactly i'll test it :P [11:18:32] HitomiAkane: sure. You need to enable X-Wikimedia-Debug browser extension first, see https://wikitech.wikimedia.org/wiki/WikimediaDebug#Browser_extensions [11:18:56] then, you need to make sure the change works as expected (ie. the right change shows up in Special:UserGroupRights) [11:18:58] does it make sense, HitomiAkane ? [11:19:32] yes [11:20:27] HitomiAkane: let me know how it goes :) [11:20:53] Lucas_WMDE: hoo: Once I'm done, would you like to self-service https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/632209, or should I deploy it too? [11:21:14] I will self-service, thanks :) [11:21:16] (03PS6) 10Jbond: cassandra: add data types, remove validation code [puppet] - 10https://gerrit.wikimedia.org/r/630312 (owner: 10Dzahn) [11:21:24] hoo: okay, I'll ping you once it's ready :) [11:21:46] (03CR) 10jerkins-bot: [V: 04-1] cassandra: add data types, remove validation code [puppet] - 10https://gerrit.wikimedia.org/r/630312 (owner: 10Dzahn) [11:21:50] (03PS7) 10Jbond: cassandra: add data types, remove validation code [puppet] - 10https://gerrit.wikimedia.org/r/630312 (owner: 10Dzahn) [11:23:25] (03CR) 10Jbond: "Compare PS6" [puppet] - 10https://gerrit.wikimedia.org/r/630312 (owner: 10Dzahn) [11:23:45] HitomiAkane: how is it going? [11:26:52] HitomiAkane: ping? [11:27:39] syncing anyway, as I tested it myself... [11:28:24] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: be73f155001e9095697c3c21a208c63e7bf5d2d1: Move changetags right from users to sysop [trwiki] (T264508) (duration: 00m 59s) [11:28:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:30] hoo: the floor is yours [11:28:32] T264508: Remove changetags right from user group & grant to sysops on Turkish Wikipedia - https://phabricator.wikimedia.org/T264508 [11:29:36] (03PS3) 10Hoo man: Revert "Remove $wgExtraLanguageNames from Wikidata and Commons" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/632209 (https://phabricator.wikimedia.org/T264295) [11:30:08] (03CR) 10Hoo man: [C: 03+2] Revert "Remove $wgExtraLanguageNames from Wikidata and Commons" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/632209 (https://phabricator.wikimedia.org/T264295) (owner: 10Hoo man) [11:32:05] (03Merged) 10jenkins-bot: Revert "Remove $wgExtraLanguageNames from Wikidata and Commons" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/632209 (https://phabricator.wikimedia.org/T264295) (owner: 10Hoo man) [11:34:26] !log hoo@deploy1001 Synchronized wmf-config/: Revert "Remove $wgExtraLanguageNames from Wikidata and Commons" (T264295) (duration: 00m 59s) [11:34:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:34:33] T264295: Reinstate $wgExtraLanguageCodes in production - https://phabricator.wikimedia.org/T264295 [11:39:02] (03CR) 10Jbond: cassandra: add data types, remove validation code (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/630312 (owner: 10Dzahn) [11:45:58] (03PS1) 10Ssingh: dnsdist: do not set the TLS library explicitly (use dnsdist's default) [puppet] - 10https://gerrit.wikimedia.org/r/632217 (https://phabricator.wikimedia.org/T263789) [11:47:49] (03CR) 10Jbond: [C: 03+1] base: replace hiera() with lookup() [puppet] - 10https://gerrit.wikimedia.org/r/631307 (owner: 10Dzahn) [11:48:02] (03CR) 10Ssingh: "https://puppet-compiler.wmflabs.org/compiler1001/25655/malmok.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/632217 (https://phabricator.wikimedia.org/T263789) (owner: 10Ssingh) [11:49:55] (03CR) 10Jbond: [C: 03+1] wmcs::postgres: hiera->lookup and add data types [puppet] - 10https://gerrit.wikimedia.org/r/628459 (owner: 10Dzahn) [11:51:23] (03CR) 10Ssingh: [C: 03+2] dnsdist: do not set the TLS library explicitly (use dnsdist's default) [puppet] - 10https://gerrit.wikimedia.org/r/632217 (https://phabricator.wikimedia.org/T263789) (owner: 10Ssingh) [11:51:34] (03PS3) 10Hnowlan: cassandra: use profile::java [puppet] - 10https://gerrit.wikimedia.org/r/631789 (https://phabricator.wikimedia.org/T261966) [11:52:59] (03CR) 10Muehlenhoff: [C: 03+2] Add profile::java for cergen [puppet] - 10https://gerrit.wikimedia.org/r/631197 (https://phabricator.wikimedia.org/T264177) (owner: 10Muehlenhoff) [11:53:40] (03CR) 10Klausman: [C: 03+2] aptrepo: Add rocm 3.8 packages to reprepro [puppet] - 10https://gerrit.wikimedia.org/r/631725 (https://phabricator.wikimedia.org/T264408) (owner: 10Klausman) [11:53:48] (03PS7) 10Klausman: aptrepo: Add rocm 3.8 packages to reprepro [puppet] - 10https://gerrit.wikimedia.org/r/631725 (https://phabricator.wikimedia.org/T264408) [11:53:51] (03CR) 10Klausman: [V: 03+2 C: 03+2] aptrepo: Add rocm 3.8 packages to reprepro [puppet] - 10https://gerrit.wikimedia.org/r/631725 (https://phabricator.wikimedia.org/T264408) (owner: 10Klausman) [11:54:37] 10Operations, 10Traffic, 10Patch-For-Review: Wikidough: Upgrade to dnsdist 1.5.0 - https://phabricator.wikimedia.org/T263789 (10ssingh) [11:56:41] 10Operations, 10Patch-For-Review: Switch cergen to profile::java - https://phabricator.wikimedia.org/T264177 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff cergen is now using profile::java. [11:56:44] 10Operations: Migrate remaining services using Java to profile::java - https://phabricator.wikimedia.org/T264174 (10MoritzMuehlenhoff) [11:56:57] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/628460 (owner: 10Dzahn) [12:00:23] 10Operations, 10DNS, 10Traffic, 10User-DannyS712: DNS_PROBE_FINISHED_NXDOMAIN for mobile version of internal.wikimedia.org - https://phabricator.wikimedia.org/T264565 (10Peachey88) [12:00:37] 10Operations, 10DNS, 10Traffic, 10Mobile, 10Patch-For-Review: Many misc wikis lack mobile domains - https://phabricator.wikimedia.org/T152882 (10Peachey88) [12:00:55] 10Operations, 10DNS, 10Traffic: DNS_PROBE_FINISHED_NXDOMAIN for mobile version of internal.wikimedia.org - https://phabricator.wikimedia.org/T264565 (10DannyS712) [12:04:38] can i get someone with access to update the clinic victim in the topic to me? [12:05:45] instead, i've given you access [12:05:58] you know, self-service infrastructure ;) [12:07:24] cheers :) [12:09:31] (03PS4) 10Muehlenhoff: profile::java: Add support to deploy debug packages [puppet] - 10https://gerrit.wikimedia.org/r/617079 [12:10:31] (03CR) 10jerkins-bot: [V: 04-1] profile::java: Add support to deploy debug packages [puppet] - 10https://gerrit.wikimedia.org/r/617079 (owner: 10Muehlenhoff) [12:11:21] 10Operations, 10Performance-Team, 10serviceops, 10User-jijiki: MediaWiki to route spefic keys to /*/mw-with-onhost-tier/ - https://phabricator.wikimedia.org/T264604 (10jijiki) [12:12:16] 10Operations, 10Performance-Team, 10serviceops, 10User-jijiki: MediaWiki to route spefic keys to /*/mw-with-onhost-tier/ - https://phabricator.wikimedia.org/T264604 (10jijiki) [12:12:33] 10Operations, 10Performance-Team, 10serviceops, 10User-jijiki: MediaWiki to route spefic keys to /*/mw-with-onhost-tier/ - https://phabricator.wikimedia.org/T264604 (10jijiki) [12:12:37] 10Operations, 10Performance-Team, 10serviceops, 10Patch-For-Review, and 2 others: Reduce read pressure on mc* servers by adding a machine-local Memcached instance (on-host memcached) - https://phabricator.wikimedia.org/T244340 (10jijiki) [12:13:46] (03PS5) 10Muehlenhoff: profile::java: Add support to deploy debug packages [puppet] - 10https://gerrit.wikimedia.org/r/617079 [12:30:45] (03CR) 10Alexandros Kosiaris: "> Patch Set 4:" (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/631397 (owner: 10Alexandros Kosiaris) [12:30:47] 10Puppet, 10Patch-For-Review: Bashisms in various /bin/sh scripts - https://phabricator.wikimedia.org/T95064 (10jbond) Adding a not here that ci was added to the puppet repo for any files ending in .sh in https://gerrit.wikimedia.org/r/c/operations/puppet/+/602693. however files in `modules/admin/files/home`... [12:30:59] (03PS7) 10Alexandros Kosiaris: Add pytest and a simple test for decommission [cookbooks] - 10https://gerrit.wikimedia.org/r/631448 (https://phabricator.wikimedia.org/T257297) [12:31:01] (03PS5) 10Alexandros Kosiaris: decommission: Avoid matching some IPs in regexp [cookbooks] - 10https://gerrit.wikimedia.org/r/631397 [12:32:08] akosiaris: I'll review them later today, between meetings ;) [12:32:09] thx [12:32:31] volans: you 've already +1ed the first one [12:32:46] just saying :-) [12:37:07] (03PS6) 10Muehlenhoff: profile::java: Add support to deploy debug packages [puppet] - 10https://gerrit.wikimedia.org/r/617079 [12:37:54] (03PS1) 10Klausman: Drop mivisionx from the RE of wanted packages [puppet] - 10https://gerrit.wikimedia.org/r/632219 (https://phabricator.wikimedia.org/T264408) [12:39:41] !log installing curl security updates on remaining hosts [12:39:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:41:30] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/632219 (https://phabricator.wikimedia.org/T264408) (owner: 10Klausman) [12:42:55] (03CR) 10Gehel: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/617079 (owner: 10Muehlenhoff) [12:43:21] (03CR) 10Klausman: [C: 03+2] Drop mivisionx from the RE of wanted packages [puppet] - 10https://gerrit.wikimedia.org/r/632219 (https://phabricator.wikimedia.org/T264408) (owner: 10Klausman) [12:47:36] (03PS1) 10Jgreen: add frmx1001 to nsca_frack.cfg.erb [puppet] - 10https://gerrit.wikimedia.org/r/632220 (https://phabricator.wikimedia.org/T260181) [12:48:42] 10Operations, 10fundraising-tech-ops, 10Patch-For-Review: (Need By: 2020-09-30) rack/setup/install frmx1001 & frdata1002 - https://phabricator.wikimedia.org/T260181 (10Jgreen) a:05Jgreen→03None [12:52:58] (03CR) 10Jgreen: [C: 03+2] add frmx1001 to nsca_frack.cfg.erb [puppet] - 10https://gerrit.wikimedia.org/r/632220 (https://phabricator.wikimedia.org/T260181) (owner: 10Jgreen) [12:53:24] RECOVERY - Disk space on an-master1002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=an-master1002&var-datasource=eqiad+prometheus/ops [12:53:36] 10Operations, 10CAS-SSO, 10User-jbond: Apereo CAS expose CASCookieSameSite cia profile::idp::client::http - https://phabricator.wikimedia.org/T264605 (10jbond) [12:54:36] 10Operations, 10CAS-SSO, 10User-jbond: Apereo CAS expose CASCookieSameSite via profile::idp::client::http - https://phabricator.wikimedia.org/T264605 (10jbond) [12:56:09] 10Operations, 10CAS-SSO, 10User-jbond: Apereo CAS expose CASCookieSameSite via profile::idp::client::http - https://phabricator.wikimedia.org/T264605 (10MoritzMuehlenhoff) The patch to support the setting is not yet in the released or packaged versions of libapache2-mod-auth-cas, but if it works for us, I ca... [12:56:43] (03CR) 10Elukey: Import the config module from Spicerack (032 comments) [software/pywmflib] - 10https://gerrit.wikimedia.org/r/631909 (https://phabricator.wikimedia.org/T257905) (owner: 10Elukey) [12:57:23] (03CR) 10Jbond: "Im a little confused by this change specifically the linked task uses the following to run checkbashisms" [puppet] - 10https://gerrit.wikimedia.org/r/631895 (https://phabricator.wikimedia.org/T95064) (owner: 10Dzahn) [13:00:48] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/631304 (owner: 10Dzahn) [13:05:39] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/617079 (owner: 10Muehlenhoff) [13:13:21] (03CR) 10Muehlenhoff: "PCC: https://puppet-compiler.wmflabs.org/compiler1002/25657/" [puppet] - 10https://gerrit.wikimedia.org/r/617079 (owner: 10Muehlenhoff) [13:13:55] (03CR) 10Muehlenhoff: [C: 03+2] profile::java: Add support to deploy debug packages [puppet] - 10https://gerrit.wikimedia.org/r/617079 (owner: 10Muehlenhoff) [13:19:48] (03CR) 10Jbond: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/630694 (owner: 10Dzahn) [13:23:39] (03PS6) 10Jbond: stdlib: update to version 5.2.0 [puppet] - 10https://gerrit.wikimedia.org/r/631900 (owner: 10Dzahn) [13:24:21] (03CR) 10jerkins-bot: [V: 04-1] stdlib: update to version 5.2.0 [puppet] - 10https://gerrit.wikimedia.org/r/631900 (owner: 10Dzahn) [13:24:59] 10Operations, 10Gerrit, 10Release-Engineering-Team (Development services), 10Release-Engineering-Team-TODO (2020-10-01 to 2020-12-31 (Q2)): Migrate Gerrit to profile::java - https://phabricator.wikimedia.org/T264182 (10MoritzMuehlenhoff) >>! In T264182#6511478, @hashar wrote: > We need the `dbg` package in... [13:25:34] (03PS7) 10Jbond: stdlib: update to version 5.2.0 [puppet] - 10https://gerrit.wikimedia.org/r/631900 (owner: 10Dzahn) [13:27:04] (03CR) 10Filippo Giunchedi: "It looks like this role is only used in modules/profile/manifests/mariadb/monitor/prometheus.pp, maybe we can keep the profile and get rid" [puppet] - 10https://gerrit.wikimedia.org/r/631288 (https://phabricator.wikimedia.org/T159412) (owner: 10Dzahn) [13:31:58] !log shutdown an-master1002 for ram expansion (64 -> 128G) [13:32:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:32:15] (03CR) 10Alexandros Kosiaris: [C: 03+1] "Jbond already started a fleet wide PCC at https://puppet-compiler.wmflabs.org/compiler1001/25659/ so I guess we 'll find out soon enough, " [puppet] - 10https://gerrit.wikimedia.org/r/631900 (owner: 10Dzahn) [13:33:31] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Bump the version in Chart.yaml as well, otherwise +1" [deployment-charts] - 10https://gerrit.wikimedia.org/r/631437 (https://phabricator.wikimedia.org/T219919) (owner: 10Filippo Giunchedi) [13:36:58] (03PS2) 10Filippo Giunchedi: citoid: stop using gelf for logging [deployment-charts] - 10https://gerrit.wikimedia.org/r/631437 (https://phabricator.wikimedia.org/T219919) [13:38:32] (03PS1) 10Muehlenhoff: Switch gerrit to profile::java [puppet] - 10https://gerrit.wikimedia.org/r/632224 (https://phabricator.wikimedia.org/T264182) [13:49:39] (03PS15) 10Jbond: puppet-merge: add Repository class [puppet] - 10https://gerrit.wikimedia.org/r/544943 (https://phabricator.wikimedia.org/T254249) [13:50:01] (03CR) 10Filippo Giunchedi: [C: 03+2] citoid: stop using gelf for logging [deployment-charts] - 10https://gerrit.wikimedia.org/r/631437 (https://phabricator.wikimedia.org/T219919) (owner: 10Filippo Giunchedi) [13:51:06] (03CR) 10Jbond: "Thanks for the review, I have updated the minor issues, ill do more testing around `capture_output=True` to see if i can jog my memory for" (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/544943 (https://phabricator.wikimedia.org/T254249) (owner: 10Jbond) [13:53:01] RECOVERY - cassandra-c CQL 10.64.48.236:9042 on restbase1030 is OK: TCP OK - 0.001 second response time on 10.64.48.236 port 9042 https://phabricator.wikimedia.org/T93886 [13:54:01] PROBLEM - Host stat1005.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [13:54:54] !log shutdown stat1005 for ram upgrade [13:54:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:30] !log filippo@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'citoid' for release 'staging' . [13:55:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:57:12] (03CR) 10Jbond: "> Patch Set 4:" [puppet] - 10https://gerrit.wikimedia.org/r/631288 (https://phabricator.wikimedia.org/T159412) (owner: 10Dzahn) [13:58:33] (03CR) 10Muehlenhoff: "PCC: https://puppet-compiler.wmflabs.org/compiler1002/25660/" [puppet] - 10https://gerrit.wikimedia.org/r/632224 (https://phabricator.wikimedia.org/T264182) (owner: 10Muehlenhoff) [13:58:41] !log filippo@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'citoid' for release 'production' . [13:58:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:17] RECOVERY - Check systemd state on deneb is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:01:36] !log filippo@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'citoid' for release 'production' . [14:01:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:37] 10Operations, 10SRE-tools, 10Continuous-Integration-Config, 10Patch-For-Review: Add shell scripts CI validations - https://phabricator.wikimedia.org/T148494 (10jbond) I think this has been added via https://gerrit.wikimedia.org/r/c/operations/puppet/+/602693 can anyone confirm if we are still missing anyth... [14:12:49] PROBLEM - Host stat1005 is DOWN: PING CRITICAL - Packet loss = 100% [14:13:25] 10Operations, 10Wikimedia-Logstash, 10observability, 10service-runner, 10Patch-For-Review: Move service-runner to new logging infrastructure - https://phabricator.wikimedia.org/T211125 (10fgiunchedi) [14:13:44] !log ppchelko@deploy1001 Started deploy [restbase/deploy@366a543]: T263133 T264035 [14:13:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:50] T263133: Move feed assembly from RESTBase to Wikifeeds - https://phabricator.wikimedia.org/T263133 [14:13:50] 10Operations, 10Citoid, 10Wikimedia-Logstash, 10observability, and 2 others: Move citoid logging to new logging pipeline - https://phabricator.wikimedia.org/T219919 (10fgiunchedi) 05Open→03Resolved Citoid is using only k8s-native logging now, resolving. [14:13:50] T264035: [Bug] The feed/featured endpoint does not handle language variant correctly in zhwiki - https://phabricator.wikimedia.org/T264035 [14:15:46] (03PS1) 10Filippo Giunchedi: wikifeeds: use k8s stdout logging only [deployment-charts] - 10https://gerrit.wikimedia.org/r/632243 (https://phabricator.wikimedia.org/T245604) [14:16:07] (03PS2) 10Ppchelko: Force local short descriptions for enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/631775 (https://phabricator.wikimedia.org/T263493) [14:20:49] RECOVERY - Host stat1005 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [14:20:51] (03PS4) 10Hnowlan: cassandra: use profile::java [puppet] - 10https://gerrit.wikimedia.org/r/631789 (https://phabricator.wikimedia.org/T261966) [14:23:13] RECOVERY - Host stat1005.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.93 ms [14:24:36] 10Operations, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: 2020-09-15) upgrade/replace memory in stat100[58] - https://phabricator.wikimedia.org/T260448 (10Cmjohnson) @robh @wiki_willy I am looking at the packing slip and what I have in the data center and it appears we're 4 DIMM short. The pac... [14:25:18] (03CR) 10Kormat: "> Patch Set 4:" [puppet] - 10https://gerrit.wikimedia.org/r/631288 (https://phabricator.wikimedia.org/T159412) (owner: 10Dzahn) [14:25:22] !log shutdown an-master1001 for ram expansion [14:25:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:28:47] PROBLEM - Host an-master1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:29:01] cmjohnson1: I'll reopen the memory task [14:29:05] (03PS1) 10Andrew Bogott: cloud-vps instance backups: only save 4 days of backups rather than 7 [puppet] - 10https://gerrit.wikimedia.org/r/632246 [14:29:08] but you wanna reopen the procurement in fujture not comment on install task [14:29:12] for the memory install [14:29:24] ie: you are lucky i spotted in my alerts ;D [14:30:10] 10Operations, 10SRE-Access-Requests: Requesting access to researchers and analytics-privatedata-users for Leila Zia - https://phabricator.wikimedia.org/T264472 (10Kormat) Hi Leila, I'm the SRE clinic duty peon this week :) When looking at this task, i noticed a few odd things: - The wikitech user `leila` doe... [14:30:45] 10Operations, 10ops-eqiad, 10Analytics-Clusters: (Need By: TBD) upgrade ram in an-master100[12] - https://phabricator.wikimedia.org/T259162 (10Cmjohnson) [14:31:10] 10Operations, 10ops-eqiad, 10Analytics-Clusters: (Need By: TBD) upgrade ram in an-master100[12] - https://phabricator.wikimedia.org/T259162 (10Cmjohnson) 05Open→03Resolved Task complete [14:31:27] 10Operations, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: 2020-09-15) upgrade/replace memory in stat100[58] - https://phabricator.wikimedia.org/T260448 (10Cmjohnson) [14:31:43] (03PS1) 10Klausman: modules: Add functionality to allow use of 3.8 rocm packages [puppet] - 10https://gerrit.wikimedia.org/r/632248 (https://phabricator.wikimedia.org/T264408) [14:34:09] (03PS2) 10Fdans: dumps::web::fetches::stat_dumps: add rsync job for pageview complete [puppet] - 10https://gerrit.wikimedia.org/r/629409 (https://phabricator.wikimedia.org/T251777) [14:34:29] RECOVERY - Host an-master1001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.98 ms [14:36:07] !log ppchelko@deploy1001 Finished deploy [restbase/deploy@366a543]: T263133 T264035 (duration: 22m 23s) [14:36:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:36:14] T263133: Move feed assembly from RESTBase to Wikifeeds - https://phabricator.wikimedia.org/T263133 [14:36:14] T264035: [Bug] The feed/featured endpoint does not handle language variant correctly in zhwiki - https://phabricator.wikimedia.org/T264035 [14:37:50] (03CR) 10Ppchelko: [C: 03+2] changeprop: Add x-request-id header to jobqueue requests [deployment-charts] - 10https://gerrit.wikimedia.org/r/631794 (owner: 10Clarakosi) [14:38:03] 10Operations, 10Traffic, 10Wikimedia-General-or-Unknown, 10User-DannyS712: Pages whose title ends with semicolon (;) are intermittently inaccessible - https://phabricator.wikimedia.org/T238285 (10Reedy) [14:39:00] 10Operations, 10Product-Infrastructure-Team-Backlog, 10Wikifeeds, 10serviceops: [Bug] The feed/featured endpoint is broken - https://phabricator.wikimedia.org/T263043 (10Pchelolo) [14:39:27] 10Operations, 10Product-Infrastructure-Team-Backlog, 10Wikifeeds, 10serviceops, 10Platform Team Workboards (Clinic Duty Team): Move feed assembly from RESTBase to Wikifeeds - https://phabricator.wikimedia.org/T263133 (10Pchelolo) 05Open→03Resolved [14:39:31] 10Operations, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: 2020-09-15) upgrade/replace memory in stat100[58] - https://phabricator.wikimedia.org/T260448 (10RobH) [14:40:13] (03Merged) 10jenkins-bot: changeprop: Add x-request-id header to jobqueue requests [deployment-charts] - 10https://gerrit.wikimedia.org/r/631794 (owner: 10Clarakosi) [14:41:00] !log shutdown stat1005 and stat1008 for ram expansion (1005 again) [14:41:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:41:15] (03CR) 10Andrew Bogott: [C: 03+2] cloud-vps instance backups: only save 4 days of backups rather than 7 [puppet] - 10https://gerrit.wikimedia.org/r/632246 (owner: 10Andrew Bogott) [14:42:14] (03CR) 10Alexandros Kosiaris: [C: 03+1] wikifeeds: use k8s stdout logging only [deployment-charts] - 10https://gerrit.wikimedia.org/r/632243 (https://phabricator.wikimedia.org/T245604) (owner: 10Filippo Giunchedi) [14:43:18] 10Operations, 10SRE-tools, 10User-Joe, 10User-jijiki: Spicerack cookbooks TODO list - https://phabricator.wikimedia.org/T203943 (10JMeybohm) [14:43:21] 10Operations, 10SRE-tools, 10serviceops: Create a cookbook for depooling one or all services from one kubernetes cluster - https://phabricator.wikimedia.org/T260663 (10JMeybohm) [14:44:05] PROBLEM - Host stat1008 is DOWN: PING CRITICAL - Packet loss = 100% [14:44:27] PROBLEM - Host stat1005 is DOWN: PING CRITICAL - Packet loss = 100% [14:44:48] 10Operations, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: 2020-09-15) upgrade/replace memory in stat100[58] - https://phabricator.wikimedia.org/T260448 (10RobH) >>! In T260448#6517598, @Cmjohnson wrote: > @robh @wiki_willy I am looking at the packing slip and what I have in the data center and... [14:47:47] (03PS1) 10Jbond: hadoop::common: add additional info to fail message [puppet] - 10https://gerrit.wikimedia.org/r/632250 [14:48:03] PROBLEM - Host stat1005.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:50:06] (03CR) 10Hnowlan: "pcc for aqs, maps, restbase, sessionstore: https://puppet-compiler.wmflabs.org/compiler1002/25665/" [puppet] - 10https://gerrit.wikimedia.org/r/631789 (https://phabricator.wikimedia.org/T261966) (owner: 10Hnowlan) [14:52:33] (03PS1) 10Andrew Bogott: cloud-vps backups: exclude a few more things from backups [puppet] - 10https://gerrit.wikimedia.org/r/632253 [14:53:45] RECOVERY - Host stat1005.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.47 ms [14:55:00] (03PS1) 10Herron: admin: update user sbailey ssh key [puppet] - 10https://gerrit.wikimedia.org/r/632254 (https://phabricator.wikimedia.org/T264127) [14:55:27] !log ppchelko@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'staging' . [14:55:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:48] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Please replace Shannon Baileys SSH key - https://phabricator.wikimedia.org/T264127 (10herron) New key has been confirmed via google chat and email [14:56:30] !log ppchelko@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'production' . [14:56:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:57:21] RECOVERY - Host stat1005 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [15:07:23] (03CR) 10Paladox: Switch gerrit to profile::java (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/632224 (https://phabricator.wikimedia.org/T264182) (owner: 10Muehlenhoff) [15:10:31] PROBLEM - Host stat1008.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:10:47] this is maintenance --^ [15:13:13] RECOVERY - Disk space on an-coord1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=an-coord1001&var-datasource=eqiad+prometheus/ops [15:15:16] !log ppchelko@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'production' . [15:15:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:18:31] (03CR) 10ArielGlenn: [C: 03+1] "Seems ok to me but Brooke should really give the final thumbs up." [puppet] - 10https://gerrit.wikimedia.org/r/629409 (https://phabricator.wikimedia.org/T251777) (owner: 10Fdans) [15:21:57] RECOVERY - Host stat1008.mgmt is UP: PING OK - Packet loss = 0%, RTA = 5.52 ms [15:22:02] 10Operations, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: 2020-09-15) upgrade/replace memory in stat100[58] - https://phabricator.wikimedia.org/T260448 (10Cmjohnson) [15:22:11] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=routinator site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:22:13] 10Operations, 10SRE-Access-Requests: Requesting access to sites from Google Search Console - https://phabricator.wikimedia.org/T264630 (10CGlenn) [15:23:47] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:28:04] 10Operations, 10Wikidata, 10Wikidata-Query-Service: Migrate WDQS to profile::java - https://phabricator.wikimedia.org/T264181 (10CBogen) p:05Medium→03High [15:29:16] 10Operations, 10Wikidata, 10Wikidata-Query-Service: Migrate WDQS to profile::java - https://phabricator.wikimedia.org/T264181 (10CBogen) p:05High→03Medium [15:29:20] 10Operations, 10Wikidata, 10Wikidata-Query-Service: Migrate WDQS to profile::java - https://phabricator.wikimedia.org/T264181 (10CBogen) p:05Medium→03High [15:31:24] (03CR) 10MSantos: [C: 03+2] wikifeeds: use k8s stdout logging only [deployment-charts] - 10https://gerrit.wikimedia.org/r/632243 (https://phabricator.wikimedia.org/T245604) (owner: 10Filippo Giunchedi) [15:32:27] (03CR) 10Jbond: [C: 03+1] "LGTM the 3 failures from PCC can be safley ignored:" [puppet] - 10https://gerrit.wikimedia.org/r/631900 (owner: 10Dzahn) [15:34:04] (03PS1) 10Jgreen: flip payments.wm.o to codfw to test mw 1.35 upgrade [dns] - 10https://gerrit.wikimedia.org/r/632262 (https://phabricator.wikimedia.org/T254298) [15:34:36] (03CR) 10Jgreen: [C: 03+2] flip payments.wm.o to codfw to test mw 1.35 upgrade [dns] - 10https://gerrit.wikimedia.org/r/632262 (https://phabricator.wikimedia.org/T254298) (owner: 10Jgreen) [15:34:38] (03Merged) 10jenkins-bot: wikifeeds: use k8s stdout logging only [deployment-charts] - 10https://gerrit.wikimedia.org/r/632243 (https://phabricator.wikimedia.org/T245604) (owner: 10Filippo Giunchedi) [15:34:42] (03PS2) 10Jbond: hadoop::common: add additional info to fail message [puppet] - 10https://gerrit.wikimedia.org/r/632250 [15:35:18] (03CR) 10Muehlenhoff: [C: 03+1] "Nice work!" [puppet] - 10https://gerrit.wikimedia.org/r/631789 (https://phabricator.wikimedia.org/T261966) (owner: 10Hnowlan) [15:36:18] (03CR) 10Jbond: "Added to debug a separate issues but seems generally useful" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/632250 (owner: 10Jbond) [15:38:14] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/632254 (https://phabricator.wikimedia.org/T264127) (owner: 10Herron) [15:40:43] PROBLEM - Host stat1008.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:40:54] this is still maintenance --^ [15:42:00] (03PS2) 10Andrew Bogott: cloud-vps backups: exclude a few more things from backups [puppet] - 10https://gerrit.wikimedia.org/r/632253 [15:42:15] (03CR) 10Elukey: [C: 03+1] "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/632250 (owner: 10Jbond) [15:43:04] (03CR) 10Jbond: [C: 03+2] hadoop::common: add additional info to fail message [puppet] - 10https://gerrit.wikimedia.org/r/632250 (owner: 10Jbond) [15:46:31] 10Operations, 10ops-eqiad, 10Analytics-Radar: an-presto1004 down - https://phabricator.wikimedia.org/T253438 (10Cmjohnson) The dell tech is back today with new power supplies, he took the system down to the bare minimum and slowly started adding things back, and once he connected the backplane there was smo... [15:47:45] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:48:19] (03CR) 10Thcipriani: cloud-vps backups: exclude a few more things from backups (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/632253 (owner: 10Andrew Bogott) [15:49:19] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:52:37] (03PS3) 10Andrew Bogott: cloud-vps backups: exclude a few more things from backups [puppet] - 10https://gerrit.wikimedia.org/r/632253 [15:55:48] (03PS4) 10Andrew Bogott: cloud-vps backups: exclude a few more things from backups [puppet] - 10https://gerrit.wikimedia.org/r/632253 [15:55:52] (03CR) 10Andrew Bogott: [C: 03+2] cloud-vps backups: exclude a few more things from backups [puppet] - 10https://gerrit.wikimedia.org/r/632253 (owner: 10Andrew Bogott) [15:56:31] (03PS1) 10Muehlenhoff: Add component/icu63 for stretch-wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/632264 [15:56:44] (03CR) 10Herron: [C: 03+2] admin: update user sbailey ssh key [puppet] - 10https://gerrit.wikimedia.org/r/632254 (https://phabricator.wikimedia.org/T264127) (owner: 10Herron) [16:03:51] RECOVERY - Host stat1008.mgmt is UP: PING OK - Packet loss = 0%, RTA = 3.71 ms [16:06:42] 10Operations, 10SRE-tools, 10serviceops, 10User-jijiki: Create a cookbook to perform a rolling reboot of a kubernetes cluster - https://phabricator.wikimedia.org/T260661 (10JMeybohm) @MoritzMuehlenhoff wrote some generic code to do rolling reboots for groups of hosts that could probably be utilized here: h... [16:17:03] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:18:51] PROBLEM - Host stat1008.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:20:13] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:23:30] 10Operations, 10Analytics, 10Traffic: ~1 request/minute to intake-logging.wikimedia.org times out at the traffic/service interface - https://phabricator.wikimedia.org/T264021 (10fdans) Just pinging @Ottomata for when he's back from vacation. [16:26:29] 10Operations, 10Analytics-Radar, 10Traffic, 10Wikimedia-General-or-Unknown: Cookie “WMF-Last-Access-Global” has been rejected for invalid domain. - https://phabricator.wikimedia.org/T261803 (10fdans) [16:26:51] (03PS1) 10Jgreen: flip payments back to eqiad, test is complete [dns] - 10https://gerrit.wikimedia.org/r/632288 (https://phabricator.wikimedia.org/T254298) [16:28:04] (03CR) 10Jgreen: [C: 03+2] flip payments back to eqiad, test is complete [dns] - 10https://gerrit.wikimedia.org/r/632288 (https://phabricator.wikimedia.org/T254298) (owner: 10Jgreen) [16:28:06] 10Operations, 10Analytics, 10Traffic, 10netops: Turnilo: per-second rates for wmf_netflow bytes + packets - https://phabricator.wikimedia.org/T263290 (10fdans) [16:28:09] 10Operations, 10Analytics, 10Analytics-Kanban, 10netops: Add more dimensions in the netflow/pmacct/Druid pipeline - https://phabricator.wikimedia.org/T254332 (10fdans) [16:28:51] RECOVERY - Host stat1008 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [16:30:27] RECOVERY - Host stat1008.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.12 ms [16:32:14] 10Operations, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: 2020-09-15) upgrade/replace memory in stat100[58] - https://phabricator.wikimedia.org/T260448 (10Cmjohnson) stat1008, I added all the DIMM and the server would not boot, I received the following error UEFI0060: Power required by the syst... [16:34:31] 10Operations, 10Traffic, 10Wikimedia-General-or-Unknown, 10User-DannyS712: Pages whose title ends with semicolon (;) are intermittently inaccessible - https://phabricator.wikimedia.org/T238285 (10BBlack) With the dupe merger, maybe we owe a status update here: We're pretty sure this is a bug in Apache Tra... [16:36:04] 10Operations, 10Analytics-Clusters: Switch Zookeeper to profile::java - https://phabricator.wikimedia.org/T264176 (10fdans) [16:36:21] (03CR) 10Volans: "reply inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/544943 (https://phabricator.wikimedia.org/T254249) (owner: 10Jbond) [16:37:32] 10Operations, 10ops-eqiad, 10DC-Ops, 10Maps: (Need By: TBD) rack/setup/install maps10[05-10].eqiad.wmnet - https://phabricator.wikimedia.org/T260269 (10Cmjohnson) chatted with Ryan in IRC, Raid10 is needed. I will get that set up and ready for the initial install/puppet [16:37:39] PROBLEM - Rate of JVM GC Old generation-s runs - elastic2029-production-search-psi-codfw on elastic2029 is CRITICAL: 108.8 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-codfw&var-instance=elastic2029&panelId=37 [16:40:22] (03PS1) 10Hnowlan: changeprop-jobqueue: Turn down loglevel [deployment-charts] - 10https://gerrit.wikimedia.org/r/632289 (https://phabricator.wikimedia.org/T264195) [16:41:23] (03CR) 10Volans: [C: 04-1] "replies inline (just keeping the old vote)" (032 comments) [software/pywmflib] - 10https://gerrit.wikimedia.org/r/631909 (https://phabricator.wikimedia.org/T257905) (owner: 10Elukey) [16:41:55] 10Operations, 10Analytics-Clusters, 10Traffic: varnishkafka 1.1.0 CPU usage increase - https://phabricator.wikimedia.org/T264074 (10fdans) a:03klausman [16:42:19] (03CR) 10Ppchelko: [C: 03+1] changeprop-jobqueue: Turn down loglevel [deployment-charts] - 10https://gerrit.wikimedia.org/r/632289 (https://phabricator.wikimedia.org/T264195) (owner: 10Hnowlan) [16:43:04] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/631397 (owner: 10Alexandros Kosiaris) [16:43:43] (03CR) 10Hnowlan: [C: 03+2] changeprop-jobqueue: Turn down loglevel [deployment-charts] - 10https://gerrit.wikimedia.org/r/632289 (https://phabricator.wikimedia.org/T264195) (owner: 10Hnowlan) [16:45:52] (03Merged) 10jenkins-bot: changeprop-jobqueue: Turn down loglevel [deployment-charts] - 10https://gerrit.wikimedia.org/r/632289 (https://phabricator.wikimedia.org/T264195) (owner: 10Hnowlan) [16:50:26] (03CR) 10Dzahn: [C: 03+2] admins/bd808: add bash shebang to .bash scripts [puppet] - 10https://gerrit.wikimedia.org/r/631893 (https://phabricator.wikimedia.org/T95064) (owner: 10Dzahn) [16:51:13] !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'production' . [16:51:13] !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'staging' . [16:51:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:51:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:53:00] (03CR) 10Volans: "replies inline" (034 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/631448 (https://phabricator.wikimedia.org/T257297) (owner: 10Alexandros Kosiaris) [16:56:29] 10Operations, 10Traffic, 10Platform Team Initiatives (API Gateway), 10Story: Client Developer has a cookie-free API call - https://phabricator.wikimedia.org/T258748 (10eprodromou) [16:59:09] !log hnowlan@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'staging' . [16:59:09] !log hnowlan@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'production' . [16:59:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:59:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:59:55] 10Operations, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: 2020-09-15) upgrade/replace memory in stat100[58] - https://phabricator.wikimedia.org/T260448 (10RobH) [17:00:04] !log hnowlan@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'production' . [17:00:04] !log hnowlan@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'staging' . [17:00:04] ryankemper: #bothumor I � Unicode. All rise for Wikidata Query Service weekly deploy deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201005T1700). [17:00:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:00:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:06:56] 10Operations, 10ops-eqiad, 10Analytics-Radar, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10Cmjohnson) physically moved an-worker1111 from C8 to C2, updated network switch and netbox. vlan and IP stay the same physically moved an-worker1113/1... [17:09:09] (03PS1) 10Elukey: Set an-worker111[5-7] as Hadoop workers [puppet] - 10https://gerrit.wikimedia.org/r/632294 (https://phabricator.wikimedia.org/T255140) [17:09:41] (03CR) 10Elukey: [C: 03+2] Set an-worker111[5-7] as Hadoop workers [puppet] - 10https://gerrit.wikimedia.org/r/632294 (https://phabricator.wikimedia.org/T255140) (owner: 10Elukey) [17:14:46] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: (Due By: 2020-07-11) rack/setup/install an-worker[1096-1101] - https://phabricator.wikimedia.org/T254892 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts: ` ['an-worker1111.eqiad.wmnet'] ` The l... [17:19:15] (03PS1) 10Andrew Bogott: Cloudvirt1021 and 1022 to Buster and Ceph [puppet] - 10https://gerrit.wikimedia.org/r/632296 (https://phabricator.wikimedia.org/T259399) [17:20:17] 10Operations, 10SRE-tools, 10Continuous-Integration-Config, 10Patch-For-Review: Add shell scripts CI validations - https://phabricator.wikimedia.org/T148494 (10hashar) >>! In T148494#6517546, @jbond wrote: > I think this has been added via https://gerrit.wikimedia.org/r/c/operations/puppet/+/602693 can any... [17:21:30] PROBLEM - Check systemd state on an-worker1116 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:21:34] PROBLEM - Check systemd state on an-worker1117 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:21:38] 10Operations, 10Traffic, 10Patch-For-Review: Wikidough: Upgrade to dnsdist 1.5.0 - https://phabricator.wikimedia.org/T263789 (10ssingh) Another important change in 1.5.0 is https://github.com/PowerDNS/pdns/pull/7138 [dnsdist/rec: Drop remaining capabilities after startup]. For our dnsdist instance, this is h... [17:22:37] the two an-worker are new [17:23:32] RECOVERY - Check systemd state on an-worker1116 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:23:34] RECOVERY - Check systemd state on an-worker1117 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:23:56] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:24:56] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:25:32] !log hnowlan@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'staging' . [17:25:32] !log hnowlan@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'production' . [17:25:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:25:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:27:09] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime [17:27:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:28:06] (03CR) 10Hashar: Switch gerrit to profile::java (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/632224 (https://phabricator.wikimedia.org/T264182) (owner: 10Muehlenhoff) [17:29:06] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [17:29:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:38:23] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: (Due By: 2020-07-11) rack/setup/install an-worker[1096-1101] - https://phabricator.wikimedia.org/T254892 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-worker1111.eqiad.wmnet'] ` Of which those **FAILED**: ` ['an-worker1111.eqiad.wmn... [17:39:25] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: (Due By: 2020-07-11) rack/setup/install an-worker[1096-1101] - https://phabricator.wikimedia.org/T254892 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts: ` ['an-worker1113.eqiad.wmnet', 'an-wor... [17:42:45] (03PS2) 10Dzahn: base: replace hiera() with lookup() [puppet] - 10https://gerrit.wikimedia.org/r/631307 [17:43:07] 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: (Need by: 2020-04-02) rack/setup/install relforge100[34] - https://phabricator.wikimedia.org/T241791 (10RKemper) [17:44:58] RECOVERY - Rate of JVM GC Old generation-s runs - elastic2029-production-search-psi-codfw on elastic2029 is OK: (C)100 gt (W)80 gt 79.32 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-codfw&var-instance=elastic2029&panelId=37 [17:49:00] PROBLEM - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is CRITICAL: 10 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting%23Nova-fullstack [17:51:49] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime [17:51:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:53:45] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [17:53:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:55:03] 10Operations, 10serviceops, 10Patch-For-Review: docker-reporter-releng-images failed on deneb - https://phabricator.wikimedia.org/T251918 (10JMeybohm) 05Open→03Resolved This was some kind of connection issue to a apt repo while installing debmonitor-cli: ` Sep 28 14:51:47 deneb docker-report-releng[2358... [18:00:04] RoanKattouw, Niharika, and Urbanecm: Time to snap out of that daydream and deploy Morning backport window. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201005T1800). [18:00:04] No GERRIT patches in the queue for this window AFAICS. [18:05:07] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: (Due By: 2020-07-11) rack/setup/install an-worker[1096-1101] - https://phabricator.wikimedia.org/T254892 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-worker1113.eqiad.wmnet'] ` Of which those **FAILED**: ` ['an-worker1114.eqiad.wmn... [18:07:11] (03CR) 10Dzahn: [V: 03+1] "https://puppet-compiler.wmflabs.org/compiler1003/25673/" [puppet] - 10https://gerrit.wikimedia.org/r/631307 (owner: 10Dzahn) [18:07:48] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Please replace Shannon Baileys SSH key - https://phabricator.wikimedia.org/T264127 (10herron) 05Open→03Resolved a:03herron Hi @Sbailey, the updated SSH key has been deployed to servers by now. Please re-open if any follow-up is needed. Thanks! [18:09:14] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Please replace Shannon Baileys SSH key - https://phabricator.wikimedia.org/T264127 (10Sbailey) Thank you. I tested in and can now access scandium as needed. [18:09:48] RECOVERY - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is OK: 3 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting%23Nova-fullstack [18:10:08] !log elukey@cumin1001 START - Cookbook sre.hadoop.init-hadoop-workers [18:10:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:10:23] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "and another one for cloud https://puppet-compiler.wmflabs.org/compiler1002/25675/" [puppet] - 10https://gerrit.wikimedia.org/r/631307 (owner: 10Dzahn) [18:11:59] !log elukey@cumin1001 END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) [18:12:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:13:42] !log elukey@cumin1001 START - Cookbook sre.hadoop.init-hadoop-workers [18:13:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:15:46] !log elukey@cumin1001 END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) [18:15:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:17:18] !log elukey@cumin1001 START - Cookbook sre.hadoop.init-hadoop-workers [18:17:19] !log elukey@cumin1001 END (FAIL) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=99) [18:17:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:17:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:19:23] (03CR) 10Dzahn: [C: 03+1] hieradata: move deployment-prep swift settings off Horizon [puppet] - 10https://gerrit.wikimedia.org/r/631758 (owner: 10Filippo Giunchedi) [18:20:52] (03CR) 10Dzahn: [C: 03+2] Remove obsolete Hiera settings for edge bastions [puppet] - 10https://gerrit.wikimedia.org/r/632211 (owner: 10Muehlenhoff) [18:22:33] (03CR) 10Dzahn: "This will add screen monitoring back for bast4002 - but that's alright." [puppet] - 10https://gerrit.wikimedia.org/r/632211 (owner: 10Muehlenhoff) [18:23:28] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:24:28] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:24:50] (03CR) 10Dzahn: "bast5001: noop, bast3004: noop, bast4002: added icinga check and script" [puppet] - 10https://gerrit.wikimedia.org/r/632211 (owner: 10Muehlenhoff) [18:27:40] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_cxserver_cluster_codfw site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:28:40] (03CR) 10Dzahn: [C: 03+1] "seems worth trying, though also see the "use only if..." part in the official docs: https://httpd.apache.org/docs/2.4/mod/core.html" [puppet] - 10https://gerrit.wikimedia.org/r/631952 (https://phabricator.wikimedia.org/T261031) (owner: 10Ladsgroup) [18:28:46] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:29:26] (03CR) 10Dzahn: [C: 03+1] "failed to link correctly, just pasting it directly: "AddDefaultCharset should only be used when all of the text resources to which it app" [puppet] - 10https://gerrit.wikimedia.org/r/631952 (https://phabricator.wikimedia.org/T261031) (owner: 10Ladsgroup) [18:33:57] legoktm: you'll want to mark this with {{used}} and ~~~~: https://wikitech.wikimedia.org/wiki/News/Cloud_VPS_2020_Purge#codereview [18:34:30] mutante: codereview isn't me [18:35:10] https://openstack-browser.toolforge.org/project/codereview Luke081515 apparently [18:36:14] legoktm: oh, nevermind. i somehow read it as codesearch [18:36:31] :) I marked that one already [18:36:39] 10Operations, 10Performance-Team, 10serviceops, 10User-jijiki: MediaWiki to route spefic keys to /*/mw-with-onhost-tier/ - https://phabricator.wikimedia.org/T264604 (10Gilles) a:03aaron [18:36:54] ack, thanks [18:36:59] 10Operations, 10Performance-Team, 10serviceops, 10User-jijiki: MediaWiki to route spefic keys to /*/mw-with-onhost-tier/ - https://phabricator.wikimedia.org/T264604 (10aaron) p:05Triage→03Medium [18:37:30] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:38:19] 10Operations, 10Traffic, 10Performance-Team (Radar): Elevated latency starting 2020-09-28 - https://phabricator.wikimedia.org/T264398 (10Gilles) [18:38:54] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:42:27] (03PS2) 10Ebernhardson: envoy: Set appropriate service names for three level wikimedia.org domains [puppet] - 10https://gerrit.wikimedia.org/r/631503 (https://phabricator.wikimedia.org/T263073) [18:42:29] (03CR) 10Ebernhardson: envoy: Set appropriate service names for three level wikimedia.org domains (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/631503 (https://phabricator.wikimedia.org/T263073) (owner: 10Ebernhardson) [18:46:31] !log mforns@deploy1001 Started deploy [analytics/refinery@2c6c335]: Special deployment to unblock deletion jobs [analytics/refinery@2c6c335e61cecd0321ec6f066a153feaf2dbbc27] [18:46:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:55:17] (03CR) 10Ebernhardson: "updated pcc, same output as before: https://puppet-compiler.wmflabs.org/compiler1002/25676/" [puppet] - 10https://gerrit.wikimedia.org/r/631503 (https://phabricator.wikimedia.org/T263073) (owner: 10Ebernhardson) [18:56:43] (03PS5) 10Dzahn: consolidate bastionhost roles, remove module [puppet] - 10https://gerrit.wikimedia.org/r/631858 [18:58:39] !log mforns@deploy1001 Finished deploy [analytics/refinery@2c6c335]: Special deployment to unblock deletion jobs [analytics/refinery@2c6c335e61cecd0321ec6f066a153feaf2dbbc27] (duration: 12m 08s) [18:58:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:59:01] !log mforns@deploy1001 Started deploy [analytics/refinery@2c6c335] (thin): [THIN] Special deployment to unblock deletion jobs [analytics/refinery@2c6c335e61cecd0321ec6f066a153feaf2dbbc27] [18:59:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:59:10] !log mforns@deploy1001 Finished deploy [analytics/refinery@2c6c335] (thin): [THIN] Special deployment to unblock deletion jobs [analytics/refinery@2c6c335e61cecd0321ec6f066a153feaf2dbbc27] (duration: 00m 08s) [18:59:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:59:37] (03CR) 10Dzahn: [V: 03+1] "also: openstack-browser was repaired and confirmed the "opsonly" role is NOT used in cloud. they all use role::labs::bastion including the" [puppet] - 10https://gerrit.wikimedia.org/r/631858 (owner: 10Dzahn) [19:00:48] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/25677/" [puppet] - 10https://gerrit.wikimedia.org/r/631858 (owner: 10Dzahn) [19:10:45] PROBLEM - Uncommitted DNS changes in Netbox on netbox1001 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [19:14:57] chaomodus: ^^^^ [19:16:27] PROBLEM - Rate of JVM GC Old generation-s runs - elastic2029-production-search-psi-codfw on elastic2029 is CRITICAL: 100.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-codfw&var-instance=elastic2029&panelId=37 [19:20:20] volans: taking a look what it is [19:21:18] (03PS2) 10Andrew Bogott: Cloudvirt1021 and 1022 to Buster and Ceph [puppet] - 10https://gerrit.wikimedia.org/r/632296 (https://phabricator.wikimedia.org/T259399) [19:23:25] 10Puppet, 10Cloud-VPS, 10cloud-services-team (Kanban): labspuppetbackend service can fail to intialize when DNS blip happens at the wrong time - https://phabricator.wikimedia.org/T264658 (10bd808) [19:23:29] elukey: should an-worker1113 be added to DNS? it would need an sre.dns.netbox run to be committed [19:23:41] (03PS3) 10Andrew Bogott: Cloudvirt1021 and 1022 to Buster and Ceph [puppet] - 10https://gerrit.wikimedia.org/r/632296 (https://phabricator.wikimedia.org/T259399) [19:24:03] 10Puppet, 10Cloud-VPS, 10cloud-services-team (Kanban): labspuppetbackend service can fail to intialize when DNS blip happens at the wrong time - https://phabricator.wikimedia.org/T264658 (10bd808) `lang=irc [19:00] bd808: hm, uwsgi caught the error, would be nice if it retried [19:01] < b... [19:24:54] 10Puppet, 10Cloud-VPS, 10cloud-services-team (Kanban): labspuppetbackend service can fail to intialize when DNS blip happens at the wrong time - https://phabricator.wikimedia.org/T264658 (10Andrew) a:03Andrew [19:27:39] (03PS2) 10Mforns: [WIP] Drop /wmf/data/raw/mediawiki_job and /wmf/data/raw/netflow after 90 days [puppet] - 10https://gerrit.wikimedia.org/r/628895 (https://phabricator.wikimedia.org/T231339) (owner: 10Ottomata) [19:27:50] !log dzahn@cumin1001 START - Cookbook sre.dns.netbox [19:27:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:29:35] PROBLEM - Rate of JVM GC Old generation-s runs - elastic2029-production-search-psi-codfw on elastic2029 is CRITICAL: 100.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-codfw&var-instance=elastic2029&panelId=37 [19:30:54] !log dzahn@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:30:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:31:01] (03PS4) 10Andrew Bogott: Cloudvirt1021 and 1022 to Buster and Ceph [puppet] - 10https://gerrit.wikimedia.org/r/632296 (https://phabricator.wikimedia.org/T259399) [19:31:39] !log ran sre.dns.netbox to push addition of an-worker1113 which was commited in prod repo but not in netbox data [19:31:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:33:02] (03CR) 10Mforns: "I modified a bit the regexes: '[^/]+' instead of '.+', to allow for path traversal pruning using partial matches!" [puppet] - 10https://gerrit.wikimedia.org/r/628895 (https://phabricator.wikimedia.org/T231339) (owner: 10Ottomata) [19:33:42] (03CR) 10Mforns: [WIP] Drop /wmf/data/raw/mediawiki_job and /wmf/data/raw/netflow after 90 days (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/628895 (https://phabricator.wikimedia.org/T231339) (owner: 10Ottomata) [19:35:24] mutante: thanks, I think that host had some trouble on reimage [19:35:27] elukey: committed the an-worker1113 addition in netbox data [19:35:49] I'll check if anything failed in the morning [19:35:52] volans: yea, so now i am wondering why it's still not recovered in icinga even though i committed and told icinga to re-check [19:36:08] takes a bit [19:36:08] running it a second time to confirm no more diff [19:36:28] yea, but usually i am impatient and click "reschedule next service check" to force it [19:36:31] because the check takes a bit it's done via a timer [19:36:37] that storea in a file [19:36:41] gotcha [19:36:45] and nrpe checks only the file [19:36:55] makes sense [19:37:03] 10Operations, 10Performance-Team, 10serviceops, 10User-jijiki: MediaWiki to route specific keys to /*/mw-with-onhost-tier/ - https://phabricator.wikimedia.org/T264604 (10Aklapper) [19:37:03] because it's slower than global nrpe timeout [19:37:24] I should add it to the wikitexh page [19:37:38] should recover in few minutes [19:38:28] I can confirm sre.dns.netbox "test" now shows no more diff. yep [19:38:37] RECOVERY - Uncommitted DNS changes in Netbox on netbox1001 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [19:38:45] haha,ok [19:38:49] :) [19:39:00] thx [19:40:15] np, going for lunch [19:49:28] (03PS3) 10Razzi: oozie: use admin groups to determine admin access [puppet] - 10https://gerrit.wikimedia.org/r/631849 (https://phabricator.wikimedia.org/T262660) [19:53:12] PROBLEM - LVS zotero codfw port 4969/tcp - Zotero- zotero.svc.eqiad.wmnet IPv4 #page on zotero.svc.codfw.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [19:53:24] 👀 [19:54:00] * volans on mobile bit get to the laptop if needed [19:54:05] *but [19:54:30] * akosiaris around [19:54:48] RECOVERY - LVS zotero codfw port 4969/tcp - Zotero- zotero.svc.eqiad.wmnet IPv4 #page on zotero.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 196 bytes in 1.210 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [19:54:50] * jbond42 here but not famliure with zotero (checking wiki's) [19:55:07] spike of 500s visible on https://grafana.wikimedia.org/d/NJkCVermz/citoid [19:55:10] * akosiaris guesses some weird citation [19:55:16] <_joe_> a cpu usage spike [19:55:22] peaking at ~0.01 rps [19:55:23] <_joe_> on zotero itself [19:55:34] * godog here too [19:55:44] _joe_: huh, I couldn't find it at https://grafana.wikimedia.org/d/2oPtfvXWk/zotero [19:55:55] <_joe_> rzl: codfw [19:56:00] dammit [19:56:03] yeah of course [19:56:03] https://grafana.wikimedia.org/d/2oPtfvXWk/zotero?orgId=1&refresh=1m&var-dc=codfw%20prometheus%2Fk8s&var-service=zotero [19:56:04] <_joe_> https://grafana.wikimedia.org/d/2oPtfvXWk/zotero?viewPanel=52&orgId=1&refresh=1m&var-dc=codfw%20prometheus%2Fk8s&var-service=zotero [19:56:05] ouch [19:56:15] !log restart elasticsearch_6@production-search-codfw on elastic2050 to take reduced (128kB) readahead settings T264053 [19:56:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:56:20] T264053: Unsustainable increases in Elasticsearch cluster disk IO - https://phabricator.wikimedia.org/T264053 [19:56:28] <_joe_> so zotero is using a lot of cpu [19:56:37] interestingly not memory issues this time [19:56:41] <_joe_> yeah [19:56:42] and disregard my "0.01 rps" obviously, that was eqiad too [19:57:02] <_joe_> it's all the pods or just one? [19:59:03] <_joe_> heh it completely recovered [19:59:40] https://w.wiki/fHA [19:59:45] it looks like it was four of them initially and now just one [19:59:50] four pods that is [20:00:04] chrisalbon and accraze: My dear minions, it's time we take the moon! Just kidding. Time for Services – Graphoid / ORES deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201005T2000). [20:00:09] (explore link, log in to grafana if you aren't) [20:00:21] yeah, I only see like 5 as well [20:00:48] 4 of them around the same time and a fifth one a little bit lagging behind [20:00:59] <_joe_> so I guess akosiaris was right [20:00:59] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:01:01] PROBLEM - High average GET latency for mw requests on appserver in codfw on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=appserver&var-method=GET [20:01:09] <_joe_> uh oh [20:01:21] <_joe_> already recovered [20:01:22] hi, I see I'm late for the fun [20:02:45] RECOVERY - High average GET latency for mw requests on appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=appserver&var-method=GET [20:02:47] I don't see a need to do something right now [20:02:54] <_joe_> nope [20:03:07] <_joe_> go have the rest of your evening :) [20:03:26] * akosiaris back to bed [20:04:18] * jbond42 back to tv [20:04:21] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:09:24] (03CR) 10Andrew Bogott: [C: 03+2] Cloudvirt1021 and 1022 to Buster and Ceph [puppet] - 10https://gerrit.wikimedia.org/r/632296 (https://phabricator.wikimedia.org/T259399) (owner: 10Andrew Bogott) [20:13:27] !log restart elasticsearch_6@production-search-codfw on elastic2051 to take reduced (64 sector, 32kB) readahead settings T264053 [20:13:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:13:33] T264053: Unsustainable increases in Elasticsearch cluster disk IO - https://phabricator.wikimedia.org/T264053 [20:16:08] (03CR) 10Razzi: "Thanks for the review Elukey; updated." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/631849 (https://phabricator.wikimedia.org/T262660) (owner: 10Razzi) [20:19:31] (03CR) 10Razzi: "Puppet catalog diff: https://puppet-compiler.wmflabs.org/compiler1003/25679/an-coord1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/631849 (https://phabricator.wikimedia.org/T262660) (owner: 10Razzi) [20:26:13] !log restart elasticsearch_6@production-search-codfw on elastic2051 to take reduced (32 sector, 16kB) readahead settings T264053 [20:26:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:26:18] T264053: Unsustainable increases in Elasticsearch cluster disk IO - https://phabricator.wikimedia.org/T264053 [20:28:26] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime [20:28:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:30:23] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [20:30:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:52:27] (03PS1) 10Ebernhardson: Lower elasticsearch readahead from 128kB to 16kB [puppet] - 10https://gerrit.wikimedia.org/r/632319 (https://phabricator.wikimedia.org/T264053) [20:54:11] PROBLEM - Rate of JVM GC Old generation-s runs - elastic2029-production-search-psi-codfw on elastic2029 is CRITICAL: 102.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-codfw&var-instance=elastic2029&panelId=37 [20:59:16] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime [20:59:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:00:05] Reedy and sbassett: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Weekly Security deployment window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201005T2100). [21:01:15] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [21:01:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:10:14] (03CR) 10Dzahn: "wow, this structure is really special. in profile::mariadb::monitor::prometheus it then instantiates another role class..." [puppet] - 10https://gerrit.wikimedia.org/r/631288 (https://phabricator.wikimedia.org/T159412) (owner: 10Dzahn) [21:11:53] Jdlrobson: hi, thank you for the fix in T264665, we still need to add one or two lines of code in the ptwiki's Commons.js, where we can see that chart to check the code has no error? [21:11:53] T264665: Edit to pt:MediaWiki:Common.js led to huge client side error spike - https://phabricator.wikimedia.org/T264665 [21:13:27] (03PS5) 10Dzahn: prometheus: convert mysqld_exporter from role to profile [puppet] - 10https://gerrit.wikimedia.org/r/631288 (https://phabricator.wikimedia.org/T159412) [21:25:57] RECOVERY - Rate of JVM GC Old generation-s runs - elastic2029-production-search-psi-codfw on elastic2029 is OK: (C)100 gt (W)80 gt 79.32 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-codfw&var-instance=elastic2029&panelId=37 [21:27:16] (03PS1) 10Cicalese: Add API Portal to $wgCentralAuthAutoLoginWikis - beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/632322 (https://phabricator.wikimedia.org/T264637) [21:29:35] (03PS1) 10Cicalese: Add API Portal to $wgCentralAuthAutoLoginWikis - prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/632323 (https://phabricator.wikimedia.org/T264637) [21:31:50] (03CR) 10Ppchelko: [C: 03+2] Add API Portal to $wgCentralAuthAutoLoginWikis - beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/632322 (https://phabricator.wikimedia.org/T264637) (owner: 10Cicalese) [21:32:22] (03CR) 10Ppchelko: [C: 03+1] "Please self-merge whenever you're ready for deployment." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/632323 (https://phabricator.wikimedia.org/T264637) (owner: 10Cicalese) [21:33:05] (03Merged) 10jenkins-bot: Add API Portal to $wgCentralAuthAutoLoginWikis - beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/632322 (https://phabricator.wikimedia.org/T264637) (owner: 10Cicalese) [21:34:29] (03CR) 10Bstorm: [C: 03+2] Revert "toolforge: Temp handling for tools.wmflabs.org/wpcleaner" [puppet] - 10https://gerrit.wikimedia.org/r/628148 (https://phabricator.wikimedia.org/T258813) (owner: 10Nskaggs) [21:47:51] (03CR) 10Dzahn: "> * move all the code from role::prometheus::mysqld_exporter to profile::mariadb::monitor::prometheus" [puppet] - 10https://gerrit.wikimedia.org/r/631288 (https://phabricator.wikimedia.org/T159412) (owner: 10Dzahn) [21:48:06] (03PS6) 10Dzahn: prometheus: convert mysqld_exporter from role to profile [puppet] - 10https://gerrit.wikimedia.org/r/631288 (https://phabricator.wikimedia.org/T159412) [22:01:24] !log restore wikidatawiki_content enwiki_content enwiki_general and commonswiki_file to default index.merge.policy.deletes_pct_allowed on eqiad cirrus cluster T264053 [22:01:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:01:30] T264053: Unsustainable increases in Elasticsearch cluster disk IO - https://phabricator.wikimedia.org/T264053 [22:05:03] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:08:15] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:14:05] (03PS4) 10Dzahn: remove shinken module, profile, role [puppet] - 10https://gerrit.wikimedia.org/r/629464 (https://phabricator.wikimedia.org/T236547) [22:14:19] (03CR) 10Dzahn: [C: 03+2] remove shinken module, profile, role [puppet] - 10https://gerrit.wikimedia.org/r/629464 (https://phabricator.wikimedia.org/T236547) (owner: 10Dzahn) [22:15:04] mutante: you're good with puppet. Does puppet clean up the old location if you change the path it clones a repo to? [22:17:33] Spookreeeno: no, you would have to set "ensure => absent" with git::clone first to make it clean up [22:17:45] or alternatively delete it manually [22:18:04] mutante: I think delete manually is going to be my sanest way of doing things [22:18:43] Spookreeeno: i would say it depends how many nodes are using the puppet class. is it just 1 host.. do it manually.. is it many hosts.. make an extra patch [22:19:11] mutante: just the one host [22:19:57] Spookreeeno: ack, the easiest is to just rm -rf the path and then run puppet once [22:20:28] mutante: agreed [22:20:29] well. here's one catch. the parent path must already exist in puppet [22:21:23] So based on my move I'll have to do rm -rf /path/to/dev and then mkdir /path/to/dev [22:21:40] And then let it recreate it the /path/to/dev/core [22:21:44] rm -rf yes, but don't do mkdir [22:21:53] puppet should create all the dirs needed [22:22:31] if /path or /path/to is missing, it should be added to puppet code, and "dev" will be created by git::clone [22:22:50] or it will see it already exists when trying to clone [22:23:40] mutante: /path/to/dev will be missing and it's going in /path/to/dev/core [22:23:53] /path/to does exist [22:24:38] Spookreeeno: then you should find the puppet code that created "to" (/srv will be created by the Linux distro) and add the "dev" part to it [22:24:56] file { '/path/to': [22:24:57] Okay I get ya [22:25:04] ensure => 'directory' [22:25:12] Yep! [22:25:17] I will add that [22:27:59] !log removing shinken puppet module and role [22:28:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:33:57] (03CR) 10Dzahn: "hmm.. there is still this diff about the socket location: https://puppet-compiler.wmflabs.org/compiler1002/25682/db1115.eqiad.wmnet/index." [puppet] - 10https://gerrit.wikimedia.org/r/631288 (https://phabricator.wikimedia.org/T159412) (owner: 10Dzahn) [22:39:30] 10Operations, 10observability: Better abstractions for puppet & icinga/nagios/shinken - https://phabricator.wikimedia.org/T85624 (10Dzahn) The shinken module has been deleted in https://gerrit.wikimedia.org/r/c/operations/puppet/+/629464 [22:42:34] (03CR) 10Cwhite: hieradata: expand swift object server statsd mappings (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/632205 (https://phabricator.wikimedia.org/T264588) (owner: 10Filippo Giunchedi) [22:43:35] (03CR) 10Cwhite: [C: 03+1] profile: add alertmanager::alerts [puppet] - 10https://gerrit.wikimedia.org/r/629153 (https://phabricator.wikimedia.org/T258948) (owner: 10Filippo Giunchedi) [22:50:38] 10Operations, 10observability, 10cloud-services-team (Kanban): Deprecate Diamond collectors in Cloud VPS - https://phabricator.wikimedia.org/T210993 (10Dzahn) asked Mukunda about it and he says it is probably obsolete. asked -traffic because i do see @ema still logged in there not too long ago [22:55:30] (03PS7) 10Dzahn: prometheus: convert mysqld_exporter from role to profile [puppet] - 10https://gerrit.wikimedia.org/r/631288 (https://phabricator.wikimedia.org/T159412) [23:00:04] RoanKattouw, Niharika, and Urbanecm: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Evening backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201005T2300). [23:00:04] AaronSchulz: A patch you scheduled for Evening backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:04:56] 10Operations, 10Parsoid, 10observability, 10serviceops, 10User-jijiki: Create per cluster error rate alerts on Mediawiki servers - https://phabricator.wikimedia.org/T262078 (10colewhite) As of T256418, we have removed StatsD outputs from Logstash. Prometheus-ES-Exporter accepts an Elasticsearch query an... [23:05:35] Krinkle: could wait a bit, sure (e.g. on the core patch) [23:06:17] (03PS8) 10Dzahn: prometheus: convert mysqld_exporter from role to profile [puppet] - 10https://gerrit.wikimedia.org/r/631288 (https://phabricator.wikimedia.org/T159412) [23:07:31] (03CR) 10Ladsgroup: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/631952 (https://phabricator.wikimedia.org/T261031) (owner: 10Ladsgroup) [23:12:00] (03PS1) 10Dzahn: delete role::beta::availability_collector, diamond varnishstatus.py [puppet] - 10https://gerrit.wikimedia.org/r/632351 (https://phabricator.wikimedia.org/T210993) [23:17:43] (03CR) 10Dzahn: "just confirming "collect beta / staging cluster availability metrics" isn't actually a thing - this would be nice because it should be the" [puppet] - 10https://gerrit.wikimedia.org/r/632351 (https://phabricator.wikimedia.org/T210993) (owner: 10Dzahn) [23:19:44] (03CR) 10Dzahn: [C: 03+1] "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/631952 (https://phabricator.wikimedia.org/T261031) (owner: 10Ladsgroup) [23:21:52] (03CR) 10Dzahn: [V: 03+1 C: 03+1] "> I would go further and say get rid of the intermediate roles and profiles all together and:" [puppet] - 10https://gerrit.wikimedia.org/r/631288 (https://phabricator.wikimedia.org/T159412) (owner: 10Dzahn) [23:25:09] (03CR) 10Dzahn: [C: 03+2] admins/rush: add shebangs to shell scripts [puppet] - 10https://gerrit.wikimedia.org/r/631897 (https://phabricator.wikimedia.org/T95064) (owner: 10Dzahn) [23:32:19] (03CR) 10Dzahn: [V: 03+2 C: 03+2] "ah, this is exempt from jenkins-bot checks, right" [puppet] - 10https://gerrit.wikimedia.org/r/631897 (https://phabricator.wikimedia.org/T95064) (owner: 10Dzahn) [23:45:07] (03CR) 10Dzahn: "> Further CI runs shellcheck and explicitly excludes the modules/admin/files/home directory[1] and the > reason i did that is because i fi" [puppet] - 10https://gerrit.wikimedia.org/r/631895 (https://phabricator.wikimedia.org/T95064) (owner: 10Dzahn) [23:52:55] (03CR) 10Dzahn: [C: 03+2] "Well.. Only doing this now because it's such a simple change that removed a bunch of noise and it's already uploaded. I will not be worryi" [puppet] - 10https://gerrit.wikimedia.org/r/631895 (https://phabricator.wikimedia.org/T95064) (owner: 10Dzahn) [23:54:48] (03CR) 10Dzahn: [C: 03+2] turn a couple scripts without bashisms into sh scripts [puppet] - 10https://gerrit.wikimedia.org/r/631890 (https://phabricator.wikimedia.org/T95064) (owner: 10Dzahn) [23:59:43] PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3050 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server