[00:00:21] (03Merged) 10jenkins-bot: Reparse deploy page before announcing an event (v2) [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/688503 (https://phabricator.wikimedia.org/T243394) (owner: 10BryanDavis) [00:01:59] jouncebot: now [00:01:59] No deployments scheduled for the next 10 hour(s) and 58 minute(s) [00:02:04] jouncebot: next [00:02:04] In 10 hour(s) and 57 minute(s): European mid-day backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210511T1100) [00:02:21] * bd808 crosses fingers and toes [00:02:22] that sounds about right [00:03:59] there were 2 bugs, the one in the lua, and one in the reparse magic that I added last Friday. The first caused the wrong duration in the announce, the second made it announce the N+1 deploy window instead of the expected N. I think I've fixed both. [00:04:36] i'll check tomorrow and complain if not :P [00:04:48] thanks for fixing it :) [00:05:27] yes please to complaints if it's still busted. T243394 is the new feature that it is supposed to be gaining [00:05:27] T243394: Automatically refresh jouncebot just before a deployment window starts - https://phabricator.wikimedia.org/T243394 [00:09:40] (03PS1) 10Reedy: Add CoC link to non tech wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/688505 (https://phabricator.wikimedia.org/T280886) [00:10:06] (03PS2) 10Reedy: Add CoC link to non tech wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/688505 (https://phabricator.wikimedia.org/T280886) [00:11:01] (03CR) 10Reedy: [C: 04-2] "Not yet." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/688505 (https://phabricator.wikimedia.org/T280886) (owner: 10Reedy) [00:13:48] PROBLEM - Postgres Replication Lag on puppetdb2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 132692024 and 34 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:14:27] (03PS1) 10Reedy: CommonSettings.php: Minor code style tweak inside $wmgUseFooterContactLink [mediawiki-config] - 10https://gerrit.wikimedia.org/r/688506 [00:21:22] RECOVERY - Postgres Replication Lag on puppetdb2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 376560 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:24:48] (03CR) 10BryanDavis: [C: 03+1] wikireplicas-dns: condense repeated nodes for better failover [puppet] - 10https://gerrit.wikimedia.org/r/688501 (https://phabricator.wikimedia.org/T260389) (owner: 10Bstorm) [00:32:14] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:34:46] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:37:48] PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - No response from remote host 91.198.174.244 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [01:00:21] (03CR) 10Ladsgroup: [C: 03+1] backup: Exclude /var/lib/mailman3/queue [puppet] - 10https://gerrit.wikimedia.org/r/688383 (owner: 10Legoktm) [01:22:34] PROBLEM - Postgres Replication Lag on puppetdb2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 112136392 and 2 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:25:06] RECOVERY - Postgres Replication Lag on puppetdb2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 521208 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:25:50] 10SRE, 10vm-requests: : of VMs requested for - https://phabricator.wikimedia.org/T282508 (10Ramtin2021) [01:44:41] 10SRE, 10vm-requests: : of VMs requested for - https://phabricator.wikimedia.org/T282508 (10bd808) 05Open→03Invalid [01:49:14] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:51:46] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [02:07:47] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.37.0-wmf.5 [core] (wmf/1.37.0-wmf.5) - 10https://gerrit.wikimedia.org/r/688628 [02:07:49] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/1.37.0-wmf.5 [core] (wmf/1.37.0-wmf.5) - 10https://gerrit.wikimedia.org/r/688628 (owner: 10TrainBranchBot) [02:33:02] (03Merged) 10jenkins-bot: Branch commit for wmf/1.37.0-wmf.5 [core] (wmf/1.37.0-wmf.5) - 10https://gerrit.wikimedia.org/r/688628 (owner: 10TrainBranchBot) [03:07:22] PROBLEM - dump of m5 in eqiad on alert1001 is CRITICAL: Last dump for m5 at eqiad (db1117.eqiad.wmnet:3325) taken on 2021-05-11 02:12:37 is 25 GB, but previous one was 17 GB, a change of 50.4% https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [03:49:04] 10SRE, 10Thumbor, 10Wikimedia-SVG-rendering: Incorrect text positioning in SVG rasterization (scale/transform; font-size; kerning) - https://phabricator.wikimedia.org/T36947 (10Arthur2e5) At least the 2.51.1 result makes more sense and works in non-extreme scales... [04:18:19] (03PS7) 10Majavah: toolforge: Allow passing host port for k8s ingress [puppet] - 10https://gerrit.wikimedia.org/r/688361 (https://phabricator.wikimedia.org/T264221) [04:18:43] (03PS8) 10Majavah: toolforge: Allow passing host port for k8s ingress [puppet] - 10https://gerrit.wikimedia.org/r/688361 (https://phabricator.wikimedia.org/T264221) [04:20:59] (03CR) 10Majavah: toolforge: Allow passing host port for k8s ingress (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/688361 (https://phabricator.wikimedia.org/T264221) (owner: 10Majavah) [04:28:18] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (cloudmetrics1002), No backups: 2 (backup1002, ...), Fresh: 101 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [04:29:21] (03PS8) 10Majavah: toolforge: Add ingress-nginx Helm files [puppet] - 10https://gerrit.wikimedia.org/r/685715 (https://phabricator.wikimedia.org/T264221) [04:34:19] (03CR) 10Majavah: "> Patch Set 7:" [puppet] - 10https://gerrit.wikimedia.org/r/685715 (https://phabricator.wikimedia.org/T264221) (owner: 10Majavah) [04:45:16] RECOVERY - Check systemd state on sodium is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:57:12] (03CR) 10Marostegui: [C: 03+2] wikireplicas: disable notifications on the old replica cluster [puppet] - 10https://gerrit.wikimedia.org/r/688443 (https://phabricator.wikimedia.org/T260389) (owner: 10Bstorm) [04:58:25] (03CR) 10Marostegui: [C: 03+1] wikireplicas: remove the old wikireplicas profile from the proxy [puppet] - 10https://gerrit.wikimedia.org/r/688368 (https://phabricator.wikimedia.org/T260389) (owner: 10Bstorm) [05:01:28] 10SRE, 10DBA, 10Wikimedia-Mailing-lists: Delete lists-next.wikimedia.org - https://phabricator.wikimedia.org/T281548 (10Marostegui) Sure, that works. Just let me know when we can proceed. I assume we'd need to delete the following databases: ` testmailman3 testmailman3web ` And the following users: ` | test... [05:05:14] 10SRE, 10DBA, 10Wikimedia-Mailing-lists: Delete lists-next.wikimedia.org - https://phabricator.wikimedia.org/T281548 (10Ladsgroup) I confirm that :) [05:08:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1182', diff saved to https://phabricator.wikimedia.org/P15894 and previous config saved to /var/cache/conftool/dbconfig/20210511-050816-marostegui.json [05:08:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:10:55] (03PS1) 10Marostegui: db1121: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/688742 (https://phabricator.wikimedia.org/T280492) [05:11:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1121 - going to be reimaged to buster T280492', diff saved to https://phabricator.wikimedia.org/P15895 and previous config saved to /var/cache/conftool/dbconfig/20210511-051102-marostegui.json [05:11:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:11:06] T280492: Upgrade all sanitarium masters to 10.4 and Buster - https://phabricator.wikimedia.org/T280492 [05:11:41] !log Reimage db1121 to buster, this will generate lag on s4 (commonswiki) on wikireplicas T280492 [05:11:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:12:06] (03CR) 10Marostegui: [C: 03+2] db1121: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/688742 (https://phabricator.wikimedia.org/T280492) (owner: 10Marostegui) [05:14:10] PROBLEM - dump of m5 in codfw on alert1001 is CRITICAL: Last dump for m5 at codfw (db2078.codfw.wmnet:3325) taken on 2021-05-11 04:18:31 is 25 GB, but previous one was 17 GB, a change of 52.2% https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [05:25:29] (03PS2) 10Amire80: Make the Malaysian talk namespaces names consistent [mediawiki-config] - 10https://gerrit.wikimedia.org/r/687064 [05:25:44] (03PS3) 10Amire80: Make the Malaysian talk namespaces names consistent [mediawiki-config] - 10https://gerrit.wikimedia.org/r/687064 [05:30:11] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1121.eqiad.wmnet with reason: REIMAGE [05:30:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:32:17] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1121.eqiad.wmnet with reason: REIMAGE [05:32:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:35:03] (03PS1) 10Marostegui: mariadb: Decommission db1082 [puppet] - 10https://gerrit.wikimedia.org/r/688752 (https://phabricator.wikimedia.org/T281794) [05:36:13] !log marostegui@cumin1001 START - Cookbook sre.hosts.decommission for hosts db1082.eqiad.wmnet [05:36:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:43:24] (03CR) 10Marostegui: [C: 03+2] mariadb: Decommission db1082 [puppet] - 10https://gerrit.wikimedia.org/r/688752 (https://phabricator.wikimedia.org/T281794) (owner: 10Marostegui) [05:44:58] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db1082.eqiad.wmnet [05:45:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:45:28] 10SRE, 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission db1082.eqiad.wmnet - https://phabricator.wikimedia.org/T281794 (10Marostegui) This is ready for #dc-ops [05:45:30] 10SRE, 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission db1082.eqiad.wmnet - https://phabricator.wikimedia.org/T281794 (10Marostegui) [05:46:01] 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) [05:50:16] (03PS7) 10Giuseppe Lavagetto: safe-service-restart: only verify pooled services [puppet] - 10https://gerrit.wikimedia.org/r/684287 (https://phabricator.wikimedia.org/T279100) [05:51:40] (03CR) 10jerkins-bot: [V: 04-1] safe-service-restart: only verify pooled services [puppet] - 10https://gerrit.wikimedia.org/r/684287 (https://phabricator.wikimedia.org/T279100) (owner: 10Giuseppe Lavagetto) [06:22:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1182 (re)pooling @ 25%: Repool db1182', diff saved to https://phabricator.wikimedia.org/P15896 and previous config saved to /var/cache/conftool/dbconfig/20210511-062231-root.json [06:22:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:30:29] RECOVERY - Check systemd state on analytics1075 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:37:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1182 (re)pooling @ 50%: Repool db1182', diff saved to https://phabricator.wikimedia.org/P15897 and previous config saved to /var/cache/conftool/dbconfig/20210511-063734-root.json [06:37:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:45:21] (03PS1) 10Elukey: profile::hadoop::master: add alert to bump the heap size when needed [puppet] - 10https://gerrit.wikimedia.org/r/688778 [06:50:14] (03CR) 10Joal: [C: 03+1] "Thanks a lot Luca :)" [puppet] - 10https://gerrit.wikimedia.org/r/688778 (owner: 10Elukey) [06:50:50] !log Stop replication on db2094:3318 T282514 [06:50:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:50:54] T282514: Re-import some tables on db2094:3318 - https://phabricator.wikimedia.org/T282514 [06:52:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1182 (re)pooling @ 75%: Repool db1182', diff saved to https://phabricator.wikimedia.org/P15898 and previous config saved to /var/cache/conftool/dbconfig/20210511-065238-root.json [06:52:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:00:50] (03CR) 10Elukey: "shall we merge?" [puppet] - 10https://gerrit.wikimedia.org/r/686351 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup) [07:06:47] (Traffic bill over quota) firing: (3) Traffic bill over quota - https://alerts.wikimedia.org [07:07:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1182 (re)pooling @ 100%: Repool db1182', diff saved to https://phabricator.wikimedia.org/P15899 and previous config saved to /var/cache/conftool/dbconfig/20210511-070742-root.json [07:07:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:08:35] (03PS9) 10Muehlenhoff: Add a cookbook to delete hosts from debmonitor [cookbooks] - 10https://gerrit.wikimedia.org/r/684819 [07:11:47] (Traffic bill over quota) resolved: (3) Traffic bill over quota - https://alerts.wikimedia.org [07:13:30] (03CR) 10Muehlenhoff: [C: 03+2] Add a cookbook to delete hosts from debmonitor [cookbooks] - 10https://gerrit.wikimedia.org/r/684819 (owner: 10Muehlenhoff) [07:15:06] (03CR) 10Elukey: WIP - Add istio base images build support (032 comments) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/688211 (https://phabricator.wikimedia.org/T278192) (owner: 10Elukey) [07:21:57] ACKNOWLEDGEMENT - dump of m5 in codfw on alert1001 is CRITICAL: Last dump for m5 at codfw (db2078.codfw.wmnet:3325) taken on 2021-05-11 04:18:31 is 25 GB, but previous one was 17 GB, a change of 52.2% Jcrespo mailman import ongoing - The acknowledgement expires at: 2021-05-18 07:21:23. https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [07:21:57] ACKNOWLEDGEMENT - dump of m5 in eqiad on alert1001 is CRITICAL: Last dump for m5 at eqiad (db1117.eqiad.wmnet:3325) taken on 2021-05-11 02:12:37 is 25 GB, but previous one was 17 GB, a change of 50.4% Jcrespo mailman import ongoing - The acknowledgement expires at: 2021-05-18 07:21:23. https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [07:23:31] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install copernicium - https://phabricator.wikimedia.org/T282272 (10MoritzMuehlenhoff) [07:31:24] (03PS8) 10Giuseppe Lavagetto: safe-service-restart: only verify pooled services [puppet] - 10https://gerrit.wikimedia.org/r/684287 (https://phabricator.wikimedia.org/T279100) [07:34:54] (03CR) 10JMeybohm: [C: 03+2] calico: Remove CPU limit for calico-node, bump for typha and kube-controllers [deployment-charts] - 10https://gerrit.wikimedia.org/r/688332 (https://phabricator.wikimedia.org/T277877) (owner: 10JMeybohm) [07:35:10] (03CR) 10JMeybohm: [C: 03+2] prometheus: Clean up absent file resource [puppet] - 10https://gerrit.wikimedia.org/r/684801 (https://phabricator.wikimedia.org/T271573) (owner: 10JMeybohm) [07:35:20] (03CR) 10JMeybohm: [C: 03+2] kube-apiserver: Update admission controller config [puppet] - 10https://gerrit.wikimedia.org/r/677922 (https://phabricator.wikimedia.org/T270063) (owner: 10JMeybohm) [07:36:12] (03Merged) 10jenkins-bot: calico: Remove CPU limit for calico-node, bump for typha and kube-controllers [deployment-charts] - 10https://gerrit.wikimedia.org/r/688332 (https://phabricator.wikimedia.org/T277877) (owner: 10JMeybohm) [07:36:30] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "LGTM but I feel some more elaborate comments would improve this." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/685734 (owner: 10Filippo Giunchedi) [07:37:15] (03CR) 10JMeybohm: "> Patch Set 6: Code-Review-1" [deployment-charts] - 10https://gerrit.wikimedia.org/r/685748 (https://phabricator.wikimedia.org/T282148) (owner: 10Effie Mouzeli) [07:37:47] (03PS1) 10Muehlenhoff: Support removing debmonitor entries for hosts already garbage-collected from puppetdb [cookbooks] - 10https://gerrit.wikimedia.org/r/688891 [07:39:21] !log jayme@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'sync'. [07:39:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:40:31] (03CR) 10jerkins-bot: [V: 04-1] Support removing debmonitor entries for hosts already garbage-collected from puppetdb [cookbooks] - 10https://gerrit.wikimedia.org/r/688891 (owner: 10Muehlenhoff) [07:40:51] !log jayme@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'sync'. [07:40:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:42:36] (03CR) 10Jcrespo: [C: 03+1] backup: Exclude /var/lib/mailman3/queue [puppet] - 10https://gerrit.wikimedia.org/r/688383 (owner: 10Legoktm) [07:44:31] (03PS1) 10JMeybohm: calico: Remove limits.cpu instead of requests.cpu for calico-node [deployment-charts] - 10https://gerrit.wikimedia.org/r/688895 (https://phabricator.wikimedia.org/T277877) [07:47:46] (03CR) 10JMeybohm: [C: 03+2] calico: Remove limits.cpu instead of requests.cpu for calico-node [deployment-charts] - 10https://gerrit.wikimedia.org/r/688895 (https://phabricator.wikimedia.org/T277877) (owner: 10JMeybohm) [07:49:05] (03Merged) 10jenkins-bot: calico: Remove limits.cpu instead of requests.cpu for calico-node [deployment-charts] - 10https://gerrit.wikimedia.org/r/688895 (https://phabricator.wikimedia.org/T277877) (owner: 10JMeybohm) [07:51:45] (03PS2) 10Muehlenhoff: Support removing debmonitor entries for hosts already garbage-collected from puppetdb [cookbooks] - 10https://gerrit.wikimedia.org/r/688891 [07:54:00] !log jayme@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'sync'. [07:54:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:55:05] !log jayme@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'sync'. [07:55:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:01:43] (03CR) 10Alexandros Kosiaris: "> Patch Set 1: Code-Review-1" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/685734 (owner: 10Filippo Giunchedi) [08:01:49] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/686633 (https://phabricator.wikimedia.org/T232343) (owner: 10Herron) [08:06:30] (03CR) 10Muehlenhoff: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/688333 (https://phabricator.wikimedia.org/T232343) (owner: 10Jbond) [08:07:11] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: Remove absented crons [puppet] - 10https://gerrit.wikimedia.org/r/686351 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup) [08:12:09] (03PS2) 10Elukey: prometheus: Migrate cron in node_amd_rocm to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/686352 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup) [08:15:56] (03CR) 10Elukey: [C: 03+2] prometheus: Migrate cron in node_amd_rocm to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/686352 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup) [08:17:48] !log jayme@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'sync'. [08:17:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:18:08] (03CR) 10Volans: [C: 03+1] "It's totally ok for me to merge this as is (with or without the nit inline). FYI it's already possible to achieve the same with the curren" (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/688891 (owner: 10Muehlenhoff) [08:19:08] !log jayme@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'sync'. [08:19:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:20:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1170:3312', diff saved to https://phabricator.wikimedia.org/P15901 and previous config saved to /var/cache/conftool/dbconfig/20210511-082038-marostegui.json [08:20:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:30:00] !log jayme@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'sync'. [08:30:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:31:01] !log jayme@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'sync'. [08:31:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:31:33] (03CR) 10Muehlenhoff: "> Patch Set 2: Code-Review+1" (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/688891 (owner: 10Muehlenhoff) [08:31:53] (03CR) 10Muehlenhoff: Support removing debmonitor entries for hosts already garbage-collected from puppetdb (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/688891 (owner: 10Muehlenhoff) [08:32:36] !log installing hivex security updates [08:32:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:34:13] !log jayme@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. [08:34:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:34:29] (03CR) 10Filippo Giunchedi: "> Patch Set 1:" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/685734 (owner: 10Filippo Giunchedi) [08:35:34] (03CR) 10Filippo Giunchedi: "> Patch Set 1: Code-Review-1" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/685734 (owner: 10Filippo Giunchedi) [08:35:35] !log jayme@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. [08:35:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:36:01] !log jayme@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. [08:36:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:36:24] (03PS1) 10Cathal Mooney: Replaced incorrect ssh pub key. [puppet] - 10https://gerrit.wikimedia.org/r/688916 [08:37:08] !log jayme@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. [08:37:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:38:29] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:40:58] (03CR) 10Cathal Mooney: "Hi this should be the correct one if you can give it a quick look over. Thanks." [puppet] - 10https://gerrit.wikimedia.org/r/688916 (owner: 10Cathal Mooney) [08:42:22] (03CR) 10Muehlenhoff: Replaced incorrect ssh pub key. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/688916 (owner: 10Cathal Mooney) [08:43:23] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:45:26] (03PS3) 10Muehlenhoff: Support removing debmonitor entries for hosts already garbage-collected from puppetdb [cookbooks] - 10https://gerrit.wikimedia.org/r/688891 [08:47:15] (03CR) 10Giuseppe Lavagetto: [C: 03+1] Add vrt-wiki.wikimedia.org to mediawiki.yaml [puppet] - 10https://gerrit.wikimedia.org/r/683000 (https://phabricator.wikimedia.org/T280400) (owner: 10Urbanecm) [08:48:17] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:48:22] (03CR) 10jerkins-bot: [V: 04-1] Support removing debmonitor entries for hosts already garbage-collected from puppetdb [cookbooks] - 10https://gerrit.wikimedia.org/r/688891 (owner: 10Muehlenhoff) [08:49:44] (03PS4) 10Elukey: WIP - Add istio base images build support [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/688211 (https://phabricator.wikimedia.org/T278192) [08:50:24] (03CR) 10Elukey: WIP - Add istio base images build support (032 comments) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/688211 (https://phabricator.wikimedia.org/T278192) (owner: 10Elukey) [08:52:54] (03CR) 10Ladsgroup: [C: 04-1] Add P2671 and P4839 to deprecated properties list (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/688178 (https://phabricator.wikimedia.org/T280779) (owner: 10Itamar Givon) [08:53:15] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:03:29] (03PS2) 10Filippo Giunchedi: wmflib: add role/public_endpoint to wmflib::service [puppet] - 10https://gerrit.wikimedia.org/r/685734 [09:03:31] (03PS2) 10Filippo Giunchedi: pontoon: introduce public_certs [puppet] - 10https://gerrit.wikimedia.org/r/685737 [09:03:33] (03PS2) 10Filippo Giunchedi: pontoon: add public LB class [puppet] - 10https://gerrit.wikimedia.org/r/685738 [09:03:35] (03PS2) 10Filippo Giunchedi: role: add pontoon::frontend role/profile [puppet] - 10https://gerrit.wikimedia.org/r/685739 [09:05:23] (03PS4) 10Muehlenhoff: Support removing debmonitor entries for hosts already garbage-collected from puppetdb [cookbooks] - 10https://gerrit.wikimedia.org/r/688891 [09:23:48] !log aborrero@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudgw2002-dev.codfw.wmnet [09:23:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:24:06] (03Abandoned) 10Cathal Mooney: Replaced incorrect ssh pub key. [puppet] - 10https://gerrit.wikimedia.org/r/688916 (owner: 10Cathal Mooney) [09:37:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1170:3312 (re)pooling @ 25%: Repool db1170:3312', diff saved to https://phabricator.wikimedia.org/P15903 and previous config saved to /var/cache/conftool/dbconfig/20210511-093701-root.json [09:37:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:39:40] (03PS1) 10Marostegui: install_server: Delete labsdb10[09|10|11] [puppet] - 10https://gerrit.wikimedia.org/r/688933 (https://phabricator.wikimedia.org/T282522) [09:41:23] (03CR) 10Marostegui: [C: 03+2] install_server: Delete labsdb10[09|10|11] [puppet] - 10https://gerrit.wikimedia.org/r/688933 (https://phabricator.wikimedia.org/T282522) (owner: 10Marostegui) [09:47:35] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:49:58] 10SRE, 10LDAP-Access-Requests: Grant Access to for - https://phabricator.wikimedia.org/T282521 (10Aklapper) [09:52:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1170:3312 (re)pooling @ 50%: Repool db1170:3312', diff saved to https://phabricator.wikimedia.org/P15904 and previous config saved to /var/cache/conftool/dbconfig/20210511-095204-root.json [09:52:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:52:37] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:54:51] !log aborrero@cumin1001 END (ERROR) - Cookbook sre.hosts.reboot-single (exit_code=97) for host cloudgw2002-dev.codfw.wmnet [09:54:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:00:15] (03PS5) 10Arturo Borrero Gonzalez: cloudgw: don't use concatenation with CIDR [puppet] - 10https://gerrit.wikimedia.org/r/688367 (https://phabricator.wikimedia.org/T270704) [10:05:37] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudgw: don't use concatenation with CIDR [puppet] - 10https://gerrit.wikimedia.org/r/688367 (https://phabricator.wikimedia.org/T270704) (owner: 10Arturo Borrero Gonzalez) [10:06:35] (03PS5) 10Elukey: WIP - Add istio base images build support [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/688211 (https://phabricator.wikimedia.org/T278192) [10:06:55] (03CR) 10Itamar Givon: Add P2671 and P4839 to deprecated properties list (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/688178 (https://phabricator.wikimedia.org/T280779) (owner: 10Itamar Givon) [10:07:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1170:3312 (re)pooling @ 75%: Repool db1170:3312', diff saved to https://phabricator.wikimedia.org/P15907 and previous config saved to /var/cache/conftool/dbconfig/20210511-100708-root.json [10:07:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:07:55] !log jayme@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'. [10:07:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:08:59] PROBLEM - Long running screen/tmux on mw2280 is CRITICAL: CRIT: Long running tmux process. (user: root PID: 25179, 1734079s 1728000s). https://wikitech.wikimedia.org/wiki/Monitoring/Long_running_screens [10:10:27] (03PS2) 10Itamar Givon: Add P2671 and P4839 to deprecated properties list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/688178 (https://phabricator.wikimedia.org/T280779) [10:12:26] (03PS6) 10Elukey: WIP - Add istio base images build support [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/688211 (https://phabricator.wikimedia.org/T278192) [10:13:03] PROBLEM - Disk space on backup2002 is CRITICAL: DISK CRITICAL - free space: /srv 2945826 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=backup2002&var-datasource=codfw+prometheus/ops [10:13:47] !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [10:13:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:13:56] !log aborrero@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudgw1002.eqiad.wmnet [10:13:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:15:14] (03PS3) 10Itamar Givon: Add P2671 and P4839 to deprecated properties list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/688178 (https://phabricator.wikimedia.org/T280779) [10:16:17] 10SRE, 10LDAP-Access-Requests: Grant Access to for - https://phabricator.wikimedia.org/T282519 (10Aklapper) [10:16:19] 10SRE, 10LDAP-Access-Requests: Grant Access to for - https://phabricator.wikimedia.org/T282521 (10Aklapper) [10:16:45] 10SRE, 10Traffic: dbtree.wm.o stopped working after enforcing Puppet CA issued certs for ATS backend origin servers - https://phabricator.wikimedia.org/T282531 (10Vgutierrez) [10:17:11] (03PS1) 10Volans: hosts: log stacktrace on update error [software/debmonitor] - 10https://gerrit.wikimedia.org/r/688949 (https://phabricator.wikimedia.org/T282529) [10:18:06] 10SRE, 10Traffic: dbtree.wm.o stopped working after enforcing Puppet CA issued certs for ATS backend origin servers - https://phabricator.wikimedia.org/T282531 (10Vgutierrez) p:05Triage→03Medium dbtree.wikimedia.org is hosted on the same boxes as tendril and it's using the same certificate, tendril.wikimed... [10:18:14] 10SRE, 10DBA, 10Traffic: dbtree.wm.o stopped working after enforcing Puppet CA issued certs for ATS backend origin servers - https://phabricator.wikimedia.org/T282531 (10Marostegui) [10:22:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1170:3312 (re)pooling @ 100%: Repool db1170:3312', diff saved to https://phabricator.wikimedia.org/P15908 and previous config saved to /var/cache/conftool/dbconfig/20210511-102212-root.json [10:22:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:22:57] (03CR) 10Ladsgroup: [C: 03+1] Add P2671 and P4839 to deprecated properties list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/688178 (https://phabricator.wikimedia.org/T280779) (owner: 10Itamar Givon) [10:23:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1162', diff saved to https://phabricator.wikimedia.org/P15909 and previous config saved to /var/cache/conftool/dbconfig/20210511-102303-marostegui.json [10:23:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:26:58] 10SRE, 10GitLab (Initialization), 10Release-Engineering-Team (Doing), 10User-brennen: Define auth strategy for GitLab - https://phabricator.wikimedia.org/T274461 (10jbond) >>! In T274461#7075774, @Sergey.Trofimovsky.SF wrote: >>>! In T274461#7073817, @jbond wrote: >>> - Group membership sync/import from... [10:27:14] (03Abandoned) 10Muehlenhoff: Update puppetised java.security file from 11.9 [puppet] - 10https://gerrit.wikimedia.org/r/637464 (https://phabricator.wikimedia.org/T266782) (owner: 10Muehlenhoff) [10:39:40] (03CR) 10Volans: [C: 03+1] "LGTM" (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/688891 (owner: 10Muehlenhoff) [10:46:42] !log jayme@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'. [10:46:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:49:56] (03PS1) 10Jbond: P:pki: make provide bundle the default [puppet] - 10https://gerrit.wikimedia.org/r/688952 [10:49:58] (03PS1) 10Jbond: P:tendril::webserver: update dbtree to use new pki for tls certs [puppet] - 10https://gerrit.wikimedia.org/r/688953 (https://phabricator.wikimedia.org/T282531) [10:50:38] (03CR) 10jerkins-bot: [V: 04-1] P:tendril::webserver: update dbtree to use new pki for tls certs [puppet] - 10https://gerrit.wikimedia.org/r/688953 (https://phabricator.wikimedia.org/T282531) (owner: 10Jbond) [10:52:17] !log jayme@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [10:52:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:52:37] (03CR) 10Jbond: [C: 03+2] P:pki: make provide bundle the default [puppet] - 10https://gerrit.wikimedia.org/r/688952 (owner: 10Jbond) [10:52:43] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29491/console" [puppet] - 10https://gerrit.wikimedia.org/r/688953 (https://phabricator.wikimedia.org/T282531) (owner: 10Jbond) [10:55:49] (03PS1) 10Jbond: P:pki: function: always return chained path [puppet] - 10https://gerrit.wikimedia.org/r/688961 [10:56:17] legoktm / Amir1 hi. For some reason I am subscribed twice to listadmins-announce? [10:56:34] (03CR) 10Jbond: [V: 03+1 C: 03+2] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29492/console" [puppet] - 10https://gerrit.wikimedia.org/r/688952 (owner: 10Jbond) [10:56:35] Maybe under two username? [10:56:38] *two email [10:56:39] (03PS2) 10Jbond: P:tendril::webserver: update dbtree to use new pki for tls certs [puppet] - 10https://gerrit.wikimedia.org/r/688953 (https://phabricator.wikimedia.org/T282531) [10:56:47] (03PS3) 10Jbond: P:tendril::webserver: update dbtree to use new pki for tls certs [puppet] - 10https://gerrit.wikimedia.org/r/688953 (https://phabricator.wikimedia.org/T282531) [10:56:52] Amir1: nope, same address [10:57:16] (03CR) 10jerkins-bot: [V: 04-1] P:pki: function: always return chained path [puppet] - 10https://gerrit.wikimedia.org/r/688961 (owner: 10Jbond) [10:57:22] (03CR) 10jerkins-bot: [V: 04-1] P:tendril::webserver: update dbtree to use new pki for tls certs [puppet] - 10https://gerrit.wikimedia.org/r/688953 (https://phabricator.wikimedia.org/T282531) (owner: 10Jbond) [10:57:55] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29493/console" [puppet] - 10https://gerrit.wikimedia.org/r/688961 (owner: 10Jbond) [10:59:19] I'll see if I can do something myself [10:59:52] "Se ha producido un error al procesar su petición." --> there was an error processing your request [11:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for European mid-day backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210511T1100). [11:00:04] phuedx: A patch you scheduled for European mid-day backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:11] o/ [11:00:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1162 (re)pooling @ 25%: Repool db1162', diff saved to https://phabricator.wikimedia.org/P15910 and previous config saved to /var/cache/conftool/dbconfig/20210511-110029-root.json [11:00:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:00:40] phuedx: do you want to self serve, or should I deploy? [11:01:11] (03PS2) 10Jbond: P:pki: function: always return chained path [puppet] - 10https://gerrit.wikimedia.org/r/688961 [11:01:25] Urbanecm: A process question: It's a Beta Cluster only change. Should it be synced like any other patch? [11:01:38] Urbanecm: Also, I'll deploy it :) [11:02:14] If it only changes labs only files, you only need to +2 it _and_ fetch on deploy1002 [11:02:36] (03CR) 10jerkins-bot: [V: 04-1] P:pki: function: always return chained path [puppet] - 10https://gerrit.wikimedia.org/r/688961 (owner: 10Jbond) [11:02:39] (so it is in the stagging dir and doesn't surprise the next deployer) [11:02:53] (03PS2) 10Phuedx: Drop unused configuration on labs instances [mediawiki-config] - 10https://gerrit.wikimedia.org/r/686756 (https://phabricator.wikimedia.org/T277955) (owner: 10Jdlrobson) [11:03:27] Beta magic should deploy it automatically (usually takes up to 30 minutes) [11:03:36] Urbanecm: Thanks. I'll do that now. ^ The patch only changes a -labs.php file [11:04:05] Yup, confirmed. So just +2 and fetch when it merges. [11:04:24] (03PS3) 10Jbond: P:pki: function: always return chained path [puppet] - 10https://gerrit.wikimedia.org/r/688961 [11:04:34] I have some deployments, ping me once you're done [11:04:34] (03CR) 10Phuedx: [C: 03+2] Drop unused configuration on labs instances [mediawiki-config] - 10https://gerrit.wikimedia.org/r/686756 (https://phabricator.wikimedia.org/T277955) (owner: 10Jdlrobson) [11:04:43] Amir1: Will do. [11:04:48] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] "assuming the interval spec is right (I'm not very familiar with that syntax) then LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/686353 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup) [11:05:00] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] prometheus: Migrate node_ssh_open_sessions cron to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/686353 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup) [11:05:42] (03Merged) 10jenkins-bot: Drop unused configuration on labs instances [mediawiki-config] - 10https://gerrit.wikimedia.org/r/686756 (https://phabricator.wikimedia.org/T277955) (owner: 10Jdlrobson) [11:07:11] (03PS9) 10Jbond: P:trafficserver::backend: update the source of the ATS trusted ca bundle [puppet] - 10https://gerrit.wikimedia.org/r/685497 (https://phabricator.wikimedia.org/T281673) [11:07:31] (03CR) 10Jbond: [C: 03+2] P:pki: function: always return chained path [puppet] - 10https://gerrit.wikimedia.org/r/688961 (owner: 10Jbond) [11:07:55] (03PS3) 10Jbond: O:debmonitor::server: Switch debmonitor.wikimedia.org ssl to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/685576 (https://phabricator.wikimedia.org/T281673) [11:08:07] (03PS4) 10Jbond: P:tendril::webserver: update dbtree to use new pki for tls certs [puppet] - 10https://gerrit.wikimedia.org/r/688953 (https://phabricator.wikimedia.org/T282531) [11:08:10] Amir1: Done [11:09:30] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29494/console" [puppet] - 10https://gerrit.wikimedia.org/r/688953 (https://phabricator.wikimedia.org/T282531) (owner: 10Jbond) [11:09:31] cool [11:09:33] Thanks [11:11:36] (03PS1) 10Marostegui: production-m5.sql: Remove testmailman users [puppet] - 10https://gerrit.wikimedia.org/r/688975 (https://phabricator.wikimedia.org/T281548) [11:11:56] (03CR) 10Ladsgroup: [C: 03+2] Add P2671 and P4839 to deprecated properties list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/688178 (https://phabricator.wikimedia.org/T280779) (owner: 10Itamar Givon) [11:12:35] 10SRE, 10DBA, 10Wikimedia-Mailing-lists, 10Patch-For-Review: Delete lists-next.wikimedia.org - https://phabricator.wikimedia.org/T281548 (10Marostegui) Ready to merge this whenver you review it: https://gerrit.wikimedia.org/r/688975 Once merge this requires removing the users manually too in the DB. [11:13:39] (03Merged) 10jenkins-bot: Add P2671 and P4839 to deprecated properties list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/688178 (https://phabricator.wikimedia.org/T280779) (owner: 10Itamar Givon) [11:15:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1162 (re)pooling @ 50%: Repool db1162', diff saved to https://phabricator.wikimedia.org/P15911 and previous config saved to /var/cache/conftool/dbconfig/20210511-111532-root.json [11:15:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:16:08] Amir1: I think I fixed it by removing myself via -leave [11:16:36] !log ladsgroup@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:688178|Add P2671 and P4839 to deprecated properties list (T280779)]] (duration: 00m 58s) [11:16:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:16:44] T280779: Property suggester should not include Google Knowledge Graph ID (P2671) and Wolfram Language entity code (P4839) - https://phabricator.wikimedia.org/T280779 [11:16:55] Yup, only one sub now \o/ [11:17:08] Cool [11:17:15] Sorry, I'm deploying things atm [11:17:24] I was planning to look at it afterwards [11:18:55] np [11:19:29] I think it has to do with mm thinking that I have two addresses: the main one in the mm3 system and the other imported from mm2 [11:19:38] even if they're the same [11:23:10] (03PS1) 10Arturo Borrero Gonzalez: cr/firewall.conf: add cumin term to labs-in6 [homer/public] - 10https://gerrit.wikimedia.org/r/688978 [11:24:53] (03CR) 10Filippo Giunchedi: "Prior to merge this I'll add the paging transports to "port utilization over %80" alerts" [puppet] - 10https://gerrit.wikimedia.org/r/685779 (https://phabricator.wikimedia.org/T281095) (owner: 10Filippo Giunchedi) [11:25:13] (03CR) 10Jbond: [C: 03+1] "lgtm" [homer/public] - 10https://gerrit.wikimedia.org/r/688978 (owner: 10Arturo Borrero Gonzalez) [11:25:32] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cr/firewall.conf: add cumin term to labs-in6 [homer/public] - 10https://gerrit.wikimedia.org/r/688978 (owner: 10Arturo Borrero Gonzalez) [11:29:29] (03PS1) 10Arturo Borrero Gonzalez: cr/firewall.conf: fix syntax in labs-in6 cumin term [homer/public] - 10https://gerrit.wikimedia.org/r/688979 [11:30:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1162 (re)pooling @ 75%: Repool db1162', diff saved to https://phabricator.wikimedia.org/P15912 and previous config saved to /var/cache/conftool/dbconfig/20210511-113036-root.json [11:30:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:31:23] (03CR) 10Jbond: [C: 03+1] "lgtm" [homer/public] - 10https://gerrit.wikimedia.org/r/688979 (owner: 10Arturo Borrero Gonzalez) [11:31:52] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cr/firewall.conf: fix syntax in labs-in6 cumin term [homer/public] - 10https://gerrit.wikimedia.org/r/688979 (owner: 10Arturo Borrero Gonzalez) [11:35:22] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudgw1002.eqiad.wmnet [11:35:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:35:27] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] prometheus: Migrate node_ssh_open_sessions cron to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/686353 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup) [11:36:12] (03PS1) 10Marostegui: install_server: Reimage db2108 [puppet] - 10https://gerrit.wikimedia.org/r/688986 (https://phabricator.wikimedia.org/T282535) [11:37:00] (03CR) 10Marostegui: [C: 03+2] install_server: Reimage db2108 [puppet] - 10https://gerrit.wikimedia.org/r/688986 (https://phabricator.wikimedia.org/T282535) (owner: 10Marostegui) [11:42:16] 10SRE, 10GitLab (Initialization), 10Release-Engineering-Team (Doing), 10User-brennen: Define auth strategy for GitLab - https://phabricator.wikimedia.org/T274461 (10MoritzMuehlenhoff) >>! In T274461#7075868, @Sergey.Trofimovsky.SF wrote: > Requesting some more information on the current LDAP schema, which... [11:45:05] (03PS5) 10Muehlenhoff: Support removing debmonitor entries for hosts already garbage-collected from puppetdb [cookbooks] - 10https://gerrit.wikimedia.org/r/688891 [11:45:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1162 (re)pooling @ 100%: Repool db1162', diff saved to https://phabricator.wikimedia.org/P15913 and previous config saved to /var/cache/conftool/dbconfig/20210511-114540-root.json [11:45:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:52:16] (03PS1) 10Jbond: gitlab: minor variable value updates [gitlab-ansible] - 10https://gerrit.wikimedia.org/r/688998 [11:54:57] (03PS1) 10Filippo Giunchedi: pontoon: set required librenms variables [puppet] - 10https://gerrit.wikimedia.org/r/688999 [11:54:59] (03PS1) 10Filippo Giunchedi: pontoon: use 7d retention for prometheus [puppet] - 10https://gerrit.wikimedia.org/r/689000 [11:56:25] (03CR) 10Jbond: gitlab: minor variable value updates (031 comment) [gitlab-ansible] - 10https://gerrit.wikimedia.org/r/688998 (owner: 10Jbond) [11:57:16] 10SRE, 10GitLab (Initialization), 10Patch-For-Review, 10Release-Engineering-Team (Doing), 10User-brennen: SSH Access of Git data in GitLab - https://phabricator.wikimedia.org/T276148 (10jbond) >>! In T276148#7075459, @Sergey.Trofimovsky.SF wrote: > These were all addressed (please review). Please note, t... [11:57:30] (03CR) 10Hnowlan: "Some initial notes, mostly nits but the templating of the configuration is an outstanding component afaict" (036 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/688358 (https://phabricator.wikimedia.org/T281257) (owner: 10Nikki Nikkhoui) [12:04:07] (03CR) 10Klausman: WIP - Add istio base images build support (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/688211 (https://phabricator.wikimedia.org/T278192) (owner: 10Elukey) [12:06:34] (03CR) 10Vgutierrez: [C: 03+1] "LGTM but update the commit message as the file is going to be stored in /etc/ssl/certs" [puppet] - 10https://gerrit.wikimedia.org/r/685495 (https://phabricator.wikimedia.org/T281673) (owner: 10Jbond) [12:09:26] (03CR) 10Muehlenhoff: [C: 03+2] Support removing debmonitor entries for hosts already garbage-collected from puppetdb [cookbooks] - 10https://gerrit.wikimedia.org/r/688891 (owner: 10Muehlenhoff) [12:11:34] !log jmm@cumin2002 START - Cookbook sre.debmonitor.remove-hosts for 1 hosts: mc1027.eqiad.wmnet [12:11:35] !log jmm@cumin2002 END (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: mc1027.eqiad.wmnet [12:11:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:11:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:22:13] (03CR) 10Muehlenhoff: "PCC: https://puppet-compiler.wmflabs.org/compiler1002/29495/" [puppet] - 10https://gerrit.wikimedia.org/r/688327 (https://phabricator.wikimedia.org/T281345) (owner: 10Muehlenhoff) [12:25:57] (03PS1) 10Jcrespo: Fix bug where backup returns an exception if there are errors on log [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/689015 [12:27:21] 10SRE, 10User-MoritzMuehlenhoff: Saner updates of java.security properties - https://phabricator.wikimedia.org/T282545 (10MoritzMuehlenhoff) [12:27:26] (03CR) 10Jcrespo: [C: 03+2] Fix bug where backup returns an exception if there are errors on log [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/689015 (owner: 10Jcrespo) [12:30:32] (03CR) 10Muehlenhoff: [C: 03+2] Update java.security file for 11.0.11 [puppet] - 10https://gerrit.wikimedia.org/r/688327 (https://phabricator.wikimedia.org/T281345) (owner: 10Muehlenhoff) [12:43:08] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Looks pretty ready to me, couple of minor inline comments" (036 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/667047 (owner: 10Legoktm) [12:58:15] (03PS8) 10Jbond: P:trafficserver::backend: Use a new trusted CA file [puppet] - 10https://gerrit.wikimedia.org/r/685495 (https://phabricator.wikimedia.org/T281673) [13:05:02] (03CR) 10Jbond: [C: 03+2] P:trafficserver::backend: Use a new trusted CA file [puppet] - 10https://gerrit.wikimedia.org/r/685495 (https://phabricator.wikimedia.org/T281673) (owner: 10Jbond) [13:06:34] (03PS1) 10Muehlenhoff: Failover the IDP to idp1001 [dns] - 10https://gerrit.wikimedia.org/r/689030 [13:08:58] (03PS2) 10Muehlenhoff: Failover the IDP to idp1001 [dns] - 10https://gerrit.wikimedia.org/r/689030 [13:14:02] (03CR) 10Muehlenhoff: [C: 03+2] Failover the IDP to idp1001 [dns] - 10https://gerrit.wikimedia.org/r/689030 (owner: 10Muehlenhoff) [13:18:34] (03PS1) 10Vgutierrez: cumin: Remove outdated cp-ats and cp-ats-ulsfo aliases [puppet] - 10https://gerrit.wikimedia.org/r/689066 [13:19:39] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for JStephenson1980 - https://phabricator.wikimedia.org/T282521 (10Reedy) [13:20:22] 10SRE, 10User-MoritzMuehlenhoff: Sensible updates of java.security properties - https://phabricator.wikimedia.org/T282545 (10Reedy) [13:21:24] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/689066 (owner: 10Vgutierrez) [13:21:59] (03CR) 10Vgutierrez: [C: 03+2] cumin: Remove outdated cp-ats and cp-ats-ulsfo aliases [puppet] - 10https://gerrit.wikimedia.org/r/689066 (owner: 10Vgutierrez) [13:32:30] (03PS1) 10Vgutierrez: acme_chief: Improve OCSPResponse error handling [software/acme-chief] - 10https://gerrit.wikimedia.org/r/689068 (https://phabricator.wikimedia.org/T282490) [13:36:02] (03PS7) 10Elukey: WIP - Add istio base images build support [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/688211 (https://phabricator.wikimedia.org/T278192) [13:36:21] (03CR) 10Elukey: WIP - Add istio base images build support (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/688211 (https://phabricator.wikimedia.org/T278192) (owner: 10Elukey) [13:36:28] 10SRE, 10Cloud-VPS, 10Traffic, 10HTTPS: certificate for Cloud VPS has expired - https://phabricator.wikimedia.org/T282102 (10Sascha) If it helps, feel free to adopt https://certmon.toolforge.org/ which was quickly thrown together in an attempt to help Wikimedia to improve its monitoring. See [[ https://git... [13:39:25] !log rolling restart of ats-backend [13:39:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:07] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/688249 (https://phabricator.wikimedia.org/T233941) (owner: 10Jbond) [13:41:48] (03PS1) 10Urbanecm: enwiki: Growth features: Change help panel links [mediawiki-config] - 10https://gerrit.wikimedia.org/r/689069 (https://phabricator.wikimedia.org/T281896) [13:42:44] jouncebot: now [13:42:44] No deployments scheduled for the next 2 hour(s) and 17 minute(s) [13:42:52] jouncebot: next [13:42:52] In 2 hour(s) and 17 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210511T1600) [13:43:10] (03CR) 10Urbanecm: [C: 03+2] enwiki: Growth features: Change help panel links [mediawiki-config] - 10https://gerrit.wikimedia.org/r/689069 (https://phabricator.wikimedia.org/T281896) (owner: 10Urbanecm) [13:44:47] (03Merged) 10jenkins-bot: enwiki: Growth features: Change help panel links [mediawiki-config] - 10https://gerrit.wikimedia.org/r/689069 (https://phabricator.wikimedia.org/T281896) (owner: 10Urbanecm) [13:47:36] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 1d4d00798bb24daa4e5b81b6c2ecda6143a6c6f0: enwiki: Growth features: Change help panel links (T281896) (duration: 01m 02s) [13:47:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:47:40] T281896: Deploy Growth features on English Wikipedia - https://phabricator.wikimedia.org/T281896 [13:47:55] * Urbanecm done [13:51:28] (03CR) 10Muehlenhoff: [C: 03+1] "Sounds good." [puppet] - 10https://gerrit.wikimedia.org/r/688277 (owner: 10Jbond) [13:57:20] 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work): hw troubleshooting: IPMI sensor critical for elastic1042.eqiad.wmnet - https://phabricator.wikimedia.org/T278185 (10Cmjohnson) 05Open→03Resolved it appears the power cable was loose and not properly seated, pushed it back in and the LED l... [13:59:30] 10SRE, 10ops-eqiad, 10DC-Ops: Add eqiad airport express to Netbox - https://phabricator.wikimedia.org/T278934 (10Cmjohnson) @wiki_willy I can add to netbox but how do I classify it? I could add as an access switch but Apple is not a manufacturer listed in our devices. Please advise. [14:01:52] !log Restarted releases Jenkins for plugin upgrade # T282433 [14:01:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:01:57] 10SRE, 10ops-eqiad, 10DC-Ops: Add eqiad airport express to Netbox - https://phabricator.wikimedia.org/T278934 (10ayounsi) I created a "wireless access-point" role, as well as an "apple" manufacturer. [14:03:33] PROBLEM - wikimedia-client-errors-alerts grafana alert on alert1001 is CRITICAL: CRITICAL: Overview ( https://grafana.wikimedia.org/d/000000566/overview ) is alerting: Client error alert. https://logstash.wikimedia.org/app/kibana%23/dashboard/AXDBY8Qhh3Uj6x1zCF56 https://grafana.wikimedia.org/d/000000566/ [14:03:39] 10SRE, 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission db1082.eqiad.wmnet - https://phabricator.wikimedia.org/T281794 (10Cmjohnson) [14:03:46] 10SRE, 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission db1082.eqiad.wmnet - https://phabricator.wikimedia.org/T281794 (10Cmjohnson) 05Stalled→03Resolved [14:03:51] 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Cmjohnson) [14:04:15] (03CR) 10Jbond: [V: 03+1 C: 03+2] P:idp::client::httpd: add support for CASSSOEnabled [puppet] - 10https://gerrit.wikimedia.org/r/688249 (https://phabricator.wikimedia.org/T233941) (owner: 10Jbond) [14:04:19] (03CR) 10Jbond: [C: 03+2] hiera: enable SSOut for peopleweb and puppetboard [puppet] - 10https://gerrit.wikimedia.org/r/688277 (owner: 10Jbond) [14:04:21] 10SRE, 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission db1080.eqiad.mnet - https://phabricator.wikimedia.org/T280121 (10Cmjohnson) [14:05:53] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [14:05:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:12] 10SRE, 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission db1080.eqiad.mnet - https://phabricator.wikimedia.org/T280121 (10Cmjohnson) 05Open→03Resolved [14:08:41] RECOVERY - IPMI Sensor Status on elastic1042 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [14:10:23] !log Restarted CI Jenkins for plugin upgrade # T282433 [14:10:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:35] (03PS1) 10Volans: client: get source package from correct version [software/debmonitor] - 10https://gerrit.wikimedia.org/r/689077 (https://phabricator.wikimedia.org/T282529) [14:14:02] !log Restarted CI Jenkins with a snapshot of the Gearman Jenkins plugin # T281737 [14:14:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:06] T281737: Zuul can't stop jobs or set the build description - https://phabricator.wikimedia.org/T281737 [14:14:31] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:14:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:16:38] (03CR) 10Volans: "For more context to review the patch see https://phabricator.wikimedia.org/T282529#7078358" [software/debmonitor] - 10https://gerrit.wikimedia.org/r/689077 (https://phabricator.wikimedia.org/T282529) (owner: 10Volans) [14:17:43] (03PS1) 10Filippo Giunchedi: profile: don't hardcode LDAP group/base names in icinga/am [puppet] - 10https://gerrit.wikimedia.org/r/689078 [14:18:21] (03CR) 10CDanis: icinga: switch to LibreNMS AlertManager paging (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/685779 (https://phabricator.wikimedia.org/T281095) (owner: 10Filippo Giunchedi) [14:19:40] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29496/console" [puppet] - 10https://gerrit.wikimedia.org/r/689078 (owner: 10Filippo Giunchedi) [14:24:16] (03CR) 10Muehlenhoff: profile: don't hardcode LDAP group/base names in icinga/am (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/689078 (owner: 10Filippo Giunchedi) [14:24:40] (03CR) 10Herron: [C: 03+2] logstash101[012]: prep for reimaging [puppet] - 10https://gerrit.wikimedia.org/r/685090 (https://phabricator.wikimedia.org/T281266) (owner: 10Herron) [14:24:45] (03PS3) 10Herron: logstash101[012]: prep for reimaging [puppet] - 10https://gerrit.wikimedia.org/r/685090 (https://phabricator.wikimedia.org/T281266) [14:26:23] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [14:26:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:26:44] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [software/debmonitor] - 10https://gerrit.wikimedia.org/r/689077 (https://phabricator.wikimedia.org/T282529) (owner: 10Volans) [14:27:07] !log installing cgal security updates [14:27:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:29:45] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [14:29:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:31:51] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [14:31:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:07] PROBLEM - Postgres Replication Lag on puppetdb2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 200422336 and 14 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [14:33:15] (03PS2) 10Filippo Giunchedi: profile: don't hardcode LDAP group/base names in icinga/am [puppet] - 10https://gerrit.wikimedia.org/r/689078 [14:33:32] (03CR) 10Alexandros Kosiaris: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29497/console" [puppet] - 10https://gerrit.wikimedia.org/r/684287 (https://phabricator.wikimedia.org/T279100) (owner: 10Giuseppe Lavagetto) [14:33:50] (03CR) 10Filippo Giunchedi: profile: don't hardcode LDAP group/base names in icinga/am (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/689078 (owner: 10Filippo Giunchedi) [14:34:39] RECOVERY - Postgres Replication Lag on puppetdb2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 376424 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [14:35:12] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29498/console" [puppet] - 10https://gerrit.wikimedia.org/r/689078 (owner: 10Filippo Giunchedi) [14:35:42] (03CR) 10Filippo Giunchedi: [V: 03+1] "LGTM now: https://puppet-compiler.wmflabs.org/compiler1003/29498/alert1001.wikimedia.org/fulldiff.html" [puppet] - 10https://gerrit.wikimedia.org/r/689078 (owner: 10Filippo Giunchedi) [14:37:49] 10SRE, 10observability, 10Patch-For-Review: Decommission old ELK5 Logstash cluster - https://phabricator.wikimedia.org/T281266 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by herron on cumin1001.eqiad.wmnet for hosts: ` logstash1010.eqiad.wmnet ` The log can be found in `/var/log/wmf-auto-reim... [14:38:20] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:38:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:01] (03CR) 10Alexandros Kosiaris: [C: 04-1] "I like the approach. The diff in https://puppet-compiler.wmflabs.org/compiler1002/29497/mw1332.eqiad.wmnet/fulldiff.html also makes a lot " (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/684287 (https://phabricator.wikimedia.org/T279100) (owner: 10Giuseppe Lavagetto) [14:41:15] (03CR) 10Filippo Giunchedi: icinga: switch to LibreNMS AlertManager paging (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/685779 (https://phabricator.wikimedia.org/T281095) (owner: 10Filippo Giunchedi) [14:44:59] PROBLEM - ElasticSearch health check for shards on 9200 on logstash1007 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [14:45:25] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS1299/IPv4: Connect - Telia, AS1299/IPv6: Active - Telia https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:45:36] 10SRE, 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission db1077.eqiad.wmnet - https://phabricator.wikimedia.org/T281075 (10Cmjohnson) [14:45:43] 10SRE, 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission db1077.eqiad.wmnet - https://phabricator.wikimedia.org/T281075 (10Cmjohnson) 05Open→03Resolved [14:45:45] PROBLEM - ElasticSearch health check for shards on 9200 on logstash1009 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [14:45:47] 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Cmjohnson) [14:45:59] PROBLEM - ElasticSearch health check for shards on 9200 on logstash1008 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [14:46:05] 10SRE, 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission db1086.eqiad.wmnet - https://phabricator.wikimedia.org/T278229 (10Cmjohnson) [14:46:22] 10SRE, 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission db1086.eqiad.wmnet - https://phabricator.wikimedia.org/T278229 (10Cmjohnson) 05Open→03Resolved [14:46:25] 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Cmjohnson) [14:46:34] 10SRE, 10ops-eqiad, 10DC-Ops, 10decommission-hardware, 10Patch-For-Review: decommission db1076.eqiad.wmnet - https://phabricator.wikimedia.org/T274752 (10Cmjohnson) [14:46:43] 10SRE, 10ops-eqiad, 10DC-Ops, 10decommission-hardware, 10Patch-For-Review: decommission db1076.eqiad.wmnet - https://phabricator.wikimedia.org/T274752 (10Cmjohnson) 05Open→03Resolved [14:46:45] 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Cmjohnson) [14:46:59] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission bast1002.wikimedia.org - https://phabricator.wikimedia.org/T280110 (10Cmjohnson) [14:47:23] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission bast1002.wikimedia.org - https://phabricator.wikimedia.org/T280110 (10Cmjohnson) 05Open→03Resolved [14:48:34] 10SRE, 10ops-eqiad, 10serviceops: decommission scb100[1-4].eqiad.wmnet - https://phabricator.wikimedia.org/T275759 (10Cmjohnson) 05Open→03Resolved all for are removed from rack and decom'd on netbox. [14:49:39] jouncebot: now [14:49:40] No deployments scheduled for the next 1 hour(s) and 10 minute(s) [14:49:43] jouncebot: next [14:49:43] In 1 hour(s) and 10 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210511T1600) [14:49:52] !log installing busybox security updates [14:49:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:58] argh. that's not correct [14:50:22] oh wait. I can't tell time :) [14:53:45] (03CR) 10Muehlenhoff: [C: 03+2] package_builder: Remove support for jessie [puppet] - 10https://gerrit.wikimedia.org/r/671093 (owner: 10Muehlenhoff) [14:54:22] 10SRE, 10ops-eqiad, 10DC-Ops: Add eqiad airport express to Netbox - https://phabricator.wikimedia.org/T278934 (10Cmjohnson) thanks @ayounsi could you add a device type to the list, that is a netbox requirement for me to save. [14:54:52] (03PS1) 10Jbond: O:sslcert: fix comments [puppet] - 10https://gerrit.wikimedia.org/r/689089 [14:55:40] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29499/console" [puppet] - 10https://gerrit.wikimedia.org/r/689089 (owner: 10Jbond) [14:56:33] (03CR) 10Jbond: [V: 03+1 C: 03+2] O:sslcert: fix comments [puppet] - 10https://gerrit.wikimedia.org/r/689089 (owner: 10Jbond) [14:56:52] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/689078 (owner: 10Filippo Giunchedi) [14:57:04] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): cloudvirt1038: PCIe error - https://phabricator.wikimedia.org/T276922 (10Cmjohnson) Dell is supposed to be here today to replace several more parts. We will see how it goes [14:57:07] moritzm: also merging yours :) [14:57:49] !log herron@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on logstash1010.eqiad.wmnet with reason: REIMAGE [14:57:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:58:04] jbond42: thx [14:59:21] (03PS1) 10Volans: fileio: uniform quotes [software/pywmflib] - 10https://gerrit.wikimedia.org/r/689090 [14:59:55] !log herron@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on logstash1010.eqiad.wmnet with reason: REIMAGE [14:59:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:54] (03CR) 10Elukey: [C: 03+1] "This needs an immediate release! :P" [software/pywmflib] - 10https://gerrit.wikimedia.org/r/689090 (owner: 10Volans) [15:01:02] lol [15:01:16] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [15:01:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:47] (03PS1) 10Andrew Bogott: profile:mariadb:core: Hack in access from labwebs to s6 [puppet] - 10https://gerrit.wikimedia.org/r/689092 (https://phabricator.wikimedia.org/T282209) [15:03:55] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:03:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:05:05] (03CR) 10Elukey: "I had a very interesting chat with Joe about this patch, and some suggestions came up:" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/688211 (https://phabricator.wikimedia.org/T278192) (owner: 10Elukey) [15:05:23] (03CR) 10Arturo Borrero Gonzalez: profile:mariadb:core: Hack in access from labwebs to s6 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/689092 (https://phabricator.wikimedia.org/T282209) (owner: 10Andrew Bogott) [15:06:14] (03CR) 10Andrew Bogott: profile:mariadb:core: Hack in access from labwebs to s6 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/689092 (https://phabricator.wikimedia.org/T282209) (owner: 10Andrew Bogott) [15:06:23] (03PS2) 10Andrew Bogott: profile:mariadb:core: Hack in access from labwebs to s6 [puppet] - 10https://gerrit.wikimedia.org/r/689092 (https://phabricator.wikimedia.org/T282209) [15:07:36] (03CR) 10Volans: [C: 03+2] fileio: uniform quotes [software/pywmflib] - 10https://gerrit.wikimedia.org/r/689090 (owner: 10Volans) [15:07:50] (03CR) 10Jbond: [C: 03+1] "lgtm" [software/debmonitor] - 10https://gerrit.wikimedia.org/r/688949 (https://phabricator.wikimedia.org/T282529) (owner: 10Volans) [15:10:57] 10SRE, 10decommission-hardware, 10observability: decomission mwlog1001 - https://phabricator.wikimedia.org/T282575 (10herron) p:05Triage→03Medium [15:11:06] 10SRE, 10decommission-hardware, 10observability: decomission mwlog2001 - https://phabricator.wikimedia.org/T282576 (10herron) p:05Triage→03Medium [15:11:08] (03Merged) 10jenkins-bot: fileio: uniform quotes [software/pywmflib] - 10https://gerrit.wikimedia.org/r/689090 (owner: 10Volans) [15:11:57] 10SRE, 10GitLab (Initialization), 10Release-Engineering-Team (Doing), 10User-brennen: Define auth strategy for GitLab - https://phabricator.wikimedia.org/T274461 (10jbond) >>! In T274461#7077413, @jbond wrote: >> - No group membership (turns out that was never a requirement) > @brennen, @thcipriani, @wkand... [15:12:45] (03PS1) 10Ahmon Dancy: testwikis wikis to 1.37.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/689093 [15:12:49] (03CR) 10Ahmon Dancy: [C: 03+2] testwikis wikis to 1.37.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/689093 (owner: 10Ahmon Dancy) [15:12:50] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "Please collect +1 from the DBA team before merging." [puppet] - 10https://gerrit.wikimedia.org/r/689092 (https://phabricator.wikimedia.org/T282209) (owner: 10Andrew Bogott) [15:13:16] 10SRE, 10observability: Decommission old ELK5 Logstash cluster - https://phabricator.wikimedia.org/T281266 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['logstash1010.eqiad.wmnet'] ` and were **ALL** successful. [15:13:52] (03Merged) 10jenkins-bot: testwikis wikis to 1.37.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/689093 (owner: 10Ahmon Dancy) [15:13:58] 10SRE, 10ops-eqiad, 10DC-Ops: Add eqiad airport express to Netbox - https://phabricator.wikimedia.org/T278934 (10ayounsi) You need to create it so it matches the device model. [15:13:58] !log dancy@deploy1002 Started scap: testwikis wikis to 1.37.0-wmf.5 [15:14:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:14:20] (03CR) 10Volans: [C: 03+2] hosts: log stacktrace on update error [software/debmonitor] - 10https://gerrit.wikimedia.org/r/688949 (https://phabricator.wikimedia.org/T282529) (owner: 10Volans) [15:14:36] 10SRE, 10decommission-hardware, 10observability: decomission mwlog1001 - https://phabricator.wikimedia.org/T282575 (10herron) [15:14:40] 10SRE, 10observability, 10Patch-For-Review: Migrate mwlog/udp2log servers to Buster - https://phabricator.wikimedia.org/T224565 (10herron) [15:14:50] 10SRE, 10decommission-hardware, 10observability: decomission mwlog2001 - https://phabricator.wikimedia.org/T282576 (10herron) [15:14:52] 10SRE, 10observability, 10Patch-For-Review: Migrate mwlog/udp2log servers to Buster - https://phabricator.wikimedia.org/T224565 (10herron) [15:14:59] (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+2] profile: don't hardcode LDAP group/base names in icinga/am [puppet] - 10https://gerrit.wikimedia.org/r/689078 (owner: 10Filippo Giunchedi) [15:16:30] (03PS1) 10Herron: remove all references to mwlog1001 [puppet] - 10https://gerrit.wikimedia.org/r/689094 (https://phabricator.wikimedia.org/T282575) [15:16:32] (03PS1) 10Herron: remove all references to mwlog2001 [puppet] - 10https://gerrit.wikimedia.org/r/689095 (https://phabricator.wikimedia.org/T282576) [15:16:43] (03Merged) 10jenkins-bot: hosts: log stacktrace on update error [software/debmonitor] - 10https://gerrit.wikimedia.org/r/688949 (https://phabricator.wikimedia.org/T282529) (owner: 10Volans) [15:16:59] (03PS4) 10Jforrester: labs: Enable TheWikipediaLibrary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/685520 (https://phabricator.wikimedia.org/T282143) (owner: 10Jsn.sherman) [15:17:05] (03PS5) 10Jforrester: labs: Enable TheWikipediaLibrary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/685520 (https://phabricator.wikimedia.org/T282143) (owner: 10Jsn.sherman) [15:18:05] (03CR) 10jerkins-bot: [V: 04-1] remove all references to mwlog1001 [puppet] - 10https://gerrit.wikimedia.org/r/689094 (https://phabricator.wikimedia.org/T282575) (owner: 10Herron) [15:18:44] (03CR) 10Cwhite: [C: 03+1] remove all references to mwlog2001 [puppet] - 10https://gerrit.wikimedia.org/r/689095 (https://phabricator.wikimedia.org/T282576) (owner: 10Herron) [15:19:03] (03CR) 10Cwhite: [C: 03+1] pontoon: use 7d retention for prometheus [puppet] - 10https://gerrit.wikimedia.org/r/689000 (owner: 10Filippo Giunchedi) [15:19:16] dancy: OK if I merge a beta-only config change? No need to do a prod sync. [15:19:17] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] client: get source package from correct version [software/debmonitor] - 10https://gerrit.wikimedia.org/r/689077 (https://phabricator.wikimedia.org/T282529) (owner: 10Volans) [15:19:21] (03CR) 10Cwhite: [C: 03+1] pontoon: set required librenms variables [puppet] - 10https://gerrit.wikimedia.org/r/688999 (owner: 10Filippo Giunchedi) [15:19:32] James_F: Go for it. [15:19:56] (03PS2) 10Herron: remove all references to mwlog1001 [puppet] - 10https://gerrit.wikimedia.org/r/689094 (https://phabricator.wikimedia.org/T282575) [15:20:36] (03PS6) 10Jforrester: [BETA CLUSTER] Enable TheWikipediaLibrary on all beta wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/685520 (https://phabricator.wikimedia.org/T282143) (owner: 10Jsn.sherman) [15:20:40] (03CR) 10Herron: [C: 03+2] remove all references to mwlog2001 [puppet] - 10https://gerrit.wikimedia.org/r/689095 (https://phabricator.wikimedia.org/T282576) (owner: 10Herron) [15:20:40] (03CR) 10Jforrester: [C: 03+2] [BETA CLUSTER] Enable TheWikipediaLibrary on all beta wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/685520 (https://phabricator.wikimedia.org/T282143) (owner: 10Jsn.sherman) [15:20:43] (03CR) 10Filippo Giunchedi: [C: 03+2] pontoon: use 7d retention for prometheus [puppet] - 10https://gerrit.wikimedia.org/r/689000 (owner: 10Filippo Giunchedi) [15:20:45] (03PS2) 10Herron: remove all references to mwlog2001 [puppet] - 10https://gerrit.wikimedia.org/r/689095 (https://phabricator.wikimedia.org/T282576) [15:20:52] (03PS2) 10Filippo Giunchedi: pontoon: use 7d retention for prometheus [puppet] - 10https://gerrit.wikimedia.org/r/689000 [15:21:36] (03Merged) 10jenkins-bot: [BETA CLUSTER] Enable TheWikipediaLibrary on all beta wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/685520 (https://phabricator.wikimedia.org/T282143) (owner: 10Jsn.sherman) [15:21:51] (03CR) 10Filippo Giunchedi: [C: 03+2] pontoon: set required librenms variables [puppet] - 10https://gerrit.wikimedia.org/r/688999 (owner: 10Filippo Giunchedi) [15:22:39] (03CR) 10Cwhite: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/685738 (owner: 10Filippo Giunchedi) [15:23:37] (03CR) 10Filippo Giunchedi: [C: 03+2] pontoon: add public LB class [puppet] - 10https://gerrit.wikimedia.org/r/685738 (owner: 10Filippo Giunchedi) [15:23:41] (03PS9) 10Ahmon Dancy: WIP: Test emailing notification of security patch failure [mediawiki-config] - 10https://gerrit.wikimedia.org/r/679015 [15:23:43] (03PS3) 10Filippo Giunchedi: pontoon: add public LB class [puppet] - 10https://gerrit.wikimedia.org/r/685738 [15:23:50] 10SRE, 10decommission-hardware, 10observability, 10Patch-For-Review: decommission mwlog2001 - https://phabricator.wikimedia.org/T282576 (10herron) [15:24:01] 10SRE, 10decommission-hardware, 10observability, 10Patch-For-Review: decommission mwlog1001 - https://phabricator.wikimedia.org/T282575 (10herron) [15:24:21] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [15:24:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:24:40] (03CR) 10Cwhite: [C: 03+1] pontoon: introduce public_certs [puppet] - 10https://gerrit.wikimedia.org/r/685737 (owner: 10Filippo Giunchedi) [15:25:25] (All done.) [15:25:49] 👍🏾 [15:25:55] (03CR) 10Cwhite: [C: 03+1] remove all references to mwlog1001 [puppet] - 10https://gerrit.wikimedia.org/r/689094 (https://phabricator.wikimedia.org/T282575) (owner: 10Herron) [15:26:31] (03CR) 10Filippo Giunchedi: [C: 03+2] pontoon: introduce public_certs [puppet] - 10https://gerrit.wikimedia.org/r/685737 (owner: 10Filippo Giunchedi) [15:26:38] (03PS3) 10Filippo Giunchedi: pontoon: introduce public_certs [puppet] - 10https://gerrit.wikimedia.org/r/685737 [15:27:08] 10SRE, 10decommission-hardware, 10observability, 10Patch-For-Review: decommission mwlog2001 - https://phabricator.wikimedia.org/T282576 (10herron) [15:27:09] !log herron@cumin1001 START - Cookbook sre.hosts.decommission for hosts mwlog2001.codfw.wmnet [15:27:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:27:37] (03CR) 10Jbond: [C: 04-1] "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/688333 (https://phabricator.wikimedia.org/T232343) (owner: 10Jbond) [15:28:33] (03PS7) 10Effie Mouzeli: Add canary support in scaffolding [deployment-charts] - 10https://gerrit.wikimedia.org/r/685748 (https://phabricator.wikimedia.org/T282148) [15:29:14] (03CR) 10jerkins-bot: [V: 04-1] Add canary support in scaffolding [deployment-charts] - 10https://gerrit.wikimedia.org/r/685748 (https://phabricator.wikimedia.org/T282148) (owner: 10Effie Mouzeli) [15:30:57] RECOVERY - Stale file for node-exporter textfile in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Stale_file_for_node-exporter_textfile https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile [15:31:18] (03CR) 10Volans: [C: 03+2] client: get source package from correct version [software/debmonitor] - 10https://gerrit.wikimedia.org/r/689077 (https://phabricator.wikimedia.org/T282529) (owner: 10Volans) [15:31:34] !log dancy@deploy1002 scap failed: average error rate on 9/9 canaries increased by 10x (rerun with --force to override this check, see https://logstash.wikimedia.org/goto/83629bcb5560d11e61d3085c89dd9ed6 for details) [15:31:35] !log dancy@deploy1002 scap failed: RuntimeError scap failed: average error rate on 9/9 canaries increased by 10x (rerun with --force to override this check, see https://logstash.wikimedia.org/goto/83629bcb5560d11e61d3085c89dd9ed6 for details) (duration: 17m 36s) [15:31:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:31:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:32:26] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install moss-be100[12] - https://phabricator.wikimedia.org/T276637 (10Cmjohnson) @Jclark-ctr moss-be1001 cables are wrong, the ports you have them connected to are already labeled for cloudcephosd1016 but I see that the server is not connected to the s... [15:33:09] (03CR) 10Muehlenhoff: "Looks good, but we also need to remove hieradata/hosts/mwlog1001.yaml as well" [puppet] - 10https://gerrit.wikimedia.org/r/689094 (https://phabricator.wikimedia.org/T282575) (owner: 10Herron) [15:33:18] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:33:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:33:27] (03PS1) 10Ahmon Dancy: testwikis wikis to 1.37.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/689103 [15:33:30] (03CR) 10Ahmon Dancy: [C: 03+2] testwikis wikis to 1.37.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/689103 (owner: 10Ahmon Dancy) [15:33:32] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install moss-be100[12] - https://phabricator.wikimedia.org/T276637 (10Cmjohnson) a:05Cmjohnson→03Jclark-ctr [15:34:09] (03Merged) 10jenkins-bot: testwikis wikis to 1.37.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/689103 (owner: 10Ahmon Dancy) [15:34:14] !log dancy@deploy1002 Started scap: testwikis wikis to 1.37.0-wmf.4 [15:34:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:34:19] (03Merged) 10jenkins-bot: client: get source package from correct version [software/debmonitor] - 10https://gerrit.wikimedia.org/r/689077 (https://phabricator.wikimedia.org/T282529) (owner: 10Volans) [15:34:30] (03PS3) 10Herron: remove all references to mwlog1001 [puppet] - 10https://gerrit.wikimedia.org/r/689094 (https://phabricator.wikimedia.org/T282575) [15:35:28] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/689094 (https://phabricator.wikimedia.org/T282575) (owner: 10Herron) [15:35:48] (03PS4) 10Herron: remove all references to mwlog1001 [puppet] - 10https://gerrit.wikimedia.org/r/689094 (https://phabricator.wikimedia.org/T282575) [15:36:19] !log dancy@deploy1002 sync-world aborted: testwikis wikis to 1.37.0-wmf.4 (duration: 02m 04s) [15:36:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:36:21] (03CR) 10Jbond: "minor comments, nits inline will leave +1 to DBA's" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/689092 (https://phabricator.wikimedia.org/T282209) (owner: 10Andrew Bogott) [15:36:53] (03CR) 10Bstorm: [C: 03+2] wikireplicas: remove the old wikireplicas profile from the proxy [puppet] - 10https://gerrit.wikimedia.org/r/688368 (https://phabricator.wikimedia.org/T260389) (owner: 10Bstorm) [15:37:04] !log dancy@deploy1002 scap failed: average error rate on 9/9 canaries increased by 10x (rerun with --force to override this check, see https://logstash.wikimedia.org/goto/83629bcb5560d11e61d3085c89dd9ed6 for details) [15:37:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:37:30] (03CR) 10Herron: [C: 03+2] remove all references to mwlog1001 [puppet] - 10https://gerrit.wikimedia.org/r/689094 (https://phabricator.wikimedia.org/T282575) (owner: 10Herron) [15:38:03] (03PS8) 10Elukey: WIP - Add istio base images build support [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/688211 (https://phabricator.wikimedia.org/T278192) [15:38:30] !log herron@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mwlog2001.codfw.wmnet [15:38:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:38:36] 10SRE, 10decommission-hardware, 10observability, 10Patch-For-Review: decommission mwlog2001 - https://phabricator.wikimedia.org/T282576 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by herron@cumin1001 for hosts: `mwlog2001.codfw.wmnet` - mwlog2001.codfw.wmnet (**PASS**) - Downtimed ho... [15:39:10] (03PS8) 10Effie Mouzeli: Add canary support in scaffolding [deployment-charts] - 10https://gerrit.wikimedia.org/r/685748 (https://phabricator.wikimedia.org/T282148) [15:39:25] (03PS9) 10Effie Mouzeli: Add canary support in scaffolding [deployment-charts] - 10https://gerrit.wikimedia.org/r/685748 (https://phabricator.wikimedia.org/T282148) [15:39:46] (03PS3) 10Cwhite: Build on the production builder host [software/logstash/plugins] - 10https://gerrit.wikimedia.org/r/688418 [15:39:48] (03CR) 10jerkins-bot: [V: 04-1] Add canary support in scaffolding [deployment-charts] - 10https://gerrit.wikimedia.org/r/685748 (https://phabricator.wikimedia.org/T282148) (owner: 10Effie Mouzeli) [15:40:10] (03CR) 10Jbond: [C: 03+1] mail: move default mail relay config out of standard module [puppet] - 10https://gerrit.wikimedia.org/r/686633 (https://phabricator.wikimedia.org/T232343) (owner: 10Herron) [15:41:24] 10SRE, 10ops-eqiad: Degraded RAID on cloudvirt1018 - https://phabricator.wikimedia.org/T280668 (10Cmjohnson) 05Open→03Resolved [15:43:00] (03PS9) 10Elukey: WIP - Add istio base images build support [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/688211 (https://phabricator.wikimedia.org/T278192) [15:46:49] (03CR) 10Jbond: "see comments inline" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/688391 (https://phabricator.wikimedia.org/T232343) (owner: 10Herron) [15:47:14] RECOVERY - BGP status on cr3-ulsfo is OK: BGP OK - up: 64, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:47:38] 10SRE, 10decommission-hardware, 10observability, 10Patch-For-Review: decommission mwlog1001 - https://phabricator.wikimedia.org/T282575 (10herron) [15:47:55] !log herron@cumin1001 START - Cookbook sre.hosts.decommission for hosts mwlog1001.eqiad.wmnet [15:47:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:51:01] 10SRE, 10ops-eqiad, 10DC-Ops: Add eqiad airport express to Netbox - https://phabricator.wikimedia.org/T278934 (10Cmjohnson) 05Open→03Resolved Added Airport Express and connected to mr1 in netbox [15:53:15] (03PS5) 10Jbond: P:tendril::webserver: update dbtree to use new pki for tls certs [puppet] - 10https://gerrit.wikimedia.org/r/688953 (https://phabricator.wikimedia.org/T282531) [15:55:03] !log restart haproxy on dbproxy1018/9 to remove old config [15:55:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:33] (03PS1) 10Ssingh: test_dns: add a test to check for DoH response headers [software/knead-wikidough] - 10https://gerrit.wikimedia.org/r/689114 [15:56:45] (03PS1) 10Legoktm: lists: Redirect /admindb/ URLs to Postorius too [puppet] - 10https://gerrit.wikimedia.org/r/689120 (https://phabricator.wikimedia.org/T282581) [15:58:05] (03CR) 10JMeybohm: [C: 04-1] Helm chart to run MediaWiki (0313 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/670220 (https://phabricator.wikimedia.org/T265327) (owner: 10Giuseppe Lavagetto) [15:58:48] !log herron@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mwlog1001.eqiad.wmnet [15:58:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:58:56] 10SRE, 10decommission-hardware, 10observability, 10Patch-For-Review: decommission mwlog1001 - https://phabricator.wikimedia.org/T282575 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by herron@cumin1001 for hosts: `mwlog1001.eqiad.wmnet` - mwlog1001.eqiad.wmnet (**PASS**) - Downtimed ho... [15:59:08] 10SRE, 10ops-codfw, 10netops: Multiple host down alerts from rack C2 - https://phabricator.wikimedia.org/T279457 (10Krinkle) [15:59:14] (03CR) 10Jbond: [C: 03+2] P:tendril::webserver: update dbtree to use new pki for tls certs [puppet] - 10https://gerrit.wikimedia.org/r/688953 (https://phabricator.wikimedia.org/T282531) (owner: 10Jbond) [15:59:18] !log dancy@deploy1002 rebuilt and synchronized wikiversions files: (no justification provided) [15:59:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:00:04] jbond42 and cdanis: #bothumor My software never has bugs. It just develops random features. Rise for Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210511T1600). [16:01:03] (03CR) 10Ssingh: [C: 03+2] test_dns: add a test to check for DoH response headers [software/knead-wikidough] - 10https://gerrit.wikimedia.org/r/689114 (owner: 10Ssingh) [16:02:10] (03CR) 10Jbond: [C: 03+2] hiera - cp1077: test CA bundle with pki and puppet ROOT ca certs [puppet] - 10https://gerrit.wikimedia.org/r/685496 (https://phabricator.wikimedia.org/T281673) (owner: 10Jbond) [16:03:43] (03PS4) 10Cwhite: Build on the production builder host [software/logstash/plugins] - 10https://gerrit.wikimedia.org/r/688418 [16:04:45] (03PS9) 10Giuseppe Lavagetto: safe-service-restart: only verify pooled services [puppet] - 10https://gerrit.wikimedia.org/r/684287 (https://phabricator.wikimedia.org/T279100) [16:04:56] (03CR) 10Giuseppe Lavagetto: safe-service-restart: only verify pooled services (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/684287 (https://phabricator.wikimedia.org/T279100) (owner: 10Giuseppe Lavagetto) [16:06:14] 10SRE, 10Wikimedia-Incident: Uncached wiki requests partially unavailable due to excessive request rates from a bot - https://phabricator.wikimedia.org/T280232 (10Krinkle) [16:07:15] (03PS1) 10Jbond: Revert "hiera - cp1077: test CA bundle with pki and puppet ROOT ca certs" [puppet] - 10https://gerrit.wikimedia.org/r/689050 [16:07:29] (03Abandoned) 10Jbond: Revert "hiera - cp1077: test CA bundle with pki and puppet ROOT ca certs" [puppet] - 10https://gerrit.wikimedia.org/r/689050 (owner: 10Jbond) [16:09:34] (03PS1) 10Jbond: hiera - cp1079: test new CA on cp1079 [puppet] - 10https://gerrit.wikimedia.org/r/689133 [16:10:29] (03CR) 10Jbond: [C: 03+2] hiera - cp1079: test new CA on cp1079 [puppet] - 10https://gerrit.wikimedia.org/r/689133 (owner: 10Jbond) [16:12:55] !log restarting bacula-dir on backup1001, stuck process [16:12:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:13:36] (03CR) 10BryanDavis: [C: 03+1] lists: Redirect /admindb/ URLs to Postorius too [puppet] - 10https://gerrit.wikimedia.org/r/689120 (https://phabricator.wikimedia.org/T282581) (owner: 10Legoktm) [16:14:39] (03Abandoned) 10Effie Mouzeli: (WIP) conftool: improve safe-service-restart multiple cluster support [puppet] - 10https://gerrit.wikimedia.org/r/681676 (https://phabricator.wikimedia.org/T279100) (owner: 10Effie Mouzeli) [16:15:25] 10SRE, 10ops-eqiad, 10cloud-services-team (Hardware): cloudvirt1040 primary NIC disconnected - https://phabricator.wikimedia.org/T281399 (10aborrero) p:05Triage→03Medium [16:16:39] (03PS16) 10Effie Mouzeli: Helm chart to run MediaWiki [deployment-charts] - 10https://gerrit.wikimedia.org/r/670220 (https://phabricator.wikimedia.org/T265327) (owner: 10Giuseppe Lavagetto) [16:17:00] (03CR) 10jerkins-bot: [V: 04-1] Helm chart to run MediaWiki [deployment-charts] - 10https://gerrit.wikimedia.org/r/670220 (https://phabricator.wikimedia.org/T265327) (owner: 10Giuseppe Lavagetto) [16:17:08] 10SRE, 10Wikimedia-Incident: Uncached wiki requests partially unavailable due to excessive request rates from a bot - https://phabricator.wikimedia.org/T280232 (10Krinkle) [16:17:32] PROBLEM - bacula director process on backup1001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 112 (bacula), command name bacula-dir https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [16:18:26] that's me, fixing [16:18:33] (I logged it) [16:18:38] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={bacula,netbox_device_statistics} site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:19:42] (03PS1) 10Jcrespo: bacula: Fix new job defaults for read-write Es database backups [puppet] - 10https://gerrit.wikimedia.org/r/689140 (https://phabricator.wikimedia.org/T282249) [16:19:50] (03PS1) 10Jbond: CP1079: revert combined CA [puppet] - 10https://gerrit.wikimedia.org/r/689141 [16:21:15] (03CR) 10Jcrespo: [C: 03+2] bacula: Fix new job defaults for read-write Es database backups [puppet] - 10https://gerrit.wikimedia.org/r/689140 (https://phabricator.wikimedia.org/T282249) (owner: 10Jcrespo) [16:21:40] (03PS10) 10Jbond: P:trafficserver::backend: update the source of the ATS trusted ca bundle [puppet] - 10https://gerrit.wikimedia.org/r/685497 (https://phabricator.wikimedia.org/T281673) [16:25:03] (03PS11) 10Jbond: P:trafficserver::backend: update the source of the ATS trusted ca bundle [puppet] - 10https://gerrit.wikimedia.org/r/685497 (https://phabricator.wikimedia.org/T281673) [16:27:56] (03PS1) 10Jcrespo: bacula: Fix job defaults for read-write Es database backups (2) [puppet] - 10https://gerrit.wikimedia.org/r/689142 (https://phabricator.wikimedia.org/T282249) [16:29:19] 10SRE, 10decommission-hardware, 10observability: decommission mwlog2001 - https://phabricator.wikimedia.org/T282576 (10herron) [16:29:25] (03CR) 10Jcrespo: [C: 03+2] bacula: Fix job defaults for read-write Es database backups (2) [puppet] - 10https://gerrit.wikimedia.org/r/689142 (https://phabricator.wikimedia.org/T282249) (owner: 10Jcrespo) [16:29:31] 10SRE, 10decommission-hardware, 10observability: decommission mwlog1001 - https://phabricator.wikimedia.org/T282575 (10herron) [16:30:46] 10SRE, 10decommission-hardware, 10observability: decommission mwlog1001 - https://phabricator.wikimedia.org/T282575 (10herron) Will give this a 1w grace period before handing off for disk removal/wipe [16:31:00] RECOVERY - bacula director process on backup1001 is OK: PROCS OK: 1 process with UID = 112 (bacula), command name bacula-dir https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [16:31:52] things should be ok now [16:32:13] 10Puppet, 10Cloud-VPS, 10cloud-services-team (Kanban): Puppet labs/private.git data loss incident affecting some projects - https://phabricator.wikimedia.org/T254491 (10aborrero) 05Open→03Resolved a:03aborrero [16:32:16] 10SRE, 10decommission-hardware, 10observability: decommission mwlog2001 - https://phabricator.wikimedia.org/T282576 (10herron) a:05herron→03Papaul [16:32:19] 10SRE, 10decommission-hardware, 10observability: decommission mwlog2001 - https://phabricator.wikimedia.org/T282576 (10herron) [16:37:33] (03PS1) 10Cwhite: rsyslog: enable ecs_170 template and transition prometheus [puppet] - 10https://gerrit.wikimedia.org/r/689160 (https://phabricator.wikimedia.org/T234565) [16:38:31] (03PS1) 10Ssingh: test_dns: improve identifier for DoH response [software/knead-wikidough] - 10https://gerrit.wikimedia.org/r/689162 [16:39:18] (03CR) 10Ssingh: [C: 03+2] test_dns: improve identifier for DoH response [software/knead-wikidough] - 10https://gerrit.wikimedia.org/r/689162 (owner: 10Ssingh) [16:40:02] waiting to see metric collection coming back [16:40:10] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:45:41] 10SRE, 10LDAP-Access-Requests: Please grant CRS access to Superset/Turnilo - https://phabricator.wikimedia.org/T282589 (10Elitre) I will start: my Wikitech account is Elitre (same handle than here on Phab). [16:49:10] 10SRE, 10LDAP-Access-Requests, 10CommRel-Specialists-Support (Apr-Jun-2021): Please grant CRS access to Superset/Turnilo - https://phabricator.wikimedia.org/T282589 (10Elitre) [16:54:00] 10SRE, 10GitLab (Initialization), 10Release-Engineering-Team (Doing), 10User-brennen: Define auth strategy for GitLab - https://phabricator.wikimedia.org/T274461 (10thcipriani) >>! In T274461#7078632, @jbond wrote: >>>! In T274461#7077413, @jbond wrote: >>> - No group membership (turns out that was never a... [16:55:49] 10SRE, 10Okapi [Wikimedia Enterprise], 10Traffic: "wikimedia.com" DNS transfer to Wikimedia Enterprise's AWS infra - https://phabricator.wikimedia.org/T281428 (10BBlack) A lot of what's in that zonefile of course will change for the new DNS setup, or is irrelevant to any smooth transition, etc. The key part... [17:00:04] chrisalbon and accraze: (Dis)respected human, time to deploy Services – Graphoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210511T1700). Please do the needful. [17:01:27] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_citoid_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:02:57] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:04:26] (03PS17) 10Effie Mouzeli: Helm chart to run MediaWiki [deployment-charts] - 10https://gerrit.wikimedia.org/r/670220 (https://phabricator.wikimedia.org/T265327) (owner: 10Giuseppe Lavagetto) [17:04:27] (03PS1) 10Herron: logstash101[012]: use default OS installer version [puppet] - 10https://gerrit.wikimedia.org/r/689166 (https://phabricator.wikimedia.org/T281266) [17:04:45] (03CR) 10jerkins-bot: [V: 04-1] Helm chart to run MediaWiki [deployment-charts] - 10https://gerrit.wikimedia.org/r/670220 (https://phabricator.wikimedia.org/T265327) (owner: 10Giuseppe Lavagetto) [17:05:16] (03CR) 10Herron: [C: 03+2] logstash101[012]: use default OS installer version [puppet] - 10https://gerrit.wikimedia.org/r/689166 (https://phabricator.wikimedia.org/T281266) (owner: 10Herron) [17:07:31] !log dancy@deploy1002 scap failed: average error rate on 9/9 canaries increased by 10x (rerun with --force to override this check, see https://logstash.wikimedia.org/goto/83629bcb5560d11e61d3085c89dd9ed6 for details) [17:07:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:07:49] (03CR) 10Andrew Bogott: profile:mariadb:core: Hack in access from labwebs to s6 (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/689092 (https://phabricator.wikimedia.org/T282209) (owner: 10Andrew Bogott) [17:07:58] (03PS18) 10Effie Mouzeli: Helm chart to run MediaWiki [deployment-charts] - 10https://gerrit.wikimedia.org/r/670220 (https://phabricator.wikimedia.org/T265327) (owner: 10Giuseppe Lavagetto) [17:08:09] (03PS3) 10Andrew Bogott: profile:mariadb:core: Hack in access from labwebs to s6 [puppet] - 10https://gerrit.wikimedia.org/r/689092 (https://phabricator.wikimedia.org/T282209) [17:08:15] 10SRE, 10observability, 10Patch-For-Review: Decommission old ELK5 Logstash cluster - https://phabricator.wikimedia.org/T281266 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by herron on cumin1001.eqiad.wmnet for hosts: ` logstash1010.eqiad.wmnet ` The log can be found in `/var/log/wmf-auto-reim... [17:08:19] (03CR) 10jerkins-bot: [V: 04-1] Helm chart to run MediaWiki [deployment-charts] - 10https://gerrit.wikimedia.org/r/670220 (https://phabricator.wikimedia.org/T265327) (owner: 10Giuseppe Lavagetto) [17:09:08] (03PS2) 10Dzahn: switch peopleweb backend to people1003 [dns] - 10https://gerrit.wikimedia.org/r/682770 [17:10:41] PROBLEM - ensure kvm processes are running on cloudvirt1038 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [17:10:53] (03CR) 10Dzahn: [C: 03+2] switch peopleweb backend to people1003 [dns] - 10https://gerrit.wikimedia.org/r/682770 (owner: 10Dzahn) [17:13:17] PROBLEM - nova-compute proc minimum on cloudvirt1038 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [17:17:30] (03PS1) 10Legoktm: lists: Make pipermail redirects 302s, some are broken [puppet] - 10https://gerrit.wikimedia.org/r/689171 [17:18:43] !log disabled pipermail redirects on lists.wikimedia.org [17:18:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:19:55] (03CR) 10Legoktm: [C: 03+2] lists: Redirect /admindb/ URLs to Postorius too [puppet] - 10https://gerrit.wikimedia.org/r/689120 (https://phabricator.wikimedia.org/T282581) (owner: 10Legoktm) [17:20:01] !log the backend for people.wikimedia.org switched from people1002 to people1003, the people.wikimedia.org CNAME has been updated. MOTD is about to be updated to inform users. [17:20:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:20:05] (03CR) 10Legoktm: [C: 03+2] lists: Make pipermail redirects 302s, some are broken [puppet] - 10https://gerrit.wikimedia.org/r/689171 (owner: 10Legoktm) [17:20:14] (03PS2) 10Legoktm: lists: Make pipermail redirects 302s, some are broken [puppet] - 10https://gerrit.wikimedia.org/r/689171 [17:20:17] (03CR) 10Legoktm: [V: 03+2 C: 03+2] lists: Make pipermail redirects 302s, some are broken [puppet] - 10https://gerrit.wikimedia.org/r/689171 (owner: 10Legoktm) [17:23:32] (03PS1) 10Dzahn: peopleweb: switch rsync direction to people1003->people2001 [puppet] - 10https://gerrit.wikimedia.org/r/689178 [17:23:33] (03PS1) 10Dzahn: peopleweb: set a different MOTD if not on the active backend [puppet] - 10https://gerrit.wikimedia.org/r/689179 [17:25:20] (03CR) 10jerkins-bot: [V: 04-1] peopleweb: set a different MOTD if not on the active backend [puppet] - 10https://gerrit.wikimedia.org/r/689179 (owner: 10Dzahn) [17:29:10] !log andrew@deploy1002 Started deploy [horizon/deploy@2604d7b]: testing default policy deployment in codfw1dev [17:29:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:30:55] 10SRE, 10Wikimedia-Mailing-lists, 10Patch-For-Review: mailman3: redirect https://lists.wikimedia.org/mailman/admindb/... urls - https://phabricator.wikimedia.org/T282581 (10Legoktm) 05Open→03Resolved a:03Legoktm https://lists.wikimedia.org/mailman/admindb/cloud-announce redirects now. [17:31:09] !log andrew@deploy1002 Finished deploy [horizon/deploy@2604d7b]: testing default policy deployment in codfw1dev (duration: 01m 59s) [17:31:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:32:22] (03PS2) 10Dzahn: peopleweb: set a different MOTD if not on the active backend [puppet] - 10https://gerrit.wikimedia.org/r/689179 [17:32:25] 10SRE, 10Wikimedia-Mailing-lists, 10Patch-For-Review: Implement static redirects from pipermail archives to hyperkitty archives - https://phabricator.wikimedia.org/T280731 (10Legoktm) The script misredirected some old wikien-l links: * https://lists.wikimedia.org/pipermail/wikien-l/2004-December/017659.html... [17:32:29] !log andrew@deploy1002 Started deploy [horizon/deploy@acc3c68]: testing default policy deployment in codfw1dev (again) [17:32:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:33:22] !log herron@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on logstash1010.eqiad.wmnet with reason: REIMAGE [17:33:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:33:49] (03CR) 10jerkins-bot: [V: 04-1] peopleweb: set a different MOTD if not on the active backend [puppet] - 10https://gerrit.wikimedia.org/r/689179 (owner: 10Dzahn) [17:33:56] 10SRE, 10GitLab (Initialization), 10Release-Engineering-Team (Doing), 10User-brennen: Define auth strategy for GitLab - https://phabricator.wikimedia.org/T274461 (10jbond) > that we have a gerrit group called wmde that is populated with the ldap group wmde is...probably an anti pattern :) > We should not r... [17:34:56] !log andrew@deploy1002 Finished deploy [horizon/deploy@acc3c68]: testing default policy deployment in codfw1dev (again) (duration: 02m 27s) [17:34:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:35:01] !log andrew@deploy1002 Started deploy [horizon/deploy@acc3c68]: testing default policy deployment in codfw1dev (again) [17:35:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:35:32] !log herron@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on logstash1010.eqiad.wmnet with reason: REIMAGE [17:35:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:36:21] 10SRE, 10LDAP-Access-Requests, 10CommRel-Specialists-Support (Apr-Jun-2021): Please grant CRS access to Superset/Turnilo - https://phabricator.wikimedia.org/T282589 (10sgrabarczuk) [17:36:24] 10SRE, 10LDAP-Access-Requests: Grant access to LDAP/WMF for SGrabarczuk - https://phabricator.wikimedia.org/T282475 (10sgrabarczuk) [17:36:27] !log andrew@deploy1002 Finished deploy [horizon/deploy@acc3c68]: testing default policy deployment in codfw1dev (again) (duration: 01m 25s) [17:36:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:39:16] (03CR) 10Jbond: [C: 03+1] "lgtm thx" [puppet] - 10https://gerrit.wikimedia.org/r/689092 (https://phabricator.wikimedia.org/T282209) (owner: 10Andrew Bogott) [17:39:50] (03PS1) 10Jsn.sherman: [BETA CLUSTER] set TwlRegistrationDays to 1 for TheWikipediaLibrary on all beta wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/689186 [17:41:57] (03PS3) 10Dzahn: peopleweb: set a different MOTD if not on the active backend [puppet] - 10https://gerrit.wikimedia.org/r/689179 [17:46:22] (03CR) 10Scardenasmolinar: [C: 03+1] [BETA CLUSTER] set TwlRegistrationDays to 1 for TheWikipediaLibrary on all beta wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/689186 (owner: 10Jsn.sherman) [17:47:38] (03PS1) 10Herron: scap: switch logstash check host to logstash1023 [puppet] - 10https://gerrit.wikimedia.org/r/689189 (https://phabricator.wikimedia.org/T281266) [17:47:45] 10SRE, 10LDAP-Access-Requests, 10CommRel-Specialists-Support (Apr-Jun-2021): Please grant CRS access to Superset/Turnilo - https://phabricator.wikimedia.org/T282589 (10Trizek-WMF) trizek [17:48:18] (03PS1) 10Andrew Bogott: Deploy OpenStack Trove in eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/689190 (https://phabricator.wikimedia.org/T212595) [17:48:19] 10SRE, 10Wikimedia-Mailing-lists, 10Patch-For-Review: Implement static redirects from pipermail archives to hyperkitty archives - https://phabricator.wikimedia.org/T280731 (10Legoktm) >>! In T280731#7079229, @Legoktm wrote: > * https://lists.wikimedia.org/pipermail/wikien-l/2004-December/017659.html The Mes... [17:48:56] (03CR) 10Ahmon Dancy: [C: 03+1] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/689189 (https://phabricator.wikimedia.org/T281266) (owner: 10Herron) [17:49:13] 10SRE, 10observability, 10Patch-For-Review: Decommission old ELK5 Logstash cluster - https://phabricator.wikimedia.org/T281266 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['logstash1010.eqiad.wmnet'] ` and were **ALL** successful. [17:49:18] (03CR) 10Herron: [C: 03+2] scap: switch logstash check host to logstash1023 [puppet] - 10https://gerrit.wikimedia.org/r/689189 (https://phabricator.wikimedia.org/T281266) (owner: 10Herron) [17:50:38] 10SRE, 10LDAP-Access-Requests, 10CommRel-Specialists-Support (Apr-Jun-2021): Please grant CRS access to Superset/Turnilo - https://phabricator.wikimedia.org/T282589 (10Johan) Johan [17:51:42] (03PS4) 10Dzahn: peopleweb: set a different MOTD if not on the active backend [puppet] - 10https://gerrit.wikimedia.org/r/689179 [17:52:11] (03PS1) 10Ahmon Dancy: logstash_checker.py: Provide more info on error [puppet] - 10https://gerrit.wikimedia.org/r/689192 [17:53:02] (03CR) 10jerkins-bot: [V: 04-1] logstash_checker.py: Provide more info on error [puppet] - 10https://gerrit.wikimedia.org/r/689192 (owner: 10Ahmon Dancy) [17:55:20] (03PS2) 10Ahmon Dancy: logstash_checker.py: Provide more info on error [puppet] - 10https://gerrit.wikimedia.org/r/689192 [18:00:04] Deploy window Pre MediaWiki train break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210511T1800) [18:00:39] (03PS3) 10Ahmon Dancy: logstash_checker.py: Provide more info on error [puppet] - 10https://gerrit.wikimedia.org/r/689192 [18:01:18] (03CR) 10jerkins-bot: [V: 04-1] logstash_checker.py: Provide more info on error [puppet] - 10https://gerrit.wikimedia.org/r/689192 (owner: 10Ahmon Dancy) [18:02:26] (03PS4) 10Ahmon Dancy: logstash_checker.py: Provide more info on error [puppet] - 10https://gerrit.wikimedia.org/r/689192 [18:03:58] (03CR) 10Dzahn: "https://puppet-compiler.wmflabs.org/compiler1001/29503/people1002.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/689179 (owner: 10Dzahn) [18:04:00] (03CR) 10Dzahn: [C: 03+2] peopleweb: set a different MOTD if not on the active backend [puppet] - 10https://gerrit.wikimedia.org/r/689179 (owner: 10Dzahn) [18:05:38] (03CR) 10Dzahn: [C: 03+2] peopleweb: switch rsync direction to people1003->people2001 [puppet] - 10https://gerrit.wikimedia.org/r/689178 (owner: 10Dzahn) [18:05:45] (03PS2) 10Dzahn: peopleweb: switch rsync direction to people1003->people2001 [puppet] - 10https://gerrit.wikimedia.org/r/689178 [18:06:39] (03CR) 10Ahmon Dancy: [C: 03+1] logstash_checker.py: Provide more info on error [puppet] - 10https://gerrit.wikimedia.org/r/689192 (owner: 10Ahmon Dancy) [18:08:27] (03CR) 10Ahmon Dancy: [C: 03+2] .pipeline/wmf-publish/build: Use --skip-message-purge [mediawiki-config] - 10https://gerrit.wikimedia.org/r/685952 (owner: 10Ahmon Dancy) [18:09:11] (03Merged) 10jenkins-bot: .pipeline/wmf-publish/build: Use --skip-message-purge [mediawiki-config] - 10https://gerrit.wikimedia.org/r/685952 (owner: 10Ahmon Dancy) [18:09:30] (03PS1) 10Ahmon Dancy: testwikis wikis to 1.37.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/689194 [18:09:32] (03CR) 10Ahmon Dancy: [C: 03+2] testwikis wikis to 1.37.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/689194 (owner: 10Ahmon Dancy) [18:10:43] (03Merged) 10jenkins-bot: testwikis wikis to 1.37.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/689194 (owner: 10Ahmon Dancy) [18:10:47] !log dancy@deploy1002 Started scap: testwikis wikis to 1.37.0-wmf.5 [18:10:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:11:39] (03PS8) 10Herron: mail: move default mail relay config out of standard module [puppet] - 10https://gerrit.wikimedia.org/r/686633 (https://phabricator.wikimedia.org/T232343) [18:12:18] (03CR) 10Herron: mail: move default mail relay config out of standard module (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/686633 (https://phabricator.wikimedia.org/T232343) (owner: 10Herron) [18:16:59] (03PS1) 10Dzahn: peopleweb: fix logic for changing MOTD based on rsync source [puppet] - 10https://gerrit.wikimedia.org/r/689197 [18:17:14] (03CR) 10jerkins-bot: [V: 04-1] peopleweb: fix logic for changing MOTD based on rsync source [puppet] - 10https://gerrit.wikimedia.org/r/689197 (owner: 10Dzahn) [18:20:25] 10SRE, 10ops-eqiad, 10Data-Services, 10Epic, 10cloud-services-team (Hardware): Move labstore1004 and labstore1005 to 10G Ethernet - https://phabricator.wikimedia.org/T266198 (10Bstorm) 05Resolved→03Open Reviewing old tickets, this includes the 10G link for DRBD on it, and that is not yet done (though... [18:20:31] !log dancy@deploy1002 Finished scap: testwikis wikis to 1.37.0-wmf.5 (duration: 09m 43s) [18:20:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:21:59] (03PS2) 10Dzahn: peopleweb: fix logic for changing MOTD based on rsync source [puppet] - 10https://gerrit.wikimedia.org/r/689197 [18:27:42] 10SRE, 10observability: Decommission old ELK5 Logstash cluster - https://phabricator.wikimedia.org/T281266 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by herron on cumin1001.eqiad.wmnet for hosts: ` logstash1011.eqiad.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/202105111827_herro... [18:30:17] jouncebot: now [18:30:17] For the next 0 hour(s) and 29 minute(s): Pre MediaWiki train break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210511T1800) [18:30:18] 10SRE, 10Patch-For-Review: try planet/people on bullseye / upgrade people.wikimedia.org backends to bullseye - https://phabricator.wikimedia.org/T280989 (10Dzahn) [18:30:20] 10SRE, 10Patch-For-Review: try planet/people on bullseye / upgrade people.wikimedia.org backends to bullseye - https://phabricator.wikimedia.org/T280989 (10Dzahn) [18:31:13] (03PS3) 10Dzahn: peopleweb: fix logic for changing MOTD based on rsync source [puppet] - 10https://gerrit.wikimedia.org/r/689197 (https://phabricator.wikimedia.org/T280989) [18:31:21] (03CR) 10Jforrester: [C: 04-1] [BETA CLUSTER] set TwlRegistrationDays to 1 for TheWikipediaLibrary on all beta wikis (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/689186 (owner: 10Jsn.sherman) [18:31:33] (03CR) 10Andrew Bogott: [C: 03+2] Deploy OpenStack Trove in eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/689190 (https://phabricator.wikimedia.org/T212595) (owner: 10Andrew Bogott) [18:33:02] (03PS2) 10Jsn.sherman: [BETA CLUSTER] set TwlRegistrationDays to 1 for TheWikipediaLibrary on all beta wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/689186 [18:33:05] (03PS3) 10Jforrester: [BETA CLUSTER] Set wgTwlRegistrationDays to 1 on all beta wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/689186 (https://phabricator.wikimedia.org/T282592) (owner: 10Jsn.sherman) [18:33:32] (03CR) 10Jforrester: [C: 03+2] [BETA CLUSTER] Set wgTwlRegistrationDays to 1 on all beta wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/689186 (https://phabricator.wikimedia.org/T282592) (owner: 10Jsn.sherman) [18:34:30] (03Merged) 10jenkins-bot: [BETA CLUSTER] Set wgTwlRegistrationDays to 1 on all beta wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/689186 (https://phabricator.wikimedia.org/T282592) (owner: 10Jsn.sherman) [18:42:29] 10SRE, 10LDAP-Access-Requests, 10CommRel-Specialists-Support (Apr-Jun-2021): Grant access to LDAP/WMF for Sannita - https://phabricator.wikimedia.org/T282600 (10Sannita) [18:43:13] !log mforns@deploy1002 Started deploy [analytics/refinery@7e0598d]: Regular analytics weekly train [analytics/refinery@7e0598d3f0805bf3dda4e01b637d95c16a6a668b] [18:43:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:44:57] 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install ganeti202[56] - https://phabricator.wikimedia.org/T282603 (10RobH) [18:45:05] 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install ganeti202[56] - https://phabricator.wikimedia.org/T282603 (10RobH) [18:46:20] (03PS2) 10Ottomata: Migrate VirtualPageView to EventPlatform on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/685928 (https://phabricator.wikimedia.org/T238138) (owner: 10Mforns) [18:46:44] mforns: yt? shall I^ ? [18:46:51] ottomata: here [18:46:55] 10SRE, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install ganeti202[56] - https://phabricator.wikimedia.org/T282603 (10RobH) @MoritzMuehlenhoff, You approved the quote/spec for this, but we didn't get updated racking details on the procurement request T279174, so we'll need to confirm them here. I... [18:47:00] yes, let's go ahead [18:47:06] k [18:48:08] (03CR) 10Dzahn: [C: 03+2] peopleweb: fix logic for changing MOTD based on rsync source [puppet] - 10https://gerrit.wikimedia.org/r/689197 (https://phabricator.wikimedia.org/T280989) (owner: 10Dzahn) [18:49:22] (03CR) 10Ottomata: [C: 03+2] Migrate VirtualPageView to EventPlatform on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/685928 (https://phabricator.wikimedia.org/T238138) (owner: 10Mforns) [18:50:10] (03PS1) 10Andrew Bogott: Trove: puppetize api-paste.ini [puppet] - 10https://gerrit.wikimedia.org/r/689200 (https://phabricator.wikimedia.org/T212595) [18:50:13] (03Merged) 10jenkins-bot: Migrate VirtualPageView to EventPlatform on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/685928 (https://phabricator.wikimedia.org/T238138) (owner: 10Mforns) [18:51:40] (03CR) 10Andrew Bogott: [C: 03+2] Trove: puppetize api-paste.ini [puppet] - 10https://gerrit.wikimedia.org/r/689200 (https://phabricator.wikimedia.org/T212595) (owner: 10Andrew Bogott) [18:52:54] !log herron@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on logstash1011.eqiad.wmnet with reason: REIMAGE [18:52:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:53:14] !log otto@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Migrate VirtualPageView to EventPlatform on testwiki - T238138 (duration: 01m 09s) [18:53:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:53:19] T238138: VirtualPageView Event Platform Migration - https://phabricator.wikimedia.org/T238138 [18:53:25] mforns: done [18:54:58] !log herron@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on logstash1011.eqiad.wmnet with reason: REIMAGE [18:55:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:55:13] (03PS1) 10Andrew Bogott: Trove: set up haproxy for trove-api in eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/689201 (https://phabricator.wikimedia.org/T212595) [18:57:40] (03PS2) 10Andrew Bogott: Trove: set up haproxy for trove-api in eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/689201 (https://phabricator.wikimedia.org/T212595) [18:57:42] (03PS1) 10Andrew Bogott: Trove api-paste.ini: remove trailing space [puppet] - 10https://gerrit.wikimedia.org/r/689203 [18:58:49] (03CR) 10Andrew Bogott: [C: 03+2] Trove api-paste.ini: remove trailing space [puppet] - 10https://gerrit.wikimedia.org/r/689203 (owner: 10Andrew Bogott) [19:00:04] dancy and brennen: Time to snap out of that daydream and deploy MediaWiki train - American Version. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210511T1900). [19:00:25] I will start group0 in about 30 minutes. [19:00:34] (03CR) 10Andrew Bogott: [C: 03+2] Trove: set up haproxy for trove-api in eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/689201 (https://phabricator.wikimedia.org/T212595) (owner: 10Andrew Bogott) [19:01:12] PROBLEM - Too many messages in kafka logging-eqiad #o11y on alert1001 is CRITICAL: cluster=misc exported_cluster=logging-eqiad group=logstash7-eqiad instance=kafkamon1002 job=burrow partition={2,4} prometheus=ops site=eqiad topic={rsyslog-notice,udp_localhost-info} https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasourc [19:01:12] ter=logging-eqiad&var-topic=All&var-consumer_group=All [19:03:32] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 143, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:03:32] (03PS1) 10Mforns: Migrate VirtualPageView to EventPlatform on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/689205 (https://phabricator.wikimedia.org/T238138) [19:03:52] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 239, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:08:03] 10SRE, 10observability: Decommission old ELK5 Logstash cluster - https://phabricator.wikimedia.org/T281266 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['logstash1011.eqiad.wmnet'] ` and were **ALL** successful. [19:08:25] 10SRE, 10Patch-For-Review: try planet/people on bullseye / upgrade people.wikimedia.org backends to bullseye - https://phabricator.wikimedia.org/T280989 (10Dzahn) [19:11:02] RECOVERY - Too many messages in kafka logging-eqiad #o11y on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=thanos&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All [19:13:22] 10SRE, 10Wikimedia-Mailing-lists: Make sure bounce_score_threshold isn't set too low - https://phabricator.wikimedia.org/T282501 (10Legoktm) 05Open→03Resolved Lists fixed and announced at https://lists.wikimedia.org/hyperkitty/list/listadmins-announce@lists.wikimedia.org/thread/6UUMAW52VVDF24M5OQ7NO54KY6OY... [19:28:58] !log mforns@deploy1002 Finished deploy [analytics/refinery@7e0598d]: Regular analytics weekly train [analytics/refinery@7e0598d3f0805bf3dda4e01b637d95c16a6a668b] (duration: 45m 45s) [19:28:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:29:23] PROBLEM - Too many messages in kafka logging-eqiad #o11y on alert1001 is CRITICAL: cluster=misc exported_cluster=logging-eqiad group=logstash7-eqiad instance=kafkamon1002 job=burrow partition={0,2,3,5,9} prometheus=ops site=eqiad topic={eqiad.w3c.reportingapi.network_error,logback-info,rsyslog-info,rsyslog-notice,rsyslog-warning,udp_localhost-err,udp_localhost-info,udp_localhost-warning} https://wikitech.wikimedia.org/wiki/Logsta [19:29:23] er_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=thanos&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All [19:29:52] !log mforns@deploy1002 Started deploy [analytics/refinery@7e0598d] (thin): Regular analytics weekly train THIN [analytics/refinery@7e0598d3f0805bf3dda4e01b637d95c16a6a668b] [19:29:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:29:58] !log mforns@deploy1002 Finished deploy [analytics/refinery@7e0598d] (thin): Regular analytics weekly train THIN [analytics/refinery@7e0598d3f0805bf3dda4e01b637d95c16a6a668b] (duration: 00m 07s) [19:30:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:31:24] (03PS1) 10Ahmon Dancy: group0 wikis to 1.37.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/689227 [19:31:26] (03CR) 10Ahmon Dancy: [C: 03+2] group0 wikis to 1.37.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/689227 (owner: 10Ahmon Dancy) [19:32:10] (03Merged) 10jenkins-bot: group0 wikis to 1.37.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/689227 (owner: 10Ahmon Dancy) [19:33:54] !log dancy@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.37.0-wmf.5 [19:33:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:37:07] !log mforns@deploy1002 Started deploy [analytics/refinery@7e0598d] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@7e0598d3f0805bf3dda4e01b637d95c16a6a668b] [19:37:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:37:49] RECOVERY - Too many messages in kafka logging-eqiad #o11y on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=thanos&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All [19:37:54] (03PS1) 10Andrew Bogott: nova.conf: exclude some more api config from compute nodes [puppet] - 10https://gerrit.wikimedia.org/r/689237 [19:38:38] (03CR) 10Andrew Bogott: [C: 03+2] nova.conf: exclude some more api config from compute nodes [puppet] - 10https://gerrit.wikimedia.org/r/689237 (owner: 10Andrew Bogott) [19:46:52] !log mforns@deploy1002 Finished deploy [analytics/refinery@7e0598d] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@7e0598d3f0805bf3dda4e01b637d95c16a6a668b] (duration: 09m 45s) [19:46:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:53:25] (03PS1) 10Dzahn: httpbb: fix path to test suite for peopleweb [puppet] - 10https://gerrit.wikimedia.org/r/689252 [19:54:48] (03CR) 10Dzahn: [C: 03+2] httpbb: fix path to test suite for peopleweb [puppet] - 10https://gerrit.wikimedia.org/r/689252 (owner: 10Dzahn) [19:55:28] !log dzahn@cumin1001 START - Cookbook sre.ganeti.makevm for new host people2002.codfw.wmnet [19:55:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:00:13] !log mforns@deploy1002 Started deploy [analytics/refinery@270c753]: Regular analytics weekly train [analytics/refinery@270c753fc746b979cf90e1537f9a67ede6372795] [20:00:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:06:26] (03PS1) 10Dzahn: backups: add people2002 to ignore file to avoid false positive monitoring alert [puppet] - 10https://gerrit.wikimedia.org/r/689259 [20:13:35] (03PS1) 10Cwhite: logstash: add openstack transition config and tests [puppet] - 10https://gerrit.wikimedia.org/r/689262 (https://phabricator.wikimedia.org/T234565) [20:17:14] !log mforns@deploy1002 Finished deploy [analytics/refinery@270c753]: Regular analytics weekly train [analytics/refinery@270c753fc746b979cf90e1537f9a67ede6372795] (duration: 17m 01s) [20:17:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:17:28] !log mforns@deploy1002 Started deploy [analytics/refinery@270c753] (thin): Regular analytics weekly train THIN [analytics/refinery@270c753fc746b979cf90e1537f9a67ede6372795] [20:17:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:17:32] !log mforns@deploy1002 Finished deploy [analytics/refinery@270c753] (thin): Regular analytics weekly train THIN [analytics/refinery@270c753fc746b979cf90e1537f9a67ede6372795] (duration: 00m 05s) [20:17:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:17:44] !log mforns@deploy1002 Started deploy [analytics/refinery@270c753] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@270c753fc746b979cf90e1537f9a67ede6372795] [20:17:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:23:04] (03PS1) 10Andrew Bogott: Trove: allow labs instances to talk to rabbitmq [puppet] - 10https://gerrit.wikimedia.org/r/689266 (https://phabricator.wikimedia.org/T212595) [20:24:42] !log mforns@deploy1002 Finished deploy [analytics/refinery@270c753] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@270c753fc746b979cf90e1537f9a67ede6372795] (duration: 06m 57s) [20:24:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:32:03] (03CR) 10Andrew Bogott: [C: 03+2] Trove: allow labs instances to talk to rabbitmq [puppet] - 10https://gerrit.wikimedia.org/r/689266 (https://phabricator.wikimedia.org/T212595) (owner: 10Andrew Bogott) [20:37:52] !log dzahn@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host people2002.codfw.wmnet [20:37:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:39:46] (03CR) 10Thcipriani: [C: 03+1] "Looks like it would have made troubleshooting easier today." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/689192 (owner: 10Ahmon Dancy) [20:49:45] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:50:09] 10SRE, 10Wikimedia-Mailing-lists: Mailman3 bounce runner is running very slowly - https://phabricator.wikimedia.org/T282348 (10Legoktm) >>! In T282348#7076312, @Legoktm wrote: > * I upgraded flufl.bounce which I thought might be better at parsing bounce messages, hopefully causing less crashes and then it unsu... [20:52:09] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:52:26] !log upgraded mailman3 on lists1001 [20:52:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:52:29] RECOVERY - mailman3_runners on lists1001 is OK: PROCS OK: 14 processes with UID = 38 (list), regex args /usr/lib/mailman3/bin/runner https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:57:06] (03CR) 10Herron: "could you expand a bit on what this is meant to do in the commit msg and comments, specifically output_kafka?" [puppet] - 10https://gerrit.wikimedia.org/r/689160 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [21:02:38] (03CR) 10Herron: [C: 03+1] "lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/688502 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [21:09:42] 10SRE, 10observability: Decommission old ELK5 Logstash cluster - https://phabricator.wikimedia.org/T281266 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by herron on cumin1001.eqiad.wmnet for hosts: ` logstash1012.eqiad.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/202105112109_herro... [21:11:37] (03PS5) 10Ahmon Dancy: logstash_checker.py: Provide more info on error [puppet] - 10https://gerrit.wikimedia.org/r/689192 [21:12:57] (03CR) 10Ahmon Dancy: logstash_checker.py: Provide more info on error (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/689192 (owner: 10Ahmon Dancy) [21:15:57] RECOVERY - mailman3_queue_size on lists1001 is OK: OK: mailman3 queues are below the limits https://wikitech.wikimedia.org/wiki/Mailman/Monitoring https://grafana.wikimedia.org/d/GvuAmuuGk/mailman3 [21:17:15] yay [21:34:57] !log herron@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on logstash1012.eqiad.wmnet with reason: REIMAGE [21:34:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:37:08] !log herron@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on logstash1012.eqiad.wmnet with reason: REIMAGE [21:37:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:37:15] !log Start server-side upload for 3 video files (T282566, T282565, T282559) [21:37:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:37:20] T282566: Server side upload for Butko - https://phabricator.wikimedia.org/T282566 [21:37:21] T282559: Server side upload for Butko - https://phabricator.wikimedia.org/T282559 [21:37:21] T282565: Server side upload for Butko - https://phabricator.wikimedia.org/T282565 [21:43:57] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:45:33] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:48:55] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:50:55] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:51:22] 10SRE, 10observability: Decommission old ELK5 Logstash cluster - https://phabricator.wikimedia.org/T281266 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['logstash1012.eqiad.wmnet'] ` and were **ALL** successful. [21:56:19] 10SRE, 10DBA, 10Wikimedia-Mailing-lists, 10Schema-change: Mailman3 schema change: change utf8 columns to utf8mb4 - https://phabricator.wikimedia.org/T282621 (10Legoktm) [22:05:29] !log legoktm@cumin1001 START - Cookbook sre.hosts.decommission for hosts lists1002.wikimedia.org [22:05:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:12:48] (03PS1) 10Legoktm: decom lists1002/lists-next [puppet] - 10https://gerrit.wikimedia.org/r/689313 (https://phabricator.wikimedia.org/T281548) [22:13:52] (03CR) 10Legoktm: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29508/console" [puppet] - 10https://gerrit.wikimedia.org/r/689313 (https://phabricator.wikimedia.org/T281548) (owner: 10Legoktm) [22:14:26] !log legoktm@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts lists1002.wikimedia.org [22:14:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:14:33] 10SRE, 10DBA, 10Wikimedia-Mailing-lists, 10Patch-For-Review: Delete lists-next.wikimedia.org - https://phabricator.wikimedia.org/T281548 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by legoktm@cumin1001 for hosts: `lists1002.wikimedia.org` - lists1002.wikimedia.org (**PASS**) - Downti... [22:16:29] (03PS1) 10Legoktm: Remove lists-next [dns] - 10https://gerrit.wikimedia.org/r/689315 (https://phabricator.wikimedia.org/T281548) [22:19:21] (03CR) 10Legoktm: [C: 03+2] Remove lists-next [dns] - 10https://gerrit.wikimedia.org/r/689315 (https://phabricator.wikimedia.org/T281548) (owner: 10Legoktm) [22:22:14] 10SRE, 10DBA, 10Wikimedia-Mailing-lists, 10Patch-For-Review: Delete lists-next.wikimedia.org - https://phabricator.wikimedia.org/T281548 (10Legoktm) >>! In T281548#7080171, @ops-monitoring-bot wrote: > - COMMON_STEPS (**FAIL**) > - **Failed to run the sre.dns.netbox cookbook**: Cumin execution failed (ex... [22:22:41] (03PS1) 10Zabe: Limit IA granting/revoking to stewards only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/689321 (https://phabricator.wikimedia.org/T282624) [22:25:37] 10SRE, 10Wikimedia-Mailing-lists, 10translatewiki.net: Add mailman-templates to translatewiki.net - https://phabricator.wikimedia.org/T282022 (10Nemo_bis) [22:28:35] (03CR) 10Bstorm: [C: 03+2] "That looks right to me (based on the horrible source-code-generated documentation for this https://pkg.go.dev/k8s.io/kubernetes@v1.21.0/cm" [puppet] - 10https://gerrit.wikimedia.org/r/687003 (owner: 10Majavah) [22:31:03] (03CR) 10Urbanecm: [C: 04-2] "not yet. needs a final OK from T&S to make sure comms happen at right time" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/689321 (https://phabricator.wikimedia.org/T282624) (owner: 10Zabe) [22:35:39] (03PS1) 10Legoktm: Remove lists1002/lists-next [labs/private] - 10https://gerrit.wikimedia.org/r/689327 (https://phabricator.wikimedia.org/T281548) [22:36:32] (03CR) 10Legoktm: [V: 03+2 C: 03+2] Remove lists1002/lists-next [labs/private] - 10https://gerrit.wikimedia.org/r/689327 (https://phabricator.wikimedia.org/T281548) (owner: 10Legoktm) [22:38:49] (03PS1) 10Tks4Fish: Adding square logo and wordmark for ptwiki 20 years celebration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/689329 (https://phabricator.wikimedia.org/T281925) [22:39:44] RECOVERY - wikimedia-client-errors-alerts grafana alert on alert1001 is OK: OK: Overview ( https://grafana.wikimedia.org/d/000000566/overview ) is not alerting. https://logstash.wikimedia.org/app/kibana%23/dashboard/AXDBY8Qhh3Uj6x1zCF56 https://grafana.wikimedia.org/d/000000566/ [22:40:12] 10SRE, 10ops-eqiad, 10Data-Services, 10cloud-services-team (Kanban): Move maps and scratch on cloudstore1008/9 to a DRBD failover similar to labstore1004/5 - https://phabricator.wikimedia.org/T224747 (10Bstorm) a:05Bstorm→03None [22:40:21] (03CR) 10Legoktm: [C: 03+1] production-m5.sql: Remove testmailman users [puppet] - 10https://gerrit.wikimedia.org/r/688975 (https://phabricator.wikimedia.org/T281548) (owner: 10Marostegui) [22:40:29] (03CR) 10Nemo bis: "This can be ok as a start but it's missing documentation for all the variables. In some cases it's pretty important, for instance "For: $m" (032 comments) [software/mailman-templates] - 10https://gerrit.wikimedia.org/r/685533 (owner: 10Legoktm) [22:41:06] 10SRE, 10Wikimedia-Mailing-lists, 10translatewiki.net: Add mailman-templates to translatewiki.net - https://phabricator.wikimedia.org/T282022 (10Nemo_bis) [22:42:20] 10SRE, 10DBA, 10Wikimedia-Mailing-lists, 10Patch-For-Review: Delete lists-next.wikimedia.org - https://phabricator.wikimedia.org/T281548 (10Legoktm) I deleted the VM, DNS, private puppet, labs/private. @Marostegui all ready for you to do database deletion! [22:44:57] 10SRE, 10ops-eqiad, 10Data-Services, 10cloud-services-team (Kanban): Move maps and scratch on cloudstore1008/9 to a DRBD failover similar to labstore1004/5 - https://phabricator.wikimedia.org/T224747 (10Bstorm) a:03wiki_willy Assigning to @wiki_willy to see if it can be prioritized/moved however it shoul... [22:50:20] (03PS1) 10Zabe: Change namespace names and aliases on tiwiki and tiwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/689334 (https://phabricator.wikimedia.org/T263840) [22:50:28] (03PS1) 10Tks4Fish: Adding logos for ptwiki's 20 year celebration on new vector [mediawiki-config] - 10https://gerrit.wikimedia.org/r/689335 (https://phabricator.wikimedia.org/T281925) [22:50:37] 10SRE, 10ops-eqiad, 10Data-Services, 10cloud-services-team (Kanban): Move maps and scratch on cloudstore1008/9 to a DRBD failover similar to labstore1004/5 - https://phabricator.wikimedia.org/T224747 (10wiki_willy) Hi @Bstorm - @Jclark-ctr and @Andrew typically work together once a week (usually Tuesdays)... [22:52:09] (03PS2) 10Urbanecm: Adding square logo and wordmark for ptwiki 20 years celebration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/689329 (https://phabricator.wikimedia.org/T281925) (owner: 10Tks4Fish) [22:52:18] jouncebot: next [22:52:18] In 0 hour(s) and 7 minute(s): Evening backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210511T2300) [22:53:08] (03CR) 10jerkins-bot: [V: 04-1] Adding logos for ptwiki's 20 year celebration on new vector [mediawiki-config] - 10https://gerrit.wikimedia.org/r/689335 (https://phabricator.wikimedia.org/T281925) (owner: 10Tks4Fish) [22:53:22] (03CR) 10Urbanecm: [C: 03+2] Adding square logo and wordmark for ptwiki 20 years celebration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/689329 (https://phabricator.wikimedia.org/T281925) (owner: 10Tks4Fish) [22:53:24] (03PS2) 10DLynch: Make DT's source mode toolbar available as beta on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/684404 (https://phabricator.wikimedia.org/T279124) (owner: 10Esanders) [22:53:33] (03CR) 10Legoktm: [C: 04-1] mailman3: Improve the logging directory permissions (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/682760 (owner: 10Ladsgroup) [22:54:33] (03PS2) 10Urbanecm: Adding logos for ptwiki's 20 year celebration on new vector [mediawiki-config] - 10https://gerrit.wikimedia.org/r/689335 (https://phabricator.wikimedia.org/T281925) (owner: 10Tks4Fish) [22:55:11] (03PS2) 10Zabe: Change namespace names and aliases on tiwiki and tiwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/689334 (https://phabricator.wikimedia.org/T263840) [22:55:25] (03PS3) 10Urbanecm: ptwiki: Use celebration logos in new vector [mediawiki-config] - 10https://gerrit.wikimedia.org/r/689335 (https://phabricator.wikimedia.org/T281925) (owner: 10Tks4Fish) [22:56:42] (03CR) 10Urbanecm: [C: 03+2] Adding square logo and wordmark for ptwiki 20 years celebration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/689329 (https://phabricator.wikimedia.org/T281925) (owner: 10Tks4Fish) [22:57:25] (03Merged) 10jenkins-bot: Adding square logo and wordmark for ptwiki 20 years celebration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/689329 (https://phabricator.wikimedia.org/T281925) (owner: 10Tks4Fish) [23:00:05] RoanKattouw, Niharika, and Urbanecm: #bothumor My software never has bugs. It just develops random features. Rise for Evening backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210511T2300). [23:00:05] kemayo and Zabe: A patch you scheduled for Evening backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:00:09] i can deploy today [23:00:15] 👋🏻 [23:00:16] o/ [23:00:40] The two competing styles, clearly. [23:02:16] !log urbanecm@deploy1002 Synchronized static/images/mobile/copyright/: e35199baf01f423015905b8fec9e419ed3529787: Adding square logo and wordmark for ptwiki 20 years celebration (T281925) (duration: 01m 50s) [23:02:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:02:20] T281925: Change ptwiki logo temporarily (celebration of 20 years) - https://phabricator.wikimedia.org/T281925 [23:03:15] (03PS1) 10Urbanecm: ptwiki: Add wikipedia-pt-20.png [mediawiki-config] - 10https://gerrit.wikimedia.org/r/689342 (https://phabricator.wikimedia.org/T281925) [23:03:27] (03CR) 10Urbanecm: [C: 03+2] Make DT's source mode toolbar available as beta on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/684404 (https://phabricator.wikimedia.org/T279124) (owner: 10Esanders) [23:04:13] (03Merged) 10jenkins-bot: Make DT's source mode toolbar available as beta on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/684404 (https://phabricator.wikimedia.org/T279124) (owner: 10Esanders) [23:04:18] (03CR) 10Urbanecm: [C: 03+2] ptwiki: Add wikipedia-pt-20.png [mediawiki-config] - 10https://gerrit.wikimedia.org/r/689342 (https://phabricator.wikimedia.org/T281925) (owner: 10Urbanecm) [23:04:43] Kemayo: your patch is pulled to mwdebug1001. Can you check, please? [23:04:57] Sure, just one second. [23:05:01] thank you [23:05:08] (03Merged) 10jenkins-bot: ptwiki: Add wikipedia-pt-20.png [mediawiki-config] - 10https://gerrit.wikimedia.org/r/689342 (https://phabricator.wikimedia.org/T281925) (owner: 10Urbanecm) [23:06:34] Urbanecm: Okay, seems to be good. [23:06:38] thanks, syncing [23:06:51] !log urbanecm@deploy1002 Synchronized static/images/mobile/copyright/wikipedia-pt-20.png: 60e6e4e960ee6cb31df9ce08fdeaedb647ce3afb: ptwiki: Add wikipedia-pt-20.png (T281925) (duration: 01m 08s) [23:06:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:07:53] (03PS4) 10Urbanecm: ptwiki: Use celebration logos in new vector [mediawiki-config] - 10https://gerrit.wikimedia.org/r/689335 (https://phabricator.wikimedia.org/T281925) (owner: 10Tks4Fish) [23:08:47] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: eac843a69574feebbf962959e5eb9811a2a83bc4: Make DT source mode toolbar available as beta on all wikis (T279124) (duration: 01m 12s) [23:08:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:08:51] T279124: Make config change to expose source mode with tools functionality and setting - https://phabricator.wikimedia.org/T279124 [23:09:09] Kemayo: should be live! [23:10:13] Urbanecm: Yup, looks good live as well. [23:10:20] Thanks! [23:10:20] excellent :) [23:10:23] any time [23:10:27] (03CR) 10Urbanecm: [C: 03+2] ptwiki: Use celebration logos in new vector [mediawiki-config] - 10https://gerrit.wikimedia.org/r/689335 (https://phabricator.wikimedia.org/T281925) (owner: 10Tks4Fish) [23:11:22] (03Merged) 10jenkins-bot: ptwiki: Use celebration logos in new vector [mediawiki-config] - 10https://gerrit.wikimedia.org/r/689335 (https://phabricator.wikimedia.org/T281925) (owner: 10Tks4Fish) [23:12:58] (03PS3) 10Urbanecm: Change namespace names and aliases on tiwiki and tiwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/689334 (https://phabricator.wikimedia.org/T263840) (owner: 10Zabe) [23:13:03] (03CR) 10Urbanecm: [C: 03+2] Change namespace names and aliases on tiwiki and tiwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/689334 (https://phabricator.wikimedia.org/T263840) (owner: 10Zabe) [23:13:50] (03Merged) 10jenkins-bot: Change namespace names and aliases on tiwiki and tiwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/689334 (https://phabricator.wikimedia.org/T263840) (owner: 10Zabe) [23:14:37] Zabe: pulled to mwdebug1001, can you check? [23:14:49] yes [23:16:58] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 5bc40acfa6514f7940af0a6cef9974140680f4b9: ptwiki: Use celebration logos in new vector (T281925) (duration: 01m 06s) [23:16:59] 10SRE, 10ops-eqiad, 10Data-Services, 10cloud-services-team (Kanban): Move maps and scratch on cloudstore1008/9 to a DRBD failover similar to labstore1004/5 - https://phabricator.wikimedia.org/T224747 (10Bstorm) I was hoping it would be discussed this week, but I somehow missed that? I commented on IRC duri... [23:17:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:17:02] T281925: Change ptwiki logo temporarily (celebration of 20 years) - https://phabricator.wikimedia.org/T281925 [23:17:33] Urbanecm: works the supposed way [23:17:43] thanks [23:17:45] (03PS4) 10Dave Pifke: Sketch of Performance team alerts [alerts] - 10https://gerrit.wikimedia.org/r/670230 (https://phabricator.wikimedia.org/T281358) (owner: 10Filippo Giunchedi) [23:18:31] syncing [23:19:30] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: ec37795eba5faa0c1a1dddb29504941205e155b4: Change namespace names and aliases on tiwiki and tiwiktionary (T263840) (duration: 01m 07s) [23:19:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:19:35] T263840: Update sitename/namespaces for ti.wikis - https://phabricator.wikimedia.org/T263840 [23:19:37] Zabe: live :) [23:19:51] (03CR) 10jerkins-bot: [V: 04-1] Sketch of Performance team alerts [alerts] - 10https://gerrit.wikimedia.org/r/670230 (https://phabricator.wikimedia.org/T281358) (owner: 10Filippo Giunchedi) [23:19:56] thanks :) [23:20:13] np [23:20:37] (03CR) 10Dave Pifke: "I wrote descriptions and made some minor updates to the tests from Puppet, and added some from ArcLamp. I'll add tests next." [alerts] - 10https://gerrit.wikimedia.org/r/670230 (https://phabricator.wikimedia.org/T281358) (owner: 10Filippo Giunchedi) [23:28:31] (03PS2) 10Cwhite: logstash: add openstack ECS transition config and tests [puppet] - 10https://gerrit.wikimedia.org/r/689262 (https://phabricator.wikimedia.org/T234565) [23:29:10] 10SRE, 10ops-eqiad, 10Data-Services, 10cloud-services-team (Kanban): Move maps and scratch on cloudstore1008/9 to a DRBD failover similar to labstore1004/5 - https://phabricator.wikimedia.org/T224747 (10wiki_willy) Hey @Bstorm - both Chris and John are off for the day (EST), but I'll check with them tomorr... [23:30:05] 10SRE, 10ops-eqiad, 10Data-Services, 10cloud-services-team (Kanban): Move maps and scratch on cloudstore1008/9 to a DRBD failover similar to labstore1004/5 - https://phabricator.wikimedia.org/T224747 (10Bstorm) Thanks! [23:36:46] (03PS1) 10Tim Starling: Fix changes list "hide myself" feature [core] (wmf/1.37.0-wmf.4) - 10https://gerrit.wikimedia.org/r/689059 (https://phabricator.wikimedia.org/T282183) [23:37:08] (03PS1) 10Tim Starling: Fix changes list "hide myself" feature [core] (wmf/1.37.0-wmf.5) - 10https://gerrit.wikimedia.org/r/689060 (https://phabricator.wikimedia.org/T282183) [23:40:33] (03PS2) 10Cwhite: rsyslog: enable ecs_170 template and transition prometheus [puppet] - 10https://gerrit.wikimedia.org/r/689160 (https://phabricator.wikimedia.org/T234565) [23:44:01] (03CR) 10Cwhite: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/689160 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [23:49:54] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:52:12] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:59:54] (03CR) 10jerkins-bot: [V: 04-1] Fix changes list "hide myself" feature [core] (wmf/1.37.0-wmf.5) - 10https://gerrit.wikimedia.org/r/689060 (https://phabricator.wikimedia.org/T282183) (owner: 10Tim Starling)