[00:18:06] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:20:24] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:25:48] RECOVERY - Logstash Elasticsearch indexing errors #o11y on alert1001 is OK: (C)8 ge (W)1 ge 0.9958 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/3283cc1372b7df18f26128163125cf45 https://grafana.wikimedia.org/dashboard/db/logstash [01:07:22] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:09:42] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [03:06:38] (03CR) 10DannyS712: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/672117 (owner: 10DannyS712) [03:19:30] PROBLEM - WDQS SPARQL on wdqs1012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:28:40] RECOVERY - WDQS SPARQL on wdqs1012 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.086 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:50:14] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [03:52:38] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [05:02:06] RECOVERY - mediawiki originals uploads -hourly- for eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad [05:02:10] RECOVERY - mediawiki originals uploads -hourly- for codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw [06:06:02] 10SRE, 10ops-eqiad, 10DBA: db1162 crashed - https://phabricator.wikimedia.org/T275309 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts: ` ['db1162.eqiad.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/202103150605_marostegui_10267.log`. [06:27:21] (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/672238 [06:27:28] (03CR) 10Legoktm: "> Patch Set 1:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/670913 (https://phabricator.wikimedia.org/T224579) (owner: 10Legoktm) [06:32:34] 10SRE, 10ops-eqiad, 10DBA: db1162 crashed - https://phabricator.wikimedia.org/T275309 (10Marostegui) It looks like the host isn't rebooting via PXE - trying to force it manually [06:37:38] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:39:58] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [06:41:48] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [06:41:53] (03CR) 10Legoktm: Support having multiple IRC feed servers (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/670913 (https://phabricator.wikimedia.org/T224579) (owner: 10Legoktm) [06:42:08] 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) [06:42:31] 10SRE, 10ops-eqiad, 10DBA: db1162 crashed - https://phabricator.wikimedia.org/T275309 (10Marostegui) 05Resolved→03Open @Cmjohnson I am not able to PXE boot the host. Neither via the normal reimage process nor forcing PXE manually with: ` [06:37:22] marostegui@cumin1001:~$ sudo ipmitool -I lanplus -H db1... [06:55:56] 10SRE, 10DBA, 10Wikimedia-Incident: 14 March 2021 Wikimedia API Outage - https://phabricator.wikimedia.org/T277417 (10Legoktm) >>! In T277417#6912139, @RhinosF1 wrote: >>>! In T277417#6912136, @Legoktm wrote: >>> This also brought down any third party wiki using Instant Commons. >> >> The wikis actually wen... [07:03:07] 10SRE, 10DBA, 10Wikimedia-Incident: 14 March 2021 Wikimedia API Outage - https://phabricator.wikimedia.org/T277417 (10RhinosF1) See subtask @Legoktm [07:07:08] 10SRE, 10ops-eqiad, 10DBA: db1162 crashed - https://phabricator.wikimedia.org/T275309 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db1162.eqiad.wmnet'] ` Of which those **FAILED**: ` ['db1162.eqiad.wmnet'] ` [07:15:22] PROBLEM - SSH on mw2227.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:22:38] !log powercycle ms-be1038 - no ssh, no tty available in mgmt serial console, irrecoverable error saved in ilo's system logs [07:22:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:25:16] RECOVERY - Host ms-be1038 is UP: PING OK - Packet loss = 0%, RTA = 0.15 ms [07:25:18] PROBLEM - puppet last run on ms-be1038 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [07:26:12] RECOVERY - SSH on ms-be1038 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [07:26:48] Cc: godog: --^ [07:31:34] RECOVERY - puppet last run on ms-be1038 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [07:31:36] PROBLEM - very high load average likely xfs on ms-be1038 is CRITICAL: CRITICAL - load average: 138.83, 103.64, 48.96 https://wikitech.wikimedia.org/wiki/Swift [07:32:13] (03CR) 10Elukey: [C: 03+1] "Left a note to clear the Jenkins -1, but the rest LGTM!" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/671132 (https://phabricator.wikimedia.org/T257572) (owner: 10Hnowlan) [07:41:29] (03CR) 10Muehlenhoff: [C: 03+2] Add approval for swift-roots [puppet] - 10https://gerrit.wikimedia.org/r/671180 (https://phabricator.wikimedia.org/T276465) (owner: 10Muehlenhoff) [07:42:02] (03PS2) 10Muehlenhoff: Add approval for snapshot admins [puppet] - 10https://gerrit.wikimedia.org/r/671178 (https://phabricator.wikimedia.org/T276465) [07:47:19] (03CR) 10Muehlenhoff: [C: 03+2] Add approval for snapshot admins [puppet] - 10https://gerrit.wikimedia.org/r/671178 (https://phabricator.wikimedia.org/T276465) (owner: 10Muehlenhoff) [07:49:52] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:52:16] elukey: nice, thank you! hopefully it stays up, unlike ms-be1037 [08:09:16] (03PS1) 10Elukey: hadoop: raise the bw usage of the hdfs-balancer [puppet] - 10https://gerrit.wikimedia.org/r/672333 [08:14:16] (03CR) 10Elukey: [C: 03+2] hadoop: raise the bw usage of the hdfs-balancer [puppet] - 10https://gerrit.wikimedia.org/r/672333 (owner: 10Elukey) [08:17:16] (03CR) 10Filippo Giunchedi: pontoon: add hiera settings for o11y-grafana (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/671187 (owner: 10Herron) [08:23:40] (03CR) 10Volans: [C: 03+1] "LGTM" (031 comment) [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/671176 (https://phabricator.wikimedia.org/T277006) (owner: 10Ayounsi) [08:26:50] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install (2) new 10G switches - https://phabricator.wikimedia.org/T277340 (10ayounsi) The original plan was to retrofit two existing 1G racks to 10G as a quick fix for the existing contention. The downside is that the migration from 1G to 10G means a rac... [08:31:20] (03PS1) 10Marostegui: Revert "wikireplicas: depool labsdb1009 due to instability" [puppet] - 10https://gerrit.wikimedia.org/r/672122 [08:32:13] (03CR) 10Marostegui: [C: 03+2] Revert "wikireplicas: depool labsdb1009 due to instability" [puppet] - 10https://gerrit.wikimedia.org/r/672122 (owner: 10Marostegui) [08:32:38] (03CR) 10DCausse: create helmfile.d structure (036 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/671204 (https://phabricator.wikimedia.org/T264006) (owner: 10Mstyles) [08:33:33] !log Repool labsdb1009 T276980 [08:33:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:33:40] T276980: mariadb crashed on labsdb1009 - https://phabricator.wikimedia.org/T276980 [08:33:42] !log swift eqiad-prod remove decom hosts from account/container rings - T272836 T276193 [08:33:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:33:49] T276193: Decom ms-be1034 from swift - https://phabricator.wikimedia.org/T276193 [08:33:50] T272836: Decom ms-be[1019-1026] from swift - https://phabricator.wikimedia.org/T272836 [08:34:37] (03CR) 10Filippo Giunchedi: [C: 03+1] logstash: add normalize level filter to scap migration [puppet] - 10https://gerrit.wikimedia.org/r/671290 (owner: 10Cwhite) [08:34:46] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:35:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1146:3312', diff saved to https://phabricator.wikimedia.org/P14828 and previous config saved to /var/cache/conftool/dbconfig/20210315-083555-marostegui.json [08:36:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:36:05] (03CR) 10Filippo Giunchedi: "recheck" [alerts] - 10https://gerrit.wikimedia.org/r/670231 (https://phabricator.wikimedia.org/T272977) (owner: 10Filippo Giunchedi) [08:36:29] (03CR) 10jerkins-bot: [V: 04-1] Run tests for alerts [alerts] - 10https://gerrit.wikimedia.org/r/670231 (https://phabricator.wikimedia.org/T272977) (owner: 10Filippo Giunchedi) [08:37:04] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:38:10] (03PS1) 10Marostegui: db1136: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/672337 (https://phabricator.wikimedia.org/T277007) [08:38:46] (03CR) 10Muehlenhoff: [C: 03+2] idp: Make the memcached transcoder configurable [puppet] - 10https://gerrit.wikimedia.org/r/671140 (https://phabricator.wikimedia.org/T271684) (owner: 10Muehlenhoff) [08:43:14] RECOVERY - very high load average likely xfs on ms-be1038 is OK: OK - load average: 54.64, 61.53, 77.40 https://wikitech.wikimedia.org/wiki/Swift [08:43:39] 10SRE, 10Kubernetes: helm test fails in ci namespace - https://phabricator.wikimedia.org/T277252 (10JMeybohm) 05Open→03Resolved a:03JMeybohm Thanks for the note, I'll resolve this. [08:44:20] hashar: thank you for https://gerrit.wikimedia.org/r/c/integration/config/+/670782 ! I forgot I need 'promtool' available, followed up with https://gerrit.wikimedia.org/r/c/integration/config/+/672340 [08:47:54] (03CR) 10Marostegui: [C: 03+2] db1136: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/672337 (https://phabricator.wikimedia.org/T277007) (owner: 10Marostegui) [08:48:12] moritzm: ok to merge your change? [08:50:32] (03CR) 10Volans: "Nice to see a new cookbook! Few comments inline, mostly optional/possible improvement." (035 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/670966 (https://phabricator.wikimedia.org/T273278) (owner: 10Razzi) [08:52:49] marostegui: sorry, yes [08:52:54] forgot to hit Enter [08:52:58] moritzm: merging! [08:53:02] thx [08:54:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1136 T277007', diff saved to https://phabricator.wikimedia.org/P14829 and previous config saved to /var/cache/conftool/dbconfig/20210315-085409-marostegui.json [08:54:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:54:17] T277007: Upgrade firmware on db1136 - https://phabricator.wikimedia.org/T277007 [08:54:46] elukey: about this weekend https://phabricator.wikimedia.org/T277438 [08:54:49] !log Stop MySQL on db1136 T277007 [08:54:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:56:12] (03PS1) 10Filippo Giunchedi: pontoon: use 'puppet agent' to kick off puppet [puppet] - 10https://gerrit.wikimedia.org/r/672341 [08:58:10] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops, 10Patch-For-Review: Upgrade firmware on db1136 - https://phabricator.wikimedia.org/T277007 (10Marostegui) @Cmjohnson db1136 is now off, you can proceed as needed [08:59:04] XioNoX: nice thanks! One follow up qs - maybe we could review the mr1-* alerts to avoid including "pa*ge" in their description? (if the don't go through VO) [08:59:13] (if possible) [09:00:20] elukey: 302 cdanis :) [09:02:03] (03PS2) 10Ayounsi: Remove servers interface names from switches interfaces descriptions [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/671176 (https://phabricator.wikimedia.org/T277006) [09:03:47] (03CR) 10Elukey: [C: 04-1] "As it is, this cookbook is very dangerous, since we co-locate Zookeeper on most clusters listed in the args with other important systems. " [cookbooks] - 10https://gerrit.wikimedia.org/r/670966 (https://phabricator.wikimedia.org/T273278) (owner: 10Razzi) [09:04:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1146:3312 (re)pooling @ 25%: Repool db1146:3312', diff saved to https://phabricator.wikimedia.org/P14830 and previous config saved to /var/cache/conftool/dbconfig/20210315-090409-root.json [09:04:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:04:40] (03CR) 10Kormat: [C: 03+1] pontoon: use 'puppet agent' to kick off puppet [puppet] - 10https://gerrit.wikimedia.org/r/672341 (owner: 10Filippo Giunchedi) [09:04:57] (03CR) 10Filippo Giunchedi: [C: 03+2] pontoon: use 'puppet agent' to kick off puppet [puppet] - 10https://gerrit.wikimedia.org/r/672341 (owner: 10Filippo Giunchedi) [09:06:57] (03CR) 10Volans: [C: 03+1] "LGTM" [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/671176 (https://phabricator.wikimedia.org/T277006) (owner: 10Ayounsi) [09:10:34] (03PS1) 10Filippo Giunchedi: icinga: remove Grafana alerts for Performance [puppet] - 10https://gerrit.wikimedia.org/r/672346 (https://phabricator.wikimedia.org/T272979) [09:13:15] (03CR) 10Ayounsi: [V: 03+2 C: 03+2] Remove servers interface names from switches interfaces descriptions [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/671176 (https://phabricator.wikimedia.org/T277006) (owner: 10Ayounsi) [09:14:02] (03CR) 10Ayounsi: [V: 03+2 C: 03+2] Remove servers interface names from switches interfaces descriptions (031 comment) [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/671176 (https://phabricator.wikimedia.org/T277006) (owner: 10Ayounsi) [09:15:28] (03PS9) 10Elukey: profile::hadoop::common: move defaults from hiera to the profile [puppet] - 10https://gerrit.wikimedia.org/r/671172 [09:17:35] (03CR) 10Elukey: [C: 03+2] profile::hadoop::common: move defaults from hiera to the profile [puppet] - 10https://gerrit.wikimedia.org/r/671172 (owner: 10Elukey) [09:18:59] (03PS4) 10Muehlenhoff: Assign mw_rc_irc role to irc2001 [puppet] - 10https://gerrit.wikimedia.org/r/670829 (https://phabricator.wikimedia.org/T224579) [09:19:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1146:3312 (re)pooling @ 50%: Repool db1146:3312', diff saved to https://phabricator.wikimedia.org/P14831 and previous config saved to /var/cache/conftool/dbconfig/20210315-091912-root.json [09:19:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:19:54] (03PS1) 10Kormat: cumin: Update aliases for wikireplicas. [puppet] - 10https://gerrit.wikimedia.org/r/672348 [09:20:16] (03PS2) 10Kormat: cumin: Update aliases for wikireplicas. [puppet] - 10https://gerrit.wikimedia.org/r/672348 [09:20:35] (03CR) 10Vgutierrez: [C: 03+2] lvs: Set depool_threshold to .8 for upload & text [puppet] - 10https://gerrit.wikimedia.org/r/671124 (https://phabricator.wikimedia.org/T274888) (owner: 10Vgutierrez) [09:21:48] (03CR) 10Kormat: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28552/console" [puppet] - 10https://gerrit.wikimedia.org/r/672348 (owner: 10Kormat) [09:23:10] !log rolling restart of LVS cluster to bump depool_threshold to 0.8 on text & upload clusters - T274888 [09:23:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:23:17] T274888: cp_upload @ eqsin cascading failures, February 2021 - https://phabricator.wikimedia.org/T274888 [09:24:32] (03CR) 10Kormat: "Tested on cumin1001:" [puppet] - 10https://gerrit.wikimedia.org/r/672348 (owner: 10Kormat) [09:29:20] PROBLEM - BGP status on cr3-eqsin is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:30:56] PROBLEM - Postgres Replication Lag on puppetdb2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 226767248 and 6 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [09:33:22] RECOVERY - Postgres Replication Lag on puppetdb2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 517160 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [09:34:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1146:3312 (re)pooling @ 75%: Repool db1146:3312', diff saved to https://phabricator.wikimedia.org/P14832 and previous config saved to /var/cache/conftool/dbconfig/20210315-093416-root.json [09:34:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:36:52] PROBLEM - PyBal connections to etcd on lvs4007 is CRITICAL: CRITICAL: 15 connections established with conf2003.codfw.wmnet:2379 (min=16) https://wikitech.wikimedia.org/wiki/PyBal [09:40:25] vgutierrez: what's better than starting the week with a pybal roll restart? :D [09:41:00] indeed [09:41:31] that error isn't expected though [09:41:33] * vgutierrez checking [09:43:55] lvs5003 BGP seems to be happy [09:44:13] Mar 15 09:29:05 lvs5003 pybal[8612]: [bgp.FSM@0x7f7a1c1d8990 peer 103.102.166.130:179] INFO: State is now: ESTABLISHED [09:44:13] Mar 15 09:29:05 lvs5003 pybal[8612]: [bgp.FSM@0x7f7a1c1d8590 peer 103.102.166.131:179] INFO: State is now: ESTABLISHED [09:44:31] re lvs4007... we didn't add any services AFAIK [09:45:35] (03CR) 10Jcrespo: "Thank you for your contribution, please check comments by Ready and me below, more on the ticket." (032 comments) [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/672150 (owner: 10H.krishna123) [09:49:17] (03CR) 10Kormat: [C: 03+2] "None of the wikireplica hosts (labsdb + clouddb) have any grants for 'wikiadmin' at a specific IP, so nothing needs to be done there." [puppet] - 10https://gerrit.wikimedia.org/r/668785 (https://phabricator.wikimedia.org/T275831) (owner: 10Dzahn) [09:49:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1146:3312 (re)pooling @ 100%: Repool db1146:3312', diff saved to https://phabricator.wikimedia.org/P14833 and previous config saved to /var/cache/conftool/dbconfig/20210315-094920-root.json [09:49:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:49:32] RECOVERY - PyBal connections to etcd on lvs4007 is OK: OK: 16 connections established with conf2003.codfw.wmnet:2379 (min=16) https://wikitech.wikimedia.org/wiki/PyBal [09:49:35] <_joe_> vgutierrez: why don't you have a cookbook for restarting pybals? [09:52:30] (03CR) 10Muehlenhoff: [C: 03+2] Assign mw_rc_irc role to irc2001 [puppet] - 10https://gerrit.wikimedia.org/r/670829 (https://phabricator.wikimedia.org/T224579) (owner: 10Muehlenhoff) [09:56:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1076', diff saved to https://phabricator.wikimedia.org/P14834 and previous config saved to /var/cache/conftool/dbconfig/20210315-095607-marostegui.json [09:56:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:56:39] (03CR) 10DCausse: create helmfile.d structure (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/671204 (https://phabricator.wikimedia.org/T264006) (owner: 10Mstyles) [09:57:17] (03CR) 10Phedenskog: "Ooops, I missed converting perceptionsurvey-alerts, let me fix that first," [puppet] - 10https://gerrit.wikimedia.org/r/672346 (https://phabricator.wikimedia.org/T272979) (owner: 10Filippo Giunchedi) [09:59:40] (03PS1) 10Jbond: idp: fix sso port [puppet] - 10https://gerrit.wikimedia.org/r/672354 [10:00:22] (03CR) 10Jbond: [C: 03+2] idp: fix sso port [puppet] - 10https://gerrit.wikimedia.org/r/672354 (owner: 10Jbond) [10:00:41] yes [10:02:45] (03CR) 10Phedenskog: [C: 03+1] "Looks good, fixed the miss on my side." [puppet] - 10https://gerrit.wikimedia.org/r/672346 (https://phabricator.wikimedia.org/T272979) (owner: 10Filippo Giunchedi) [10:02:55] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db1114.eqiad.wmnet with reason: schema change T267767 [10:02:55] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1114.eqiad.wmnet with reason: schema change T267767 [10:03:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:03:02] T267767: Drop default of revactor_timestamp - https://phabricator.wikimedia.org/T267767 [10:03:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:03:37] !log kormat@cumin1001 dbctl commit (dc=all): 'db1114 depooling: schema change T267767', diff saved to https://phabricator.wikimedia.org/P14835 and previous config saved to /var/cache/conftool/dbconfig/20210315-100337-kormat.json [10:03:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:04:22] (03PS1) 10Jbond: cloud - pko: update memcached encoder [puppet] - 10https://gerrit.wikimedia.org/r/672356 [10:04:46] (03CR) 10Jbond: [C: 03+2] cloud - pko: update memcached encoder [puppet] - 10https://gerrit.wikimedia.org/r/672356 (owner: 10Jbond) [10:05:57] (03PS7) 10Hnowlan: aqs: add aqs1011 to cassandra 3.11 test cluster, add aqs_next role [puppet] - 10https://gerrit.wikimedia.org/r/671132 (https://phabricator.wikimedia.org/T257572) [10:08:05] hnowlan: \o/ [10:08:46] (03CR) 10Hnowlan: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28553/console" [puppet] - 10https://gerrit.wikimedia.org/r/671132 (https://phabricator.wikimedia.org/T257572) (owner: 10Hnowlan) [10:11:45] !log kormat@cumin1001 dbctl commit (dc=all): 'db1114 (re)pooling @ 50%: schema change T267767', diff saved to https://phabricator.wikimedia.org/P14836 and previous config saved to /var/cache/conftool/dbconfig/20210315-101143-kormat.json [10:11:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:11:52] T267767: Drop default of revactor_timestamp - https://phabricator.wikimedia.org/T267767 [10:13:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1076 (re)pooling @ 25%: Repool db1076', diff saved to https://phabricator.wikimedia.org/P14837 and previous config saved to /var/cache/conftool/dbconfig/20210315-101309-root.json [10:13:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:18:46] RECOVERY - SSH on mw2227.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:19:08] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:19:46] (03PS1) 10Arturo Borrero Gonzalez: cloudgw: separate dataplane network configuration into a different file [puppet] - 10https://gerrit.wikimedia.org/r/672360 (https://phabricator.wikimedia.org/T277287) [10:22:25] (03CR) 10Muehlenhoff: [C: 03+2] Switch idp-test to serial transcoder [puppet] - 10https://gerrit.wikimedia.org/r/671166 (https://phabricator.wikimedia.org/T271684) (owner: 10Muehlenhoff) [10:23:49] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] "PCC https://puppet-compiler.wmflabs.org/compiler1001/28554/" [puppet] - 10https://gerrit.wikimedia.org/r/672360 (https://phabricator.wikimedia.org/T277287) (owner: 10Arturo Borrero Gonzalez) [10:26:49] !log kormat@cumin1001 dbctl commit (dc=all): 'db1114 (re)pooling @ 100%: schema change T267767', diff saved to https://phabricator.wikimedia.org/P14838 and previous config saved to /var/cache/conftool/dbconfig/20210315-102648-kormat.json [10:26:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:26:57] T267767: Drop default of revactor_timestamp - https://phabricator.wikimedia.org/T267767 [10:28:10] RECOVERY - puppet last run on idp-test1001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [10:28:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1076 (re)pooling @ 50%: Repool db1076', diff saved to https://phabricator.wikimedia.org/P14839 and previous config saved to /var/cache/conftool/dbconfig/20210315-102813-root.json [10:28:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:30:04] jan_drewniak: How many deployers does it take to do Wikimedia Portals Update deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210315T1030). [10:31:58] hm, my IRC client claims there are only two people in this channel o_O [10:32:04] (03PS1) 10Arturo Borrero Gonzalez: cloudgw: add standard mapped IPv6 addresses to the primary interfaces [puppet] - 10https://gerrit.wikimedia.org/r/672364 (https://phabricator.wikimedia.org/T277287) [10:32:20] ok, my IRC client is just wrong :shrug: [10:32:34] either that or most of us are bots ;) [10:33:42] !log upgraded spicerack on cumin2001 to 0.0.49-1+deb10u1 [10:33:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:38:31] (03PS2) 10Arturo Borrero Gonzalez: cloudgw: add standard mapped IPv6 addresses to the primary interfaces [puppet] - 10https://gerrit.wikimedia.org/r/672364 (https://phabricator.wikimedia.org/T277287) [10:38:56] (03CR) 10Hnowlan: [V: 03+1] aqs: add aqs1011 to cassandra 3.11 test cluster, add aqs_next role (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/671132 (https://phabricator.wikimedia.org/T257572) (owner: 10Hnowlan) [10:40:55] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] "PCC: https://puppet-compiler.wmflabs.org/compiler1001/28555/" [puppet] - 10https://gerrit.wikimedia.org/r/672364 (https://phabricator.wikimedia.org/T277287) (owner: 10Arturo Borrero Gonzalez) [10:42:15] !log installing pygments security updates on buster [10:42:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:42:34] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:42:48] (03PS6) 10Filippo Giunchedi: Run tests for alerts [alerts] - 10https://gerrit.wikimedia.org/r/670231 (https://phabricator.wikimedia.org/T272977) [10:43:05] (03CR) 10jerkins-bot: [V: 04-1] Run tests for alerts [alerts] - 10https://gerrit.wikimedia.org/r/670231 (https://phabricator.wikimedia.org/T272977) (owner: 10Filippo Giunchedi) [10:43:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1076 (re)pooling @ 75%: Repool db1076', diff saved to https://phabricator.wikimedia.org/P14840 and previous config saved to /var/cache/conftool/dbconfig/20210315-104316-root.json [10:43:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:44:45] (03PS1) 10Filippo Giunchedi: Add Blubber and Pipeline [alerts] - 10https://gerrit.wikimedia.org/r/672365 (https://phabricator.wikimedia.org/T272977) [10:45:06] (03CR) 10Elukey: aqs: add aqs1011 to cassandra 3.11 test cluster, add aqs_next role (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/671132 (https://phabricator.wikimedia.org/T257572) (owner: 10Hnowlan) [10:51:20] 10SRE, 10CAS-SSO, 10Patch-For-Review: Investigate CAS Session logout - https://phabricator.wikimedia.org/T273867 (10jbond) See below for a table showing the various value size for each encoding tested on 6.2.7, theses are vary rough measures simply len(value) on the raw buffer received from memcache however... [10:53:21] (03PS1) 10Hnowlan: aqs: move import of ::passwords::aqs [puppet] - 10https://gerrit.wikimedia.org/r/672366 (https://phabricator.wikimedia.org/T257572) [10:54:40] (03CR) 10Hnowlan: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28556/console" [puppet] - 10https://gerrit.wikimedia.org/r/672366 (https://phabricator.wikimedia.org/T257572) (owner: 10Hnowlan) [10:55:41] 10SRE, 10CAS-SSO, 10Patch-For-Review: Investigate CAS Session logout - https://phabricator.wikimedia.org/T273867 (10jbond) I just did a quick test and the value size on idp-test1001 with SERIAL is 6208 and the size on idp1001 using KYRO is 1919 [10:55:55] !log volans@cumin2001 START - Cookbook sre.hosts.downtime for 0:05:00 on cumin2001.codfw.wmnet with reason: test [10:55:56] !log volans@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:05:00 on cumin2001.codfw.wmnet with reason: test [10:56:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:56:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:58:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1076 (re)pooling @ 100%: Repool db1076', diff saved to https://phabricator.wikimedia.org/P14841 and previous config saved to /var/cache/conftool/dbconfig/20210315-105820-root.json [10:58:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:58:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1129', diff saved to https://phabricator.wikimedia.org/P14842 and previous config saved to /var/cache/conftool/dbconfig/20210315-105855-marostegui.json [10:59:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:59:26] (03PS1) 10Volans: sre.hosts.downtime: fix example usage [cookbooks] - 10https://gerrit.wikimedia.org/r/672368 [11:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: #bothumor My software never has bugs. It just develops random features. Rise for European mid-day backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210315T1100). [11:00:04] No GERRIT patches in the queue for this window AFAICS. [11:00:20] oh right, DST hijinks [11:00:34] two weeks where the EU window is earlier than I’m used to ^^ [11:00:45] !log upgraded spicerack on cumin1001 to 0.0.49-1+deb10u1 [11:00:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:02:56] (03CR) 10Filippo Giunchedi: [C: 03+2] icinga: remove Grafana alerts for Performance [puppet] - 10https://gerrit.wikimedia.org/r/672346 (https://phabricator.wikimedia.org/T272979) (owner: 10Filippo Giunchedi) [11:09:47] (03PS2) 10Filippo Giunchedi: Add Blubber and Pipeline [alerts] - 10https://gerrit.wikimedia.org/r/672365 (https://phabricator.wikimedia.org/T272977) [11:10:44] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:12:45] (03CR) 10Jbond: [C: 03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/672368 (owner: 10Volans) [11:12:58] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:13:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1129 (re)pooling @ 25%: Repool db1129', diff saved to https://phabricator.wikimedia.org/P14843 and previous config saved to /var/cache/conftool/dbconfig/20210315-111315-root.json [11:13:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:13:41] (03PS2) 10Hnowlan: aqs: move import of ::passwords::aqs [puppet] - 10https://gerrit.wikimedia.org/r/672366 (https://phabricator.wikimedia.org/T257572) [11:14:03] (03CR) 10Elukey: [C: 03+1] sre.hosts.downtime: fix example usage [cookbooks] - 10https://gerrit.wikimedia.org/r/672368 (owner: 10Volans) [11:14:30] if only CI was not stuck setting up the venv from pypi... :) [11:15:40] (03CR) 10Hnowlan: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28557/console" [puppet] - 10https://gerrit.wikimedia.org/r/672366 (https://phabricator.wikimedia.org/T257572) (owner: 10Hnowlan) [11:17:16] !log installing golang-1.7 security updates [11:17:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:19:36] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:19:55] (03CR) 10jerkins-bot: [V: 04-1] sre.hosts.downtime: fix example usage [cookbooks] - 10https://gerrit.wikimedia.org/r/672368 (owner: 10Volans) [11:22:26] !log installing tiff security updates [11:22:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1129 (re)pooling @ 50%: Repool db1129', diff saved to https://phabricator.wikimedia.org/P14844 and previous config saved to /var/cache/conftool/dbconfig/20210315-112819-root.json [11:28:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:53] !log aborrero@cumin2001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudgw2002-dev.codfw.wmnet with reason: REIMAGE [11:28:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:29:39] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:30:10] (03CR) 10Volans: "recheck" [cookbooks] - 10https://gerrit.wikimedia.org/r/672368 (owner: 10Volans) [11:30:59] !log aborrero@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudgw2002-dev.codfw.wmnet with reason: REIMAGE [11:31:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:34:14] !log restarting FPM on mw canaries to pick up new libtiff [11:34:17] (03PS4) 10Volans: netbox: add NetboxServer class [software/spicerack] - 10https://gerrit.wikimedia.org/r/670235 (https://phabricator.wikimedia.org/T205885) [11:34:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:34:40] (03CR) 10Volans: "replies inline" (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/670235 (https://phabricator.wikimedia.org/T205885) (owner: 10Volans) [11:36:57] oh no daylight saving tricked me again. Im an hour late for my portals deploy window. I see there is nothing happening in the backport window though, so I'd like to do that deploy now if that's ok. [11:39:09] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/672371 (https://phabricator.wikimedia.org/T128546) [11:41:52] jan_drewniak: go ahead :) [11:43:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1129 (re)pooling @ 75%: Repool db1129', diff saved to https://phabricator.wikimedia.org/P14846 and previous config saved to /var/cache/conftool/dbconfig/20210315-114323-root.json [11:43:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:43:48] (03CR) 10Jdrewniak: [C: 03+2] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/672371 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [11:44:31] (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/672371 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [11:45:40] (03PS1) 10Hnowlan: passwords::cassandra: add aqs password entry [labs/private] - 10https://gerrit.wikimedia.org/r/672372 (https://phabricator.wikimedia.org/T257572) [11:45:57] (03PS1) 10Elukey: hadoop: add a profile to deploy the capacity scheduler's settings [puppet] - 10https://gerrit.wikimedia.org/r/672373 (https://phabricator.wikimedia.org/T277062) [11:46:08] (03CR) 10Elukey: [C: 03+1] passwords::cassandra: add aqs password entry [labs/private] - 10https://gerrit.wikimedia.org/r/672372 (https://phabricator.wikimedia.org/T257572) (owner: 10Hnowlan) [11:46:23] (03CR) 10Reedy: [C: 04-1] [T277160] Added execution time for recover-dump, added unit tests (032 comments) [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/672150 (owner: 10H.krishna123) [11:49:51] Daimona: o/ yt? Should I schedule https://gerrit.wikimedia.org/r/c/mediawiki/extensions/MobileFrontend/+/671150 for the upcoming backport window? [11:50:32] !log jdrewniak@deploy1002 Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:672371| Bumping portals to master (T128546)]] (duration: 00m 59s) [11:50:33] (03CR) 10Hnowlan: [V: 03+2 C: 03+2] passwords::cassandra: add aqs password entry [labs/private] - 10https://gerrit.wikimedia.org/r/672372 (https://phabricator.wikimedia.org/T257572) (owner: 10Hnowlan) [11:50:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:50:39] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [11:50:56] phuedx: hey, yes, that would be a nice thing to do. [11:51:21] (03PS3) 10Hnowlan: aqs: move import of ::passwords::aqs [puppet] - 10https://gerrit.wikimedia.org/r/672366 (https://phabricator.wikimedia.org/T257572) [11:51:31] !log jdrewniak@deploy1002 Synchronized portals: Wikimedia Portals Update: [[gerrit:672371| Bumping portals to master (T128546)]] (duration: 00m 58s) [11:51:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:52:07] jouncebot: next [11:52:07] In 0 hour(s) and 7 minute(s): Run maintenance script deleting ca. 1.1 million rows on wikidatawiki (T270249) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210315T1200) [11:52:07] T270249: Run maintenance script to remove deleted items from term store on production - https://phabricator.wikimedia.org/T270249 [11:53:57] (03PS4) 10Hnowlan: aqs: move import of ::passwords::aqs [puppet] - 10https://gerrit.wikimedia.org/r/672366 (https://phabricator.wikimedia.org/T257572) [11:54:00] (03CR) 10Phuedx: [C: 03+2] Revert "Rewite MoveLeadParagraphTransform based on mobile apps approach" [extensions/MobileFrontend] (wmf/1.36.0-wmf.34) - 10https://gerrit.wikimedia.org/r/671150 (https://phabricator.wikimedia.org/T277302) (owner: 10Daimona Eaytoy) [11:54:26] (03CR) 10Phuedx: [C: 04-2] Revert "Rewite MoveLeadParagraphTransform based on mobile apps approach" [extensions/MobileFrontend] (wmf/1.36.0-wmf.34) - 10https://gerrit.wikimedia.org/r/671150 (https://phabricator.wikimedia.org/T277302) (owner: 10Daimona Eaytoy) [11:54:37] (03PS1) 10Urbanecm: Initial configuration for taywiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/672376 (https://phabricator.wikimedia.org/T275803) [11:54:42] I hit +2 on the wrong one [11:54:48] (03CR) 10jerkins-bot: [V: 04-1] netbox: add NetboxServer class [software/spicerack] - 10https://gerrit.wikimedia.org/r/670235 (https://phabricator.wikimedia.org/T205885) (owner: 10Volans) [11:54:51] (03CR) 10Hnowlan: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28559/console" [puppet] - 10https://gerrit.wikimedia.org/r/672366 (https://phabricator.wikimedia.org/T257572) (owner: 10Hnowlan) [11:55:02] Right. The revert on the master branch is now C+2 [11:56:41] (03CR) 10Elukey: aqs: move import of ::passwords::aqs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/672366 (https://phabricator.wikimedia.org/T257572) (owner: 10Hnowlan) [11:57:02] (03CR) 10Elukey: aqs: move import of ::passwords::aqs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/672366 (https://phabricator.wikimedia.org/T257572) (owner: 10Hnowlan) [11:57:35] Daimona: Scheduled [11:58:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1129 (re)pooling @ 100%: Repool db1129', diff saved to https://phabricator.wikimedia.org/P14847 and previous config saved to /var/cache/conftool/dbconfig/20210315-115826-root.json [11:58:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:59:21] (03CR) 10Phuedx: [C: 04-2] "Recheck." [extensions/MobileFrontend] (wmf/1.36.0-wmf.34) - 10https://gerrit.wikimedia.org/r/671150 (https://phabricator.wikimedia.org/T277302) (owner: 10Daimona Eaytoy) [11:59:39] Thank you! [12:00:05] Lucas_WMDE: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Run maintenance script deleting ca. 1.1 million rows on wikidatawiki (T270249) . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210315T1200). [12:00:05] T270249: Run maintenance script to remove deleted items from term store on production - https://phabricator.wikimedia.org/T270249 [12:00:13] o/ [12:00:53] (03PS2) 10Elukey: hadoop: add a profile to deploy the capacity scheduler's settings [puppet] - 10https://gerrit.wikimedia.org/r/672373 (https://phabricator.wikimedia.org/T277062) [12:01:00] Daimona: I think that the patch for https://phabricator.wikimedia.org/T277367 will improve the situation but I think that the approach you detailed in https://phabricator.wikimedia.org/T277302#6910747 is the correct one [12:01:10] (03PS1) 10Urbanecm: Initial configuration for trvwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/672378 (https://phabricator.wikimedia.org/T276246) [12:02:15] (03PS2) 10Urbanecm: Initial configuration for taywiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/672376 (https://phabricator.wikimedia.org/T275803) [12:02:29] (03PS2) 10Urbanecm: Initial configuration for trvwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/672378 (https://phabricator.wikimedia.org/T276246) [12:02:32] (03PS1) 10Arturo Borrero Gonzalez: interface: add_ip6_mapped: ignore errors setting IPv6 token [puppet] - 10https://gerrit.wikimedia.org/r/672379 (https://phabricator.wikimedia.org/T277287) [12:04:50] That patch will indeed help, but I'm fairly confident that there are more unexpected situations on some wikis. In general, the only thing that we really need is time. [12:05:07] ^ [12:05:15] (03CR) 10Bartosz Dziewoński: [C: 03+1] Enable visualeditor on enwikibooks by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/669966 (https://phabricator.wikimedia.org/T276851) (owner: 10Zabe) [12:06:26] Lucas_WMDE, Urbanecm: o/ I scheduled https://gerrit.wikimedia.org/r/c/mediawiki/extensions/MobileFrontend/+/671150 super-late in the Euro mid-day deployment window. Any chance it can be deployed? The associated task is marked as a train blocker FWIW [12:08:27] phuedx: would you be able to test it? [12:09:33] (03CR) 10Bartosz Dziewoński: [C: 03+1] Enable DiscussionsTools for enwikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/669960 (https://phabricator.wikimedia.org/T276851) (owner: 10Zabe) [12:09:43] Lucas_WMDE: I have a handful of test pages from Daimona for itwiki and enwiki should be unaffected by the change. Daimona would you also be available to test? [12:10:04] !log lucaswerkmeister-wmde@mwmaint1002:~$ time mwscript extensions/Wikibase/repo/maintenance/RemoveDeletedItemsFromTermStore.php wikidatawiki --itemIds "$(sed -n 5,54p T270249.ids | tr '\n' ',' | sed 's/,$//')" # T270249, 50 items [12:10:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:10:11] T270249: Run maintenance script to remove deleted items from term store on production - https://phabricator.wikimedia.org/T270249 [12:10:15] !log finished in 5.1s [12:10:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:11:21] !log lucaswerkmeister-wmde@mwmaint1002:~$ time mwscript extensions/Wikibase/repo/maintenance/RemoveDeletedItemsFromTermStore.php wikidatawiki --itemIds "$(sed -n 55,554p T270249.ids | tr '\n' ',' | sed 's/,$//')" # T270249, 500 items [12:11:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:12:07] phuedx: i can sync it for you, unless Lucas_WMDE wants to? [12:12:19] !log finished in 43s [12:12:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:12:30] Urbanecm: feel free to go ahead, I’m doing that maintenance script in the meantime [12:12:41] Urbanecm: Thanks [12:12:42] (03CR) 10Urbanecm: [C: 03+2] Revert "Rewite MoveLeadParagraphTransform based on mobile apps approach" [extensions/MobileFrontend] (wmf/1.36.0-wmf.34) - 10https://gerrit.wikimedia.org/r/671150 (https://phabricator.wikimedia.org/T277302) (owner: 10Daimona Eaytoy) [12:12:44] np [12:12:55] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_citoid_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:13:50] I am available to test modulo lunch [12:13:52] !log lucaswerkmeister-wmde@mwmaint1002:~$ time mwscript extensions/Wikibase/repo/maintenance/RemoveDeletedItemsFromTermStore.php wikidatawiki --itemIds "$(sed -n 555,5554p T270249.ids | tr '\n' ',' | sed 's/,$//')" # T270249, 5000 items [12:13:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:14:43] (03CR) 10Hnowlan: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28560/console" [puppet] - 10https://gerrit.wikimedia.org/r/672366 (https://phabricator.wikimedia.org/T257572) (owner: 10Hnowlan) [12:14:56] is it a known issue that the MySQL Grafana board (https://grafana.wikimedia.org/d/000000273/mysql?orgId=1) shows “no data” for replication lag? [12:16:41] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:16:53] nevermind, it works if I select a replica (according to dbtree) :) [12:17:40] (03Merged) 10jenkins-bot: Revert "Rewite MoveLeadParagraphTransform based on mobile apps approach" [extensions/MobileFrontend] (wmf/1.36.0-wmf.34) - 10https://gerrit.wikimedia.org/r/671150 (https://phabricator.wikimedia.org/T277302) (owner: 10Daimona Eaytoy) [12:18:18] Lucas_WMDE: sounds like it's the replag of that specific replica [12:18:38] yup, https://grafana.wikimedia.org/d/000000303/mysql-replication-lag?orgId=1 looks like it’s the board I want [12:19:26] <_joe_> !log depooled mw1347 for testing [12:19:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:20:45] (03PS6) 10Kosta Harlan: linkrecommendation: Use Envoy for requests to MediaWiki API [deployment-charts] - 10https://gerrit.wikimedia.org/r/667868 (https://phabricator.wikimedia.org/T276217) [12:22:28] (03PS1) 10Urbanecm: Manual submodule update of GrowthExperiments repository [core] (wmf/1.36.0-wmf.34) - 10https://gerrit.wikimedia.org/r/672381 (https://phabricator.wikimedia.org/T276966) [12:22:33] (03PS1) 10Arturo Borrero Gonzalez: cloudgw: adjust sysctl parameters that are only meant for dataplane [puppet] - 10https://gerrit.wikimedia.org/r/672382 (https://phabricator.wikimedia.org/T277287) [12:22:41] !log RemoveDeletedItemsFromTermStore.php finished in 8min [12:22:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:23:28] (03Abandoned) 10Arturo Borrero Gonzalez: interface: add_ip6_mapped: ignore errors setting IPv6 token [puppet] - 10https://gerrit.wikimedia.org/r/672379 (https://phabricator.wikimedia.org/T277287) (owner: 10Arturo Borrero Gonzalez) [12:23:31] !log lucaswerkmeister-wmde@mwmaint1002:~$ time mwscript extensions/Wikibase/repo/maintenance/RemoveDeletedItemsFromTermStore.php wikidatawiki --itemIds "$(sed -n 5555,9593p T270249.ids | tr '\n' ',' | sed 's/,$//')" # T270249, remaining 4039 items [12:23:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:23:37] T270249: Run maintenance script to remove deleted items from term store on production - https://phabricator.wikimedia.org/T270249 [12:23:39] phuedx: pulled to mwdebug1002, can you test? [12:23:46] Urbanecm: On it [12:24:13] (03CR) 10jerkins-bot: [V: 04-1] cloudgw: adjust sysctl parameters that are only meant for dataplane [puppet] - 10https://gerrit.wikimedia.org/r/672382 (https://phabricator.wikimedia.org/T277287) (owner: 10Arturo Borrero Gonzalez) [12:24:24] Me too [12:25:13] (03CR) 10Arturo Borrero Gonzalez: "PCC: https://puppet-compiler.wmflabs.org/compiler1001/28561/" [puppet] - 10https://gerrit.wikimedia.org/r/672382 (https://phabricator.wikimedia.org/T277287) (owner: 10Arturo Borrero Gonzalez) [12:25:15] (03CR) 10Urbanecm: [V: 03+2 C: 03+2] Manual submodule update of GrowthExperiments repository [core] (wmf/1.36.0-wmf.34) - 10https://gerrit.wikimedia.org/r/672381 (https://phabricator.wikimedia.org/T276966) (owner: 10Urbanecm) [12:25:23] thanks Daimona [12:25:50] (03PS2) 10Arturo Borrero Gonzalez: cloudgw: adjust sysctl parameters that are only meant for dataplane [puppet] - 10https://gerrit.wikimedia.org/r/672382 (https://phabricator.wikimedia.org/T277287) [12:26:34] Uhm I cannot see any difference. Was it really applied? [12:26:43] ^ [12:26:46] let me check [12:27:18] FTR, I've already ruled out caching stuff by null-editing and purging the page [12:27:32] apparently not, sorry [12:27:41] let me try again [12:28:25] Daimona: phuedx: can you try again? [12:28:53] Working now! [12:29:23] excellent [12:29:32] syncing my own manual submodule update and then I'll sync yours [12:29:36] !log RemoveDeletedItemsFromTermStore.php finished in 5m39s [12:29:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:30:01] Thanks Urbanecm, Daimona. After my own lunch, I'll write a comment on the associated task :) [12:30:02] (03CR) 10JMeybohm: [C: 03+1] "LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/667868 (https://phabricator.wikimedia.org/T276217) (owner: 10Kosta Harlan) [12:30:11] cool :) [12:30:19] !log urbanecm@deploy1002 Synchronized php-1.36.0-wmf.34/extensions/GrowthExperiments/: fa2abfab23c7030402336f8908d0988f37d8133b: Manual submodule update of GrowthExperiments repository (T276966) (duration: 00m 59s) [12:30:20] Amazing, thank you [12:30:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:30:26] T276966: GrowthExperiments help error : TypeError: title is undefined, Uncaught TypeError: Cannot read property 'replace' of undefined - https://phabricator.wikimedia.org/T276966 [12:31:02] (03CR) 10David Caro: netbox: add NetboxServer class (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/670235 (https://phabricator.wikimedia.org/T205885) (owner: 10Volans) [12:31:07] !log maintenance scripts for T270249 completed successfully, no more terms for deleted items found on stat1007 [12:31:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:31:14] T270249: Run maintenance script to remove deleted items from term store on production - https://phabricator.wikimedia.org/T270249 [12:32:09] (03PS1) 10Jbond: C:query_service: Add paramter to control if we manage services [puppet] - 10https://gerrit.wikimedia.org/r/672383 (https://phabricator.wikimedia.org/T267927) [12:32:11] (03PS1) 10Jbond: P:query_service: add ability to disable managing services [puppet] - 10https://gerrit.wikimedia.org/r/672384 (https://phabricator.wikimedia.org/T267927) [12:32:34] !log urbanecm@deploy1002 Synchronized php-1.36.0-wmf.34/extensions/MobileFrontend/: 41a2aaac8c7b6ee5ec05af6d051d541614eaba30: Revert "Rewite MoveLeadParagraphTransform based on mobile apps approach" (T277302) (duration: 00m 58s) [12:32:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:32:41] T277302: Hatnote and ambox recognition is poor and essentially only works for enwiki - https://phabricator.wikimedia.org/T277302 [12:32:47] phuedx: Daimona: should be live, can you check? [12:33:14] (03PS5) 10Hnowlan: aqs: move import of ::passwords::aqs [puppet] - 10https://gerrit.wikimedia.org/r/672366 (https://phabricator.wikimedia.org/T257572) [12:33:31] (03CR) 10jerkins-bot: [V: 04-1] P:query_service: add ability to disable managing services [puppet] - 10https://gerrit.wikimedia.org/r/672384 (https://phabricator.wikimedia.org/T267927) (owner: 10Jbond) [12:33:37] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudgw: adjust sysctl parameters that are only meant for dataplane [puppet] - 10https://gerrit.wikimedia.org/r/672382 (https://phabricator.wikimedia.org/T277287) (owner: 10Arturo Borrero Gonzalez) [12:34:19] (03CR) 10Hnowlan: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28563/console" [puppet] - 10https://gerrit.wikimedia.org/r/672366 (https://phabricator.wikimedia.org/T257572) (owner: 10Hnowlan) [12:34:34] (03PS2) 10Jbond: P:query_service: add ability to disable managing services [puppet] - 10https://gerrit.wikimedia.org/r/672384 (https://phabricator.wikimedia.org/T267927) [12:35:38] Urbanecm: confirming [12:35:50] (03CR) 10jerkins-bot: [V: 04-1] P:query_service: add ability to disable managing services [puppet] - 10https://gerrit.wikimedia.org/r/672384 (https://phabricator.wikimedia.org/T267927) (owner: 10Jbond) [12:36:33] Daimona: cool. [12:37:20] (03CR) 10Hnowlan: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28565/console" [puppet] - 10https://gerrit.wikimedia.org/r/672366 (https://phabricator.wikimedia.org/T257572) (owner: 10Hnowlan) [12:37:38] (03PS3) 10Muehlenhoff: Update email address for DannyS712 [puppet] - 10https://gerrit.wikimedia.org/r/672117 (owner: 10DannyS712) [12:37:57] (03PS2) 10Jbond: C:query_service: Add paramter to control if we manage services [puppet] - 10https://gerrit.wikimedia.org/r/672383 (https://phabricator.wikimedia.org/T267927) [12:39:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1105:3312', diff saved to https://phabricator.wikimedia.org/P14848 and previous config saved to /var/cache/conftool/dbconfig/20210315-123930-marostegui.json [12:39:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:40:00] Urbanecm: It LGTM [12:40:13] (03CR) 10Muehlenhoff: [C: 03+2] Update email address for DannyS712 [puppet] - 10https://gerrit.wikimedia.org/r/672117 (owner: 10DannyS712) [12:41:13] (03PS3) 10Jbond: P:query_service: add ability to disable managing services [puppet] - 10https://gerrit.wikimedia.org/r/672384 (https://phabricator.wikimedia.org/T267927) [12:42:30] (03CR) 10jerkins-bot: [V: 04-1] P:query_service: add ability to disable managing services [puppet] - 10https://gerrit.wikimedia.org/r/672384 (https://phabricator.wikimedia.org/T267927) (owner: 10Jbond) [12:45:57] (03PS4) 10Jbond: P:query_service: add ability to disable managing services [puppet] - 10https://gerrit.wikimedia.org/r/672384 (https://phabricator.wikimedia.org/T267927) [12:46:02] (03CR) 10Hnowlan: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28567/console" [puppet] - 10https://gerrit.wikimedia.org/r/672366 (https://phabricator.wikimedia.org/T257572) (owner: 10Hnowlan) [12:47:08] (03PS3) 10Jbond: C:query_service: Add paramter to control if we manage services [puppet] - 10https://gerrit.wikimedia.org/r/672383 (https://phabricator.wikimedia.org/T267927) [12:47:24] (03PS5) 10Jbond: P:query_service: add ability to disable managing services [puppet] - 10https://gerrit.wikimedia.org/r/672384 (https://phabricator.wikimedia.org/T267927) [12:48:50] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28569/console" [puppet] - 10https://gerrit.wikimedia.org/r/672384 (https://phabricator.wikimedia.org/T267927) (owner: 10Jbond) [12:49:49] (03PS1) 10KartikMistry: WIP: Update cxserver metrics port [deployment-charts] - 10https://gerrit.wikimedia.org/r/672386 [12:50:06] Daimona: are you still around by any chance? [12:52:40] (03Abandoned) 10Urbanecm: Deploy Growth features to newcomers on da.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667570 (https://phabricator.wikimedia.org/T256126) (owner: 10Urbanecm) [12:56:47] (Primary outbound port utilisation over 80% #page) firing: Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org [12:57:18] ^ here but I have an interview in three minutes, can't look unless I'm really needed [12:57:28] Urbanecm: yep! [12:57:36] it looks like it's one of those "it says hashtag page but doesn't actually page"alerts though [12:57:50] Daimona: cool, can we do the sec patch please? [12:57:59] Sure, moving to pm [12:58:17] cool :) [13:00:18] rzl: yep [13:01:47] (Primary outbound port utilisation over 80% #page) resolved: Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org [13:02:14] looks like monitoring glitch, 3.8G out on a 1G interface [13:02:52] (03PS1) 10KartikMistry: Enable ContentTranslation as a default tool in Amharic, Maltese and Uzbek Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/672391 (https://phabricator.wikimedia.org/T276765) [13:04:15] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:08:21] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:09:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1105:3312 (re)pooling @ 25%: Repool db1105:3312', diff saved to https://phabricator.wikimedia.org/P14849 and previous config saved to /var/cache/conftool/dbconfig/20210315-130914-root.json [13:09:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:12] (03CR) 10Kormat: [C: 03+2] cumin: Update aliases for wikireplicas. [puppet] - 10https://gerrit.wikimedia.org/r/672348 (owner: 10Kormat) [13:17:01] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:17:08] 10SRE, 10Domains, 10Okapi, 10Traffic: Subdomain Request - OKAPI - https://phabricator.wikimedia.org/T276585 (10RBrounley_WMF) [13:17:40] (03CR) 10Ottomata: [C: 03+1] hadoop: raise the bw usage of the hdfs-balancer [puppet] - 10https://gerrit.wikimedia.org/r/672333 (owner: 10Elukey) [13:17:50] (03PS1) 10BBlack: Add enterprise CNAME [dns] - 10https://gerrit.wikimedia.org/r/672394 (https://phabricator.wikimedia.org/T276585) [13:18:40] (03CR) 10BBlack: [C: 03+2] Add enterprise CNAME [dns] - 10https://gerrit.wikimedia.org/r/672394 (https://phabricator.wikimedia.org/T276585) (owner: 10BBlack) [13:21:24] PROBLEM - SSH on mw2227.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:24:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1105:3312 (re)pooling @ 50%: Repool db1105:3312', diff saved to https://phabricator.wikimedia.org/P14851 and previous config saved to /var/cache/conftool/dbconfig/20210315-132418-root.json [13:24:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:37] !log Deploy security patch for T152394 [13:25:38] (03PS1) 10BBlack: Add validation record for enterprise TLS [dns] - 10https://gerrit.wikimedia.org/r/672395 (https://phabricator.wikimedia.org/T276585) [13:25:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:45] (03CR) 10Filippo Giunchedi: "recheck" [alerts] - 10https://gerrit.wikimedia.org/r/670231 (https://phabricator.wikimedia.org/T272977) (owner: 10Filippo Giunchedi) [13:26:56] (03PS7) 10Filippo Giunchedi: Run tests for alerts [alerts] - 10https://gerrit.wikimedia.org/r/670231 (https://phabricator.wikimedia.org/T272977) [13:27:03] (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] Run tests for alerts [alerts] - 10https://gerrit.wikimedia.org/r/670231 (https://phabricator.wikimedia.org/T272977) (owner: 10Filippo Giunchedi) [13:27:05] (03CR) 10BBlack: [C: 03+2] Add validation record for enterprise TLS [dns] - 10https://gerrit.wikimedia.org/r/672395 (https://phabricator.wikimedia.org/T276585) (owner: 10BBlack) [13:27:11] (03CR) 10Filippo Giunchedi: [C: 03+2] Add Blubber and Pipeline [alerts] - 10https://gerrit.wikimedia.org/r/672365 (https://phabricator.wikimedia.org/T272977) (owner: 10Filippo Giunchedi) [13:27:16] (03PS3) 10Filippo Giunchedi: Add Blubber and Pipeline [alerts] - 10https://gerrit.wikimedia.org/r/672365 (https://phabricator.wikimedia.org/T272977) [13:27:26] (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] Add Blubber and Pipeline [alerts] - 10https://gerrit.wikimedia.org/r/672365 (https://phabricator.wikimedia.org/T272977) (owner: 10Filippo Giunchedi) [13:39:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1105:3312 (re)pooling @ 75%: Repool db1105:3312', diff saved to https://phabricator.wikimedia.org/P14852 and previous config saved to /var/cache/conftool/dbconfig/20210315-133921-root.json [13:39:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:34] 10SRE, 10observability, 10CAS-SSO: thanos u/i gives errors if left idle for a few hours - https://phabricator.wikimedia.org/T268233 (10fgiunchedi) I've been running into this issue on Grafana as well, specifically on SSO session refresh the XHR issued by the browser start failing. Can we experiment with exte... [13:41:17] 10SRE, 10SRE-Access-Requests: Add samwalton to analytics-privatedata-users - https://phabricator.wikimedia.org/T277298 (10nshahquinn-wmf) [13:42:45] 10SRE, 10SRE-Access-Requests: Add samwalton to analytics-privatedata-users - https://phabricator.wikimedia.org/T277298 (10nshahquinn-wmf) @Samwalton9: if your hiring manager is @DannyH, he just needs to comment on this task saying that he approves you to access confidential data. [13:42:47] (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/672398 [13:43:35] 10SRE, 10Analytics: Upgrade to Kafka MirrorMaker 2 - https://phabricator.wikimedia.org/T277467 (10Ottomata) [13:43:42] 10SRE, 10Analytics: Upgrade to Kafka MirrorMaker 2 - https://phabricator.wikimedia.org/T277467 (10Ottomata) p:05Triage→03Low [13:44:13] (03CR) 10Ottomata: [C: 03+2] Add eventstreams clusters to monitor_services.pp [puppet] - 10https://gerrit.wikimedia.org/r/670960 (https://phabricator.wikimedia.org/T276305) (owner: 10Ottomata) [13:49:44] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:50:20] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 72 probes of 599 (alerts on 65) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [13:51:03] (03PS1) 10BBlack: Revert "Add validation record for enterprise TLS" [dns] - 10https://gerrit.wikimedia.org/r/672399 [13:51:53] (03PS2) 10BBlack: Revert "Add validation record for enterprise TLS" [dns] - 10https://gerrit.wikimedia.org/r/672399 (https://phabricator.wikimedia.org/T276585) [13:52:31] (03CR) 10BBlack: [C: 03+2] Revert "Add validation record for enterprise TLS" [dns] - 10https://gerrit.wikimedia.org/r/672399 (https://phabricator.wikimedia.org/T276585) (owner: 10BBlack) [13:54:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1105:3312 (re)pooling @ 100%: Repool db1105:3312', diff saved to https://phabricator.wikimedia.org/P14853 and previous config saved to /var/cache/conftool/dbconfig/20210315-135426-root.json [13:54:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:51] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 46 probes of 599 (alerts on 65) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [13:59:44] (03PS1) 10BBlack: Authorize cloudfront cert for enterprise [dns] - 10https://gerrit.wikimedia.org/r/672400 (https://phabricator.wikimedia.org/T276585) [14:00:21] (03CR) 10jerkins-bot: [V: 04-1] Authorize cloudfront cert for enterprise [dns] - 10https://gerrit.wikimedia.org/r/672400 (https://phabricator.wikimedia.org/T276585) (owner: 10BBlack) [14:02:12] (03PS2) 10BBlack: Authorize cloudfront cert for enterprise [dns] - 10https://gerrit.wikimedia.org/r/672400 (https://phabricator.wikimedia.org/T276585) [14:02:58] (03CR) 10BBlack: [C: 03+2] Authorize cloudfront cert for enterprise [dns] - 10https://gerrit.wikimedia.org/r/672400 (https://phabricator.wikimedia.org/T276585) (owner: 10BBlack) [14:04:03] (03PS2) 10KartikMistry: WIP: cxserver: Use new metrics interface of servicerunner [deployment-charts] - 10https://gerrit.wikimedia.org/r/672386 (https://phabricator.wikimedia.org/T271711) [14:04:38] !log re-pooling wdqs1005 [14:04:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1074', diff saved to https://phabricator.wikimedia.org/P14854 and previous config saved to /var/cache/conftool/dbconfig/20210315-140809-marostegui.json [14:08:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:40] (03PS1) 10Klausman: hiera/modules: Add role for ML k8s workers [puppet] - 10https://gerrit.wikimedia.org/r/672402 (https://phabricator.wikimedia.org/T272918) [14:12:39] 10SRE, 10Analytics-Radar, 10Machine-Learning-Team: Kubeflow on stat machines - https://phabricator.wikimedia.org/T275551 (10fkaelin) I created a separate [[ https://docs.google.com/document/d/1Nffi3jUojC3BGNHkm2TyG7k5x30_7nzuPqgZ_tBeWNM/edit# | document ]] to discuss some of the bigger questions around orche... [14:20:30] (03PS6) 10Hnowlan: aqs: move import of ::passwords::aqs [puppet] - 10https://gerrit.wikimedia.org/r/672366 (https://phabricator.wikimedia.org/T257572) [14:21:37] (03CR) 10Hnowlan: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28571/console" [puppet] - 10https://gerrit.wikimedia.org/r/672366 (https://phabricator.wikimedia.org/T257572) (owner: 10Hnowlan) [14:23:23] (03CR) 10Gergő Tisza: [C: 03+2] linkrecommendation: Use Envoy for requests to MediaWiki API [deployment-charts] - 10https://gerrit.wikimedia.org/r/667868 (https://phabricator.wikimedia.org/T276217) (owner: 10Kosta Harlan) [14:24:13] 10SRE, 10Domains, 10Okapi, 10Traffic: Subdomain Request - OKAPI - https://phabricator.wikimedia.org/T276585 (10BBlack) 05Open→03Resolved [14:25:29] (03Merged) 10jenkins-bot: linkrecommendation: Use Envoy for requests to MediaWiki API [deployment-charts] - 10https://gerrit.wikimedia.org/r/667868 (https://phabricator.wikimedia.org/T276217) (owner: 10Kosta Harlan) [14:28:53] !log kharlan@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'linkrecommendation' for release 'staging' . [14:28:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:29:26] 10SRE, 10Domains, 10Okapi, 10Traffic: Subdomain Request - OKAPI - https://phabricator.wikimedia.org/T276585 (10LWyatt) Thanks for your efficient work @bblack! [14:31:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1074 (re)pooling @ 25%: Repool db1074', diff saved to https://phabricator.wikimedia.org/P14855 and previous config saved to /var/cache/conftool/dbconfig/20210315-143137-root.json [14:31:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:06] !log kharlan@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'linkrecommendation' for release 'production' . [14:32:06] !log kharlan@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'linkrecommendation' for release 'external' . [14:32:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:11] 10SRE, 10netops, 10Patch-For-Review: Auhoritative ports list - https://phabricator.wikimedia.org/T277146 (10Kormat) Just in case it's relevant, we use a range of ports for mariadb. Most (but not all) of them are [[ https://github.com/wikimedia/puppet/blob/da54cc6f29debe9703448f60c6a3cf3d8f5c9345/hieradata/co... [14:36:49] !log kharlan@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'linkrecommendation' for release 'production' . [14:36:49] !log kharlan@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'linkrecommendation' for release 'external' . [14:36:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:36:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:48] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] ratelimit: Switch to nobody, update build and base image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/670836 (https://phabricator.wikimedia.org/T274852) (owner: 10JMeybohm) [14:40:26] (03PS1) 10Cmjohnson: updating mac address for db1162 to reflect motherboard change [puppet] - 10https://gerrit.wikimedia.org/r/672426 (https://phabricator.wikimedia.org/T275309) [14:41:32] (03CR) 10Cmjohnson: [C: 03+2] updating mac address for db1162 to reflect motherboard change [puppet] - 10https://gerrit.wikimedia.org/r/672426 (https://phabricator.wikimedia.org/T275309) (owner: 10Cmjohnson) [14:41:50] (03PS2) 10Klausman: hiera/modules: Add role for ML k8s workers [puppet] - 10https://gerrit.wikimedia.org/r/672402 (https://phabricator.wikimedia.org/T272918) [14:43:11] 10SRE, 10ops-eqiad, 10DBA, 10Patch-For-Review: db1162 crashed - https://phabricator.wikimedia.org/T275309 (10Cmjohnson) @marostegui The mac address for the nic changed, just merged the change. The install should work now. Can you try again and resolve this task when it works please. [14:43:48] 10SRE, 10DBA: db1080-95 batch possibly suffering BBU issues - https://phabricator.wikimedia.org/T258386 (10Cmjohnson) [14:43:51] 10SRE, 10DBA: Reboot, upgrade firmware and kernel of db1096-db1106, db2071-db2092 - https://phabricator.wikimedia.org/T216240 (10Cmjohnson) [14:43:54] 10SRE, 10ops-eqiad, 10DBA, 10Patch-For-Review: db1162 crashed - https://phabricator.wikimedia.org/T275309 (10Marostegui) Thanks @Cmjohnson I will try today or tomorrow morning and will close when done. [14:44:17] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Upgrade firmware on db1136 - https://phabricator.wikimedia.org/T277007 (10Cmjohnson) 05Open→03Resolved @marostegui updated the BIOS firmware [14:45:13] (03PS3) 10Klausman: hiera/modules: Add role for ML k8s workers [puppet] - 10https://gerrit.wikimedia.org/r/672402 (https://phabricator.wikimedia.org/T272918) [14:45:54] (03PS7) 10Hnowlan: aqs: move import of ::passwords::aqs [puppet] - 10https://gerrit.wikimedia.org/r/672366 (https://phabricator.wikimedia.org/T257572) [14:46:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1074 (re)pooling @ 50%: Repool db1074', diff saved to https://phabricator.wikimedia.org/P14856 and previous config saved to /var/cache/conftool/dbconfig/20210315-144641-root.json [14:46:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:46:51] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:47:49] (03PS8) 10Hnowlan: aqs: move import of ::passwords::aqs [puppet] - 10https://gerrit.wikimedia.org/r/672366 (https://phabricator.wikimedia.org/T257572) [14:48:37] (03CR) 10Hnowlan: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28577/console" [puppet] - 10https://gerrit.wikimedia.org/r/672366 (https://phabricator.wikimedia.org/T257572) (owner: 10Hnowlan) [14:48:52] (03PS4) 10Klausman: hiera/modules: Add role for ML k8s workers [puppet] - 10https://gerrit.wikimedia.org/r/672402 (https://phabricator.wikimedia.org/T272918) [14:49:01] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:49:44] (03PS5) 10Klausman: hiera/modules: Add role for ML k8s workers [puppet] - 10https://gerrit.wikimedia.org/r/672402 (https://phabricator.wikimedia.org/T272918) [14:51:10] 10SRE, 10DBA: db1080-95 batch possibly suffering BBU issues - https://phabricator.wikimedia.org/T258386 (10Marostegui) [14:51:15] 10SRE, 10DBA: Reboot, upgrade firmware and kernel of db1096-db1106, db2071-db2092 - https://phabricator.wikimedia.org/T216240 (10Marostegui) [14:51:41] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Upgrade firmware on db1136 - https://phabricator.wikimedia.org/T277007 (10Marostegui) 05Resolved→03Open This host booted from PXE boot, and attempted to reimage itself. Luckily the partman recipe we have didn't delete its data. Did the BIOS upgrade change the defaul... [14:51:44] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Upgrade firmware on db1136 - https://phabricator.wikimedia.org/T277007 (10Marostegui) @Cmjohnson can you take a look to see if that was the case? [14:52:19] 10SRE, 10DBA, 10Wikimedia-Incident: 14 March 2021 Wikimedia API Outage - https://phabricator.wikimedia.org/T277417 (10RhinosF1) [14:52:44] 10SRE, 10DBA, 10Wikimedia-Incident: 14 March 2021 Wikimedia API Outage - https://phabricator.wikimedia.org/T277417 (10RhinosF1) [14:53:53] (03CR) 10Elukey: hiera/modules: Add role for ML k8s workers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/672402 (https://phabricator.wikimedia.org/T272918) (owner: 10Klausman) [14:54:26] 10SRE, 10ops-eqiad, 10DBA, 10Patch-For-Review: db1162 crashed - https://phabricator.wikimedia.org/T275309 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts: ` ['db1162.eqiad.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/202103151454_... [14:54:28] legoktm: i found the cause [14:55:18] (03PS6) 10Klausman: hiera/modules: Add role for ML k8s workers [puppet] - 10https://gerrit.wikimedia.org/r/672402 (https://phabricator.wikimedia.org/T272918) [14:55:20] (03CR) 10Klausman: hiera/modules: Add role for ML k8s workers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/672402 (https://phabricator.wikimedia.org/T272918) (owner: 10Klausman) [14:58:46] (03CR) 10David Caro: "Got some questions, all the 'nit:' can be ignored." (039 comments) [puppet] - 10https://gerrit.wikimedia.org/r/671286 (https://phabricator.wikimedia.org/T276615) (owner: 10Bstorm) [14:59:35] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:01:15] (03PS3) 10Andrew Bogott: prepare_cinder_volume.py: Add optional arg for mount options [puppet] - 10https://gerrit.wikimedia.org/r/671208 (https://phabricator.wikimedia.org/T272114) [15:01:17] (03PS4) 10Andrew Bogott: prepare_cinder_volume.py: Add optional mount mode [puppet] - 10https://gerrit.wikimedia.org/r/671209 (https://phabricator.wikimedia.org/T272114) [15:01:19] (03PS4) 10Andrew Bogott: cinderutils::ensure: support specifying mount options and file mode [puppet] - 10https://gerrit.wikimedia.org/r/671210 (https://phabricator.wikimedia.org/T272114) [15:01:21] (03PS1) 10Andrew Bogott: Refactor cindervolumes fact again [puppet] - 10https://gerrit.wikimedia.org/r/672437 (https://phabricator.wikimedia.org/T272114) [15:01:23] (03PS1) 10Andrew Bogott: Add cinderutils::swap [puppet] - 10https://gerrit.wikimedia.org/r/672438 (https://phabricator.wikimedia.org/T272114) [15:01:41] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:01:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1074 (re)pooling @ 75%: Repool db1074', diff saved to https://phabricator.wikimedia.org/P14857 and previous config saved to /var/cache/conftool/dbconfig/20210315-150144-root.json [15:01:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:01] (03PS9) 10Hnowlan: aqs: move import of ::passwords::aqs [puppet] - 10https://gerrit.wikimedia.org/r/672366 (https://phabricator.wikimedia.org/T257572) [15:02:23] (03CR) 10jerkins-bot: [V: 04-1] Refactor cindervolumes fact again [puppet] - 10https://gerrit.wikimedia.org/r/672437 (https://phabricator.wikimedia.org/T272114) (owner: 10Andrew Bogott) [15:02:50] (03CR) 10jerkins-bot: [V: 04-1] Add cinderutils::swap [puppet] - 10https://gerrit.wikimedia.org/r/672438 (https://phabricator.wikimedia.org/T272114) (owner: 10Andrew Bogott) [15:03:15] (03CR) 10Hnowlan: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28580/console" [puppet] - 10https://gerrit.wikimedia.org/r/672366 (https://phabricator.wikimedia.org/T257572) (owner: 10Hnowlan) [15:04:41] (03PS10) 10Hnowlan: aqs: move import of ::passwords::aqs [puppet] - 10https://gerrit.wikimedia.org/r/672366 (https://phabricator.wikimedia.org/T257572) [15:06:14] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1162.eqiad.wmnet with reason: REIMAGE [15:06:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:06:43] (03CR) 10Hnowlan: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28582/console" [puppet] - 10https://gerrit.wikimedia.org/r/672366 (https://phabricator.wikimedia.org/T257572) (owner: 10Hnowlan) [15:08:05] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:08:17] (03PS2) 10Andrew Bogott: Refactor cindervolumes fact again [puppet] - 10https://gerrit.wikimedia.org/r/672437 (https://phabricator.wikimedia.org/T272114) [15:08:17] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1162.eqiad.wmnet with reason: REIMAGE [15:08:19] (03PS4) 10Andrew Bogott: prepare_cinder_volume.py: Add optional arg for mount options [puppet] - 10https://gerrit.wikimedia.org/r/671208 (https://phabricator.wikimedia.org/T272114) [15:08:21] (03PS5) 10Andrew Bogott: prepare_cinder_volume.py: Add optional mount mode [puppet] - 10https://gerrit.wikimedia.org/r/671209 (https://phabricator.wikimedia.org/T272114) [15:08:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:08:23] (03PS5) 10Andrew Bogott: cinderutils::ensure: support specifying mount options and file mode [puppet] - 10https://gerrit.wikimedia.org/r/671210 (https://phabricator.wikimedia.org/T272114) [15:08:25] (03PS2) 10Andrew Bogott: Add cinderutils::swap [puppet] - 10https://gerrit.wikimedia.org/r/672438 (https://phabricator.wikimedia.org/T272114) [15:09:16] (03PS11) 10Hnowlan: aqs: move import of ::passwords::aqs [puppet] - 10https://gerrit.wikimedia.org/r/672366 (https://phabricator.wikimedia.org/T257572) [15:09:29] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:10:28] (03CR) 10jerkins-bot: [V: 04-1] Add cinderutils::swap [puppet] - 10https://gerrit.wikimedia.org/r/672438 (https://phabricator.wikimedia.org/T272114) (owner: 10Andrew Bogott) [15:10:45] (03CR) 10Hnowlan: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28583/console" [puppet] - 10https://gerrit.wikimedia.org/r/672366 (https://phabricator.wikimedia.org/T257572) (owner: 10Hnowlan) [15:14:03] !log aborrero@cumin2001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudgw2002-dev.codfw.wmnet with reason: REIMAGE [15:14:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:14:37] (03CR) 10Jbond: "see comments" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/672366 (https://phabricator.wikimedia.org/T257572) (owner: 10Hnowlan) [15:16:05] !log aborrero@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudgw2002-dev.codfw.wmnet with reason: REIMAGE [15:16:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:16:44] 10SRE, 10ops-eqiad, 10DBA, 10Patch-For-Review: db1162 crashed - https://phabricator.wikimedia.org/T275309 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db1162.eqiad.wmnet'] ` and were **ALL** successful. [15:16:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1074 (re)pooling @ 100%: Repool db1074', diff saved to https://phabricator.wikimedia.org/P14858 and previous config saved to /var/cache/conftool/dbconfig/20210315-151648-root.json [15:16:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:17:08] 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) [15:17:54] 10SRE, 10ops-eqiad, 10DBA, 10Patch-For-Review: db1162 crashed - https://phabricator.wikimedia.org/T275309 (10Marostegui) 05Open→03Resolved db1162 was reimaged nicely Thank you Chris I will clone and repool this host tomorrow. [15:18:55] (03CR) 10Herron: "LGTM for configuring the dead letter queue and ingesting into logstash, although I'm not following where this creates dlq-* index or sets " [puppet] - 10https://gerrit.wikimedia.org/r/670576 (https://phabricator.wikimedia.org/T277080) (owner: 10Cwhite) [15:21:37] (03CR) 10Volans: "recheck" [cookbooks] - 10https://gerrit.wikimedia.org/r/672368 (owner: 10Volans) [15:22:25] RECOVERY - SSH on mw2227.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:24:56] !log aborrero@cumin2001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudgw2001-dev.codfw.wmnet with reason: REIMAGE [15:25:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:27:01] !log aborrero@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudgw2001-dev.codfw.wmnet with reason: REIMAGE [15:27:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:31:03] (03PS3) 10Andrew Bogott: Add cinderutils::swap [puppet] - 10https://gerrit.wikimedia.org/r/672438 (https://phabricator.wikimedia.org/T272114) [15:31:53] (03CR) 10jerkins-bot: [V: 04-1] Add cinderutils::swap [puppet] - 10https://gerrit.wikimedia.org/r/672438 (https://phabricator.wikimedia.org/T272114) (owner: 10Andrew Bogott) [15:32:55] (03PS1) 10Hnowlan: aqs: move password to hieradata rather than password module [labs/private] - 10https://gerrit.wikimedia.org/r/672441 (https://phabricator.wikimedia.org/T257572) [15:33:34] !log draining ganeti2007 [15:33:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:33:42] (03PS4) 10Andrew Bogott: Add cinderutils::swap [puppet] - 10https://gerrit.wikimedia.org/r/672438 (https://phabricator.wikimedia.org/T272114) [15:34:53] (03CR) 10Jbond: [C: 03+2] "lgtm will merge" [puppet] - 10https://gerrit.wikimedia.org/r/670568 (owner: 10Ahmon Dancy) [15:37:17] (03CR) 10Herron: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/670893 (https://phabricator.wikimedia.org/T277080) (owner: 10Cwhite) [15:41:11] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=pdu_sentry4 site=eqsin https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:41:59] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ganeti2007.codfw.wmnet [15:42:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:42:18] (03PS1) 10Herron: grafana: make domainrw optional [puppet] - 10https://gerrit.wikimedia.org/r/672445 [15:43:51] PROBLEM - Host kubetcd2004 is DOWN: PING CRITICAL - Packet loss = 100% [15:45:29] (03CR) 10Herron: "https://puppet-compiler.wmflabs.org/compiler1003/28586/" [puppet] - 10https://gerrit.wikimedia.org/r/672445 (owner: 10Herron) [15:45:31] RECOVERY - Host kubetcd2004 is UP: PING OK - Packet loss = 0%, RTA = 33.27 ms [15:46:16] ^ kubetcd2004 is expected, those use "plain" disks in Ganeti [15:46:27] k [15:47:28] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2007.codfw.wmnet [15:47:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:48:49] !log draining ganeti2009 [15:48:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:49:18] (03CR) 10Filippo Giunchedi: [C: 03+1] grafana: make domainrw optional [puppet] - 10https://gerrit.wikimedia.org/r/672445 (owner: 10Herron) [15:52:41] (03PS6) 10Cwhite: logstash: add dead letter queue support [puppet] - 10https://gerrit.wikimedia.org/r/670576 (https://phabricator.wikimedia.org/T277080) [15:53:10] (03CR) 10Cwhite: "> Patch Set 5:" [puppet] - 10https://gerrit.wikimedia.org/r/670576 (https://phabricator.wikimedia.org/T277080) (owner: 10Cwhite) [15:53:39] (03PS2) 10Cwhite: logstash: add dead letter queue ingest support [puppet] - 10https://gerrit.wikimedia.org/r/670893 (https://phabricator.wikimedia.org/T277080) [15:55:31] 10SRE, 10netops, 10cloud-services-team (Kanban): cloudgw eqiad1: review & allocate subnets and VLANs - https://phabricator.wikimedia.org/T277020 (10aborrero) 05Open→03Resolved thanks! [15:58:14] PROBLEM - Postgres Replication Lag on puppetdb2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 157435344 and 6 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [15:58:27] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ganeti2009.codfw.wmnet [15:58:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:59:34] RECOVERY - Postgres Replication Lag on puppetdb2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 607352 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [16:01:34] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:02:30] (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/672449 [16:03:25] (03PS10) 10Legoktm: k8s: Add docker-registry credentials to pull restricted images [puppet] - 10https://gerrit.wikimedia.org/r/663064 (https://phabricator.wikimedia.org/T273521) [16:03:27] (03PS1) 10Legoktm: docker: tabs to spaces [puppet] - 10https://gerrit.wikimedia.org/r/672450 [16:03:59] (03CR) 10Legoktm: [V: 03+1] k8s: Add docker-registry credentials to pull restricted images (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/663064 (https://phabricator.wikimedia.org/T273521) (owner: 10Legoktm) [16:04:16] (03CR) 10JMeybohm: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/663064 (https://phabricator.wikimedia.org/T273521) (owner: 10Legoktm) [16:04:48] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2009.codfw.wmnet [16:04:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:05:33] (03CR) 10Legoktm: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28587/console" [puppet] - 10https://gerrit.wikimedia.org/r/672450 (owner: 10Legoktm) [16:05:34] !log draining ganeti2010 [16:05:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:06:21] !log aborrero@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudvirt-wdqs1001.eqiad.wmnet [16:06:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:08:46] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ganeti2010.codfw.wmnet [16:08:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:08:55] (03PS1) 10Dave Pifke: Add dummy tokens for XHGui-on-k8s [labs/private] - 10https://gerrit.wikimedia.org/r/672451 [16:09:03] (03CR) 10Legoktm: [C: 03+2] k8s: Add docker-registry credentials to pull restricted images [puppet] - 10https://gerrit.wikimedia.org/r/663064 (https://phabricator.wikimedia.org/T273521) (owner: 10Legoktm) [16:11:50] PROBLEM - Host kubetcd2005 is DOWN: PING CRITICAL - Packet loss = 100% [16:14:12] (03Abandoned) 10SBassett: Add rate limiter to Special:ConfirmEmail - config change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519479 (https://phabricator.wikimedia.org/T226733) (owner: 10SBassett) [16:14:45] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2010.codfw.wmnet [16:14:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:15:28] (03PS12) 10Hnowlan: aqs: move import of ::passwords::aqs [puppet] - 10https://gerrit.wikimedia.org/r/672366 (https://phabricator.wikimedia.org/T257572) [16:15:30] RECOVERY - Host kubetcd2005 is UP: PING OK - Packet loss = 0%, RTA = 33.43 ms [16:18:41] (03CR) 10Bstorm: paws: block using the Jupyterhub from Tor (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/671286 (https://phabricator.wikimedia.org/T276615) (owner: 10Bstorm) [16:18:50] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:19:36] (03PS13) 10Hnowlan: aqs: move import of ::passwords::aqs [puppet] - 10https://gerrit.wikimedia.org/r/672366 (https://phabricator.wikimedia.org/T257572) [16:20:21] (03CR) 10Andrew Bogott: [C: 03+2] Refactor cindervolumes fact again [puppet] - 10https://gerrit.wikimedia.org/r/672437 (https://phabricator.wikimedia.org/T272114) (owner: 10Andrew Bogott) [16:21:02] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:21:39] (03CR) 10Andrew Bogott: [C: 03+2] prepare_cinder_volume.py: Add optional arg for mount options [puppet] - 10https://gerrit.wikimedia.org/r/671208 (https://phabricator.wikimedia.org/T272114) (owner: 10Andrew Bogott) [16:21:54] (03CR) 10Andrew Bogott: [C: 03+2] prepare_cinder_volume.py: Add optional mount mode [puppet] - 10https://gerrit.wikimedia.org/r/671209 (https://phabricator.wikimedia.org/T272114) (owner: 10Andrew Bogott) [16:22:10] (03PS6) 10Andrew Bogott: prepare_cinder_volume.py: Add optional mount mode [puppet] - 10https://gerrit.wikimedia.org/r/671209 (https://phabricator.wikimedia.org/T272114) [16:23:29] !log aborrero@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudvirt-wdqs1002.eqiad.wmnet [16:23:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:23:51] !log aborrero@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudvirt-wdqs1003.eqiad.wmnet [16:23:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:24:01] (03CR) 10Andrew Bogott: [C: 03+2] cinderutils::ensure: support specifying mount options and file mode [puppet] - 10https://gerrit.wikimedia.org/r/671210 (https://phabricator.wikimedia.org/T272114) (owner: 10Andrew Bogott) [16:24:06] (03PS6) 10Andrew Bogott: cinderutils::ensure: support specifying mount options and file mode [puppet] - 10https://gerrit.wikimedia.org/r/671210 (https://phabricator.wikimedia.org/T272114) [16:24:39] (03CR) 10David Caro: [C: 03+1] "+💯" [cookbooks] - 10https://gerrit.wikimedia.org/r/672368 (owner: 10Volans) [16:25:26] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={netbox_device_statistics,routinator} site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:27:36] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:27:56] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudvirt-wdqs1001.eqiad.wmnet [16:28:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:28:56] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudvirt-wdqs1003.eqiad.wmnet [16:29:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:29:11] (03PS30) 10Andrew Bogott: profile::ci::slave::labs::common: move to cinder-based storage [puppet] - 10https://gerrit.wikimedia.org/r/670524 (https://phabricator.wikimedia.org/T277078) [16:29:35] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudvirt-wdqs1002.eqiad.wmnet [16:29:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:31:09] 10SRE, 10Services, 10Service-deployment-requests: New Service Request: xhgui - https://phabricator.wikimedia.org/T277483 (10dpifke) [16:31:48] (03PS7) 10Klausman: hiera/modules: Add role for ML k8s workers [puppet] - 10https://gerrit.wikimedia.org/r/672402 (https://phabricator.wikimedia.org/T272918) [16:32:42] (03PS5) 10Andrew Bogott: Add cinderutils::swap [puppet] - 10https://gerrit.wikimedia.org/r/672438 (https://phabricator.wikimedia.org/T272114) [16:33:12] PROBLEM - Postgres Replication Lag on puppetdb2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 86283624 and 21 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [16:33:13] (03CR) 10jerkins-bot: [V: 04-1] Add cinderutils::swap [puppet] - 10https://gerrit.wikimedia.org/r/672438 (https://phabricator.wikimedia.org/T272114) (owner: 10Andrew Bogott) [16:34:25] (03PS2) 10Dave Pifke: Add dummy tokens for XHGui-on-k8s [labs/private] - 10https://gerrit.wikimedia.org/r/672451 (https://phabricator.wikimedia.org/T277483) [16:34:33] (03CR) 10Jbond: aqs: move import of ::passwords::aqs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/672366 (https://phabricator.wikimedia.org/T257572) (owner: 10Hnowlan) [16:35:07] (03PS1) 10Klausman: hiera: add dummy secrets for ML k8s workers [labs/private] - 10https://gerrit.wikimedia.org/r/672455 (https://phabricator.wikimedia.org/T272918) [16:35:22] RECOVERY - Postgres Replication Lag on puppetdb2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 679464 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [16:35:36] (03PS2) 10Klausman: hiera: add dummy secrets for ML k8s workers [labs/private] - 10https://gerrit.wikimedia.org/r/672455 (https://phabricator.wikimedia.org/T272918) [16:35:50] (03CR) 10Klausman: [C: 03+2] hiera: add dummy secrets for ML k8s workers [labs/private] - 10https://gerrit.wikimedia.org/r/672455 (https://phabricator.wikimedia.org/T272918) (owner: 10Klausman) [16:35:54] (03CR) 10Klausman: [V: 03+2 C: 03+2] hiera: add dummy secrets for ML k8s workers [labs/private] - 10https://gerrit.wikimedia.org/r/672455 (https://phabricator.wikimedia.org/T272918) (owner: 10Klausman) [16:36:14] (03PS14) 10Hnowlan: aqs: move import of ::passwords::aqs [puppet] - 10https://gerrit.wikimedia.org/r/672366 (https://phabricator.wikimedia.org/T257572) [16:36:27] (03CR) 10Hnowlan: aqs: move import of ::passwords::aqs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/672366 (https://phabricator.wikimedia.org/T257572) (owner: 10Hnowlan) [16:38:18] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2231.codfw.wmnet [16:38:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:38:32] (03PS4) 10Dave Pifke: xhgui: enable database access for admins [puppet] - 10https://gerrit.wikimedia.org/r/621100 (https://phabricator.wikimedia.org/T260640) [16:38:39] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2232.codfw.wmnet [16:38:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:38:49] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2233.codfw.wmnet [16:38:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:39:53] (03PS1) 10Andrew Bogott: Support building a grid-exec node with cinder volumes or flavor-defined ephemeral storage [puppet] - 10https://gerrit.wikimedia.org/r/672456 (https://phabricator.wikimedia.org/T272114) [16:41:00] (03CR) 10jerkins-bot: [V: 04-1] Support building a grid-exec node with cinder volumes or flavor-defined ephemeral storage [puppet] - 10https://gerrit.wikimedia.org/r/672456 (https://phabricator.wikimedia.org/T272114) (owner: 10Andrew Bogott) [16:41:35] klausman: fyi i merged your change to puppet private [16:42:05] (03PS6) 10Andrew Bogott: Add cinderutils::swap [puppet] - 10https://gerrit.wikimedia.org/r/672438 (https://phabricator.wikimedia.org/T272114) [16:42:07] (03PS2) 10Andrew Bogott: Support building a grid-exec node with cinder flavor-defined storage [puppet] - 10https://gerrit.wikimedia.org/r/672456 (https://phabricator.wikimedia.org/T272114) [16:42:09] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28593/console" [puppet] - 10https://gerrit.wikimedia.org/r/672366 (https://phabricator.wikimedia.org/T257572) (owner: 10Hnowlan) [16:42:14] !log dzahn@cumin1001 START - Cookbook sre.hosts.decommission for hosts mw2220.codfw.wmnet [16:42:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:42:31] (03CR) 10jerkins-bot: [V: 04-1] Add cinderutils::swap [puppet] - 10https://gerrit.wikimedia.org/r/672438 (https://phabricator.wikimedia.org/T272114) (owner: 10Andrew Bogott) [16:43:06] (03PS1) 10Klausman: hiera: move ML k8s worker secrets into the correct location [labs/private] - 10https://gerrit.wikimedia.org/r/672457 (https://phabricator.wikimedia.org/T272918) [16:43:23] (03CR) 10Klausman: [C: 03+2] hiera: move ML k8s worker secrets into the correct location [labs/private] - 10https://gerrit.wikimedia.org/r/672457 (https://phabricator.wikimedia.org/T272918) (owner: 10Klausman) [16:43:25] (03CR) 10Klausman: [V: 03+2 C: 03+2] hiera: move ML k8s worker secrets into the correct location [labs/private] - 10https://gerrit.wikimedia.org/r/672457 (https://phabricator.wikimedia.org/T272918) (owner: 10Klausman) [16:44:13] (03CR) 10Klausman: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28594/console" [puppet] - 10https://gerrit.wikimedia.org/r/672402 (https://phabricator.wikimedia.org/T272918) (owner: 10Klausman) [16:44:43] (03PS7) 10Andrew Bogott: Add cinderutils::swap [puppet] - 10https://gerrit.wikimedia.org/r/672438 (https://phabricator.wikimedia.org/T272114) [16:44:45] (03PS3) 10Andrew Bogott: Support building a grid-exec node with cinder flavor-defined storage [puppet] - 10https://gerrit.wikimedia.org/r/672456 (https://phabricator.wikimedia.org/T272114) [16:44:48] klausman: and again :) [16:45:28] (03PS8) 10Klausman: hiera/modules: Add role for ML k8s workers [puppet] - 10https://gerrit.wikimedia.org/r/672402 (https://phabricator.wikimedia.org/T272918) [16:45:32] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28595/console" [puppet] - 10https://gerrit.wikimedia.org/r/672366 (https://phabricator.wikimedia.org/T257572) (owner: 10Hnowlan) [16:46:11] (03CR) 10Jbond: [V: 03+1 C: 03+1] "LGTM i also added the fake secret in the the labs private repo and PCC attached" [puppet] - 10https://gerrit.wikimedia.org/r/672366 (https://phabricator.wikimedia.org/T257572) (owner: 10Hnowlan) [16:46:26] (03CR) 10Klausman: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28596/console" [puppet] - 10https://gerrit.wikimedia.org/r/672402 (https://phabricator.wikimedia.org/T272918) (owner: 10Klausman) [16:46:28] (03PS4) 10Andrew Bogott: Support building a grid-exec node with cinder or flavor-defined storage [puppet] - 10https://gerrit.wikimedia.org/r/672456 (https://phabricator.wikimedia.org/T272114) [16:47:07] jbond42: thanks! I keep forgetting p-m for private repo stuff :-/ [16:47:17] no probs :) [16:48:33] !log dzahn@cumin1001 START - Cookbook sre.hosts.decommission for hosts mw2224.codfw.wmnet [16:48:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:49:28] (03PS5) 10Andrew Bogott: Support building a grid-exec node with cinder or flavor-defined storage [puppet] - 10https://gerrit.wikimedia.org/r/672456 (https://phabricator.wikimedia.org/T272114) [16:51:38] (03PS1) 10Dave Pifke: xhgui: add dummy admin password [labs/private] - 10https://gerrit.wikimedia.org/r/672461 [16:53:23] (03PS9) 10Klausman: hiera/modules: Add role for ML k8s workers [puppet] - 10https://gerrit.wikimedia.org/r/672402 (https://phabricator.wikimedia.org/T272918) [16:53:49] 10SRE, 10DBA, 10Wikimedia-Incident: 14 March 2021 Wikimedia API Outage - https://phabricator.wikimedia.org/T277417 (10CDanis) [16:53:53] (03PS2) 10Dave Pifke: xhgui: add dummy admin password [labs/private] - 10https://gerrit.wikimedia.org/r/672461 (https://phabricator.wikimedia.org/T260640) [16:55:25] (03CR) 10Dave Pifke: "I've rebased it." [puppet] - 10https://gerrit.wikimedia.org/r/621100 (https://phabricator.wikimedia.org/T260640) (owner: 10Dave Pifke) [16:55:36] (03CR) 10Klausman: "Janis, Alex: only one of you needs to review, feel free to remove the other from the Reviewers list" [puppet] - 10https://gerrit.wikimedia.org/r/672402 (https://phabricator.wikimedia.org/T272918) (owner: 10Klausman) [16:55:46] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts mw2224.codfw.wmnet [16:55:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:56:38] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mw2220.codfw.wmnet [16:56:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:57:49] (03CR) 10Muehlenhoff: [C: 04-1] "-1, needs to include additional Hadoop roles in the Hiera settings" [puppet] - 10https://gerrit.wikimedia.org/r/671130 (https://phabricator.wikimedia.org/T257412) (owner: 10Muehlenhoff) [16:58:21] !log dzahn@cumin1001 conftool action : set/pooled=inactive; selector: name=mw2220.codfw.wmnet [16:58:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:58:32] !log dzahn@cumin1001 conftool action : set/pooled=inactive; selector: name=mw2224.codfw.wmnet [16:58:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:59:27] !log dzahn@cumin1001 START - Cookbook sre.hosts.decommission for hosts mw2221.codfw.wmnet [16:59:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:00:05] ryankemper: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Wikidata Query Service weekly deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210315T1700). [17:00:48] PROBLEM - cassandra-a service on aqs1010 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [17:01:10] PROBLEM - cassandra-a CQL 10.64.0.88:9042 on aqs1010 is CRITICAL: connect to address 10.64.0.88 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [17:01:26] ^ host not in use, ignore [17:02:14] PROBLEM - Check systemd state on aqs1010 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:03:07] 10SRE, 10Analytics-Radar, 10Machine-Learning-Team: Kubeflow on stat machines - https://phabricator.wikimedia.org/T275551 (10elukey) >>! In T275551#6913820, @fkaelin wrote: > I created a separate [[ https://docs.google.com/document/d/1Nffi3jUojC3BGNHkm2TyG7k5x30_7nzuPqgZ_tBeWNM/edit# | document ]] to discuss... [17:03:22] ACKNOWLEDGEMENT - Check systemd state on aqs1010 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Hnowlan Host not in use yet https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:03:22] ACKNOWLEDGEMENT - cassandra-a CQL 10.64.0.88:9042 on aqs1010 is CRITICAL: connect to address 10.64.0.88 and port 9042: Connection refused Hnowlan Host not in use yet https://phabricator.wikimedia.org/T93886 [17:03:22] ACKNOWLEDGEMENT - cassandra-a service on aqs1010 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed Hnowlan Host not in use yet https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [17:03:26] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on aqs1010.eqiad.wmnet with reason: New buster host [17:03:27] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on aqs1010.eqiad.wmnet with reason: New buster host [17:03:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:03:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:07:20] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=pdu_sentry4 site=eqsin https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:09:08] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:16:15] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install (2) new 10G switches - https://phabricator.wikimedia.org/T277340 (10wiki_willy) Hi @ayounsi - let me check with Chris and John to confirm, but I'm thinking we should target the racks with the most amount of rack space. (so we can phase out any l... [17:21:16] (03CR) 10Hnowlan: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28600/console" [puppet] - 10https://gerrit.wikimedia.org/r/672366 (https://phabricator.wikimedia.org/T257572) (owner: 10Hnowlan) [17:25:05] (03Abandoned) 10Jeena Huneidi: Rsync private mediawiki files to releases server [puppet] - 10https://gerrit.wikimedia.org/r/667747 (https://phabricator.wikimedia.org/T276145) (owner: 10Jeena Huneidi) [17:25:18] (03CR) 10Elukey: hiera/modules: Add role for ML k8s workers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/672402 (https://phabricator.wikimedia.org/T272918) (owner: 10Klausman) [17:25:46] 10SRE, 10Patch-For-Review: Handle sunset of stretch-backports - https://phabricator.wikimedia.org/T256877 (10Legoktm) [17:25:47] PROBLEM - Thanos query-frontend has high latency for queries on alert1001 is CRITICAL: job=thanos-query-frontend https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/aa7Rx0oMk/thanos-query-frontend [17:28:38] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mw2221.codfw.wmnet [17:28:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:29:08] !log dzahn@cumin1001 conftool action : set/pooled=inactive; selector: name=mw2221.codfw.wmnet [17:29:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:29:21] !log dzahn@cumin1001 conftool action : set/pooled=inactive; selector: name=mw2222.codfw.wmnet [17:29:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:29:32] !log dzahn@cumin1001 conftool action : set/pooled=inactive; selector: name=mw2223.codfw.wmnet [17:29:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:30:06] (03CR) 10Elukey: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/672366 (https://phabricator.wikimedia.org/T257572) (owner: 10Hnowlan) [17:30:14] !log dzahn@cumin1001 START - Cookbook sre.hosts.decommission for hosts mw2222.codfw.wmnet [17:30:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:30:23] !log disabling puppet on aqs100[4-9].eqiad.wmnet to test change to password logic in puppet [17:30:27] RECOVERY - Thanos query-frontend has high latency for queries on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/aa7Rx0oMk/thanos-query-frontend [17:30:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:35:55] (03CR) 10Hnowlan: [V: 03+1 C: 03+2] aqs: move import of ::passwords::aqs [puppet] - 10https://gerrit.wikimedia.org/r/672366 (https://phabricator.wikimedia.org/T257572) (owner: 10Hnowlan) [17:42:57] (03PS8) 10Andrew Bogott: Add cinderutils::swap [puppet] - 10https://gerrit.wikimedia.org/r/672438 (https://phabricator.wikimedia.org/T272114) [17:42:59] (03PS6) 10Andrew Bogott: Support building a grid-exec node with cinder or flavor-defined storage [puppet] - 10https://gerrit.wikimedia.org/r/672456 (https://phabricator.wikimedia.org/T272114) [17:43:01] (03PS1) 10Andrew Bogott: cinderutils: exclude sda and vda from any mounting [puppet] - 10https://gerrit.wikimedia.org/r/672472 [17:43:33] PROBLEM - mediawiki-installation DSH group on mw2223 is CRITICAL: Host mw2223 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [17:43:39] PROBLEM - Thanos store has high latency for series gate requests on alert1001 is CRITICAL: job=thanos-store https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/e832e8f26403d95fac0ea1c59837588b/thanos-store [17:44:23] (03CR) 10jerkins-bot: [V: 04-1] Add cinderutils::swap [puppet] - 10https://gerrit.wikimedia.org/r/672438 (https://phabricator.wikimedia.org/T272114) (owner: 10Andrew Bogott) [17:44:26] (03CR) 10jerkins-bot: [V: 04-1] cinderutils: exclude sda and vda from any mounting [puppet] - 10https://gerrit.wikimedia.org/r/672472 (owner: 10Andrew Bogott) [17:44:29] PROBLEM - Thanos query-frontend has high latency for queries on alert1001 is CRITICAL: job=thanos-query-frontend https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/aa7Rx0oMk/thanos-query-frontend [17:45:31] RECOVERY - Thanos store has high latency for series gate requests on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/e832e8f26403d95fac0ea1c59837588b/thanos-store [17:46:19] RECOVERY - Thanos query-frontend has high latency for queries on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/aa7Rx0oMk/thanos-query-frontend [17:53:42] (03PS8) 10Hnowlan: aqs: add aqs1011 to cassandra 3.11 test cluster, add aqs_next role [puppet] - 10https://gerrit.wikimedia.org/r/671132 (https://phabricator.wikimedia.org/T257572) [17:54:17] (03PS1) 10Ahmon Dancy: logspam.pl: Ignore messages from mwmaint* hosts [puppet] - 10https://gerrit.wikimedia.org/r/672483 [17:55:11] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mw2222.codfw.wmnet [17:55:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:56:00] !log dzahn@cumin1001 START - Cookbook sre.hosts.decommission for hosts mw2223.codfw.wmnet [17:56:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:57:19] (03CR) 10Hnowlan: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28601/console" [puppet] - 10https://gerrit.wikimedia.org/r/671132 (https://phabricator.wikimedia.org/T257572) (owner: 10Hnowlan) [17:58:14] (03PS2) 10Dzahn: site/conftool: decom mw2220 through mw2223 [puppet] - 10https://gerrit.wikimedia.org/r/671259 (https://phabricator.wikimedia.org/T277119) [17:58:52] (03PS9) 10Andrew Bogott: Add cinderutils::swap [puppet] - 10https://gerrit.wikimedia.org/r/672438 (https://phabricator.wikimedia.org/T272114) [17:58:54] (03PS2) 10Andrew Bogott: cinderutils: exclude sda and vda from any mounting [puppet] - 10https://gerrit.wikimedia.org/r/672472 [17:58:56] (03PS7) 10Andrew Bogott: Support building a grid-exec node with cinder or flavor-defined storage [puppet] - 10https://gerrit.wikimedia.org/r/672456 (https://phabricator.wikimedia.org/T272114) [17:58:58] (03PS1) 10Andrew Bogott: prepare_cinder_volume: allow --force to override some mountpoint validation [puppet] - 10https://gerrit.wikimedia.org/r/672485 [17:59:00] (03CR) 10Dzahn: [C: 03+2] site/conftool: decom mw2220 through mw2223 [puppet] - 10https://gerrit.wikimedia.org/r/671259 (https://phabricator.wikimedia.org/T277119) (owner: 10Dzahn) [18:00:04] RoanKattouw, Niharika, and Urbanecm: I, the Bot under the Fountain, allow thee, The Deployer, to do Morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210315T1800). [18:00:04] Jdlrobson, Huji, Zabe, and marxarelli: A patch you scheduled for Morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:00:17] hey! i can deploy today [18:00:29] assuming people are not confused by DST :) [18:00:55] Jdlrobson, marxarelli: hi, around? [18:01:00] o/ [18:01:23] (03CR) 10Andrew Bogott: [C: 03+2] prepare_cinder_volume: allow --force to override some mountpoint validation [puppet] - 10https://gerrit.wikimedia.org/r/672485 (owner: 10Andrew Bogott) [18:01:25] marxarelli: wanna do your patches yourself, or should i? [18:01:26] present [18:01:41] (03CR) 10Andrew Bogott: [C: 03+2] cinderutils: exclude sda and vda from any mounting [puppet] - 10https://gerrit.wikimedia.org/r/672472 (owner: 10Andrew Bogott) [18:01:44] i need to be in a meeting at 11.30 [18:01:45] Urbanecm: i can do them at the end if that works for you [18:01:50] they're both noop [18:01:54] marxarelli: okay [18:02:09] Jdlrobson: can you describe what does https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikimediaEvents/+/671202 do? [18:02:13] I don't see a patch it cherrypicks [18:02:55] (03PS3) 10Andrew Bogott: cinderutils: exclude sda and vda from any mounting [puppet] - 10https://gerrit.wikimedia.org/r/672472 [18:03:09] Urbanecm: the wmf34 branch is in a weird state [18:03:18] as changes backported to wmf34 were never merged to master [18:03:43] this patch basically resets the JS to a known stable state (the current code has bugs). [18:03:59] so you copied whatever is in master to this patch, right? [18:04:08] this is essentially a merge of origin/master with manual conflict resolution [18:04:08] (03PS2) 10Andrew Bogott: prepare_cinder_volume: allow --force to override some mountpoint validation [puppet] - 10https://gerrit.wikimedia.org/r/672485 [18:04:12] Urbanecm: exactly [18:04:17] thanks [18:04:27] (03CR) 10Urbanecm: [C: 03+2] Use master version of clientError.js [extensions/WikimediaEvents] (wmf/1.36.0-wmf.34) - 10https://gerrit.wikimedia.org/r/671202 (owner: 10Jdlrobson) [18:04:32] looks good, merging [18:05:30] ok im all setup my side to test [18:05:43] Jdlrobson: ack, now it's all with CI :) [18:06:27] @Urbanecm hi, I'm ready too [18:06:35] hi huji [18:06:59] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2234.codfw.wmnet [18:07:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:07:06] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2235.codfw.wmnet [18:07:06] (03CR) 10Urbanecm: [C: 03+2] Add deleterevision right to botadmin group on fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/671402 (https://phabricator.wikimedia.org/T277358) (owner: 10Huji) [18:07:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:07:45] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mw2223.codfw.wmnet [18:07:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:10:23] (03Merged) 10jenkins-bot: Add deleterevision right to botadmin group on fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/671402 (https://phabricator.wikimedia.org/T277358) (owner: 10Huji) [18:10:51] (03Merged) 10jenkins-bot: Use master version of clientError.js [extensions/WikimediaEvents] (wmf/1.36.0-wmf.34) - 10https://gerrit.wikimedia.org/r/671202 (owner: 10Jdlrobson) [18:11:16] Urbanem: Sorry for being late [18:11:33] Urbanecm: ^ [18:11:53] thanks Zabe, I did see the merges, just needed some time to fetch it there [18:12:04] Jdlrobson: huji: your patches are on mwdebug1001 (both of them) [18:12:19] thx [18:12:28] I just refreshed the page on mwdebug1001 and it did work as expected [18:12:37] great, syncing [18:13:10] Urbanecm: testing mine.. [18:13:57] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: a8234a9435a3acf669d44705fbcb19bf4dd5658e: Add deleterevision right to botadmin group on fawiki (T277358) (duration: 00m 59s) [18:14:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:14:05] T277358: Add deleterevision right to botadmin group on fawiki - https://phabricator.wikimedia.org/T277358 [18:14:06] huji: should be live [18:14:53] Urbanecm: confirming that it is live (per Special:UserGroupRights) [18:14:58] thank you very much [18:15:00] great [18:15:03] any time :) [18:15:15] (03PS2) 10Urbanecm: Configure default search namespaces for thwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/672041 (https://phabricator.wikimedia.org/T275280) (owner: 10Zabe) [18:15:19] (03CR) 10Urbanecm: [C: 03+2] Configure default search namespaces for thwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/672041 (https://phabricator.wikimedia.org/T275280) (owner: 10Zabe) [18:16:14] Urbanecm: LGTM. please sync! [18:16:17] syncing [18:17:50] !log urbanecm@deploy1002 Synchronized php-1.36.0-wmf.34/extensions/WikimediaEvents/modules/ext.wikimediaEvents/clientError.js: a7eb550498fd038fbc5d96d8a82a64c2ee5eb57a: Use master version of clientError.js (duration: 00m 58s) [18:17:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:17:57] Jdlrobson: should be live. anything else? [18:18:00] !log Updated the Wikidata property suggester with data from the 2021-03-08 JSON dump (with pre-applied T132839 workarounds) [18:18:03] 10SRE, 10Performance-Team, 10serviceops, 10Patch-For-Review, 10User-jijiki: Enable "/*/mw-with-onhost-tier/" route for MediaWiki where safe - https://phabricator.wikimedia.org/T264604 (10Krinkle) Hm.. so the above poses a bit of a paradox. Implementing the `onhostRoutingPrefix` option is dependent on tom... [18:18:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:18:07] T132839: [RfC] Property suggester suggests human properties for non-human items - https://phabricator.wikimedia.org/T132839 [18:19:45] (03CR) 10Urbanecm: [C: 03+2] Enable visualeditor on enwikibooks by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/669966 (https://phabricator.wikimedia.org/T276851) (owner: 10Zabe) [18:19:51] (03CR) 10Urbanecm: Enable DiscussionsTools for enwikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/669960 (https://phabricator.wikimedia.org/T276851) (owner: 10Zabe) [18:19:51] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=pdu_sentry4 site=eqsin https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:20:20] (03Merged) 10jenkins-bot: Configure default search namespaces for thwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/672041 (https://phabricator.wikimedia.org/T275280) (owner: 10Zabe) [18:20:44] Zabe: please go to mwdebug1001 and test [18:21:13] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:21:33] thank Urbanecm am watching the logs [18:21:36] will let you know if any problems [18:21:39] cool [18:21:41] thanks for your help today! [18:22:42] np [18:22:51] (03Merged) 10jenkins-bot: Enable visualeditor on enwikibooks by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/669966 (https://phabricator.wikimedia.org/T276851) (owner: 10Zabe) [18:22:55] Urbanecm: all done? [18:23:01] not yxet [18:23:04] k! [18:23:15] Zabe: how's your testing going? [18:26:17] Zabe: ping? [18:27:26] Zabe: hey? [18:27:59] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install payments100[5-8] - https://phabricator.wikimedia.org/T266481 (10Cmjohnson) [18:28:53] 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: analytics1066's BBU might need to be replaced - https://phabricator.wikimedia.org/T277005 (10Cmjohnson) My morning got away from me and this is rescheduled for tomorrow 1400UTC (1000EST) [18:29:58] Zabe: hello? [18:30:47] it’s working, sorry for needing that long [18:30:53] (03PS3) 10Bstorm: wikireplicas: create actual paws database accounts [puppet] - 10https://gerrit.wikimedia.org/r/670626 (https://phabricator.wikimedia.org/T276284) [18:31:18] Zabe: which of the two patches did you test? [18:32:03] and in the future, please do speak when you need more time or run into any issues. Giving a few more minutes is fine, but I need to know you are actually looking into it and that you did not just go away :) [18:32:05] The thwikisource one [18:32:09] thanks [18:32:20] can you test the enwikibooks too Zabe ? [18:33:25] Yeah, it works too [18:33:42] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: b70a75c7530f4bc71fbb88b859329edb6dadf2a0: Configure default search namespaces for thwikisource (T275280) (duration: 00m 59s) [18:33:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:33:49] T275280: Add some namespaces to search results for thwikisource - https://phabricator.wikimedia.org/T275280 [18:34:08] cool, thanks [18:34:24] (03PS10) 10Andrew Bogott: Add cinderutils::swap [puppet] - 10https://gerrit.wikimedia.org/r/672438 (https://phabricator.wikimedia.org/T272114) [18:34:26] (03PS8) 10Andrew Bogott: Support building a grid-exec node with cinder or flavor-defined storage [puppet] - 10https://gerrit.wikimedia.org/r/672456 (https://phabricator.wikimedia.org/T272114) [18:34:28] (03PS1) 10Andrew Bogott: blockdevices fact: use push() instead of append() [puppet] - 10https://gerrit.wikimedia.org/r/672504 [18:34:53] (03PS4) 10Urbanecm: Enable DiscussionsTools for enwikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/669960 (https://phabricator.wikimedia.org/T276851) (owner: 10Zabe) [18:35:28] !log urbanecm@deploy1002 Synchronized dblists/visualeditor-nondefault.dblist: b6a8df04701f9a83643c93342183b448705477bd: Enable visualeditor on enwikibooks by default (T276851; 1/2) (duration: 00m 58s) [18:35:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:35:36] T276851: DiscussionsTools and VisualEditor for Wikibooks - https://phabricator.wikimedia.org/T276851 [18:36:21] (03CR) 10Andrew Bogott: [C: 03+2] blockdevices fact: use push() instead of append() [puppet] - 10https://gerrit.wikimedia.org/r/672504 (owner: 10Andrew Bogott) [18:37:12] !log removing 1 file from eowiki, for legal compliance [18:37:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:38:15] !log urbanecm@deploy1002 Synchronized wmf-config/config/enwikibooks.yaml: b6a8df04701f9a83643c93342183b448705477bd: Enable visualeditor on enwikibooks by default (T276851; 2/2) (duration: 01m 00s) [18:38:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:38:31] Zabe: should be live [18:39:00] (03CR) 10Urbanecm: [C: 03+2] Enable DiscussionsTools for enwikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/669960 (https://phabricator.wikimedia.org/T276851) (owner: 10Zabe) [18:39:57] Thx [18:41:15] !log puppet disabled on kubestage1001 for debugging docker-registry credentials [18:41:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:41:31] (03Merged) 10jenkins-bot: Enable DiscussionsTools for enwikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/669960 (https://phabricator.wikimedia.org/T276851) (owner: 10Zabe) [18:42:08] 10SRE, 10Wikidata, 10Wikidata-Query-Service, 10wdwb-tech-focus, and 2 others: [Epic] Scaling strategy for Wikidata Query Service - https://phabricator.wikimedia.org/T221938 (10MPhamWMF) [18:42:36] Zabe: can you check at mwdebug1001, please? [18:44:29] It works [18:44:40] thanks, syncing [18:46:01] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: e5a7284956e707ace94120e8224b262d5ef56c99: Enable DiscussionsTools for enwikibooks (T276851) (duration: 00m 59s) [18:46:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:46:09] T276851: DiscussionsTools and VisualEditor for Wikibooks - https://phabricator.wikimedia.org/T276851 [18:46:12] Zabe: should be live. anything else? [18:46:34] no, thx [18:46:51] cool [18:46:54] marxarelli: floor is yours [18:47:01] (03CR) 10Brennen Bearnes: [C: 03+1] "We discussed this change at length in the past, seems like the correct thing for our purposes. Tested, works, code LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/672483 (owner: 10Ahmon Dancy) [18:47:49] Urbanecm: thank you! [18:47:53] np [18:48:13] (03CR) 10Dduvall: [C: 03+2] pipeline: Initial multiversion pipeline configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/666492 (https://phabricator.wikimedia.org/T274182) (owner: 10Dduvall) [18:48:18] (03CR) 10Dduvall: [C: 03+2] pipeline: add building the webserver image [mediawiki-config] - 10https://gerrit.wikimedia.org/r/669807 (owner: 10Giuseppe Lavagetto) [18:48:51] merging both patches and then i'll sync multiversion/ first, .pipeline/ second [18:48:55] longma: ^ [18:49:19] (03Merged) 10jenkins-bot: pipeline: Initial multiversion pipeline configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/666492 (https://phabricator.wikimedia.org/T274182) (owner: 10Dduvall) [18:49:22] (03Merged) 10jenkins-bot: pipeline: add building the webserver image [mediawiki-config] - 10https://gerrit.wikimedia.org/r/669807 (owner: 10Giuseppe Lavagetto) [18:49:31] 👍 [18:55:03] !log dduvall@deploy1002 Synchronized multiversion/: config: [[gerrit:666492|Initial multiversion pipeline configuration]] [[gerrit:669807|pipeline: add building the webserver image]] (T274182) (duration: 00m 59s) [18:55:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:55:13] T274182: Multi-version MediaWiki image is built and published - https://phabricator.wikimedia.org/T274182 [18:56:16] !log dduvall@deploy1002 Synchronized .pipeline: config: [[gerrit:666492|Initial multiversion pipeline configuration]] [[gerrit:669807|pipeline: add building the webserver image]] (T274182) (duration: 00m 59s) [18:56:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:56:24] (03PS1) 10Bartosz Dziewoński: Enable DiscussionTools' beta features on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/672507 (https://phabricator.wikimedia.org/T273146) [18:56:26] (03PS1) 10Bartosz Dziewoński: Enable DiscussionTools' beta features on ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/672508 (https://phabricator.wikimedia.org/T276493) [18:56:28] (03PS1) 10Bartosz Dziewoński: Enable DiscussionTools' beta features on almost all remaining projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/672509 (https://phabricator.wikimedia.org/T276498) [18:56:30] (03PS1) 10Bartosz Dziewoński: Enable DiscussionTools' tools for everyone on officewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/672510 (https://phabricator.wikimedia.org/T277103) [18:57:50] Urbanecm: all done. thanks again [18:58:01] np :) [18:59:48] (03CR) 10Eevans: [C: 03+1] "Good to go now." [deployment-charts] - 10https://gerrit.wikimedia.org/r/671252 (https://phabricator.wikimedia.org/T274262) (owner: 10Eevans) [18:59:50] (03CR) 10Daimona Eaytoy: [C: 03+1] "The time is ripe." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657696 (https://phabricator.wikimedia.org/T269712) (owner: 10Jforrester) [19:01:02] (03PS2) 10Eevans: Update sessionstore prod to Kask 2021-03-12-195445-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/671252 (https://phabricator.wikimedia.org/T274262) [19:01:10] (03PS11) 10Andrew Bogott: Add cinderutils::swap [puppet] - 10https://gerrit.wikimedia.org/r/672438 (https://phabricator.wikimedia.org/T272114) [19:01:12] (03PS9) 10Andrew Bogott: Support building a grid-exec node with cinder or flavor-defined storage [puppet] - 10https://gerrit.wikimedia.org/r/672456 (https://phabricator.wikimedia.org/T272114) [19:03:42] (03PS12) 10Andrew Bogott: Add cinderutils::swap [puppet] - 10https://gerrit.wikimedia.org/r/672438 (https://phabricator.wikimedia.org/T272114) [19:03:44] (03PS10) 10Andrew Bogott: Support building a grid-exec node with cinder or flavor-defined storage [puppet] - 10https://gerrit.wikimedia.org/r/672456 (https://phabricator.wikimedia.org/T272114) [19:05:01] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_citoid_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:06:42] (03PS13) 10Andrew Bogott: Add cinderutils::swap [puppet] - 10https://gerrit.wikimedia.org/r/672438 (https://phabricator.wikimedia.org/T272114) [19:06:44] (03PS11) 10Andrew Bogott: Support building a grid-exec node with cinder or flavor-defined storage [puppet] - 10https://gerrit.wikimedia.org/r/672456 (https://phabricator.wikimedia.org/T272114) [19:07:21] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:08:06] (03PS14) 10Andrew Bogott: Add cinderutils::swap [puppet] - 10https://gerrit.wikimedia.org/r/672438 (https://phabricator.wikimedia.org/T272114) [19:08:08] (03PS12) 10Andrew Bogott: Support building a grid-exec node with cinder or flavor-defined storage [puppet] - 10https://gerrit.wikimedia.org/r/672456 (https://phabricator.wikimedia.org/T272114) [19:10:09] (03PS15) 10Andrew Bogott: Add cinderutils::swap [puppet] - 10https://gerrit.wikimedia.org/r/672438 (https://phabricator.wikimedia.org/T272114) [19:10:11] (03PS13) 10Andrew Bogott: Support building a grid-exec node with cinder or flavor-defined storage [puppet] - 10https://gerrit.wikimedia.org/r/672456 (https://phabricator.wikimedia.org/T272114) [19:12:23] PROBLEM - Postgres Replication Lag on puppetdb2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 165028280 and 13 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [19:14:43] RECOVERY - Postgres Replication Lag on puppetdb2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 497616 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [19:18:30] (03PS1) 10Eevans: Update echostore staging to Kask 2021-03-12-195445-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/672517 (https://phabricator.wikimedia.org/T274262) [19:20:14] (03CR) 10Eevans: [C: 03+2] Update echostore staging to Kask 2021-03-12-195445-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/672517 (https://phabricator.wikimedia.org/T274262) (owner: 10Eevans) [19:21:51] (03Merged) 10jenkins-bot: Update echostore staging to Kask 2021-03-12-195445-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/672517 (https://phabricator.wikimedia.org/T274262) (owner: 10Eevans) [19:27:42] !log eevans@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'echostore' for release 'staging' . [19:27:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:32:33] PROBLEM - mediawiki originals uploads -hourly- for codfw on alert1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe2005 job=statsd_exporter site=codfw https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw [19:32:34] (03PS1) 10Eevans: Update echostore to Kask 2021-03-12-195445-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/672522 (https://phabricator.wikimedia.org/T274262) [19:32:37] (03CR) 10Cwhite: [C: 03+2] logstash: add normalize level filter to scap migration [puppet] - 10https://gerrit.wikimedia.org/r/671290 (owner: 10Cwhite) [19:33:21] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={ircd,netbox_device_statistics} site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:33:41] (03CR) 10Eevans: [C: 03+2] Update echostore to Kask 2021-03-12-195445-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/672522 (https://phabricator.wikimedia.org/T274262) (owner: 10Eevans) [19:37:20] !log eevans@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'echostore' for release 'production' . [19:37:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:42:11] PROBLEM - mediawiki originals uploads -hourly- for eqiad on alert1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe1005 job=statsd_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad [19:42:55] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:43:22] !log eevans@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'echostore' for release 'production' . [19:43:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:44:42] (03PS14) 10Andrew Bogott: Support building a grid-exec node with cinder or flavor-defined storage [puppet] - 10https://gerrit.wikimedia.org/r/672456 (https://phabricator.wikimedia.org/T272114) [19:47:49] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={ircd,netbox_device_statistics} site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:50:19] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:52:50] (03PS16) 10Andrew Bogott: Add cinderutils::swap [puppet] - 10https://gerrit.wikimedia.org/r/672438 (https://phabricator.wikimedia.org/T272114) [19:52:51] (03PS15) 10Andrew Bogott: Support building a grid-exec node with cinder or flavor-defined storage [puppet] - 10https://gerrit.wikimedia.org/r/672456 (https://phabricator.wikimedia.org/T272114) [19:52:54] (03PS1) 10Andrew Bogott: cinderutils::manifests::ensure: refine behavior for existing lvm volumes [puppet] - 10https://gerrit.wikimedia.org/r/672526 [19:53:20] 10SRE, 10Platform Engineering, 10serviceops, 10Performance-Team (Radar): Get rid of nutcracker for connecting to redis - https://phabricator.wikimedia.org/T277183 (10Gilles) [19:53:38] !log dzahn@cumin1001 START - Cookbook sre.hosts.decommission for hosts mw2224.codfw.wmnet [19:53:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:53:46] (03CR) 10jerkins-bot: [V: 04-1] cinderutils::manifests::ensure: refine behavior for existing lvm volumes [puppet] - 10https://gerrit.wikimedia.org/r/672526 (owner: 10Andrew Bogott) [19:53:48] !log dzahn@cumin1001 END (ERROR) - Cookbook sre.hosts.decommission (exit_code=97) for hosts mw2224.codfw.wmnet [19:53:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:53:56] !log dzahn@cumin1001 START - Cookbook sre.hosts.decommission for hosts mw2225.codfw.wmnet [19:54:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:54:18] (03PS2) 10Andrew Bogott: cinderutils::manifests::ensure: refine behavior for existing lvm volumes [puppet] - 10https://gerrit.wikimedia.org/r/672526 [19:54:20] (03PS17) 10Andrew Bogott: Add cinderutils::swap [puppet] - 10https://gerrit.wikimedia.org/r/672438 (https://phabricator.wikimedia.org/T272114) [19:54:22] (03PS16) 10Andrew Bogott: Support building a grid-exec node with cinder or flavor-defined storage [puppet] - 10https://gerrit.wikimedia.org/r/672456 (https://phabricator.wikimedia.org/T272114) [19:55:55] !log ebernhardson@deploy1002 Started deploy [wikimedia/discovery/analytics@82e0654]: prepare_mw_rev_score: Correct scores_export to bulk_ingest [19:56:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:56:14] (03CR) 10Cwhite: [C: 03+2] logstash: add dead letter queue ingest support [puppet] - 10https://gerrit.wikimedia.org/r/670893 (https://phabricator.wikimedia.org/T277080) (owner: 10Cwhite) [19:57:45] !log ebernhardson@deploy1002 Finished deploy [wikimedia/discovery/analytics@82e0654]: prepare_mw_rev_score: Correct scores_export to bulk_ingest (duration: 01m 49s) [19:57:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:59:42] (03CR) 10Herron: [C: 03+1] "ah! makes sense and LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/670576 (https://phabricator.wikimedia.org/T277080) (owner: 10Cwhite) [19:59:58] (03CR) 10BryanDavis: wikireplicas: create actual paws database accounts (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/670626 (https://phabricator.wikimedia.org/T276284) (owner: 10Bstorm) [20:00:04] chrisalbon and accraze: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Services – Graphoid / ORES . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210315T2000). [20:05:00] (03PS3) 10Andrew Bogott: cinderutils::manifests::ensure: refine behavior for existing lvm volumes [puppet] - 10https://gerrit.wikimedia.org/r/672526 [20:05:02] (03PS18) 10Andrew Bogott: Add cinderutils::swap [puppet] - 10https://gerrit.wikimedia.org/r/672438 (https://phabricator.wikimedia.org/T272114) [20:05:04] (03PS17) 10Andrew Bogott: Support building a grid-exec node with cinder or flavor-defined storage [puppet] - 10https://gerrit.wikimedia.org/r/672456 (https://phabricator.wikimedia.org/T272114) [20:05:32] (03CR) 10jerkins-bot: [V: 04-1] cinderutils::manifests::ensure: refine behavior for existing lvm volumes [puppet] - 10https://gerrit.wikimedia.org/r/672526 (owner: 10Andrew Bogott) [20:06:27] (03PS4) 10Andrew Bogott: cinderutils::manifests::ensure: refine behavior for existing lvm volumes [puppet] - 10https://gerrit.wikimedia.org/r/672526 [20:06:29] (03PS19) 10Andrew Bogott: Add cinderutils::swap [puppet] - 10https://gerrit.wikimedia.org/r/672438 (https://phabricator.wikimedia.org/T272114) [20:06:31] (03PS18) 10Andrew Bogott: Support building a grid-exec node with cinder or flavor-defined storage [puppet] - 10https://gerrit.wikimedia.org/r/672456 (https://phabricator.wikimedia.org/T272114) [20:07:04] (03CR) 10Bstorm: wikireplicas: create actual paws database accounts (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/670626 (https://phabricator.wikimedia.org/T276284) (owner: 10Bstorm) [20:09:15] (03CR) 10Bstorm: wikireplicas: create actual paws database accounts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/670626 (https://phabricator.wikimedia.org/T276284) (owner: 10Bstorm) [20:09:47] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:14:49] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:18:55] 10SRE, 10SRE-Access-Requests: Add samwalton to analytics-privatedata-users - https://phabricator.wikimedia.org/T277298 (10Volans) p:05Triage→03Medium [20:22:03] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:23:05] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:23:40] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mw2225.codfw.wmnet [20:23:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:25:09] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={ircd,netbox_device_statistics} site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:31:21] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:31:42] (03CR) 10BryanDavis: wikireplicas: create actual paws database accounts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/670626 (https://phabricator.wikimedia.org/T276284) (owner: 10Bstorm) [20:34:04] (03PS1) 10Cwhite: logstash: bugfix: use type field [puppet] - 10https://gerrit.wikimedia.org/r/672530 [20:34:11] (03PS4) 10Bstorm: wikireplicas: create actual paws database accounts [puppet] - 10https://gerrit.wikimedia.org/r/670626 (https://phabricator.wikimedia.org/T276284) [20:34:27] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:34:46] 10SRE, 10SRE-Access-Requests: Add samwalton to analytics-privatedata-users - https://phabricator.wikimedia.org/T277298 (10Volans) @Ottomata: is this access request actually needed? According to https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hue a manual sync of the user is enough to access Hue as the LD... [20:35:09] (03CR) 10Bstorm: wikireplicas: create actual paws database accounts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/670626 (https://phabricator.wikimedia.org/T276284) (owner: 10Bstorm) [20:35:43] (03CR) 10Cwhite: [C: 03+2] logstash: bugfix: use type field [puppet] - 10https://gerrit.wikimedia.org/r/672530 (owner: 10Cwhite) [20:36:05] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:40:01] 10SRE, 10SRE-Access-Requests: Requesting access to gitlab1001 / gitlab1002 for Oly Kalinichenko from Speed & Function - https://phabricator.wikimedia.org/T275677 (10Volans) 05Open→03Resolved a:03Volans My understanding is that all the steps have been completed. Resolving. Feel free to re-open in case of... [20:42:22] (03PS1) 10Cwhite: logstash: bugfix: remove unnecessary character escaping [puppet] - 10https://gerrit.wikimedia.org/r/672532 [20:43:40] (03CR) 10Cwhite: [C: 03+2] logstash: bugfix: remove unnecessary character escaping [puppet] - 10https://gerrit.wikimedia.org/r/672532 (owner: 10Cwhite) [20:44:21] (03CR) 10Dzahn: [C: 03+2] mcrouter: replace proxy for codfw A3, mw2235->mw2299 [puppet] - 10https://gerrit.wikimedia.org/r/670949 (https://phabricator.wikimedia.org/T277119) (owner: 10Dzahn) [20:44:32] 10SRE, 10SRE-Access-Requests: Requesting access to stat boxes for mlitn - https://phabricator.wikimedia.org/T274749 (10Volans) a:03MarkTraceur Assigning it to @MarkTraceur for approval. [20:44:36] jouncebot: now [20:44:36] For the next 0 hour(s) and 15 minute(s): Services – Graphoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210315T2000) [20:45:01] jouncebot: next [20:45:01] In 0 hour(s) and 14 minute(s): Weekly Security deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210315T2100) [20:45:10] (03CR) 10BryanDavis: [C: 03+1] "Not tested, but the logical changes look correct." [puppet] - 10https://gerrit.wikimedia.org/r/670626 (https://phabricator.wikimedia.org/T276284) (owner: 10Bstorm) [20:49:07] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:51:15] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:55:37] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:55:44] !log re-enabled puppet on kubestage2001, uncordoned kubestage2002 [20:55:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:59:43] !log ebernhardson@deploy1002 Started deploy [wikimedia/discovery/analytics@4300929]: convert_to_esbulk: Accept partial hour timestamps [20:59:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:00:04] Reedy and sbassett: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Weekly Security deployment window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210315T2100). [21:00:05] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:01:06] (03PS7) 10Cwhite: logstash: add dead letter queue support [puppet] - 10https://gerrit.wikimedia.org/r/670576 (https://phabricator.wikimedia.org/T277080) [21:02:29] (03PS8) 10Cwhite: logstash: add dead letter queue support [puppet] - 10https://gerrit.wikimedia.org/r/670576 (https://phabricator.wikimedia.org/T277080) [21:02:45] !log ebernhardson@deploy1002 Finished deploy [wikimedia/discovery/analytics@4300929]: convert_to_esbulk: Accept partial hour timestamps (duration: 03m 02s) [21:02:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:02:58] (03CR) 10Andrew Bogott: [C: 03+2] cinderutils::manifests::ensure: refine behavior for existing lvm volumes [puppet] - 10https://gerrit.wikimedia.org/r/672526 (owner: 10Andrew Bogott) [21:05:27] (03CR) 10Andrew Bogott: [C: 03+2] Add cinderutils::swap [puppet] - 10https://gerrit.wikimedia.org/r/672438 (https://phabricator.wikimedia.org/T272114) (owner: 10Andrew Bogott) [21:06:03] (03CR) 10Cwhite: [C: 03+2] logstash: add dead letter queue support [puppet] - 10https://gerrit.wikimedia.org/r/670576 (https://phabricator.wikimedia.org/T277080) (owner: 10Cwhite) [21:08:59] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:13:35] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:16:19] (03PS1) 10Legoktm: [WIP] docker_registry_ha: Allow k8s node IPs [puppet] - 10https://gerrit.wikimedia.org/r/672537 [21:18:03] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:21:58] (03CR) 10Bstorm: [C: 03+2] wikireplicas: create actual paws database accounts [puppet] - 10https://gerrit.wikimedia.org/r/670626 (https://phabricator.wikimedia.org/T276284) (owner: 10Bstorm) [21:22:13] (03PS2) 10Legoktm: [WIP] docker_registry_ha: Allow k8s node IPs [puppet] - 10https://gerrit.wikimedia.org/r/672537 [21:22:17] (03CR) 10DannyS712: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/672419 (https://phabricator.wikimedia.org/T276195) (owner: 10DannyS712) [21:22:46] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:23:02] (03CR) 10Legoktm: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28616/console" [puppet] - 10https://gerrit.wikimedia.org/r/672537 (owner: 10Legoktm) [21:24:22] 10SRE, 10netbox, 10Patch-For-Review: Add SSO support to netbox - https://phabricator.wikimedia.org/T244849 (10Volans) @crusnov @jbond I did some quick testing on netbox-next and I have some questions: # Where are the logging messages saved? # I removed the `wmf` group, then I tried the following: -... [21:25:30] PROBLEM - SSH on mw2227.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:26:51] (03CR) 10Legoktm: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28617/console" [puppet] - 10https://gerrit.wikimedia.org/r/672537 (owner: 10Legoktm) [21:28:05] 10SRE, 10netbox, 10Patch-For-Review: Add SSO support to netbox - https://phabricator.wikimedia.org/T244849 (10crusnov) >>! In T244849#6915715, @Volans wrote: > @crusnov @jbond > I did some quick testing on netbox-next and I have some questions: > > # Where are the logging messages saved? It is using the... [21:28:09] 10SRE, 10Wikimedia-Mailing-lists: Enable CSP for mailman3 - https://phabricator.wikimedia.org/T277263 (10Legoktm) p:05Triage→03Medium a:03Legoktm [21:28:18] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:29:30] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:29:44] (03PS3) 10Legoktm: [WIP] docker_registry_ha: Allow k8s node IPs [puppet] - 10https://gerrit.wikimedia.org/r/672537 [21:30:30] (03CR) 10Legoktm: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28618/console" [puppet] - 10https://gerrit.wikimedia.org/r/672537 (owner: 10Legoktm) [21:31:45] 10SRE, 10netbox, 10Patch-For-Review: Add SSO support to netbox - https://phabricator.wikimedia.org/T244849 (10Volans) IIRC the LDAP auth re-syncs the groups all the time because needs to be sure the user reflects their current groups, to ensure that their access is consistent with their real groups and doesn... [21:34:55] 10SRE, 10Packaging: Disable man-db in pbuilder in package_builder on deneb - https://phabricator.wikimedia.org/T276632 (10Legoktm) p:05Triage→03Low [21:35:06] 10SRE, 10Wikimedia-Mailing-lists, 10vm-requests: Requesting a test VM in production for mailman3 - https://phabricator.wikimedia.org/T276686 (10Legoktm) p:05Triage→03Medium [21:35:10] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={ircd,netbox_device_statistics} site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:35:58] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2236.codfw.wmnet [21:36:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:36:04] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2237.codfw.wmnet [21:36:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:36:15] (03PS1) 10Bstorm: Revert "wikireplicas: create actual paws database accounts" [puppet] - 10https://gerrit.wikimedia.org/r/672420 [21:36:20] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2238.codfw.wmnet [21:36:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:36:26] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2239.codfw.wmnet [21:36:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:36:44] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:38:12] (03CR) 10Bstorm: [C: 03+2] Revert "wikireplicas: create actual paws database accounts" [puppet] - 10https://gerrit.wikimedia.org/r/672420 (owner: 10Bstorm) [21:39:27] (03PS4) 10Legoktm: docker_registry_ha: Require authentication from k8s nodes [puppet] - 10https://gerrit.wikimedia.org/r/672537 (https://phabricator.wikimedia.org/T273521) [21:40:12] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:40:25] 10SRE, 10Wikimedia-Mailing-lists: Request for Community Resources mailman Mailing List - https://phabricator.wikimedia.org/T277468 (10Volans) p:05Triage→03Medium [21:40:34] PROBLEM - Confd template for /srv/config-master/pybal/codfw/appservers-https on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/appservers-https is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [21:40:34] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/apaches on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/apaches is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [21:40:36] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/apaches on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/apaches is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [21:40:38] PROBLEM - Confd template for /srv/config-master/pybal/codfw/apaches on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/apaches is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [21:41:26] PROBLEM - Confd template for /srv/config-master/pybal/codfw/apaches on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/apaches is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [21:41:42] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/appservers-https on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/appservers-https is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [21:42:50] (03PS1) 10Andrew Bogott: Nova vendordata: rework initial partitioning [puppet] - 10https://gerrit.wikimedia.org/r/672538 (https://phabricator.wikimedia.org/T272114) [21:43:54] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:45:55] (03PS1) 10Bstorm: wikireplicas: create actual paws database accounts [puppet] - 10https://gerrit.wikimedia.org/r/672540 (https://phabricator.wikimedia.org/T276284) [21:46:26] um, what's wrong with confd? [21:46:40] 10SRE, 10netbox, 10Patch-For-Review: Add SSO support to netbox - https://phabricator.wikimedia.org/T244849 (10crusnov) >>! In T244849#6915745, @Volans wrote: > IIRC the LDAP auth re-syncs the groups all the time because needs to be sure the user reflects their current groups, to ensure that their access is c... [21:47:27] (03CR) 10Bstorm: "This should correct a type error in https://gerrit.wikimedia.org/r/c/operations/puppet/+/670626" [puppet] - 10https://gerrit.wikimedia.org/r/672540 (https://phabricator.wikimedia.org/T276284) (owner: 10Bstorm) [21:47:28] { 'host': 'mw2225.codfw.wmnet', 'weight':25, 'enabled': False } [Errno -2] Name or service not known [21:47:58] mutante: ^^ [21:48:19] it's complaining that the host is still in conftool [21:48:22] (03CR) 10Zabe: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/671084 (owner: 10Gerrit Patch Uploader) [21:49:05] (03CR) 10Bstorm: wikireplicas: create actual paws database accounts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/672540 (https://phabricator.wikimedia.org/T276284) (owner: 10Bstorm) [21:49:52] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:50:06] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/appservers-https on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/appservers-https is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [21:50:06] PROBLEM - Confd template for /srv/config-master/pybal/codfw/appservers-https on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/appservers-https is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [21:52:02] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:52:51] (03PS1) 10Jeena Huneidi: rdf-streaming-updater: fix networkpolicy selector [deployment-charts] - 10https://gerrit.wikimedia.org/r/672544 [21:52:55] 10SRE, 10Wikimedia-Mailing-lists: Request for Community Resources mailman Mailing List - https://phabricator.wikimedia.org/T277468 (10Volans) @IJethroBT-WMF List created: * List info page: https://lists.wikimedia.org/mailman/listinfo/communityresources-l * List admin page: https://lists.wikimedia.org/mailman/... [21:55:28] (03PS2) 10Bstorm: wikireplicas: create actual paws database accounts [puppet] - 10https://gerrit.wikimedia.org/r/672540 (https://phabricator.wikimedia.org/T276284) [21:56:34] (03CR) 10Bstorm: wikireplicas: create actual paws database accounts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/672540 (https://phabricator.wikimedia.org/T276284) (owner: 10Bstorm) [21:56:56] 10SRE, 10Wikimedia-Mailing-lists: Request for Community Resources mailman Mailing List - https://phabricator.wikimedia.org/T277468 (10IJethroBT-WMF) @Volans42 Thank you! Appreciate the help and quick response. : ) [21:58:20] (03PS1) 10QChris: Add .gitreview [debs/grizzly] - 10https://gerrit.wikimedia.org/r/672545 [21:58:22] (03CR) 10QChris: [V: 03+2 C: 03+2] Add .gitreview [debs/grizzly] - 10https://gerrit.wikimedia.org/r/672545 (owner: 10QChris) [22:00:29] 10SRE, 10netbox, 10Patch-For-Review: Add SSO support to netbox - https://phabricator.wikimedia.org/T244849 (10Volans) >>! In T244849#6915810, @crusnov wrote: > After reviewing the LDAP authentication library I can say that you're correct, it does do this in its backend. If this is a required functionality it... [22:01:24] 10SRE, 10Wikimedia-Mailing-lists: Request for Community Resources mailman Mailing List - https://phabricator.wikimedia.org/T277468 (10Volans) 05Open→03Resolved a:03Volans No problem. Resolving for now, feel free to re-open it if needed. [22:04:46] (03PS1) 10Cwhite: bugfix: insert pipeline id value into dlq input template [puppet] - 10https://gerrit.wikimedia.org/r/672547 [22:05:15] (03CR) 10jerkins-bot: [V: 04-1] bugfix: insert pipeline id value into dlq input template [puppet] - 10https://gerrit.wikimedia.org/r/672547 (owner: 10Cwhite) [22:05:30] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:06:54] (03CR) 10Bstorm: wikireplicas: create actual paws database accounts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/672540 (https://phabricator.wikimedia.org/T276284) (owner: 10Bstorm) [22:07:44] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:12:43] (03CR) 10CRusnov: "This change is ready for review." [software/netbox] - 10https://gerrit.wikimedia.org/r/672548 (https://phabricator.wikimedia.org/T244849) (owner: 10CRusnov) [22:13:15] (03PS2) 10CRusnov: Group Sensitive Remote: Sync groups at auth time, not creation time [software/netbox] - 10https://gerrit.wikimedia.org/r/672548 (https://phabricator.wikimedia.org/T244849) [22:13:56] 10SRE, 10netbox, 10Patch-For-Review: Add SSO support to netbox - https://phabricator.wikimedia.org/T244849 (10crusnov) >>! In T244849#6915834, @Volans wrote: >>>! In T244849#6915810, @crusnov wrote: >> After reviewing the LDAP authentication library I can say that you're correct, it does do this in its backe... [22:20:30] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: TBD) rack/setup/install db11[76-84] - https://phabricator.wikimedia.org/T273566 (10Jclark-ctr) db1176 A1 u6 p14 id1751 db1177 A3 u38 p29 id 1931 db1178 B1 u25 p16 id4020 db1179 B5 u35 p38 id3356 db1180 C3 u20 p6 id2956 db1181 C5 u16 p17 id1846 db1182 D1 u38 p2... [22:21:30] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: TBD) rack/setup/install db11[76-84] - https://phabricator.wikimedia.org/T273566 (10Jclark-ctr) a:03Cmjohnson [22:21:38] (03PS2) 10Cwhite: logstash: bugfix: insert pipeline id value into dlq input template [puppet] - 10https://gerrit.wikimedia.org/r/672547 [22:23:40] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:23:53] (03CR) 10Cwhite: [C: 03+2] logstash: bugfix: insert pipeline id value into dlq input template [puppet] - 10https://gerrit.wikimedia.org/r/672547 (owner: 10Cwhite) [22:24:52] (03PS2) 10Bartosz Dziewoński: Make DiscussionTools' replytool available for everyone on officewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/672510 (https://phabricator.wikimedia.org/T277103) [22:26:00] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:26:50] RECOVERY - SSH on mw2227.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:30:48] PROBLEM - mediawiki originals uploads -hourly- for eqiad on alert1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe1005 job=statsd_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad [22:32:40] PROBLEM - mediawiki originals uploads -hourly- for codfw on alert1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe2005 job=statsd_exporter site=codfw https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw [22:35:30] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:37:42] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:40:45] (03PS3) 10Bstorm: wikireplicas: create actual paws database accounts [puppet] - 10https://gerrit.wikimedia.org/r/672540 (https://phabricator.wikimedia.org/T276284) [22:44:02] (03CR) 10Mstyles: "Thanks for noticing and fixing this! The task manager only needs ingress on port 6125 and that doesn't need to be publicly available, that" [deployment-charts] - 10https://gerrit.wikimedia.org/r/672544 (owner: 10Jeena Huneidi) [22:50:44] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:53:22] (03CR) 10CRusnov: "I'm debugging this to try to see why it isn't working at all anymore with this change." [software/netbox] - 10https://gerrit.wikimedia.org/r/672548 (https://phabricator.wikimedia.org/T244849) (owner: 10CRusnov) [22:55:36] (03CR) 10Jeena Huneidi: "> Patch Set 1:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/672544 (owner: 10Jeena Huneidi) [22:58:54] PROBLEM - IPMI Sensor Status on elastic1042 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Power Supply 2 = Critical, Power Supplies = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [23:00:04] RoanKattouw, Niharika, and Urbanecm: (Dis)respected human, time to deploy Evening backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210315T2300). Please do the needful. [23:00:04] legoktm and DannyS712: A patch you scheduled for Evening backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:00:27] legoktm: will you do B&C, or should i? [23:00:37] I can do it [23:00:37] I'm here [23:01:08] thanks legoktm [23:02:00] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:02:31] (03CR) 10Legoktm: [C: 03+2] GlobalWatchlist: allow watching up to 50 sites [mediawiki-config] - 10https://gerrit.wikimedia.org/r/672419 (https://phabricator.wikimedia.org/T276195) (owner: 10DannyS712) [23:03:30] (03Merged) 10jenkins-bot: GlobalWatchlist: allow watching up to 50 sites [mediawiki-config] - 10https://gerrit.wikimedia.org/r/672419 (https://phabricator.wikimedia.org/T276195) (owner: 10DannyS712) [23:04:54] legoktm do you want me to try and test on the debug host once its ready? Or just merge [23:05:07] DannyS712: please test on mwdebug1002 [23:05:21] (just took me a minute to sync) [23:06:24] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={ircd,netbox_device_statistics} site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:07:07] hmm, the chrome extension for that is giving me javascript exceptions... looking [23:08:45] (03PS3) 10Mstyles: create helmfile.d structure [deployment-charts] - 10https://gerrit.wikimedia.org/r/671204 (https://phabricator.wikimedia.org/T264006) [23:08:53] legoktm appears to work, was able to save settings with more sites [23:08:57] ok [23:09:17] (03CR) 10Mstyles: create helmfile.d structure (037 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/671204 (https://phabricator.wikimedia.org/T264006) (owner: 10Mstyles) [23:09:47] (03PS1) 10Cwhite: logstash: short-circuit dead letter recursion [puppet] - 10https://gerrit.wikimedia.org/r/672556 (https://phabricator.wikimedia.org/T277080) [23:10:14] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: (Need By: TBD) rack/setup/install mc10[37-54].eqiad.wmnet - https://phabricator.wikimedia.org/T274925 (10Jclark-ctr) a:05Jclark-ctr→03Cmjohnson [23:10:21] syncing [23:10:23] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: (Need By: TBD) rack/setup/install mc10[37-54].eqiad.wmnet - https://phabricator.wikimedia.org/T274925 (10Jclark-ctr) Handing over to Chris to finish the 2 we have received mc1037 r,A7 u41 P5 ID5352 mc1038 r,A7 u42 P19 ID5353 [23:10:40] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:11:22] !log legoktm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: GlobalWatchlist: allow watching up to 50 sites (T276195) (duration: 01m 04s) [23:11:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:11:34] T276195: Allow more than five wikis as projects in Global Watchlist - https://phabricator.wikimedia.org/T276195 [23:11:41] thanks legoktm [23:11:44] np [23:12:57] mutante: I don't think mw2225.codfw.wmnet was fully decomissioned [23:13:41] !log legoktm@deploy1002 conftool action : set/pooled=inactive; selector: name=mw2225.codfw.wmnet [23:13:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:15:00] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:15:37] (03PS2) 10Legoktm: Support having multiple IRC feed servers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/670913 (https://phabricator.wikimedia.org/T224579) [23:15:41] (03PS3) 10Legoktm: Define IRC feed servers as an array in {Production,Labs}Services.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/670914 (https://phabricator.wikimedia.org/T224579) [23:15:43] (03PS3) 10Legoktm: Remove back-compat from when IRC feed servers was a string [mediawiki-config] - 10https://gerrit.wikimedia.org/r/670915 (https://phabricator.wikimedia.org/T224579) [23:16:33] (03CR) 10Legoktm: [C: 03+2] Support having multiple IRC feed servers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/670913 (https://phabricator.wikimedia.org/T224579) (owner: 10Legoktm) [23:16:38] (03CR) 10Legoktm: [C: 03+2] Define IRC feed servers as an array in {Production,Labs}Services.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/670914 (https://phabricator.wikimedia.org/T224579) (owner: 10Legoktm) [23:17:29] (03Merged) 10jenkins-bot: Support having multiple IRC feed servers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/670913 (https://phabricator.wikimedia.org/T224579) (owner: 10Legoktm) [23:17:33] (03Merged) 10jenkins-bot: Define IRC feed servers as an array in {Production,Labs}Services.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/670914 (https://phabricator.wikimedia.org/T224579) (owner: 10Legoktm) [23:19:42] (03PS1) 10Cwhite: prometheus: generate metrics on dead letter events [puppet] - 10https://gerrit.wikimedia.org/r/672558 (https://phabricator.wikimedia.org/T277080) [23:21:36] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:22:21] I made edits on mwdebug1002 and watched them show up on irc.wm.o [23:23:04] !log legoktm@deploy1002 Synchronized wmf-config/CommonSettings.php: Support having multiple IRC feed servers (T224579) (duration: 00m 58s) [23:23:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:23:11] T224579: Migrate irc.wikimedia.org/kraz to Buster - https://phabricator.wikimedia.org/T224579 [23:24:41] !log legoktm@deploy1002 Synchronized wmf-config/: Define IRC feed servers as an array in {Production,Labs}Services.php (T224579) (duration: 00m 59s) [23:24:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:25:46] (03CR) 10Legoktm: [C: 03+2] Remove back-compat from when IRC feed servers was a string [mediawiki-config] - 10https://gerrit.wikimedia.org/r/670915 (https://phabricator.wikimedia.org/T224579) (owner: 10Legoktm) [23:26:00] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:26:58] (03Merged) 10jenkins-bot: Remove back-compat from when IRC feed servers was a string [mediawiki-config] - 10https://gerrit.wikimedia.org/r/670915 (https://phabricator.wikimedia.org/T224579) (owner: 10Legoktm) [23:28:09] (03PS2) 10Jeena Huneidi: rdf-streaming-updater: fix networkpolicy selector [deployment-charts] - 10https://gerrit.wikimedia.org/r/672544 [23:31:26] (03PS1) 10Cwhite: logstash: disable the dlq [puppet] - 10https://gerrit.wikimedia.org/r/672559 [23:31:27] !log legoktm@deploy1002 Synchronized wmf-config/CommonSettings.php: Remove back-compat from when IRC feed servers was a string (T224579) (duration: 00m 59s) [23:31:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:31:34] T224579: Migrate irc.wikimedia.org/kraz to Buster - https://phabricator.wikimedia.org/T224579 [23:31:52] all done [23:32:16] 10SRE, 10Patch-For-Review: Migrate irc.wikimedia.org/kraz to Buster - https://phabricator.wikimedia.org/T224579 (10Legoktm) Should be all set on the MW side now. [23:32:26] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:35:08] (03CR) 10H.krishna123: "Thanks for the code review :) Just added my acknowledgements, I've marked some of them as resolved - wasn't sure if I should tick things a" (036 comments) [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/672150 (owner: 10H.krishna123) [23:36:48] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:39:02] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:45:00] (03PS3) 10Jeena Huneidi: rdf-streaming-updater: fix networkpolicy selector [deployment-charts] - 10https://gerrit.wikimedia.org/r/672544 [23:45:34] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:45:35] (03CR) 10Jeena Huneidi: "I was wondering about port 6122, which is specified in the networkpolicy, as well as in values.yaml as tls.config.public_port. It's expose" [deployment-charts] - 10https://gerrit.wikimedia.org/r/672544 (owner: 10Jeena Huneidi) [23:46:09] (03CR) 10jerkins-bot: [V: 04-1] rdf-streaming-updater: fix networkpolicy selector [deployment-charts] - 10https://gerrit.wikimedia.org/r/672544 (owner: 10Jeena Huneidi) [23:48:27] (03PS1) 10Ebernhardson: Add Cirrus testing profile for glent m1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/672565 (https://phabricator.wikimedia.org/T262612) [23:52:22] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:56:52] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:57:09] (03PS4) 10Jeena Huneidi: rdf-streaming-updater: fix networkpolicy selector [deployment-charts] - 10https://gerrit.wikimedia.org/r/672544