[00:03:50] (03CR) 10Dzahn: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/606824 (owner: 10Ladsgroup) [00:07:41] RECOVERY - Stale file for node-exporter textfile in ulsfo on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile [00:07:47] (03CR) 10Dzahn: "for now you can manage permissions manually as before to unbreak it. puppet won't fight over them until we let it manage them properly" [puppet] - 10https://gerrit.wikimedia.org/r/606824 (owner: 10Ladsgroup) [00:12:36] (03PS1) 10Dzahn: meet::accountmanager: add some fake private secrets (example) [labs/private] - 10https://gerrit.wikimedia.org/r/607153 [00:15:28] (03CR) 10Dzahn: "once created this would exist in the exact same location but with real secrets in the prod private repo." [labs/private] - 10https://gerrit.wikimedia.org/r/607153 (owner: 10Dzahn) [00:23:56] (03CR) 10Ladsgroup: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/606824 (owner: 10Ladsgroup) [00:25:46] (03CR) 10Ladsgroup: meet::accountmanager: add some fake private secrets (example) (031 comment) [labs/private] - 10https://gerrit.wikimedia.org/r/607153 (owner: 10Dzahn) [00:26:05] RECOVERY - Stale file for node-exporter textfile in esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile [00:34:35] RECOVERY - Stale file for node-exporter textfile in eqsin on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile [01:30:20] (03PS1) 10Aaron Schulz: [DNM] Enable "coalesceKeys" for global keys for WANCache (II) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/607155 (https://phabricator.wikimedia.org/T252564) [02:05:22] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.35.0-wmf.38 [core] (wmf/1.35.0-wmf.38) - 10https://gerrit.wikimedia.org/r/607158 [02:06:35] (03PS2) 10DannyS712: Branch commit for wmf/1.35.0-wmf.38 [core] (wmf/1.35.0-wmf.38) - 10https://gerrit.wikimedia.org/r/607158 (https://phabricator.wikimedia.org/T254175) (owner: 10TrainBranchBot) [02:11:57] 10Operations, 10Wikimedia-Mailing-lists, 10Upstream: "From" at start of line becomes ">From" in pipermail - https://phabricator.wikimedia.org/T115329 (10MZMcBride) >>! In T115329#5232892, @fsero wrote: > This task has been inactive for 3 years, so I'm closing it please reopen if this still needed. Please ne... [03:13:40] (03PS1) 10CDanis: MW PHP-FPM worker saturation: make it page [puppet] - 10https://gerrit.wikimedia.org/r/607163 (https://phabricator.wikimedia.org/T252605) [03:48:41] (03CR) 10Aaron Schulz: [C: 03+1] Set expiry headers on thumbnails [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/489022 (https://phabricator.wikimedia.org/T211661) (owner: 10Gilles) [04:38:30] 10Operations, 10Wikimedia-Mailing-lists, 10Upstream: "From" at start of line becomes ">From" in pipermail - https://phabricator.wikimedia.org/T115329 (10Aklapper) Seems current behavior to escape body lines starting with `From` was introduced in https://bugs.launchpad.net/mailman/+bug/266068/comments/11 Mail... [04:49:26] (03PS1) 10Marostegui: db1118: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/607165 (https://phabricator.wikimedia.org/T254462) [04:54:58] (03CR) 10Marostegui: [C: 03+2] db1118: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/607165 (https://phabricator.wikimedia.org/T254462) (owner: 10Marostegui) [05:03:15] !log marostegui@cumin2001 dbctl commit (dc=all): 'Slowly repool db1118', diff saved to https://phabricator.wikimedia.org/P11633 and previous config saved to /var/cache/conftool/dbconfig/20200623-050314-marostegui.json [05:03:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:11:37] (03PS1) 10Marostegui: db1080: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/607167 (https://phabricator.wikimedia.org/T254462) [05:12:00] !log marostegui@cumin2001 dbctl commit (dc=all): 'Slowly repool db1118', diff saved to https://phabricator.wikimedia.org/P11634 and previous config saved to /var/cache/conftool/dbconfig/20200623-051159-marostegui.json [05:12:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:12:50] (03CR) 10Marostegui: [C: 03+2] db1080: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/607167 (https://phabricator.wikimedia.org/T254462) (owner: 10Marostegui) [05:22:55] !log marostegui@cumin2001 dbctl commit (dc=all): 'Slowly repool db1118', diff saved to https://phabricator.wikimedia.org/P11635 and previous config saved to /var/cache/conftool/dbconfig/20200623-052254-marostegui.json [05:22:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:23:51] !log marostegui@cumin2001 dbctl commit (dc=all): 'Depool db1080 for InnoDB compression', diff saved to https://phabricator.wikimedia.org/P11636 and previous config saved to /var/cache/conftool/dbconfig/20200623-052350-marostegui.json [05:23:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:24:20] !log Compress InnoDB on db1080 T254462 [05:24:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:24:23] T254462: Compress enwiki InnoDB tables - https://phabricator.wikimedia.org/T254462 [05:40:36] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 55 probes of 567 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [05:46:18] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 47 probes of 567 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [06:03:40] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 131, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:06:38] !log disable peering BGP sessions on AMS-IX - T253970 [06:06:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:06:43] T253970: No LACP info for cr2-esams:ae2 - https://phabricator.wikimedia.org/T253970 [06:13:16] PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS6939/IPv6: Idle - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [06:13:42] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 91, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:18:00] PROBLEM - IPv6 ping to esams on ripe-atlas-esams IPv6 is CRITICAL: CRITICAL - failed 53 probes of 564 (alerts on 50) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [06:21:14] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 62, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:23:48] RECOVERY - IPv6 ping to esams on ripe-atlas-esams IPv6 is OK: OK - failed 48 probes of 564 (alerts on 50) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [06:57:32] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [06:59:18] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [07:01:21] !log marostegui@cumin2001 dbctl commit (dc=all): 'Fully repool db1118', diff saved to https://phabricator.wikimedia.org/P11637 and previous config saved to /var/cache/conftool/dbconfig/20200623-070120-marostegui.json [07:01:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:08:06] (03PS1) 10Aklapper: phabricator weekly changes email: Include URLs for listed projects [puppet] - 10https://gerrit.wikimedia.org/r/607218 [07:12:07] (03CR) 10Marostegui: WIP - Introduce profile::mariadb::misc::analytics (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/553742 (https://phabricator.wikimedia.org/T234826) (owner: 10Elukey) [07:15:13] (03PS1) 10Marostegui: eventlogging: Change basedir depending on the OS [puppet] - 10https://gerrit.wikimedia.org/r/607219 (https://phabricator.wikimedia.org/T234826) [07:18:52] (03PS1) 10Aklapper: phabricator weekly changes email: List projects' color/icon violations [puppet] - 10https://gerrit.wikimedia.org/r/607222 (https://phabricator.wikimedia.org/T249806) [07:21:15] (03CR) 10Elukey: WIP - Introduce profile::mariadb::misc::analytics (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/553742 (https://phabricator.wikimedia.org/T234826) (owner: 10Elukey) [07:21:56] (03PS1) 10Marostegui: db2133: Reimage to Buster [puppet] - 10https://gerrit.wikimedia.org/r/607223 (https://phabricator.wikimedia.org/T250666) [07:24:44] (03CR) 10Marostegui: WIP - Introduce profile::mariadb::misc::analytics (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/553742 (https://phabricator.wikimedia.org/T234826) (owner: 10Elukey) [07:26:12] 10Operations, 10Analytics, 10Analytics-Cluster: Segfault for systemd-sysusers.service on stat1007 - https://phabricator.wikimedia.org/T256098 (10elukey) [07:28:33] 10Operations, 10Analytics, 10Analytics-Cluster: Segfault for systemd-sysusers.service on stat1007 - https://phabricator.wikimedia.org/T256098 (10elukey) [07:28:51] (03CR) 10Marostegui: [C: 03+2] db2133: Reimage to Buster [puppet] - 10https://gerrit.wikimedia.org/r/607223 (https://phabricator.wikimedia.org/T250666) (owner: 10Marostegui) [07:30:42] !log Reimage db2133 (m2 codfw master) to Buster (this will trigger haproxy IRC alert) T250666 [07:30:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:30:47] T250666: Upgrade WMF database-and-backup-related hosts to buster - https://phabricator.wikimedia.org/T250666 [07:33:22] 10Operations, 10Analytics, 10Analytics-Cluster: Segfault for systemd-sysusers.service on stat1007 - https://phabricator.wikimedia.org/T256098 (10elukey) One note is: ` elukey@stat1007:~$ apt-cache policy libsystemd0 libsystemd0: Installed: 241-5~bpo9+1 Candidate: 241-5~bpo9+1 Version table: *** 241-5... [07:36:54] PROBLEM - haproxy failover on dbproxy2002 is CRITICAL: CRITICAL check_failover servers up 1 down 1 https://wikitech.wikimedia.org/wiki/HAProxy [07:37:10] 10Operations, 10Analytics, 10Analytics-Cluster: Segfault for systemd-sysusers.service on stat1007 - https://phabricator.wikimedia.org/T256098 (10elukey) [07:37:31] haproxy alerts is expected ^ [07:42:33] !log Deploy schema change on db1088 [07:42:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:46:03] 10Operations, 10SRE-swift-storage, 10serviceops: Access to the thanos-swift cluster for ChartMuseum - https://phabricator.wikimedia.org/T256020 (10fgiunchedi) Generally LGTM as a use case. Is there PII/private data in the charts or you expect to? I'm pointing this out because while connections to the swift... [07:48:26] 10Operations, 10Analytics, 10Analytics-Cluster: Segfault for systemd-sysusers.service on stat1007 - https://phabricator.wikimedia.org/T256098 (10MoritzMuehlenhoff) We could try to track this down whether a specific user definitions makes it crash, by first removing individual files from /usr/lib/sysusers.d (... [07:48:31] (03CR) 10Hashar: "We can drop the contint::firewall::labs class entirely, it is not used. I am amending with the detailed explanation ;]" [puppet] - 10https://gerrit.wikimedia.org/r/606737 (https://phabricator.wikimedia.org/T114209) (owner: 10Dzahn) [07:48:50] (03PS2) 10Hashar: contint: remove obsolete firewall rules from labs [puppet] - 10https://gerrit.wikimedia.org/r/606737 (https://phabricator.wikimedia.org/T114209) (owner: 10Dzahn) [07:49:05] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/606739 (https://phabricator.wikimedia.org/T114209) (owner: 10Dzahn) [07:49:10] !log marostegui@cumin2001 START - Cookbook sre.hosts.downtime [07:49:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:49:22] 10Operations, 10ops-eqiad, 10DBA: db1088 crashed - https://phabricator.wikimedia.org/T255927 (10Kormat) Data consistency check passed. [07:49:51] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/606730 (https://phabricator.wikimedia.org/T114209) (owner: 10Dzahn) [07:50:10] 10Operations, 10ops-eqiad, 10DBA: db1088 crashed - https://phabricator.wikimedia.org/T255927 (10Marostegui) @Kormat I am going to apply the MCR schema change and once I am done, maybe we can reimage this to Buster and 10.4 while DCOPs look for a BBU? [07:50:22] (03CR) 10Jcrespo: [C: 03+1] install_server: Do not allow db1140 reimage [puppet] - 10https://gerrit.wikimedia.org/r/606967 (owner: 10Marostegui) [07:50:24] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/606735 (https://phabricator.wikimedia.org/T114209) (owner: 10Dzahn) [07:50:55] (03CR) 10Marostegui: [C: 03+2] install_server: Do not allow db1140 reimage [puppet] - 10https://gerrit.wikimedia.org/r/606967 (owner: 10Marostegui) [07:51:10] (03PS3) 10Hashar: contint: remove obsolete firewall rules from labs [puppet] - 10https://gerrit.wikimedia.org/r/606737 (https://phabricator.wikimedia.org/T114209) (owner: 10Dzahn) [07:51:40] (03CR) 10Hashar: [C: 03+1] "I forgot to sign-off and mention that a wmf-style issue is resolved:" [puppet] - 10https://gerrit.wikimedia.org/r/606737 (https://phabricator.wikimedia.org/T114209) (owner: 10Dzahn) [07:51:44] !log marostegui@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [07:51:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:58:00] !log restart scs-a8-eqiad - T256101 [07:58:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:58:04] T256101: scs-a8-eqiad CPU usage over 85% - https://phabricator.wikimedia.org/T256101 [07:58:43] (03PS1) 10Jcrespo: mariadb-backups: Reimage db2101 (x1 backup source) to buster [puppet] - 10https://gerrit.wikimedia.org/r/607228 (https://phabricator.wikimedia.org/T254871) [07:59:40] (03CR) 10Jcrespo: "@Kormat could you confirm that if a regular db recipe, this is all needed to reimage without deleting /srv ?" [puppet] - 10https://gerrit.wikimedia.org/r/607228 (https://phabricator.wikimedia.org/T254871) (owner: 10Jcrespo) [08:01:57] 10Operations, 10SRE-swift-storage, 10serviceops: Access to the thanos-swift cluster for ChartMuseum - https://phabricator.wikimedia.org/T256020 (10JMeybohm) We don't expect private data in the charts at all. In addition, they are already publicly accessible via https://releases.wikimedia.org/charts/ and http... [08:02:42] RECOVERY - haproxy failover on dbproxy2002 is OK: OK check_failover servers up 2 down 0 https://wikitech.wikimedia.org/wiki/HAProxy [08:03:42] !log draining ganeti1007 for eventual reboot [08:03:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:04:15] (03CR) 10Kormat: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/607228 (https://phabricator.wikimedia.org/T254871) (owner: 10Jcrespo) [08:04:42] (03CR) 10Kormat: [C: 03+1] mariadb-backups: Reimage db2101 (x1 backup source) to buster [puppet] - 10https://gerrit.wikimedia.org/r/607228 (https://phabricator.wikimedia.org/T254871) (owner: 10Jcrespo) [08:05:13] (03CR) 10Marostegui: "You might want to disable notifications too" [puppet] - 10https://gerrit.wikimedia.org/r/607228 (https://phabricator.wikimedia.org/T254871) (owner: 10Jcrespo) [08:06:41] (03CR) 10Elukey: [C: 03+1] "the class will be decommed very soon but LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/607219 (https://phabricator.wikimedia.org/T234826) (owner: 10Marostegui) [08:06:58] (03CR) 10ArielGlenn: "The ticket says to move ferm rules into the roles; is that what we want to do, rather than the profile?" [puppet] - 10https://gerrit.wikimedia.org/r/606739 (https://phabricator.wikimedia.org/T114209) (owner: 10Dzahn) [08:07:02] (03CR) 10Marostegui: [C: 03+2] eventlogging: Change basedir depending on the OS [puppet] - 10https://gerrit.wikimedia.org/r/607219 (https://phabricator.wikimedia.org/T234826) (owner: 10Marostegui) [08:07:04] (03CR) 10JMeybohm: "Sorry for bugging you with the unclean state of this Alex. It wasn't meant to be ready for review :-)" [puppet] - 10https://gerrit.wikimedia.org/r/606956 (https://phabricator.wikimedia.org/T253843) (owner: 10JMeybohm) [08:08:33] (03CR) 10Jcrespo: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/607228 (https://phabricator.wikimedia.org/T254871) (owner: 10Jcrespo) [08:10:29] (03PS2) 10Jcrespo: mariadb-backups: Reimage db2101 (x1 backup source) to buster [puppet] - 10https://gerrit.wikimedia.org/r/607228 (https://phabricator.wikimedia.org/T254871) [08:11:06] (03CR) 10Marostegui: [C: 03+1] mariadb-backups: Reimage db2101 (x1 backup source) to buster [puppet] - 10https://gerrit.wikimedia.org/r/607228 (https://phabricator.wikimedia.org/T254871) (owner: 10Jcrespo) [08:12:59] (03CR) 10Jcrespo: [C: 03+2] mariadb-backups: Reimage db2101 (x1 backup source) to buster [puppet] - 10https://gerrit.wikimedia.org/r/607228 (https://phabricator.wikimedia.org/T254871) (owner: 10Jcrespo) [08:19:59] (03PS3) 10JMeybohm: WIP: chartmuseum: Add initial module, profile and role [puppet] - 10https://gerrit.wikimedia.org/r/606956 (https://phabricator.wikimedia.org/T253843) [08:20:46] (03CR) 10JMeybohm: WIP: chartmuseum: Add initial module, profile and role (037 comments) [puppet] - 10https://gerrit.wikimedia.org/r/606956 (https://phabricator.wikimedia.org/T253843) (owner: 10JMeybohm) [08:34:30] 10Operations, 10serviceops, 10Patch-For-Review, 10User-Elukey: Reimage one memcached shard to Buster - https://phabricator.wikimedia.org/T252391 (10elukey) Getting back to https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/595810/ - one thing that it would be useful before merging is to dump all the ke... [08:35:24] (03CR) 10Jcrespo: "See the below mistake, otherwise this can be merged." (031 comment) [software/transferpy] - 10https://gerrit.wikimedia.org/r/602754 (https://phabricator.wikimedia.org/T253736) (owner: 10Privacybatm) [08:36:28] I'm re-enabling AMS-IX [08:37:02] (03PS1) 10Marostegui: db2133: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/607233 [08:37:32] (03CR) 10Marostegui: [C: 03+2] db2133: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/607233 (owner: 10Marostegui) [08:38:20] !log re-enable peering BGP sessions on AMS-IX - T253970 [08:38:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:38:24] T253970: No LACP info for cr2-esams:ae2 - https://phabricator.wikimedia.org/T253970 [08:39:00] ACKNOWLEDGEMENT - LibreNMS has a critical alert #page on icinga1001 is CRITICAL: Primary outbound port utilisation over 80% #page (cr3-esams.wikimedia.org) Ayounsi re-enabling AMS-IX https://wikitech.wikimedia.org/wiki/Network_monitoring%23LibreNMS_alerts [08:39:17] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 93, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:39:24] acking the page before it pages [08:39:35] RECOVERY - BGP status on cr2-esams is OK: BGP OK - up: 233, down: 6, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:40:19] (03PS1) 10DCausse: Set proper language code for simple english wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/607235 (https://phabricator.wikimedia.org/T250810) [08:45:58] (03PS9) 10Jbond: wmflib: Migrate ini, ordered_yaml and php_ini to modern puppet custom functions [puppet] - 10https://gerrit.wikimedia.org/r/492518 (owner: 10Paladox) [08:46:12] !log mwmaint1002: add uid=abban,ou=people,dc=wikimedia,dc=org to group 'nda' T255775 [08:46:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:46:16] T255775: Add Abban to the ldap/nda group - https://phabricator.wikimedia.org/T255775 [08:46:55] (03PS1) 10Marostegui: mariadb: Promote es1024 to es5 master [puppet] - 10https://gerrit.wikimedia.org/r/607236 (https://phabricator.wikimedia.org/T255755) [08:47:13] (03CR) 10jerkins-bot: [V: 04-1] wmflib: Migrate ini, ordered_yaml and php_ini to modern puppet custom functions [puppet] - 10https://gerrit.wikimedia.org/r/492518 (owner: 10Paladox) [08:47:16] 10Operations, 10LDAP-Access-Requests: Add Abban to the ldap/nda group - https://phabricator.wikimedia.org/T255775 (10ema) 05Open→03Resolved a:03ema Done. [08:48:04] (03CR) 10Marostegui: [C: 04-2] "Wait for the failover day" [puppet] - 10https://gerrit.wikimedia.org/r/607236 (https://phabricator.wikimedia.org/T255755) (owner: 10Marostegui) [08:48:35] 10Operations, 10netops: No LACP info for cr2-esams:ae2 - https://phabricator.wikimedia.org/T253970 (10ayounsi) I had to re-enable the link before AMS-IX figured out how to turn LACP on on their side. Will re-schedule a window another early morning, hopefully they will have figured out what delayed them until t... [08:50:16] (03PS7) 10Elukey: WIP - Introduce profile::mariadb::misc::analytics [puppet] - 10https://gerrit.wikimedia.org/r/553742 (https://phabricator.wikimedia.org/T234826) [08:50:43] (03CR) 10Jcrespo: "The other warnings, aside from the .py extension (which we will ignore) is the lack of a man page. We can do that at a later patch." (032 comments) [software/transferpy] - 10https://gerrit.wikimedia.org/r/602754 (https://phabricator.wikimedia.org/T253736) (owner: 10Privacybatm) [08:55:35] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Allow AKlapper to disable other people's personal Herald rules in Phabricator - https://phabricator.wikimedia.org/T255914 (10ema) p:05Triage→03Medium [08:56:44] (03CR) 10Muehlenhoff: [C: 03+1] "That ticket is so old that is predated our use of profiles :-) It applies to role/profiles alike." [puppet] - 10https://gerrit.wikimedia.org/r/606739 (https://phabricator.wikimedia.org/T114209) (owner: 10Dzahn) [08:57:25] (03PS1) 10Muehlenhoff: Switch puppetboard to only use CAS for authentication [puppet] - 10https://gerrit.wikimedia.org/r/607239 [08:59:34] (03CR) 10ArielGlenn: [C: 03+1] "I see :-D IN that case, here's my +1" [puppet] - 10https://gerrit.wikimedia.org/r/606739 (https://phabricator.wikimedia.org/T114209) (owner: 10Dzahn) [08:59:47] (03CR) 10Kormat: [C: 03+1] mariadb: Promote es1024 to es5 master [puppet] - 10https://gerrit.wikimedia.org/r/607236 (https://phabricator.wikimedia.org/T255755) (owner: 10Marostegui) [09:00:44] (03CR) 10Ema: [C: 03+2] "> Note that I have no idea if this is sufficient when it comes to" [puppet] - 10https://gerrit.wikimedia.org/r/602951 (https://phabricator.wikimedia.org/T255914) (owner: 10Aklapper) [09:00:52] (03PS4) 10Ema: Phab: Allow disabling Herald rules [puppet] - 10https://gerrit.wikimedia.org/r/602951 (https://phabricator.wikimedia.org/T255914) (owner: 10Aklapper) [09:01:43] (03CR) 10Jcrespo: "See my comments below, when you are happy with it, I will give it a try." (032 comments) [software/transferpy] - 10https://gerrit.wikimedia.org/r/607028 (https://phabricator.wikimedia.org/T255999) (owner: 10Privacybatm) [09:04:16] (03PS8) 10Elukey: Introduce profile::mariadb::misc::analytics [puppet] - 10https://gerrit.wikimedia.org/r/553742 (https://phabricator.wikimedia.org/T234826) [09:07:45] (03CR) 10Filippo Giunchedi: [C: 03+1] "I'm +1 on the general idea, although I'm a little conflicted on changing the label name to a label that doesn't exists. Would leaving 'ins" [puppet] - 10https://gerrit.wikimedia.org/r/607134 (owner: 10CDanis) [09:08:04] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Allow AKlapper to disable other people's personal Herald rules in Phabricator - https://phabricator.wikimedia.org/T255914 (10ema) @Aklapper: patch merged, please try and see if the command works as expected. [09:10:38] 10Operations, 10SRE-Access-Requests: Requesting access to centralauth database for Jennifer Wang - https://phabricator.wikimedia.org/T255836 (10ema) p:05Triage→03Medium [09:12:38] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: require class node_exporter for node textfile scripts [puppet] - 10https://gerrit.wikimedia.org/r/606977 (owner: 10Filippo Giunchedi) [09:14:52] (03PS2) 10Muehlenhoff: Switch puppetboard to only use CAS for authentication [puppet] - 10https://gerrit.wikimedia.org/r/607239 [09:14:55] 10Operations, 10SRE-Access-Requests: Requesting access to centralauth database for Jennifer Wang - https://phabricator.wikimedia.org/T255836 (10ema) [09:17:13] (03CR) 10Jcrespo: "Looks good, only have some questions:" [puppet] - 10https://gerrit.wikimedia.org/r/606953 (https://phabricator.wikimedia.org/T254556) (owner: 10Marostegui) [09:19:45] (03CR) 10Muehlenhoff: "PCC: https://puppet-compiler.wmflabs.org/compiler1003/23383/" [puppet] - 10https://gerrit.wikimedia.org/r/607239 (owner: 10Muehlenhoff) [09:21:28] 10Operations, 10SRE-Access-Requests: Requesting access to centralauth database for Jennifer Wang - https://phabricator.wikimedia.org/T255836 (10ema) Hi @jwang, to carry on with your access request we need some additional information: - as per point 3 of the checklist, what sort of commands and/or tasks do you... [09:24:18] (03CR) 10Marostegui: [C: 04-2] "> Looks good, only have some questions:" [puppet] - 10https://gerrit.wikimedia.org/r/606953 (https://phabricator.wikimedia.org/T254556) (owner: 10Marostegui) [09:24:33] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: import availability aggregation rules from Prometheus global [puppet] - 10https://gerrit.wikimedia.org/r/607031 (https://phabricator.wikimedia.org/T233956) (owner: 10Filippo Giunchedi) [09:25:49] (03CR) 10Jcrespo: "Cool, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/606953 (https://phabricator.wikimedia.org/T254556) (owner: 10Marostegui) [09:25:51] (03PS1) 10Awight: Fix broken copy link in JS mode [extensions/TwoColConflict] (wmf/1.35.0-wmf.37) - 10https://gerrit.wikimedia.org/r/607248 (https://phabricator.wikimedia.org/T253724) [09:25:59] (03PS2) 10Marostegui: mariadb: Promote db1097 to m1 master [puppet] - 10https://gerrit.wikimedia.org/r/606953 (https://phabricator.wikimedia.org/T254556) [09:26:36] (03CR) 10Jcrespo: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/606953 (https://phabricator.wikimedia.org/T254556) (owner: 10Marostegui) [09:28:12] (03CR) 10Jcrespo: [C: 03+1] mariadb: Promote db1097 to m1 master [puppet] - 10https://gerrit.wikimedia.org/r/606953 (https://phabricator.wikimedia.org/T254556) (owner: 10Marostegui) [09:28:39] (03PS1) 10Marostegui: dbproxy1012,1014: Place db1097 as standby host. [puppet] - 10https://gerrit.wikimedia.org/r/607249 (https://phabricator.wikimedia.org/T254556) [09:29:08] (03CR) 10Marostegui: "jcrespo let's test this for 24h as agreed on the previous patch" [puppet] - 10https://gerrit.wikimedia.org/r/607249 (https://phabricator.wikimedia.org/T254556) (owner: 10Marostegui) [09:29:12] (03CR) 10Jcrespo: [C: 03+1] mariadb: Promote db1097 to m1 master (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/606953 (https://phabricator.wikimedia.org/T254556) (owner: 10Marostegui) [09:29:14] (03PS1) 10Filippo Giunchedi: prometheus: fix traffic aggregation rules [puppet] - 10https://gerrit.wikimedia.org/r/607250 [09:30:29] (03CR) 10Jcrespo: [C: 03+1] "If you tested already I trust you. It wouldn't hurt anyway." [puppet] - 10https://gerrit.wikimedia.org/r/607249 (https://phabricator.wikimedia.org/T254556) (owner: 10Marostegui) [09:30:37] (03CR) 10Marostegui: [C: 03+2] dbproxy1012,1014: Place db1097 as standby host. [puppet] - 10https://gerrit.wikimedia.org/r/607249 (https://phabricator.wikimedia.org/T254556) (owner: 10Marostegui) [09:32:17] !log Reload haproxy on dbproxy1012 and dbproxy1014 to test db1097 as secondary for 24h T254556 [09:32:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:32:22] T254556: Upgrade m1 to Buster and Mariadb 10.4 - https://phabricator.wikimedia.org/T254556 [09:33:07] (03PS1) 10Arturo Borrero Gonzalez: toolforge: mailrelay: migrate TLS cert to acme-chief [puppet] - 10https://gerrit.wikimedia.org/r/607251 (https://phabricator.wikimedia.org/T120225) [09:34:26] (03CR) 10Filippo Giunchedi: "I'm not super familiar with the code, so this might be a silly question: how will renames/rotation be affected ?" [debs/mtail] (cross_dist_build) - 10https://gerrit.wikimedia.org/r/607144 (https://phabricator.wikimedia.org/T255776) (owner: 10Cwhite) [09:35:04] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: fix traffic aggregation rules [puppet] - 10https://gerrit.wikimedia.org/r/607250 (owner: 10Filippo Giunchedi) [09:37:37] (03CR) 10Jbond: "thanks see inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/605568 (https://phabricator.wikimedia.org/T244792) (owner: 10Jbond) [09:38:02] (03PS2) 10Arturo Borrero Gonzalez: toolforge: mailrelay: migrate TLS cert to acme-chief [puppet] - 10https://gerrit.wikimedia.org/r/607251 (https://phabricator.wikimedia.org/T120225) [09:38:42] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/607139 (owner: 10Dzahn) [09:40:31] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/607239 (owner: 10Muehlenhoff) [09:42:18] (03CR) 10Jcrespo: [C: 03+1] "I cannot remember why those where there. Maybe it got reformatted on refactoring. +1 but let's add alex." [puppet] - 10https://gerrit.wikimedia.org/r/607139 (owner: 10Dzahn) [09:46:39] !log stopping and reimaging db2101 into buster T254871 [09:46:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:46:43] T254871: Upgrade x1 databases to Buster and Mariadb 10.4 - https://phabricator.wikimedia.org/T254871 [09:47:43] (03CR) 10Jbond: [C: 03+2] taskgen: add new CI test to ensure defaults in prod are also in cloud [puppet] - 10https://gerrit.wikimedia.org/r/606444 (owner: 10Jbond) [09:48:08] !log add new CI check for cloud yaml data https://gerrit.wikimedia.org/r/c/operations/puppet/+/606444/ [09:48:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:48:42] (03Abandoned) 10Jbond: taskgen: test new CI check [puppet] - 10https://gerrit.wikimedia.org/r/606445 (owner: 10Jbond) [09:48:51] (03Abandoned) 10Jbond: taskgen: fix CI issues [puppet] - 10https://gerrit.wikimedia.org/r/606446 (owner: 10Jbond) [09:51:12] (03PS6) 10Jbond: cookbook sre.pdus: add reboot script [cookbooks] - 10https://gerrit.wikimedia.org/r/604678 (https://phabricator.wikimedia.org/T246890) [09:52:01] (03Abandoned) 10Jbond: build.gradle: add memcached support to cas blob [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/592659 (https://phabricator.wikimedia.org/T233931) (owner: 10Jbond) [09:56:39] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] Fix broken copy link in JS mode [extensions/TwoColConflict] (wmf/1.35.0-wmf.37) - 10https://gerrit.wikimedia.org/r/607248 (https://phabricator.wikimedia.org/T253724) (owner: 10Awight) [09:58:05] (03PS1) 10Filippo Giunchedi: prometheus: move global availability rules to new names [puppet] - 10https://gerrit.wikimedia.org/r/607256 (https://phabricator.wikimedia.org/T233956) [09:58:21] (03CR) 10jerkins-bot: [V: 04-1] prometheus: move global availability rules to new names [puppet] - 10https://gerrit.wikimedia.org/r/607256 (https://phabricator.wikimedia.org/T233956) (owner: 10Filippo Giunchedi) [09:59:20] (03PS2) 10Filippo Giunchedi: prometheus: move global availability rules to new names [puppet] - 10https://gerrit.wikimedia.org/r/607256 (https://phabricator.wikimedia.org/T233956) [10:01:14] !log jynus@cumin2001 START - Cookbook sre.hosts.downtime [10:01:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:03:53] !log jynus@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [10:03:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:04:02] 10Operations, 10SRE-Access-Requests: Allow AKlapper to disable other people's personal Herald rules in Phabricator - https://phabricator.wikimedia.org/T255914 (10Aklapper) 05Open→03Resolved Works! Thanks a lot! [10:07:11] 10Operations, 10ops-eqiad, 10DBA: db1088 crashed - https://phabricator.wikimedia.org/T255927 (10Marostegui) @Kormat MCR schema change applied, you can proceed with the reimage anytime Thank you! [10:07:50] (03PS1) 10Jbond: profile::idp::httpd::client: add CASScope [puppet] - 10https://gerrit.wikimedia.org/r/607257 [10:08:47] (03PS3) 10Arturo Borrero Gonzalez: toolforge: mailrelay: migrate TLS cert to acme-chief [puppet] - 10https://gerrit.wikimedia.org/r/607251 (https://phabricator.wikimedia.org/T120225) [10:08:58] (03CR) 10jerkins-bot: [V: 04-1] profile::idp::httpd::client: add CASScope [puppet] - 10https://gerrit.wikimedia.org/r/607257 (owner: 10Jbond) [10:10:58] (03PS1) 10Jcrespo: cumin: backup all of /srv where a lot of deployment state may live [puppet] - 10https://gerrit.wikimedia.org/r/607258 [10:15:13] (03PS2) 10Jbond: profile::idp::httpd::client: add CASScope [puppet] - 10https://gerrit.wikimedia.org/r/607257 [10:16:55] (03PS1) 10Kormat: install_server: Switch db1088 to buster. [puppet] - 10https://gerrit.wikimedia.org/r/607260 (https://phabricator.wikimedia.org/T250666) [10:19:18] (03CR) 10Marostegui: [C: 03+1] "Make sure to umount /srv before starting the install, as it doesn't have a BBU let's make sure everything is sync'ed to disk" [puppet] - 10https://gerrit.wikimedia.org/r/607260 (https://phabricator.wikimedia.org/T250666) (owner: 10Kormat) [10:19:43] (03CR) 10Kormat: [C: 03+2] install_server: Switch db1088 to buster. [puppet] - 10https://gerrit.wikimedia.org/r/607260 (https://phabricator.wikimedia.org/T250666) (owner: 10Kormat) [10:19:59] (03CR) 10Jbond: "Ready for review" [puppet] - 10https://gerrit.wikimedia.org/r/607257 (owner: 10Jbond) [10:21:23] (03PS1) 10Elukey: admin: add yubikey ssh-rsa entry for elukey [puppet] - 10https://gerrit.wikimedia.org/r/607263 [10:21:53] (03PS2) 10Elukey: admin: add yubikey ssh-rsa entry for elukey [puppet] - 10https://gerrit.wikimedia.org/r/607263 [10:22:46] !log reimaging db1088 to buster T250666 [10:22:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:22:50] T250666: Upgrade WMF database-and-backup-related hosts to buster - https://phabricator.wikimedia.org/T250666 [10:27:59] 10Operations, 10ops-eqiad, 10DBA: db1088 crashed - https://phabricator.wikimedia.org/T255927 (10Kormat) Reimaging in progress. [10:29:45] (03CR) 10Arturo Borrero Gonzalez: "I tested this in toolsbeta, should be ready to merge" [puppet] - 10https://gerrit.wikimedia.org/r/607251 (https://phabricator.wikimedia.org/T120225) (owner: 10Arturo Borrero Gonzalez) [10:35:21] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [10:35:22] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [10:35:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:35:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:38:10] 10Operations, 10ops-eqiad, 10netops: eqiad row D switch fabric recabling - https://phabricator.wikimedia.org/T256112 (10ayounsi) p:05Triage→03Medium [10:38:17] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime [10:38:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:38:29] !log temporarily shutdown xhgui1001/releases1002 to reshuffle Ganeti instances for reboots [10:38:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:38:53] 10Operations, 10ops-eqiad, 10netops: eqiad row D switch fabric recabling - https://phabricator.wikimedia.org/T256112 (10ayounsi) [10:39:47] 10Operations, 10ops-eqiad, 10netops: eqiad row D switch fabric recabling - https://phabricator.wikimedia.org/T256112 (10Marostegui) @ayounsi do you have a timeframe for this? The expected downtime is 30 minutes per for the whole row? [10:40:09] 10Operations, 10ops-eqiad, 10netops: eqiad row D switch fabric recabling - https://phabricator.wikimedia.org/T256112 (10ayounsi) [10:40:11] 10Operations, 10ops-eqiad, 10netops: asw2-d1-eqiad:VCP failure - https://phabricator.wikimedia.org/T252797 (10ayounsi) [10:40:14] 10Operations, 10ops-eqiad, 10netops: Three ports on asw2-d-eqiad are not working as expected - https://phabricator.wikimedia.org/T247881 (10ayounsi) [10:40:16] 10Operations, 10netops, 10Sustainability (Incident Prevention): D1<->D8 VC link failure - https://phabricator.wikimedia.org/T251663 (10ayounsi) [10:40:19] 10Operations, 10ops-eqiad, 10netops: upgrade row d to have 3 10G switches - https://phabricator.wikimedia.org/T196487 (10ayounsi) [10:40:56] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [10:40:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:41:10] 10Operations, 10CAS-SSO, 10User-jbond: CAS Store U2f tokens in a database - https://phabricator.wikimedia.org/T256113 (10jbond) [10:42:23] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [10:42:24] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [10:42:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:42:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:48:25] 10Operations, 10SRE-swift-storage, 10serviceops: Access to the thanos-swift cluster for ChartMuseum - https://phabricator.wikimedia.org/T256020 (10fgiunchedi) >>! In T256020#6247552, @JMeybohm wrote: > We don't expect private data in the charts at all. > In addition, they are already publicly accessible via... [10:55:02] (03PS1) 10Jcrespo: mariadb-backups: Reenable db2101 with snapshots to backup2002 [puppet] - 10https://gerrit.wikimedia.org/r/607264 (https://phabricator.wikimedia.org/T254871) [10:55:35] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [10:55:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:57:09] (03CR) 10Jbond: [C: 03+1] "lgtm" (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/606961 (owner: 10Filippo Giunchedi) [10:57:40] PROBLEM - Host etcd1003 is DOWN: PING CRITICAL - Packet loss = 100% [10:58:15] (03PS2) 10Matthias Mullie: Revert "Revert "test commons: Use the database name in the Wikibase entity source config"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/605645 (owner: 10Addshore) [10:58:59] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [10:59:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:59:46] (03CR) 10Matthias Mullie: [C: 03+1] "Can't reproduce original issue on local setup. Will deploy to investigate further." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/605645 (owner: 10Addshore) [11:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: That opportune time is upon us again. Time for a European mid-day backport window(Max 6 patches) deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200623T1100). [11:00:04] matthiasmullie and awight: A patch you scheduled for European mid-day backport window(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:13] o/ [11:00:13] (03CR) 10Alexandros Kosiaris: [C: 03+1] "I guess puppet-lint became smarter in the 5 years since Ic156a8659607eb69b298d9280a18e18ca9c0fb5f ?" [puppet] - 10https://gerrit.wikimedia.org/r/607139 (owner: 10Dzahn) [11:00:34] RECOVERY - Host etcd1003 is UP: PING OK - Packet loss = 0%, RTA = 0.44 ms [11:00:46] I'll go ahead an deploy my config change [11:01:04] (03CR) 10Matthias Mullie: [C: 03+2] Revert "Revert "test commons: Use the database name in the Wikibase entity source config"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/605645 (owner: 10Addshore) [11:01:13] 10Operations, 10ops-eqiad, 10DBA: db1088 crashed - https://phabricator.wikimedia.org/T255927 (10Kormat) Reimage done, and host has caught up with replication. [11:01:15] Cool. I can deploy my patch whenever you're done. [11:01:42] 10Operations, 10ops-eqiad, 10netops: eqiad row D switch fabric recabling - https://phabricator.wikimedia.org/T256112 (10ayounsi) >>! In T256112#6247946, @Marostegui wrote: > @ayounsi do you have a timeframe for this? The expected downtime is 30 minutes per for the whole row? There is no expected downtime, b... [11:01:57] (03Merged) 10jenkins-bot: Revert "Revert "test commons: Use the database name in the Wikibase entity source config"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/605645 (owner: 10Addshore) [11:03:54] (03CR) 10Jcrespo: [C: 03+2] bacula: remove unneeded lint-ignores [puppet] - 10https://gerrit.wikimedia.org/r/607139 (owner: 10Dzahn) [11:04:04] (03PS2) 10Jcrespo: bacula: remove unneeded lint-ignores [puppet] - 10https://gerrit.wikimedia.org/r/607139 (owner: 10Dzahn) [11:04:35] !log draining ganeti1008 for eventual reboot [11:04:37] (03CR) 10Elukey: [C: 03+2] admin: add yubikey ssh-rsa entry for elukey [puppet] - 10https://gerrit.wikimedia.org/r/607263 (owner: 10Elukey) [11:04:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:07:12] 10Operations, 10netops: No LACP info for cr2-esams:ae2 - https://phabricator.wikimedia.org/T253970 (10ayounsi) > I can confirm that I see traffic again and my apologies for the delay in responding, we encountered a provisioning issue which has now been fixed. Re-scheduled for tomorrow 6am UTC. [11:07:31] !log mlitn@deploy1001 Synchronized wmf-config/InitialiseSettings.php: test commons: Use the database name in the Wikibase entity source config (duration: 00m 59s) [11:07:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:07:57] awight: I'm done, you're good to go! [11:09:08] (03CR) 10Jcrespo: [C: 03+2] "This should allow a full x1 conversion to 10.4 on codfw." [puppet] - 10https://gerrit.wikimedia.org/r/607264 (https://phabricator.wikimedia.org/T254871) (owner: 10Jcrespo) [11:10:19] matthiasmullie: ty :-) [11:10:34] (03CR) 10Awight: [C: 03+2] "BACON" [extensions/TwoColConflict] (wmf/1.35.0-wmf.37) - 10https://gerrit.wikimedia.org/r/607248 (https://phabricator.wikimedia.org/T253724) (owner: 10Awight) [11:11:32] 10Operations, 10CAS-SSO, 10User-jbond: CAS Store U2f tokens in a database - https://phabricator.wikimedia.org/T256113 (10jbond) p:05Triage→03Medium [11:14:44] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/607258 (owner: 10Jcrespo) [11:16:06] (03CR) 10Jcrespo: "I would like at least the ok from Riccardo, as he will know if the logical division of profiles the the right one, or a different one woul" [puppet] - 10https://gerrit.wikimedia.org/r/607258 (owner: 10Jcrespo) [11:24:39] (03PS1) 10Jcrespo: install_server: Reimage and wipe db1145 into stretch [puppet] - 10https://gerrit.wikimedia.org/r/607267 (https://phabricator.wikimedia.org/T254871) [11:27:28] (03Merged) 10jenkins-bot: Fix broken copy link in JS mode [extensions/TwoColConflict] (wmf/1.35.0-wmf.37) - 10https://gerrit.wikimedia.org/r/607248 (https://phabricator.wikimedia.org/T253724) (owner: 10Awight) [11:34:51] !log awight@deploy1001 Synchronized php-1.35.0-wmf.37/extensions/TwoColConflict/: BACON: [[gerrit:607248|Fix broken copy link in JS mode (T253724)]] (duration: 00m 57s) [11:34:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:34:55] T253724: Add copy to clipboard function for text of user's revision (JS) - https://phabricator.wikimedia.org/T253724 [11:35:35] !log EU BACON cooked [11:35:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:36:12] 10Operations, 10DBA, 10CAS-SSO, 10User-jbond: Request new database or idp-test.wikimedia.org - https://phabricator.wikimedia.org/T256120 (10jbond) [11:37:00] 10Operations, 10ops-eqiad, 10netops: asw2-d1-eqiad:VCP failure - https://phabricator.wikimedia.org/T252797 (10ayounsi) 05Open→03Stalled Larger discussion in T256112. Feel free to un-assign it from you until we figure out the overall plan. [11:40:04] 10Operations, 10ops-eqiad: Degraded RAID on db1088 - https://phabricator.wikimedia.org/T256121 (10ops-monitoring-bot) [11:40:28] kormat: let's close that as invalid ^ (as it is something known) [11:45:32] 10Operations, 10DBA, 10CAS-SSO, 10User-jbond: Request new database or idp-test.wikimedia.org - https://phabricator.wikimedia.org/T256120 (10Marostegui) Thanks for the detailed ticket. A few comments. 1) Let's not use `-`, if you really want that, we can go for `can_test`. 2) Can probably place this into m... [11:45:34] marostegui: duplicate of T255928? (which was merged into crash ticket) [11:45:34] T255928: Degraded RAID on db1088 - https://phabricator.wikimedia.org/T255928 [11:45:52] which would suggest that closing that one would create a new one [11:45:52] Majavah: yes [11:46:33] Majavah: No, it triggered again cause the host was reimaged and hence deleted from the monitoring system [11:54:04] (03PS2) 10DCausse: Set proper language code for some wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/607235 (https://phabricator.wikimedia.org/T250810) [11:54:23] (03CR) 10Marostegui: "This looks good, let's do a final PCC to confirm?" [puppet] - 10https://gerrit.wikimedia.org/r/553742 (https://phabricator.wikimedia.org/T234826) (owner: 10Elukey) [11:56:27] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [11:56:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:58:21] PROBLEM - Host etcd1002 is DOWN: PING CRITICAL - Packet loss = 100% [11:58:53] PROBLEM - Host kubetcd1005 is DOWN: PING CRITICAL - Packet loss = 100% [12:00:33] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [12:00:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:00:39] RECOVERY - Host kubetcd1005 is UP: PING OK - Packet loss = 0%, RTA = 0.38 ms [12:01:01] RECOVERY - Host etcd1002 is UP: PING OK - Packet loss = 0%, RTA = 0.39 ms [12:02:13] 10Operations, 10Analytics-Radar, 10SRE-Access-Requests, 10Patch-For-Review, 10Product-Analytics (Kanban): Creation of a new POSIX group and system user for the Product Analytics team - https://phabricator.wikimedia.org/T255039 (10elukey) @mpopov I had a chat with Moritz, I'll take care of amending the co... [12:02:28] 10Operations, 10Analytics-Radar, 10SRE-Access-Requests, 10Patch-For-Review, and 2 others: Creation of a new POSIX group and system user for the Product Analytics team - https://phabricator.wikimedia.org/T255039 (10elukey) a:05mpopov→03elukey [12:04:28] ACKNOWLEDGEMENT - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 131, down: 1, dormant: 0, excluded: 0, unused: 0: Ayounsi Telia Carrier Reference: 01177245 - The acknowledgement expires at: 2020-06-23 15:03:55. https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:04:28] ACKNOWLEDGEMENT - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 62, down: 1, dormant: 0, excluded: 0, unused: 0: Ayounsi Telia Carrier Reference: 01177245 - The acknowledgement expires at: 2020-06-23 15:03:55. https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:04:47] 10Operations, 10ops-eqiad: Degraded RAID on db1088 - https://phabricator.wikimedia.org/T256121 (10Kormat) 05Open→03Invalid Already known: {T255927} [12:08:19] (03CR) 10Marostegui: [C: 03+1] install_server: Reimage and wipe db1145 into stretch [puppet] - 10https://gerrit.wikimedia.org/r/607267 (https://phabricator.wikimedia.org/T254871) (owner: 10Jcrespo) [12:11:06] (03PS9) 10Elukey: Introduce profile::mariadb::misc::analytics [puppet] - 10https://gerrit.wikimedia.org/r/553742 (https://phabricator.wikimedia.org/T234826) [12:13:20] (03CR) 10Elukey: "Sure: https://puppet-compiler.wmflabs.org/compiler1002/23387/db1108.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/553742 (https://phabricator.wikimedia.org/T234826) (owner: 10Elukey) [12:21:17] (03CR) 10Arturo Borrero Gonzalez: "I plan to revert this in https://gerrit.wikimedia.org/r/c/operations/puppet/+/607251" [puppet] - 10https://gerrit.wikimedia.org/r/597257 (https://phabricator.wikimedia.org/T120225) (owner: 10Arturo Borrero Gonzalez) [12:21:41] (03CR) 10Arturo Borrero Gonzalez: "BTW this is basically a revert of https://gerrit.wikimedia.org/r/c/operations/puppet/+/597257 plus adding the acme-chief bits" [puppet] - 10https://gerrit.wikimedia.org/r/607251 (https://phabricator.wikimedia.org/T120225) (owner: 10Arturo Borrero Gonzalez) [12:23:19] (03PS6) 10Kormat: mariadb: Add 2 profiles to allow finer-grained cumin selection [puppet] - 10https://gerrit.wikimedia.org/r/606708 (https://phabricator.wikimedia.org/T255409) [12:24:36] (03PS4) 10Alexandros Kosiaris: kask: added support egress rules kask: added support dst_ports rules without adding cidr field. Added missing rule [deployment-charts] - 10https://gerrit.wikimedia.org/r/597776 (owner: 10Apakhomov) [12:25:56] (03CR) 10Alexandros Kosiaris: [C: 03+2] "Thanks, this works just fine!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/597776 (owner: 10Apakhomov) [12:26:25] (03Merged) 10jenkins-bot: kask: added support egress rules kask: added support dst_ports rules without adding cidr field. Added missing rule [deployment-charts] - 10https://gerrit.wikimedia.org/r/597776 (owner: 10Apakhomov) [12:27:54] (03CR) 10Marostegui: "Thanks - I just realised a few things:" [puppet] - 10https://gerrit.wikimedia.org/r/553742 (https://phabricator.wikimedia.org/T234826) (owner: 10Elukey) [12:31:19] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [12:31:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:31:40] (03PS1) 10Alexandros Kosiaris: recommendation-api: Add egress support [deployment-charts] - 10https://gerrit.wikimedia.org/r/607277 (https://phabricator.wikimedia.org/T249927) [12:33:57] PROBLEM - Host kubetcd1004 is DOWN: PING CRITICAL - Packet loss = 100% [12:36:06] (03CR) 10Jcrespo: [C: 03+2] install_server: Reimage and wipe db1145 into stretch [puppet] - 10https://gerrit.wikimedia.org/r/607267 (https://phabricator.wikimedia.org/T254871) (owner: 10Jcrespo) [12:36:14] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [12:36:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:39:14] (03PS3) 10DCausse: Set proper language code for some wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/607235 (https://phabricator.wikimedia.org/T250810) [12:39:51] (03PS4) 10Filippo Giunchedi: pontoon: first skeleton [puppet] - 10https://gerrit.wikimedia.org/r/606961 [12:40:45] RECOVERY - Host kubetcd1004 is UP: PING OK - Packet loss = 0%, RTA = 0.61 ms [12:41:00] (03CR) 10jerkins-bot: [V: 04-1] pontoon: first skeleton [puppet] - 10https://gerrit.wikimedia.org/r/606961 (owner: 10Filippo Giunchedi) [12:43:31] (03CR) 10Elukey: "> Thanks - I just realised a few things:" [puppet] - 10https://gerrit.wikimedia.org/r/553742 (https://phabricator.wikimedia.org/T234826) (owner: 10Elukey) [12:45:08] !log Deploy schema change on s6 codfw master (lag will appear on codfw) - T253276 [12:45:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:45:12] T253276: Normalise MW Core database language fields length - https://phabricator.wikimedia.org/T253276 [12:45:26] !log draining ganeti1011 for eventual reboot [12:45:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:45:41] (03PS4) 10Arturo Borrero Gonzalez: toolforge: mailrelay: migrate TLS cert to acme-chief [puppet] - 10https://gerrit.wikimedia.org/r/607251 (https://phabricator.wikimedia.org/T120225) [12:45:43] (03PS1) 10Arturo Borrero Gonzalez: toolforge: mailrelay: introduce SRS to correctly envelope forwarded emails [puppet] - 10https://gerrit.wikimedia.org/r/607279 (https://phabricator.wikimedia.org/T120225) [12:50:32] (03CR) 10Alexandros Kosiaris: [C: 03+2] recommendation-api: Add egress support [deployment-charts] - 10https://gerrit.wikimedia.org/r/607277 (https://phabricator.wikimedia.org/T249927) (owner: 10Alexandros Kosiaris) [12:51:03] (03Merged) 10jenkins-bot: recommendation-api: Add egress support [deployment-charts] - 10https://gerrit.wikimedia.org/r/607277 (https://phabricator.wikimedia.org/T249927) (owner: 10Alexandros Kosiaris) [12:54:00] !log jynus@cumin1001 START - Cookbook sre.hosts.downtime [12:54:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:54:17] (03PS7) 10Kormat: mariadb: Add 2 profiles to allow finer-grained cumin selection [puppet] - 10https://gerrit.wikimedia.org/r/606708 (https://phabricator.wikimedia.org/T255409) [12:54:38] (03CR) 10QChris: [C: 04-1] "> It seems the inline comments from Hashar are still valid." [puppet] - 10https://gerrit.wikimedia.org/r/507072 (https://phabricator.wikimedia.org/T218844) (owner: 10Paladox) [12:55:16] (03CR) 10Muehlenhoff: [C: 03+1] "Looks great!" [puppet] - 10https://gerrit.wikimedia.org/r/607257 (owner: 10Jbond) [12:56:05] (03PS1) 10Filippo Giunchedi: puppetmaster: fix spec tests for geoip [puppet] - 10https://gerrit.wikimedia.org/r/607280 [12:56:36] !log jynus@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [12:56:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:56:56] (03PS2) 10Filippo Giunchedi: puppetmaster: fix spec tests for geoip [puppet] - 10https://gerrit.wikimedia.org/r/607280 [12:56:58] (03PS5) 10Filippo Giunchedi: pontoon: first skeleton [puppet] - 10https://gerrit.wikimedia.org/r/606961 [12:58:46] 10Operations, 10serviceops, 10Kubernetes, 10Patch-For-Review, and 2 others: Support kubernetes Egress networkpolicies in our helm charts - https://phabricator.wikimedia.org/T249927 (10akosiaris) 05Open→03Resolved With the last 2 changes merged, the work on this has been completed. Many thanks @apakhomov ! [12:58:50] 10Operations, 10serviceops, 10Kubernetes, 10User-fsero, 10User-jijiki: Upgrade calico in production to version 2.4+ - https://phabricator.wikimedia.org/T207804 (10akosiaris) [13:00:27] (03CR) 10Kormat: "Puppet compiler run against a slew of hosts, and the diffs look good." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/606708 (https://phabricator.wikimedia.org/T255409) (owner: 10Kormat) [13:00:47] (03CR) 10Kormat: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/606708 (https://phabricator.wikimedia.org/T255409) (owner: 10Kormat) [13:01:46] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/607280 (owner: 10Filippo Giunchedi) [13:02:05] (03CR) 10QChris: "I'd prefer to not include this in the upcoming Gerrit upgrade, as" [puppet] - 10https://gerrit.wikimedia.org/r/509542 (owner: 10Paladox) [13:02:19] (03CR) 10Filippo Giunchedi: [C: 03+2] puppetmaster: fix spec tests for geoip [puppet] - 10https://gerrit.wikimedia.org/r/607280 (owner: 10Filippo Giunchedi) [13:02:27] (03CR) 10Filippo Giunchedi: [C: 03+2] pontoon: first skeleton (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/606961 (owner: 10Filippo Giunchedi) [13:02:51] (03PS2) 10Kormat: mariadb: Add monitoring for lag spikes (v2) [puppet] - 10https://gerrit.wikimedia.org/r/607039 (https://phabricator.wikimedia.org/T253120) [13:02:54] (03CR) 10Alexandros Kosiaris: [C: 04-1] WIP: chartmuseum: Add initial module, profile and role (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/606956 (https://phabricator.wikimedia.org/T253843) (owner: 10JMeybohm) [13:03:14] (03CR) 10QChris: "> Patch Set 8:" [puppet] - 10https://gerrit.wikimedia.org/r/556265 (https://phabricator.wikimedia.org/T240266) (owner: 10Paladox) [13:05:16] (03PS1) 10Muehlenhoff: Add pwstore (dummy) profile [puppet] - 10https://gerrit.wikimedia.org/r/607281 [13:05:33] (03CR) 10Marostegui: "> > Thanks - I just realised a few things:" [puppet] - 10https://gerrit.wikimedia.org/r/553742 (https://phabricator.wikimedia.org/T234826) (owner: 10Elukey) [13:06:23] (03CR) 10jerkins-bot: [V: 04-1] Add pwstore (dummy) profile [puppet] - 10https://gerrit.wikimedia.org/r/607281 (owner: 10Muehlenhoff) [13:08:20] (03PS2) 10Muehlenhoff: Add pwstore (dummy) profile [puppet] - 10https://gerrit.wikimedia.org/r/607281 [13:08:38] (03CR) 10QChris: "LGTM. I'd probably wait with landing it after the upgrade and err on the" [puppet] - 10https://gerrit.wikimedia.org/r/606434 (owner: 10Paladox) [13:10:21] 10Operations, 10DBA, 10CAS-SSO, 10User-jbond: Request new database or idp-test.wikimedia.org - https://phabricator.wikimedia.org/T256120 (10jbond) >>! In T256120#6248144, @Marostegui wrote: > Thanks for the detailed ticket. > A few comments. > > 1) Let's not use `-`, if you really want that, we can go for... [13:11:19] (03CR) 10Jbond: [C: 03+2] profile::idp::httpd::client: add CASScope [puppet] - 10https://gerrit.wikimedia.org/r/607257 (owner: 10Jbond) [13:13:03] 10Operations, 10DBA, 10CAS-SSO, 10User-jbond: Request new database for idp-test.wikimedia.org - https://phabricator.wikimedia.org/T256120 (10Kormat) [13:13:52] (03PS3) 10Kormat: mariadb: Add monitoring for lag spikes (v2) [puppet] - 10https://gerrit.wikimedia.org/r/607039 (https://phabricator.wikimedia.org/T253120) [13:14:16] (03PS8) 10Kormat: mariadb: Add 2 profiles to allow finer-grained cumin selection [puppet] - 10https://gerrit.wikimedia.org/r/606708 (https://phabricator.wikimedia.org/T255409) [13:28:23] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [13:28:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:39] (03CR) 10CDanis: [C: 03+1] prometheus: move global availability rules to new names (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/607256 (https://phabricator.wikimedia.org/T233956) (owner: 10Filippo Giunchedi) [13:32:48] (03CR) 10CDanis: "> Patch Set 1: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/607134 (owner: 10CDanis) [13:34:06] (03PS2) 10CDanis: check_prometheus: rewrite 'instance' output to omit port numbers [puppet] - 10https://gerrit.wikimedia.org/r/607134 [13:34:30] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [13:34:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:57] !log draining ganeti1012 for eventual reboot [13:34:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:21] (03PS8) 10Hashar: scap: Stop cloning over /p/ [puppet] - 10https://gerrit.wikimedia.org/r/507072 (https://phabricator.wikimedia.org/T218844) (owner: 10Paladox) [13:35:53] (03CR) 10Paladox: scap: Stop cloning over /p/ (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/507072 (https://phabricator.wikimedia.org/T218844) (owner: 10Paladox) [13:36:01] 10Operations, 10ops-eqiad, 10DBA: db1088 crashed - https://phabricator.wikimedia.org/T255927 (10Jclark-ctr) @marostegui we have spare bbu. i happen to be on site today can. Are you available to assist? [13:39:18] 10Operations, 10ops-eqiad, 10DBA: db1088 crashed - https://phabricator.wikimedia.org/T255927 (10Kormat) @Jclark-ctr - i'm available. I can power the host off now for you to do the replacement. Just let me know when it's back. Cheers. [13:39:25] (03CR) 10QChris: [C: 04-1] gerrit: Redirect /r/projects/(.+),dashboards/(.+) to /r/p/$1/+/dashboard/$2 (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/606432 (owner: 10Paladox) [13:39:42] (03PS9) 10Hashar: scap: Stop cloning over /p/ [puppet] - 10https://gerrit.wikimedia.org/r/507072 (https://phabricator.wikimedia.org/T218844) (owner: 10Paladox) [13:40:10] (03CR) 10jerkins-bot: [V: 04-1] scap: Stop cloning over /p/ [puppet] - 10https://gerrit.wikimedia.org/r/507072 (https://phabricator.wikimedia.org/T218844) (owner: 10Paladox) [13:40:11] 10Operations, 10ops-eqiad, 10DBA: db1088 crashed - https://phabricator.wikimedia.org/T255927 (10Kormat) Host is now off. [13:42:00] (03PS8) 10Paladox: Redirect Gerrit v2.15 dashboard links to those of v3.2 [puppet] - 10https://gerrit.wikimedia.org/r/606432 [13:42:03] (03CR) 10Paladox: Redirect Gerrit v2.15 dashboard links to those of v3.2 (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/606432 (owner: 10Paladox) [13:42:11] (03PS9) 10Paladox: Redirect Gerrit v2.15 dashboard links to those of v3.2 [puppet] - 10https://gerrit.wikimedia.org/r/606432 [13:42:22] (03PS4) 10Paladox: gerrit: drop old redirect as workaround for broken browser detection [puppet] - 10https://gerrit.wikimedia.org/r/606434 [13:45:18] ACKNOWLEDGEMENT - Host db1088.mgmt is DOWN: PING CRITICAL - Packet loss = 100% Kormat Host down for BBU replacement. [13:46:04] 10Operations, 10ops-eqiad, 10DBA: db1088 crashed - https://phabricator.wikimedia.org/T255927 (10Jclark-ctr) @Kormat bbu replaced powering up right now [13:53:01] (03PS1) 10Andrew Bogott: Galera backups: use default policies [puppet] - 10https://gerrit.wikimedia.org/r/607286 [13:54:16] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [13:54:16] !log jmm@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [13:54:18] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [13:54:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:54:20] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [13:54:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:54:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:54:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:26] 10Operations, 10ops-eqiad, 10DBA: db1088 crashed - https://phabricator.wikimedia.org/T255927 (10Kormat) @Jclark-ctr : great, thanks :) The diagnostics are happy with the new bbu: ` root@db1088:~# hpssacli ctrl all show status Smart Array P840 in Slot 1 Controller Status: OK Cache Status: Not Configur... [13:56:04] 10Operations, 10ops-eqiad, 10DBA: db1088 crashed - https://phabricator.wikimedia.org/T255927 (10Kormat) a:05Jclark-ctr→03Kormat [13:56:48] (03CR) 10Andrew Bogott: [C: 03+2] Galera backups: use default policies [puppet] - 10https://gerrit.wikimedia.org/r/607286 (owner: 10Andrew Bogott) [13:58:30] (03PS10) 10Hashar: scap: Stop cloning over /p/ [puppet] - 10https://gerrit.wikimedia.org/r/507072 (https://phabricator.wikimedia.org/T218844) (owner: 10Paladox) [13:58:59] (03CR) 10jerkins-bot: [V: 04-1] scap: Stop cloning over /p/ [puppet] - 10https://gerrit.wikimedia.org/r/507072 (https://phabricator.wikimedia.org/T218844) (owner: 10Paladox) [14:00:18] 10Operations, 10ops-eqiad, 10DBA: db1088 crashed - https://phabricator.wikimedia.org/T255927 (10Jclark-ctr) 05Open→03Resolved [14:01:28] (03CR) 10Filippo Giunchedi: prometheus: move global availability rules to new names (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/607256 (https://phabricator.wikimedia.org/T233956) (owner: 10Filippo Giunchedi) [14:01:33] 10Operations, 10DBA: db1088 crashed - https://phabricator.wikimedia.org/T255927 (10Kormat) [14:01:49] (03PS11) 10Hashar: scap: Stop cloning over /p/ [puppet] - 10https://gerrit.wikimedia.org/r/507072 (https://phabricator.wikimedia.org/T218844) (owner: 10Paladox) [14:02:05] 10Operations, 10DBA: db1088 crashed - https://phabricator.wikimedia.org/T255927 (10Kormat) 05Resolved→03Open Re-opening for us to keep track of re-adding the host back into service. [14:02:08] (03PS1) 10Andrew Bogott: Fixed a meaningless typo [puppet] - 10https://gerrit.wikimedia.org/r/607287 [14:02:27] (03PS3) 10Filippo Giunchedi: prometheus: move global availability rules to new names [puppet] - 10https://gerrit.wikimedia.org/r/607256 (https://phabricator.wikimedia.org/T233956) [14:03:31] (03CR) 10Filippo Giunchedi: [C: 03+1] check_prometheus: rewrite 'instance' output to omit port numbers [puppet] - 10https://gerrit.wikimedia.org/r/607134 (owner: 10CDanis) [14:04:11] (03CR) 10Andrew Bogott: [C: 03+2] Fixed a meaningless typo [puppet] - 10https://gerrit.wikimedia.org/r/607287 (owner: 10Andrew Bogott) [14:04:38] (03CR) 10QChris: Redirect Gerrit v2.15 dashboard links to those of v3.2 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/606432 (owner: 10Paladox) [14:05:10] !log mholloway-shell@deploy1001 Started deploy [recommendation-api/deploy@db7fd80]: Update recommendation-api to 7e00177 [14:05:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:49] (03CR) 10CDanis: [C: 03+2] check_prometheus: rewrite 'instance' output to omit port numbers [puppet] - 10https://gerrit.wikimedia.org/r/607134 (owner: 10CDanis) [14:06:29] (03CR) 10Hashar: "Sorry that was spammy." [puppet] - 10https://gerrit.wikimedia.org/r/507072 (https://phabricator.wikimedia.org/T218844) (owner: 10Paladox) [14:06:38] (03CR) 10Paladox: Redirect Gerrit v2.15 dashboard links to those of v3.2 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/606432 (owner: 10Paladox) [14:06:45] (03PS10) 10Paladox: Redirect Gerrit v2.15 dashboard links to those of v3.2 [puppet] - 10https://gerrit.wikimedia.org/r/606432 [14:07:20] (03PS1) 10Jcrespo: mariadb: Setup db1145 as a mariadb backup source for s4, s5 [puppet] - 10https://gerrit.wikimedia.org/r/607288 (https://phabricator.wikimedia.org/T254871) [14:08:23] !log mholloway-shell@deploy1001 Finished deploy [recommendation-api/deploy@db7fd80]: Update recommendation-api to 7e00177 (duration: 03m 13s) [14:08:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:18] (03PS11) 10Paladox: Redirect Gerrit v2.15 dashboard links to those of v3.2 [puppet] - 10https://gerrit.wikimedia.org/r/606432 [14:11:03] (03CR) 10Jcrespo: [C: 03+2] mariadb: Setup db1145 as a mariadb backup source for s4, s5 [puppet] - 10https://gerrit.wikimedia.org/r/607288 (https://phabricator.wikimedia.org/T254871) (owner: 10Jcrespo) [14:12:55] (03CR) 10QChris: [C: 03+1] "Whoa. That looks great." [puppet] - 10https://gerrit.wikimedia.org/r/507072 (https://phabricator.wikimedia.org/T218844) (owner: 10Paladox) [14:15:51] (03CR) 10Hashar: [C: 03+1] "It is probably good enough yes ;]" [puppet] - 10https://gerrit.wikimedia.org/r/507072 (https://phabricator.wikimedia.org/T218844) (owner: 10Paladox) [14:26:31] (03PS1) 10QChris: gerrit: Drop javamelody-deps library for new Gerrits [puppet] - 10https://gerrit.wikimedia.org/r/607291 [14:32:06] (03CR) 10QChris: [C: 03+1] Redirect Gerrit v2.15 dashboard links to those of v3.2 [puppet] - 10https://gerrit.wikimedia.org/r/606432 (owner: 10Paladox) [14:32:36] (03CR) 10Paladox: [C: 03+1] gerrit: Drop javamelody-deps library for new Gerrits [puppet] - 10https://gerrit.wikimedia.org/r/607291 (owner: 10QChris) [14:33:11] (03PS5) 10Paladox: gerrit: drop old redirect as workaround for broken browser detection [puppet] - 10https://gerrit.wikimedia.org/r/606434 [14:36:20] 10Operations, 10ops-eqiad: upgrade memory in ganeti100[5-8].eqiad.wmnet - https://phabricator.wikimedia.org/T244530 (10akosiaris) @Jclark-ctr: OK, how about 1 host per week? no need for specific timeframes. I 'll have the host depooled, emptied, powered off and downtimed in icinga and ready for the memory upgr... [14:43:22] (03PS9) 10Kormat: Add native mysql spicerack module. [software/spicerack] - 10https://gerrit.wikimedia.org/r/603434 (https://phabricator.wikimedia.org/T255409) [14:45:19] * Urbanecm messing with mwdebug1001 [14:48:10] !log urbanecm@deploy1001 Synchronized private/PrivateSettings.php: T250887 (duration: 00m 58s) [14:48:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:02] * Urbanecm done [14:49:26] (03CR) 10jerkins-bot: [V: 04-1] Add native mysql spicerack module. [software/spicerack] - 10https://gerrit.wikimedia.org/r/603434 (https://phabricator.wikimedia.org/T255409) (owner: 10Kormat) [14:50:27] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [14:50:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:01] PROBLEM - Host kubestagetcd1006 is DOWN: PING CRITICAL - Packet loss = 100% [14:54:09] 10Operations, 10ops-eqiad, 10DC-Ops, 10netops, 10cloud-services-team (Hardware): (Need By: 2020-06-12) rack/setup/install WMCS 10G switches - https://phabricator.wikimedia.org/T251632 (10Jclark-ctr) sfp`s arrived today will have fibers finished today [14:55:22] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [14:55:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:29] RECOVERY - Host kubestagetcd1006 is UP: PING OK - Packet loss = 0%, RTA = 0.39 ms [14:56:07] (03PS1) 10BryanDavis: wncs: Set default prometheus_nodes value [puppet] - 10https://gerrit.wikimedia.org/r/607297 (https://phabricator.wikimedia.org/T256134) [14:56:38] !log failover ganeti master in eqiad to ganeti1011 [14:56:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:59:40] (03PS1) 10Ppchelko: EventBus: Emit kafka purges for everything [mediawiki-config] - 10https://gerrit.wikimedia.org/r/607298 (https://phabricator.wikimedia.org/T250781) [15:00:38] (03PS9) 10Privacybatm: transferpy: Package transferpy [software/transferpy] - 10https://gerrit.wikimedia.org/r/602754 (https://phabricator.wikimedia.org/T253736) [15:01:11] (03PS2) 10Ppchelko: EventBus: Emit kafka purges for everything [mediawiki-config] - 10https://gerrit.wikimedia.org/r/607298 (https://phabricator.wikimedia.org/T250781) [15:01:45] (03CR) 10Privacybatm: "Thank you for the comments, I have resolved them." (033 comments) [software/transferpy] - 10https://gerrit.wikimedia.org/r/602754 (https://phabricator.wikimedia.org/T253736) (owner: 10Privacybatm) [15:04:15] (03PS1) 10Muehlenhoff: Reduce TTL for IDP CNAMEs to 5 minutes [dns] - 10https://gerrit.wikimedia.org/r/607299 [15:09:53] (03Abandoned) 10Hashar: Bump CI puppet Gem version to 5.5.10 [puppet] - 10https://gerrit.wikimedia.org/r/526104 (https://phabricator.wikimedia.org/T228657) (owner: 10Alexandros Kosiaris) [15:10:52] !log dzahn@cumin1001 START - Cookbook sre.hosts.decommission [15:10:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:12:10] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [15:12:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:12:12] 10Operations, 10vm-requests: Site: eqiad/codfw 2 VM request for releases - https://phabricator.wikimedia.org/T255590 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: `releases1002.eqiad.wmnet` - releases1002.eqiad.wmnet (**PASS**) - Downtimed host on Icinga - F... [15:12:55] !log removing ganeti VM releases1002 in eqiad row_A - will recreate in another row to re-balance (T255590) [15:12:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:12:59] T255590: Site: eqiad/codfw 2 VM request for releases - https://phabricator.wikimedia.org/T255590 [15:13:35] (03CR) 10Cwhite: "> Patch Set 1:" [debs/mtail] (cross_dist_build) - 10https://gerrit.wikimedia.org/r/607144 (https://phabricator.wikimedia.org/T255776) (owner: 10Cwhite) [15:13:38] 10Operations, 10ops-codfw, 10DC-Ops: Decomission oresrdb2002.codfw.wmnet - https://phabricator.wikimedia.org/T254240 (10Papaul) [15:13:55] 10Operations, 10ops-codfw, 10DC-Ops: Decomission oresrdb2002.codfw.wmnet - https://phabricator.wikimedia.org/T254240 (10Papaul) 05Open→03Resolved Complete [15:14:34] 10Operations, 10CAS-SSO, 10Patch-For-Review, 10User-jbond: Replicated ticket registry - https://phabricator.wikimedia.org/T233933 (10MoritzMuehlenhoff) Did some tests on idp-test* and it's working nicely; sessions persisted across a Tomcat restart when explicitly addressing the failover IDP, they were also... [15:16:30] (03PS10) 10Kormat: Add native mysql spicerack module. [software/spicerack] - 10https://gerrit.wikimedia.org/r/603434 (https://phabricator.wikimedia.org/T255409) [15:18:34] !log dzahn@cumin1001 START - Cookbook sre.hosts.decommission [15:18:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:19:30] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on pc2007 - https://phabricator.wikimedia.org/T255904 (10Papaul) ` Create Dispatch: Success You have successfully submitted request SR1028000240. [15:22:54] (03CR) 10Jbond: [C: 03+1] Reduce TTL for IDP CNAMEs to 5 minutes [dns] - 10https://gerrit.wikimedia.org/r/607299 (owner: 10Muehlenhoff) [15:23:07] (03CR) 10jerkins-bot: [V: 04-1] Add native mysql spicerack module. [software/spicerack] - 10https://gerrit.wikimedia.org/r/603434 (https://phabricator.wikimedia.org/T255409) (owner: 10Kormat) [15:23:15] (03PS1) 10Ssingh: prometheus: add wikidough statistics [puppet] - 10https://gerrit.wikimedia.org/r/607301 (https://phabricator.wikimedia.org/T252132) [15:24:03] (03CR) 10Arturo Borrero Gonzalez: wncs: Set default prometheus_nodes value (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/607297 (https://phabricator.wikimedia.org/T256134) (owner: 10BryanDavis) [15:24:25] (03CR) 10jerkins-bot: [V: 04-1] prometheus: add wikidough statistics [puppet] - 10https://gerrit.wikimedia.org/r/607301 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [15:25:16] (03PS2) 10BryanDavis: wmcs: Set default prometheus_nodes value [puppet] - 10https://gerrit.wikimedia.org/r/607297 (https://phabricator.wikimedia.org/T256134) [15:26:19] (03CR) 10Bstorm: "I wonder how much storage this will use." [puppet] - 10https://gerrit.wikimedia.org/r/607297 (https://phabricator.wikimedia.org/T256134) (owner: 10BryanDavis) [15:26:45] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [15:26:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:26:49] 10Operations, 10vm-requests, 10Performance-Team (Radar): vm request for xhgui - https://phabricator.wikimedia.org/T238098 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: `xhgui1001.eqiad.wmnet` - xhgui1001.eqiad.wmnet (**PASS**) - Downtimed host on Icinga -... [15:27:57] !log removing ganeti VM xhgui1001 from eqiad row_A, will recreate in another row for rebalancing VMs between rows (T180761 T238098) [15:28:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:28:07] T180761: Move XHGui from tungsten to xhgui-001 - https://phabricator.wikimedia.org/T180761 [15:28:07] T238098: vm request for xhgui - https://phabricator.wikimedia.org/T238098 [15:28:14] !log installing libvpx security updates [15:28:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:29:14] (03PS11) 10Kormat: Add native mysql spicerack module. [software/spicerack] - 10https://gerrit.wikimedia.org/r/603434 (https://phabricator.wikimedia.org/T255409) [15:31:41] (03PS12) 10Kormat: Add native mysql spicerack module. [software/spicerack] - 10https://gerrit.wikimedia.org/r/603434 (https://phabricator.wikimedia.org/T255409) [15:32:22] 10Operations: Create ssh keypair for integration/docroot deployment with scap - https://phabricator.wikimedia.org/T256138 (10hashar) [15:32:37] 10Operations: Create ssh keypair for integration/docroot deployment with scap - https://phabricator.wikimedia.org/T256138 (10hashar) [15:32:43] (03CR) 10Kormat: "It turns out that trying to document how this works has forced me to rethink some things :)" (037 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/603434 (https://phabricator.wikimedia.org/T255409) (owner: 10Kormat) [15:34:02] 10Operations, 10serviceops, 10CPT Initiatives (Containerise Services): Migrate node-based services in production to node10 - https://phabricator.wikimedia.org/T210704 (10MSantos) [15:36:56] (03CR) 10Bstorm: "For numbers now: https://prometheus.wmflabs.org/cloud/targets" [puppet] - 10https://gerrit.wikimedia.org/r/607297 (https://phabricator.wikimedia.org/T256134) (owner: 10BryanDavis) [15:37:42] !log prune nginx packages on mw1380-mw1412 T255565 [15:37:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:37:47] T255565: Remaining nginx packages on some mw servers - https://phabricator.wikimedia.org/T255565 [15:40:36] (03CR) 10Bstorm: "Wait, is this just to get a firewall setting in one VM (and by default all VMs)? https://prometheus.wmflabs.org/cloud/targets#job-analytic" [puppet] - 10https://gerrit.wikimedia.org/r/607297 (https://phabricator.wikimedia.org/T256134) (owner: 10BryanDavis) [15:41:23] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, thanks for the answers" [debs/mtail] (cross_dist_build) - 10https://gerrit.wikimedia.org/r/607144 (https://phabricator.wikimedia.org/T255776) (owner: 10Cwhite) [15:42:26] (03PS1) 10Dzahn: move xhgui1001 from row A to row D to rebalance VMs [dns] - 10https://gerrit.wikimedia.org/r/607304 [15:42:40] 10Operations, 10observability: Icinga refresh hardware selection (2020) - https://phabricator.wikimedia.org/T251644 (10Papaul) [15:42:48] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: move global availability rules to new names [puppet] - 10https://gerrit.wikimedia.org/r/607256 (https://phabricator.wikimedia.org/T233956) (owner: 10Filippo Giunchedi) [15:43:53] (03PS2) 10Dzahn: move xhgui1001 from row A to row D to rebalance VMs [dns] - 10https://gerrit.wikimedia.org/r/607304 (https://phabricator.wikimedia.org/T238098) [15:43:56] 10Operations, 10ops-codfw, 10decommission-hardware: decommission ganeti200[1-6].codfw.wmnet - https://phabricator.wikimedia.org/T255554 (10Papaul) [15:44:50] (03CR) 10BryanDavis: "> Wait, is this just to get a firewall setting in one VM (and by" [puppet] - 10https://gerrit.wikimedia.org/r/607297 (https://phabricator.wikimedia.org/T256134) (owner: 10BryanDavis) [15:45:42] (03CR) 10Muehlenhoff: [C: 03+1] move xhgui1001 from row A to row D to rebalance VMs [dns] - 10https://gerrit.wikimedia.org/r/607304 (https://phabricator.wikimedia.org/T238098) (owner: 10Dzahn) [15:46:49] (03CR) 10Bstorm: "Seems legit to me. I'll merge it." [puppet] - 10https://gerrit.wikimedia.org/r/607297 (https://phabricator.wikimedia.org/T256134) (owner: 10BryanDavis) [15:46:55] (03CR) 10Bstorm: [C: 03+2] wmcs: Set default prometheus_nodes value [puppet] - 10https://gerrit.wikimedia.org/r/607297 (https://phabricator.wikimedia.org/T256134) (owner: 10BryanDavis) [15:47:17] (03CR) 10Dzahn: [C: 03+2] move xhgui1001 from row A to row D to rebalance VMs [dns] - 10https://gerrit.wikimedia.org/r/607304 (https://phabricator.wikimedia.org/T238098) (owner: 10Dzahn) [15:47:21] 10Operations, 10ops-codfw, 10netops: (Need by: End of July-2020 ) codfw:rack/setup/new management switches - https://phabricator.wikimedia.org/T253154 (10Papaul) @Jgreen I added you to this task so you can keep in track of the replacement of the mgmt switch fmsw-c8. [15:47:37] !log prune nginx packages on mwdebug hosts T255565 [15:47:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:47:42] T255565: Remaining nginx packages on some mw servers - https://phabricator.wikimedia.org/T255565 [15:51:02] 10Operations, 10serviceops: Remaining nginx packages on some mw servers - https://phabricator.wikimedia.org/T255565 (10MoritzMuehlenhoff) A few gotchas on this one, noting on the task as it can happen to other roles currently using Nginx when moving to Envoy: * The nginx init scripts get marked as a conffile,... [15:51:55] (03PS2) 10Ssingh: prometheus: add wikidough statistics [puppet] - 10https://gerrit.wikimedia.org/r/607301 (https://phabricator.wikimedia.org/T252132) [15:52:52] (03PS1) 10Dzahn: move releases1002 from row A to row to rebalance VMs [dns] - 10https://gerrit.wikimedia.org/r/607305 (https://phabricator.wikimedia.org/T255590) [15:54:33] (03CR) 10Dzahn: [C: 03+2] move releases1002 from row A to row to rebalance VMs [dns] - 10https://gerrit.wikimedia.org/r/607305 (https://phabricator.wikimedia.org/T255590) (owner: 10Dzahn) [15:54:38] (03PS2) 10Dzahn: move releases1002 from row A to row to rebalance VMs [dns] - 10https://gerrit.wikimedia.org/r/607305 (https://phabricator.wikimedia.org/T255590) [15:54:54] 10Operations, 10ops-codfw, 10netops: (Need by: End of July-2020 ) codfw:rack/setup/new management switches - https://phabricator.wikimedia.org/T253154 (10Papaul) [15:57:19] andrewbogott: fyi. this shows up currently in Icinga: "No backups: 3 (cloudcontrol2001-dev, ...)". It's probably normal until the first one is done. But also a good way to see it is monitored and to keep an eye on it turning green soonish. [15:57:47] (03CR) 10Bstorm: [C: 03+1] toolforge: mailrelay: migrate TLS cert to acme-chief [puppet] - 10https://gerrit.wikimedia.org/r/607251 (https://phabricator.wikimedia.org/T120225) (owner: 10Arturo Borrero Gonzalez) [15:58:55] (03CR) 10Bstorm: [C: 03+2] cloud nfs: clean up some of the secondary cluster materials [puppet] - 10https://gerrit.wikimedia.org/r/607142 (https://phabricator.wikimedia.org/T224747) (owner: 10Bstorm) [15:59:36] chaomodus: ^ same as above. netbox-dev hosts are showing up in Icinga alerts about backups since they don't backup anything. maybe we should do some puppet code "if dev host then skip backups"? https://phabricator.wikimedia.org/T253140#6247568 [15:59:51] ah [15:59:55] thatks :) [15:59:59] yw [16:00:04] godog and _joe_: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Puppet request window(Max 6 patches) . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200623T1600). [16:00:04] Amir1: A patch you scheduled for Puppet request window(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [16:00:54] 10Operations, 10Wikimedia-Logstash, 10observability, 10User-fgiunchedi: VM requests for additional Logstash capacity - https://phabricator.wikimedia.org/T256139 (10fgiunchedi) [16:01:15] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: mailrelay: migrate TLS cert to acme-chief [puppet] - 10https://gerrit.wikimedia.org/r/607251 (https://phabricator.wikimedia.org/T120225) (owner: 10Arturo Borrero Gonzalez) [16:01:22] Amir1: here? taking a look at your patches [16:01:35] !log 1.35.0-wmf.38 was branched at a35f7318 for https://phabricator.wikimedia.org/T254175 [16:01:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:01:48] Amir1: ok comments only, I'll merge [16:02:06] (03CR) 10Filippo Giunchedi: [C: 03+2] wmcs: Remove "slave" from comment [puppet] - 10https://gerrit.wikimedia.org/r/605383 (https://phabricator.wikimedia.org/T254646) (owner: 10Ladsgroup) [16:02:13] godog: if you get some spare time after I could use a ssh key pair to be generated for keyholder :-] Or i will ask e.ma tomorrow ( https://phabricator.wikimedia.org/T256138 ) [16:02:25] (03CR) 10Filippo Giunchedi: [C: 03+2] rabbitmq: Rename "slave" to "replica" in comment [puppet] - 10https://gerrit.wikimedia.org/r/605382 (https://phabricator.wikimedia.org/T254646) (owner: 10Ladsgroup) [16:02:33] (03CR) 10Brennen Bearnes: [C: 03+2] Branch commit for wmf/1.35.0-wmf.38 [core] (wmf/1.35.0-wmf.38) - 10https://gerrit.wikimedia.org/r/607158 (https://phabricator.wikimedia.org/T254175) (owner: 10TrainBranchBot) [16:03:31] hashar: yeah sorry I can't ATM :| [16:04:42] godog: sorry, was caught up on a phone call [16:04:51] yes, it's just comment [16:04:56] Thanks! [16:08:53] 10Operations, 10ops-codfw, 10netops: codfw: rack/setup new srx300 (mr1) - https://phabricator.wikimedia.org/T255577 (10Papaul) @ayounsi i see you changed the CyrusOne OOB configuration on the new mr1. I had it configured on ge-0/0/5 and you set it on ge-0/0/7. The last port on the SRX300 is ge-0/0/5 and the... [16:15:50] (03PS2) 10Dzahn: dumps: move ferm rules for xmldumps from module to profile [puppet] - 10https://gerrit.wikimedia.org/r/606739 (https://phabricator.wikimedia.org/T114209) [16:17:35] PROBLEM - Unmerged changes on repository puppet on puppetmaster1001 is CRITICAL: There are 2 unmerged changes in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [16:20:20] godog: still merging? [16:20:39] mutante: oops! yes sorry [16:20:43] mutante: merging now [16:20:55] thanks! [16:21:07] RECOVERY - Unmerged changes on repository puppet on puppetmaster1001 is OK: No changes to merge. https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [16:21:28] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1003/23392/labstore1006.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/606739 (https://phabricator.wikimedia.org/T114209) (owner: 10Dzahn) [16:22:13] mutante: np, {{done}} [16:24:36] (03Merged) 10jenkins-bot: Branch commit for wmf/1.35.0-wmf.38 [core] (wmf/1.35.0-wmf.38) - 10https://gerrit.wikimedia.org/r/607158 (https://phabricator.wikimedia.org/T254175) (owner: 10TrainBranchBot) [16:26:10] (03PS1) 10CRusnov: profile::netbox::postgres: Parameterize backups [puppet] - 10https://gerrit.wikimedia.org/r/607310 (https://phabricator.wikimedia.org/T253140) [16:31:32] (03PS1) 10BBlack: Add Elastic IP to wikimedia_nets for PAPI [puppet] - 10https://gerrit.wikimedia.org/r/607313 (https://phabricator.wikimedia.org/T255524) [16:33:59] 10Operations, 10serviceops, 10Patch-For-Review, 10User-Elukey: Reimage one memcached shard to Buster - https://phabricator.wikimedia.org/T252391 (10RLazarus) Side note: This question is also interesting from a DC switchover perspective (T243316) since that will also effectively be a Redis flush. In previou... [16:36:11] (03CR) 10CDanis: [C: 03+1] Add Elastic IP to wikimedia_nets for PAPI [puppet] - 10https://gerrit.wikimedia.org/r/607313 (https://phabricator.wikimedia.org/T255524) (owner: 10BBlack) [16:37:18] (03PS1) 10Brennen Bearnes: testwikis wikis to 1.35.0-wmf.38 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/607314 [16:37:20] (03CR) 10Brennen Bearnes: [C: 03+2] testwikis wikis to 1.35.0-wmf.38 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/607314 (owner: 10Brennen Bearnes) [16:38:09] (03Merged) 10jenkins-bot: testwikis wikis to 1.35.0-wmf.38 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/607314 (owner: 10Brennen Bearnes) [16:39:03] !log brennen@deploy1001 Started scap: testwikis wikis to 1.35.0-wmf.38 [16:39:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:39:28] (03PS1) 10Papaul: DNS: Remove mgmt asset tag for ganeti200[1-6] [dns] - 10https://gerrit.wikimedia.org/r/607315 [16:41:04] (03PS3) 10Ssingh: prometheus: add wikidough statistics [puppet] - 10https://gerrit.wikimedia.org/r/607301 (https://phabricator.wikimedia.org/T252132) [16:44:17] (03CR) 10Papaul: [C: 03+2] DNS: Remove mgmt asset tag for ganeti200[1-6] [dns] - 10https://gerrit.wikimedia.org/r/607315 (owner: 10Papaul) [16:47:41] 10Operations, 10ops-codfw, 10decommission-hardware, 10Patch-For-Review: decommission ganeti200[1-6].codfw.wmnet - https://phabricator.wikimedia.org/T255554 (10Papaul) [16:50:27] (03PS1) 10Andrew Bogott: wmcs: install galera on eqiad1 cloudcontrol nodes [puppet] - 10https://gerrit.wikimedia.org/r/607318 (https://phabricator.wikimedia.org/T242455) [16:56:40] 10Operations, 10ops-codfw, 10netops: codfw: Decommission old mr1 - https://phabricator.wikimedia.org/T256143 (10Papaul) [16:57:08] 10Operations, 10ops-codfw, 10netops: codfw: Decommission old mr1 - https://phabricator.wikimedia.org/T256143 (10Papaul) p:05Triage→03Low [16:58:52] (03PS2) 10Andrew Bogott: wmcs: install galera on eqiad1 cloudcontrol nodes [puppet] - 10https://gerrit.wikimedia.org/r/607318 (https://phabricator.wikimedia.org/T242455) [16:59:58] (03CR) 10jerkins-bot: [V: 04-1] wmcs: install galera on eqiad1 cloudcontrol nodes [puppet] - 10https://gerrit.wikimedia.org/r/607318 (https://phabricator.wikimedia.org/T242455) (owner: 10Andrew Bogott) [17:00:04] halfak and accraze: Your horoscope predicts another unfortunate Services – Graphoid / Citoid / ORES deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200623T1700). [17:02:22] 10Operations, 10netops, 10observability, 10Patch-For-Review: Determine & implement near-term method for escalating network alerts - https://phabricator.wikimedia.org/T237587 (10CDanis) 05Open→03Resolved I think we can call this closed? LibreNMS and Fastnetmon both send pages (via Icinga) quite well now. [17:04:32] (03PS3) 10Andrew Bogott: wmcs: install galera on eqiad1 cloudcontrol nodes [puppet] - 10https://gerrit.wikimedia.org/r/607318 (https://phabricator.wikimedia.org/T242455) [17:05:54] (03CR) 10jerkins-bot: [V: 04-1] wmcs: install galera on eqiad1 cloudcontrol nodes [puppet] - 10https://gerrit.wikimedia.org/r/607318 (https://phabricator.wikimedia.org/T242455) (owner: 10Andrew Bogott) [17:06:39] 10Operations: Script to point SRE local machine traffic to another LB - https://phabricator.wikimedia.org/T244761 (10CDanis) a:03CDanis [17:07:55] (03PS4) 10Andrew Bogott: wmcs: install galera on eqiad1 cloudcontrol nodes [puppet] - 10https://gerrit.wikimedia.org/r/607318 (https://phabricator.wikimedia.org/T242455) [17:11:21] (03CR) 10VolkerE: [C: 04-1] Enable click tracking in Vector on beta cluster (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/607136 (https://phabricator.wikimedia.org/T250282) (owner: 10Jdlrobson) [17:14:28] (03CR) 10Andrew Bogott: [C: 03+2] wmcs: install galera on eqiad1 cloudcontrol nodes [puppet] - 10https://gerrit.wikimedia.org/r/607318 (https://phabricator.wikimedia.org/T242455) (owner: 10Andrew Bogott) [17:17:22] (03PS2) 10Jdlrobson: Enable click tracking in Vector on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/607136 (https://phabricator.wikimedia.org/T250282) [17:20:22] (03PS1) 10Arturo Borrero Gonzalez: toolforge: mailrelay: enforce ratelimiting [puppet] - 10https://gerrit.wikimedia.org/r/607320 (https://phabricator.wikimedia.org/T175964) [17:21:14] (03PS1) 10Andrew Bogott: eqiad1: enable galera [puppet] - 10https://gerrit.wikimedia.org/r/607321 (https://phabricator.wikimedia.org/T242455) [17:21:57] (03CR) 10Andrew Bogott: [C: 03+2] eqiad1: enable galera [puppet] - 10https://gerrit.wikimedia.org/r/607321 (https://phabricator.wikimedia.org/T242455) (owner: 10Andrew Bogott) [17:28:59] 10Operations, 10DBA: db1088 crashed - https://phabricator.wikimedia.org/T255927 (10wkandek) Should a BBU failure cause a reboot? [17:32:35] (03PS2) 10CRusnov: profile::netbox::postgres: Parameterize backups [puppet] - 10https://gerrit.wikimedia.org/r/607310 (https://phabricator.wikimedia.org/T253140) [17:33:24] I'm having trouble connecting to toolforge - its still stuck on `Last login: ...` and hasn't shown me a `$` to start. Where should I ask for help? [17:34:26] DannyS712: #wikimedia-cloud [17:34:30] thanks [17:34:47] 10Operations, 10Release Pipeline, 10Release-Engineering-Team-TODO, 10Release-Engineering-Team (Pipeline): Design pipeline image versioning scheme - https://phabricator.wikimedia.org/T209088 (10thcipriani) 05Open→03Resolved a:03thcipriani We've been using the schema outlined on this task for a while n... [17:39:28] (03CR) 10CRusnov: "PCC output: https://puppet-compiler.wmflabs.org/compiler1001/23401/" [puppet] - 10https://gerrit.wikimedia.org/r/607310 (https://phabricator.wikimedia.org/T253140) (owner: 10CRusnov) [17:39:42] 10Operations, 10ops-eqiad, 10DC-Ops, 10netops, 10cloud-services-team (Hardware): (Need By: 2020-06-12) rack/setup/install WMCS 10G switches - https://phabricator.wikimedia.org/T251632 (10Jclark-ctr) all fibers ran console connected. @RobH serial cables need to be ran and what pinout should use? [17:48:31] (03PS3) 10CRusnov: profile::netbox: Parameterize backups [puppet] - 10https://gerrit.wikimedia.org/r/607310 (https://phabricator.wikimedia.org/T253140) [17:48:40] (03CR) 10Bstorm: [C: 03+1] "Based entirely on https://www.exim.org/exim-html-current/doc/html/spec_html/ch-access_control_lists.html#SECID200 and the helpful comments" [puppet] - 10https://gerrit.wikimedia.org/r/607320 (https://phabricator.wikimedia.org/T175964) (owner: 10Arturo Borrero Gonzalez) [17:51:25] (03CR) 10Dzahn: [C: 03+2] contint: remove obsolete firewall rules from labs [puppet] - 10https://gerrit.wikimedia.org/r/606737 (https://phabricator.wikimedia.org/T114209) (owner: 10Dzahn) [17:52:10] 10Operations, 10SRE-Access-Requests: Requesting access to centralauth database for Jennifer Wang - https://phabricator.wikimedia.org/T255836 (10jwang) @ema, thanks for your review. Here are the additional info. Feel free to let me know if you need anything from me. - as per point 3 of the checklist, what so... [17:55:14] (03PS3) 10Dzahn: gerrit: Allow to use request tracing for new Gerrits [puppet] - 10https://gerrit.wikimedia.org/r/606795 (https://phabricator.wikimedia.org/T254158) (owner: 10QChris) [17:56:01] (03CR) 10Dzahn: [C: 03+2] gerrit: Allow to use request tracing for new Gerrits [puppet] - 10https://gerrit.wikimedia.org/r/606795 (https://phabricator.wikimedia.org/T254158) (owner: 10QChris) [17:56:44] (03PS3) 10Dzahn: gerrit: Do not enable the ability to move changes for new Gerrits [puppet] - 10https://gerrit.wikimedia.org/r/606796 (https://phabricator.wikimedia.org/T254158) (owner: 10QChris) [17:57:44] jouncebot: next [17:57:44] In 0 hour(s) and 2 minute(s): Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200623T1800) [17:59:11] (03CR) 10Dzahn: [C: 03+2] gerrit: Do not enable the ability to move changes for new Gerrits [puppet] - 10https://gerrit.wikimedia.org/r/606796 (https://phabricator.wikimedia.org/T254158) (owner: 10QChris) [17:59:18] (03CR) 10Cwhite: "Ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/605688 (https://phabricator.wikimedia.org/T251466) (owner: 10Cwhite) [18:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200623T1800) [18:00:28] (03PS1) 10Arturo Borrero Gonzalez: toolforge: mailrelay: collect exim metrics using prometheus [puppet] - 10https://gerrit.wikimedia.org/r/607324 (https://phabricator.wikimedia.org/T175964) [18:01:16] (03PS4) 10JMeybohm: WIP: chartmuseum: Add initial module, profile and role [puppet] - 10https://gerrit.wikimedia.org/r/606956 (https://phabricator.wikimedia.org/T253843) [18:02:01] (03CR) 10Dzahn: [C: 03+2] gerrit: Drop javamelody-deps library for new Gerrits [puppet] - 10https://gerrit.wikimedia.org/r/607291 (owner: 10QChris) [18:02:14] (03PS2) 10Dzahn: gerrit: Drop javamelody-deps library for new Gerrits [puppet] - 10https://gerrit.wikimedia.org/r/607291 (owner: 10QChris) [18:02:24] (03CR) 10jerkins-bot: [V: 04-1] WIP: chartmuseum: Add initial module, profile and role [puppet] - 10https://gerrit.wikimedia.org/r/606956 (https://phabricator.wikimedia.org/T253843) (owner: 10JMeybohm) [18:02:25] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:02:59] (03CR) 10Cwhite: "This check appears to be stable since 5/27. Ready to revert." [puppet] - 10https://gerrit.wikimedia.org/r/593815 (https://phabricator.wikimedia.org/T251294) (owner: 10Cwhite) [18:04:11] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:04:43] jouncebot: now [18:04:44] For the next 0 hour(s) and 55 minute(s): Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200623T1800) [18:04:46] jouncebot: next [18:04:47] In 0 hour(s) and 55 minute(s): Mediawiki train - American+European Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200623T1900) [18:04:57] !log brennen@deploy1001 Finished scap: testwikis wikis to 1.35.0-wmf.38 (duration: 85m 53s) [18:04:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:05:13] heh [18:05:26] eh.. it's sanity break [18:06:33] we specifically wanted to pick a window when there are no deployments [18:06:42] (03PS5) 10JMeybohm: WIP: chartmuseum: Add initial module, profile and role [puppet] - 10https://gerrit.wikimedia.org/r/606956 (https://phabricator.wikimedia.org/T253843) [18:08:13] (03CR) 10JMeybohm: WIP: chartmuseum: Add initial module, profile and role (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/606956 (https://phabricator.wikimedia.org/T253843) (owner: 10JMeybohm) [18:11:09] (03CR) 10Herron: "The exim mtail bit LGTM. But will also need to tell the labs prometheus instance(s?) to scrape this this exporter on relevant hosts." [puppet] - 10https://gerrit.wikimedia.org/r/607324 (https://phabricator.wikimedia.org/T175964) (owner: 10Arturo Borrero Gonzalez) [18:14:57] (03CR) 10Privacybatm: "> Patch Set 2:" (031 comment) [software/transferpy] - 10https://gerrit.wikimedia.org/r/607028 (https://phabricator.wikimedia.org/T255999) (owner: 10Privacybatm) [18:15:05] (03PS12) 10Dzahn: gerrit: Redirect v2.15 dashboard links to those of v3.2 [puppet] - 10https://gerrit.wikimedia.org/r/606432 (https://phabricator.wikimedia.org/T254158) (owner: 10Paladox) [18:16:34] (03PS2) 10Reedy: Enable BotPasswords on officewiki and otrs_wikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606454 (https://phabricator.wikimedia.org/T254925) [18:16:41] (03PS3) 10Reedy: Enable BotPasswords on officewiki and otrs_wikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606454 (https://phabricator.wikimedia.org/T254925) [18:17:51] (03CR) 10Dzahn: [C: 03+2] "amended to that it's a noop on old version (prod): https://puppet-compiler.wmflabs.org/compiler1001/23404/" [puppet] - 10https://gerrit.wikimedia.org/r/606432 (https://phabricator.wikimedia.org/T254158) (owner: 10Paladox) [18:20:59] (03PS1) 10Cicalese: Enable MediaModeration on group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/607327 (https://phabricator.wikimedia.org/T247943) [18:21:24] (03CR) 10Reedy: [C: 03+2] Enable BotPasswords on officewiki and otrs_wikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606454 (https://phabricator.wikimedia.org/T254925) (owner: 10Reedy) [18:22:18] (03CR) 10Ppchelko: [C: 03+1] Enable MediaModeration on group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/607327 (https://phabricator.wikimedia.org/T247943) (owner: 10Cicalese) [18:22:24] (03Merged) 10jenkins-bot: Enable BotPasswords on officewiki and otrs_wikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606454 (https://phabricator.wikimedia.org/T254925) (owner: 10Reedy) [18:22:58] (03PS2) 10Cicalese: Enable MediaModeration on group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/607327 (https://phabricator.wikimedia.org/T247943) [18:23:28] 10Operations, 10ops-eqiad, 10DC-Ops: scs-a8-eqiad CPU usage over 85% - https://phabricator.wikimedia.org/T256101 (10wiki_willy) 05Open→03Resolved [18:24:42] !log reedy@deploy1001 Synchronized wmf-config/InitialiseSettings.php: T254925 T246489 (duration: 01m 06s) [18:24:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:24:48] T254925: Bot passwords for officewiki - https://phabricator.wikimedia.org/T254925 [18:24:48] T246489: enable bot passwords otrs-wiki.wikimedia.org - https://phabricator.wikimedia.org/T246489 [18:26:32] jouncebot: now [18:26:32] For the next 0 hour(s) and 33 minute(s): Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200623T1800) [18:26:37] heh [18:29:17] (03CR) 10Herron: [C: 03+1] Revert "profile: temporarily disable alerts on cloud_dev_pdns* for maintenance" [puppet] - 10https://gerrit.wikimedia.org/r/593815 (https://phabricator.wikimedia.org/T251294) (owner: 10Cwhite) [18:30:52] 10Operations, 10Traffic: Collect client network errors, deprecation, intervention and crash reports - https://phabricator.wikimedia.org/T207860 (10Krinkle) [18:39:47] PROBLEM - Unmerged changes on repository puppet on labtestpuppetmaster2001 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [18:40:32] (03PS1) 10Ottomata: Migrate TemplateWizard from EventLogging to EventGate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/607333 (https://phabricator.wikimedia.org/T238230) [18:41:27] (03PS6) 10Dzahn: gerrit: drop old redirect as workaround for broken browser detection [puppet] - 10https://gerrit.wikimedia.org/r/606434 (owner: 10Paladox) [18:41:37] RECOVERY - Unmerged changes on repository puppet on labtestpuppetmaster2001 is OK: No changes to merge. https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [18:42:37] (03PS7) 10Dzahn: gerrit: drop old redirect as workaround for broken browser detection [puppet] - 10https://gerrit.wikimedia.org/r/606434 (owner: 10Paladox) [18:43:44] (03CR) 10Dzahn: "also added a guard around this to only apply on new_version servers" [puppet] - 10https://gerrit.wikimedia.org/r/606434 (owner: 10Paladox) [18:44:28] (03CR) 10Paladox: "@Dzahn not really needed the guard, what it redirects to, no longer appears to exist." [puppet] - 10https://gerrit.wikimedia.org/r/606434 (owner: 10Paladox) [18:45:58] (03PS8) 10Dzahn: gerrit: drop old redirect as workaround for broken browser detection [puppet] - 10https://gerrit.wikimedia.org/r/606434 (owner: 10Paladox) [18:46:49] (03PS2) 10Ottomata: Migrate TemplateWizard from EventLogging to EventGate on group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/607333 (https://phabricator.wikimedia.org/T238230) [18:48:09] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1003/23406/" [puppet] - 10https://gerrit.wikimedia.org/r/606434 (owner: 10Paladox) [18:49:44] (03CR) 10Ottomata: [C: 03+2] Migrate TemplateWizard from EventLogging to EventGate on group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/607333 (https://phabricator.wikimedia.org/T238230) (owner: 10Ottomata) [18:53:36] !log otto@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Migrate TemplateWizard from EventLogging to EventGate on group0 - T238230 (duration: 01m 06s) [18:53:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:53:40] T238230: Decommission EventLogging backend components by migrating to MEP - https://phabricator.wikimedia.org/T238230 [18:55:14] !log gerrit1001 (prod) - restarting gerrit service to verify config changes [18:55:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:56:15] (03PS1) 10Ottomata: Migrate TemplateWizard from EventLogging to EventGate on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/607346 (https://phabricator.wikimedia.org/T238230) [18:58:39] 10Operations, 10Traffic, 10observability: Collect client network errors, deprecation, intervention and crash reports - https://phabricator.wikimedia.org/T207860 (10CDanis) [19:00:04] brennen and hashar: Your horoscope predicts another unfortunate Mediawiki train - American+European Version deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200623T1900). [19:00:28] jouncebot: next [19:00:28] In 3 hour(s) and 59 minute(s): Evening backport window(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200623T2300) [19:00:40] deployment window is free [19:00:45] gerrit maintenance done for now [19:01:07] thanks mutante. [19:01:40] (03PS4) 10Ssingh: prometheus: add wikidough statistics [puppet] - 10https://gerrit.wikimedia.org/r/607301 (https://phabricator.wikimedia.org/T252132) [19:02:41] (03CR) 10Herron: role::exim: update config to drop ldap validation (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/605568 (https://phabricator.wikimedia.org/T244792) (owner: 10Jbond) [19:03:51] (03PS1) 10Brennen Bearnes: group0 wikis to 1.35.0-wmf.38 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/607347 [19:03:53] (03CR) 10Brennen Bearnes: [C: 03+2] group0 wikis to 1.35.0-wmf.38 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/607347 (owner: 10Brennen Bearnes) [19:04:37] (03Merged) 10jenkins-bot: group0 wikis to 1.35.0-wmf.38 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/607347 (owner: 10Brennen Bearnes) [19:06:55] !log brennen@deploy1001 rebuilt and synchronized wikiversions files: group0 wikis to 1.35.0-wmf.38 [19:06:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:10:14] brennen: lemme know when you are done, i have a config change i'd like to push [19:12:07] ottomata: on group0 and things seem stable. go ahead. [19:13:08] ok, thanks! [19:13:40] (03PS2) 10Ottomata: Migrate TemplateWizard from EventLogging to EventGate on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/607346 (https://phabricator.wikimedia.org/T238230) [19:14:53] (03CR) 10Ottomata: [C: 03+2] Migrate TemplateWizard from EventLogging to EventGate on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/607346 (https://phabricator.wikimedia.org/T238230) (owner: 10Ottomata) [19:16:32] !log otto@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Migrate TemplateWizard from EventLogging to EventGate on all wikis - T238230 (duration: 01m 05s) [19:16:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:16:37] T238230: Decommission EventLogging backend components by migrating to MEP - https://phabricator.wikimedia.org/T238230 [19:29:00] (03CR) 10Cwhite: [C: 03+2] Revert "profile: temporarily disable alerts on cloud_dev_pdns* for maintenance" [puppet] - 10https://gerrit.wikimedia.org/r/593815 (https://phabricator.wikimedia.org/T251294) (owner: 10Cwhite) [19:29:08] (03PS3) 10Cwhite: Revert "profile: temporarily disable alerts on cloud_dev_pdns* for maintenance" [puppet] - 10https://gerrit.wikimedia.org/r/593815 (https://phabricator.wikimedia.org/T251294) [19:41:16] (03CR) 10Dzahn: [C: 03+2] "yea, i think it's "puppet-lint became smarter". we are not excluding the arrow alignment check and it passes it nowadays. thanks all for c" [puppet] - 10https://gerrit.wikimedia.org/r/607139 (owner: 10Dzahn) [19:47:28] (03PS2) 10Thcipriani: blubberoid: Update to latest image [deployment-charts] - 10https://gerrit.wikimedia.org/r/606501 (https://phabricator.wikimedia.org/T248927) (owner: 10Jeena Huneidi) [19:48:24] (03CR) 10Thcipriani: [C: 03+2] blubberoid: Update to latest image [deployment-charts] - 10https://gerrit.wikimedia.org/r/606501 (https://phabricator.wikimedia.org/T248927) (owner: 10Jeena Huneidi) [19:48:56] (03Merged) 10jenkins-bot: blubberoid: Update to latest image [deployment-charts] - 10https://gerrit.wikimedia.org/r/606501 (https://phabricator.wikimedia.org/T248927) (owner: 10Jeena Huneidi) [19:50:46] (03PS3) 10Cicalese: Enable MediaModeration on group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/607327 (https://phabricator.wikimedia.org/T247943) [19:51:31] 10Operations, 10Core Platform Team, 10Release Pipeline, 10Release-Engineering-Team-TODO, and 5 others: Kask functional testing with Cassandra via the Deployment Pipeline - https://phabricator.wikimedia.org/T224041 (10WDoranWMF) [19:52:14] 10Operations, 10ops-eqiad: upgrade memory in ganeti100[5-8].eqiad.wmnet - https://phabricator.wikimedia.org/T244530 (10Jclark-ctr) @akosiaris that sounds great ping me on irc if anything comes up [19:53:15] (03PS1) 10Ssingh: copy fake wikidough data from hieradata to passwords [labs/private] - 10https://gerrit.wikimedia.org/r/607355 [19:53:29] 10Operations, 10ops-eqiad: upgrade memory in ganeti100[5-8].eqiad.wmnet - https://phabricator.wikimedia.org/T244530 (10Jclark-ctr) [19:54:38] (03CR) 10Ssingh: [V: 03+2 C: 03+2] copy fake wikidough data from hieradata to passwords [labs/private] - 10https://gerrit.wikimedia.org/r/607355 (owner: 10Ssingh) [20:02:48] (03PS5) 10Ssingh: prometheus: add wikidough statistics [puppet] - 10https://gerrit.wikimedia.org/r/607301 (https://phabricator.wikimedia.org/T252132) [20:11:57] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1013 with 10G interfaces - https://phabricator.wikimedia.org/T243414 (10Andrew) @jclark, typically we need to drain the workload from a host before we can swap it. It was empty when I opened this task... [20:16:29] (03CR) 10Ssingh: "https://puppet-compiler.wmflabs.org/compiler1002/23408/prometheus2003.codfw.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/607301 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [20:30:46] (03CR) 10CRusnov: [C: 03+2] profile::netbox: Parameterize backups [puppet] - 10https://gerrit.wikimedia.org/r/607310 (https://phabricator.wikimedia.org/T253140) (owner: 10CRusnov) [20:31:22] !log otto@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Migrate TemplateWizard from EventLogging to EventGate on all wikis - take 2 - T238230 (duration: 01m 06s) [20:31:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:31:27] T238230: Decommission EventLogging backend components by migrating to MEP - https://phabricator.wikimedia.org/T238230 [20:32:41] (03CR) 10Volans: "My 2 cents inline on the paths to backup and profiles involved." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/607258 (owner: 10Jcrespo) [20:36:35] (03CR) 10Volans: "Just a thought, but we could use the same approach of the homer private repo and automatically replicate it between the two cumin hosts wh" [puppet] - 10https://gerrit.wikimedia.org/r/607281 (owner: 10Muehlenhoff) [20:55:47] (03PS2) 10Dzahn: codesearch: move ferm rules from module to profile [puppet] - 10https://gerrit.wikimedia.org/r/606735 (https://phabricator.wikimedia.org/T114209) [21:14:50] !log wkandek@cumin1001 START - Cookbook sre.hosts.decommission [21:14:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:15:33] !log wkandek@cumin1001 END (ERROR) - Cookbook sre.hosts.decommission (exit_code=97) [21:15:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:22:36] !log dzahn@cumin1001 START - Cookbook sre.ganeti.makevm [21:22:36] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) [21:22:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:22:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:22:47] !log wkandek@cumin1001 START - Cookbook sre.ganeti.makevm [21:22:47] !log wkandek@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) [21:22:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:22:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:23:02] !log dzahn@cumin1001 START - Cookbook sre.ganeti.makevm [21:23:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:23:43] !log dzahn@cumin1001 END (ERROR) - Cookbook sre.ganeti.makevm (exit_code=97) [21:23:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:23:47] !log wkandek@cumin1001 START - Cookbook sre.ganeti.makevm [21:23:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:24:34] ^ we are sharing knowledege about ganeti [21:36:03] (03PS1) 10Wolfgang Kandek: install_server: update MAC address for releases1002 [puppet] - 10https://gerrit.wikimedia.org/r/607362 (https://phabricator.wikimedia.org/T255590) [21:36:07] (03PS1) 10Dzahn: DHCP: update MAC address for releases1002 [puppet] - 10https://gerrit.wikimedia.org/r/607363 (https://phabricator.wikimedia.org/T255590) [21:36:29] (03CR) 10jerkins-bot: [V: 04-1] install_server: update MAC address for releases1002 [puppet] - 10https://gerrit.wikimedia.org/r/607362 (https://phabricator.wikimedia.org/T255590) (owner: 10Wolfgang Kandek) [21:38:50] 10Operations, 10Core Platform Team, 10Release Pipeline, 10Release-Engineering-Team-TODO, and 5 others: Kask functional testing with Cassandra via the Deployment Pipeline - https://phabricator.wikimedia.org/T224041 (10Eevans) >>! In T224041#6229281, @jeena wrote: >>>! In T224041#6227592, @akosiaris wrote: >... [21:44:45] (03PS2) 10Wolfgang Kandek: install_server: update MAC address for releases1002 [puppet] - 10https://gerrit.wikimedia.org/r/607362 (https://phabricator.wikimedia.org/T255590) [21:45:34] 10Operations, 10Traffic: Block hotlinking - https://phabricator.wikimedia.org/T152091 (10Nemo_bis) It's a long-standing Wikimedia policy that we allow hotlinking, or even encourage it. See for instance https://commons.wikimedia.org/wiki/Commons:HOTLINK , which reflects consensus (and Wikimedia Foundation guida... [21:48:20] (03CR) 10Dzahn: [C: 04-1] "we are still waiting for the real VM to be created" [puppet] - 10https://gerrit.wikimedia.org/r/607362 (https://phabricator.wikimedia.org/T255590) (owner: 10Wolfgang Kandek) [21:52:13] (03CR) 10Dzahn: [C: 03+2] codesearch: move ferm rules from module to profile [puppet] - 10https://gerrit.wikimedia.org/r/606735 (https://phabricator.wikimedia.org/T114209) (owner: 10Dzahn) [21:56:30] (03Abandoned) 10Dzahn: DHCP: update MAC address for releases1002 [puppet] - 10https://gerrit.wikimedia.org/r/607363 (https://phabricator.wikimedia.org/T255590) (owner: 10Dzahn) [21:57:22] (03PS1) 10Ssingh: wikidough: update dnsdist web server listen address [puppet] - 10https://gerrit.wikimedia.org/r/607368 (https://phabricator.wikimedia.org/T252132) [22:01:33] (03CR) 10Ssingh: "https://puppet-compiler.wmflabs.org/compiler1001/23409/malmok.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/607368 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [22:02:26] (03CR) 10Dzahn: wikidough: update dnsdist web server listen address (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/607368 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [22:04:47] (03PS1) 10Dave Pifke: [WIP] arclamp: Deploy from scap [puppet] - 10https://gerrit.wikimedia.org/r/607370 (https://phabricator.wikimedia.org/T200109) [22:06:27] !log dzahn@cumin1001 START - Cookbook sre.ganeti.makevm [22:06:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:09:43] 10Operations, 10DBA: db1088 crashed - https://phabricator.wikimedia.org/T255927 (10wkandek) HP says, the server should not reboot due to battery failure: https://support.hpe.com/hpesc/public/docDisplay?docId=mmr_kc-0126260#:~:text=POST%20Error%3A%20313%20%2D%20HPE%20Smart,other%20reasons%20for%20a%20reboot. "... [22:11:46] (03PS1) 10Dzahn: DHCP: update MAC address for xhgui1001 [puppet] - 10https://gerrit.wikimedia.org/r/607371 (https://phabricator.wikimedia.org/T238098) [22:24:08] !log wkandek@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) [22:24:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:27:44] (03PS3) 10Wolfgang Kandek: install_server: update MAC address for releases1002 [puppet] - 10https://gerrit.wikimedia.org/r/607362 (https://phabricator.wikimedia.org/T255590) [22:30:04] (03CR) 10Dzahn: [C: 03+1] install_server: update MAC address for releases1002 [puppet] - 10https://gerrit.wikimedia.org/r/607362 (https://phabricator.wikimedia.org/T255590) (owner: 10Wolfgang Kandek) [22:30:08] (03PS1) 10Andrew Bogott: eqiad1: move glance db to galera [puppet] - 10https://gerrit.wikimedia.org/r/607374 (https://phabricator.wikimedia.org/T242455) [22:30:53] (03CR) 10Wolfgang Kandek: [C: 03+2] install_server: update MAC address for releases1002 [puppet] - 10https://gerrit.wikimedia.org/r/607362 (https://phabricator.wikimedia.org/T255590) (owner: 10Wolfgang Kandek) [22:32:35] (03CR) 10Andrew Bogott: [C: 03+2] eqiad1: move glance db to galera [puppet] - 10https://gerrit.wikimedia.org/r/607374 (https://phabricator.wikimedia.org/T242455) (owner: 10Andrew Bogott) [22:38:49] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 52 probes of 567 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [22:44:37] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 48 probes of 567 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [22:52:20] (03PS1) 10Andrew Bogott: Revert "eqiad1: move glance db to galera" [puppet] - 10https://gerrit.wikimedia.org/r/607378 [22:53:04] (03CR) 10Andrew Bogott: [C: 03+2] Revert "eqiad1: move glance db to galera" [puppet] - 10https://gerrit.wikimedia.org/r/607378 (owner: 10Andrew Bogott) [22:56:14] (03PS1) 10Krinkle: Define NS_LQT in Lqt.namespaces.php [extensions/LiquidThreads] (wmf/1.35.0-wmf.37) - 10https://gerrit.wikimedia.org/r/607379 (https://phabricator.wikimedia.org/T256151) [22:56:27] (03PS1) 10Krinkle: Define NS_LQT in Lqt.namespaces.php [extensions/LiquidThreads] (wmf/1.35.0-wmf.38) - 10https://gerrit.wikimedia.org/r/607380 (https://phabricator.wikimedia.org/T256151) [23:00:04] RoanKattouw, Niharika, and Urbanecm: Your horoscope predicts another unfortunate Evening backport window(Max 6 patches) deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200623T2300). [23:02:00] 10Operations, 10LDAP-Access-Requests: NDA for superset access request from WMDE employee danshick - https://phabricator.wikimedia.org/T254442 (10KFrancis) @Dzahn Update - this request has been sent for signatures. Once it's complete, I'll notify you and you may move forward. [23:11:03] !log dzahn@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) [23:11:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:12:30] (03PS1) 10Andrew Bogott: Revert "Revert "eqiad1: move glance db to galera"" [puppet] - 10https://gerrit.wikimedia.org/r/607382 [23:13:19] (03CR) 10Andrew Bogott: [C: 03+2] Revert "Revert "eqiad1: move glance db to galera"" [puppet] - 10https://gerrit.wikimedia.org/r/607382 (owner: 10Andrew Bogott) [23:16:47] !log releases1002 is back after being moved to row D (T255590) [23:16:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:16:51] T255590: Site: eqiad/codfw 2 VM request for releases - https://phabricator.wikimedia.org/T255590 [23:17:17] (03CR) 10Krinkle: [WIP] arclamp: Deploy from scap (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/607370 (https://phabricator.wikimedia.org/T200109) (owner: 10Dave Pifke) [23:18:58] (03PS9) 10RLazarus: mcrouter_wancache: Add mcrouter support for a machine-local memcached instance [puppet] - 10https://gerrit.wikimedia.org/r/594760 (https://phabricator.wikimedia.org/T244340) [23:19:02] (03PS2) 10Dzahn: DHCP: update MAC address for xhgui1001 [puppet] - 10https://gerrit.wikimedia.org/r/607371 (https://phabricator.wikimedia.org/T238098) [23:20:06] (03CR) 10jerkins-bot: [V: 04-1] mcrouter_wancache: Add mcrouter support for a machine-local memcached instance [puppet] - 10https://gerrit.wikimedia.org/r/594760 (https://phabricator.wikimedia.org/T244340) (owner: 10RLazarus) [23:23:20] (03CR) 10Wolfgang Kandek: [C: 03+1] DHCP: update MAC address for xhgui1001 [puppet] - 10https://gerrit.wikimedia.org/r/607371 (https://phabricator.wikimedia.org/T238098) (owner: 10Dzahn) [23:23:45] (03CR) 10Dzahn: [C: 03+2] DHCP: update MAC address for xhgui1001 [puppet] - 10https://gerrit.wikimedia.org/r/607371 (https://phabricator.wikimedia.org/T238098) (owner: 10Dzahn) [23:25:13] (03PS3) 10Dzahn: DHCP: update MAC address for xhgui1001 [puppet] - 10https://gerrit.wikimedia.org/r/607371 (https://phabricator.wikimedia.org/T238098) [23:28:23] (03PS10) 10RLazarus: mcrouter_wancache: Add mcrouter support for a machine-local memcached instance [puppet] - 10https://gerrit.wikimedia.org/r/594760 (https://phabricator.wikimedia.org/T244340) [23:30:19] PROBLEM - IPv6 ping to esams on ripe-atlas-esams IPv6 is CRITICAL: CRITICAL - failed 64 probes of 564 (alerts on 50) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [23:35:43] RECOVERY - IPv6 ping to esams on ripe-atlas-esams IPv6 is OK: OK - failed 46 probes of 564 (alerts on 50) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [23:38:28] (03CR) 10RLazarus: "Krinkle, Aaron, Elukey: Adding you per our meeting last week. This is a no-op for the mcrouter config until the hiera flag is flipped on, " [puppet] - 10https://gerrit.wikimedia.org/r/594760 (https://phabricator.wikimedia.org/T244340) (owner: 10RLazarus) [23:40:19] (03CR) 10Dave Pifke: [WIP] arclamp: Deploy from scap (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/607370 (https://phabricator.wikimedia.org/T200109) (owner: 10Dave Pifke) [23:47:53] (03CR) 10Krinkle: [WIP] arclamp: Deploy from scap (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/607370 (https://phabricator.wikimedia.org/T200109) (owner: 10Dave Pifke)