[00:44:49] PROBLEM - MD RAID on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [00:45:21] PROBLEM - DPKG on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [00:45:29] PROBLEM - configured eth on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [00:45:35] PROBLEM - Check size of conntrack table on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [00:45:35] PROBLEM - dhclient process on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [00:45:37] PROBLEM - Disk space on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=stat1007&var-datasource=eqiad+prometheus/ops [00:45:47] PROBLEM - Check the last execution of search-drop-query-clicks on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [00:46:17] PROBLEM - Check whether ferm is active by checking the default input chain on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [00:46:17] PROBLEM - Check systemd state on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:46:55] RECOVERY - DPKG on stat1007 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [00:47:05] RECOVERY - configured eth on stat1007 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [00:47:09] RECOVERY - Check size of conntrack table on stat1007 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [00:47:11] RECOVERY - dhclient process on stat1007 is OK: PROCS OK: 0 processes with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [00:47:11] RECOVERY - Disk space on stat1007 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=stat1007&var-datasource=eqiad+prometheus/ops [00:47:15] (03PS1) 10DannyS712: Fix typos in code [puppet] - 10https://gerrit.wikimedia.org/r/530989 (https://phabricator.wikimedia.org/T201491) [00:47:49] RECOVERY - Check whether ferm is active by checking the default input chain on stat1007 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [00:47:51] RECOVERY - Check systemd state on stat1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:47:57] RECOVERY - MD RAID on stat1007 is OK: OK: Active: 8, Working: 8, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [00:52:08] (03PS2) 10DannyS712: Fix typos in code [puppet] - 10https://gerrit.wikimedia.org/r/530989 (https://phabricator.wikimedia.org/T201491) [00:56:19] RECOVERY - Check the last execution of search-drop-query-clicks on stat1007 is OK: OK: Status of the systemd unit search-drop-query-clicks https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [02:10:31] PROBLEM - MariaDB Slave Lag: s3 on db2098 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 951.88 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [03:48:55] PROBLEM - snapshot of s7 in codfw on db1115 is CRITICAL: snapshot for s7 at codfw taken more than 4 days ago: Most recent backup 2019-08-16 03:40:14 https://wikitech.wikimedia.org/wiki/MariaDB/Backups [03:56:19] (03PS1) 10Vgutierrez: ATS: Disable config status check for TLS instance [puppet] - 10https://gerrit.wikimedia.org/r/531018 (https://phabricator.wikimedia.org/T221594) [03:56:45] (03CR) 10jerkins-bot: [V: 04-1] ATS: Disable config status check for TLS instance [puppet] - 10https://gerrit.wikimedia.org/r/531018 (https://phabricator.wikimedia.org/T221594) (owner: 10Vgutierrez) [03:58:02] (03PS2) 10Vgutierrez: ATS: Disable config status check for TLS instance [puppet] - 10https://gerrit.wikimedia.org/r/531018 (https://phabricator.wikimedia.org/T221594) [03:58:29] (03CR) 10jerkins-bot: [V: 04-1] ATS: Disable config status check for TLS instance [puppet] - 10https://gerrit.wikimedia.org/r/531018 (https://phabricator.wikimedia.org/T221594) (owner: 10Vgutierrez) [04:05:21] (03PS3) 10Vgutierrez: ATS: Disable config status check for TLS instance [puppet] - 10https://gerrit.wikimedia.org/r/531018 (https://phabricator.wikimedia.org/T221594) [04:06:17] (03CR) 10jerkins-bot: [V: 04-1] ATS: Disable config status check for TLS instance [puppet] - 10https://gerrit.wikimedia.org/r/531018 (https://phabricator.wikimedia.org/T221594) (owner: 10Vgutierrez) [04:06:32] *sigh* [04:08:11] (03PS4) 10Vgutierrez: ATS: Disable config status check for TLS instance [puppet] - 10https://gerrit.wikimedia.org/r/531018 (https://phabricator.wikimedia.org/T221594) [04:24:07] RECOVERY - MariaDB Slave Lag: s3 on db2098 is OK: OK slave_sql_lag Replication lag: 0.36 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [04:27:19] (03CR) 10Vgutierrez: "pcc at https://puppet-compiler.wmflabs.org/compiler1002/17954/ shows the expected changes:" [puppet] - 10https://gerrit.wikimedia.org/r/531018 (https://phabricator.wikimedia.org/T221594) (owner: 10Vgutierrez) [05:05:17] 10Operations, 10ops-eqiad, 10DC-Ops: b3-eqiad pdu refresh (Tuesday 9/17 @11am UTC) - https://phabricator.wikimedia.org/T227539 (10Marostegui) >>! In T227539#5422811, @wiki_willy wrote: > @Marostegui - I'll defer to Faidon or Mark for their opinion, but my suggestion is to go ahead and fail out in advance if... [05:05:51] 10Operations, 10ops-eqiad, 10DC-Ops: a6-eqiad pdu refresh (Tuesday 10/22 @11am UTC) - https://phabricator.wikimedia.org/T227142 (10Marostegui) I will get them scheduled, planned etc. Thanks [05:11:47] (03PS2) 10ArielGlenn: ukwiki and viwiki are now configured as 'big wikis' for xml/sql dumps [puppet] - 10https://gerrit.wikimedia.org/r/524748 (https://phabricator.wikimedia.org/T228558) [05:13:13] (03CR) 10ArielGlenn: [C: 03+2] ukwiki and viwiki are now configured as 'big wikis' for xml/sql dumps [puppet] - 10https://gerrit.wikimedia.org/r/524748 (https://phabricator.wikimedia.org/T228558) (owner: 10ArielGlenn) [05:17:21] (03PS1) 10Marostegui: dbproxy2002: Promote db2067 to m2 codfw master [puppet] - 10https://gerrit.wikimedia.org/r/531025 (https://phabricator.wikimedia.org/T230705) [05:18:33] !log Switchover m2 codfw master, db2044 -> db2067 T230705 [05:18:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:18:43] T230705: Replace db2044 (m2 codfw master) with db2067 - https://phabricator.wikimedia.org/T230705 [05:20:29] (03PS2) 10Marostegui: dbproxy2002: Promote db2067 to m2 codfw master [puppet] - 10https://gerrit.wikimedia.org/r/531025 (https://phabricator.wikimedia.org/T230705) [05:21:15] (03CR) 10Marostegui: [C: 03+2] dbproxy2002: Promote db2067 to m2 codfw master [puppet] - 10https://gerrit.wikimedia.org/r/531025 (https://phabricator.wikimedia.org/T230705) (owner: 10Marostegui) [05:24:50] !log Reload haproxy on dbproxy2002 T230705 [05:24:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:24:58] T230705: Replace db2044 (m2 codfw master) with db2067 - https://phabricator.wikimedia.org/T230705 [05:26:14] 10Operations, 10DBA: Predictive failures on disk S.M.A.R.T. status - https://phabricator.wikimedia.org/T208323 (10Marostegui) [05:26:59] 10Operations, 10DBA: Predictive failures on disk S.M.A.R.T. status - https://phabricator.wikimedia.org/T208323 (10Marostegui) [05:30:17] (03PS1) 10Marostegui: mariadb: Decommission db2049 [puppet] - 10https://gerrit.wikimedia.org/r/531026 (https://phabricator.wikimedia.org/T230721) [05:31:30] (03CR) 10Marostegui: [C: 03+2] mariadb: Decommission db2049 [puppet] - 10https://gerrit.wikimedia.org/r/531026 (https://phabricator.wikimedia.org/T230721) (owner: 10Marostegui) [05:35:00] !log Stop MySQL on db2049 for decommissioning - T230721 [05:35:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:35:08] T230721: Decommission db2049.codfw.wmnet - https://phabricator.wikimedia.org/T230721 [05:36:59] !log Remove db2049 from tendril and zarcillo T230721 [05:37:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:38:44] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission db2049.codfw.wmnet - https://phabricator.wikimedia.org/T230721 (10Marostegui) a:05Marostegui→03RobH [05:38:54] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission db2049.codfw.wmnet - https://phabricator.wikimedia.org/T230721 (10Marostegui) This host is ready for #dc-ops to decommission [05:39:05] (03PS1) 10Vgutierrez: ATS: Enable TCP Fast Open for the TLS instance [puppet] - 10https://gerrit.wikimedia.org/r/531027 (https://phabricator.wikimedia.org/T221594) [05:39:16] 10Operations, 10DBA: Decommission db2043-db2069 - https://phabricator.wikimedia.org/T228258 (10Marostegui) [05:41:22] (03PS3) 10Muehlenhoff: Extend netboot.cfg for failoid* [puppet] - 10https://gerrit.wikimedia.org/r/530830 (https://phabricator.wikimedia.org/T229903) [05:44:08] (03PS2) 10Vgutierrez: ATS: Enable TCP Fast Open for the TLS instance [puppet] - 10https://gerrit.wikimedia.org/r/531027 (https://phabricator.wikimedia.org/T221594) [05:44:13] (03CR) 10Muehlenhoff: [C: 03+2] Extend netboot.cfg for failoid* [puppet] - 10https://gerrit.wikimedia.org/r/530830 (https://phabricator.wikimedia.org/T229903) (owner: 10Muehlenhoff) [05:47:29] 10Operations, 10DBA: Decommission db2044.codfw.wmnet - https://phabricator.wikimedia.org/T230761 (10Marostegui) [05:48:56] (03PS2) 10Muehlenhoff: Add PXE config for failoid1001/failoid2001 [puppet] - 10https://gerrit.wikimedia.org/r/530838 [05:50:15] (03CR) 10Muehlenhoff: [C: 03+2] Add PXE config for failoid1001/failoid2001 [puppet] - 10https://gerrit.wikimedia.org/r/530838 (owner: 10Muehlenhoff) [05:51:21] (03PS1) 10Marostegui: mariadb: Decommission db2044 [puppet] - 10https://gerrit.wikimedia.org/r/531029 (https://phabricator.wikimedia.org/T230761) [05:53:47] (03PS2) 10Marostegui: mariadb: Decommission db2044 [puppet] - 10https://gerrit.wikimedia.org/r/531029 (https://phabricator.wikimedia.org/T230761) [05:55:12] (03CR) 10Marostegui: [C: 03+2] mariadb: Decommission db2044 [puppet] - 10https://gerrit.wikimedia.org/r/531029 (https://phabricator.wikimedia.org/T230761) (owner: 10Marostegui) [05:55:43] (03PS3) 10Vgutierrez: ATS: Enable TCP Fast Open for the TLS instance [puppet] - 10https://gerrit.wikimedia.org/r/531027 (https://phabricator.wikimedia.org/T221594) [05:55:50] !log Stop MySQL on db2044 for decommissioning - T221594 [05:55:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:55:58] T221594: Puppetize ATS TLS configuration for incoming traffic - https://phabricator.wikimedia.org/T221594 [05:56:30] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission db2044.codfw.wmnet - https://phabricator.wikimedia.org/T230761 (10Marostegui) p:05Triage→03Normal a:05Marostegui→03RobH [05:56:44] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission db2044.codfw.wmnet - https://phabricator.wikimedia.org/T230761 (10Marostegui) This host is ready for #dc-ops to decommission [05:56:56] 10Operations, 10DBA: Decommission db2043-db2069 - https://phabricator.wikimedia.org/T228258 (10Marostegui) [05:59:13] !log Stop MySQL and shutdown db1114 for on-siste maintenance - T229452 [05:59:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:59:21] T229452: db1114 crashed due to memory issues (server under warranty) - https://phabricator.wikimedia.org/T229452 [06:00:56] 10Operations, 10ops-eqiad, 10DBA: db1114 crashed due to memory issues (server under warranty) - https://phabricator.wikimedia.org/T229452 (10Marostegui) @Cmjohnson this host is now OFF, so you can act on it whenever you get to the DC. Thanks! [06:05:35] RECOVERY - Memory correctable errors -EDAC- on thumbor1004 is OK: (C)4 ge (W)2 ge 1 https://wikitech.wikimedia.org/wiki/Monitoring/Memory%23Memory_correctable_errors_-EDAC- https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=thumbor1004&var-datasource=eqiad+prometheus/ops [06:07:54] 10Operations, 10DBA: Switchover s8 (wikidata) primary database master db1104 -> db1109 - https://phabricator.wikimedia.org/T230762 (10Marostegui) [06:08:14] 10Operations, 10DBA: Switchover s8 (wikidata) primary database master db1104 -> db1109 - https://phabricator.wikimedia.org/T230762 (10Marostegui) p:05Triage→03Normal [06:19:49] PROBLEM - Disk space on elastic1018 is CRITICAL: DISK CRITICAL - free space: /srv 27766 MB (5% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=elastic1018&var-datasource=eqiad+prometheus/ops [06:23:11] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 229, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:24:16] hello cr2-eqiad [06:25:13] (03PS1) 10Muehlenhoff: Change mail address for fsero [puppet] - 10https://gerrit.wikimedia.org/r/531044 [06:25:14] Level3 link down to esams, but we have maintenance scheduled [06:29:00] 10Operations, 10Performance-Team, 10Traffic, 10Patch-For-Review, 10Performance: Study performance impact of disabling TCP selective acknowledgments - https://phabricator.wikimedia.org/T225998 (10MoritzMuehlenhoff) 05Open→03Resolved Yeah, I think we can close this one. [06:29:22] (03PS2) 10Elukey: profile::analytics::refinery::job::druid_load: add more dim to netflow [puppet] - 10https://gerrit.wikimedia.org/r/530893 (https://phabricator.wikimedia.org/T229682) [06:32:28] (03CR) 10Elukey: [C: 03+2] profile::analytics::refinery::job::druid_load: add more dim to netflow [puppet] - 10https://gerrit.wikimedia.org/r/530893 (https://phabricator.wikimedia.org/T229682) (owner: 10Elukey) [06:33:15] PROBLEM - Widespread puppet agent failures- no resources reported on icinga1001 is CRITICAL: site=eqsin https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [06:34:29] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 231, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:40:05] RECOVERY - Disk space on elastic1018 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=elastic1018&var-datasource=eqiad+prometheus/ops [06:55:25] (03CR) 10Filippo Giunchedi: prometheus: add prometheus ipsec exporter service & config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/530616 (https://phabricator.wikimedia.org/T230236) (owner: 10Herron) [06:56:41] (03CR) 10Filippo Giunchedi: prometheus-ipsec-exporter: initial commit of version 0.3.1 (031 comment) [debs/prometheus-ipsec-exporter] - 10https://gerrit.wikimedia.org/r/530203 (https://phabricator.wikimedia.org/T230236) (owner: 10Herron) [06:58:37] (03PS1) 10Elukey: Add more tunables to Eventlogging to Druid [puppet] - 10https://gerrit.wikimedia.org/r/531046 [06:59:26] !log installing failoid1001/2001 T229903 [06:59:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:59:42] T229903: eqiad/codfw: One VM for Failoid - https://phabricator.wikimedia.org/T229903 [07:03:31] RECOVERY - Widespread puppet agent failures- no resources reported on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [07:09:53] PROBLEM - Check the Netbox report-s- puppetdb for fail status. on netmon1002 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [07:36:23] Serious problem with OAuth. I can't reactivate it now due to an Internal Error [07:36:57] https://pastebin.com/SLCTvmVR [07:38:15] _joe_: ^ [07:40:37] <_joe_> Cyberpower678: I don't think this is exactly SRE territory, but rather bug material? [07:40:47] <_joe_> on OAuth specifically [07:41:10] * Cyberpower678 is unfortunately very pre-occupied with other matters. [07:42:17] Sorry I meant 2FA [07:46:03] (03PS1) 10Marostegui: db-codfw.php: Re-organize s5 codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531049 (https://phabricator.wikimedia.org/T230106) [07:47:31] (03CR) 10Marostegui: [C: 03+2] db-codfw.php: Re-organize s5 codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531049 (https://phabricator.wikimedia.org/T230106) (owner: 10Marostegui) [07:47:33] 10Operations, 10ops-eqiad, 10DC-Ops: b3-eqiad pdu refresh (Tuesday 9/17 @11am UTC) - https://phabricator.wikimedia.org/T227539 (10wiki_willy) Thanks @Marostegui , I appreciate it. [07:48:30] (03Merged) 10jenkins-bot: db-codfw.php: Re-organize s5 codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531049 (https://phabricator.wikimedia.org/T230106) (owner: 10Marostegui) [07:48:47] (03CR) 10jenkins-bot: db-codfw.php: Re-organize s5 codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531049 (https://phabricator.wikimedia.org/T230106) (owner: 10Marostegui) [07:49:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2059 and db2066, those two will be decommissioned T228258', diff saved to https://phabricator.wikimedia.org/P8934 and previous config saved to /var/cache/conftool/dbconfig/20190820-074900-marostegui.json [07:49:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:49:09] T228258: Decommission db2043-db2069 - https://phabricator.wikimedia.org/T228258 [07:49:46] (03PS4) 10Giuseppe Lavagetto: Update parsoid-rt-client.config.js.erb to fetch test ids from a function [puppet] - 10https://gerrit.wikimedia.org/r/529391 (https://phabricator.wikimedia.org/T230166) (owner: 10Subramanya Sastry) [07:50:12] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Reorganize s5 codfw weights T230106 (duration: 00m 47s) [07:50:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:50:21] T230106: Switchover codfw primary database masters to new hosts - https://phabricator.wikimedia.org/T230106 [07:52:13] PROBLEM - HHVM rendering on mw1233 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [07:52:53] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Update parsoid-rt-client.config.js.erb to fetch test ids from a function [puppet] - 10https://gerrit.wikimedia.org/r/529391 (https://phabricator.wikimedia.org/T230166) (owner: 10Subramanya Sastry) [07:53:23] (03CR) 10DCausse: [C: 03+1] Remove PrimateTmp=true from elasticsearch_6@ systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/530969 (owner: 10EBernhardson) [07:53:46] (03PS1) 10Marostegui: mariadb: Promote db2123 to s5 codfw master [puppet] - 10https://gerrit.wikimedia.org/r/531050 (https://phabricator.wikimedia.org/T230106) [07:54:09] PROBLEM - Host mr1-eqsin.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [07:54:19] RECOVERY - HHVM rendering on mw1233 is OK: HTTP OK: HTTP/1.1 200 OK - 76346 bytes in 0.125 second response time https://wikitech.wikimedia.org/wiki/Application_servers [07:55:57] (03PS1) 10Marostegui: db-codfw.php: Promote db2123 to s5 codfw master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531051 (https://phabricator.wikimedia.org/T230106) [07:56:59] n/win 7 [07:57:24] (03PS2) 10Marostegui: mariadb: Promote db2123 to s5 codfw master [puppet] - 10https://gerrit.wikimedia.org/r/531050 (https://phabricator.wikimedia.org/T230106) [07:57:49] PROBLEM - MariaDB Slave Lag: s8 on db2079 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 303.62 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [07:58:09] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 37 probes of 449 (alerts on 35) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [07:58:15] PROBLEM - MariaDB Slave Lag: s8 on db2082 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 312.38 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [07:58:42] Uh? Lag in s8 [07:58:44] checking [07:58:49] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 43 probes of 449 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [07:58:51] PROBLEM - MariaDB Slave Lag: s8 on db2081 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 324.65 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [07:59:14] Ah, there is a user rename in progress [07:59:29] RECOVERY - Host mr1-eqsin.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 233.73 ms [08:00:41] PROBLEM - MariaDB Slave Lag: s8 on db2086 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 355.89 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [08:03:25] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 28 probes of 449 (alerts on 35) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [08:03:59] RECOVERY - MariaDB Slave Lag: s8 on db2081 is OK: OK slave_sql_lag Replication lag: 0.40 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [08:04:08] _joe_: I need 2FA reset on InternetArchiveBot [08:04:09] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 22 probes of 449 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [08:04:15] RECOVERY - MariaDB Slave Lag: s8 on db2079 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [08:04:41] RECOVERY - MariaDB Slave Lag: s8 on db2086 is OK: OK slave_sql_lag Replication lag: 0.04 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [08:04:42] Who do I reach out for theat. [08:04:45] RECOVERY - MariaDB Slave Lag: s8 on db2082 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [08:05:00] <_joe_> Cyberpower678: it's a wikitech account? [08:05:11] No. It's Wikipedia account [08:05:22] <_joe_> then I wouldn't know :/ [08:05:30] I set up 2FA, but the idiot I am, I accidentally deleted the keys. [08:05:38] <_joe_> it happens :) [08:05:52] But now I'm locked out [08:05:52] !log Switchover s5 codfw master db2052 -> db2123 T230106 [08:05:52] <_joe_> but I don't know who you could contact to reset your 2FA [08:05:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:06:00] T230106: Switchover codfw primary database masters to new hosts - https://phabricator.wikimedia.org/T230106 [08:07:17] <_joe_> Cyberpower678: did you write down the scratch codes? [08:07:45] (03CR) 10Muehlenhoff: "One comment inline, looks good to me (to the extent the kludgy kprop mechanism can look good :-)" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/529733 (https://phabricator.wikimedia.org/T226089) (owner: 10Elukey) [08:07:57] I did, and I managed to use one before the keychain decided to nuke them. [08:08:04] Fuck this. [08:08:13] So I'm logged in, but I can't disable 2FA. [08:08:15] <_joe_> Cyberpower678: you need to open a phab ticket at the very least [08:08:25] <_joe_> per https://en.wikipedia.org/wiki/Help:Two-factor_authentication#Scratch_codes [08:08:39] (03PS2) 10Gehel: Remove PrimateTmp=true from elasticsearch_6@ systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/530969 (owner: 10EBernhardson) [08:08:45] (03CR) 10Hashar: "recheck" [software/homer] - 10https://gerrit.wikimedia.org/r/530860 (https://phabricator.wikimedia.org/T228388) (owner: 10Volans) [08:08:58] hashar: thanks! [08:09:13] <_joe_> gehel: please hold a sec [08:09:15] RECOVERY - Check the Netbox report-s- puppetdb for fail status. on netmon1002 is OK: puppetdb.PuppetDB OK https://wikitech.wikimedia.org/wiki/Netbox%23Reports [08:09:19] sure [08:10:36] (03CR) 10Giuseppe Lavagetto: "having a private tmp directory protects against an entire class of exploits, it's not something that's not needed. I would at least wait f" [puppet] - 10https://gerrit.wikimedia.org/r/530969 (owner: 10EBernhardson) [08:11:05] <_joe_> gehel: privatetmp is a pretty serious security improvement [08:11:18] <_joe_> so I'd like moritzm's opinion on that patch [08:11:41] I had that discussion with him on a similar patch for WDQS, but I'll get his review on that one as well [08:11:42] <_joe_> I think we should work on glue to make jstack/etc usable with private tmps [08:12:07] <_joe_> gehel: esp true for elasticsearch where we run multiple instances on the same servers [08:12:10] (03CR) 10Elukey: profile::kerberos::kadminserver: add support for replication (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/529733 (https://phabricator.wikimedia.org/T226089) (owner: 10Elukey) [08:14:38] _joe_: what kind of glue are you thinking? Detect the private temp dir and set it to the same value for jstack? [08:15:23] <_joe_> gehel: yes, I don't remember (thankfully) much about jstack, but it should be possible yes [08:17:12] it honors $TEMP, so we can probably just reset it [08:17:33] Cyberpower678: https://wikitech.wikimedia.org/wiki/Password_reset#For_users / https://wikitech.wikimedia.org/wiki/Password_reset#Reset_two_factor_authentication [08:18:13] (03CR) 10Marostegui: [C: 03+2] mariadb: Promote db2123 to s5 codfw master [puppet] - 10https://gerrit.wikimedia.org/r/531050 (https://phabricator.wikimedia.org/T230106) (owner: 10Marostegui) [08:18:15] (03CR) 10Marostegui: [C: 03+2] db-codfw.php: Promote db2123 to s5 codfw master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531051 (https://phabricator.wikimedia.org/T230106) (owner: 10Marostegui) [08:18:20] RECOVERY - Logstash rate of ingestion percent change compared to yesterday on icinga1001 is OK: (C)130 ge (W)110 ge 104.1 https://phabricator.wikimedia.org/T202307 https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1&panelId=2&fullscreen [08:18:49] (03Merged) 10jenkins-bot: db-codfw.php: Promote db2123 to s5 codfw master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531051 (https://phabricator.wikimedia.org/T230106) (owner: 10Marostegui) [08:19:10] tgr: I'm not sure what to do with that? [08:19:16] (03CR) 10jenkins-bot: db-codfw.php: Promote db2123 to s5 codfw master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531051 (https://phabricator.wikimedia.org/T230106) (owner: 10Marostegui) [08:19:47] follow the 2FA reset request process described there [08:19:57] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Promote db2123 to s5 codfw master T230106 (duration: 00m 48s) [08:19:58] I filed a phab already [08:20:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:20:05] T230106: Switchover codfw primary database masters to new hosts - https://phabricator.wikimedia.org/T230106 [08:20:06] did you file a bug report about the OATH error? [08:20:07] tgr: https://phabricator.wikimedia.org/T230773 [08:20:27] Not yet. [08:21:11] I can do the reset but you need to prove your identity somehow, see the second link [08:21:13] (03CR) 10Muehlenhoff: "We have additional Linux kernel hardening which prevents symlink attacks in /tmp, so if it helps debugging it's an acceptable tradeoff." [puppet] - 10https://gerrit.wikimedia.org/r/530969 (owner: 10EBernhardson) [08:21:17] (03CR) 10Muehlenhoff: Remove PrimateTmp=true from elasticsearch_6@ systemd unit (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/530969 (owner: 10EBernhardson) [08:21:20] Fortunately, my Phab account and my Wikipedia account are linked, and that IABot's creation was carried out by me according to the logs. [08:22:31] tgr: and how will that prove my identity on Wikipedia? [08:23:21] tgr: do you want me to post a confirmation of the request on my talk page and from InternetArchiveBot as well? I'm still logged in. [08:24:08] the point of the check is to ensure that it's not coming from soneone who stole your password [08:24:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'Re-organize s5 codfw weights - T230106', diff saved to https://phabricator.wikimedia.org/P8935 and previous config saved to /var/cache/conftool/dbconfig/20190820-082411-marostegui.json [08:24:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:24:27] 10Operations, 10Domains, 10Traffic, 10WMF-Legal, 10Patch-For-Review: Move wikimedia.ee under WM-EE - https://phabricator.wikimedia.org/T204056 (10tramm) The campaign for getting approval for a simple DNS record change from legal department is starting to resemble EU copyright campaign which allocated mos... [08:25:16] (03CR) 10Giuseppe Lavagetto: [C: 03+1] mediawiki: add cluster latency alerts [puppet] - 10https://gerrit.wikimedia.org/r/529923 (https://phabricator.wikimedia.org/T230396) (owner: 10Filippo Giunchedi) [08:25:19] ideally video chat with a WMF staff member or well-known volunteer who knows you personally [08:27:15] So me posting from both accounts won't work? Different passwords and both 2FA enforced. [08:27:27] 10Operations, 10Discovery-Search (Current work): Run jstack / jmap / etc... with PrivateTmp=true - https://phabricator.wikimedia.org/T230774 (10Gehel) [08:28:01] hm, I guess if it's 2FA protected, that should be sufficient [08:28:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'Set db2123 as codfw s5 master - T230106', diff saved to https://phabricator.wikimedia.org/P8936 and previous config saved to /var/cache/conftool/dbconfig/20190820-082802-marostegui.json [08:28:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:28:11] T230106: Switchover codfw primary database masters to new hosts - https://phabricator.wikimedia.org/T230106 [08:28:15] Or making a request from Phabricator which is also 2FA enabled. [08:28:16] (03CR) 10Gehel: [C: 04-1] "As suggested by Joe, it would be better to wrap jstack / jmap / etc... and give them the appropriate access. Tracked inhttps://phabricator" [puppet] - 10https://gerrit.wikimedia.org/r/530969 (owner: 10EBernhardson) [08:29:23] I can't check who has 2FA access on Phab, so that would have to involve a Phab admin [08:29:38] ...who has 2FA enabled... [08:29:45] tgr: I'm happy to prove it, but the best way on top of that would be to get in a call with you and screen-share, showing you that I can access my accounts with 2FA. :-) [08:30:05] Except IABot. [08:30:37] Or maybe show you my Wikimania badge as well. :-) [08:31:32] tgr: Doc James and I have met there, he knows me. [08:32:01] But I don't have a means to make a video call with him as I don't have contact info, beyond an email. [08:32:46] 10Operations, 10Gerrit, 10Release-Engineering-Team-TODO, 10Continuous-Integration-Config: Fix operations/puppet.git "rebase hell" - https://phabricator.wikimedia.org/T224033 (10Joe) >>! In T224033#5402709, @CDanis wrote: > Here is my two lepta: I did miss this comment last week. Thanks for reviving this.... [08:37:02] (03PS3) 10Gehel: Mjolnir bulk daemon to read from jumbo-eqiad cluster [puppet] - 10https://gerrit.wikimedia.org/r/530966 (owner: 10EBernhardson) [08:39:02] (03CR) 10Gehel: [C: 03+2] Mjolnir bulk daemon to read from jumbo-eqiad cluster [puppet] - 10https://gerrit.wikimedia.org/r/530966 (owner: 10EBernhardson) [08:43:05] Cyberpower678: Commons has an image of you so a video call should suffice [08:43:19] I think we met in passing at one of the hackathons, too [08:43:33] (03PS1) 10Volans: Set permissions for ldap/ops [software/homer] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/531123 [08:43:42] Cool. But the 2FA has already been reset. :-) [08:44:32] (03CR) 10Volans: [V: 03+2 C: 03+2] "Fix permissions" [software/homer] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/531123 (owner: 10Volans) [08:45:14] cool! please don't forget to file a task about the error. [08:45:19] (03CR) 10Gehel: [C: 03+2] [elastic] log slow index ops to its own log file [puppet] - 10https://gerrit.wikimedia.org/r/530950 (owner: 10DCausse) [08:45:30] (03PS2) 10Gehel: [elastic] log slow index ops to its own log file [puppet] - 10https://gerrit.wikimedia.org/r/530950 (owner: 10DCausse) [08:45:33] (03CR) 10Volans: [C: 03+2] Initial structure of the project [software/homer] - 10https://gerrit.wikimedia.org/r/530860 (https://phabricator.wikimedia.org/T228388) (owner: 10Volans) [08:45:48] (03CR) 10Volans: "recheck" [software/homer] - 10https://gerrit.wikimedia.org/r/530861 (https://phabricator.wikimedia.org/T228388) (owner: 10Volans) [08:45:54] (03CR) 10Volans: "recheck" [software/homer] - 10https://gerrit.wikimedia.org/r/530862 (https://phabricator.wikimedia.org/T228388) (owner: 10Volans) [08:47:58] (03Merged) 10jenkins-bot: Initial structure of the project [software/homer] - 10https://gerrit.wikimedia.org/r/530860 (https://phabricator.wikimedia.org/T228388) (owner: 10Volans) [08:48:37] It happened on test.wikipedia.org, so maybe not so important? [08:48:45] (03CR) 10jenkins-bot: Initial structure of the project [software/homer] - 10https://gerrit.wikimedia.org/r/530860 (https://phabricator.wikimedia.org/T228388) (owner: 10Volans) [08:48:57] (03CR) 10Volans: [C: 03+2] Initial draft of the CLI [software/homer] - 10https://gerrit.wikimedia.org/r/530861 (https://phabricator.wikimedia.org/T228388) (owner: 10Volans) [08:49:23] Man. I feel so relieved now. When I saw what happened, I was like "Oh fuck no!!" [08:50:51] (03Merged) 10jenkins-bot: Initial draft of the CLI [software/homer] - 10https://gerrit.wikimedia.org/r/530861 (https://phabricator.wikimedia.org/T228388) (owner: 10Volans) [08:51:42] (03CR) 10jenkins-bot: Initial draft of the CLI [software/homer] - 10https://gerrit.wikimedia.org/r/530861 (https://phabricator.wikimedia.org/T228388) (owner: 10Volans) [08:52:27] (03CR) 10Volans: [C: 03+2] Initial draft of devices configuration parsing [software/homer] - 10https://gerrit.wikimedia.org/r/530862 (https://phabricator.wikimedia.org/T228388) (owner: 10Volans) [08:53:40] (03PS3) 10Gehel: Remove PrivateTmp=true from elasticsearch_6@ systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/530969 (owner: 10EBernhardson) [08:53:59] (03CR) 10Gehel: Remove PrivateTmp=true from elasticsearch_6@ systemd unit (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/530969 (owner: 10EBernhardson) [08:54:15] 10Operations, 10Discovery-Search (Current work): Run jstack / jmap / etc... with PrivateTmp=true - https://phabricator.wikimedia.org/T230774 (10MoritzMuehlenhoff) If you just want to run a command manually that's already possible using nsenter, e.g. on my Buster workstation I have systemd-timesynd running with... [08:54:23] (03Merged) 10jenkins-bot: Initial draft of devices configuration parsing [software/homer] - 10https://gerrit.wikimedia.org/r/530862 (https://phabricator.wikimedia.org/T228388) (owner: 10Volans) [08:55:07] (03CR) 10Muehlenhoff: [C: 04-1] "Looking into this more closely, we can simply keep PrivateTmp and just use nsenter from util-linux, I've left some comments at" [puppet] - 10https://gerrit.wikimedia.org/r/530969 (owner: 10EBernhardson) [08:56:05] (03CR) 10jenkins-bot: Initial draft of devices configuration parsing [software/homer] - 10https://gerrit.wikimedia.org/r/530862 (https://phabricator.wikimedia.org/T228388) (owner: 10Volans) [08:59:14] 10Operations, 10Discovery-Search (Current work): Run jstack / jmap / etc... with PrivateTmp=true - https://phabricator.wikimedia.org/T230774 (10Gehel) Nice! I did not know that one. So as an example, jstack can be called with: ` gehel@elastic2050:~$ sudo systemctl status elasticsearch_6@production-search-cod... [09:00:18] !log Starting 2nd smoketest of termbox service on eqiad: T229907 [09:00:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:00:26] T229907: Synthetic Load Test - https://phabricator.wikimedia.org/T229907 [09:14:20] (03CR) 10Filippo Giunchedi: [C: 03+2] mediawiki: add cluster latency alerts [puppet] - 10https://gerrit.wikimedia.org/r/529923 (https://phabricator.wikimedia.org/T230396) (owner: 10Filippo Giunchedi) [09:14:29] (03PS6) 10Filippo Giunchedi: mediawiki: add cluster latency alerts [puppet] - 10https://gerrit.wikimedia.org/r/529923 (https://phabricator.wikimedia.org/T230396) [09:15:02] (03PS3) 10DannyS712: Fix typos in code [puppet] - 10https://gerrit.wikimedia.org/r/530989 (https://phabricator.wikimedia.org/T201491) [09:19:04] 10Operations, 10DBA: Decommission db2051.codfw.wmnet - https://phabricator.wikimedia.org/T230777 (10Marostegui) [09:19:37] 10Operations, 10DBA: Decommission db2051.codfw.wmnet - https://phabricator.wikimedia.org/T230778 (10Marostegui) [09:19:45] (03PS4) 10DannyS712: Fix typos in code [puppet] - 10https://gerrit.wikimedia.org/r/530989 (https://phabricator.wikimedia.org/T201491) [09:20:15] 10Operations, 10DBA: Decommission db2056.codfw.wmnet - https://phabricator.wikimedia.org/T230777 (10Marostegui) p:05Triage→03Normal [09:21:03] 10Operations, 10DBA: Decommission db2043-db2069 - https://phabricator.wikimedia.org/T228258 (10Marostegui) [09:21:27] 10Operations, 10DBA: Decommission db2051.codfw.wmnet - https://phabricator.wikimedia.org/T230778 (10Marostegui) p:05Triage→03Normal [09:25:06] 10Operations, 10Discovery-Search (Current work): Run jstack / jmap / etc... with PrivateTmp=true - https://phabricator.wikimedia.org/T230774 (10Joe) You can automate the process with a simple alias: `lang=bash systemctl-jstack() { if [ -z $1 ] { echo "Please provide a service name." echo "Usage: systemc... [09:26:31] (03CR) 10Gilles: [C: 03+1] varnishlog: request/response headers to send to logstash [puppet] - 10https://gerrit.wikimedia.org/r/520425 (https://phabricator.wikimedia.org/T189333) (owner: 10Ema) [09:29:30] (03PS1) 10Marostegui: db-eqiad,db-codfw.php: Remove db2051,db2056 from config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531133 (https://phabricator.wikimedia.org/T230777) [09:32:59] (03CR) 10Marostegui: [C: 03+2] db-eqiad,db-codfw.php: Remove db2051,db2056 from config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531133 (https://phabricator.wikimedia.org/T230777) (owner: 10Marostegui) [09:34:03] (03Merged) 10jenkins-bot: db-eqiad,db-codfw.php: Remove db2051,db2056 from config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531133 (https://phabricator.wikimedia.org/T230777) (owner: 10Marostegui) [09:34:19] (03CR) 10jenkins-bot: db-eqiad,db-codfw.php: Remove db2051,db2056 from config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531133 (https://phabricator.wikimedia.org/T230777) (owner: 10Marostegui) [09:35:38] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Remove db2051 and db2056 from config T230777 T230778 (duration: 00m 48s) [09:35:45] 10Operations, 10Discovery-Search (Current work): Run jstack / jmap / etc... with PrivateTmp=true - https://phabricator.wikimedia.org/T230774 (10MoritzMuehlenhoff) BTW, nsenter can also be used for other namespace types (e.g. network namespaces), one can print all namespaces currently in use via "sudo lsns" (or... [09:35:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:35:47] T230778: Decommission db2051.codfw.wmnet - https://phabricator.wikimedia.org/T230778 [09:35:48] T230777: Decommission db2056.codfw.wmnet - https://phabricator.wikimedia.org/T230777 [09:40:03] 10Operations, 10DBA, 10Patch-For-Review: Decommission db2056.codfw.wmnet - https://phabricator.wikimedia.org/T230777 (10Marostegui) [09:40:07] (03PS1) 10Marostegui: mariadb: Decommission db2051,db2056 [puppet] - 10https://gerrit.wikimedia.org/r/531136 (https://phabricator.wikimedia.org/T230777) [09:40:11] 10Operations, 10DBA, 10Patch-For-Review: Decommission db2051.codfw.wmnet - https://phabricator.wikimedia.org/T230778 (10Marostegui) [09:41:54] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me (to the extent the kludgy kprop mechanism can look good :-)" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/529733 (https://phabricator.wikimedia.org/T226089) (owner: 10Elukey) [09:43:18] 10Operations, 10DBA, 10Patch-For-Review: Decommission db2051.codfw.wmnet - https://phabricator.wikimedia.org/T230778 (10Marostegui) [09:44:47] 10Operations, 10DBA, 10Patch-For-Review: Decommission db2056.codfw.wmnet - https://phabricator.wikimedia.org/T230777 (10Marostegui) [09:44:48] (03PS1) 10Filippo Giunchedi: mediawiki: remove dashboard_links urlencoding [puppet] - 10https://gerrit.wikimedia.org/r/531137 (https://phabricator.wikimedia.org/T230396) [09:45:35] (03CR) 10Filippo Giunchedi: [C: 03+2] mediawiki: remove dashboard_links urlencoding [puppet] - 10https://gerrit.wikimedia.org/r/531137 (https://phabricator.wikimedia.org/T230396) (owner: 10Filippo Giunchedi) [09:46:54] (03PS5) 10Elukey: profile::kerberos::kadminserver: add support for replication [puppet] - 10https://gerrit.wikimedia.org/r/529733 (https://phabricator.wikimedia.org/T226089) [09:47:45] (03CR) 10Ema: [C: 03+1] ATS: Disable config status check for TLS instance [puppet] - 10https://gerrit.wikimedia.org/r/531018 (https://phabricator.wikimedia.org/T221594) (owner: 10Vgutierrez) [09:47:53] (03CR) 10Ema: [C: 03+1] ATS: Enable TCP Fast Open for the TLS instance [puppet] - 10https://gerrit.wikimedia.org/r/531027 (https://phabricator.wikimedia.org/T221594) (owner: 10Vgutierrez) [09:49:25] (03CR) 10Elukey: [C: 03+2] profile::kerberos::kadminserver: add support for replication [puppet] - 10https://gerrit.wikimedia.org/r/529733 (https://phabricator.wikimedia.org/T226089) (owner: 10Elukey) [09:49:36] (03PS6) 10Elukey: profile::kerberos::kadminserver: add support for replication [puppet] - 10https://gerrit.wikimedia.org/r/529733 (https://phabricator.wikimedia.org/T226089) [09:50:53] (03PS2) 10Marostegui: mariadb: Decommission db2051,db2056 [puppet] - 10https://gerrit.wikimedia.org/r/531136 (https://phabricator.wikimedia.org/T230777) [09:51:56] (03CR) 10Marostegui: [C: 03+2] mariadb: Decommission db2051,db2056 [puppet] - 10https://gerrit.wikimedia.org/r/531136 (https://phabricator.wikimedia.org/T230777) (owner: 10Marostegui) [09:52:33] !log Remove db2051 and db2056 from tendril and zarcillo - T230777 T230778 [09:52:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:52:42] T230778: Decommission db2051.codfw.wmnet - https://phabricator.wikimedia.org/T230778 [09:52:42] T230777: Decommission db2056.codfw.wmnet - https://phabricator.wikimedia.org/T230777 [09:53:09] (03PS1) 10Muehlenhoff: Add failoid1001/2001 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/531141 (https://phabricator.wikimedia.org/T224559) [09:53:12] 10Operations, 10DBA, 10Patch-For-Review: Decommission db2051.codfw.wmnet - https://phabricator.wikimedia.org/T230778 (10Marostegui) [09:53:16] 10Operations, 10vm-requests: eqiad/codfw: One VM for Failoid - https://phabricator.wikimedia.org/T229903 (10MoritzMuehlenhoff) 05Open→03Resolved VMs have been created. [09:53:23] 10Operations, 10DBA, 10Patch-For-Review: Decommission db2056.codfw.wmnet - https://phabricator.wikimedia.org/T230777 (10Marostegui) [09:54:13] (03PS1) 10Filippo Giunchedi: mediawiki: remove per-host high CPU alerts [puppet] - 10https://gerrit.wikimedia.org/r/531142 (https://phabricator.wikimedia.org/T230396) [09:54:46] (03CR) 10Filippo Giunchedi: [C: 04-1] "To be merged once latency alerts have had time to bake" [puppet] - 10https://gerrit.wikimedia.org/r/531142 (https://phabricator.wikimedia.org/T230396) (owner: 10Filippo Giunchedi) [09:55:03] (03CR) 10Volans: [C: 03+1] "\o/ LGTM, have you tried a compiler JIC?" [puppet] - 10https://gerrit.wikimedia.org/r/531141 (https://phabricator.wikimedia.org/T224559) (owner: 10Muehlenhoff) [09:58:23] (03PS1) 10Giuseppe Lavagetto: profile::tlsproxy::envoy: use full systemd unit on jessie [puppet] - 10https://gerrit.wikimedia.org/r/531143 [09:58:27] <_joe_> ema: ^^ [09:59:04] (03PS3) 10Muehlenhoff: Add DNS entries for Buster puppetdb instances [dns] - 10https://gerrit.wikimedia.org/r/530835 (https://phabricator.wikimedia.org/T230609) [10:04:01] (03PS1) 10Ema: Revert "ATS: leave AE removal to Lua" [puppet] - 10https://gerrit.wikimedia.org/r/531144 (https://phabricator.wikimedia.org/T227432) [10:04:05] (03PS7) 10Elukey: profile::kerberos::kadminserver: add support for replication [puppet] - 10https://gerrit.wikimedia.org/r/529733 (https://phabricator.wikimedia.org/T226089) [10:04:07] (03PS1) 10Ema: Revert "ATS: unset Accept-Encoding" [puppet] - 10https://gerrit.wikimedia.org/r/531145 (https://phabricator.wikimedia.org/T227432) [10:04:30] 10Operations, 10DBA: Switchover s8 (wikidata) primary database master db1104 -> db1109 - https://phabricator.wikimedia.org/T230762 (10Marostegui) [10:05:30] (03CR) 10Ema: [C: 03+1] profile::tlsproxy::envoy: use full systemd unit on jessie [puppet] - 10https://gerrit.wikimedia.org/r/531143 (owner: 10Giuseppe Lavagetto) [10:07:11] _joe_: looks good but a single tear was shed for jessie packages w/o systemd unit [10:07:30] <_joe_> ema: don't want to see how that sausage is made [10:07:54] (03PS2) 10Giuseppe Lavagetto: profile::tlsproxy::envoy: use full systemd unit on jessie [puppet] - 10https://gerrit.wikimedia.org/r/531143 [10:08:18] <_joe_> I suspect it's a pure binary package built with some automatic tool [10:11:36] 10Operations, 10DBA: Switchover s8 (wikidata) primary database master db1104 -> db1109 - 10th Sept @05:00 UTC - https://phabricator.wikimedia.org/T230762 (10Marostegui) [10:11:55] !log termbox 2nd smoketests finished [10:12:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:13:52] (03CR) 10Giuseppe Lavagetto: [C: 03+2] profile::tlsproxy::envoy: use full systemd unit on jessie [puppet] - 10https://gerrit.wikimedia.org/r/531143 (owner: 10Giuseppe Lavagetto) [10:13:59] 10Operations, 10DBA: Switchover s3 primary database master db1075 -> db1078 - 24th Sept @05:00 UTC - https://phabricator.wikimedia.org/T230783 (10Marostegui) [10:14:02] (03PS16) 10Mathew.onipe: Add maps reboot cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/511819 (https://phabricator.wikimedia.org/T224072) [10:14:12] 10Operations, 10DBA: Switchover s3 primary database master db1075 -> db1078 - 24th Sept @05:00 UTC - https://phabricator.wikimedia.org/T230783 (10Marostegui) p:05Triage→03Normal [10:15:57] 10Operations, 10DBA: Switchover s4 (commonswiki) primary database master db1081 -> db1138 - 26th Sept @05:00 UTC - https://phabricator.wikimedia.org/T230784 (10Marostegui) [10:16:08] 10Operations, 10DBA: Switchover s4 (commonswiki) primary database master db1081 -> db1138 - 26th Sept @05:00 UTC - https://phabricator.wikimedia.org/T230784 (10Marostegui) p:05Triage→03Normal [10:17:02] (03PS1) 10Urbanecm: Assign all rights assigned to suppress group to oversight group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531147 (https://phabricator.wikimedia.org/T230601) [10:17:21] (03PS2) 10Ema: Revert "ATS: leave AE removal to Lua" [puppet] - 10https://gerrit.wikimedia.org/r/531144 (https://phabricator.wikimedia.org/T227432) [10:18:06] (03CR) 10Ema: [C: 03+2] Revert "ATS: leave AE removal to Lua" [puppet] - 10https://gerrit.wikimedia.org/r/531144 (https://phabricator.wikimedia.org/T227432) (owner: 10Ema) [10:18:28] (03PS2) 10Ema: Revert "ATS: unset Accept-Encoding" [puppet] - 10https://gerrit.wikimedia.org/r/531145 (https://phabricator.wikimedia.org/T227432) [10:20:21] (03CR) 10Ema: [C: 03+2] Revert "ATS: unset Accept-Encoding" [puppet] - 10https://gerrit.wikimedia.org/r/531145 (https://phabricator.wikimedia.org/T227432) (owner: 10Ema) [10:21:22] 10Operations, 10DBA: Switchover s2 primary database master db1066 -> db1122 - 17th Sept @05:00 UTC - https://phabricator.wikimedia.org/T230785 (10Marostegui) [10:21:34] 10Operations, 10DBA: Switchover s2 primary database master db1066 -> db1122 - 17th Sept @05:00 UTC - https://phabricator.wikimedia.org/T230785 (10Marostegui) p:05Triage→03Normal [10:21:56] 10Operations, 10DBA: Decommission db1061-db1073 - https://phabricator.wikimedia.org/T217396 (10Marostegui) [10:21:59] 10Operations, 10DBA: Switchover s2 primary database master db1066 -> db1122 - 17th Sept @05:00 UTC - https://phabricator.wikimedia.org/T230785 (10Marostegui) [10:22:37] 10Operations, 10Phabricator, 10Traffic, 10Release-Engineering-Team (Development services), and 2 others: Prepare Phame to support heavy traffic for a Tech Department blog - https://phabricator.wikimedia.org/T226044 (10mmodell) Ok I've created https://phabricator.wikimedia.org/phame/blog/view/15/ but I woul... [10:26:35] (03CR) 10Jbond: "> Patch Set 4: Code-Review+1" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/528521 (https://phabricator.wikimedia.org/T228657) (owner: 10Jbond) [10:28:17] (03CR) 10Jbond: [C: 03+2] "LGTM will merge" [puppet] - 10https://gerrit.wikimedia.org/r/531044 (owner: 10Muehlenhoff) [10:28:25] (03PS2) 10Jbond: Change mail address for fsero [puppet] - 10https://gerrit.wikimedia.org/r/531044 (owner: 10Muehlenhoff) [10:29:09] (03CR) 10Muehlenhoff: "He didn't sign it yet, though" [puppet] - 10https://gerrit.wikimedia.org/r/531044 (owner: 10Muehlenhoff) [10:29:57] (03CR) 10Jbond: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/531044 (owner: 10Muehlenhoff) [10:30:15] !log cp5002: restart trafficserver for compress.so config change [10:30:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:31:12] (03CR) 10Daimona Eaytoy: [C: 04-1] "This should be meant to work for any extension assigning rights to 'suppress' - actually, you could also remove the comment. It probably h" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531147 (https://phabricator.wikimedia.org/T230601) (owner: 10Urbanecm) [10:33:17] (03CR) 10Jbond: [C: 03+2] Change mail address for fsero [puppet] - 10https://gerrit.wikimedia.org/r/531044 (owner: 10Muehlenhoff) [10:36:15] Urbanecm hey! [10:36:20] 10Operations, 10DBA: Switchover s2 primary database master db1066 -> db1122 - 17th Sept @05:00 UTC - https://phabricator.wikimedia.org/T230785 (10Marostegui) [10:36:23] 10Operations, 10DBA: Switchover s4 (commonswiki) primary database master db1081 -> db1138 - 26th Sept @05:00 UTC - https://phabricator.wikimedia.org/T230784 (10Marostegui) [10:36:29] 10Operations, 10DBA: Switchover s3 primary database master db1075 -> db1078 - 24th Sept @05:00 UTC - https://phabricator.wikimedia.org/T230783 (10Marostegui) [10:36:32] 10Operations, 10DBA: Switchover s8 (wikidata) primary database master db1104 -> db1109 - 10th Sept @05:00 UTC - https://phabricator.wikimedia.org/T230762 (10Marostegui) [10:36:49] I'm skimming through CommonSettings, but I don't have much time now... We can test after lunchtime if it's OK for you [10:39:09] Daimona: Hi! [10:39:17] Well i'm in metro now, so... [10:39:27] ...not much time :: [10:39:57] Oh well :D [10:43:36] (03PS5) 10Jbond: puppetmaster::frontend: update web conf to use RewriteRules instead of proxypass [puppet] - 10https://gerrit.wikimedia.org/r/528521 (https://phabricator.wikimedia.org/T228657) [10:45:29] (03PS2) 10Muehlenhoff: Add failoid1001/2001 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/531141 (https://phabricator.wikimedia.org/T224559) [10:51:04] !log Stop MySQL on db2051 and db2056 for decommission T230777 T230778 [10:51:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:51:14] T230778: Decommission db2051.codfw.wmnet - https://phabricator.wikimedia.org/T230778 [10:51:14] T230777: Decommission db2056.codfw.wmnet - https://phabricator.wikimedia.org/T230777 [10:51:51] 10Operations, 10DBA: Decommission db2056.codfw.wmnet - https://phabricator.wikimedia.org/T230777 (10Marostegui) [10:51:57] 10Operations, 10DBA: Decommission db2051.codfw.wmnet - https://phabricator.wikimedia.org/T230778 (10Marostegui) [10:52:33] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission db2051.codfw.wmnet - https://phabricator.wikimedia.org/T230778 (10Marostegui) a:05Marostegui→03RobH This host is ready for #DC-ops to decommission [10:53:49] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission db2056.codfw.wmnet - https://phabricator.wikimedia.org/T230777 (10Marostegui) a:05Marostegui→03RobH This host is ready for #DC-ops to decommission [10:54:26] 10Operations, 10DBA: Decommission db2043-db2069 - https://phabricator.wikimedia.org/T228258 (10Marostegui) [10:54:57] (03CR) 10Muehlenhoff: [C: 03+2] Add failoid1001/2001 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/531141 (https://phabricator.wikimedia.org/T224559) (owner: 10Muehlenhoff) [10:55:05] 10Operations, 10DBA: Predictive failures on disk S.M.A.R.T. status - https://phabricator.wikimedia.org/T208323 (10Marostegui) [10:55:09] (03PS1) 10Jbond: IPv6: add static IPv6 addresses to spare hosts [puppet] - 10https://gerrit.wikimedia.org/r/531149 [10:56:12] (03CR) 10jerkins-bot: [V: 04-1] IPv6: add static IPv6 addresses to spare hosts [puppet] - 10https://gerrit.wikimedia.org/r/531149 (owner: 10Jbond) [10:58:44] marostegui: I'm gonna try again today switching to new store on clients, after we backport the fixes https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190820T1100 [10:58:53] alaa_wmde: ok [10:58:55] will let you know once it is live if you wanna look at some stuff [10:59:02] thanks, will do [10:59:07] thank you :) [10:59:58] (03PS1) 10Jbond: ipv6: add mapped ipv6 to yubiauth server [puppet] - 10https://gerrit.wikimedia.org/r/531151 [11:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: My dear minions, it's time we take the moon! Just kidding. Time for European Mid-day SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190820T1100). [11:00:04] alaa_wmde: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:30] (03CR) 10jerkins-bot: [V: 04-1] ipv6: add mapped ipv6 to yubiauth server [puppet] - 10https://gerrit.wikimedia.org/r/531151 (owner: 10Jbond) [11:01:18] (03PS2) 10Jbond: ipv6: add mapped ipv6 to yubiauth server [puppet] - 10https://gerrit.wikimedia.org/r/531151 (https://phabricator.wikimedia.org/T102099) [11:01:46] (03CR) 10jerkins-bot: [V: 04-1] ipv6: add mapped ipv6 to yubiauth server [puppet] - 10https://gerrit.wikimedia.org/r/531151 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond) [11:02:13] hello [11:02:38] anyone doing swat today? [11:04:17] (03PS3) 10Jbond: ipv6: add mapped ipv6 to yubiauth server [puppet] - 10https://gerrit.wikimedia.org/r/531151 (https://phabricator.wikimedia.org/T102099) [11:14:02] (03PS2) 10Jbond: IPv6: add static IPv6 addresses to spare hosts [puppet] - 10https://gerrit.wikimedia.org/r/531149 [11:14:58] (03CR) 10jerkins-bot: [V: 04-1] IPv6: add static IPv6 addresses to spare hosts [puppet] - 10https://gerrit.wikimedia.org/r/531149 (owner: 10Jbond) [11:16:18] (03Abandoned) 10Jbond: IPv6: add static IPv6 addresses to spare hosts [puppet] - 10https://gerrit.wikimedia.org/r/531149 (owner: 10Jbond) [11:17:27] alaa_wmde: still no luck? [11:18:39] Urbanecm: nope asking around in wikidata team .. but can be that no one is free atm [11:18:51] Urbanecm: nope asking around in wikidata team .. but can be that no one is free atm [11:19:25] Well I'm currently at a hospital visit with a friend, so I can't swat :/ [11:19:38] (03PS1) 10Jbond: syslog::centralserver: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531153 [11:19:39] If you're available during Morning SWAT, I can do it [11:19:41] Otherwise, tomorrow [11:20:12] (03CR) 10Muehlenhoff: [C: 03+1] ipv6: add mapped ipv6 to yubiauth server [puppet] - 10https://gerrit.wikimedia.org/r/531151 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond) [11:20:16] oh no worries .. thanks for asking! and best wishes to your friend. [11:20:56] (03PS1) 10Elukey: Swap analytics-tool1002 with an-tool1007 in caching config [puppet] - 10https://gerrit.wikimedia.org/r/531154 (https://phabricator.wikimedia.org/T230709) [11:21:09] (03PS1) 10Ema: ATS: enable systemd hardening on all hosts [puppet] - 10https://gerrit.wikimedia.org/r/531155 [11:21:30] (03CR) 10Muehlenhoff: [C: 03+2] Add DNS entries for Buster puppetdb instances [dns] - 10https://gerrit.wikimedia.org/r/530835 (https://phabricator.wikimedia.org/T230609) (owner: 10Muehlenhoff) [11:22:33] (03PS2) 10Jbond: syslog::centralserver: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531153 (https://phabricator.wikimedia.org/T102099) [11:27:34] alaa_wmde: Better late than never? I'm happy to deploy things. [11:28:20] :) that'd be great yes [11:28:28] awesome, here goes! [11:28:56] (03PS1) 10Jbond: builder: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531156 (https://phabricator.wikimedia.org/T102099) [11:31:59] alaa_wmde: CI is going to take a while for the first change. Just to confirm, the Wikibase patch must be deployed before the config change, right? [11:32:14] yes that's right [11:32:19] (03PS1) 10Jbond: spare::system: add ipv6 mapped addres [puppet] - 10https://gerrit.wikimedia.org/r/531157 (https://phabricator.wikimedia.org/T102099) [11:32:34] I'm keeping an eye on CI .. can ping you once its done ;) [11:32:57] (03CR) 10Ema: [C: 03+2] ATS: enable systemd hardening on all hosts [puppet] - 10https://gerrit.wikimedia.org/r/531155 (owner: 10Ema) [11:33:27] (03CR) 10jerkins-bot: [V: 04-1] spare::system: add ipv6 mapped addres [puppet] - 10https://gerrit.wikimedia.org/r/531157 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond) [11:33:31] (03PS5) 10DannyS712: Fix typos in code [puppet] - 10https://gerrit.wikimedia.org/r/530989 (https://phabricator.wikimedia.org/T201491) [11:38:21] jouncebot: next [11:38:22] In 0 hour(s) and 21 minute(s): Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190820T1200) [11:38:29] jouncebot: now [11:38:29] For the next 0 hour(s) and 21 minute(s): European Mid-day SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190820T1100) [11:38:29] alaa_wmde: I see this is a second attempt--do you think there's some risk involved and we there's a chance we might have to roll back today as well? Just wondering what to expect. [11:38:48] Urbanecm: any SWAT happening now? [11:39:01] hauskatze: [11:39:05] hauskatze: I'm deploying two SWAT changes, but lmk if there's a reason to abort, it's not too late. [11:39:08] hauskatze: yes we are on a patch [11:39:41] awight: ack, no reason to abort. [11:40:04] I wanted to deploy one but maybe it's too late && not urgent [11:40:07] awight: after the backported fixes I do not expect any need for a rollback .. I'll be watching DB traffic as well, but I don't expect problems there either [11:40:09] hauskatze: okay thanks, I'll ping you when I finish. [11:40:34] hauskatze: There might be a few minutes left over... [11:40:46] awight: don't worry, I'm going AFK for horrible family lunch. [11:41:06] hehe that sounds fun [11:41:29] Yeah, fireworks :P [11:42:42] (03PS1) 10Elukey: role::analytics_test_cluster::client: include SWAP/notebooks [puppet] - 10https://gerrit.wikimedia.org/r/531158 (https://phabricator.wikimedia.org/T226698) [11:42:53] alaa_wmde: Is it true that the Wikibase patch will be untestable in debug / noop without the config? [11:43:59] yeap without the config it is untestable (or at least I can't think of a way to test it) [11:44:29] okay, I'll skip mwdebug1002 in that case. [11:44:46] 👍 [11:48:18] (03CR) 10Elukey: [V: 03+2 C: 03+2] role::analytics_test_cluster::client: include SWAP/notebooks [puppet] - 10https://gerrit.wikimedia.org/r/531158 (https://phabricator.wikimedia.org/T226698) (owner: 10Elukey) [11:49:01] (03PS1) 10Jbond: mariadb::core: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531161 (https://phabricator.wikimedia.org/T102099) [11:49:48] jouncebot: next [11:49:48] In 0 hour(s) and 10 minute(s): Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190820T1200) [11:49:50] tfw everything seems to be wrapping up, then PHPUnit is suddenly like "12/4667" [11:49:59] jouncebot: now [11:49:59] For the next 0 hour(s) and 10 minute(s): European Mid-day SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190820T1100) [11:52:44] wow oh wow so it is at 30% PHPUnit and Jenkins things it is almost done .. tells already how long the whole build is [11:53:32] alaa_wmde: I don't understand what happened here, but do we need to revert the config change? https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/529807/ [11:54:09] hmm there should another Revert of that (which we would then revert) [11:54:38] ah maybe the wrong patch was linked on the deployments schedule? [11:54:57] it might be .. maybe I need to create a new one to revert the first revert [11:55:11] can we reuse the same one? [11:55:30] ah no ignore me that doesn't make sense [11:55:31] just as ec [11:55:37] +1 no worries [11:56:25] (03PS1) 10Alaa Sarhan: Revert "Revert "Revert "Revert "Switch property terms migration to WRITE_NEW on client wikis"""" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531162 [11:56:31] https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/531162 [11:56:37] ty [11:56:54] Revert^4 is hella real-life [11:57:00] :) will fix the schedule too for history [11:57:13] haha it is indeed .. I hope we are not breaking a record though [11:57:50] I think I saw Revert^6 recently [11:58:02] Didn't feel like it was polite to make a scene though ;-) [12:00:01] yeah .. we might benefit of making that a metric to opitmize :D lowering revert-chain-length (will lead to more/longer/better testing pipeline) [12:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190820T1200) [12:01:08] we are overlapping with this sanity break .. is that fine? [12:05:53] Oops I was spamming releng [12:05:55] !log awight@deploy1001 Synchronized php-1.34.0-wmf.17/extensions/Wikibase: SWAT: [[gerrit:530845|Initialize DatabaseTermIdsResolver and DatabaseTypeIdsStore with repo database name in client. (T230119, T225053)]] (duration: 00m 52s) [12:05:55] (03PS1) 10Jbond: mariadb::core_multiinstance: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531164 (https://phabricator.wikimedia.org/T102099) [12:06:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:06:05] T225053: Switch `tmpPropertyTermsMigrationStage` to MIGRATION_WRITE_NEW - https://phabricator.wikimedia.org/T225053 [12:06:05] T230119: New term store connects to the wrong host in clients - https://phabricator.wikimedia.org/T230119 [12:06:27] (03CR) 10Awight: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531162 (owner: 10Alaa Sarhan) [12:07:38] Deployers: I'm still finishing a SWAT deploy, apologies for the overlap with the sanity break. [12:10:08] alaa_wmde: Can the config side of this wait for another window? CI is waiting on some long jobs, my insinct is to cancel the config merge. [12:10:47] yes sure .. we can do it tomorrow as well .. now that we got the backport it will be much faster next time [12:11:35] alaa_wmde: on a second look, our jobs just kicked in. I go ahead with pushing to mwdebug1002 [12:11:50] okay nice [12:12:02] > Change 531162 - Merge Conflict [12:12:21] (03CR) 10Awight: Revert "Revert "Revert "Revert "Switch property terms migration to WRITE_NEW on client wikis"""" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531162 (owner: 10Alaa Sarhan) [12:12:35] I've canceled. [12:12:45] What a roller coaster! [12:12:50] (03PS2) 10Alaa Sarhan: Revert "Revert "Revert "Revert "Switch property terms migration to WRITE_NEW on client wikis"""" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531162 [12:13:54] that repo is weird .. although there's no real conflict with this change https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/531133/ it still shows merge conflict [12:14:00] I'm going to leave it like this, thanks for the help and responses. I think the train doesn't need any extra challenges, it already includes an extra week of changes due to Wikimania. [12:14:05] okay let's schedule the config for tomorrow [12:14:15] alaa_wmde: Yeah I think there's some extra-cautious gerrit config for that repo. [12:14:55] thanks a lot awight for your help :) [12:15:03] (03CR) 10Marostegui: "Yeah, I would feel more comfortable if we split them." [puppet] - 10https://gerrit.wikimedia.org/r/531161 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond) [12:15:29] (03CR) 10Marostegui: "As I mentioned on the other patch, let's do codfw only first." [puppet] - 10https://gerrit.wikimedia.org/r/531164 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond) [12:15:44] Of course, and sorry for the late start! [12:15:53] (03PS1) 10Muehlenhoff: Switch Failoid in codfw to failoid2001 [puppet] - 10https://gerrit.wikimedia.org/r/531165 (https://phabricator.wikimedia.org/T224449) [12:17:31] Deployers: please note that I found possibly undeployed changes in CheckUser, MobileFrontend, and TimedMediaHandler. [12:21:26] (03PS1) 10Jbond: role::mariadb::misc - codfw: add ipv6 mapped [puppet] - 10https://gerrit.wikimedia.org/r/531166 (https://phabricator.wikimedia.org/T102099) [12:21:30] (03PS1) 10Jbond: role::mariadb::misc - eqiad: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531167 (https://phabricator.wikimedia.org/T102099) [12:24:27] (03Restored) 10Jbond: IPv6: add static IPv6 addresses to spare hosts [puppet] - 10https://gerrit.wikimedia.org/r/531149 (owner: 10Jbond) [12:26:07] (03PS3) 10Jbond: IPv6: add static IPv6 addresses to spare hosts [puppet] - 10https://gerrit.wikimedia.org/r/531149 [12:26:15] (03PS1) 10Muehlenhoff: Change DNS entry for puppetdb1002 [dns] - 10https://gerrit.wikimedia.org/r/531168 (https://phabricator.wikimedia.org/T230609) [12:26:23] (03CR) 10DannyS712: "I ~think~ you have the wrong bug tagged in the commit message" [puppet] - 10https://gerrit.wikimedia.org/r/531165 (https://phabricator.wikimedia.org/T224449) (owner: 10Muehlenhoff) [12:26:29] (03CR) 10jerkins-bot: [V: 04-1] Change DNS entry for puppetdb1002 [dns] - 10https://gerrit.wikimedia.org/r/531168 (https://phabricator.wikimedia.org/T230609) (owner: 10Muehlenhoff) [12:28:11] (03PS2) 10Muehlenhoff: Change DNS entry for puppetdb1002 [dns] - 10https://gerrit.wikimedia.org/r/531168 (https://phabricator.wikimedia.org/T230609) [12:28:47] (03CR) 10Muehlenhoff: "@DannyS712, sorry. Fixing." [puppet] - 10https://gerrit.wikimedia.org/r/531165 (https://phabricator.wikimedia.org/T224449) (owner: 10Muehlenhoff) [12:29:24] (03PS2) 10Muehlenhoff: Switch Failoid in codfw to failoid2001 [puppet] - 10https://gerrit.wikimedia.org/r/531165 (https://phabricator.wikimedia.org/T224559) [12:32:16] (03CR) 10Jbond: [C: 03+1] "lgtm" [dns] - 10https://gerrit.wikimedia.org/r/531168 (https://phabricator.wikimedia.org/T230609) (owner: 10Muehlenhoff) [12:35:23] (03PS1) 10MarcoAurelio: Restrict account creation on es.wikiquote to 1 day/IP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531169 (https://phabricator.wikimedia.org/T230796) [12:37:52] (03PS1) 10MarcoAurelio: Enable DNS blacklist for es.wikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531171 (https://phabricator.wikimedia.org/T230796) [12:37:54] (03PS1) 10DannyS712: Clean up `wgNamespacesToBeSearchedDefault` to remove unneeded entries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531172 (https://phabricator.wikimedia.org/T230797) [12:40:03] We have an spike of errors [12:40:06] alaa_wmde: ^ ? [12:40:24] checking [12:40:35] * awight looks [12:40:46] (03PS2) 10DannyS712: Clean up `wgNamespacesToBeSearchedDefault` to remove unneeded entries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531172 (https://phabricator.wikimedia.org/T230797) [12:40:53] from what I can see it is mostly on nlwiktionary? [12:41:27] * addshore reads up [12:41:46] and on host: snapshot1005 [12:41:51] yep [12:42:04] Coming from dumpBackup.php ? [12:42:24] #2 /srv/mediawiki/php-1.34.0-wmf.17/includes/Storage/SqlBlobStore.php(438): MediaWiki\Storage\SqlBlobStore->decompressData(string, array) [12:42:29] awight: looks so, yes [12:42:53] gonna guess it hit a revision in external storage that doesn't gzip decode properly? [12:42:54] I don't know anything about the job, but it sounds like a corrupt .gz file is responsible [12:43:03] (03PS2) 10Jbond: mariadb::core_multiinstance - codfw: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531164 (https://phabricator.wikimedia.org/T102099) [12:43:05] (03PS1) 10Jbond: mariadb::core_multiinstance - eqiad: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531173 (https://phabricator.wikimedia.org/T102099) [12:43:26] cdanis's guess makes more sense [12:43:35] unfortunately there's no details in the logstash record as to which page [12:44:00] PHP Warning: MediaWiki\Storage\SqlBlobStore::fetchBlob: Bad data in text row 2350. [Called from MediaWiki\Storage\SqlBlobStore::fetchBlob in /srv/mediawiki/php-1.34.0-wmf.17/includes/Storage/SqlBlobStore.php at line 361] [12:44:06] I don't know what 'text row 2350' means [12:44:35] oh, there are actually these errors for many distinct text rows [12:45:01] hola marostegui [12:45:02] Looks similar to: https://phabricator.wikimedia.org/T203075 [12:45:17] https://phabricator.wikimedia.org/T203075#5405161 [12:45:17] PROBLEM - Disk space on elastic1018 is CRITICAL: DISK CRITICAL - free space: /srv 27308 MB (5% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=elastic1018&var-datasource=eqiad+prometheus/ops [12:45:19] hauskatze: o/ [12:46:07] cdanis awight I think I am going to update that task letting apergos knows this happened again today [12:46:15] marostegui: sounds good [12:46:34] I don't see any notes in the task on how to translate 'text row #' to something actually meaningful in the DB [12:46:43] so I've no idea how to investigate further [12:46:54] should Wiktionary dumpBackup at any level interact with wikidata as a client? [12:47:03] cdanis: I believe that row is the old_id id [12:47:19] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, acl wise we should be fine I think since the hosts won't have AAAA records yet" [puppet] - 10https://gerrit.wikimedia.org/r/531153 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond) [12:48:04] confirmed [12:48:04] (03CR) 10Muehlenhoff: [C: 03+2] Change DNS entry for puppetdb1002 [dns] - 10https://gerrit.wikimedia.org/r/531168 (https://phabricator.wikimedia.org/T230609) (owner: 10Muehlenhoff) [12:48:13] that 2350 doesn't have a reference on external storage [12:48:39] I will update the task [12:49:13] marostegui: so it is unrelated to the backpor we did in swat today right? [12:49:18] I believe so [12:49:51] marostegui: here's a logstash permalink to include in the task: https://logstash.wikimedia.org/goto/ddb6a35051e297d5f6fed409a7d65fba [12:50:08] cdanis: thanks <3 [12:50:36] ACKNOWLEDGEMENT - Device not healthy -SMART- on cloudelastic1002 is CRITICAL: cluster=cloudelastic device=sdb instance=cloudelastic1002:9100 job=node site=eqiad Muehlenhoff T230088 https://wikitech.wikimedia.org/wiki/SMART%23Alerts https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=cloudelastic1002&var-datasource=eqiad+prometheus/ops [12:50:44] * alaa_wmde okay then I'll continue my day, thanks for the ping! [12:50:52] (03PS2) 10Jbond: mariadb::core - codfw: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531161 (https://phabricator.wikimedia.org/T102099) [12:50:54] (03PS1) 10Jbond: mariadb::core - eqiad: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531174 (https://phabricator.wikimedia.org/T102099) [12:51:24] 10Operations, 10Puppet: offboard-user.py: do not hardcode Phabricator project names, use PHID instead - https://phabricator.wikimedia.org/T230516 (10MarcoAurelio) Thanks for merging 530230 @Muehlenhoff. May I suggest that we re-run the script again for some of the past offboarded users to see if there's any le... [12:51:29] PROBLEM - Disk space on elastic1018 is CRITICAL: DISK CRITICAL - free space: /srv 25560 MB (5% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=elastic1018&var-datasource=eqiad+prometheus/ops [12:52:00] (03CR) 10Jbond: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/531161 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond) [12:52:44] (03CR) 10Jbond: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/531164 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond) [12:53:57] (03PS5) 10CDanis: noc: read dbctl JSON from local disk mirror of etcd [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528938 (https://phabricator.wikimedia.org/T229631) [12:54:57] (03CR) 10CDanis: "I believe this is ready to go now. Thanks to Reedy for the help and cleanups" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528938 (https://phabricator.wikimedia.org/T229631) (owner: 10CDanis) [12:56:11] ACKNOWLEDGEMENT - Disk space on elastic1018 is CRITICAL: DISK CRITICAL - free space: /srv 25079 MB (5% inode=99%): Gehel shards already leaving, will recover soon https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=elastic1018&var-datasource=eqiad+prometheus/ops [12:57:43] (03PS1) 10Jbond: mariadb::core_test: add mapped ipv6 address [puppet] - 10https://gerrit.wikimedia.org/r/531176 (https://phabricator.wikimedia.org/T102099) [12:57:45] (03PS1) 10Filippo Giunchedi: mediawiki: filter out NaN values for latency alerts [puppet] - 10https://gerrit.wikimedia.org/r/531175 (https://phabricator.wikimedia.org/T230396) [12:58:25] jouncebot: next [12:58:25] In 0 hour(s) and 1 minute(s): MediaWiki train - European version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190820T1300) [12:58:39] (03CR) 10Filippo Giunchedi: [C: 03+2] mediawiki: filter out NaN values for latency alerts [puppet] - 10https://gerrit.wikimedia.org/r/531175 (https://phabricator.wikimedia.org/T230396) (owner: 10Filippo Giunchedi) [13:00:04] zeljkof: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for MediaWiki train - European version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190820T1300). [13:01:47] zeljkof: are you running the train now or is it blocked? [13:02:03] (03CR) 10Marostegui: [C: 03+1] "Let's try this patch in codfw and to this small set of databases there before going for full codfw :)" [puppet] - 10https://gerrit.wikimedia.org/r/531164 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond) [13:03:00] (03PS2) 10Jbond: role::mariadb::misc - eqiad: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531167 (https://phabricator.wikimedia.org/T102099) [13:03:19] !log jmm@cumin2001 START - Cookbook sre.ganeti.makevm [13:03:24] mhhh gerrit slow/unhappy for me, anyone else ? [13:03:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:03:55] godog: slow for me as well [13:03:56] godog: yes slow for me as well... [13:04:01] same here [13:04:02] had a git review invocation take about a minute [13:04:14] same [13:04:29] oh yeah I see gc times going up on https://grafana.wikimedia.org/d/Bw2mQ3iWz/gerrit-javamelody?orgId=1&from=now-30m&to=now [13:04:38] loading the javamelody monitoring page from gerrit itself is also taking over 15 seconds now [13:04:57] if the deployers haven't started the train yet perhaps it's time for a restart [13:04:58] probably in a gc death spiral [13:05:09] https://grafana.wikimedia.org/d/Bw2mQ3iWz/gerrit-javamelody?panelId=16&fullscreen&orgId=1 [13:05:12] a puppet-merge was slow but completed btw [13:05:29] RECOVERY - Disk space on elastic1018 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=elastic1018&var-datasource=eqiad+prometheus/ops [13:05:30] zeljkof: are you running the train now? [13:06:36] yeah GC times keep increasing, it is dead jim [13:07:26] I'm assuming we're not running the train, I'll bounce gerrit if there are no objections [13:07:38] !log ✔️ cdanis@cobalt.wikimedia.org ~ 🕘 sudo systemctl restart gerrit.service [13:07:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:07:52] nice, thanks cdanis [13:08:02] PROBLEM - High average GET latency for mediawiki requests on cluster appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=500 handler=proxy:fcgi://127.0.0.1:9000 method=GET https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appser [13:08:02] T [13:08:09] godog: zeljkof has two very idle sessions on deploy1001 so figured no train run [13:08:18] ah thanks, I was wondering why all gerrit actions took 5-10 seconds to complete [13:08:34] gerrit down for me now [13:08:36] PROBLEM - High average POST latency for mediawiki requests on cluster appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code={200,302} handler={proxy:fcgi://127.0.0.1:9000,proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+pro [13:08:36] luster=appserver&var-method=POST [13:09:01] the latency alerts is me btw [13:09:27] haha icinga-wm is not quite rendering those links all that well [13:09:34] should I restart gerrit? [13:09:39] looks like it has a misunderstanding of how long lines can be? [13:09:39] marostegui: already being restarted [13:09:45] ah good [13:09:46] thanks [13:09:58] [2019-08-20 13:09:46,591] [main] INFO com.google.gerrit.pgm.Daemon : Gerrit Code Review 2.15.14-16-g855b179b5f ready [13:10:00] cdanis: sorry, just saw this, yes, cutting the branch [13:10:00] it's back [13:10:06] PROBLEM - High average GET latency for mediawiki requests on cluster appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=500 handler=proxy:fcgi://127.0.0.1:9000 method=GET https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appser [13:10:06] T [13:10:12] jijiki: running the train [13:10:16] zeljkof: ah okay. gerrit was quite slow, so we restarted it, but should be back now [13:10:40] (03CR) 10Marostegui: [C: 03+1] "Let's try to do these tests hosts yep" [puppet] - 10https://gerrit.wikimedia.org/r/531176 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond) [13:10:42] cdanis: yeah icinga-wm is getting quite hermetic, 13:10 -icinga-wm:#wikimedia-operations- T [13:11:01] zeljkof: ok [13:11:40] (03PS2) 10Jbond: role::mariadb::misc - codfw: add ipv6 mapped [puppet] - 10https://gerrit.wikimedia.org/r/531166 (https://phabricator.wikimedia.org/T102099) [13:11:50] ah, branch cut was stopped at Kartographer [13:12:15] By the gerrit downage? [13:12:23] (03PS2) 10Jbond: mariadb::core_test: add mapped ipv6 address [puppet] - 10https://gerrit.wikimedia.org/r/531176 (https://phabricator.wikimedia.org/T102099) [13:13:15] by gerrit reboot, I think [13:14:22] PROBLEM - High average GET latency for mediawiki requests on cluster appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=500 handler=proxy:fcgi://127.0.0.1:9000 method=GET https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appser [13:14:22] T [13:14:22] (03PS2) 10Elukey: Add sre.hadoop.reboot-workers.py [cookbooks] - 10https://gerrit.wikimedia.org/r/530096 (https://phabricator.wikimedia.org/T225297) [13:17:07] (03CR) 10Jbond: [C: 03+2] mariadb::core_test: add mapped ipv6 address [puppet] - 10https://gerrit.wikimedia.org/r/531176 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond) [13:17:52] PROBLEM - High average GET latency for mediawiki requests on cluster appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=500 handler=proxy:fcgi://127.0.0.1:9000 method=GET https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appser [13:17:52] T [13:18:32] I'm looking into the alerts btw, looks like 500s take a whole lot of time [13:18:43] haha I guess that is unsurprising [13:18:56] a lot of 500s are just timeouts [13:19:10] I am checking kibana [13:19:36] indeed, probably ok to alert on latency of success codes [13:19:41] +1 [13:21:14] +1, nothing standing out on kibana btw [13:21:25] 10Operations, 10Analytics: Access to HUE for Mayakpwiki - https://phabricator.wikimedia.org/T229143 (10Nuria) a:03JAllemandou [13:23:08] (03PS1) 10Filippo Giunchedi: mediawiki: alert on latency for non-error status [puppet] - 10https://gerrit.wikimedia.org/r/531188 (https://phabricator.wikimedia.org/T230396) [13:23:19] thanks, jijiki cdanis ^ [13:24:01] (03CR) 10CDanis: [C: 03+1] mediawiki: alert on latency for non-error status [puppet] - 10https://gerrit.wikimedia.org/r/531188 (https://phabricator.wikimedia.org/T230396) (owner: 10Filippo Giunchedi) [13:24:14] PROBLEM - High average GET latency for mediawiki requests on cluster appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=500 handler=proxy:fcgi://127.0.0.1:9000 method=GET https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appser [13:24:14] T [13:24:45] (03CR) 10Filippo Giunchedi: [C: 03+2] mediawiki: alert on latency for non-error status [puppet] - 10https://gerrit.wikimedia.org/r/531188 (https://phabricator.wikimedia.org/T230396) (owner: 10Filippo Giunchedi) [13:24:54] (03PS2) 10Filippo Giunchedi: mediawiki: alert on latency for non-error status [puppet] - 10https://gerrit.wikimedia.org/r/531188 (https://phabricator.wikimedia.org/T230396) [13:25:14] (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] mediawiki: alert on latency for non-error status [puppet] - 10https://gerrit.wikimedia.org/r/531188 (https://phabricator.wikimedia.org/T230396) (owner: 10Filippo Giunchedi) [13:27:00] PROBLEM - High average GET latency for mediawiki requests on cluster appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=500 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&va [13:29:33] 10Operations, 10Icinga, 10observability: icinga-wm bot truncating long messages - https://phabricator.wikimedia.org/T230799 (10CDanis) [13:30:53] (03PS1) 10Marostegui: mariadb: Promote db1109 to s8 master [puppet] - 10https://gerrit.wikimedia.org/r/531189 (https://phabricator.wikimedia.org/T230762) [13:31:11] 10Operations, 10Icinga, 10observability: icinga-wm bot truncating long messages - https://phabricator.wikimedia.org/T230799 (10fgiunchedi) For reference, the lines above appear as below in `/var/log/icinga/irc.log`: ` PROBLEM - High average GET latency for mediawiki requests on cluster appserver in eqiad on... [13:31:24] (03CR) 10Marostegui: [C: 04-2] "Wait for the failover day" [puppet] - 10https://gerrit.wikimedia.org/r/531189 (https://phabricator.wikimedia.org/T230762) (owner: 10Marostegui) [13:31:34] (03CR) 10Effie Mouzeli: [V: 03+1 C: 03+1] "https://puppet-compiler.wmflabs.org/compiler1002/17960/ LGTM, will merge later today" [puppet] - 10https://gerrit.wikimedia.org/r/530014 (https://phabricator.wikimedia.org/T224857) (owner: 10Thcipriani) [13:32:05] !log jmm@cumin2001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) [13:32:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:32:16] 10Operations, 10Icinga, 10observability: icinga-wm bot truncating long messages - https://phabricator.wikimedia.org/T230799 (10CDanis) Ahhh, thanks! Those lines ending in `GET` explains the `T`; that makes it seem very likely that icinga-wm is splitting the lines internally, but has a different/wrong idea o... [13:39:03] (03CR) 10Gehel: [C: 04-1] Add maps reboot cookbook (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/511819 (https://phabricator.wikimedia.org/T224072) (owner: 10Mathew.onipe) [13:39:20] (03PS1) 10Muehlenhoff: Setup partman config for puppetdb hosts [puppet] - 10https://gerrit.wikimedia.org/r/531190 [13:41:22] PROBLEM - High average POST latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code={200,302} handler={proxy:fcgi://127.0.0.1:9000,proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var [13:41:22] r&var-method=POST [13:43:08] ah yeah of course, only one retry [13:52:03] (03PS1) 10Filippo Giunchedi: mediawiki: bump POST latency threshold and retries [puppet] - 10https://gerrit.wikimedia.org/r/531191 (https://phabricator.wikimedia.org/T230396) [13:54:51] cdanis jijiki ^ if you spare some minutes [13:55:18] (03CR) 10Huji: [C: 03+1] Fix typos in code [puppet] - 10https://gerrit.wikimedia.org/r/530989 (https://phabricator.wikimedia.org/T201491) (owner: 10DannyS712) [13:55:20] (03CR) 10CDanis: [C: 03+1] mediawiki: bump POST latency threshold and retries [puppet] - 10https://gerrit.wikimedia.org/r/531191 (https://phabricator.wikimedia.org/T230396) (owner: 10Filippo Giunchedi) [13:55:29] (03PS6) 10Huji: Fix typos in code [puppet] - 10https://gerrit.wikimedia.org/r/530989 (https://phabricator.wikimedia.org/T201491) (owner: 10DannyS712) [13:55:58] we need to write a proposal to the Unicode Consortium for a RUBBER STAMP emoji [13:57:04] godog: sure [13:58:02] !log jmm@cumin1001 START - Cookbook sre.ganeti.makevm [13:58:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:58:40] cdanis: where do I stamp? [13:58:42] cdanis: you can use ship-it ;) 🚢 [13:59:03] godog: https://unicode.org/emoji/proposals.html [13:59:13] if you haven't read some of the submitted proposals, I recommend taking a look [13:59:14] some are very very good [13:59:36] is there a proposal to get back to ascii? :-P [13:59:42] cdanis: that's gold, thank you [14:00:17] (03CR) 10Krinkle: [C: 04-1] "Task declined." [puppet] - 10https://gerrit.wikimedia.org/r/520172 (https://phabricator.wikimedia.org/T216243) (owner: 10Aaron Schulz) [14:00:26] volans: 👴 [14:01:07] (03PS1) 10Zfilipin: Group0 to 1.34.0-wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531193 [14:02:09] (03CR) 10Herron: prometheus-ipsec-exporter: initial commit of version 0.3.1 (031 comment) [debs/prometheus-ipsec-exporter] - 10https://gerrit.wikimedia.org/r/530203 (https://phabricator.wikimedia.org/T230236) (owner: 10Herron) [14:02:42] (03CR) 10Effie Mouzeli: [C: 03+1] mediawiki: bump POST latency threshold and retries [puppet] - 10https://gerrit.wikimedia.org/r/531191 (https://phabricator.wikimedia.org/T230396) (owner: 10Filippo Giunchedi) [14:02:49] (03CR) 10Filippo Giunchedi: [C: 03+2] mediawiki: bump POST latency threshold and retries [puppet] - 10https://gerrit.wikimedia.org/r/531191 (https://phabricator.wikimedia.org/T230396) (owner: 10Filippo Giunchedi) [14:02:57] (03PS2) 10Filippo Giunchedi: mediawiki: bump POST latency threshold and retries [puppet] - 10https://gerrit.wikimedia.org/r/531191 (https://phabricator.wikimedia.org/T230396) [14:03:12] (03CR) 10Filippo Giunchedi: [C: 03+1] prometheus-ipsec-exporter: initial commit of version 0.3.1 [debs/prometheus-ipsec-exporter] - 10https://gerrit.wikimedia.org/r/530203 (https://phabricator.wikimedia.org/T230236) (owner: 10Herron) [14:03:30] (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] mediawiki: bump POST latency threshold and retries [puppet] - 10https://gerrit.wikimedia.org/r/531191 (https://phabricator.wikimedia.org/T230396) (owner: 10Filippo Giunchedi) [14:03:36] (03CR) 10Jbond: [C: 03+2] builder: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531156 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond) [14:03:43] (03PS2) 10Jbond: builder: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531156 (https://phabricator.wikimedia.org/T102099) [14:05:08] (03CR) 10Nuria: [C: 03+1] "Should we merge these changes then?" [puppet] - 10https://gerrit.wikimedia.org/r/519181 (https://phabricator.wikimedia.org/T226035) (owner: 10Elukey) [14:07:18] (03PS1) 10Jbond: mariadb::misc::phabricator - codfw: add ipv6 address [puppet] - 10https://gerrit.wikimedia.org/r/531195 (https://phabricator.wikimedia.org/T102099) [14:07:20] (03PS1) 10Jbond: mariadb::misc::phabricator - eqiad: add ipv6 address [puppet] - 10https://gerrit.wikimedia.org/r/531196 (https://phabricator.wikimedia.org/T102099) [14:12:40] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:fcgi://127.0.0.1:9000 method=GET https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver [14:12:44] RECOVERY - High average POST latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST [14:13:48] !log zfilipin@deploy1001 Pruned MediaWiki: 1.34.0-wmf.14 (duration: 06m 44s) [14:13:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:07] (03PS1) 10Jbond: mariadb::misc::eventlogging: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531199 (https://phabricator.wikimedia.org/T102099) [14:15:58] (03PS4) 10Herron: prometheus: add prometheus ipsec exporter service & config [puppet] - 10https://gerrit.wikimedia.org/r/530616 (https://phabricator.wikimedia.org/T230236) [14:16:56] (03PS17) 10Mathew.onipe: Add maps reboot cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/511819 (https://phabricator.wikimedia.org/T224072) [14:17:03] (03CR) 10Herron: prometheus: add prometheus ipsec exporter service & config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/530616 (https://phabricator.wikimedia.org/T230236) (owner: 10Herron) [14:17:30] (03CR) 10Mathew.onipe: Add maps reboot cookbook (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/511819 (https://phabricator.wikimedia.org/T224072) (owner: 10Mathew.onipe) [14:17:34] _joe_: can you help with this? https://phabricator.wikimedia.org/T230802#5424782 [14:17:55] (not urgent, but would be nice to fix it soon) [14:20:11] <_joe_> zeljkof: what's the issue there? sure I'll take care of it [14:20:40] thcipriani said "This looks like @Urbanecm 's umask maybe set wrong" [14:20:56] and "You need to find a root to chown that" [14:20:58] (03PS1) 10Jbond: mariadb::misc::multiinstance: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531200 (https://phabricator.wikimedia.org/T102099) [14:21:06] PROBLEM - PHP opcache health on mwdebug1001 is CRITICAL: CRITICAL: opcache free space is below 50 MB https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [14:21:18] ^ that is ok [14:21:18] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/531149 (owner: 10Jbond) [14:21:21] !log zfilipin@deploy1001 Pruned MediaWiki: 1.34.0-wmf.15 [keeping static files] (duration: 01m 43s) [14:21:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:12] (03PS4) 10Jbond: IPv6: add static IPv6 addresses to spare hosts [puppet] - 10https://gerrit.wikimedia.org/r/531149 [14:22:36] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:fcgi://127.0.0.1:9000 method=GET https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver [14:22:39] _joe_: I've replied above, not sure if you saw it :) [14:23:15] ah yes, this has happened before zeljkof, I think also with Urbanecm [14:23:28] zeljkof: ftr, my umask should be hopefully fixed now [14:23:32] Just a relict from the past [14:23:39] I can probably chmod g+w it [14:23:43] (03PS1) 10Jbond: mariadb::sanitarium_multiinstance: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531201 (https://phabricator.wikimedia.org/T102099) [14:24:01] On mobile now, so if someone can handle that, it'd be great [14:24:27] !log zfilipin@deploy1001 Started scap: testwiki to php-1.34.0-wmf.19 and rebuild l10n cache [14:24:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:25:55] zeljkof: Urbanecm: should be fixed [14:26:10] thanks Urbanecm for taking care of this, hopefully no more trouble in the future :) [14:26:12] RECOVERY - PHP opcache health on mwdebug2001 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [14:26:13] cdanis: thanks! I'll check [14:26:30] (03PS1) 10Jbond: mariadb::misc::tendril: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531203 (https://phabricator.wikimedia.org/T102099) [14:26:34] RECOVERY - PHP opcache health on mwdebug1001 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [14:26:36] <_joe_> cdanis: thanks [14:26:52] RECOVERY - PHP opcache health on mwdebug2002 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [14:26:57] <_joe_> zeljkof: yes I saw, I was still concentrated on "not urgent" " :P [14:28:19] (03PS1) 10Gilles: Upgrade to 2.6 [debs/python-thumbor-wikimedia] - 10https://gerrit.wikimedia.org/r/531204 (https://phabricator.wikimedia.org/T226707) [14:28:21] _joe_: I could continue with train things without that being resolved, so there was no rush, but still, any train problem makes me nervous :) [14:29:02] !log jmm@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) [14:29:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:35] (03PS2) 10Gilles: Only apply expiry logic to "thumb" zone [puppet] - 10https://gerrit.wikimedia.org/r/519374 (https://phabricator.wikimedia.org/T211661) [14:34:25] (03PS1) 10Jbond: mariadb::dbstore_multiinstance - codfw: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531205 (https://phabricator.wikimedia.org/T102099) [14:34:27] (03PS1) 10Jbond: mariadb::dbstore_multiinstance - codfw: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531206 (https://phabricator.wikimedia.org/T102099) [14:35:20] (03CR) 10Jbond: [C: 03+2] IPv6: add static IPv6 addresses to spare hosts [puppet] - 10https://gerrit.wikimedia.org/r/531149 (owner: 10Jbond) [14:35:24] (03CR) 10Thcipriani: [V: 03+2 C: 03+2] blubberoid: update base chart for "helm test" [deployment-charts] - 10https://gerrit.wikimedia.org/r/530881 (owner: 10Thcipriani) [14:37:00] PROBLEM - Disk space on elastic1018 is CRITICAL: DISK CRITICAL - free space: /srv 28484 MB (5% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=elastic1018&var-datasource=eqiad+prometheus/ops [14:37:18] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:fcgi://127.0.0.1:9000 method=GET https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver [14:37:20] (03PS2) 10Jbond: mariadb::dbstore_multiinstance - eqiad: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531206 (https://phabricator.wikimedia.org/T102099) [14:37:27] 10Operations, 10ops-eqiad: Degraded RAID on cloudvirt1018 - https://phabricator.wikimedia.org/T230575 (10Cmjohnson) Another ticket has been placed with Dell [14:37:56] 10Operations, 10ops-eqiad, 10cloud-services-team: Degraded RAID on cloudvirt1024 -- Filesystem mounted read-only - https://phabricator.wikimedia.org/T230289 (10Cmjohnson) A ticket has been placed with Dell [14:38:26] (03PS4) 10BBlack: anycast recdns: config for 41 eqiad canaries [puppet] - 10https://gerrit.wikimedia.org/r/528524 (https://phabricator.wikimedia.org/T228190) [14:38:28] (03PS5) 10BBlack: anycast recdns: enable for codfw clients [puppet] - 10https://gerrit.wikimedia.org/r/526788 (https://phabricator.wikimedia.org/T228190) [14:38:32] 10Operations, 10Core Platform Team, 10MediaWiki-extensions-CentralAuth, 10TimedMediaHandler, and 5 others: Consistent HTTP 503 Error on some urls for some logged-in users (CentralAuth Set-Cookie storm) - https://phabricator.wikimedia.org/T226840 (10Jdforrester-WMF) >>! In T226840#5422273, @BBlack wrote: >... [14:38:33] (03PS4) 10BBlack: anycast recdns: enable globally [puppet] - 10https://gerrit.wikimedia.org/r/528525 (https://phabricator.wikimedia.org/T228190) [14:38:37] (03CR) 10Elukey: [C: 03+1] mariadb::misc::eventlogging: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531199 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond) [14:38:44] (03PS3) 10Jbond: mariadb::dbstore_multiinstance - eqiad: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531206 (https://phabricator.wikimedia.org/T102099) [14:40:20] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:fcgi://127.0.0.1:9000 method=GET https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver [14:40:44] (03CR) 10Mforns: [C: 03+1] "Awesome! LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/531046 (owner: 10Elukey) [14:41:15] (03PS1) 10Jbond: mariadb::backups - codfw: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531207 (https://phabricator.wikimedia.org/T102099) [14:41:20] (03PS1) 10Jbond: mariadb::backups - eqiad: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531208 (https://phabricator.wikimedia.org/T102099) [14:42:05] (03PS5) 10BBlack: anycast recdns: config for 41 eqiad canaries [puppet] - 10https://gerrit.wikimedia.org/r/528524 (https://phabricator.wikimedia.org/T228190) [14:42:07] (03PS6) 10BBlack: anycast recdns: enable for codfw clients [puppet] - 10https://gerrit.wikimedia.org/r/526788 (https://phabricator.wikimedia.org/T228190) [14:42:09] (03PS5) 10BBlack: anycast recdns: enable globally [puppet] - 10https://gerrit.wikimedia.org/r/528525 (https://phabricator.wikimedia.org/T228190) [14:45:31] (03PS1) 10Jbond: mariadb::proxy - codfw: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531209 (https://phabricator.wikimedia.org/T102099) [14:46:21] 10Operations, 10ops-eqiad, 10DBA: db1114 crashed due to memory issues (server under warranty) - https://phabricator.wikimedia.org/T229452 (10Cmjohnson) Swapped the DIMM B3 with A3 and B7 with A7. Powered on and cleared log. Let's see if the errors return or change, [14:46:38] (03PS2) 10Jbond: mariadb::proxy - codfw: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531209 (https://phabricator.wikimedia.org/T102099) [14:46:40] (03PS1) 10Jbond: mariadb::proxy - codfw: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531210 (https://phabricator.wikimedia.org/T102099) [14:47:42] (03PS2) 10Jbond: mariadb::proxy - eqiad: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531210 (https://phabricator.wikimedia.org/T102099) [14:48:06] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:fcgi://127.0.0.1:9000 method=GET https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver [14:48:12] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/530616 (https://phabricator.wikimedia.org/T230236) (owner: 10Herron) [14:48:26] ^ I am checking the GET latency alert [14:49:04] (03PS1) 10Jbond: debmonitor: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531211 (https://phabricator.wikimedia.org/T102099) [14:49:19] jijiki: thanks! yeah I think we'll need to tune the thresholds [14:49:29] (03Abandoned) 10Aaron Schulz: Raise HHVM mysql query time threshold to effectively not trigger [puppet] - 10https://gerrit.wikimedia.org/r/520172 (https://phabricator.wikimedia.org/T216243) (owner: 10Aaron Schulz) [14:50:27] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1063 - https://phabricator.wikimedia.org/T230682 (10Cmjohnson) @Marostegui I had a used disk on-site and replace it....it's currently in rebuild Device Firmware Level: ES66 Firmware state: Rebuild [14:50:50] RECOVERY - Disk space on elastic1018 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=elastic1018&var-datasource=eqiad+prometheus/ops [14:51:11] (03PS1) 10Jbond: dumps::generation::server: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531212 (https://phabricator.wikimedia.org/T102099) [14:51:30] PROBLEM - High CPU load on API appserver on mw1288 is CRITICAL: CRITICAL - load average: 62.70, 33.91, 20.87 https://wikitech.wikimedia.org/wiki/Application_servers [14:51:45] (03PS1) 10Elukey: profile::swap: add push_published_datasets tunable [puppet] - 10https://gerrit.wikimedia.org/r/531213 [14:51:54] PROBLEM - High CPU load on API appserver on mw1290 is CRITICAL: CRITICAL - load average: 66.46, 27.89, 18.21 https://wikitech.wikimedia.org/wiki/Application_servers [14:52:53] godog: this might be due to deploymentg [14:52:57] so I will wait a bit [14:53:04] RECOVERY - High CPU load on API appserver on mw1288 is OK: OK - load average: 22.81, 28.23, 20.12 https://wikitech.wikimedia.org/wiki/Application_servers [14:53:10] and we can figure out a way to make it better [14:53:28] RECOVERY - High CPU load on API appserver on mw1290 is OK: OK - load average: 27.45, 26.31, 18.68 https://wikitech.wikimedia.org/wiki/Application_servers [14:54:21] jijiki: sounds great -- thanks [14:54:53] (03PS1) 10Muehlenhoff: Add DHCP config for puppetdb1002/puppetdb2002 [puppet] - 10https://gerrit.wikimedia.org/r/531214 [14:54:57] (03PS2) 10Elukey: profile::swap: add push_published_datasets tunable [puppet] - 10https://gerrit.wikimedia.org/r/531213 [14:54:58] !log zfilipin@deploy1001 Finished scap: testwiki to php-1.34.0-wmf.19 and rebuild l10n cache (duration: 30m 31s) [14:55:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:57:09] (03CR) 10Elukey: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/17962/" [puppet] - 10https://gerrit.wikimedia.org/r/531213 (owner: 10Elukey) [14:57:41] (03PS1) 10Jbond: elasticsearch::cirrus - codfw: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531215 (https://phabricator.wikimedia.org/T102099) [14:57:43] (03PS1) 10Jbond: elasticsearch::cirrus - eqiad: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531216 (https://phabricator.wikimedia.org/T102099) [14:58:50] (03CR) 10Zfilipin: [C: 03+2] Group0 to 1.34.0-wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531193 (owner: 10Zfilipin) [14:59:16] I'm a little late with train today, just deploying to grou0 [14:59:19] group 0 [14:59:30] I'll need to extend the train window for 5-10 minutes at least [15:00:38] and CI seems to be really busy, it might take a while for the patch to merge... [15:00:55] yeah :( [15:02:36] (03PS1) 10Jbond: mariadb::temporary_storage: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531217 (https://phabricator.wikimedia.org/T102099) [15:04:45] (03PS1) 10Jbond: failoid: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531218 (https://phabricator.wikimedia.org/T102099) [15:06:18] (03PS1) 10Jbond: analytics_cluster::hadoop::client: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531219 (https://phabricator.wikimedia.org/T102099) [15:06:21] zeljkof: it has been like that for quite a while [15:06:51] (03CR) 10Elukey: [C: 03+2] "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/531213 (owner: 10Elukey) [15:07:40] (03CR) 10Elukey: [C: 03+1] analytics_cluster::hadoop::client: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531219 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond) [15:08:50] (03Merged) 10jenkins-bot: Group0 to 1.34.0-wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531193 (owner: 10Zfilipin) [15:09:11] (03CR) 10jenkins-bot: Group0 to 1.34.0-wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531193 (owner: 10Zfilipin) [15:09:37] (03CR) 10Muehlenhoff: [C: 03+2] Add DHCP config for puppetdb1002/puppetdb2002 [puppet] - 10https://gerrit.wikimedia.org/r/531214 (owner: 10Muehlenhoff) [15:10:36] (03PS3) 10Elukey: profile::swap: add push_published_datasets tunable [puppet] - 10https://gerrit.wikimedia.org/r/531213 [15:10:39] (03CR) 10Elukey: [V: 03+2 C: 03+2] profile::swap: add push_published_datasets tunable [puppet] - 10https://gerrit.wikimedia.org/r/531213 (owner: 10Elukey) [15:11:46] (03CR) 10Volans: [C: 03+1] "LGTM but better to double check with @bblack" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/531165 (https://phabricator.wikimedia.org/T224559) (owner: 10Muehlenhoff) [15:13:15] jijiki: I'll ask my team if anybody knows what's wrong [15:13:26] 10Operations, 10Analytics: Access to HUE for Mayakpwiki - https://phabricator.wikimedia.org/T229143 (10Nuria) Assigning to @joal who has ops duty this week https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Access#Admin_Instructions_to_sync_a_Hue_account [15:13:37] 10Operations, 10ops-eqiad: (Need By: August 31) rack/setup/install (3) new zookeeper nodes - https://phabricator.wikimedia.org/T227025 (10Cmjohnson) [15:13:50] (03PS1) 10Jbond: kerberos::kdc: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531220 (https://phabricator.wikimedia.org/T102099) [15:14:03] 10Operations, 10ops-eqiad: (Need By: August 31) rack/setup/install (3) new zookeeper nodes - https://phabricator.wikimedia.org/T227025 (10Cmjohnson) @elukey the site specific portion is complete if you want to take over from here [15:14:30] (03CR) 10Volans: "One comment inline" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/531218 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond) [15:15:07] (03CR) 10Elukey: [C: 03+1] kerberos::kdc: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531220 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond) [15:16:09] (03CR) 10BBlack: [C: 03+1] Switch Failoid in codfw to failoid2001 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/531165 (https://phabricator.wikimedia.org/T224559) (owner: 10Muehlenhoff) [15:16:17] 10Operations, 10ops-eqiad, 10User-Elukey: (Need By: August 31) rack/setup/install (3) new zookeeper nodes - https://phabricator.wikimedia.org/T227025 (10elukey) a:05Cmjohnson→03elukey [15:16:25] (03PS1) 10Jbond: etcd::kubernetes: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531222 (https://phabricator.wikimedia.org/T102099) [15:17:45] !log zfilipin@deploy1001 rebuilt and synchronized wikiversions files: group0 to 1.34.0-wmf.19 [15:17:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:18:31] (03CR) 10Muehlenhoff: failoid: add ipv6 mapped address (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/531218 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond) [15:19:02] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:fcgi://127.0.0.1:9000 method=GET https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver [15:19:23] (03PS1) 10Jbond: etcd::networking: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531224 (https://phabricator.wikimedia.org/T102099) [15:19:41] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/531220 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond) [15:20:59] ok, done with group 0, nothing obvious is broken, I'll continue monitoring logs [15:21:34] (03PS1) 10Jbond: etherpad: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531225 (https://phabricator.wikimedia.org/T102099) [15:23:15] (03PS1) 10Jbond: dumps::web::htmldumps: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531226 (https://phabricator.wikimedia.org/T102099) [15:25:06] (03PS1) 10Jbond: ganeti: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531227 (https://phabricator.wikimedia.org/T102099) [15:25:42] cdanis, _joe_: I've tried `scap clean --delete 1.34.0-wmf.13` again and got the same error message: `14:03:08 clean failed: [Errno 13] Permission denied: '/srv/mediawiki-staging/php-1.34.0-wmf.13/.git/modules/extensions/Gadgets/refs/remotes/origin/wmf/1.34.0-wmf.11'` [15:26:12] argh [15:26:19] I gave the whole subtree group write [15:26:36] <_joe_> cdanis: uhm ok lemme look if I can see what's wrong [15:26:52] zeljkof: train window done? Would like to roll out a patch to help me investigate a prod issue. [15:27:02] (03PS1) 10Jbond: grafana: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531230 (https://phabricator.wikimedia.org/T102099) [15:27:02] ls: cannot access '/srv/mediawiki-staging/php-1.34.0-wmf.13/.git/modules/extensions/Gadgets/refs/remotes/origin/wmf/1.34.0-wmf.11': No such file or directory [15:27:17] Krinkle: done, but I need a break :/ [15:27:23] zeljkof: OK np [15:28:58] 10Operations, 10ops-eqiad, 10DBA: db1114 crashed due to memory issues (server under warranty) - https://phabricator.wikimedia.org/T229452 (10Marostegui) Thank you Chris! I have started MySQL, let's wait a few days before closing this, and if it happens again we can re-open! [15:29:16] (03PS1) 10Jbond: debug_proxy: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531231 (https://phabricator.wikimedia.org/T102099) [15:29:59] <_joe_> cdanis: yeah that whole refs/ dir doesn't exist [15:30:17] 10Operations, 10WMF-Legal, 10serviceops: Move old transparency report pages to historical URLs and setup redirect - https://phabricator.wikimedia.org/T230638 (10BBlack) 05Open→03Stalled Just stalling this so that anyone following it doesn't try to pick this up or move with it yet. There's an ongoing ema... [15:30:24] zeljkof: not sure what's up, that subtree doesn't exist, neither do a few above it, and all the ones that do exist are wikidev-group-writable [15:30:44] (03CR) 10SBassett: [C: 03+1] Restrict account creation on es.wikiquote to 1 day/IP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531169 (https://phabricator.wikimedia.org/T230796) (owner: 10MarcoAurelio) [15:30:55] <_joe_> cdanis: I'm gonna run scap clean myself [15:31:00] (03CR) 10SBassett: [C: 03+1] Enable DNS blacklist for es.wikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531171 (https://phabricator.wikimedia.org/T230796) (owner: 10MarcoAurelio) [15:31:14] _joe_: sgtm [15:31:33] (03PS1) 10Jbond: backup::ofsite: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531233 (https://phabricator.wikimedia.org/T102099) [15:31:58] <_joe_> cdanis: the treeish is [15:32:01] <_joe_> /srv/mediawiki-staging/php-1.34.0-wmf.13/.git/modules/extensions/Gadgets/logs/refs/remotes/origin/wmf/1.34.0-wmf.11 [15:32:09] <_joe_> so logs was missing [15:34:36] (03PS1) 10Jbond: wmcs::openstack::codfw1dev::net: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531235 (https://phabricator.wikimedia.org/T102099) [15:36:24] (03PS1) 10Jbond: graphite: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531236 (https://phabricator.wikimedia.org/T102099) [15:37:29] Daimona: around? [15:37:36] Urbanecm: yes [15:37:47] Do you have a while to discuss/test T230601? [15:37:47] T230601: Groups 'oversight'/'suppress' should be reconciled - https://phabricator.wikimedia.org/T230601 [15:38:03] First of all, I've wrapped it into wgExtensionFunctions (both in af.php and cs.php) [15:38:07] which _should_ make it work [15:38:17] maybe some cache got into the play [15:38:18] idk [15:39:12] (03PS1) 10Jbond: installserver: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531237 (https://phabricator.wikimedia.org/T102099) [15:40:05] Daimona: ^^ [15:40:47] I'll probably have time in 15 minutes or so [15:41:30] Cool! [15:41:57] Using $wgExtensionFunctions looks promising though [15:42:06] After all, we're trying to handle extension use cases [15:42:12] (03PS1) 10Jbond: kafka::main: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531238 (https://phabricator.wikimedia.org/T102099) [15:42:51] * Urbanecm is going to do a security-related deployment for T230796 [15:43:32] Oh lord, again [15:44:11] (03CR) 10Urbanecm: [C: 03+2] Restrict account creation on es.wikiquote to 1 day/IP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531169 (https://phabricator.wikimedia.org/T230796) (owner: 10MarcoAurelio) [15:44:14] (03CR) 10Urbanecm: [C: 03+2] Enable DNS blacklist for es.wikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531171 (https://phabricator.wikimedia.org/T230796) (owner: 10MarcoAurelio) [15:44:16] (03CR) 10DCausse: [C: 03+1] Add -XX:NewRatio=3 for cloudelastic-chi [puppet] - 10https://gerrit.wikimedia.org/r/530889 (owner: 10EBernhardson) [15:44:35] Daimona: Me again doing sec stuff, or spambots again? [15:44:43] The latter [15:44:45] :D [15:44:53] Such a PITA [15:45:16] (03Merged) 10jenkins-bot: Restrict account creation on es.wikiquote to 1 day/IP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531169 (https://phabricator.wikimedia.org/T230796) (owner: 10MarcoAurelio) [15:45:20] https://en.wikipedia.org/wiki/Pita ? [15:45:23] (03PS1) 10Jbond: webserver_misc_apps: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531239 (https://phabricator.wikimedia.org/T102099) [15:46:49] (03CR) 10jenkins-bot: Restrict account creation on es.wikiquote to 1 day/IP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531169 (https://phabricator.wikimedia.org/T230796) (owner: 10MarcoAurelio) [15:47:52] (03PS4) 10Gehel: Add -XX:NewRatio=3 for cloudelastic-chi [puppet] - 10https://gerrit.wikimedia.org/r/530889 (owner: 10EBernhardson) [15:48:15] (03PS2) 10Jbond: wmcs::openstack::codfw1dev: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531235 (https://phabricator.wikimedia.org/T102099) [15:48:25] Urbanecm: sadly not that delicious: https://en.wiktionary.org/wiki/pain_in_the_ass [15:48:36] !log Run sudo -u mwdeploy chmod g+w /srv/mediawiki-stagging/wmf-config on deploy1001 [15:48:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:49:03] https://www.urbandictionary.com/define.php?term=pita [15:49:12] thought it wouldn't be right :D [15:49:22] > sadly not that delicious <-- indeed :) [15:49:22] (03CR) 10Gehel: [C: 03+2] Add -XX:NewRatio=3 for cloudelastic-chi [puppet] - 10https://gerrit.wikimedia.org/r/530889 (owner: 10EBernhardson) [15:49:35] !log creating Parsoid/PHP storage schema in restbase-dev -- T230792 [15:49:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:49:44] T230792: Create Parsoid/PHP tables in Cassandra - https://phabricator.wikimedia.org/T230792 [15:50:02] there's something weird going on... [15:50:09] 15:49:34 sync-file failed: [Errno 13] Permission denied: u'/srv/mediawiki-staging/php-1.34.0-wmf.17/cache/gitinfo/info.json' [15:50:22] (03PS1) 10Jbond: labs::db: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531240 (https://phabricator.wikimedia.org/T102099) [15:51:12] Urbanecm: context? [15:51:22] can you past the full trace of the error? [15:51:25] *paste [15:51:29] tried to sync wmf-config/initialiseSettings.php thcipriani [15:51:48] sure [15:52:08] https://www.irccloud.com/pastebin/1LrcJpqC/ [15:52:12] * thcipriani looks [15:52:13] here you are thcipriani ^^ [15:52:25] (03PS1) 10Jbond: wmcs::nfs: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531241 (https://phabricator.wikimedia.org/T102099) [15:52:48] ftr, I got similar error when trying git rebase, but the directory was owned by mwdeploy, so I fixed it myself [15:52:52] see my earlier !log statement [15:53:22] <_joe_> uhm [15:53:34] <_joe_> I might have made a mess earlier when trying to fix permissions? [15:53:44] looks like the git info cache isn't writable [15:53:52] <_joe_> ok lemme fix that [15:53:53] for wmf.17 specifically anyway [15:53:54] might be _joe_ [15:54:01] <_joe_> that's pretty strange though [15:54:03] thanks both [15:54:11] (03PS1) 10Jbond: openldap: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531242 (https://phabricator.wikimedia.org/T102099) [15:54:12] should /srv/mediawiki-staging be group writable? [15:54:14] <_joe_> thcipriani: I have some probelms with permissions on wmf.13 too [15:54:24] cdanis: I guess so [15:54:30] otherwise, deployers wouldn't be able to sync etc [15:54:47] <_joe_> ok I might have messed that up cdanis [15:54:48] <_joe_> sith [15:54:50] <_joe_> *sigh [15:54:59] cdanis: yeah, for many things: l10n, git info cache, and git operations. [15:55:13] * Krinkle staging on mwdebug1002 for https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/531229/ [15:55:16] IIRC the setguid bit is set [15:55:31] Krinkle: I'm going to sync in a while [15:55:32] drwxr-sr-x 24 mwdeploy wikidev 4096 Aug 20 15:16 /srv/mediawiki-staging/ [15:55:32] <_joe_> cdanis: I fixed it now [15:55:39] (03PS1) 10Jbond: logstash: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531243 (https://phabricator.wikimedia.org/T102099) [15:55:42] (config, so shouldn't be affected) [15:55:46] thanks _joe_ [15:55:49] <_joe_> cdanis: that was my error, yes [15:55:50] Urbanecm: should be done before the whole hour, swat right? [15:56:09] <_joe_> thcipriani: do you see other permission problems? [15:56:15] Krinkle: no, sec-related emergency deployment [15:56:26] _joe_: let me try to do the sync [15:56:59] https://www.irccloud.com/pastebin/OZBUd5Nn/ [15:57:03] <_joe_> so that file is still not writable yes [15:57:10] Urbanecm: OK. I'll do mine after. [15:57:11] <_joe_> and that has definitely nothing to do with me [15:57:18] staging and debug are clear now [15:57:23] thanks Krinkle [15:57:34] _joe_: could you fix that as well, please? [15:57:38] RECOVERY - MegaRAID on db1063 is OK: OK: optimal, 1 logical, 2 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [15:57:47] (03PS1) 10Jbond: lvs::balancer: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531244 (https://phabricator.wikimedia.org/T102099) [15:58:31] <_joe_> Urbanecm: done [15:58:38] <_joe_> and sorry for the trouble [15:58:43] thank you very much _joe_ [15:59:01] <_joe_> not sure everything's fixed [15:59:10] we'll see as time goes :) [15:59:12] is there an argument here for scap managing permissions when it does a branch cut? [15:59:27] Would that have also been why wmf.13 had scap clean issues earlier for zeljkof? [15:59:45] <_joe_> James_F: that was probably different [15:59:45] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 5ab38dc: Restrict account creation on es.wikiquote to 1 day/IP (T230796) (duration: 01m 00s) [15:59:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:59:54] T230796: Deploy countermeasures to stop ongoing spambot attack at es.wikiquote 2019-08-20 [public task] - https://phabricator.wikimedia.org/T230796 [15:59:56] <_joe_> James_F: lemme try now [16:00:03] (03PS1) 10Jbond: maps - codfw: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531245 (https://phabricator.wikimedia.org/T102099) [16:00:04] godog and _joe_: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Puppet SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190820T1600). [16:00:04] No GERRIT patches in the queue for this window AFAICS. [16:00:07] (03PS1) 10Jbond: maps - eqiad: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531246 (https://phabricator.wikimedia.org/T102099) [16:00:17] okay, first sync seems to work [16:00:19] <_joe_> James_F: still cannot delete that damn git dir [16:00:34] (03PS2) 10Urbanecm: Enable DNS blacklist for es.wikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531171 (https://phabricator.wikimedia.org/T230796) (owner: 10MarcoAurelio) [16:00:43] Fun. [16:00:54] <_joe_> ok done [16:00:56] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:fcgi://127.0.0.1:9000 method=GET https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver [16:00:59] <_joe_> Urbanecm: I'm running scap clean [16:01:01] (03CR) 10Urbanecm: [C: 03+2] Enable DNS blacklist for es.wikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531171 (https://phabricator.wikimedia.org/T230796) (owner: 10MarcoAurelio) [16:01:13] <_joe_> jijiki: I think we need a higher threshold on codfw [16:01:25] thanks for the info _joe_ [16:01:29] * Urbanecm is waiting on CI [16:01:31] <_joe_> James_F: it worked this time [16:01:41] _joe_: looks like it [16:01:43] * James_F shrugs. [16:02:09] btw, why isn't operations/mediawiki-config in gate-and-submit-swat? [16:02:14] so it has higher priority? [16:02:14] <_joe_> jijiki: well, on the inactive dc [16:02:22] the plot thickens [16:02:33] (03PS1) 10Jbond: mediawiki::memcached - codfw: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531247 (https://phabricator.wikimedia.org/T102099) [16:02:35] (03PS1) 10Jbond: mediawiki::memcached - eqiad: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531248 (https://phabricator.wikimedia.org/T102099) [16:02:56] _joe_: we can do it for codfw for these days, and create a task to properly fix it [16:03:18] <_joe_> jijiki: I don't think the logic is much more complex [16:03:35] maybe not, but I will have to dig :p [16:04:04] > find /srv/mediawiki-staging/php-*/.git ! -perm -g=w => shows a bunch of git objects, but that's fine since they aren't going to change, I think everything should be sane on deploy1001 now [16:04:19] thanks thcipriani ! [16:04:22] <_joe_> thcipriani: scap clean is very slow :D [16:04:57] !log oblivian@deploy1001 Pruned MediaWiki: 1.34.0-wmf.13 (duration: 04m 09s) [16:04:58] (03PS1) 10Jbond: otrs: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531250 (https://phabricator.wikimedia.org/T102099) [16:05:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:05:05] Urbanecm: -swat isn't a higher priority than gate-and-submit, it's just a different one. [16:05:09] _joe_: jijiki: isn't there that puppet struct with all the mediawiki config data like primary DC [16:05:12] (03Merged) 10jenkins-bot: Enable DNS blacklist for es.wikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531171 (https://phabricator.wikimedia.org/T230796) (owner: 10MarcoAurelio) [16:05:17] so why we separate it then James_F ? [16:05:38] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:fcgi://127.0.0.1:9000 method=GET https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver [16:05:43] _joe_: yeah, the design still does a lot of rsyncing, we could probably move towards just doing the rm-ing directly and it'd speed up quite a bit, fewer stats that way [16:06:56] (03CR) 10jenkins-bot: Enable DNS blacklist for es.wikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531171 (https://phabricator.wikimedia.org/T230796) (owner: 10MarcoAurelio) [16:07:11] Urbanecm: It's faster for dependent things in the mw channel. [16:07:23] Ok [16:07:41] doesn't make much sense, but that's probably because I lack some knowledge of how it works :-) [16:07:47] (03PS1) 10Jbond: swift - codfw: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531251 (https://phabricator.wikimedia.org/T102099) [16:07:49] (03PS1) 10Jbond: swift - eqiad: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531252 (https://phabricator.wikimedia.org/T102099) [16:07:57] So a commit to mediawiki/core master would block a commit to mediawiki/core wmf.17 "wrongly", whereas now it goes into -swat. [16:08:05] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: fa903b7: Enable DNS blacklist for es.wikiquote (T230796) (duration: 00m 55s) [16:08:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:08:13] T230796: Deploy countermeasures to stop ongoing spambot attack at es.wikiquote 2019-08-20 [public task] - https://phabricator.wikimedia.org/T230796 [16:08:28] Krinkle: I'm done [16:08:32] thanks James_F [16:08:50] cdanis: re: scap enforcing permissions/umask in scap. There's a relevant ticket https://phabricator.wikimedia.org/T200690 with a similar suggestion. It seems to me though that permission errors are mostly caused by git operations, scap just surfaces the problem :( [16:09:38] thcipriani: sure, and git operations + unexpected umasks will cause problems ofc... but scap could maybe fix them? or we could standardize what the permissions of the whole tree could be, and have a setuid command to fix them up? [16:09:58] s/could/should/ [16:10:50] +1 to the setuid command [16:11:10] (or maybe even allow deployers to sudo as root for chmod/chown inside the tree?) [16:12:00] * Krinkle staging on mwdebug1002 for https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/531229/ [16:15:00] !log krinkle@deploy1001 Synchronized php-1.34.0-wmf.19/includes/resourceloader/ResourceLoaderWikiModule.php: T229433 - 44607c984016b (debugging) (duration: 00m 55s) [16:15:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:15:08] T229433: Investigate magically missing array key in ResourceLoaderWikiModule - https://phabricator.wikimedia.org/T229433 [16:15:23] is it correct that /srv/mediawiki-staging and all descendents should be group:mwdeploy and g+w? [16:15:26] always? [16:15:44] (03PS1) 10Jbond: mw servers - codfw: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531255 (https://phabricator.wikimedia.org/T102099) [16:15:46] (03PS1) 10Jbond: MW servers - eqiad: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531256 (https://phabricator.wikimedia.org/T102099) [16:16:04] cdanis: group wikidev, not? [16:16:12] sorry, yes, that's what I meant [16:16:30] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:fcgi://127.0.0.1:9000 method=GET https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver [16:16:55] James_F: refresh https://integration.wikimedia.org/zuul/ and search for e.g. 'swat' [16:17:05] now hides other pipelines (instead of very big empty placeholders) [16:18:10] (03PS1) 10Jbond: logging::mediawiki::udp2log: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531257 (https://phabricator.wikimedia.org/T102099) [16:20:19] (03PS1) 10Jbond: swap: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531258 (https://phabricator.wikimedia.org/T102099) [16:22:13] (03PS1) 10Jbond: ores: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531260 (https://phabricator.wikimedia.org/T102099) [16:22:40] (03CR) 10Jbond: "moritz could you help with a service owner for this?" [puppet] - 10https://gerrit.wikimedia.org/r/531260 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond) [16:24:58] !log Updated the Wikidata property suggester with data from the 2019-08-12 JSON dump and applied the T132839 workarounds [16:25:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:25:07] T132839: [RfC] Property suggester suggests human properties for non-human items - https://phabricator.wikimedia.org/T132839 [16:26:05] jouncebot: next [16:26:06] In 0 hour(s) and 33 minute(s): Services – Graphoid / Parsoid / Citoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190820T1700) [16:26:11] (03PS1) 10Jbond: ping_offload: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531261 (https://phabricator.wikimedia.org/T102099) [16:28:53] (03PS1) 10Jbond: mariadb::parsercache - codfw: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531262 (https://phabricator.wikimedia.org/T102099) [16:28:54] (03PS1) 10Jbond: mariadb::parsercache - eqiad: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531263 (https://phabricator.wikimedia.org/T102099) [16:31:26] (03PS1) 10Jbond: prometheus: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531264 (https://phabricator.wikimedia.org/T102099) [16:33:01] (03PS1) 10Jbond: poolcounter: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531265 (https://phabricator.wikimedia.org/T102099) [16:34:00] Urbanecm: still around? [16:34:05] (03CR) 10Jforrester: [C: 03+1] Send 33.3% of anonymous users to PHP7.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/529924 (https://phabricator.wikimedia.org/T219150) (owner: 10Effie Mouzeli) [16:34:07] (03PS1) 10Jbond: puppetboard/puppetdb: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531266 (https://phabricator.wikimedia.org/T102099) [16:35:28] (03CR) 10Ladsgroup: "PCC looks happy" [puppet] - 10https://gerrit.wikimedia.org/r/511078 (https://phabricator.wikimedia.org/T113114) (owner: 10Ladsgroup) [16:36:18] (03PS1) 10Jbond: redis - codfw: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531267 (https://phabricator.wikimedia.org/T102099) [16:36:20] (03PS1) 10Jbond: redis - eqiad: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531268 (https://phabricator.wikimedia.org/T102099) [16:36:27] cdanis: sorry, had to jump into a meeting. I should verify my understanding, but I believe it is correct to say that everything should be g+w under /srv/mediawiki-staging; however, thhe l10nupdate cache is owned by l10nupdate:l10nupdate and I'm unsure if that is historical or still needed at this point. [16:36:31] Daimona: not much [16:36:36] Np [16:36:43] If you wait 20 mins, I should be back [16:37:10] thcipriani: hm, interesting, thanks [16:37:55] (03PS1) 10Jbond: docker_registry: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531269 (https://phabricator.wikimedia.org/T102099) [16:39:16] Sure [16:39:19] Please ping me [16:39:55] hi Daimona [16:40:01] Hey! [16:40:01] * Krinkle staging on mwdebug1002 for https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/531228/ [16:40:17] (03PS1) 10Jbond: elasticsearch::relforge: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531271 (https://phabricator.wikimedia.org/T102099) [16:42:23] (03PS1) 10Jbond: restbase: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531272 (https://phabricator.wikimedia.org/T102099) [16:42:28] (03PS1) 10EBernhardson: Allow glent indices to auto-create in cirrus clusters [puppet] - 10https://gerrit.wikimedia.org/r/531273 [16:42:42] !log php-1.34.0-wmf.17/extensions/TimedMediaHandler is dirty. A merged patch was not deployed - https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/TimedMediaHandler/+/530558/ [16:42:45] brion: ^ [16:42:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:43:56] error: unable to unlink old 'includes/resourceloader/ResourceLoaderWikiModule.php': Permission denied [16:44:04] Unable to checkout patch [16:44:09] (03PS1) 10Jbond: eventschemas: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531274 (https://phabricator.wikimedia.org/T102099) [16:44:17] in /srv/mediawiki-staging/php-1.34.0-wmf.17 [16:44:22] (03CR) 10DCausse: [C: 03+1] Allow glent indices to auto-create in cirrus clusters [puppet] - 10https://gerrit.wikimedia.org/r/531273 (owner: 10EBernhardson) [16:44:29] git reset --hard origin/master ? [16:44:30] cdanis: ^ [16:44:43] hauskatze: eh, never in prod :) [16:44:59] but 'git checkout includes/resourceloader/ResourceLoaderWikiModule.php' should also overwrite it [16:45:00] it's a chmod issue [16:45:02] not a git issue [16:45:06] aha [16:45:22] though of an out-of-gerrit deployment happened, like a security patch [16:45:33] sometimes people forget to put them on gerrit [16:45:43] Krinkle: Re. zuul status board, yeah, saw the patch. Nice. :-) [16:45:46] (03PS1) 10Jbond: sessionstore: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531275 (https://phabricator.wikimedia.org/T102099) [16:45:48] (03CR) 10EBernhardson: "no cluster restart required, I've applied the same config to eqiad/codfw 9243/9443/9643 via transient cluster settings api" [puppet] - 10https://gerrit.wikimedia.org/r/531273 (owner: 10EBernhardson) [16:45:53] Yeah, a pull would rebase that away and/or show a conflict (if different) [16:46:08] (03CR) 10Jbond: "not sure of the owner for this?" [puppet] - 10https://gerrit.wikimedia.org/r/531275 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond) [16:46:12] in this case everything is fine in git, but it is unable to make the files on disk match the internal git state due to a chmod issue [16:46:20] permissions issue* [16:46:52] Krinkle: what's the full path? [16:46:59] $ ls -halF includes/resourceloader/ResourceLoaderWikiModule.php [16:47:00] -rw-r--r-- 1 *** wikidev 18K Aug 6 17:29 includes/resourceloader/ResourceLoaderWikiModule.php [16:47:06] in /srv/mediawiki-staging/php-1.34.0-wmf.17 [16:47:30] Krinkle: ok try now [16:47:35] (03PS1) 10Jbond: dumps: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531276 (https://phabricator.wikimedia.org/T102099) [16:47:36] presumably all nearby files have the same issue. maybe a typo in the previous recursive one? [16:47:43] (03PS1) 10Elukey: Add base configuration for an-conf100[1-3] [puppet] - 10https://gerrit.wikimedia.org/r/531277 (https://phabricator.wikimedia.org/T227025) [16:47:48] cdanis: works :) [16:48:01] thanks [16:48:38] (03CR) 10Elukey: [C: 04-1] "TODO: add proper mac addresses" [puppet] - 10https://gerrit.wikimedia.org/r/531277 (https://phabricator.wikimedia.org/T227025) (owner: 10Elukey) [16:48:59] (03PS1) 10Jbond: thumbor: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531278 (https://phabricator.wikimedia.org/T102099) [16:49:31] !log krinkle@deploy1001 Synchronized php-1.34.0-wmf.17/includes/resourceloader/ResourceLoaderWikiModule.php: T229433 - f84a4abb418de8 (debugging) (duration: 00m 56s) [16:49:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:49:40] T229433: Investigate magically missing array key in ResourceLoaderWikiModule - https://phabricator.wikimedia.org/T229433 [16:49:53] I haven't looked at any details but there have been weird permissions errors all of today :/ [16:50:24] Krinkle: that permissions stuff happened to me before [16:50:31] _joe_ fixed ut [16:50:39] *it [16:51:17] (03PS1) 10Jbond: xhgui::app: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531279 (https://phabricator.wikimedia.org/T102099) [16:52:52] (03PS1) 10Jbond: wqds: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531280 (https://phabricator.wikimedia.org/T102099) [16:53:15] <_joe_> cdanis: shall we just g+w the whole tree? [16:53:28] <_joe_> thcipriani: ^^ [16:53:30] _joe_: I think so, also, do you know anything about l10ncache? [16:53:48] I'd vote for that _joe_ [16:53:56] I just did so [16:53:57] <_joe_> not enough to answer questions cdanis [16:54:03] (03Abandoned) 10EBernhardson: Remove PrivateTmp=true from elasticsearch_6@ systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/530969 (owner: 10EBernhardson) [16:54:06] !log ✔️ cdanis@deploy1001.eqiad.wmnet /srv/mediawiki-staging 🕐☕ sudo chmod g+w -R /srv/mediawiki-staging/ [16:54:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:54:30] (03PS1) 10Jbond: parsoid: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531281 (https://phabricator.wikimedia.org/T102099) [17:00:04] cscott, arlolra, subbu, halfak, and accraze: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Services – Graphoid / Parsoid / Citoid / ORES. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190820T1700). [17:03:53] (03PS1) 10EBernhardson: Set REQUESTS_CA_BUNDLE for mjolnir daemons [puppet] - 10https://gerrit.wikimedia.org/r/531285 [17:07:08] (03CR) 10Jbond: [C: 03+2] ipv6: add mapped ipv6 to yubiauth server [puppet] - 10https://gerrit.wikimedia.org/r/531151 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond) [17:07:17] (03PS4) 10Jbond: ipv6: add mapped ipv6 to yubiauth server [puppet] - 10https://gerrit.wikimedia.org/r/531151 (https://phabricator.wikimedia.org/T102099) [17:09:10] (03CR) 10Ayounsi: [C: 03+1] ping_offload: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531261 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond) [17:10:33] (03PS4) 10Jdlrobson: Remove unused remnant from old menu click tracking [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527615 (https://phabricator.wikimedia.org/T228681) [17:14:50] (03PS3) 10Jbond: syslog::centralserver: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531153 (https://phabricator.wikimedia.org/T102099) [17:17:09] (03CR) 10Jbond: [C: 03+2] syslog::centralserver: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531153 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond) [17:20:27] (03PS2) 10Jbond: mariadb::misc::eventlogging: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531199 (https://phabricator.wikimedia.org/T102099) [17:20:42] 10Operations, 10Readers-Web-Backlog, 10Traffic: [Bug] iPadOS 13 shows the desktop version of Safari with a broken layout - https://phabricator.wikimedia.org/T229875 (10phuedx) >>! In T229875#5396278, @Jdlrobson wrote: > Varnish determines when to show or not show the mobile website and that's purely based on... [17:22:41] (03CR) 10Jbond: [C: 03+2] mariadb::misc::eventlogging: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531199 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond) [17:25:58] (03PS2) 10EBernhardson: Set REQUESTS_CA_BUNDLE for mjolnir daemons [puppet] - 10https://gerrit.wikimedia.org/r/531285 (https://phabricator.wikimedia.org/T227364) [17:26:21] (03PS2) 10Jbond: analytics_cluster::hadoop::client: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531219 (https://phabricator.wikimedia.org/T102099) [17:26:41] (03PS2) 10EBernhardson: Allow glent indices to auto-create in cirrus clusters [puppet] - 10https://gerrit.wikimedia.org/r/531273 (https://phabricator.wikimedia.org/T227364) [17:26:43] (03PS3) 10EBernhardson: Set REQUESTS_CA_BUNDLE for mjolnir daemons [puppet] - 10https://gerrit.wikimedia.org/r/531285 (https://phabricator.wikimedia.org/T227364) [17:27:12] (03CR) 10Herron: [V: 03+2 C: 03+2] prometheus-ipsec-exporter: initial commit of version 0.3.1 [debs/prometheus-ipsec-exporter] - 10https://gerrit.wikimedia.org/r/530203 (https://phabricator.wikimedia.org/T230236) (owner: 10Herron) [17:27:45] 10Operations, 10Discovery-Search (Current work): Run jstack / jmap / etc... with PrivateTmp=true - https://phabricator.wikimedia.org/T230774 (10Gehel) a:03Gehel [17:28:56] (03CR) 10Jbond: [C: 03+2] analytics_cluster::hadoop::client: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531219 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond) [17:29:21] Daimona: do we have a task to remove 'zero' stuff from AbuseFilter? [17:29:33] Yes [17:29:43] T227843 [17:29:44] T227843: Deprecate AbuseFilter's support for Zero - https://phabricator.wikimedia.org/T227843 [17:29:52] awesome :D [17:31:40] 10Operations, 10DNS, 10Domains, 10Traffic: Could not reach wikipedia from domain wikipedia.fi - https://phabricator.wikimedia.org/T230470 (10Zache) https have error NET::ERR_CERT_COMMON_NAME_INVALID - https://wikipedia.fi/ [17:33:17] (03PS4) 10EBernhardson: Set REQUESTS_CA_BUNDLE for mjolnir daemons [puppet] - 10https://gerrit.wikimedia.org/r/531285 (https://phabricator.wikimedia.org/T227364) [17:33:36] (03PS2) 10Jbond: kerberos::kdc: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531220 (https://phabricator.wikimedia.org/T102099) [17:42:16] (03CR) 10Jbond: [C: 03+2] kerberos::kdc: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531220 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond) [17:45:00] (03PS1) 10Jdlrobson: Disable Wikimedia ReadingDepth [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531290 (https://phabricator.wikimedia.org/T208594) [17:45:48] (03PS2) 10Jdlrobson: Disable Wikimedia ReadingDepth [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531290 (https://phabricator.wikimedia.org/T229042) [17:48:16] (03PS2) 10Jbond: ping_offload: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531261 (https://phabricator.wikimedia.org/T102099) [17:49:16] (03CR) 10Jbond: [C: 03+2] ping_offload: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531261 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond) [17:50:12] (03PS1) 10CDanis: WIP [puppet] - 10https://gerrit.wikimedia.org/r/531291 [17:55:44] (03PS2) 10CDanis: WIP [puppet] - 10https://gerrit.wikimedia.org/r/531291 [17:56:28] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/531190 (owner: 10Muehlenhoff) [17:56:35] (03PS6) 10BBlack: anycast recdns: config for 41 eqiad canaries [puppet] - 10https://gerrit.wikimedia.org/r/528524 (https://phabricator.wikimedia.org/T228190) [17:56:37] (03PS7) 10BBlack: anycast recdns: enable for codfw clients [puppet] - 10https://gerrit.wikimedia.org/r/526788 (https://phabricator.wikimedia.org/T228190) [17:56:39] (03PS6) 10BBlack: anycast recdns: enable globally [puppet] - 10https://gerrit.wikimedia.org/r/528525 (https://phabricator.wikimedia.org/T228190) [17:57:45] (03CR) 10BBlack: [C: 03+2] anycast recdns: config for 41 eqiad canaries [puppet] - 10https://gerrit.wikimedia.org/r/528524 (https://phabricator.wikimedia.org/T228190) (owner: 10BBlack) [17:57:55] !log deploying anycast recdns settings to resolv.conf on 41 live hosts in eqiad - https://gerrit.wikimedia.org/r/528524 - T228190 [17:58:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:58:04] T228190: Roll out Anycast RecDNS to more servers - https://phabricator.wikimedia.org/T228190 [18:11:52] (03PS3) 10CDanis: deployment: add fix-staging-perms command & sudo for it [puppet] - 10https://gerrit.wikimedia.org/r/531291 [18:20:05] (03CR) 10CDanis: "https://puppet-compiler.wmflabs.org/compiler1001/17966/" [puppet] - 10https://gerrit.wikimedia.org/r/531291 (owner: 10CDanis) [18:33:02] 10Operations, 10ops-eqiad, 10Cloud-Services, 10cloud-services-team (Kanban): rack/setup/install cloudcephmon100[123] - https://phabricator.wikimedia.org/T228102 (10Jclark-ctr) a:05Jclark-ctr→03RobH [18:33:40] 10Operations, 10ops-eqiad, 10Cloud-Services, 10cloud-services-team (Kanban): rack/setup/install cloudcephmon100[123] - https://phabricator.wikimedia.org/T228102 (10Jclark-ctr) asset tagged and added to Netbox [18:43:12] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10decommission: Decommission iron - https://phabricator.wikimedia.org/T220505 (10Cmjohnson) [18:43:26] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10decommission: Decommission iron - https://phabricator.wikimedia.org/T220505 (10Cmjohnson) p:05Normal→03High [18:44:53] 10Operations, 10ops-eqiad, 10DC-Ops, 10Wikimedia-Logstash, and 3 others: Decommission old eqiad logstash hardware hosts logstash100[456] - https://phabricator.wikimedia.org/T217556 (10Cmjohnson) @jclark-ctr has this ben done? We need the space in rack B2 so please make this a priority item. Thanks! [18:46:04] 10Operations, 10ops-eqiad, 10DC-Ops, 10Wikimedia-Logstash, and 3 others: Decommission old eqiad logstash hardware hosts logstash100[456] - https://phabricator.wikimedia.org/T217556 (10Jclark-ctr) Finished wiping i will be removing from rack shortly [18:46:20] 10Operations, 10Icinga, 10observability: icinga-wm bot truncating long messages - https://phabricator.wikimedia.org/T230799 (10CDanis) p:05Triage→03Normal [18:47:12] (03PS1) 10Cmjohnson: Removing mgmt dns entries for logstash1000[4-6] [dns] - 10https://gerrit.wikimedia.org/r/531293 (https://phabricator.wikimedia.org/T217556) [18:47:32] 10Operations, 10ops-eqiad: Degraded RAID on elastic1046 - https://phabricator.wikimedia.org/T228606 (10wiki_willy) Confirmed by Chris that the drive arrived on August 8 [18:50:48] jouncebot: next [18:50:48] In 4 hour(s) and 9 minute(s): Evening SWAT (Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190820T2300) [18:51:36] (03PS6) 10CDanis: noc: read dbctl JSON from local disk mirror of etcd [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528938 (https://phabricator.wikimedia.org/T229631) [18:56:58] (03CR) 10CDanis: [C: 03+2] noc: read dbctl JSON from local disk mirror of etcd (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528938 (https://phabricator.wikimedia.org/T229631) (owner: 10CDanis) [18:58:09] (03Merged) 10jenkins-bot: noc: read dbctl JSON from local disk mirror of etcd [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528938 (https://phabricator.wikimedia.org/T229631) (owner: 10CDanis) [18:58:29] (03CR) 10jenkins-bot: noc: read dbctl JSON from local disk mirror of etcd [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528938 (https://phabricator.wikimedia.org/T229631) (owner: 10CDanis) [19:00:29] !log cdanis@deploy1001 Synchronized docroot/noc/db.php: 80a6743dd noc: read dbctl JSON T229631 (duration: 00m 58s) [19:00:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:00:39] T229631: noc.wm.o/db.php: remove hosts information, or fetch it from etcd somehow - https://phabricator.wikimedia.org/T229631 [19:01:25] 10Operations, 10Analytics: Access to HUE for Mayakpwiki - https://phabricator.wikimedia.org/T229143 (10JAllemandou) Action has been taken that should have granted access to shell username `Mayakpwiki`. @Mayakp.wiki can you test please? :) [19:02:55] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: decom californium - https://phabricator.wikimedia.org/T189921 (10Cmjohnson) p:05Normal→03High a:05Cmjohnson→03Jclark-ctr John, Can you wipe this server and remove from the rack as soon as you can. Need the space. Thanks! [19:03:28] 10Operations, 10MediaWiki-Configuration, 10conftool, 10Patch-For-Review, 10Performance-Team (Radar): noc.wm.o/db.php: remove hosts information, or fetch it from etcd somehow - https://phabricator.wikimedia.org/T229631 (10CDanis) p:05Triage→03Low https://noc.wikimedia.org/db.php will now stay up-to-da... [19:03:45] (03PS1) 10Cmjohnson: Removing mgmt dns for californium [dns] - 10https://gerrit.wikimedia.org/r/531295 (https://phabricator.wikimedia.org/T189921) [19:04:58] !log push BGP_Wikimedia_pops to cr3-ulsfo - T227808 [19:05:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:05:06] T227808: Standardize cross confederation BGP policies - https://phabricator.wikimedia.org/T227808 [19:11:32] 10Operations, 10ops-eqiad, 10Cloud-Services, 10cloud-services-team (Kanban): rack/setup/install cloudcephmon100[123] - https://phabricator.wikimedia.org/T228102 (10Jclark-ctr) a:05RobH→03Jclark-ctr [19:12:57] (03PS5) 10Gehel: Set REQUESTS_CA_BUNDLE for mjolnir daemons [puppet] - 10https://gerrit.wikimedia.org/r/531285 (https://phabricator.wikimedia.org/T227364) (owner: 10EBernhardson) [19:13:24] 10Operations, 10ops-eqiad, 10decommission: Decommission labnet1001 & labnet1002 - https://phabricator.wikimedia.org/T221818 (10Cmjohnson) a:05Cmjohnson→03Jclark-ctr @Jclark-ctr Please wipe and remove these servers from the rack and update the task -- assign it back to me once done please. [19:13:36] 10Operations, 10ops-eqiad, 10decommission: Decommission labnet1001 & labnet1002 - https://phabricator.wikimedia.org/T221818 (10Cmjohnson) p:05Normal→03High [19:14:08] (03CR) 10Gehel: [C: 03+2] Set REQUESTS_CA_BUNDLE for mjolnir daemons [puppet] - 10https://gerrit.wikimedia.org/r/531285 (https://phabricator.wikimedia.org/T227364) (owner: 10EBernhardson) [19:16:02] (03PS1) 10Urbanecm: Log dnsblacklist entries at info level [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531299 (https://phabricator.wikimedia.org/T230822) [19:20:40] jouncebot: next [19:20:41] In 3 hour(s) and 39 minute(s): Evening SWAT (Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190820T2300) [19:23:16] PROBLEM - High average POST latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=303 handler=proxy:fcgi://127.0.0.1:9000 method=POST https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-metho [19:23:21] 10Operations, 10ops-eqiad, 10Cloud-Services, 10cloud-services-team (Kanban): rack/setup/install (3) new osd ceph nodes - https://phabricator.wikimedia.org/T224188 (10Jclark-ctr) added asset tags updated Netbox [19:23:26] 10Operations, 10netops: Standardize cross confederation BGP policies - https://phabricator.wikimedia.org/T227808 (10ayounsi) Updated the change to no not include the local aggregates to keep the level of changes to a minimum. I do think that's something we would want to add down the road though. Confirmed tha... [19:24:50] RECOVERY - High average POST latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST [19:25:14] !log push BGP_Wikimedia_pops to cr4-ulsfo - T227808 [19:25:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:25:23] T227808: Standardize cross confederation BGP policies - https://phabricator.wikimedia.org/T227808 [19:25:35] !log cleanup old (pre 1.34.0-wmf.14) wmf/* branches for core and extensions on gerrit [19:25:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:28:45] !log push BGP_Wikimedia_pops to eqsin - T227808 [19:28:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:28:57] 10Operations, 10Analytics: Access to HUE for Mayakpwiki - https://phabricator.wikimedia.org/T229143 (10Mayakp.wiki) Checked connection and ran queries against mediawiki history. Access is working as expected. Thanks @JAllemandou and @Nuria for your help ! [19:38:17] 10Operations, 10Analytics: Access to HUE for Mayakpwiki - https://phabricator.wikimedia.org/T229143 (10JAllemandou) 05Open→03Resolved [19:48:26] 10Operations, 10ops-eqiad, 10Cloud-Services, 10cloud-services-team (Kanban): rack/setup/install cloudcephmon100[123] - https://phabricator.wikimedia.org/T228102 (10Jclark-ctr) a:05Jclark-ctr→03RobH [19:56:17] (03PS1) 10Nray: Add config to turn AMC Outreach on for Beta servers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531303 (https://phabricator.wikimedia.org/T226068) [20:01:37] @MaxSem @Niharika @RoanKattouw is one of you around to to the 4pm SWAT window today? https://wikitech.wikimedia.org/wiki/Deployments [20:08:14] PROBLEM - HHVM rendering on mw1286 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [20:09:46] RECOVERY - HHVM rendering on mw1286 is OK: HTTP OK: HTTP/1.1 200 OK - 77116 bytes in 5.832 second response time https://wikitech.wikimedia.org/wiki/Application_servers [20:17:12] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 46 probes of 449 (alerts on 35) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [20:22:44] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 22 probes of 449 (alerts on 35) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [20:38:13] 10Operations, 10netops: csw2-esams's VCP link flapped - https://phabricator.wikimedia.org/T229755 (10ayounsi) JTAC found a core dump on fpc4 and fpc5, sent to Juniper for analysis. [20:49:12] !log push BGP_Wikimedia_pops to ams - T227808 [20:49:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:49:20] T227808: Standardize cross confederation BGP policies - https://phabricator.wikimedia.org/T227808 [20:50:21] (03PS1) 10Urbanecm: Enable RelatedArticles on all skins on eswikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531311 (https://phabricator.wikimedia.org/T230660) [20:51:56] (03CR) 10Urbanecm: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531172 (https://phabricator.wikimedia.org/T230797) (owner: 10DannyS712) [20:55:46] jouncebot: next [20:55:46] In 2 hour(s) and 4 minute(s): Evening SWAT (Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190820T2300) [20:56:38] 10Operations, 10netops: Standardize cross confederation BGP policies - https://phabricator.wikimedia.org/T227808 (10ayounsi) 05Open→03Resolved [20:56:41] 10Operations, 10netops: Cleanup confed BGP peerings and policies - https://phabricator.wikimedia.org/T167841 (10ayounsi) [21:14:28] !log ebernhardson@deploy1001 Started deploy [search/mjolnir/deploy@9b40607]: bulk_daemon: Increase max_poll_interval_ms to 15 minutes [21:14:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:15:01] 10Operations: Investigate janitor, maintenance emails parser - https://phabricator.wikimedia.org/T230835 (10ayounsi) p:05Triage→03Low [21:20:50] !log ebernhardson@deploy1001 Finished deploy [search/mjolnir/deploy@9b40607]: bulk_daemon: Increase max_poll_interval_ms to 15 minutes (duration: 06m 22s) [21:20:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:29:29] (03CR) 10Jdlrobson: [C: 03+1] Enable RelatedArticles on all skins on eswikinews (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531311 (https://phabricator.wikimedia.org/T230660) (owner: 10Urbanecm) [21:42:09] (03CR) 10Thcipriani: [C: 03+1] "Looks like this could save some flail in many cases." [puppet] - 10https://gerrit.wikimedia.org/r/531291 (owner: 10CDanis) [22:18:13] (03CR) 10Urbanecm: [C: 03+1] "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/531291 (owner: 10CDanis) [23:00:04] MaxSem, RoanKattouw, and Niharika: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Evening SWAT (Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190820T2300). [23:00:04] Jdlrobson: A patch you scheduled for Evening SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:00:24] o/ [23:00:35] I can SWAT today! [23:00:39] HURRAH [23:00:49] ? [23:01:10] (03CR) 10Urbanecm: [C: 03+2] Add config to turn AMC Outreach on for Beta servers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531303 (https://phabricator.wikimedia.org/T226068) (owner: 10Nray) [23:01:16] doing the beta only change now [23:01:17] (03PS1) 10Smalyshev: Setup RDF configuration for Commons Beta with correct prefixes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531317 (https://phabricator.wikimedia.org/T230840) [23:01:19] should require just +2 [23:02:12] (03PS5) 10Urbanecm: Update wgSkipSkins to experiment with not showing skins to users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/511321 (https://phabricator.wikimedia.org/T223824) (owner: 10Jdlrobson) [23:02:16] (03Merged) 10jenkins-bot: Add config to turn AMC Outreach on for Beta servers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531303 (https://phabricator.wikimedia.org/T226068) (owner: 10Nray) [23:02:20] (03CR) 10jerkins-bot: [V: 04-1] Setup RDF configuration for Commons Beta with correct prefixes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531317 (https://phabricator.wikimedia.org/T230840) (owner: 10Smalyshev) [23:02:36] (03CR) 10Urbanecm: [C: 03+2] Update wgSkipSkins to experiment with not showing skins to users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/511321 (https://phabricator.wikimedia.org/T223824) (owner: 10Jdlrobson) [23:02:46] (03CR) 10jenkins-bot: Add config to turn AMC Outreach on for Beta servers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531303 (https://phabricator.wikimedia.org/T226068) (owner: 10Nray) [23:02:50] jdlrobson: +2'ed 511321, will ping you once it's on mwdebug [23:02:57] roger that! [23:03:13] (03PS2) 10Smalyshev: Setup RDF configuration for Commons Beta with correct prefixes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531317 (https://phabricator.wikimedia.org/T230840) [23:03:19] just excited there's a SWATter today - last SWAT window I went to nobody showed up :) [23:03:37] (03Merged) 10jenkins-bot: Update wgSkipSkins to experiment with not showing skins to users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/511321 (https://phabricator.wikimedia.org/T223824) (owner: 10Jdlrobson) [23:04:26] jdlrobson: 511321 is ready to be tested on mwdebug1002 [23:04:33] on it [23:04:57] thanks [23:05:04] (03CR) 10jenkins-bot: Update wgSkipSkins to experiment with not showing skins to users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/511321 (https://phabricator.wikimedia.org/T223824) (owner: 10Jdlrobson) [23:05:42] Urbanecm: https://gerrit.wikimedia.org/r/511321 can be synced [23:06:03] jdlrobson: thanks. Just to ensure, can https://logstash.wikimedia.org/app/kibana#/doc/logstash-*/logstash-2019.08.20/mediawiki?id=AWyxRq5aZKA7RpirHjPg&_g=h@44136fa be caused by your patch? [23:07:40] looking [23:08:00] Nope. Looks like Echo related issue [23:09:07] ok jdlrobson [23:09:09] syncing [23:09:38] (03CR) 10Urbanecm: [C: 03+2] Remove unused remnant from old menu click tracking [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527615 (https://phabricator.wikimedia.org/T228681) (owner: 10Jdlrobson) [23:09:52] (03PS5) 10Urbanecm: Remove unused remnant from old menu click tracking [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527615 (https://phabricator.wikimedia.org/T228681) (owner: 10Jdlrobson) [23:10:00] (03CR) 10Urbanecm: [C: 03+2] Remove unused remnant from old menu click tracking [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527615 (https://phabricator.wikimedia.org/T228681) (owner: 10Jdlrobson) [23:10:34] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: b94b647: Update wgSkipSkins to experiment with not showing skins to users (T223824) (duration: 00m 58s) [23:10:39] jdlrobson: synced [23:10:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:10:45] T223824: Hide Cologne Blue skin in preferences - https://phabricator.wikimedia.org/T223824 [23:11:10] (03Merged) 10jenkins-bot: Remove unused remnant from old menu click tracking [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527615 (https://phabricator.wikimedia.org/T228681) (owner: 10Jdlrobson) [23:11:25] (03CR) 10jenkins-bot: Remove unused remnant from old menu click tracking [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527615 (https://phabricator.wikimedia.org/T228681) (owner: 10Jdlrobson) [23:12:02] jdlrobson: not sure if https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/527615 can be tested on mwdebug1002, but if so, it's ready for testing [23:12:56] (03CR) 10Urbanecm: [C: 03+2] Disable Wikimedia ReadingDepth [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531290 (https://phabricator.wikimedia.org/T229042) (owner: 10Jdlrobson) [23:13:48] Urbanecm: testing it now [23:13:55] thanks jdlrobson [23:14:52] Urbanecm: that can be synced [23:15:06] jdlrobson: ack, syncing [23:15:23] (03PS3) 10Urbanecm: Disable Wikimedia ReadingDepth [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531290 (https://phabricator.wikimedia.org/T229042) (owner: 10Jdlrobson) [23:15:36] (03CR) 10Urbanecm: [C: 03+2] Disable Wikimedia ReadingDepth [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531290 (https://phabricator.wikimedia.org/T229042) (owner: 10Jdlrobson) [23:16:16] (03CR) 10Urbanecm: [C: 03+2] Grant skipcaptcha to everyone coming from whitelisted IP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523005 (https://phabricator.wikimedia.org/T227487) (owner: 10Urbanecm) [23:16:34] (03Merged) 10jenkins-bot: Disable Wikimedia ReadingDepth [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531290 (https://phabricator.wikimedia.org/T229042) (owner: 10Jdlrobson) [23:16:39] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: 0c08257: Remove unused remnant from old menu click tracking (T228681) (duration: 00m 55s) [23:16:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:16:48] T228681: Decommision MobileWebMainMenuClickTracking - https://phabricator.wikimedia.org/T228681 [23:16:50] synced! [23:17:05] (03CR) 10jenkins-bot: Disable Wikimedia ReadingDepth [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531290 (https://phabricator.wikimedia.org/T229042) (owner: 10Jdlrobson) [23:17:15] jdlrobson: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/531290 is ready to be tested now [23:17:16] (03Merged) 10jenkins-bot: Grant skipcaptcha to everyone coming from whitelisted IP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523005 (https://phabricator.wikimedia.org/T227487) (owner: 10Urbanecm) [23:17:39] that one can be synced to @Urbanecm [23:17:55] syncing [23:19:12] (03CR) 10jenkins-bot: Grant skipcaptcha to everyone coming from whitelisted IP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523005 (https://phabricator.wikimedia.org/T227487) (owner: 10Urbanecm) [23:19:18] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: 13be059: Disable Wikimedia ReadingDepth (T229042) (duration: 00m 56s) [23:19:24] synced! [23:19:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:19:27] T229042: Reading_depth: remove eventlogging instrumentation - https://phabricator.wikimedia.org/T229042 [23:20:47] jdlrobson: should be all done, right? [23:20:55] yep Urbanecm just monitoring some graphs [23:21:00] thank you so much for your help today! [23:21:04] happy to help! [23:22:31] !log urbanecm@deploy1001 Synchronized wmf-config/throttle-analyze.php: SWAT: a3927a7: Grant skipcaptcha to everyone coming from whitelisted IP (T227487) (duration: 00m 54s) [23:22:48] (03PS2) 10Urbanecm: Enable RelatedArticles on all skins on eswikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531311 (https://phabricator.wikimedia.org/T230660) [23:22:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:22:50] T227487: Add skipcaptcha to all users from throttle-exempted IP addresses - https://phabricator.wikimedia.org/T227487 [23:22:53] (03CR) 10Urbanecm: [C: 03+2] Enable RelatedArticles on all skins on eswikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531311 (https://phabricator.wikimedia.org/T230660) (owner: 10Urbanecm) [23:23:56] (03Merged) 10jenkins-bot: Enable RelatedArticles on all skins on eswikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531311 (https://phabricator.wikimedia.org/T230660) (owner: 10Urbanecm) [23:23:58] (03PS3) 10Urbanecm: Clean up `wgNamespacesToBeSearchedDefault` to remove unneeded entries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531172 (https://phabricator.wikimedia.org/T230797) (owner: 10DannyS712) [23:24:05] (03CR) 10Urbanecm: [C: 03+2] Clean up `wgNamespacesToBeSearchedDefault` to remove unneeded entries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531172 (https://phabricator.wikimedia.org/T230797) (owner: 10DannyS712) [23:24:06] PROBLEM - Too many messages in kafka logging-eqiad on icinga1001 is CRITICAL: cluster=misc exported_cluster=logging-eqiad group={logstash,logstash-codfw} instance=kafkamon1001:9501 job=burrow partition={0,1,2} site=eqiad topic=udp_localhost-info https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=eqiad+prometheus/o [23:24:06] gging-eqiad&var-topic=All&var-consumer_group=All [23:24:28] (03CR) 10jenkins-bot: Enable RelatedArticles on all skins on eswikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531311 (https://phabricator.wikimedia.org/T230660) (owner: 10Urbanecm) [23:25:12] (03Merged) 10jenkins-bot: Clean up `wgNamespacesToBeSearchedDefault` to remove unneeded entries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531172 (https://phabricator.wikimedia.org/T230797) (owner: 10DannyS712) [23:26:35] (03CR) 10jenkins-bot: Clean up `wgNamespacesToBeSearchedDefault` to remove unneeded entries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531172 (https://phabricator.wikimedia.org/T230797) (owner: 10DannyS712) [23:27:24] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: fd2cece: Enable RelatedArticles on all skins on eswikinews (T230660) (duration: 00m 52s) [23:27:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:27:32] T230660: Add RelatedArticles to es.wikinews - https://phabricator.wikimedia.org/T230660 [23:29:24] 10Operations, 10Wikimedia-Logstash: Logstash gets significantly lower number of messages from mediawiki - https://phabricator.wikimedia.org/T230847 (10Urbanecm) [23:29:30] 10Operations, 10Wikimedia-Logstash: Logstash gets significantly lower number of messages from mediawiki - https://phabricator.wikimedia.org/T230847 (10Urbanecm) p:05Triage→03Unbreak! [23:29:53] ^^ created an UBN, Logstash seems to be getting very low number of messages ^^ [23:30:22] Right now, I'm in the middle of deploying https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/531172 [23:30:33] can either revert the patch or deploy anyway, any advices? [23:32:26] Urbanecm: with prod always rollback if you aren't sure [23:32:31] usually safest bet :) [23:32:42] I didn't touch anything [23:32:46] I have just a patch merged :) [23:33:04] (well, I have deployed several patches before, meant this part only) [23:33:23] Urbanecm: right, but i mean if you aren't sure if you should procede or revert, the answer is almost always revert [23:33:37] * ebernhardson can't spell either... [23:34:33] (03PS1) 10Urbanecm: Revert "Clean up `wgNamespacesToBeSearchedDefault` to remove unneeded entries" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531321 (https://phabricator.wikimedia.org/T230797) [23:34:39] (03CR) 10Urbanecm: [V: 03+2 C: 03+2] Revert "Clean up `wgNamespacesToBeSearchedDefault` to remove unneeded entries" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531321 (https://phabricator.wikimedia.org/T230797) (owner: 10Urbanecm) [23:34:49] fair enough ebernhardson [23:35:03] patch is gone, and I hope someone is awake to look into Logstash... [23:35:41] Urbanecm: i hope so too, i took a quick look but nothing seems obvious to me [23:35:51] 10Operations, 10Wikimedia-Logstash: Logstash gets significantly lower number of messages from mediawiki - https://phabricator.wikimedia.org/T230847 (10Urbanecm) [23:35:59] ebernhardson: in case you didn't see, T228089 is the same thing [23:35:59] T228089: Logstash down for MediaWiki - https://phabricator.wikimedia.org/T228089 [23:36:03] (occured before) [23:36:37] (03CR) 10jenkins-bot: Revert "Clean up `wgNamespacesToBeSearchedDefault` to remove unneeded entries" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531321 (https://phabricator.wikimedia.org/T230797) (owner: 10Urbanecm) [23:38:11] 10Operations, 10Wikimedia-Logstash, 10Wikimedia-Incident: Logstash gets significantly lower number of messages from mediawiki - https://phabricator.wikimedia.org/T230847 (10Urbanecm) [23:38:41] Urbanecm: just poking around, it seems most of the decrease is for channel:diff. It increased from 5k/10min to 22k/10min over ~8 hours, and then dropped back to ~7k/10min [23:39:44] RECOVERY - Too many messages in kafka logging-eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=eqiad+prometheus/ops&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All [23:40:16] seems to be back up? [23:42:54] !log Evening SWAT aborted due to no logs logged for some period of time (T230847), no patches were reverted [23:43:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:43:02] T230847: Logstash gets significantly lower number of messages from mediawiki - https://phabricator.wikimedia.org/T230847 [23:43:33] anyway, it's too scary for me to continue, let's leave the rest of later :) [23:49:48] !log ebernhardson@deploy1001 Started deploy [search/mjolnir/deploy@a7bf6cf]: bulk_daemon: Increase bulk request_timeout [23:49:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:53:28] !log ebernhardson@deploy1001 Finished deploy [search/mjolnir/deploy@a7bf6cf]: bulk_daemon: Increase bulk request_timeout (duration: 03m 40s) [23:53:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log