[00:02:45] !log installing and setting up netbox instances T223291 [00:02:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:02:47] T223291: Netbox: move it to dedicated Ganeti VMs - https://phabricator.wikimedia.org/T223291 [00:08:28] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1021 with 10G interfaces - https://phabricator.wikimedia.org/T229873 (10Andrew) I can confirm that this seems to have 10g networking set up already. So all that's left dc-wise is the raid setup. I don... [00:08:36] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1022 with 10G interfaces - https://phabricator.wikimedia.org/T229872 (10Andrew) I can confirm that this seems to have 10g networking set up already. So all that's left dc-wise is the raid setup. I don... [00:09:01] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1021 with 10G interfaces - https://phabricator.wikimedia.org/T229873 (10Andrew) [00:09:04] (03PS1) 10Cmjohnson: Adding mgmt dns for cloudcephmon100[1-3] [dns] - 10https://gerrit.wikimedia.org/r/534255 (https://phabricator.wikimedia.org/T228102) [00:09:24] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1022 with 10G interfaces - https://phabricator.wikimedia.org/T229872 (10Andrew) a:05Jclark-ctr→03Cmjohnson [00:09:35] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1021 with 10G interfaces - https://phabricator.wikimedia.org/T229873 (10Andrew) a:05Jclark-ctr→03Cmjohnson [00:10:37] 10Operations, 10ops-eqiad, 10Cloud-Services, 10Patch-For-Review, 10cloud-services-team (Kanban): rack/setup/install cloudcephmon100[123] - https://phabricator.wikimedia.org/T228102 (10Cmjohnson) @Jclark-ctr Please set up the idrac and add the mgmt dns. Let me know if you have any issues or questions. I... [00:11:58] (03CR) 10Cmjohnson: [C: 03+2] Adding mgmt dns for cloudcephmon100[1-3] [dns] - 10https://gerrit.wikimedia.org/r/534255 (https://phabricator.wikimedia.org/T228102) (owner: 10Cmjohnson) [00:19:24] PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:34:48] PROBLEM - OSPF status on cr2-knams is CRITICAL: OSPFv2: 3/5 UP : OSPFv3: 3/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:38:46] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:40:48] PROBLEM - BFD status on cr2-eqdfw is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [00:41:52] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:42:16] RECOVERY - OSPF status on cr2-knams is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:42:24] RECOVERY - BFD status on cr2-eqdfw is OK: OK: UP: 10 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [00:42:32] RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:44:16] (03PS1) 10CRusnov: postgres: Fix deps for Buster [puppet] - 10https://gerrit.wikimedia.org/r/534259 [00:49:38] (03CR) 10Ayounsi: [C: 03+1] postgres: Fix deps for Buster [puppet] - 10https://gerrit.wikimedia.org/r/534259 (owner: 10CRusnov) [00:49:50] (03CR) 10CRusnov: [C: 03+2] postgres: Fix deps for Buster [puppet] - 10https://gerrit.wikimedia.org/r/534259 (owner: 10CRusnov) [00:54:59] (03PS8) 10Krinkle: CommonSettings: Store mtime inside wmf-config cache file [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528447 (https://phabricator.wikimedia.org/T217830) [00:55:02] (03CR) 10Krinkle: [C: 03+2] CommonSettings: Store mtime inside wmf-config cache file [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528447 (https://phabricator.wikimedia.org/T217830) (owner: 10Krinkle) [00:55:11] * Krinkle staging for deploy on mwdebug1002 [00:56:01] (03Merged) 10jenkins-bot: CommonSettings: Store mtime inside wmf-config cache file [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528447 (https://phabricator.wikimedia.org/T217830) (owner: 10Krinkle) [00:56:32] (03CR) 10jenkins-bot: CommonSettings: Store mtime inside wmf-config cache file [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528447 (https://phabricator.wikimedia.org/T217830) (owner: 10Krinkle) [01:02:06] !log krinkle@deploy1001 Synchronized wmf-config/CommonSettings.php: ed5297c10 / T217830 (duration: 00m 59s) [01:02:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:02:23] T217830: Changing dblist files requires mtime touch of InitialiseSettings.php - https://phabricator.wikimedia.org/T217830 [01:16:02] (03PS1) 10CRusnov: netbox::postgres: Update ferm rules for FEs [puppet] - 10https://gerrit.wikimedia.org/r/534262 [01:16:31] (03PS1) 10Ayounsi: Fix dependencies [debs/pynetbox] - 10https://gerrit.wikimedia.org/r/534263 [01:18:26] (03CR) 10jerkins-bot: [V: 04-1] netbox::postgres: Update ferm rules for FEs [puppet] - 10https://gerrit.wikimedia.org/r/534262 (owner: 10CRusnov) [01:25:13] (03PS2) 10CRusnov: netbox::postgres: Update ferm rules for FEs [puppet] - 10https://gerrit.wikimedia.org/r/534262 [01:27:11] (03CR) 10jerkins-bot: [V: 04-1] netbox::postgres: Update ferm rules for FEs [puppet] - 10https://gerrit.wikimedia.org/r/534262 (owner: 10CRusnov) [01:30:25] (03PS3) 10CRusnov: netbox::postgres: Update ferm rules for FEs [puppet] - 10https://gerrit.wikimedia.org/r/534262 [01:33:37] (03PS4) 10CRusnov: netbox::postgres: Update ferm rules for FEs [puppet] - 10https://gerrit.wikimedia.org/r/534262 [01:47:04] (03PS5) 10CRusnov: netbox::postgres: Update ferm rules for FEs [puppet] - 10https://gerrit.wikimedia.org/r/534262 [01:49:05] (03CR) 10jerkins-bot: [V: 04-1] netbox::postgres: Update ferm rules for FEs [puppet] - 10https://gerrit.wikimedia.org/r/534262 (owner: 10CRusnov) [01:51:16] (03PS6) 10CRusnov: netbox::postgres: Update ferm rules for FEs [puppet] - 10https://gerrit.wikimedia.org/r/534262 [01:53:20] (03CR) 10jerkins-bot: [V: 04-1] netbox::postgres: Update ferm rules for FEs [puppet] - 10https://gerrit.wikimedia.org/r/534262 (owner: 10CRusnov) [01:55:58] (03PS7) 10CRusnov: netbox::postgres: Update ferm rules for FEs [puppet] - 10https://gerrit.wikimedia.org/r/534262 [01:59:15] (03CR) 10CRusnov: [C: 03+2] netbox::postgres: Update ferm rules for FEs [puppet] - 10https://gerrit.wikimedia.org/r/534262 (owner: 10CRusnov) [02:07:20] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/summary/{title} (Get summary for test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [02:08:46] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [02:21:01] !log extending downtime on netmon1002 and netmon2001, netbox1001, netbox2001, netboxdb1001 and netbox2001 should be stable but are still being debugged [02:21:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:42:34] !log krinkle@deploy1001 Synchronized php-1.34.0-wmf.21/resources/src/mediawiki.base/mediawiki.base.js: 8a1b13026 (duration: 00m 56s) [02:42:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:46:35] !log krinkle@deploy1001 Synchronized php-1.34.0-wmf.21/resources/src/startup/mediawiki.js: 8a1b13026 (duration: 00m 55s) [02:46:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:47:21] (03CR) 10Dzahn: [C: 03+2] Remove role::ci::slave::webperformance [puppet] - 10https://gerrit.wikimedia.org/r/531420 (https://phabricator.wikimedia.org/T225416) (owner: 10Hashar) [03:47:33] (03PS3) 10Dzahn: Remove role::ci::slave::webperformance [puppet] - 10https://gerrit.wikimedia.org/r/531420 (https://phabricator.wikimedia.org/T225416) (owner: 10Hashar) [04:10:15] (03PS1) 10Dzahn: tlsproxy/envoy: create user/group if on jessie [puppet] - 10https://gerrit.wikimedia.org/r/534267 [04:13:22] (03CR) 10Dzahn: [C: 03+2] tlsproxy/envoy: create user/group if on jessie [puppet] - 10https://gerrit.wikimedia.org/r/534267 (owner: 10Dzahn) [04:17:51] (03CR) 10Dzahn: "puppet runs now without issues on ununpentium (jessie with envoy class)" [puppet] - 10https://gerrit.wikimedia.org/r/534267 (owner: 10Dzahn) [04:26:34] (03CR) 10Dzahn: [C: 03+2] "[mwmaint1002:~] $ file /usr/bin/php" [puppet] - 10https://gerrit.wikimedia.org/r/534012 (https://phabricator.wikimedia.org/T230110) (owner: 10Dzahn) [04:27:04] (03PS2) 10Dzahn: switch RUNNER in foreachwikiindblist back to just 'php' [puppet] - 10https://gerrit.wikimedia.org/r/534012 (https://phabricator.wikimedia.org/T230110) [04:30:46] 10Operations, 10MediaWiki-Maintenance-scripts, 10serviceops, 10Patch-For-Review: Stop forcing RUNNER=php for foreachwiki/foreachwikiindblist - https://phabricator.wikimedia.org/T230110 (10Dzahn) 05Open→03Resolved a:03Dzahn [04:30:54] 10Operations, 10MediaWiki-extensions-Mailgun, 10cloud-services-team, 10serviceops, and 5 others: Switch cronjobs on maintenance hosts to PHP7 - https://phabricator.wikimedia.org/T195392 (10Dzahn) [04:31:19] (03PS3) 10Dzahn: remove wikiba.se microsite puppetization [puppet] - 10https://gerrit.wikimedia.org/r/532972 (https://phabricator.wikimedia.org/T99531) [04:33:11] (03CR) 10Dzahn: [C: 03+2] "ATS backend has already been removed" [puppet] - 10https://gerrit.wikimedia.org/r/532972 (https://phabricator.wikimedia.org/T99531) (owner: 10Dzahn) [04:46:09] (03Abandoned) 10Dzahn: labs.yaml: remove wikibase in cloud vps [puppet] - 10https://gerrit.wikimedia.org/r/532975 (https://phabricator.wikimedia.org/T99531) (owner: 10Dzahn) [04:47:56] (03PS2) 10Dzahn: wikistats (cloud): remove php5 support [puppet] - 10https://gerrit.wikimedia.org/r/533161 [05:06:05] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [05:07:51] 10Operations, 10DBA: Decommission db1061-db1073 - https://phabricator.wikimedia.org/T217396 (10Marostegui) [05:08:08] (03PS1) 10Marostegui: db1073: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/534269 (https://phabricator.wikimedia.org/T231892) [05:08:59] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [05:09:09] (03CR) 10Marostegui: [C: 03+2] db1073: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/534269 (https://phabricator.wikimedia.org/T231892) (owner: 10Marostegui) [05:09:35] 10Operations, 10DBA, 10Patch-For-Review: Decommission db1073.eqiad.wmnet - https://phabricator.wikimedia.org/T231892 (10Marostegui) [05:10:01] 10Operations, 10Phabricator, 10Traffic, 10Release-Engineering-Team (Development services), and 2 others: Prepare Phame to support heavy traffic for a Tech Department blog - https://phabricator.wikimedia.org/T226044 (10mmodell) [05:12:52] (03CR) 10Dzahn: [C: 03+2] wikistats (cloud): remove php5 support [puppet] - 10https://gerrit.wikimedia.org/r/533161 (owner: 10Dzahn) [05:13:02] (03PS3) 10Dzahn: wikistats (cloud): remove php5 support [puppet] - 10https://gerrit.wikimedia.org/r/533161 [05:14:10] (03PS1) 10Marostegui: mariadb: Provision dbproxy1017 to replace dbproxy1005 [puppet] - 10https://gerrit.wikimedia.org/r/534270 (https://phabricator.wikimedia.org/T202367) [05:15:41] (03Abandoned) 10Dzahn: Gerrit: Switch 'mirror' back on for the GitHub remote [puppet] - 10https://gerrit.wikimedia.org/r/528433 (owner: 10Paladox) [05:16:11] (03CR) 10Dzahn: "@Paladox when should this be merged? at which version?" [puppet] - 10https://gerrit.wikimedia.org/r/532391 (owner: 10Paladox) [05:17:36] (03CR) 10Dzahn: "ok with you to call it a duplicate of https://gerrit.wikimedia.org/r/c/operations/puppet/+/511614?" [puppet] - 10https://gerrit.wikimedia.org/r/494811 (owner: 10Paladox) [05:22:25] (03PS2) 10Marostegui: mariadb: Provision dbproxy1017 to replace dbproxy1005 [puppet] - 10https://gerrit.wikimedia.org/r/534270 (https://phabricator.wikimedia.org/T202367) [05:24:28] (03PS3) 10Marostegui: mariadb: Provision dbproxy1017 to replace dbproxy1005 [puppet] - 10https://gerrit.wikimedia.org/r/534270 (https://phabricator.wikimedia.org/T202367) [05:29:03] (03PS4) 10Marostegui: mariadb: Provision dbproxy1017 to replace dbproxy1005 [puppet] - 10https://gerrit.wikimedia.org/r/534270 (https://phabricator.wikimedia.org/T202367) [05:31:38] (03CR) 10Marostegui: [C: 03+2] mariadb: Provision dbproxy1017 to replace dbproxy1005 [puppet] - 10https://gerrit.wikimedia.org/r/534270 (https://phabricator.wikimedia.org/T202367) (owner: 10Marostegui) [05:32:45] PROBLEM - Disk space on notebook1004 is CRITICAL: DISK CRITICAL - free space: /srv 2967 MB (2% inode=81%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=notebook1004&var-datasource=eqiad+prometheus/ops [05:34:05] (03PS1) 10Dzahn: ATS/varnish/parsoid-testing: remove directors for parsoid-vd/parsoid-rt [puppet] - 10https://gerrit.wikimedia.org/r/534271 (https://phabricator.wikimedia.org/T229356) [05:36:32] (03PS1) 10Marostegui: dbproxy1017: Allow reimage [puppet] - 10https://gerrit.wikimedia.org/r/534272 (https://phabricator.wikimedia.org/T202367) [05:37:31] (03CR) 10Marostegui: [C: 03+2] dbproxy1017: Allow reimage [puppet] - 10https://gerrit.wikimedia.org/r/534272 (https://phabricator.wikimedia.org/T202367) (owner: 10Marostegui) [05:51:32] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime [05:51:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:53:29] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [05:53:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:55:32] 10Operations, 10ops-eqiad, 10DC-Ops: b3-eqiad pdu refresh (Tuesday 9/17 @11am UTC) - https://phabricator.wikimedia.org/T227539 (10Marostegui) [05:55:37] (03CR) 10Dzahn: "@paladox you still need this? would it work for you? the idea is you put the entire ldapconfig snippet in Hiera" [puppet] - 10https://gerrit.wikimedia.org/r/511614 (owner: 10Dzahn) [05:58:53] (03PS2) 10Vgutierrez: Release 8.0.5-1wm5 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/534118 (https://phabricator.wikimedia.org/T231859) [06:10:34] (03PS5) 10Dzahn: gerrit: allow customizing LDAP config in Hiera [puppet] - 10https://gerrit.wikimedia.org/r/511614 [06:14:25] (03PS3) 10Vgutierrez: Release 8.0.5-1wm5 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/534118 (https://phabricator.wikimedia.org/T231859) [06:15:17] (03PS1) 10Dzahn: re-add my (Dzahn) root key for cloud VPS [labs/private] - 10https://gerrit.wikimedia.org/r/534275 [06:25:54] (03PS2) 10Dzahn: re-add my (Dzahn) root key for cloud VPS [labs/private] - 10https://gerrit.wikimedia.org/r/534275 [06:26:49] (03CR) 10Dzahn: [C: 03+1] "no diff on prod https://puppet-compiler.wmflabs.org/compiler1001/18165/" [puppet] - 10https://gerrit.wikimedia.org/r/511614 (owner: 10Dzahn) [06:34:10] 10Operations, 10Traffic, 10Wikidata, 10serviceops, and 4 others: [Task] move wikiba.se webhosting to wikimedia cluster - https://phabricator.wikimedia.org/T99531 (10Dzahn) >>! In T99531#5414154, @BBlack wrote: > track down various revert patches first before we close it up (revert the DNS repo stuff and w... [06:35:17] (03PS1) 10Dzahn: delete wikiba.se [dns] - 10https://gerrit.wikimedia.org/r/534276 (https://phabricator.wikimedia.org/T99531) [06:35:48] (03CR) 10jerkins-bot: [V: 04-1] delete wikiba.se [dns] - 10https://gerrit.wikimedia.org/r/534276 (https://phabricator.wikimedia.org/T99531) (owner: 10Dzahn) [06:40:06] (03CR) 10Hashar: "Danke schon!" [puppet] - 10https://gerrit.wikimedia.org/r/531420 (https://phabricator.wikimedia.org/T225416) (owner: 10Hashar) [06:41:20] (03CR) 10Dzahn: "Il n'y a pas de quoi !" [puppet] - 10https://gerrit.wikimedia.org/r/531420 (https://phabricator.wikimedia.org/T225416) (owner: 10Hashar) [06:43:17] (03PS2) 10Dzahn: delete wikiba.se [dns] - 10https://gerrit.wikimedia.org/r/534276 (https://phabricator.wikimedia.org/T99531) [06:45:36] PROBLEM - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is CRITICAL: 59.73 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [06:47:48] RECOVERY - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is OK: (C)60 le (W)70 le 73.39 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [06:53:57] (03PS2) 10Giuseppe Lavagetto: profile::lvs::realserver: remove absenting of old restart script [puppet] - 10https://gerrit.wikimedia.org/r/532672 [06:54:32] (03CR) 10Giuseppe Lavagetto: [C: 03+1] mediawiki: remove per-host high CPU alerts [puppet] - 10https://gerrit.wikimedia.org/r/531142 (https://phabricator.wikimedia.org/T230396) (owner: 10Filippo Giunchedi) [07:05:50] (03PS1) 10Dzahn: requesttracker: do not load httpd ssl module [puppet] - 10https://gerrit.wikimedia.org/r/534278 [07:07:36] (03CR) 10Dzahn: [C: 03+2] requesttracker: do not load httpd ssl module [puppet] - 10https://gerrit.wikimedia.org/r/534278 (owner: 10Dzahn) [07:20:51] (03CR) 10Muehlenhoff: [C: 03+1] "Seems fine" [puppet] - 10https://gerrit.wikimedia.org/r/531227 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond) [07:21:46] !log ununpentium - a2dismod ssl - systemctl restart apache2 [07:21:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:23:33] (03PS3) 10Giuseppe Lavagetto: profile::lvs::realserver: remove absenting of old restart script [puppet] - 10https://gerrit.wikimedia.org/r/532672 [07:25:28] (03CR) 10Giuseppe Lavagetto: [C: 03+2] profile::lvs::realserver: remove absenting of old restart script [puppet] - 10https://gerrit.wikimedia.org/r/532672 (owner: 10Giuseppe Lavagetto) [07:25:49] <_joe_> is jenkins still slow? [07:26:05] <_joe_> nope [07:26:38] (03PS1) 10Marostegui: dbproxy1017: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/534292 (https://phabricator.wikimedia.org/T202367) [07:27:15] (03PS2) 10Marostegui: dbproxy1017: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/534292 (https://phabricator.wikimedia.org/T202367) [07:28:13] (03CR) 10Marostegui: [C: 03+2] dbproxy1017: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/534292 (https://phabricator.wikimedia.org/T202367) (owner: 10Marostegui) [07:32:52] 10Operations, 10DBA: Migrate MySQLs to use ROW-based replication - https://phabricator.wikimedia.org/T109179 (10jcrespo) @TK-999 Please note that this is an infrastructure limitation, which means it is mostly related to Wikimedia servers, not mediawiki. As I see it, our main limitations are: * Compatibility f... [07:36:40] (03PS6) 10Ema: envoyproxy: allow overriding TLS port [puppet] - 10https://gerrit.wikimedia.org/r/534184 [07:38:55] PROBLEM - Check systemd state on db2103 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:38:57] PROBLEM - Check whether ferm is active by checking the default input chain on db2103 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [07:40:10] 10Operations, 10Traffic, 10docker-pkg, 10serviceops: Getting registry metadata from a public client fails on our registry - https://phabricator.wikimedia.org/T220085 (10ema) It seems that CL is returned properly now: ` $ curl -v --http1.1 https://docker-registry.wikimedia.org/v2/python3/manifests/latest 2... [07:40:32] (03CR) 10Ema: [C: 03+2] envoyproxy: allow overriding TLS port [puppet] - 10https://gerrit.wikimedia.org/r/534184 (owner: 10Ema) [07:43:29] RECOVERY - Check systemd state on db2103 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:43:33] RECOVERY - Check whether ferm is active by checking the default input chain on db2103 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [07:45:35] (03PS1) 10Marostegui: db-eqiad,db-codfw.php: Remove db2047 from config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534294 (https://phabricator.wikimedia.org/T231852) [07:46:39] (03CR) 10Marostegui: [C: 03+2] db-eqiad,db-codfw.php: Remove db2047 from config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534294 (https://phabricator.wikimedia.org/T231852) (owner: 10Marostegui) [07:47:42] (03Merged) 10jenkins-bot: db-eqiad,db-codfw.php: Remove db2047 from config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534294 (https://phabricator.wikimedia.org/T231852) (owner: 10Marostegui) [07:48:45] (03PS3) 10Giuseppe Lavagetto: Add the mediawiki.restart-appservers cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/527487 [07:49:02] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Remove db2047 from config T231852 (duration: 00m 57s) [07:49:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:49:05] T231852: Decommission db2047.codfw.wmnet - https://phabricator.wikimedia.org/T231852 [07:50:02] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Remove db2047 from config T231852 (duration: 00m 54s) [07:50:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:50:16] (03CR) 10jenkins-bot: db-eqiad,db-codfw.php: Remove db2047 from config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534294 (https://phabricator.wikimedia.org/T231852) (owner: 10Marostegui) [07:50:40] (03CR) 10Giuseppe Lavagetto: Add the mediawiki.restart-appservers cookbook (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/527487 (owner: 10Giuseppe Lavagetto) [07:53:01] (03PS3) 10Ema: restbase: TLS termination with envoy on port 444 [puppet] - 10https://gerrit.wikimedia.org/r/533028 (https://phabricator.wikimedia.org/T210411) [07:53:39] (03PS1) 10Marostegui: dbproxy1017: Clarify that it belongs to m5 [puppet] - 10https://gerrit.wikimedia.org/r/534297 (https://phabricator.wikimedia.org/T202367) [07:54:53] (03CR) 10Marostegui: [C: 03+2] dbproxy1017: Clarify that it belongs to m5 [puppet] - 10https://gerrit.wikimedia.org/r/534297 (https://phabricator.wikimedia.org/T202367) (owner: 10Marostegui) [07:56:23] (03CR) 10Ema: "pcc looks reasonable https://puppet-compiler.wmflabs.org/compiler1001/18167/" [puppet] - 10https://gerrit.wikimedia.org/r/533028 (https://phabricator.wikimedia.org/T210411) (owner: 10Ema) [07:58:42] 10Operations, 10DBA: Decommission dbproxy1005.eqiad.wmnet - https://phabricator.wikimedia.org/T231967 (10Marostegui) [07:59:33] 10Operations, 10DBA: Decommission dbproxy1005.eqiad.wmnet - https://phabricator.wikimedia.org/T231967 (10Marostegui) p:05Triage→03Normal Nothing uses dbproxy1005, but I am going to stop haproxy and leave it stopped for some hours before fully decommissioning this host just in case. [07:59:49] 10Operations, 10DBA: Decommission dbproxy1005.eqiad.wmnet - https://phabricator.wikimedia.org/T231967 (10Marostegui) [08:01:40] (03PS1) 10Marostegui: dbproxy1005: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/534298 (https://phabricator.wikimedia.org/T231967) [08:01:42] (03CR) 10Muehlenhoff: Add the mediawiki.restart-appservers cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/527487 (owner: 10Giuseppe Lavagetto) [08:05:32] PROBLEM - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is CRITICAL: 57.15 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [08:06:27] (03CR) 10Marostegui: [C: 03+2] dbproxy1005: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/534298 (https://phabricator.wikimedia.org/T231967) (owner: 10Marostegui) [08:07:34] (03PS4) 10Ema: restbase: TLS termination with envoy on port 7443 [puppet] - 10https://gerrit.wikimedia.org/r/533028 (https://phabricator.wikimedia.org/T210411) [08:09:56] RECOVERY - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is OK: (C)60 le (W)70 le 70.71 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [08:12:22] these seem like "simple fluctuations" in codfw traffic ^ [08:14:18] (03PS1) 10Giuseppe Lavagetto: envoyproxy: create a python 3.4 version of build_envoy_config [puppet] - 10https://gerrit.wikimedia.org/r/534345 [08:16:23] (03CR) 10jerkins-bot: [V: 04-1] envoyproxy: create a python 3.4 version of build_envoy_config [puppet] - 10https://gerrit.wikimedia.org/r/534345 (owner: 10Giuseppe Lavagetto) [08:19:46] (03PS1) 10Marostegui: mariadb: Promote db1135 to m1 master [puppet] - 10https://gerrit.wikimedia.org/r/534386 (https://phabricator.wikimedia.org/T231403) [08:20:29] (03CR) 10Marostegui: [C: 04-2] "Wait for the failover day" [puppet] - 10https://gerrit.wikimedia.org/r/534386 (https://phabricator.wikimedia.org/T231403) (owner: 10Marostegui) [08:22:02] (03CR) 10Alexandros Kosiaris: [C: 04-1] Add the mediawiki.restart-appservers cookbook (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/527487 (owner: 10Giuseppe Lavagetto) [08:26:06] !log Reboot db1135 to pick up new kernel - T231403 [08:26:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:26:10] T231403: Switchover m1 primary master: db1063 to db1135: Tuesday 10th September at 16:00 UTC - https://phabricator.wikimedia.org/T231403 [08:29:46] (03CR) 10Alexandros Kosiaris: [C: 03+1] ganeti: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531227 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond) [08:30:19] (03CR) 10Filippo Giunchedi: "Probably a redudant rule, LGTM otherwise" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/533282 (https://phabricator.wikimedia.org/T230570) (owner: 10Herron) [08:32:17] (03CR) 10Filippo Giunchedi: [C: 03+1] prometheus: deploy prometheus-ipsec-exporter to all sites [puppet] - 10https://gerrit.wikimedia.org/r/534210 (https://phabricator.wikimedia.org/T230236) (owner: 10Herron) [08:32:52] (03CR) 10Giuseppe Lavagetto: Add the mediawiki.restart-appservers cookbook (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/527487 (owner: 10Giuseppe Lavagetto) [08:33:23] (03PS4) 10Giuseppe Lavagetto: Add the mediawiki.restart-appservers cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/527487 [08:34:05] (03CR) 10Giuseppe Lavagetto: Add the mediawiki.restart-appservers cookbook (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/527487 (owner: 10Giuseppe Lavagetto) [08:37:11] !log upgrading API canaries in eqiad to 7.2.22 [08:37:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:41:02] (03PS1) 10Ema: varnish: reload service upon -common separate VCL changes [puppet] - 10https://gerrit.wikimedia.org/r/534387 [08:41:03] PROBLEM - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is CRITICAL: 49.19 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [08:42:09] !log Stop HAproxy on dbproxy1005 - T231967 [08:42:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:42:12] T231967: Decommission dbproxy1005.eqiad.wmnet - https://phabricator.wikimedia.org/T231967 [08:47:40] (03CR) 10Ema: "pcc here https://puppet-compiler.wmflabs.org/compiler1002/18168/" [puppet] - 10https://gerrit.wikimedia.org/r/534387 (owner: 10Ema) [08:48:43] 10Operations, 10DBA: Predictive failures on disk S.M.A.R.T. status - https://phabricator.wikimedia.org/T208323 (10Marostegui) [08:50:14] (03PS5) 10Giuseppe Lavagetto: Add the mediawiki.restart-appservers cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/527487 [08:52:28] (03CR) 10jerkins-bot: [V: 04-1] Add the mediawiki.restart-appservers cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/527487 (owner: 10Giuseppe Lavagetto) [08:56:26] !log upgrading mw1238-mw1258 to PHP 7.2.22 [08:56:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:59:24] RECOVERY - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is OK: (C)60 le (W)70 le 95.15 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [09:04:07] <_joe_> oh the period in the docstring ofc [09:06:27] 10Operations, 10DBA, 10Patch-For-Review, 10User-notice: Switchover m1 primary master: db1063 to db1135: Tuesday 10th September at 16:00 UTC - https://phabricator.wikimedia.org/T231403 (10Trizek-WMF) Added for Tech News, since Etherpad service is quite used, and 16:00 UTC is a common meetings hour. [09:08:14] 10Operations, 10DBA, 10Patch-For-Review, 10User-notice: Switchover m1 primary master: db1063 to db1135: Tuesday 10th September at 16:00 UTC - https://phabricator.wikimedia.org/T231403 (10Marostegui) >>! In T231403#5464235, @Trizek-WMF wrote: > Added for Tech News, since Etherpad service is quite used, and... [09:10:51] (03PS6) 10Giuseppe Lavagetto: Add the mediawiki.restart-appservers cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/527487 [09:12:45] (03PS2) 10Ema: varnish: reload service upon -common separate VCL changes [puppet] - 10https://gerrit.wikimedia.org/r/534387 (https://phabricator.wikimedia.org/T230772) [09:14:02] 10Operations, 10Analytics, 10Analytics-Kanban, 10Analytics-Wikistats, and 3 others: Piwik JS isn't cached - https://phabricator.wikimedia.org/T230772 (10ema) >>! In T230772#5461321, @Nuria wrote: > I can see the cache-control: max-age=604800 , I think @ema needs to change something on his end so varnish /A... [09:15:12] (03PS1) 10Marostegui: realm.pp: Remove filejournal table from private list [puppet] - 10https://gerrit.wikimedia.org/r/534392 (https://phabricator.wikimedia.org/T51195) [09:15:31] (03CR) 10Ema: [C: 03+2] varnish: reload service upon -common separate VCL changes [puppet] - 10https://gerrit.wikimedia.org/r/534387 (https://phabricator.wikimedia.org/T230772) (owner: 10Ema) [09:16:36] (03PS2) 10Giuseppe Lavagetto: envoyproxy: create a python 3.4 version of build_envoy_config [puppet] - 10https://gerrit.wikimedia.org/r/534345 [09:17:32] (03PS2) 10Marostegui: realm.pp: Remove filejournal table from private list [puppet] - 10https://gerrit.wikimedia.org/r/534392 (https://phabricator.wikimedia.org/T51195) [09:18:18] PROBLEM - BFD status on cr2-eqdfw is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [09:19:14] ^ uh? [09:19:16] <_joe_> !log uploaded envoyproxy to buster [09:19:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:19:28] <_joe_> marostegui: uhm no idea [09:19:35] we have lost a link [09:19:36] I am checking the documentation [09:19:54] bfd is a protocol to fastly ping a connection to detect a link failure faster [09:20:20] There is a planned work from telia in Buffalo [09:20:26] RECOVERY - BFD status on cr2-eqdfw is OK: OK: UP: 10 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [09:20:27] <_joe_> both that and OSPF report something down [09:20:37] <_joe_> yes, seems related [09:22:13] (03CR) 10Vgutierrez: [C: 03+2] Release 8.0.5-1wm5 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/534118 (https://phabricator.wikimedia.org/T231859) (owner: 10Vgutierrez) [09:22:15] (03CR) 10Giuseppe Lavagetto: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/18169/" [puppet] - 10https://gerrit.wikimedia.org/r/534345 (owner: 10Giuseppe Lavagetto) [09:22:23] (03PS3) 10Giuseppe Lavagetto: envoyproxy: create a python 3.4 version of build_envoy_config [puppet] - 10https://gerrit.wikimedia.org/r/534345 [09:24:27] <_joe_> sigh jenkins [09:30:03] in other Icinga interface checks it shows the circuit ID which we can then match against the maint-announce mail, but not on this one [09:33:30] !log upgrading mw servers in codfw to 7.2.22 [09:33:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:57:09] (03PS3) 10Mathew.onipe: elasticsearch: add syslog logging option [puppet] - 10https://gerrit.wikimedia.org/r/533928 (https://phabricator.wikimedia.org/T225125) [09:57:11] (03PS1) 10Mathew.onipe: elasticsearch: logging.yml template is ensure=absent [puppet] - 10https://gerrit.wikimedia.org/r/534398 [09:57:13] (03PS1) 10Mathew.onipe: elasticsearch: switch elasticsearch logging to syslog [puppet] - 10https://gerrit.wikimedia.org/r/534399 (https://phabricator.wikimedia.org/T225125) [09:57:30] PROBLEM - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is CRITICAL: 58.86 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [10:01:18] RECOVERY - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is OK: (C)60 le (W)70 le 70.75 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [10:04:09] <_joe_> codfw doesn't have enough traffic at this time of the day for that metric to make sense [10:07:24] (03CR) 10Alexandros Kosiaris: [C: 03+1] Add the mediawiki.restart-appservers cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/527487 (owner: 10Giuseppe Lavagetto) [10:07:44] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Add the mediawiki.restart-appservers cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/527487 (owner: 10Giuseppe Lavagetto) [10:09:11] !log uploaded trafficserver 8.0.5-1wm5 to apt.wikimedia.org (stretch) - T231533 T231859 [10:09:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:09:15] T231533: Improve ATS prometheus metrics - https://phabricator.wikimedia.org/T231533 [10:09:16] T231859: ATS-tls isn't enforcing the same list of curves as nginx during TLS handshake - https://phabricator.wikimedia.org/T231859 [10:11:01] !log Stop MySQL on db1115 - T231769 [10:11:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:11:04] T231769: Investigate possible memory leak on db1115 - https://phabricator.wikimedia.org/T231769 [10:11:35] !log Tendril/dbtree will be unavailable for a few minutes T231769 [10:11:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:11:49] !log @ helmfile [STAGING] Ran 'sync' command on namespace 'restrouter' for release 'staging' . [10:11:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:12:33] !log upgrading ATS to 8.0.5-1wm5 on cp5001 - T231859 [10:12:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:12:40] !log Stop MySQL on db1115 without the event scheduler - T231769 [10:12:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:13:38] PROBLEM - Widespread puppet agent failures on icinga1001 is CRITICAL: site=codfw https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [10:14:08] (03CR) 10Vgutierrez: [C: 03+2] ATS: Configure a list of curves to be offered during the TLS handshake [puppet] - 10https://gerrit.wikimedia.org/r/534123 (https://phabricator.wikimedia.org/T231859) (owner: 10Vgutierrez) [10:14:17] (03PS2) 10Vgutierrez: ATS: Configure a list of curves to be offered during the TLS handshake [puppet] - 10https://gerrit.wikimedia.org/r/534123 (https://phabricator.wikimedia.org/T231859) [10:16:33] (03PS1) 10Giuseppe Lavagetto: envoyproxy: fix settings for jessie [puppet] - 10https://gerrit.wikimedia.org/r/534401 [10:18:42] (03CR) 10jerkins-bot: [V: 04-1] envoyproxy: fix settings for jessie [puppet] - 10https://gerrit.wikimedia.org/r/534401 (owner: 10Giuseppe Lavagetto) [10:20:37] !log Start MySQL on db1115 without the event scheduler - T231769 [10:20:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:20:40] T231769: Investigate possible memory leak on db1115 - https://phabricator.wikimedia.org/T231769 [10:21:49] 10Operations, 10Traffic, 10Patch-For-Review: ATS-tls isn't enforcing the same list of curves as nginx during TLS handshake - https://phabricator.wikimedia.org/T231859 (10Vgutierrez) 05Open→03Resolved a:03Vgutierrez Solved, now ATS has the same behaviour as nginx: ` vgutierrez@cp5001:~$ openssl s_client... [10:23:01] !log upgrading ATS to 8.0.5-1wm5 on cp2002 - T231859 [10:23:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:23:04] T231859: ATS-tls isn't enforcing the same list of curves as nginx during TLS handshake - https://phabricator.wikimedia.org/T231859 [10:25:31] (03PS1) 10Dzahn: add discovery CNAME for releases [dns] - 10https://gerrit.wikimedia.org/r/534402 [10:26:01] (03CR) 10jerkins-bot: [V: 04-1] add discovery CNAME for releases [dns] - 10https://gerrit.wikimedia.org/r/534402 (owner: 10Dzahn) [10:28:53] (03PS1) 10Muehlenhoff: Enable puppetdb1002/2002 as puppetdb hosts [puppet] - 10https://gerrit.wikimedia.org/r/534403 [10:30:19] (03PS1) 10Giuseppe Lavagetto: mediawiki::php: do not run GC on APCu [puppet] - 10https://gerrit.wikimedia.org/r/534404 [10:30:21] (03PS1) 10Giuseppe Lavagetto: mediawiki::php: only enable tideways/mongodb where needed [puppet] - 10https://gerrit.wikimedia.org/r/534405 [10:35:32] (03PS1) 10Dzahn: add certificate for releases.discovery.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/534406 [10:37:32] PROBLEM - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is CRITICAL: 51.28 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [10:38:29] (03PS1) 10Dzahn: add fake SSL key for releases.discovery.wmnet [labs/private] - 10https://gerrit.wikimedia.org/r/534408 [10:41:52] !log Start event scheduler on db1115 T231769 [10:42:05] (03PS1) 10Dzahn: add fake SSL key for webperf.discovery.wmnet [labs/private] - 10https://gerrit.wikimedia.org/r/534409 [10:42:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:42:12] T231769: Investigate possible memory leak on db1115 - https://phabricator.wikimedia.org/T231769 [10:43:28] RECOVERY - Widespread puppet agent failures on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [10:45:08] RECOVERY - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is OK: (C)60 le (W)70 le 81.95 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [10:48:49] (03PS4) 10Mathew.onipe: elasticsearch: add syslog logging option [puppet] - 10https://gerrit.wikimedia.org/r/533928 (https://phabricator.wikimedia.org/T225125) [10:48:51] (03PS2) 10Mathew.onipe: elasticsearch: switch elasticsearch logging to syslog [puppet] - 10https://gerrit.wikimedia.org/r/534399 (https://phabricator.wikimedia.org/T225125) [10:50:57] <_joe_> moritzm: puppet is failing on puppetdb2001 [10:51:23] <_joe_> with a pretty strange error [10:51:31] <_joe_> so probably not linked to your changes [10:51:58] (03CR) 10Mathew.onipe: "changes are expected according to PCC: https://puppet-compiler.wmflabs.org/compiler1001/18171/" [puppet] - 10https://gerrit.wikimedia.org/r/533928 (https://phabricator.wikimedia.org/T225125) (owner: 10Mathew.onipe) [10:52:41] unlikely, given that I haven't merged yet :-) [10:52:48] <_joe_> heheh [10:52:49] gonna have a look at 2001 in a bit [10:59:29] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki::php: do not run GC on APCu [puppet] - 10https://gerrit.wikimedia.org/r/534404 (owner: 10Giuseppe Lavagetto) [11:00:04] Amir1, Lucas_WMDE, and Urbanecm: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for European Mid-day SWAT(Max 6 patches) . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190904T1100). [11:00:04] odder, dcausse, and Amir1: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:29] o/ [11:01:18] o/ [11:02:08] o/ [11:02:16] o/ [11:02:41] +2ing my patch on CirrusSearch so that CI can run while config patches are deployed [11:02:52] cool [11:02:56] I can do the SWAT I guess [11:03:09] I count 8 patches in the window [11:03:12] that’s a lot [11:04:25] odder: two of your patches are failing in the CI [11:04:31] we can't merge them unless it's fixed [11:05:01] (03PS3) 10Odder: Add high-density logos for Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534152 (https://phabricator.wikimedia.org/T230120) [11:05:06] https://integration.wikimedia.org/ci/job/operations-mw-config-hhvm-composer-test-docker/4194/console [11:05:08] (03PS2) 10Odder: Add high-density logos for the Incubator [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534198 (https://phabricator.wikimedia.org/T230122) [11:05:16] Your patches should be merged together [11:05:26] that would fix the ci [11:05:50] Amir1: I assume they’re split to ensure that the files are synced in the right order [11:06:00] but the second half of each should at least be based on the first one, I guess [11:06:00] (03CR) 10jerkins-bot: [V: 04-1] Add high-density logos for Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534152 (https://phabricator.wikimedia.org/T230120) (owner: 10Odder) [11:06:06] (03CR) 10jerkins-bot: [V: 04-1] Add high-density logos for the Incubator [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534198 (https://phabricator.wikimedia.org/T230122) (owner: 10Odder) [11:06:17] That's weird, that did not happen last time I did this [11:06:19] Lucas_WMDE: yeah [11:07:05] (03PS4) 10Lucas Werkmeister (WMDE): Add high-density logos for Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534152 (https://phabricator.wikimedia.org/T230120) (owner: 10Odder) [11:07:14] (03PS3) 10Lucas Werkmeister (WMDE): Add high-density logos for the Incubator [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534198 (https://phabricator.wikimedia.org/T230122) (owner: 10Odder) [11:07:18] I rebased them [11:07:25] let’s see if that makes CI happy [11:07:30] (03PS1) 10Dzahn: add certificate for webperf.discovery.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/534412 [11:07:39] (03CR) 10Ladsgroup: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534151 (https://phabricator.wikimedia.org/T230120) (owner: 10Odder) [11:07:55] Lucas_WMDE: lol, you must be new around here [11:08:12] LOL [11:08:19] … [11:08:26] …thanks? [11:08:56] (03Merged) 10jenkins-bot: Add high-density logos for Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534151 (https://phabricator.wikimedia.org/T230120) (owner: 10Odder) [11:09:28] CI is never happy [11:09:31] At best, it's placated [11:09:40] (03CR) 10jenkins-bot: Add high-density logos for Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534151 (https://phabricator.wikimedia.org/T230120) (owner: 10Odder) [11:10:00] sorry I haven’t been here for a decade and a half already [11:10:04] I’ll try better next time [11:11:04] dcausse: your line in the deployment calendar links the same Gerrit change twice btw, is that intentional? [11:11:20] oops, no [11:12:19] fixed [11:12:39] !log ladsgroup@deploy1001 Synchronized static/images/project-logos/wikidatawiki-1.5x.png: SWAT: [[gerrit:534151|Add high-density logos for Wikidata (T230120)]] Part I (duration: 00m 56s) [11:12:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:12:42] T230120: Create HIDPI logo for Wikidata - https://phabricator.wikimedia.org/T230120 [11:13:59] !log ladsgroup@deploy1001 Synchronized static/images/project-logos/wikidatawiki-2x.png: SWAT: [[gerrit:534151|Add high-density logos for Wikidata (T230120)]] Part II (duration: 00m 56s) [11:14:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:14:03] odder: take a look please [11:14:32] https://wikidata.org/static/images/project-logos/wikidatawiki-2x.png looks okay to me [11:14:50] (03PS1) 10Dzahn: add discovery CNAME for webperf [dns] - 10https://gerrit.wikimedia.org/r/534414 [11:14:57] (03PS5) 10Ladsgroup: Add high-density logos for Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534152 (https://phabricator.wikimedia.org/T230120) (owner: 10Odder) [11:15:26] (03CR) 10Ladsgroup: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534152 (https://phabricator.wikimedia.org/T230120) (owner: 10Odder) [11:16:27] Lucas_WMDE: I think Reedy was pointing out that you have muscle memory for Wikimedia CI ideosyncracies. Thank you for your sacrifice ;-) [11:16:30] (03Merged) 10jenkins-bot: Add high-density logos for Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534152 (https://phabricator.wikimedia.org/T230120) (owner: 10Odder) [11:16:46] Amir1: it does look okay ;) [11:16:47] (03CR) 10jenkins-bot: Add high-density logos for Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534152 (https://phabricator.wikimedia.org/T230120) (owner: 10Odder) [11:18:06] odder: syncing IS.php [11:18:31] (03CR) 10Ladsgroup: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534197 (https://phabricator.wikimedia.org/T230122) (owner: 10Odder) [11:18:42] !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:534152|Add high-density logos for Wikidata (T230120)]] (duration: 00m 55s) [11:18:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:18:45] T230120: Create HIDPI logo for Wikidata - https://phabricator.wikimedia.org/T230120 [11:19:26] (03Merged) 10jenkins-bot: Add high-density logos for the Incubator [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534197 (https://phabricator.wikimedia.org/T230122) (owner: 10Odder) [11:19:34] dcausse: does your config patch has to wait for the backport or they are unrelated [11:19:44] (03CR) 10jenkins-bot: Add high-density logos for the Incubator [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534197 (https://phabricator.wikimedia.org/T230122) (owner: 10Odder) [11:20:16] Amir1: the 1.5x file seems to be missing? [11:20:28] Amir1: I can ship them after yours, np [11:20:36] at 150%-190% zoom I see a blank space where the logo should be [11:20:46] https://en.wikipedia.org/static/images/project-logos/wikidatawiki-1.5x.png [11:20:49] and https://www.wikidata.org/static/images/project-logos/wikidatawiki-1.5x.png returns 404 [11:21:01] varnsih [11:21:41] 404s are being cached in varnish for five or ten minutes [11:21:55] I tried it when I was testing [11:23:03] 10Operations, 10netops: Check router ACLs for early install SSH access from puppet masters/cumin hosts - https://phabricator.wikimedia.org/T231811 (10MoritzMuehlenhoff) >>! In T231811#5462395, @ayounsi wrote: > Hosts in the `cloud-hosts1-b-eqiad` vlan are behind the `labs-in4` firewall filter (applied on traff... [11:23:23] purgeList.php (as recommended at https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/Caching#Purging_a_single_page_(URL) ) doesn’t seem to work either [11:23:33] yup. Just tried [11:23:40] ok [11:24:12] the debug mode works though [11:24:18] Maybe I forgot to sync it [11:24:22] ah, no, got it I think [11:24:25] unless it just expired by itself [11:24:31] !log lucaswerkmeister-wmde@mwmaint1002:~$ printf '%s\n' 'https://en.wikipedia.org/static/images/project-logos/wikidatawiki-1.5x.png' | mwscript purgeList.php wikidatawiki # T230120 [11:24:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:24:36] T230120: Create HIDPI logo for Wikidata - https://phabricator.wikimedia.org/T230120 [11:24:47] needed en.wikipedia.org per https://wikitech.wikimedia.org/wiki/Multicast_HTCP_purging#One-off_purge [11:25:16] oh [11:26:58] !log ladsgroup@deploy1001 Synchronized static/images/project-logos/incubatorwiki-1.5x.png: SWAT: [[gerrit:534197|Add high-density logos for the Incubator (T230122)]] Part I (duration: 00m 52s) [11:27:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:27:02] T230122: Create HIDPI logo for Incubator - https://phabricator.wikimedia.org/T230122 [11:27:35] (03PS4) 10Ladsgroup: Add high-density logos for the Incubator [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534198 (https://phabricator.wikimedia.org/T230122) (owner: 10Odder) [11:28:12] !log ladsgroup@deploy1001 Synchronized static/images/project-logos/incubatorwiki-2x.png: SWAT: [[gerrit:534197|Add high-density logos for the Incubator (T230122)]] Part II (duration: 00m 54s) [11:28:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:41] (03CR) 10Ladsgroup: [C: 03+2] Add high-density logos for the Incubator [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534198 (https://phabricator.wikimedia.org/T230122) (owner: 10Odder) [11:29:50] (03Merged) 10jenkins-bot: Add high-density logos for the Incubator [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534198 (https://phabricator.wikimedia.org/T230122) (owner: 10Odder) [11:30:08] (03CR) 10jenkins-bot: Add high-density logos for the Incubator [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534198 (https://phabricator.wikimedia.org/T230122) (owner: 10Odder) [11:31:50] (03PS2) 10Ladsgroup: Set item terms migration stage for Wikidata on WRITE_BOTH up to Q2m [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534183 (https://phabricator.wikimedia.org/T225055) [11:32:23] !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:534197|Add high-density logos for the Incubator (T230122)]] (duration: 00m 56s) [11:32:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:32:28] T230122: Create HIDPI logo for Incubator - https://phabricator.wikimedia.org/T230122 [11:32:29] (03CR) 10Ladsgroup: [C: 03+2] Set item terms migration stage for Wikidata on WRITE_BOTH up to Q2m [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534183 (https://phabricator.wikimedia.org/T225055) (owner: 10Ladsgroup) [11:33:38] (03Merged) 10jenkins-bot: Set item terms migration stage for Wikidata on WRITE_BOTH up to Q2m [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534183 (https://phabricator.wikimedia.org/T225055) (owner: 10Ladsgroup) [11:33:53] (03CR) 10jenkins-bot: Set item terms migration stage for Wikidata on WRITE_BOTH up to Q2m [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534183 (https://phabricator.wikimedia.org/T225055) (owner: 10Ladsgroup) [11:33:55] Amir1: Looks sweet for both projects, many thanks :) [11:35:00] (03PS1) 10Dzahn: tlsproxy/envoy: limit connections on 443 to cache servers [puppet] - 10https://gerrit.wikimedia.org/r/534421 [11:37:19] marostegui: jynus this is going live ^ increases the write on s8 for wb_terms replacement stuff [11:37:46] !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:534183|Set item terms migration stage for Wikidata on WRITE_BOTH up to Q2m (T225055)]] (duration: 00m 55s) [11:37:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:37:50] T225055: Switch `tmpItemTermsMigrationStages` to MIGRATION_WRITE_BOTH - https://phabricator.wikimedia.org/T225055 [11:38:26] !log upgrading mw1339-mw1348 to PHP 7.2.22 [11:38:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:38:48] dcausse: SWAT is yours [11:38:58] Amir1: thanks [11:39:05] thanks for the heads up, Amir1 [11:39:17] dcausse: Thank you for waiting :) [11:39:24] np :) [11:39:42] (03CR) 10DCausse: [C: 03+2] [cirrus] Reenable sanity checks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/533842 (https://phabricator.wikimedia.org/T231194) (owner: 10DCausse) [11:40:41] (03Merged) 10jenkins-bot: [cirrus] Reenable sanity checks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/533842 (https://phabricator.wikimedia.org/T231194) (owner: 10DCausse) [11:40:43] (03PS2) 10Dzahn: tlsproxy/envoy: limit connections on 443 to cache servers [puppet] - 10https://gerrit.wikimedia.org/r/534421 [11:40:57] (03CR) 10jenkins-bot: [cirrus] Reenable sanity checks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/533842 (https://phabricator.wikimedia.org/T231194) (owner: 10DCausse) [11:41:33] (03CR) 10Phamhi: [C: 03+2] admin: convert maintain_kubeusers to systemd timer type [puppet] - 10https://gerrit.wikimedia.org/r/533606 (owner: 10Phamhi) [11:42:27] (03PS4) 10Phamhi: admin: convert maintain_kubeusers to systemd timer type [puppet] - 10https://gerrit.wikimedia.org/r/533606 [11:43:25] (03CR) 10Dzahn: tlsproxy/envoy: limit connections on 443 to cache servers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/534421 (owner: 10Dzahn) [11:44:50] (03PS8) 10Alexandros Kosiaris: First version of the wikifeeds chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/526679 (https://phabricator.wikimedia.org/T229287) (owner: 10MSantos) [11:46:47] !log start of ladsgroup@mwmaint1002:~$ time mwscript extensions/Wikibase/repo/maintenance/rebuildItemTerms.php --wiki=wikidatawiki --to-id 2000000 --sleep 2 > ~/rebuildItemTerms.out 2> rebuildItemTerms.err (T225056). This is going to take a while. On screen [11:46:52] !log dcausse@deploy1001 Synchronized php-1.34.0-wmf.21/extensions/CirrusSearch/: T159321: Add morelikethis a non-greedy version of the morelike keyword (duration: 00m 57s) [11:47:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:47:05] T225056: Run Item Terms Rebuild script - https://phabricator.wikimedia.org/T225056 [11:47:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:47:08] T159321: [Bug] Unpredictable behavior with the order of Special:Search parameters - https://phabricator.wikimedia.org/T159321 [11:49:31] PROBLEM - PHP opcache health on mwdebug1002 is CRITICAL: CRITICAL: opcache free space is below 50 MB https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [11:49:34] !log dcausse@deploy1001 Synchronized wmf-config/CirrusSearch-production.php: T231194: [cirrus] Reenable sanity checks (duration: 00m 56s) [11:49:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:49:38] T231194: Increase concurrency of the cirrusCheckerJob - https://phabricator.wikimedia.org/T231194 [11:52:06] !log EU SWAT done [11:52:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:53:42] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] "Fixed a small comment issue in the prometheus-statsd.conf and now merging" [deployment-charts] - 10https://gerrit.wikimedia.org/r/526679 (https://phabricator.wikimedia.org/T229287) (owner: 10MSantos) [12:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190904T1200) [12:04:53] (03PS1) 10KartikMistry: Update cxserver to 2019-09-04-065911-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/534427 (https://phabricator.wikimedia.org/T213255) [12:05:34] (03PS4) 10KartikMistry: Move ContentTranslation out of Beta in jvwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/533172 (https://phabricator.wikimedia.org/T231207) [12:10:57] (03PS1) 10Alexandros Kosiaris: WIP: LVS: Setup port 7233 for restbase-backend [puppet] - 10https://gerrit.wikimedia.org/r/534430 (https://phabricator.wikimedia.org/T223953) [12:18:57] just a head up [12:19:17] I can't do the train in 40 minutes, so moved it to later tonight during the american slot (19:00 UTC) [12:21:59] (03PS1) 10Muehlenhoff: Add DNS entries for new Buster-based LDAP/corp replicas [dns] - 10https://gerrit.wikimedia.org/r/534432 (https://phabricator.wikimedia.org/T231015) [12:22:45] 10Operations, 10Analytics, 10Analytics-Kanban, 10Analytics-Wikistats, and 2 others: Piwik JS isn't cached - https://phabricator.wikimedia.org/T230772 (10Nuria) I think things look good now: [nuriaruiz@nurieta][~/tips]$ curl -v https://piwik.wikimedia.org/piwik.js > piwik * Trying 91.198.174.192... % T... [12:23:14] hashar, good luck! [12:24:55] ;) [12:25:41] (03CR) 10MSantos: "> Patch Set 8: Verified+2 Code-Review+2" [deployment-charts] - 10https://gerrit.wikimedia.org/r/526679 (https://phabricator.wikimedia.org/T229287) (owner: 10MSantos) [12:36:45] !log restart kartotherian on maps1001 - T231964 [12:36:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:36:48] T231964: Empty Kartotherian maps because of password authentication failure - https://phabricator.wikimedia.org/T231964 [12:48:16] (03PS1) 10CDanis: db-codfw: remove obsoleted DB config data [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534436 (https://phabricator.wikimedia.org/T231642) [12:49:00] !log reset kartotherian password on maps slaves - T231964 [12:49:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:49:03] T231964: Empty Kartotherian maps because of password authentication failure - https://phabricator.wikimedia.org/T231964 [12:49:16] (03CR) 10jerkins-bot: [V: 04-1] db-codfw: remove obsoleted DB config data [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534436 (https://phabricator.wikimedia.org/T231642) (owner: 10CDanis) [12:49:19] (03CR) 10Marostegui: db-codfw: remove obsoleted DB config data (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534436 (https://phabricator.wikimedia.org/T231642) (owner: 10CDanis) [12:52:04] (03CR) 10CDanis: db-codfw: remove obsoleted DB config data (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534436 (https://phabricator.wikimedia.org/T231642) (owner: 10CDanis) [12:52:16] (03PS2) 10CDanis: db-codfw: remove obsoleted DB config data [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534436 (https://phabricator.wikimedia.org/T231642) [12:54:40] (03CR) 10Marostegui: db-codfw: remove obsoleted DB config data (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534436 (https://phabricator.wikimedia.org/T231642) (owner: 10CDanis) [12:55:26] (03CR) 10CDanis: db-codfw: remove obsoleted DB config data (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534436 (https://phabricator.wikimedia.org/T231642) (owner: 10CDanis) [12:56:54] !log manually testing I1bc6d1603 on mwdebug2002 [12:56:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:58:12] cdanis: I will check too [13:00:40] I haven't found anything wrong yet [13:02:05] cdanis: so far so good for me [13:02:08] let me try some other wikis [13:02:12] (enwiki works fine) [13:02:16] so far I've tried enwiki and meta [13:02:16] let me try one wiki per section [13:04:26] cdanis: can you try ru.wikipedia.org or ja.wikipedia.org? [13:04:35] fr.wikipedia.org isn't working for me (that is s6) [13:04:48] ru and ja are on the same section (s6) [13:04:53] ru loaded for me after a 30 seconds delay [13:04:56] aside from checking different urls/site, did you check weights seem to work as intended? [13:05:04] not sure if it's just mwdebug2002 being slow or what [13:05:13] cdanis: frwiki timedout for me [13:05:22] let now it seems to be working [13:05:28] fr loaded for me [13:05:36] cdanis: some delay is common on cold servers [13:05:50] but not sure if 30 seconds/timeout is normal [13:06:16] jynus: how would you verify weights? [13:06:44] cdanis: not as much verfy, but sanity checks- doing a couple of requests and see if they end up on the right servers [13:06:59] e.g. on api and on browser urls [13:09:28] cdanis: I have tested one wiki per section, and also verified recentchanges for each wiki going to the rc slave and all that [13:09:50] marostegui: that was what I meant, should be good enough [13:10:09] dumb question, how do you see what db server was used in a request? [13:10:25] cdanis: I have basically being checking their processlist [13:10:46] not dumb, cache (even at app layer) can be hiding many requests [13:11:03] ah ok, i was imagining some logging or tracing output from MW [13:11:08] (03CR) 10Marostegui: [C: 03+1] db-codfw: remove obsoleted DB config data [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534436 (https://phabricator.wikimedia.org/T231642) (owner: 10CDanis) [13:11:10] we could SELECT @hostname [13:11:14] or enable general log [13:11:24] but manuel's option is good enough for this [13:11:45] not good enough, the right way [13:12:10] (03CR) 10Ottomata: [C: 03+1] Remove EventBusRCFeedEngine eventServiceName. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534236 (https://phabricator.wikimedia.org/T229863) (owner: 10Ppchelko) [13:12:10] cdanis: for db-eqiad I would suggest we: 1) test on mwdebug, 2) deploy only to canaries, leave it there for like 10 minutes or so and then go ahead [13:12:30] when you finish, I have question for cdanis regarding dbctl output [13:12:38] <_joe_> sadly scap is not thought for doing that [13:12:44] cdanis: I can do all the canaries stuff, I have it fresh from when I did it for pc [13:13:00] marostegui: if we are letting it sit a while, I will book us a deploy window [13:13:11] cdanis: we can do it tomorrow too if you like [13:13:14] (eqiad) [13:13:15] today is good [13:13:21] sure [13:13:42] 14:00 UTC work for you? [13:13:44] for eqiad [13:13:46] sure [13:13:48] for codfw I will push in a minute [13:13:48] what is the name of progressive deploy with slow rampup technique? [13:14:57] rolling deployment? [13:15:31] that's a fine one [13:16:13] I've see it named also as rolling-update, incremental deployment or ramped deployment [13:16:49] (03CR) 10CDanis: [C: 03+2] "Tested manually on mwdebug2002 and looked fine." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534436 (https://phabricator.wikimedia.org/T231642) (owner: 10CDanis) [13:17:35] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Kanban, 10netops: Move cloudvirtan* hardware out of CloudVPS back into production Analytics VLAN. - https://phabricator.wikimedia.org/T225128 (10Jclark-ctr) @Cmjohnson removed 2nd dac cable yes it plugged into d7 xe-7/0/2 [13:17:42] !log oblivian@cumin1001 START - Cookbook sre.mediawiki.restart-appservers [13:17:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:17:46] (03Merged) 10jenkins-bot: db-codfw: remove obsoleted DB config data [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534436 (https://phabricator.wikimedia.org/T231642) (owner: 10CDanis) [13:17:46] !log oblivian@cumin1001 END (FAIL) - Cookbook sre.mediawiki.restart-appservers (exit_code=99) [13:17:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:17:53] !log oblivian@cumin1001 START - Cookbook sre.mediawiki.restart-appservers [13:17:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:18:05] (03CR) 10jenkins-bot: db-codfw: remove obsoleted DB config data [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534436 (https://phabricator.wikimedia.org/T231642) (owner: 10CDanis) [13:20:09] !log oblivian@cumin1001 END (PASS) - Cookbook sre.mediawiki.restart-appservers (exit_code=0) [13:20:10] !log cdanis@deploy1001 Synchronized wmf-config/db-codfw.php: a8dc4c4a0 db-codfw: remove obsoleted DB config T231642 (duration: 00m 55s) [13:20:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:25] T231642: Empty db-eqiad.php, db-codfw.php s1-s8+wikitech lines - https://phabricator.wikimedia.org/T231642 [13:25:05] (03PS1) 10CDanis: db-eqiad: remove obsoleted DB config data [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534441 (https://phabricator.wikimedia.org/T231642) [13:32:45] (03CR) 10Marostegui: [C: 03+1] db-eqiad: remove obsoleted DB config data (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534441 (https://phabricator.wikimedia.org/T231642) (owner: 10CDanis) [13:33:32] (03CR) 10Gehel: [C: 04-1] elasticsearch: logging.yml template is ensure=absent (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/534398 (owner: 10Mathew.onipe) [13:35:27] (03PS1) 10Ema: envoyproxy: do not hardcode 443 in ferm rule [puppet] - 10https://gerrit.wikimedia.org/r/534442 [13:36:42] (03CR) 10Marostegui: db-eqiad: remove obsoleted DB config data [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534441 (https://phabricator.wikimedia.org/T231642) (owner: 10CDanis) [13:40:24] (03PS2) 10CDanis: db-eqiad: remove obsoleted DB config data [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534441 (https://phabricator.wikimedia.org/T231642) [13:40:37] (03CR) 10CDanis: db-eqiad: remove obsoleted DB config data (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534441 (https://phabricator.wikimedia.org/T231642) (owner: 10CDanis) [13:40:47] (03CR) 10Gehel: [C: 04-1] "see comments inline." (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/533928 (https://phabricator.wikimedia.org/T225125) (owner: 10Mathew.onipe) [13:42:53] (03CR) 10Marostegui: [C: 03+1] db-eqiad: remove obsoleted DB config data [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534441 (https://phabricator.wikimedia.org/T231642) (owner: 10CDanis) [13:42:55] (03CR) 10Gehel: "We should test this change first on relforge, by activating the option via hiera. Once we are confident everything is working as expected," (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/534399 (https://phabricator.wikimedia.org/T225125) (owner: 10Mathew.onipe) [13:43:31] PROBLEM - Check systemd state on notebook1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:44:53] 10Operations, 10Analytics, 10Analytics-Kanban, 10Analytics-Wikistats, and 2 others: Piwik JS isn't cached - https://phabricator.wikimedia.org/T230772 (10Gilles) While the caching header is correctly served, when the request is in the context of the foundation website, Varnish is doing a pass: {F30222372,... [13:44:54] (03PS1) 10Giuseppe Lavagetto: restart-appservers: fix to the cli args, some other cosmetic changes [cookbooks] - 10https://gerrit.wikimedia.org/r/534445 [13:44:58] (03CR) 10Gehel: [C: 04-1] "> Patch Set 2:" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/534399 (https://phabricator.wikimedia.org/T225125) (owner: 10Mathew.onipe) [13:47:44] <_joe_> !log restarting php7.2-fpm across the fleet to pick up the apc.ttl removal [13:47:44] (03CR) 10Gehel: [C: 04-1] "Should this be abandoned in favor of https://gerrit.wikimedia.org/r/c/operations/puppet/+/533928 ?" [puppet] - 10https://gerrit.wikimedia.org/r/531922 (https://phabricator.wikimedia.org/T225125) (owner: 10Mathew.onipe) [13:47:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:48:55] (03CR) 10Gehel: Add maps reboot cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/511819 (https://phabricator.wikimedia.org/T224072) (owner: 10Mathew.onipe) [13:50:00] (03Abandoned) 10Mathew.onipe: elasticsearch: ship logs to local syslog server [puppet] - 10https://gerrit.wikimedia.org/r/531922 (https://phabricator.wikimedia.org/T225125) (owner: 10Mathew.onipe) [13:50:39] 10Operations, 10Analytics, 10Analytics-Kanban, 10Analytics-Wikistats, and 2 others: Piwik JS isn't cached - https://phabricator.wikimedia.org/T230772 (10Nuria) Inddeed! I see server-timing: cache;desc="pass" on reg chrome window as well, thanks @Gilles for catching that [13:51:51] (03CR) 10Mathew.onipe: Add maps reboot cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/511819 (https://phabricator.wikimedia.org/T224072) (owner: 10Mathew.onipe) [13:52:15] RECOVERY - Check systemd state on notebook1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:52:42] cdanis: so, let's deploy to mwdebug1001 and then I will deploy to the canary hosts? [13:52:46] in 8 minutes I mean [13:52:57] yeah sounds good [13:53:21] ok, I will deploy to: mw1279.eqiad.wmnet,mw1276.eqiad.wmnet,mw1261.eqiad.wmnet,mw1264.eqiad.wmnet,mwdebug1002.eqiad.wmnet,mwdebug1001.eqiad.wmnet,mw1263.eqiad.wmnet,mw1262.eqiad.wmnet,mw1278.eqiad.wmnet,mw1277.eqiad.wmnet,mw1265.eqiad.wmnet [13:53:38] 10Operations, 10Analytics, 10Analytics-Kanban, 10Analytics-Wikistats, and 2 others: Piwik JS isn't cached - https://phabricator.wikimedia.org/T230772 (10ema) >>! In T230772#5464917, @Gilles wrote: > I'm guessing it might be coming from the cookies? Which the Chrome developer tools weren't showing. We've ha... [13:54:41] I'll merge in gerrit just before the start of the window [13:54:49] sure [13:55:41] 10Operations, 10Analytics, 10Analytics-Kanban, 10Analytics-Wikistats, and 2 others: Piwik JS isn't cached - https://phabricator.wikimedia.org/T230772 (10Gilles) I think the issue is that misc_recv_pass is applied to every site. You want it to apply to wikis (where people can log in), but not on non-wiki we... [13:56:40] 10Operations, 10Analytics, 10Analytics-Kanban, 10Analytics-Wikistats, and 2 others: Piwik JS isn't cached - https://phabricator.wikimedia.org/T230772 (10Gilles) Or I guess, to be conservative, add Wikimedia wiki login cookies to the filter only if you're in the context of a non-wiki site. [13:57:12] 10Operations, 10Analytics, 10Analytics-Kanban, 10Analytics-Wikistats, and 2 others: Piwik JS isn't cached - https://phabricator.wikimedia.org/T230772 (10Gilles) This should probably be its own task, though, it's not specific to piwik.js [13:58:50] (03CR) 10CDanis: [C: 03+2] db-eqiad: remove obsoleted DB config data [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534441 (https://phabricator.wikimedia.org/T231642) (owner: 10CDanis) [13:59:14] !log manually testing If0dd79604 on mwdebug1001 [13:59:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:59:26] !log installing nghttp2 security updates [13:59:27] cdanis: I have hold a lock on scap, to make sure no one deploys [13:59:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:59:58] (03Merged) 10jenkins-bot: db-eqiad: remove obsoleted DB config data [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534441 (https://phabricator.wikimedia.org/T231642) (owner: 10CDanis) [14:00:04] cdanis and marostegui: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for old DB config removal . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190904T1400). [14:00:15] hello jouncebot [14:00:19] (03CR) 10jenkins-bot: db-eqiad: remove obsoleted DB config data [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534441 (https://phabricator.wikimedia.org/T231642) (owner: 10CDanis) [14:00:24] ok marostegui change is live on mwdebug1001 [14:00:29] cdanis: testing! [14:01:07] en fr it ru look good [14:01:37] ok, commons,wikidata,es, meta, wikitech, look good [14:01:39] (03PS3) 10Gehel: maps: cleanup unused template [puppet] - 10https://gerrit.wikimedia.org/r/533974 [14:02:04] pl also looking good [14:02:18] so that means all the sections have been tested [14:02:31] also tested a few small wikis from s3 [14:02:40] great [14:02:47] (03CR) 10Gehel: [C: 03+2] maps: cleanup unused template [puppet] - 10https://gerrit.wikimedia.org/r/533974 (owner: 10Gehel) [14:02:50] cdanis: I am going to deploy to canary hosts [14:03:24] cdanis: done, let's monitor their traffic [14:04:02] (03PS1) 10Giuseppe Lavagetto: safe-service-restart: correctly parse pybal responses [puppet] - 10https://gerrit.wikimedia.org/r/534447 [14:04:07] (03PS22) 10Mathew.onipe: Add maps reboot cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/511819 (https://phabricator.wikimedia.org/T224072) [14:04:32] 10Operations, 10Analytics, 10Analytics-Kanban, 10Analytics-Wikistats, and 2 others: Piwik JS isn't cached - https://phabricator.wikimedia.org/T230772 (10Nuria) Right this is an issue for all resources served for https://stats.wikimedia.org/v2 and, I am gusessing, other non wiki domains. Can we consider b... [14:04:54] !log If0dd79604 deployed to eqiad MW canaries T231642 [14:04:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:56] T231642: Empty db-eqiad.php, db-codfw.php s1-s8+wikitech lines - https://phabricator.wikimedia.org/T231642 [14:05:20] <_joe_> jenkins is paifully slow again. [14:05:44] cdanis: mw1264 looks normal: https://grafana.wikimedia.org/d/000000550/mediawiki-application-servers?orgId=1&var-source=eqiad%20prometheus%2Fops&var-cluster=appserver&var-node=mw1264 [14:06:23] (03CR) 10Giuseppe Lavagetto: [C: 03+2] safe-service-restart: correctly parse pybal responses [puppet] - 10https://gerrit.wikimedia.org/r/534447 (owner: 10Giuseppe Lavagetto) [14:06:32] mw1261 as well: https://grafana.wikimedia.org/d/000000550/mediawiki-application-servers?orgId=1&var-source=eqiad%20prometheus%2Fops&var-cluster=appserver&var-node=mw1261 [14:06:51] I don't see anything too unusual on logstash [14:07:05] (03PS1) 10Mathew.onipe: wdqs: remove redundancy from cookbook names [cookbooks] - 10https://gerrit.wikimedia.org/r/534448 [14:07:30] cdanis: mmm, wait a sec, I think my deployment there didn't work [14:07:52] (03PS1) 10Muehlenhoff: Add library hint for nghttp2 [puppet] - 10https://gerrit.wikimedia.org/r/534449 [14:07:59] cdanis: Now the config is live, before it wasn't [14:08:14] !log If0dd79604 actually live on canaries now [14:08:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:21] sorry :) [14:08:31] I forgot to rebase on deploy1001 :) [14:08:38] np! [14:09:46] mw1261 looking good, no traffic changes [14:11:11] (03PS2) 10Muehlenhoff: Add library hint for nghttp2 [puppet] - 10https://gerrit.wikimedia.org/r/534449 [14:12:50] cdanis: I don't see anything strange [14:12:55] should we deploy then? [14:13:13] +1 [14:13:16] ok [14:14:37] deploying [14:14:40] (03CR) 10Gehel: [C: 03+2] wdqs: remove redundancy from cookbook names [cookbooks] - 10https://gerrit.wikimedia.org/r/534448 (owner: 10Mathew.onipe) [14:15:29] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Remove obsoleted DB config from db-eqiad.php T231642 (duration: 00m 57s) [14:15:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:32] T231642: Empty db-eqiad.php, db-codfw.php s1-s8+wikitech lines - https://phabricator.wikimedia.org/T231642 [14:15:33] cdanis: ^ [14:15:47] I can still browse the site :) [14:16:06] and I can see queries arriving to the slaves [14:16:06] I just loaded the watchlist on my volunteer account :) [14:16:31] I can login fine too [14:18:43] \o/ [14:22:27] cdanis: everything looking good https://grafana.wikimedia.org/d/000000278/mysql-aggregated?panelId=1&fullscreen&orgId=1 [14:22:41] I think we can move on! [14:22:51] 👍 [14:24:56] 10Operations, 10netops: Check router ACLs for early install SSH access from puppet masters/cumin hosts - https://phabricator.wikimedia.org/T231811 (10Andrew) Moving the iron exception to cloudnet1003 works for me -- presuming we mean adding it to 'wmcs::openstack::eqiad1::net'. One thing I'm not clear on is h... [14:25:06] (03PS3) 10Herron: prometheus: aggregate systemd failed metrics [puppet] - 10https://gerrit.wikimedia.org/r/533282 (https://phabricator.wikimedia.org/T230570) [14:25:35] (03PS4) 10Herron: prometheus: aggregate systemd failed metrics [puppet] - 10https://gerrit.wikimedia.org/r/533282 (https://phabricator.wikimedia.org/T230570) [14:27:05] (03PS1) 10Phamhi: toollabs: convert heredoc portion from systemd timer to variable type so that string interpolation can occur [puppet] - 10https://gerrit.wikimedia.org/r/534459 [14:27:31] 10Operations, 10ops-eqiad, 10DC-Ops, 10User-Zppix, 10cloud-services-team (Kanban): VMs on cloudvirt1015 crashing - bad mainboard/memory - https://phabricator.wikimedia.org/T220853 (10Andrew) a:05Cmjohnson→03Andrew I'll see if I can make it crash again! [14:27:45] (03CR) 10jerkins-bot: [V: 04-1] toollabs: convert heredoc portion from systemd timer to variable type so that string interpolation can occur [puppet] - 10https://gerrit.wikimedia.org/r/534459 (owner: 10Phamhi) [14:28:49] 10Operations, 10netops: Check router ACLs for early install SSH access from puppet masters/cumin hosts - https://phabricator.wikimedia.org/T231811 (10MoritzMuehlenhoff) >>! In T231811#5465056, @Andrew wrote: > Moving the iron exception to cloudnet1003 works for me -- presuming we mean adding it to 'wmcs::opens... [14:29:14] (03CR) 10Ema: "pcc seems fine: https://puppet-compiler.wmflabs.org/compiler1001/18173/phab1003.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/534442 (owner: 10Ema) [14:31:11] (03CR) 10Herron: [C: 03+2] prometheus: aggregate systemd failed metrics (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/533282 (https://phabricator.wikimedia.org/T230570) (owner: 10Herron) [14:36:14] <_joe_> herron, godog are you trying to remove the 'systemd unit failed' alert from icinga? [14:36:26] <_joe_> it's pretty important even for single hosts [14:36:44] <_joe_> sometimes it's the only way we can tell something is wrong on a server [14:37:32] <_joe_> it could be more iformative, sure, but I definitely wouldn't suppress it unless we ask everyone to check their monitoring [14:38:21] (03PS1) 10Muehlenhoff: Pick a new canary for elastic [puppet] - 10https://gerrit.wikimedia.org/r/534461 [14:38:59] _joe_: trying to move it to check_prometheus for alerting, but not remove entierly. I can add you to the patches for any changes to alerts [14:39:02] <_joe_> also - I think the whole idea of demoting things to WARNING is the wrong way to tackle the problem - we should be able to decide which criticals go to irc, and which don't [14:39:16] <_joe_> herron: I will comment on the task [14:39:29] <_joe_> I'm sorry I didn't see that specific one until today [14:40:09] ok sounds good [14:41:19] 10Operations, 10netops: Check router ACLs for early install SSH access from puppet masters/cumin hosts - https://phabricator.wikimedia.org/T231811 (10Andrew) >>! In T231811#5465070, @MoritzMuehlenhoff wrote: > > On the system/ferm level there's fleet-wide Ferm rule which grants SSH access from Cumin masters.... [14:41:20] (03PS2) 10Herron: prometheus: deploy prometheus-ipsec-exporter to all sites [puppet] - 10https://gerrit.wikimedia.org/r/534210 (https://phabricator.wikimedia.org/T230236) [14:42:01] RECOVERY - PHP opcache health on mwdebug1002 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [14:42:38] (03PS2) 10Phamhi: toollabs: convert heredoc portion from systemd timer to variable type so that string interpolation can occur [puppet] - 10https://gerrit.wikimedia.org/r/534459 [14:42:43] (03CR) 10Ayounsi: "This replaces port 7231 with 7233 in Ferm, it doesn't seem to allow both ports simultaneously." [puppet] - 10https://gerrit.wikimedia.org/r/532382 (https://phabricator.wikimedia.org/T223953) (owner: 10Mobrovac) [14:42:48] _joe_: thanks, yeah the task sounds good [14:43:19] (03CR) 10jerkins-bot: [V: 04-1] toollabs: convert heredoc portion from systemd timer to variable type so that string interpolation can occur [puppet] - 10https://gerrit.wikimedia.org/r/534459 (owner: 10Phamhi) [14:43:37] (03CR) 10Herron: [C: 03+2] prometheus: deploy prometheus-ipsec-exporter to all sites [puppet] - 10https://gerrit.wikimedia.org/r/534210 (https://phabricator.wikimedia.org/T230236) (owner: 10Herron) [14:43:41] 10Operations, 10netops: Check router ACLs for early install SSH access from puppet masters/cumin hosts - https://phabricator.wikimedia.org/T231811 (10MoritzMuehlenhoff) >>! In T231811#5465102, @Andrew wrote: >>>! In T231811#5465070, @MoritzMuehlenhoff wrote: >> >> On the system/ferm level there's fleet-wide F... [14:43:47] (03CR) 10Ayounsi: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/532382 (https://phabricator.wikimedia.org/T223953) (owner: 10Mobrovac) [14:44:51] (03PS3) 10Phamhi: toollabs: convert heredoc to variable type for interpolation [puppet] - 10https://gerrit.wikimedia.org/r/534459 [14:45:29] (03CR) 10jerkins-bot: [V: 04-1] toollabs: convert heredoc to variable type for interpolation [puppet] - 10https://gerrit.wikimedia.org/r/534459 (owner: 10Phamhi) [14:47:14] (03PS1) 10Ema: lvs: add restbase-ssl [puppet] - 10https://gerrit.wikimedia.org/r/534462 (https://phabricator.wikimedia.org/T210411) [14:49:36] (03PS4) 10Phamhi: toollabs: convert heredoc to variable type for interpolation [puppet] - 10https://gerrit.wikimedia.org/r/534459 [14:50:20] (03CR) 10jerkins-bot: [V: 04-1] toollabs: convert heredoc to variable type for interpolation [puppet] - 10https://gerrit.wikimedia.org/r/534459 (owner: 10Phamhi) [14:51:57] !log reimaging cloudvirt1015 for T220853 [14:52:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:52:01] T220853: VMs on cloudvirt1015 crashing - bad mainboard/memory - https://phabricator.wikimedia.org/T220853 [14:52:57] (03CR) 10Ema: "PCC output here: https://puppet-compiler.wmflabs.org/compiler1002/18174/" [puppet] - 10https://gerrit.wikimedia.org/r/534462 (https://phabricator.wikimedia.org/T210411) (owner: 10Ema) [14:55:08] (03CR) 10Giuseppe Lavagetto: [C: 03+1] envoyproxy: do not hardcode 443 in ferm rule [puppet] - 10https://gerrit.wikimedia.org/r/534442 (owner: 10Ema) [14:56:08] (03PS2) 10Ema: envoyproxy: do not hardcode 443 in ferm rule [puppet] - 10https://gerrit.wikimedia.org/r/534442 [14:57:13] (03CR) 10Alexandros Kosiaris: [C: 04-1] lvs: add restbase-ssl (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/534462 (https://phabricator.wikimedia.org/T210411) (owner: 10Ema) [14:59:03] (03CR) 10Ema: [C: 03+2] envoyproxy: do not hardcode 443 in ferm rule [puppet] - 10https://gerrit.wikimedia.org/r/534442 (owner: 10Ema) [14:59:07] (03PS1) 10CRusnov: netbox::postgres: Fix conditional on rule inclusion [puppet] - 10https://gerrit.wikimedia.org/r/534464 [14:59:15] (03CR) 10Alexandros Kosiaris: [C: 03+2] RESTBase: Temporarily allow access to port 7233 as well [puppet] - 10https://gerrit.wikimedia.org/r/532382 (https://phabricator.wikimedia.org/T223953) (owner: 10Mobrovac) [14:59:25] (03PS3) 10Alexandros Kosiaris: RESTBase: Temporarily allow access to port 7233 as well [puppet] - 10https://gerrit.wikimedia.org/r/532382 (https://phabricator.wikimedia.org/T223953) (owner: 10Mobrovac) [15:05:50] (03PS1) 10Jhedden: icinga: check number of parent nova-compute procs [puppet] - 10https://gerrit.wikimedia.org/r/534465 (https://phabricator.wikimedia.org/T231999) [15:06:24] (03PS2) 10Ema: lvs: add restbase-ssl [puppet] - 10https://gerrit.wikimedia.org/r/534462 (https://phabricator.wikimedia.org/T210411) [15:06:43] (03CR) 10Ema: lvs: add restbase-ssl (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/534462 (https://phabricator.wikimedia.org/T210411) (owner: 10Ema) [15:06:50] 10Operations, 10ops-eqiad, 10DC-Ops, 10User-Zppix, 10cloud-services-team (Kanban): VMs on cloudvirt1015 crashing - bad mainboard/memory - https://phabricator.wikimedia.org/T220853 (10Andrew) btw, @Cmjohnson, did you restore BIOS settings after replacing the board? [15:07:31] (03PS5) 10Phamhi: toollabs: convert heredoc to variable type for interpolation [puppet] - 10https://gerrit.wikimedia.org/r/534459 [15:08:03] (03CR) 10Ayounsi: [C: 03+1] netbox::postgres: Fix conditional on rule inclusion [puppet] - 10https://gerrit.wikimedia.org/r/534464 (owner: 10CRusnov) [15:08:13] (03CR) 10CRusnov: [C: 03+2] netbox::postgres: Fix conditional on rule inclusion [puppet] - 10https://gerrit.wikimedia.org/r/534464 (owner: 10CRusnov) [15:08:15] (03CR) 10Phamhi: [C: 03+2] icinga: check number of parent nova-compute procs [puppet] - 10https://gerrit.wikimedia.org/r/534465 (https://phabricator.wikimedia.org/T231999) (owner: 10Jhedden) [15:08:24] (03PS2) 10CRusnov: netbox::postgres: Fix conditional on rule inclusion [puppet] - 10https://gerrit.wikimedia.org/r/534464 [15:08:42] (03CR) 10Andrew Bogott: [C: 03+1] "Thank you! This flap has been bugging me." [puppet] - 10https://gerrit.wikimedia.org/r/534465 (https://phabricator.wikimedia.org/T231999) (owner: 10Jhedden) [15:09:19] (03CR) 10Jhedden: [C: 03+2] icinga: check number of parent nova-compute procs [puppet] - 10https://gerrit.wikimedia.org/r/534465 (https://phabricator.wikimedia.org/T231999) (owner: 10Jhedden) [15:09:30] (03PS2) 10Jhedden: icinga: check number of parent nova-compute procs [puppet] - 10https://gerrit.wikimedia.org/r/534465 (https://phabricator.wikimedia.org/T231999) [15:11:21] 10Operations, 10ops-eqiad, 10DC-Ops, 10User-Zppix, 10cloud-services-team (Kanban): VMs on cloudvirt1015 crashing - bad mainboard/memory - https://phabricator.wikimedia.org/T220853 (10Andrew) (I just now enabled virtualization in the bios) [15:12:23] (03PS3) 10CRusnov: netbox::postgres: Fix conditional on rule inclusion [puppet] - 10https://gerrit.wikimedia.org/r/534464 [15:15:50] (03CR) 10jerkins-bot: [V: 04-1] netbox::postgres: Fix conditional on rule inclusion [puppet] - 10https://gerrit.wikimedia.org/r/534464 (owner: 10CRusnov) [15:16:13] (03CR) 10Jhedden: [C: 03+1] toollabs: convert heredoc to variable type for interpolation [puppet] - 10https://gerrit.wikimedia.org/r/534459 (owner: 10Phamhi) [15:17:04] (03CR) 10Phamhi: [C: 03+2] toollabs: convert heredoc to variable type for interpolation [puppet] - 10https://gerrit.wikimedia.org/r/534459 (owner: 10Phamhi) [15:17:24] (03CR) 10Phamhi: [C: 03+2] "https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/18176/console" [puppet] - 10https://gerrit.wikimedia.org/r/534459 (owner: 10Phamhi) [15:17:26] (03CR) 10Bstorm: [C: 03+1] re-add my (Dzahn) root key for cloud VPS [labs/private] - 10https://gerrit.wikimedia.org/r/534275 (owner: 10Dzahn) [15:17:59] (03PS6) 10Phamhi: toollabs: convert heredoc to variable type for interpolation [puppet] - 10https://gerrit.wikimedia.org/r/534459 [15:18:29] (03CR) 10Andrew Bogott: [V: 03+2 C: 03+2] re-add my (Dzahn) root key for cloud VPS [labs/private] - 10https://gerrit.wikimedia.org/r/534275 (owner: 10Dzahn) [15:20:54] (03PS4) 10CRusnov: netbox::postgres: Fix conditional on rule inclusion [puppet] - 10https://gerrit.wikimedia.org/r/534464 [15:22:48] PROBLEM - Host mw2231.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:24:20] (03CR) 10jerkins-bot: [V: 04-1] netbox::postgres: Fix conditional on rule inclusion [puppet] - 10https://gerrit.wikimedia.org/r/534464 (owner: 10CRusnov) [15:24:46] what are you even doing jenkins [15:24:48] (03CR) 10Andrew Bogott: [C: 03+1] "My only reservation about this is that we might want to do it in base rather than just for these two servers." [puppet] - 10https://gerrit.wikimedia.org/r/533727 (https://phabricator.wikimedia.org/T224828) (owner: 10Bstorm) [15:28:04] RECOVERY - Host mw2231.mgmt is UP: PING OK - Packet loss = 0%, RTA = 37.05 ms [15:29:48] (03CR) 10Ayounsi: "Network devices use NTP, please add $NETWORK_INFRA and $MGMT_NETWORKS" [puppet] - 10https://gerrit.wikimedia.org/r/531808 (owner: 10Muehlenhoff) [15:30:00] (03CR) 10Ayounsi: [C: 04-1] Restrict NTP servers to production networks (including frack) [puppet] - 10https://gerrit.wikimedia.org/r/531808 (owner: 10Muehlenhoff) [15:30:17] (03PS2) 10Bstorm: openstack-pdns: don't run the mdadm check where the database runs [puppet] - 10https://gerrit.wikimedia.org/r/533727 (https://phabricator.wikimedia.org/T224828) [15:31:08] (03CR) 10CRusnov: [C: 03+2] "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/534464 (owner: 10CRusnov) [15:34:28] 10Operations, 10netops: Check router ACLs for early install SSH access from puppet masters/cumin hosts - https://phabricator.wikimedia.org/T231811 (10Andrew) install_console works just fine from cumin1001. So, no need for a special case here, we can just go ahead and decom iron. Thanks all! [15:34:35] (03CR) 10Bstorm: [C: 03+2] openstack-pdns: don't run the mdadm check where the database runs [puppet] - 10https://gerrit.wikimedia.org/r/533727 (https://phabricator.wikimedia.org/T224828) (owner: 10Bstorm) [15:35:22] PROBLEM - Host mw2231.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:35:42] 10Operations, 10netops: Check router ACLs for early install SSH access from puppet masters/cumin hosts - https://phabricator.wikimedia.org/T231811 (10MoritzMuehlenhoff) 05Open→03Declined Great, thanks. Closing the task, will proceed with iron decom. [15:36:21] !log upgrade grafana to 5.4.5 on labmon [15:36:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:37:09] (03PS4) 10Herron: eventgate-main: add new kafka-main brokers to broker list [deployment-charts] - 10https://gerrit.wikimedia.org/r/529428 (https://phabricator.wikimedia.org/T225005) [15:38:02] 10Operations, 10ops-codfw, 10DC-Ops: mw2231 is down and unable to reboot - https://phabricator.wikimedia.org/T231192 (10Papaul) @wiki_willy this system is out of warranty since May 2019 which is like 4 months and we do not have spare . Did some tests again on the system today. - Swapped CPU 1 with CPU2 err... [15:38:33] (03PS5) 10CRusnov: netbox::postgres: Fix conditional on rule inclusion [puppet] - 10https://gerrit.wikimedia.org/r/534464 [15:40:34] (03CR) 10Herron: [C: 03+2] eventgate-main: add new kafka-main brokers to broker list [deployment-charts] - 10https://gerrit.wikimedia.org/r/529428 (https://phabricator.wikimedia.org/T225005) (owner: 10Herron) [15:40:40] (03PS5) 10Herron: eventgate-main: add new kafka-main brokers to broker list [deployment-charts] - 10https://gerrit.wikimedia.org/r/529428 (https://phabricator.wikimedia.org/T225005) [15:41:08] (03CR) 10Herron: [V: 03+2 C: 03+2] eventgate-main: add new kafka-main brokers to broker list [deployment-charts] - 10https://gerrit.wikimedia.org/r/529428 (https://phabricator.wikimedia.org/T225005) (owner: 10Herron) [15:44:45] 10Operations, 10observability, 10Patch-For-Review, 10Performance-Team (Radar), 10User-CDanis: Upgrade grafana to 5.x - https://phabricator.wikimedia.org/T210416 (10fgiunchedi) [15:46:50] RECOVERY - Host mw2231.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.85 ms [15:52:31] (03PS1) 10Zoranzoki21: Set noindex for user and user_talk on zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534471 (https://phabricator.wikimedia.org/T231982) [15:54:11] (03PS2) 10Zoranzoki21: Set noindex for user and user_talk on zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534471 (https://phabricator.wikimedia.org/T231982) [15:55:30] !log joal@deploy1001 Started deploy [analytics/refinery@2322f10]: Fix for yesterday regular analytics deploy [15:55:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:24] (03PS1) 10Herron: eventgate-main: add new brokers to staging broker list [deployment-charts] - 10https://gerrit.wikimedia.org/r/534472 (https://phabricator.wikimedia.org/T225005) [15:57:01] 10Operations, 10ops-codfw, 10decommission, 10observability, 10Patch-For-Review: Decom graphite2001/WMF6160 - https://phabricator.wikimedia.org/T200209 (10Papaul) [15:57:54] (03CR) 10Ottomata: [C: 03+1] eventgate-main: add new brokers to staging broker list [deployment-charts] - 10https://gerrit.wikimedia.org/r/534472 (https://phabricator.wikimedia.org/T225005) (owner: 10Herron) [15:58:09] (03CR) 10Herron: [V: 03+2 C: 03+2] eventgate-main: add new brokers to staging broker list [deployment-charts] - 10https://gerrit.wikimedia.org/r/534472 (https://phabricator.wikimedia.org/T225005) (owner: 10Herron) [16:00:04] MaxSem, RoanKattouw, Niharika, and Urbanecm: #bothumor My software never has bugs. It just develops random features. Rise for Morning SWAT (Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190904T1600). [16:00:05] kart_ and Daimona: A patch you scheduled for Morning SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [16:00:14] o/ [16:00:19] here. [16:00:19] o/ [16:00:30] Lucas_WMDE: helping SWAT? :) [16:00:46] I could, if needed :) [16:01:16] I can go with my patch. I need to do more testing of that, so if you can help Daimona after that, that will be great. [16:01:21] but there’s only two patches in the window so I’ll give the real swatters a few more minutes to o/ :) [16:01:32] sure [16:01:50] Please ping me when you're ready :) [16:02:22] kart_: go ahead, I guess :) [16:02:32] OK! [16:02:36] (03CR) 10KartikMistry: [C: 03+2] "SWAT." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/533172 (https://phabricator.wikimedia.org/T231207) (owner: 10KartikMistry) [16:03:34] (03Merged) 10jenkins-bot: Move ContentTranslation out of Beta in jvwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/533172 (https://phabricator.wikimedia.org/T231207) (owner: 10KartikMistry) [16:03:51] (03CR) 10jenkins-bot: Move ContentTranslation out of Beta in jvwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/533172 (https://phabricator.wikimedia.org/T231207) (owner: 10KartikMistry) [16:04:30] kart_: Congratulations to you and the whole Language Engineering team. A major moment. :-) [16:06:00] Thanks James_F [16:06:08] Starting with one wiki as of now :) [16:06:34] (03PS1) 10Phamhi: toollabs: fix maintain-kubeusers-timer command arguments [puppet] - 10https://gerrit.wikimedia.org/r/534475 [16:06:42] Sure. :-) [16:06:47] (03CR) 10Alexandros Kosiaris: [C: 03+1] lvs: add restbase-ssl [puppet] - 10https://gerrit.wikimedia.org/r/534462 (https://phabricator.wikimedia.org/T210411) (owner: 10Ema) [16:08:15] (03CR) 10Phamhi: [C: 03+2] toollabs: fix maintain-kubeusers-timer command arguments [puppet] - 10https://gerrit.wikimedia.org/r/534475 (owner: 10Phamhi) [16:08:40] (03CR) 10Phamhi: [C: 03+2] "https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/18177/console" [puppet] - 10https://gerrit.wikimedia.org/r/534475 (owner: 10Phamhi) [16:09:41] Deploying. mwdebug looks OK. [16:09:47] (03PS4) 10Ppchelko: Switch all non-low-traffic jobs to eventgate. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534225 (https://phabricator.wikimedia.org/T228705) [16:09:57] then I’ll already start gate-and-submit on the backport, to speed things up a bit [16:10:20] (03CR) 10Jhedden: [C: 03+1] toollabs: fix maintain-kubeusers-timer command arguments [puppet] - 10https://gerrit.wikimedia.org/r/534475 (owner: 10Phamhi) [16:11:57] !log kartik@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit|533172|Move ContentTranslation out of Beta in jvwiki (T231207)]] (duration: 00m 56s) [16:11:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:12:00] T231207: Enable Content Translation in Javanese Wikipedia as a default tool - https://phabricator.wikimedia.org/T231207 [16:12:38] 10Operations, 10ops-codfw, 10DC-Ops: mw2231 is down and unable to reboot - https://phabricator.wikimedia.org/T231192 (10Papaul) @MoritzMuehlenhoff @Joe I talked to @wiki_willy On IRC on what needs to be done for this system. The sames has : 2x500GB 2.5" SATA disks E5 2650 V3 @2,3 GHz 2x32 GB of memory w... [16:13:04] Lucas_WMDE: done. [16:13:45] cool, thanks! (and congrats) [16:14:03] Daimona: currently waiting for jenkins (ca. 15-20 minutes) [16:14:13] Nice, thanks [16:14:17] It's gonna take ages :) [16:14:44] turn that smile upside down ;) [16:17:04] Let's be optimistic [16:17:22] 10Operations, 10ops-codfw, 10DC-Ops: mw2231 is down and unable to reboot - https://phabricator.wikimedia.org/T231192 (10Joe) @Papaul that looks fine - I don't think we //need// to swap out the SSDs, so just do it if we have a better use of those disks (they're pretty useless on an appserver). [16:21:12] (03PS1) 10CRusnov: netbox::postgres: Finally fix firewall [puppet] - 10https://gerrit.wikimedia.org/r/534482 [16:21:43] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "We have to make sure this works correctly under jessie, as some of the restbase hosts are still on jessie. Given we're using an unprivileg" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/533028 (https://phabricator.wikimedia.org/T210411) (owner: 10Ema) [16:23:16] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "Also add the new pool to the settings of profile::lvs::realserver for role restbase, having envoy as a dependent service" [puppet] - 10https://gerrit.wikimedia.org/r/534462 (https://phabricator.wikimedia.org/T210411) (owner: 10Ema) [16:24:13] jouncebot: now [16:24:14] For the next 0 hour(s) and 35 minute(s): Morning SWAT (Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190904T1600) [16:24:38] Daimona: great to see your AbuseFilter patch being added :] [16:24:49] Yay [16:24:58] Re selenium: We'd really need it in these situations [16:32:47] (03PS2) 10CRusnov: netbox::postgres: Finally fix firewall [puppet] - 10https://gerrit.wikimedia.org/r/534482 [16:33:09] Daimona: it was merged! [16:33:13] (sorry, I was briefly busy with other things) [16:33:17] Cool! [16:33:20] I’ll go ahead now [16:33:23] it can be tested, right? [16:33:24] Sure [16:33:29] Yeah, easy to test [16:34:01] easy for an admin, at least ;) [16:34:17] (03CR) 10Ottomata: [C: 03+1] Switch all non-low-traffic jobs to eventgate. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534225 (https://phabricator.wikimedia.org/T228705) (owner: 10Ppchelko) [16:34:34] Yes, well [16:35:17] it’s on mwdebug1002 [16:36:05] Testing [16:36:09] 10Operations, 10ops-codfw, 10DC-Ops: mw2231 is down and unable to reboot - https://phabricator.wikimedia.org/T231192 (10Papaul) @joe I am afraid i did get the comment on the SSD's Do you to use the SSD's or keep the SATA? mw2231 has SATA 2.5" disk graphite2002 has 1.6TB SSD's [16:36:28] seems to be working on testwikidata [16:37:22] (03PS2) 10Bstorm: powerdns: correct some database variables in my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/533308 (https://phabricator.wikimedia.org/T224828) [16:37:22] Yeah works like a charm [16:37:23] Thanks [16:37:39] ok deploying [16:38:46] !log lucaswerkmeister-wmde@deploy1001 Synchronized php-1.34.0-wmf.21/extensions/AbuseFilter: SWAT: [[gerrit:534429|Fix filter validation in ViewEdit (T231985)]] (duration: 00m 58s) [16:38:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:38:49] T231985: Cannot create new abuse filters - https://phabricator.wikimedia.org/T231985 [16:39:50] ok, I think we’re done :) [16:40:06] Noice, thanks! [16:40:12] !log Morning SWAT done [16:40:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:40:18] (03CR) 10Andrew Bogott: [C: 03+1] "fine w/me!" [puppet] - 10https://gerrit.wikimedia.org/r/533308 (https://phabricator.wikimedia.org/T224828) (owner: 10Bstorm) [16:43:02] (03CR) 10CRusnov: [C: 03+2] netbox::postgres: Finally fix firewall [puppet] - 10https://gerrit.wikimedia.org/r/534482 (owner: 10CRusnov) [16:46:45] (03PS3) 10Bstorm: powerdns: correct some database variables in my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/533308 (https://phabricator.wikimedia.org/T224828) [16:48:46] !log joal@deploy1001 Finished deploy [analytics/refinery@2322f10]: Fix for yesterday regular analytics deploy (duration: 53m 16s) [16:48:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:49:23] (03CR) 10Bstorm: [C: 03+2] powerdns: correct some database variables in my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/533308 (https://phabricator.wikimedia.org/T224828) (owner: 10Bstorm) [16:50:18] !log @ helmfile [STAGING] Ran 'apply' command on namespace 'eventgate-main' for release 'main' . [16:50:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:11:12] (03PS1) 10CRusnov: netbox: Add profile::base::firewall to module [puppet] - 10https://gerrit.wikimedia.org/r/534490 [17:14:25] !log @ helmfile [EQIAD] Ran 'apply' command on namespace 'eventgate-main' for release 'main' . [17:14:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:25:41] I'm grabbing the conch. [17:28:16] 10Operations, 10Analytics, 10Core Platform Team Legacy (Watching / External), 10Patch-For-Review, and 2 others: Replace and expand kafka main hosts (kafka[12]00[123]) with kafka-main[12]00[12345] - https://phabricator.wikimedia.org/T225005 (10herron) The eventgate-main config now includes the new brokers i... [17:32:16] (03CR) 10Ottomata: [C: 03+2] Switch all non-low-traffic jobs to eventgate. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534225 (https://phabricator.wikimedia.org/T228705) (owner: 10Ppchelko) [17:32:33] !log Switch all non-low-traffic jobs to eventgate - T228705 [17:32:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:32:36] T228705: Migrate JobQueue to eventgate - https://phabricator.wikimedia.org/T228705 [17:32:42] (03CR) 10jenkins-bot: Switch all non-low-traffic jobs to eventgate. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534225 (https://phabricator.wikimedia.org/T228705) (owner: 10Ppchelko) [17:34:08] !log otto@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Switch all non-low-traffic jobs to eventgate - T228705 (duration: 00m 56s) [17:34:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:43:22] !log otto@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Switch all non-low-traffic jobs to eventgate - T228705 - take 2 (duration: 00m 55s) [17:43:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:43:24] T228705: Migrate JobQueue to eventgate - https://phabricator.wikimedia.org/T228705 [17:45:23] !log jforrester@deploy1001 Synchronized php-1.34.0-wmf.21/extensions/VisualEditor/modules/ve-mw/init/ve.init.mw.Target.js: T150418 Fix HTML blacklist inheritance to avoid copy-pasted read s again (duration: 00m 56s) [17:45:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:45:26] T150418: References pasted from read mode should be dropped until we can support them properly - https://phabricator.wikimedia.org/T150418 [17:47:33] !log jforrester@deploy1001 Synchronized php-1.34.0-wmf.20/extensions/VisualEditor/modules/ve-mw/init/ve.init.mw.Target.js: T150418 Fix HTML blacklist inheritance to avoid copy-pasted read s again (duration: 00m 57s) [17:47:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:52:22] (03PS1) 10Andrew Bogott: designate: create pdns zone records with both designate servers as 'master' [puppet] - 10https://gerrit.wikimedia.org/r/534496 [17:56:07] (03CR) 10Andrew Bogott: [C: 03+2] designate: create pdns zone records with both designate servers as 'master' [puppet] - 10https://gerrit.wikimedia.org/r/534496 (owner: 10Andrew Bogott) [17:59:49] !log jforrester@deploy1001 Synchronized php-1.34.0-wmf.21/extensions/GrowthExperiments/modules/homepage/: T229271 Homepage: Unbreak question dialogs on mobile (duration: 00m 56s) [17:59:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190904T1800) [18:00:09] T229271: [wmf.15-mobile] Homepage: 'Ask mentor'/'Ask help desk' flashes 'undefined' - https://phabricator.wikimedia.org/T229271 [18:02:18] 10Operations, 10ops-codfw, 10decommission, 10observability, 10Patch-For-Review: Decom graphite2001/WMF6160 - https://phabricator.wikimedia.org/T200209 (10Papaul) [18:02:33] 10Operations, 10ops-codfw, 10decommission, 10observability, 10Patch-For-Review: Decom graphite2001/WMF6160 - https://phabricator.wikimedia.org/T200209 (10Papaul) ` papaul@asw-b-codfw# run show interfaces ge-5/0/1 descriptions Interface Admin Link Description ge-5/0/1 down down DISABLED [18:08:06] (03PS1) 10Papaul: DNS: Remove mgmt and production DNS for graphite2001 [dns] - 10https://gerrit.wikimedia.org/r/534500 [18:09:04] (03CR) 10Papaul: [C: 03+2] DNS: Remove mgmt and production DNS for graphite2001 [dns] - 10https://gerrit.wikimedia.org/r/534500 (owner: 10Papaul) [18:09:57] 10Operations, 10ops-codfw, 10decommission, 10observability, 10Patch-For-Review: Decom graphite2001/WMF6160 - https://phabricator.wikimedia.org/T200209 (10Papaul) [18:31:28] (03PS1) 10Ppchelko: Switch all events to eventgate. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534506 (https://phabricator.wikimedia.org/T228705) [18:36:24] (03PS1) 10Ayounsi: Homer deploy repo init [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/534507 [18:40:57] PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [140.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [18:41:43] 10Operations, 10fundraising-tech-ops: Authentication for grafana - https://phabricator.wikimedia.org/T198648 (10Jgreen) Currently implemented as client certificate authentication is sufficient for no-login read only access, and optional grafana accounts for editors/admins backending in sqlite db. We need to sy... [18:41:44] (03CR) 10Ayounsi: "Instead of having the submodule to a specific homer branch, we could rely on pypi and add homer to the requirement.txt file." [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/534507 (owner: 10Ayounsi) [18:43:33] 10Operations, 10fundraising-tech-ops: Authentication for grafana - https://phabricator.wikimedia.org/T198648 (10Jgreen) 05Open→03Resolved a:03Jgreen [18:43:36] 10Operations, 10fundraising-tech-ops, 10Patch-For-Review: Rack/setup frmon1001 - https://phabricator.wikimedia.org/T186073 (10Jgreen) [18:50:26] jouncebot: now [18:50:26] For the next 0 hour(s) and 9 minute(s): Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190904T1800) [18:50:30] o/ [18:50:35] still 10 minutes hehe [18:51:51] hashar: You can go for it. ;-) [18:52:44] I guess so yeah [18:54:18] (03PS1) 10Hashar: group1 wikis to 1.34.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534509 [18:54:20] (03CR) 10Hashar: [C: 03+2] group1 wikis to 1.34.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534509 (owner: 10Hashar) [18:55:32] 10Operations, 10WMF-Communications: Updating DNS records - https://phabricator.wikimedia.org/T231387 (10Varnent) @BBlack - any updates? [18:56:08] (03Merged) 10jenkins-bot: group1 wikis to 1.34.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534509 (owner: 10Hashar) [18:56:35] (03CR) 10jenkins-bot: group1 wikis to 1.34.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534509 (owner: 10Hashar) [18:59:24] !log hashar@deploy1001 rebuilt and synchronized wikiversions files: group1 wikis to 1.34.0-wmf.21 [18:59:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:00:05] hashar: It is that lovely time of the day again! You are hereby commanded to deploy MediaWiki train. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190904T1900). [19:00:19] !log hashar@deploy1001 Synchronized php: group1 wikis to 1.34.0-wmf.21 (duration: 00m 54s) [19:00:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:01:27] RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 30.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [19:02:25] PROBLEM - termbox codfw on termbox.svc.codfw.wmnet is CRITICAL: /termbox (get rendered termbox) is CRITICAL: Test get rendered termbox returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service [19:05:31] well hmm [19:05:42] not much happening beside that termbox alarm :-\ [19:05:54] whatever that one could be [19:07:57] RECOVERY - termbox codfw on termbox.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service [19:12:30] hmm that does not sound related [19:12:35] PROBLEM - termbox codfw on termbox.svc.codfw.wmnet is CRITICAL: /termbox (get rendered termbox) is CRITICAL: Test get rendered termbox returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service [19:13:13] timeout of 3000ms exceeded [19:13:14] :\ [19:13:30] Could be us. [19:13:38] wikidata.org is in group1, after all. [19:13:46] yeah that started with the train [19:13:48] But no real traffic goes to codfw? [19:14:10] Wikibase\View\Termbox\Renderer\TermboxRemoteRenderer: Problem requesting from the remote server [19:14:19] really [19:14:22] Ah, cross-repo issues, perhaps? [19:14:24] I should deploy wikidata in my morning [19:14:30] Do we mirror traffic? [19:14:35] Well, normally you would have. :-) [19:14:43] Want to pull back Wikidata only? [19:15:21] (03CR) 10Eevans: [V: 03+2 C: 03+2] sessionstore: Bump limits and requests [deployment-charts] - 10https://gerrit.wikimedia.org/r/533922 (https://phabricator.wikimedia.org/T229697) (owner: 10Alexandros Kosiaris) [19:15:31] (03PS2) 10Eevans: sessionstore: Bump limits and requests [deployment-charts] - 10https://gerrit.wikimedia.org/r/533922 (https://phabricator.wikimedia.org/T229697) (owner: 10Alexandros Kosiaris) [19:15:42] (03CR) 10Eevans: [V: 03+2 C: 03+2] sessionstore: Bump limits and requests [deployment-charts] - 10https://gerrit.wikimedia.org/r/533922 (https://phabricator.wikimedia.org/T229697) (owner: 10Alexandros Kosiaris) [19:16:35] hashar: Or just ACK it? The site seems to be working… [19:16:38] (Famous last words.) [19:17:01] well something throws a 500 [19:17:06] will look at the icinga probe [19:17:10] !log @ helmfile [STAGING] Ran 'apply' command on namespace 'sessionstore' for release 'staging' . [19:17:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:17:29] bah check_wmf_service again ;D [19:21:53] RECOVERY - termbox codfw on termbox.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service [19:23:59] 10Operations, 10ops-codfw, 10DC-Ops: mw2231 is down and unable to reboot - https://phabricator.wikimedia.org/T231192 (10MoritzMuehlenhoff) What Giuseppe meant: For the app server use case it doesn't matter whether we use SSD or SATA, they do very little I/O. If you have other use for the SSDs (e.g. because w... [19:28:40] https://phabricator.wikimedia.org/T232035 [19:28:40] :D [19:30:15] PROBLEM - termbox codfw on termbox.svc.codfw.wmnet is CRITICAL: /termbox (get rendered termbox) is CRITICAL: Test get rendered termbox returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service [19:30:45] (03PS1) 10Hashar: Rollback wikidata to 1.34.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534514 (https://phabricator.wikimedia.org/T232035) [19:30:50] (03PS1) 10Eevans: staging/sessionstore: bump memory back to 100Mi in response to errors [deployment-charts] - 10https://gerrit.wikimedia.org/r/534515 (https://phabricator.wikimedia.org/T229697) [19:31:29] (03CR) 10Hashar: [C: 03+2] Rollback wikidata to 1.34.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534514 (https://phabricator.wikimedia.org/T232035) (owner: 10Hashar) [19:31:31] RECOVERY - termbox codfw on termbox.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service [19:32:12] (03CR) 10Eevans: [V: 03+2 C: 03+2] staging/sessionstore: bump memory back to 100Mi in response to errors [deployment-charts] - 10https://gerrit.wikimedia.org/r/534515 (https://phabricator.wikimedia.org/T229697) (owner: 10Eevans) [19:32:44] (03Merged) 10jenkins-bot: Rollback wikidata to 1.34.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534514 (https://phabricator.wikimedia.org/T232035) (owner: 10Hashar) [19:33:16] (03CR) 10jenkins-bot: Rollback wikidata to 1.34.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534514 (https://phabricator.wikimedia.org/T232035) (owner: 10Hashar) [19:33:48] !log @ helmfile [STAGING] Ran 'apply' command on namespace 'sessionstore' for release 'staging' . [19:33:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:35:50] !log hashar@deploy1001 rebuilt and synchronized wikiversions files: rollback wikidatawiki to 1.34.0-wmf.20 for T232035 - T220746 [19:35:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:35:59] T220746: 1.34.0-wmf.21 deployment blockers - https://phabricator.wikimedia.org/T220746 [19:36:00] T232035: 1.34.0-wmf.21 cause termbox to emit: Test get rendered termbox returned the unexpected status 500 - https://phabricator.wikimedia.org/T232035 [19:44:38] 10Operations, 10ops-codfw, 10DC-Ops: mw2231 is down and unable to reboot - https://phabricator.wikimedia.org/T231192 (10Papaul) @MoritzMuehlenhoff thanks. [19:47:01] 10Operations, 10WMF-Legal, 10serviceops, 10Patch-For-Review: Move old transparency report pages to historical URLs and setup redirect - https://phabricator.wikimedia.org/T230638 (10APalmer_WMF) Hi all, just wanted to see if there was any further info or clarification needed from Legal. We really appreciate... [19:50:31] :-( [19:51:27] so [19:51:44] I have no idea who manages the termbox service beside wmde [19:52:00] so I guess I add them + serviceops [19:52:07] (03PS1) 10Andrew Bogott: Designate: fix syntax for domain masters [puppet] - 10https://gerrit.wikimedia.org/r/534518 [19:53:32] (03CR) 10Andrew Bogott: [C: 03+2] Designate: fix syntax for domain masters [puppet] - 10https://gerrit.wikimedia.org/r/534518 (owner: 10Andrew Bogott) [19:56:49] (03PS1) 10Eevans: staging/sessionstore: restbase-dev1006 is back online [deployment-charts] - 10https://gerrit.wikimedia.org/r/534519 (https://phabricator.wikimedia.org/T224260) [19:58:27] (03CR) 10Eevans: [V: 03+2 C: 03+2] staging/sessionstore: restbase-dev1006 is back online [deployment-charts] - 10https://gerrit.wikimedia.org/r/534519 (https://phabricator.wikimedia.org/T224260) (owner: 10Eevans) [20:00:04] cscott, arlolra, subbu, bearND, halfak, and accraze: (Dis)respected human, time to deploy Services – Parsoid / Citoid / Mobileapps / ORES / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190904T2000). Please do the needful. [20:00:05] !log @ helmfile [STAGING] Ran 'apply' command on namespace 'sessionstore' for release 'staging' . [20:00:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:02:34] ok train looks quiet [20:09:47] I am off! [20:12:11] 10Operations, 10Cassandra, 10RESTBase, 10RESTBase-Cassandra, and 2 others: Migrate remaining Restbase servers to Stretch - https://phabricator.wikimedia.org/T224553 (10Eevans) [20:13:13] 10Operations, 10Cassandra, 10RESTBase, 10Core Platform Team Workboards (Clinic Duty Team), 10User-Eevans: Migrate Restbase-dev cluster to Stretch - https://phabricator.wikimedia.org/T224554 (10Eevans) [20:14:48] !log decommission restbase-dev1004-a (Cassandra) -- T224554 [20:14:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:14:54] T224554: Migrate Restbase-dev cluster to Stretch - https://phabricator.wikimedia.org/T224554 [20:22:24] (03CR) 10CDanis: [C: 03+2] "+2'ing because this is needed to make the schemas in the conftool tree match what's needed to make production work / what is deployed in p" [software/conftool] - 10https://gerrit.wikimedia.org/r/534153 (owner: 10CDanis) [20:25:25] 10Operations, 10ops-eqiad, 10DC-Ops, 10User-Zppix, 10cloud-services-team (Kanban): VMs on cloudvirt1015 crashing - bad mainboard/memory - https://phabricator.wikimedia.org/T220853 (10Andrew) 05Resolved→03Open a:05Andrew→03wiki_willy I can still make this crash -- my process is scheduling 80 VMs o... [20:25:51] (03Merged) 10jenkins-bot: dbctl: schema: allow wikitech in readOnlyBySection [software/conftool] - 10https://gerrit.wikimedia.org/r/534153 (owner: 10CDanis) [20:46:55] (03PS1) 10Andrew Bogott: designate: make the pdns pools treat both designate hosts as masters [puppet] - 10https://gerrit.wikimedia.org/r/534528 [20:47:59] (03CR) 10Andrew Bogott: [C: 03+2] designate: make the pdns pools treat both designate hosts as masters [puppet] - 10https://gerrit.wikimedia.org/r/534528 (owner: 10Andrew Bogott) [20:54:04] PROBLEM - Host mr1-eqiad.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [20:54:06] 10Operations, 10ops-eqiad, 10DC-Ops, 10User-Zppix, 10cloud-services-team (Kanban): VMs on cloudvirt1015 crashing - bad mainboard/memory - https://phabricator.wikimedia.org/T220853 (10wiki_willy) Hi @Andrew - I mentioned the ongoing issues with this machine to our Dell account rep last week, since we've b... [20:54:18] (03PS4) 10Andrew Bogott: openstack configs: forward some mitaka updates to newton [puppet] - 10https://gerrit.wikimedia.org/r/533549 [20:55:13] (03CR) 10Andrew Bogott: [C: 03+2] openstack configs: forward some mitaka updates to newton [puppet] - 10https://gerrit.wikimedia.org/r/533549 (owner: 10Andrew Bogott) [20:55:39] (03PS3) 10Andrew Bogott: glance: add Newton config files [puppet] - 10https://gerrit.wikimedia.org/r/533923 (https://phabricator.wikimedia.org/T212302) [20:56:30] (03CR) 10Andrew Bogott: [C: 03+2] glance: add Newton config files [puppet] - 10https://gerrit.wikimedia.org/r/533923 (https://phabricator.wikimedia.org/T212302) (owner: 10Andrew Bogott) [20:56:48] (03CR) 10Arlolra: [C: 03+1] Enable loading Parsoid/PHP as an extension on the beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534215 (https://phabricator.wikimedia.org/T231569) (owner: 10Subramanya Sastry) [20:58:49] (03PS3) 10Andrew Bogott: keystone: forward mitaka config to newton [puppet] - 10https://gerrit.wikimedia.org/r/533924 (https://phabricator.wikimedia.org/T212302) [20:58:51] (03PS3) 10Andrew Bogott: keystone: update policy.json for Newton [puppet] - 10https://gerrit.wikimedia.org/r/533925 (https://phabricator.wikimedia.org/T212302) [20:58:53] (03PS3) 10Andrew Bogott: Openstack Neutron: added config files and templates for version Newton [puppet] - 10https://gerrit.wikimedia.org/r/533927 (https://phabricator.wikimedia.org/T212302) [20:58:55] (03PS3) 10Andrew Bogott: Designate: add Newton config files and resources [puppet] - 10https://gerrit.wikimedia.org/r/533926 (https://phabricator.wikimedia.org/T212302) [20:59:24] RECOVERY - Host mr1-eqiad.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 2.17 ms [21:01:55] 10Operations, 10ops-eqiad: Degraded RAID on cloudvirt1018 - https://phabricator.wikimedia.org/T230575 (10Andrew) [21:03:29] !log crusnov@deploy1001 Started deploy [netbox/deploy@367ca84]: (no justification provided) [21:03:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:04:13] 10Operations, 10ops-eqiad, 10cloud-services-team (Kanban): Degraded RAID on cloudvirt1018 - https://phabricator.wikimedia.org/T216004 (10Andrew) [21:05:36] 10Operations, 10ops-eqiad, 10DC-Ops, 10User-Zppix, 10cloud-services-team (Kanban): VMs on cloudvirt1015 crashing - bad mainboard/memory - https://phabricator.wikimedia.org/T220853 (10Andrew) @wiki_willy, the parent task of this task is the procurement for four identical systems: cloudvirt1015, 1016. 101... [21:08:40] (03CR) 10Arlolra: [C: 03+1] ATS/varnish/parsoid-testing: remove directors for parsoid-vd/parsoid-rt [puppet] - 10https://gerrit.wikimedia.org/r/534271 (https://phabricator.wikimedia.org/T229356) (owner: 10Dzahn) [21:10:22] (03CR) 10Andrew Bogott: [C: 03+2] keystone: forward mitaka config to newton [puppet] - 10https://gerrit.wikimedia.org/r/533924 (https://phabricator.wikimedia.org/T212302) (owner: 10Andrew Bogott) [21:10:42] (03CR) 10Andrew Bogott: [C: 03+2] keystone: update policy.json for Newton [puppet] - 10https://gerrit.wikimedia.org/r/533925 (https://phabricator.wikimedia.org/T212302) (owner: 10Andrew Bogott) [21:12:23] !log crusnov@deploy1001 Finished deploy [netbox/deploy@367ca84]: (no justification provided) (duration: 08m 55s) [21:12:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:15:30] 10Operations, 10ops-eqiad, 10DC-Ops, 10User-Zppix, 10cloud-services-team (Kanban): VMs on cloudvirt1015 crashing - bad mainboard/memory - https://phabricator.wikimedia.org/T220853 (10wiki_willy) Thanks @Andrew - I'll reach out to our Account Rep, to see if something else can be done. [21:17:57] 10Operations, 10ops-eqiad: Degraded RAID on cloudvirt1018 - https://phabricator.wikimedia.org/T230575 (10wiki_willy) a:05wiki_willy→03Bstorm Assigning to @Bstorm to follow up on the previous comment. [21:19:16] PROBLEM - Widespread puppet agent failures- no resources reported on icinga1001 is CRITICAL: site=eqsin https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [21:45:28] so if STOP SLAVE doesn't stop replication, what am I doing wrong? [21:45:33] (local docker, i'm not messing with production) [21:46:25] (03PS5) 10Holger Knust: WIP: Add cassandra-table-properties tool to Cassandra deployments [puppet] - 10https://gerrit.wikimedia.org/r/529074 (https://phabricator.wikimedia.org/T226553) [21:47:39] (03PS1) 10Ayounsi: Deploy homer [puppet] - 10https://gerrit.wikimedia.org/r/534538 [21:49:56] PROBLEM - Host cumin1001 is DOWN: PING CRITICAL - Packet loss = 100% [21:53:09] is that expected ^? [21:53:15] (03CR) 10Jforrester: Retry "CommonSettings: Factor out variant config generation into MWConfigCacheGenerator" (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525698 (https://phabricator.wikimedia.org/T223602) (owner: 10Jforrester) [21:54:13] paladox: I don't know, I was just wondering that [21:54:23] I guess it's not user-facing in any case [21:54:26] ah, ok [21:56:45] huh, console on cumin1001 is in the system config [21:56:53] but if someone else booted it there, why can I access the console? [21:58:14] RECOVERY - Host cumin1001 is UP: PING OK - Packet loss = 0%, RTA = 0.30 ms [21:58:22] !log attached to console on cumin1001, found it in bios 'system settings', exited, allowed boot to continue. No idea how it got there — spontaneous reboot? [21:58:23] (03PS5) 10Jforrester: Retry "CommonSettings: Factor out variant config generation into MWConfigCacheGenerator" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525698 (https://phabricator.wikimedia.org/T223602) [21:58:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:00:15] (03PS6) 10Jforrester: CommonSettings: Factor out variant config generation into MWConfigCacheGenerator [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525698 (https://phabricator.wikimedia.org/T223602) [22:02:44] Krinkle: Can I get your blessing to land ^^ and the next few changes (should be a no-op)? [22:05:22] James_F: reviewing now. [22:05:28] * James_F nods. [22:05:32] Thank you. [22:07:37] (03CR) 10Krinkle: [C: 04-1] CommonSettings: Factor out variant config generation into MWConfigCacheGenerator (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525698 (https://phabricator.wikimedia.org/T223602) (owner: 10Jforrester) [22:08:21] (03CR) 10Krinkle: [C: 04-1] CommonSettings: Factor out variant config generation into MWConfigCacheGenerator (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525698 (https://phabricator.wikimedia.org/T223602) (owner: 10Jforrester) [22:11:10] PROBLEM - HHVM rendering on mw1316 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [22:11:46] RECOVERY - Widespread puppet agent failures- no resources reported on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [22:12:30] 10Operations, 10ops-eqiad, 10vm-requests: rack/setup/install ganeti10([09]|1[0-8[).eqiad.wmnet - https://phabricator.wikimedia.org/T228924 (10Jclark-ctr) @Cmjohnson Idrac and bios settings finished [22:12:36] RECOVERY - HHVM rendering on mw1316 is OK: HTTP OK: HTTP/1.1 200 OK - 76029 bytes in 0.761 second response time https://wikitech.wikimedia.org/wiki/Application_servers [22:13:21] (03CR) 10Jforrester: CommonSettings: Factor out variant config generation into MWConfigCacheGenerator (034 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525698 (https://phabricator.wikimedia.org/T223602) (owner: 10Jforrester) [22:14:44] PROBLEM - Keyholder SSH agent on cumin1001 is CRITICAL: CRITICAL: Keyholder is not armed. Run keyholder arm to arm it. https://wikitech.wikimedia.org/wiki/Keyholder [22:17:04] (03PS2) 10CRusnov: netbox: Fix multiple lingering issues. [puppet] - 10https://gerrit.wikimedia.org/r/534490 [22:21:26] (03CR) 10Subramanya Sastry: [C: 03+1] ATS/varnish/parsoid-testing: remove directors for parsoid-vd/parsoid-rt [puppet] - 10https://gerrit.wikimedia.org/r/534271 (https://phabricator.wikimedia.org/T229356) (owner: 10Dzahn) [22:27:48] (03PS7) 10Jforrester: CommonSettings: Factor out variant config generation into MWConfigCacheGenerator [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525698 (https://phabricator.wikimedia.org/T223602) [22:33:13] 10Operations, 10ops-eqiad, 10DC-Ops, 10User-Zppix, 10cloud-services-team (Kanban): VMs on cloudvirt1015 crashing - bad mainboard/memory - https://phabricator.wikimedia.org/T220853 (10wiki_willy) Emailed our Dell account rep, who responded that they will look into what our options are and get back to us.... [22:57:56] (03CR) 10Krinkle: CommonSettings: Factor out variant config generation into MWConfigCacheGenerator (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525698 (https://phabricator.wikimedia.org/T223602) (owner: 10Jforrester) [22:58:09] (03CR) 10Krinkle: [C: 03+1] "Probably file, but can't confirm right now." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525698 (https://phabricator.wikimedia.org/T223602) (owner: 10Jforrester) [22:59:06] (03CR) 10Jforrester: CommonSettings: Factor out variant config generation into MWConfigCacheGenerator (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525698 (https://phabricator.wikimedia.org/T223602) (owner: 10Jforrester) [23:00:04] MaxSem, RoanKattouw, Niharika, and Urbanecm: (Dis)respected human, time to deploy Evening SWAT (Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190904T2300). Please do the needful. [23:00:04] RoanKattouw: A patch you scheduled for Evening SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:00:51] James already deployed this patch earlier today, so there's nothing to do in this SWAT window [23:05:27] !log decommission restbase-dev1004-b (Cassandra) -- T224554 [23:05:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:05:30] T224554: Migrate Restbase-dev cluster to Stretch - https://phabricator.wikimedia.org/T224554 [23:17:50] (03PS8) 10Jforrester: CommonSettings: Factor out variant config generation into MWConfigCacheGenerator [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525698 (https://phabricator.wikimedia.org/T223602) [23:24:32] I'm taking the conch. [23:25:56] (03CR) 10Krinkle: CommonSettings: Factor out variant config generation into MWConfigCacheGenerator (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525698 (https://phabricator.wikimedia.org/T223602) (owner: 10Jforrester) [23:25:59] (03CR) 10Krinkle: [C: 03+1] CommonSettings: Factor out variant config generation into MWConfigCacheGenerator [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525698 (https://phabricator.wikimedia.org/T223602) (owner: 10Jforrester) [23:26:21] (03CR) 10Krinkle: [C: 03+1] "LGTM. Best to watch Logstash/mwdebug1002 very carefully when rolling out though." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525698 (https://phabricator.wikimedia.org/T223602) (owner: 10Jforrester) [23:26:27] ok, signing off now for real [23:26:28] o/ [23:26:31] See you. [23:30:16] (03CR) 10Jforrester: [C: 03+2] CommonSettings: Factor out variant config generation into MWConfigCacheGenerator [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525698 (https://phabricator.wikimedia.org/T223602) (owner: 10Jforrester) [23:31:26] (03Merged) 10jenkins-bot: CommonSettings: Factor out variant config generation into MWConfigCacheGenerator [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525698 (https://phabricator.wikimedia.org/T223602) (owner: 10Jforrester) [23:31:41] (03CR) 10jenkins-bot: CommonSettings: Factor out variant config generation into MWConfigCacheGenerator [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525698 (https://phabricator.wikimedia.org/T223602) (owner: 10Jforrester) [23:33:19] !log jforrester@deploy1001 Synchronized multiversion/MWConfigCacheGenerator.php: CommonSettings: Factor out variant config generation into MWConfigCacheGenerator, part 1 (duration: 00m 54s) [23:33:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:36:13] !log jforrester@deploy1001 Synchronized wmf-config/CommonSettings.php: CommonSettings: Factor out variant config generation into MWConfigCacheGenerator, part 2 (duration: 00m 55s) [23:36:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:37:50] (03PS3) 10Jforrester: cirrusTest:Use shared dblist array; fix a couple of old loads [mediawiki-config] - 10https://gerrit.wikimedia.org/r/533584 [23:38:15] (03CR) 10Jforrester: [C: 03+2] cirrusTest:Use shared dblist array; fix a couple of old loads [mediawiki-config] - 10https://gerrit.wikimedia.org/r/533584 (owner: 10Jforrester) [23:38:56] (03CR) 10Ladsgroup: CommonSettings: Factor out variant config generation into MWConfigCacheGenerator (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525698 (https://phabricator.wikimedia.org/T223602) (owner: 10Jforrester) [23:39:15] (03Merged) 10jenkins-bot: cirrusTest:Use shared dblist array; fix a couple of old loads [mediawiki-config] - 10https://gerrit.wikimedia.org/r/533584 (owner: 10Jforrester) [23:39:54] (03CR) 10jenkins-bot: cirrusTest:Use shared dblist array; fix a couple of old loads [mediawiki-config] - 10https://gerrit.wikimedia.org/r/533584 (owner: 10Jforrester) [23:40:25] (03PS1) 10Jforrester: tests: Update local copy of SiteConfiguration.php to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534546 [23:41:15] (03CR) 10jerkins-bot: [V: 04-1] tests: Update local copy of SiteConfiguration.php to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534546 (owner: 10Jforrester) [23:50:34] (03PS6) 10Jforrester: CommonSettings: Factor out load of variant config into MWConfigCacheGenerator [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507727 [23:51:42] (03CR) 10jerkins-bot: [V: 04-1] CommonSettings: Factor out load of variant config into MWConfigCacheGenerator [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507727 (owner: 10Jforrester) [23:52:14] (03PS7) 10Jforrester: CommonSettings: Factor out load of variant config into MWConfigCacheGenerator [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507727 [23:53:03] (03PS8) 10Jforrester: CommonSettings: Factor out load of variant config into MWConfigCacheGenerator [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507727 [23:53:05] (03CR) 10jerkins-bot: [V: 04-1] CommonSettings: Factor out load of variant config into MWConfigCacheGenerator [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507727 (owner: 10Jforrester) [23:58:21] (03CR) 10Jforrester: [C: 03+2] CommonSettings: Factor out load of variant config into MWConfigCacheGenerator [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507727 (owner: 10Jforrester) [23:59:16] (03Merged) 10jenkins-bot: CommonSettings: Factor out load of variant config into MWConfigCacheGenerator [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507727 (owner: 10Jforrester) [23:59:35] (03CR) 10jenkins-bot: CommonSettings: Factor out load of variant config into MWConfigCacheGenerator [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507727 (owner: 10Jforrester)