[01:09:59] PROBLEM - Memory correctable errors -EDAC- on mw1248 is CRITICAL: 10 ge 4 https://wikitech.wikimedia.org/wiki/Monitoring/Memory%23Memory_correctable_errors_-EDAC- https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=mw1248&var-datasource=eqiad+prometheus/ops [01:16:18] (03CR) 10DannyS712: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/566581 (owner: 10Reedy) [02:07:11] PROBLEM - puppet last run on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [02:08:43] PROBLEM - MD RAID on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [02:08:49] PROBLEM - Check systemd state on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:09:33] PROBLEM - Check whether ferm is active by checking the default input chain on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [02:09:41] PROBLEM - dhclient process on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [02:09:59] PROBLEM - Disk space on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=stat1007&var-datasource=eqiad+prometheus/ops [02:10:07] PROBLEM - configured eth on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [02:10:29] PROBLEM - Check the NTP synchronisation status of timesyncd on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/NTP [02:10:35] PROBLEM - DPKG on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [02:10:47] PROBLEM - Check size of conntrack table on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [02:11:31] PROBLEM - Check the last execution of search-drop-query-clicks on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [02:17:11] RECOVERY - DPKG on stat1007 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [02:17:23] RECOVERY - Check size of conntrack table on stat1007 is OK: OK: nf_conntrack is 6 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [02:17:33] RECOVERY - MD RAID on stat1007 is OK: OK: Active: 8, Working: 8, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [02:17:37] RECOVERY - Check systemd state on stat1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:18:23] RECOVERY - Check whether ferm is active by checking the default input chain on stat1007 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [02:18:31] RECOVERY - dhclient process on stat1007 is OK: PROCS OK: 0 processes with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [02:18:47] RECOVERY - Disk space on stat1007 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=stat1007&var-datasource=eqiad+prometheus/ops [02:18:57] RECOVERY - configured eth on stat1007 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [02:19:35] RECOVERY - puppet last run on stat1007 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [02:22:41] RECOVERY - Check the last execution of search-drop-query-clicks on stat1007 is OK: OK: Status of the systemd unit search-drop-query-clicks https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [02:41:43] RECOVERY - Check the NTP synchronisation status of timesyncd on stat1007 is OK: OK: synced at Mon 2020-03-02 02:41:41 UTC. https://wikitech.wikimedia.org/wiki/NTP [02:45:53] (03PS1) 10Krinkle: Remove dead code from 'ChangeAuthenticationDataAudit' hook handler [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575815 [02:46:11] (03CR) 10Krinkle: Bring up password change logging to the same standards as login logging (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467110 (owner: 10Gergő Tisza) [02:59:48] (03PS1) 10KartikMistry: Update cxserver to 2020-02-28-043702-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/575817 (https://phabricator.wikimedia.org/T246319) [06:04:38] !log Re-add db1111 to s8 in tendril and zarcillo - T246447 [06:04:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:04:48] T246447: Move db1111 from test-s4 to s8 - https://phabricator.wikimedia.org/T246447 [06:18:31] (03PS1) 10Marostegui: db1111: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/575825 (https://phabricator.wikimedia.org/T246447) [06:20:55] (03CR) 10Marostegui: [C: 03+2] db1111: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/575825 (https://phabricator.wikimedia.org/T246447) (owner: 10Marostegui) [06:24:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'Add db1111 to s8 with minimal weight to check grants and any other issues T246447', diff saved to https://phabricator.wikimedia.org/P10564 and previous config saved to /var/cache/conftool/dbconfig/20200302-062435-marostegui.json [06:24:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:24:41] T246447: Move db1111 from test-s4 to s8 - https://phabricator.wikimedia.org/T246447 [06:29:40] (03Abandoned) 10Zoranzoki21: Equalization of wgPopupsReferencePreviews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552643 (owner: 10Zoranzoki21) [06:32:57] (03PS1) 10Vgutierrez: lvs: Add lvs2008 as a high-traffic2 load balancer [puppet] - 10https://gerrit.wikimedia.org/r/575826 (https://phabricator.wikimedia.org/T196560) [06:42:03] !log Enable events on db1111 T246447 [06:42:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:42:08] T246447: Move db1111 from test-s4 to s8 - https://phabricator.wikimedia.org/T246447 [06:43:03] (03CR) 10Vgutierrez: [C: 03+2] "pcc is happy: https://puppet-compiler.wmflabs.org/compiler1003/21176/" [puppet] - 10https://gerrit.wikimedia.org/r/575826 (https://phabricator.wikimedia.org/T196560) (owner: 10Vgutierrez) [06:45:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'Increase weight from 1 to 10 on db1111 T246447', diff saved to https://phabricator.wikimedia.org/P10565 and previous config saved to /var/cache/conftool/dbconfig/20200302-064522-marostegui.json [06:45:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:48:15] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "LGTM overall but the messages need to be reduced/corrected." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/575598 (owner: 10Herron) [06:50:25] 10Operations, 10ops-codfw, 10Traffic, 10Patch-For-Review: (Need by: TBD) rack/setup/install LVS200[7-10] - https://phabricator.wikimedia.org/T196560 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on cumin1001.eqiad.wmnet for hosts: ` lvs2008.codfw.wmnet ` The log can be found in... [06:59:47] (03PS2) 10Giuseppe Lavagetto: prometheus::ops: collect envoy stats from all servers [puppet] - 10https://gerrit.wikimedia.org/r/575504 [07:04:08] PROBLEM - Host an-worker1083 is DOWN: PING CRITICAL - Packet loss = 100% [07:04:13] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/573760 (owner: 10RLazarus) [07:05:50] PROBLEM - Categories update lag on wdqs1009 is CRITICAL: CRITICAL - Categories lag: 13 days, 2:05:46.979889 https://wikitech.wikimedia.org/wiki/Wikidata_query_service [07:08:28] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime [07:08:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:10:39] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [07:10:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:19:09] 10Operations, 10ops-codfw, 10Traffic: (Need by: TBD) rack/setup/install LVS200[7-10] - https://phabricator.wikimedia.org/T196560 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['lvs2008.codfw.wmnet'] ` and were **ALL** successful. [07:21:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'Increase weight from 10 to 30 on db1111 T246447', diff saved to https://phabricator.wikimedia.org/P10566 and previous config saved to /var/cache/conftool/dbconfig/20200302-072118-marostegui.json [07:21:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:21:23] T246447: Move db1111 from test-s4 to s8 - https://phabricator.wikimedia.org/T246447 [07:22:57] !log upgrading NICs FW on lvs2008 - T196560 T203194 [07:23:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:23:02] T196560: (Need by: TBD) rack/setup/install LVS200[7-10] - https://phabricator.wikimedia.org/T196560 [07:23:03] T203194: cp1075-90 - bnxt_en transmit hangs - https://phabricator.wikimedia.org/T203194 [07:30:11] PROBLEM - rsyslog TLS listener on port 6514 on centrallog1001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Logs [07:31:47] RECOVERY - rsyslog TLS listener on port 6514 on centrallog1001 is OK: SSL OK - Certificate centrallog1001.eqiad.wmnet valid until 2024-06-25 15:42:33 +0000 (expires in 1576 days) https://wikitech.wikimedia.org/wiki/Logs [07:39:07] 10Operations, 10ops-codfw, 10Traffic: (Need by: TBD) rack/setup/install LVS200[7-10] - https://phabricator.wikimedia.org/T196560 (10Vgutierrez) @Papaul I had to upgrade the NIC FW on lvs2008 `name=before vgutierrez@lvs2008:~$ sudo -i ethtool -i ens2f0np0 driver: bnxt_en version: 1.9.2 firmware-version: 20.6.... [07:58:57] 10Operations, 10Graphite: graphite clustering plan - https://phabricator.wikimedia.org/T86316 (10fgiunchedi) 05Open→03Declined Graphite is on its way out, declining [07:59:01] 10Operations, 10WMDE-Analytics-Engineering, 10Core Platform Team Legacy (Watching / External), 10Graphite, and 3 others: scale graphite deployment (tracking) - https://phabricator.wikimedia.org/T85451 (10fgiunchedi) [08:00:33] 10Operations, 10observability, 10User-fgiunchedi: Monitor and alarm on SMART attributes - https://phabricator.wikimedia.org/T86552 (10fgiunchedi) >>! In T86552#5930238, @Aklapper wrote: > @fgiunchedi: Hi, all related patches in Gerrit have been merged or abandoned. Is there more to do in this task? Asking as... [08:00:44] (03PS1) 10Muehlenhoff: Remove access for jsamra [puppet] - 10https://gerrit.wikimedia.org/r/575844 [08:02:03] 10Operations, 10RESTBase-Cassandra, 10Core Platform Team Legacy (Watching / External), 10Patch-For-Review, 10Services (watching): setup an alertable threshold for Cassandra heap dumps - https://phabricator.wikimedia.org/T106346 (10fgiunchedi) a:05fgiunchedi→03None I never got around to deploying it a... [08:07:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'Increase weight from 30 to 50 on db1111 T246447', diff saved to https://phabricator.wikimedia.org/P10567 and previous config saved to /var/cache/conftool/dbconfig/20200302-080721-marostegui.json [08:07:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:07:27] T246447: Move db1111 from test-s4 to s8 - https://phabricator.wikimedia.org/T246447 [08:11:34] (03CR) 10Muehlenhoff: [C: 03+2] Remove access for jsamra [puppet] - 10https://gerrit.wikimedia.org/r/575844 (owner: 10Muehlenhoff) [08:14:29] !log resume item term table rebuild script (from Q54 mill) T219123 [08:14:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:14:34] T219123: Migrate to and read from new store for item terms - https://phabricator.wikimedia.org/T219123 [08:19:38] (03PS1) 10Muehlenhoff: Remove LDAP access for jaufrecht [puppet] - 10https://gerrit.wikimedia.org/r/575971 [08:23:56] (03CR) 10Muehlenhoff: [C: 03+2] Remove LDAP access for jaufrecht [puppet] - 10https://gerrit.wikimedia.org/r/575971 (owner: 10Muehlenhoff) [08:32:07] (03PS1) 10KartikMistry: Add URL campaign for Wiki for WikiGap [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575974 (https://phabricator.wikimedia.org/T246335) [08:32:31] 10Operations, 10observability: prometheus1003/prometheus1004 /srv/prometheus/ops disk space warning - https://phabricator.wikimedia.org/T245361 (10fgiunchedi) 05Open→03Resolved Utilization growth has stabilized around Feb 20th and is now back to organic growth, resolving [08:33:24] !log warm cache for db1111 for Q0-6 million T219123 T246447 [08:33:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:33:30] T219123: Migrate to and read from new store for item terms - https://phabricator.wikimedia.org/T219123 [08:33:30] T246447: Move db1111 from test-s4 to s8 - https://phabricator.wikimedia.org/T246447 [08:35:53] (03PS2) 10Muehlenhoff: Remove role::prometheus::k8s in favour of including the profile [puppet] - 10https://gerrit.wikimedia.org/r/575491 [08:40:02] (03PS1) 10Brian Wolff: Log csp and csp-report-only channels in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575976 [08:40:43] (03PS1) 10Filippo Giunchedi: toil: probe rsyslog tls listener every 5 min [puppet] - 10https://gerrit.wikimedia.org/r/575977 [08:41:02] (03CR) 10Brian Wolff: "Alternatively, maybe the config should just not start with a '-', so that it does not override production. It seems like it was changed (m" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575976 (owner: 10Brian Wolff) [08:41:35] (03CR) 10Muehlenhoff: [C: 03+2] Remove role::prometheus::k8s in favour of including the profile [puppet] - 10https://gerrit.wikimedia.org/r/575491 (owner: 10Muehlenhoff) [08:41:50] (03PS2) 10KartikMistry: ContentTranslation: Add URL campaign for WikiGap [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575974 (https://phabricator.wikimedia.org/T246335) [08:44:01] !log installing openssh updates for stretch [08:44:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:46:47] (03CR) 10CDanis: [C: 03+1] toil: probe rsyslog tls listener every 5 min [puppet] - 10https://gerrit.wikimedia.org/r/575977 (owner: 10Filippo Giunchedi) [08:50:11] (03CR) 10Filippo Giunchedi: [C: 03+2] toil: probe rsyslog tls listener every 5 min [puppet] - 10https://gerrit.wikimedia.org/r/575977 (owner: 10Filippo Giunchedi) [08:54:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'Increase weight from 50 to 80 on db1111 T246447', diff saved to https://phabricator.wikimedia.org/P10568 and previous config saved to /var/cache/conftool/dbconfig/20200302-085420-marostegui.json [08:54:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:54:26] T246447: Move db1111 from test-s4 to s8 - https://phabricator.wikimedia.org/T246447 [09:02:21] (03CR) 10Filippo Giunchedi: [C: 03+2] Fix system role names for restbase [puppet] - 10https://gerrit.wikimedia.org/r/575494 (owner: 10Muehlenhoff) [09:12:53] !log warm cache for db1111 for Q0-6 million T219123 T246447 (pass 2) [09:12:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:12:59] T219123: Migrate to and read from new store for item terms - https://phabricator.wikimedia.org/T219123 [09:12:59] T246447: Move db1111 from test-s4 to s8 - https://phabricator.wikimedia.org/T246447 [09:19:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1119 T239791', diff saved to https://phabricator.wikimedia.org/P10569 and previous config saved to /var/cache/conftool/dbconfig/20200302-091947-marostegui.json [09:19:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:19:54] T239791: DB: perform rolling restart of mariadb daemons to pick up CA changes - https://phabricator.wikimedia.org/T239791 [09:20:48] (03CR) 10Giuseppe Lavagetto: prometheus::ops: collect envoy stats from all servers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/575504 (owner: 10Giuseppe Lavagetto) [09:27:04] 10Operations, 10Puppet, 10DBA, 10User-jbond: DB: perform rolling restart of mariadb daemons to pick up CA changes - https://phabricator.wikimedia.org/T239791 (10Marostegui) [09:27:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1119 after upgrade T239791', diff saved to https://phabricator.wikimedia.org/P10570 and previous config saved to /var/cache/conftool/dbconfig/20200302-092743-marostegui.json [09:27:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:27:49] T239791: DB: perform rolling restart of mariadb daemons to pick up CA changes - https://phabricator.wikimedia.org/T239791 [09:32:47] (03CR) 10Filippo Giunchedi: prometheus::ops: collect envoy stats from all servers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/575504 (owner: 10Giuseppe Lavagetto) [09:34:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'Increase weight from 80 to 100 on db1111 T246447', diff saved to https://phabricator.wikimedia.org/P10571 and previous config saved to /var/cache/conftool/dbconfig/20200302-093449-marostegui.json [09:34:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:34:54] T246447: Move db1111 from test-s4 to s8 - https://phabricator.wikimedia.org/T246447 [09:37:48] 10Operations: Integrate Stretch 9.10/9.11 point updates - https://phabricator.wikimedia.org/T232308 (10MoritzMuehlenhoff) [09:38:10] (03CR) 10Volans: [C: 03+1] "LGTM, just be aware of this upstream issue:" [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/575580 (owner: 10CRusnov) [09:38:18] !log installing openssh updates for jessie [09:38:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:38:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1119 after upgrade T239791', diff saved to https://phabricator.wikimedia.org/P10572 and previous config saved to /var/cache/conftool/dbconfig/20200302-093848-marostegui.json [09:38:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:38:53] T239791: DB: perform rolling restart of mariadb daemons to pick up CA changes - https://phabricator.wikimedia.org/T239791 [09:40:14] 10Operations, 10observability, 10Epic, 10User-fgiunchedi: Monitor and alarm on SMART attributes [tracking] - https://phabricator.wikimedia.org/T86552 (10Aklapper) [09:40:23] 10Operations, 10observability, 10Epic, 10User-fgiunchedi: Monitor and alarm on SMART attributes [tracking] - https://phabricator.wikimedia.org/T86552 (10Aklapper) Ah, thanks. Let's call it an epic. :P [09:41:35] ACKNOWLEDGEMENT - ElasticSearch unassigned shard check - 9200 on cloudelastic1001 is CRITICAL: CRITICAL - ebernhardson_deleteme[6](2020-02-27T17:24:54.480Z), ebernhardson_deleteme[6](2020-02-27T17:24:54.480Z), ebernhardson_deleteme[2](2020-02-27T17:24:54.480Z), ebernhardson_deleteme[3](2020-02-27T17:24:54.480Z), ebernhardson_deleteme[1](2020-02-27T17:24:54.480Z), ebernhardson_deleteme[7](2020-02-27T17:24:54.480Z), ebernhardson_de [09:41:35] -27T17:24:54.480Z), ebernhardson_deleteme[4](2020-02-27T17:24:54.480Z), ebernhardson_deleteme[4](2020-02-27T17:24:54.480Z), ebernhardson_deleteme[5](2020-02-27T17:24:54.480Z), ebernhardson_deleteme[5](2020-02-27T17:24:54.480Z), ebernhardson_deleteme[0](2020-02-27T17:24:54.480Z) Gehel On going test, this index will be cleaned up https://wikitech.wikimedia.org/wiki/Search%23Administration [09:41:35] ACKNOWLEDGEMENT - ElasticSearch unassigned shard check - 9200 on cloudelastic1002 is CRITICAL: CRITICAL - ebernhardson_deleteme[2](2020-02-27T17:24:54.480Z), ebernhardson_deleteme[1](2020-02-27T17:24:54.480Z), ebernhardson_deleteme[4](2020-02-27T17:24:54.480Z), ebernhardson_deleteme[4](2020-02-27T17:24:54.480Z), ebernhardson_deleteme[7](2020-02-27T17:24:54.480Z), ebernhardson_deleteme[7](2020-02-27T17:24:54.480Z), ebernhardson_de [09:41:35] -27T17:24:54.480Z), ebernhardson_deleteme[6](2020-02-27T17:24:54.480Z), ebernhardson_deleteme[6](2020-02-27T17:24:54.480Z), ebernhardson_deleteme[5](2020-02-27T17:24:54.480Z), ebernhardson_deleteme[5](2020-02-27T17:24:54.480Z), ebernhardson_deleteme[0](2020-02-27T17:24:54.480Z) Gehel On going test, this index will be cleaned up https://wikitech.wikimedia.org/wiki/Search%23Administration [09:41:35] ACKNOWLEDGEMENT - ElasticSearch unassigned shard check - 9200 on cloudelastic1003 is CRITICAL: CRITICAL - ebernhardson_deleteme[1](2020-02-27T17:24:54.480Z), ebernhardson_deleteme[4](2020-02-27T17:24:54.480Z), ebernhardson_deleteme[4](2020-02-27T17:24:54.480Z), ebernhardson_deleteme[3](2020-02-27T17:24:54.480Z), ebernhardson_deleteme[7](2020-02-27T17:24:54.480Z), ebernhardson_deleteme[7](2020-02-27T17:24:54.480Z), ebernhardson_de [09:41:35] -27T17:24:54.480Z), ebernhardson_deleteme[5](2020-02-27T17:24:54.480Z), ebernhardson_deleteme[6](2020-02-27T17:24:54.480Z), ebernhardson_deleteme[6](2020-02-27T17:24:54.480Z), ebernhardson_deleteme[2](2020-02-27T17:24:54.480Z), ebernhardson_deleteme[0](2020-02-27T17:24:54.480Z) Gehel On going test, this index will be cleaned up https://wikitech.wikimedia.org/wiki/Search%23Administration [09:41:35] ACKNOWLEDGEMENT - ElasticSearch unassigned shard check - 9200 on cloudelastic1004 is CRITICAL: CRITICAL - ebernhardson_deleteme[6](2020-02-27T17:24:54.480Z), ebernhardson_deleteme[6](2020-02-27T17:24:54.480Z), ebernhardson_deleteme[4](2020-02-27T17:24:54.480Z), ebernhardson_deleteme[4](2020-02-27T17:24:54.480Z), ebernhardson_deleteme[2](2020-02-27T17:24:54.480Z), ebernhardson_deleteme[5](2020-02-27T17:24:54.480Z), ebernhardson_de [09:41:36] -27T17:24:54.480Z), ebernhardson_deleteme[3](2020-02-27T17:24:54.480Z), ebernhardson_deleteme[1](2020-02-27T17:24:54.480Z), ebernhardson_deleteme[7](2020-02-27T17:24:54.480Z), ebernhardson_deleteme[7](2020-02-27T17:24:54.480Z), ebernhardson_deleteme[0](2020-02-27T17:24:54.480Z) Gehel On going test, this index will be cleaned up https://wikitech.wikimedia.org/wiki/Search%23Administration [09:46:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1119 after upgrade T239791', diff saved to https://phabricator.wikimedia.org/P10573 and previous config saved to /var/cache/conftool/dbconfig/20200302-094633-marostegui.json [09:46:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:46:38] T239791: DB: perform rolling restart of mariadb daemons to pick up CA changes - https://phabricator.wikimedia.org/T239791 [09:50:47] !log powercycle an-worker1083 (no ssh, mgmt console available but tty not really usable, CPU soft lockups reported) [09:50:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:51:08] RECOVERY - Host an-worker1083 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [09:51:37] !log START warm cache for db1111 & db1126 for Q6-8 million T219123 (pass 1) [09:51:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:51:41] T219123: Migrate to and read from new store for item terms - https://phabricator.wikimedia.org/T219123 [09:52:12] !log installing remaining curl security updates [09:52:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:58:24] (03CR) 10Volans: [C: 04-1] "Couple of details to improve, the rest seems reasonable." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/575603 (https://phabricator.wikimedia.org/T243927) (owner: 10CRusnov) [09:58:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Fully repool db1119 after upgrade T239791', diff saved to https://phabricator.wikimedia.org/P10574 and previous config saved to /var/cache/conftool/dbconfig/20200302-095841-marostegui.json [09:58:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:58:48] T239791: DB: perform rolling restart of mariadb daemons to pick up CA changes - https://phabricator.wikimedia.org/T239791 [09:59:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'Increase weight from 100 to 150 on db1111 T246447', diff saved to https://phabricator.wikimedia.org/P10575 and previous config saved to /var/cache/conftool/dbconfig/20200302-095921-marostegui.json [09:59:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:59:26] T246447: Move db1111 from test-s4 to s8 - https://phabricator.wikimedia.org/T246447 [10:16:38] PROBLEM - Disk space on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=stat1007&var-datasource=eqiad+prometheus/ops [10:16:50] PROBLEM - DPKG on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [10:17:00] PROBLEM - dhclient process on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [10:17:04] PROBLEM - Check systemd state on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:17:12] PROBLEM - MD RAID on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [10:18:09] sigh big processes ongoing [10:18:48] RECOVERY - Disk space on stat1007 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=stat1007&var-datasource=eqiad+prometheus/ops [10:19:00] RECOVERY - DPKG on stat1007 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [10:19:10] RECOVERY - dhclient process on stat1007 is OK: PROCS OK: 0 processes with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [10:19:16] RECOVERY - Check systemd state on stat1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:19:24] RECOVERY - MD RAID on stat1007 is OK: OK: Active: 8, Working: 8, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [10:22:36] (03PS1) 10Vgutierrez: install_server,lvs: Reimage lvs5002 with buster [puppet] - 10https://gerrit.wikimedia.org/r/575990 (https://phabricator.wikimedia.org/T245984) [10:22:42] !log START warm cache for db1111 & db1126 for Q6-8 million T219123 (pass 2) [10:22:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:22:48] T219123: Migrate to and read from new store for item terms - https://phabricator.wikimedia.org/T219123 [10:26:57] (03CR) 10Vgutierrez: [C: 03+2] "pcc is happy: https://puppet-compiler.wmflabs.org/compiler1003/21177/" [puppet] - 10https://gerrit.wikimedia.org/r/575990 (https://phabricator.wikimedia.org/T245984) (owner: 10Vgutierrez) [10:33:53] Is the "Ops Clinic Duty" info in the channel topic correct? [10:34:36] (03PS1) 10Marostegui: install_server: Allow manual reimage db109[6-9] [puppet] - 10https://gerrit.wikimedia.org/r/575991 (https://phabricator.wikimedia.org/T246604) [10:34:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'Increase weight from 150 to 200 on db1111 T246447', diff saved to https://phabricator.wikimedia.org/P10576 and previous config saved to /var/cache/conftool/dbconfig/20200302-103445-marostegui.json [10:34:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:34:51] T246447: Move db1111 from test-s4 to s8 - https://phabricator.wikimedia.org/T246447 [10:35:02] (03PS1) 10Addshore: Read from the new term store up to Q8 million for clients [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575992 (https://phabricator.wikimedia.org/T219123) [10:35:44] !log reimage lvs5002 with buster - T245984 [10:35:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:35:54] T245984: Upgrade load balancers to buster - https://phabricator.wikimedia.org/T245984 [10:36:31] (03PS2) 10Marostegui: install_server: Allow manual reimage db109[6-9] [puppet] - 10https://gerrit.wikimedia.org/r/575991 (https://phabricator.wikimedia.org/T246604) [10:36:36] (03CR) 10Marostegui: "jcrespo I would appreciate a review of this just in case. I want to reimage db1096 (and the rest later) without formatting /srv/" [puppet] - 10https://gerrit.wikimedia.org/r/575991 (https://phabricator.wikimedia.org/T246604) (owner: 10Marostegui) [10:36:47] 10Operations, 10Traffic, 10Patch-For-Review: Upgrade load balancers to buster - https://phabricator.wikimedia.org/T245984 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on cumin1001.eqiad.wmnet for hosts: ` lvs5002.eqsin.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/20... [10:36:49] 10Operations, 10Citoid, 10Wikimedia-Logstash, 10observability, and 3 others: Move citoid logging to new logging pipeline - https://phabricator.wikimedia.org/T219919 (10fgiunchedi) Reminder/ping as we (SRE Observability) would like to deprecate all non-kafka inputs by end of Q4 FY19/20. If the service is mo... [10:37:21] (03CR) 10Filippo Giunchedi: [C: 03+1] cxserver: Remove logstash logging [deployment-charts] - 10https://gerrit.wikimedia.org/r/573240 (https://phabricator.wikimedia.org/T219921) (owner: 10Alexandros Kosiaris) [10:37:32] jouncebot: now [10:37:32] No deployments scheduled for the next 0 hour(s) and 52 minute(s) [10:37:58] 10Operations, 10Analytics, 10User-Elukey: Refactor Analytics POSIX groups in puppet to improve maintainability - https://phabricator.wikimedia.org/T246578 (10MoritzMuehlenhoff) Looks good to me, what does the "Set the above two groups as admin groups for the stat100x roles." refer to? There are three groups... [10:38:05] 10Operations, 10Proton, 10Wikimedia-Logstash, 10observability, and 4 others: Move proton logging to new logging pipeline - https://phabricator.wikimedia.org/T219925 (10fgiunchedi) Reminder/ping as we (SRE Observability) would like to deprecate all non-kafka inputs by end of Q4 FY19/20. If the service is mo... [10:38:23] 10Operations, 10Maps, 10Wikimedia-Logstash, 10observability, and 4 others: Move kartotherian/tilerator logging to new logging pipeline - https://phabricator.wikimedia.org/T222377 (10fgiunchedi) Reminder/ping as we (SRE Observability) would like to deprecate all non-kafka inputs by end of Q4 FY19/20. If the... [10:39:00] 10Operations, 10Analytics, 10User-Elukey: Refactor Analytics POSIX groups in puppet to improve maintainability - https://phabricator.wikimedia.org/T246578 (10elukey) >>! In T246578#5932003, @MoritzMuehlenhoff wrote: > Looks good to me, what does the "Set the above two groups as admin groups for the stat100x... [10:39:10] 10Operations, 10Wikifeeds, 10Wikimedia-Logstash, 10observability: Move wikifeeds to the logging pipeline - https://phabricator.wikimedia.org/T245604 (10fgiunchedi) Reminder/ping as we (SRE Observability) would like to deprecate all non-kafka inputs by end of Q4 FY19/20. If the service is moving (or has mov... [10:39:15] 10Operations, 10Analytics, 10User-Elukey: Refactor Analytics POSIX groups in puppet to improve maintainability - https://phabricator.wikimedia.org/T246578 (10elukey) [10:39:33] 10Operations, 10Release-Engineering-Team, 10serviceops: mcrouter proxies and scap proxies - https://phabricator.wikimedia.org/T245841 (10Joe) >>! In T245841#5922561, @jijiki wrote: >>>! In T245841#5919699, @Joe wrote: >> >> What would having all scap proxies also be mcrouter proxies change in terms of the s... [10:46:05] 10Operations, 10Wikimedia-Logstash, 10observability: Move iegreview from udp2log to syslog - https://phabricator.wikimedia.org/T215497 (10fgiunchedi) In terms of implementation, iegreview uses monolog, so an approach similar/equal to https://gerrit.wikimedia.org/r/c/mediawiki/core/+/477791 would work [10:46:14] 10Operations, 10Wikimedia-Logstash, 10observability: Move wikimania-scholarships from udp2log to syslog - https://phabricator.wikimedia.org/T215499 (10fgiunchedi) In terms of implementation, wikimania-scholarships uses monolog, so an approach similar/equal to https://gerrit.wikimedia.org/r/c/mediawiki/core/+... [10:52:08] (03CR) 10Addshore: [C: 03+2] Read from the new term store up to Q8 million for clients [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575992 (https://phabricator.wikimedia.org/T219123) (owner: 10Addshore) [10:52:14] * addshore will deploy ^^ shortly [10:52:26] * addshore continues moving the lever [10:53:13] (03Merged) 10jenkins-bot: Read from the new term store up to Q8 million for clients [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575992 (https://phabricator.wikimedia.org/T219123) (owner: 10Addshore) [10:53:23] 10Operations, 10Wikimedia-Mailing-lists, 10Patch-For-Review: Rename multimedia-team to structured-data-team - https://phabricator.wikimedia.org/T235550 (10Aklapper) @jbond: My guess is you're on clinic duty this week (not sure if the IRC channel topic is correct though)? (A bit hard to find out, see T244266#... [10:55:00] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me. One thing that is currently also not puppetised is the repository key used to sign apt.wikimedia.org; see "gpg --list-s" [puppet] - 10https://gerrit.wikimedia.org/r/575638 (https://phabricator.wikimedia.org/T224576) (owner: 10Dzahn) [10:55:41] !log addshore@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Reading up to Q8M for the new term store for clients (was Q6M) + warm db1126 & db1111 caches (T219123) (duration: 00m 58s) [10:55:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:55:46] T219123: Migrate to and read from new store for item terms - https://phabricator.wikimedia.org/T219123 [11:00:02] addshore: done with deployment? I plan to update cxserver.. [11:00:11] kart_: yup, go for it! [11:00:28] cool. [11:00:51] (03CR) 10KartikMistry: [C: 03+2] Update cxserver to 2020-02-28-043702-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/575817 (https://phabricator.wikimedia.org/T246319) (owner: 10KartikMistry) [11:01:11] (03Merged) 10jenkins-bot: Update cxserver to 2020-02-28-043702-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/575817 (https://phabricator.wikimedia.org/T246319) (owner: 10KartikMistry) [11:01:37] !log START warm cache for db1111 & db1126 for Q8-10 million T219123 (pass 1) [11:01:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:01:46] T219123: Migrate to and read from new store for item terms - https://phabricator.wikimedia.org/T219123 [11:02:21] (03PS1) 10Jbond: javascript headers: the presence of this file causes in valid script loads [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/575995 [11:02:38] !log kartik@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'cxserver' for release 'staging' . [11:02:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:03:22] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime [11:03:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:04:28] !log kartik@deploy1001 helmfile [EQIAD] Ran 'apply' command on namespace 'cxserver' for release 'production' . [11:04:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:05:40] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [11:05:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:07:27] (03PS3) 10Giuseppe Lavagetto: prometheus::ops: collect envoy stats from all servers [puppet] - 10https://gerrit.wikimedia.org/r/575504 [11:07:28] !log kartik@deploy1001 helmfile [CODFW] Ran 'apply' command on namespace 'cxserver' for release 'production' . [11:07:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:12:08] !log Update cxserver to 2020-02-28-043702-production (T246319) [11:12:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:12:12] T246319: Enable Google Translate support in Content Translation for Kinyarwanda, Odia, Tatar, Turkmen and Uyghur - https://phabricator.wikimedia.org/T246319 [11:17:31] (03CR) 10Jbond: [V: 03+2 C: 03+2] javascript headers: the presence of this file causes in valid script loads [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/575995 (owner: 10Jbond) [11:17:43] 10Operations, 10Traffic, 10Patch-For-Review: Upgrade load balancers to buster - https://phabricator.wikimedia.org/T245984 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['lvs5002.eqsin.wmnet'] ` and were **ALL** successful. [11:19:27] (03PS1) 10Vgutierrez: lvs: Re-enable BGP in lvs5002 [puppet] - 10https://gerrit.wikimedia.org/r/575997 (https://phabricator.wikimedia.org/T245984) [11:19:37] (03PS1) 10Elukey: Allow analytics client nodes to set user.slice global limits [puppet] - 10https://gerrit.wikimedia.org/r/575998 [11:21:20] (03PS1) 10Muehlenhoff: Add root's home to the backup for apt* [puppet] - 10https://gerrit.wikimedia.org/r/575999 [11:30:04] jan_drewniak: My dear minions, it's time we take the moon! Just kidding. Time for Wikimedia Portals Update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200302T1130). [11:31:54] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576004 (https://phabricator.wikimedia.org/T128546) [11:34:33] (03CR) 10Jdrewniak: [C: 03+2] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576004 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [11:36:28] (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576004 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [11:38:32] (03CR) 10Jbond: "> Patch Set 1: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/575220 (https://phabricator.wikimedia.org/T246327) (owner: 10Jbond) [11:38:37] (03CR) 10Jbond: [C: 03+2] puppetmaster: enable strict_hostname_checking[1] [puppet] - 10https://gerrit.wikimedia.org/r/575220 (https://phabricator.wikimedia.org/T246327) (owner: 10Jbond) [11:39:11] !log enable strict_hostname_checking on the puppet masters https://gerrit.wikimedia.org/r/c/operations/puppet/+/575220 [11:39:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:40:07] !log jdrewniak@deploy1001 Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:576004| Bumping portals to master (563985)]] (duration: 00m 57s) [11:40:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:41:05] !log jdrewniak@deploy1001 Synchronized portals: Wikimedia Portals Update: [[gerrit:576004| Bumping portals to master (563985)]] (duration: 00m 57s) [11:41:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:43:24] (03PS2) 10Giuseppe Lavagetto: ProductionServices: switch search to use envoy instead of nginx [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575268 (https://phabricator.wikimedia.org/T244843) [11:43:26] (03PS2) 10Giuseppe Lavagetto: ProductionServices: use local http proxy for parsoid, parsoidphp [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575269 (https://phabricator.wikimedia.org/T244843) [11:43:27] (03PS2) 10Giuseppe Lavagetto: ProductionServices: use the local proxy for sessionstore [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575270 (https://phabricator.wikimedia.org/T244843) [11:43:29] (03PS1) 10Giuseppe Lavagetto: ProductionServices: use envoy to connect to mathoid [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576007 (https://phabricator.wikimedia.org/T244843) [11:43:32] (03PS1) 10Giuseppe Lavagetto: ProductionServices:switch eventgate-analytics to use envoy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576008 (https://phabricator.wikimedia.org/T244843) [11:43:34] (03PS1) 10Giuseppe Lavagetto: ProductionServices: switch eventgate-main to use envoy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576009 (https://phabricator.wikimedia.org/T244843) [11:44:31] !log START warm cache for db1111 & db1126 for Q8-10 million T219123 (pass 2) [11:44:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:44:37] T219123: Migrate to and read from new store for item terms - https://phabricator.wikimedia.org/T219123 [11:46:01] (03CR) 10Volans: "> Patch Set 1: Verified+2" [dns] - 10https://gerrit.wikimedia.org/r/575650 (owner: 10CRusnov) [11:46:26] (03CR) 10Giuseppe Lavagetto: [C: 03+1] mediawiki: install phpdbg on mwdebg hosts. [puppet] - 10https://gerrit.wikimedia.org/r/573333 (https://phabricator.wikimedia.org/T244549) (owner: 10Hnowlan) [11:48:13] (03CR) 10Vgutierrez: [C: 03+2] lvs: Re-enable BGP in lvs5002 [puppet] - 10https://gerrit.wikimedia.org/r/575997 (https://phabricator.wikimedia.org/T245984) (owner: 10Vgutierrez) [11:49:40] 10Operations, 10Puppet, 10Patch-For-Review: Enable strict_hostname_checking on our Puppet nodes - https://phabricator.wikimedia.org/T246327 (10jbond) 05Open→03Resolved This setting has been enabled [11:52:31] jouncebot: next [11:52:31] In 0 hour(s) and 7 minute(s): European Mid-day SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200302T1200) [11:53:00] (03CR) 10Jbond: "lgtm but also an optional nit" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/575536 (owner: 10Muehlenhoff) [11:54:05] !log enable BGP in lvs5002 - T245984 [11:54:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:54:10] T245984: Upgrade load balancers to buster - https://phabricator.wikimedia.org/T245984 [11:54:58] 10Operations, 10Traffic, 10Patch-For-Review: Upgrade load balancers to buster - https://phabricator.wikimedia.org/T245984 (10Vgutierrez) [11:56:23] (03CR) 10Muehlenhoff: Add logstash-next IDP service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/575536 (owner: 10Muehlenhoff) [12:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: Time to snap out of that daydream and deploy European Mid-day SWAT(Max 6 patches). Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200302T1200). [12:00:04] tarrow and _joe_: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [12:00:18] (03PS2) 10Muehlenhoff: Amend logstash IDP service for logstash-next [puppet] - 10https://gerrit.wikimedia.org/r/575536 [12:00:19] o/ [12:00:22] * Urbanecm here [12:00:26] * _joe_ here [12:00:28] hey [12:01:12] Wanna do _joe_'s first? I'm currently wondering if we can put off my backport [12:01:22] _joe_: want to self-deploy? [12:01:35] <_joe_> Urbanecm: yes [12:01:44] so go ahead, please 🙂 [12:01:48] <_joe_> happy to, I have to do some rather lenghty verifications too [12:01:55] great! [12:02:12] (03CR) 10Giuseppe Lavagetto: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575268 (https://phabricator.wikimedia.org/T244843) (owner: 10Giuseppe Lavagetto) [12:02:22] tarrow: unsure what "put off" means - are you saying it shouldn't be deployed now? [12:03:13] (03Merged) 10jenkins-bot: ProductionServices: switch search to use envoy instead of nginx [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575268 (https://phabricator.wikimedia.org/T244843) (owner: 10Giuseppe Lavagetto) [12:03:25] Urbanecm: yep; had a last minute thought that perhaps it would be better left for the train [12:03:25] <_joe_> wow that was fast jenkins [12:03:54] tarrow: okay, remve it from the calendar then please :-) [12:04:19] <_joe_> testing on mwdebug1001 [12:04:23] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] nova: remove the custom scheduler pool filter [puppet] - 10https://gerrit.wikimedia.org/r/575541 (https://phabricator.wikimedia.org/T226731) (owner: 10Andrew Bogott) [12:04:51] Urbanecm: I think I shall; might SWAT it this evening if it turns out to be more urgent [12:05:00] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] codfw1dev: Register our bastion hosts [puppet] - 10https://gerrit.wikimedia.org/r/575737 (owner: 10Alex Monk) [12:05:01] ack [12:06:08] its basically a patch with a two file dependent change and I'm worried about the possible flurry of errors when syncing. Not really sure if can be well split though [12:06:21] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] nova: remove some obsolete config files [puppet] - 10https://gerrit.wikimedia.org/r/574566 (owner: 10Andrew Bogott) [12:06:30] (03PS2) 10Jbond: Fix some incorrect uses of the lookup function [puppet] - 10https://gerrit.wikimedia.org/r/575718 (owner: 10Alex Monk) [12:06:57] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] Openstack: add 'queens' config files [puppet] - 10https://gerrit.wikimedia.org/r/575083 (owner: 10Andrew Bogott) [12:07:04] <_joe_> ok, deploying I guess [12:09:05] !log oblivian@deploy1001 Synchronized wmf-config/ProductionServices.php: Switch search to use envoy as a proxy (duration: 00m 56s) [12:09:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:09:16] (03PS1) 10Urbanecm: Throttle rule for Czech Wikigap [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576012 (https://phabricator.wikimedia.org/T246356) [12:09:28] (03PS2) 10Jbond: Move more eqiad1.yaml hieradata to cloud.yaml [puppet] - 10https://gerrit.wikimedia.org/r/575719 (https://phabricator.wikimedia.org/T242607) (owner: 10Alex Monk) [12:10:08] (03CR) 10Jbond: [C: 03+2] "Thanks alex" [puppet] - 10https://gerrit.wikimedia.org/r/575718 (owner: 10Alex Monk) [12:10:35] (03CR) 10jerkins-bot: [V: 04-1] Throttle rule for Czech Wikigap [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576012 (https://phabricator.wikimedia.org/T246356) (owner: 10Urbanecm) [12:11:29] (03PS2) 10Urbanecm: Throttle rule for Czech Wikigap [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576012 (https://phabricator.wikimedia.org/T246356) [12:11:44] _joe_: are you still deploying? [12:12:00] <_joe_> Urbanecm: no I'm done [12:12:07] <_joe_> search seems to work on the wikis [12:12:12] <_joe_> so I guess I didn't break them [12:12:15] thanks [12:12:36] (03PS3) 10Urbanecm: Set cswiki and cywiki to use custom minerva logo again [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575692 (https://phabricator.wikimedia.org/T246535) [12:12:46] (03CR) 10Urbanecm: [C: 03+2] Set cswiki and cywiki to use custom minerva logo again [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575692 (https://phabricator.wikimedia.org/T246535) (owner: 10Urbanecm) [12:13:44] (03Merged) 10jenkins-bot: Set cswiki and cywiki to use custom minerva logo again [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575692 (https://phabricator.wikimedia.org/T246535) (owner: 10Urbanecm) [12:14:23] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM. Small comment inlined." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/575271 (https://phabricator.wikimedia.org/T245174) (owner: 10Andrew Bogott) [12:14:34] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/575536 (owner: 10Muehlenhoff) [12:15:18] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: 8280f81: Set cswiki and cywiki to use custom minerva logo again (T246535) (duration: 00m 58s) [12:15:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:15:22] T246535: Minerva logo at cswiki was deleted - https://phabricator.wikimedia.org/T246535 [12:16:23] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: 8280f81: Set cswiki and cywiki to use custom minerva logo again (T246535): take II (duration: 00m 57s) [12:16:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:21:12] 10Operations, 10Wikimedia-Mailing-lists, 10Patch-For-Review: Rename multimedia-team to structured-data-team - https://phabricator.wikimedia.org/T235550 (10jbond) @Aklapper i'm unsure who was on clinic duty last week (it wasn't me and the irc topic was not updated) however i think Daniel tackled a lot of the... [12:24:24] (03CR) 10Muehlenhoff: [C: 03+2] Amend logstash IDP service for logstash-next [puppet] - 10https://gerrit.wikimedia.org/r/575536 (owner: 10Muehlenhoff) [12:28:52] (03CR) 10Arturo Borrero Gonzalez: "LGTM in general, small comment inlined." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/575744 (https://phabricator.wikimedia.org/T242607) (owner: 10Alex Monk) [12:29:23] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] Move more eqiad1.yaml hieradata to cloud.yaml [puppet] - 10https://gerrit.wikimedia.org/r/575719 (https://phabricator.wikimedia.org/T242607) (owner: 10Alex Monk) [12:31:02] (03PS1) 10Addshore: Read from the new term store up to Q10 million for clients [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576020 (https://phabricator.wikimedia.org/T219123) [12:33:33] Urbanecm: is that swat done? :) [12:33:36] (03PS5) 10Matěj Suchánek: Synchronize and fix DisableQueryPageUpdate for wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/573969 [12:33:38] (03CR) 10Volans: "LGTM for the current approach, couple of nits and a big question inline." (033 comments) [dns] - 10https://gerrit.wikimedia.org/r/569340 (https://phabricator.wikimedia.org/T243362) (owner: 10CRusnov) [12:34:01] addshore: go ahead if you have sth [12:34:21] (03CR) 10Volans: "> Patch Set 1:" [dns] - 10https://gerrit.wikimedia.org/r/575650 (owner: 10CRusnov) [12:35:17] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=icinga site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:35:29] 10Operations, 10cloud-services-team: Move labtestpuppetmaster2001 to Puppet 5 - https://phabricator.wikimedia.org/T246655 (10MoritzMuehlenhoff) [12:37:27] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:38:09] addshore: are you working on something, or might i do a deploy? [12:40:01] Urbanecm: you go ahead :) [12:40:06] thank you [12:40:11] sorry, stepped away for a moment, but I have one for after you! :) [12:40:22] (03PS5) 10Hnowlan: mediawiki: install phpdbg on mwdebg hosts. [puppet] - 10https://gerrit.wikimedia.org/r/573333 (https://phabricator.wikimedia.org/T244549) [12:40:37] 10Operations, 10cloud-services-team (Kanban): Migrate Cloud VPS to Puppet 5 / facter 3 - https://phabricator.wikimedia.org/T241719 (10MoritzMuehlenhoff) >>! In T241719#5929960, @Krenair wrote: > Maybe I was too quick to merge this and instead this is `Migrate all Cloud VPS puppetmasters to Puppet 5 / facter 3`... [12:40:57] okay [12:41:58] _joe_: https://wikitech.wikimedia.org/w/index.php?search=Perform+security+fixes&title=Special%3ASearch&go=Go&ns0=1&ns12=1&ns116=1&ns498=1, might that be related? [12:42:02] (to what you deployed) [12:42:10] `An error has occurred while searching: We could not complete your search due to a temporary problem. Please try again later.` [12:42:20] <_joe_> Urbanecm: oh meh [12:42:23] <_joe_> damn wikitech [12:42:38] <_joe_> Urbanecm: yes, I was convinced wikitech had its own unicorn settings [12:42:47] <_joe_> I'll fix wikitech [12:42:57] <_joe_> no reason to rollback imho [12:43:12] thanks [12:48:19] 10Operations, 10cloud-services-team: Move labtestpuppetmaster2001 to Puppet 5 - https://phabricator.wikimedia.org/T246655 (10Krenair) We should just complete {T242607} instead [12:49:41] (03PS2) 10Cmjohnson: Adding mgmt dns for mw185-1413 [dns] - 10https://gerrit.wikimedia.org/r/574785 (https://phabricator.wikimedia.org/T241849) [12:50:04] (03CR) 10jerkins-bot: [V: 04-1] Adding mgmt dns for mw185-1413 [dns] - 10https://gerrit.wikimedia.org/r/574785 (https://phabricator.wikimedia.org/T241849) (owner: 10Cmjohnson) [12:50:38] 10Operations, 10cloud-services-team: Move labtestpuppetmaster2001 to Puppet 5 - https://phabricator.wikimedia.org/T246655 (10MoritzMuehlenhoff) Ah nice, I wasn't aware of that task. [12:50:53] Am I okay to do my change? 8 mill to 10 mill tersm reads? :) [12:51:27] not yet please [12:51:31] ack! [12:51:37] (03PS3) 10Cmjohnson: Adding mgmt dns for mw185-1413 [dns] - 10https://gerrit.wikimedia.org/r/574785 (https://phabricator.wikimedia.org/T241849) [12:51:58] (03PS1) 10Giuseppe Lavagetto: profile::services_proxy: enable envoy everywhere [puppet] - 10https://gerrit.wikimedia.org/r/576024 [12:52:00] (03CR) 10jerkins-bot: [V: 04-1] Adding mgmt dns for mw185-1413 [dns] - 10https://gerrit.wikimedia.org/r/574785 (https://phabricator.wikimedia.org/T241849) (owner: 10Cmjohnson) [12:52:31] (03CR) 10Giuseppe Lavagetto: [C: 03+2] profile::services_proxy: enable envoy everywhere [puppet] - 10https://gerrit.wikimedia.org/r/576024 (owner: 10Giuseppe Lavagetto) [12:53:23] (03PS4) 10Cmjohnson: Adding mgmt dns for mw185-1413 [dns] - 10https://gerrit.wikimedia.org/r/574785 (https://phabricator.wikimedia.org/T241849) [12:53:48] (03CR) 10jerkins-bot: [V: 04-1] Adding mgmt dns for mw185-1413 [dns] - 10https://gerrit.wikimedia.org/r/574785 (https://phabricator.wikimedia.org/T241849) (owner: 10Cmjohnson) [12:53:58] <_joe_> come on jenkins [12:55:08] <_joe_> ok enough, I'm fixing a problem [12:55:12] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] profile::services_proxy: enable envoy everywhere [puppet] - 10https://gerrit.wikimedia.org/r/576024 (owner: 10Giuseppe Lavagetto) [12:55:18] (03PS5) 10Cmjohnson: Adding mgmt dns for mw185-1413 [dns] - 10https://gerrit.wikimedia.org/r/574785 (https://phabricator.wikimedia.org/T241849) [12:55:45] (03CR) 10jerkins-bot: [V: 04-1] Adding mgmt dns for mw185-1413 [dns] - 10https://gerrit.wikimedia.org/r/574785 (https://phabricator.wikimedia.org/T241849) (owner: 10Cmjohnson) [12:56:05] (03CR) 10Jbond: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/526664 (https://phabricator.wikimedia.org/T229397) (owner: 10Jbond) [12:57:05] (03CR) 10Jbond: "I think we can move forward with this" [puppet] - 10https://gerrit.wikimedia.org/r/526664 (https://phabricator.wikimedia.org/T229397) (owner: 10Jbond) [12:58:38] !log Deploy security fix for T229731 [13:01:39] (03PS6) 10Cmjohnson: Adding mgmt dns for mw185-1413 [dns] - 10https://gerrit.wikimedia.org/r/574785 (https://phabricator.wikimedia.org/T241849) [13:02:02] <_joe_> Urbanecm: wikitech search is back [13:02:07] addshore: thanks [13:02:10] (03PS7) 10Cmjohnson: Adding mgmt dns for mw185-1413 [dns] - 10https://gerrit.wikimedia.org/r/574785 (https://phabricator.wikimedia.org/T241849) [13:02:19] Urbanecm: my go? :) [13:03:30] (03CR) 10Cmjohnson: [C: 03+2] Adding mgmt dns for mw185-1413 [dns] - 10https://gerrit.wikimedia.org/r/574785 (https://phabricator.wikimedia.org/T241849) (owner: 10Cmjohnson) [13:03:56] (03CR) 10Addshore: [C: 03+2] Read from the new term store up to Q10 million for clients [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576020 (https://phabricator.wikimedia.org/T219123) (owner: 10Addshore) [13:05:02] (03Merged) 10jenkins-bot: Read from the new term store up to Q10 million for clients [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576020 (https://phabricator.wikimedia.org/T219123) (owner: 10Addshore) [13:05:11] addshore: last sync :( [13:05:18] ack, np :P [13:05:41] done for good [13:05:52] sweet! *starts his* [13:06:25] syncing [13:07:18] !log addshore@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Reading up to Q10M for the new term store for clients (was Q8M) + warm db1126 & db1111 caches (T219123) (duration: 00m 56s) [13:07:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:07:23] T219123: Migrate to and read from new store for item terms - https://phabricator.wikimedia.org/T219123 [13:08:22] !log addshore@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Reading up to Q10M for the new term store for clients (was Q8M) + warm db1126 & db1111 caches (T219123) cache bust (duration: 00m 55s) [13:08:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:09:38] sweet [13:11:20] !log START warm cache for db1111 & db1126 for Q10-12 million T219123 (pass 1) [13:11:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:13:29] (03PS1) 10KartikMistry: Update cxserver to 2020-03-02-115344-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/576027 [13:18:25] !log roll restart Hadoop master daemons on an-master100[1,2] for openjdk upgrades [13:18:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:16] (03CR) 10Muehlenhoff: Adapt cross-validate-accounts for system users (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/575141 (https://phabricator.wikimedia.org/T235161) (owner: 10Muehlenhoff) [13:20:42] (03PS2) 10Muehlenhoff: Adapt cross-validate-accounts for system users [puppet] - 10https://gerrit.wikimedia.org/r/575141 (https://phabricator.wikimedia.org/T235161) [13:22:53] SWAT done Urbanecm ? [13:23:04] 10Operations, 10cloud-services-team (Kanban): Migrate Cloud VPS to Puppet 5 / facter 3 - https://phabricator.wikimedia.org/T241719 (10Krenair) Sure, let's make this the tracking task? [13:23:06] addshore: also, let me know when I can again deploy :) [13:23:18] kart_: go for it! [13:23:24] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "hope this works for you! LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/575998 (owner: 10Elukey) [13:23:29] cool. Need to revert some broken stuff. [13:24:26] (03PS2) 10KartikMistry: Update cxserver to 2020-03-02-115344-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/576027 [13:24:38] (03CR) 10Arturo Borrero Gonzalez: Openstack: add apt definitions for version 'queens' (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/575084 (https://phabricator.wikimedia.org/T246287) (owner: 10Andrew Bogott) [13:25:30] (03CR) 10KartikMistry: [C: 03+2] "Revert of 46a4498 (cxserver) as it is not yet deployed in Cloud API." [deployment-charts] - 10https://gerrit.wikimedia.org/r/576027 (owner: 10KartikMistry) [13:25:48] (03Merged) 10jenkins-bot: Update cxserver to 2020-03-02-115344-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/576027 (owner: 10KartikMistry) [13:26:49] !log kartik@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'cxserver' for release 'staging' . [13:26:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:28:11] !log kartik@deploy1001 helmfile [EQIAD] Ran 'apply' command on namespace 'cxserver' for release 'production' . [13:28:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:45] !log kartik@deploy1001 helmfile [CODFW] Ran 'apply' command on namespace 'cxserver' for release 'production' . [13:30:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:32:58] (03PS1) 10Muehlenhoff: Adapt offboarding script for system users [puppet] - 10https://gerrit.wikimedia.org/r/576030 (https://phabricator.wikimedia.org/T235161) [13:33:28] (03PS1) 10Vgutierrez: install_server,lvs: Reimage lvs5001 with buster [puppet] - 10https://gerrit.wikimedia.org/r/576031 (https://phabricator.wikimedia.org/T245984) [13:33:33] !log Update cxserver to 2020-03-02-115344-production: Reverting T246319 [13:33:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:33:38] T246319: Enable Google Translate support in Content Translation for Kinyarwanda, Odia, Tatar, Turkmen and Uyghur - https://phabricator.wikimedia.org/T246319 [13:35:16] 10Operations, 10netbox: Add SSO support to netbox - https://phabricator.wikimedia.org/T244849 (10jbond) > There may be issues with groups in this scenario. @jbond Presumably the apache module sets group headers? Can you fill us in a bit on that, so that we can ask upstream to support it. Currently CAS sets a... [13:38:36] (03CR) 10Vgutierrez: [C: 03+2] "pcc is happy: https://puppet-compiler.wmflabs.org/compiler1001/21180/" [puppet] - 10https://gerrit.wikimedia.org/r/576031 (https://phabricator.wikimedia.org/T245984) (owner: 10Vgutierrez) [13:41:30] (03CR) 10Jbond: [C: 03+1] "LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/575141 (https://phabricator.wikimedia.org/T235161) (owner: 10Muehlenhoff) [13:42:40] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/576030 (https://phabricator.wikimedia.org/T235161) (owner: 10Muehlenhoff) [13:44:22] (03PS2) 10Muehlenhoff: Adapt offboarding script for system users [puppet] - 10https://gerrit.wikimedia.org/r/576030 (https://phabricator.wikimedia.org/T235161) [13:46:04] (03CR) 10Muehlenhoff: [C: 03+2] Adapt offboarding script for system users [puppet] - 10https://gerrit.wikimedia.org/r/576030 (https://phabricator.wikimedia.org/T235161) (owner: 10Muehlenhoff) [13:48:15] !log reimage lvs5001 with buster - T245984 [13:48:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:48:24] T245984: Upgrade load balancers to buster - https://phabricator.wikimedia.org/T245984 [13:48:34] 10Operations, 10Traffic, 10Patch-For-Review: Upgrade load balancers to buster - https://phabricator.wikimedia.org/T245984 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on cumin1001.eqiad.wmnet for hosts: ` lvs5001.eqsin.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/20... [13:48:57] (03CR) 10Muehlenhoff: Adapt cross-validate-accounts for system users (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/575141 (https://phabricator.wikimedia.org/T235161) (owner: 10Muehlenhoff) [13:51:17] (03CR) 10Vgutierrez: [C: 03+2] ATS: Switch unified cert vendor to Let's Encrypt on ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/575305 (https://phabricator.wikimedia.org/T230687) (owner: 10Vgutierrez) [13:51:36] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/575141 (https://phabricator.wikimedia.org/T235161) (owner: 10Muehlenhoff) [13:53:33] !log Switch from globalsign to LE as unified cert vendor on cp4026 - T230687 [13:53:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:53:38] T230687: Decide/document criteria needed to serve acme-chief LE issued unified certificate to end users - https://phabricator.wikimedia.org/T230687 [13:53:49] (03CR) 10Muehlenhoff: [C: 03+2] Adapt cross-validate-accounts for system users [puppet] - 10https://gerrit.wikimedia.org/r/575141 (https://phabricator.wikimedia.org/T235161) (owner: 10Muehlenhoff) [13:53:55] 10Operations, 10Wikidata, 10Wikidata-Query-Service: WDQS Categories update lag alert - https://phabricator.wikimedia.org/T246497 (10dcausse) Running the check manually the alert seems gone, unfortunately checking the graph I cannot see any stale data that's still there that may have caused this alert. Next t... [13:55:38] !log Switch from globalsign to LE as unified cert vendor on ulsfo - T230687 [13:55:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:57:27] 10Operations, 10netbox: Add SSO support to netbox - https://phabricator.wikimedia.org/T244849 (10Volans) Yeah I agree `HTTP_X_CAS_MEMBEROF` has all the info needed. >>! In T244849#5932495, @jbond wrote: > Who are the non-SRE consumers? Currently anyone in WMF has read only access to Netbox. As for usual cons... [13:58:48] !log START warm cache for db1111 & db1126 for Q10-12 million T219123 (pass 2) [13:58:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:58:53] T219123: Migrate to and read from new store for item terms - https://phabricator.wikimedia.org/T219123 [14:01:15] (03PS1) 10Vgutierrez: lvs: Add missing mapping of lvs2008 as high-traffic2 [puppet] - 10https://gerrit.wikimedia.org/r/576043 (https://phabricator.wikimedia.org/T196560) [14:03:15] RECOVERY - Categories update lag on wdqs1009 is OK: OK - Categories lag: 9:03:13.835921 https://wikitech.wikimedia.org/wiki/Wikidata_query_service [14:04:29] 10Operations, 10Acme-chief, 10Traffic, 10Patch-For-Review: Decide/document criteria needed to serve acme-chief LE issued unified certificate to end users - https://phabricator.wikimedia.org/T230687 (10Vgutierrez) ` $ openssl s_client -connect upload-lb.ulsfo.wikimedia.org:443 2>&1 < /dev/null |openssl x509... [14:05:07] !log update puppet compiler facts [14:05:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:23] 10Operations, 10Analytics, 10User-Elukey: Refactor Analytics POSIX groups in puppet to improve maintainability - https://phabricator.wikimedia.org/T246578 (10nshahquinn-wmf) Ah, sorry for commenting on an outdated version! >>! In T243934#5932193, @elukey wrote: >>>! In T243934#5932167, @nshahquinn-wmf wrote... [14:12:53] (03CR) 10Vgutierrez: [C: 03+2] "pcc is happy: https://puppet-compiler.wmflabs.org/compiler1001/21182/" [puppet] - 10https://gerrit.wikimedia.org/r/576043 (https://phabricator.wikimedia.org/T196560) (owner: 10Vgutierrez) [14:15:07] PROBLEM - PyBal connections to etcd on lvs2008 is CRITICAL: CRITICAL: 0 connections established with conf2001.codfw.wmnet:2379 (min=8) https://wikitech.wikimedia.org/wiki/PyBal [14:15:11] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime [14:15:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:32] !log START warm cache for db1111 & db1126 for Q10-12 million T219123 (pass 3) [14:17:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:37] T219123: Migrate to and read from new store for item terms - https://phabricator.wikimedia.org/T219123 [14:17:46] jouncebot: next [14:17:46] In 3 hour(s) and 42 minute(s): Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200302T1800) [14:19:02] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [14:19:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:34] (03PS1) 10Addshore: Read from the new term store up to Q12 mill everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576044 (https://phabricator.wikimedia.org/T219123) [14:20:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'Increase weight from 200 to 250 on db1111 T246447', diff saved to https://phabricator.wikimedia.org/P10577 and previous config saved to /var/cache/conftool/dbconfig/20200302-142017-marostegui.json [14:20:21] RECOVERY - PyBal connections to etcd on lvs2008 is OK: OK: 8 connections established with conf2001.codfw.wmnet:2379 (min=8) https://wikitech.wikimedia.org/wiki/PyBal [14:20:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:20:22] T246447: Move db1111 from test-s4 to s8 - https://phabricator.wikimedia.org/T246447 [14:20:23] 10Operations, 10Analytics, 10User-Elukey: Refactor Analytics POSIX groups in puppet to improve maintainability - https://phabricator.wikimedia.org/T246578 (10elukey) >>! In T246578#5932587, @nshahquinn-wmf wrote: > Ah, sorry for commenting on an outdated version! > >>>! In T243934#5932193, @elukey wrote: >>... [14:22:14] 10Operations, 10LDAP-Access-Requests, 10WMF-Legal: Add Abban Dunne to the ldap/wmde group - https://phabricator.wikimedia.org/T246664 (10AbbanWMDE) [14:23:33] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job={swagger_check_citoid_cluster_eqiad,swagger_check_cxserver_cluster_eqiad} site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:25:53] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:26:26] 10Operations, 10ops-codfw, 10DC-Ops, 10Traffic, 10decommission: decommission lvs2005.codfw.wmnet - https://phabricator.wikimedia.org/T246666 (10Vgutierrez) [14:27:23] (03PS1) 10Vgutierrez: install_server,lvs: Decommission lvs2005.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/576046 (https://phabricator.wikimedia.org/T246666) [14:27:46] (03CR) 10jerkins-bot: [V: 04-1] install_server,lvs: Decommission lvs2005.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/576046 (https://phabricator.wikimedia.org/T246666) (owner: 10Vgutierrez) [14:29:16] (03PS2) 10Vgutierrez: install_server,lvs: Decommission lvs2005.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/576046 (https://phabricator.wikimedia.org/T246666) [14:30:08] 10Operations, 10Traffic: Upgrade load balancers to buster - https://phabricator.wikimedia.org/T245984 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['lvs5001.eqsin.wmnet'] ` and were **ALL** successful. [14:31:15] (03PS5) 10Jbond: realm: rename labs realm to cloud [puppet] - 10https://gerrit.wikimedia.org/r/572626 (https://phabricator.wikimedia.org/T244222) (owner: 10Arturo Borrero Gonzalez) [14:33:53] Right, time for Q12 million [14:34:01] (03CR) 10Vgutierrez: [C: 03+2] "pcc is happy: https://puppet-compiler.wmflabs.org/compiler1003/21183/" [puppet] - 10https://gerrit.wikimedia.org/r/576046 (https://phabricator.wikimedia.org/T246666) (owner: 10Vgutierrez) [14:34:06] (03CR) 10Addshore: [C: 03+2] Read from the new term store up to Q12 mill everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576044 (https://phabricator.wikimedia.org/T219123) (owner: 10Addshore) [14:35:47] (03Merged) 10jenkins-bot: Read from the new term store up to Q12 mill everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576044 (https://phabricator.wikimedia.org/T219123) (owner: 10Addshore) [14:35:53] (03PS6) 10Jbond: realm: rename labs realm to cloud [puppet] - 10https://gerrit.wikimedia.org/r/572626 (https://phabricator.wikimedia.org/T244222) (owner: 10Arturo Borrero Gonzalez) [14:36:36] !log running the decommission cookbook against lvs2005 - T246666 [14:36:40] !log vgutierrez@cumin2001 START - Cookbook sre.hosts.decommission [14:36:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:36:41] T246666: decommission lvs2005.codfw.wmnet - https://phabricator.wikimedia.org/T246666 [14:36:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:36:58] (03CR) 10Filippo Giunchedi: [C: 03+1] "Looks like this is about 1/10th of all metrics, still quite a lot but let's try." [puppet] - 10https://gerrit.wikimedia.org/r/575504 (owner: 10Giuseppe Lavagetto) [14:37:16] !log vgutierrez@cumin2001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [14:37:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:22] 10Operations, 10ops-codfw, 10DC-Ops, 10Traffic, and 2 others: decommission lvs2005.codfw.wmnet - https://phabricator.wikimedia.org/T246666 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by vgutierrez@cumin2001 for hosts: `lvs2005.codfw.wmnet` - lvs2005.codfw.wmnet (**PASS**) - Downtime... [14:37:27] !log addshore@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Reading up to Q12M for the new term store everywhere (was Q10M) + warm db1126 & db1111 caches (T219123) (duration: 00m 58s) [14:37:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:32] T219123: Migrate to and read from new store for item terms - https://phabricator.wikimedia.org/T219123 [14:38:18] (03CR) 10Jbond: "Have added a few templates that where missing and fixed the rspec tests. Have added comments with questions" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/572626 (https://phabricator.wikimedia.org/T244222) (owner: 10Arturo Borrero Gonzalez) [14:38:33] !log addshore@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Reading up to Q12M for the new term store everywhere (was Q10M) + warm db1126 & db1111 caches (T219123) cache bust (duration: 00m 56s) [14:38:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:38] (03PS1) 10Vgutierrez: lvs: Re-enable BGP in lvs5001 [puppet] - 10https://gerrit.wikimedia.org/r/576047 (https://phabricator.wikimedia.org/T245984) [14:39:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Give weight to es4 and es5 unused codfw slaves T246072', diff saved to https://phabricator.wikimedia.org/P10578 and previous config saved to /var/cache/conftool/dbconfig/20200302-143915-marostegui.json [14:39:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:19] T246072: Enable es4 and es5 as writable new external store sections - https://phabricator.wikimedia.org/T246072 [14:39:57] (03CR) 10Vgutierrez: [C: 03+2] lvs: Re-enable BGP in lvs5001 [puppet] - 10https://gerrit.wikimedia.org/r/576047 (https://phabricator.wikimedia.org/T245984) (owner: 10Vgutierrez) [14:40:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'Give weight to es4 and es5 unused eqiad slaves T246072', diff saved to https://phabricator.wikimedia.org/P10579 and previous config saved to /var/cache/conftool/dbconfig/20200302-144033-marostegui.json [14:40:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:42:43] !log Re-enable BGP in lvs5001 - T245984 [14:42:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:42:48] T245984: Upgrade load balancers to buster - https://phabricator.wikimedia.org/T245984 [14:44:20] 10Operations, 10Traffic, 10Patch-For-Review: Upgrade load balancers to buster - https://phabricator.wikimedia.org/T245984 (10Vgutierrez) [14:45:55] (03PS1) 10Vgutierrez: Remove lvs2005 production entries [dns] - 10https://gerrit.wikimedia.org/r/576051 (https://phabricator.wikimedia.org/T246666) [14:46:54] (03CR) 10Vgutierrez: [C: 03+2] Remove lvs2005 production entries [dns] - 10https://gerrit.wikimedia.org/r/576051 (https://phabricator.wikimedia.org/T246666) (owner: 10Vgutierrez) [14:48:02] (03CR) 10Marostegui: [C: 04-2] "We will be pushing https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/574696/ and later this tomorrow to enable es4." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575016 (https://phabricator.wikimedia.org/T246072) (owner: 10Marostegui) [14:49:35] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:50:39] (03PS1) 10Vgutierrez: install_server: Reimage lvs3007 with buster [puppet] - 10https://gerrit.wikimedia.org/r/576052 (https://phabricator.wikimedia.org/T245984) [14:51:00] (03CR) 10Muehlenhoff: [C: 03+1] "Worth a try" [puppet] - 10https://gerrit.wikimedia.org/r/575998 (owner: 10Elukey) [14:51:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'Increase weight from 250 to 300 on db1111 T246447', diff saved to https://phabricator.wikimedia.org/P10581 and previous config saved to /var/cache/conftool/dbconfig/20200302-145130-marostegui.json [14:51:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:36] T246447: Move db1111 from test-s4 to s8 - https://phabricator.wikimedia.org/T246447 [14:51:47] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:53:37] (03CR) 10Vgutierrez: [C: 03+2] install_server: Reimage lvs3007 with buster [puppet] - 10https://gerrit.wikimedia.org/r/576052 (https://phabricator.wikimedia.org/T245984) (owner: 10Vgutierrez) [14:54:13] (03PS1) 10Muehlenhoff: Update partman recipes for lvs* [puppet] - 10https://gerrit.wikimedia.org/r/576054 [14:55:16] (03CR) 10Ottomata: [C: 03+1] kafka-dev: Updated API endpoint and added required selector [deployment-charts] - 10https://gerrit.wikimedia.org/r/575612 (https://phabricator.wikimedia.org/T246501) (owner: 10Holger Knust) [14:55:23] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:58:05] 10Operations, 10ops-codfw, 10DC-Ops, 10Traffic, 10decommission: decommission lvs2004.codfw.wmnet - https://phabricator.wikimedia.org/T246669 (10Vgutierrez) [14:59:07] (03PS1) 10Vgutierrez: install_server,lvs: Decommission lvs2004.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/576056 (https://phabricator.wikimedia.org/T246669) [14:59:46] 10Operations, 10ops-codfw, 10DC-Ops, 10Traffic, and 2 others: decommission lvs2005.codfw.wmnet - https://phabricator.wikimedia.org/T246666 (10Vgutierrez) a:05Vgutierrez→03Papaul [15:00:55] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:02:48] (03CR) 10Filippo Giunchedi: [C: 03+1] Update partman recipes for lvs* [puppet] - 10https://gerrit.wikimedia.org/r/576054 (owner: 10Muehlenhoff) [15:03:44] (03PS1) 10Andrew Bogott: Openstack Queens: use mirrors.wikimedia.org for the Queens backport [puppet] - 10https://gerrit.wikimedia.org/r/576059 (https://phabricator.wikimedia.org/T246287) [15:04:25] (03PS2) 10Ottomata: Use new LVS port for eventgate-analytics [mediawiki-config] - 10https://gerrit.wikimedia.org/r/573307 (https://phabricator.wikimedia.org/T245203) [15:04:34] (03PS1) 10Krinkle: MWConfigCacheGenerator: Remove unused 'docRoot' wgConf placeholder variable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576060 (https://phabricator.wikimedia.org/T223602) [15:06:03] (03Abandoned) 10Andrew Bogott: Revert "grafana: prevent Anonymous viewers from editing user settings" [puppet] - 10https://gerrit.wikimedia.org/r/575756 (owner: 10Andrew Bogott) [15:06:48] (03PS2) 10Krinkle: MWConfigCacheGenerator: Remove unused 'docRoot' wgConf placeholder variable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576060 (https://phabricator.wikimedia.org/T223602) [15:07:18] (03CR) 10Vgutierrez: [C: 03+1] lvs: kibana-next: promote from "service_setup" to "lvs_setup" [puppet] - 10https://gerrit.wikimedia.org/r/575631 (https://phabricator.wikimedia.org/T234854) (owner: 10Herron) [15:08:43] (03CR) 10Hnowlan: [C: 03+2] mediawiki: install phpdbg on mwdebg hosts. [puppet] - 10https://gerrit.wikimedia.org/r/573333 (https://phabricator.wikimedia.org/T244549) (owner: 10Hnowlan) [15:09:05] (03CR) 10Ottomata: [C: 03+2] Use new LVS port for eventgate-analytics [mediawiki-config] - 10https://gerrit.wikimedia.org/r/573307 (https://phabricator.wikimedia.org/T245203) (owner: 10Ottomata) [15:09:16] (03PS6) 10Hnowlan: mediawiki: install phpdbg on mwdebg hosts. [puppet] - 10https://gerrit.wikimedia.org/r/573333 (https://phabricator.wikimedia.org/T244549) [15:10:38] (03CR) 10Muehlenhoff: [C: 03+2] Update partman recipes for lvs* [puppet] - 10https://gerrit.wikimedia.org/r/576054 (owner: 10Muehlenhoff) [15:10:43] PROBLEM - DPKG on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [15:10:59] PROBLEM - Disk space on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=notebook1003&var-datasource=eqiad+prometheus/ops [15:11:17] PROBLEM - dhclient process on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [15:11:23] PROBLEM - Check size of conntrack table on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [15:11:29] PROBLEM - Check whether ferm is active by checking the default input chain on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [15:11:45] PROBLEM - MD RAID on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [15:11:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'Increase weight from 300 to 350 on db1111 T246447', diff saved to https://phabricator.wikimedia.org/P10582 and previous config saved to /var/cache/conftool/dbconfig/20200302-151149-marostegui.json [15:11:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:11:55] T246447: Move db1111 from test-s4 to s8 - https://phabricator.wikimedia.org/T246447 [15:12:01] PROBLEM - Check systemd state on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:12:33] PROBLEM - configured eth on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [15:12:39] sigh [15:13:40] (03CR) 10Vgutierrez: [C: 03+2] "pcc is happy: https://puppet-compiler.wmflabs.org/compiler1002/21184/" [puppet] - 10https://gerrit.wikimedia.org/r/576056 (https://phabricator.wikimedia.org/T246669) (owner: 10Vgutierrez) [15:14:07] (03CR) 10Herron: [C: 03+2] lvs: kibana-next: promote from "service_setup" to "lvs_setup" [puppet] - 10https://gerrit.wikimedia.org/r/575631 (https://phabricator.wikimedia.org/T234854) (owner: 10Herron) [15:15:59] PROBLEM - puppet last run on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [15:16:39] RECOVERY - DPKG on notebook1003 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [15:17:01] RECOVERY - Disk space on notebook1003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=notebook1003&var-datasource=eqiad+prometheus/ops [15:17:19] RECOVERY - dhclient process on notebook1003 is OK: PROCS OK: 0 processes with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [15:17:25] RECOVERY - Check size of conntrack table on notebook1003 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [15:17:31] RECOVERY - Check whether ferm is active by checking the default input chain on notebook1003 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [15:17:47] RECOVERY - MD RAID on notebook1003 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [15:18:03] RECOVERY - Check systemd state on notebook1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:18:39] RECOVERY - configured eth on notebook1003 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [15:19:37] (03PS2) 10Vgutierrez: install_server,lvs: Decommission lvs2004.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/576056 (https://phabricator.wikimedia.org/T246669) [15:20:11] !log Use new LVS port for EventBus+monolog for eventgate-analytics - T245203 [15:20:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:20:16] T245203: Create production and canary releases for existent eventgate helmfile services - https://phabricator.wikimedia.org/T245203 [15:20:57] !log otto@deploy1001 Synchronized wmf-config/ProductionServices.php: Use new LVS port for EventBus+monolog for eventgate-analytics - T245203 (duration: 00m 56s) [15:21:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:21:12] 10Operations, 10ops-codfw, 10Core Platform Team Workboards (Clinic Duty Team): Bootstrap new Cassandra instances: restbase202[123]-{a,b,c} - https://phabricator.wikimedia.org/T243000 (10Papaul) @Eevans hey since this is complete can we close it? Thanks [15:22:09] RECOVERY - puppet last run on notebook1003 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [15:22:37] PROBLEM - PyBal connections to etcd on lvs1015 is CRITICAL: CRITICAL: 68 connections established with conf1004.eqiad.wmnet:4001 (min=69) https://wikitech.wikimedia.org/wiki/PyBal [15:22:39] PROBLEM - PyBal connections to etcd on lvs2009 is CRITICAL: CRITICAL: 57 connections established with conf2001.codfw.wmnet:2379 (min=58) https://wikitech.wikimedia.org/wiki/PyBal [15:22:41] PROBLEM - PyBal IPVS diff check on lvs2009 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.1.48:443]) https://wikitech.wikimedia.org/wiki/PyBal [15:22:45] PROBLEM - PyBal IPVS diff check on lvs1015 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.2.48:443]) https://wikitech.wikimedia.org/wiki/PyBal [15:22:54] ^^ that's expected (herron) [15:23:01] phew [15:23:12] ack, thanks [15:23:19] (scary looking) [15:23:25] (03CR) 10Elukey: [C: 03+2] Allow analytics client nodes to set user.slice global limits [puppet] - 10https://gerrit.wikimedia.org/r/575998 (owner: 10Elukey) [15:23:27] PROBLEM - PyBal connections to etcd on lvs1016 is CRITICAL: CRITICAL: 100 connections established with conf1004.eqiad.wmnet:4001 (min=101) https://wikitech.wikimedia.org/wiki/PyBal [15:23:35] PROBLEM - PyBal connections to etcd on lvs2010 is CRITICAL: CRITICAL: 77 connections established with conf2001.codfw.wmnet:2379 (min=78) https://wikitech.wikimedia.org/wiki/PyBal [15:23:40] (03CR) 10Hnowlan: [C: 03+1] kafka-dev: Updated API endpoint and added required selector [deployment-charts] - 10https://gerrit.wikimedia.org/r/575612 (https://phabricator.wikimedia.org/T246501) (owner: 10Holger Knust) [15:23:43] herron: those will go away as soon as you restart pybal on the affected LVSs [15:24:13] (03CR) 10Vgutierrez: [C: 03+2] install_server,lvs: Decommission lvs2004.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/576056 (https://phabricator.wikimedia.org/T246669) (owner: 10Vgutierrez) [15:24:56] elukey: Elukey: Allow analytics client nodes to set user.slice global limits (b306892f76), may I merge that one? [15:25:05] vgutierrez: +1 <3 [15:26:23] !log running the decommission cookbook against lvs2004 - T246669 [15:26:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:26:28] T246669: decommission lvs2004.codfw.wmnet - https://phabricator.wikimedia.org/T246669 [15:26:36] !log vgutierrez@cumin2001 START - Cookbook sre.hosts.decommission [15:26:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:27:11] !log vgutierrez@cumin2001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [15:27:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:27:18] 10Operations, 10ops-codfw, 10DC-Ops, 10Traffic, and 2 others: decommission lvs2004.codfw.wmnet - https://phabricator.wikimedia.org/T246669 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by vgutierrez@cumin2001 for hosts: `lvs2004.codfw.wmnet` - lvs2004.codfw.wmnet (**PASS**) - Downtime... [15:27:47] RECOVERY - PyBal connections to etcd on lvs1016 is OK: OK: 101 connections established with conf1004.eqiad.wmnet:4001 (min=101) https://wikitech.wikimedia.org/wiki/PyBal [15:27:47] RECOVERY - PyBal connections to etcd on lvs2010 is OK: OK: 78 connections established with conf2001.codfw.wmnet:2379 (min=78) https://wikitech.wikimedia.org/wiki/PyBal [15:29:33] (03PS1) 10Gehel: wdqs: added link to runbook entry for categories update lag. [puppet] - 10https://gerrit.wikimedia.org/r/576064 (https://phabricator.wikimedia.org/T246497) [15:30:42] !log reimage lvs3007 with buster - T245984 [15:30:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:30:49] T245984: Upgrade load balancers to buster - https://phabricator.wikimedia.org/T245984 [15:30:57] 10Operations, 10Traffic, 10Patch-For-Review: Upgrade load balancers to buster - https://phabricator.wikimedia.org/T245984 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on cumin1001.eqiad.wmnet for hosts: ` lvs3007.esams.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/20... [15:31:54] (03PS1) 10Vgutierrez: Remove lvs2004 production entries [dns] - 10https://gerrit.wikimedia.org/r/576065 (https://phabricator.wikimedia.org/T246669) [15:31:57] 10Operations, 10ops-codfw, 10Traffic: (Need by: TBD) rack/setup/install LVS200[7-10] - https://phabricator.wikimedia.org/T196560 (10Papaul) [15:32:45] RECOVERY - PyBal connections to etcd on lvs1015 is OK: OK: 69 connections established with conf1004.eqiad.wmnet:4001 (min=69) https://wikitech.wikimedia.org/wiki/PyBal [15:32:45] RECOVERY - PyBal connections to etcd on lvs2009 is OK: OK: 58 connections established with conf2001.codfw.wmnet:2379 (min=58) https://wikitech.wikimedia.org/wiki/PyBal [15:32:47] RECOVERY - PyBal IPVS diff check on lvs2009 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [15:32:47] RECOVERY - PyBal IPVS diff check on lvs1015 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [15:33:03] herron: ^^ nice :) [15:33:47] success! [15:34:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'Increase weight from 350 to 400 on db1111 T246447', diff saved to https://phabricator.wikimedia.org/P10583 and previous config saved to /var/cache/conftool/dbconfig/20200302-153416-marostegui.json [15:34:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:34:21] T246447: Move db1111 from test-s4 to s8 - https://phabricator.wikimedia.org/T246447 [15:34:45] (03CR) 10Vgutierrez: [C: 03+2] Remove lvs2004 production entries [dns] - 10https://gerrit.wikimedia.org/r/576065 (https://phabricator.wikimedia.org/T246669) (owner: 10Vgutierrez) [15:35:41] (03PS1) 10Ottomata: Use new LVS port for eventgate-main [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576066 (https://phabricator.wikimedia.org/T245203) [15:36:29] 10Operations, 10ops-codfw, 10DC-Ops, 10Traffic, 10decommission: decommission lvs2004.codfw.wmnet - https://phabricator.wikimedia.org/T246669 (10Vgutierrez) a:05Vgutierrez→03Papaul [15:37:06] 10Operations, 10Traffic, 10Patch-For-Review: Upgrade load balancers to buster - https://phabricator.wikimedia.org/T245984 (10Vgutierrez) [15:38:34] (03PS3) 10KartikMistry: ContentTranslation: Add URL campaign for WikiGapFinder [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575974 (https://phabricator.wikimedia.org/T246335) [15:40:05] 10Operations, 10LDAP-Access-Requests, 10WMF-Legal: Add Abban Dunne to the ldap/wmde group - https://phabricator.wikimedia.org/T246664 (10Tobi_WMDE_SW) I can confirm that @AbbanWMDE is working as a software developer for Wikimedia Germany in one of the teams I'm responsible for. [15:43:40] (03PS1) 10Giuseppe Lavagetto: mediawiki: stop installing the nginx proxy on the canaries [puppet] - 10https://gerrit.wikimedia.org/r/576067 (https://phabricator.wikimedia.org/T244843) [15:46:03] PROBLEM - MD RAID on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [15:46:50] (03CR) 10Andrew Bogott: [C: 03+2] nova: remove the custom scheduler pool filter [puppet] - 10https://gerrit.wikimedia.org/r/575541 (https://phabricator.wikimedia.org/T226731) (owner: 10Andrew Bogott) [15:48:06] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki: stop installing the nginx proxy on the canaries [puppet] - 10https://gerrit.wikimedia.org/r/576067 (https://phabricator.wikimedia.org/T244843) (owner: 10Giuseppe Lavagetto) [15:49:46] (03PS2) 10Andrew Bogott: nova: remove some obsolete config files [puppet] - 10https://gerrit.wikimedia.org/r/574566 [15:49:48] (03PS2) 10Andrew Bogott: Openstack Queens: use mirrors.wikimedia.org for the Queens backport [puppet] - 10https://gerrit.wikimedia.org/r/576059 (https://phabricator.wikimedia.org/T246287) [15:50:18] RECOVERY - MD RAID on notebook1003 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [15:51:54] (03PS1) 10Ssingh: First release of the cescout project [software/censorship-monitoring] - 10https://gerrit.wikimedia.org/r/576070 [15:52:31] (03CR) 10jerkins-bot: [V: 04-1] First release of the cescout project [software/censorship-monitoring] - 10https://gerrit.wikimedia.org/r/576070 (owner: 10Ssingh) [15:53:37] (03CR) 10Andrew Bogott: [C: 03+2] nova: remove some obsolete config files [puppet] - 10https://gerrit.wikimedia.org/r/574566 (owner: 10Andrew Bogott) [15:54:13] 10Operations, 10Traffic, 10Patch-For-Review: Upgrade load balancers to buster - https://phabricator.wikimedia.org/T245984 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['lvs3007.esams.wmnet'] ` Of which those **FAILED**: ` ['lvs3007.esams.wmnet'] ` [15:54:29] (03CR) 10RLazarus: [C: 03+2] Convert all the apache-fast-test URLs to httpbb tests. [puppet] - 10https://gerrit.wikimedia.org/r/573760 (owner: 10RLazarus) [15:54:53] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime [15:54:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:57:14] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [15:57:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:58:25] (03PS1) 10Muehlenhoff: Bump max CAS session life time to a day [puppet] - 10https://gerrit.wikimedia.org/r/576072 [15:59:53] (03PS2) 10Ssingh: First release of the cescout project [software/censorship-monitoring] - 10https://gerrit.wikimedia.org/r/576070 [15:59:58] (03PS1) 10Cmjohnson: Adding production dns for new mw1385-1413 [dns] - 10https://gerrit.wikimedia.org/r/576073 (https://phabricator.wikimedia.org/T241849) [16:01:00] (03CR) 10jerkins-bot: [V: 04-1] First release of the cescout project [software/censorship-monitoring] - 10https://gerrit.wikimedia.org/r/576070 (owner: 10Ssingh) [16:01:20] 10Operations, 10ops-codfw, 10Traffic, 10netops: switch port configuration for lvs200[7-10] - https://phabricator.wikimedia.org/T196946 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` lvs2007.codfw.wmnet ` The log can be found in `/var/log/wmf-auto... [16:03:13] (03PS1) 10Krinkle: Move Defines.php mock from tests/ to src/ [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576074 [16:05:40] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [16:06:15] I really need to figure out how to match my tox configuration with the one on jenkins [16:06:45] sukhe: context? [16:07:27] (03PS1) 10Vgutierrez: lvs: rename lvs3007 NIC interfaces [puppet] - 10https://gerrit.wikimedia.org/r/576075 (https://phabricator.wikimedia.org/T245984) [16:07:41] FFS buster && predictable network interface NICs [16:07:51] (03PS2) 10Muehlenhoff: Bump max CAS session life time to a day [puppet] - 10https://gerrit.wikimedia.org/r/576072 [16:08:32] * Krinkle testing on mwdebug1001 [16:09:38] 10Operations, 10ops-codfw, 10Traffic, 10netops: switch port configuration for lvs200[7-10] - https://phabricator.wikimedia.org/T196946 (10Papaul) 05Open→03Resolved Complete [16:09:41] 10Operations, 10ops-codfw, 10Traffic: (Need by: TBD) rack/setup/install LVS200[7-10] - https://phabricator.wikimedia.org/T196560 (10Papaul) [16:10:08] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [16:10:15] 10Operations, 10ops-codfw, 10Traffic: (Need by: TBD) rack/setup/install LVS200[7-10] - https://phabricator.wikimedia.org/T196560 (10Papaul) [16:10:18] PROBLEM - Unmerged changes on repository puppet on labtestpuppetmaster2001 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [16:10:43] (03CR) 10Muehlenhoff: "PCC: https://puppet-compiler.wmflabs.org/compiler1003/21187/" [puppet] - 10https://gerrit.wikimedia.org/r/576072 (owner: 10Muehlenhoff) [16:11:04] 10Operations, 10ops-codfw, 10Traffic: (Need by: TBD) rack/setup/install LVS200[7-10] - https://phabricator.wikimedia.org/T196560 (10Papaul) [16:11:54] 10Operations, 10ops-codfw, 10Traffic: (Need by: TBD) rack/setup/install LVS200[7-10] - https://phabricator.wikimedia.org/T196560 (10Papaul) [16:12:12] PROBLEM - puppet last run on mw2246 is CRITICAL: CRITICAL: Puppet last ran 4 days ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [16:13:04] PROBLEM - puppet last run on mw2156 is CRITICAL: CRITICAL: Puppet last ran 4 days ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [16:13:14] (03PS3) 10Ssingh: First release of the cescout project [software/censorship-monitoring] - 10https://gerrit.wikimedia.org/r/576070 [16:13:16] PROBLEM - puppet last run on mw2152 is CRITICAL: CRITICAL: Puppet last ran 4 days ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [16:13:16] PROBLEM - puppet last run on mw2154 is CRITICAL: CRITICAL: Puppet last ran 4 days ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [16:13:32] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [16:13:50] (03CR) 10jerkins-bot: [V: 04-1] First release of the cescout project [software/censorship-monitoring] - 10https://gerrit.wikimedia.org/r/576070 (owner: 10Ssingh) [16:15:17] volans: when you have a chance, I added skipping_interpreters to force the 3.7 run ^ [16:15:36] PROBLEM - puppet last run on mw2248 is CRITICAL: CRITICAL: Puppet last ran 4 days ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [16:15:46] 10Operations, 10Analytics, 10User-Elukey: Refactor Analytics POSIX groups in puppet to improve maintainability - https://phabricator.wikimedia.org/T246578 (10Ottomata) > Reduce the number of POSIX groups to: analtyics, analytics-wmde-users and analytics-privatedata Do you mean `analytics-users` and `analyti... [16:16:02] PROBLEM - puppet last run on mw2162 is CRITICAL: CRITICAL: Puppet last ran 4 days ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [16:16:44] PROBLEM - puppet last run on mw2280 is CRITICAL: CRITICAL: Puppet last ran 4 days ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [16:17:02] PROBLEM - puppet last run on mw2281 is CRITICAL: CRITICAL: Puppet last ran 4 days ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [16:17:18] PROBLEM - puppet last run on mw2265 is CRITICAL: CRITICAL: Puppet last ran 4 days ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [16:17:32] RECOVERY - Unmerged changes on repository puppet on labtestpuppetmaster2001 is OK: No changes to merge. https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [16:17:34] (03PS3) 10Ppchelko: Add EventBus Run Job API permissions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/572994 (https://phabricator.wikimedia.org/T244770) (owner: 10Clarakosi) [16:18:12] RECOVERY - puppet last run on mw2246 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [16:18:31] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime [16:18:33] sukhe: ahh I got what's the issue [16:18:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:18:47] the other envs apart py37 don't have set the base python as 3.7 [16:18:56] so CI uses the default one that is 3.5 on those VMs [16:19:00] 10Operations, 10Analytics, 10User-Elukey: Refactor Analytics POSIX groups in puppet to improve maintainability - https://phabricator.wikimedia.org/T246578 (10elukey) >>! In T246578#5933122, @Ottomata wrote: >> Reduce the number of POSIX groups to: analtyics, analytics-wmde-users and analytics-privatedata >... [16:19:00] :( [16:19:02] RECOVERY - puppet last run on mw2156 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [16:19:16] RECOVERY - puppet last run on mw2152 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [16:19:16] RECOVERY - puppet last run on mw2154 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [16:19:26] 10Operations, 10Analytics, 10User-Elukey: Refactor Analytics POSIX groups in puppet to improve maintainability - https://phabricator.wikimedia.org/T246578 (10elukey) [16:19:41] (03CR) 10Vgutierrez: [C: 03+2] lvs: rename lvs3007 NIC interfaces [puppet] - 10https://gerrit.wikimedia.org/r/576075 (https://phabricator.wikimedia.org/T245984) (owner: 10Vgutierrez) [16:19:56] RECOVERY - ElasticSearch unassigned shard check - 9200 on cloudelastic1001 is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Administration [16:19:56] RECOVERY - ElasticSearch unassigned shard check - 9200 on cloudelastic1002 is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Administration [16:19:56] RECOVERY - ElasticSearch unassigned shard check - 9200 on cloudelastic1004 is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Administration [16:19:56] RECOVERY - ElasticSearch unassigned shard check - 9200 on cloudelastic1003 is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Administration [16:19:57] (03PS2) 10Vgutierrez: lvs: rename lvs3007 NIC interfaces [puppet] - 10https://gerrit.wikimedia.org/r/576075 (https://phabricator.wikimedia.org/T245984) [16:20:29] (03PS3) 10Marostegui: install_server: Allow manual reimage db109[6-9] [puppet] - 10https://gerrit.wikimedia.org/r/575991 (https://phabricator.wikimedia.org/T246604) [16:20:32] !log installing netty-3.9 security updates [16:20:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:20:50] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [16:20:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:21:38] RECOVERY - puppet last run on mw2248 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [16:22:07] 10Operations, 10ops-codfw, 10Traffic, 10netops: switch port configuration for lvs200[7-10] - https://phabricator.wikimedia.org/T196946 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['lvs2007.codfw.wmnet'] ` Of which those **FAILED**: ` ['lvs2007.codfw.wmnet'] ` [16:22:08] RECOVERY - puppet last run on mw2162 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [16:22:54] RECOVERY - puppet last run on mw2280 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [16:23:10] RECOVERY - puppet last run on mw2281 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [16:23:23] (03PS2) 10Ottomata: Use new LVS port for eventgate-main in group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576066 (https://phabricator.wikimedia.org/T245203) [16:23:28] RECOVERY - puppet last run on mw2265 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [16:23:34] (03PS1) 10Giuseppe Lavagetto: mediawiki: stop installing the nginx-based proxy [puppet] - 10https://gerrit.wikimedia.org/r/576078 (https://phabricator.wikimedia.org/T244843) [16:23:55] (03PS6) 10SBassett: Deployment group audit [puppet] - 10https://gerrit.wikimedia.org/r/574869 (https://phabricator.wikimedia.org/T237696) [16:24:40] (03CR) 10jerkins-bot: [V: 04-1] mediawiki: stop installing the nginx-based proxy [puppet] - 10https://gerrit.wikimedia.org/r/576078 (https://phabricator.wikimedia.org/T244843) (owner: 10Giuseppe Lavagetto) [16:24:45] (03CR) 10jerkins-bot: [V: 04-1] Use new LVS port for eventgate-main in group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576066 (https://phabricator.wikimedia.org/T245203) (owner: 10Ottomata) [16:25:29] (03PS2) 10Giuseppe Lavagetto: mediawiki: stop installing the nginx-based proxy [puppet] - 10https://gerrit.wikimedia.org/r/576078 (https://phabricator.wikimedia.org/T244843) [16:25:33] (03CR) 10Dzahn: [C: 03+2] Add root's home to the backup for apt* [puppet] - 10https://gerrit.wikimedia.org/r/575999 (owner: 10Muehlenhoff) [16:27:35] (03CR) 10Ppchelko: [C: 03+1] "LGTM, will leave to @Alex to make the final cut." [deployment-charts] - 10https://gerrit.wikimedia.org/r/575612 (https://phabricator.wikimedia.org/T246501) (owner: 10Holger Knust) [16:29:28] (03PS3) 10Ottomata: Use new LVS port for eventgate-main in group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576066 (https://phabricator.wikimedia.org/T245203) [16:29:59] (03CR) 10Mholloway: [C: 04-1] Enabling depicts count (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575611 (owner: 10Sharvaniharan) [16:31:13] (03CR) 10Ppchelko: [C: 03+1] Use new LVS port for eventgate-main in group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576066 (https://phabricator.wikimedia.org/T245203) (owner: 10Ottomata) [16:33:05] phab is being slow [16:33:08] (UK) [16:33:51] (03PS4) 10Ssingh: First release of the cescout project [software/censorship-monitoring] - 10https://gerrit.wikimedia.org/r/576070 [16:33:58] volans: here goes! [16:34:03] I will add codecov instead later [16:34:05] Reports of site slowing down [16:34:16] (03CR) 10jerkins-bot: [V: 04-1] First release of the cescout project [software/censorship-monitoring] - 10https://gerrit.wikimedia.org/r/576070 (owner: 10Ssingh) [16:34:17] sukhe: k [16:35:06] ping for phab is (RhinosF1) min: 7.072ms, max: 7.487ms, average: 7.199ms, range: 0.147ms, count: 5 [16:35:12] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job={icinga,pybal,swagger_check_restbase_esams} site={codfw,esams} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:35:21] and enwp min: 7.146ms, max: 8.052ms, average: 7.753ms, range: 0.322ms, count: 5 [16:35:21] seems like I need it for flake8 as well. my understading was it will be taken care of in the [tox] section [16:35:46] no, because when the name of the env is py37-foo tox knows to use python 3.7 [16:35:53] otherwise it will use the default one [16:36:06] but for flake8 you don't need the package itself to be installed [16:36:18] 10Operations, 10ops-codfw, 10Traffic, 10netops: switch port configuration for lvs200[7-10] - https://phabricator.wikimedia.org/T196946 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` lvs2007.codfw.wmnet ` The log can be found in `/var/log/wmf-auto... [16:36:22] 10Operations, 10ops-codfw, 10Traffic, 10netops: switch port configuration for lvs200[7-10] - https://phabricator.wikimedia.org/T196946 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['lvs2007.codfw.wmnet'] ` Of which those **FAILED**: ` ['lvs2007.codfw.wmnet'] ` [16:36:24] PROBLEM - LVS HTTPS IPv6 #page on text-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [16:36:25] at the same time it could be even quicker to use the same venv for both envs [16:36:46] FFS... at the same time that I'm reimaging lvs3007? [16:36:46] what's up? [16:36:47] I also can’t connect to Wikimedia sites, including Grafana [16:37:00] <_joe_> esams sees to be down [16:37:08] PROBLEM - LVS HTTP IPv6 #page on text-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [16:37:11] yeah I see that [16:37:27] Lucas_WMDE: same here: wikitech, phab, enwp down. (That's UK) 2 users reported [16:37:28] <_joe_> ok, I'll prepare a depool patch in the meanwhile [16:37:49] yep wikis unreachable for me [16:37:49] not that down.. I can reach the servers without problem from here :/ [16:37:52] traceroute if anyone wants it: https://paste.gnome.org/pjemvhplk [16:37:54] * jbond42 here [16:37:54] PROBLEM - LVS HTTPS IPv4 #page on ncredir-lb.esams.wikimedia.org is CRITICAL: HTTP CRITICAL - No data received from host https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [16:38:06] oh it just loaded. hmm. but very slow [16:38:07] high-traffic1 seems to be suffering... [16:38:13] DNS issues again? [16:38:16] PROBLEM - Varnish traffic drop between 30min ago and now at esams on icinga1001 is CRITICAL: 25.68 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [16:38:34] something is pretty wrong with lvs3005 [16:38:37] RECOVERY - LVS HTTPS IPv6 #page on text-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 14946 bytes in 2.623 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [16:38:50] (03PS5) 10Ssingh: First release of the cescout project [software/censorship-monitoring] - 10https://gerrit.wikimedia.org/r/576070 [16:38:53] https://usercontent.irccloud-cdn.com/file/JzsLhhRH/Screen%20Shot%202020-03-02%20at%2011.38.42%20AM.png [16:39:09] enwp's just loaded here, wikitech + phab still no [16:39:16] RECOVERY - LVS HTTP IPv6 #page on text-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 301 TLS Redirect - 563 bytes in 0.327 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [16:39:32] wikitech + phab just loaded [16:40:01] I'm seeing ping times directly of about 16ms [16:40:10] RECOVERY - LVS HTTPS IPv4 #page on ncredir-lb.esams.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 233 bytes in 8.117 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [16:40:17] <_joe_> hi all, we're investigating the issue, which is still ongoing [16:40:18] So I think the issue isn't the connection? [16:42:20] 10Operations, 10Analytics, 10User-Elukey: Refactor Analytics POSIX groups in puppet to improve maintainability - https://phabricator.wikimedia.org/T246578 (10Milimetric) p:05Triage→03High [16:42:48] PROBLEM - LVS HTTPS IPv4 #page on text-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [16:42:58] (03PS1) 10Volans: Revert "esams/knams: remove prepending and tcp-mss clamping" [homer/public] - 10https://gerrit.wikimedia.org/r/576080 [16:43:12] PROBLEM - Restbase edge esams on text-lb.esams.wikimedia.org is CRITICAL: /api/rest_v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received: /api/rest_v1/page/mobile-html/{title} (Get mobile-html from storage) timed out before a response was received: /api/rest_v1/media/math/check/{type} (Mathoid - check test formula) timed out before a response was received: /api/rest_v1/page/g [16:43:12] {revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received: /api/rest_v1/page/title/{title} (Get rev by title from storage) timed out before a response was received: /api/rest_v1/feed/announcements (Retrieve announcements) timed out before a response was received: /api/rest_v1/page/mobile-sections/{title} (Get mobile-sections for a test page on enwiki) timed out before a response was received: /api [16:43:12] erences/{title} (Get references from storage) timed out before a response was received: /api/rest_v1/page/html/{title} (Get html by title from storage) timed out before a response was received: /api/rest_v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /api/rest_v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before [16:43:12] ceived: /api/rest_v1/page/media-list/{title} (Get media-list from storage) timed out before a response was received: /api/rest_v1/page/summary/{title} (Get summary from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [16:43:14] PROBLEM - PyBal backends health check on lvs3005 is CRITICAL: PYBAL CRITICAL - CRITICAL - ncredirlb_443: Servers ncredir3001.esams.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [16:43:15] (03CR) 10BBlack: [C: 03+1] Revert "esams/knams: remove prepending and tcp-mss clamping" [homer/public] - 10https://gerrit.wikimedia.org/r/576080 (owner: 10Volans) [16:43:28] (03CR) 10Volans: [C: 03+2] Revert "esams/knams: remove prepending and tcp-mss clamping" [homer/public] - 10https://gerrit.wikimedia.org/r/576080 (owner: 10Volans) [16:43:46] (03Merged) 10jenkins-bot: Revert "esams/knams: remove prepending and tcp-mss clamping" [homer/public] - 10https://gerrit.wikimedia.org/r/576080 (owner: 10Volans) [16:44:06] PROBLEM - configured eth on lvs3005 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.20.0.15: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [16:44:52] RECOVERY - LVS HTTPS IPv4 #page on text-lb.esams.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 14933 bytes in 2.187 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [16:45:00] PROBLEM - rsyslog in eqiad is failing to deliver messages on icinga1001 is CRITICAL: action=fwd_centrallog1001.eqiad.wmnet:6514 https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=eqiad+prometheus/ops [16:45:00] PROBLEM - rsyslog in codfw is failing to deliver messages on icinga1001 is CRITICAL: action=fwd_centrallog1001.eqiad.wmnet:6514 https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=codfw+prometheus/ops [16:45:06] RECOVERY - Restbase edge esams on text-lb.esams.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [16:45:14] RECOVERY - PyBal backends health check on lvs3005 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [16:46:18] RECOVERY - configured eth on lvs3005 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [16:46:26] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:46:37] wikis work for me again, thank you [16:48:32] Lucas_WDE: are not working for me [16:48:48] I’m seeing issues again, spoke too soon… [16:49:06] yes, there will be a few more ripples, probably [16:49:48] nowiki loads awfully slow from Norway [16:50:14] <_joe_> jeblad: known, we're working on it [16:50:28] thanks! [16:51:00] (03PS1) 10Tchanders: Remove $wgEnablePartialBlocks config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576086 (https://phabricator.wikimedia.org/T242912) [16:55:32] 10Operations, 10ops-codfw, 10fundraising-tech-ops: new payments2003 bonded ethernet network error/warning - https://phabricator.wikimedia.org/T246492 (10Jgreen) Just finished imaging payments2001 and it is exhibiting the same behavior. We tested bond0 failover by unplugging eno1 on payments2003 and it fails... [16:57:58] jeblad: how are things looking for you now? [16:59:11] (03PS1) 10Krinkle: multiversion: Optimise readDbListFile() function by 40% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576091 [17:00:10] (03PS2) 10Krinkle: multiversion: Optimise readDbListFile() function by 40% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576091 (https://phabricator.wikimedia.org/T169821) [17:00:45] apwergos: Seems normal [17:02:24] good [17:02:45] things should be more or less stable at this point [17:03:39] 10Operations, 10Traffic, 10Patch-For-Review: Upgrade load balancers to buster - https://phabricator.wikimedia.org/T245984 (10Vgutierrez) [17:03:55] 10Operations, 10ops-codfw, 10Traffic, 10netops: device asw-d-codfw.mgmt.codfw.wmnet - Inbound interface errors - https://phabricator.wikimedia.org/T246686 (10Papaul) [17:04:30] !log Stopping pybal on lvs2009 to let lvs2010 get its traffic - T246686 [17:04:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:04:35] T246686: device asw-d-codfw.mgmt.codfw.wmnet - Inbound interface errors - https://phabricator.wikimedia.org/T246686 [17:05:57] 10Operations, 10Analytics, 10Research, 10Traffic, 10WMF-Legal: Enable layered data-access and sharing for a new form of collaboration - https://phabricator.wikimedia.org/T245833 (10Milimetric) p:05Triage→03Medium We should have a meeting about this towards the end of this quarter / beginning of next.... [17:14:14] RECOVERY - rsyslog in codfw is failing to deliver messages on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=codfw+prometheus/ops [17:15:43] 10Operations, 10Security-Team, 10Stewards-and-global-tools, 10Security, 10User-revi: Security Issue Access Request for 2020 Stewards - https://phabricator.wikimedia.org/T246449 (10chasemp) >>! In T246449#5933519, @chasemp wrote: > I'll ask folks on our end about duplicate entry thinking (also worth askin... [17:15:51] (03PS1) 10Krinkle: multiversion: Remove support for passing file path to readDbListFile() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576097 [17:16:22] RECOVERY - Varnish traffic drop between 30min ago and now at esams on icinga1001 is OK: (C)60 le (W)70 le 153.7 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [17:16:24] RECOVERY - rsyslog in eqiad is failing to deliver messages on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=eqiad+prometheus/ops [17:16:36] (03CR) 10Krinkle: [C: 04-1] "Let's wait for the SecurePoll fix just in case, although I don't think those scripts aren't used anymore." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576097 (owner: 10Krinkle) [17:17:16] (03CR) 10jerkins-bot: [V: 04-1] multiversion: Remove support for passing file path to readDbListFile() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576097 (owner: 10Krinkle) [17:17:22] (03CR) 10Hnowlan: [C: 03+2] kafka-dev: Updated API endpoint and added required selector [deployment-charts] - 10https://gerrit.wikimedia.org/r/575612 (https://phabricator.wikimedia.org/T246501) (owner: 10Holger Knust) [17:19:17] 10Operations, 10Security-Team, 10Stewards-and-global-tools, 10Security, 10User-revi: Security Issue Access Request for 2020 Stewards - https://phabricator.wikimedia.org/T246449 (10chasemp) 05Open→03Stalled p:05Medium→03Lowest On my cal for the 9th. [17:20:16] (03CR) 10Krinkle: [C: 04-1] multiversion: Remove support for passing file path to readDbListFile() (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576097 (owner: 10Krinkle) [17:20:24] (03PS2) 10Krinkle: multiversion: Remove support for passing file path to readDbListFile() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576097 [17:27:00] (03PS1) 10Elukey: cdh::hive: remove DBTokenStore from hive-site.xml config [puppet] - 10https://gerrit.wikimedia.org/r/576099 (https://phabricator.wikimedia.org/T244499) [17:28:11] (03PS2) 10Sharvaniharan: Enabling depicts count [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575611 [17:29:22] (03CR) 10Sharvaniharan: "That is correct @Mholloway. Changes are done." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575611 (owner: 10Sharvaniharan) [17:30:48] (03CR) 10Ppchelko: "This has to be rebased now" (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/575620 (https://phabricator.wikimedia.org/T245803) (owner: 10Holger Knust) [17:32:19] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] Openstack Queens: use mirrors.wikimedia.org for the Queens backport [puppet] - 10https://gerrit.wikimedia.org/r/576059 (https://phabricator.wikimedia.org/T246287) (owner: 10Andrew Bogott) [17:32:54] PROBLEM - puppet last run on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [17:35:04] PROBLEM - DPKG on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [17:35:08] PROBLEM - Check size of conntrack table on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [17:35:16] PROBLEM - Check whether ferm is active by checking the default input chain on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [17:35:34] PROBLEM - Check systemd state on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:35:42] PROBLEM - configured eth on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [17:35:54] PROBLEM - MD RAID on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [17:36:23] (03PS3) 10Sharvaniharan: Enabling depicts count [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575611 [17:36:32] PROBLEM - dhclient process on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [17:36:40] PROBLEM - Disk space on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=notebook1003&var-datasource=eqiad+prometheus/ops [17:40:04] RECOVERY - configured eth on notebook1003 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [17:40:06] !log notebook1003 systemctl restart nagios-nrpe-server [17:40:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:40:16] RECOVERY - MD RAID on notebook1003 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [17:40:52] RECOVERY - dhclient process on notebook1003 is OK: PROCS OK: 0 processes with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [17:41:02] RECOVERY - Disk space on notebook1003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=notebook1003&var-datasource=eqiad+prometheus/ops [17:41:05] !log notebook1003 df: /mnt/hdfs: Input/output error | systemctl restart nagios-nrpe-server (T224682) [17:41:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:41:09] T224682: notebook1003:/srv/ 2% disk space left - https://phabricator.wikimedia.org/T224682 [17:41:10] (03PS6) 10Hnowlan: changeprop: Add Kafka subchart [deployment-charts] - 10https://gerrit.wikimedia.org/r/575620 (https://phabricator.wikimedia.org/T245803) (owner: 10Holger Knust) [17:41:18] (03CR) 10jerkins-bot: [V: 04-1] changeprop: Add Kafka subchart [deployment-charts] - 10https://gerrit.wikimedia.org/r/575620 (https://phabricator.wikimedia.org/T245803) (owner: 10Holger Knust) [17:41:36] RECOVERY - DPKG on notebook1003 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [17:41:37] mutante: thanks :) [17:41:40] RECOVERY - Check size of conntrack table on notebook1003 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [17:41:48] RECOVERY - Check whether ferm is active by checking the default input chain on notebook1003 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [17:42:03] (03CR) 10Ppchelko: "I think due to the tgz file is should be manually rebased." [deployment-charts] - 10https://gerrit.wikimedia.org/r/575620 (https://phabricator.wikimedia.org/T245803) (owner: 10Holger Knust) [17:42:46] 10Operations, 10ops-codfw, 10Traffic, 10netops: switch port configuration for lvs200[7-10] - https://phabricator.wikimedia.org/T196946 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` lvs2007.codfw.wmnet ` The log can be found in `/var/log/wmf-auto... [17:48:00] 10Operations, 10ops-codfw, 10Traffic, 10netops: device asw-d-codfw.mgmt.codfw.wmnet - Inbound interface errors - https://phabricator.wikimedia.org/T246686 (10Papaul) @Vgutierrez looks like re-seating the transceiver fix the problem. you can get the traffic back if we see the error again we will replace the... [17:48:49] elukey: yw. i'm afraid /mnt/hdfs is borked [17:48:53] RECOVERY - Check systemd state on notebook1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:49:20] (03PS4) 10Ppchelko: kafka-dev: Updated API endpoint and added required selector [deployment-charts] - 10https://gerrit.wikimedia.org/r/575612 (https://phabricator.wikimedia.org/T246501) (owner: 10Holger Knust) [17:49:44] mutante: I'll double check! [17:49:46] (03CR) 10Ppchelko: [C: 03+2] kafka-dev: Updated API endpoint and added required selector [deployment-charts] - 10https://gerrit.wikimedia.org/r/575612 (https://phabricator.wikimedia.org/T246501) (owner: 10Holger Knust) [17:50:17] (03Merged) 10jenkins-bot: kafka-dev: Updated API endpoint and added required selector [deployment-charts] - 10https://gerrit.wikimedia.org/r/575612 (https://phabricator.wikimedia.org/T246501) (owner: 10Holger Knust) [17:50:19] RECOVERY - puppet last run on notebook1003 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [17:50:34] !log starting pybal on lvs2009 - T246686 [17:50:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:50:38] T246686: device asw-d-codfw.mgmt.codfw.wmnet - Inbound interface errors - https://phabricator.wikimedia.org/T246686 [17:51:57] (03CR) 10Ppchelko: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/575620 (https://phabricator.wikimedia.org/T245803) (owner: 10Holger Knust) [17:52:32] (03CR) 10Ppchelko: "yeah, for sure. Manual rebase needed." [deployment-charts] - 10https://gerrit.wikimedia.org/r/575620 (https://phabricator.wikimedia.org/T245803) (owner: 10Holger Knust) [17:59:47] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime [17:59:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:00:04] gehel and onimisionipe: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Wikidata Query Service weekly deploy deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200302T1800). [18:00:40] mutante: ah wait if you are talking about input/output errors, it is probably because you are not logged with Kerberos [18:00:47] PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs1007.eqiad.wmnet, wdqs1006.eqiad.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs1006.eqiad.wmnet, wdqs1005.eqiad.wmnet, wdqs1004.eqiad.wmnet are marked down but pooled: wdqs_80: Servers wdqs1007.eqiad.wmnet, wdqs1005.eqiad.wmnet, wdqs1004.eqiad.wmnet are marked down but pooled https://wikitec [18:00:47] iki/PyBal [18:01:10] * gehel is having a look at wdqs [18:02:05] !log pt1979@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [18:02:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:03:11] elukey: oooh, ok. just because i ran df -h and was looking for full disk. alright [18:03:16] PROBLEM - LVS HTTPS IPv4 #page on wdqs.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [18:03:32] PROBLEM - PyBal IPVS diff check on lvs1015 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([wdqs1007.eqiad.wmnet, wdqs1006.eqiad.wmnet, wdqs1005.eqiad.wmnet, wdqs1004.eqiad.wmnet]) https://wikitech.wikimedia.org/wiki/PyBal [18:03:45] what's going on? [18:04:02] wdqs seems overloaded [18:04:18] marostegui: do you think is related to what you mentioned on the meeting? [18:04:29] public cluster only, so probably more related to the query load than the edit load [18:04:40] PROBLEM - PyBal IPVS diff check on lvs1016 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([wdqs1006.eqiad.wmnet]) https://wikitech.wikimedia.org/wiki/PyBal [18:04:53] is there a mitigation needed, e.g. reduce allowed rates temporarilly? [18:04:57] 10Operations, 10ops-codfw, 10Traffic, 10netops: device asw-d-codfw.mgmt.codfw.wmnet - Inbound interface errors - https://phabricator.wikimedia.org/T246686 (10Papaul) @Vgutierrez after you got the traffic back the error came back again so we will have to replace the transceiver tomorrow. [18:05:24] RECOVERY - LVS HTTPS IPv4 #page on wdqs.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 483 bytes in 1.028 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [18:05:33] I can query it ok ATM [18:05:44] (at least simple queries) [18:06:01] load is already going down [18:06:13] di you do anything? [18:06:19] nope [18:06:28] 10Operations, 10ops-codfw, 10Traffic, 10netops: device asw-d-codfw.mgmt.codfw.wmnet - Inbound interface errors - https://phabricator.wikimedia.org/T246686 (10Papaul) p:05Triage→03Medium [18:06:45] our throttling (application side) did not kick in, so probably queries from different IPs [18:07:44] 10Operations, 10ops-codfw, 10Traffic, 10netops: switch port configuration for lvs200[7-10] - https://phabricator.wikimedia.org/T196946 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['lvs2007.codfw.wmnet'] ` and were **ALL** successful. [18:08:19] did any class of queries stand out or are things not set up to show that information? [18:08:20] 10Operations, 10ops-codfw, 10Traffic: (Need by: TBD) rack/setup/install LVS200[7-10] - https://phabricator.wikimedia.org/T196560 (10Papaul) [18:08:23] (03CR) 10Ottomata: [C: 03+1] "Huh ok...." [puppet] - 10https://gerrit.wikimedia.org/r/576099 (https://phabricator.wikimedia.org/T244499) (owner: 10Elukey) [18:08:44] 10Operations, 10ops-codfw, 10Traffic: (Need by: TBD) rack/setup/install LVS200[7-10] - https://phabricator.wikimedia.org/T196560 (10Papaul) 05Open→03Resolved @Vgutierrez lvs2007 is ready for service. [18:08:46] RECOVERY - PyBal IPVS diff check on lvs1015 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [18:09:02] apergos: we don't have a good way to get aggregated query info in real time [18:09:16] right [18:10:10] RECOVERY - PyBal IPVS diff check on lvs1016 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [18:11:31] (03PS1) 10Jbond: ferm: Add status check [puppet] - 10https://gerrit.wikimedia.org/r/576101 (https://phabricator.wikimedia.org/T206951) [18:11:36] (03PS1) 10Jbond: ferm: enable ferm status script [puppet] - 10https://gerrit.wikimedia.org/r/576102 (https://phabricator.wikimedia.org/T206951) [18:12:35] (03PS2) 10Jbond: ferm: Add status check [puppet] - 10https://gerrit.wikimedia.org/r/576101 (https://phabricator.wikimedia.org/T206951) [18:12:48] (03PS2) 10Jbond: ferm: enable ferm status script [puppet] - 10https://gerrit.wikimedia.org/r/576102 (https://phabricator.wikimedia.org/T206951) [18:13:01] (03PS3) 10Jbond: ferm: enable ferm status script [puppet] - 10https://gerrit.wikimedia.org/r/576102 (https://phabricator.wikimedia.org/T206951) [18:15:28] (03CR) 10jerkins-bot: [V: 04-1] ferm: Add status check [puppet] - 10https://gerrit.wikimedia.org/r/576101 (https://phabricator.wikimedia.org/T206951) (owner: 10Jbond) [18:16:29] (03CR) 10Jforrester: [C: 03+1] "Nice." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576060 (https://phabricator.wikimedia.org/T223602) (owner: 10Krinkle) [18:17:22] 10Operations, 10ops-codfw, 10fundraising-tech-ops: (Need by: TBD) rack/setup/install frpm2001.frack.codfw.wmnet - https://phabricator.wikimedia.org/T242269 (10Papaul) [18:19:35] RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [18:21:00] (03PS3) 10Jbond: ferm: Add status check [puppet] - 10https://gerrit.wikimedia.org/r/576101 (https://phabricator.wikimedia.org/T206951) [18:21:24] 10Operations, 10Analytics, 10User-Elukey: Refactor Analytics POSIX groups in puppet to improve maintainability - https://phabricator.wikimedia.org/T246578 (10nshahquinn-wmf) >>! In T246578#5932606, @elukey wrote: > There are other use cases for people using the stat boxes, that often don't involve private da... [18:22:14] (03CR) 10Herron: [C: 03+2] logstash: filter-syslog: match http_method on DATA [puppet] - 10https://gerrit.wikimedia.org/r/575614 (owner: 10Herron) [18:23:04] (03CR) 10Jforrester: [C: 03+1] "Let's hope we never find ourselves needing to set MW_VERSION in defines.php, given that config is non-variant." (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576074 (owner: 10Krinkle) [18:24:58] (03PS2) 10Jbond: ferm: add a very basic status check [puppet] - 10https://gerrit.wikimedia.org/r/573335 (https://phabricator.wikimedia.org/T206951) [18:25:22] (03CR) 10Jforrester: "Very nice." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576091 (https://phabricator.wikimedia.org/T169821) (owner: 10Krinkle) [18:25:28] (03CR) 10Jforrester: [C: 03+1] multiversion: Remove support for passing file path to readDbListFile() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576097 (owner: 10Krinkle) [18:27:37] (03PS4) 10Jbond: ferm: Add status check [puppet] - 10https://gerrit.wikimedia.org/r/576101 (https://phabricator.wikimedia.org/T206951) [18:30:25] (03CR) 10Jbond: "ready for review" [puppet] - 10https://gerrit.wikimedia.org/r/576101 (https://phabricator.wikimedia.org/T206951) (owner: 10Jbond) [18:32:50] (03CR) 10Elukey: [C: 03+2] cdh::hive: remove DBTokenStore from hive-site.xml config [puppet] - 10https://gerrit.wikimedia.org/r/576099 (https://phabricator.wikimedia.org/T244499) (owner: 10Elukey) [18:33:36] (03CR) 10Ottomata: [C: 03+2] Use new LVS port for eventgate-main in group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576066 (https://phabricator.wikimedia.org/T245203) (owner: 10Ottomata) [18:35:11] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=icinga site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:35:28] !log add BGP to lvs2008 on cr1/2-codfw [18:35:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:36:59] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:37:52] !log ebernhardson@deploy1001 Started deploy [search/mjolnir/deploy@8195b6f]: Bump python to 3.7, python-kafka to 1.4.7 [18:37:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:38:03] !log using new eventgate-main LVS ports for eventbus on group0 wikis - T245203 [18:38:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:38:07] T245203: Create production and canary releases for existent eventgate helmfile services - https://phabricator.wikimedia.org/T245203 [18:38:49] (03CR) 10Andrew Bogott: [C: 03+2] Openstack Queens: use mirrors.wikimedia.org for the Queens backport [puppet] - 10https://gerrit.wikimedia.org/r/576059 (https://phabricator.wikimedia.org/T246287) (owner: 10Andrew Bogott) [18:39:13] !log otto@deploy1001 Synchronized wmf-config/LabsServices.php: Use new LVS port for EventBus for eventgate-main on group0 wikis - T245203 (duration: 00m 58s) [18:39:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:41:33] !log otto@deploy1001 Synchronized wmf-config/ProductionServices.php: Use new LVS port for EventBus for eventgate-main on group0 wikis - T245203 (duration: 00m 57s) [18:41:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:41:56] !log ebernhardson@deploy1001 Finished deploy [search/mjolnir/deploy@8195b6f]: Bump python to 3.7, python-kafka to 1.4.7 (duration: 04m 04s) [18:41:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:41:59] !log remove BGP to lvs2004/5/6 on cr1/2-codfw [18:42:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:43:15] !log otto@deploy1001 Synchronized wmf-config/CommonSettings.php: Use new LVS port for EventBus for eventgate-main on group0 wikis - T245203 (duration: 00m 56s) [18:43:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:43:19] T245203: Create production and canary releases for existent eventgate helmfile services - https://phabricator.wikimedia.org/T245203 [18:45:12] !log otto@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Use new LVS port for EventBus for eventgate-main on group0 wikis - T245203 (duration: 00m 57s) [18:45:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:48:00] (03PS1) 10Ottomata: Use new LVS port for eventgate-main in group1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576115 (https://phabricator.wikimedia.org/T245203) [18:48:35] 10Operations, 10netops: Netbox has incorrect email address for GTT - https://phabricator.wikimedia.org/T246564 (10ayounsi) a:03faidon Seems like they changed things. From https://www.gtt.net/us-en/support/ "Finding your NOC contacts on the dashboard in EtherVision" @faidon do you have portal access? If so... [18:51:28] 10Operations, 10netops: Netbox has incorrect email address for GTT - https://phabricator.wikimedia.org/T246564 (10ayounsi) Until then I updated the Netbox page to match their PeeringDB NOC contact: https://www.peeringdb.com/net/14 [18:51:33] (03CR) 10Ottomata: [C: 03+2] Use new LVS port for eventgate-main in group1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576115 (https://phabricator.wikimedia.org/T245203) (owner: 10Ottomata) [18:52:42] 10Operations, 10ops-codfw, 10fundraising-tech-ops: new payments2001 and payments2003 bonded ethernet network error/warning - https://phabricator.wikimedia.org/T246492 (10Jgreen) [18:53:11] !log otto@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Use new LVS port for EventBus for eventgate-main on group1 wikis - T245203 (duration: 00m 57s) [18:53:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:53:16] T245203: Create production and canary releases for existent eventgate helmfile services - https://phabricator.wikimedia.org/T245203 [18:53:27] 10Operations, 10ops-codfw, 10fundraising-tech-ops: (Need by: TBD) codfw: rack/setup/install 3 new payments server for frack - https://phabricator.wikimedia.org/T244169 (10Jgreen) [18:54:11] (03PS2) 10Herron: check_confd_template: glob fixup and add detail to alerts [puppet] - 10https://gerrit.wikimedia.org/r/575598 [18:54:13] (03CR) 10Herron: check_confd_template: glob fixup and add detail to alerts (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/575598 (owner: 10Herron) [18:54:40] (03PS1) 10EBernhardson: mjolnir: Install python3.7 to older debian versions [puppet] - 10https://gerrit.wikimedia.org/r/576116 [18:55:12] (03PS1) 10Ottomata: Use new LVS port for eventgate-main in all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576117 (https://phabricator.wikimedia.org/T245203) [18:56:55] (03CR) 10Ottomata: [C: 03+2] Use new LVS port for eventgate-main in all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576117 (https://phabricator.wikimedia.org/T245203) (owner: 10Ottomata) [18:57:41] 10Operations, 10DC-Ops, 10decommission: decommission WMF6141 (old payments2001.frack.codfw.wmnet) - https://phabricator.wikimedia.org/T246697 (10Jgreen) [18:58:55] !log otto@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Use new LVS port for EventBus for eventgate-main on all wikis - T245203 (duration: 00m 56s) [18:58:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:59:00] T245203: Create production and canary releases for existent eventgate helmfile services - https://phabricator.wikimedia.org/T245203 [18:59:15] (03PS4) 10Ppchelko: Add EventBus Run Job API permissions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/572994 (https://phabricator.wikimedia.org/T244770) (owner: 10Clarakosi) [18:59:52] 10Operations, 10DC-Ops, 10decommission: decommission WMF6143 (old payments2002.frack.codfw.wmnet) - https://phabricator.wikimedia.org/T246698 (10Jgreen) [19:00:04] RoanKattouw, Niharika, and Urbanecm: Dear deployers, time to do the Morning SWAT(Max 6 patches) deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200302T1900). [19:00:04] Pchelolo: A patch you scheduled for Morning SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [19:00:18] ottomata: r u done with yours? [19:00:34] (03CR) 10jerkins-bot: [V: 04-1] Add EventBus Run Job API permissions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/572994 (https://phabricator.wikimedia.org/T244770) (owner: 10Clarakosi) [19:00:35] got a few sync files sorry! just for cleanupu [19:00:36] doing now [19:00:40] meant to finish before swat [19:00:47] np, ping me when done [19:01:55] !log otto@deploy1001 Synchronized wmf-config/CommonSettings.php: Use new LVS port for EventBus for eventgate-main on all wikis - T245203 (duration: 00m 57s) [19:01:56] 10Operations, 10DC-Ops, 10decommission: decommission WMF6142 (old payments2003.frack.codfw.wmnet) - https://phabricator.wikimedia.org/T246699 (10Jgreen) [19:01:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:02:35] (03PS7) 10Rush: Deployment group audit [puppet] - 10https://gerrit.wikimedia.org/r/574869 (https://phabricator.wikimedia.org/T237696) (owner: 10SBassett) [19:02:52] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/576072 (owner: 10Muehlenhoff) [19:03:06] (03CR) 10EBernhardson: [C: 04-1] "fails to compile on stat1007" [puppet] - 10https://gerrit.wikimedia.org/r/576116 (owner: 10EBernhardson) [19:03:06] !log otto@deploy1001 Synchronized wmf-config/LabsServices.php: Use new LVS port for EventBus for eventgate-main on all wikis - T245203 (duration: 00m 56s) [19:03:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:04:27] !log otto@deploy1001 Synchronized wmf-config/ProductionServices.php: Use new LVS port for EventBus for eventgate-main on all wikis - T245203 (duration: 00m 56s) [19:04:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:04:31] T245203: Create production and canary releases for existent eventgate helmfile services - https://phabricator.wikimedia.org/T245203 [19:05:20] all done Pchelolo [19:05:33] ok, I'll do mine now [19:05:49] (03CR) 10Rush: [C: 03+2] Deployment group audit [puppet] - 10https://gerrit.wikimedia.org/r/574869 (https://phabricator.wikimedia.org/T237696) (owner: 10SBassett) [19:06:12] (03CR) 10Ppchelko: [C: 03+1] "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/572994 (https://phabricator.wikimedia.org/T244770) (owner: 10Clarakosi) [19:07:10] (03PS3) 10Ottomata: EventStreamConfig - allow eventgate to produce error events [mediawiki-config] - 10https://gerrit.wikimedia.org/r/572979 (https://phabricator.wikimedia.org/T233629) [19:07:38] (03CR) 10Ppchelko: [C: 03+2] Add EventBus Run Job API permissions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/572994 (https://phabricator.wikimedia.org/T244770) (owner: 10Clarakosi) [19:08:23] Pchelolo: let me know when you are done i will have a few more for other things now too :) [19:08:31] ok ottomata [19:08:43] (03Merged) 10jenkins-bot: Add EventBus Run Job API permissions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/572994 (https://phabricator.wikimedia.org/T244770) (owner: 10Clarakosi) [19:15:36] !log ppchelko@deploy1001 Synchronized wmf-config/CommonSettings-labs.php: Enable REST run jobs endpoint on jobrunners T244770 (duration: 00m 56s) [19:15:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:15:41] T244770: Enable RunSingleJobHandler endpoint on Job Runner Cluster - https://phabricator.wikimedia.org/T244770 [19:17:09] !log ppchelko@deploy1001 Synchronized wmf-config/CommonSettings.php: Enable REST run jobs endpoint on jobrunners T244770 (duration: 00m 56s) [19:17:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:17:41] 10Operations, 10Security-Team, 10User-jbond: Determine any impacts to SRE from OIT's planned move to JumpCloud for LDAP - https://phabricator.wikimedia.org/T244792 (10chasemp) a:03HMarcus >>! In T244792#5925692, @MoritzMuehlenhoff wrote: >>>! In T244792#5925452, @HMarcus wrote: >> Yes, we will plan on prov... [19:17:43] 10Operations, 10ops-codfw, 10fundraising-tech-ops: (Need by: TBD) codfw: rack/setup/install 3 new payments server for frack - https://phabricator.wikimedia.org/T244169 (10Jgreen) 05Open→03Resolved [19:18:36] 10Operations, 10ops-codfw, 10fundraising-tech-ops: (Need by: TBD) codfw: rack/setup/install 3 new payments server for frack - https://phabricator.wikimedia.org/T244169 (10Jgreen) [19:18:57] ottomata: done, all yours [19:21:43] (03CR) 10Jforrester: [C: 03+1] Remove dead code from 'ChangeAuthenticationDataAudit' hook handler [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575815 (owner: 10Krinkle) [19:25:54] 10Operations, 10DC-Ops, 10decommission: decommission WMF6142 (old payments2003.frack.codfw.wmnet) - https://phabricator.wikimedia.org/T246699 (10Jgreen) [19:25:56] 10Operations, 10DC-Ops, 10decommission: decommission WMF6143 (old payments2002.frack.codfw.wmnet) - https://phabricator.wikimedia.org/T246698 (10Jgreen) [19:25:58] 10Operations, 10DC-Ops, 10decommission: decommission WMF6141 (old payments2001.frack.codfw.wmnet) - https://phabricator.wikimedia.org/T246697 (10Jgreen) [19:29:55] ty [19:30:07] (03PS2) 10Cmjohnson: Adding production dns for new mw1385-1413 [dns] - 10https://gerrit.wikimedia.org/r/576073 (https://phabricator.wikimedia.org/T241849) [19:30:11] (03CR) 10Ottomata: [C: 03+2] EventStreamConfig - allow eventgate to produce error events [mediawiki-config] - 10https://gerrit.wikimedia.org/r/572979 (https://phabricator.wikimedia.org/T233629) (owner: 10Ottomata) [19:30:36] (03PS7) 10Ppchelko: changeprop: Add Kafka subchart [deployment-charts] - 10https://gerrit.wikimedia.org/r/575620 (https://phabricator.wikimedia.org/T245803) (owner: 10Holger Knust) [19:30:51] (03CR) 10Herron: [C: 03+1] monitoring: remove hostname from mgmt definitions (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/526165 (owner: 10Cwhite) [19:32:35] (03CR) 10Cmjohnson: [C: 03+2] Adding production dns for new mw1385-1413 [dns] - 10https://gerrit.wikimedia.org/r/576073 (https://phabricator.wikimedia.org/T241849) (owner: 10Cmjohnson) [19:33:09] (03CR) 10Ppchelko: "PS7 is a manual rebase" [deployment-charts] - 10https://gerrit.wikimedia.org/r/575620 (https://phabricator.wikimedia.org/T245803) (owner: 10Holger Knust) [19:35:13] !log otto@deploy1001 Synchronized wmf-config/InitialiseSettings.php: EventStreamConfig - allow eventgate-analytics-external to produce error events - T233629 (duration: 00m 56s) [19:35:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:35:18] T233629: Public EventGate instance and endpoint for analytics event intake: eventgate-analytics-external - https://phabricator.wikimedia.org/T233629 [19:35:58] hmmm [19:36:19] (03PS2) 10EBernhardson: mjolnir: Install python3.7 to older debian versions [puppet] - 10https://gerrit.wikimedia.org/r/576116 [19:37:09] oh cancel hmm, just cached response [19:37:12] :) [19:38:02] (03CR) 10Ppchelko: [C: 03+2] changeprop: Add Kafka subchart [deployment-charts] - 10https://gerrit.wikimedia.org/r/575620 (https://phabricator.wikimedia.org/T245803) (owner: 10Holger Knust) [19:38:43] (03Merged) 10jenkins-bot: changeprop: Add Kafka subchart [deployment-charts] - 10https://gerrit.wikimedia.org/r/575620 (https://phabricator.wikimedia.org/T245803) (owner: 10Holger Knust) [19:39:15] (03PS1) 10Ottomata: Enable client side error logging in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576120 (https://phabricator.wikimedia.org/T246030) [19:39:19] 10Operations, 10ChangeProp, 10Release Pipeline, 10Release-Engineering-Team-TODO, and 5 others: Migrate cpjobqueue to kubernetes - https://phabricator.wikimedia.org/T220399 (10Pchelolo) [19:39:30] (03CR) 10EBernhardson: [C: 04-1] "puppet compiler looks reasonable: https://puppet-compiler.wmflabs.org/compiler1002/21190/" [puppet] - 10https://gerrit.wikimedia.org/r/576116 (owner: 10EBernhardson) [19:40:06] (03PS8) 10Ppchelko: changeprop: New helmfiles for deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/574094 (https://phabricator.wikimedia.org/T213193) (owner: 10Holger Knust) [19:40:30] 10Operations, 10ops-eqiad, 10fundraising-tech-ops: (Need by: ASAP) rack/setup/install frnetmon1001.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T232137 (10Jgreen) p:05High→03Medium [19:40:45] (03CR) 10Ppchelko: [C: 03+1] "I am unfamiliar with the deploy process from now on, but I think it's good to go?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/574094 (https://phabricator.wikimedia.org/T213193) (owner: 10Holger Knust) [19:41:32] (03CR) 10Ottomata: [C: 03+2] Enable client side error logging in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576120 (https://phabricator.wikimedia.org/T246030) (owner: 10Ottomata) [19:42:14] (03CR) 10CRusnov: tox: Support DNS_INCLUDE_DIR and generated DNS (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/569340 (https://phabricator.wikimedia.org/T243362) (owner: 10CRusnov) [19:43:10] (03PS1) 10Ottomata: Fix wgWMEClientErrorIntakeURL in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576126 (https://phabricator.wikimedia.org/T246030) [19:43:55] (03CR) 10jerkins-bot: [V: 04-1] Fix wgWMEClientErrorIntakeURL in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576126 (https://phabricator.wikimedia.org/T246030) (owner: 10Ottomata) [19:49:12] (03PS2) 10Ottomata: Fix wgWMEClientErrorIntakeURL in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576126 (https://phabricator.wikimedia.org/T246030) [19:49:38] (03PS1) 10Cmjohnson: Adding mac addresses for mw1385-1413 [puppet] - 10https://gerrit.wikimedia.org/r/576127 (https://phabricator.wikimedia.org/T241849) [19:51:46] (03CR) 10Cmjohnson: [C: 03+2] Adding mac addresses for mw1385-1413 [puppet] - 10https://gerrit.wikimedia.org/r/576127 (https://phabricator.wikimedia.org/T241849) (owner: 10Cmjohnson) [19:58:27] (03PS5) 10Dzahn: aptrepo: puppetize REPREPRO_BASE_DIR env variable [puppet] - 10https://gerrit.wikimedia.org/r/575638 (https://phabricator.wikimedia.org/T224576) [19:59:06] (03CR) 10Ottomata: [C: 03+2] Fix wgWMEClientErrorIntakeURL in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576126 (https://phabricator.wikimedia.org/T246030) (owner: 10Ottomata) [19:59:34] (03CR) 10jerkins-bot: [V: 04-1] aptrepo: puppetize REPREPRO_BASE_DIR env variable [puppet] - 10https://gerrit.wikimedia.org/r/575638 (https://phabricator.wikimedia.org/T224576) (owner: 10Dzahn) [20:01:24] !log otto@deploy1001 Synchronized wmf-config/InitialiseSettings-labs.php: Enable Mediawiki client side error logging in beta - T246030 (duration: 00m 56s) [20:01:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:01:29] T246030: Enable client side error logging in prod for small wiki - https://phabricator.wikimedia.org/T246030 [20:02:26] (03CR) 10Krinkle: [C: 03+2] MWConfigCacheGenerator: Remove unused 'docRoot' wgConf placeholder variable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576060 (https://phabricator.wikimedia.org/T223602) (owner: 10Krinkle) [20:03:03] (03PS1) 10Ottomata: Enable client side (browser) error logging for group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576130 (https://phabricator.wikimedia.org/T246030) [20:03:18] (03PS2) 10Krinkle: Move Defines.php mock from tests/ to src/ [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576074 [20:03:42] ottomata: still deploying? [20:03:52] Krinkle: i can pause if need be [20:04:01] don't have anything dirty atm [20:04:17] ottomata: I landed a no-op I'd like to roll out. But I can wait as well. [20:04:21] Krinkle: proceed [20:04:22] i'll wait [20:04:23] ok [20:04:25] (03Merged) 10jenkins-bot: MWConfigCacheGenerator: Remove unused 'docRoot' wgConf placeholder variable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576060 (https://phabricator.wikimedia.org/T223602) (owner: 10Krinkle) [20:07:33] * Krinkle staging on mwdebug1002 [20:10:13] 10Operations, 10ops-codfw, 10fundraising-tech-ops: (Need by: TBD) rack/setup/install frpm2001.frack.codfw.wmnet - https://phabricator.wikimedia.org/T242269 (10Papaul) [20:10:35] (03CR) 10Andrew Bogott: [C: 03+2] keystone hooks: create new default domains for new projects [puppet] - 10https://gerrit.wikimedia.org/r/575271 (https://phabricator.wikimedia.org/T245174) (owner: 10Andrew Bogott) [20:12:26] !log krinkle@deploy1001 Synchronized multiversion/MWConfigCacheGenerator.php: I1ef0589 (duration: 00m 58s) [20:12:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:12:37] ottomata: all yours [20:15:53] ty [20:16:14] (03CR) 10Ottomata: [C: 03+2] Enable client side (browser) error logging for group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576130 (https://phabricator.wikimedia.org/T246030) (owner: 10Ottomata) [20:16:18] (03PS2) 10Ottomata: Enable client side (browser) error logging for group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576130 (https://phabricator.wikimedia.org/T246030) [20:17:29] 10Operations, 10ops-codfw, 10fundraising-tech-ops: (Need by: TBD) codfw:fundraising single-cpu misc servers frpig2001,civi2001.pay-lvs200[1-2] - https://phabricator.wikimedia.org/T244950 (10Papaul) [20:18:40] (03CR) 10Krinkle: "Yeah, if both this and SiteConfiguration::getAll() can be fast enough, then we won't need to cache it." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576091 (https://phabricator.wikimedia.org/T169821) (owner: 10Krinkle) [20:18:57] 10Operations, 10ops-codfw, 10fundraising-tech-ops: (Need by: TBD) codfw:fundraising single-cpu misc servers frpig2001,civi2001.pay-lvs200[1-2] - https://phabricator.wikimedia.org/T244950 (10Jgreen) [20:22:02] (03PS1) 10Papaul: DNS: Add mgmt DNS for frpm2001 [dns] - 10https://gerrit.wikimedia.org/r/576132 [20:22:47] !log otto@deploy1001 Synchronized wmf-config/InitialiseSettings-labs.php: Enable Mediawiki client side error logging on group0 wikis - T246030 (duration: 00m 57s) [20:22:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:22:52] T246030: Enable client side error logging in prod for small wiki - https://phabricator.wikimedia.org/T246030 [20:23:50] (03CR) 10Dzahn: [C: 03+1] DNS: Add mgmt DNS for frpm2001 [dns] - 10https://gerrit.wikimedia.org/r/576132 (owner: 10Papaul) [20:25:18] (03CR) 10Jgreen: [C: 03+1] DNS: Add mgmt DNS for frpm2001 [dns] - 10https://gerrit.wikimedia.org/r/576132 (owner: 10Papaul) [20:25:45] (03CR) 10Dzahn: [C: 03+2] DNS: Add mgmt DNS for frpm2001 [dns] - 10https://gerrit.wikimedia.org/r/576132 (owner: 10Papaul) [20:27:26] (03PS1) 10Andrew Bogott: keystone hoooks: don't create .wmflabs.org domains for codfw1dev projects [puppet] - 10https://gerrit.wikimedia.org/r/576133 (https://phabricator.wikimedia.org/T245174) [20:27:37] 10Operations, 10ops-codfw, 10fundraising-tech-ops, 10Patch-For-Review: (Need by: TBD) rack/setup/install frpm2001.frack.codfw.wmnet - https://phabricator.wikimedia.org/T242269 (10Papaul) [20:27:45] 10Operations, 10ops-codfw, 10fundraising-tech-ops, 10Patch-For-Review: (Need by: TBD) rack/setup/install frpm2001.frack.codfw.wmnet - https://phabricator.wikimedia.org/T242269 (10Papaul) ` [edit interfaces interface-range disabled] - member "ge-[0-1]/0/12"; [edit interfaces interface-range vlan-administ... [20:28:28] 10Operations, 10ops-codfw, 10fundraising-tech-ops, 10Patch-For-Review: (Need by: TBD) rack/setup/install frpm2001.frack.codfw.wmnet - https://phabricator.wikimedia.org/T242269 (10Papaul) a:05Papaul→03Jgreen @Jgreen All yours [20:28:50] (03CR) 10Andrew Bogott: [C: 03+2] keystone hoooks: don't create .wmflabs.org domains for codfw1dev projects [puppet] - 10https://gerrit.wikimedia.org/r/576133 (https://phabricator.wikimedia.org/T245174) (owner: 10Andrew Bogott) [20:32:56] 10Operations, 10ops-codfw, 10fundraising-tech-ops: (Need by: TBD) codfw:fundraising single-cpu misc servers frpig2001,civi2001.pay-lvs200[1-2] - https://phabricator.wikimedia.org/T244950 (10Papaul) 05Open→03Declined Declinning this task since it is been track in separate tasks T242270 T242266 T242267 T2... [20:34:56] (03CR) 10Dzahn: [V: 03+2 C: 03+2] aptrepo: puppetize REPREPRO_BASE_DIR env variable [puppet] - 10https://gerrit.wikimedia.org/r/575638 (https://phabricator.wikimedia.org/T224576) (owner: 10Dzahn) [20:36:44] (03CR) 10Dzahn: "no issue in production, despite what jerkins said before" [puppet] - 10https://gerrit.wikimedia.org/r/575638 (https://phabricator.wikimedia.org/T224576) (owner: 10Dzahn) [20:38:58] 10Operations, 10ORES, 10Scoring-platform-team (Current): ORES uwsgi consumes a large amount of memory and CPU when shutting down (as part of a restart) - https://phabricator.wikimedia.org/T242705 (10Halfak) @akosiaris, I've run a couple of tests now using threads rather than process-based workers and the beh... [20:45:10] (03CR) 10Krinkle: [C: 04-2] "Blocked pending investigation of elevated memory use issue in Beta Cluster. I'll enable this on a canary server for a while separately fir" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575709 (https://phabricator.wikimedia.org/T99740) (owner: 10Krinkle) [20:45:35] (03PS1) 10Andrew Bogott: keystone: allow setting the bastion project id via heira [puppet] - 10https://gerrit.wikimedia.org/r/576137 [20:48:16] (03PS2) 10Andrew Bogott: keystone: allow setting the bastion project id via heira [puppet] - 10https://gerrit.wikimedia.org/r/576137 [20:51:06] (03CR) 10Ayounsi: [C: 03+1] wdqs: added link to runbook entry for categories update lag. [puppet] - 10https://gerrit.wikimedia.org/r/576064 (https://phabricator.wikimedia.org/T246497) (owner: 10Gehel) [20:54:16] (03PS3) 10Andrew Bogott: keystone: allow setting the bastion project id via heira [puppet] - 10https://gerrit.wikimedia.org/r/576137 [20:57:24] (03CR) 10Andrew Bogott: [C: 03+2] keystone: allow setting the bastion project id via heira [puppet] - 10https://gerrit.wikimedia.org/r/576137 (owner: 10Andrew Bogott) [21:00:04] cscott, arlolra, subbu, halfak, and accraze: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Services – Graphoid / Parsoid / Citoid / ORES . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200302T2100). [21:03:43] 10Operations, 10ops-eqiad, 10DC-Ops: audit/rebalance power in a5-eqiad - https://phabricator.wikimedia.org/T245655 (10ayounsi) The way LibreNMS works is that it will either: * get the alerting value from the device if it supports (exposes) it. I don't think that's the case for PDUs * guess an alerting value... [21:06:42] (03CR) 10Dzahn: "> Patch Set 4: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/575638 (https://phabricator.wikimedia.org/T224576) (owner: 10Dzahn) [21:07:07] (03PS1) 10Andrew Bogott: keystone: include ::profile::openstack::codfw1dev::clientpackages on keystone service host [puppet] - 10https://gerrit.wikimedia.org/r/576141 [21:09:55] (03CR) 10jerkins-bot: [V: 04-1] keystone: include ::profile::openstack::codfw1dev::clientpackages on keystone service host [puppet] - 10https://gerrit.wikimedia.org/r/576141 (owner: 10Andrew Bogott) [21:10:04] (03PS1) 10Dzahn: releases: remove superfluous lint-ignore line [puppet] - 10https://gerrit.wikimedia.org/r/576142 [21:10:58] !log re-number AMS-IX peer 64271 [21:11:00] (03PS2) 10Andrew Bogott: codfw1dev: include clientpackages on keystone service host [puppet] - 10https://gerrit.wikimedia.org/r/576141 [21:11:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:12:49] 10Operations, 10Analytics, 10Research, 10Traffic, 10WMF-Legal: Enable layered data-access and sharing for a new form of collaboration - https://phabricator.wikimedia.org/T245833 (10LilyOfTheWest) @Milimetric that is a good point. @Miriam I suggest replacing "highly anonymized" in the task description w... [21:13:13] (03PS1) 10Dzahn: aptrepo: puppetize gpg sec and pub keys for apt.wm.org [puppet] - 10https://gerrit.wikimedia.org/r/576143 (https://phabricator.wikimedia.org/T224576) [21:14:07] (03CR) 10Andrew Bogott: [C: 03+2] codfw1dev: include clientpackages on keystone service host [puppet] - 10https://gerrit.wikimedia.org/r/576141 (owner: 10Andrew Bogott) [21:15:17] (03CR) 10Dzahn: [C: 03+2] releases: remove superfluous lint-ignore line [puppet] - 10https://gerrit.wikimedia.org/r/576142 (owner: 10Dzahn) [21:16:15] 10Operations, 10Analytics, 10Research, 10Traffic, 10WMF-Legal: Enable layered data-access and sharing for a new form of collaboration - https://phabricator.wikimedia.org/T245833 (10Milimetric) The internal use cases would be nice to support, and I think we can discuss that separately from how much we tru... [21:17:07] 10Operations, 10netops: Add graceful-restart to cr2-esams - https://phabricator.wikimedia.org/T246338 (10ayounsi) Note that enabling `graceful-restart` will cause all BGP sessions to flap. [21:19:37] 10Operations, 10serviceops-radar, 10Wikimedia-maintenance-script-run: special pages has not been updated since November 2019 in jawiki and several other projects - https://phabricator.wikimedia.org/T243599 (10Mbch331) [21:19:45] 10Operations, 10Analytics, 10Research, 10Traffic, 10WMF-Legal: Enable layered data-access and sharing for a new form of collaboration - https://phabricator.wikimedia.org/T245833 (10leila) @Milimetric that is a good point. @Miriam I suggest replacing "highly anonymized" in the task description with "suf... [21:27:57] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good. It's strange that the entire deploy the key logic was around, but simply not used? Did that get dropped accidentally before." [puppet] - 10https://gerrit.wikimedia.org/r/576143 (https://phabricator.wikimedia.org/T224576) (owner: 10Dzahn) [21:34:28] 10Operations, 10serviceops-radar, 10Wikimedia-maintenance-script-run: special pages has not been updated since November 2019 in jawiki and several other projects - https://phabricator.wikimedia.org/T243599 (10Ciell) Hi all, Because Dutch Wikipedia will be hitting a mile stone somewhere this month, I'd like... [21:34:51] 10Operations, 10Analytics, 10Research, 10Traffic, 10WMF-Legal: Enable layered data-access and sharing for a new form of collaboration - https://phabricator.wikimedia.org/T245833 (10Nuria) What is the number of users this potential system would serve? 10/100? [21:37:00] 10Operations, 10ops-eqiad, 10fundraising-tech-ops: (Need by: ASAP) rack/setup/install frnetmon1001.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T232137 (10Jgreen) I'm not able to access the management IP, can you check it out? jgreen@frbast1001:~$ dig frnetmon1001.mgmt.frack.eqiad.wmnet ; <<>> D... [21:39:13] 10Operations, 10ops-eqiad, 10fundraising-tech-ops: (Need by: ASAP) rack/setup/install frnetmon1001.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T232137 (10Jgreen) a:05Jgreen→03None [21:39:53] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 62, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:40:27] (03CR) 10BBlack: tox: Support DNS_INCLUDE_DIR and generated DNS (034 comments) [dns] - 10https://gerrit.wikimedia.org/r/569340 (https://phabricator.wikimedia.org/T243362) (owner: 10CRusnov) [21:40:35] 10Operations, 10ops-eqiad, 10fundraising-tech-ops: (Need by: ASAP) rack/setup/install frnetmon1001.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T232137 (10Jgreen) @cmjohnson @Jclark-ctr can someone take a look at the management interface situation? It's not accessible via network as far as I can tell. [21:41:10] 10Operations, 10Analytics, 10Research, 10Traffic, 10WMF-Legal: Enable layered data-access and sharing for a new form of collaboration - https://phabricator.wikimedia.org/T245833 (10leila) @Nuria can you help me understand in what sense the answer to this question is important? Is it about RAM and Storage... [21:41:13] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 74, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:42:56] ^ is the eqord transport link [21:44:39] 10Operations, 10serviceops-radar, 10Wikimedia-maintenance-script-run: special pages has not been updated since November 2019 in jawiki and several other projects - https://phabricator.wikimedia.org/T243599 (10Koavf) p:05Medium→03High [21:45:10] 10Operations, 10serviceops-radar, 10Wikimedia-maintenance-script-run: special pages has not been updated since November 2019 in jawiki and several other projects - https://phabricator.wikimedia.org/T243599 (10Koavf) This is not urgent/break now but it's pretty important for maintenance on several wikis and h... [21:45:11] (03PS1) 10Herron: dns: add logstash-next.wikimedia.org record [dns] - 10https://gerrit.wikimedia.org/r/576152 (https://phabricator.wikimedia.org/T234854) [21:45:16] (03PS1) 10Herron: cache: map logstash-next.wikimedia.org to kibana-next lvs [puppet] - 10https://gerrit.wikimedia.org/r/576151 (https://phabricator.wikimedia.org/T234854) [21:46:54] 10Operations, 10Traffic, 10netops: BGP: Investigate isolating codfw and eqiad - https://phabricator.wikimedia.org/T246721 (10ayounsi) [21:49:54] 10Operations, 10Analytics, 10Research, 10Traffic, 10WMF-Legal: Enable layered data-access and sharing for a new form of collaboration - https://phabricator.wikimedia.org/T245833 (10Nuria) This ask, in terms of infrastructure is a significant one and we would like to e how many users are benefiting from i... [21:50:21] " We currently have a network disturbance near Lawrence, Kansas. Investigation is ongoing. We will provide further details as soon as possible. " [21:50:39] (03Abandoned) 10Andrew Bogott: Openstack: add 'queens' config files [puppet] - 10https://gerrit.wikimedia.org/r/575083 (owner: 10Andrew Bogott) [21:53:17] (03PS1) 10Andrew Bogott: wmcs-novastats-dnsleaks: make safe to run in codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/576154 (https://phabricator.wikimedia.org/T246551) [21:57:34] (03CR) 10Alex Monk: "I think I'm missing the codfw1dev bit?" [puppet] - 10https://gerrit.wikimedia.org/r/576154 (https://phabricator.wikimedia.org/T246551) (owner: 10Andrew Bogott) [21:59:40] (03CR) 10Herron: "https://puppet-compiler.wmflabs.org/compiler1002/21195/" [puppet] - 10https://gerrit.wikimedia.org/r/576151 (https://phabricator.wikimedia.org/T234854) (owner: 10Herron) [22:00:04] Reedy and sbassett: I, the Bot under the Fountain, allow thee, The Deployer, to do Weekly Security deployment window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200302T2200). [22:01:12] (03PS2) 10Andrew Bogott: wmcs-novastats-dnsleaks: run through 2to3 and black [puppet] - 10https://gerrit.wikimedia.org/r/576154 (https://phabricator.wikimedia.org/T246551) [22:01:15] (03PS1) 10Andrew Bogott: wmcs-novastats-dnsleaks: make safe to run in codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/576155 [22:02:39] (03CR) 10jerkins-bot: [V: 04-1] wmcs-novastats-dnsleaks: make safe to run in codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/576155 (owner: 10Andrew Bogott) [22:04:06] (03PS2) 10Andrew Bogott: wmcs-novastats-dnsleaks: make safe to run in codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/576155 [22:05:20] (03CR) 10jerkins-bot: [V: 04-1] wmcs-novastats-dnsleaks: make safe to run in codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/576155 (owner: 10Andrew Bogott) [22:07:23] (03PS3) 10Andrew Bogott: wmcs-novastats-dnsleaks: make safe to run in codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/576155 (https://phabricator.wikimedia.org/T246551) [22:09:53] (03CR) 10Dzahn: "> Patch Set 1: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/576143 (https://phabricator.wikimedia.org/T224576) (owner: 10Dzahn) [22:12:33] (03PS3) 10Alex Monk: profile::mariadb::cloudinfra: Allow overriding of hostnames [puppet] - 10https://gerrit.wikimedia.org/r/575744 (https://phabricator.wikimedia.org/T242607) [22:12:38] (03PS1) 10Jdlrobson: Restore beta cluster logo on mobile [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576157 (https://phabricator.wikimedia.org/T232140) [22:12:59] (03CR) 10jerkins-bot: [V: 04-1] profile::mariadb::cloudinfra: Allow overriding of hostnames [puppet] - 10https://gerrit.wikimedia.org/r/575744 (https://phabricator.wikimedia.org/T242607) (owner: 10Alex Monk) [22:13:41] (03CR) 10Jdlrobson: "https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/576157 Restore beta cluster logo on mobile will restore the correct icon" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570379 (https://phabricator.wikimedia.org/T232140) (owner: 10Jforrester) [22:14:16] (03CR) 10Jdlrobson: [C: 04-1] "if possible would prefer a solution that keeps wordmark the same for beta cluster as production." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576157 (https://phabricator.wikimedia.org/T232140) (owner: 10Jdlrobson) [22:14:30] 10Operations, 10observability, 10Patch-For-Review, 10Performance-Team (Radar): Fully migrate producers off statsd - https://phabricator.wikimedia.org/T205870 (10colewhite) [22:15:12] (03PS4) 10Alex Monk: profile::mariadb::cloudinfra: Allow overriding of hostnames [puppet] - 10https://gerrit.wikimedia.org/r/575744 (https://phabricator.wikimedia.org/T242607) [22:17:45] (03CR) 10Andrew Bogott: [C: 03+2] wmcs-novastats-dnsleaks: run through 2to3 and black [puppet] - 10https://gerrit.wikimedia.org/r/576154 (https://phabricator.wikimedia.org/T246551) (owner: 10Andrew Bogott) [22:17:56] (03CR) 10Volans: [C: 04-1] "LGTM but there is a styling issue (I know sometimes are painful to follow), see inline." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/576116 (owner: 10EBernhardson) [22:18:59] (03CR) 10Andrew Bogott: [C: 03+2] profile::wmcs::nfs::backup::primary::base: include a standard firewall [puppet] - 10https://gerrit.wikimedia.org/r/575594 (https://phabricator.wikimedia.org/T245808) (owner: 10Andrew Bogott) [22:19:53] 10Operations, 10Analytics, 10Research, 10Traffic, 10WMF-Legal: Enable layered data-access and sharing for a new form of collaboration - https://phabricator.wikimedia.org/T245833 (10leila) >>! In T245833#5934960, @Nuria wrote: > This ask, in terms of infrastructure is a significant one and we would like t... [22:20:26] (03PS4) 10Andrew Bogott: wmcs-novastats-dnsleaks: make safe to run in codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/576155 (https://phabricator.wikimedia.org/T246551) [22:21:46] (03PS2) 10Andrew Bogott: cloudmetrics: clean up the grafana-labs-admin endpoint [puppet] - 10https://gerrit.wikimedia.org/r/575758 (https://phabricator.wikimedia.org/T246508) [22:22:47] 10Operations, 10Patch-For-Review, 10cloud-services-team (Kanban): Ferm rules for cloudbackup2001/2001 - https://phabricator.wikimedia.org/T245808 (10Andrew) 05Open→03Resolved a:03Andrew [22:22:50] 10Operations, 10ops-codfw, 10Cloud-Services: rack/setup codfw: cloudbackup2001.codfw.wmnet and cloudbackup2002.codfw.wmnet - https://phabricator.wikimedia.org/T224528 (10Andrew) [22:23:01] (03PS1) 10Dzahn: add fake GPG keys for apt.wikimedia.org [labs/private] - 10https://gerrit.wikimedia.org/r/576158 [22:23:37] (03CR) 10Dzahn: [V: 03+2 C: 03+2] add fake GPG keys for apt.wikimedia.org [labs/private] - 10https://gerrit.wikimedia.org/r/576158 (owner: 10Dzahn) [22:23:39] (03PS1) 10RLazarus: Allow regular expressions in assert_headers values. [software/httpbb] - 10https://gerrit.wikimedia.org/r/576159 (https://phabricator.wikimedia.org/T236699) [22:25:05] (03CR) 10Andrew Bogott: [C: 03+2] cloudmetrics: clean up the grafana-labs-admin endpoint [puppet] - 10https://gerrit.wikimedia.org/r/575758 (https://phabricator.wikimedia.org/T246508) (owner: 10Andrew Bogott) [22:25:49] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/21197/" [puppet] - 10https://gerrit.wikimedia.org/r/576143 (https://phabricator.wikimedia.org/T224576) (owner: 10Dzahn) [22:26:18] (03PS2) 10Dzahn: aptrepo: puppetize gpg sec and pub keys for apt.wm.org [puppet] - 10https://gerrit.wikimedia.org/r/576143 (https://phabricator.wikimedia.org/T224576) [22:31:00] (03CR) 10Andrew Bogott: [C: 03+1] "I don't know much about this but it lgtm. Note that grafana-labs-admin doesn't exist anymore, as of yesterday." [puppet] - 10https://gerrit.wikimedia.org/r/572381 (https://phabricator.wikimedia.org/T210411) (owner: 10Dzahn) [22:46:19] (03CR) 10Mholloway: [C: 04-1] Enabling depicts count (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575611 (owner: 10Sharvaniharan) [22:53:51] (03PS1) 10Dzahn: installserver/apt_repo: add homedir parameter, move dirs to Hiera [puppet] - 10https://gerrit.wikimedia.org/r/576160 (https://phabricator.wikimedia.org/T224576) [22:57:58] (03CR) 10Dzahn: "this added the keys but in /var/lib/reprepro .. -> https://gerrit.wikimedia.org/r/c/operations/puppet/+/576160" [puppet] - 10https://gerrit.wikimedia.org/r/576143 (https://phabricator.wikimedia.org/T224576) (owner: 10Dzahn) [22:58:23] security-deploying patch for T246602 now... [23:02:36] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [23:03:30] thanks sbassett, watching that spike. it happens during deployment [23:06:13] expecting recovery [23:07:38] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [23:08:41] 10Operations, 10ops-eqiad, 10DC-Ops: (Need by: TBD) rack/setup/install wdqs101[123].eqiad.wmnet - https://phabricator.wikimedia.org/T246352 (10Jclark-ctr) [23:09:36] 10Operations, 10ops-esams, 10Traffic: cp3057 crash (was: network down) - https://phabricator.wikimedia.org/T244127 (10RobH) [23:09:38] 10Operations, 10ops-esams, 10DC-Ops, 10Traffic: Upgrade BIOS and IDRAC firmware on R440 cp systems - https://phabricator.wikimedia.org/T243167 (10RobH) [23:10:00] 10Operations, 10ops-esams, 10Traffic: cp3057 crash (was: network down) - https://phabricator.wikimedia.org/T244127 (10RobH) Please note this may be fixed by T243167, which I'm doing (as time and esams condition permits.) [23:15:10] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [23:18:20] sbassett: is deployment still ongoing? [23:18:36] mutante: I don't think so. Scap completed successfully for me. [23:19:24] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [23:19:25] i was about to say exceptions are raised.. but right when i said it now seems to be over [23:19:39] sbassett: alright, recovered [23:20:06] mutante: Ok. Would be a bit surprised, since the patch added one line of sanitization to a js lib. [23:20:16] PROBLEM - Old JVM GC check - cloudelastic1001-cloudelastic-chi-eqiad on cloudelastic1001 is CRITICAL: 108.8 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1 [23:20:35] I try to keep an eye on FatalMonitor whenever I deploy. [23:21:11] (03PS3) 10EBernhardson: mjolnir: Install python3.7 to older debian versions [puppet] - 10https://gerrit.wikimedia.org/r/576116 [23:21:20] sbassett: unfortunately it seems to happen on each deploy lately. just that there was more than one spike this time [23:21:26] (03PS2) 10Krinkle: Remove dead code from 'ChangeAuthenticationDataAudit' hook handler [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575815 [23:21:33] (03PS3) 10Krinkle: Move Defines.php mock from tests/ to src/ [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576074 [23:22:01] (03PS3) 10Krinkle: multiversion: Optimise readDbListFile() function by 40% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576091 (https://phabricator.wikimedia.org/T169821) [23:22:09] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1003/21200/" [puppet] - 10https://gerrit.wikimedia.org/r/576160 (https://phabricator.wikimedia.org/T224576) (owner: 10Dzahn) [23:22:12] (03PS3) 10Krinkle: multiversion: Remove support for passing file path to readDbListFile() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576097 [23:22:54] (03PS2) 10Dzahn: installserver/apt_repo: add homedir parameter, move dirs to Hiera [puppet] - 10https://gerrit.wikimedia.org/r/576160 (https://phabricator.wikimedia.org/T224576) [23:24:01] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/576116 (owner: 10EBernhardson) [23:24:59] (03CR) 10EBernhardson: "updated PCC, still looks reasonable: https://puppet-compiler.wmflabs.org/compiler1001/21201/" [puppet] - 10https://gerrit.wikimedia.org/r/576116 (owner: 10EBernhardson) [23:28:13] (03CR) 10Krinkle: [C: 04-1] Enable lead paragraph in user namespace on nlwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/562486 (https://phabricator.wikimedia.org/T242030) (owner: 10Ammarpad) [23:29:14] (03CR) 10Krinkle: [C: 03+2] Remove dead code from 'ChangeAuthenticationDataAudit' hook handler [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575815 (owner: 10Krinkle) [23:30:17] (03PS1) 10RobH: adding new shipping sku [software] - 10https://gerrit.wikimedia.org/r/576163 [23:30:19] (03Merged) 10jenkins-bot: Remove dead code from 'ChangeAuthenticationDataAudit' hook handler [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575815 (owner: 10Krinkle) [23:30:35] (03CR) 10Cwhite: [C: 03+1] "Nothing jumps out at me as potentially problematic." [puppet] - 10https://gerrit.wikimedia.org/r/526664 (https://phabricator.wikimedia.org/T229397) (owner: 10Jbond) [23:30:40] (03CR) 10RobH: [C: 03+2] adding new shipping sku [software] - 10https://gerrit.wikimedia.org/r/576163 (owner: 10RobH) [23:33:20] (03PS7) 10Jdlrobson: Enable lead paragraph in user namespace on nlwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/562486 (https://phabricator.wikimedia.org/T242030) (owner: 10Ammarpad) [23:35:06] (03CR) 10Krinkle: [C: 03+2] "Verified on mwdebug1002 that it still works and still logs entries the same way to the same channel." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575815 (owner: 10Krinkle) [23:35:54] (03CR) 10Krinkle: Enable lead paragraph in user namespace on nlwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/562486 (https://phabricator.wikimedia.org/T242030) (owner: 10Ammarpad) [23:35:56] Krinkle: Can I take over prod now, or do you have more to do? Want to switch the Parsoid cluster over to use the vendor.git checkout. [23:36:15] (03CR) 10Krinkle: [C: 03+2] Move Defines.php mock from tests/ to src/ [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576074 (owner: 10Krinkle) [23:36:30] !log krinkle@deploy1001 Synchronized wmf-config/CommonSettings.php: I1bdc5e359476 (duration: 00m 56s) [23:36:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:36:46] James_F: Was gonna do a few more no-ops until Jon's swat. [23:36:50] (03PS8) 10Jdlrobson: Enable lead paragraph in user namespace on nlwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/562486 (https://phabricator.wikimedia.org/T242030) (owner: 10Ammarpad) [23:36:59] Ah, I wanted to get this done ahead of the SWAT. [23:37:10] I'll yield after this one then [23:37:15] Thank you! [23:37:18] (CC cscott.) [23:37:33] (03Merged) 10jenkins-bot: Move Defines.php mock from tests/ to src/ [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576074 (owner: 10Krinkle) [23:38:37] (03CR) 10Dzahn: "secring file is the same on install and apt servers after the change:" [puppet] - 10https://gerrit.wikimedia.org/r/576160 (https://phabricator.wikimedia.org/T224576) (owner: 10Dzahn) [23:39:23] (03CR) 10Jdlrobson: [C: 03+1] Enable lead paragraph in user namespace on nlwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/562486 (https://phabricator.wikimedia.org/T242030) (owner: 10Ammarpad) [23:39:35] (03PS7) 10Jforrester: Parsoid: Use the version of Parsoid in $IP/vendor [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575336 (https://phabricator.wikimedia.org/T240055) [23:40:44] James_F: it's three syncs FYI [23:40:51] (03CR) 10Thcipriani: "> This "works", but it doesn't change the output from just doing" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446507 (https://phabricator.wikimedia.org/T198844) (owner: 10Thcipriani) [23:41:28] !log krinkle@deploy1001 Synchronized src/: Idc26716abef5bff (duration: 00m 56s) [23:41:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:41:42] Krinkle: Plus touch+re-sync for IS. [23:41:49] I'm refering to mine. [23:42:01] Oh, yeah, yours doesn't touch IS this time. [23:42:39] !log krinkle@deploy1001 Synchronized multiversion/: Idc26716abef5bff (duration: 00m 57s) [23:42:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:43:56] !log krinkle@deploy1001 Synchronized docroot/noc/: Idc26716abef5bff (duration: 00m 56s) [23:43:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:44:50] James_F: all yours [23:45:08] Thanks. [23:45:19] (03CR) 10Jforrester: [C: 03+2] Parsoid: Use the version of Parsoid in $IP/vendor [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575336 (https://phabricator.wikimedia.org/T240055) (owner: 10Jforrester) [23:46:16] (03Merged) 10jenkins-bot: Parsoid: Use the version of Parsoid in $IP/vendor [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575336 (https://phabricator.wikimedia.org/T240055) (owner: 10Jforrester) [23:46:28] (03CR) 10Jforrester: Merge $wgLogo and $wgLogoHD into $wgLogos (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570379 (https://phabricator.wikimedia.org/T232140) (owner: 10Jforrester) [23:46:53] Okie-dokie. [23:46:59] cscott: I'm pulling to deployment now. [23:47:10] ok. [23:47:48] cscott: OK, can you test this on a parsoid cluster box? [23:49:26] is it live? or am i pulling to a specific canary? [23:49:53] 10Operations, 10ops-eqiad, 10DC-Ops: (Need by: TBD) rack/setup/install wdqs101[123].eqiad.wmnet - https://phabricator.wikimedia.org/T246352 (10Jclark-ctr) 05Open→03Resolved a:05Jclark-ctr→03Cmjohnson host name, Rack , Switch port wdqs1011: B5 ,34 wdqs1012: A5, 19 wdqs1013:C5 , 39 [23:50:22] 10Operations, 10ops-eqiad, 10DC-Ops: (Need by: TBD) rack/setup/install wdqs101[123].eqiad.wmnet - https://phabricator.wikimedia.org/T246352 (10Jclark-ctr) [23:50:29] cscott: I was suggesting pulling to a canary. [23:50:43] (And checking that said canary still worked.) [23:50:50] ok, let me open up a couple of tabs to watch the canaries [23:51:08] It's "live" on mwdebug1001 and has precisely no impact, as expected. [23:51:24] we could also start by pulling to scandium, but i think we did that already [23:51:41] Well, yes, but we expect scandium to not work, right? [23:51:46] So it's not a great test. [23:52:42] i think scandium "doesn't work" in the sense that it won't run our code-under-test anymore, because it will start running the production code instead [23:52:58] so it's still a decent test of production, just not a good test for parsoid-next, until we re-hack it [23:53:02] No no. The patch changes the load path to one you've not created yetr. [23:53:15] Oh, right, I see what you mean. [23:53:23] But it won't load the fake extension.json file. [23:53:52] we could also depool a node from the parsoid cluster, and use that as our test guinea pig [23:54:15] Reverting is quick. But sure. [23:54:23] the pool/depool command is in https://wikitech.wikimedia.org/wiki/Parsoid#Misc_stuff, but i've never done that before personally so i don't know if there are any other effects [23:55:47] (03PS1) 10Dzahn: installserver/apt: allow setting gpg_user different from reprepro user [puppet] - 10https://gerrit.wikimedia.org/r/576164 (https://phabricator.wikimedia.org/T224576) [23:55:56] i can depool a wtp host for you if you want [23:56:09] mutante: That'd be great. [23:56:13] cscott: ^^ [23:56:53] yeah. if you could check the command given in https://wikitech.wikimedia.org/wiki/Parsoid#Misc_stuff for accuracy while you're at it, that would be nice [23:57:01] wtp1025 ok ? [23:57:07] that's the first one [23:57:29] mutante: works for me, let me double check i have login perms there [23:57:52] i mean, i can scap there, i guess. [23:57:59] sorry, still new to this [23:58:10] [cumin1001:~] $ sudo -i confctl select name=wtp1025.eqiad.wmnet get [23:58:13] {"wtp1025.eqiad.wmnet": {"pooled": "yes", "weight": 10}, "tags": "dc=eqiad,cluster=parsoid,service=parsoid"} [23:58:16] {"wtp1025.eqiad.wmnet": {"pooled": "yes", "weight": 10}, "tags": "dc=eqiad,cluster=parsoid,service=parsoid-php"} [23:58:17] anyway, i've got grafana queued up watching wtp1025, so that should work fine. [23:58:19] that's the part to check if it's pooled. that works [23:58:28] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=wtp1025.eqiad.wmnet [23:58:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:58:48] {"wtp1025.eqiad.wmnet": {"weight": 10, "pooled": "no"}, "tags": "dc=eqiad,cluster=parsoid,service=parsoid"} [23:58:51] {"wtp1025.eqiad.wmnet": {"weight": 10, "pooled": "no"}, "tags": "dc=eqiad,cluster=parsoid,service=parsoid-php"} [23:59:03] cscott: So just a simple `scap pull` in `/srv/mediawiki/` should work to get it up to date (do a `git log` to check). [23:59:31] (03CR) 10jerkins-bot: [V: 04-1] installserver/apt: allow setting gpg_user different from reprepro user [puppet] - 10https://gerrit.wikimedia.org/r/576164 (https://phabricator.wikimedia.org/T224576) (owner: 10Dzahn) [23:59:38] cscott: you could probably just literally type "depool" on the host itself as well [23:59:53] but this is via cumin from cumin1001