[00:21:08] RECOVERY - Rate of JVM GC Old generation-s runs - elastic1052-production-search-psi-eqiad on elastic1052 is OK: (C)100 gt (W)80 gt 52.88 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-eqiad&var-instance=elastic1052&panelId=37 [01:08:50] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 54 probes of 574 (alerts on 50) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [01:14:42] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 46 probes of 574 (alerts on 50) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [01:23:51] (03PS2) 10Tim Starling: Explicitly set SwiftFileBackend timeouts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/596538 (https://phabricator.wikimedia.org/T245170) [01:24:01] (03CR) 10Tim Starling: [C: 03+2] Explicitly set SwiftFileBackend timeouts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/596538 (https://phabricator.wikimedia.org/T245170) (owner: 10Tim Starling) [01:24:50] (03Merged) 10jenkins-bot: Explicitly set SwiftFileBackend timeouts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/596538 (https://phabricator.wikimedia.org/T245170) (owner: 10Tim Starling) [01:29:52] (03CR) 10Ammarpad: [C: 03+1] [eswiki] Normalize talk namespaces for Anexo, Portal and Wikiproyecto [mediawiki-config] - 10https://gerrit.wikimedia.org/r/600979 (https://phabricator.wikimedia.org/T254077) (owner: 10MarcoAurelio) [02:11:04] PROBLEM - Check systemd state on an-launcher1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:19:33] 10Operations, 10WMF-Design, 10Design: Create URL for Design blog - https://phabricator.wikimedia.org/T254118 (10Prtksxna) [03:20:24] 10Operations, 10WMF-Design, 10Design: Create sub-directory URL for Design blog (https://design.wikimedia.org/blog) - https://phabricator.wikimedia.org/T254118 (10Prtksxna) [03:47:57] 10Operations, 10Puppet, 10Traffic, 10Patch-For-Review, and 3 others: Deprecate `base::service_unit` in puppet - https://phabricator.wikimedia.org/T194724 (10bd808) [04:01:29] 10Puppet, 10Cloud-Services: mysql logs to unrotated file - https://phabricator.wikimedia.org/T97531 (10bd808) 05Open→03Declined 5 years of inactivity. Pretty sure this got fixed by someone at some point [04:41:45] (03PS1) 10Marostegui: dbproxy1018: Depool db1141 to take a binary dump [puppet] - 10https://gerrit.wikimedia.org/r/601138 (https://phabricator.wikimedia.org/T249188) [04:43:39] (03CR) 10Marostegui: [C: 03+2] dbproxy1018: Depool db1141 to take a binary dump [puppet] - 10https://gerrit.wikimedia.org/r/601138 (https://phabricator.wikimedia.org/T249188) (owner: 10Marostegui) [04:44:08] !log Depool db1141 from Analytics role - T249188 [04:44:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:44:12] T249188: Reimage labsdb1011 to Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T249188 [04:54:35] !log Drop testreduce_0715 from m5 master T245408 [04:54:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:54:40] T245408: testreduce_vd database in m5 still in use? - https://phabricator.wikimedia.org/T245408 [05:03:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool enwiki db2071 slave to test new index - T238966', diff saved to https://phabricator.wikimedia.org/P11338 and previous config saved to /var/cache/conftool/dbconfig/20200601-050354-marostegui.json [05:03:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:03:59] T238966: Apply updates for MCR, actor migration, and content migration, to production wikis. - https://phabricator.wikimedia.org/T238966 [05:04:26] PROBLEM - Check systemd state on labstore1006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:05:38] PROBLEM - Check systemd state on labstore1007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:08:00] 10Operations, 10SRE-Access-Requests: Give access to the Analytics Cluster to Research Intern (@YiJuLu) - https://phabricator.wikimedia.org/T254120 (10diego) [05:13:48] PROBLEM - Check systemd state on an-launcher1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:14:57] 10Operations, 10Jupyter-Hub, 10LDAP-Access-Requests: Give access to the JupyterHub (SWAP) notebooks to (@YiJuLu ) - https://phabricator.wikimedia.org/T254121 (10diego) [06:21:48] (03PS1) 10ArielGlenn: sdc dumps: add placeholder entry in dcat setup to avoid syntax errors [puppet] - 10https://gerrit.wikimedia.org/r/601162 (https://phabricator.wikimedia.org/T221917) [06:22:43] 10Operations, 10Analytics: Increase memory available for an-launcher1001 - https://phabricator.wikimedia.org/T254125 (10elukey) p:05Triage→03High [06:22:46] 10Operations, 10Analytics: Increase memory available for an-launcher1001 - https://phabricator.wikimedia.org/T254125 (10elukey) In theory it should be very simple: https://wikitech.wikimedia.org/wiki/Ganeti#Increase/Decrease_CPU/RAM [06:29:09] (03CR) 10ArielGlenn: [C: 03+2] sdc dumps: add placeholder entry in dcat setup to avoid syntax errors [puppet] - 10https://gerrit.wikimedia.org/r/601162 (https://phabricator.wikimedia.org/T221917) (owner: 10ArielGlenn) [06:31:37] 10Operations, 10Phabricator, 10Security: peek is incorrectly configured to run every minute every 1st of the month, creating large amounts of cronspam - https://phabricator.wikimedia.org/T254127 (10jcrespo) [06:31:55] (03PS1) 10Jcrespo: peek: Disable cron execution [puppet] - 10https://gerrit.wikimedia.org/r/601170 (https://phabricator.wikimedia.org/T254127) [06:32:48] (03CR) 10jerkins-bot: [V: 04-1] peek: Disable cron execution [puppet] - 10https://gerrit.wikimedia.org/r/601170 (https://phabricator.wikimedia.org/T254127) (owner: 10Jcrespo) [06:33:42] 10Operations, 10Phabricator, 10Patch-For-Review, 10Security: peek is incorrectly configured to run every minute every 1st of the month, creating large amounts of cronspam - https://phabricator.wikimedia.org/T254127 (10jcrespo) [06:33:45] 10Operations, 10Patch-For-Review: Tracking and Reducing cron-spam to root@ - https://phabricator.wikimedia.org/T132324 (10jcrespo) [06:35:20] (03PS2) 10Jcrespo: peek: Disable cron execution [puppet] - 10https://gerrit.wikimedia.org/r/601170 (https://phabricator.wikimedia.org/T254127) [06:41:46] thanks jy nus [06:42:14] apergos: I need a +1 [06:42:34] checking [06:43:18] (03CR) 10Elukey: [C: 03+1] peek: Disable cron execution [puppet] - 10https://gerrit.wikimedia.org/r/601170 (https://phabricator.wikimedia.org/T254127) (owner: 10Jcrespo) [06:43:20] (03CR) 10ArielGlenn: [C: 03+1] "good enough for now" [puppet] - 10https://gerrit.wikimedia.org/r/601170 (https://phabricator.wikimedia.org/T254127) (owner: 10Jcrespo) [06:43:31] heh [06:43:54] (03CR) 10Jcrespo: [C: 03+2] peek: Disable cron execution [puppet] - 10https://gerrit.wikimedia.org/r/601170 (https://phabricator.wikimedia.org/T254127) (owner: 10Jcrespo) [06:44:27] 10Operations, 10Phabricator, 10Security: peek is incorrectly configured to run every minute every 1st of the month, creating large amounts of cronspam - https://phabricator.wikimedia.org/T254127 (10DannyS712) [06:53:18] (03PS1) 10Jcrespo: peek: Reenable cron with the correct configuration [puppet] - 10https://gerrit.wikimedia.org/r/601173 (https://phabricator.wikimedia.org/T254127) [06:55:31] 10Operations, 10Phabricator, 10Patch-For-Review, 10Security: peek is incorrectly configured to run every minute every 1st of the month, creating large amounts of cronspam - https://phabricator.wikimedia.org/T254127 (10jcrespo) p:05Triage→03High 812 reports were sent today 1st June. The scheduling seem... [06:56:24] (03PS1) 10KartikMistry: Create URL campaign for African languages for COVID-19 translation project [mediawiki-config] - 10https://gerrit.wikimedia.org/r/601174 (https://phabricator.wikimedia.org/T253305) [06:56:49] 10Operations, 10SRE-Access-Requests: Requesting access to Analytics Cluster for Yi-Ju,Lu - https://phabricator.wikimedia.org/T254130 (10YiJuLu) [06:57:15] (03CR) 10Jcrespo: [C: 04-1] "Needs more work than just a plain revert" [puppet] - 10https://gerrit.wikimedia.org/r/601173 (https://phabricator.wikimedia.org/T254127) (owner: 10Jcrespo) [07:05:08] (03PS1) 10Marostegui: mariadb: Productionize db1147 [puppet] - 10https://gerrit.wikimedia.org/r/601175 (https://phabricator.wikimedia.org/T252512) [07:05:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1142 to clone db1147 T252512', diff saved to https://phabricator.wikimedia.org/P11339 and previous config saved to /var/cache/conftool/dbconfig/20200601-070519-marostegui.json [07:05:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:05:23] T252512: Productionize db114[1-9] - https://phabricator.wikimedia.org/T252512 [07:06:02] (03PS2) 10Marostegui: mariadb: Productionize db1147 [puppet] - 10https://gerrit.wikimedia.org/r/601175 (https://phabricator.wikimedia.org/T252512) [07:10:56] (03CR) 10Marostegui: [C: 03+2] mariadb: Productionize db1147 [puppet] - 10https://gerrit.wikimedia.org/r/601175 (https://phabricator.wikimedia.org/T252512) (owner: 10Marostegui) [07:22:13] 10Operations, 10netops: cr1-codfw:fpc0 failure - https://phabricator.wikimedia.org/T254110 (10ayounsi) > I checked the log messages and could see “I2c slave read back errors” on this FPC: > May 31 20:00:08 re0.cr1-codfw chassisd[4838]: CHASSISD_I2CS_READBACK_ERROR: Readback error from I2C slave forFPC 0 ([... [07:31:41] (03PS3) 10Elukey: web::fetches::analytics::job: do not rsync mediawiki if missing source [puppet] - 10https://gerrit.wikimedia.org/r/594773 (https://phabricator.wikimedia.org/T251858) (owner: 10Mforns) [07:33:26] 10Operations, 10SRE-Access-Requests: Requesting access to Analytics Cluster for Yi-Ju,Lu - https://phabricator.wikimedia.org/T254130 (10Aklapper) 05Open→03Stalled a:05diego→03None Hi @YiJuLu, thanks for taking the time to report this and welcome to Wikimedia Phabricator! (If you have any questions abou... [07:36:08] 10Operations, 10SRE-Access-Requests: Give access to the Analytics Cluster to Research Intern (@YiJuLu) - https://phabricator.wikimedia.org/T254120 (10Aklapper) @diego: Is this a duplicate of {T254130}? [07:37:05] thanks for handling the peek cron fix. was looking and surprised the cron was already gone [07:44:52] (03CR) 10Dzahn: [C: 03+2] stdlib: fix "double quoted string" lint warnings [puppet] - 10https://gerrit.wikimedia.org/r/599689 (owner: 10Dzahn) [07:47:09] 10Operations, 10ops-codfw: Degraded RAID on restbase2009 - https://phabricator.wikimedia.org/T253715 (10Dzahn) 05Resolved→03Open [07:48:24] 10Operations, 10ops-codfw: Degraded RAID on restbase2009 - https://phabricator.wikimedia.org/T253715 (10Dzahn) {F31849799} [07:49:59] ACKNOWLEDGEMENT - cassandra-a CQL 10.192.48.54:9042 on restbase2009 is CRITICAL: connect to address 10.192.48.54 and port 9042: Connection refused daniel_zahn https://phabricator.wikimedia.org/T253715 https://phabricator.wikimedia.org/T93886 [07:49:59] ACKNOWLEDGEMENT - cassandra-a SSL 10.192.48.54:7001 on restbase2009 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused daniel_zahn https://phabricator.wikimedia.org/T253715 https://phabricator.wikimedia.org/T120662 [07:49:59] ACKNOWLEDGEMENT - cassandra-a service on restbase2009 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is inactive daniel_zahn https://phabricator.wikimedia.org/T253715 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [07:49:59] ACKNOWLEDGEMENT - cassandra-b CQL 10.192.48.55:9042 on restbase2009 is CRITICAL: connect to address 10.192.48.55 and port 9042: Connection refused daniel_zahn https://phabricator.wikimedia.org/T253715 https://phabricator.wikimedia.org/T93886 [07:49:59] ACKNOWLEDGEMENT - cassandra-b SSL 10.192.48.55:7001 on restbase2009 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused daniel_zahn https://phabricator.wikimedia.org/T253715 https://phabricator.wikimedia.org/T120662 [07:49:59] ACKNOWLEDGEMENT - cassandra-b service on restbase2009 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is inactive daniel_zahn https://phabricator.wikimedia.org/T253715 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [07:49:59] ACKNOWLEDGEMENT - cassandra-c CQL 10.192.48.56:9042 on restbase2009 is CRITICAL: connect to address 10.192.48.56 and port 9042: Connection refused daniel_zahn https://phabricator.wikimedia.org/T253715 https://phabricator.wikimedia.org/T93886 [07:50:00] ACKNOWLEDGEMENT - cassandra-c SSL 10.192.48.56:7001 on restbase2009 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused daniel_zahn https://phabricator.wikimedia.org/T253715 https://phabricator.wikimedia.org/T120662 [07:50:00] ACKNOWLEDGEMENT - cassandra-c service on restbase2009 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is inactive daniel_zahn https://phabricator.wikimedia.org/T253715 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [07:55:50] (03PS4) 10Elukey: web::fetches::analytics::job: do not rsync mediawiki if missing source [puppet] - 10https://gerrit.wikimedia.org/r/594773 (https://phabricator.wikimedia.org/T251858) (owner: 10Mforns) [07:57:09] (03CR) 10jerkins-bot: [V: 04-1] web::fetches::analytics::job: do not rsync mediawiki if missing source [puppet] - 10https://gerrit.wikimedia.org/r/594773 (https://phabricator.wikimedia.org/T251858) (owner: 10Mforns) [07:57:23] (03PS5) 10Elukey: web::fetches::analytics::job: do not rsync mediawiki if missing source [puppet] - 10https://gerrit.wikimedia.org/r/594773 (https://phabricator.wikimedia.org/T251858) (owner: 10Mforns) [07:58:40] (03CR) 10jerkins-bot: [V: 04-1] web::fetches::analytics::job: do not rsync mediawiki if missing source [puppet] - 10https://gerrit.wikimedia.org/r/594773 (https://phabricator.wikimedia.org/T251858) (owner: 10Mforns) [07:59:45] (03PS6) 10Elukey: web::fetches::analytics::job: do not rsync mediawiki if missing source [puppet] - 10https://gerrit.wikimedia.org/r/594773 (https://phabricator.wikimedia.org/T251858) (owner: 10Mforns) [08:00:30] 10Operations, 10SRE-Access-Requests: Requesting access to Analytics Cluster for Yi-Ju,Lu - https://phabricator.wikimedia.org/T254130 (10YiJuLu) 05Stalled→03Open [08:02:15] 10Operations, 10SRE-Access-Requests: Requesting access to Analytics Cluster for Yi-Ju,Lu - https://phabricator.wikimedia.org/T254130 (10YiJuLu) [08:02:40] 10Operations, 10SRE-Access-Requests: Requesting access to Analytics Cluster for Yi-Ju,Lu - https://phabricator.wikimedia.org/T254130 (10YiJuLu) [08:10:08] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: db1140 (backup source) crashed - https://phabricator.wikimedia.org/T250602 (10jcrespo) @Jclark-ctr One last thing- this was not an issue for me because as I had remote login so I could fix it myself, but may be interesting for you: remote IPMI was disabled, so I... [08:10:46] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: db1140 (backup source) crashed - https://phabricator.wikimedia.org/T250602 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jynus on cumin1001.eqiad.wmnet for hosts: ` ['db1140.eqiad.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/20200... [08:12:45] 10Puppet, 10Toolforge, 10Patch-For-Review: role::puppetmaster::standalone clones Git repositories as gitpuppet, git-sync-upstream overwrites them as root - https://phabricator.wikimedia.org/T152059 (10Aklapper) a:05scfc→03None Assignee has not been active since 2018 hence resetting task assignee. [08:16:46] (03CR) 10Marostegui: [C: 03+1] transfer.py: Use xtrabackup on path now that all packages are 10.1.43+ [puppet] - 10https://gerrit.wikimedia.org/r/599596 (owner: 10Jcrespo) [08:19:11] !log disabling puppet on all db/es/pc hosts for deploy of gerrit:599596 [08:19:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:19:30] (03PS9) 10Jcrespo: transfer.py: Use xtrabackup on path now that all packages are 10.1.43+ [puppet] - 10https://gerrit.wikimedia.org/r/599596 [08:22:36] !log mw1331 re-enabled puppet (SAL told me about an experiment a little while ago) [08:22:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:23:10] (03PS7) 10Elukey: web::fetches::analytics::job: do not rsync mediawiki if missing source [puppet] - 10https://gerrit.wikimedia.org/r/594773 (https://phabricator.wikimedia.org/T251858) (owner: 10Mforns) [08:23:44] (03CR) 10Jcrespo: [C: 03+2] transfer.py: Use xtrabackup on path now that all packages are 10.1.43+ [puppet] - 10https://gerrit.wikimedia.org/r/599596 (owner: 10Jcrespo) [08:33:09] !log restart cr1-codfw:fpc0 - T254110 [08:33:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:33:15] T254110: cr1-codfw:fpc0 failure - https://phabricator.wikimedia.org/T254110 [08:39:33] 10Operations, 10netops: cr1-codfw:fpc0 failure - https://phabricator.wikimedia.org/T254110 (10ayounsi) a:05ayounsi→03Papaul JTAC doesn't want to RMA it without a restart/reseat. Restart didn't help. @papaul can you re-seat cr1-codfw:fpc0 asap? [08:40:13] (03PS8) 10Elukey: web::fetches::analytics::job: do not rsync mediawiki if missing source [puppet] - 10https://gerrit.wikimedia.org/r/594773 (https://phabricator.wikimedia.org/T251858) (owner: 10Mforns) [08:41:55] !log deneb - apt-get remove python3-debconf (package was in status "ri" causing DPKG icinga alert. ri means it should be removed but is not) [08:41:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:43:37] !log deneb - apt-get remove --purge apt-listchanges (packages was in status "rc" causing DPKG alert, should be removed but config was not purged) [08:43:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:43:59] RECOVERY - DPKG on deneb is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [08:50:45] (03PS9) 10Elukey: web::fetches::analytics::job: do not rsync mediawiki if missing source [puppet] - 10https://gerrit.wikimedia.org/r/594773 (https://phabricator.wikimedia.org/T251858) (owner: 10Mforns) [08:51:21] (03PS1) 10Jbond: profile::idp: split memcached function to its on profile [puppet] - 10https://gerrit.wikimedia.org/r/601301 (https://phabricator.wikimedia.org/T233933) [08:52:26] 10Operations, 10SRE-Access-Requests: Give access to the Analytics Cluster to Research Intern (@YiJuLu) - https://phabricator.wikimedia.org/T254120 (10Aklapper) [08:52:29] 10Operations, 10SRE-Access-Requests: Requesting access to Analytics Cluster for Yi-Ju,Lu - https://phabricator.wikimedia.org/T254130 (10Aklapper) [08:53:52] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: db1140 (backup source) crashed - https://phabricator.wikimedia.org/T250602 (10jcrespo) a:05jcrespo→03Jclark-ctr @Jclark-ctr Serial port redirection doesn't work. This is a blocker because I cannot read the console output on restart, and understand why it is n... [08:54:39] (03PS2) 10Jbond: profile::idp: split memcached function to its on profile [puppet] - 10https://gerrit.wikimedia.org/r/601301 (https://phabricator.wikimedia.org/T233933) [08:56:44] (03PS3) 10Jbond: profile::idp: split memcached function to its on profile [puppet] - 10https://gerrit.wikimedia.org/r/601301 (https://phabricator.wikimedia.org/T233933) [08:57:11] 10Operations, 10SRE-Access-Requests: Requesting access to Analytics Cluster for Yi-Ju,Lu - https://phabricator.wikimedia.org/T254130 (10Aklapper) 05Open→03Stalled @YiJuLu: This still lacks your public SSH key, and lacks acknowledgement that you signed L3. Please see the task description and my previous com... [08:58:22] !log prometheus eqiad lvextend --resizefs --size +100G vg-ssd/prometheus-ops [08:58:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:59:15] (03CR) 10Jbond: [C: 03+2] profile::idp: split memcached function to its on profile [puppet] - 10https://gerrit.wikimedia.org/r/601301 (https://phabricator.wikimedia.org/T233933) (owner: 10Jbond) [09:00:58] 10Operations, 10netops: cr1-codfw:fpc0 failure - https://phabricator.wikimedia.org/T254110 (10ayounsi) a:05Papaul→03ayounsi Opened remote hands request instead T254136. [09:02:33] PROBLEM - puppet last run on prometheus1003 is CRITICAL: CRITICAL: Puppet last ran 2 days ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [09:03:10] !log filippo@cumin1001 START - Cookbook sre.hosts.decommission [09:03:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:04:13] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [09:04:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:05:06] !log filippo@cumin1001 START - Cookbook sre.hosts.decommission [09:05:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:05:23] !log offline cr1-codfw:fpc0 - T254110 [09:05:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:05:28] T254110: cr1-codfw:fpc0 failure - https://phabricator.wikimedia.org/T254110 [09:05:39] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [09:05:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:05:45] !log filippo@cumin1001 START - Cookbook sre.hosts.decommission [09:05:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:06:18] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [09:06:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:07:43] (03CR) 10Jbond: [C: 03+2] mcrouter: update ssl options if running on buster [puppet] - 10https://gerrit.wikimedia.org/r/599896 (owner: 10Jbond) [09:08:21] RECOVERY - puppet last run on prometheus1003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [09:08:55] (03PS1) 10Filippo Giunchedi: Decom ms-be101[678] [puppet] - 10https://gerrit.wikimedia.org/r/601303 (https://phabricator.wikimedia.org/T252008) [09:10:20] (03PS1) 10Filippo Giunchedi: Decom ms-be101[678] [dns] - 10https://gerrit.wikimedia.org/r/601304 (https://phabricator.wikimedia.org/T252008) [09:11:13] (03CR) 10Filippo Giunchedi: [C: 03+2] Decom ms-be101[678] [puppet] - 10https://gerrit.wikimedia.org/r/601303 (https://phabricator.wikimedia.org/T252008) (owner: 10Filippo Giunchedi) [09:11:26] (03CR) 10Filippo Giunchedi: [C: 03+2] Decom ms-be101[678] [dns] - 10https://gerrit.wikimedia.org/r/601304 (https://phabricator.wikimedia.org/T252008) (owner: 10Filippo Giunchedi) [09:11:31] (03PS2) 10Filippo Giunchedi: Decom ms-be101[678] [dns] - 10https://gerrit.wikimedia.org/r/601304 (https://phabricator.wikimedia.org/T252008) [09:13:16] RECOVERY - Check systemd state on idp-test2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:16:37] 10Operations, 10ops-eqiad, 10SRE-swift-storage, 10decommission: decommission ms-be101[678] - https://phabricator.wikimedia.org/T252008 (10fgiunchedi) a:03Cmjohnson @Cmjohnson (or @Jclark-ctr) these hosts are ready for decom, thanks! [09:16:50] godog: do you have other hosts to decom by any chance? [09:17:02] I was looking for test-hosts for a patch I have to merge since last week... [09:18:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'Add db1147 to dbctl, depooled T252512', diff saved to https://phabricator.wikimedia.org/P11341 and previous config saved to /var/cache/conftool/dbconfig/20200601-091809-marostegui.json [09:18:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:18:14] T252512: Productionize db114[1-9] - https://phabricator.wikimedia.org/T252512 [09:18:35] volans: not at the moment no, those three is all I had [09:18:37] 10Operations, 10ops-codfw, 10DBA: db2075 failed to boot kernel 2/3 tries, please upgrade firmware/BIOS to mitigate - https://phabricator.wikimedia.org/T254139 (10jcrespo) [09:18:45] ok thanks [09:18:53] np [09:19:29] (03PS1) 10Marostegui: db1147: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/601305 (https://phabricator.wikimedia.org/T252512) [09:21:03] (03CR) 10Marostegui: [C: 03+2] db1147: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/601305 (https://phabricator.wikimedia.org/T252512) (owner: 10Marostegui) [09:22:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1142, db1147 T252512', diff saved to https://phabricator.wikimedia.org/P11342 and previous config saved to /var/cache/conftool/dbconfig/20200601-092220-marostegui.json [09:22:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:23:35] 10Operations, 10SRE-Access-Requests: Requesting access to Analytics Cluster for Yi-Ju,Lu - https://phabricator.wikimedia.org/T254130 (10YiJuLu) 05Stalled→03Open [09:24:27] (03PS1) 10Phedenskog: icinga: Add alerts for synthetic testing for search pages [puppet] - 10https://gerrit.wikimedia.org/r/601307 (https://phabricator.wikimedia.org/T246416) [09:26:00] (03CR) 10Elukey: [C: 03+1] "thanks!" [dns] - 10https://gerrit.wikimedia.org/r/599779 (https://phabricator.wikimedia.org/T253173) (owner: 10Jbond) [09:26:27] RECOVERY - Memcached on idp-test2001 is OK: TCP OK - 0.037 second response time on 208.80.153.25 port 11000 https://wikitech.wikimedia.org/wiki/Memcached [09:26:29] !log reenabling puppet on all db/es/pc hosts after deploy of gerrit:599596 [09:26:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:28:22] 10Operations, 10SRE-Access-Requests: Requesting access to Analytics Cluster for Yi-Ju,Lu - https://phabricator.wikimedia.org/T254130 (10YiJuLu) [09:28:41] (03CR) 10Elukey: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/22904/labstore1006.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/594773 (https://phabricator.wikimedia.org/T251858) (owner: 10Mforns) [09:28:51] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: move analytics to profile [puppet] - 10https://gerrit.wikimedia.org/r/599342 (https://phabricator.wikimedia.org/T252186) (owner: 10Filippo Giunchedi) [09:29:04] (03CR) 10Volans: [C: 03+2] "Tested with prod Netbox (RO), the generation looks fine, merging. I'll address any followup comment in a separate CR." [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/599948 (https://phabricator.wikimedia.org/T233183) (owner: 10Volans) [09:30:02] elukey: merging your change too [09:30:05] <3 [09:30:30] !log volans@cumin1001 START - Cookbook sre.dns.netbox [09:30:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:30:40] 10Operations, 10ops-codfw, 10DBA: db2075 failed to boot kernel 2/3 tries, please upgrade firmware/BIOS to mitigate - https://phabricator.wikimedia.org/T254139 (10Marostegui) p:05Triage→03Medium a:03Papaul [09:37:13] !log volans@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:37:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:37:23] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 54 probes of 574 (alerts on 50) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [09:40:44] (03PS1) 10Jbond: profile::idp: use mcrouter port [puppet] - 10https://gerrit.wikimedia.org/r/601315 (https://phabricator.wikimedia.org/T233933) [09:43:01] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 47 probes of 574 (alerts on 50) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [09:43:23] (03CR) 10Jbond: [C: 03+1] "LGTM" [software/homer] - 10https://gerrit.wikimedia.org/r/599897 (https://phabricator.wikimedia.org/T253795) (owner: 10Volans) [09:44:01] (03CR) 10Jbond: [C: 03+2] profile::idp: use mcrouter port [puppet] - 10https://gerrit.wikimedia.org/r/601315 (https://phabricator.wikimedia.org/T233933) (owner: 10Jbond) [09:45:12] RECOVERY - Check systemd state on labstore1006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:46:31] (03PS1) 10Elukey: dumos::web::fetches::analytics::job: fix rsync bash script [puppet] - 10https://gerrit.wikimedia.org/r/601316 (https://phabricator.wikimedia.org/T251858) [09:47:49] (03PS2) 10Elukey: dumps::web::fetches::analytics::job: fix rsync bash script [puppet] - 10https://gerrit.wikimedia.org/r/601316 (https://phabricator.wikimedia.org/T251858) [09:48:00] (03CR) 10Jbond: [C: 03+2] AAAA: for flerovium and furud [dns] - 10https://gerrit.wikimedia.org/r/599779 (https://phabricator.wikimedia.org/T253173) (owner: 10Jbond) [09:48:14] (03PS3) 10Jbond: AAAA: for flerovium and furud [dns] - 10https://gerrit.wikimedia.org/r/599779 (https://phabricator.wikimedia.org/T253173) [09:49:07] (03CR) 10Elukey: [C: 03+2] dumps::web::fetches::analytics::job: fix rsync bash script [puppet] - 10https://gerrit.wikimedia.org/r/601316 (https://phabricator.wikimedia.org/T251858) (owner: 10Elukey) [09:51:18] (03CR) 10Jbond: [C: 03+2] profile/manifests/dumps: enable ipv6 drop ferm rule for 443 [puppet] - 10https://gerrit.wikimedia.org/r/599783 (https://phabricator.wikimedia.org/T253173) (owner: 10Jbond) [09:56:06] (03PS1) 10Filippo Giunchedi: role: use_remote_address: true for Envoy on Thanos [puppet] - 10https://gerrit.wikimedia.org/r/601320 (https://phabricator.wikimedia.org/T233956) [09:58:21] 10Operations, 10observability, 10User-jbond: Write ulogd logs to a dedicated logfile - https://phabricator.wikimedia.org/T238414 (10Aklapper) [09:58:54] 10Operations, 10Phabricator, 10Security-Team, 10Security, 10User-jbond: Adjust onboarding/offboarding logic to accommodate changes to #security (now acl*security) - https://phabricator.wikimedia.org/T245771 (10Aklapper) [09:59:57] (03PS1) 10Jbond: IPv6: add AAAA records for francium & htmldumper1001 [dns] - 10https://gerrit.wikimedia.org/r/601321 (https://phabricator.wikimedia.org/T233933) [10:01:29] (03CR) 10Ayounsi: "Tested with a config load error for a diff and commit action. Works as expected." [software/homer] - 10https://gerrit.wikimedia.org/r/599897 (https://phabricator.wikimedia.org/T253795) (owner: 10Volans) [10:04:07] (03CR) 10Filippo Giunchedi: [C: 03+2] role: use_remote_address: true for Envoy on Thanos [puppet] - 10https://gerrit.wikimedia.org/r/601320 (https://phabricator.wikimedia.org/T233956) (owner: 10Filippo Giunchedi) [10:09:50] 10Operations, 10Analytics: Increase memory available for an-launcher1001 - https://phabricator.wikimedia.org/T254125 (10elukey) This is the current status for eqiad: ` elukey@ganeti1003:~$ sudo gnt-node list Node DTotal DFree MTotal MNode MFree Pinst Sinst ganeti1001.eqiad.wmnet 707.4G 9.... [10:10:44] (03PS3) 10Volans: mgmt: use netbox-generated data for ulsfo [dns] - 10https://gerrit.wikimedia.org/r/585545 (https://phabricator.wikimedia.org/T233183) [10:18:41] (03PS1) 10Jbond: profile::idp::memcached: mcrouter also needs to own the cert [puppet] - 10https://gerrit.wikimedia.org/r/601325 (https://phabricator.wikimedia.org/T233933) [10:21:23] (03CR) 10Jbond: [C: 03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/598065 (owner: 10Volans) [10:21:42] (03CR) 10Jbond: [C: 03+2] profile::idp::memcached: mcrouter also needs to own the cert [puppet] - 10https://gerrit.wikimedia.org/r/601325 (https://phabricator.wikimedia.org/T233933) (owner: 10Jbond) [10:23:08] (03CR) 10Volans: [C: 03+2] Improve error catching [software/homer] - 10https://gerrit.wikimedia.org/r/599897 (https://phabricator.wikimedia.org/T253795) (owner: 10Volans) [10:26:08] (03Merged) 10jenkins-bot: Improve error catching [software/homer] - 10https://gerrit.wikimedia.org/r/599897 (https://phabricator.wikimedia.org/T253795) (owner: 10Volans) [10:27:23] (03PS1) 10Filippo Giunchedi: prometheus: enable Thanos upload for analytics [puppet] - 10https://gerrit.wikimedia.org/r/601326 (https://phabricator.wikimedia.org/T252186) [10:29:28] (03PS1) 10Jbond: role:idp* add arguments for default memcached port [puppet] - 10https://gerrit.wikimedia.org/r/601327 (https://phabricator.wikimedia.org/T233933) [10:30:04] jan_drewniak: Time to snap out of that daydream and deploy Wikimedia Portals Update. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200601T1030). [10:31:11] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/601328 (https://phabricator.wikimedia.org/T128546) [10:31:41] (03CR) 10Filippo Giunchedi: "PCC https://puppet-compiler.wmflabs.org/compiler1001/22905/" [puppet] - 10https://gerrit.wikimedia.org/r/601326 (https://phabricator.wikimedia.org/T252186) (owner: 10Filippo Giunchedi) [10:34:01] (03CR) 10Jdrewniak: [C: 03+2] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/601328 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [10:34:51] (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/601328 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [10:38:14] hello anybody here, I have an un-deployed change on mediawiki config. https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/596538/ Is it ok to go ahead with my update? [10:39:44] @TimStarling ^ is T245170 [10:39:45] T245170: Revisit timeouts, concurrency limits in remote HTTP calls from MediaWiki - https://phabricator.wikimedia.org/T245170 [10:41:27] its 8 45p for tim at the moment, probably not online [10:41:49] Sorry, don't know peoples time zones [10:42:14] I share the tz, which is why I know [10:44:33] OK, well the portals deployment only ever touches the portals directory, so I don't think the previous change will have any effect. [10:46:38] but I guess something to know about during SWAT [10:48:11] !log jdrewniak@deploy1001 Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:601328| Bumping portals to master (601328)]] (duration: 01m 03s) [10:48:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:49:10] !log jdrewniak@deploy1001 Synchronized portals: Wikimedia Portals Update: [[gerrit:601328| Bumping portals to master (601328)]] (duration: 00m 59s) [10:49:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:51:45] (03CR) 10Jbond: [C: 03+2] role:idp* add arguments for default memcached port [puppet] - 10https://gerrit.wikimedia.org/r/601327 (https://phabricator.wikimedia.org/T233933) (owner: 10Jbond) [10:57:30] 10Operations, 10Phabricator, 10Security-Team, 10Security, 10User-jbond: Adjust onboarding/offboarding logic to accommodate changes to #security (now acl*security) - https://phabricator.wikimedia.org/T245771 (10jbond) 05Open→03Resolved a:03jbond Resolve as per Moritz comment [11:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: #bothumor My software never has bugs. It just develops random features. Rise for European Mid-day SWAT(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200601T1100). [11:00:04] No GERRIT patches in the queue for this window AFAICS. [11:00:48] (03CR) 10Jbond: [C: 03+1] "LGTM would be great to have this ported to the new API and some spec test" [puppet] - 10https://gerrit.wikimedia.org/r/600928 (owner: 10Cwhite) [11:13:28] (03PS1) 10Volans: scripts: refactor management interface [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/601334 [11:13:52] (03CR) 10jerkins-bot: [V: 04-1] scripts: refactor management interface [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/601334 (owner: 10Volans) [11:13:59] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge-k8s: proposing removing hostkey checking for the upgrades [puppet] - 10https://gerrit.wikimedia.org/r/599472 (https://phabricator.wikimedia.org/T246122) (owner: 10Bstorm) [11:15:36] (03PS2) 10Volans: scripts: refactor management interface [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/601334 [11:22:11] 10Operations, 10Phabricator, 10Security-Team, 10Patch-For-Review, 10Security: peek is incorrectly configured to run every minute every 1st of the month, creating large amounts of cronspam - https://phabricator.wikimedia.org/T254127 (10chasemp) a:03chasemp Thanks for disabling @jcrespo, sorry for the av... [11:30:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1142, db1147 T252512', diff saved to https://phabricator.wikimedia.org/P11343 and previous config saved to /var/cache/conftool/dbconfig/20200601-113032-marostegui.json [11:30:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:30:38] T252512: Productionize db114[1-9] - https://phabricator.wikimedia.org/T252512 [11:43:15] (03CR) 10Dzahn: [C: 03+1] "yep, confirmed these are already on the interface and the right rows" [dns] - 10https://gerrit.wikimedia.org/r/601321 (https://phabricator.wikimedia.org/T233933) (owner: 10Jbond) [11:44:26] (03CR) 10Jbond: [C: 03+2] IPv6: add AAAA records for francium & htmldumper1001 [dns] - 10https://gerrit.wikimedia.org/r/601321 (https://phabricator.wikimedia.org/T233933) (owner: 10Jbond) [11:44:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1142, db1147 T252512', diff saved to https://phabricator.wikimedia.org/P11344 and previous config saved to /var/cache/conftool/dbconfig/20200601-114440-marostegui.json [11:44:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:44:44] T252512: Productionize db114[1-9] - https://phabricator.wikimedia.org/T252512 [11:45:49] (03CR) 10Dzahn: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/551882 (owner: 10Dzahn) [11:50:06] (03CR) 10Dzahn: "issue is now because "notes_url" contains 2 URLs and not just one... ''https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link' " [puppet] - 10https://gerrit.wikimedia.org/r/551882 (owner: 10Dzahn) [11:50:51] 10Operations, 10ops-esams, 10netops: Amsterdam maintenance (June 2020) - https://phabricator.wikimedia.org/T254021 (10ayounsi) Scheduling it for this Wednesday June 3rd at 6am UTC, 2h window for a 1h work. [11:54:28] (03PS10) 10Dzahn: monitoring: add data types to monitoring::service [puppet] - 10https://gerrit.wikimedia.org/r/551882 [11:56:31] (03CR) 10jerkins-bot: [V: 04-1] monitoring: add data types to monitoring::service [puppet] - 10https://gerrit.wikimedia.org/r/551882 (owner: 10Dzahn) [11:59:50] (03CR) 10Ayounsi: [C: 03+1] "LGTM!" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/601334 (owner: 10Volans) [12:00:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'Fully repool db1142, db1147 T252512', diff saved to https://phabricator.wikimedia.org/P11345 and previous config saved to /var/cache/conftool/dbconfig/20200601-120000-marostegui.json [12:00:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:00:04] T252512: Productionize db114[1-9] - https://phabricator.wikimedia.org/T252512 [12:04:44] (03PS11) 10Dzahn: monitoring: add data types to monitoring::service [puppet] - 10https://gerrit.wikimedia.org/r/551882 [12:06:47] (03CR) 10jerkins-bot: [V: 04-1] monitoring: add data types to monitoring::service [puppet] - 10https://gerrit.wikimedia.org/r/551882 (owner: 10Dzahn) [12:11:25] (03CR) 10Filippo Giunchedi: [C: 03+2] icinga: Add alerts for synthetic testing for search pages [puppet] - 10https://gerrit.wikimedia.org/r/601307 (https://phabricator.wikimedia.org/T246416) (owner: 10Phedenskog) [12:15:14] (03PS12) 10Dzahn: monitoring: add data types to monitoring::service [puppet] - 10https://gerrit.wikimedia.org/r/551882 [12:17:17] (03CR) 10jerkins-bot: [V: 04-1] monitoring: add data types to monitoring::service [puppet] - 10https://gerrit.wikimedia.org/r/551882 (owner: 10Dzahn) [12:19:12] (03PS13) 10Dzahn: monitoring: add data types to monitoring::service [puppet] - 10https://gerrit.wikimedia.org/r/551882 [12:26:40] (03CR) 10Volans: [C: 03+2] scripts: refactor management interface [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/601334 (owner: 10Volans) [12:43:24] (03PS1) 10QChris: profile,gerrit: Add backup of Gerrit's LFS files [puppet] - 10https://gerrit.wikimedia.org/r/601341 [12:46:09] (03PS14) 10Dzahn: monitoring: add data types to monitoring::service [puppet] - 10https://gerrit.wikimedia.org/r/551882 [12:46:15] (03CR) 10QChris: [C: 04-1] profile,gerrit: Add backup of Gerrit's LFS files (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/601341 (owner: 10QChris) [12:47:31] (03CR) 10Dzahn: profile,gerrit: Add backup of Gerrit's LFS files (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/601341 (owner: 10QChris) [12:50:23] (03CR) 10Dzahn: monitoring: add data types to monitoring::service (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/551882 (owner: 10Dzahn) [12:56:34] (03PS1) 10Dzahn: site: add new POP install servers with insetup role [puppet] - 10https://gerrit.wikimedia.org/r/601342 (https://phabricator.wikimedia.org/T252526) [12:57:58] 10Operations, 10WMF-Design, 10Design: Create sub-directory URL for Design blog (https://design.wikimedia.org/blog) - https://phabricator.wikimedia.org/T254118 (10Dzahn) a:03Dzahn [12:58:24] 10Operations, 10ops-eqiad, 10DBA, 10Wikimedia-Incident: db1138 (s4 master) crashed due to memory issues - https://phabricator.wikimedia.org/T253808 (10Marostegui) @Jclark-ctr could you confirm if you want to do this maintenance today Monday 1st June or tomorrow Tuesday 2nd June? [13:04:41] 10Operations, 10WMF-Design, 10Design: Create sub-directory URL for Design blog (https://design.wikimedia.org/blog) - https://phabricator.wikimedia.org/T254118 (10Dzahn) Hi @Prtksxna I just cloned https://gerrit.wikimedia.org/r/design/blog to my laptop. Is that the actual content repo that should be copied t... [13:07:27] (03PS1) 10Dzahn: microsites::design: add blog subdirectory [puppet] - 10https://gerrit.wikimedia.org/r/601344 (https://phabricator.wikimedia.org/T254118) [13:10:54] 10Operations, 10ops-eqiad, 10DBA, 10Wikimedia-Incident: db1138 (s4 master) crashed due to memory issues - https://phabricator.wikimedia.org/T253808 (10Marostegui) John confirmed via IRC that the maintenance will be done on Tuesday - thank you! [13:11:00] (03PS2) 10QChris: profile,gerrit: Add backup of Gerrit's LFS files [puppet] - 10https://gerrit.wikimedia.org/r/601341 (https://phabricator.wikimedia.org/T254155) [13:14:14] (03CR) 10QChris: "I just did a replace in files to find `srv-gerrit-git` and the" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/601341 (https://phabricator.wikimedia.org/T254155) (owner: 10QChris) [13:17:48] (03PS4) 10Dzahn: add IPs for installservers in POPs [dns] - 10https://gerrit.wikimedia.org/r/599883 (https://phabricator.wikimedia.org/T252526) [13:18:56] (03CR) 10Dzahn: [C: 03+2] profile,gerrit: Add backup of Gerrit's LFS files [puppet] - 10https://gerrit.wikimedia.org/r/601341 (https://phabricator.wikimedia.org/T254155) (owner: 10QChris) [13:22:04] (03CR) 10Dzahn: "> I just did a replace in files to find `srv-gerrit-git` and the" [puppet] - 10https://gerrit.wikimedia.org/r/601341 (https://phabricator.wikimedia.org/T254155) (owner: 10QChris) [13:23:08] (03PS1) 10Filippo Giunchedi: swift: refactor stats_reporter into a profile [puppet] - 10https://gerrit.wikimedia.org/r/601346 (https://phabricator.wikimedia.org/T252186) [13:24:03] (03CR) 10Dzahn: "@backup1001:/etc/bacula/jobs.d# cat gerrit1001.wikimedia.org-gerrit-repo-data-Hourly-Sun-production.conf" [puppet] - 10https://gerrit.wikimedia.org/r/601341 (https://phabricator.wikimedia.org/T254155) (owner: 10QChris) [13:27:15] (03Abandoned) 10Paladox: phabricator: Install the php zip extension [puppet] - 10https://gerrit.wikimedia.org/r/594157 (owner: 10Paladox) [13:27:33] sorry about that jan_drewniak [13:27:48] (03CR) 10Filippo Giunchedi: "PCC as expected, on hosts where the stats_reporter isn't active now (e.g. ms-fe2006.codfw.wmnet) new resources are defined but absented" [puppet] - 10https://gerrit.wikimedia.org/r/601346 (https://phabricator.wikimedia.org/T252186) (owner: 10Filippo Giunchedi) [13:28:12] I was going to push that out but got distracted [13:28:29] either way, it should have no effect but is not urgent [13:34:37] 10Operations, 10Traffic, 10observability, 10Performance-Team (Radar), 10Sustainability (Incident Prevention): Document and/or improve navigation of the various HTTP frontend Grafana dashboards - https://phabricator.wikimedia.org/T253655 (10fgiunchedi) Indeed, thanks for filing this @Krinkle ! My two cent... [13:36:02] 10Operations, 10Jupyter-Hub, 10LDAP-Access-Requests: Give access to the JupyterHub (SWAP) notebooks to (@YiJuLu ) - https://phabricator.wikimedia.org/T254121 (10Dzahn) Hi @KFrancis Could you confirm whether @YiJuLu is already covered by an NDA like it was the case for research interns in T252129#6131756 an... [13:37:32] (03CR) 10Alexandros Kosiaris: "This might make upgrading stdlib in the future more difficult. We 've followed such an approach in the past for a while, and we needed a n" [puppet] - 10https://gerrit.wikimedia.org/r/599689 (owner: 10Dzahn) [13:38:03] 10Operations, 10Analytics, 10Product-Analytics, 10SRE-Access-Requests: Hive access for Sam Patton - https://phabricator.wikimedia.org/T248097 (10Dzahn) Hi, gentle prod. Any updates on this? It has been open for a while now. [13:40:34] (03PS1) 10Dzahn: Revert "stdlib: fix "double quoted string" lint warnings" [puppet] - 10https://gerrit.wikimedia.org/r/601349 [13:41:33] (03Abandoned) 10Paladox: Phabricator: support undef as a value to ipv6 [puppet] - 10https://gerrit.wikimedia.org/r/594107 (owner: 10Paladox) [13:42:08] (03PS8) 10Paladox: Gerrit: Update soy templates for gerrit 2.16 [puppet] - 10https://gerrit.wikimedia.org/r/473264 [13:42:33] (03PS14) 10Paladox: Gerrit: Convert CoC and Privacy links to use the new PolyGerrit extension point [puppet] - 10https://gerrit.wikimedia.org/r/520295 [13:42:45] (03PS7) 10Paladox: Gerrit: Migrate theme to support Polymer 2 [puppet] - 10https://gerrit.wikimedia.org/r/539180 (https://phabricator.wikimedia.org/T227509) [13:43:15] (03CR) 10QChris: "Seeing" [puppet] - 10https://gerrit.wikimedia.org/r/601341 (https://phabricator.wikimedia.org/T254155) (owner: 10QChris) [13:44:53] (03CR) 10Dzahn: "> Patch Set 3:" [puppet] - 10https://gerrit.wikimedia.org/r/601341 (https://phabricator.wikimedia.org/T254155) (owner: 10QChris) [13:45:41] (03CR) 10QChris: "Nice! Thanks. You're awesome!" [puppet] - 10https://gerrit.wikimedia.org/r/601341 (https://phabricator.wikimedia.org/T254155) (owner: 10QChris) [13:46:50] (03PS7) 10Privacybatm: Write documentation using Sphinx [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/598295 (https://phabricator.wikimedia.org/T253219) [13:47:41] (03CR) 10Jcrespo: "> Patch Set 3:" [puppet] - 10https://gerrit.wikimedia.org/r/601341 (https://phabricator.wikimedia.org/T254155) (owner: 10QChris) [13:47:43] (03Abandoned) 10Gergő Tisza: [WIP] Add sentry-phabricator package [software/sentry] - 10https://gerrit.wikimedia.org/r/227931 (https://phabricator.wikimedia.org/T97136) (owner: 10Gergő Tisza) [13:47:50] (03Abandoned) 10Gergő Tisza: LDAP support [software/sentry] - 10https://gerrit.wikimedia.org/r/240949 (https://phabricator.wikimedia.org/T97133) (owner: 10Gergő Tisza) [13:48:10] (03Abandoned) 10Gergő Tisza: [WIP] Configure LDAP plugin [puppet] - 10https://gerrit.wikimedia.org/r/241578 (owner: 10Gergő Tisza) [13:48:41] (03CR) 10QChris: "> Note that won't mean full backups have run- those will not happen" [puppet] - 10https://gerrit.wikimedia.org/r/601341 (https://phabricator.wikimedia.org/T254155) (owner: 10QChris) [13:49:32] (03CR) 10Jcrespo: "> Patch Set 3:" [puppet] - 10https://gerrit.wikimedia.org/r/601341 (https://phabricator.wikimedia.org/T254155) (owner: 10QChris) [13:49:37] (03CR) 10Filippo Giunchedi: [C: 03+1] "FWIW, having the swift-object-expirer running in the cluster would be good before deploying this: https://phabricator.wikimedia.org/T22958" [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/489022 (https://phabricator.wikimedia.org/T211661) (owner: 10Gilles) [13:49:42] (03CR) 10Privacybatm: "I have resolved the issues with this patch. Now `tox -e py37-sphinx` would generate the HTML docs inside the transferpy/doc/.build folder." [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/598295 (https://phabricator.wikimedia.org/T253219) (owner: 10Privacybatm) [13:50:03] (03CR) 10jerkins-bot: [V: 04-1] Set expiry headers on thumbnails [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/489022 (https://phabricator.wikimedia.org/T211661) (owner: 10Gilles) [13:52:20] 10Operations, 10Analytics, 10Traffic: missing wmf_netflow data, 18:30-19:00 May 31 - https://phabricator.wikimedia.org/T254161 (10CDanis) [13:56:17] (03CR) 10Jbond: "lgtm, few nits thanks" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/551882 (owner: 10Dzahn) [14:00:00] 10Operations: Why do we have 2 sets of squid proxies? - https://phabricator.wikimedia.org/T254011 (10akosiaris) I don't have an "official" answer, but I always treated them as follows: * url-downloader => the proxy for applications/services to reach resources on the public internet * webproxy => the proxies the... [14:35:47] 10Operations, 10Analytics, 10Event-Platform, 10Services (watching): Discovery for Kafka cluster brokers - https://phabricator.wikimedia.org/T213561 (10Ottomata) FYI, we've recently added a 'general.yaml' values support to our helm charts repo. This allows us to render values from puppet. I'd like to accom... [14:36:38] 10Operations, 10Analytics: Increase memory available for an-launcher1001 - https://phabricator.wikimedia.org/T254125 (10akosiaris) >>! In T254125#6181321, @elukey wrote: > This is the current status for eqiad: > > ` > elukey@ganeti1003:~$ sudo gnt-node list > Node DTotal DFree MTotal MNode... [14:36:48] (03PS2) 10Ottomata: Remove ganglia configs from cdh and jmxtrans modules [puppet] - 10https://gerrit.wikimedia.org/r/599415 (https://phabricator.wikimedia.org/T253555) [14:37:26] (03CR) 10Ottomata: "> Patch Set 6: Code-Review-1" [puppet] - 10https://gerrit.wikimedia.org/r/599389 (owner: 10Elukey) [14:37:49] (03CR) 10Ottomata: "Oh I see" [puppet] - 10https://gerrit.wikimedia.org/r/599389 (owner: 10Elukey) [14:38:45] 10Operations, 10Analytics, 10Event-Platform, 10Services (watching): Discovery for Kafka cluster brokers - https://phabricator.wikimedia.org/T213561 (10Pchelolo) cc @hnowlan let's see what @Ottomata gets to and see if we can incorporate it into changeprop [14:42:50] (03CR) 10Volans: [C: 03+2] "Already reviewed in PS2, PS3 just changes the commit message." [dns] - 10https://gerrit.wikimedia.org/r/585545 (https://phabricator.wikimedia.org/T233183) (owner: 10Volans) [14:43:00] (03PS4) 10Volans: mgmt: use netbox-generated data for ulsfo [dns] - 10https://gerrit.wikimedia.org/r/585545 (https://phabricator.wikimedia.org/T233183) [14:44:42] !log deploying ulsfo mgmt DNS records automatically generated by Netbox ( operations/dns/+/585545/ ) - T233183 [14:44:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:46] T233183: Automate generation of Management DNS records from Netbox - https://phabricator.wikimedia.org/T233183 [14:44:50] 10Operations, 10Privacy Engineering, 10Research, 10Security-Team, and 2 others: wikiworkshop.org has Facebook button, external statcounter, https to http redirect - https://phabricator.wikimedia.org/T251732 (10JFishback_WMF) [14:46:02] (03CR) 10Alexandros Kosiaris: [C: 03+1] add loki 1.5.0 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/597317 (https://phabricator.wikimedia.org/T222826) (owner: 10Cwhite) [14:47:41] (03CR) 10Ottomata: [C: 03+2] Remove ganglia configs from cdh and jmxtrans modules [puppet] - 10https://gerrit.wikimedia.org/r/599415 (https://phabricator.wikimedia.org/T253555) (owner: 10Ottomata) [14:51:46] (03CR) 10Herron: [C: 03+1] "LGTM overall, couple of small things" [puppet] - 10https://gerrit.wikimedia.org/r/601346 (https://phabricator.wikimedia.org/T252186) (owner: 10Filippo Giunchedi) [14:52:47] (03CR) 10Alexandros Kosiaris: [C: 03+1] changeprop: Migrate to common_templates 0.2 tls_helper [deployment-charts] - 10https://gerrit.wikimedia.org/r/600862 (https://phabricator.wikimedia.org/T253396) (owner: 10JMeybohm) [14:53:44] 10Operations, 10Analytics: Increase memory available for an-launcher1001 - https://phabricator.wikimedia.org/T254125 (10elukey) ` elukey@ganeti1003:~$ sudo gnt-instance modify -B memory=12g an-launcher1001.eqiad.wmnet Modified instance an-launcher1001.eqiad.wmnet - be/memory -> 12288 Please don't forget that... [14:53:57] !log ganeti: increase memory available for an-launcher1001 from 8g to 12g - T254125 [14:54:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:01] T254125: Increase memory available for an-launcher1001 - https://phabricator.wikimedia.org/T254125 [14:58:12] (03CR) 10Cwhite: [C: 03+2] hiera: install mtail 3.0.0~rc35 from component in ulsfo and codfw [puppet] - 10https://gerrit.wikimedia.org/r/599473 (https://phabricator.wikimedia.org/T251466) (owner: 10Cwhite) [14:59:16] (03PS2) 10Filippo Giunchedi: swift: refactor stats_reporter into a profile [puppet] - 10https://gerrit.wikimedia.org/r/601346 (https://phabricator.wikimedia.org/T252186) [15:01:19] (03PS1) 10Cmjohnson: Adding all dns entries for relforge1003/1004. removed old asset tag entry [dns] - 10https://gerrit.wikimedia.org/r/601356 (https://phabricator.wikimedia.org/T241791) [15:01:21] (03CR) 10Herron: "I'm in support of this, but not experienced enough here to weigh in on the technical details" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/597317 (https://phabricator.wikimedia.org/T222826) (owner: 10Cwhite) [15:03:24] (03CR) 10Cmjohnson: [C: 03+2] Adding all dns entries for relforge1003/1004. removed old asset tag entry [dns] - 10https://gerrit.wikimedia.org/r/601356 (https://phabricator.wikimedia.org/T241791) (owner: 10Cmjohnson) [15:05:59] 10Operations, 10WMF-Design, 10Design, 10Patch-For-Review: Create sub-directory URL for Design blog (https://design.wikimedia.org/blog) - https://phabricator.wikimedia.org/T254118 (10Iniquity) I think you also need to add a link to the main page :) [15:06:13] 10Operations, 10Phabricator, 10Security-Team, 10Patch-For-Review, 10Security: peek is incorrectly configured to run every minute every 1st of the month, creating large amounts of cronspam - https://phabricator.wikimedia.org/T254127 (10chasemp) p:05High→03Medium [15:09:38] 10Operations, 10Peek, 10Phabricator, 10Security-Team, 10Patch-For-Review: peek is incorrectly configured to run every minute every 1st of the month, creating large amounts of cronspam - https://phabricator.wikimedia.org/T254127 (10chasemp) [15:10:10] 10Operations, 10ops-eqiad, 10Discovery-Search (Current work), 10Patch-For-Review: (Need by: TBD) rack/setup/install relforge100[34] - https://phabricator.wikimedia.org/T241791 (10Cmjohnson) [15:10:26] 10Operations, 10ops-eqiad, 10Discovery-Search (Current work), 10Patch-For-Review: (Need by: TBD) rack/setup/install relforge100[34] - https://phabricator.wikimedia.org/T241791 (10Cmjohnson) [15:11:15] 10Operations, 10ops-eqiad, 10Discovery-Search (Current work): (Need by: TBD) rack/setup/install relforge100[34] - https://phabricator.wikimedia.org/T241791 (10Cmjohnson) Network switches have been updated but I disabled the ports until after the bios is set up and ready for imaging. [15:18:07] (03PS1) 10Ottomata: Fix schema uri for SearchSatisfaction on testwiki for wgEventLoggingSchemas [mediawiki-config] - 10https://gerrit.wikimedia.org/r/601360 (https://phabricator.wikimedia.org/T249261) [15:23:00] (03PS1) 10CDanis: limit per-user Special:Contributions concurrency to 2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/601361 (https://phabricator.wikimedia.org/T234450) [15:36:16] (03PS2) 10Ottomata: [WIP] Add eventlogging_legacy job to refine EventLogging events from EventGate [puppet] - 10https://gerrit.wikimedia.org/r/593610 (https://phabricator.wikimedia.org/T249261) [15:41:38] (03PS1) 10Cmjohnson: Adding all dns entries for thanos-fe100[123] [dns] - 10https://gerrit.wikimedia.org/r/601363 (https://phabricator.wikimedia.org/T251620) [15:43:59] (03CR) 10Cmjohnson: [C: 03+2] Adding all dns entries for thanos-fe100[123] [dns] - 10https://gerrit.wikimedia.org/r/601363 (https://phabricator.wikimedia.org/T251620) (owner: 10Cmjohnson) [15:53:52] (03CR) 10EBernhardson: "Could this also be causing T254058 ?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/601360 (https://phabricator.wikimedia.org/T249261) (owner: 10Ottomata) [16:01:24] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: (NEED BY: ASAP) rack/setup/install thanos-fe100[123].eqiad.wmnet - https://phabricator.wikimedia.org/T251620 (10Cmjohnson) [16:03:11] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: (NEED BY: ASAP) rack/setup/install thanos-fe100[123].eqiad.wmnet - https://phabricator.wikimedia.org/T251620 (10Cmjohnson) Added to network switches and put in disabled vlan until bios is set up and ready for imaging. [16:12:40] 10Operations, 10Gerrit: Initial backup run for Gerrit LFS data - https://phabricator.wikimedia.org/T254162 (10jcrespo) a:03jcrespo [16:12:47] 10Operations, 10Gerrit: Initial backup run for Gerrit LFS data - https://phabricator.wikimedia.org/T254162 (10jcrespo) p:05Triage→03Medium [16:16:50] (03PS2) 10CDanis: esams-offline: route heavy EU bytes users to codfw [dns] - 10https://gerrit.wikimedia.org/r/574493 [16:18:36] 10Operations, 10Gerrit: Initial backup run for Gerrit LFS data - https://phabricator.wikimedia.org/T254162 (10jcrespo) Running: ` 232157 Back Full 0 0 gerrit1001.wikimedia.org-Hourly-Sun-production-gerrit-repo-data is running ` [16:20:49] 10Operations, 10Gerrit: Initial backup run for Gerrit LFS data - https://phabricator.wikimedia.org/T254162 (10jcrespo) I have a few things I would like to test/ask about this. * First, I would like to test a recovery, if possible to make sure the backup worked as intended. * I also have a few questions regar... [16:27:20] 10Operations, 10Gerrit: Initial backup run for Gerrit LFS data - https://phabricator.wikimedia.org/T254162 (10greg) >>! In T254162#6182368, @jcrespo wrote: > * I also have a few questions regarding recovery of gerrit- in terms of consistency and disaster recovery model, but not sure if this task would be the r... [16:28:18] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS1299/IPv4: Idle - Telia, AS1299/IPv6: Idle - Telia https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:28:54] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 131, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:29:05] (03PS2) 10Ppchelko: changeprop-jobqueue: enable all high-traffic jobs, increase replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/599383 (https://phabricator.wikimedia.org/T220399) (owner: 10Hnowlan) [16:39:47] !log running view updates on db1141 T252219 [16:39:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:39:51] T252219: Drop MCR-obsoleted fields from the wiki replicas - https://phabricator.wikimedia.org/T252219 [16:40:00] 10Operations, 10Gerrit: Initial backup run for Gerrit LFS data - https://phabricator.wikimedia.org/T254162 (10jcrespo) ` *llist jobid=232157 JobId: 232,157 Job: gerrit1001.wikimedia.org-Hourly-Sun-production-gerrit-repo-data.2020-06-01_16.14.26_23 Name: gerrit1001.wikime... [16:43:02] 10Operations, 10DC-Ops, 10Traffic: Fix recdns config on various hardware devices - https://phabricator.wikimedia.org/T254178 (10BBlack) [16:43:27] (03PS6) 10Ppchelko: Session Store: Switch group2 to kask-transition [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570395 (https://phabricator.wikimedia.org/T243106) [16:46:52] (03CR) 10Ottomata: [C: 03+2] "Def related, but not quite this. Looking into it, thanks!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/601360 (https://phabricator.wikimedia.org/T249261) (owner: 10Ottomata) [16:48:28] !log otto@deploy1001 sync-file aborted: EventLogging - fix searchsatisfaction schema URI - testwiki only - T249261 (duration: 00m 02s) [16:48:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:48:32] T249261: Vertical: Migrate SearchSatisfaction EventLogging event stream to Event Platform - https://phabricator.wikimedia.org/T249261 [16:49:32] !log otto@deploy1001 Synchronized wmf-config/InitialiseSettings.php: EventLogging - fix searchsatisfaction schema URI - testwiki only - T249261 (duration: 00m 59s) [16:49:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:50:40] (03PS1) 10Alexandros Kosiaris: Initial debianization [debs/kubeyaml] - 10https://gerrit.wikimedia.org/r/601372 [16:58:00] 10Operations, 10ORES, 10Scoring-platform-team (Current): ORES uwsgi consumes a large amount of memory and CPU when shutting down (as part of a restart) - https://phabricator.wikimedia.org/T242705 (10Halfak) @Gilles and @elukey, you've both offered to help us out here. I'm really excited that we may be able... [16:59:55] (03PS1) 10Bstorm: dumps-distribution: don't monitory systemd directly for paging [puppet] - 10https://gerrit.wikimedia.org/r/601374 [17:00:04] gehel and onimisionipe: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Wikidata Query Service weekly deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200601T1700). [17:01:31] (03PS1) 10Alexandros Kosiaris: ci: Add kubeyaml [puppet] - 10https://gerrit.wikimedia.org/r/601376 [17:02:16] (03PS2) 10Bstorm: dumps-distribution: don't monitory systemd directly for paging [puppet] - 10https://gerrit.wikimedia.org/r/601374 [17:04:02] (03CR) 10jerkins-bot: [V: 04-1] ci: Add kubeyaml [puppet] - 10https://gerrit.wikimedia.org/r/601376 (owner: 10Alexandros Kosiaris) [17:10:33] (03CR) 10Dzahn: [C: 03+2] Revert "stdlib: fix "double quoted string" lint warnings" [puppet] - 10https://gerrit.wikimedia.org/r/601349 (owner: 10Dzahn) [17:11:33] (03CR) 10Dzahn: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/599689 (owner: 10Dzahn) [17:12:04] (03CR) 10Jforrester: "recheck" [debs/kubeyaml] - 10https://gerrit.wikimedia.org/r/601372 (owner: 10Alexandros Kosiaris) [17:12:21] (03CR) 10jerkins-bot: [V: 04-1] Initial debianization [debs/kubeyaml] - 10https://gerrit.wikimedia.org/r/601372 (owner: 10Alexandros Kosiaris) [17:15:30] (03PS3) 10Bstorm: dumps-distribution: don't monitor systemd directly for paging [puppet] - 10https://gerrit.wikimedia.org/r/601374 [17:21:49] (03CR) 10Dzahn: "@qchris I hear this will be a requirement for the upgrade." [puppet] - 10https://gerrit.wikimedia.org/r/539180 (https://phabricator.wikimedia.org/T227509) (owner: 10Paladox) [17:22:30] (03CR) 10Dzahn: "@qchris This one needs to be merged right after the upgrade." [puppet] - 10https://gerrit.wikimedia.org/r/473264 (owner: 10Paladox) [17:23:18] 10Operations, 10ops-eqiad: Degraded RAID on restbase-dev1004 - https://phabricator.wikimedia.org/T253607 (10Cmjohnson) the AHS log has been uploaded to HP per their request. Looks like we're going to be okay with the warranty thing [17:30:35] 10Operations, 10WMF-Design, 10Design, 10Patch-For-Review: Create sub-directory URL for Design blog (https://design.wikimedia.org/blog) - https://phabricator.wikimedia.org/T254118 (10Dzahn) @Iniquity What would we link to ? [17:31:25] !log backup1001 - queued job 42 - gerrit backup after renaming of the file set and addition of LFS data (T254155, T254162) it is incremental, the full one already ran [17:31:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:31:30] T254155: Make sure we backup Gerrit's LFS data - https://phabricator.wikimedia.org/T254155 [17:31:30] T254162: Initial backup run for Gerrit LFS data - https://phabricator.wikimedia.org/T254162 [17:41:29] (03CR) 10Dzahn: monitoring: add data types to monitoring::service (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/551882 (owner: 10Dzahn) [17:42:04] (03PS15) 10Dzahn: monitoring: add data types to monitoring::service [puppet] - 10https://gerrit.wikimedia.org/r/551882 [17:46:26] !log update mtail in ulsfo caching hosts. restarting atsmtail and varnishmtail [17:46:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:46:54] (03PS1) 10Cmjohnson: Adding all dns for thanos-be100[1-4] [dns] - 10https://gerrit.wikimedia.org/r/601383 (https://phabricator.wikimedia.org/T251618) [17:47:07] !log turn online cr1-codfw:fpc0 - T254110 [17:47:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:47:11] T254110: cr1-codfw:fpc0 failure - https://phabricator.wikimedia.org/T254110 [17:48:46] XioNoX: did smart hands get around to it already? [17:48:58] cdanis: yeah, so quick! [17:49:02] not bad 👍 [17:49:08] (03CR) 10Cmjohnson: [C: 03+2] Adding all dns for thanos-be100[1-4] [dns] - 10https://gerrit.wikimedia.org/r/601383 (https://phabricator.wikimedia.org/T251618) (owner: 10Cmjohnson) [17:51:14] (03CR) 10Herron: [C: 03+1] swift: refactor stats_reporter into a profile [puppet] - 10https://gerrit.wikimedia.org/r/601346 (https://phabricator.wikimedia.org/T252186) (owner: 10Filippo Giunchedi) [17:56:32] (03PS2) 10Alexandros Kosiaris: Initial debianization [debs/kubeyaml] - 10https://gerrit.wikimedia.org/r/601372 [17:56:46] (03CR) 10jerkins-bot: [V: 04-1] Initial debianization [debs/kubeyaml] - 10https://gerrit.wikimedia.org/r/601372 (owner: 10Alexandros Kosiaris) [17:57:07] 10Operations, 10WMF-Design, 10Design, 10Patch-For-Review: Create sub-directory URL for Design blog (https://design.wikimedia.org/blog) - https://phabricator.wikimedia.org/T254118 (10Iniquity) No matter, I missed. Sorry :) [17:59:26] 10Operations, 10netops: cr1-codfw:fpc0 failure - https://phabricator.wikimedia.org/T254110 (10ayounsi) The linecard went through those states: `0 Present Testing` `0 Offline ---Unresponsive---` `0 Present Absent` `0 Offline ---Unresponsive---` And seems to be flapping betw... [17:59:41] !log offline cr1-codfw:fpc0 - T254110 [17:59:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:59:45] T254110: cr1-codfw:fpc0 failure - https://phabricator.wikimedia.org/T254110 [18:00:04] RoanKattouw, Niharika, and Urbanecm: That opportune time is upon us again. Time for a Morning SWAT(Max 6 patches) deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200601T1800). [18:00:04] bpirkle and Pchelolo: A patch you scheduled for Morning SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:01:08] We're gonna be doing a pretty big config change, if anyone needs to use SWAT window please go ahead of us [18:01:12] Pchelolo: /me around :-) [18:01:20] cool. [18:01:29] akosiaris: I'll keep you posted on our progress [18:02:22] interestingly, mediawiki latencies are increasing right now. [18:02:29] started 7 mins ago [18:02:35] https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-30m&to=now&refresh=1m [18:02:51] p95s have jumped almost 3fold ... sigh [18:03:10] 10Operations, 10netops: cr1-codfw:fpc0 failure - https://phabricator.wikimedia.org/T254110 (10ayounsi) Followed up with Juniper and requested a RMA. [18:03:28] akosiaris: oh.. does this consitute an abort in our plans? [18:03:38] no, not yet, this isn't unheard of [18:04:21] ok, we'll be prepping all dashboards etc in the meantime [18:04:55] databases look fine, but mcrouter get(s) are up from ~250k to 400k... what on earth [18:09:02] (03PS7) 10Ppchelko: Session Store: Switch group2 to kask-transition [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570395 (https://phabricator.wikimedia.org/T243106) [18:09:04] (03PS1) 10Bstorm: toolsdb: add another temprorary filter [puppet] - 10https://gerrit.wikimedia.org/r/601385 (https://phabricator.wikimedia.org/T253738) [18:09:50] akosiaris: we're ready around here. do you think we can/should proceed? [18:10:09] (03CR) 10Bstorm: "The reason for this one is the log:" [puppet] - 10https://gerrit.wikimedia.org/r/601385 (https://phabricator.wikimedia.org/T253738) (owner: 10Bstorm) [18:10:24] (03PS3) 10Alexandros Kosiaris: Initial debianization [debs/kubeyaml] - 10https://gerrit.wikimedia.org/r/601372 [18:10:49] (03CR) 10Dzahn: [C: 04-1] "> Currently we have auth setup for it, so even to get the most simple resource, like 'https://ganeti01.svc.eqiad.wmnet:5080/v2/' we need t" [puppet] - 10https://gerrit.wikimedia.org/r/589608 (owner: 10Dzahn) [18:10:50] Pchelolo: yes, it looks like it. We are slightly increased on latency but it's dieing down [18:10:56] ok. [18:11:04] (03CR) 10Ppchelko: [C: 03+2] Session Store: Switch group2 to kask-transition [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570395 (https://phabricator.wikimedia.org/T243106) (owner: 10Ppchelko) [18:11:06] (03CR) 10jerkins-bot: [V: 04-1] Initial debianization [debs/kubeyaml] - 10https://gerrit.wikimedia.org/r/601372 (owner: 10Alexandros Kosiaris) [18:11:47] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: ASAP) rack/setup/install thanos-be100[123] - https://phabricator.wikimedia.org/T251618 (10Cmjohnson) [18:11:52] (03Merged) 10jenkins-bot: Session Store: Switch group2 to kask-transition [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570395 (https://phabricator.wikimedia.org/T243106) (owner: 10Ppchelko) [18:12:12] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: ASAP) rack/setup/install thanos-be100[123] - https://phabricator.wikimedia.org/T251618 (10Cmjohnson) network switches are updated but they are disabled until I set up bios and image [18:15:17] 10Operations, 10netops: cr1-codfw:fpc0 failure - https://phabricator.wikimedia.org/T254110 (10ayounsi) @RobH to be ahead of Juniper, they will need the following to issue the RMA and know where to ship the part. Note that the task is public, feel free to make it SRE only or email me the info. Linecard replacem... [18:18:27] akosiaris: got the change on mwdebug1002, still logged in, looks good. proceeding [18:18:38] cool [18:20:01] (03PS4) 10Alexandros Kosiaris: Initial debianization [debs/kubeyaml] - 10https://gerrit.wikimedia.org/r/601372 [18:20:43] (03CR) 10jerkins-bot: [V: 04-1] Initial debianization [debs/kubeyaml] - 10https://gerrit.wikimedia.org/r/601372 (owner: 10Alexandros Kosiaris) [18:21:03] (03CR) 10Bstorm: [C: 03+2] toolsdb: add another temprorary filter [puppet] - 10https://gerrit.wikimedia.org/r/601385 (https://phabricator.wikimedia.org/T253738) (owner: 10Bstorm) [18:21:57] !log ppchelko@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:570395|Enable kask-transition for all wikis]] (duration: 01m 00s) [18:21:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:23:22] akosiaris: it's out [18:23:40] (03PS3) 10Dzahn: ganeti: add monitoring for ganeti RAPI [puppet] - 10https://gerrit.wikimedia.org/r/589608 [18:23:41] Pchelolo: cool, I can see the increased rps [18:23:49] nothing yet on the mediawiki side, which is good [18:23:57] yup, same here [18:24:33] kask memory and CPU usage are increasing, up to now as expected [18:24:35] akosiaris: https://grafana.wikimedia.org/d/000001590/sessionstore?panelId=30&fullscreen&orgId=1 - that's pretty high utilization [18:25:32] Pchelolo: limit is at 2.5 [18:25:38] oh ok... [18:25:47] hehe :) I thought it was out of 1 [18:27:08] right now about 0.6 is user, 0.5 is system (syscalls, interrupts etc) [18:27:51] so we maxed out at ~15K rps... [18:28:10] slighly more, 16.5K, but nice [18:28:25] so, why do we have reqs in codfw? https://grafana.wikimedia.org/d/000001590/sessionstore?orgId=1&var-dc=codfw%20prometheus%2Fk8s&var-service=sessionstore [18:28:35] not a lot, but I haven't expected that [18:29:28] that's.... werid [18:29:30] weird* [18:30:59] Pchelolo: fwiw, mediawiki seems to have been pretty ignorant of all of this during this whole time [18:31:14] akosiaris: yeah, that's what we've hoped for right :) [18:31:14] heck, latency is dropping during the deploy [18:31:20] 10Operations, 10Jupyter-Hub, 10LDAP-Access-Requests: Give access to the JupyterHub (SWAP) notebooks to (@YiJuLu ) - https://phabricator.wikimedia.org/T254121 (10KFrancis) @Dzahn As @YiJulu was hired as an intern through Upwork, their NDA would be covered by the fully executed intern agreement. [18:31:49] THere's a 1h expiration TTL, so things will move into kask over the hour gradually [18:31:56] as, the traffic will not grow [18:32:09] but the traffic to redis should steadily drop [18:32:53] \o/, finally.. [18:33:15] so, those requests to codfw sessionstore? all seem to be resulting to 404s, so not logged in users [18:34:16] do we have some code path that is active/active? and we did not know about it? what on earth... [18:35:19] about 55 for appservers and 45 for apiservers [18:35:21] weird [18:35:58] Pchelolo: I have a feeling that we are going to have a "TIL" moment today or one of the next ones [18:36:27] anyway, deploy seems to have gone pretty well [18:37:02] yeah. [18:37:12] the TTLs should be expiring gradually [18:37:22] at least if I understand things correctly [18:37:28] so there shouldn't be a ttl moment [18:37:40] I'll be keeping an eye on things [18:37:53] me too. https://grafana.wikimedia.org/d/000001590/sessionstore?panelId=25&fullscreen&orgId=1&from=now-30m&to=now&refresh=1m is mildly worrying [18:38:52] we have heardroom, https://grafana.wikimedia.org/d/000001590/sessionstore?panelId=68&fullscreen&orgId=1&from=now-30m&to=now&refresh=1m puts it at 180Mi with the limit at 300Mi [18:38:59] headroom* [18:39:06] but I better keep an eye out [18:40:01] 10Operations, 10Jupyter-Hub, 10LDAP-Access-Requests: Give access to the JupyterHub (SWAP) notebooks to (@YiJuLu ) - https://phabricator.wikimedia.org/T254121 (10Dzahn) Alright, thanks @KFrancis [18:42:05] 10Operations, 10Jupyter-Hub, 10LDAP-Access-Requests: Give access to the JupyterHub (SWAP) notebooks to (@YiJuLu ) - https://phabricator.wikimedia.org/T254121 (10Dzahn) @YiJuLu Hi, do you already have a developer account at wikitech wiki? If not, can you please create one at https://wikitech.wikimedia.org/w/i... [18:43:50] 10Operations, 10netops: cr1-codfw:fpc0 failure - https://phabricator.wikimedia.org/T254110 (10RobH) https://netbox.wikimedia.org/dcim/sites/esams/ point of contact is just 'iron mountain shipping' and the generic contact number. That has all the info I think you are requesting? We will need to open in inbo... [18:48:08] (03PS1) 10Jforrester: Drop scap plugins, moved into scap proper [mediawiki-config] - 10https://gerrit.wikimedia.org/r/601388 (https://phabricator.wikimedia.org/T248490) [18:48:23] (03CR) 10Jforrester: [C: 04-2] Drop scap plugins, moved into scap proper [mediawiki-config] - 10https://gerrit.wikimedia.org/r/601388 (https://phabricator.wikimedia.org/T248490) (owner: 10Jforrester) [18:49:14] I think memory usage stabilized [18:49:34] * akosiaris gonna give it another 10m [18:49:37] 10Operations, 10netops: cr1-codfw:fpc0 failure - https://phabricator.wikimedia.org/T254110 (10RobH) IRC update: codfw not esams, heh.. https://netbox.wikimedia.org/dcim/sites/codfw/ I would list the generic info for their NOC though, which I'll append to that entry now. [18:49:43] (03CR) 10Ppchelko: [C: 04-2] "until dependency is deployed at least on group0 and we're sure we're not reverting" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599150 (https://phabricator.wikimedia.org/T250781) (owner: 10Ppchelko) [18:53:37] 10Operations, 10ops-eqiad, 10DC-Ops: decom cobalt - https://phabricator.wikimedia.org/T236187 (10Cmjohnson) a:05Jclark-ctr→03Cmjohnson This was out there lingering without the property tags. I found it and will finish the decom process. [18:54:19] 10Operations, 10SRE-Access-Requests: Adding Italian Wikinews to Google Search Console to add it to Google News - https://phabricator.wikimedia.org/T253988 (10Dzahn) Hi @Ferdi2005 the entire wikinews.org already has a google-site-verification TXT record in DNS. So that part is not needed. Once i logged in with... [18:54:52] (03CR) 10CRusnov: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/589608 (owner: 10Dzahn) [18:59:46] (03CR) 10RLazarus: [C: 03+1] remove production IPs of mw2163 through mw2172 [dns] - 10https://gerrit.wikimedia.org/r/599610 (https://phabricator.wikimedia.org/T247018) (owner: 10Dzahn) [19:00:39] (03CR) 10RLazarus: [C: 03+1] remove production IPs of mw2150 through mw2162 [dns] - 10https://gerrit.wikimedia.org/r/599614 (https://phabricator.wikimedia.org/T247018) (owner: 10Dzahn) [19:01:27] Pchelolo: I call it a full success. I 'll monitor it another 10m, but kudos. Excellent work! [19:01:55] thank you. We have one more step to do - clean up the config and remove redis fallback [19:02:02] we will do that on Wednesday [19:02:07] sounds good to me [19:02:07] same SWAT window [19:05:47] (03PS5) 10Alexandros Kosiaris: Initial debianization [debs/kubeyaml] - 10https://gerrit.wikimedia.org/r/601372 [19:06:50] (03CR) 10jerkins-bot: [V: 04-1] Initial debianization [debs/kubeyaml] - 10https://gerrit.wikimedia.org/r/601372 (owner: 10Alexandros Kosiaris) [19:08:16] 10Operations, 10Analytics, 10Product-Analytics, 10SRE-Access-Requests: Hive access for Sam Patton - https://phabricator.wikimedia.org/T248097 (10mpopov) I've reached out to @spatton on Slack about this. [19:15:23] (03CR) 10RLazarus: [C: 03+1] site: add new appservers mw2335 through mw2339 [puppet] - 10https://gerrit.wikimedia.org/r/599749 (https://phabricator.wikimedia.org/T241852) (owner: 10Dzahn) [19:23:00] (03CR) 10RLazarus: [C: 03+1] site: decom mw2180 through mw2186 [puppet] - 10https://gerrit.wikimedia.org/r/599606 (https://phabricator.wikimedia.org/T247018) (owner: 10Dzahn) [19:25:21] (03CR) 10QChris: [C: 04-1] "> @qchris This one needs to be merged right after the upgrade." [puppet] - 10https://gerrit.wikimedia.org/r/473264 (owner: 10Paladox) [19:27:44] (03CR) 10RLazarus: [C: 03+1] site: decom mw2173 through mw2179 [puppet] - 10https://gerrit.wikimedia.org/r/599603 (https://phabricator.wikimedia.org/T247018) (owner: 10Dzahn) [19:35:56] (03CR) 10RLazarus: [C: 03+1] "The httpbb change LGTM, thanks for updating it." [puppet] - 10https://gerrit.wikimedia.org/r/524088 (https://phabricator.wikimedia.org/T187716) (owner: 10MaxSem) [19:37:32] 10Operations, 10Gerrit: Initial backup run for Gerrit LFS data - https://phabricator.wikimedia.org/T254162 (10QChris) @jcrespo Thanks for the initial run! > We should try restoring it, give me a path you have access to to do so. gerrit1002 (gerrit-test) is a bit scarce on disk space, so what about gerrit2001... [19:41:09] 10Operations, 10vm-requests, 10Patch-For-Review: esams,ulsfo,eqsin: one VM request each for install_servers - https://phabricator.wikimedia.org/T254157 (10Peachey88) [19:44:34] !log restart atsmtail in eqsin [19:44:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:50:03] (03PS7) 10Jdlrobson: Use AddFooterLink hook for code of conduct and contact links [mediawiki-config] - 10https://gerrit.wikimedia.org/r/596277 (https://phabricator.wikimedia.org/T251817) [19:52:50] (03CR) 10Jdlrobson: [C: 03+1] "This can now be deployed now that 1.35.0-wmf.34 is live." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/596277 (https://phabricator.wikimedia.org/T251817) (owner: 10Jdlrobson) [19:55:07] !log fail vrrp over cr3-ulsfo - T237575 [19:55:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:57:01] !log disable IX4/6 on cr4-ulsfo - T237575 [19:57:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:00:05] halfak and accraze: Time to snap out of that daydream and deploy Services – Graphoid / Citoid / ORES. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200601T2000). [20:01:00] (03PS1) 10Cwhite: Revert "hiera: install mtail 3.0.0~rc35 from component in ulsfo and codfw" [puppet] - 10https://gerrit.wikimedia.org/r/601407 [20:02:17] on a call with equinix to turn the new ulsfo peering port/circuit up [20:02:23] should be no impact [20:03:41] (03CR) 10Cwhite: [C: 03+2] Revert "hiera: install mtail 3.0.0~rc35 from component in ulsfo and codfw" [puppet] - 10https://gerrit.wikimedia.org/r/601407 (owner: 10Cwhite) [20:04:42] (03PS1) 10Gergő Tisza: Enable GrowthExperiments guidance everywhere behind feature flag [mediawiki-config] - 10https://gerrit.wikimedia.org/r/601409 (https://phabricator.wikimedia.org/T253794) [20:12:29] !log enable IX4/6 on cr4-ulsfo - T237575 [20:12:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:14:13] !log downgrade mtail to rc5 in ulsfo -- T254192 [20:14:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:14:17] T254192: mtail rc35 stops incrementing atsmtail counters - https://phabricator.wikimedia.org/T254192 [20:19:36] PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3058 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [20:32:06] RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3058 is OK: HTTP OK: HTTP/1.0 200 OK - 23561 bytes in 0.272 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [20:32:07] (03CR) 10Dave Pifke: Set expiry headers on thumbnails (031 comment) [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/489022 (https://phabricator.wikimedia.org/T211661) (owner: 10Gilles) [20:53:38] (03PS1) 10Dave Pifke: Swift object servers: enable object-expirer [puppet] - 10https://gerrit.wikimedia.org/r/601429 (https://phabricator.wikimedia.org/T229584) [20:54:20] 10Operations, 10Analytics, 10Discovery, 10Recommendation-API, 10Patch-For-Review: Run swift-object-expirer as part of the swift cluster - https://phabricator.wikimedia.org/T229584 (10dpifke) a:03dpifke [20:56:43] XioNoX: cool [20:57:22] 1.1Gbit already [20:57:56] (03PS1) 10Cwhite: varnish: add support for additional mtail args and set disable_fsnotify in eqsin [puppet] - 10https://gerrit.wikimedia.org/r/601430 (https://phabricator.wikimedia.org/T254192) [21:00:04] Reedy and sbassett: Time to snap out of that daydream and deploy Weekly Security deployment window. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200601T2100). [21:04:48] (03PS1) 10Volans: mgmt: use netbox-generated data for eqsin mgmt [dns] - 10https://gerrit.wikimedia.org/r/601434 (https://phabricator.wikimedia.org/T233183) [21:14:23] (03CR) 10Catrope: "Needs to wait until 1.35.0-wmf.35 is deployed everywhere" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/601409 (https://phabricator.wikimedia.org/T253794) (owner: 10Gergő Tisza) [21:14:30] (03CR) 10Catrope: [C: 03+1] Enable GrowthExperiments guidance everywhere behind feature flag [mediawiki-config] - 10https://gerrit.wikimedia.org/r/601409 (https://phabricator.wikimedia.org/T253794) (owner: 10Gergő Tisza) [21:15:03] (03PS1) 10Cwhite: profile: add additional mtail args support and set disable_fsnotify in eqsin [puppet] - 10https://gerrit.wikimedia.org/r/601436 (https://phabricator.wikimedia.org/T254192) [21:24:37] (03PS1) 10Cwhite: profile: add additional flags support for atsmtail and disable_fsnotify in eqsin [puppet] - 10https://gerrit.wikimedia.org/r/601440 (https://phabricator.wikimedia.org/T251466) [21:54:19] 10Operations, 10Thumbor, 10serviceops, 10User-jijiki: Upgrade Thumbor to Buster - https://phabricator.wikimedia.org/T216815 (10AntiCompositeNumber) Is there still work that needs to be done to prepare for this, or is it just waiting for someone to get around to handling the necessary configuration changes,... [22:01:54] (03PS6) 10EBernhardson: Remove duplication and improve clarity in role::wdqs [puppet] - 10https://gerrit.wikimedia.org/r/598884 [22:01:56] (03PS3) 10EBernhardson: query_service: Move shared config into common file [puppet] - 10https://gerrit.wikimedia.org/r/599145 [22:01:58] (03PS5) 10EBernhardson: Consolidate query_service profile duplication [puppet] - 10https://gerrit.wikimedia.org/r/599146 [22:02:00] (03PS13) 10EBernhardson: Role for SDoC WDQS [puppet] - 10https://gerrit.wikimedia.org/r/595041 (https://phabricator.wikimedia.org/T237089) [22:02:41] (03CR) 10Cwhite: "PCC checks out: https://puppet-compiler.wmflabs.org/compiler1003/22909/" [puppet] - 10https://gerrit.wikimedia.org/r/601440 (https://phabricator.wikimedia.org/T251466) (owner: 10Cwhite) [22:05:14] (03CR) 10jerkins-bot: [V: 04-1] Role for SDoC WDQS [puppet] - 10https://gerrit.wikimedia.org/r/595041 (https://phabricator.wikimedia.org/T237089) (owner: 10EBernhardson) [22:08:12] (03CR) 10Cwhite: "PCC checks out: https://puppet-compiler.wmflabs.org/compiler1001/22912/" [puppet] - 10https://gerrit.wikimedia.org/r/601436 (https://phabricator.wikimedia.org/T254192) (owner: 10Cwhite) [22:10:31] (03PS2) 10Cwhite: varnish: add support for additional mtail args and set disable_fsnotify in eqsin [puppet] - 10https://gerrit.wikimedia.org/r/601430 (https://phabricator.wikimedia.org/T254192) [22:25:10] (03CR) 10Cwhite: "PCC checks out, although seems to not show the bin/varnishmtail utility updates: https://puppet-compiler.wmflabs.org/compiler1001/22913/" [puppet] - 10https://gerrit.wikimedia.org/r/601430 (https://phabricator.wikimedia.org/T254192) (owner: 10Cwhite) [22:25:38] 10Operations, 10observability, 10Patch-For-Review: mtail rc35 stops incrementing atsmtail counters - https://phabricator.wikimedia.org/T254192 (10colewhite) [22:29:45] (03PS1) 10CDanis: textfile exporter for php-fpm worker pool status [puppet] - 10https://gerrit.wikimedia.org/r/601454 (https://phabricator.wikimedia.org/T252605) [22:30:14] (03PS2) 10CDanis: textfile exporter for php-fpm worker pool status [puppet] - 10https://gerrit.wikimedia.org/r/601454 (https://phabricator.wikimedia.org/T252605) [22:31:04] (03PS7) 10EBernhardson: Remove duplication and improve clarity in role::wdqs [puppet] - 10https://gerrit.wikimedia.org/r/598884 [22:34:27] (03CR) 10EBernhardson: "pcc looks reasonable: https://puppet-compiler.wmflabs.org/compiler1001/22915/" [puppet] - 10https://gerrit.wikimedia.org/r/598884 (owner: 10EBernhardson) [22:34:35] (03PS3) 10CDanis: textfile exporter for php-fpm worker pool status [puppet] - 10https://gerrit.wikimedia.org/r/601454 (https://phabricator.wikimedia.org/T252605) [22:35:19] (03Abandoned) 10EBernhardson: [DNM] Demonstrate problem with wdqs hiera configuration [puppet] - 10https://gerrit.wikimedia.org/r/598886 (owner: 10EBernhardson) [22:37:03] (03PS4) 10EBernhardson: query_service: Move shared config into common file [puppet] - 10https://gerrit.wikimedia.org/r/599145 [22:40:44] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:44:17] (03CR) 10EBernhardson: "pcc looks reasonable: https://puppet-compiler.wmflabs.org/compiler1001/22917/" [puppet] - 10https://gerrit.wikimedia.org/r/599145 (owner: 10EBernhardson) [22:44:22] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:44:57] (03PS6) 10EBernhardson: Consolidate query_service profile duplication [puppet] - 10https://gerrit.wikimedia.org/r/599146 [22:47:18] (03PS4) 10CDanis: textfile exporter for php-fpm worker pool status [puppet] - 10https://gerrit.wikimedia.org/r/601454 (https://phabricator.wikimedia.org/T252605) [22:47:20] (03PS1) 10CDanis: Systemd::Servicename: make it reflect reality e.g. php7.2-fpm [puppet] - 10https://gerrit.wikimedia.org/r/601460 [22:48:54] jouncebot: next [22:48:54] In 0 hour(s) and 11 minute(s): Evening SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200601T2300) [23:00:05] RoanKattouw, Niharika, and Urbanecm: I, the Bot under the Fountain, allow thee, The Deployer, to do Evening SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200601T2300). [23:00:05] No GERRIT patches in the queue for this window AFAICS. [23:01:33] (03PS9) 10Jeena Huneidi: Automate deployments [deployment-charts] - 10https://gerrit.wikimedia.org/r/597653 (https://phabricator.wikimedia.org/T253264) [23:02:01] (03PS10) 10Jeena Huneidi: Automate deployments [deployment-charts] - 10https://gerrit.wikimedia.org/r/597653 (https://phabricator.wikimedia.org/T253264) [23:02:39] (03PS2) 10Cwhite: profile: add additional flags support for atsmtail and disable_fsnotify in eqsin [puppet] - 10https://gerrit.wikimedia.org/r/601440 (https://phabricator.wikimedia.org/T254192) [23:03:32] (03CR) 10Cwhite: [V: 03+2 C: 03+2] add loki 1.5.0 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/597317 (https://phabricator.wikimedia.org/T222826) (owner: 10Cwhite) [23:03:45] (03PS9) 10Cwhite: add loki 1.5.0 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/597317 (https://phabricator.wikimedia.org/T222826) [23:03:54] (03CR) 10Cwhite: [V: 03+2 C: 03+2] add loki 1.5.0 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/597317 (https://phabricator.wikimedia.org/T222826) (owner: 10Cwhite) [23:03:58] (03CR) 10CDanis: "PCC looks reasonable https://puppet-compiler.wmflabs.org/compiler1002/22919/" [puppet] - 10https://gerrit.wikimedia.org/r/601454 (https://phabricator.wikimedia.org/T252605) (owner: 10CDanis) [23:04:00] (03CR) 10EBernhardson: "pcc: https://puppet-compiler.wmflabs.org/compiler1003/22918/" [puppet] - 10https://gerrit.wikimedia.org/r/599146 (owner: 10EBernhardson) [23:07:24] (03CR) 10Cwhite: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/601346 (https://phabricator.wikimedia.org/T252186) (owner: 10Filippo Giunchedi) [23:09:12] (03PS5) 10Cwhite: wmflib: add systemd.timer onCalendar support to cron_splay [puppet] - 10https://gerrit.wikimedia.org/r/600928 (https://phabricator.wikimedia.org/T210818) [23:10:13] (03CR) 10Ryan Kemper: [C: 03+2] [wdqs] fix DCAT-AP reload and load it to the categories endpoint [puppet] - 10https://gerrit.wikimedia.org/r/596655 (owner: 10DCausse) [23:12:03] (03Abandoned) 10Ryan Kemper: sre.wdqs.data-transfer: fix syntax, simplify rule [cookbooks] - 10https://gerrit.wikimedia.org/r/595649 (https://phabricator.wikimedia.org/T206951) (owner: 10Ryan Kemper) [23:16:07] 10Operations, 10observability, 10Patch-For-Review: mtail rc35 stops incrementing atsmtail counters - https://phabricator.wikimedia.org/T254192 (10colewhite) Rolled back codfw and ulsfo, but left eqsin for testing the `-disable_fsnotify` flag. There doesn't appear to be any user-facing impact aside from metr... [23:19:29] (03PS14) 10EBernhardson: Role for SDoC WDQS [puppet] - 10https://gerrit.wikimedia.org/r/595041 (https://phabricator.wikimedia.org/T237089) [23:21:40] (03CR) 10jerkins-bot: [V: 04-1] Role for SDoC WDQS [puppet] - 10https://gerrit.wikimedia.org/r/595041 (https://phabricator.wikimedia.org/T237089) (owner: 10EBernhardson) [23:24:32] (03PS7) 10EBernhardson: Consolidate query_service profile duplication [puppet] - 10https://gerrit.wikimedia.org/r/599146 [23:27:32] (03PS15) 10EBernhardson: Role for SDoC WDQS [puppet] - 10https://gerrit.wikimedia.org/r/595041 (https://phabricator.wikimedia.org/T237089) [23:29:49] (03CR) 10jerkins-bot: [V: 04-1] Role for SDoC WDQS [puppet] - 10https://gerrit.wikimedia.org/r/595041 (https://phabricator.wikimedia.org/T237089) (owner: 10EBernhardson) [23:29:56] PROBLEM - Unmerged changes on repository puppet on labtestpuppetmaster2001 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [23:52:05] 10Operations, 10ops-ulsfo: remove cr4-ulsfo:xe-0/1/1 - https://phabricator.wikimedia.org/T254206 (10RobH) p:05Triage→03Medium [23:52:14] 10Operations, 10ops-ulsfo: remove cr4-ulsfo:xe-0/1/1 - https://phabricator.wikimedia.org/T254206 (10RobH) [23:56:37] (03PS3) 10Krinkle: Enable "coalesceKeys"="non-global" for WANCache on wikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/598852 (owner: 10Aaron Schulz) [23:59:09] (03CR) 10Krinkle: [C: 03+2] Enable "coalesceKeys"="non-global" for WANCache on wikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/598852 (owner: 10Aaron Schulz) [23:59:54] (03Merged) 10jenkins-bot: Enable "coalesceKeys"="non-global" for WANCache on wikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/598852 (owner: 10Aaron Schulz)