[00:03:43] PROBLEM - Old JVM GC check - cloudelastic1002-cloudelastic-chi-eqiad on cloudelastic1002 is CRITICAL: 103.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad [00:06:20] 10Operations, 10Performance-Team, 10SRE-swift-storage, 10Traffic, and 2 others: Automatically clean up unused thumbnails in Swift - https://phabricator.wikimedia.org/T211661 (10dpifke) Yes, this is totally something I can tackle. It meshes well with the work I'm doing on converting ArcLamp to use Swift fo... [00:15:23] PROBLEM - Old JVM GC check - cloudelastic1002-cloudelastic-chi-eqiad on cloudelastic1002 is CRITICAL: 107.8 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad [00:20:03] PROBLEM - Old JVM GC check - cloudelastic1002-cloudelastic-chi-eqiad on cloudelastic1002 is CRITICAL: 103.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad [00:59:05] RECOVERY - Old JVM GC check - cloudelastic1001-cloudelastic-chi-eqiad on cloudelastic1001 is OK: (C)100 gt (W)80 gt 78.31 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad [01:11:17] RECOVERY - Old JVM GC check - cloudelastic1002-cloudelastic-chi-eqiad on cloudelastic1002 is OK: (C)100 gt (W)80 gt 68.76 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad [01:25:25] PROBLEM - MariaDB Slave Lag: s3 on db1095 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1420.50 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [01:31:33] 10Operations, 10DBA, 10Privacy Engineering, 10Traffic, and 4 others: dbtree loads third party resources (from google.com/jsapi) - https://phabricator.wikimedia.org/T96499 (10Krinkle) [02:26:39] PROBLEM - MariaDB Slave Lag: s3 on db2098 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1372.99 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [02:31:05] PROBLEM - mediawiki originals uploads -hourly- for codfw on icinga1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe2005:9112 job=statsd_exporter site=codfw https://wikitech.wikimedia.org/wiki/Swift/How_To https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw [02:32:19] PROBLEM - mediawiki originals uploads -hourly- for eqiad on icinga1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe1005:9112 job=statsd_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Swift/How_To https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad [02:49:41] RECOVERY - MariaDB Slave Lag: s3 on db1095 is OK: OK slave_sql_lag Replication lag: 0.43 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [03:10:55] RECOVERY - mediawiki originals uploads -hourly- for codfw on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Swift/How_To https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw [03:12:09] RECOVERY - mediawiki originals uploads -hourly- for eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Swift/How_To https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad [03:49:19] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Scrapes sample page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [03:51:33] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [03:54:26] 10Operations, 10Analytics: Installing package graph-tool on one stat-machine - https://phabricator.wikimedia.org/T247266 (10Zoranzoki21) >>! In T247266#5954803, @Aklapper wrote: >> Therefore, I would like to inquire about the possibility to use graph-tool on one of the stat-machines (e.g. stat1005) via any of... [03:55:43] RECOVERY - MariaDB Slave Lag: s3 on db2098 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [03:56:55] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_citoid_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [04:01:37] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [04:40:01] 10Operations, 10Analytics, 10ContentTranslation, 10SRE-Access-Requests, 10Language-Team (Language-2020-January-March): Request for access for stats machines for Santhosh - https://phabricator.wikimedia.org/T247246 (10santhosh) [04:40:26] 10Operations, 10Analytics, 10ContentTranslation, 10SRE-Access-Requests, 10Language-Team (Language-2020-January-March): Request for access for stats machines for Santhosh - https://phabricator.wikimedia.org/T247246 (10santhosh) >>! In T247246#5953952, @Nuria wrote: > @santhosh What is your LDAP user? Sa... [04:41:35] 10Operations, 10Analytics, 10ContentTranslation, 10SRE-Access-Requests, 10Language-Team (Language-2020-January-March): Request for access for stats machines for Santhosh - https://phabricator.wikimedia.org/T247246 (10Nuria) FYI that @santhosh is a WMF employee. Approved on my end [04:53:05] PROBLEM - Old JVM GC check - cloudelastic1001-cloudelastic-chi-eqiad on cloudelastic1001 is CRITICAL: 106.8 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad [04:58:19] PROBLEM - Old JVM GC check - cloudelastic1002-cloudelastic-chi-eqiad on cloudelastic1002 is CRITICAL: 105.8 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad [05:14:16] (03PS1) 10BryanDavis: toolforge: support canonical redirects in urlproxy [puppet] - 10https://gerrit.wikimedia.org/r/578406 (https://phabricator.wikimedia.org/T234617) [05:26:19] (03PS9) 10BryanDavis: kubernetes: Set php7.3 as the default type [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/496564 [05:26:21] (03PS15) 10BryanDavis: Make Kubernetes the default backend and warn when guessing [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/443190 (https://phabricator.wikimedia.org/T154504) (owner: 10Nehajha) [05:26:23] (03PS1) 10BryanDavis: Bump manifest version [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/578407 [05:26:25] (03PS1) 10BryanDavis: Remove temporary code from 2020 Kubernetes migration [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/578408 (https://phabricator.wikimedia.org/T246689) [05:26:27] (03PS1) 10BryanDavis: Refactor argparse setup [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/578409 [05:26:29] (03PS1) 10BryanDavis: Reuse toolforge.common.tool.PROJECT in KubernetesBackend [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/578410 [05:26:31] (03PS1) 10BryanDavis: Introduce command "template" feature [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/578411 [05:26:33] (03PS1) 10BryanDavis: Add support for Kubernetes replica scaling [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/578412 [05:26:35] (03PS1) 10BryanDavis: Add support for redirecting to toolforge.org [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/578413 (https://phabricator.wikimedia.org/T234617) [05:28:19] (03CR) 10jerkins-bot: [V: 04-1] Add support for redirecting to toolforge.org [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/578413 (https://phabricator.wikimedia.org/T234617) (owner: 10BryanDavis) [05:29:24] (03CR) 10BryanDavis: "Probably should rebase on I5f06ff9fe682f782cbfd7f722fd44406a842e2bd" [puppet] - 10https://gerrit.wikimedia.org/r/566491 (https://phabricator.wikimedia.org/T218427) (owner: 10Legoktm) [05:30:59] (03PS2) 10BryanDavis: Add support for redirecting to toolforge.org [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/578413 (https://phabricator.wikimedia.org/T234617) [05:36:57] !log restart ats-be on cp4032 to clean up the restart alert - T247232 [05:37:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:37:03] T247232: lua related crash on ats-be @ cp4032 - https://phabricator.wikimedia.org/T247232 [05:39:41] RECOVERY - traffic_server backend process restarted on cp4032 is OK: (C)2 ge (W)2 ge 1 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server https://grafana.wikimedia.org/d/6uhkG6OZk/ats-instance-drilldown?orgId=1&var-site=ulsfo+prometheus/ops&var-instance=cp4032&var-layer=backend [05:54:35] RECOVERY - Old JVM GC check - cloudelastic1002-cloudelastic-chi-eqiad on cloudelastic1002 is OK: (C)100 gt (W)80 gt 73.22 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad [05:56:20] (03PS3) 10Vgutierrez: ATS: Turn on TLS Session tickets on ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/578327 (https://phabricator.wikimedia.org/T245616) [05:56:23] RECOVERY - Old JVM GC check - cloudelastic1001-cloudelastic-chi-eqiad on cloudelastic1001 is OK: (C)100 gt (W)80 gt 78.04 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad [06:02:57] (03PS4) 10Vgutierrez: ATS: Turn on TLS Session tickets on ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/578327 (https://phabricator.wikimedia.org/T245616) [06:29:02] (03CR) 10Vgutierrez: "pcc seems happy: https://puppet-compiler.wmflabs.org/compiler1002/21361/" [puppet] - 10https://gerrit.wikimedia.org/r/578327 (https://phabricator.wikimedia.org/T245616) (owner: 10Vgutierrez) [06:35:57] (03CR) 10KartikMistry: "recheck" [debs/contenttranslation/apertium-pol-szl] - 10https://gerrit.wikimedia.org/r/576628 (https://phabricator.wikimedia.org/T202276) (owner: 10KartikMistry) [06:44:48] (03PS5) 10KartikMistry: Add apertium-pol-szl package [debs/contenttranslation/apertium-pol-szl] - 10https://gerrit.wikimedia.org/r/576628 (https://phabricator.wikimedia.org/T202276) [06:45:44] (03PS2) 10KartikMistry: apertium-fra-cat: Updated to upstream release 1.7.0 [debs/contenttranslation/apertium-fra-cat] - 10https://gerrit.wikimedia.org/r/577216 (https://phabricator.wikimedia.org/T233700) [06:48:41] (03CR) 10jerkins-bot: [V: 04-1] apertium-fra-cat: Updated to upstream release 1.7.0 [debs/contenttranslation/apertium-fra-cat] - 10https://gerrit.wikimedia.org/r/577216 (https://phabricator.wikimedia.org/T233700) (owner: 10KartikMistry) [06:49:57] (03PS2) 10KartikMistry: apertium-spa-cat: Update to new upstream release 2.2.0 [debs/contenttranslation/apertium-spa-cat] - 10https://gerrit.wikimedia.org/r/577243 (https://phabricator.wikimedia.org/T233700) [06:51:25] (03PS3) 10KartikMistry: Add apertium-oci-fra package [debs/contenttranslation/apertium-oci-fra] - 10https://gerrit.wikimedia.org/r/577047 (https://phabricator.wikimedia.org/T202360) [06:57:33] (03CR) 10jerkins-bot: [V: 04-1] Add apertium-oci-fra package [debs/contenttranslation/apertium-oci-fra] - 10https://gerrit.wikimedia.org/r/577047 (https://phabricator.wikimedia.org/T202360) (owner: 10KartikMistry) [07:27:09] 10Operations, 10DBA, 10OTRS, 10Recommendation-API, 10Research: Upgrade and restart m2 primary database master (db1132) - https://phabricator.wikimedia.org/T246098 (10Marostegui) Thanks @leila! @akosiaris does Tuesday 17th at 09:00 AM UTC work? [07:28:04] 10Operations, 10DBA, 10OTRS, 10Recommendation-API, 10Research: Upgrade and restart m2 primary database master (db1132) - https://phabricator.wikimedia.org/T246098 (10akosiaris) >>! In T246098#5955323, @Marostegui wrote: > Thanks @leila! > @akosiaris does Tuesday 17th at 09:00 AM UTC work? Fine by me. [07:36:13] 10Operations, 10DBA, 10OTRS, 10Recommendation-API, 10Research: Upgrade and restart m2 primary database master (db1132) - https://phabricator.wikimedia.org/T246098 (10Marostegui) Excellent - going to send calendar invite and block that time on the deployment page. [07:37:01] 10Operations, 10DBA, 10OTRS, 10Recommendation-API, 10Research: Upgrade and restart m2 primary database master (db1132) - https://phabricator.wikimedia.org/T246098 (10Marostegui) [07:38:22] (03PS1) 10KartikMistry: Apertium: Install apertium-dev [puppet] - 10https://gerrit.wikimedia.org/r/578465 [07:45:57] (03PS3) 10Marostegui: db-eqiad,db-codfw.php: Add es5 as new ES, for initial testing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/577185 (https://phabricator.wikimedia.org/T246072) [07:46:45] (03PS1) 10Marostegui: Revert "install_server: Allow reimage of db2125" [puppet] - 10https://gerrit.wikimedia.org/r/578471 [07:47:51] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [07:49:11] (03CR) 10Marostegui: [C: 03+2] Revert "install_server: Allow reimage of db2125" [puppet] - 10https://gerrit.wikimedia.org/r/578471 (owner: 10Marostegui) [07:52:27] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [07:58:03] (03PS1) 10Marostegui: install_server: Allow reimage db2121 [puppet] - 10https://gerrit.wikimedia.org/r/578472 (https://phabricator.wikimedia.org/T246604) [08:01:28] (03CR) 10Marostegui: [C: 03+2] install_server: Allow reimage db2121 [puppet] - 10https://gerrit.wikimedia.org/r/578472 (https://phabricator.wikimedia.org/T246604) (owner: 10Marostegui) [08:05:13] (03PS1) 10Marostegui: install_server: Reimage db2121 to buster [puppet] - 10https://gerrit.wikimedia.org/r/578473 (https://phabricator.wikimedia.org/T246604) [08:14:15] (03PS5) 10Jcrespo: mariadb-backups: Increase snapshot frequency and retain those on bacula [puppet] - 10https://gerrit.wikimedia.org/r/577462 (https://phabricator.wikimedia.org/T138562) [08:14:30] (03CR) 10Jcrespo: [C: 03+2] mariadb-backups: Increase snapshot frequency and retain those on bacula [puppet] - 10https://gerrit.wikimedia.org/r/577462 (https://phabricator.wikimedia.org/T138562) (owner: 10Jcrespo) [08:19:10] (03CR) 10Marostegui: [C: 03+2] install_server: Reimage db2121 to buster [puppet] - 10https://gerrit.wikimedia.org/r/578473 (https://phabricator.wikimedia.org/T246604) (owner: 10Marostegui) [08:24:58] (03CR) 10Marostegui: db-eqiad,db-codfw.php: Add es5 as new ES, for initial testing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/577185 (https://phabricator.wikimedia.org/T246072) (owner: 10Marostegui) [08:25:09] (03CR) 10Marostegui: db-eqiad,db-codfw.php: Set es5 as writable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/577189 (https://phabricator.wikimedia.org/T246072) (owner: 10Marostegui) [08:25:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool es1012', diff saved to https://phabricator.wikimedia.org/P10670 and previous config saved to /var/cache/conftool/dbconfig/20200310-082525-marostegui.json [08:25:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:25:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'Promote es1012 back to es1 master, this is a NOOP T239791', diff saved to https://phabricator.wikimedia.org/P10671 and previous config saved to /var/cache/conftool/dbconfig/20200310-082552-marostegui.json [08:25:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:25:59] T239791: DB: perform rolling restart of mariadb daemons to pick up CA changes - https://phabricator.wikimedia.org/T239791 [08:37:54] (03PS6) 10Muehlenhoff: Enable ssoSessions endpoint [puppet] - 10https://gerrit.wikimedia.org/r/578319 [08:40:05] (03CR) 10jerkins-bot: [V: 04-1] Enable ssoSessions endpoint [puppet] - 10https://gerrit.wikimedia.org/r/578319 (owner: 10Muehlenhoff) [08:49:00] (03PS7) 10Muehlenhoff: Enable ssoSessions endpoint [puppet] - 10https://gerrit.wikimedia.org/r/578319 [08:50:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'Fully repool es1012', diff saved to https://phabricator.wikimedia.org/P10673 and previous config saved to /var/cache/conftool/dbconfig/20200310-085001-marostegui.json [08:50:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:51:13] (03CR) 10jerkins-bot: [V: 04-1] Enable ssoSessions endpoint [puppet] - 10https://gerrit.wikimedia.org/r/578319 (owner: 10Muehlenhoff) [09:00:04] marostegui and jynus: Your horoscope predicts another unfortunate es5 database deployment deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200310T0900). [09:00:09] o/ [09:00:10] let's go? [09:00:17] I am here [09:00:25] cool, merging [09:00:30] (03CR) 10Marostegui: [C: 03+2] db-eqiad,db-codfw.php: Add es5 as new ES, for initial testing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/577185 (https://phabricator.wikimedia.org/T246072) (owner: 10Marostegui) [09:00:45] !log Start es5 deployment window T246072 [09:00:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:00:50] T246072: Enable es4 and es5 as writable new external store sections and set es2 and es3 as read only - https://phabricator.wikimedia.org/T246072 [09:01:34] (03Merged) 10jenkins-bot: db-eqiad,db-codfw.php: Add es5 as new ES, for initial testing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/577185 (https://phabricator.wikimedia.org/T246072) (owner: 10Marostegui) [09:02:09] going to deploy the first change [09:02:14] should be a noop [09:02:17] ok [09:03:00] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Add es5 to the available es sections, not in use yet - T246072 (duration: 01m 01s) [09:03:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:03:13] deplying eqiad [09:03:52] s3 metadata showing one host with bad perf [09:04:00] since 8:57 [09:04:07] probably unrelated, but checking [09:04:08] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Add es5 to the available es sections, not in use yet - T246072 (duration: 00m 59s) [09:04:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:04:14] thanks, should be unrelaed yeah [09:04:16] checking shell.php now [09:04:36] (03PS8) 10Muehlenhoff: Enable ssoSessions endpoint [puppet] - 10https://gerrit.wikimedia.org/r/578319 [09:05:23] another fetchblob test? [09:05:35] yep, now it succeeded [09:05:39] i did 2 [09:05:45] I reloaded shell.php and now it worked [09:05:47] strange, it gabe the same error [09:05:50] so there should be 2 failures [09:05:51] ah, I see [09:05:54] yes [09:06:34] (03PS2) 10Marostegui: db-eqiad,db-codfw.php: Set es5 as writable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/577189 (https://phabricator.wikimedia.org/T246072) [09:06:37] I see no errors [09:06:44] can we pause for a second? [09:06:48] sure [09:06:49] I see some lag on s3 [09:06:54] woot [09:06:55] checking [09:07:04] db1075 [09:07:26] something weird happened, unrelated [09:07:27] (03PS1) 10Alexandros Kosiaris: tls: Supply sane default values for resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/578478 (https://phabricator.wikimedia.org/T244843) [09:07:37] but I don't want to interfere with this deployment [09:07:53] it seems it got fixed [09:07:57] BBU is ok [09:08:23] but there were s3 higher load from 57 to 06 [09:08:26] my first reaction was to check BBU,as that host is one of the ones that might have BBU issues soon [09:08:33] is that the vslow host? [09:09:12] not sure [09:09:27] nothing on HW logs: ie, BBU learn cycle or something [09:09:41] db1078 is dump [09:09:47] ok [09:09:48] and vslow [09:10:20] but I don't see dumps running at the moment [09:10:30] (03CR) 10Giuseppe Lavagetto: [C: 03+1] tls: Supply sane default values for resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/578478 (https://phabricator.wikimedia.org/T244843) (owner: 10Alexandros Kosiaris) [09:10:53] db1075 doesn't show anything on the system's graphs [09:10:56] like CPU, memory etc [09:11:03] well, this is strange: https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1078&var-port=9104&fullscreen&panelId=3&from=1583820653335&to=1583831453335 [09:11:19] lots of temporary tables from what I can see [09:11:41] there is something happening there for sure, not sure why or if it is worrying [09:12:21] what's on the section in question? [09:12:25] apergos: s3 [09:12:34] jynus: looks like the query killer also didn't kick in [09:12:37] (03CR) 10Alexandros Kosiaris: [C: 03+2] "Thanks!. I 'll package and deploy the new charts for all those." [deployment-charts] - 10https://gerrit.wikimedia.org/r/578478 (https://phabricator.wikimedia.org/T244843) (owner: 10Alexandros Kosiaris) [09:12:50] aka default [09:12:56] (03Merged) 10jenkins-bot: tls: Supply sane default values for resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/578478 (https://phabricator.wikimedia.org/T244843) (owner: 10Alexandros Kosiaris) [09:12:58] but I don't see dumps running [09:13:11] yeah, and also not the same host [09:13:44] maybe dumps stopped at 7:30? [09:13:53] but it is a different host, from what you said [09:13:55] and I am just being paranoiid [09:13:57] db1075 isn't vslow, no? [09:14:01] no [09:14:13] I was going to say there could be some very fleeting queries but only for a couple wikis anyways [09:14:17] you said db1075 is the one having issues and db1078 is the vslow one [09:14:27] I saw issues on s3 [09:14:48] and then I saw strange change in traffic on db1078, but may be a false connection [09:14:58] (03PS1) 10Alexandros Kosiaris: Package charts that support the new resource limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/578480 (https://phabricator.wikimedia.org/T244843) [09:14:59] Ah, I got confused as you mentioned db1075 first [09:15:02] maybe it was just backups [09:16:01] I think we should just continue [09:16:08] there is no immediate concerns [09:16:13] cool [09:16:20] and we can just research at a later time [09:16:22] then going to test writes from mwdebug1001 [09:16:25] not related [09:16:32] (03CR) 10Alexandros Kosiaris: [C: 03+2] Package charts that support the new resource limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/578480 (https://phabricator.wikimedia.org/T244843) (owner: 10Alexandros Kosiaris) [09:16:35] I just though there was ongoing issues, there isn't [09:16:50] (03Merged) 10jenkins-bot: Package charts that support the new resource limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/578480 (https://phabricator.wikimedia.org/T244843) (owner: 10Alexandros Kosiaris) [09:16:59] (03CR) 10Muehlenhoff: "PCC: https://puppet-compiler.wmflabs.org/compiler1002/21363/" [puppet] - 10https://gerrit.wikimedia.org/r/578319 (owner: 10Muehlenhoff) [09:17:15] going to do some writes on my page on enwiki [09:17:17] from mwdebug [09:17:19] (03PS9) 10Muehlenhoff: Enable ssoSessions endpoint [puppet] - 10https://gerrit.wikimedia.org/r/578319 [09:17:19] there is some intermitent lag, but nothing actionable atm [09:17:29] please do [09:17:54] (03PS1) 10Elukey: statistics::compute: deploy mysql credentials only when needed [puppet] - 10https://gerrit.wikimedia.org/r/578481 (https://phabricator.wikimedia.org/T243934) [09:18:20] I see some writes already on enwiki on the new cluster [09:18:22] from my changes [09:18:27] it was db1078 and db1087, both dump/slow [09:18:40] which it would be "normal" so low priority [09:18:40] oh that's something [09:18:51] yeah, so probably higher load than usual [09:18:51] going to try some writes to eswiki now [09:19:05] let me check logs and server status [09:19:14] so far my enwiki writes showed up on es5 [09:19:27] I can see it [09:19:33] replication ok? [09:19:41] !log akosiaris@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'blubberoid' for release 'staging' . [09:19:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:20:05] !log akosiaris@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'citoid' for release 'staging' . [09:20:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:20:23] 0 errors on log related to dbs in the past 15 minutes [09:20:49] yeah, replication works fine [09:20:53] I see my changes on all the hosts [09:21:12] !log update blubberoid, cxserver, citoid to push the TLS resources changes [09:21:13] !log akosiaris@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'cxserver' for release 'staging' . [09:21:14] cool, anything else to check before full deployment? [09:21:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:21:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:21:28] !log update blubberoid, cxserver, citoid to push the TLS resources changes T244843 [09:21:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:21:33] T244843: Create a service-to-service proxy for handling HTTP calls from services to other entities - https://phabricator.wikimedia.org/T244843 [09:21:49] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1003/21364/" [puppet] - 10https://gerrit.wikimedia.org/r/578481 (https://phabricator.wikimedia.org/T243934) (owner: 10Elukey) [09:21:51] checking mysql metrics for the new master [09:21:51] (03CR) 10Elukey: [C: 03+2] statistics::compute: deploy mysql credentials only when needed [puppet] - 10https://gerrit.wikimedia.org/r/578481 (https://phabricator.wikimedia.org/T243934) (owner: 10Elukey) [09:22:05] jynus: I was reviewing that all hosts are green on icinga [09:22:09] they are [09:22:21] Let's deploy then? [09:22:25] I use icinga too to check latency sometimes [09:22:35] +1 to full sync [09:22:40] ok [09:22:48] (03CR) 10Marostegui: [C: 03+2] db-eqiad,db-codfw.php: Set es5 as writable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/577189 (https://phabricator.wikimedia.org/T246072) (owner: 10Marostegui) [09:23:44] (03Merged) 10jenkins-bot: db-eqiad,db-codfw.php: Set es5 as writable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/577189 (https://phabricator.wikimedia.org/T246072) (owner: 10Marostegui) [09:25:02] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Enable es5 as new writable external store section - T246072 (duration: 00m 59s) [09:25:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:25:07] T246072: Enable es4 and es5 as writable new external store sections and set es2 and es3 as read only - https://phabricator.wikimedia.org/T246072 [09:25:15] no errors that I can see so far [09:25:17] I didn't rebase :) [09:25:21] ah [09:25:33] Thanks to dbctl I have already forgotten how to deploy with scap [09:26:10] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Enable es5 as new writable external store section - T246072 (duration: 00m 58s) [09:26:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:26:15] ok, going for eqiad now then [09:27:15] (03CR) 10ArielGlenn: [C: 03+1] "looks good for the snapshots." [puppet] - 10https://gerrit.wikimedia.org/r/578356 (https://phabricator.wikimedia.org/T156955) (owner: 10Filippo Giunchedi) [09:27:20] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Enable es5 as new writable external store section - T246072 (duration: 00m 57s) [09:27:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:27:30] ok, writes happening on the master [09:28:03] (03CR) 10Alexandros Kosiaris: [C: 03+2] "@ottomata. eventgate and eventstreams don't use the shared _tls_helpers and as such can't benefit from this but rather the changes have to" [deployment-charts] - 10https://gerrit.wikimedia.org/r/578478 (https://phabricator.wikimedia.org/T244843) (owner: 10Alexandros Kosiaris) [09:28:07] I can see increase of traffic [09:28:16] checking save timing/edit rate [09:28:35] binlog looking good with writes [09:28:57] 10Operations, 10MediaWiki-General, 10serviceops, 10Patch-For-Review, 10Service-Architecture: Create a service-to-service proxy for handling HTTP calls from services to other entities - https://phabricator.wikimedia.org/T244843 (10akosiaris) Copying from the last comment of https://gerrit.wikimedia.org/r/... [09:29:22] edit rate will need more time to catch up metrics [09:29:26] !log akosiaris@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'blubberoid' for release 'production' . [09:29:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:30:10] Still no errors [09:30:15] !log akosiaris@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'citoid' for release 'production' . [09:30:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:30:18] (03PS1) 10Elukey: statistics::mysql_credentials: use require instead of defined [puppet] - 10https://gerrit.wikimedia.org/r/578483 (https://phabricator.wikimedia.org/T243934) [09:31:27] (03CR) 10Muehlenhoff: [C: 03+2] Enable ssoSessions endpoint [puppet] - 10https://gerrit.wikimedia.org/r/578319 (owner: 10Muehlenhoff) [09:31:45] !log akosiaris@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'cxserver' for release 'production' . [09:31:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:34:20] !log es5 deployment window finished T246072 [09:34:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:34:24] T246072: Enable es4 and es5 as writable new external store sections and set es2 and es3 as read only - https://phabricator.wikimedia.org/T246072 [09:34:29] (03CR) 10Muehlenhoff: "Looks good. You can also go ahead and remove lvm-ext-srv, the only other user (oresrdb2001) won't get reimaged again (these will be folded" [puppet] - 10https://gerrit.wikimedia.org/r/578356 (https://phabricator.wikimedia.org/T156955) (owner: 10Filippo Giunchedi) [09:35:29] (03CR) 10Marostegui: [C: 03+1] "es5 is deployed. We still have to "close" es3, this might happen either today or tomorrow. I will update this once that's done." [software/spicerack] - 10https://gerrit.wikimedia.org/r/576297 (https://phabricator.wikimedia.org/T226704) (owner: 10Volans) [09:36:24] !log akosiaris@deploy1001 helmfile [EQIAD] Ran 'apply' command on namespace 'blubberoid' for release 'production' . [09:36:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:37:02] !log akosiaris@deploy1001 helmfile [EQIAD] Ran 'apply' command on namespace 'citoid' for release 'production' . [09:37:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:37:26] (03CR) 10Elukey: [C: 03+2] statistics::mysql_credentials: use require instead of defined [puppet] - 10https://gerrit.wikimedia.org/r/578483 (https://phabricator.wikimedia.org/T243934) (owner: 10Elukey) [09:37:34] INSERT INTO masters values ('es5', 'eqiad', 'es1023'), ('es5', 'codfw', 'es20230); -- marostegui ? [09:37:43] ugh, extra 0 [09:37:56] INSERT INTO masters values ('es5', 'eqiad', 'es1023'), ('es5', 'codfw', 'es2023'); [09:38:00] jynus: INSERT INTO masters values ('es5', 'eqiad', 'es1023'), ('es5', 'codfw', 'es2023); [09:38:01] yeah [09:38:30] (03CR) 10Alexandros Kosiaris: [C: 04-1] "We don't install -dev packages in production unless there's a very clear reason as to why and it's well justified." [puppet] - 10https://gerrit.wikimedia.org/r/578465 (owner: 10KartikMistry) [09:38:38] !log akosiaris@deploy1001 helmfile [EQIAD] Ran 'apply' command on namespace 'cxserver' for release 'production' . [09:38:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:38:56] will run script on prometheus to check it updating [09:39:59] thanks [09:40:04] I am checking the graphs [09:40:14] so far so good [09:40:58] the script should fix https://grafana.wikimedia.org/d/000000278/mysql-aggregated?orgId=1&var-dc=eqiad%20prometheus%2Fops&var-group=core&var-shard=es5&var-role=master&from=1583811648663&to=1583833248664 [09:41:07] being empty to having 1 host there [09:42:21] last updated: Mar 10 09:41 mysql-core_eqiad.yaml [09:43:15] nice [09:43:17] thanks [09:43:27] but I don't see it yet on grafana [09:44:53] I think it is there now? [09:45:08] those are aggregated metrics that I think don't switch automatically [09:45:26] yep, works now [09:47:02] yeah, I can see it [09:49:03] (03PS2) 10Filippo Giunchedi: install_server: switch snapshot and sodium to standard partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/578356 (https://phabricator.wikimedia.org/T156955) [09:56:10] (03PS1) 10Elukey: Introduce profile::statistics::eventlogging_rsync [puppet] - 10https://gerrit.wikimedia.org/r/578484 (https://phabricator.wikimedia.org/T243934) [10:01:45] (03CR) 10Vgutierrez: [C: 03+2] "looking good in labs as well :)" [puppet] - 10https://gerrit.wikimedia.org/r/532348 (owner: 10Vgutierrez) [10:04:35] (03CR) 10Elukey: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/21367/" [puppet] - 10https://gerrit.wikimedia.org/r/578484 (https://phabricator.wikimedia.org/T243934) (owner: 10Elukey) [10:05:08] * elukey runs to puppetmaster to beat vgutierrez [10:05:21] elukey: err mine is already merged sorry :* [10:05:31] ahhahaha [10:05:33] <3 [10:09:33] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/578356 (https://phabricator.wikimedia.org/T156955) (owner: 10Filippo Giunchedi) [10:17:54] (03CR) 10Arturo Borrero Gonzalez: Add support for redirecting to toolforge.org (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/578413 (https://phabricator.wikimedia.org/T234617) (owner: 10BryanDavis) [10:21:00] (03PS1) 10Gehel: wdqs: decompress dumps with bzcat [cookbooks] - 10https://gerrit.wikimedia.org/r/578486 [10:23:12] (03CR) 10DCausse: [C: 03+1] wdqs: decompress dumps with bzcat [cookbooks] - 10https://gerrit.wikimedia.org/r/578486 (owner: 10Gehel) [10:27:17] PROBLEM - Backup freshness on backup1001 is CRITICAL: All failures: 1 (dbprov2001), No backups: 1 (dbprov2002), Fresh: 97 jobs https://wikitech.wikimedia.org/wiki/Backups%23Monitoring [10:27:53] (03CR) 10Gehel: [C: 03+2] wdqs: decompress dumps with bzcat [cookbooks] - 10https://gerrit.wikimedia.org/r/578486 (owner: 10Gehel) [10:28:58] (03CR) 10Volans: wdqs: decompress dumps with bzcat (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/578486 (owner: 10Gehel) [10:31:46] (03CR) 10Filippo Giunchedi: [C: 03+2] install_server: switch snapshot and sodium to standard partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/578356 (https://phabricator.wikimedia.org/T156955) (owner: 10Filippo Giunchedi) [10:32:10] (03PS3) 10Filippo Giunchedi: install_server: switch snapshot and sodium to standard partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/578356 (https://phabricator.wikimedia.org/T156955) [10:33:18] (03PS1) 10Elukey: admin: deprecate statistics-users [puppet] - 10https://gerrit.wikimedia.org/r/578488 (https://phabricator.wikimedia.org/T246578) [10:36:47] (03PS2) 10Vgutierrez: Release 8.0.6-1wm2 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/577569 (https://phabricator.wikimedia.org/T245616) [10:37:21] 10Operations, 10Analytics, 10ContentTranslation, 10SRE-Access-Requests, 10Language-Team (Language-2020-January-March): Request for access for stats machines for Santhosh - https://phabricator.wikimedia.org/T247246 (10elukey) [10:41:07] (03CR) 10Ema: [V: 03+2 C: 03+2] Use confluent-kafka-go instead of segmentio/kafka-go [software/atskafka] - 10https://gerrit.wikimedia.org/r/578328 (https://phabricator.wikimedia.org/T237993) (owner: 10Ema) [10:47:13] PROBLEM - Check systemd state on db2084 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:48:01] PROBLEM - Check systemd state on db2085 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:48:30] ^ me [10:49:33] RECOVERY - Check systemd state on db2084 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:50:21] RECOVERY - Check systemd state on db2085 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: I, the Bot under the Fountain, allow thee, The Deployer, to do European Mid-day SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200310T1100). [11:00:04] MatmaRex: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:32] hello [11:00:58] (03PS1) 10Jcrespo: bacula: Increase max total size of backups to 35 TB [puppet] - 10https://gerrit.wikimedia.org/r/578489 (https://phabricator.wikimedia.org/T238048) [11:02:22] (03PS1) 10Ema: Tidy up go.sum [software/atskafka] - 10https://gerrit.wikimedia.org/r/578490 [11:02:24] (03PS1) 10Ema: Debianization [software/atskafka] - 10https://gerrit.wikimedia.org/r/578491 (https://phabricator.wikimedia.org/T237993) [11:02:35] o/ [11:02:37] anyone deploying? [11:02:41] oh, hi! [11:02:46] I can do it :) [11:02:49] * Lucas_WMDE looks at changes [11:02:57] (03CR) 10Jcrespo: "@akosiaris please let me know if you see something wrong with this (aside from remembering to apply the change with update, etc." [puppet] - 10https://gerrit.wikimedia.org/r/578489 (https://phabricator.wikimedia.org/T238048) (owner: 10Jcrespo) [11:03:41] (03CR) 10Ema: [C: 03+1] Release 8.0.6-1wm2 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/577569 (https://phabricator.wikimedia.org/T245616) (owner: 10Vgutierrez) [11:03:59] (03CR) 10Vgutierrez: [C: 03+2] Release 8.0.6-1wm2 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/577569 (https://phabricator.wikimedia.org/T245616) (owner: 10Vgutierrez) [11:04:03] 10Operations, 10Traffic, 10observability: prometheus2004 not scraping lvs2007 & lvs2008 - https://phabricator.wikimedia.org/T246860 (10elukey) The issue went away by itself after mentioning it to Chris and Filippo :) [11:05:20] (03PS1) 10Giuseppe Lavagetto: Use Envoy to talk to echostore [mediawiki-config] - 10https://gerrit.wikimedia.org/r/578492 (https://phabricator.wikimedia.org/T244843) [11:05:22] (03PS1) 10Giuseppe Lavagetto: Move Termbox to ProductionServices, use envoy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/578493 (https://phabricator.wikimedia.org/T244843) [11:05:24] (03PS1) 10Giuseppe Lavagetto: Add ores, wdqs to ProductionServices [mediawiki-config] - 10https://gerrit.wikimedia.org/r/578494 (https://phabricator.wikimedia.org/T244843) [11:05:28] (03PS1) 10Giuseppe Lavagetto: wdqs-internal: switch to use envoy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/578495 (https://phabricator.wikimedia.org/T244843) [11:05:30] (03PS1) 10Giuseppe Lavagetto: Switch ores to use envoy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/578496 (https://phabricator.wikimedia.org/T244843) [11:05:32] (03PS1) 10Giuseppe Lavagetto: Switch restbase to use envoy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/578497 (https://phabricator.wikimedia.org/T244843) [11:05:55] 10Operations, 10DBA, 10OTRS, 10Recommendation-API, 10Research: Upgrade and restart m2 primary database master (db1132) - https://phabricator.wikimedia.org/T246098 (10Marostegui) Mail sent to wikitech-l: https://lists.wikimedia.org/pipermail/wikitech-l/2020-March/093175.html Deployments calendar window ad... [11:05:59] 10Operations, 10Performance-Team, 10SRE-swift-storage, 10Traffic, and 2 others: Automatically clean up unused thumbnails in Swift - https://phabricator.wikimedia.org/T211661 (10Gilles) Note that we're not running the Swift daemon that consumes that header and deletes objects yet (swift-object-expirer), afa... [11:06:09] ugh, CI failing for the first backport due to T246763 [11:06:09] T246763: Jenkins job failing intermittently due to Gerrit HTTP 502 errors when cloning repos - https://phabricator.wikimedia.org/T246763 [11:06:31] let’s try the second one then [11:06:46] (03CR) 10jerkins-bot: [V: 04-1] Use Envoy to talk to echostore [mediawiki-config] - 10https://gerrit.wikimedia.org/r/578492 (https://phabricator.wikimedia.org/T244843) (owner: 10Giuseppe Lavagetto) [11:06:58] <_joe_> uh? [11:07:20] (03CR) 10jerkins-bot: [V: 04-1] Move Termbox to ProductionServices, use envoy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/578493 (https://phabricator.wikimedia.org/T244843) (owner: 10Giuseppe Lavagetto) [11:07:24] (03CR) 10jerkins-bot: [V: 04-1] Add ores, wdqs to ProductionServices [mediawiki-config] - 10https://gerrit.wikimedia.org/r/578494 (https://phabricator.wikimedia.org/T244843) (owner: 10Giuseppe Lavagetto) [11:07:26] (03CR) 10jerkins-bot: [V: 04-1] wdqs-internal: switch to use envoy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/578495 (https://phabricator.wikimedia.org/T244843) (owner: 10Giuseppe Lavagetto) [11:07:38] <_joe_> hah whitespace in a commment, ffs [11:08:11] (03CR) 10jerkins-bot: [V: 04-1] Switch ores to use envoy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/578496 (https://phabricator.wikimedia.org/T244843) (owner: 10Giuseppe Lavagetto) [11:08:20] (03CR) 10jerkins-bot: [V: 04-1] Switch restbase to use envoy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/578497 (https://phabricator.wikimedia.org/T244843) (owner: 10Giuseppe Lavagetto) [11:08:34] (03CR) 10Alexandros Kosiaris: [C: 03+1] "LGTM, I was wondering when we do start doing that. Nice!" [puppet] - 10https://gerrit.wikimedia.org/r/578489 (https://phabricator.wikimedia.org/T238048) (owner: 10Jcrespo) [11:10:39] (03CR) 10Jcrespo: [C: 03+2] bacula: Increase max total size of backups to 35 TB [puppet] - 10https://gerrit.wikimedia.org/r/578489 (https://phabricator.wikimedia.org/T238048) (owner: 10Jcrespo) [11:11:45] (03CR) 10Jcrespo: "I will monitor usage (it may take weeks to see the results) and adjust this and other parameters accordingly." [puppet] - 10https://gerrit.wikimedia.org/r/578489 (https://phabricator.wikimedia.org/T238048) (owner: 10Jcrespo) [11:12:44] (03PS1) 10Elukey: admin: deprecate statistics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/578500 (https://phabricator.wikimedia.org/T246578) [11:14:21] (03PS1) 10Giuseppe Lavagetto: Bump up memory limits for echostore [deployment-charts] - 10https://gerrit.wikimedia.org/r/578503 (https://phabricator.wikimedia.org/T244843) [11:16:27] (03CR) 10Alexandros Kosiaris: [C: 03+1] Bump up memory limits for echostore [deployment-charts] - 10https://gerrit.wikimedia.org/r/578503 (https://phabricator.wikimedia.org/T244843) (owner: 10Giuseppe Lavagetto) [11:16:27] Lucas_WMDE: (it's merged) [11:16:48] sorry, got distracted [11:16:50] one sec [11:18:25] np [11:18:26] (03CR) 10Jcrespo: "This seems applied as expected:" [puppet] - 10https://gerrit.wikimedia.org/r/578489 (https://phabricator.wikimedia.org/T238048) (owner: 10Jcrespo) [11:18:50] should be on mwdebug1001 now [11:18:57] MatmaRex: can you test it there? [11:19:16] yeah [11:20:01] Lucas_WMDE: seems ok [11:20:05] ok [11:21:04] syncing [11:21:50] !log lucaswerkmeister-wmde@deploy1001 Synchronized php-1.35.0-wmf.22/extensions/DiscussionTools/: SWAT: [[gerrit:578364|controller: apply ve.fixBase to the parsed Parsoid response (T245781)]] (duration: 00m 59s) [11:21:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:22:00] T245781: Reply tool usually doesn't work in Safari - https://phabricator.wikimedia.org/T245781 [11:22:15] the other change sounds like it would be harder to test…? [11:22:17] (03PS2) 10Giuseppe Lavagetto: Use Envoy to talk to echostore [mediawiki-config] - 10https://gerrit.wikimedia.org/r/578492 (https://phabricator.wikimedia.org/T244843) [11:22:19] (03PS2) 10Giuseppe Lavagetto: Move Termbox to ProductionServices, use envoy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/578493 (https://phabricator.wikimedia.org/T244843) [11:22:21] (03PS2) 10Giuseppe Lavagetto: Add ores, wdqs to ProductionServices [mediawiki-config] - 10https://gerrit.wikimedia.org/r/578494 (https://phabricator.wikimedia.org/T244843) [11:22:24] (03PS2) 10Giuseppe Lavagetto: wdqs-internal: switch to use envoy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/578495 (https://phabricator.wikimedia.org/T244843) [11:22:26] (03PS2) 10Giuseppe Lavagetto: Switch ores to use envoy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/578496 (https://phabricator.wikimedia.org/T244843) [11:22:28] (03PS2) 10Giuseppe Lavagetto: Switch restbase to use envoy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/578497 (https://phabricator.wikimedia.org/T244843) [11:23:00] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Bump up memory limits for echostore [deployment-charts] - 10https://gerrit.wikimedia.org/r/578503 (https://phabricator.wikimedia.org/T244843) (owner: 10Giuseppe Lavagetto) [11:23:59] (03CR) 10jerkins-bot: [V: 04-1] Move Termbox to ProductionServices, use envoy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/578493 (https://phabricator.wikimedia.org/T244843) (owner: 10Giuseppe Lavagetto) [11:24:16] (03PS4) 10KartikMistry: Add apertium-oci-fra package [debs/contenttranslation/apertium-oci-fra] - 10https://gerrit.wikimedia.org/r/577047 (https://phabricator.wikimedia.org/T202360) [11:24:25] (03CR) 10Alexandros Kosiaris: [C: 03+2] Add apertium-pol-szl package [debs/contenttranslation/apertium-pol-szl] - 10https://gerrit.wikimedia.org/r/576628 (https://phabricator.wikimedia.org/T202276) (owner: 10KartikMistry) [11:24:27] (03CR) 10jerkins-bot: [V: 04-1] Add ores, wdqs to ProductionServices [mediawiki-config] - 10https://gerrit.wikimedia.org/r/578494 (https://phabricator.wikimedia.org/T244843) (owner: 10Giuseppe Lavagetto) [11:24:45] akosiaris: re recycling and expiration- the increase of retention may take a full cycle to kick in, because I belive bacula track the "expiration time", not the "write time", so only new backups will have the larger retention :-( [11:24:49] (03CR) 10jerkins-bot: [V: 04-1] wdqs-internal: switch to use envoy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/578495 (https://phabricator.wikimedia.org/T244843) (owner: 10Giuseppe Lavagetto) [11:25:25] (03Merged) 10jenkins-bot: Bump up memory limits for echostore [deployment-charts] - 10https://gerrit.wikimedia.org/r/578503 (https://phabricator.wikimedia.org/T244843) (owner: 10Giuseppe Lavagetto) [11:25:32] (03CR) 10jerkins-bot: [V: 04-1] Switch ores to use envoy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/578496 (https://phabricator.wikimedia.org/T244843) (owner: 10Giuseppe Lavagetto) [11:26:01] (03CR) 10jerkins-bot: [V: 04-1] Switch restbase to use envoy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/578497 (https://phabricator.wikimedia.org/T244843) (owner: 10Giuseppe Lavagetto) [11:26:07] !log Restart mysqld exporter on db2125 to see if the collection errors decrease from 30 T247290 [11:26:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:26:12] T247290: mysql_exporter_last_scrape_error flag on prometheus-mysqld-exporter increases after 10.4 upgrade - https://phabricator.wikimedia.org/T247290 [11:26:30] jynus: I think not. IIRC it will autocreate new volumes the moment it needs them [11:27:06] it is not happening at the moment- it prefers recycling rather than creating new ones, unless something else is bad [11:27:07] what I don't remember if it will take into account volumes that can be purged or not. But it doesn't have to do with the job retention at all. This is all about volumes [11:27:28] !log oblivian@deploy1001 helmfile [CODFW] Ran 'apply' command on namespace 'echostore' for release 'production' . [11:27:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:27:45] we'll see how it behaves from now on [11:27:49] sure. [11:28:39] I have noted down the total number of volumes and which ones is being written to to compare in a few weeks [11:28:54] (03CR) 10Alexandros Kosiaris: [C: 03+2] apertium-separable: Update to new upstream release 0.3.3 [debs/contenttranslation/apertium-separable] - 10https://gerrit.wikimedia.org/r/577046 (https://phabricator.wikimedia.org/T234182) (owner: 10KartikMistry) [11:29:20] it may be also my fault for not doing some of the updates last time [11:30:05] !log oblivian@deploy1001 helmfile [EQIAD] Ran 'apply' command on namespace 'echostore' for release 'production' . [11:30:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:30:26] (as I can see production volume with the desired retention, but not Databases) [11:30:53] (03CR) 10Alexandros Kosiaris: [C: 03+2] apertium-spa-cat: Update to new upstream release 2.2.0 [debs/contenttranslation/apertium-spa-cat] - 10https://gerrit.wikimedia.org/r/577243 (https://phabricator.wikimedia.org/T233700) (owner: 10KartikMistry) [11:33:08] (03CR) 10jerkins-bot: [V: 04-1] Add apertium-oci-fra package [debs/contenttranslation/apertium-oci-fra] - 10https://gerrit.wikimedia.org/r/577047 (https://phabricator.wikimedia.org/T202360) (owner: 10KartikMistry) [11:35:08] second backport was merged \o/ [11:35:57] MatmaRex: EventLogging backport should be on mwdebug1001 now, can you test it? [11:36:23] (03PS5) 10KartikMistry: Add apertium-oci-fra package [debs/contenttranslation/apertium-oci-fra] - 10https://gerrit.wikimedia.org/r/577047 (https://phabricator.wikimedia.org/T202360) [11:36:41] yeah, looking [11:39:03] Lucas_WMDE: thanks, looks good [11:39:07] ok [11:40:35] !log lucaswerkmeister-wmde@deploy1001 Synchronized php-1.35.0-wmf.22/extensions/EventLogging/: SWAT: [[gerrit:578317|Make BackgroundQueue more aware of page unload flow (T246382, T244874)]] (duration: 00m 58s) [11:40:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:40:41] T246382: New EventLogging queue doesn't log events in window.unload - https://phabricator.wikimedia.org/T246382 [11:40:42] T244874: QA replying workflow (v1.0) instrumentation - https://phabricator.wikimedia.org/T244874 [11:40:56] ok, that should be it [11:41:01] !log EU SWAT done [11:41:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:41:13] Lucas_WMDE: thank you! [11:41:19] no problem :) [11:43:24] 10Operations, 10ops-eqiad, 10Wikimedia-Logstash: (Need by: 2020-03-06) rack/setup/install logstash102[6-9].eqiad.wmnet - https://phabricator.wikimedia.org/T240881 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` logstash1027.eqiad.wmnet ` The log... [11:43:52] 10Operations, 10Traffic, 10observability: prometheus2004 not scraping lvs2007 & lvs2008 - https://phabricator.wikimedia.org/T246860 (10fgiunchedi) I took a look at this on both prometheus200[34] for `up{instance=~"elastic2055.*9108"}` and the metric appears yesterday on 2004 at 9:44 and 2003 at 17:48. Wherea... [11:44:45] 10Operations, 10ops-eqiad, 10Wikimedia-Logstash: (Need by: 2020-03-06) rack/setup/install logstash102[6-9].eqiad.wmnet - https://phabricator.wikimedia.org/T240881 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` logstash1028.eqiad.wmnet ` The log... [11:45:35] (03Abandoned) 10KartikMistry: Apertium: Install apertium-dev [puppet] - 10https://gerrit.wikimedia.org/r/578465 (owner: 10KartikMistry) [11:46:20] (03CR) 10KartikMistry: "recheck" [debs/contenttranslation/apertium-fra-cat] - 10https://gerrit.wikimedia.org/r/577216 (https://phabricator.wikimedia.org/T233700) (owner: 10KartikMistry) [11:46:45] ACKNOWLEDGEMENT - Backup freshness on backup1001 is CRITICAL: All failures: 1 (dbprov2001), No backups: 1 (dbprov2002), Fresh: 97 jobs Jcrespo Backups running right now for snapshots - The acknowledgement expires at: 2020-03-11 11:46:11. https://wikitech.wikimedia.org/wiki/Backups%23Monitoring [11:47:17] (03CR) 10Alexandros Kosiaris: [C: 03+2] Add apertium-oci-fra package [debs/contenttranslation/apertium-oci-fra] - 10https://gerrit.wikimedia.org/r/577047 (https://phabricator.wikimedia.org/T202360) (owner: 10KartikMistry) [11:48:22] 10Operations, 10ops-eqiad, 10Wikimedia-Logstash: (Need by: 2020-03-06) rack/setup/install logstash102[6-9].eqiad.wmnet - https://phabricator.wikimedia.org/T240881 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` logstash1029.eqiad.wmnet ` The log... [11:52:01] 10Operations, 10ops-eqiad, 10DC-Ops: (Need by: TBD) rack/setup/install wdqs101[123].eqiad.wmnet - https://phabricator.wikimedia.org/T246352 (10Cmjohnson) [11:53:50] (03PS3) 10KartikMistry: apertium-fra-cat: Updated to upstream release 1.7.0 [debs/contenttranslation/apertium-fra-cat] - 10https://gerrit.wikimedia.org/r/577216 (https://phabricator.wikimedia.org/T233700) [11:55:57] (03PS1) 10Volans: dns::auth: add DNS snippets generated from Netbox [puppet] - 10https://gerrit.wikimedia.org/r/578506 (https://phabricator.wikimedia.org/T233183) [11:56:10] !log upload trafficserver 8.0.6-1wm2 to apt.wm.o (buster) - T245616 [11:56:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:56:22] T245616: Provide a simple and automated SSL Ticket key generation system for ATS - https://phabricator.wikimedia.org/T245616 [11:57:30] (03PS1) 10Ema: Add basic testing [software/atskafka] - 10https://gerrit.wikimedia.org/r/578507 (https://phabricator.wikimedia.org/T237993) [11:58:08] 10Operations, 10ops-eqiad, 10DC-Ops: (Need by: TBD) rack/setup/install wdqs101[123].eqiad.wmnet - https://phabricator.wikimedia.org/T246352 (10Cmjohnson) [12:00:22] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime [12:00:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:01:38] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime [12:01:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:02:50] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [12:02:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:04:39] (03CR) 10Ema: [C: 03+1] ATS: Turn on TLS Session tickets on ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/578327 (https://phabricator.wikimedia.org/T245616) (owner: 10Vgutierrez) [12:04:57] (03PS1) 10Cmjohnson: Add production dns for wdqs10[123] [dns] - 10https://gerrit.wikimedia.org/r/578509 (https://phabricator.wikimedia.org/T246352) [12:05:17] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime [12:05:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:06:15] (03CR) 10Cmjohnson: [C: 03+2] Add production dns for wdqs10[123] [dns] - 10https://gerrit.wikimedia.org/r/578509 (https://phabricator.wikimedia.org/T246352) (owner: 10Cmjohnson) [12:09:14] (03PS1) 10Jbond: check_agent_run: ensure warning and critical are int's [puppet] - 10https://gerrit.wikimedia.org/r/578511 [12:10:30] (03CR) 10Jbond: [C: 03+1] "LGTM, thanks" [puppet] - 10https://gerrit.wikimedia.org/r/578488 (https://phabricator.wikimedia.org/T246578) (owner: 10Elukey) [12:21:19] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/578488 (https://phabricator.wikimedia.org/T246578) (owner: 10Elukey) [12:22:18] (03PS1) 10Cmjohnson: updating dhcpd file with new wdqs10[123] [puppet] - 10https://gerrit.wikimedia.org/r/578514 (https://phabricator.wikimedia.org/T246352) [12:23:05] (03PS2) 10Cmjohnson: updating dhcpd file with new wdqs10[123] [puppet] - 10https://gerrit.wikimedia.org/r/578514 (https://phabricator.wikimedia.org/T246352) [12:24:17] (03PS2) 10Volans: dns::auth: add DNS snippets generated from Netbox [puppet] - 10https://gerrit.wikimedia.org/r/578506 (https://phabricator.wikimedia.org/T233183) [12:24:47] 10Operations, 10Performance Issue: Investigate CAS performance - https://phabricator.wikimedia.org/T246010 (10MoritzMuehlenhoff) The second delay happens within Spring: 2020-03-10 12:13:53,660 DEBUG [org.springframework.binding.mapping.impl.DefaultMapper] - (03CR) 10Jbond: [C: 03+1] "LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/578500 (https://phabricator.wikimedia.org/T246578) (owner: 10Elukey) [12:26:08] (03CR) 10Jbond: [C: 03+2] check_agent_run: ensure warning and critical are int's [puppet] - 10https://gerrit.wikimedia.org/r/578511 (owner: 10Jbond) [12:28:23] 10Operations, 10ops-eqiad, 10Wikimedia-Logstash: (Need by: 2020-03-06) rack/setup/install logstash102[6-9].eqiad.wmnet - https://phabricator.wikimedia.org/T240881 (10Cmjohnson) [12:29:13] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: (Need by: TBD) rack/setup/install wdqs101[123].eqiad.wmnet - https://phabricator.wikimedia.org/T246352 (10Cmjohnson) [12:29:28] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/578500 (https://phabricator.wikimedia.org/T246578) (owner: 10Elukey) [12:31:09] (03CR) 10Volans: "Compiler results available here:" [puppet] - 10https://gerrit.wikimedia.org/r/578506 (https://phabricator.wikimedia.org/T233183) (owner: 10Volans) [12:32:09] (03PS1) 10Cmjohnson: Adding wdqs101[1-3] to site.pp role:spare [puppet] - 10https://gerrit.wikimedia.org/r/578515 (https://phabricator.wikimedia.org/T246352) [12:32:57] (03CR) 10Cmjohnson: [C: 03+2] updating dhcpd file with new wdqs10[123] [puppet] - 10https://gerrit.wikimedia.org/r/578514 (https://phabricator.wikimedia.org/T246352) (owner: 10Cmjohnson) [12:34:41] (03PS2) 10Cmjohnson: Adding wdqs101[1-3] to site.pp role:spare [puppet] - 10https://gerrit.wikimedia.org/r/578515 (https://phabricator.wikimedia.org/T246352) [12:36:34] (03CR) 10Cmjohnson: [C: 03+2] Adding wdqs101[1-3] to site.pp role:spare [puppet] - 10https://gerrit.wikimedia.org/r/578515 (https://phabricator.wikimedia.org/T246352) (owner: 10Cmjohnson) [12:37:59] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: (Need by: TBD) rack/setup/install wdqs101[123].eqiad.wmnet - https://phabricator.wikimedia.org/T246352 (10Cmjohnson) [12:39:34] (03PS1) 10Jbond: check_puppet_run_changes: Ensure we exit correctly with critical [puppet] - 10https://gerrit.wikimedia.org/r/578517 [12:42:31] 10Operations, 10Security-Team, 10Stewards-and-global-tools, 10Security, 10User-revi: Security Issue Access Request for 2020 Stewards - https://phabricator.wikimedia.org/T246449 (10chasemp) >>! In T246449#5954198, @revi wrote: > @chasemp Feel free to consider current list final for the time being. One ste... [12:52:20] (03PS1) 10Cmjohnson: update dns mgmt/production for stat1008 [dns] - 10https://gerrit.wikimedia.org/r/578518 (https://phabricator.wikimedia.org/T246472) [12:52:43] (03CR) 10jerkins-bot: [V: 04-1] update dns mgmt/production for stat1008 [dns] - 10https://gerrit.wikimedia.org/r/578518 (https://phabricator.wikimedia.org/T246472) (owner: 10Cmjohnson) [12:52:51] 10Operations, 10ops-eqiad, 10Wikimedia-Logstash: (Need by: 2020-03-06) rack/setup/install logstash102[6-9].eqiad.wmnet - https://phabricator.wikimedia.org/T240881 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` logstash1026.eqiad.wmnet ` The log... [12:55:16] (03PS2) 10Cmjohnson: update dns mgmt/production for stat1008 [dns] - 10https://gerrit.wikimedia.org/r/578518 (https://phabricator.wikimedia.org/T246472) [12:55:54] (03CR) 10Cmjohnson: [C: 03+2] update dns mgmt/production for stat1008 [dns] - 10https://gerrit.wikimedia.org/r/578518 (https://phabricator.wikimedia.org/T246472) (owner: 10Cmjohnson) [12:56:29] 10Operations, 10Analytics, 10Wikidata, 10Wikidata-Query-Service: Deployment strategy and hardware requirement for new Flink based WDQS updater - https://phabricator.wikimedia.org/T247058 (10Ottomata) While not a google doc, the parent ticket's description describes it pretty well: {T244590} [12:56:46] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: (Need by: ASAP) rack/setup/install stat1008 - https://phabricator.wikimedia.org/T246472 (10Cmjohnson) [12:57:39] (03PS1) 10WMDE-Fisch: Don't use TwoColConflict as beta feature on labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/578520 (https://phabricator.wikimedia.org/T247292) [12:58:45] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: (Need by: ASAP) rack/setup/install stat1008 - https://phabricator.wikimedia.org/T246472 (10Cmjohnson) [13:00:24] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: (Need by: TBD) rack/setup/install wdqs101[123].eqiad.wmnet - https://phabricator.wikimedia.org/T246352 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` wdqs1011.eqiad.wmnet ` The log can b... [13:01:24] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: (Need by: TBD) rack/setup/install wdqs101[123].eqiad.wmnet - https://phabricator.wikimedia.org/T246352 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` wdqs1012.eqiad.wmnet ` The log can b... [13:02:16] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: (Need by: TBD) rack/setup/install wdqs101[123].eqiad.wmnet - https://phabricator.wikimedia.org/T246352 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` wdqs1013.eqiad.wmnet ` The log can b... [13:08:36] (03CR) 10Volans: [C: 03+1] "LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/578517 (owner: 10Jbond) [13:08:55] (03PS1) 10Andrew Bogott: Revert "neutron: update l3_agent hacks for Queens" [puppet] - 10https://gerrit.wikimedia.org/r/578522 [13:08:57] (03PS1) 10Andrew Bogott: Neutron l3: update with files from Queens [puppet] - 10https://gerrit.wikimedia.org/r/578523 [13:08:59] (03PS1) 10Andrew Bogott: neutron: apply l3_agent hacks for Queens [puppet] - 10https://gerrit.wikimedia.org/r/578524 [13:09:41] (03PS2) 10Jbond: check_puppet_run_changes: Ensure we exit correctly with critical [puppet] - 10https://gerrit.wikimedia.org/r/578517 [13:10:17] (03CR) 10Jbond: "thx" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/578517 (owner: 10Jbond) [13:10:25] 10Operations, 10MediaWiki-General, 10serviceops, 10Patch-For-Review, 10Service-Architecture: Create a service-to-service proxy for handling HTTP calls from services to other entities - https://phabricator.wikimedia.org/T244843 (10Ottomata) @akosiaris Hm, yes, let's try! We are going to have issues with... [13:10:47] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime [13:10:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:13:16] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [13:13:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:18] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime [13:15:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:44] 10Operations, 10ops-eqiad, 10Wikimedia-Logstash: (Need by: 2020-03-06) rack/setup/install logstash102[6-9].eqiad.wmnet - https://phabricator.wikimedia.org/T240881 (10Cmjohnson) [13:16:18] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime [13:16:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:16:39] !log upgrade ATS on ulsfo to 8.0.6-1wm2 - T245616 [13:16:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:16:44] T245616: Provide a simple and automated SSL Ticket key generation system for ATS - https://phabricator.wikimedia.org/T245616 [13:17:04] 10Operations, 10Wikimedia-Logstash: (Need by: 2020-03-06) rack/setup/install logstash102[6-9].eqiad.wmnet - https://phabricator.wikimedia.org/T240881 (10Cmjohnson) a:05Jclark-ctr→03herron @herron these servers are all yours, I have already added them to role spare in site.pp. I am removing the ops-eqiad ta... [13:17:10] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime [13:17:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:17:30] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [13:17:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:58] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [13:20:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:22:10] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [13:22:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:23:44] !log gehel@cumin1001 START - Cookbook sre.wdqs.data-reload [13:23:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:23:48] !log gehel@cumin1001 END (ERROR) - Cookbook sre.wdqs.data-reload (exit_code=97) [13:23:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:24:38] 10Operations, 10ops-eqiad, 10DC-Ops: (Need by: TBD) rack/setup/install wdqs101[123].eqiad.wmnet - https://phabricator.wikimedia.org/T246352 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['wdqs1012.eqiad.wmnet'] ` and were **ALL** successful. [13:25:10] !log gehel@cumin1001 START - Cookbook sre.wdqs.data-reload [13:25:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:31] PROBLEM - rsyslog in codfw is failing to deliver messages on icinga1001 is CRITICAL: action=fwd_centrallog2001.codfw.wmnet:6514 https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=codfw+prometheus/ops [13:26:40] 10Operations, 10ops-eqiad, 10DC-Ops: (Need by: TBD) rack/setup/install wdqs101[123].eqiad.wmnet - https://phabricator.wikimedia.org/T246352 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['wdqs1013.eqiad.wmnet'] ` and were **ALL** successful. [13:26:49] RECOVERY - rsyslog in codfw is failing to deliver messages on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=codfw+prometheus/ops [13:27:11] 10Operations, 10ops-eqiad, 10DC-Ops: (Need by: TBD) rack/setup/install wdqs101[123].eqiad.wmnet - https://phabricator.wikimedia.org/T246352 (10Cmjohnson) [13:27:54] 10Operations, 10ops-eqiad, 10DC-Ops: (Need by: TBD) rack/setup/install wdqs101[123].eqiad.wmnet - https://phabricator.wikimedia.org/T246352 (10Cmjohnson) 05Open→03Resolved @gehel I am resolving this task, if there are any issues please re-open and ping me. [13:27:56] (03CR) 10Ottomata: [C: 03+2] Enable client side error logging on haw.wikipedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/577260 (https://phabricator.wikimedia.org/T246030) (owner: 10Ottomata) [13:28:00] (03PS2) 10Ottomata: Enable client side error logging on haw.wikipedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/577260 (https://phabricator.wikimedia.org/T246030) [13:29:37] !log T202360 upload apertium-oci-fra_0.3.0-1+wmf1_amd64.changes to apt.wikimedia.org/jessie-wikimedia main [13:29:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:29:42] T202360: Package apertium-oci-fra (Occitan-French) - https://phabricator.wikimedia.org/T202360 [13:31:15] !log otto@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Enable Mediawiki client side error logging on hawwiki - T246030 (duration: 00m 58s) [13:31:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:19] T246030: Enable client side error logging in prod for small wiki - https://phabricator.wikimedia.org/T246030 [13:34:44] !log akosiaris@cumin1001 conftool action : set/weight=2; selector: dc=eqiad,service=eventstreams,name=kubernetes.* [13:34:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:56] !log akosiaris@cumin1001 conftool action : set/pooled=yes; selector: dc=eqiad,service=eventstreams,name=kubernetes.* [13:34:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:24] !log pool all kubernetes hosts in eqiad for eventstreams. weight=2 which means ~20% of requests are going to be served by kubernetes [13:35:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:37:39] <_joe_> jouncebot: next [13:37:39] In 2 hour(s) and 22 minute(s): Puppet SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200310T1600) [13:37:45] <_joe_> oook good [13:37:53] <_joe_> I can switch echostore then [13:38:16] <_joe_> akosiaris: FYI, I'm proceding with echostore [13:39:08] ok [13:40:21] (03CR) 10Jbond: [C: 03+2] check_puppet_run_changes: Ensure we exit correctly with critical [puppet] - 10https://gerrit.wikimedia.org/r/578517 (owner: 10Jbond) [13:40:26] !log bump eventstreams on scb1003 to force users to reconnect, hoping more connections will make it to kubernetes hosts [13:40:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:40:47] PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - eventstreams_8092: Servers kubernetes1002.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [13:40:47] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Use Envoy to talk to echostore [mediawiki-config] - 10https://gerrit.wikimedia.org/r/578492 (https://phabricator.wikimedia.org/T244843) (owner: 10Giuseppe Lavagetto) [13:40:58] <_joe_> uh akosiaris ^^ [13:41:13] <_joe_> that didn't go well it seems [13:41:44] (03Merged) 10jenkins-bot: Use Envoy to talk to echostore [mediawiki-config] - 10https://gerrit.wikimedia.org/r/578492 (https://phabricator.wikimedia.org/T244843) (owner: 10Giuseppe Lavagetto) [13:41:48] !log otto@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Enable Mediawiki client side error logging on hawwiki (take 2) - T246030 (duration: 00m 57s) [13:41:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:53] T246030: Enable client side error logging in prod for small wiki - https://phabricator.wikimedia.org/T246030 [13:41:53] <_joe_> Mar 10 13:41:39 lvs1015 pybal[4104]: [eventstreams_8092 ProxyFetch] WARN: kubernetes1004.eqiad.wmnet (enabled/down/not pooled): Fetch failed (http://localhost/_info), 30.000 s [13:42:08] <_joe_> akosiaris: we need to rollback [13:42:14] ??? [13:42:27] yes [13:42:34] http vs https probably? [13:42:38] <_joe_> all kube nodes report eventstreams down [13:42:53] es only exposes the tls porot [13:43:23] ahhhh yeah akosiaris [13:43:25] it isn't so simple eh? [13:43:41] <_joe_> ottomata: so what's the solution? [13:43:52] we need too make a new LVS entry [13:43:55] with a the new port [13:44:03] we can't just add the k8s nodes to the existing one [13:44:30] <_joe_> ok so the solution is to remove them? [13:44:50] !log akosiaris@cumin1001 conftool action : set/pooled=no; selector: dc=eqiad,service=eventstreams,name=kubernetes.* [13:44:50] from scb backends? I think so. [13:44:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:01] I 've rolled back [13:45:11] <_joe_> akosiaris: heh thanks [13:45:21] <_joe_> akosiaris: put them pooled=inactive though [13:45:27] <_joe_> so that pybal stops checking them [13:45:35] !log akosiaris@cumin1001 conftool action : set/pooled=inactive; selector: dc=eqiad,service=eventstreams,name=kubernetes.* [13:45:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:46:08] <_joe_> ok are we back? if so, I'll go on with my deployment [13:46:22] 10Operations, 10Traffic, 10observability: some Prometheis not scraping the full set of targets - https://phabricator.wikimedia.org/T246860 (10CDanis) [13:46:23] we probably never in problems? [13:46:30] I mean pybal probably depooled all those? [13:46:36] ah wait, the threshold [13:46:39] it might not have ... [13:46:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2121 for reimage to buster - T246604', diff saved to https://phabricator.wikimedia.org/P10676 and previous config saved to /var/cache/conftool/dbconfig/20200310-134648-marostegui.json [13:46:56] 10Operations, 10Traffic, 10observability: some Prometheis not scraping the full set of targets - https://phabricator.wikimedia.org/T246860 (10CDanis) I think the following shell one-liner could be turned into an Icinga check for this condition pretty easily: {P10675} [13:46:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:46:59] T246604: Install 1 buster+10.4 host per section - https://phabricator.wikimedia.org/T246604 [13:47:11] (03CR) 10Vgutierrez: [C: 03+2] ATS: Turn on TLS Session tickets on ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/578327 (https://phabricator.wikimedia.org/T245616) (owner: 10Vgutierrez) [13:49:27] so, we can't really migrate the traffic slowly then [13:49:30] 10Operations, 10Performance Issue: Investigate CAS performance - https://phabricator.wikimedia.org/T246010 (10MoritzMuehlenhoff) Code in Spring Webflow for the above: ` public MappingResults map(Object source, Object target) { if (logger.isDebugEnabled()) { logg... [13:51:16] 10Operations, 10Page Content Service, 10Product-Infrastructure-Team-Backlog, 10Wikimedia-Logstash, and 4 others: Move mobileapps logging to new logging pipeline - https://phabricator.wikimedia.org/T219924 (10Mholloway) a:05Mholloway→03None [13:51:57] 10Operations, 10Product-Infrastructure-Team-Backlog, 10Proton, 10Wikimedia-Logstash, and 4 others: Move proton logging to new logging pipeline - https://phabricator.wikimedia.org/T219925 (10Mholloway) [13:51:58] <_joe_> ok going on with my change then [13:52:16] PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3064 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [13:52:31] !log Stop mysql on db2121 for reimage to buster T246604 [13:52:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:36] T246604: Install 1 buster+10.4 host per section - https://phabricator.wikimedia.org/T246604 [13:59:19] (03PS6) 10Ssingh: First release of the cescout project [software/censorship-monitoring] - 10https://gerrit.wikimedia.org/r/576070 (https://phabricator.wikimedia.org/T247273) [14:00:22] !log oblivian@deploy1001 Synchronized wmf-config/ProductionServices.php: switch echotore to use envoy (duration: 00m 57s) [14:00:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:33] !log reboot cp4026 - T245616 [14:00:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:38] T245616: Provide a simple and automated SSL Ticket key generation system for ATS - https://phabricator.wikimedia.org/T245616 [14:00:44] RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3064 is OK: HTTP OK: HTTP/1.0 200 OK - 22090 bytes in 0.273 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:01:05] hmmm that timeout is real or is just an eqiad<-->esams comm issue? [14:02:04] the exporter on cp3064 has a 4weeks uptime :/ [14:02:06] PROBLEM - Check to ensure host are not preforming a change on every puppet run on puppetdb1002 is CRITICAL: CRITICAL: the following (13) node(s) change every puppet run: elastic2060.codfw.wmnet, logstash1029.eqiad.wmnet, logstash1027.eqiad.wmnet, elastic2056.codfw.wmnet, elastic2057.codfw.wmnet, elastic2059.codfw.wmnet, elastic2058.codfw.wmnet, elastic2055.codfw.wmnet, wdqs1011.eqiad.wmnet, cloudvirt2003-dev.codfw.wmnet, logstash [14:02:06] logstash1026.eqiad.wmnet, wdqs1013.eqiad.wmnet https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes [14:03:57] (03PS1) 10Ottomata: Add new evenstreams TLS LVS for k8s, rename existing one to eventstreams-scb [puppet] - 10https://gerrit.wikimedia.org/r/578525 (https://phabricator.wikimedia.org/T238658) [14:04:48] (03PS2) 10Ottomata: Add new evenstreams TLS LVS for k8s, rename existing one to eventstreams-scb [puppet] - 10https://gerrit.wikimedia.org/r/578525 (https://phabricator.wikimedia.org/T238658) [14:09:16] (03CR) 10Elukey: admin: deprecate statistics-privatedata-users (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/578500 (https://phabricator.wikimedia.org/T246578) (owner: 10Elukey) [14:10:57] (03PS3) 10Ottomata: Add new evenstreams TLS LVS for k8s, rename existing one to eventstreams-scb [puppet] - 10https://gerrit.wikimedia.org/r/578525 (https://phabricator.wikimedia.org/T238658) [14:11:26] (03CR) 10Alexandros Kosiaris: Add new evenstreams TLS LVS for k8s, rename existing one to eventstreams-scb (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/578525 (https://phabricator.wikimedia.org/T238658) (owner: 10Ottomata) [14:11:29] (03CR) 10Alexandros Kosiaris: [C: 04-1] Add new evenstreams TLS LVS for k8s, rename existing one to eventstreams-scb [puppet] - 10https://gerrit.wikimedia.org/r/578525 (https://phabricator.wikimedia.org/T238658) (owner: 10Ottomata) [14:12:34] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime [14:12:35] !log Switch to TLS session tickets on ulsfo - T245616 [14:12:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:12:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:12:43] T245616: Provide a simple and automated SSL Ticket key generation system for ATS - https://phabricator.wikimedia.org/T245616 [14:13:36] (03CR) 10Alexandros Kosiaris: [C: 03+2] apertium-fra-cat: Updated to upstream release 1.7.0 [debs/contenttranslation/apertium-fra-cat] - 10https://gerrit.wikimedia.org/r/577216 (https://phabricator.wikimedia.org/T233700) (owner: 10KartikMistry) [14:14:23] (03PS2) 10Elukey: admin: deprecate statistics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/578500 (https://phabricator.wikimedia.org/T246578) [14:15:04] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [14:15:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:13] (03CR) 10Ottomata: Add new evenstreams TLS LVS for k8s, rename existing one to eventstreams-scb (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/578525 (https://phabricator.wikimedia.org/T238658) (owner: 10Ottomata) [14:15:33] (03PS1) 10Hnowlan: changeprop: configure redis servers for staging. [deployment-charts] - 10https://gerrit.wikimedia.org/r/578526 (https://phabricator.wikimedia.org/T213193) [14:15:43] PROBLEM - PyBal backends health check on lvs1015 is CRITICAL: PYBAL CRITICAL - CRITICAL - eventstreams_8092: Servers kubernetes1002.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [14:15:54] (03PS4) 10Ottomata: Add new evenstreams TLS LVS for k8s, rename existing one to eventstreams-scb [puppet] - 10https://gerrit.wikimedia.org/r/578525 (https://phabricator.wikimedia.org/T238658) [14:16:02] (03PS3) 10Elukey: admin: deprecate statistics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/578500 (https://phabricator.wikimedia.org/T246578) [14:20:43] (03CR) 10Alexandros Kosiaris: [C: 03+2] "PCC happy at https://puppet-compiler.wmflabs.org/compiler1003/21372/, merging" [puppet] - 10https://gerrit.wikimedia.org/r/578525 (https://phabricator.wikimedia.org/T238658) (owner: 10Ottomata) [14:21:30] (03PS1) 10Jforrester: [Beta Cluster] Point to deployment-parsoid11 for Parsoid services [mediawiki-config] - 10https://gerrit.wikimedia.org/r/578529 (https://phabricator.wikimedia.org/T246833) [14:25:19] PROBLEM - LVS HTTP IPv4 on eventstreams.svc.eqiad.wmnet is CRITICAL: connect to address 10.2.2.34 and port 8092: Connection refused https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [14:25:28] <_joe_> uh [14:25:33] uh oh... [14:25:34] <_joe_> what's happening? ^^ [14:25:38] * akosiaris looking [14:25:41] <_joe_> people... [14:25:51] <_joe_> you readded k8s to the same pool [14:25:53] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/eventstreams on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/eventstreams is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [14:26:19] PROBLEM - Confd template for /srv/config-master/pybal/codfw/eventstreams on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/eventstreams is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [14:26:26] (03PS3) 10Volans: dns: add the Netbox driven DNS zonefile snippets [puppet] - 10https://gerrit.wikimedia.org/r/578506 (https://phabricator.wikimedia.org/T233183) [14:26:34] (03CR) 10Elukey: [C: 03+2] admin: deprecate statistics-users [puppet] - 10https://gerrit.wikimedia.org/r/578488 (https://phabricator.wikimedia.org/T246578) (owner: 10Elukey) [14:26:47] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/eventstreams on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/eventstreams is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [14:26:51] pybal hasn't been restarted so we are fine [14:26:54] (03PS4) 10Elukey: admin: deprecate statistics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/578500 (https://phabricator.wikimedia.org/T246578) [14:26:59] traffic hasn't been harmed, just to be clear [14:28:26] <_joe_> akosiaris: and why the page then? [14:28:46] sigh it's the different cluster/svc [14:29:02] <_joe_> ack [14:29:03] at least we caught it early enough [14:29:06] <_joe_> probably a wrong port [14:29:09] (03CR) 10Elukey: [C: 03+2] admin: deprecate statistics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/578500 (https://phabricator.wikimedia.org/T246578) (owner: 10Elukey) [14:29:22] there was a page? [14:29:31] no and there shouldn't have been one? [14:29:40] I don't think there was one [14:29:47] (03CR) 10Volans: "New compiler results at:" [puppet] - 10https://gerrit.wikimedia.org/r/578506 (https://phabricator.wikimedia.org/T233183) (owner: 10Volans) [14:29:55] PROBLEM - PyBal IPVS diff check on lvs2010 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.1.34:8092]) https://wikitech.wikimedia.org/wiki/PyBal [14:30:06] traffic is still fine as far as I know [14:30:26] (03CR) 10Elukey: [C: 03+2] admin: deprecate statistics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/578500 (https://phabricator.wikimedia.org/T246578) (owner: 10Elukey) [14:30:41] PROBLEM - PyBal IPVS diff check on lvs1016 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.2.34:8092]) https://wikitech.wikimedia.org/wiki/PyBal [14:30:51] PROBLEM - PyBal IPVS diff check on lvs2009 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.1.34:8092]) https://wikitech.wikimedia.org/wiki/PyBal [14:30:53] PROBLEM - PyBal IPVS diff check on lvs1015 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.2.34:8092]) https://wikitech.wikimedia.org/wiki/PyBal [14:30:57] PROBLEM - Check if active EventStreams endpoint is delivering messages. on icinga1001 is CRITICAL: CRITICAL: No EventStreams message was consumed from https://stream.wikimedia.org/v2/stream/recentchange within 10 seconds. https://wikitech.wikimedia.org/wiki/Event_Platform/EventStreams/Administration [14:32:59] (03PS1) 10Alexandros Kosiaris: Revert "Add new evenstreams TLS LVS for k8s, rename existing one to eventstreams-scb" [puppet] - 10https://gerrit.wikimedia.org/r/578530 [14:33:04] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] Revert "Add new evenstreams TLS LVS for k8s, rename existing one to eventstreams-scb" [puppet] - 10https://gerrit.wikimedia.org/r/578530 (owner: 10Alexandros Kosiaris) [14:33:41] (03PS1) 10Volans: sre.dns.netbox: new cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/578531 (https://phabricator.wikimedia.org/T233183) [14:34:49] PROBLEM - LVS HTTP IPv4 on eventstreams.svc.codfw.wmnet is CRITICAL: connect to address 10.2.1.34 and port 8092: Connection refused https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [14:34:57] !log akosiaris@cumin1001 conftool action : set/weight=8; selector: dc=eqiad,service=eventstreams,name=scb.* [14:35:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:04] !log akosiaris@cumin1001 conftool action : set/pooled=yes; selector: dc=eqiad,service=eventstreams,name=scb.* [14:35:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:11] !log akosiaris@cumin1001 conftool action : set/pooled=yes; selector: dc=codfw,service=eventstreams,name=scb.* [14:35:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:18] !log akosiaris@cumin1001 conftool action : set/weight=10; selector: dc=codfw,service=eventstreams,name=scb.* [14:35:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:25] (03CR) 10jerkins-bot: [V: 04-1] sre.dns.netbox: new cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/578531 (https://phabricator.wikimedia.org/T233183) (owner: 10Volans) [14:35:33] RECOVERY - PyBal backends health check on lvs1015 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [14:35:35] I take it back, there was service interruption. Pybal removed all backends after all [14:35:49] RECOVERY - LVS HTTP IPv4 on eventstreams.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 1043 bytes in 0.002 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [14:35:51] RECOVERY - PyBal IPVS diff check on lvs2010 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [14:36:39] RECOVERY - PyBal IPVS diff check on lvs1016 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [14:36:41] RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [14:36:47] RECOVERY - LVS HTTP IPv4 on eventstreams.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 1043 bytes in 0.074 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [14:36:49] RECOVERY - PyBal IPVS diff check on lvs2009 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [14:36:51] RECOVERY - PyBal IPVS diff check on lvs1015 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [14:37:09] (03PS1) 10Ema: Use json.Marshal [software/atskafka] - 10https://gerrit.wikimedia.org/r/578532 (https://phabricator.wikimedia.org/T237993) [14:38:16] (03PS2) 10Volans: sre.dns.netbox: new cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/578531 (https://phabricator.wikimedia.org/T233183) [14:40:42] (03CR) 10Vgutierrez: Use json.Marshal (031 comment) [software/atskafka] - 10https://gerrit.wikimedia.org/r/578532 (https://phabricator.wikimedia.org/T237993) (owner: 10Ema) [14:42:04] !logT233700 upload apertium-fra-cat_1.7.0-1+wmf1_amd64.changes to apt.wikimedia.org/jessie-wikimedia.org main [14:42:07] !log T233700 upload apertium-fra-cat_1.7.0-1+wmf1_amd64.changes to apt.wikimedia.org/jessie-wikimedia.org main [14:42:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:42:19] T233700: Update French-Catalan, Spanish-Catalan and English-Catalan Apertium MT pairs - https://phabricator.wikimedia.org/T233700 [14:44:15] (03CR) 10Hnowlan: [C: 03+1] cpjobqueue: Add jobrunner_host & videoscaler_host to deployment vars [puppet] - 10https://gerrit.wikimedia.org/r/577677 (https://phabricator.wikimedia.org/T246371) (owner: 10Clarakosi) [14:44:48] (03CR) 10Alexandros Kosiaris: [C: 04-1] changeprop: configure redis servers for staging. (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/578526 (https://phabricator.wikimedia.org/T213193) (owner: 10Hnowlan) [14:47:07] (03PS2) 10Ema: Use json.Marshal [software/atskafka] - 10https://gerrit.wikimedia.org/r/578532 (https://phabricator.wikimedia.org/T237993) [14:47:33] (03CR) 10Ema: Use json.Marshal (031 comment) [software/atskafka] - 10https://gerrit.wikimedia.org/r/578532 (https://phabricator.wikimedia.org/T237993) (owner: 10Ema) [14:48:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db2121 - T246604', diff saved to https://phabricator.wikimedia.org/P10677 and previous config saved to /var/cache/conftool/dbconfig/20200310-144817-root.json [14:48:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:22] T246604: Install 1 buster+10.4 host per section - https://phabricator.wikimedia.org/T246604 [14:48:55] (03PS3) 10Giuseppe Lavagetto: Move Termbox to ProductionServices, use envoy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/578493 (https://phabricator.wikimedia.org/T244843) [14:48:57] (03PS3) 10Giuseppe Lavagetto: wdqs-internal: switch to use envoy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/578495 (https://phabricator.wikimedia.org/T244843) [14:48:59] (03PS3) 10Giuseppe Lavagetto: Switch ores to use envoy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/578496 (https://phabricator.wikimedia.org/T244843) [14:49:01] (03PS3) 10Giuseppe Lavagetto: Switch restbase to use envoy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/578497 (https://phabricator.wikimedia.org/T244843) [14:50:13] (03CR) 10jerkins-bot: [V: 04-1] Move Termbox to ProductionServices, use envoy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/578493 (https://phabricator.wikimedia.org/T244843) (owner: 10Giuseppe Lavagetto) [14:50:21] (03CR) 10jerkins-bot: [V: 04-1] wdqs-internal: switch to use envoy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/578495 (https://phabricator.wikimedia.org/T244843) (owner: 10Giuseppe Lavagetto) [14:51:43] (03PS1) 10Elukey: Move stat1006 to role::statistics::explorer [puppet] - 10https://gerrit.wikimedia.org/r/578535 (https://phabricator.wikimedia.org/T243934) [14:51:53] <_joe_> uh, why oh why [14:54:42] (03PS1) 10Jforrester: tests: Check that variant URLs are localhost-only, where possible [mediawiki-config] - 10https://gerrit.wikimedia.org/r/578536 [14:54:59] (03PS1) 10Ottomata: eventstreams - use evenstreams _tls_helpers.tpl and set envoy resource limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/578537 (https://phabricator.wikimedia.org/T238658) [14:55:08] _joe_: ^^ Will stop IS getting more non-localhost URLs in it. [14:56:15] <_joe_> James_F: ohhh that's amazing thanks [14:56:30] _joe_: I mean, it's not my finest work, but… ;-) [14:56:42] (03PS2) 10Ottomata: eventstreams - use evenstreams _tls_helpers.tpl and set envoy resource limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/578537 (https://phabricator.wikimedia.org/T238658) [14:57:11] (03PS1) 10Alexandros Kosiaris: WIP: Add new evenstreams TLS LVS for k8s, rename existing one to eventstreams-scb. [puppet] - 10https://gerrit.wikimedia.org/r/578538 [14:57:11] <_joe_> James_F: I loved the comment "just wow" :D [14:57:17] Yeah. [14:57:26] UseX is meant to be a boolean. [14:57:32] I might make a test for that next. [14:57:46] (03CR) 10Jforrester: [C: 03+2] [Beta Cluster] Point to deployment-parsoid11 for Parsoid services [mediawiki-config] - 10https://gerrit.wikimedia.org/r/578529 (https://phabricator.wikimedia.org/T246833) (owner: 10Jforrester) [14:58:21] (03CR) 10jerkins-bot: [V: 04-1] WIP: Add new evenstreams TLS LVS for k8s, rename existing one to eventstreams-scb. [puppet] - 10https://gerrit.wikimedia.org/r/578538 (owner: 10Alexandros Kosiaris) [14:58:42] (03Merged) 10jenkins-bot: [Beta Cluster] Point to deployment-parsoid11 for Parsoid services [mediawiki-config] - 10https://gerrit.wikimedia.org/r/578529 (https://phabricator.wikimedia.org/T246833) (owner: 10Jforrester) [14:59:39] (03CR) 10Ema: [V: 03+2 C: 03+2] Tidy up go.sum [software/atskafka] - 10https://gerrit.wikimedia.org/r/578490 (owner: 10Ema) [15:00:19] ok, I think I know what happened now [15:00:48] (03CR) 10Ema: "recheck" [software/atskafka] - 10https://gerrit.wikimedia.org/r/578491 (https://phabricator.wikimedia.org/T237993) (owner: 10Ema) [15:00:49] the etcd lvs cluster group was renamed and the old one ended up being empty [15:01:15] (03CR) 10jerkins-bot: [V: 04-1] Debianization [software/atskafka] - 10https://gerrit.wikimedia.org/r/578491 (https://phabricator.wikimedia.org/T237993) (owner: 10Ema) [15:01:33] RECOVERY - Check if active EventStreams endpoint is delivering messages. on icinga1001 is OK: OK: An EventStreams message was consumed from https://stream.wikimedia.org/v2/stream/recentchange within 10 seconds. https://wikitech.wikimedia.org/wiki/Event_Platform/EventStreams/Administration [15:04:14] (03PS2) 10Alexandros Kosiaris: Add new evenstreams TLS LVS [puppet] - 10https://gerrit.wikimedia.org/r/578538 [15:04:51] (03CR) 10Elukey: [C: 03+2] Move stat1006 to role::statistics::explorer [puppet] - 10https://gerrit.wikimedia.org/r/578535 (https://phabricator.wikimedia.org/T243934) (owner: 10Elukey) [15:07:25] (03PS3) 10Ottomata: eventstreams - use evenstreams _tls_helpers.tpl and set envoy resource limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/578537 (https://phabricator.wikimedia.org/T238658) [15:07:31] (03CR) 10jerkins-bot: [V: 04-1] eventstreams - use evenstreams _tls_helpers.tpl and set envoy resource limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/578537 (https://phabricator.wikimedia.org/T238658) (owner: 10Ottomata) [15:09:06] (03PS1) 10Jforrester: tests: Assert that wgUse* flags are boolean [mediawiki-config] - 10https://gerrit.wikimedia.org/r/578540 [15:09:20] (03CR) 10Jforrester: [C: 03+2] tests: Assert that wgUse* flags are boolean [mediawiki-config] - 10https://gerrit.wikimedia.org/r/578540 (owner: 10Jforrester) [15:09:25] (03CR) 10Jforrester: [C: 03+2] tests: Check that variant URLs are localhost-only, where possible [mediawiki-config] - 10https://gerrit.wikimedia.org/r/578536 (owner: 10Jforrester) [15:10:24] (03Merged) 10jenkins-bot: tests: Check that variant URLs are localhost-only, where possible [mediawiki-config] - 10https://gerrit.wikimedia.org/r/578536 (owner: 10Jforrester) [15:10:29] (03PS4) 10Ottomata: eventstreams - use evenstreams _tls_helpers.tpl and set envoy resource limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/578537 (https://phabricator.wikimedia.org/T238658) [15:10:33] (03Merged) 10jenkins-bot: tests: Assert that wgUse* flags are boolean [mediawiki-config] - 10https://gerrit.wikimedia.org/r/578540 (owner: 10Jforrester) [15:10:56] (03CR) 10Ema: "recheck" [software/atskafka] - 10https://gerrit.wikimedia.org/r/578491 (https://phabricator.wikimedia.org/T237993) (owner: 10Ema) [15:13:00] (03PS1) 10Elukey: role::statistics::explorer: remove analytics keytab [puppet] - 10https://gerrit.wikimedia.org/r/578541 (https://phabricator.wikimedia.org/T243934) [15:13:02] (03CR) 10Alexandros Kosiaris: [C: 03+2] Add new evenstreams TLS LVS [puppet] - 10https://gerrit.wikimedia.org/r/578538 (owner: 10Alexandros Kosiaris) [15:14:11] (03PS2) 10Hnowlan: changeprop: configure redis servers for staging. [deployment-charts] - 10https://gerrit.wikimedia.org/r/578526 (https://phabricator.wikimedia.org/T213193) [15:14:55] (03CR) 10Vgutierrez: [C: 03+1] Use json.Marshal [software/atskafka] - 10https://gerrit.wikimedia.org/r/578532 (https://phabricator.wikimedia.org/T237993) (owner: 10Ema) [15:18:48] (03CR) 10Elukey: [C: 03+2] role::statistics::explorer: remove analytics keytab [puppet] - 10https://gerrit.wikimedia.org/r/578541 (https://phabricator.wikimedia.org/T243934) (owner: 10Elukey) [15:20:58] (03PS1) 10Alexandros Kosiaris: eventstreams-tls: Switch state to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/578542 (https://phabricator.wikimedia.org/T238658) [15:21:00] (03PS4) 10Jcrespo: prometheus-mysqld-exporter: Add es3 to the list of standalone sections [puppet] - 10https://gerrit.wikimedia.org/r/576655 (https://phabricator.wikimedia.org/T246072) [15:24:31] 10Operations, 10Continuous-Integration-Infrastructure, 10Traffic: debian-glue-backports not enabling backports on buster - https://phabricator.wikimedia.org/T247316 (10ema) [15:26:57] (03PS1) 10Ema: package_builder: assume backports exist [puppet] - 10https://gerrit.wikimedia.org/r/578543 (https://phabricator.wikimedia.org/T247316) [15:27:23] (03PS1) 10Vgutierrez: ATS: Re-enable session ID based cache on ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/578544 (https://phabricator.wikimedia.org/T245616) [15:27:50] (03CR) 10jerkins-bot: [V: 04-1] ATS: Re-enable session ID based cache on ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/578544 (https://phabricator.wikimedia.org/T245616) (owner: 10Vgutierrez) [15:28:52] (03PS2) 10Vgutierrez: ATS: Re-enable session ID based cache on ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/578544 (https://phabricator.wikimedia.org/T245616) [15:29:57] 10Operations, 10Continuous-Integration-Infrastructure, 10Traffic, 10Patch-For-Review: debian-glue-backports not enabling backports on buster - https://phabricator.wikimedia.org/T247316 (10ema) p:05Triage→03Medium [15:30:45] RECOVERY - Confd template for /srv/config-master/pybal/eqiad/eventstreams on puppetmaster1001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [15:30:49] RECOVERY - Confd template for /srv/config-master/pybal/eqiad/eventstreams on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [15:31:48] (03CR) 10Vgutierrez: "pcc seems happy: https://puppet-compiler.wmflabs.org/compiler1003/21379/" [puppet] - 10https://gerrit.wikimedia.org/r/578544 (https://phabricator.wikimedia.org/T245616) (owner: 10Vgutierrez) [15:32:28] (03Abandoned) 10Giuseppe Lavagetto: Add ores, wdqs to ProductionServices [mediawiki-config] - 10https://gerrit.wikimedia.org/r/578494 (https://phabricator.wikimedia.org/T244843) (owner: 10Giuseppe Lavagetto) [15:34:34] (03PS4) 10Giuseppe Lavagetto: Move Termbox to ProductionServices, use envoy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/578493 (https://phabricator.wikimedia.org/T244843) [15:34:36] (03PS4) 10Giuseppe Lavagetto: wdqs-internal: switch to use envoy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/578495 (https://phabricator.wikimedia.org/T244843) [15:34:38] (03PS4) 10Giuseppe Lavagetto: Switch ores to use envoy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/578496 (https://phabricator.wikimedia.org/T244843) [15:34:40] (03PS4) 10Giuseppe Lavagetto: Switch restbase to use envoy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/578497 (https://phabricator.wikimedia.org/T244843) [15:36:55] (03CR) 10Ema: ATS: Re-enable session ID based cache on ulsfo (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/578544 (https://phabricator.wikimedia.org/T245616) (owner: 10Vgutierrez) [15:38:02] (03CR) 10Vgutierrez: ATS: Re-enable session ID based cache on ulsfo (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/578544 (https://phabricator.wikimedia.org/T245616) (owner: 10Vgutierrez) [15:38:22] (03CR) 10Jbond: "had a scan through and added a few comments, looks good" (033 comments) [software/censorship-monitoring] - 10https://gerrit.wikimedia.org/r/576070 (https://phabricator.wikimedia.org/T247273) (owner: 10Ssingh) [15:39:57] (03CR) 10Jforrester: [C: 03+1] Fix incorrect name of safe_service_restart parameter [puppet] - 10https://gerrit.wikimedia.org/r/577725 (https://phabricator.wikimedia.org/T247151) (owner: 10Alex Monk) [15:40:51] (03PS1) 10Jhedden: prometheus: fix int format in node_directory_size [puppet] - 10https://gerrit.wikimedia.org/r/578545 [15:42:07] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM, this check was added when Buster was fresh and no backports suite was available there." [puppet] - 10https://gerrit.wikimedia.org/r/578543 (https://phabricator.wikimedia.org/T247316) (owner: 10Ema) [15:43:02] (03CR) 10Jhedden: "Cwhite, since you're in this file's log history I was hoping you could help review this" [puppet] - 10https://gerrit.wikimedia.org/r/578545 (owner: 10Jhedden) [15:43:07] RECOVERY - Confd template for /srv/config-master/pybal/codfw/eventstreams on puppetmaster1001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [15:44:14] (03CR) 10Ema: [C: 03+1] ATS: Re-enable session ID based cache on ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/578544 (https://phabricator.wikimedia.org/T245616) (owner: 10Vgutierrez) [15:44:23] (03CR) 10Vgutierrez: [C: 03+2] ATS: Re-enable session ID based cache on ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/578544 (https://phabricator.wikimedia.org/T245616) (owner: 10Vgutierrez) [15:44:50] (03CR) 10Ema: [C: 03+2] package_builder: assume backports exist [puppet] - 10https://gerrit.wikimedia.org/r/578543 (https://phabricator.wikimedia.org/T247316) (owner: 10Ema) [15:47:52] (03CR) 10CRusnov: [C: 03+2] reports/coherence.py: Add check for Juniper inventory item descriptions [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/576499 (https://phabricator.wikimedia.org/T241289) (owner: 10CRusnov) [15:48:21] !log re-enabling session id based caching on ulsfo (along with tls session tickets) - T245616 [15:48:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:48:27] T245616: Provide a simple and automated SSL Ticket key generation system for ATS - https://phabricator.wikimedia.org/T245616 [15:48:41] <_joe_> jouncebot: next [15:48:41] In 0 hour(s) and 11 minute(s): Puppet SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200310T1600) [15:48:51] <_joe_> ok, I can go [15:51:32] (03CR) 10Jhedden: "Example of the error and patch at https://phabricator.wikimedia.org/P10678" [puppet] - 10https://gerrit.wikimedia.org/r/578545 (owner: 10Jhedden) [15:53:24] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Move Termbox to ProductionServices, use envoy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/578493 (https://phabricator.wikimedia.org/T244843) (owner: 10Giuseppe Lavagetto) [15:54:24] (03Merged) 10jenkins-bot: Move Termbox to ProductionServices, use envoy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/578493 (https://phabricator.wikimedia.org/T244843) (owner: 10Giuseppe Lavagetto) [15:54:32] <_joe_> ok let's go [15:56:38] (03CR) 10Ema: "recheck" [software/atskafka] - 10https://gerrit.wikimedia.org/r/578491 (https://phabricator.wikimedia.org/T237993) (owner: 10Ema) [15:57:51] (03PS2) 10CRusnov: reports/coherence.py: Add test for racked devices with no position [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/576391 (https://phabricator.wikimedia.org/T239244) [15:59:57] PROBLEM - Host backup2001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:00:04] godog and _joe_: It is that lovely time of the day again! You are hereby commanded to deploy Puppet SWAT(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200310T1600). [16:00:04] No GERRIT patches in the queue for this window AFAICS. [16:00:54] (03CR) 10CRusnov: [C: 03+2] reports/coherence.py: Add test for racked devices with no position [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/576391 (https://phabricator.wikimedia.org/T239244) (owner: 10CRusnov) [16:02:32] jynus: backup2001 is known? ^ [16:02:46] It is only the mgmt from what I can see [16:03:58] not expected [16:04:15] but the host was known to crash in the past [16:05:13] (03CR) 10CRusnov: [C: 03+1] "Looks pretty straight forward!" [software/spicerack] - 10https://gerrit.wikimedia.org/r/577672 (owner: 10Volans) [16:05:17] !log oblivian@deploy1001 Synchronized wmf-config/ProductionServices.php: switch termbox to use envoy (duration: 00m 59s) [16:05:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:05:51] I will handle it soon, as it is for now not causing service loss [16:06:05] when dc op is around [16:06:53] (03CR) 10CRusnov: [C: 03+1] "LGTM" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/577529 (owner: 10Volans) [16:09:03] (03CR) 10Volans: [C: 03+2] spicerack: allow to cache the Ipmi instance [software/spicerack] - 10https://gerrit.wikimedia.org/r/577672 (owner: 10Volans) [16:09:26] (03CR) 10CRusnov: [C: 03+1] "This looks along the lines of what we've discussed." [puppet] - 10https://gerrit.wikimedia.org/r/578506 (https://phabricator.wikimedia.org/T233183) (owner: 10Volans) [16:11:07] (03CR) 10Volans: [C: 04-1] "Point to discuss more inline" (031 comment) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/577529 (owner: 10Volans) [16:14:18] (03PS1) 10Jcrespo: prometheus-mysqld-exporter: Detect special "standalone replicas" from the db [puppet] - 10https://gerrit.wikimedia.org/r/578547 (https://phabricator.wikimedia.org/T246072) [16:14:33] (03Merged) 10jenkins-bot: spicerack: allow to cache the Ipmi instance [software/spicerack] - 10https://gerrit.wikimedia.org/r/577672 (owner: 10Volans) [16:14:38] (03CR) 10BryanDavis: Add support for redirecting to toolforge.org (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/578413 (https://phabricator.wikimedia.org/T234617) (owner: 10BryanDavis) [16:15:01] !log ebernhardson@deploy1001 Started deploy [wikimedia/discovery/analytics@d182ca7]: Build airflow venvs from stat1007 [16:15:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:15:18] (03CR) 10CRusnov: "THank you for the reviews!" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/576459 (https://phabricator.wikimedia.org/T243927) (owner: 10CRusnov) [16:15:46] !log ebernhardson@deploy1001 Finished deploy [wikimedia/discovery/analytics@d182ca7]: Build airflow venvs from stat1007 (duration: 00m 45s) [16:15:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:16:23] (03CR) 10BryanDavis: Add support for redirecting to toolforge.org (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/578413 (https://phabricator.wikimedia.org/T234617) (owner: 10BryanDavis) [16:16:26] (03PS2) 10Jcrespo: prometheus-mysqld-exporter: Detect special "standalone replicas" from the db [puppet] - 10https://gerrit.wikimedia.org/r/578547 (https://phabricator.wikimedia.org/T246072) [16:17:52] (03PS3) 10Jcrespo: prometheus-mysqld-exporter: Detect special "standalone replicas" from the db [puppet] - 10https://gerrit.wikimedia.org/r/578547 (https://phabricator.wikimedia.org/T246072) [16:19:14] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 60, down: 2, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:19:56] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 74, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:20:08] (03PS4) 10Jcrespo: prometheus-mysqld-exporter: Detect special "standalone replicas" from the db [puppet] - 10https://gerrit.wikimedia.org/r/578547 (https://phabricator.wikimedia.org/T246072) [16:21:55] !log volker-e@deploy1001 Started deploy [design/style-guide@14bb669]: Deploy design/style-guide: [16:21:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:22:04] !log volker-e@deploy1001 Finished deploy [design/style-guide@14bb669]: Deploy design/style-guide: (duration: 00m 08s) [16:22:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:22:51] (03CR) 10jerkins-bot: [V: 04-1] prometheus-mysqld-exporter: Detect special "standalone replicas" from the db [puppet] - 10https://gerrit.wikimedia.org/r/578547 (https://phabricator.wikimedia.org/T246072) (owner: 10Jcrespo) [16:25:48] RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 76, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:26:16] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 64, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:29:37] (03PS1) 10Ema: Handle rdkafka statistics [software/atskafka] - 10https://gerrit.wikimedia.org/r/578549 (https://phabricator.wikimedia.org/T237993) [16:31:27] (03CR) 10Ema: "recheck" [software/atskafka] - 10https://gerrit.wikimedia.org/r/578491 (https://phabricator.wikimedia.org/T237993) (owner: 10Ema) [16:31:48] (03CR) 10Alexandros Kosiaris: [C: 03+2] eventstreams-tls: Switch state to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/578542 (https://phabricator.wikimedia.org/T238658) (owner: 10Alexandros Kosiaris) [16:32:28] 10Operations, 10DC-Ops, 10Wikimedia-Incident, 10cloud-services-team (Kanban): Increase visibility of auto-generated tasks for RAID errors - https://phabricator.wikimedia.org/T216133 (10Andrew) 05Open→03Resolved a:03Andrew We seem to be getting these alerts now. [16:32:46] (03PS5) 10Jcrespo: prometheus-mysqld-exporter: Detect special "standalone replicas" from the db [puppet] - 10https://gerrit.wikimedia.org/r/578547 (https://phabricator.wikimedia.org/T246072) [16:33:06] 10Operations, 10Performance Issue: Investigate CAS performance - https://phabricator.wikimedia.org/T246010 (10MoritzMuehlenhoff) Others are affected by this as well: https://groups.google.com/a/apereo.org/forum/#!topic/cas-user/iMwglmoMBPc [16:34:42] (03CR) 10Jforrester: [C: 03+1] mediawiki: Change php-wmerrors channel from "fatal" to as "exception" [puppet] - 10https://gerrit.wikimedia.org/r/577645 (https://phabricator.wikimedia.org/T247113) (owner: 10Krinkle) [16:35:11] 10Operations, 10DC-Ops, 10Wikimedia-Incident, 10cloud-services-team (Kanban): Increase visibility of auto-generated tasks for RAID errors - https://phabricator.wikimedia.org/T216133 (10JHedden) Icinga alerts for these notifications on cloud/labs hosts were changed to notify the WMCS team in T246130 [16:37:11] (03CR) 10Hnowlan: "> let's see what the apache conf differences and the modules changes imply" [puppet] - 10https://gerrit.wikimedia.org/r/576913 (https://phabricator.wikimedia.org/T246389) (owner: 10Hnowlan) [16:38:47] (03PS1) 10CRusnov: netbox (hiera): Add coherence.Rack to alerted reports [puppet] - 10https://gerrit.wikimedia.org/r/578551 (https://phabricator.wikimedia.org/T239244) [16:44:08] brennen: When are you cutting the branch this week? [16:44:46] Reedy: was planning to start in about 5 minutes. [16:45:05] (03PS2) 10Jhedden: prometheus: fix int format in node_directory_size [puppet] - 10https://gerrit.wikimedia.org/r/578545 (https://phabricator.wikimedia.org/T218925) [16:45:48] Reedy: need me to hold a bit? [16:46:20] Not particularly. We were wondering about getting one of the security patches into master to remove it as a local patch [16:46:45] No need to hold off on that. We can backport and clean up after it's in master :) [16:46:59] (03CR) 10Jcrespo: [C: 03+1] "Diffs from production after patch applied on eqiad:" [puppet] - 10https://gerrit.wikimedia.org/r/578547 (https://phabricator.wikimedia.org/T246072) (owner: 10Jcrespo) [16:46:59] kk. [16:50:04] !log starting branch cut for wmf/1.35.0-wmf.23 - T233871 [16:50:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:50:09] T233871: 1.35.0-wmf.23 deployment blockers - https://phabricator.wikimedia.org/T233871 [16:51:40] (03CR) 10Alexandros Kosiaris: [C: 03+1] eventstreams - use evenstreams _tls_helpers.tpl and set envoy resource limits (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/578537 (https://phabricator.wikimedia.org/T238658) (owner: 10Ottomata) [16:53:04] (03CR) 10CRusnov: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/578551 (https://phabricator.wikimedia.org/T239244) (owner: 10CRusnov) [16:57:38] (03PS1) 10Alexandros Kosiaris: eventstreams: Switch to monitoring_setup [puppet] - 10https://gerrit.wikimedia.org/r/578555 (https://phabricator.wikimedia.org/T238658) [16:59:09] (03CR) 10Alexandros Kosiaris: [C: 03+1] changeprop: configure redis servers for staging. [deployment-charts] - 10https://gerrit.wikimedia.org/r/578526 (https://phabricator.wikimedia.org/T213193) (owner: 10Hnowlan) [17:00:04] halfak and accraze: (Dis)respected human, time to deploy Services – Graphoid / Citoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200310T1700). Please do the needful. [17:00:38] (I'll be doing deployment training this hour.) [17:01:52] 10Operations, 10netbox, 10Patch-For-Review: Netbox report check for no position set in rack - https://phabricator.wikimedia.org/T239244 (10RobH) wmf5801 is the old srx device in ulsfo. It was not sold off with decoms, as we kept all network hardware. I somehow didn't update its entry when we moved things ar... [17:02:52] (03CR) 10Ottomata: eventstreams - use evenstreams _tls_helpers.tpl and set envoy resource limits (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/578537 (https://phabricator.wikimedia.org/T238658) (owner: 10Ottomata) [17:03:44] 10Operations, 10ops-codfw, 10DBA: backup2001.mgmt interface down - https://phabricator.wikimedia.org/T247324 (10jcrespo) [17:04:39] ACKNOWLEDGEMENT - Host backup2001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% Jcrespo Down, unknown reason https://phabricator.wikimedia.org/T247324 [17:05:17] 10Puppet, 10Toolforge, 10cloud-services-team (Kanban): Switch Toolforge project hosts to the future parser - https://phabricator.wikimedia.org/T177298 (10Andrew) [17:06:59] (03CR) 10Jcrespo: [C: 03+2] prometheus-mysqld-exporter: Detect special "standalone replicas" from the db [puppet] - 10https://gerrit.wikimedia.org/r/578547 (https://phabricator.wikimedia.org/T246072) (owner: 10Jcrespo) [17:08:43] (03CR) 10Ottomata: [C: 03+2] eventstreams - use evenstreams _tls_helpers.tpl and set envoy resource limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/578537 (https://phabricator.wikimedia.org/T238658) (owner: 10Ottomata) [17:09:41] (03CR) 10Hnowlan: [C: 03+2] changeprop: configure redis servers for staging. [deployment-charts] - 10https://gerrit.wikimedia.org/r/578526 (https://phabricator.wikimedia.org/T213193) (owner: 10Hnowlan) [17:13:48] Going to deploy mobileapps soon. [17:16:17] (03Abandoned) 10Jcrespo: prometheus-mysqld-exporter: Add es3 to the list of standalone sections [puppet] - 10https://gerrit.wikimedia.org/r/576655 (https://phabricator.wikimedia.org/T246072) (owner: 10Jcrespo) [17:16:31] (03PS3) 10Hnowlan: changeprop: configure redis servers for staging. [deployment-charts] - 10https://gerrit.wikimedia.org/r/578526 (https://phabricator.wikimedia.org/T213193) [17:16:54] (03CR) 10Hnowlan: [V: 03+2 C: 03+2] changeprop: configure redis servers for staging. [deployment-charts] - 10https://gerrit.wikimedia.org/r/578526 (https://phabricator.wikimedia.org/T213193) (owner: 10Hnowlan) [17:16:59] (03PS2) 10Dwisehaupt: Add frpm2001 to monitoring [puppet] - 10https://gerrit.wikimedia.org/r/577006 (https://phabricator.wikimedia.org/T242269) [17:17:14] (03PS1) 10Jforrester: [nlwiki] Enable WikiLove [mediawiki-config] - 10https://gerrit.wikimedia.org/r/578563 (https://phabricator.wikimedia.org/T247286) [17:17:17] (03Merged) 10jenkins-bot: changeprop: configure redis servers for staging. [deployment-charts] - 10https://gerrit.wikimedia.org/r/578526 (https://phabricator.wikimedia.org/T213193) (owner: 10Hnowlan) [17:18:33] !log 1.35.0-wmf.23 was branched at 8e3738cc2f0665d19c1ff758a1f16eebae0039dd for T233871 [17:18:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:18:41] T233871: 1.35.0-wmf.23 deployment blockers - https://phabricator.wikimedia.org/T233871 [17:19:33] !log bsitzmann@deploy1001 Started deploy [mobileapps/deploy@6c2ee13]: Update mobileapps to 304fb43 [17:19:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:19:43] !log hnowlan@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'changeprop' for release 'staging' . [17:19:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:20:04] !log otto@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'eventstreams' for release 'production' . [17:20:04] !log otto@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'eventstreams' for release 'canary' . [17:20:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:20:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:22:12] (03PS1) 10Dwisehaupt: Set up frdb2001 in monitoring [puppet] - 10https://gerrit.wikimedia.org/r/578564 (https://phabricator.wikimedia.org/T246045) [17:22:58] (03CR) 10Jforrester: [C: 03+2] [nlwiki] Enable WikiLove [mediawiki-config] - 10https://gerrit.wikimedia.org/r/578563 (https://phabricator.wikimedia.org/T247286) (owner: 10Jforrester) [17:23:22] !log otto@deploy1001 helmfile [CODFW] Ran 'apply' command on namespace 'eventstreams' for release 'production' . [17:23:22] !log otto@deploy1001 helmfile [CODFW] Ran 'apply' command on namespace 'eventstreams' for release 'canary' . [17:23:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:23:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:24:00] (03Merged) 10jenkins-bot: [nlwiki] Enable WikiLove [mediawiki-config] - 10https://gerrit.wikimedia.org/r/578563 (https://phabricator.wikimedia.org/T247286) (owner: 10Jforrester) [17:24:43] !log Ran mwscript extensions/WikimediaMaintenance/createExtensionTables.php --wiki=nlwiki wikilove for T247286 [17:24:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:24:48] T247286: Enable Wikilove on Dutch Wikipedia - https://phabricator.wikimedia.org/T247286 [17:25:53] !log otto@deploy1001 helmfile [EQIAD] Ran 'apply' command on namespace 'eventstreams' for release 'production' . [17:25:53] !log otto@deploy1001 helmfile [EQIAD] Ran 'apply' command on namespace 'eventstreams' for release 'canary' . [17:25:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:25:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:26:40] (03CR) 10Aaron Schulz: [C: 03+1] mediawiki: Change php-wmerrors channel from "fatal" to as "exception" [puppet] - 10https://gerrit.wikimedia.org/r/577645 (https://phabricator.wikimedia.org/T247113) (owner: 10Krinkle) [17:27:43] !log bsitzmann@deploy1001 Finished deploy [mobileapps/deploy@6c2ee13]: Update mobileapps to 304fb43 (duration: 08m 09s) [17:27:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:29:08] PROBLEM - Host mc-gp2001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:31:50] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: [nlwiki] Enable WikiLove T247286 (duration: 00m 59s) [17:31:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:31:56] T247286: Enable Wikilove on Dutch Wikipedia - https://phabricator.wikimedia.org/T247286 [17:33:09] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Touch and secondary sync of IS for cache-busting (duration: 01m 00s) [17:33:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:33:21] !log ebernhardson@deploy1001 Started deploy [wikimedia/discovery/analytics@88b3e14]: Update predictions dag with new cli parameters [17:33:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:33:38] 10Operations, 10Security-Team, 10Stewards-and-global-tools, 10Security, 10User-revi: Security Issue Access Request for 2020 Stewards - https://phabricator.wikimedia.org/T246449 (10chasemp) 05Stalled→03Open [17:34:07] mc-gp2001 was me [17:34:22] !log ebernhardson@deploy1001 Finished deploy [wikimedia/discovery/analytics@88b3e14]: Update predictions dag with new cli parameters (duration: 01m 00s) [17:34:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:35:36] RECOVERY - Host mc-gp2001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 37.36 ms [17:36:30] 10Operations, 10Security-Team, 10Stewards-and-global-tools, 10Security, 10User-revi: Security Issue Access Request for 2020 Stewards - https://phabricator.wikimedia.org/T246449 (10chasemp) [17:36:55] 10Operations, 10Security-Team, 10Stewards-and-global-tools, 10Security, 10User-revi: Security Issue Access Request for 2020 Stewards - https://phabricator.wikimedia.org/T246449 (10chasemp) @Krd if you would like to be included here please add MFA to your phab account. Thanks. [17:37:43] 10Operations, 10Security-Team, 10Stewards-and-global-tools, 10Security, 10User-revi: Security Issue Access Request for 2020 Stewards - https://phabricator.wikimedia.org/T246449 (10chasemp) Looks like https://phabricator.wikimedia.org/project/members/2849/ is already updated otherwise so cheers and thank... [17:38:06] 10Operations, 10Security-Team, 10Stewards-and-global-tools, 10Security, 10User-revi: Security Issue Access Request for 2020 Stewards - https://phabricator.wikimedia.org/T246449 (10chasemp) 05Open→03Stalled >>! In T246449#5957171, @chasemp wrote: > @Krd if you would like to be included here please add... [17:38:11] (03CR) 10Ssingh: "Thanks for the feedback and the review. For the remote connection, I will address that in a future release." (033 comments) [software/censorship-monitoring] - 10https://gerrit.wikimedia.org/r/576070 (https://phabricator.wikimedia.org/T247273) (owner: 10Ssingh) [17:38:20] 10Operations, 10Security-Team, 10Stewards-and-global-tools, 10Security, 10User-revi: Security Issue Access Request for 2020 Stewards - https://phabricator.wikimedia.org/T246449 (10chasemp) a:05chasemp→03None [17:39:09] (03PS7) 10Ssingh: First release of the cescout project [software/censorship-monitoring] - 10https://gerrit.wikimedia.org/r/576070 (https://phabricator.wikimedia.org/T247273) [17:40:11] (03PS1) 10Hnowlan: changeprop: enable nutcracker [deployment-charts] - 10https://gerrit.wikimedia.org/r/578567 (https://phabricator.wikimedia.org/T213193) [17:43:13] 10Operations, 10ops-codfw, 10DBA: backup2001.mgmt interface down - https://phabricator.wikimedia.org/T247324 (10Papaul) 05Open→03Resolved a:03Papaul Was cleaning up some old cables and accidentally disconnected the mgmt cable. All back up now [17:43:46] RECOVERY - Host backup2001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 41.63 ms [17:44:46] 10Operations, 10ops-codfw, 10DBA: backup2001.mgmt interface down - https://phabricator.wikimedia.org/T247324 (10jcrespo) We love when it is causes are simple as this! (vs a complex to debug issue). Do not worry at all! Thanks! [17:45:27] (03PS1) 10Ottomata: eventgate - use eventgate _tls_helpers.tpl and set envoy resource limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/578569 (https://phabricator.wikimedia.org/T244843) [17:45:41] (03CR) 10jerkins-bot: [V: 04-1] eventgate - use eventgate _tls_helpers.tpl and set envoy resource limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/578569 (https://phabricator.wikimedia.org/T244843) (owner: 10Ottomata) [17:49:16] (03CR) 10Jgreen: [C: 03+2] Set up frdb2001 in monitoring [puppet] - 10https://gerrit.wikimedia.org/r/578564 (https://phabricator.wikimedia.org/T246045) (owner: 10Dwisehaupt) [17:49:20] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 99 jobs https://wikitech.wikimedia.org/wiki/Backups%23Monitoring [17:49:49] (03PS3) 10Jgreen: Add frpm2001 to monitoring [puppet] - 10https://gerrit.wikimedia.org/r/577006 (https://phabricator.wikimedia.org/T242269) (owner: 10Dwisehaupt) [17:50:12] (03PS2) 10Ottomata: eventgate - use eventgate _tls_helpers.tpl and set envoy resource limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/578569 (https://phabricator.wikimedia.org/T244843) [17:52:48] (03PS1) 10Brennen Bearnes: Group0 to 1.35.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/578570 [17:53:36] (03CR) 10Jgreen: [C: 03+2] Add frpm2001 to monitoring [puppet] - 10https://gerrit.wikimedia.org/r/577006 (https://phabricator.wikimedia.org/T242269) (owner: 10Dwisehaupt) [17:55:01] (03CR) 10Dzahn: [C: 03+2] hiera/aptrepo: rename install_server variables to aptrepo_server [puppet] - 10https://gerrit.wikimedia.org/r/577701 (https://phabricator.wikimedia.org/T224576) (owner: 10Dzahn) [17:55:45] !log brennen@deploy1001 Started scap: testwiki to php-1.35.0-wmf.23 and rebuild l10n cache [17:55:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:58:58] (03CR) 10Dzahn: "noop" [puppet] - 10https://gerrit.wikimedia.org/r/577701 (https://phabricator.wikimedia.org/T224576) (owner: 10Dzahn) [17:59:00] (03CR) 10Ottomata: [C: 03+2] eventgate - use eventgate _tls_helpers.tpl and set envoy resource limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/578569 (https://phabricator.wikimedia.org/T244843) (owner: 10Ottomata) [17:59:03] (03PS3) 10Ottomata: eventgate - use eventgate _tls_helpers.tpl and set envoy resource limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/578569 (https://phabricator.wikimedia.org/T244843) [17:59:09] (03CR) 10Ottomata: [V: 03+2 C: 03+2] eventgate - use eventgate _tls_helpers.tpl and set envoy resource limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/578569 (https://phabricator.wikimedia.org/T244843) (owner: 10Ottomata) [17:59:27] (03Merged) 10jenkins-bot: eventgate - use eventgate _tls_helpers.tpl and set envoy resource limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/578569 (https://phabricator.wikimedia.org/T244843) (owner: 10Ottomata) [18:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200310T1800) [18:04:43] (03PS1) 10Ottomata: eventgate - bump chart version to 0.1.6 [deployment-charts] - 10https://gerrit.wikimedia.org/r/578571 [18:06:26] (03CR) 10Ottomata: [C: 03+2] eventgate - bump chart version to 0.1.6 [deployment-charts] - 10https://gerrit.wikimedia.org/r/578571 (owner: 10Ottomata) [18:06:54] 10Operations, 10ops-eqiad, 10DC-Ops: (Need by: ASAP) rack/setup/install stat1008 - https://phabricator.wikimedia.org/T246472 (10Cmjohnson) [18:08:19] (03CR) 10Dzahn: "> Patch Set 7:" [dns] - 10https://gerrit.wikimedia.org/r/569680 (https://phabricator.wikimedia.org/T224576) (owner: 10Dzahn) [18:11:15] (03PS1) 10Cmjohnson: Add netboot.cfg and dhcpd file for stat1008 [puppet] - 10https://gerrit.wikimedia.org/r/578573 (https://phabricator.wikimedia.org/T246472) [18:13:35] (03CR) 10Dzahn: [C: 03+2] "Yea, conftool::scripts::safe_service_restart expects a parameter just called $services. Also cherry-picked in beta and confirmed noop in p" [puppet] - 10https://gerrit.wikimedia.org/r/577725 (https://phabricator.wikimedia.org/T247151) (owner: 10Alex Monk) [18:15:07] (03CR) 10Cmjohnson: [C: 03+2] Add netboot.cfg and dhcpd file for stat1008 [puppet] - 10https://gerrit.wikimedia.org/r/578573 (https://phabricator.wikimedia.org/T246472) (owner: 10Cmjohnson) [18:16:32] cmjohnson1: thanks a lot for stat1008! [18:17:01] 10Operations, 10Performance-Team, 10serviceops, 10Core Platform Team Workboards (Clinic Duty Team), 10Wikimedia-production-error: Wiki diffs take over 15s to load - https://phabricator.wikimedia.org/T244058 (10Krinkle) I've gone through dozens of very old diffs from unpopular pages (hoping for a cache mi... [18:17:13] (03PS1) 10Cmjohnson: Add stat1008 to site.pp role spare [puppet] - 10https://gerrit.wikimedia.org/r/578574 (https://phabricator.wikimedia.org/T246472) [18:17:24] elukey ^ check that please [18:17:49] 10Operations, 10Performance-Team, 10serviceops, 10Core Platform Team Workboards (Clinic Duty Team), 10Wikimedia-Incident: Strategy for storing parser output for "old revision" (Popular diffs and permalinks) - https://phabricator.wikimedia.org/T244058 (10Krinkle) [18:18:03] (03PS1) 10QEDK: Fix typo, accesible -> accessible [puppet] - 10https://gerrit.wikimedia.org/r/578575 (https://phabricator.wikimedia.org/T201491) [18:18:10] 10Operations, 10Analytics, 10User-Elukey: Refactor Analytics POSIX groups in puppet to improve maintainability - https://phabricator.wikimedia.org/T246578 (10elukey) Next steps: * decide what to do with the `researchers` posix group (fold it in `analytics-privatedata-users`, etc..) [18:18:20] 10Operations, 10serviceops, 10Core Platform Team Workboards (Clinic Duty Team), 10Performance-Team (Radar), 10Wikimedia-Incident: Strategy for storing parser output for "old revision" (Popular diffs and permalinks) - https://phabricator.wikimedia.org/T244058 (10Krinkle) a:05Krinkle→03None [18:18:24] 10Operations, 10serviceops, 10Core Platform Team Workboards (Clinic Duty Team), 10Performance-Team (Radar), 10Wikimedia-Incident: Strategy for storing parser output for "old revision" (Popular diffs and permalinks) - https://phabricator.wikimedia.org/T244058 (10Krinkle) [18:19:21] (03PS1) 10Ottomata: eventgate & eventstreams - use main_app.name for tls container name [deployment-charts] - 10https://gerrit.wikimedia.org/r/578576 [18:20:28] (03CR) 10Ottomata: [C: 03+2] eventgate & eventstreams - use main_app.name for tls container name [deployment-charts] - 10https://gerrit.wikimedia.org/r/578576 (owner: 10Ottomata) [18:23:13] (03CR) 10Elukey: [C: 03+1] Add stat1008 to site.pp role spare [puppet] - 10https://gerrit.wikimedia.org/r/578574 (https://phabricator.wikimedia.org/T246472) (owner: 10Cmjohnson) [18:23:25] cmjohnson1: +1ed --^ [18:24:09] (03CR) 10Cmjohnson: [C: 03+2] Add stat1008 to site.pp role spare [puppet] - 10https://gerrit.wikimedia.org/r/578574 (https://phabricator.wikimedia.org/T246472) (owner: 10Cmjohnson) [18:26:38] (03PS1) 10Ottomata: eventgate & eventstreams - use main_app.name for SERVICE_NAME env [deployment-charts] - 10https://gerrit.wikimedia.org/r/578577 [18:26:56] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: (Need by: ASAP) rack/setup/install stat1008 - https://phabricator.wikimedia.org/T246472 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` stat1008.eqiad.wmnet ` The log can be found in `/va... [18:27:31] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: (Need by: ASAP) rack/setup/install stat1008 - https://phabricator.wikimedia.org/T246472 (10Cmjohnson) [18:27:42] (03PS2) 10Ottomata: eventgate & eventstreams - use main_app.name for SERVICE_NAME env [deployment-charts] - 10https://gerrit.wikimedia.org/r/578577 [18:28:14] (03CR) 10Ottomata: [C: 03+2] eventgate & eventstreams - use main_app.name for SERVICE_NAME env [deployment-charts] - 10https://gerrit.wikimedia.org/r/578577 (owner: 10Ottomata) [18:33:41] !log otto@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'eventgate-analytics-external' for release 'production' . [18:33:41] !log otto@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'eventgate-analytics-external' for release 'canary' . [18:33:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:33:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:36:24] !log otto@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'eventgate-logging-external' for release 'production' . [18:36:24] !log otto@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'eventgate-logging-external' for release 'canary' . [18:36:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:36:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:36:58] (03PS1) 10Elukey: admin: add user santhosh to analytics-users and gpu-testers [puppet] - 10https://gerrit.wikimedia.org/r/578579 (https://phabricator.wikimedia.org/T247246) [18:37:45] (03CR) 10Elukey: [C: 03+2] "Nuria already approved in the task, no sudo involved, merging :)" [puppet] - 10https://gerrit.wikimedia.org/r/578579 (https://phabricator.wikimedia.org/T247246) (owner: 10Elukey) [18:39:10] !log otto@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'eventgate-analytics' for release 'production' . [18:39:10] !log otto@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'eventgate-analytics' for release 'canary' . [18:39:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:39:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:39:17] (03CR) 10Bstorm: toolforge-clush: correct the classifications and remove legacy k8s (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/577279 (https://phabricator.wikimedia.org/T246689) (owner: 10Bstorm) [18:39:31] 10Operations, 10ops-codfw, 10Traffic: (Need by: TBD) rack/setup/install cp202[7-0], cp203[0-9], cp204[0-2] - https://phabricator.wikimedia.org/T247340 (10Papaul) [18:39:35] (03CR) 10Bstorm: [C: 03+2] toolforge-clush: correct the classifications and remove legacy k8s [puppet] - 10https://gerrit.wikimedia.org/r/577279 (https://phabricator.wikimedia.org/T246689) (owner: 10Bstorm) [18:40:08] 10Operations, 10ops-codfw, 10Traffic: (Need by: TBD) rack/setup/install cp202[7-0], cp203[0-9], cp204[0-2] - https://phabricator.wikimedia.org/T247340 (10Papaul) p:05Triage→03Medium [18:40:43] bstorm_: ok to merge ? [18:40:52] Yup [18:40:53] Plz do [18:42:43] 10Operations, 10Analytics, 10ContentTranslation, 10SRE-Access-Requests, and 2 others: Request for access for stats machines for Santhosh - https://phabricator.wikimedia.org/T247246 (10elukey) 05Open→03Resolved a:03elukey @santhosh you have now access to stat100[4-7], and on 1005 we have a AMD GPU :) [18:47:53] (03PS1) 10Ottomata: eventgate-main - remove unused 'main' release from staging helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/578581 (https://phabricator.wikimedia.org/T245203) [18:51:15] (03CR) 10Ottomata: [C: 03+2] eventgate-main - remove unused 'main' release from staging helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/578581 (https://phabricator.wikimedia.org/T245203) (owner: 10Ottomata) [18:55:19] (03PS1) 10Cmjohnson: Add mgmt dns for fran1001 [dns] - 10https://gerrit.wikimedia.org/r/578583 (https://phabricator.wikimedia.org/T245554) [18:56:03] !log otto@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'eventgate-main' for release 'production' . [18:56:03] !log otto@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'eventgate-main' for release 'canary' . [18:56:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:56:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:00:04] brennen and hashar: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Mediawiki train - American+European Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200310T1900). [19:00:38] ^ currently still running `scap sync "testwiki to php-1.35.0-wmf.23 and rebuild l10n cache"` [19:00:45] !log otto@deploy1001 helmfile [CODFW] Ran 'apply' command on namespace 'eventgate-analytics-external' for release 'production' . [19:00:45] !log otto@deploy1001 helmfile [CODFW] Ran 'apply' command on namespace 'eventgate-analytics-external' for release 'canary' . [19:00:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:00:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:01:08] been grinding for a while, but i'm not sure if it's inordinately slower than usual or not. [19:04:46] !log otto@deploy1001 helmfile [CODFW] Ran 'apply' command on namespace 'eventgate-logging-external' for release 'production' . [19:04:46] !log otto@deploy1001 helmfile [CODFW] Ran 'apply' command on namespace 'eventgate-logging-external' for release 'canary' . [19:04:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:04:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:08:08] 10Operations, 10ops-eqiad, 10DC-Ops: (Need by: TBD) rack/setup/install wdqs101[123].eqiad.wmnet - https://phabricator.wikimedia.org/T246352 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` stat1008.eqiad.wmnet ` The log can be found in `/var/log/w... [19:08:53] (03CR) 10CRusnov: [C: 03+1] "Looks good for phase 1" [cookbooks] - 10https://gerrit.wikimedia.org/r/578531 (https://phabricator.wikimedia.org/T233183) (owner: 10Volans) [19:09:34] !log otto@deploy1001 helmfile [CODFW] Ran 'apply' command on namespace 'eventgate-analytics' for release 'canary' . [19:09:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:09:46] (03CR) 10DannyS712: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/578575 (https://phabricator.wikimedia.org/T201491) (owner: 10QEDK) [19:10:52] (03CR) 10CRusnov: [C: 03+1] "LGTM ignorable annoyance inline" (031 comment) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/577528 (https://phabricator.wikimedia.org/T233183) (owner: 10Volans) [19:12:16] !log otto@deploy1001 helmfile [CODFW] Ran 'apply' command on namespace 'eventgate-analytics' for release 'production' . [19:12:16] !log otto@deploy1001 helmfile [CODFW] Ran 'apply' command on namespace 'eventgate-analytics' for release 'canary' . [19:12:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:12:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:12:57] 10Operations, 10ops-eqiad, 10DC-Ops: (Need by: ASAP) rack/setup/install stat1008 - https://phabricator.wikimedia.org/T246472 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['stat1008.eqiad.wmnet'] ` Of which those **FAILED**: ` ['stat1008.eqiad.wmnet'] ` [19:13:24] elukey: what raid h/w raid setup do you need? I have a raid10 but it's failing during the install [19:14:52] (03PS9) 10CRusnov: tox: Support DNS_INCLUDE_DIR and generated DNS [dns] - 10https://gerrit.wikimedia.org/r/569340 (https://phabricator.wikimedia.org/T243362) [19:15:14] 10Operations, 10SRE-OnFire, 10Wikimedia-Incident: Investigate whether we can automatically share incident status docs with WMDE - https://phabricator.wikimedia.org/T244395 (10RLazarus) [19:15:21] (03CR) 10Volans: "reply inline" (031 comment) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/577528 (https://phabricator.wikimedia.org/T233183) (owner: 10Volans) [19:15:58] (03CR) 10CRusnov: "ping @BBlack Riccardo would like a final sign-off that nothing ridiculous will happen." [dns] - 10https://gerrit.wikimedia.org/r/569340 (https://phabricator.wikimedia.org/T243362) (owner: 10CRusnov) [19:18:34] (03PS10) 10CRusnov: tox: Support DNS_INCLUDE_DIR and generated DNS [dns] - 10https://gerrit.wikimedia.org/r/569340 (https://phabricator.wikimedia.org/T243362) [19:19:17] !log otto@deploy1001 helmfile [CODFW] Ran 'apply' command on namespace 'eventgate-main' for release 'canary' . [19:19:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:19:25] (03CR) 10Bstorm: [C: 03+1] "Looks good!" [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/577818 (owner: 10BryanDavis) [19:19:26] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: (Need by: ASAP) rack/setup/install fran1001 - https://phabricator.wikimedia.org/T245554 (10Cmjohnson) [19:19:58] (03CR) 10Volans: [C: 03+1] "LGTM, just to be on the safe side it would be great if bblack could also have a look to check that we'll not have any weird behaviour in t" [dns] - 10https://gerrit.wikimedia.org/r/569340 (https://phabricator.wikimedia.org/T243362) (owner: 10CRusnov) [19:20:31] (03CR) 10Bstorm: [C: 03+1] prometheus: fix int format in node_directory_size [puppet] - 10https://gerrit.wikimedia.org/r/578545 (https://phabricator.wikimedia.org/T218925) (owner: 10Jhedden) [19:22:05] !log otto@deploy1001 helmfile [CODFW] Ran 'apply' command on namespace 'eventgate-main' for release 'production' . [19:22:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:24:23] (03CR) 10Bstorm: [C: 03+1] Add xmldumps to stat100[4,5] [puppet] - 10https://gerrit.wikimedia.org/r/577278 (https://phabricator.wikimedia.org/T243934) (owner: 10Elukey) [19:25:20] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: (Need by: ASAP) rack/setup/install fran1001 - https://phabricator.wikimedia.org/T245554 (10Cmjohnson) a:05Jclark-ctr→03Jgreen All DC work has been completed and handing this off to @Jgreen for the final image. Jeff, I would appreciate it if you c... [19:26:12] (03CR) 10Jhedden: [C: 03+2] prometheus: fix int format in node_directory_size [puppet] - 10https://gerrit.wikimedia.org/r/578545 (https://phabricator.wikimedia.org/T218925) (owner: 10Jhedden) [19:26:25] !log otto@deploy1001 helmfile [EQIAD] Ran 'apply' command on namespace 'eventgate-analytics-external' for release 'production' . [19:26:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:26:29] (03CR) 10Nuria: [C: 03+1] admin: add user santhosh to analytics-users and gpu-testers [puppet] - 10https://gerrit.wikimedia.org/r/578579 (https://phabricator.wikimedia.org/T247246) (owner: 10Elukey) [19:29:30] !log otto@deploy1001 helmfile [EQIAD] Ran 'apply' command on namespace 'eventgate-logging-external' for release 'production' . [19:29:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:30:58] !log scap-cdb-rebuild currently at 29%; at present rate wmf.23 will roll to group0 a bit after the official window [19:31:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:31:16] !log otto@deploy1001 helmfile [EQIAD] Ran 'apply' command on namespace 'eventgate-analytics' for release 'canary' . [19:31:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:32:40] (03CR) 10Cmjohnson: [C: 03+2] Add mgmt dns for fran1001 [dns] - 10https://gerrit.wikimedia.org/r/578583 (https://phabricator.wikimedia.org/T245554) (owner: 10Cmjohnson) [19:32:43] !log otto@deploy1001 helmfile [EQIAD] Ran 'apply' command on namespace 'eventgate-analytics' for release 'production' . [19:32:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:34:30] !log volker-e@deploy1001 Started deploy [design/style-guide@62bf7c6]: Deploy design/style-guide: [19:34:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:34:37] !log volker-e@deploy1001 Finished deploy [design/style-guide@62bf7c6]: Deploy design/style-guide: (duration: 00m 06s) [19:34:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:34:40] !log otto@deploy1001 helmfile [EQIAD] Ran 'apply' command on namespace 'eventgate-logging-external' for release 'canary' . [19:34:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:37:24] !log otto@deploy1001 helmfile [EQIAD] Ran 'apply' command on namespace 'eventgate-analytics-external' for release 'canary' . [19:37:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:38:27] !log gerrit1001 - /var/log/syslog empty and 2 rsyslogd procs running, killing one of them, stopping the other, letting puppet run [19:38:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:40:36] !log otto@deploy1001 helmfile [EQIAD] Ran 'apply' command on namespace 'eventgate-main' for release 'canary' . [19:40:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:41:54] !log otto@deploy1001 helmfile [EQIAD] Ran 'apply' command on namespace 'eventgate-main' for release 'production' . [19:41:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:42:57] (03PS1) 10Cmjohnson: Add mgmt dns for htmldumper1001 [dns] - 10https://gerrit.wikimedia.org/r/578590 (https://phabricator.wikimedia.org/T245567) [19:43:16] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: (Need by: 2020-03-01) rack/setup/install htmldumper1001.eqiad.wmnet. - https://phabricator.wikimedia.org/T245567 (10Cmjohnson) [19:48:10] 10Operations, 10MediaWiki-General, 10serviceops, 10Patch-For-Review, 10Service-Architecture: Create a service-to-service proxy for handling HTTP calls from services to other entities - https://phabricator.wikimedia.org/T244843 (10Ottomata) @Joe @akosiaris all deployments of eventgate and eventstreams hav... [19:51:46] 10Operations, 10DC-Ops, 10decommission, 10Patch-For-Review: decommission WMF6147 (old frpig2001.frack.codfw.wmnet) - https://phabricator.wikimedia.org/T246824 (10Jgreen) >>! In T246824#5947140, @Papaul wrote: > @jgreen while trying to remove the old frpig2001 mgmt DNS, i am getting the error below > > ` >... [19:52:06] (03PS1) 10Cmjohnson: update mgmt hosts to reflect new host name sretest100[12] [dns] - 10https://gerrit.wikimedia.org/r/578591 (https://phabricator.wikimedia.org/T245754) [20:18:46] (03PS3) 10Dzahn: add mw2350-2376 as API and appservers, codfw rack C6 [puppet] - 10https://gerrit.wikimedia.org/r/577409 (https://phabricator.wikimedia.org/T247021) [20:21:05] (03PS4) 10Dzahn: add mw2350-2376 as API and appservers, codfw rack C6 [puppet] - 10https://gerrit.wikimedia.org/r/577409 (https://phabricator.wikimedia.org/T247021) [20:23:15] (03CR) 10RLazarus: [C: 03+1] add mw2350-2376 as API and appservers, codfw rack C6 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/577409 (https://phabricator.wikimedia.org/T247021) (owner: 10Dzahn) [20:29:21] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [20:29:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:29:33] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [20:29:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:29:39] 10Operations, 10serviceops, 10Patch-For-Review: move all 86 new codfw appservers into production (mw2[291-2377].codfw.wmnet) - https://phabricator.wikimedia.org/T247021 (10ops-monitoring-bot) Icinga downtime for 2:00:00 set by dzahn@cumin1001 on 27 host(s) and their services with reason: new_install ` mw[235... [20:31:40] noting here that i'm still on scap-cdb-rebuild (86%) for wmf.23. [20:32:31] (03CR) 10Dzahn: [C: 03+2] add mw2350-2376 as API and appservers, codfw rack C6 [puppet] - 10https://gerrit.wikimedia.org/r/577409 (https://phabricator.wikimedia.org/T247021) (owner: 10Dzahn) [20:34:00] jouncebot: next [20:34:01] In 2 hour(s) and 25 minute(s): Evening SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200310T2300) [20:34:06] jouncebot: now [20:34:06] No deployments scheduled for the next 2 hour(s) and 25 minute(s) [20:38:39] (03PS3) 10Herron: logstash: add ES 7 compatible logstash template [puppet] - 10https://gerrit.wikimedia.org/r/571622 (https://phabricator.wikimedia.org/T247014) [20:39:14] (03CR) 10jerkins-bot: [V: 04-1] logstash: add ES 7 compatible logstash template [puppet] - 10https://gerrit.wikimedia.org/r/571622 (https://phabricator.wikimedia.org/T247014) (owner: 10Herron) [20:39:22] !log brennen@deploy1001 Finished scap: testwiki to php-1.35.0-wmf.23 and rebuild l10n cache (duration: 163m 37s) [20:39:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:39:34] (03PS4) 10Herron: logstash: add ES 7 compatible logstash template [puppet] - 10https://gerrit.wikimedia.org/r/571622 (https://phabricator.wikimedia.org/T247014) [20:42:36] (03CR) 10Brennen Bearnes: [C: 03+2] Group0 to 1.35.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/578570 (owner: 10Brennen Bearnes) [20:43:22] 10Operations, 10ops-codfw, 10DC-Ops, 10Traffic, 10decommission: decommission lvs2002.codfw.wmnet - https://phabricator.wikimedia.org/T246756 (10Papaul) ` [edit interfaces interface-range disabled] member xe-2/0/47 { ... } + member xe-2/0/46; [edit interfaces] - xe-2/0/46 { - description l... [20:43:37] (03Merged) 10jenkins-bot: Group0 to 1.35.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/578570 (owner: 10Brennen Bearnes) [20:43:52] 10Operations, 10ops-codfw, 10DC-Ops, 10Traffic, 10decommission: decommission lvs2002.codfw.wmnet - https://phabricator.wikimedia.org/T246756 (10Papaul) [20:45:01] 10Operations, 10ops-codfw, 10DC-Ops, 10Traffic, 10decommission: decommission lvs2003.codfw.wmnet - https://phabricator.wikimedia.org/T246334 (10Papaul) [20:49:05] !log brennen@deploy1001 rebuilt and synchronized wikiversions files: group0 to 1.35.0-wmf.23 [20:49:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:51:30] 10Operations, 10ops-codfw, 10DC-Ops, 10Traffic, 10decommission: decommission lvs2004.codfw.wmnet - https://phabricator.wikimedia.org/T246669 (10Papaul) ` [edit interfaces interface-range disabled] member xe-2/0/46 { ... } + member xe-7/0/47; [edit interfaces] - xe-7/0/47 { - description l... [20:52:26] 10Operations, 10ops-codfw, 10DC-Ops, 10Traffic, 10decommission: decommission lvs2004.codfw.wmnet - https://phabricator.wikimedia.org/T246669 (10Papaul) [20:57:49] 10Operations, 10ops-eqiad, 10cloud-services-team (Hardware): cloudvirt1009: Device not healthy -SMART- - https://phabricator.wikimedia.org/T244986 (10Jclark-ctr) [20:57:53] (03PS1) 10QEDK: Fix typos, add OS tempfiles to gitignore [puppet] - 10https://gerrit.wikimedia.org/r/578603 (https://phabricator.wikimedia.org/T201491) [21:01:57] 10Operations, 10ops-codfw, 10DC-Ops, 10Traffic, 10decommission: decommission lvs2005.codfw.wmnet - https://phabricator.wikimedia.org/T246666 (10Papaul) ` [edit interfaces interface-range disabled] member xe-7/0/47 { ... } + member xe-7/0/46; [edit interfaces] - xe-7/0/46 { - description l... [21:02:16] 10Operations, 10ops-codfw, 10DC-Ops, 10Traffic, 10decommission: decommission lvs2005.codfw.wmnet - https://phabricator.wikimedia.org/T246666 (10Papaul) [21:05:08] PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3060 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [21:13:22] 10Operations, 10ops-eqiad, 10DC-Ops: (Need by: ASAP) rack/setup/install fran1001 - https://phabricator.wikimedia.org/T245554 (10Jgreen) a:05Jgreen→03Jclark-ctr Assigning back to DC-Ops because I'm blocked on a weird error from the iDRAC trying to pxeboot the server. racadm>>racadm serveraction powerup... [21:23:11] 10Operations, 10MediaWiki-General, 10TechCom-RFC (TechCom-RFC-Closed): Bump PHP requirement to 5.6 in 1.31 - https://phabricator.wikimedia.org/T178538 (10Krinkle) [21:23:30] (03PS1) 10CDanis: typo fixes: s/ preform/ perform/ [puppet] - 10https://gerrit.wikimedia.org/r/578607 [21:24:26] (03PS2) 10CDanis: typo fixes: s/ preform/ perform/ [puppet] - 10https://gerrit.wikimedia.org/r/578607 [21:26:02] 10Operations, 10Wikimedia-Logstash, 10Patch-For-Review: ELK7 shards failed errors when loading saved objects, e.g. "field expansion matches too many fields, limit: 1024, got: 1726" - https://phabricator.wikimedia.org/T247014 (10herron) @EBernhardson thanks! I've updated the elasticsearch template in the pat... [21:26:45] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [21:26:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:27:31] (03CR) 10CDanis: [C: 03+2] typo fixes: s/ preform/ perform/ [puppet] - 10https://gerrit.wikimedia.org/r/578607 (owner: 10CDanis) [21:29:25] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [21:29:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:29:30] 10Operations, 10serviceops: move all 86 new codfw appservers into production (mw2[291-2377].codfw.wmnet) - https://phabricator.wikimedia.org/T247021 (10ops-monitoring-bot) Icinga downtime for 2:00:00 set by dzahn@cumin1001 on 27 host(s) and their services with reason: new_install ` mw[2350-2376].codfw.wmnet ` [21:32:46] 273 pending icinga checks. but they should all recover before the downtime expires, i hope :) [21:32:56] adding 27 codfw servers [21:33:05] after that removing 15 old ones [21:34:20] (03CR) 10EBernhardson: logstash: add ES 7 compatible logstash template (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/571622 (https://phabricator.wikimedia.org/T247014) (owner: 10Herron) [21:35:02] mutante: the icinga eventloop process is pretty consistently spinning at 100% cpu on icinga1001, it's a bit concerning [21:35:39] cdanis: it should calm down after a while i hope [21:35:49] the number of pending checks is going down at least [21:35:51] mutante: AFAIK it's been the case for a while now [21:35:55] it makes progress, but [21:36:11] I suspect just spending lots of time forking and reaping and doing things [21:36:12] (03PS1) 10QEDK: Fix typos, add OS tempfiles to gitignore [puppet] - 10https://gerrit.wikimedia.org/r/578609 (https://phabricator.wikimedia.org/T201491) [21:36:58] (03CR) 10Muehlenhoff: "FYI, we also have a typos file in the top level directory of puppet.git to catch similar cases in the future." [puppet] - 10https://gerrit.wikimedia.org/r/578607 (owner: 10CDanis) [21:37:05] (03CR) 10EBernhardson: logstash: add ES 7 compatible logstash template (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/571622 (https://phabricator.wikimedia.org/T247014) (owner: 10Herron) [21:37:34] (03Abandoned) 10QEDK: Fix typos, add OS tempfiles to gitignore [puppet] - 10https://gerrit.wikimedia.org/r/578603 (https://phabricator.wikimedia.org/T201491) (owner: 10QEDK) [21:37:38] moritzm: yeah I need to look at how that works, as there's an existing 'preformattedHTML' that I want to make sure it doesn't match [21:38:37] !log volker-e@deploy1001 Started deploy [design/style-guide@8eb1daf]: Deploy design/style-guide: [21:38:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:38:44] !log volker-e@deploy1001 Finished deploy [design/style-guide@8eb1daf]: Deploy design/style-guide: (duration: 00m 07s) [21:38:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:38:50] cdanis: system("git grep -I -n -P -f typos -- #{shell_files}") [21:39:00] (03PS5) 10Sharvaniharan: Enabling depicts count [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575611 [21:43:07] PROBLEM - Ensure hosts are not performing a change on every puppet run on puppetdb1002 is CRITICAL: CRITICAL: the following (9) node(s) change every puppet run: elastic2060.codfw.wmnet, logstash1029.eqiad.wmnet, elastic2056.codfw.wmnet, elastic2057.codfw.wmnet, elastic2059.codfw.wmnet, elastic2058.codfw.wmnet, elastic2055.codfw.wmnet, cloudvirt2003-dev.codfw.wmnet, an-tool1006.eqiad.wmnet https://wikitech.wikimedia.org/wiki/Puppe [21:43:07] run_changes [21:43:10] (03PS1) 10QEDK: Fix documentation and typos [puppet] - 10https://gerrit.wikimedia.org/r/578611 (https://phabricator.wikimedia.org/T201491) [21:46:44] cdanis: ah, good catch wrt preformattedHTML :-) [21:48:38] ^ new elastic hosts alert .. see -sre [21:50:37] out of 28 unhandled alerts 20 are simply disabled notifications [21:51:13] !log ebernhardson@deploy1001 Started deploy [search/mjolnir/deploy@dda3d28]: re-sync latest version to trigger scap scripts on new elastic nodes in codfw [21:51:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:51:21] disabling notifications does not make alerts handled [21:51:36] !log ebernhardson@deploy1001 Finished deploy [search/mjolnir/deploy@dda3d28]: re-sync latest version to trigger scap scripts on new elastic nodes in codfw (duration: 00m 23s) [21:51:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:53:53] RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3060 is OK: HTTP OK: HTTP/1.0 200 OK - 22075 bytes in 0.255 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [21:54:48] 10Operations, 10Wikimedia-Logstash: (Need by: 2020-03-06) rack/setup/install logstash102[6-9].eqiad.wmnet - https://phabricator.wikimedia.org/T240881 (10Dzahn) logstash1029 seems to be different from other servers and has problems. Icinga alerts and can't ssh to it. [21:56:30] !log ebernhardson@deploy1001 Started deploy [search/mjolnir/deploy@dda3d28]: re-sync latest version to trigger scap scripts on new elastic nodes in codfw [21:56:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:58:45] !log ebernhardson@deploy1001 Finished deploy [search/mjolnir/deploy@dda3d28]: re-sync latest version to trigger scap scripts on new elastic nodes in codfw (duration: 02m 15s) [21:58:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:04:24] 10Operations, 10cloud-services-team (Kanban): rack/setup/install cloudcontrol2001-dev & cloudvirt200[123]-dev - https://phabricator.wikimedia.org/T214448 (10Dzahn) currently we have the following alerts but nobody gets notifications about them because those are disabled cloudcontrol2001-dev - systemd state cl... [22:09:39] !log dzahn@cumin1001 conftool action : set/weight=15; selector: name=mw235[0-9].codfw.wmnet [22:09:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:11:54] !log mw2359 sudo systemctl start php7.2-fpm_check_restart [22:11:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:12:14] (03PS1) 10CDanis: typos: add preform (but not preformat) [puppet] - 10https://gerrit.wikimedia.org/r/578621 [22:12:36] !log dzahn@cumin1001 conftool action : set/weight=15; selector: name=mw236[0-9].codfw.wmnet [22:12:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:12:55] !log dzahn@cumin1001 conftool action : set/weight=15; selector: name=mw237[0-4].codfw.wmnet [22:12:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:15:44] (03PS1) 10Guozr.im: CuminExecution: Capture Exception cumin.transports.WorkerError [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/578623 (https://phabricator.wikimedia.org/T218189) [22:15:46] (03CR) 10Welcome, new contributor!: "Thank you for making your first contribution to Wikimedia! :) To learn how to get your code changes reviewed faster and more likely to get" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/578623 (https://phabricator.wikimedia.org/T218189) (owner: 10Guozr.im) [22:28:55] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw235[0-9].codfw.wmnet [22:28:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:29:26] (03PS1) 10EBernhardson: mjolnir: Ensure python3.7 is available before initializing repo [puppet] - 10https://gerrit.wikimedia.org/r/578628 (https://phabricator.wikimedia.org/T247362) [22:30:42] (03PS1) 10CDanis: add a getter for _spicerack_config_dir [software/spicerack] - 10https://gerrit.wikimedia.org/r/578629 [22:31:32] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw236[0-5].codfw.wmnet [22:31:35] (03PS1) 10Dzahn: site: remove duplicate regex for mw2366-mw2376 [puppet] - 10https://gerrit.wikimedia.org/r/578630 (https://phabricator.wikimedia.org/T247021) [22:31:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:32:51] 10Puppet, 10SRE-tools: Forward port Python2 files to Python3 in Puppet Repository - https://phabricator.wikimedia.org/T247364 (10crusnov) [22:33:39] 10Puppet, 10SRE-tools: Forward port Python2 files to Python3 in Puppet Repository - https://phabricator.wikimedia.org/T247364 (10crusnov) [22:34:40] 10Puppet, 10SRE-tools: Forward port Python2 files to Python3 in Puppet Repository - https://phabricator.wikimedia.org/T247364 (10MoritzMuehlenhoff) Note that there's a number of scripts blocked by OS runtime dependencies, e.g. various LDAP scripts are blocked until mwmaint* and cumin* are reimaged to Buster (n... [22:35:19] 10Puppet, 10SRE-tools: Forward port Python2 files to Python3 in Puppet Repository - https://phabricator.wikimedia.org/T247364 (10crusnov) A quick survey: ` $ git grep '#!.*python' |grep -v python3 15:31:01 modules/admin/data... [22:38:31] (03CR) 10jerkins-bot: [V: 04-1] add a getter for _spicerack_config_dir [software/spicerack] - 10https://gerrit.wikimedia.org/r/578629 (owner: 10CDanis) [22:38:52] (03PS2) 10Dzahn: site: fix duplicate regex and row for mw2366-mw2376 [puppet] - 10https://gerrit.wikimedia.org/r/578630 (https://phabricator.wikimedia.org/T247021) [22:45:12] (03PS5) 10Krinkle: multiversion: Introduce MWMultiVersion::SUFFIXES constant [mediawiki-config] - 10https://gerrit.wikimedia.org/r/577366 [22:46:21] (03CR) 10Krinkle: [C: 03+2] multiversion: Introduce MWMultiVersion::SUFFIXES constant [mediawiki-config] - 10https://gerrit.wikimedia.org/r/577366 (owner: 10Krinkle) [22:47:56] (03PS2) 10CDanis: add a getter for _spicerack_config_dir [software/spicerack] - 10https://gerrit.wikimedia.org/r/578629 [22:47:58] (03Merged) 10jenkins-bot: multiversion: Introduce MWMultiVersion::SUFFIXES constant [mediawiki-config] - 10https://gerrit.wikimedia.org/r/577366 (owner: 10Krinkle) [22:48:06] 10Puppet, 10SRE-tools: Forward port Python2 files to Python3 in Puppet Repository - https://phabricator.wikimedia.org/T247364 (10crusnov) p:05Triage→03Medium [22:49:22] * Krinkle testing on mwdebug1002 [22:49:56] (03PS3) 10CDanis: add a getter for _spicerack_config_dir [software/spicerack] - 10https://gerrit.wikimedia.org/r/578629 [22:49:56] I've pull down the patch on deploy1001 and then used 'git reset HEAD^ && git add -p && git checkout .' to only stage a small part of it, and then 'scap pull' that part to mwdebug1002 [22:51:39] then 'git reset && git checkout .' to clear it, and then repeat the same cycle again after git pull, [22:51:48] jouncebot: now [22:51:48] No deployments scheduled for the next 0 hour(s) and 8 minute(s) [22:52:22] Krinkle: does that mean running "scap pull" on new servers might not do the right thing? [22:54:08] mutante: The technique I described is a way to both stage, and sync a patch one file at a time. Normally after 'git pull' the entire patch is pulled down, then 'scap pull' on mwdebug will also pull down the entire patch, and then 'scap sync-file' goes one file at a time, which means the test on mwdebug wasn't meaningful. [22:54:39] using 'scap pull' on new or outdated servers is fine afaik [22:54:52] Krinkle: ok, thanks. *nod* [22:55:43] (03CR) 10Volans: [C: 03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/578629 (owner: 10CDanis) [22:55:58] (03CR) 10CDanis: [C: 03+2] add a getter for _spicerack_config_dir [software/spicerack] - 10https://gerrit.wikimedia.org/r/578629 (owner: 10CDanis) [22:58:43] !log krinkle@deploy1001 Synchronized multiversion/MWMultiVersion.php: Ib5473af6 (duration: 01m 07s) [22:58:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:59:36] (03PS3) 10Dzahn: site: fix duplicate regex and row for mw2366-mw2376 [puppet] - 10https://gerrit.wikimedia.org/r/578630 (https://phabricator.wikimedia.org/T247021) [23:00:04] RoanKattouw, Niharika, and Urbanecm: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Evening SWAT(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200310T2300). [23:00:04] ebernhardson: A patch you scheduled for Evening SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:01:59] (03Merged) 10jenkins-bot: add a getter for _spicerack_config_dir [software/spicerack] - 10https://gerrit.wikimedia.org/r/578629 (owner: 10CDanis) [23:02:50] !log krinkle@deploy1001 Synchronized multiversion/MWConfigCacheGenerator.php: Ib5473af6 (duration: 01m 07s) [23:02:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:02:58] i can ship it [23:03:11] ebernhardson: done in a minute or two [23:03:20] (03CR) 10Dzahn: [V: 03+1 C: 03+1] "https://puppet-compiler.wmflabs.org/compiler1003/21383/" [puppet] - 10https://gerrit.wikimedia.org/r/578630 (https://phabricator.wikimedia.org/T247021) (owner: 10Dzahn) [23:03:35] Krinkle: gerrit will need half an hour to merge these anyways :) [23:03:45] :/ [23:04:01] Always looking at the bright side of life, aye :D [23:04:05] i uploaded them ~25 minutes ago and the test-wmf is still running. Yea :/ [23:04:21] (03PS5) 10Herron: logstash: add ES 7 compatible logstash template [puppet] - 10https://gerrit.wikimedia.org/r/571622 (https://phabricator.wikimedia.org/T247014) [23:04:34] (03PS4) 10Krinkle: tests: Move MWWikiversionsTest out of dblistTest.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/577375 [23:04:37] (03CR) 10Krinkle: [C: 03+2] tests: Move MWWikiversionsTest out of dblistTest.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/577375 (owner: 10Krinkle) [23:04:54] (03CR) 10jerkins-bot: [V: 04-1] logstash: add ES 7 compatible logstash template [puppet] - 10https://gerrit.wikimedia.org/r/571622 (https://phabricator.wikimedia.org/T247014) (owner: 10Herron) [23:05:09] (03CR) 10Cmjohnson: [C: 03+2] Add mgmt dns for htmldumper1001 [dns] - 10https://gerrit.wikimedia.org/r/578590 (https://phabricator.wikimedia.org/T245567) (owner: 10Cmjohnson) [23:05:13] (03PS6) 10Herron: logstash: add ES 7 compatible logstash template [puppet] - 10https://gerrit.wikimedia.org/r/571622 (https://phabricator.wikimedia.org/T247014) [23:05:19] (03CR) 10Volans: [C: 03+2] "As agreed on IRC it's ok for now as is." [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/577528 (https://phabricator.wikimedia.org/T233183) (owner: 10Volans) [23:05:28] (03PS4) 10Dzahn: site: fix duplicate regex and row for mw2366-mw2376 [puppet] - 10https://gerrit.wikimedia.org/r/578630 (https://phabricator.wikimedia.org/T247021) [23:05:30] !log krinkle@deploy1001 Synchronized wmf-config/wgConf.php: Ib5473af6 (duration: 01m 07s) [23:05:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:05:37] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw236[579].codfw.wmnet [23:05:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:05:51] (03Merged) 10jenkins-bot: tests: Move MWWikiversionsTest out of dblistTest.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/577375 (owner: 10Krinkle) [23:06:45] (03CR) 10Cmjohnson: [C: 03+2] update mgmt hosts to reflect new host name sretest100[12] [dns] - 10https://gerrit.wikimedia.org/r/578591 (https://phabricator.wikimedia.org/T245754) (owner: 10Cmjohnson) [23:07:19] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw237[135].codfw.wmnet [23:07:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:08:43] (03CR) 10Dzahn: [C: 03+2] site: fix duplicate regex and row for mw2366-mw2376 [puppet] - 10https://gerrit.wikimedia.org/r/578630 (https://phabricator.wikimedia.org/T247021) (owner: 10Dzahn) [23:08:52] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: (Need by: 2020-03-01) rack/setup/install htmldumper1001.eqiad.wmnet. - https://phabricator.wikimedia.org/T245567 (10Cmjohnson) [23:11:30] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [23:11:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:11:33] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [23:11:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:11:38] 10Operations, 10serviceops: move all 86 new codfw appservers into production (mw2[291-2377].codfw.wmnet) - https://phabricator.wikimedia.org/T247021 (10ops-monitoring-bot) Icinga downtime for 2:00:00 set by dzahn@cumin1001 on 6 host(s) and their services with reason: new_install ` mw[2366,2368,2370,2372,2374,2... [23:12:57] ebernhardson: all yours [23:15:44] (03CR) 10Herron: logstash: add ES 7 compatible logstash template (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/571622 (https://phabricator.wikimedia.org/T247014) (owner: 10Herron) [23:17:53] (03PS1) 10Cmjohnson: Add production dns for htmldumper1001 [dns] - 10https://gerrit.wikimedia.org/r/578640 (https://phabricator.wikimedia.org/T245567) [23:18:35] (03CR) 10Cmjohnson: [C: 03+2] Add production dns for htmldumper1001 [dns] - 10https://gerrit.wikimedia.org/r/578640 (https://phabricator.wikimedia.org/T245567) (owner: 10Cmjohnson) [23:19:50] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: (Need by: 2020-03-01) rack/setup/install htmldumper1001.eqiad.wmnet. - https://phabricator.wikimedia.org/T245567 (10Cmjohnson) [23:23:02] 10Operations, 10Wikimedia-Logstash: (Need by: 2020-03-06) rack/setup/install logstash102[6-9].eqiad.wmnet - https://phabricator.wikimedia.org/T240881 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` logstash1029.eqiad.wmnet ` The log can be found in... [23:23:39] PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp1083 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [23:33:05] RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp1083 is OK: HTTP OK: HTTP/1.0 200 OK - 22037 bytes in 0.006 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [23:37:32] !log mw2366 - systemctl start nutcracker [23:37:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:38:34] !log ebernhardson@deploy1001 Synchronized php-1.35.0-wmf.23/extensions/CirrusSearch/includes/Maintenance/Reindexer.php: cirrus: Wait around after a refresh before counting docs (duration: 01m 08s) [23:38:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:39:56] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime [23:39:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:42:29] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [23:42:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:44:29] !log ebernhardson@deploy1001 Synchronized php-1.35.0-wmf.22/extensions/CirrusSearch/includes/Maintenance/Reindexer.php: (no justification provided) (duration: 01m 07s) [23:44:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:45:03] !log start in-place reindex procedure on kowiki against eqiad and codfw [23:45:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:45:34] !log dzahn@cumin1001 conftool action : set/weight=15; selector: name=mw2376.codfw.wmnet [23:45:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:48:08] !log mw2376 - systemctl start apache2 [23:48:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:49:15] 10Operations, 10Wikimedia-Logstash: (Need by: 2020-03-06) rack/setup/install logstash102[6-9].eqiad.wmnet - https://phabricator.wikimedia.org/T240881 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['logstash1029.eqiad.wmnet'] ` and were **ALL** successful. [23:53:02] !log volker-e@deploy1001 Started deploy [design/style-guide@8eb1daf]: Deploy design/style-guide: [23:53:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:53:08] !log volker-e@deploy1001 Finished deploy [design/style-guide@8eb1daf]: Deploy design/style-guide: (duration: 00m 05s) [23:53:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:54:48] 10Operations, 10Wikimedia-Logstash: (Need by: 2020-03-06) rack/setup/install logstash102[6-9].eqiad.wmnet - https://phabricator.wikimedia.org/T240881 (10Cmjohnson) I did a reimage of logstash1029, everything appears normal now