[00:06:05] RECOVERY - Check systemd state on cloudweb2001-dev is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:06:33] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:10:49] PROBLEM - Check systemd state on cloudweb2001-dev is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:14:31] PROBLEM - mobileapps endpoints health on scb2005 is CRITICAL: /{domain}/v1/page/metadata/{title} (retrieve extended metadata for Video article on English Wikipedia) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [00:16:03] RECOVERY - mobileapps endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [00:16:07] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/mobile-html/{title} (Get page content HTML for test page) timed out before a response was received: /{domain}/v1/transform/html/to/mobile-html/{title} (Get preview mobile HTML for test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [00:20:49] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [00:23:09] 08Warning [01:12:31] (03PS1) 10Alex Monk: deployment-prep: Fix purge_host_regex [puppet] - 10https://gerrit.wikimedia.org/r/536794 [01:12:58] (03CR) 10Alex Monk: "(not cherry-picked exactly, instead I just set the correct hiera data on the instance)" [puppet] - 10https://gerrit.wikimedia.org/r/536794 (owner: 10Alex Monk) [02:23:09] 08Warning [02:25:51] PROBLEM - Postgres Replication Lag on maps2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 60929624 and 4 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [02:29:03] RECOVERY - Postgres Replication Lag on maps2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 230688 and 58 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [02:37:23] RECOVERY - Check systemd state on netbox2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:45:13] PROBLEM - Check systemd state on netbox2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:23:09] 08Warning [03:55:03] RECOVERY - Check systemd state on cloudweb2001-dev is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:59:45] PROBLEM - Check systemd state on cloudweb2001-dev is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:06:03] RECOVERY - Check systemd state on cloudweb2001-dev is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:10:47] PROBLEM - Check systemd state on cloudweb2001-dev is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:23:05] (03PS4) 104nn1l2: Add support for some languages on Commons and stop support for nys on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536764 (https://phabricator.wikimedia.org/T230480) [05:01:14] (03PS2) 10Marostegui: wmnet: Change s2 CNAME to db1122 [dns] - 10https://gerrit.wikimedia.org/r/535842 (https://phabricator.wikimedia.org/T230785) [05:01:16] (03PS2) 10Marostegui: mariadb: Promote db1122 as s2 primary master [puppet] - 10https://gerrit.wikimedia.org/r/535839 (https://phabricator.wikimedia.org/T230785) [05:09:52] 10Operations, 10DBA, 10Patch-For-Review: Switchover s2 primary database master db1066 -> db1122 - 17th Sept @05:00 UTC - https://phabricator.wikimedia.org/T230785 (10Marostegui) Reserved window on the Deployments page: https://wikitech.wikimedia.org/w/index.php?title=Deployments&type=revision&diff=1837698&ol... [05:10:39] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:23:10] 08Warning [05:34:28] (03PS1) 10AndyRussG: Disable FundraiserLandingPage extension on test.wikipedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536796 [05:37:25] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:38:08] (03PS2) 10AndyRussG: Disable FundraiserLandingPage extension on test.wikipedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536796 (https://phabricator.wikimedia.org/T203020) [05:43:54] 10Operations, 10DBA: Decommission db2054.codfw.wmnet - https://phabricator.wikimedia.org/T232969 (10Marostegui) [05:47:57] (03PS1) 10Marostegui: mariadb: Decommission db2054 [puppet] - 10https://gerrit.wikimedia.org/r/536797 (https://phabricator.wikimedia.org/T232969) [05:48:29] (03PS1) 10Marostegui: db-eqiad,db-codfw.php: Remove db2054 from config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536798 (https://phabricator.wikimedia.org/T232969) [05:54:00] (03CR) 10Marostegui: [C: 03+2] db-eqiad,db-codfw.php: Remove db2054 from config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536798 (https://phabricator.wikimedia.org/T232969) (owner: 10Marostegui) [05:54:51] (03Merged) 10jenkins-bot: db-eqiad,db-codfw.php: Remove db2054 from config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536798 (https://phabricator.wikimedia.org/T232969) (owner: 10Marostegui) [05:56:25] (03CR) 10jenkins-bot: db-eqiad,db-codfw.php: Remove db2054 from config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536798 (https://phabricator.wikimedia.org/T232969) (owner: 10Marostegui) [05:58:05] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Remove db2054 from config T232969 (duration: 01m 05s) [05:58:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:58:08] T232969: Decommission db2054.codfw.wmnet - https://phabricator.wikimedia.org/T232969 [05:59:22] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Remove db2054 from config T232969 (duration: 01m 03s) [05:59:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:01:43] 10Operations, 10DBA, 10Patch-For-Review: Decommission db2054.codfw.wmnet - https://phabricator.wikimedia.org/T232969 (10Marostegui) p:05Triage→03Normal [06:01:53] !log Remove db2054 from tendril and zarcillo T232969 [06:01:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:02:51] (03CR) 10Marostegui: [C: 03+2] mariadb: Decommission db2054 [puppet] - 10https://gerrit.wikimedia.org/r/536797 (https://phabricator.wikimedia.org/T232969) (owner: 10Marostegui) [06:03:43] 10Operations, 10DBA, 10Patch-For-Review: Decommission db2054.codfw.wmnet - https://phabricator.wikimedia.org/T232969 (10Marostegui) [06:03:59] !log Stop MySQL on db2054 for decommissioning T232969 [06:04:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:04:01] T232969: Decommission db2054.codfw.wmnet - https://phabricator.wikimedia.org/T232969 [06:04:45] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission db2054.codfw.wmnet - https://phabricator.wikimedia.org/T232969 (10Marostegui) a:05Marostegui→03RobH [06:04:59] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission db2054.codfw.wmnet - https://phabricator.wikimedia.org/T232969 (10Marostegui) This host is ready for #dc-ops to decommission [06:05:18] 10Operations, 10DBA: Decommission db2043-db2069 - https://phabricator.wikimedia.org/T228258 (10Marostegui) [06:13:42] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Kanban, 10netops: Move cloudvirtan* hardware out of CloudVPS back into production Analytics VLAN. - https://phabricator.wikimedia.org/T225128 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts: ` clo... [06:13:46] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Kanban, 10netops: Move cloudvirtan* hardware out of CloudVPS back into production Analytics VLAN. - https://phabricator.wikimedia.org/T225128 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cloudvirtan1001.eqiad.wmnet'] ` Of which those **F... [06:15:28] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Kanban, 10netops: Move cloudvirtan* hardware out of CloudVPS back into production Analytics VLAN. - https://phabricator.wikimedia.org/T225128 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts: ` clo... [06:15:32] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Kanban, 10netops: Move cloudvirtan* hardware out of CloudVPS back into production Analytics VLAN. - https://phabricator.wikimedia.org/T225128 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cloudvirtan1001.eqiad.wmnet'] ` Of which those **F... [06:15:39] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Kanban, 10netops: Move cloudvirtan* hardware out of CloudVPS back into production Analytics VLAN. - https://phabricator.wikimedia.org/T225128 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts: ` clo... [06:23:09] 08Warning [06:24:56] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Kanban, 10netops: Move cloudvirtan* hardware out of CloudVPS back into production Analytics VLAN. - https://phabricator.wikimedia.org/T225128 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-presto1001.eqiad.wmnet'] ` Of which those **FAI... [06:37:27] RECOVERY - Check systemd state on netbox2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:45:19] PROBLEM - Check systemd state on netbox2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:48:29] !log Stop MySQL on db1114 to upgrade it to 10.3 [06:48:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:09:52] (03CR) 10Marostegui: control-mariadb-10.3*: Upgrade version [software] - 10https://gerrit.wikimedia.org/r/536005 (owner: 10Marostegui) [07:09:55] (03CR) 10Marostegui: [C: 03+2] control-mariadb-10.3*: Upgrade version [software] - 10https://gerrit.wikimedia.org/r/536005 (owner: 10Marostegui) [07:10:23] (03Merged) 10jenkins-bot: control-mariadb-10.3*: Upgrade version [software] - 10https://gerrit.wikimedia.org/r/536005 (owner: 10Marostegui) [07:10:55] (03PS1) 10Elukey: Set Debian Buster fro an-presto100* nodes [puppet] - 10https://gerrit.wikimedia.org/r/536804 (https://phabricator.wikimedia.org/T225128) [07:12:29] (03CR) 10Elukey: [C: 03+2] Set Debian Buster fro an-presto100* nodes [puppet] - 10https://gerrit.wikimedia.org/r/536804 (https://phabricator.wikimedia.org/T225128) (owner: 10Elukey) [07:15:48] !log repooling restbase2009 [07:15:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:24:03] 10Operations, 10netops: librenms doesn't print alert text on irc anymore - https://phabricator.wikimedia.org/T232977 (10fgiunchedi) [07:25:14] (03PS1) 10Muehlenhoff: Remove access for nirzar [puppet] - 10https://gerrit.wikimedia.org/r/536809 [07:26:16] (03CR) 10Filippo Giunchedi: [C: 03+1] prometheus-ipsec-exporter: subscribe service to /etc/ipsec.conf [puppet] - 10https://gerrit.wikimedia.org/r/536671 (https://phabricator.wikimedia.org/T230236) (owner: 10Herron) [07:27:41] (03PS2) 10Muehlenhoff: Remove access for nirzar [puppet] - 10https://gerrit.wikimedia.org/r/536809 [07:29:05] (03CR) 10Filippo Giunchedi: [C: 03+1] prometheus: make statsd.relay-address toggle-able [puppet] - 10https://gerrit.wikimedia.org/r/536365 (https://phabricator.wikimedia.org/T205870) (owner: 10Cwhite) [07:31:43] (03CR) 10Muehlenhoff: [C: 03+2] Remove access for nirzar [puppet] - 10https://gerrit.wikimedia.org/r/536809 (owner: 10Muehlenhoff) [07:35:16] (03CR) 10Filippo Giunchedi: profile: use prometheus for logstash alerting (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/536358 (https://phabricator.wikimedia.org/T205870) (owner: 10Cwhite) [07:38:00] !log installing faad2 security updates [07:38:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:40:23] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:49:23] (03CR) 10Filippo Giunchedi: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/536591 (https://phabricator.wikimedia.org/T232303) (owner: 10Filippo Giunchedi) [07:52:45] (03PS1) 10Marostegui: mariadb: Productionize dbproxy1021 [puppet] - 10https://gerrit.wikimedia.org/r/536812 (https://phabricator.wikimedia.org/T202367) [07:53:12] (03CR) 10Marostegui: [C: 03+1] mariadb::ferm_misc: allow connections from gerrit1001 [puppet] - 10https://gerrit.wikimedia.org/r/535966 (https://phabricator.wikimedia.org/T222391) (owner: 10Dzahn) [07:53:48] (03CR) 10Marostegui: [C: 03+2] mariadb: Productionize dbproxy1021 [puppet] - 10https://gerrit.wikimedia.org/r/536812 (https://phabricator.wikimedia.org/T202367) (owner: 10Marostegui) [07:54:19] (03PS1) 10Elukey: partman: add more preseed configs for an-presto's d-i [puppet] - 10https://gerrit.wikimedia.org/r/536835 (https://phabricator.wikimedia.org/T225128) [07:54:44] (03PS2) 10Elukey: partman: add more preseed configs for an-presto's d-i [puppet] - 10https://gerrit.wikimedia.org/r/536835 (https://phabricator.wikimedia.org/T225128) [07:57:12] (03CR) 10Elukey: [C: 03+2] partman: add more preseed configs for an-presto's d-i [puppet] - 10https://gerrit.wikimedia.org/r/536835 (https://phabricator.wikimedia.org/T225128) (owner: 10Elukey) [08:01:57] (03CR) 10Muehlenhoff: gerrit: get LDAP servers from ldap_config, use ro server, simplify (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/536714 (owner: 10Dzahn) [08:09:26] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime [08:09:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:11:27] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [08:11:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:13:38] PROBLEM - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [08:14:00] (03PS3) 10Filippo Giunchedi: swift: remove statsite [puppet] - 10https://gerrit.wikimedia.org/r/536146 (https://phabricator.wikimedia.org/T205870) [08:14:21] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime [08:14:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:15:34] (03PS4) 10Filippo Giunchedi: statsite: support for ensure/removal [puppet] - 10https://gerrit.wikimedia.org/r/536136 [08:16:02] (03CR) 10Filippo Giunchedi: [C: 03+2] statsite: support for ensure/removal [puppet] - 10https://gerrit.wikimedia.org/r/536136 (owner: 10Filippo Giunchedi) [08:16:32] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [08:16:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:19:28] (03PS4) 10Filippo Giunchedi: swift: remove statsite [puppet] - 10https://gerrit.wikimedia.org/r/536146 (https://phabricator.wikimedia.org/T205870) [08:20:38] (03Abandoned) 10Filippo Giunchedi: secret: add thumbor ban lists [labs/private] - 10https://gerrit.wikimedia.org/r/512485 (owner: 10Filippo Giunchedi) [08:21:32] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:22:18] (03PS1) 10Marostegui: production-m5.sql.erb: Add dbproxy1021 grants [puppet] - 10https://gerrit.wikimedia.org/r/536961 (https://phabricator.wikimedia.org/T202367) [08:23:06] (03CR) 10Filippo Giunchedi: [C: 03+2] swift: remove statsite [puppet] - 10https://gerrit.wikimedia.org/r/536146 (https://phabricator.wikimedia.org/T205870) (owner: 10Filippo Giunchedi) [08:23:10] 08Warning [08:23:46] RECOVERY - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [08:24:15] (03CR) 10Marostegui: "This also requires manual grants creation on the m5 databases" [puppet] - 10https://gerrit.wikimedia.org/r/536961 (https://phabricator.wikimedia.org/T202367) (owner: 10Marostegui) [08:29:19] (03PS1) 10Filippo Giunchedi: statsite: set ensure for instance config file [puppet] - 10https://gerrit.wikimedia.org/r/536963 [08:29:48] (03CR) 10Filippo Giunchedi: [C: 03+1] statsite: set ensure for instance config file [puppet] - 10https://gerrit.wikimedia.org/r/536963 (owner: 10Filippo Giunchedi) [08:29:55] (03CR) 10jerkins-bot: [V: 04-1] statsite: set ensure for instance config file [puppet] - 10https://gerrit.wikimedia.org/r/536963 (owner: 10Filippo Giunchedi) [08:31:24] (03PS2) 10Filippo Giunchedi: statsite: set ensure for instance config file [puppet] - 10https://gerrit.wikimedia.org/r/536963 [08:32:08] (03CR) 10Filippo Giunchedi: [C: 03+2] statsite: set ensure for instance config file [puppet] - 10https://gerrit.wikimedia.org/r/536963 (owner: 10Filippo Giunchedi) [08:35:22] (03CR) 10Vgutierrez: [C: 03+2] ATS: Fix check for do_ocsp [puppet] - 10https://gerrit.wikimedia.org/r/536747 (owner: 10Alex Monk) [08:35:32] (03PS2) 10Vgutierrez: ATS: Fix check for do_ocsp [puppet] - 10https://gerrit.wikimedia.org/r/536747 (owner: 10Alex Monk) [08:36:13] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] Enable source wiki editing for testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536590 (https://phabricator.wikimedia.org/T228851) (owner: 10Awight) [08:44:29] (03PS2) 10Marostegui: production-m5.sql.erb: Add dbproxy1021 grants [puppet] - 10https://gerrit.wikimedia.org/r/536961 (https://phabricator.wikimedia.org/T202367) [08:45:48] (03PS1) 10Effie Mouzeli: profile::mediawiki::php: Add rlimit_core php.ini variable [puppet] - 10https://gerrit.wikimedia.org/r/536966 [08:45:58] (03PS2) 10Mobrovac: Enable loading Parsoid/PHP as an extension on the beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534215 (https://phabricator.wikimedia.org/T231569) (owner: 10Subramanya Sastry) [08:50:08] !log Apply grants for dbproxy1021 on db1133 (m5 master) with replication - T202367 [08:50:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:50:11] T202367: Productionize dbproxy101[2-7].eqiad.wmnet and dbproxy200[1-4] - https://phabricator.wikimedia.org/T202367 [08:50:12] (03PS1) 10Vgutierrez: hiera: Move nginx from port 443 to port 4443 on cp3034 [puppet] - 10https://gerrit.wikimedia.org/r/536967 (https://phabricator.wikimedia.org/T231433) [08:50:14] (03PS1) 10Vgutierrez: hiera: Move ats-tls from port 8443 to port 443 on cp3034 [puppet] - 10https://gerrit.wikimedia.org/r/536968 (https://phabricator.wikimedia.org/T231433) [08:52:03] (03PS3) 10Marostegui: production-m5.sql.erb: Add dbproxy1021 grants [puppet] - 10https://gerrit.wikimedia.org/r/536961 (https://phabricator.wikimedia.org/T202367) [08:52:08] * mobrovac taking over deploy1001 for a quick mw-config deploy [08:52:22] (03CR) 10Mobrovac: [C: 03+2] Enable loading Parsoid/PHP as an extension on the beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534215 (https://phabricator.wikimedia.org/T231569) (owner: 10Subramanya Sastry) [08:53:56] 10Operations, 10ops-eqiad, 10DC-Ops: b3-eqiad pdu refresh (Tuesday 9/17 @11am UTC) - https://phabricator.wikimedia.org/T227539 (10MoritzMuehlenhoff) [08:54:28] (03Merged) 10jenkins-bot: Enable loading Parsoid/PHP as an extension on the beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534215 (https://phabricator.wikimedia.org/T231569) (owner: 10Subramanya Sastry) [08:54:45] (03CR) 10Marostegui: [C: 03+2] production-m5.sql.erb: Add dbproxy1021 grants [puppet] - 10https://gerrit.wikimedia.org/r/536961 (https://phabricator.wikimedia.org/T202367) (owner: 10Marostegui) [08:56:30] !log mobrovac@deploy1001 Synchronized wmf-config/CommonSettings.php: Beta: enable the Parsoid extension - T231569 (duration: 01m 01s) [08:56:41] (03CR) 10jenkins-bot: Enable loading Parsoid/PHP as an extension on the beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534215 (https://phabricator.wikimedia.org/T231569) (owner: 10Subramanya Sastry) [08:56:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:56:50] T231569: Deploy Parsoid-PHP (integrated with Mediawiki) to the beta cluster - https://phabricator.wikimedia.org/T231569 [08:56:52] * mobrovac is done [08:57:44] !log replacing nginx with ATS in cp3034 (upload cluster) - T231433 [08:57:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:57:48] T231433: Move cache upload cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231433 [08:58:01] (03PS3) 10Filippo Giunchedi: thumbor: stop relaying to statsd/statsite [puppet] - 10https://gerrit.wikimedia.org/r/535591 (https://phabricator.wikimedia.org/T205870) [08:58:12] 10Operations, 10Wikimedia-Mailing-lists: Official support for upgrade from existing Mailman 2.1 lists to Mailman 3 - https://phabricator.wikimedia.org/T130554 (10MarcoAurelio) It looks that the link provided above by the task author no longer reads what was wrote. Instead I've found https://docs.mailman3.org/e... [08:58:35] (03CR) 10Vgutierrez: [C: 03+2] hiera: Move nginx from port 443 to port 4443 on cp3034 [puppet] - 10https://gerrit.wikimedia.org/r/536967 (https://phabricator.wikimedia.org/T231433) (owner: 10Vgutierrez) [08:58:45] (03PS2) 10Vgutierrez: hiera: Move nginx from port 443 to port 4443 on cp3034 [puppet] - 10https://gerrit.wikimedia.org/r/536967 (https://phabricator.wikimedia.org/T231433) [08:59:52] (03CR) 10Filippo Giunchedi: "Let me know how this looks, the Prometheus-based Thumbor dashboard is here https://grafana.wikimedia.org/d/fKaieTcZk/filippo-thumbor-prome" [puppet] - 10https://gerrit.wikimedia.org/r/535591 (https://phabricator.wikimedia.org/T205870) (owner: 10Filippo Giunchedi) [09:02:47] (03PS1) 10Marostegui: db1074: Disable notifications also in puppet [puppet] - 10https://gerrit.wikimedia.org/r/536970 (https://phabricator.wikimedia.org/T231638) [09:03:37] (03CR) 10Vgutierrez: [C: 03+2] hiera: Move ats-tls from port 8443 to port 443 on cp3034 [puppet] - 10https://gerrit.wikimedia.org/r/536968 (https://phabricator.wikimedia.org/T231433) (owner: 10Vgutierrez) [09:03:46] (03PS2) 10Vgutierrez: hiera: Move ats-tls from port 8443 to port 443 on cp3034 [puppet] - 10https://gerrit.wikimedia.org/r/536968 (https://phabricator.wikimedia.org/T231433) [09:05:17] (03PS2) 10Marostegui: db1074: Disable notifications also in puppet [puppet] - 10https://gerrit.wikimedia.org/r/536970 (https://phabricator.wikimedia.org/T231638) [09:06:01] PROBLEM - HTTPS Unified ECDSA on cp3034 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS [09:06:07] PROBLEM - HTTPS Unified RSA on cp3034 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS [09:06:17] ^^ that's expected [09:06:28] (03CR) 10Marostegui: [C: 03+2] db1074: Disable notifications also in puppet [puppet] - 10https://gerrit.wikimedia.org/r/536970 (https://phabricator.wikimedia.org/T231638) (owner: 10Marostegui) [09:07:27] RECOVERY - HTTPS Unified ECDSA on cp3034 is OK: SSL OK - OCSP staple validity for en.wikipedia.org has 345595 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2019-11-22 07:59:59 +0000 (expires in 66 days) https://wikitech.wikimedia.org/wiki/HTTPS [09:07:31] RECOVERY - HTTPS Unified RSA on cp3034 is OK: SSL OK - OCSP staple validity for en.wikipedia.org has 345591 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2019-11-22 07:59:59 +0000 (expires in 66 days) https://wikitech.wikimedia.org/wiki/HTTPS [09:10:28] (03PS1) 10Awight: Enable File Importer source wiki edits on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536974 (https://phabricator.wikimedia.org/T228851) [09:15:22] (03PS1) 10Mobrovac: Parsoid: Change the beta port to 8001 to avoid conflict with PHP7 [puppet] - 10https://gerrit.wikimedia.org/r/536975 (https://phabricator.wikimedia.org/T231569) [09:15:45] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "I was wondering if /var/lib/systemd/coredump is where I like the core dumps to be on the filesystem." [puppet] - 10https://gerrit.wikimedia.org/r/536558 (owner: 10Effie Mouzeli) [09:20:13] (03CR) 10Mobrovac: "PCC ok - https://puppet-compiler.wmflabs.org/compiler1001/18308/" [puppet] - 10https://gerrit.wikimedia.org/r/536975 (https://phabricator.wikimedia.org/T231569) (owner: 10Mobrovac) [09:23:09] 08Warning [09:23:23] Warning? [09:24:20] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime [09:24:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:25:24] jynus: Filippo already opened a task about it, it seems due to the recent librenms update [09:26:09] https://phabricator.wikimedia.org/T232977 [09:26:36] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [09:26:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:27:23] PROBLEM - Widespread puppet agent failures on icinga1001 is CRITICAL: site=codfw https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [09:28:07] ^errors on ms-be CC godog [09:30:33] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Kanban, 10netops: Move cloudvirtan* hardware out of CloudVPS back into production Analytics VLAN. - https://phabricator.wikimedia.org/T225128 (10elukey) * an-presto1001 ` NIC in Slot 4 Port 1: QLogic 577xx/578xx 10 Gb Ethernet BCM57810 - F4:E9:D4:... [09:33:19] uughh thanks jynus I'm taking a look [09:33:33] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+1] Add wikifeeds helmfile stanzas [deployment-charts] - 10https://gerrit.wikimedia.org/r/535533 (https://phabricator.wikimedia.org/T170455) (owner: 10Alexandros Kosiaris) [09:34:00] (03PS3) 10Alexandros Kosiaris: Enable coredns based cluster DNS in codfw [puppet] - 10https://gerrit.wikimedia.org/r/536593 [09:37:49] RECOVERY - Check systemd state on netbox2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:39:45] (03PS1) 10Effie Mouzeli: profile::mediawiki::common Enable systemd-coredump on all mw servers [puppet] - 10https://gerrit.wikimedia.org/r/536979 [09:40:23] (03CR) 10Alexandros Kosiaris: [C: 03+2] Enable coredns based cluster DNS in codfw [puppet] - 10https://gerrit.wikimedia.org/r/536593 (owner: 10Alexandros Kosiaris) [09:41:53] (03Abandoned) 10Effie Mouzeli: profile::mediawiki::api: Setup systemd-coredump on api servers [puppet] - 10https://gerrit.wikimedia.org/r/536582 (https://phabricator.wikimedia.org/T232613) (owner: 10Effie Mouzeli) [09:41:58] (03CR) 10jerkins-bot: [V: 04-1] profile::mediawiki::common Enable systemd-coredump on all mw servers [puppet] - 10https://gerrit.wikimedia.org/r/536979 (owner: 10Effie Mouzeli) [09:43:13] RECOVERY - Check systemd state on idp1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:44:49] PROBLEM - Check systemd state on netbox2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:46:04] (03PS1) 10Filippo Giunchedi: statsite: don't enable service when statsite is absent [puppet] - 10https://gerrit.wikimedia.org/r/536981 [09:46:28] (03PS2) 10Muehlenhoff: Adapt path for puppet cert used by nginx/puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/536233 [09:46:39] (03CR) 10jerkins-bot: [V: 04-1] statsite: don't enable service when statsite is absent [puppet] - 10https://gerrit.wikimedia.org/r/536981 (owner: 10Filippo Giunchedi) [09:48:09] (03CR) 10Muehlenhoff: [C: 03+2] Adapt path for puppet cert used by nginx/puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/536233 (owner: 10Muehlenhoff) [09:49:17] (03PS1) 10Matthias Geisler: bridge: enable EditTags for beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536982 [09:49:58] (03PS2) 10Matthias Geisler: bridge: enable EditTags for beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536982 (https://phabricator.wikimedia.org/T232582) [09:51:39] (03PS2) 10Filippo Giunchedi: statsite: don't enable service when statsite is absent [puppet] - 10https://gerrit.wikimedia.org/r/536981 [09:53:22] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:53:25] (03CR) 10Filippo Giunchedi: [C: 03+2] "PCC https://puppet-compiler.wmflabs.org/compiler1002/18310/" [puppet] - 10https://gerrit.wikimedia.org/r/536981 (owner: 10Filippo Giunchedi) [09:53:39] (03PS3) 10Filippo Giunchedi: statsite: don't enable service when statsite is absent [puppet] - 10https://gerrit.wikimedia.org/r/536981 [09:53:51] (03CR) 10jerkins-bot: [V: 04-1] bridge: enable EditTags for beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536982 (https://phabricator.wikimedia.org/T232582) (owner: 10Matthias Geisler) [09:53:55] (03PS2) 10Effie Mouzeli: profile::mediawiki::php: Add rlimit_core php.ini variable [puppet] - 10https://gerrit.wikimedia.org/r/536966 [09:55:38] RECOVERY - Check systemd state on puppetdb1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:56:02] (03PS1) 10Mathew.onipe: query_service: rename wdqs module to query_service [puppet] - 10https://gerrit.wikimedia.org/r/537008 (https://phabricator.wikimedia.org/T232297) [09:56:39] (03PS2) 10Effie Mouzeli: profile::mediawiki::common Enable systemd-coredump on all mw servers [puppet] - 10https://gerrit.wikimedia.org/r/536979 [09:56:54] (03CR) 10jerkins-bot: [V: 04-1] query_service: rename wdqs module to query_service [puppet] - 10https://gerrit.wikimedia.org/r/537008 (https://phabricator.wikimedia.org/T232297) (owner: 10Mathew.onipe) [09:57:56] PROBLEM - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [10:00:48] (03PS3) 10Matthias Geisler: bridge: enable EditTags for beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536982 (https://phabricator.wikimedia.org/T232582) [10:01:08] (03PS7) 10Effie Mouzeli: systemd: Add support for coredump.conf [puppet] - 10https://gerrit.wikimedia.org/r/536558 [10:01:44] RECOVERY - Widespread puppet agent failures on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [10:01:59] (03PS3) 10Effie Mouzeli: profile::mediawiki::common Enable systemd-coredump on all mw servers [puppet] - 10https://gerrit.wikimedia.org/r/536979 [10:03:37] (03CR) 10jerkins-bot: [V: 04-1] bridge: enable EditTags for beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536982 (https://phabricator.wikimedia.org/T232582) (owner: 10Matthias Geisler) [10:05:17] !log restarting acmechief servers to get latest kernel upgrades [10:05:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:06:18] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:08:24] RECOVERY - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [10:09:35] (03PS4) 10Matthias Geisler: bridge: enable EditTags for beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536982 (https://phabricator.wikimedia.org/T232582) [10:16:04] !log upload libtrapperkeeper-webserver-jetty9-clojure 1.7.0-2+wmf1 to buster-wikimedia [10:16:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:16:36] !log upgrade acme-chief production servers to acme-chief 0.21 - T219765 [10:16:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:16:39] T219765: Implement server-side OCSP stapling - https://phabricator.wikimedia.org/T219765 [10:21:55] This is naughty, but I'm going to merge backports in advance of the SWAT window, so they land some time today... [10:21:55] (03PS4) 10Effie Mouzeli: profile::mediawiki::common Enable systemd-coredump on all mw servers [puppet] - 10https://gerrit.wikimedia.org/r/536979 [10:24:59] 10Operations, 10Wikimedia-Mailing-lists: Reset inactive admin of offline-l mailing list - https://phabricator.wikimedia.org/T232609 (10jbond) 05Open→03Resolved a:03jbond Hello All, Sorry for the delay. I have now updated the list admins to stephane and kelson and reset the admin password (which should... [10:29:06] <_joe_> !log installing a patched php-memcached on mw1347 T232613 [10:29:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:29:11] T232613: LBFactoryMulti.php: PHP Notice: Undefined index: - https://phabricator.wikimedia.org/T232613 [10:30:04] jan_drewniak: #bothumor I � Unicode. All rise for Wikimedia Portals Update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190916T1030). [10:34:49] (03CR) 10Effie Mouzeli: "NOOP as expected https://puppet-compiler.wmflabs.org/compiler1001/18313/" [puppet] - 10https://gerrit.wikimedia.org/r/536558 (owner: 10Effie Mouzeli) [10:35:48] (03CR) 10Effie Mouzeli: [C: 03+2] systemd: Add support for coredump.conf [puppet] - 10https://gerrit.wikimedia.org/r/536558 (owner: 10Effie Mouzeli) [10:37:19] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537060 (https://phabricator.wikimedia.org/T128546) [10:41:47] (03CR) 10Effie Mouzeli: "LGTM https://puppet-compiler.wmflabs.org/compiler1002/18314/" [puppet] - 10https://gerrit.wikimedia.org/r/536966 (owner: 10Effie Mouzeli) [10:42:02] (03PS1) 10Alexandros Kosiaris: coredns: Keep traffic node local if possible [deployment-charts] - 10https://gerrit.wikimedia.org/r/537061 [10:42:07] (03PS5) 10Effie Mouzeli: profile::mediawiki::common Enable systemd-coredump on all mw servers [puppet] - 10https://gerrit.wikimedia.org/r/536979 [10:42:21] (03PS1) 10Vgutierrez: ncredir: Enable OCSP [puppet] - 10https://gerrit.wikimedia.org/r/537062 (https://phabricator.wikimedia.org/T232988) [10:43:04] (03CR) 10Jdrewniak: [C: 03+2] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537060 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [10:44:13] (03CR) 10Volans: [C: 03+1] "LGTM" (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/530096 (https://phabricator.wikimedia.org/T225297) (owner: 10Elukey) [10:44:29] (03CR) 10Effie Mouzeli: [V: 03+1 C: 03+1] "Looks as expected https://puppet-compiler.wmflabs.org/compiler1001/18315/thumbor1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/535591 (https://phabricator.wikimedia.org/T205870) (owner: 10Filippo Giunchedi) [10:45:08] !log Enabling OCSP prefetched responses for the non-canonical redirect service - T232988 [10:45:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:45:12] T232988: Use acme-chief provided OCSP stapling responses - https://phabricator.wikimedia.org/T232988 [10:45:38] (03CR) 10Giuseppe Lavagetto: [C: 03+1] coredns: Keep traffic node local if possible [deployment-charts] - 10https://gerrit.wikimedia.org/r/537061 (owner: 10Alexandros Kosiaris) [10:46:58] (03CR) 10Vgutierrez: [C: 03+2] "pcc looks happy: https://puppet-compiler.wmflabs.org/compiler1002/18317/" [puppet] - 10https://gerrit.wikimedia.org/r/537062 (https://phabricator.wikimedia.org/T232988) (owner: 10Vgutierrez) [10:47:08] (03PS2) 10Vgutierrez: ncredir: Enable OCSP [puppet] - 10https://gerrit.wikimedia.org/r/537062 (https://phabricator.wikimedia.org/T232988) [10:47:23] (03CR) 10Jdrewniak: [V: 03+2 C: 03+2] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537060 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [10:47:49] (03CR) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537060 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [10:48:12] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:49:50] !log jdrewniak@deploy1001 Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:537060| Bumping portals to master (T128546)]] (duration: 01m 04s) [10:49:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:49:52] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [10:50:53] !log jdrewniak@deploy1001 Synchronized portals: Wikimedia Portals Update: [[gerrit:537060| Bumping portals to master (T128546)]] (duration: 01m 03s) [10:50:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:51:03] !log vgutierrez@puppetmaster1001 conftool action : set/pooled=no; selector: name=ncredir2002.codfw.wmnet [10:51:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:52:51] !log vgutierrez@puppetmaster1001 conftool action : set/pooled=yes; selector: name=ncredir2002.codfw.wmnet [10:52:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:53:57] (03CR) 10Effie Mouzeli: "As expected, https://puppet-compiler.wmflabs.org/compiler1001/18316/," [puppet] - 10https://gerrit.wikimedia.org/r/536979 (owner: 10Effie Mouzeli) [10:55:47] (03CR) 10Volans: [C: 03+2] config: enforce positional vs. keyword args [software/homer] - 10https://gerrit.wikimedia.org/r/533623 (owner: 10Volans) [10:57:10] (03CR) 10Lucas Werkmeister (WMDE): [C: 04-1] bridge: enable EditTags for beta (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536982 (https://phabricator.wikimedia.org/T232582) (owner: 10Matthias Geisler) [10:58:03] (03PS1) 10Vgutierrez: nagios_common: Provide a HTTPS check for LE with OCSP [puppet] - 10https://gerrit.wikimedia.org/r/537066 (https://phabricator.wikimedia.org/T232988) [10:58:05] (03PS1) 10Vgutierrez: ncredir: Check OCSP stapling on HTTPS icinga check [puppet] - 10https://gerrit.wikimedia.org/r/537067 (https://phabricator.wikimedia.org/T232988) [10:59:21] (03Merged) 10jenkins-bot: config: enforce positional vs. keyword args [software/homer] - 10https://gerrit.wikimedia.org/r/533623 (owner: 10Volans) [11:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: Dear deployers, time to do the European Mid-day SWAT(Max 6 patches) deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190916T1100). [11:00:05] awight: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:16] (03CR) 10Vgutierrez: [C: 03+2] nagios_common: Provide a HTTPS check for LE with OCSP [puppet] - 10https://gerrit.wikimedia.org/r/537066 (https://phabricator.wikimedia.org/T232988) (owner: 10Vgutierrez) [11:00:25] (03PS2) 10Vgutierrez: nagios_common: Provide a HTTPS check for LE with OCSP [puppet] - 10https://gerrit.wikimedia.org/r/537066 (https://phabricator.wikimedia.org/T232988) [11:00:27] :D thank you jouncebot, I'll take it from here. [11:02:03] (03PS2) 10Awight: Enable source wiki editing for testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536590 (https://phabricator.wikimedia.org/T228851) [11:02:10] (03CR) 10Awight: "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536590 (https://phabricator.wikimedia.org/T228851) (owner: 10Awight) [11:02:17] (03CR) 10Awight: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536590 (https://phabricator.wikimedia.org/T228851) (owner: 10Awight) [11:02:26] (03CR) 10Awight: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536974 (https://phabricator.wikimedia.org/T228851) (owner: 10Awight) [11:02:31] (03PS2) 10Awight: Enable File Importer source wiki edits on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536974 (https://phabricator.wikimedia.org/T228851) [11:03:26] awight: please hand over to me once you're done [11:04:43] (it's config patches btw) [11:04:57] Urbanecm: cool, will do. [11:05:22] thx [11:05:56] (03CR) 10Vgutierrez: [C: 03+2] ncredir: Check OCSP stapling on HTTPS icinga check [puppet] - 10https://gerrit.wikimedia.org/r/537067 (https://phabricator.wikimedia.org/T232988) (owner: 10Vgutierrez) [11:06:01] Undeployed patches in extensions/CheckUser. [11:06:05] (03PS2) 10Vgutierrez: ncredir: Check OCSP stapling on HTTPS icinga check [puppet] - 10https://gerrit.wikimedia.org/r/537067 (https://phabricator.wikimedia.org/T232988) [11:06:33] <_joe_> !log uploaded php-memcached_3.0.1+2.2.0-1~wmf3 to component/php72 for stretch T232613 [11:06:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:06:36] T232613: LBFactoryMulti.php: PHP Notice: Undefined index: - https://phabricator.wikimedia.org/T232613 [11:06:57] (03Merged) 10jenkins-bot: Enable source wiki editing for testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536590 (https://phabricator.wikimedia.org/T228851) (owner: 10Awight) [11:07:13] (03CR) 10jenkins-bot: Enable source wiki editing for testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536590 (https://phabricator.wikimedia.org/T228851) (owner: 10Awight) [11:09:43] awight: that's expected, it's security patch [11:09:47] (and it should be deployed) [11:09:54] Urbanecm: ty! [11:10:01] yw [11:10:09] !log awight@deploy1001 Synchronized php-1.34.0-wmf.22/extensions/FileImporter: SWAT: [[gerrit:536980|Add debug logging for remote API failures (T228851)]] (duration: 01m 05s) [11:10:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:10:12] T228851: Source wiki editing and deletion always fails - https://phabricator.wikimedia.org/T228851 [11:11:41] (03Abandoned) 1020after4: Configure phabricator database cluster settings [puppet] - 10https://gerrit.wikimedia.org/r/300494 (https://phabricator.wikimedia.org/T112776) (owner: 1020after4) [11:12:12] awight: I don't see https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/536974 in https://integration.wikimedia.org/zuul/ [11:13:58] Urbanecm: +1 looks like I need to kick it [11:14:15] (03CR) 10Awight: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536974 (https://phabricator.wikimedia.org/T228851) (owner: 10Awight) [11:14:21] re+2'ing should help :) [11:14:27] yup, I see it now [11:14:30] thanks! [11:14:37] !log awight@deploy1001 Synchronized wmf-config/CommonSettings.php: SWAT: [[gerrit:536590|Enable source wiki editing for testwiki (T228851)]] (duration: 01m 02s) [11:14:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:17:44] (03PS1) 10Vgutierrez: ncredir: Monitor TLS handshake + OCSP stapling for every configured cert [puppet] - 10https://gerrit.wikimedia.org/r/537075 (https://phabricator.wikimedia.org/T232988) [11:19:14] (03Merged) 10jenkins-bot: Enable File Importer source wiki edits on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536974 (https://phabricator.wikimedia.org/T228851) (owner: 10Awight) [11:20:08] (03CR) 10jenkins-bot: Enable File Importer source wiki edits on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536974 (https://phabricator.wikimedia.org/T228851) (owner: 10Awight) [11:21:22] !log awight@deploy1001 Synchronized wmf-config/CommonSettings-labs.php: SWAT: [[gerrit:536974|Enable File Importer source wiki edits on beta cluster (T228851)]] (duration: 01m 03s) [11:21:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:21:25] T228851: Source wiki editing and deletion always fails - https://phabricator.wikimedia.org/T228851 [11:22:04] (03CR) 10jenkins-bot: config: enforce positional vs. keyword args [software/homer] - 10https://gerrit.wikimedia.org/r/533623 (owner: 10Volans) [11:23:09] 08Warning [11:23:42] Urbanecm: I'm just waiting for one more patch, this backport: https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/FileImporter/+/537053/ [11:24:01] awight: okay, can I deploy a throttle patch then? [11:24:02] It's taking a minute to merge to the core release branch. [11:24:04] XioNoX: FYI empty warning from librenms ^^^ [11:24:21] Urbanecm: Sure! I'm done with -config now; [11:24:28] awight: cool! [11:24:33] (03CR) 10Urbanecm: [C: 03+2] Lift IP cap on 2019-10-02 for Senior Citizen Write Wikipedia course - cs.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536740 (https://phabricator.wikimedia.org/T232831) (owner: 10Urbanecm) [11:24:43] (03PS2) 10Vgutierrez: ncredir: Monitor TLS handshake + OCSP stapling for every configured cert [puppet] - 10https://gerrit.wikimedia.org/r/537075 (https://phabricator.wikimedia.org/T232988) [11:25:29] (03Merged) 10jenkins-bot: Lift IP cap on 2019-10-02 for Senior Citizen Write Wikipedia course - cs.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536740 (https://phabricator.wikimedia.org/T232831) (owner: 10Urbanecm) [11:25:51] (03PS3) 10Vgutierrez: ncredir: Monitor TLS handshake + OCSP stapling for every configured cert [puppet] - 10https://gerrit.wikimedia.org/r/537075 (https://phabricator.wikimedia.org/T232988) [11:27:36] Urbanecm: fyi, I see my patch now. Currently pulling wmf22 to mwdebug1002 [11:27:42] (03CR) 10jenkins-bot: Lift IP cap on 2019-10-02 for Senior Citizen Write Wikipedia course - cs.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536740 (https://phabricator.wikimedia.org/T232831) (owner: 10Urbanecm) [11:27:48] awight: acknowledged [11:28:22] !log urbanecm@deploy1001 Synchronized wmf-config/throttle.php: SWAT: 869b56f: Lift IP cap on 2019-10-02 for Senior Citizen Write Wikipedia course - cs.wikipedia (T232831) (duration: 01m 02s) [11:28:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:25] T232831: Lift IP cap on 2019-10-02 for Senior Citizen Write Wikipedia course - cs.wikipedia - https://phabricator.wikimedia.org/T232831 [11:28:56] (03CR) 10Vgutierrez: [C: 03+2] "PCC is happy: https://puppet-compiler.wmflabs.org/compiler1001/18320/" [puppet] - 10https://gerrit.wikimedia.org/r/537075 (https://phabricator.wikimedia.org/T232988) (owner: 10Vgutierrez) [11:28:58] deploying. [11:29:01] ack [11:29:06] ty [11:29:09] (03PS4) 10Vgutierrez: ncredir: Monitor TLS handshake + OCSP stapling for every configured cert [puppet] - 10https://gerrit.wikimedia.org/r/537075 (https://phabricator.wikimedia.org/T232988) [11:29:57] (03PS4) 10Urbanecm: Increase move rate-limit on Commons for all autopatrolled users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535944 (https://phabricator.wikimedia.org/T232657) (owner: 104nn1l2) [11:30:07] !log awight@deploy1001 Synchronized php-1.34.0-wmf.22/extensions/FileImporter: SWAT: [[gerrit:537053|Send a User-Agent with remote API requests (T232840)]] (duration: 01m 02s) [11:30:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:30:10] T232840: Include "FileImporter" in user-agent for internal API requests - https://phabricator.wikimedia.org/T232840 [11:30:38] Urbanecm: I'm done. Shall we declare SWAT complete, or do you want to wait for the throttling config to settle? [11:30:40] !log Upgrading php-memcached to 3.0.1+2.2.0-1~wmf3 [11:30:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:31:01] awight: I'm going to deploy few more things, I officially take the SWAT responsibility :-) [11:31:14] Urbanecm: Thanks, good luck! [11:31:19] thanks! [11:31:49] (03CR) 10Urbanecm: [C: 03+2] Increase move rate-limit on Commons for all autopatrolled users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535944 (https://phabricator.wikimedia.org/T232657) (owner: 104nn1l2) [11:32:09] (03PS2) 10Urbanecm: Remove expired throttle rules [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536679 (owner: 10Zoranzoki21) [11:32:16] (03CR) 10Urbanecm: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536679 (owner: 10Zoranzoki21) [11:33:10] !log update peer address of AS28598 [11:33:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:33:18] Urbanecm: just for bookkeeping, you might want to add these patches to the deployment calendar, once the smoke clears of course. [11:33:46] willco :) [11:34:11] (03Merged) 10jenkins-bot: Increase move rate-limit on Commons for all autopatrolled users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535944 (https://phabricator.wikimedia.org/T232657) (owner: 104nn1l2) [11:34:15] (03Merged) 10jenkins-bot: Remove expired throttle rules [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536679 (owner: 10Zoranzoki21) [11:36:03] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: 313e3d9: Increase move rate-limit on Commons for all autopatrolled users (T232657) (duration: 01m 05s) [11:36:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:36:06] T232657: Increase the rate limit for moves on Commons for autopatrollers to 32/minute - https://phabricator.wikimedia.org/T232657 [11:36:31] (03CR) 10jenkins-bot: Increase move rate-limit on Commons for all autopatrolled users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535944 (https://phabricator.wikimedia.org/T232657) (owner: 104nn1l2) [11:38:04] !log urbanecm@deploy1001 Synchronized wmf-config/throttle.php: SWAT: e37aed2: Remove expired throttle rules (duration: 01m 03s) [11:38:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:43:26] !log EU SWAT is done [11:43:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:50:36] <_joe_> !log rolling restart of php-fpm in codfw to pick up the new memcached extension T232613 [11:50:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:50:40] T232613: LBFactoryMulti.php: PHP Notice: Undefined index: - https://phabricator.wikimedia.org/T232613 [11:54:06] (03PS1) 10Marostegui: dbproxy1021: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/537085 (https://phabricator.wikimedia.org/T202367) [11:54:44] and I guess then I will do the train :) [11:56:25] <_joe_> !log rolling restart of php-fpm in eqiad to pick up the new memcached extension T232613 [11:56:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:56:34] T232613: LBFactoryMulti.php: PHP Notice: Undefined index: - https://phabricator.wikimedia.org/T232613 [11:56:54] (03CR) 10Marostegui: [C: 03+2] dbproxy1021: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/537085 (https://phabricator.wikimedia.org/T202367) (owner: 10Marostegui) [11:59:43] (03CR) 10Elukey: Add sre.hadoop.reboot-workers.py (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/530096 (https://phabricator.wikimedia.org/T225297) (owner: 10Elukey) [12:05:24] (03PS1) 10Jbond: profile::mediawiki::webserver: ensure cron_splay is called correctly [puppet] - 10https://gerrit.wikimedia.org/r/537090 [12:07:05] (03PS1) 10Kosta Harlan: Enable EditorJourney for euwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537092 (https://phabricator.wikimedia.org/T232061) [12:07:11] (03CR) 10Elukey: [C: 03+1] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/537090 (owner: 10Jbond) [12:07:20] <_joe_> !log rolling restart ended on eqiad T232613 [12:07:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:07:23] T232613: LBFactoryMulti.php: PHP Notice: Undefined index: - https://phabricator.wikimedia.org/T232613 [12:08:33] (03PS5) 10Elukey: Add sre.hadoop.reboot-workers.py [cookbooks] - 10https://gerrit.wikimedia.org/r/530096 (https://phabricator.wikimedia.org/T225297) [12:10:51] (03PS2) 10Jbond: profile::mediawiki::webserver: ensure cron_splay is called correctly [puppet] - 10https://gerrit.wikimedia.org/r/537090 [12:15:02] (03CR) 10Elukey: [C: 03+2] Add sre.hadoop.reboot-workers.py [cookbooks] - 10https://gerrit.wikimedia.org/r/530096 (https://phabricator.wikimedia.org/T225297) (owner: 10Elukey) [12:16:11] (03PS2) 10Kosta Harlan: WIP: Enable GrowthExperiments for euwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534789 (https://phabricator.wikimedia.org/T232060) [12:17:55] !log elukey@cumin1001 START - Cookbook sre.hadoop.reboot-workers [12:17:56] !log elukey@cumin1001 END (FAIL) - Cookbook sre.hadoop.reboot-workers (exit_code=99) [12:17:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:17:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:18:10] (03CR) 10Jbond: [C: 03+2] profile::mediawiki::webserver: ensure cron_splay is called correctly [puppet] - 10https://gerrit.wikimedia.org/r/537090 (owner: 10Jbond) [12:20:07] (03PS3) 10Ladsgroup: mediawiki: Start rebuildItermTerms for wikidatawiki [puppet] - 10https://gerrit.wikimedia.org/r/535526 (https://phabricator.wikimedia.org/T225056) [12:20:33] 10Operations, 10netbox: postgres::slave module type for includes parameter in inconsistent. - https://phabricator.wikimedia.org/T232358 (10Aklapper) No reply to previous comment, hence adding some project tags. Feel free to correct. [12:21:03] (03PS1) 10Elukey: sre.hadoop.reboot-workers.py: add reason to puppet enable/disable [cookbooks] - 10https://gerrit.wikimedia.org/r/537097 (https://phabricator.wikimedia.org/T225297) [12:21:26] effie: re: https://gerrit.wikimedia.org/r/c/operations/puppet/+/535591 does also teh dashboard look good to you ? [12:22:15] yeah [12:22:31] (03CR) 10Marostegui: [C: 03+2] mediawiki: Start rebuildItermTerms for wikidatawiki [puppet] - 10https://gerrit.wikimedia.org/r/535526 (https://phabricator.wikimedia.org/T225056) (owner: 10Ladsgroup) [12:22:34] I can merge the patch tomorrow if you want, I just need to leave ina bit [12:22:50] (03CR) 10Elukey: [C: 03+2] sre.hadoop.reboot-workers.py: add reason to puppet enable/disable [cookbooks] - 10https://gerrit.wikimedia.org/r/537097 (https://phabricator.wikimedia.org/T225297) (owner: 10Elukey) [12:23:09] 08Warning [12:25:09] effie: nah no worries, I'll take care of merging, thanks for the review! [12:25:25] tx for the graphs [12:28:17] (03PS4) 10Filippo Giunchedi: thumbor: stop relaying to statsd/statsite [puppet] - 10https://gerrit.wikimedia.org/r/535591 (https://phabricator.wikimedia.org/T205870) [12:31:09] (03CR) 10Filippo Giunchedi: [C: 03+2] thumbor: stop relaying to statsd/statsite [puppet] - 10https://gerrit.wikimedia.org/r/535591 (https://phabricator.wikimedia.org/T205870) (owner: 10Filippo Giunchedi) [12:31:16] (03PS5) 10Filippo Giunchedi: thumbor: stop relaying to statsd/statsite [puppet] - 10https://gerrit.wikimedia.org/r/535591 (https://phabricator.wikimedia.org/T205870) [12:31:42] (03PS1) 10Elukey: sre.hadoop.reboot-workers.py: fix parameter in reboot_hadoop_workers [cookbooks] - 10https://gerrit.wikimedia.org/r/537100 (https://phabricator.wikimedia.org/T225297) [12:38:03] (03CR) 10Elukey: [C: 03+2] sre.hadoop.reboot-workers.py: fix parameter in reboot_hadoop_workers [cookbooks] - 10https://gerrit.wikimedia.org/r/537100 (https://phabricator.wikimedia.org/T225297) (owner: 10Elukey) [12:38:30] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:40:19] !log elukey@cumin1001 START - Cookbook sre.hadoop.reboot-workers [12:40:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:40:31] this is the testing cluster --^ [12:40:41] let's see if the cookbook works :) [12:44:03] !log stop thumbor traffic to statsd/graphite, use Prometheus only and replace Thumbor dashboard - T205870 [12:44:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:44:06] T205870: Fully migrate >= 30% of producers off statsd - https://phabricator.wikimedia.org/T205870 [12:45:45] 10Operations, 10observability, 10Patch-For-Review, 10Performance-Team (Radar): Fully migrate >= 30% of producers off statsd - https://phabricator.wikimedia.org/T205870 (10fgiunchedi) [12:56:48] 10Operations, 10DBA: Switchover s3 primary database master db1075 -> db1078 - 24th Sept @05:00 UTC - https://phabricator.wikimedia.org/T230783 (10Marostegui) Reserved window on the deployments calendar: https://wikitech.wikimedia.org/w/index.php?title=Deployments&type=revision&diff=1837750&oldid=1837737 [12:57:04] 10Operations, 10DBA: Switchover s4 (commonswiki) primary database master db1081 -> db1138 - 26th Sept @05:00 UTC - https://phabricator.wikimedia.org/T230784 (10Marostegui) Reserved window on the deployments calendar: https://wikitech.wikimedia.org/w/index.php?title=Deployments&type=revision&diff=1837750&oldid=... [12:57:08] I am going to run the train now that the blocker has been fixed [12:57:51] (03PS3) 10Herron: prometheus-ipsec-exporter: subscribe service to /etc/ipsec.conf [puppet] - 10https://gerrit.wikimedia.org/r/536671 (https://phabricator.wikimedia.org/T230236) [12:58:52] !log decommissioning Cassandra, restbase2010-a -- T224553 [12:58:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:58:55] T224553: Migrate remaining Restbase servers to Stretch - https://phabricator.wikimedia.org/T224553 [12:59:51] !log installing qemu security updates on stretch [12:59:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:11] (03PS1) 10Elukey: sre.hadoop.reboot-workers.py: reboot hosts in a batch in parallel [cookbooks] - 10https://gerrit.wikimedia.org/r/537105 (https://phabricator.wikimedia.org/T225297) [13:00:57] (03PS1) 10Hashar: all wikis to 1.34.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537106 [13:00:59] (03CR) 10Hashar: [C: 03+2] all wikis to 1.34.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537106 (owner: 10Hashar) [13:02:22] (03Merged) 10jenkins-bot: all wikis to 1.34.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537106 (owner: 10Hashar) [13:02:34] (03CR) 10Eevans: [C: 03+1] restbase2010: Add JBOD hiera config for upcoming reimage [puppet] - 10https://gerrit.wikimedia.org/r/536565 (owner: 10Muehlenhoff) [13:02:59] (03CR) 10Herron: [C: 03+2] prometheus-ipsec-exporter: subscribe service to /etc/ipsec.conf [puppet] - 10https://gerrit.wikimedia.org/r/536671 (https://phabricator.wikimedia.org/T230236) (owner: 10Herron) [13:03:28] (03CR) 10Eevans: [C: 03+1] restbase2011: Add JBOD hiera config for upcoming reimage [puppet] - 10https://gerrit.wikimedia.org/r/536566 (owner: 10Muehlenhoff) [13:04:16] (03CR) 10Eevans: [C: 03+1] restbase2012: Add JBOD hiera config for upcoming reimage [puppet] - 10https://gerrit.wikimedia.org/r/536567 (owner: 10Muehlenhoff) [13:04:41] !log hashar@deploy1001 rebuilt and synchronized wikiversions files: all wikis to 1.34.0-wmf.22 [13:04:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:06:48] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:fcgi://127.0.0.1:9000 method=GET https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver [13:07:06] PROBLEM - High average GET latency for mw requests on appserver in codfw on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:fcgi://127.0.0.1:9000 method=GET https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=appserver&var-method= [13:08:18] RECOVERY - High average GET latency for mw requests on api_appserver in codfw on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [13:08:34] RECOVERY - High average GET latency for mw requests on appserver in codfw on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=appserver&var-method=GET [13:09:43] (03PS5) 10Matthias Geisler: bridge: enable EditTags for beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536982 (https://phabricator.wikimedia.org/T232582) [13:09:43] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [13:09:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:09:45] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [13:09:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:10:36] !log rebooting failoid2001 for kernel update/pick up new qemu [13:10:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:18:21] (03PS2) 10Alexandros Kosiaris: Enable coredns based cluster DNS in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/536617 [13:18:35] (03CR) 10Alexandros Kosiaris: [C: 03+2] Enable coredns based cluster DNS in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/536617 (owner: 10Alexandros Kosiaris) [13:19:19] (03PS2) 10Alexandros Kosiaris: calico: Enabled felix prometheus endpoint [puppet] - 10https://gerrit.wikimedia.org/r/534126 [13:19:33] (03CR) 10Alexandros Kosiaris: [C: 03+2] calico: Enabled felix prometheus endpoint [puppet] - 10https://gerrit.wikimedia.org/r/534126 (owner: 10Alexandros Kosiaris) [13:19:37] 10Operations, 10Analytics, 10Core Platform Team Legacy (Watching / External), 10Patch-For-Review, and 2 others: Replace and expand kafka main hosts (kafka[12]00[123]) with kafka-main[12]00[12345] - https://phabricator.wikimedia.org/T225005 (10Ottomata) @herron FYI last week we decommissioned eventlogging-s... [13:20:44] 10Operations, 10Analytics, 10Core Platform Team Legacy (Watching / External), 10Patch-For-Review, and 2 others: Replace and expand kafka main hosts (kafka[12]00[123]) with kafka-main[12]00[12345] - https://phabricator.wikimedia.org/T225005 (10herron) @Ottomata excellent thx for the heads up! [13:25:38] (03PS2) 10Herron: dns: add mail records for pr.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/536318 (https://phabricator.wikimedia.org/T231387) [13:27:22] (03CR) 10Herron: [C: 03+2] dns: add mail records for pr.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/536318 (https://phabricator.wikimedia.org/T231387) (owner: 10Herron) [13:28:45] !log hashar@deploy1001 Synchronized php-1.34.0-wmf.22/extensions/FlaggedRevs/frontend/specialpages/reports/ValidationStatistics.php: Add missing "use" to getTopReviewers() - T232618 (duration: 00m 55s) [13:28:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:28:49] T232618: FlaggedRevs: PHP Notice: Undefined variable: fname - https://phabricator.wikimedia.org/T232618 [13:29:03] (03PS1) 10Andrew Bogott: Move labpuppetmaster1001 and 1002 to role::spare [puppet] - 10https://gerrit.wikimedia.org/r/537111 (https://phabricator.wikimedia.org/T171188) [13:30:14] (03CR) 10Ottomata: [C: 03+1] eventlogging: move comment about python-kafka version [puppet] - 10https://gerrit.wikimedia.org/r/536561 (https://phabricator.wikimedia.org/T222941) (owner: 10Elukey) [13:30:50] (03CR) 10Ottomata: "Don't want buster; These will talk to Hadoop and Hive, and may also be Hadoop nodes in the future." [puppet] - 10https://gerrit.wikimedia.org/r/536804 (https://phabricator.wikimedia.org/T225128) (owner: 10Elukey) [13:30:55] (03PS2) 10Herron: exim: add pr.wikimedia.org to wikimedia_domains [puppet] - 10https://gerrit.wikimedia.org/r/536313 (https://phabricator.wikimedia.org/T231387) [13:31:46] ^ /me wonders what happens when peurto rico want a chapter wiki... [13:31:59] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Kanban, 10netops: Move cloudvirtan* hardware out of CloudVPS back into production Analytics VLAN. - https://phabricator.wikimedia.org/T225128 (10Ottomata) Thanks for doing this Luca! BTW we don't want buster...I mean I guess we could try it but I'd exp... [13:32:04] Reedy: lol [13:32:08] (03CR) 10Andrew Bogott: [C: 03+2] Move labpuppetmaster1001 and 1002 to role::spare [puppet] - 10https://gerrit.wikimedia.org/r/537111 (https://phabricator.wikimedia.org/T171188) (owner: 10Andrew Bogott) [13:32:16] (03CR) 10Muehlenhoff: "There's also a ferm service in profile::mariadb::ferm_wmcs which can be removed alongside." [puppet] - 10https://gerrit.wikimedia.org/r/537111 (https://phabricator.wikimedia.org/T171188) (owner: 10Andrew Bogott) [13:32:21] (03CR) 10Elukey: [C: 03+2] "> Don't want buster; These will talk to Hadoop and Hive, and may also" [puppet] - 10https://gerrit.wikimedia.org/r/536804 (https://phabricator.wikimedia.org/T225128) (owner: 10Elukey) [13:32:21] Reedy: I guess it would be a PR issue [13:32:22] * vgutierrez hides [13:32:25] rofl [13:44:30] (03CR) 10Ottomata: "Ah right Java 8 package...Was still assuming we couldn't use this. OK then great!" [puppet] - 10https://gerrit.wikimedia.org/r/536804 (https://phabricator.wikimedia.org/T225128) (owner: 10Elukey) [13:44:54] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Kanban, 10netops: Move cloudvirtan* hardware out of CloudVPS back into production Analytics VLAN. - https://phabricator.wikimedia.org/T225128 (10Ottomata) Ah, Luca just noted that we will have a Java 8 package for buster. I'm ok with Buster then! [13:47:02] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (bad URL) timed out before a response was received: /api (Ensure Zotero is working) timed out before a response was received: /api (Scrapes sample page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [13:47:17] uh [13:48:40] !log elukey@cumin1001 END (FAIL) - Cookbook sre.hadoop.reboot-workers (exit_code=99) [13:48:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:49:21] (03PS8) 10Vgutierrez: ATS: Provide websocket support [puppet] - 10https://gerrit.wikimedia.org/r/531885 (https://phabricator.wikimedia.org/T221594) [13:50:49] (03PS2) 10Vgutierrez: ATS: Add known websocket endpoints to the TLS instance mapping rules [puppet] - 10https://gerrit.wikimedia.org/r/533379 (https://phabricator.wikimedia.org/T231627) [13:53:30] PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v2/translate/{from}/{to}{/provider} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received: /v1/page/{language}/{title}{/revision} (Fetch enwiki protected page) timed out before a response was received: /v2/suggest/source/{title}/{to} (Suggest a source title to use for translatio [13:53:30] est Suggest a source title to use for translation returned the unexpected status 504 (expecting: 200): /v2/page/{sourcelanguage}/{targetlanguage}/{title}{/revision} (Translate enwiki protected page) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [14:01:19] (03CR) 10Mholloway: [C: 03+2] Add wikifeeds helmfile stanzas [deployment-charts] - 10https://gerrit.wikimedia.org/r/535533 (https://phabricator.wikimedia.org/T170455) (owner: 10Alexandros Kosiaris) [14:01:24] (03PS5) 10Mholloway: Add wikifeeds helmfile stanzas [deployment-charts] - 10https://gerrit.wikimedia.org/r/535533 (https://phabricator.wikimedia.org/T170455) (owner: 10Alexandros Kosiaris) [14:02:25] (03CR) 10Mholloway: [V: 03+2 C: 03+2] Add wikifeeds helmfile stanzas [deployment-charts] - 10https://gerrit.wikimedia.org/r/535533 (https://phabricator.wikimedia.org/T170455) (owner: 10Alexandros Kosiaris) [14:05:08] !log @ helmfile [STAGING] Ran 'apply' command on namespace 'wikifeeds' for release 'staging' . [14:05:08] * akosiaris looking into the citoid, cxserver issues [14:05:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:15] (03CR) 10Subramanya Sastry: [C: 03+1] Parsoid: Change the beta port to 8001 to avoid conflict with PHP7 [puppet] - 10https://gerrit.wikimedia.org/r/536975 (https://phabricator.wikimedia.org/T231569) (owner: 10Mobrovac) [14:09:36] !log @ helmfile [CODFW] Ran 'apply' command on namespace 'wikifeeds' for release 'production' . [14:09:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:03] mdholloway: don't do eqiad btw, I am debugging a DNS issue. [14:10:20] akosiaris: ah, ok, thanks [14:10:34] mdholloway: is everything looking ok on staging/codfw btw? [14:10:39] i had actually tried, but hit a timeout doing `helmfile diff` [14:11:26] akosiaris: i am actually wondering where to log in now to test this [14:12:23] the deployments seemed to exit successfully for stating and codfw [14:12:31] lemme figure out what is going on with DNS in eqiad and I 'll help [14:12:31] *staging [14:12:36] k, thanks [14:22:43] (03PS4) 10Andrew Bogott: mediawiki: Disable loadExitNodes.php maintenance script [puppet] - 10https://gerrit.wikimedia.org/r/533571 (https://phabricator.wikimedia.org/T229736) (owner: 10Krinkle) [14:23:09] 08Warning [14:24:57] (03CR) 10Andrew Bogott: [C: 03+2] mediawiki: Disable loadExitNodes.php maintenance script [puppet] - 10https://gerrit.wikimedia.org/r/533571 (https://phabricator.wikimedia.org/T229736) (owner: 10Krinkle) [14:25:01] !log akosiaris@ helmfile [EQIAD] Ran 'sync' command on namespace 'kube-system' for release 'calico-policy-controller' . [14:25:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:25:50] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [14:30:04] RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [14:32:25] (03PS2) 10Alexandros Kosiaris: coredns: Keep traffic node local if possible [deployment-charts] - 10https://gerrit.wikimedia.org/r/537061 [14:32:26] (03PS1) 10Alexandros Kosiaris: calico: Fix eqiad configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/537121 [14:32:48] (03PS2) 10Alexandros Kosiaris: calico: Fix eqiad configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/537121 [14:32:50] (03PS3) 10Alexandros Kosiaris: coredns: Keep traffic node local if possible [deployment-charts] - 10https://gerrit.wikimedia.org/r/537061 [14:34:31] mdholloway: ok, crisis averted [14:34:38] so [14:34:40] akosiaris@deploy1001:~$ curl kubernetes2001.codfw.wmnet:10715/_info [14:34:40] \o/ [14:34:44] (03PS3) 10Herron: exim: add pr.wikimedia.org to wikimedia_domains [puppet] - 10https://gerrit.wikimedia.org/r/536313 (https://phabricator.wikimedia.org/T231387) [14:35:09] now that it is deployed we can fix that last part which is route an LVS IP to it [14:35:31] so that curl will become curl wikifeeds.svc.codfw.wmnet and respectively for eqiad [14:35:38] plus the port ofc [14:36:21] mdholloway: you should also be able to deploy eqiad now [14:36:37] (03PS3) 10Alexandros Kosiaris: Introduce wikifeeds LVS IPs [dns] - 10https://gerrit.wikimedia.org/r/535522 (https://phabricator.wikimedia.org/T170455) [14:36:40] excellent. retrying now [14:36:55] (03CR) 10Alexandros Kosiaris: [C: 03+2] Introduce wikifeeds LVS IPs [dns] - 10https://gerrit.wikimedia.org/r/535522 (https://phabricator.wikimedia.org/T170455) (owner: 10Alexandros Kosiaris) [14:37:11] !log @ helmfile [EQIAD] Ran 'apply' command on namespace 'wikifeeds' for release 'production' . [14:37:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:18] done! [14:38:49] cool. I 'll merge the LVS and DNS patches then now. service should be fully up in ~30m [14:39:21] (03PS1) 10Ottomata: Release debian version 0.8.0+uap0.6.9~1-1 [debs/python-ua-parser] (debian) - 10https://gerrit.wikimedia.org/r/537123 [14:39:31] thanks for all of your help, akosiaris! [14:40:30] mdholloway: thanks as well [14:41:20] (03CR) 10Thcipriani: [C: 03+1] mariadb::ferm_misc: allow connections from gerrit1001 [puppet] - 10https://gerrit.wikimedia.org/r/535966 (https://phabricator.wikimedia.org/T222391) (owner: 10Dzahn) [14:42:06] RECOVERY - Check systemd state on cloudweb2001-dev is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:42:35] (03CR) 10Thcipriani: gerrit: make scap user configurable in Hiera (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/536704 (owner: 10Dzahn) [14:45:13] (03PS1) 10Andrew Bogott: cloudweb2001-dev: remove wikitech profiles [puppet] - 10https://gerrit.wikimedia.org/r/537127 (https://phabricator.wikimedia.org/T229441) [14:45:41] (03CR) 10Paladox: gerrit: make scap user configurable in Hiera (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/536704 (owner: 10Dzahn) [14:45:47] (03PS2) 10Andrew Bogott: cloudweb2001-dev: remove wikitech profiles [puppet] - 10https://gerrit.wikimedia.org/r/537127 (https://phabricator.wikimedia.org/T229441) [14:46:51] (03PS4) 10Alexandros Kosiaris: coredns: Keep traffic node local if possible [deployment-charts] - 10https://gerrit.wikimedia.org/r/537061 [14:46:53] (03PS1) 10Alexandros Kosiaris: wikifeeds: Define nodePorts in all envs [deployment-charts] - 10https://gerrit.wikimedia.org/r/537128 (https://phabricator.wikimedia.org/T170455) [14:48:05] (03CR) 10Andrew Bogott: [C: 03+2] cloudweb2001-dev: remove wikitech profiles [puppet] - 10https://gerrit.wikimedia.org/r/537127 (https://phabricator.wikimedia.org/T229441) (owner: 10Andrew Bogott) [14:48:32] (03PS3) 10Kosta Harlan: WIP: Enable GrowthExperiments for euwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534789 (https://phabricator.wikimedia.org/T232060) [14:51:46] (03PS1) 10Muehlenhoff: Remove ferm rules for labpuppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/537129 [14:52:58] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "LGTM, but remove the trailing whitespace." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/536979 (owner: 10Effie Mouzeli) [14:53:36] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/536966 (owner: 10Effie Mouzeli) [14:54:26] !log decommissioning Cassandra, restbase2010-b -- T224553 [14:54:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:30] T224553: Migrate remaining Restbase servers to Stretch - https://phabricator.wikimedia.org/T224553 [14:54:51] (03CR) 10Ejegg: [C: 03+1] Disable FundraiserLandingPage extension on test.wikipedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536796 (https://phabricator.wikimedia.org/T203020) (owner: 10AndyRussG) [14:55:48] (03PS4) 10Jcrespo: monitoring: Enable persistent journal storage for logs on test db hosts [puppet] - 10https://gerrit.wikimedia.org/r/535818 [14:55:50] (03PS1) 10Jcrespo: backup: Reinstall backup1001 and backup2001 as buster [puppet] - 10https://gerrit.wikimedia.org/r/537130 (https://phabricator.wikimedia.org/T229209) [14:57:03] (03PS2) 10Jcrespo: backup: Reinstall backup1001 and backup2001 as buster [puppet] - 10https://gerrit.wikimedia.org/r/537130 (https://phabricator.wikimedia.org/T229209) [14:57:14] (03PS1) 10Muehlenhoff: Remove ferm rules for labpuppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/537132 [14:58:21] (03PS1) 10Jforrester: mediawiki-cache-warmup: Update CI tools for the past two years [puppet] - 10https://gerrit.wikimedia.org/r/537133 [14:58:53] (03CR) 10Andrew Bogott: [C: 03+1] Remove ferm rules for labpuppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/537132 (owner: 10Muehlenhoff) [14:59:05] (03CR) 10Jcrespo: [C: 03+2] "Implied +1 on IRC." [puppet] - 10https://gerrit.wikimedia.org/r/537130 (https://phabricator.wikimedia.org/T229209) (owner: 10Jcrespo) [14:59:12] (03CR) 10Mholloway: [C: 03+1] wikifeeds: Define nodePorts in all envs [deployment-charts] - 10https://gerrit.wikimedia.org/r/537128 (https://phabricator.wikimedia.org/T170455) (owner: 10Alexandros Kosiaris) [15:01:52] mdholloway: btw, mind filling in information about wikifeeds in https://wikitech.wikimedia.org/wiki/Wikifeeds ? [15:02:12] akosiaris: sure [15:02:33] thanks for starting the page [15:04:44] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] wikifeeds: Define nodePorts in all envs [deployment-charts] - 10https://gerrit.wikimedia.org/r/537128 (https://phabricator.wikimedia.org/T170455) (owner: 10Alexandros Kosiaris) [15:04:54] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] calico: Fix eqiad configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/537121 (owner: 10Alexandros Kosiaris) [15:04:56] (03PS1) 10Alexandros Kosiaris: Add LVS for wikifeeds [puppet] - 10https://gerrit.wikimedia.org/r/537134 (https://phabricator.wikimedia.org/T170455) [15:06:03] 10Operations, 10Traffic, 10serviceops, 10Patch-For-Review: Applayer services without TLS - https://phabricator.wikimedia.org/T210411 (10Vgutierrez) Please note that the docker-registry certificate is missing the public hostname: `docker-registry.wikimedia.org` [15:06:18] !log @ helmfile [EQIAD] Ran 'sync' command on namespace 'wikifeeds' for release 'production' . [15:06:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:07:12] (03CR) 10jerkins-bot: [V: 04-1] Add LVS for wikifeeds [puppet] - 10https://gerrit.wikimedia.org/r/537134 (https://phabricator.wikimedia.org/T170455) (owner: 10Alexandros Kosiaris) [15:07:18] 10Operations, 10ops-eqiad: backup1001 failed disk (degraded RAID) - https://phabricator.wikimedia.org/T232882 (10jcrespo) [15:07:24] 10Operations, 10DBA, 10serviceops, 10Goal, 10Patch-For-Review: Strengthen backup infrastructure and support - https://phabricator.wikimedia.org/T229209 (10jcrespo) [15:07:27] (03PS1) 10Alexandros Kosiaris: wikifeeds: Enabling paging [puppet] - 10https://gerrit.wikimedia.org/r/537135 (https://phabricator.wikimedia.org/T170455) [15:07:40] !log @ helmfile [CODFW] Ran 'sync' command on namespace 'wikifeeds' for release 'production' . [15:07:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:07:48] 10Operations, 10DBA, 10serviceops, 10Goal, 10Patch-For-Review: Strengthen backup infrastructure and support - https://phabricator.wikimedia.org/T229209 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jynus on cumin2001.codfw.wmnet for hosts: ` ['backup2001.codfw.wmnet'] ` The log can be fo... [15:08:16] (03CR) 10jerkins-bot: [V: 04-1] wikifeeds: Enabling paging [puppet] - 10https://gerrit.wikimedia.org/r/537135 (https://phabricator.wikimedia.org/T170455) (owner: 10Alexandros Kosiaris) [15:09:47] 10Operations, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team (CI & Testing services), 10Release-Engineering-Team-TODO (201909), 10Zuul: Upload zuul_2.5.1-wmf10 to apt.wikimedia.org - https://phabricator.wikimedia.org/T233025 (10hashar) [15:10:01] !log @ helmfile [STAGING] Ran 'sync' command on namespace 'wikifeeds' for release 'staging' . [15:10:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:12:31] (03CR) 10Lucas Werkmeister (WMDE): bridge: enable EditTags for beta (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536982 (https://phabricator.wikimedia.org/T232582) (owner: 10Matthias Geisler) [15:13:12] How can I regenerate 'docker' image after CI build failure at, https://gerrit.wikimedia.org/r/#/c/mediawiki/services/cxserver/+/517057/ [15:13:28] akosiaris: ^ asking here for broader scope :) [15:14:49] 10Operations, 10DBA, 10serviceops, 10Goal: Strengthen backup infrastructure and support - https://phabricator.wikimedia.org/T229209 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jynus on cumin2001.codfw.wmnet for hosts: ` ['backup2001.codfw.wmnet'] ` The log can be found in `/var/log/wmf-a... [15:16:03] 10Operations, 10DBA, 10serviceops, 10Goal: Strengthen backup infrastructure and support - https://phabricator.wikimedia.org/T229209 (10jcrespo) Got stuck at kernel boot, could it be the same issue as T216240 ? [15:17:08] hashar, thcipriani ^: See above for the CI question by kart. I am wondering too btw [15:17:48] (03PS1) 10Mathew.onipe: query_service: modify module to enable reusbility [puppet] - 10https://gerrit.wikimedia.org/r/537138 (https://phabricator.wikimedia.org/T232297) [15:18:55] (03CR) 10jerkins-bot: [V: 04-1] query_service: modify module to enable reusbility [puppet] - 10https://gerrit.wikimedia.org/r/537138 (https://phabricator.wikimedia.org/T232297) (owner: 10Mathew.onipe) [15:19:00] (03PS6) 10Matthias Geisler: bridge: enable EditTags for beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536982 (https://phabricator.wikimedia.org/T232582) [15:19:12] * thcipriani looks [15:21:44] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] bridge: enable EditTags for beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536982 (https://phabricator.wikimedia.org/T232582) (owner: 10Matthias Geisler) [15:23:09] 08Warning [15:23:11] akosiaris: kart_ currently there's not a way to rebuild failed post-merge job unless you have access to do so through jenkins. This is something we'll need to make a task about, since adding this feature will take some consideration. For now, to generate a docker image, you can push a tag for that repo, submit a new patch, or I can restart that build manually -- should I do that? [15:23:53] I noticed the blubberoid 404s this morning is that explained by...something? or was it a glitch in the matrix? [15:23:57] thcipriani: please restart build manually as of now. [15:24:01] * thcipriani does [15:24:18] thcipriani: thanks. and you are right ofc that we should make a task about treating this [15:24:19] thcipriani: seems working fine now. Still not figured out what happened. [15:24:21] not sure how ofc [15:27:36] akosiaris: offhand, I think adding a comment to the patchset might be the easiest first thing to implement, like a "recheck" (i.e., "rebuild""); however, more optimally, likely a gerrit plugin to surface this in the UI based on permissions, etc. Oould be ideal f course that will take longer to implement. [15:28:16] makes sense to me [15:28:21] (03PS2) 10Alexandros Kosiaris: Add LVS for wikifeeds [puppet] - 10https://gerrit.wikimedia.org/r/537134 (https://phabricator.wikimedia.org/T170455) [15:28:23] (03PS2) 10Alexandros Kosiaris: wikifeeds: Enabling paging [puppet] - 10https://gerrit.wikimedia.org/r/537135 (https://phabricator.wikimedia.org/T170455) [15:28:40] akosiaris: kart_ is the rebuild in progress, FYI: https://integration.wikimedia.org/ci/job/service-pipeline-test-and-publish/611/console [15:28:49] Thanks! [15:28:49] a "rebuild" would stick to the current status quo and would be fast (I assume?) which are 2 plus already [15:28:51] 10Operations, 10DBA, 10serviceops, 10Goal: Strengthen backup infrastructure and support - https://phabricator.wikimedia.org/T229209 (10Marostegui) >>! In T229209#5496239, @jcrespo wrote: > Got stuck at kernel boot, could it be the same issue as T216240 ? Maybe, even if it is not, it wouldn't hurt to get t... [15:29:06] rebuild would be nice given +2 is done. [15:29:19] hi mutante [15:29:19] (ie addition of rebuild command!) [15:31:31] 10Operations, 10DBA, 10serviceops, 10Goal: Strengthen backup infrastructure and support - https://phabricator.wikimedia.org/T229209 (10MoritzMuehlenhoff) >>! In T229209#5496266, @Marostegui wrote: >>>! In T229209#5496239, @jcrespo wrote: >> Got stuck at kernel boot, could it be the same issue as T216240 ?... [15:33:00] PROBLEM - Memory correctable errors -EDAC- on elastic1018 is CRITICAL: 11 ge 4 https://wikitech.wikimedia.org/wiki/Monitoring/Memory%23Memory_correctable_errors_-EDAC- https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=elastic1018&var-datasource=eqiad+prometheus/ops [15:36:12] jouncebot: now [15:36:12] No deployments scheduled for the next 1 hour(s) and 23 minute(s) [15:36:25] OK, I'm going to put the write-JSON config patch live. [15:37:48] (03PS8) 10Jforrester: Variant configuration: Write JSON config for all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535257 [15:38:10] (03CR) 10Jforrester: [C: 03+2] Variant configuration: Write JSON config for all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535257 (owner: 10Jforrester) [15:39:05] 10Operations, 10DBA, 10serviceops, 10Goal: Strengthen backup infrastructure and support - https://phabricator.wikimedia.org/T229209 (10jcrespo) In other order of things, the RAID controller I think now has a random device id, so the boot installer failed. I am not sure we will be able to install it without... [15:40:03] (03Merged) 10jenkins-bot: Variant configuration: Write JSON config for all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535257 (owner: 10Jforrester) [15:41:04] 10Operations, 10Cloud-Services, 10SRE-Access-Requests, 10Developer-Advocacy (Jul-Sep 2019): Membership in "researchers" group for Srishti Sethi - https://phabricator.wikimedia.org/T232664 (10bd808) >>! In T232664#5493638, @Nuria wrote: > Hardly any data can be found in mysql, we are deprecating that storag... [15:41:21] !log jforrester@deploy1001 Started scap: wmf-config/CommonSettings.php Variant configuration: Write JSON config for all wikis T223602 [15:41:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:41:25] T223602: Define variant Wikimedia production config in compiled, static files - https://phabricator.wikimedia.org/T223602 [15:41:29] !log jforrester@deploy1001 sync aborted: wmf-config/CommonSettings.php Variant configuration: Write JSON config for all wikis T223602 (duration: 00m 08s) [15:41:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:41:37] Oops. [15:41:56] * James_F tries again. [15:42:48] !log jforrester@deploy1001 Synchronized wmf-config/CommonSettings.php: Variant configuration: Write JSON config for all wikis T223602 (duration: 00m 56s) [15:43:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:43:58] 10Operations, 10DBA, 10serviceops, 10Goal: Strengthen backup infrastructure and support - https://phabricator.wikimedia.org/T229209 (10jcrespo) Sadly, I cannot setup the RAID remotelly, because the server no longer boots and mgmt interface says: ` Unified Server Configurator does not support console redir... [15:49:03] (03CR) 10Mholloway: [C: 03+1] Add discovery RRs for wikifeeds [dns] - 10https://gerrit.wikimedia.org/r/535523 (https://phabricator.wikimedia.org/T170455) (owner: 10Alexandros Kosiaris) [15:51:24] 10Operations: Redundant bootloaders for software RAID - https://phabricator.wikimedia.org/T215183 (10jcrespo) Buster failed to install on my md install for 2552d12fe15ec1 with "grub install sda sdb failed, cannot install on sda". I cannot be sure it is that change because buster, and because there is weird hybri... [15:53:40] (03PS4) 10Halfak: Adds git::lfs class and requirement to ores web/worker/misc [puppet] - 10https://gerrit.wikimedia.org/r/535631 (https://phabricator.wikimedia.org/T232494) [15:55:43] (03CR) 10jerkins-bot: [V: 04-1] Adds git::lfs class and requirement to ores web/worker/misc [puppet] - 10https://gerrit.wikimedia.org/r/535631 (https://phabricator.wikimedia.org/T232494) (owner: 10Halfak) [15:56:47] 10Puppet, 10ORES, 10Patch-For-Review, 10Scoring-platform-team (Current): Require git-lfs in ORES hosts - https://phabricator.wikimedia.org/T232494 (10Halfak) [15:57:07] 10Puppet, 10ORES, 10Patch-For-Review, 10Scoring-platform-team (Current): Require git-lfs in ORES hosts - https://phabricator.wikimedia.org/T232494 (10Halfak) I just pushed another patchset that adds a new role for "ores::misc" (e.g. ores-misc-01 where we build models) that includes git::lfs. I also added... [15:58:47] (03PS6) 10Effie Mouzeli: profile::mediawiki::common Enable systemd-coredump on all mw servers [puppet] - 10https://gerrit.wikimedia.org/r/536979 [16:03:09] 08̶W̶a̶r̶n̶i̶n̶g [16:05:38] (03PS4) 10Jforrester: Remove dependency on wgConf from wmf-config/InitialiseSettings-labs.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536345 (https://phabricator.wikimedia.org/T223602) (owner: 10Krinkle) [16:08:21] (03CR) 10Jforrester: [C: 03+2] Remove dependency on wgConf from wmf-config/InitialiseSettings-labs.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536345 (https://phabricator.wikimedia.org/T223602) (owner: 10Krinkle) [16:09:16] (03Merged) 10jenkins-bot: Remove dependency on wgConf from wmf-config/InitialiseSettings-labs.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536345 (https://phabricator.wikimedia.org/T223602) (owner: 10Krinkle) [16:12:17] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: T223602 Inject config object into InitialiseSettings-labs rather than use wgConf global (duration: 00m 55s) [16:12:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:12:20] T223602: Define variant Wikimedia production config in compiled, static files - https://phabricator.wikimedia.org/T223602 [16:16:22] (03PS12) 10Jforrester: Migrate from InitialiseSettings to VariantSettings, a static array for testability [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535665 [16:21:08] (03PS1) 10KartikMistry: Update cxserver to 2019-09-16-152511-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/537145 (https://phabricator.wikimedia.org/T224721) [16:21:42] akosiaris: i'm still not able to access the wikifeeds service at wikifeeds.svc.eqiad.wmnet:8889; i'm guessing that's because this hasn't yet landed? https://gerrit.wikimedia.org/r/#/c/operations/dns/+/535523/ [16:23:22] mdholloway: It's actually https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/537134/ [16:23:35] but yeah, did not manage to merge it yet [16:25:46] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1004 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [16:26:15] Hi, does anyone have any internet explorers stats on users accessing gerrit.w.org please? [16:28:41] 10Operations, 10ops-eqiad: backup1001 failed disk (degraded RAID) - https://phabricator.wikimedia.org/T232882 (10wiki_willy) Thanks @Jclark-ctr , can you have the drive replaced this week? Also, you might need to coordinate with @jcrespo via IRC to get a couple other things completed to get backup1001 up and... [16:32:28] PROBLEM - recommendation_api endpoints health on scb1004 is CRITICAL: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - bad seed) is CRITICAL: Test article.creation.translation - bad seed returned the unexpected status 500 (expecting: 404) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [16:33:07] (03CR) 10Krinkle: "-global $wmfUdp2logDest, $wmfDatacenter, $wmfRealm, $wmfConfigDir, $wgConf," [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535665 (owner: 10Jforrester) [16:34:04] RECOVERY - recommendation_api endpoints health on scb1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [16:34:15] 10Operations, 10Mail, 10Wikimedia-Mailing-lists: mass AOL bounces mailman - https://phabricator.wikimedia.org/T232417 (10Effeietsanders) These bounces are happening occasionally now. [16:36:21] (03CR) 10Krinkle: Migrate from InitialiseSettings to VariantSettings, a static array for testability (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535665 (owner: 10Jforrester) [16:37:21] (03PS7) 10Matthias Geisler: bridge: enable EditTags for beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536982 (https://phabricator.wikimedia.org/T232582) [16:39:47] (03PS13) 10Jforrester: Migrate from InitialiseSettings to VariantSettings, a static array for testability [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535665 [16:40:18] PROBLEM - recommendation_api endpoints health on scb1004 is CRITICAL: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - bad seed) is CRITICAL: Test article.creation.translation - bad seed returned the unexpected status 500 (expecting: 404) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [16:40:41] (03CR) 10jerkins-bot: [V: 04-1] Migrate from InitialiseSettings to VariantSettings, a static array for testability [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535665 (owner: 10Jforrester) [16:41:05] (03CR) 10Jforrester: Migrate from InitialiseSettings to VariantSettings, a static array for testability (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535665 (owner: 10Jforrester) [16:41:51] (03CR) 10Jforrester: "> Patch Set 12:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535665 (owner: 10Jforrester) [16:41:56] RECOVERY - recommendation_api endpoints health on scb1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [16:42:50] (03PS14) 10Jforrester: Migrate from InitialiseSettings to VariantSettings, a static array for testability [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535665 [16:46:52] PROBLEM - recommendation_api endpoints health on scb2001 is CRITICAL: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - bad seed) is CRITICAL: Test article.creation.translation - bad seed returned the unexpected status 500 (expecting: 404) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [16:48:47] (03CR) 10Jforrester: [C: 03+2] Migrate from InitialiseSettings to VariantSettings, a static array for testability [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535665 (owner: 10Jforrester) [16:49:41] (03Merged) 10jenkins-bot: Migrate from InitialiseSettings to VariantSettings, a static array for testability [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535665 (owner: 10Jforrester) [16:49:59] (03CR) 10Krinkle: Migrate from InitialiseSettings to VariantSettings, a static array for testability (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535665 (owner: 10Jforrester) [16:50:09] (03PS3) 10Jforrester: tests: Migrate tests from InitialiseSettingsTest to StaticSettingsTest [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535980 [16:51:26] !log ban elastic1027 from production-search-eqiad-chi [16:51:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:51:34] RECOVERY - recommendation_api endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [16:53:18] (03PS1) 10Jforrester: tests: Actually assign $variantSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537150 [16:53:23] 10Operations, 10Cloud-Services, 10SRE-Access-Requests, 10Developer-Advocacy (Jul-Sep 2019): Membership in "researchers" group for Srishti Sethi - https://phabricator.wikimedia.org/T232664 (10Nuria) I see, this data also exists in hadoop not only on the analytics replicas. .The replicas are a cluster where... [16:55:01] !log jforrester@deploy1001 Synchronized wmf-config/VariantSettings.php: Establish VariantSettings.php everywhere (duration: 00m 56s) [16:55:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:57:23] (03CR) 10Alexandros Kosiaris: [C: 03+2] Add LVS for wikifeeds [puppet] - 10https://gerrit.wikimedia.org/r/537134 (https://phabricator.wikimedia.org/T170455) (owner: 10Alexandros Kosiaris) [16:58:33] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Make InitialiseSettings use values from VariantSettings (duration: 00m 54s) [16:58:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:59:45] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Make CommonSettings use mtime from VariantSettings (duration: 00m 55s) [16:59:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:59:55] (03CR) 10Jcrespo: [C: 03+1] "Ok with this, but those are rules applied on production dbs, Andrew should be sure that if new hosts should be added here (even on separat" [puppet] - 10https://gerrit.wikimedia.org/r/537132 (owner: 10Muehlenhoff) [17:00:04] gehel and onimisionipe: #bothumor My software never has bugs. It just develops random features. Rise for Wikidata Query Service weekly deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190916T1700). [17:00:46] 10Operations, 10ops-eqiad, 10DC-Ops: b3-eqiad pdu refresh (Tuesday 9/17 @11am UTC) - https://phabricator.wikimedia.org/T227539 (10Bstorm) [17:01:01] jouncebot: no deploy today [17:01:36] (03PS5) 10Vgutierrez: Release 8.0.5-1wm7 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/536302 (https://phabricator.wikimedia.org/T232298) [17:01:38] 10Operations, 10ops-eqiad, 10DC-Ops: b3-eqiad pdu refresh (Tuesday 9/17 @11am UTC) - https://phabricator.wikimedia.org/T227539 (10wiki_willy) a:03Cmjohnson @Cmjohnson - good to go for tomorrow's PDU upgrade, but please confirm with @Marostegui before you start that DBs have been depooled. Thanks, Willy [17:03:23] 10Operations, 10ops-eqiad, 10DC-Ops: a8-eqiad pdu refresh (Date TBA) - https://phabricator.wikimedia.org/T227133 (10wiki_willy) [17:04:08] 10Operations, 10ops-eqiad, 10DC-Ops: a8-eqiad pdu refresh (Date TBA) - https://phabricator.wikimedia.org/T227133 (10wiki_willy) a:03Cmjohnson Originally scheduled for Thursday 9/19, but will reschedule for a later date, since this is a network rack. [17:04:10] PROBLEM - PyBal IPVS diff check on lvs2003 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.1.47:8889]) https://wikitech.wikimedia.org/wiki/PyBal [17:04:28] 10Operations, 10ops-eqiad: backup1001 failed disk (degraded RAID) - https://phabricator.wikimedia.org/T232882 (10jcrespo) Just to be clear, there may be new stuff coming (RAID setup), but it is not set on stone yet both on eqiad and codfw dcs. I will create one or two tickets when I have the specific configura... [17:05:55] 10Operations, 10ops-eqiad, 10DC-Ops: a8-eqiad pdu refresh (Date TBA) - https://phabricator.wikimedia.org/T227133 (10Bstorm) [17:06:14] (03CR) 10Krinkle: tests: Actually assign $variantSettings (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537150 (owner: 10Jforrester) [17:06:17] James_F: https://3v4l.org/bfrCY [17:06:50] PROBLEM - PyBal connections to etcd on lvs2006 is CRITICAL: CRITICAL: 42 connections established with conf2001.codfw.wmnet:2379 (min=43) https://wikitech.wikimedia.org/wiki/PyBal [17:06:51] This "feature" exists to be able to create/change a global within a function. [17:07:01] So in order for an assignment to work, it first needs to exist [17:07:02] :( [17:07:13] but it means we can't know if it's using undefined globals anymore [17:07:21] Krinkle: Won't the require_once trigger the globals? [17:07:36] Yeah, it was tolerant previously as well [17:07:40] PROBLEM - PyBal connections to etcd on lvs2003 is CRITICAL: CRITICAL: 42 connections established with conf2001.codfw.wmnet:2379 (min=43) https://wikitech.wikimedia.org/wiki/PyBal [17:07:45] but it won't create the ones we need [17:07:52] there is no require_once in VariantSettings to assign the ones it needs [17:08:00] they're not being mocked anywhere [17:08:13] Presumably the NS_ constants are working by accident because cirrusTest runs first. [17:08:20] PROBLEM - PyBal IPVS diff check on lvs2006 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.1.47:8889]) https://wikitech.wikimedia.org/wiki/PyBal [17:08:21] Possibly, yeah. [17:08:22] * James_F sighs. [17:08:38] adding a require_once for that in the setUp would help, but yeah, it's still missing those globals. [17:09:23] pybal complaining is me btw [17:09:27] James_F: I wonder if declare(strict_types=1); would help here [17:09:36] it's php 7.2+ only, but might be okay for the tests here [17:09:41] anyway, just an idea :) [17:09:48] !log jforrester@deploy1001 Synchronized docroot/noc/conf/VariantSettings.php.txt: New file for NOC (duration: 00m 54s) [17:09:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:09:53] that should in theory make it throw if we cast types [17:11:47] In StaticSettingsTest? [17:12:17] (03CR) 10Krinkle: tests: Actually assign $variantSettings (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537150 (owner: 10Jforrester) [17:12:26] PROBLEM - PyBal IPVS diff check on lvs1016 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.2.47:8889]) https://wikitech.wikimedia.org/wiki/PyBal [17:12:49] James_F: yeah, worth a shot maybe [17:13:40] PROBLEM - PyBal connections to etcd on lvs1015 is CRITICAL: CRITICAL: 51 connections established with conf1004.eqiad.wmnet:4001 (min=52) https://wikitech.wikimedia.org/wiki/PyBal [17:14:17] 10Operations, 10ops-eqiad, 10DC-Ops: b3-eqiad pdu refresh (Tuesday 9/17 @11am UTC) - https://phabricator.wikimedia.org/T227539 (10Marostegui) I will comment here once the DBs have been depooled tomorrow, I will do it a bit before the scheduled maintenance scheduled time. Thanks! [17:14:30] (03PS2) 10Jforrester: tests: Actually assign $variantSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537150 [17:15:23] (03CR) 10jerkins-bot: [V: 04-1] tests: Actually assign $variantSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537150 (owner: 10Jforrester) [17:16:33] PROBLEM - PyBal IPVS diff check on lvs1015 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.2.47:8889]) https://wikitech.wikimedia.org/wiki/PyBal [17:17:01] are we decomming an lvs service? [17:17:17] or bringing one up? [17:17:28] Krinkle: Hmm. `$this->assertTrue( false, "Hello" );` doesn't trigger… I'm not convinced any of this is working. [17:18:15] oh I see, "wikifeeds" [17:19:13] (03PS8) 10Matthias Geisler: bridge: enable EditTags for beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536982 (https://phabricator.wikimedia.org/T232582) [17:19:15] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1004 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [17:20:56] (03PS1) 10Alexandros Kosiaris: LVS: correct typo [puppet] - 10https://gerrit.wikimedia.org/r/537164 [17:21:06] (03CR) 10Alexandros Kosiaris: [C: 03+2] LVS: correct typo [puppet] - 10https://gerrit.wikimedia.org/r/537164 (owner: 10Alexandros Kosiaris) [17:21:25] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] LVS: correct typo [puppet] - 10https://gerrit.wikimedia.org/r/537164 (owner: 10Alexandros Kosiaris) [17:23:09] 08Warning [17:23:53] RECOVERY - PyBal IPVS diff check on lvs2006 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [17:26:21] RECOVERY - PyBal connections to etcd on lvs2006 is OK: OK: 43 connections established with conf2001.codfw.wmnet:2379 (min=43) https://wikitech.wikimedia.org/wiki/PyBal [17:26:23] RECOVERY - PyBal IPVS diff check on lvs1016 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [17:28:39] PROBLEM - Confd template for /srv/config-master/pybal/codfw/wikifeeds on puppetmaster2001 is CRITICAL: File not found: /srv/config-master/pybal/codfw/wikifeeds https://wikitech.wikimedia.org/wiki/Confd [17:28:53] PROBLEM - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) is CRITICAL: Test retrieve the most read articles for January 1, 2016 returned the unexpected status 504 (expecting: 200): /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) is CRITICAL: [17:28:53] most-read articles for January 1, 2016 (with aggregated=true) returned the unexpected status 504 (expecting: 200): /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) is CRITICAL: Test retrieve featured image data for April 29, 2016 returned the unexpected status 504 (expecting: 200): /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for A [17:28:53] CRITICAL: Test retrieve title of the featured article for April 29, 2016 returned the unexpected status 504 (expecting: 200): /{domain}/v1/feed/onthisday/{type}/{month}/{day} (retrieve all events on January 15) timed out before a response was received: /{domain}/v1/page/random/title (retrieve a random article title) is CRITICAL: Test retrieve a random article title returned the unexpected status 504 (expecting: 200): /{domain}/v [17:28:53] In the News content) timed out before a response was received https://wikitech.wikimedia.org/wiki/Wikifeeds [17:29:35] RECOVERY - PyBal connections to etcd on lvs1015 is OK: OK: 52 connections established with conf1004.eqiad.wmnet:4001 (min=52) https://wikitech.wikimedia.org/wiki/PyBal [17:29:35] RECOVERY - PyBal connections to etcd on lvs2003 is OK: OK: 43 connections established with conf2001.codfw.wmnet:2379 (min=43) https://wikitech.wikimedia.org/wiki/PyBal [17:29:37] RECOVERY - PyBal IPVS diff check on lvs2003 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [17:29:39] RECOVERY - PyBal IPVS diff check on lvs1015 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [17:32:15] (03CR) 10Joal: "comments about naming after some more thoughts" (033 comments) [debs/python-ua-parser] (debian) - 10https://gerrit.wikimedia.org/r/537123 (owner: 10Ottomata) [17:32:37] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/wikifeeds on puppetmaster2001 is CRITICAL: File not found: /srv/config-master/pybal/eqiad/wikifeeds https://wikitech.wikimedia.org/wiki/Confd [17:32:41] mdholloway: akosiaris@deploy1001:~$ curl http://wikifeeds.svc.eqiad.wmnet:8889/_info works now. There's still one 1 minor to be done and is enable the discovery records so that curl http://wikifeeds.discovery.wmnet:8889/_info will also work. I 'll do that though tomorrow. That is going to be the canonical endpoint that should be used btw in all clients as it allows easily pooling/depooling of datacenters [17:32:42] [17:34:11] 10Operations, 10Cloud-Services, 10SRE-Access-Requests, 10Developer-Advocacy (Jul-Sep 2019): Membership in "researchers" group for Srishti Sethi - https://phabricator.wikimedia.org/T232664 (10Nuria) #sre-access-requests please give @srishakatux ssh permits for the cluster if she does not have those already [17:35:06] Great! Thanks again, akosiaris. [17:35:23] RECOVERY - Confd template for /srv/config-master/pybal/codfw/wikifeeds on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd [17:35:37] (03PS4) 10Herron: exim: add pr.wikimedia.org to wikimedia_domains [puppet] - 10https://gerrit.wikimedia.org/r/536313 (https://phabricator.wikimedia.org/T231387) [17:35:45] RECOVERY - Confd template for /srv/config-master/pybal/eqiad/wikifeeds on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd [17:37:55] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1004 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [17:39:11] (03CR) 10Herron: [C: 03+2] exim: add pr.wikimedia.org to wikimedia_domains [puppet] - 10https://gerrit.wikimedia.org/r/536313 (https://phabricator.wikimedia.org/T231387) (owner: 10Herron) [17:39:31] ACKNOWLEDGEMENT - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) is CRITICAL: Test retrieve the most read articles for January 1, 2016 returned the unexpected status 504 (expecting: 200): /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) is CR [17:39:31] ieve the most-read articles for January 1, 2016 (with aggregated=true) returned the unexpected status 504 (expecting: 200): /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) is CRITICAL: Test retrieve featured image data for April 29, 2016 returned the unexpected status 504 (expecting: 200): /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured artic [17:39:31] 2016) is CRITICAL: Test retrieve title of the featured article for April 29, 2016 returned the unexpected status 504 (expecting: 200): /{domain}/v1/feed/onthisday/{type}/{month}/{day} (retrieve all events on January 15) timed out before a response was received: /{domain}/v1/page/random/title (retrieve a random article title) is CRITICAL: Test retrieve a random article title returned the unexpected status 504 (expecting: 200): /{d [17:39:31] ws (get In the News content) timed out before a response was received alexandros kosiaris new service https://wikitech.wikimedia.org/wiki/Wikifeeds [17:39:37] 10Operations, 10Analytics, 10Traffic: We are not capturing IPs of original requests for proxied requests from operamini and googleweblight. x-forwarded-for is null and client-ip is the same as IP on Webrequest data - https://phabricator.wikimedia.org/T232795 (10Nuria) @bblack Let me add more contex here, w... [17:42:49] 10Operations, 10ops-codfw: find horizontal PDUs in codfw - https://phabricator.wikimedia.org/T221153 (10Papaul) 3 of the CS-48VDY-L2130 11 OF the CS-36VYM411 [17:42:50] !log restart elasticsearch_6@production-search-eqiad on elastic1027 due to >1k orphan tasks [17:42:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:42:53] PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) is CRITICAL: Test retrieve the most read articles for January 1, 2016 returned the unexpected status 504 (expecting: 200): /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) is CRITICAL: [17:42:53] most-read articles for January 1, 2016 (with aggregated=true) returned the unexpected status 504 (expecting: 200): /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) is CRITICAL: Test retrieve featured image data for April 29, 2016 returned the unexpected status 504 (expecting: 200): /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for A [17:42:53] CRITICAL: Test retrieve title of the featured article for April 29, 2016 returned the unexpected status 504 (expecting: 200): /{domain}/v1/feed/onthisday/{type}/{month}/{day} (retrieve all events on January 15) timed out before a response was received: /{domain}/v1/page/random/title (retrieve a random article title) is CRITICAL: Test retrieve a random article title returned the unexpected status 504 (expecting: 200): /{domain}/v [17:42:53] In the News content) timed out before a response was received https://wikitech.wikimedia.org/wiki/Wikifeeds [17:42:56] !log decommissioning Cassandra, restbase2010-c -- T224553 [17:42:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:42:59] T224553: Migrate remaining Restbase servers to Stretch - https://phabricator.wikimedia.org/T224553 [17:44:21] about those wikifeeds alerts, looks like the service config needs updating to point to internal API endpoints [17:44:26] ACKNOWLEDGEMENT - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) is CRITICAL: Test retrieve the most read articles for January 1, 2016 returned the unexpected status 504 (expecting: 200): /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) is CR [17:44:26] ieve the most-read articles for January 1, 2016 (with aggregated=true) returned the unexpected status 504 (expecting: 200): /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) is CRITICAL: Test retrieve featured image data for April 29, 2016 returned the unexpected status 504 (expecting: 200): /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured artic [17:44:26] 2016) is CRITICAL: Test retrieve title of the featured article for April 29, 2016 returned the unexpected status 504 (expecting: 200): /{domain}/v1/feed/onthisday/{type}/{month}/{day} (retrieve all events on January 15) timed out before a response was received: /{domain}/v1/page/random/title (retrieve a random article title) is CRITICAL: Test retrieve a random article title returned the unexpected status 504 (expecting: 200): /{d [17:44:26] ws (get In the News content) timed out before a response was received alexandros kosiaris new service https://wikitech.wikimedia.org/wiki/Wikifeeds [17:44:55] (03PS1) 10Physikerwelt: Remove unused math config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537166 (https://phabricator.wikimedia.org/T228547) [17:46:14] 10Operations, 10ops-eqiad: Broken memory on elastic1029 - https://phabricator.wikimedia.org/T201991 (10Mathew.onipe) 05Resolved→03Open This has come up again. https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=elastic1029&service=Memory+correctable+errors+-EDAC- [17:46:49] hmm, the config has discovery.wmnet URLs [17:46:51] ACKNOWLEDGEMENT - Memory correctable errors -EDAC- on elastic1029 is CRITICAL: 4.001 ge 4 Mathew.onipe https://phabricator.wikimedia.org/T201991 - The acknowledgement expires at: 2019-09-20 17:46:31. https://wikitech.wikimedia.org/wiki/Monitoring/Memory%23Memory_correctable_errors_-EDAC- https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=elastic1029&var-datasource=eqiad+prometheus/ops [17:47:08] and yet: [17:47:14] mholloway-shell@deploy1001:~$ curl wikifeeds.svc.eqiad.wmnet:8889/en.wikipedia.org/v1/page/most-read/2018/05/05 [17:47:14] {"status":504,"type":"internal_http_error","detail":"Error: ETIMEDOUT","method":"post","uri":"https://en.wikipedia.org/w/api.php"} [17:48:58] (03CR) 10Physikerwelt: Remove unused math config (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537166 (https://phabricator.wikimedia.org/T228547) (owner: 10Physikerwelt) [17:50:07] 10Operations, 10ops-eqiad: Broken memory on elastic1029 - https://phabricator.wikimedia.org/T201991 (10Mathew.onipe) 05Open→03Resolved [17:51:15] (03PS2) 10Herron: kafka-main: replace kafka1002 hardware with kafka-main1002 [puppet] - 10https://gerrit.wikimedia.org/r/536655 (https://phabricator.wikimedia.org/T225005) [17:51:30] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1004 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [17:53:53] (03PS3) 10Jforrester: tests: Make this test actually work, and avoid global messiness a little [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537150 [17:54:44] (03CR) 10jerkins-bot: [V: 04-1] tests: Make this test actually work, and avoid global messiness a little [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537150 (owner: 10Jforrester) [17:54:50] (03PS4) 10Jforrester: tests: Make this test actually work, and avoid global messiness a little [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537150 [17:55:40] !log jforrester@deploy1001 Synchronized docroot/noc/conf/VariantSettings.php.txt: New file for NOC (duration: 00m 55s) [17:55:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:56:14] (03PS1) 10Lucas Werkmeister (WMDE): Clean up globals in InitialiseSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537169 [17:56:37] (03CR) 10jerkins-bot: [V: 04-1] tests: Make this test actually work, and avoid global messiness a little [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537150 (owner: 10Jforrester) [17:57:00] (03CR) 10Krinkle: tests: Make this test actually work, and avoid global messiness a little (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537150 (owner: 10Jforrester) [17:57:28] (03PS1) 10Herron: kafka-main1002: disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/537170 [17:57:33] jouncebot: refresh [17:57:34] I refreshed my knowledge about deployments. [17:57:38] jouncebot: reload [17:58:07] (03PS5) 10Jforrester: tests: Make this test actually work, and avoid global messiness a little [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537150 [17:58:19] (03CR) 10Krinkle: tests: Migrate tests from InitialiseSettingsTest to StaticSettingsTest (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535980 (owner: 10Jforrester) [17:58:27] (03CR) 10Herron: [C: 03+2] kafka-main1002: disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/537170 (owner: 10Herron) [17:58:54] (03PS6) 10Jforrester: tests: Make this test actually work, and avoid global messiness a little [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537150 [18:00:05] MaxSem, RoanKattouw, Niharika, and Urbanecm: That opportune time is upon us again. Time for a Morning SWAT (Max 6 patches) deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190916T1800). [18:00:05] Lucas_WMDE: A patch you scheduled for Morning SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:00:09] o/ [18:01:22] James_F: can I SWAT a config change or are you rearranging mediawiki-config right now? [18:01:33] (my change Should™ be beta-only, but I should still sync it and test on mwdebug) [18:01:54] (03PS4) 10Jforrester: tests: Migrate tests from InitialiseSettingsTest to StaticSettingsTest [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535980 [18:01:58] Lucas_WMDE: Sorry, go for it. [18:02:04] ok thanks [18:02:10] * James_F will hold back. [18:02:20] Note the new VariantSettings.php if you haven't already. [18:02:26] yup, just looked at it [18:02:31] But if you're Labs-only it shouldn't change anything for you. [18:02:32] but this change only touches IS-labs, which still looks intact [18:02:49] Yeah, go for it. [18:02:52] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536982 (https://phabricator.wikimedia.org/T232582) (owner: 10Matthias Geisler) [18:03:12] * Lucas_WMDE glares at zuul [18:05:23] James_F: CI might take a while, if you want to do something right now feel free to go ahead… [18:06:35] It should be near-instant? [18:07:10] (03CR) 10Jforrester: tests: Migrate tests from InitialiseSettingsTest to StaticSettingsTest (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535980 (owner: 10Jforrester) [18:07:19] once it starts, probably [18:07:26] but it looks like all the workers are tied up in long-running jobs? [18:07:34] Hmm, executor starvation from too many Wikibase jobs? [18:07:43] * James_F sighs. [18:08:13] was there a task for the “drop Wikibase Quibble jobs except php72” idea? [18:08:28] because it’s sounding more and more tempting [18:09:23] There is, but you (WMDE) objected on the grounds that it'd undermine third party use of Wikibase. [18:09:36] Note that "soon"™ we're dropping php70 and php71 anyway. [18:09:45] And slightly less soon, HHVM too. [18:09:46] !log registry2001 - restarting nginx [18:09:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:09:54] So that'll kill ~60% of the jobs. [18:10:10] (03Merged) 10jenkins-bot: bridge: enable EditTags for beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536982 (https://phabricator.wikimedia.org/T232582) (owner: 10Matthias Geisler) [18:10:20] yay [18:10:22] Finally. :-) [18:10:28] ok, deploying [18:10:36] (also yay for dropping those jobs btw) [18:11:19] Yeah, now if only TechCom would send out their weekly update with their decision. ;-) [18:11:46] testing on mwdebug1002… [18:12:47] looks fine, syncing [18:12:58] !log migrating kafka1002 to kafka-main1002 T225005 [18:13:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:13:01] T225005: Replace and expand kafka main hosts (kafka[12]00[123]) with kafka-main[12]00[12345] - https://phabricator.wikimedia.org/T225005 [18:14:38] !log lucaswerkmeister-wmde@deploy1001 Synchronized wmf-config/: SWAT: [[gerrit:536982|bridge: enable EditTags for beta (T232582)]] (duration: 00m 58s) [18:14:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:14:57] T232582: configure tag on beta - https://phabricator.wikimedia.org/T232582 [18:14:59] anything else to SWAT? [18:15:45] (03PS3) 10Herron: kafka-main: replace kafka1002 hardware with kafka-main1002 [puppet] - 10https://gerrit.wikimedia.org/r/536655 (https://phabricator.wikimedia.org/T225005) [18:15:58] !log Morning SWAT done [18:16:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:16:14] James_F: you’re free to go :) [18:16:38] (03CR) 10Herron: [C: 03+2] kafka-main: replace kafka1002 hardware with kafka-main1002 [puppet] - 10https://gerrit.wikimedia.org/r/536655 (https://phabricator.wikimedia.org/T225005) (owner: 10Herron) [18:16:54] Thanks. [18:16:59] (03CR) 10Jforrester: [C: 03+2] tests: Make this test actually work, and avoid global messiness a little [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537150 (owner: 10Jforrester) [18:17:09] (03CR) 10Jforrester: [C: 03+2] tests: Migrate tests from InitialiseSettingsTest to StaticSettingsTest [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535980 (owner: 10Jforrester) [18:17:16] (03PS2) 10Ottomata: Release debian version 0.8.0+core0.6.9~1-1 [debs/python-ua-parser] (debian) - 10https://gerrit.wikimedia.org/r/537123 [18:17:56] (03Merged) 10jenkins-bot: tests: Make this test actually work, and avoid global messiness a little [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537150 (owner: 10Jforrester) [18:18:05] (03Merged) 10jenkins-bot: tests: Migrate tests from InitialiseSettingsTest to StaticSettingsTest [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535980 (owner: 10Jforrester) [18:20:54] (03PS1) 10Cmjohnson: Adding mgmt dns for new ms-be servers [dns] - 10https://gerrit.wikimedia.org/r/537171 (https://phabricator.wikimedia.org/T232367) [18:21:38] !log jforrester@deploy1001 Synchronized wmf-config/VariantSettings.php: Remove globals declaration and use via GLOBALS for testability (duration: 00m 56s) [18:21:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:23:34] it looks like this CI blockage is also preventing beta updates? https://integration.wikimedia.org/ci/view/Beta/job/beta-scap-eqiad/ hasn’t run for 6 hours… [18:25:43] That looks like a different bug [18:26:00] (03CR) 10jenkins-bot: Remove expired throttle rules [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536679 (owner: 10Zoranzoki21) [18:26:04] (03CR) 10jenkins-bot: all wikis to 1.34.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537106 (owner: 10Hashar) [18:26:06] (03CR) 10jenkins-bot: Variant configuration: Write JSON config for all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535257 (owner: 10Jforrester) [18:26:08] (03CR) 10jenkins-bot: Remove dependency on wgConf from wmf-config/InitialiseSettings-labs.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536345 (https://phabricator.wikimedia.org/T223602) (owner: 10Krinkle) [18:26:10] (03CR) 10jenkins-bot: Migrate from InitialiseSettings to VariantSettings, a static array for testability [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535665 (owner: 10Jforrester) [18:26:12] just asked about it in #releng, I assume that’s the better channel [18:26:12] (03CR) 10jenkins-bot: bridge: enable EditTags for beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536982 (https://phabricator.wikimedia.org/T232582) (owner: 10Matthias Geisler) [18:26:14] (03CR) 10jenkins-bot: tests: Make this test actually work, and avoid global messiness a little [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537150 (owner: 10Jforrester) [18:26:16] (03CR) 10jenkins-bot: tests: Migrate tests from InitialiseSettingsTest to StaticSettingsTest [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535980 (owner: 10Jforrester) [18:33:09] (03PS3) 10Jforrester: WmfCluster: Use static VariantSettings instead of InitialiseSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535981 [18:34:48] James_F: Want to do the password patches? :P [18:34:59] (03CR) 10Jforrester: [C: 03+2] WmfCluster: Use static VariantSettings instead of InitialiseSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535981 (owner: 10Jforrester) [18:35:06] Reedy: Yeah, just a moment. [18:38:15] (03Merged) 10jenkins-bot: WmfCluster: Use static VariantSettings instead of InitialiseSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535981 (owner: 10Jforrester) [18:39:09] (03PS3) 10Jforrester: Drop PasswordCannotBePopular compatibility hack, no longer needed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534706 [18:40:02] !log phab1001 - racadm racreset [18:40:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:40:35] !log jforrester@deploy1001 Synchronized src/WmfClusters.php: Use static VariantSettings instead of InitialiseSettings (noc-only change) (duration: 00m 55s) [18:40:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:40:38] * James_F sighs. [18:40:55] (03CR) 10Jforrester: "Oh, we also have this in CommonSettings-labs…" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534706 (owner: 10Jforrester) [18:41:18] (03CR) 10jenkins-bot: WmfCluster: Use static VariantSettings instead of InitialiseSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535981 (owner: 10Jforrester) [18:43:43] (03CR) 10Jforrester: [C: 03+2] Drop PasswordCannotBePopular compatibility hack, no longer needed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534706 (owner: 10Jforrester) [18:43:45] (03CR) 10Jforrester: [C: 03+2] Set MinimumPasswordLengthToLogin to 10 for all prived groups, not just +staff [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534707 (owner: 10Jforrester) [18:43:55] (03PS3) 10Jforrester: Set MinimumPasswordLengthToLogin to 10 for all prived groups, not just +staff [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534707 [18:44:00] (03CR) 10Jforrester: [C: 03+2] "…" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534707 (owner: 10Jforrester) [18:44:59] (03Merged) 10jenkins-bot: Drop PasswordCannotBePopular compatibility hack, no longer needed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534706 (owner: 10Jforrester) [18:45:55] (03Merged) 10jenkins-bot: Set MinimumPasswordLengthToLogin to 10 for all prived groups, not just +staff [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534707 (owner: 10Jforrester) [18:46:23] Reedy: Live on mwdebug1002; checking. [18:46:57] (03CR) 10jenkins-bot: Drop PasswordCannotBePopular compatibility hack, no longer needed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534706 (owner: 10Jforrester) [18:47:41] All looks good. [18:48:30] (03PS5) 10Jforrester: Variant configuration: Read JSON config for all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535963 [18:48:55] !log jforrester@deploy1001 Synchronized wmf-config/CommonSettings.php: Set MinimumPasswordLengthToLogin to 10 for all prived groups, not just +staff (duration: 00m 56s) [18:48:56] (03PS6) 10Jforrester: Variant configuration: Read JSON config for all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535963 (https://phabricator.wikimedia.org/T223602) [18:49:02] (03CR) 10Jforrester: [C: 03+2] Variant configuration: Read JSON config for all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535963 (https://phabricator.wikimedia.org/T223602) (owner: 10Jforrester) [18:49:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:50:47] (03Merged) 10jenkins-bot: Variant configuration: Read JSON config for all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535963 (https://phabricator.wikimedia.org/T223602) (owner: 10Jforrester) [18:52:53] (03PS7) 10Jforrester: Variant configuration: Never write to serialised PHP, drop support [mediawiki-config] - 10https://gerrit.wikimedia.org/r/533594 (https://phabricator.wikimedia.org/T223602) [18:52:56] !log jforrester@deploy1001 Synchronized wmf-config/CommonSettings.php: T223602 Variant configuration: Read JSON config for all wikis (duration: 00m 56s) [18:53:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:53:09] T223602: Define variant Wikimedia production config in compiled, static files - https://phabricator.wikimedia.org/T223602 [18:54:31] !log dzahn@cumin1001 START - Cookbook sre.hosts.ipmi-password-reset [18:54:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:54:51] (03CR) 10Jforrester: "Testing this on debug just in case." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537169 (owner: 10Lucas Werkmeister (WMDE)) [18:54:58] !log dzahn@cumin1001 Updating IPMI password on 1 hosts - dzahn@cumin1001 [18:54:59] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.ipmi-password-reset (exit_code=0) [18:55:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:55:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:56:34] (03CR) 10Jforrester: "Maybe split this patch into the CommonSettings/VariantSettings changes (which we're sure about) and the filebackend.php changes (which we'" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537166 (https://phabricator.wikimedia.org/T228547) (owner: 10Physikerwelt) [18:57:23] (03CR) 10Jforrester: [C: 03+2] Clean up globals in InitialiseSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537169 (owner: 10Lucas Werkmeister (WMDE)) [18:58:19] (03Merged) 10jenkins-bot: Clean up globals in InitialiseSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537169 (owner: 10Lucas Werkmeister (WMDE)) [18:58:53] (03PS2) 10Jforrester: Stop setting wgCookieSetOnAutoBlock and wgCookieSetOnIpBlock to the default; never varied [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534698 (https://phabricator.wikimedia.org/T191922) [18:59:03] (03CR) 10Jforrester: [C: 03+2] "wmf.22 is everywhere." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534698 (https://phabricator.wikimedia.org/T191922) (owner: 10Jforrester) [18:59:38] (03Abandoned) 10Jforrester: Enforce a 10-byte password for privileged users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479570 (https://phabricator.wikimedia.org/T208246) (owner: 10Jforrester) [19:00:45] (03Merged) 10jenkins-bot: Stop setting wgCookieSetOnAutoBlock and wgCookieSetOnIpBlock to the default; never varied [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534698 (https://phabricator.wikimedia.org/T191922) (owner: 10Jforrester) [19:00:59] !log dzahn@cumin1001 START - Cookbook sre.hosts.ipmi-password-reset [19:01:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:01:24] !log dzahn@cumin1001 Updating IPMI password on 0 hosts - dzahn@cumin1001 [19:01:24] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.ipmi-password-reset (exit_code=0) [19:01:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:01:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:02:16] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Clean up globals in InitialiseSettings.php (duration: 00m 56s) [19:02:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:03:51] !log jforrester@deploy1001 Synchronized wmf-config/CommonSettings.php: Stop setting wgCookieSetOnAutoBlock and wgCookieSetOnIpBlock to the default; never varied (duration: 00m 56s) [19:04:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:06:22] 10Operations, 10Cloud-Services, 10SRE-Access-Requests, 10Developer-Advocacy (Jul-Sep 2019): Membership in "researchers" group for Srishti Sethi - https://phabricator.wikimedia.org/T232664 (10Nuria) Also, question to @mforns and @JAllemandou it seems that since the #cloud-services-team just wants to see edi... [19:07:58] (03CR) 10Cmjohnson: [C: 03+2] Adding mgmt dns for new ms-be servers [dns] - 10https://gerrit.wikimedia.org/r/537171 (https://phabricator.wikimedia.org/T232367) (owner: 10Cmjohnson) [19:08:44] !log dzahn@cumin1001 START - Cookbook sre.hosts.ipmi-password-reset [19:08:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:09:16] !log dzahn@cumin1001 Updating IPMI password on 8 hosts - dzahn@cumin1001 [19:09:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:09:18] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.ipmi-password-reset (exit_code=0) [19:09:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:13:24] !log dzahn@cumin1001 START - Cookbook sre.hosts.ipmi-password-reset [19:13:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:13:55] !log dzahn@cumin1001 Updating IPMI password on 8 hosts - dzahn@cumin1001 [19:13:56] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.ipmi-password-reset (exit_code=0) [19:13:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:13:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:17:58] (03CR) 10Krinkle: Variant configuration: Never write to serialised PHP, drop support (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/533594 (https://phabricator.wikimedia.org/T223602) (owner: 10Jforrester) [19:19:33] !log dzahn@cumin1001 START - Cookbook sre.hosts.ipmi-password-reset [19:19:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:19:52] !log dzahn@cumin1001 Updating IPMI password on 2 hosts - dzahn@cumin1001 [19:19:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:19:55] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.ipmi-password-reset (exit_code=0) [19:19:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:24:55] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Cluster, 10DC-Ops: analytics1045 - RAID failure and /var/lib/hadoop/data/j can't be mounted - https://phabricator.wikimedia.org/T232069 (10wiki_willy) a:03Cmjohnson [19:26:40] !log dzahn@cumin1001 START - Cookbook sre.hosts.ipmi-password-reset [19:26:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:27:03] !log dzahn@cumin1001 Updating IPMI password on 1 hosts - dzahn@cumin1001 [19:27:04] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.ipmi-password-reset (exit_code=0) [19:27:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:27:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:27:36] (03PS1) 10Ayounsi: LibreNMS: fix more files permissions [puppet] - 10https://gerrit.wikimedia.org/r/537182 [19:27:52] !log dzahn@cumin1001 START - Cookbook sre.hosts.ipmi-password-reset [19:27:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:28:13] (03PS1) 10Bstorm: password: rotate cloudwide root key for bstorm [labs/private] - 10https://gerrit.wikimedia.org/r/537183 [19:28:37] (03CR) 10Ayounsi: [C: 03+2] LibreNMS: fix more files permissions [puppet] - 10https://gerrit.wikimedia.org/r/537182 (owner: 10Ayounsi) [19:28:40] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Cluster, 10DC-Ops: analytics1045 - RAID failure and /var/lib/hadoop/data/j can't be mounted - https://phabricator.wikimedia.org/T232069 (10wiki_willy) Hi @Dzahn @jbond - looks like this host is out of warranty, and about 3/4 of a year away from a hardwa... [19:28:51] !log dzahn@cumin1001 Updating IPMI password on 12 hosts - dzahn@cumin1001 [19:28:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:28:53] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.ipmi-password-reset (exit_code=0) [19:28:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:29:43] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Cluster, 10DC-Ops: analytics1045 - RAID failure and /var/lib/hadoop/data/j can't be mounted - https://phabricator.wikimedia.org/T232069 (10Dzahn) That's a question for the analytics team, please. [19:30:03] 10Operations, 10netops: librenms doesn't print alert text on irc anymore - https://phabricator.wikimedia.org/T232977 (10ayounsi) Pushed https://gerrit.wikimedia.org/r/c/operations/puppet/+/537182 to fix some email alerting issues, not sure yet if it helps with the IRC ones. [19:34:21] (03PS1) 10Ayounsi: LibreNMS: use proper permission [puppet] - 10https://gerrit.wikimedia.org/r/537187 [19:34:43] (03CR) 10Bstorm: "Switching to using my yubikey" [labs/private] - 10https://gerrit.wikimedia.org/r/537183 (owner: 10Bstorm) [19:34:47] !log dzahn@cumin1001 START - Cookbook sre.hosts.ipmi-password-reset [19:34:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:35:06] !log dzahn@cumin1001 Updating IPMI password on 12 hosts - dzahn@cumin1001 [19:35:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:35:10] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.ipmi-password-reset (exit_code=0) [19:35:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:35:12] (03CR) 10Ayounsi: [C: 03+2] LibreNMS: use proper permission [puppet] - 10https://gerrit.wikimedia.org/r/537187 (owner: 10Ayounsi) [19:51:09] !log dzahn@cumin1001 START - Cookbook sre.hosts.ipmi-password-reset [19:51:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:51:28] !log dzahn@cumin1001 Updating IPMI password on 2 hosts - dzahn@cumin1001 [19:51:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:51:30] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.ipmi-password-reset (exit_code=0) [19:51:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:52:49] !log dzahn@cumin1001 START - Cookbook sre.hosts.ipmi-password-reset [19:52:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:53:05] !log dzahn@cumin1001 Updating IPMI password on 2 hosts - dzahn@cumin1001 [19:53:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:53:07] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.ipmi-password-reset (exit_code=0) [19:53:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:54:56] !log dzahn@cumin1001 START - Cookbook sre.hosts.ipmi-password-reset [19:54:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:55:12] !log dzahn@cumin1001 Updating IPMI password on 2 hosts - dzahn@cumin1001 [19:55:12] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.ipmi-password-reset (exit_code=0) [19:55:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:55:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:55:29] !log reboot scs-a8-eqiad (at 100% CPU) [19:55:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:59:41] 08Warning Alert for device scs-a8-eqiad.mgmt.eqiad.wmnet - Processor usage over 85% [19:59:50] yay, fixed! [20:00:05] cscott, arlolra, subbu, bearND, halfak, and accraze: Dear deployers, time to do the Services – Parsoid / Citoid / Mobileapps / ORES / … deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190916T2000). [20:00:12] no parsoid deploy today [20:01:34] 10Operations, 10netops: librenms doesn't print alert text on irc anymore - https://phabricator.wikimedia.org/T232977 (10ayounsi) 05Open→03Resolved a:03ayounsi Confirmed fixed! [20:03:08] 04Critical Alert for device scs-a8-eqiad.mgmt.eqiad.wmnet - Device rebooted [20:06:04] 10Operations, 10Core Platform Team: Apache mod_status aggregator - https://phabricator.wikimedia.org/T233047 (10ayounsi) [20:08:07] !log dzahn@cumin1001 START - Cookbook sre.hosts.ipmi-password-reset [20:08:09] 04̶C̶r̶i̶t̶i̶c̶a̶l Device scs-a8-eqiad.mgmt.eqiad.wmnet recovered from Device rebooted [20:08:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:08:22] !log dzahn@cumin1001 Updating IPMI password on 2 hosts - dzahn@cumin1001 [20:08:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:08:24] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.ipmi-password-reset (exit_code=0) [20:08:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:09:45] !log dzahn@cumin1001 START - Cookbook sre.hosts.ipmi-password-reset [20:09:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:10:06] !log dzahn@cumin1001 Updating IPMI password on 2 hosts - dzahn@cumin1001 [20:10:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:10:07] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.ipmi-password-reset (exit_code=0) [20:10:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:13:57] (03CR) 10Ori.livneh: "I filed Tim's mod_status aggregator idea as https://phabricator.wikimedia.org/T233047. I still think the approach described there should b" [puppet] - 10https://gerrit.wikimedia.org/r/511751 (owner: 10Ori.livneh) [20:14:55] !log dzahn@cumin1001 START - Cookbook sre.hosts.ipmi-password-reset [20:14:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:15:10] !log dzahn@cumin1001 Updating IPMI password on 8 hosts - dzahn@cumin1001 [20:15:10] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.ipmi-password-reset (exit_code=0) [20:15:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:15:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:15:20] 10Operations, 10Core Platform Team: Apache mod_status aggregator - https://phabricator.wikimedia.org/T233047 (10ori) [20:18:10] 08̶W̶a̶r̶n̶i̶n̶g Device scs-a8-eqiad.mgmt.eqiad.wmnet recovered from Processor usage over 85% [20:18:46] !log dzahn@cumin1001 START - Cookbook sre.hosts.ipmi-password-reset [20:18:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:19:02] !log dzahn@cumin1001 Updating IPMI password on 1 hosts - dzahn@cumin1001 [20:19:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:19:05] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.ipmi-password-reset (exit_code=0) [20:19:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:19:11] !log dzahn@cumin1001 START - Cookbook sre.hosts.ipmi-password-reset [20:19:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:19:24] !log dzahn@cumin1001 Updating IPMI password on 1 hosts - dzahn@cumin1001 [20:19:24] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.ipmi-password-reset (exit_code=0) [20:19:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:19:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:19:37] !log dzahn@cumin1001 START - Cookbook sre.hosts.ipmi-password-reset [20:19:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:19:50] !log dzahn@cumin1001 Updating IPMI password on 1 hosts - dzahn@cumin1001 [20:19:50] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.ipmi-password-reset (exit_code=0) [20:19:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:19:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:20:39] !log dzahn@cumin1001 START - Cookbook sre.hosts.ipmi-password-reset [20:20:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:20:52] !log dzahn@cumin1001 Updating IPMI password on 1 hosts - dzahn@cumin1001 [20:20:54] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.ipmi-password-reset (exit_code=0) [20:20:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:20:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:24:46] (03PS1) 10Jforrester: InitialiseSettings: Re-use wmfConfgDir rather than assume __DIR__ [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537192 [20:25:40] (03PS8) 10Jforrester: Variant configuration: Never write to serialised PHP, drop support [mediawiki-config] - 10https://gerrit.wikimedia.org/r/533594 (https://phabricator.wikimedia.org/T223602) [20:25:53] (03CR) 10Jforrester: Variant configuration: Never write to serialised PHP, drop support (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/533594 (https://phabricator.wikimedia.org/T223602) (owner: 10Jforrester) [20:26:39] (03CR) 10Jhedden: [C: 03+1] password: rotate cloudwide root key for bstorm [labs/private] - 10https://gerrit.wikimedia.org/r/537183 (owner: 10Bstorm) [20:28:13] (03CR) 10Dzahn: "broke puppet runs on mwmaint servers hours ago: Mediawiki::Periodic_job[mediawiki_tor_exit_node]: has no parameter named 'ensure'" [puppet] - 10https://gerrit.wikimedia.org/r/533571 (https://phabricator.wikimedia.org/T229736) (owner: 10Krinkle) [20:32:53] (03PS1) 10Dzahn: Revert "mediawiki: Disable loadExitNodes.php maintenance script" [puppet] - 10https://gerrit.wikimedia.org/r/537195 [20:35:27] (03PS2) 10Dzahn: Revert "mediawiki: Disable loadExitNodes.php maintenance script" [puppet] - 10https://gerrit.wikimedia.org/r/537195 [20:36:21] (03PS1) 10Herron: kafka-main: move kafka1002 to role spare system [puppet] - 10https://gerrit.wikimedia.org/r/537196 (https://phabricator.wikimedia.org/T225005) [20:36:51] (03CR) 10Catrope: [C: 03+1] Update ORES filter threshold configuration for new huwiki model [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536732 (https://phabricator.wikimedia.org/T230031) (owner: 10Gergő Tisza) [20:37:29] (03CR) 10Dzahn: [C: 03+2] Revert "mediawiki: Disable loadExitNodes.php maintenance script" [puppet] - 10https://gerrit.wikimedia.org/r/537195 (owner: 10Dzahn) [20:37:36] (03PS3) 10Dzahn: Revert "mediawiki: Disable loadExitNodes.php maintenance script" [puppet] - 10https://gerrit.wikimedia.org/r/537195 [20:39:00] 10Operations, 10ops-eqiad, 10DC-Ops: b6-eqiad pdu refresh (Tuesday 9/10 @11am UTC) - https://phabricator.wikimedia.org/T227541 (10wiki_willy) Checked with @Cmjohnson , who says he'll follow up to check the connections. [20:39:07] (03PS2) 10Herron: kafka-main: move kafka1002 to role spare system [puppet] - 10https://gerrit.wikimedia.org/r/537196 (https://phabricator.wikimedia.org/T225005) [20:40:13] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Cluster, 10DC-Ops: analytics1045 - RAID failure and /var/lib/hadoop/data/j can't be mounted - https://phabricator.wikimedia.org/T232069 (10Ottomata) Hello! In our FY2019-2020 hardware budgeting, we had planned to replace these nodes in Q4, when they ac... [20:42:59] (03CR) 10Bstorm: [V: 03+2 C: 03+2] password: rotate cloudwide root key for bstorm [labs/private] - 10https://gerrit.wikimedia.org/r/537183 (owner: 10Bstorm) [20:43:40] (03PS3) 10Herron: kafka-main: move kafka1002 to role spare system [puppet] - 10https://gerrit.wikimedia.org/r/537196 (https://phabricator.wikimedia.org/T225005) [20:51:10] 10Operations, 10ops-eqiad, 10Thumbor, 10serviceops, 10User-jijiki: (OoW) thumbor1004 memory errors - https://phabricator.wikimedia.org/T215411 (10Dzahn) server still alerting nowadays [20:51:36] ACKNOWLEDGEMENT - Memory correctable errors -EDAC- on thumbor1004 is CRITICAL: 4.001 ge 4 daniel_zahn https://phabricator.wikimedia.org/T215411 https://wikitech.wikimedia.org/wiki/Monitoring/Memory%23Memory_correctable_errors_-EDAC- https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=thumbor1004&var-datasource=eqiad+prometheus/ops [20:55:00] !log remove 2 sessions to AS12871 on cr2-esams - T232617 [20:55:01] 10Operations, 10netops: BGP sessions down on cr2-esams - https://phabricator.wikimedia.org/T232617 (10ayounsi) From AMS-IX ML: > Please remove your BGP sessions to the following IP’s: > AS12871 > IPv4: 80.249.208.64 > IPv6: 2001:7f8:1::a501:2871:1 [20:55:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:55:04] T232617: BGP sessions down on cr2-esams - https://phabricator.wikimedia.org/T232617 [20:55:11] (03PS3) 10Catrope: Update ORES filter threshold configuration for new huwiki model [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536732 (https://phabricator.wikimedia.org/T230031) (owner: 10Gergő Tisza) [20:55:50] RECOVERY - BGP status on cr2-esams is OK: BGP OK - up: 416, down: 3, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [20:56:09] (03CR) 10Catrope: [C: 03+1] Update ORES filter threshold configuration for new huwiki model [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536732 (https://phabricator.wikimedia.org/T230031) (owner: 10Gergő Tisza) [20:56:17] 10Operations, 10netops: BGP sessions down on cr2-esams - https://phabricator.wikimedia.org/T232617 (10ayounsi) 05Open→03Resolved a:03ayounsi @jbond solved AS28598 [20:57:39] (03CR) 10Catrope: [C: 03+1] Enable EditorJourney for euwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537092 (https://phabricator.wikimedia.org/T232061) (owner: 10Kosta Harlan) [20:58:32] (03PS1) 10Bstorm: ssh-keys: rotate ssh public key for bstorm [puppet] - 10https://gerrit.wikimedia.org/r/537201 [20:59:07] (03PS2) 10Catrope: Enable EditorJourney for euwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537092 (https://phabricator.wikimedia.org/T232061) (owner: 10Kosta Harlan) [21:00:04] Reedy and sbassett: My dear minions, it's time we take the moon! Just kidding. Time for Weekly Security deployment window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190916T2100). [21:00:39] (03CR) 10Catrope: [C: 03+1] Enable EditorJourney for euwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537092 (https://phabricator.wikimedia.org/T232061) (owner: 10Kosta Harlan) [21:07:13] (03CR) 10Herron: [C: 03+2] kafka-main: move kafka1002 to role spare system [puppet] - 10https://gerrit.wikimedia.org/r/537196 (https://phabricator.wikimedia.org/T225005) (owner: 10Herron) [21:07:21] (03PS2) 10Bstorm: ssh-keys: rotate ssh public key for bstorm [puppet] - 10https://gerrit.wikimedia.org/r/537201 [21:10:31] ACKNOWLEDGEMENT - Check systemd state on netbox2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Cas Rusnov This is being worked on. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:16:43] (03PS1) 10Jhedden: tools-prometheus: add toolsdb mariadb metrics [puppet] - 10https://gerrit.wikimedia.org/r/537203 (https://phabricator.wikimedia.org/T220530) [21:17:29] (03PS1) 10Dzahn: mediawiki::maintenance: remove TorExit node periodic job [puppet] - 10https://gerrit.wikimedia.org/r/537204 (https://phabricator.wikimedia.org/T229736) [21:18:28] (03PS1) 10Jforrester: Move global-dependent, invariant wgCopyUploadProxy to CommonSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537205 [21:18:30] (03PS1) 10Jforrester: Move global-dependent, invariant wmgRC2UDPAddress to CommonSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537206 [21:18:32] (03PS1) 10Jforrester: Move global-dependent, invariant wgUploadThumbnailRenderHttpCustom* to CommonSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537207 [21:20:47] (03CR) 10Dzahn: [C: 03+2] mediawiki::maintenance: remove TorExit node periodic job [puppet] - 10https://gerrit.wikimedia.org/r/537204 (https://phabricator.wikimedia.org/T229736) (owner: 10Dzahn) [21:23:46] (03PS1) 10Herron: Revert "kafka-main1002: disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/537209 [21:24:02] (03PS2) 10Herron: Revert "kafka-main1002: disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/537209 [21:25:06] (03CR) 10Herron: [C: 03+2] Revert "kafka-main1002: disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/537209 (owner: 10Herron) [21:26:05] (03PS2) 10Krinkle: mediawiki: Remove loadExitNodes.php maintenance script [puppet] - 10https://gerrit.wikimedia.org/r/533572 (https://phabricator.wikimedia.org/T229736) [21:26:24] (03PS3) 10Krinkle: mediawiki: Remove loadExitNodes.php maintenance script [puppet] - 10https://gerrit.wikimedia.org/r/533572 (https://phabricator.wikimedia.org/T229736) [21:26:30] (03PS4) 10Krinkle: mediawiki: Remove loadExitNodes.php maintenance script [puppet] - 10https://gerrit.wikimedia.org/r/533572 (https://phabricator.wikimedia.org/T229736) [21:26:32] (03CR) 10jerkins-bot: [V: 04-1] mediawiki: Remove loadExitNodes.php maintenance script [puppet] - 10https://gerrit.wikimedia.org/r/533572 (https://phabricator.wikimedia.org/T229736) (owner: 10Krinkle) [21:26:45] (03PS5) 10Krinkle: mediawiki: Remove loadExitNodes.php maintenance script [puppet] - 10https://gerrit.wikimedia.org/r/533572 (https://phabricator.wikimedia.org/T229736) [21:27:07] (03PS6) 10Krinkle: mediawiki: Remove loadExitNodes.php maintenance script [puppet] - 10https://gerrit.wikimedia.org/r/533572 (https://phabricator.wikimedia.org/T229736) [21:27:26] (03PS1) 10Jforrester: Move global-dependent, barely variant wgDebugLogFile to CommonSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537211 [21:27:40] mutante: thanks [21:32:00] (03CR) 10Krinkle: "What about labs?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537211 (owner: 10Jforrester) [21:33:33] (03CR) 10Jhedden: [C: 03+1] ssh-keys: rotate ssh public key for bstorm [puppet] - 10https://gerrit.wikimedia.org/r/537201 (owner: 10Bstorm) [21:33:57] (03CR) 10Krinkle: Move global-dependent, invariant wgCopyUploadProxy to CommonSettings (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537205 (owner: 10Jforrester) [21:34:12] (03CR) 10Jforrester: "> Patch Set 1:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537211 (owner: 10Jforrester) [21:34:31] (03CR) 10Bstorm: "To make this work between projects, I think we'll have to open the port to anything in the ip range via security groups, right? Why not s" [puppet] - 10https://gerrit.wikimedia.org/r/537203 (https://phabricator.wikimedia.org/T220530) (owner: 10Jhedden) [21:34:58] (03CR) 10Krinkle: Move global-dependent, invariant wmgRC2UDPAddress to CommonSettings (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537206 (owner: 10Jforrester) [21:35:29] (03CR) 10Krinkle: "Right, I moved that last week. Forgot about that." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537211 (owner: 10Jforrester) [21:35:36] (03CR) 10Krinkle: "Should've done it for prod too :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537211 (owner: 10Jforrester) [21:35:58] (03CR) 10Jforrester: Move global-dependent, invariant wmgRC2UDPAddress to CommonSettings (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537206 (owner: 10Jforrester) [21:37:23] (03PS2) 10Jforrester: Move global-dependent, invariant wgCopyUploadProxy to CommonSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537205 [21:37:52] (03CR) 10Jforrester: "> Patch Set 1:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537211 (owner: 10Jforrester) [21:39:48] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:41:08] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:45:06] (03CR) 10Bstorm: [C: 03+1] "I haven't checked the puppet compiler, but this looks good (unless you feel like modernizing the whole mess into a profile)." [puppet] - 10https://gerrit.wikimedia.org/r/537203 (https://phabricator.wikimedia.org/T220530) (owner: 10Jhedden) [21:48:02] !log unban elastic1027 from production-search-eqiad [21:48:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:53:21] 10Operations, 10Gerrit, 10Release-Engineering-Team-TODO, 10Release-Engineering-Team (Development services): Reimage cobalt and gerrit2001 as buster - https://phabricator.wikimedia.org/T176774 (10Paladox) Cobalt is being replaced with Gerrit1001. [21:58:37] (03PS3) 10Bstorm: ssh-keys: rotate ssh public key for bstorm [puppet] - 10https://gerrit.wikimedia.org/r/537201 [22:10:14] (03CR) 10Bstorm: [C: 03+2] ssh-keys: rotate ssh public key for bstorm [puppet] - 10https://gerrit.wikimedia.org/r/537201 (owner: 10Bstorm) [22:13:46] 10Operations, 10ops-eqiad: backup1001 failed disk (degraded RAID) - https://phabricator.wikimedia.org/T232882 (10Jclark-ctr) @Jclark-ctr @jcrespo SR# 997901435 . DPS# 717467224, and it is setup to arrive during normal business hours on Wednesday. Disk will be replaced Wednesday [22:31:57] (03CR) 10Krinkle: InitialiseSettings: Re-use wmfConfgDir rather than assume __DIR__ (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537192 (owner: 10Jforrester) [22:35:14] (03CR) 10Jforrester: InitialiseSettings: Re-use wmfConfgDir rather than assume __DIR__ (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537192 (owner: 10Jforrester) [22:37:07] (03CR) 10Krinkle: [C: 03+1] "Confirmed it still passes when running fresh-node in the dir and running "npm install-test"" [puppet] - 10https://gerrit.wikimedia.org/r/537133 (owner: 10Jforrester) [22:37:42] RECOVERY - Check systemd state on netbox2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:37:45] (03PS2) 10Jforrester: InitialiseSettings: Use __DIR__ rather than global wmfConfgDir [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537192 [22:38:35] (03CR) 10Krinkle: InitialiseSettings: Use __DIR__ rather than global wmfConfgDir (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537192 (owner: 10Jforrester) [22:38:40] (03CR) 10Krinkle: [C: 03+1] InitialiseSettings: Use __DIR__ rather than global wmfConfgDir [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537192 (owner: 10Jforrester) [22:39:06] jouncebot: now [22:39:07] For the next 0 hour(s) and 20 minute(s): Weekly Security deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190916T2100) [22:39:12] (03CR) 10Jforrester: [C: 03+2] InitialiseSettings: Use __DIR__ rather than global wmfConfgDir [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537192 (owner: 10Jforrester) [22:39:45] (03CR) 10Bstorm: "The related change to this one was reverted because it broke puppet. I'm not sure of the details there." [puppet] - 10https://gerrit.wikimedia.org/r/533572 (https://phabricator.wikimedia.org/T229736) (owner: 10Krinkle) [22:42:11] (03Merged) 10jenkins-bot: InitialiseSettings: Use __DIR__ rather than global wmfConfgDir [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537192 (owner: 10Jforrester) [22:44:35] (03CR) 10Krinkle: "I've rebased it on Daniel's revising of the reverted commit." [puppet] - 10https://gerrit.wikimedia.org/r/533572 (https://phabricator.wikimedia.org/T229736) (owner: 10Krinkle) [22:45:04] (03CR) 10Krinkle: "Once the ensure-absent has gone out, this can land to remove the dead code." [puppet] - 10https://gerrit.wikimedia.org/r/533572 (https://phabricator.wikimedia.org/T229736) (owner: 10Krinkle) [22:45:32] PROBLEM - Check systemd state on netbox2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:46:30] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Use __DIR__ rather than global wmfConfgDir (duration: 00m 55s) [22:46:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:54:41] (03PS3) 10Ejegg: Make banner-preview CSP match normal CSP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527183 (https://phabricator.wikimedia.org/T225261) [22:55:22] (03PS7) 10Dzahn: mediawiki: Remove loadExitNodes.php maintenance script [puppet] - 10https://gerrit.wikimedia.org/r/533572 (https://phabricator.wikimedia.org/T229736) (owner: 10Krinkle) [22:55:35] (03CR) 10Dzahn: [C: 03+2] mediawiki: Remove loadExitNodes.php maintenance script [puppet] - 10https://gerrit.wikimedia.org/r/533572 (https://phabricator.wikimedia.org/T229736) (owner: 10Krinkle) [22:55:50] (03PS1) 10Jforrester: CommonSettings: Factor out call to InitialiseSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537218 [22:55:52] (03PS1) 10Jforrester: CommonSettings: Move Beta-Cluster variant load into wmfLoadInitialiseSettings() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537219 [22:55:54] (03PS1) 10Jforrester: Move VariantSettings back to InitialiseSettings now that the migration is done [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537220 [22:57:00] (03CR) 10Ejegg: "WMF fundraising definitely still wants this patch to go out. The impact is limited to people using a special URL parameter to preview Cent" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526756 (https://phabricator.wikimedia.org/T194019) (owner: 10Ejegg) [22:58:18] PROBLEM - MD RAID on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [22:58:28] PROBLEM - puppet last run on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [22:58:42] PROBLEM - Check systemd state on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:58:42] PROBLEM - Check size of conntrack table on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [22:58:47] (03PS6) 10Ejegg: CSP for banner preview: allow remind me later host [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526756 (https://phabricator.wikimedia.org/T194019) [22:58:48] PROBLEM - Check whether ferm is active by checking the default input chain on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [22:58:48] PROBLEM - DPKG on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [22:58:52] PROBLEM - configured eth on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [22:59:05] oh [22:59:08] PROBLEM - Disk space on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=notebook1003&var-datasource=eqiad+prometheus/ops [22:59:28] PROBLEM - dhclient process on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [22:59:54] RECOVERY - MD RAID on notebook1003 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [22:59:54] !log restarted nagios-nrpe-server on notebook1003 [22:59:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:00:04] MaxSem, RoanKattouw, Niharika, and Urbanecm: My dear minions, it's time we take the moon! Just kidding. Time for Evening SWAT (Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190916T2300). [23:00:05] ejegg: A patch you scheduled for Evening SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:00:10] "just" always the same issue when it runs out of RAM :( [23:00:14] RECOVERY - Check systemd state on notebook1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:00:16] RECOVERY - Check size of conntrack table on notebook1003 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [23:00:17] (03CR) 10Jforrester: [C: 03+2] Make banner-preview CSP match normal CSP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527183 (https://phabricator.wikimedia.org/T225261) (owner: 10Ejegg) [23:00:22] RECOVERY - Check whether ferm is active by checking the default input chain on notebook1003 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [23:00:22] RECOVERY - DPKG on notebook1003 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [23:00:26] RECOVERY - configured eth on notebook1003 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [23:00:32] ejegg: Are you OK to test? [23:00:41] James_F: yep, I'm here standing by [23:00:42] RECOVERY - Disk space on notebook1003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=notebook1003&var-datasource=eqiad+prometheus/ops [23:00:45] Cool [23:01:02] gotta dig up an example of a banner that uses that remind me later form [23:01:04] RECOVERY - dhclient process on notebook1003 is OK: PROCS OK: 0 processes with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [23:01:07] (03CR) 10jerkins-bot: [V: 04-1] Move VariantSettings back to InitialiseSettings now that the migration is done [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537220 (owner: 10Jforrester) [23:01:08] * James_F grins. [23:01:30] ah, most of the english fundraising ones have that option [23:01:40] (03Merged) 10jenkins-bot: Make banner-preview CSP match normal CSP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527183 (https://phabricator.wikimedia.org/T225261) (owner: 10Ejegg) [23:03:39] ejegg: Live on mwdebug1002. [23:03:45] great [23:04:04] RECOVERY - puppet last run on notebook1003 is OK: OK: Puppet is currently enabled, last run 16 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [23:04:04] (03CR) 10Krinkle: CommonSettings: Factor out call to InitialiseSettings.php (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537218 (owner: 10Jforrester) [23:04:35] (03CR) 10Krinkle: CommonSettings: Move Beta-Cluster variant load into wmfLoadInitialiseSettings() (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537219 (owner: 10Jforrester) [23:04:44] (03CR) 10Krinkle: CommonSettings: Move Beta-Cluster variant load into wmfLoadInitialiseSettings() (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537219 (owner: 10Jforrester) [23:05:15] (03CR) 10Jforrester: CommonSettings: Move Beta-Cluster variant load into wmfLoadInitialiseSettings() (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537219 (owner: 10Jforrester) [23:06:04] hmm, still seeing the old CSP - let's see if the FF debug extn is sending the headers [23:06:43] (03CR) 10Krinkle: CommonSettings: Move Beta-Cluster variant load into wmfLoadInitialiseSettings() (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537219 (owner: 10Jforrester) [23:07:07] ah, it wasn't [23:07:17] now sending the right headers, and I get the matching CSP [23:07:52] James_F: ok, the first patch one looks good. Can we do the other one too? [23:07:56] (03CR) 10Krinkle: [C: 04-1] "Perhaps better if we remove use of that SiteConfiguration feature first." (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537218 (owner: 10Jforrester) [23:07:57] Sure. [23:08:01] 10Operations, 10Cassandra, 10RESTBase, 10RESTBase-Cassandra, and 2 others: Migrate remaining Restbase servers to Stretch - https://phabricator.wikimedia.org/T224553 (10Eevans) restbase2010 is ready to be reimaged. [23:08:09] (03CR) 10Jforrester: [C: 03+2] CSP for banner preview: allow remind me later host [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526756 (https://phabricator.wikimedia.org/T194019) (owner: 10Ejegg) [23:09:48] (03Merged) 10jenkins-bot: CSP for banner preview: allow remind me later host [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526756 (https://phabricator.wikimedia.org/T194019) (owner: 10Ejegg) [23:10:07] ejegg: Second patch should now be live, too. [23:10:15] looking [23:10:46] James_F: yep, i see it! [23:11:01] All good? [23:11:09] James_F: yep, looking correct [23:11:23] (03CR) 10Dzahn: [C: 03+2] mariadb::ferm_misc: allow connections from gerrit1001 [puppet] - 10https://gerrit.wikimedia.org/r/535966 (https://phabricator.wikimedia.org/T222391) (owner: 10Dzahn) [23:11:30] (03PS2) 10Dzahn: mariadb::ferm_misc: allow connections from gerrit1001 [puppet] - 10https://gerrit.wikimedia.org/r/535966 (https://phabricator.wikimedia.org/T222391) [23:12:32] Syncing. [23:13:23] !log jforrester@deploy1001 Synchronized wmf-config/CommonSettings.php: T225261 T194019 Adjust CentralNotice CSP for banner previews for FR-tech (duration: 00m 55s) [23:13:29] (03PS3) 10Jforrester: Move global-dependent, invariant wgCopyUploadProxy to CommonSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537205 [23:13:34] (03CR) 10Jforrester: Move global-dependent, invariant wgCopyUploadProxy to CommonSettings (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537205 (owner: 10Jforrester) [23:13:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:13:36] T225261: CentralNotice setting a surprising content security policy in production when using &banner= URL parameter - https://phabricator.wikimedia.org/T225261 [23:13:39] (03CR) 10Jforrester: [C: 03+2] Move global-dependent, invariant wgCopyUploadProxy to CommonSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537205 (owner: 10Jforrester) [23:15:45] James_F: great, it's looking good from non-debug backends too now [23:15:47] thanks! [23:15:56] ejegg: Any time. [23:19:06] (03Merged) 10jenkins-bot: Move global-dependent, invariant wgCopyUploadProxy to CommonSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537205 (owner: 10Jforrester) [23:21:45] !log jforrester@deploy1001 Synchronized wmf-config/CommonSettings.php: Set wgCopyUploadProxy in CS (duration: 00m 55s) [23:22:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:22:21] 10Operations, 10observability, 10serviceops, 10PHP 7.2 support, and 2 others: [Regression] fatal-errors.php action=segfault results in a 503 error under php7-fpm. - https://phabricator.wikimedia.org/T223336 (10Krinkle) 05Open→03Declined OK. I'm fine with this staying as it is. It's not really broken. I... [23:22:36] 10Operations, 10MediaWiki-Debug-Logger, 10Wikimedia-Logstash, 10observability, and 8 others: Port mediawiki/php/wmerrors to PHP7 and deploy - https://phabricator.wikimedia.org/T187147 (10Krinkle) [23:23:31] (03PS2) 10Jforrester: Move global-dependent, invariant wmgRC2UDPAddress to CommonSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537206 [23:23:37] (03CR) 10Jforrester: [C: 03+2] Move global-dependent, invariant wmgRC2UDPAddress to CommonSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537206 (owner: 10Jforrester) [23:24:09] !log jforrester@deploy1001 Synchronized wmf-config/VariantSettings.php: Stop setting wgCopyUploadProxy in VS (duration: 00m 56s) [23:24:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:25:30] (03PS4) 10Dzahn: gerrit: get LDAP servers from ldap_config, use ro server, simplify [puppet] - 10https://gerrit.wikimedia.org/r/536714 [23:25:35] (03PS2) 10Jforrester: Move global-dependent, invariant wgUploadThumbnailRenderHttpCustom* to CommonSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537207 [23:27:28] (03CR) 10jerkins-bot: [V: 04-1] gerrit: get LDAP servers from ldap_config, use ro server, simplify [puppet] - 10https://gerrit.wikimedia.org/r/536714 (owner: 10Dzahn) [23:27:49] (03Merged) 10jenkins-bot: Move global-dependent, invariant wmgRC2UDPAddress to CommonSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537206 (owner: 10Jforrester) [23:28:47] (03PS2) 10Jforrester: Move global-dependent, barely variant wgDebugLogFile to CommonSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537211 [23:29:07] !log jforrester@deploy1001 Synchronized wmf-config/CommonSettings.php: Set wmgRC2UDPAddress in CS (duration: 00m 56s) [23:29:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:30:10] !log jforrester@deploy1001 Synchronized wmf-config/VariantSettings.php: Stop setting wmgRC2UDPAddress in VS (duration: 00m 55s) [23:30:41] jforrester@deploy1001: Failed to log message to wiki. Somebody should check the error logs. [23:31:28] (03CR) 10Jforrester: [C: 03+2] Move global-dependent, invariant wgUploadThumbnailRenderHttpCustom* to CommonSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537207 (owner: 10Jforrester) [23:34:13] (03PS5) 10Dzahn: gerrit: get LDAP servers from ldap_config, use ro server, simplify [puppet] - 10https://gerrit.wikimedia.org/r/536714 [23:36:14] (03CR) 10jerkins-bot: [V: 04-1] gerrit: get LDAP servers from ldap_config, use ro server, simplify [puppet] - 10https://gerrit.wikimedia.org/r/536714 (owner: 10Dzahn) [23:39:53] (03Merged) 10jenkins-bot: Move global-dependent, invariant wgUploadThumbnailRenderHttpCustom* to CommonSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537207 (owner: 10Jforrester) [23:41:17] !log jforrester@deploy1001 Synchronized wmf-config/CommonSettings.php: Set wgUploadThumbnailRenderHttpCustom* in CS (duration: 00m 55s) [23:41:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:42:26] !log jforrester@deploy1001 Synchronized wmf-config/VariantSettings.php: Stop setting wgUploadThumbnailRenderHttpCustom* in VS (duration: 00m 54s) [23:42:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:46:14] (03PS6) 10Dzahn: gerrit: get LDAP servers from ldap_config, use ro server, simplify [puppet] - 10https://gerrit.wikimedia.org/r/536714 [23:46:20] (03CR) 10Jforrester: [C: 03+2] Move global-dependent, barely variant wgDebugLogFile to CommonSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537211 (owner: 10Jforrester) [23:50:46] (03Merged) 10jenkins-bot: Move global-dependent, barely variant wgDebugLogFile to CommonSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537211 (owner: 10Jforrester) [23:52:08] !log jforrester@deploy1001 Synchronized wmf-config/CommonSettings.php: Set wgDebugLogFile in CS (duration: 00m 55s) [23:52:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:53:16] !log jforrester@deploy1001 Synchronized wmf-config/VariantSettings.php: Stop setting wgDebugLogFile in VS (duration: 00m 55s) [23:53:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:59:14] (03PS2) 10Jforrester: CommonSettings: Factor out call to InitialiseSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537218 [23:59:35] (03CR) 10Jforrester: CommonSettings: Factor out call to InitialiseSettings.php (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537218 (owner: 10Jforrester)