[00:00:35] (03CR) 10Jforrester: CommonSettings: Move Beta-Cluster variant load into wmfLoadInitialiseSettings() (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537219 (owner: 10Jforrester) [00:07:52] (03CR) 10Jforrester: "I plan to do this tomorrow." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/533594 (https://phabricator.wikimedia.org/T223602) (owner: 10Jforrester) [00:43:47] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:59:41] PROBLEM - HTTPS-blog on blog.wikimedia.org is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:SSL connect attempt failed https://phabricator.wikimedia.org/tag/wikimedia-blog/ [01:02:45] RECOVERY - HTTPS-blog on blog.wikimedia.org is OK: SSL OK - Certificate blog.wikimedia.org valid until 2019-11-02 07:45:52 +0000 (expires in 46 days) https://phabricator.wikimedia.org/tag/wikimedia-blog/ [01:15:25] PROBLEM - HTTPS-blog on blog.wikimedia.org is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:SSL connect attempt failed https://phabricator.wikimedia.org/tag/wikimedia-blog/ [01:16:55] RECOVERY - HTTPS-blog on blog.wikimedia.org is OK: SSL OK - Certificate blog.wikimedia.org valid until 2019-11-02 07:45:52 +0000 (expires in 46 days) https://phabricator.wikimedia.org/tag/wikimedia-blog/ [01:24:51] PROBLEM - MariaDB Slave Lag: s3 on db2098 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 997.62 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [01:31:11] PROBLEM - HTTPS-blog on blog.wikimedia.org is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:SSL connect attempt failed https://phabricator.wikimedia.org/tag/wikimedia-blog/ [01:32:39] RECOVERY - HTTPS-blog on blog.wikimedia.org is OK: SSL OK - Certificate blog.wikimedia.org valid until 2019-11-02 07:45:52 +0000 (expires in 46 days) https://phabricator.wikimedia.org/tag/wikimedia-blog/ [01:37:03] RECOVERY - Check systemd state on netbox2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:37:09] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:44:55] PROBLEM - Check systemd state on netbox2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:04:17] PROBLEM - HTTPS-blog on blog.wikimedia.org is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:SSL connect attempt failed https://phabricator.wikimedia.org/tag/wikimedia-blog/ [02:10:27] RECOVERY - HTTPS-blog on blog.wikimedia.org is OK: SSL OK - Certificate blog.wikimedia.org valid until 2019-11-02 07:45:52 +0000 (expires in 46 days) https://phabricator.wikimedia.org/tag/wikimedia-blog/ [02:20:01] PROBLEM - HTTPS-blog on blog.wikimedia.org is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:SSL connect attempt failed https://phabricator.wikimedia.org/tag/wikimedia-blog/ [02:23:05] RECOVERY - HTTPS-blog on blog.wikimedia.org is OK: SSL OK - Certificate blog.wikimedia.org valid until 2019-11-02 07:45:52 +0000 (expires in 46 days) https://phabricator.wikimedia.org/tag/wikimedia-blog/ [02:40:01] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:43:45] PROBLEM - HTTPS-blog on blog.wikimedia.org is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:SSL connect attempt failed https://phabricator.wikimedia.org/tag/wikimedia-blog/ [02:46:51] RECOVERY - HTTPS-blog on blog.wikimedia.org is OK: SSL OK - Certificate blog.wikimedia.org valid until 2019-11-02 07:45:52 +0000 (expires in 46 days) https://phabricator.wikimedia.org/tag/wikimedia-blog/ [03:05:47] RECOVERY - MariaDB Slave Lag: s3 on db2098 is OK: OK slave_sql_lag Replication lag: 0.22 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [03:06:57] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:12:17] PROBLEM - HTTPS-blog on blog.wikimedia.org is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:SSL connect attempt failed https://phabricator.wikimedia.org/tag/wikimedia-blog/ [03:16:55] RECOVERY - HTTPS-blog on blog.wikimedia.org is OK: SSL OK - Certificate blog.wikimedia.org valid until 2019-11-02 07:45:52 +0000 (expires in 46 days) https://phabricator.wikimedia.org/tag/wikimedia-blog/ [03:36:07] PROBLEM - HTTPS-blog on blog.wikimedia.org is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:SSL connect attempt failed https://phabricator.wikimedia.org/tag/wikimedia-blog/ [03:37:37] RECOVERY - HTTPS-blog on blog.wikimedia.org is OK: SSL OK - Certificate blog.wikimedia.org valid until 2019-11-02 07:45:52 +0000 (expires in 46 days) https://phabricator.wikimedia.org/tag/wikimedia-blog/ [03:44:02] PROBLEM - HTTPS-blog on blog.wikimedia.org is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:SSL connect attempt failed https://phabricator.wikimedia.org/tag/wikimedia-blog/ [03:45:31] RECOVERY - HTTPS-blog on blog.wikimedia.org is OK: SSL OK - Certificate blog.wikimedia.org valid until 2019-11-02 07:45:52 +0000 (expires in 46 days) https://phabricator.wikimedia.org/tag/wikimedia-blog/ [03:58:57] (03PS1) 1020after4: Set up scap target for deploying the phatality plugin into kibana [puppet] - 10https://gerrit.wikimedia.org/r/537240 (https://phabricator.wikimedia.org/T230752) [03:59:33] (03CR) 10jerkins-bot: [V: 04-1] Set up scap target for deploying the phatality plugin into kibana [puppet] - 10https://gerrit.wikimedia.org/r/537240 (https://phabricator.wikimedia.org/T230752) (owner: 1020after4) [03:59:56] (03CR) 1020after4: "This is one possible way to deploy the kibana plugin" [puppet] - 10https://gerrit.wikimedia.org/r/537240 (https://phabricator.wikimedia.org/T230752) (owner: 1020after4) [04:01:35] (03PS2) 1020after4: Set up scap target for deploying the phatality plugin into kibana [puppet] - 10https://gerrit.wikimedia.org/r/537240 (https://phabricator.wikimedia.org/T230752) [04:02:11] (03CR) 10jerkins-bot: [V: 04-1] Set up scap target for deploying the phatality plugin into kibana [puppet] - 10https://gerrit.wikimedia.org/r/537240 (https://phabricator.wikimedia.org/T230752) (owner: 1020after4) [04:03:21] (03PS3) 1020after4: Set up scap target for deploying the phatality plugin into kibana [puppet] - 10https://gerrit.wikimedia.org/r/537240 (https://phabricator.wikimedia.org/T230752) [04:06:05] PROBLEM - HTTPS-blog on blog.wikimedia.org is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:SSL connect attempt failed https://phabricator.wikimedia.org/tag/wikimedia-blog/ [04:11:41] !log Start s2 pre-switchover steps T230785 [04:11:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:11:44] T230785: Switchover s2 primary database master db1066 -> db1122 - 17th Sept @05:00 UTC - https://phabricator.wikimedia.org/T230785 [04:13:55] RECOVERY - HTTPS-blog on blog.wikimedia.org is OK: SSL OK - Certificate blog.wikimedia.org valid until 2019-11-02 07:45:52 +0000 (expires in 46 days) https://phabricator.wikimedia.org/tag/wikimedia-blog/ [04:14:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'Set db1122 with weight 0 and depool it from API T230785', diff saved to https://phabricator.wikimedia.org/P9111 and previous config saved to /var/cache/conftool/dbconfig/20190917-041441-marostegui.json [04:14:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:16:34] 10Operations, 10ops-eqiad, 10Thumbor, 10serviceops, 10User-jijiki: (OoW) thumbor1004 memory errors - https://phabricator.wikimedia.org/T215411 (10jijiki) I have disabled this check for now [04:22:58] (03PS2) 10Physikerwelt: Remove unused math config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537166 (https://phabricator.wikimedia.org/T228547) [04:23:34] (03PS3) 10Marostegui: mariadb: Promote db1122 as s2 primary master [puppet] - 10https://gerrit.wikimedia.org/r/535839 (https://phabricator.wikimedia.org/T230785) [04:23:38] (03CR) 10Marostegui: mariadb: Promote db1122 as s2 primary master [puppet] - 10https://gerrit.wikimedia.org/r/535839 (https://phabricator.wikimedia.org/T230785) (owner: 10Marostegui) [04:24:36] (03CR) 10Marostegui: [C: 03+2] mariadb: Promote db1122 as s2 primary master [puppet] - 10https://gerrit.wikimedia.org/r/535839 (https://phabricator.wikimedia.org/T230785) (owner: 10Marostegui) [04:26:16] (03PS1) 10Physikerwelt: Remove more unused math config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537241 (https://phabricator.wikimedia.org/T228547) [04:28:24] (03PS1) 10CRusnov: add customscripts directory for custom scripts [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/537242 [04:29:09] (03CR) 10jerkins-bot: [V: 04-1] add customscripts directory for custom scripts [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/537242 (owner: 10CRusnov) [04:31:48] (03PS1) 10CRusnov: netbox: Add SCRIPTS_ROOT configuration [puppet] - 10https://gerrit.wikimedia.org/r/537243 [04:37:35] (03CR) 10Marostegui: wmnet: Change s2 CNAME to db1122 [dns] - 10https://gerrit.wikimedia.org/r/535842 (https://phabricator.wikimedia.org/T230785) (owner: 10Marostegui) [04:37:38] (03PS3) 10Marostegui: wmnet: Change s2 CNAME to db1122 [dns] - 10https://gerrit.wikimedia.org/r/535842 (https://phabricator.wikimedia.org/T230785) [04:43:09] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:53:39] PROBLEM - HTTPS-blog on blog.wikimedia.org is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:SSL connect attempt failed https://phabricator.wikimedia.org/tag/wikimedia-blog/ [04:56:59] !log Downtiming HTTPS-blog on icing - T232412 [04:57:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:57:07] T232412: HTTPS-blog on blog.wikimedia.org is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:SSL connect attempt failed - https://phabricator.wikimedia.org/T232412 [04:59:54] 10Operations, 10DBA: Predictive failures on disk S.M.A.R.T. status - https://phabricator.wikimedia.org/T208323 (10jcrespo) [04:59:55] RECOVERY - HTTPS-blog on blog.wikimedia.org is OK: SSL OK - Certificate blog.wikimedia.org valid until 2019-11-02 07:45:52 +0000 (expires in 46 days) https://phabricator.wikimedia.org/tag/wikimedia-blog/ [05:00:04] marostegui and jynus: I, the Bot under the Fountain, allow thee, The Deployer, to do s2 database master failover deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190917T0500). [05:00:07] jynus: ready? [05:00:11] yes [05:00:14] !log Starting s2 failover from db1066 to db1122 - T230785 [05:00:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:00:17] T230785: Switchover s2 primary database master db1066 -> db1122 - 17th Sept @05:00 UTC - https://phabricator.wikimedia.org/T230785 [05:00:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'Set s2 as read-only for maintenance T230785', diff saved to https://phabricator.wikimedia.org/P9112 and previous config saved to /var/cache/conftool/dbconfig/20190917-050043-marostegui.json [05:00:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:00:54] read only confirmed [05:00:59] Warning: The database has been locked for maintenance [05:01:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'Promote db1122 to s2 master and remove read-only from s2 T230785', diff saved to https://phabricator.wikimedia.org/P9113 and previous config saved to /var/cache/conftool/dbconfig/20190917-050133-marostegui.json [05:01:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:01:37] all done [05:01:38] checking [05:01:46] I can edit [05:02:06] Utente:JCrespo (WMF)/Sandbox‎; 05:00 [05:02:10] Utente:JCrespo (WMF)/Sandbox‎; 05:01 [05:02:20] same here [05:03:07] no errors on logs? [05:03:12] I am checking [05:03:24] I see some stuff with enwiktionary [05:03:35] from jobrunner [05:04:45] yeah, inject records from wikidata [05:04:53] *injectrcrecords [05:05:39] they have stopped already [05:05:41] from what I can see [05:06:26] yeah [05:07:25] I think we are good [05:08:03] (03CR) 10Marostegui: [C: 03+2] wmnet: Change s2 CNAME to db1122 [dns] - 10https://gerrit.wikimedia.org/r/535842 (https://phabricator.wikimedia.org/T230785) (owner: 10Marostegui) [05:10:44] 10Operations, 10DBA: Switchover s2 primary database master db1066 -> db1122 - 17th Sept @05:00 UTC - https://phabricator.wikimedia.org/T230785 (10Marostegui) This was done read-only start: 05:00:44 read-only stop: 05:01:34 Total read-only time: 50 seconds. [05:16:50] 10Operations, 10DBA: Switchover s2 primary database master db1066 -> db1122 - 17th Sept @05:00 UTC - https://phabricator.wikimedia.org/T230785 (10Marostegui) 05Open→03Resolved [05:16:52] 10Operations, 10ops-eqiad, 10DC-Ops: a6-eqiad pdu refresh (Tuesday 10/22 @11am UTC) - https://phabricator.wikimedia.org/T227142 (10Marostegui) [05:16:55] 10Operations, 10DBA: Decommission db1061-db1073 - https://phabricator.wikimedia.org/T217396 (10Marostegui) [05:18:10] 10Operations, 10DBA: Decommission db1066.eqiad.wmnet - https://phabricator.wikimedia.org/T233071 (10Marostegui) [05:18:46] 10Operations, 10DBA: Decommission db1066.eqiad.wmnet - https://phabricator.wikimedia.org/T233071 (10Marostegui) p:05Triage→03Normal This host has been switchedover and it is not a master anymore, let's give it some days before decommissioning it. [05:19:53] 10Operations, 10ops-eqiad, 10DC-Ops: a6-eqiad pdu refresh (Tuesday 10/22 @11am UTC) - https://phabricator.wikimedia.org/T227142 (10Marostegui) [05:20:43] 10Operations, 10ops-eqiad, 10DC-Ops: a6-eqiad pdu refresh (Tuesday 10/22 @11am UTC) - https://phabricator.wikimedia.org/T227142 (10Marostegui) [05:21:13] 10Operations, 10ops-eqiad, 10DC-Ops: a6-eqiad pdu refresh (Tuesday 10/22 @11am UTC) - https://phabricator.wikimedia.org/T227142 (10Marostegui) [05:22:12] 10Operations, 10ops-eqiad, 10DC-Ops: a6-eqiad pdu refresh (Tuesday 10/22 @11am UTC) - https://phabricator.wikimedia.org/T227142 (10Marostegui) @elukey for labsdb1012 your Team would need to let us know if MySQL can be stopped for this maintenance (just in case there is powerloss, better to have MySQL stopped... [05:23:02] !log disable puppet on mw* servers for 536979 [05:23:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:25:05] PROBLEM - Device not healthy -SMART- on db2067 is CRITICAL: cluster=mysql device=cciss,7 instance=db2067:9100 job=node site=codfw https://wikitech.wikimedia.org/wiki/SMART%23Alerts https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db2067&var-datasource=codfw+prometheus/ops [05:25:43] ACKNOWLEDGEMENT - Device not healthy -SMART- on db2067 is CRITICAL: cluster=mysql device=cciss,7 instance=db2067:9100 job=node site=codfw Marostegui Tracked at https://phabricator.wikimedia.org/T208323 https://wikitech.wikimedia.org/wiki/SMART%23Alerts https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db2067&var-datasource=codfw+prometheus/ops [05:26:34] (03CR) 10Vgutierrez: [C: 03+2] Release 8.0.5-1wm7 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/536302 (https://phabricator.wikimedia.org/T232298) (owner: 10Vgutierrez) [05:26:49] is db2067 on the list to be decommed? [05:27:12] not yet [05:27:34] that host will be decommissioned once we get the new codfw misc hosts that we'll accelerate in q2 [05:28:05] oh, so it is on the list, just not yet [05:28:09] until 69 [05:28:31] yep, not yet [05:28:33] according to T228258 [05:28:34] T228258: Decommission db2043-db2069 - https://phabricator.wikimedia.org/T228258 [05:28:45] :) [05:29:19] (03CR) 10Effie Mouzeli: [C: 03+2] profile::mediawiki::common Enable systemd-coredump on all mw servers [puppet] - 10https://gerrit.wikimedia.org/r/536979 (owner: 10Effie Mouzeli) [05:32:00] (03PS3) 10Effie Mouzeli: profile::mediawiki::php: Add rlimit_core php.ini variable [puppet] - 10https://gerrit.wikimedia.org/r/536966 [05:32:50] thanks for the good work, manuel [05:32:51] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:33:30] jynus: no, thank you for all the support and switchover.py! [05:33:57] (03PS4) 10Effie Mouzeli: profile::mediawiki::php: Add rlimit_core php.ini variable [puppet] - 10https://gerrit.wikimedia.org/r/536966 [05:34:21] (03CR) 10Effie Mouzeli: "Merge parent first" [puppet] - 10https://gerrit.wikimedia.org/r/536979 (owner: 10Effie Mouzeli) [05:34:55] (03PS1) 10Marostegui: db1066: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/537246 (https://phabricator.wikimedia.org/T233071) [05:36:19] (03CR) 10Marostegui: [C: 03+2] db1066: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/537246 (https://phabricator.wikimedia.org/T233071) (owner: 10Marostegui) [05:36:29] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 133, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:36:57] (03CR) 10Effie Mouzeli: [C: 03+2] profile::mediawiki::php: Add rlimit_core php.ini variable [puppet] - 10https://gerrit.wikimedia.org/r/536966 (owner: 10Effie Mouzeli) [05:37:06] (03PS5) 10Effie Mouzeli: profile::mediawiki::php: Add rlimit_core php.ini variable [puppet] - 10https://gerrit.wikimedia.org/r/536966 [05:37:15] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:37:44] 10Operations, 10DBA, 10Patch-For-Review: Decommission db1066.eqiad.wmnet - https://phabricator.wikimedia.org/T233071 (10Marostegui) [05:46:39] (03CR) 10Vgutierrez: "We need to resume this.. sadly my plans for ATS on text have been delayed for several reasons, so this looks like the easiest way to get r" [puppet] - 10https://gerrit.wikimedia.org/r/532973 (https://phabricator.wikimedia.org/T155359) (owner: 10Dzahn) [05:47:38] (03PS1) 10Marostegui: db2103: Candidate master for s1 [puppet] - 10https://gerrit.wikimedia.org/r/537248 [05:47:57] (03PS2) 10Marostegui: db2103: Candidate master for s1 [puppet] - 10https://gerrit.wikimedia.org/r/537248 [05:49:05] (03CR) 10Marostegui: [C: 03+2] db2103: Candidate master for s1 [puppet] - 10https://gerrit.wikimedia.org/r/537248 (owner: 10Marostegui) [05:52:40] (03PS1) 10Marostegui: report_users: Add dbproxy1021 IP [software] - 10https://gerrit.wikimedia.org/r/537249 [05:53:22] (03CR) 10Marostegui: [C: 03+2] report_users: Add dbproxy1021 IP [software] - 10https://gerrit.wikimedia.org/r/537249 (owner: 10Marostegui) [05:53:51] 10Operations, 10ops-eqiad, 10DC-Ops: a6-eqiad pdu refresh (Tuesday 10/22 @11am UTC) - https://phabricator.wikimedia.org/T227142 (10elukey) >>! In T227142#5497901, @Marostegui wrote: > @elukey for labsdb1012 your Team would need to let us know if MySQL can be stopped for this maintenance (just in case there i... [05:56:10] 10Operations, 10ops-eqiad, 10DC-Ops: a6-eqiad pdu refresh (Tuesday 10/22 @11am UTC) - https://phabricator.wikimedia.org/T227142 (10elukey) [05:56:22] !log uploaded trafficserver 8.0.5-1wm7 to apt.wikimedia.org (stretch) - T232298 T232724 [05:56:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:56:26] T232724: ATS SSL session cache doesn't work - https://phabricator.wikimedia.org/T232724 [05:56:27] T232298: Investigate segfaults on ats-tls running on cp5001 - https://phabricator.wikimedia.org/T232298 [05:57:01] 10Operations, 10ops-eqiad, 10DC-Ops: a6-eqiad pdu refresh (Tuesday 10/22 @11am UTC) - https://phabricator.wikimedia.org/T227142 (10elukey) I also added the info about analytics hosts and flipped the requirement of depooling for memcached to "no", since we should do it only if things go on fire :) [05:57:16] !log upgrading ATS to 8.0.5-1wm7 on cp2002 and cp4021 - T232724 [05:57:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:57:59] 10Operations, 10ops-eqiad, 10DC-Ops: a6-eqiad pdu refresh (Tuesday 10/22 @11am UTC) - https://phabricator.wikimedia.org/T227142 (10Marostegui) >>! In T227142#5497928, @elukey wrote: >>>! In T227142#5497901, @Marostegui wrote: >> @elukey for labsdb1012 your Team would need to let us know if MySQL can be stopp... [05:58:15] 10Operations, 10ops-eqiad, 10DC-Ops: b1-eqiad pdu refresh (Thursday 10/10 @11am UTC) - https://phabricator.wikimedia.org/T227536 (10elukey) [05:58:30] 10Operations, 10ops-eqiad, 10DC-Ops: b2-eqiad pdu refresh (Tuesday 10/29 @11am UTC) - https://phabricator.wikimedia.org/T227538 (10elukey) [05:58:38] 10Operations, 10ops-eqiad, 10DC-Ops: b2-eqiad pdu refresh (Tuesday 10/29 @11am UTC) - https://phabricator.wikimedia.org/T227538 (10elukey) [06:00:40] 10Operations, 10ops-eqiad, 10DC-Ops: b3-eqiad pdu refresh (Tuesday 9/17 @11am UTC) - https://phabricator.wikimedia.org/T227539 (10elukey) [06:01:13] 10Operations, 10ops-eqiad, 10DC-Ops: b3-eqiad pdu refresh (Tuesday 9/17 @11am UTC) - https://phabricator.wikimedia.org/T227539 (10elukey) [06:02:21] 10Operations, 10ops-eqiad, 10DC-Ops: b4-eqiad pdu refresh (Thursday 10/24 @11am UTC) - https://phabricator.wikimedia.org/T227540 (10elukey) [06:02:47] 10Operations, 10Traffic: Tune ATS SSL session cache - https://phabricator.wikimedia.org/T231849 (10Vgutierrez) [06:02:50] 10Operations, 10Traffic: ATS SSL session cache doesn't work - https://phabricator.wikimedia.org/T232724 (10Vgutierrez) 05Open→03Resolved Issue looks solved after upgrading to 8.0.5-1wm7: ` vgutierrez@cp4021:~$ openssl s_client -connect 127.0.0.1:443 -reconnect < /dev/null 2>1 |egrep -i "reconnect|reused" d... [06:02:53] 10Operations, 10Traffic: Move cache upload cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231433 (10Vgutierrez) [06:02:55] 10Operations, 10ops-eqiad, 10DC-Ops: b6-eqiad pdu refresh (Tuesday 9/10 @11am UTC) - https://phabricator.wikimedia.org/T227541 (10elukey) [06:03:25] 10Operations, 10ops-eqiad, 10DC-Ops: b7-eqiad pdu refresh (Tuesday 11/5 @11am UTC) - https://phabricator.wikimedia.org/T227542 (10elukey) [06:04:04] 10Operations, 10ops-eqiad, 10DC-Ops: b8-eqiad pdu refresh (Thursday 10/31 @11am UTC) - https://phabricator.wikimedia.org/T227543 (10elukey) [06:04:18] (03PS1) 10Effie Mouzeli: Revert "profile::mediawiki::php: Add rlimit_core php.ini variable" [puppet] - 10https://gerrit.wikimedia.org/r/537251 [06:04:18] spam ended :) [06:06:23] (03PS2) 10Effie Mouzeli: Revert "profile::mediawiki::php: Add rlimit_core php.ini variable" [puppet] - 10https://gerrit.wikimedia.org/r/537251 [06:08:23] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [06:08:47] (03CR) 10Effie Mouzeli: [C: 03+2] Revert "profile::mediawiki::php: Add rlimit_core php.ini variable" [puppet] - 10https://gerrit.wikimedia.org/r/537251 (owner: 10Effie Mouzeli) [06:10:25] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 269, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:14:09] (03PS1) 10Effie Mouzeli: profile::mediawiki::php: Add rlimit_core php.ini variable [puppet] - 10https://gerrit.wikimedia.org/r/537254 [06:15:50] the OSPF alarm is related to the Telia transport circuit between eqiad and codfw, there is maintenance in the calendar [06:15:59] should be ok [06:19:22] 10Operations, 10Core Platform Team: Apache mod_status aggregator - https://phabricator.wikimedia.org/T233047 (10Joe) @ori I'm not 100% sure I got what information you think would be useful to extract. At first glance it would seem like collecting those data in a structured manner on logstash would be useful, b... [06:20:01] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 135, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:20:37] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:23:07] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 271, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:24:21] (03CR) 10Effie Mouzeli: [V: 03+1] "https://puppet-compiler.wmflabs.org/compiler1002/18327/mw1222.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/537254 (owner: 10Effie Mouzeli) [06:24:23] (03PS1) 10Vgutierrez: Release 8.0.5-1wm8 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/537258 (https://phabricator.wikimedia.org/T231849) [06:27:15] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [06:27:41] (03PS7) 10Effie Mouzeli: profile::mediawiki::common Enable systemd-coredump on all mw servers [puppet] - 10https://gerrit.wikimedia.org/r/536979 [06:32:23] (03CR) 10Effie Mouzeli: [C: 03+2] profile::mediawiki::common Enable systemd-coredump on all mw servers [puppet] - 10https://gerrit.wikimedia.org/r/536979 (owner: 10Effie Mouzeli) [06:36:45] 10Operations, 10DBA: Decommission db1061-db1073 - https://phabricator.wikimedia.org/T217396 (10Marostegui) [06:38:28] (03CR) 10Vgutierrez: [C: 03+2] Release 8.0.5-1wm8 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/537258 (https://phabricator.wikimedia.org/T231849) (owner: 10Vgutierrez) [06:42:47] (03CR) 10Vgutierrez: "pcc seems happy: https://puppet-compiler.wmflabs.org/compiler1002/18326/" [puppet] - 10https://gerrit.wikimedia.org/r/532973 (https://phabricator.wikimedia.org/T155359) (owner: 10Dzahn) [06:43:08] (03CR) 10Giuseppe Lavagetto: [C: 03+1] profile::mediawiki::php: Add rlimit_core php.ini variable [puppet] - 10https://gerrit.wikimedia.org/r/537254 (owner: 10Effie Mouzeli) [06:49:33] !log reimage restbase2010 to Stretch T224553 [06:49:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:49:41] T224553: Migrate remaining Restbase servers to Stretch - https://phabricator.wikimedia.org/T224553 [06:53:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1066 T233071', diff saved to https://phabricator.wikimedia.org/P9114 and previous config saved to /var/cache/conftool/dbconfig/20190917-065342-marostegui.json [06:53:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:53:46] T233071: Decommission db1066.eqiad.wmnet - https://phabricator.wikimedia.org/T233071 [06:55:15] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [06:55:27] PROBLEM - BFD status on cr1-eqiad is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [06:55:38] !log uploaded trafficserver 8.0.5-1wm8 to apt.wikimedia.org (stretch) - T231849 [06:55:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:55:41] T231849: Tune ATS SSL session cache - https://phabricator.wikimedia.org/T231849 [06:55:45] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [06:56:03] !log upgrading ATS to 8.0.5-1wm8 on cp2002, cp4021 and cp5001 - T231849 [06:56:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:56:49] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [06:57:01] RECOVERY - BFD status on cr1-eqiad is OK: OK: UP: 11 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [06:57:19] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [06:59:24] 10Operations, 10ops-eqiad, 10DC-Ops: b8-eqiad pdu refresh (Thursday 10/31 @11am UTC) - https://phabricator.wikimedia.org/T227543 (10akosiaris) [07:07:44] (03Abandoned) 10Muehlenhoff: Remove ferm rules for labpuppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/537129 (owner: 10Muehlenhoff) [07:09:41] (03CR) 10Muehlenhoff: "The labpuppetmaster* hosts are not being replaced with hardware in production, the puppet masters for Cloud VPS are now running inside Clo" [puppet] - 10https://gerrit.wikimedia.org/r/537132 (owner: 10Muehlenhoff) [07:10:45] !log getting rid of wikibase TLS certificate & nginx configuration on the text cache cluster - T99531 [07:10:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:10:48] T99531: [Task] move wikiba.se webhosting to wikimedia cluster - https://phabricator.wikimedia.org/T99531 [07:14:58] 10Operations, 10netops: Review firewall rules for labpuppetmaster1001/labpuppetmaster1002 removal - https://phabricator.wikimedia.org/T233075 (10MoritzMuehlenhoff) [07:15:01] (03CR) 10Vgutierrez: [C: 03+2] ATS/acme_chief/varnish: remove wikiba.se [puppet] - 10https://gerrit.wikimedia.org/r/532973 (https://phabricator.wikimedia.org/T155359) (owner: 10Dzahn) [07:15:10] (03PS2) 10Vgutierrez: ATS/acme_chief/varnish: remove wikiba.se [puppet] - 10https://gerrit.wikimedia.org/r/532973 (https://phabricator.wikimedia.org/T155359) (owner: 10Dzahn) [07:19:53] !log depooling cp5007 to ensure that wikibase removal goes as expected - T99531 [07:19:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:19:56] T99531: [Task] move wikiba.se webhosting to wikimedia cluster - https://phabricator.wikimedia.org/T99531 [07:20:58] (03PS2) 10Muehlenhoff: restbase2010: Add JBOD hiera config for upcoming reimage [puppet] - 10https://gerrit.wikimedia.org/r/536565 [07:21:07] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [07:21:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:23:02] (03CR) 10Muehlenhoff: [C: 03+2] restbase2010: Add JBOD hiera config for upcoming reimage [puppet] - 10https://gerrit.wikimedia.org/r/536565 (owner: 10Muehlenhoff) [07:23:06] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [07:23:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:25:54] (03CR) 10Effie Mouzeli: [V: 03+1 C: 03+2] profile::mediawiki::php: Add rlimit_core php.ini variable [puppet] - 10https://gerrit.wikimedia.org/r/537254 (owner: 10Effie Mouzeli) [07:26:02] (03PS2) 10Effie Mouzeli: profile::mediawiki::php: Add rlimit_core php.ini variable [puppet] - 10https://gerrit.wikimedia.org/r/537254 [07:29:36] !log repooling cp5007 without wikibase configuration - T99531 [07:29:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:29:39] T99531: [Task] move wikiba.se webhosting to wikimedia cluster - https://phabricator.wikimedia.org/T99531 [07:38:12] (03PS1) 10Marostegui: mariadb: Decommission db1063 [puppet] - 10https://gerrit.wikimedia.org/r/537316 (https://phabricator.wikimedia.org/T232564) [07:39:26] (03CR) 10Marostegui: [C: 03+2] mariadb: Decommission db1063 [puppet] - 10https://gerrit.wikimedia.org/r/537316 (https://phabricator.wikimedia.org/T232564) (owner: 10Marostegui) [07:39:59] 10Operations, 10DBA, 10Patch-For-Review: Decommission db1063.eqiad.wmnet - https://phabricator.wikimedia.org/T232564 (10Marostegui) [07:40:15] !log Remove db1063 from puppet and zarcillo T232564 [07:40:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:40:18] T232564: Decommission db1063.eqiad.wmnet - https://phabricator.wikimedia.org/T232564 [07:41:11] !log Stop mysql on db1063 for decommissioning T232564 [07:41:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:41:43] (03CR) 10Mobrovac: [C: 04-1] "LGTM, but can't go out before I0aa7b624b645c3773b0d758634d370a494d1028a is deployed everywhere in production." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537166 (https://phabricator.wikimedia.org/T228547) (owner: 10Physikerwelt) [07:42:39] !log reboot analytics-tool1004 (host running superset) for kernel updates [07:42:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:43:00] (03CR) 10Muehlenhoff: "The daily account check flagged this patch: You're now using the same key in Cloud VPS and production and that's insecure because Cloud VP" [puppet] - 10https://gerrit.wikimedia.org/r/537201 (owner: 10Bstorm) [07:43:10] PROBLEM - HTTPS wikibase RSA on cp5008 is CRITICAL: SSL CRITICAL - failed to verify wikiba.se against *.wikipedia.org, *.m.mediawiki.org, *.m.wikibooks.org, *.m.wikidata.org, *.m.wikimedia.org, *.m.wikinews.org, *.m.wikipedia.org, *.m.wikiquote.org, *.m.wikisource.org, *.m.wikiversity.org, *.m.wikivoyage.org, *.m.wiktionary.org, *.mediawiki.org, *.planet.wikimedia.org, *.wikibooks.org, *.wikidata.org, *.wikimedia.org, *.wikimedia [07:43:10] .wikinews.org, *.wikiquote.org, *.wikisource.org, *.wikiversity.org, *.wikivoyage.org, *.wiktionary.org, *.wmfusercontent.org, mediawiki.org, w.wiki, wikibooks.org, wikidata.org, wikimedia.org, wikimediafoundation.org, wikinews.org, wikiquote.org, wikisource.org, wikiversity.org, wikivoyage.org, wiktionary.org, wmfusercontent.org, wikipedia.org:Certificate *.wikipedia.org SAN wikiba.se not found in cert SAN list: *.wikipedia.org, [07:43:10] g, *.m.wikibooks.org, *.m.wikidata.org, *.m.wikimedia.org, *.m.wikinews.org, *.m.wikipedia.org, *.m.wikiquote.org, *.m.wikisource.org, *.m.wikiversity.org, *.m.wikivoyage.org, *.m.wiktionary.org, *.mediawiki.org, *.planet.wikimedia.org, *.wikibooks.org, *.wikidata.org, *.wikimedia.org, *.wikimediafoundation.org, *.wikinews.org, *.wikiquote.org, *.wikisource.org, *.wikiversity.org, *.wikivoyage.org, *.wiktionary.org, *.wmfusercont [07:43:10] i.org, w.wiki, wikibooks.org, wikidata.org, wikimedia.org, wikimediafoundation.org, wikinews.org, wikiquote.org, wikisource.org, wikiversity.org, wikivoyage.org, wiktionary.org, wmfusercontent.org, wikipedia.org:Certificate *.wikipedia.org SAN www.wikiba.se not found in cert SAN list: *.wikipedia.org, *.m.mediawiki.org, *.m.wikibooks.org, *.m.wikidata.org, *.m.wikimedia.org, *.m.wikinews.org, *.m.wikipedia.org, *.m.wikiquote.org, [07:43:10] rg, *.m.wikiversity.org, *.m.wikivoyage.org, *.m.wiktionary.org, *.mediawiki.org, *.planet.wikimedia.org, *.wikibooks.org, *.wikidata.org, *.wikimedia.org, *.wikimediafoundation.org, *.wikinews.org, *.wikiquote.org, *.wikisource.org, *.wikiversity.org, *.wikivoyage.org, *.wiktionary.org, *.wmfusercontent.org, mediawiki.org, w.wiki, wikibooks.org, wikidata.org, wikimedia.org, wikimediafoundation.org, wikinews.org, wikiquote.org, w [07:43:11] kiversity.org, wikivoyage.org, wiktionary.org, wmfusercontent.org, wikipedia.org https://wikitech.wikimedia.org/wiki/HTTPS [07:43:23] vgutierrez: is that you? [07:43:24] ^ [07:43:33] :) that's me [07:43:37] roger [07:43:40] PROBLEM - HTTPS wikibase ECDSA on cp5008 is CRITICAL: SSL CRITICAL - failed to verify wikiba.se against *.wikipedia.org, wikimedia.org, mediawiki.org, wikibooks.org, wikidata.org, wikinews.org, wikiquote.org, wikisource.org, wikiversity.org, wikivoyage.org, wiktionary.org, wikimediafoundation.org, w.wiki, wmfusercontent.org, *.m.wikipedia.org, *.wikimedia.org, *.m.wikimedia.org, *.planet.wikimedia.org, *.mediawiki.org, *.m.mediaw [07:43:40] oks.org, *.m.wikibooks.org, *.wikidata.org, *.m.wikidata.org, *.wikinews.org, *.m.wikinews.org, *.wikiquote.org, *.m.wikiquote.org, *.wikisource.org, *.m.wikisource.org, *.wikiversity.org, *.m.wikiversity.org, *.wikivoyage.org, *.m.wikivoyage.org, *.wiktionary.org, *.m.wiktionary.org, *.wikimediafoundation.org, *.wmfusercontent.org, wikipedia.org:Certificate *.wikipedia.org SAN wikiba.se not found in cert SAN list: *.wikipedia.or [07:43:40] mediawiki.org, wikibooks.org, wikidata.org, wikinews.org, wikiquote.org, wikisource.org, wikiversity.org, wikivoyage.org, wiktionary.org, wikimediafoundation.org, w.wiki, wmfusercontent.org, *.m.wikipedia.org, *.wikimedia.org, *.m.wikimedia.org, *.planet.wikimedia.org, *.mediawiki.org, *.m.mediawiki.org, *.wikibooks.org, *.m.wikibooks.org, *.wikidata.org, *.m.wikidata.org, *.wikinews.org, *.m.wikinews.org, *.wikiquote.org, *.m.w [07:43:40] ikisource.org, *.m.wikisource.org, *.wikiversity.org, *.m.wikiversity.org, *.wikivoyage.org, *.m.wikivoyage.org, *.wiktionary.org, *.m.wiktionary.org, *.wikimediafoundation.org, *.wmfusercontent.org, wikipedia.org:Certificate *.wikipedia.org SAN www.wikiba.se not found in cert SAN list: *.wikipedia.org, wikimedia.org, mediawiki.org, wikibooks.org, wikidata.org, wikinews.org, wikiquote.org, wikisource.org, wikiversity.org, wikivoy [07:43:40] ry.org, wikimediafoundation.org, w.wiki, wmfusercontent.org, *.m.wikipedia.org, *.wikimedia.org, *.m.wikimedia.org, *.planet.wikimedia.org, *.mediawiki.org, *.m.mediawiki.org, *.wikibooks.org, *.m.wikibooks.org, *.wikidata.org, *.m.wikidata.org, *.wikinews.org, *.m.wikinews.org, *.wikiquote.org, *.m.wikiquote.org, *.wikisource.org, *.m.wikisource.org, *.wikiversity.org, *.m.wikiversity.org, *.wikivoyage.org, *.m.wikivoyage.org, * [07:43:41] *.m.wiktionary.org, *.wikimediafoundation.org, *.wmfusercontent.org, wikipedia.org https://wikitech.wikimedia.org/wiki/HTTPS [07:44:02] and it's expected [07:44:11] let me downtime the HTTPS wikibase check everywhere [07:45:30] hahaha I love it, coming christmas we should totally color the asterisks :D [07:46:11] 10Operations, 10DBA, 10Patch-For-Review: Decommission db1063.eqiad.wmnet - https://phabricator.wikimedia.org/T232564 (10Marostegui) [07:46:20] LOL :) [07:47:50] assuming it is easy, maybe consider truncating/just counting the list of domains on the alert [07:48:11] !log Enable puppet on mw* [07:48:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:48:17] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: Decommission db1063.eqiad.wmnet - https://phabricator.wikimedia.org/T232564 (10Marostegui) a:05Marostegui→03RobH [07:48:24] eg. "failed to verify wikiba.se against *.wikipedia.org and 202 other domains" [07:48:34] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: Decommission db1063.eqiad.wmnet - https://phabricator.wikimedia.org/T232564 (10Marostegui) Host ready for #dc-ops to decommission [07:48:49] 10Operations, 10DBA: Decommission db1061-db1073 - https://phabricator.wikimedia.org/T217396 (10Marostegui) [07:50:20] (03CR) 10Filippo Giunchedi: "> Patch Set 3:" [puppet] - 10https://gerrit.wikimedia.org/r/535818 (owner: 10Jcrespo) [07:51:48] jynus: that's check_ssl output AFAIK [07:53:20] (03CR) 10Filippo Giunchedi: "+ Cole and Keith to see what they think as well" [puppet] - 10https://gerrit.wikimedia.org/r/537240 (https://phabricator.wikimedia.org/T230752) (owner: 1020after4) [07:58:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'Pool db1074 with just 50 to keep its warmness level just in case T231638', diff saved to https://phabricator.wikimedia.org/P9115 and previous config saved to /var/cache/conftool/dbconfig/20190917-075807-marostegui.json [07:58:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:58:11] T231638: db1074 crashed: Broken BBU - https://phabricator.wikimedia.org/T231638 [07:59:51] 10Operations, 10ops-eqiad, 10DBA: db1074 crashed: Broken BBU - https://phabricator.wikimedia.org/T231638 (10Marostegui) This host original weight was 200 in main traffic and 1 in API. I have only pooled it with weight 50 on main traffic, just to get it to do something. [08:02:47] (03PS1) 10Marostegui: db1074: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/537319 (https://phabricator.wikimedia.org/T231638) [08:03:20] 10Operations, 10decommission: Decommission analytics10[28-31,33-41] - https://phabricator.wikimedia.org/T227485 (10elukey) [08:03:29] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: db1074 crashed: Broken BBU - https://phabricator.wikimedia.org/T231638 (10Marostegui) The BBU showed up again (usual behaviour with a broken BBU) ` root@db1074:~# hpssacli controller all show status Smart Array P840 in Slot 1 Controller Status: OK... [08:04:27] (03CR) 10Marostegui: [C: 03+2] db1074: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/537319 (https://phabricator.wikimedia.org/T231638) (owner: 10Marostegui) [08:04:52] 10Operations, 10decommission: Decommission analytics1032 - https://phabricator.wikimedia.org/T233080 (10elukey) [08:08:20] (03PS1) 10Elukey: Prepare analytics1032 for decommission [puppet] - 10https://gerrit.wikimedia.org/r/537321 (https://phabricator.wikimedia.org/T233080) [08:09:11] (03PS3) 10Mathew.onipe: wdqs: setup new logging pipeline [puppet] - 10https://gerrit.wikimedia.org/r/535345 (https://phabricator.wikimedia.org/T232184) [08:09:13] (03PS1) 10Mathew.onipe: wdqs: switch test cluster to new logging pipeline [puppet] - 10https://gerrit.wikimedia.org/r/537322 (https://phabricator.wikimedia.org/T232184) [08:09:43] (03CR) 10Gergő Tisza: "Thanks for the rebase!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536732 (https://phabricator.wikimedia.org/T230031) (owner: 10Gergő Tisza) [08:11:12] (03CR) 10Mathew.onipe: wdqs: setup new logging pipeline (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/535345 (https://phabricator.wikimedia.org/T232184) (owner: 10Mathew.onipe) [08:11:29] (03PS1) 10Jcrespo: Add partman recipe equal to raid1-lvm-ext4-srv but with an additional hwraid [puppet] - 10https://gerrit.wikimedia.org/r/537325 (https://phabricator.wikimedia.org/T229209) [08:11:32] (03CR) 10jerkins-bot: [V: 04-1] wdqs: setup new logging pipeline [puppet] - 10https://gerrit.wikimedia.org/r/535345 (https://phabricator.wikimedia.org/T232184) (owner: 10Mathew.onipe) [08:12:44] (03PS1) 10Effie Mouzeli: 100% of anonymous users via PHP7.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537328 (https://phabricator.wikimedia.org/T219150) [08:13:05] (03CR) 10Elukey: "Looks good:" [puppet] - 10https://gerrit.wikimedia.org/r/537321 (https://phabricator.wikimedia.org/T233080) (owner: 10Elukey) [08:15:19] (03PS4) 10Mathew.onipe: wdqs: setup new logging pipeline [puppet] - 10https://gerrit.wikimedia.org/r/535345 (https://phabricator.wikimedia.org/T232184) [08:15:21] (03PS2) 10Mathew.onipe: wdqs: switch test cluster to new logging pipeline [puppet] - 10https://gerrit.wikimedia.org/r/537322 (https://phabricator.wikimedia.org/T232184) [08:15:58] (03CR) 10Alexandros Kosiaris: [C: 03+1] Add partman recipe equal to raid1-lvm-ext4-srv but with an additional hwraid [puppet] - 10https://gerrit.wikimedia.org/r/537325 (https://phabricator.wikimedia.org/T229209) (owner: 10Jcrespo) [08:16:02] (03CR) 10jenkins-bot: Set MinimumPasswordLengthToLogin to 10 for all prived groups, not just +staff [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534707 (owner: 10Jforrester) [08:16:04] (03CR) 10jenkins-bot: Variant configuration: Read JSON config for all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535963 (https://phabricator.wikimedia.org/T223602) (owner: 10Jforrester) [08:16:06] (03CR) 10jenkins-bot: Clean up globals in InitialiseSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537169 (owner: 10Lucas Werkmeister (WMDE)) [08:16:08] (03CR) 10jenkins-bot: Stop setting wgCookieSetOnAutoBlock and wgCookieSetOnIpBlock to the default; never varied [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534698 (https://phabricator.wikimedia.org/T191922) (owner: 10Jforrester) [08:16:10] (03CR) 10jenkins-bot: InitialiseSettings: Use __DIR__ rather than global wmfConfgDir [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537192 (owner: 10Jforrester) [08:16:12] (03CR) 10jenkins-bot: Make banner-preview CSP match normal CSP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527183 (https://phabricator.wikimedia.org/T225261) (owner: 10Ejegg) [08:16:14] (03CR) 10jenkins-bot: CSP for banner preview: allow remind me later host [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526756 (https://phabricator.wikimedia.org/T194019) (owner: 10Ejegg) [08:16:25] (03CR) 10jenkins-bot: Move global-dependent, invariant wgCopyUploadProxy to CommonSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537205 (owner: 10Jforrester) [08:18:23] (03CR) 10jenkins-bot: Move global-dependent, invariant wmgRC2UDPAddress to CommonSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537206 (owner: 10Jforrester) [08:20:18] (03CR) 10Mathew.onipe: "PCC is happy: https://puppet-compiler.wmflabs.org/compiler1001/18330/" [puppet] - 10https://gerrit.wikimedia.org/r/535345 (https://phabricator.wikimedia.org/T232184) (owner: 10Mathew.onipe) [08:22:58] (03CR) 10Mathew.onipe: "PCC is still happy: https://puppet-compiler.wmflabs.org/compiler1001/18331/wdqs1010.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/537322 (https://phabricator.wikimedia.org/T232184) (owner: 10Mathew.onipe) [08:23:51] (03CR) 10jenkins-bot: Move global-dependent, invariant wgUploadThumbnailRenderHttpCustom* to CommonSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537207 (owner: 10Jforrester) [08:23:53] (03CR) 10jenkins-bot: Move global-dependent, barely variant wgDebugLogFile to CommonSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537211 (owner: 10Jforrester) [08:26:02] 10Operations, 10ops-eqiad, 10DC-Ops: b8-eqiad pdu refresh (Thursday 10/31 @11am UTC) - https://phabricator.wikimedia.org/T227543 (10ArielGlenn) [08:26:32] (03PS2) 10Jcrespo: Add partman recipe equal to raid1-lvm-ext4-srv but with an additional hwraid [puppet] - 10https://gerrit.wikimedia.org/r/537325 (https://phabricator.wikimedia.org/T229209) [08:28:17] 10Operations, 10Cassandra, 10RESTBase, 10RESTBase-Cassandra, and 2 others: Migrate remaining Restbase servers to Stretch - https://phabricator.wikimedia.org/T224553 (10MoritzMuehlenhoff) [08:28:34] 10Operations, 10Cassandra, 10RESTBase, 10RESTBase-Cassandra, and 2 others: Migrate remaining Restbase servers to Stretch - https://phabricator.wikimedia.org/T224553 (10MoritzMuehlenhoff) restbase2010 has been reimaged and is ready to be bootstrapped in Cassandra. [08:28:37] (03Abandoned) 10Gehel: maps: block 9db.jp from maps [puppet] - 10https://gerrit.wikimedia.org/r/536568 (owner: 10Jbond) [08:30:24] (03CR) 10Gehel: [C: 04-1] query_service: rename wdqs module to query_service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/537008 (https://phabricator.wikimedia.org/T232297) (owner: 10Mathew.onipe) [08:31:34] (03CR) 10Jcrespo: [C: 03+2] "Implied alex ok on IRC." [puppet] - 10https://gerrit.wikimedia.org/r/537325 (https://phabricator.wikimedia.org/T229209) (owner: 10Jcrespo) [08:35:50] (03CR) 10Gehel: "This looks mostly good. Was this tested somewhere already?" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/535345 (https://phabricator.wikimedia.org/T232184) (owner: 10Mathew.onipe) [08:37:09] !log upgrading ATS to 8.0.5-1wm8 on cp3034 - T231849 T232724 [08:37:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:37:14] T232724: ATS SSL session cache doesn't work - https://phabricator.wikimedia.org/T232724 [08:37:14] T231849: Tune ATS SSL session cache - https://phabricator.wikimedia.org/T231849 [08:37:45] (03CR) 10Effie Mouzeli: [C: 04-1] "Just a minor change" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/536975 (https://phabricator.wikimedia.org/T231569) (owner: 10Mobrovac) [08:44:44] (03PS2) 10Mobrovac: Parsoid: Change the beta port to 8001 to avoid conflict with PHP7 [puppet] - 10https://gerrit.wikimedia.org/r/536975 (https://phabricator.wikimedia.org/T231569) [08:45:42] (03CR) 10Mobrovac: Parsoid: Change the beta port to 8001 to avoid conflict with PHP7 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/536975 (https://phabricator.wikimedia.org/T231569) (owner: 10Mobrovac) [08:46:09] effie: updated ^ [08:46:15] :D [08:46:28] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "let's go!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537328 (https://phabricator.wikimedia.org/T219150) (owner: 10Effie Mouzeli) [08:47:34] 10Operations, 10Release-Engineering-Team, 10observability: Export zuul metrics to Prometheus - https://phabricator.wikimedia.org/T233089 (10fgiunchedi) [08:48:44] (03CR) 10Effie Mouzeli: [C: 03+2] "LGTM https://puppet-compiler.wmflabs.org/compiler1002/18333/wtp1026.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/536975 (https://phabricator.wikimedia.org/T231569) (owner: 10Mobrovac) [08:52:51] 10Operations, 10serviceops, 10Patch-For-Review, 10Performance-Team (Radar), 10User-jijiki: Ramp up percentage of users on php7.2 to 100% on both API and appserver clusters - https://phabricator.wikimedia.org/T219150 (10Ladsgroup) >>! In T219150#5498222, @gerritbot wrote: > Change 537328 had a related pat... [08:53:15] (03CR) 10Effie Mouzeli: [C: 03+2] Ad-hoc Cassandra clusters for deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/496192 (owner: 10Eevans) [08:53:31] (03PS6) 10Effie Mouzeli: Ad-hoc Cassandra clusters for deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/496192 (owner: 10Eevans) [08:53:49] (03CR) 10Jcrespo: "Sep 17 08:48:46 debconf: --> SUBST grub-installer/progress/step_install_loader BOOTDEV /dev/sda /dev/sdb" [puppet] - 10https://gerrit.wikimedia.org/r/537325 (https://phabricator.wikimedia.org/T229209) (owner: 10Jcrespo) [08:54:11] 10Operations, 10Release-Engineering-Team, 10observability: Export zuul metrics to Prometheus - https://phabricator.wikimedia.org/T233089 (10fgiunchedi) [08:59:37] (03CR) 10Effie Mouzeli: [C: 03+2] 100% of anonymous users via PHP7.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537328 (https://phabricator.wikimedia.org/T219150) (owner: 10Effie Mouzeli) [09:00:32] (03Merged) 10jenkins-bot: 100% of anonymous users via PHP7.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537328 (https://phabricator.wikimedia.org/T219150) (owner: 10Effie Mouzeli) [09:00:47] (03CR) 10jenkins-bot: 100% of anonymous users via PHP7.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537328 (https://phabricator.wikimedia.org/T219150) (owner: 10Effie Mouzeli) [09:01:00] <_joe_> effie: now that I think of it, it's 100% of users [09:01:05] <_joe_> including non-anonymous ones [09:01:22] minor typo [09:01:26] <_joe_> but oh well :P [09:01:36] <_joe_> yeah don't get stuck on this [09:01:55] haha [09:01:56] Kudos for the 100%!!! [09:02:08] it is not in prod yet :) [09:02:10] (03PS1) 10Jcrespo: install_server: Update comment on partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/537336 (https://phabricator.wikimedia.org/T229209) [09:02:20] <_joe_> Daimona: this is 100% of clients that accept a cookie so about 50% of the traffic [09:02:43] Isn't it an important milestone anyway? :) [09:02:56] <_joe_> oh don't get me wrong, it is [09:03:07] <_joe_> I was just trying to clean up misconceptions :) [09:03:14] \o/ [09:03:58] Sure :) And thanks for your work! [09:04:02] <_joe_> next step is to send people to php7 by default, including people without cookies. I expect worse surprises when we get there. [09:04:20] <_joe_> but that's going to be next week I guess :P [09:05:23] !log jiji@deploy1001 Synchronized wmf-config/CommonSettings.php: Push PHP7 traffic to 100% of users who accept cookies - T219150 (duration: 00m 57s) [09:05:37] _joe_ ^ there, fixed it [09:05:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:05:38] T219150: Ramp up percentage of users on php7.2 to 100% on both API and appserver clusters - https://phabricator.wikimedia.org/T219150 [09:05:40] :p [09:05:49] <_joe_> effie: :D [09:06:14] heh [09:08:17] (03CR) 10Joal: [C: 03+1] "Thanks ottomata :)" [debs/python-ua-parser] (debian) - 10https://gerrit.wikimedia.org/r/537123 (owner: 10Ottomata) [09:15:59] !log jynus@cumin2001 START - Cookbook sre.hosts.downtime [09:16:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:16:24] !log bootstrap restbase2010-a - T224553 [09:16:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:16:27] T224553: Migrate remaining Restbase servers to Stretch - https://phabricator.wikimedia.org/T224553 [09:18:39] !log jynus@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [09:18:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:22:00] does that mean we might drop HHVM support in time for the 1.34.0 release? [09:23:42] PROBLEM - cassandra-b CQL 10.192.16.187:9042 on restbase2010 is CRITICAL: connect to address 10.192.16.187 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [09:23:52] PROBLEM - cassandra-a CQL 10.192.16.186:9042 on restbase2010 is CRITICAL: connect to address 10.192.16.186 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [09:23:56] PROBLEM - cassandra-b service on restbase2010 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:23:58] PROBLEM - cassandra-c SSL 10.192.16.188:7001 on restbase2010 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://phabricator.wikimedia.org/T120662 [09:24:31] mobrovac: do you have access to downtime those alerts? [09:24:34] PROBLEM - cassandra-b SSL 10.192.16.187:7001 on restbase2010 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://phabricator.wikimedia.org/T120662 [09:26:28] PROBLEM - cassandra-c CQL 10.192.16.188:9042 on restbase2010 is CRITICAL: connect to address 10.192.16.188 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [09:26:36] that's from the reimage, silencing [09:26:57] yeah [09:26:59] the host gets recreated by Icinga and this removes the previous downtime [09:27:22] that sounds like a capital crime [09:27:35] :p [09:28:27] (03CR) 10Elukey: [C: 03+2] Prepare analytics1032 for decommission [puppet] - 10https://gerrit.wikimedia.org/r/537321 (https://phabricator.wikimedia.org/T233080) (owner: 10Elukey) [09:28:35] (03PS2) 10Elukey: Prepare analytics1032 for decommission [puppet] - 10https://gerrit.wikimedia.org/r/537321 (https://phabricator.wikimedia.org/T233080) [09:29:02] !log Downtime db1073 db1130 db1104 db1085 db1086 for the PDU maintenance T227539 [09:29:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:29:05] T227539: b3-eqiad pdu refresh (Tuesday 9/17 @11am UTC) - https://phabricator.wikimedia.org/T227539 [09:29:25] ACKNOWLEDGEMENT - cassandra-a CQL 10.192.16.186:9042 on restbase2010 is CRITICAL: connect to address 10.192.16.186 and port 9042: Connection refused Muehlenhoff reimage / bootstrap https://phabricator.wikimedia.org/T93886 [09:29:25] ACKNOWLEDGEMENT - cassandra-b CQL 10.192.16.187:9042 on restbase2010 is CRITICAL: connect to address 10.192.16.187 and port 9042: Connection refused Muehlenhoff reimage / bootstrap https://phabricator.wikimedia.org/T93886 [09:29:25] ACKNOWLEDGEMENT - cassandra-b SSL 10.192.16.187:7001 on restbase2010 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused Muehlenhoff reimage / bootstrap https://phabricator.wikimedia.org/T120662 [09:29:25] ACKNOWLEDGEMENT - cassandra-b service on restbase2010 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is inactive Muehlenhoff reimage / bootstrap https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:29:25] ACKNOWLEDGEMENT - cassandra-c CQL 10.192.16.188:9042 on restbase2010 is CRITICAL: connect to address 10.192.16.188 and port 9042: Connection refused Muehlenhoff reimage / bootstrap https://phabricator.wikimedia.org/T93886 [09:29:25] (03Abandoned) 10Jbond: maps: block goeshape [puppet] - 10https://gerrit.wikimedia.org/r/536570 (owner: 10Jbond) [09:29:26] ACKNOWLEDGEMENT - cassandra-c SSL 10.192.16.188:7001 on restbase2010 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused Muehlenhoff reimage / bootstrap https://phabricator.wikimedia.org/T120662 [09:29:26] ACKNOWLEDGEMENT - cassandra-c service on restbase2010 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is inactive Muehlenhoff reimage / bootstrap https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:30:10] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/531808 (owner: 10Muehlenhoff) [09:30:13] !log Restarting CI jenkins [09:30:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:36:59] 10Operations, 10Release Pipeline, 10serviceops, 10CPT Initiatives (RESTBase Split (CDP2)), and 3 others: Deploy the RESTBase front-end service (RESTRouter) to Kubernetes - https://phabricator.wikimedia.org/T223953 (10akosiaris) [09:38:20] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/537243 (owner: 10CRusnov) [09:38:34] (03PS5) 10Mathew.onipe: wdqs: setup new logging pipeline [puppet] - 10https://gerrit.wikimedia.org/r/535345 (https://phabricator.wikimedia.org/T232184) [09:38:36] (03PS3) 10Mathew.onipe: wdqs: switch test cluster to new logging pipeline [puppet] - 10https://gerrit.wikimedia.org/r/537322 (https://phabricator.wikimedia.org/T232184) [09:41:36] (03CR) 10Mathew.onipe: "PCC is Ok: https://puppet-compiler.wmflabs.org/compiler1002/18334/" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/535345 (https://phabricator.wikimedia.org/T232184) (owner: 10Mathew.onipe) [09:46:09] !log Depool and stop replication on db1130 db1104 db1085 db1086 (lag will appear on s6 on labsdb) for PDU maintenance - T227539 [09:46:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:46:12] T227539: b3-eqiad pdu refresh (Tuesday 9/17 @11am UTC) - https://phabricator.wikimedia.org/T227539 [09:48:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool and stop replication on db1130 db1104 db1085 db1086 (lag will appear on s6 on labsdb) for PDU maintenance - T227539', diff saved to https://phabricator.wikimedia.org/P9116 and previous config saved to /var/cache/conftool/dbconfig/20190917-094827-marostegui.json [09:48:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:54:32] 10Operations, 10ops-eqiad, 10DC-Ops: b3-eqiad pdu refresh (Tuesday 9/17 @11am UTC) - https://phabricator.wikimedia.org/T227539 (10Marostegui) All the DBs have been downtimed, depooled and replication has been stopped. From the DBAs point of view, this maintenance is good to go. [09:56:14] 10Operations, 10decommission, 10Patch-For-Review: Decommission analytics1032 - https://phabricator.wikimedia.org/T233080 (10elukey) [09:56:23] 10Operations, 10decommission, 10Patch-For-Review: Decommission analytics1032 - https://phabricator.wikimedia.org/T233080 (10elukey) p:05Triage→03Normal a:03RobH [10:05:39] PROBLEM - MariaDB Slave Lag: s6 on db1125 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 749.01 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [10:06:55] ^ that is me [10:06:57] downtiming it [10:19:28] (03PS1) 10Jbond: elnath: add AAAA records [dns] - 10https://gerrit.wikimedia.org/r/537354 [10:19:53] (03CR) 10jerkins-bot: [V: 04-1] elnath: add AAAA records [dns] - 10https://gerrit.wikimedia.org/r/537354 (owner: 10Jbond) [10:22:44] (03PS2) 10Jbond: elnath: add AAAA records [dns] - 10https://gerrit.wikimedia.org/r/537354 [10:23:44] 10Operations, 10Dumps-Generation, 10hardware-requests: Get a third dumpsdata server - https://phabricator.wikimedia.org/T219768 (10ArielGlenn) I guess by the closure of the subtask that the server has arrived? What's the outlook for getting it racked? [10:24:51] (03CR) 10Jbond: [C: 03+2] elnath: add AAAA records [dns] - 10https://gerrit.wikimedia.org/r/537354 (owner: 10Jbond) [10:37:59] (03PS1) 10Filippo Giunchedi: ci: add statsd_exporter for zuul/gerrit [puppet] - 10https://gerrit.wikimedia.org/r/537362 (https://phabricator.wikimedia.org/T233089) [10:39:09] (03PS1) 10Abijeet Patro: Add channels for the Translate and TranslationsNotification extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537363 (https://phabricator.wikimedia.org/T221119) [10:39:23] (03PS1) 10Volans: puppetdb: fix postgres user on new hosts [puppet] - 10https://gerrit.wikimedia.org/r/537364 (https://phabricator.wikimedia.org/T230609) [10:40:06] (03PS2) 10Abijeet Patro: Add channels for the Translate and TranslationsNotification extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537363 (https://phabricator.wikimedia.org/T221119) [10:42:14] (03CR) 10Filippo Giunchedi: "PCC https://puppet-compiler.wmflabs.org/compiler1001/18335/" [puppet] - 10https://gerrit.wikimedia.org/r/537362 (https://phabricator.wikimedia.org/T233089) (owner: 10Filippo Giunchedi) [10:42:42] (03PS1) 10Vgutierrez: hiera: Move nginx from port 443 to 4443 on cp1076 [puppet] - 10https://gerrit.wikimedia.org/r/537365 (https://phabricator.wikimedia.org/T231433) [10:42:46] (03PS1) 10Vgutierrez: hiera: Move ats-tls from port 8443 to 4443 on cp1076 [puppet] - 10https://gerrit.wikimedia.org/r/537366 (https://phabricator.wikimedia.org/T231433) [10:43:22] (03PS2) 10Vgutierrez: hiera: Move ats-tls from port 8443 to 443 on cp1076 [puppet] - 10https://gerrit.wikimedia.org/r/537366 (https://phabricator.wikimedia.org/T231433) [10:44:13] !log replacing nginx with ATS in cp1076 (upload cluster) - T231433 [10:44:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:44:16] T231433: Move cache upload cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231433 [10:45:03] (03CR) 10Vgutierrez: [C: 03+2] hiera: Move nginx from port 443 to 4443 on cp1076 [puppet] - 10https://gerrit.wikimedia.org/r/537365 (https://phabricator.wikimedia.org/T231433) (owner: 10Vgutierrez) [10:45:11] (03PS2) 10Vgutierrez: hiera: Move nginx from port 443 to 4443 on cp1076 [puppet] - 10https://gerrit.wikimedia.org/r/537365 (https://phabricator.wikimedia.org/T231433) [10:45:15] (03CR) 10Filippo Giunchedi: "Please let me know what you think! I've written the mappings by looking at metrics emitted by zuul but should be correct. Also note that t" [puppet] - 10https://gerrit.wikimedia.org/r/537362 (https://phabricator.wikimedia.org/T233089) (owner: 10Filippo Giunchedi) [10:45:50] 10Operations, 10Analytics, 10hardware-requests, 10User-Elukey: eqiad: 1 misc node for the Kerberos KDC service - https://phabricator.wikimedia.org/T227288 (10elukey) @RobH let me know if I can help with the host repurpose (also with the codfw one), I can take care of the DNS/puppet/DHCP/etc.. steps :) [10:50:47] (03CR) 10Vgutierrez: [C: 03+2] hiera: Move ats-tls from port 8443 to 443 on cp1076 [puppet] - 10https://gerrit.wikimedia.org/r/537366 (https://phabricator.wikimedia.org/T231433) (owner: 10Vgutierrez) [10:50:55] (03PS3) 10Vgutierrez: hiera: Move ats-tls from port 8443 to 443 on cp1076 [puppet] - 10https://gerrit.wikimedia.org/r/537366 (https://phabricator.wikimedia.org/T231433) [10:53:57] PROBLEM - HTTPS Unified RSA on cp1076 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS [10:54:09] ^^ expected [10:55:07] RECOVERY - HTTPS Unified RSA on cp1076 is OK: SSL OK - OCSP staple validity for en.wikipedia.org has 345563 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2019-11-22 07:59:59 +0000 (expires in 65 days) https://wikitech.wikimedia.org/wiki/HTTPS [10:58:24] !log bootstrap restbase2010-b - T224553 [10:58:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:58:27] T224553: Migrate remaining Restbase servers to Stretch - https://phabricator.wikimedia.org/T224553 [11:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: How many deployers does it take to do European Mid-day SWAT(Max 6 patches) deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190917T1100). [11:00:04] kostajh, tgr, and awight: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:13] hi jouncebot [11:00:40] (03PS5) 10Alexandros Kosiaris: Adds git::lfs class and include it respectively [puppet] - 10https://gerrit.wikimedia.org/r/535631 (https://phabricator.wikimedia.org/T232494) (owner: 10Halfak) [11:01:14] o/ I can deploy my patch when the time comes. I'll start the backport merge now, if there are no objections? [11:02:26] o/ [11:03:25] 10Operations, 10ops-eqiad, 10User-Elukey: (Need By: August 31) rack/setup/install (3) new zookeeper nodes - https://phabricator.wikimedia.org/T227025 (10elukey) Today I tested `Redirection after boot` set to enabled on an-coord1001's bios but I didn't resolve the problem, the mgmt console is not available. G... [11:06:16] is anyone SWATting? [11:06:18] awight: do you want to deploy the others as well? [11:06:27] sure! [11:06:32] good timing Urbanecm :) [11:07:01] * Urbanecm waves to tgr [11:07:26] (03CR) 10Awight: [C: 03+2] Enable EditorJourney for euwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537092 (https://phabricator.wikimedia.org/T232061) (owner: 10Kosta Harlan) [11:07:37] (03PS3) 10Awight: Enable EditorJourney for euwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537092 (https://phabricator.wikimedia.org/T232061) (owner: 10Kosta Harlan) [11:07:51] (03CR) 10Awight: [C: 03+2] Enable EditorJourney for euwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537092 (https://phabricator.wikimedia.org/T232061) (owner: 10Kosta Harlan) [11:08:24] RECOVERY - cassandra-a CQL 10.192.16.186:9042 on restbase2010 is OK: TCP OK - 0.036 second response time on 10.192.16.186 port 9042 https://phabricator.wikimedia.org/T93886 [11:08:29] (03CR) 10Muehlenhoff: "Thanks, I've reset the pg_hba.conf on puppetdb[12]002, giving this a shot" [puppet] - 10https://gerrit.wikimedia.org/r/537364 (https://phabricator.wikimedia.org/T230609) (owner: 10Volans) [11:08:35] (03PS2) 10Muehlenhoff: puppetdb: fix postgres user on new hosts [puppet] - 10https://gerrit.wikimedia.org/r/537364 (https://phabricator.wikimedia.org/T230609) (owner: 10Volans) [11:08:46] (03Merged) 10jenkins-bot: Enable EditorJourney for euwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537092 (https://phabricator.wikimedia.org/T232061) (owner: 10Kosta Harlan) [11:09:08] awight: I can verify the patch before it goes live, just let me know [11:09:18] (03CR) 10jenkins-bot: Enable EditorJourney for euwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537092 (https://phabricator.wikimedia.org/T232061) (owner: 10Kosta Harlan) [11:10:41] (03PS3) 10Urbanecm: Add channels for the Translate and TranslationsNotification extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537363 (https://phabricator.wikimedia.org/T221119) (owner: 10Abijeet Patro) [11:11:07] (03CR) 10Muehlenhoff: [C: 03+2] puppetdb: fix postgres user on new hosts [puppet] - 10https://gerrit.wikimedia.org/r/537364 (https://phabricator.wikimedia.org/T230609) (owner: 10Volans) [11:12:15] awight: as yesterday, once you're done, please let me know, I'll add my patches :) [11:13:15] 10Puppet, 10ORES, 10Patch-For-Review, 10Scoring-platform-team (Current): Require git-lfs in ORES hosts - https://phabricator.wikimedia.org/T232494 (10akosiaris) @Halfak, I 've uploaded a different proposal in a new PS. Already makes jenkins happier. [11:13:18] kostajh: The editor journey config is ready to test on mwdebug1002 [11:13:26] * kostajh looks [11:13:50] !log Run mwscript emptyUserGroup.php --wiki=aawiki 'inactive' (T150538) [11:13:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:13:53] T150538: Cleanup 'inactive' usergroup on aawiki - https://phabricator.wikimedia.org/T150538 [11:16:08] awight: looks good [11:16:17] great, deploying [11:17:02] (03PS4) 10Awight: Update ORES filter threshold configuration for new huwiki model [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536732 (https://phabricator.wikimedia.org/T230031) (owner: 10Gergő Tisza) [11:17:16] !log awight@deploy1001 Synchronized wmf-config/VariantSettings.php: SWAT: [[gerrit:537092|Enable EditorJourney for euwiki (T232061)]] (duration: 00m 56s) [11:17:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:17:22] (03CR) 10Awight: [C: 03+2] Update ORES filter threshold configuration for new huwiki model [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536732 (https://phabricator.wikimedia.org/T230031) (owner: 10Gergő Tisza) [11:17:24] T232061: Deploy EditorJourney to Basque Wikipedia - https://phabricator.wikimedia.org/T232061 [11:18:20] (03Merged) 10jenkins-bot: Update ORES filter threshold configuration for new huwiki model [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536732 (https://phabricator.wikimedia.org/T230031) (owner: 10Gergő Tisza) [11:18:22] RECOVERY - cassandra-b SSL 10.192.16.187:7001 on restbase2010 is OK: SSL OK - Certificate restbase2010-b valid until 2020-06-24 13:01:56 +0000 (expires in 281 days) https://phabricator.wikimedia.org/T120662 [11:19:21] (03CR) 10jenkins-bot: Update ORES filter threshold configuration for new huwiki model [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536732 (https://phabricator.wikimedia.org/T230031) (owner: 10Gergő Tisza) [11:19:24] oops! we lost tgr [11:19:34] hey good timing, again! [11:19:46] tgr: huwiki ORES config is live on 1002 [11:20:16] thanks! sorry, my IRC server has become a bit unstable [11:20:18] RECOVERY - cassandra-b service on restbase2010 is OK: OK - cassandra-b is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:21:00] PROBLEM - Widespread puppet agent failures- no resources reported on icinga1001 is CRITICAL: site=esams https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [11:21:24] awight: works [11:21:30] ack [11:22:36] !log awight@deploy1001 Synchronized wmf-config/VariantSettings.php: SWAT: [[gerrit:536732|Update ORES filter threshold configuration for new huwiki model (T230031)]] (duration: 00m 55s) [11:22:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:22:39] T230031: Update ORES filter thresholds for huwiki - https://phabricator.wikimedia.org/T230031 [11:23:20] awight: is config safe to touch now, or should I wait? [11:23:42] thanks! [11:24:00] !log commencing pdu swap rack b3 eqiad T227539 [11:24:01] Urbanecm: It's safe. My last act is to push a backport for extensions/FileImporter [11:24:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:24:03] T227539: b3-eqiad pdu refresh (Tuesday 9/17 @11am UTC) - https://phabricator.wikimedia.org/T227539 [11:24:10] ack, thanks awight [11:24:48] (03CR) 10Urbanecm: [C: 03+2] Add channels for the Translate and TranslationsNotification extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537363 (https://phabricator.wikimedia.org/T221119) (owner: 10Abijeet Patro) [11:26:13] (03PS4) 10Urbanecm: Add channels for the Translate and TranslationsNotification extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537363 (https://phabricator.wikimedia.org/T221119) (owner: 10Abijeet Patro) [11:26:19] (03CR) 10Urbanecm: [C: 03+2] Add channels for the Translate and TranslationsNotification extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537363 (https://phabricator.wikimedia.org/T221119) (owner: 10Abijeet Patro) [11:27:22] (03Merged) 10jenkins-bot: Add channels for the Translate and TranslationsNotification extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537363 (https://phabricator.wikimedia.org/T221119) (owner: 10Abijeet Patro) [11:27:23] !log awight@deploy1001 Synchronized php-1.34.0-wmf.22/extensions/FileImporter: SWAT: [[gerrit:537345|Use https rather than protcol-relative remote API URLs (T228851)]] (duration: 00m 58s) [11:27:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:27:26] T228851: Source wiki editing and deletion always fails - https://phabricator.wikimedia.org/T228851 [11:27:38] (03CR) 10jenkins-bot: Add channels for the Translate and TranslationsNotification extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537363 (https://phabricator.wikimedia.org/T221119) (owner: 10Abijeet Patro) [11:27:57] awight: what is the untracked file on deploy1001? [11:27:59] is that yours? [11:28:35] noo [11:28:50] ok. It shouldn't affect my deployment anyway, so I'm going to sync [11:29:52] !log dzahn@cumin1001 START - Cookbook sre.hosts.ipmi-password-reset [11:29:53] !log dzahn@cumin1001 END (ERROR) - Cookbook sre.hosts.ipmi-password-reset (exit_code=97) [11:29:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:29:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:30:11] !log dzahn@cumin1001 START - Cookbook sre.hosts.ipmi-password-reset [11:30:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:30:45] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.hosts.ipmi-password-reset (exit_code=99) [11:30:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:30:59] !log dzahn@cumin1001 START - Cookbook sre.hosts.ipmi-password-reset [11:31:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:31:17] !log urbanecm@deploy1001 Synchronized wmf-config/VariantSettings.php: SWAT: 290e207: Add channels for the Translate and TranslationsNotification extension (T221119, T144780, T143073) (duration: 00m 56s) [11:31:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:31:23] T143073: Fatal error: Argument 1 passed to MessageHandle::__construct() must be an instance of Title, null given - https://phabricator.wikimedia.org/T143073 [11:31:24] T221119: Translate error - "This namespace is reserved for content page translations" - https://phabricator.wikimedia.org/T221119 [11:31:24] PROBLEM - Host ps1-b3-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [11:31:24] T144780: Translation Notification Bot sending the same message multiple times to every translator - https://phabricator.wikimedia.org/T144780 [11:31:24] PROBLEM - Host elastic1037.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [11:31:42] RECOVERY - Host ps1-b3-eqiad is UP: PING OK - Packet loss = 0%, RTA = 2.14 ms [11:31:53] Urbanecm: I'm done with SWAT, thanks! [11:31:58] !log dzahn@cumin1001 Updating IPMI password on 8 hosts - dzahn@cumin1001 [11:31:59] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.ipmi-password-reset (exit_code=0) [11:31:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:32:00] PROBLEM - Host ores.wmflabs.org is DOWN: CRITICAL - Host Unreachable (ores.wmflabs.org) [11:32:00] thanks awight [11:32:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:32:36] PROBLEM - Juniper alarms on asw2-b-eqiad is CRITICAL: JNX_ALARMS CRITICAL - 1 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [11:32:49] !log EU SWAT is done [11:32:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:33:16] PROBLEM - Host cloudvirt1027.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [11:34:10] RECOVERY - Juniper alarms on asw2-b-eqiad is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [11:36:28] RECOVERY - Host elastic1037.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.97 ms [11:44:48] RECOVERY - Host cloudvirt1027.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.03 ms [11:46:00] RECOVERY - Host ores.wmflabs.org is UP: PING OK - Packet loss = 0%, RTA = 3.00 ms [11:46:04] PROBLEM - snapshot of s4 in codfw on db1115 is CRITICAL: snapshot for s4 at codfw taken more than 4 days ago: Most recent backup 2019-09-13 11:38:04 https://wikitech.wikimedia.org/wiki/MariaDB/Backups [11:51:18] PROBLEM - Check correctness of the icinga configuration on icinga1001 is CRITICAL: Icinga configuration contains errors https://wikitech.wikimedia.org/wiki/Icinga [11:51:51] did somene touch icinga config? [11:52:02] I will handle the backup alert after lunch [11:54:00] PROBLEM - Host ps1-b3-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [11:54:32] PROBLEM - Juniper alarms on asw2-b-eqiad is CRITICAL: JNX_ALARMS CRITICAL - 1 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [11:54:44] PROBLEM - IPMI Sensor Status on stat1007 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [11:56:39] I'm re-running a few checks to minimise alerts [11:56:55] there is also the work on pdus ongoing [11:58:32] PROBLEM - IPMI Sensor Status on elastic1036 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Power Supply 2 = Critical, Power Supplies = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [11:59:16] (03PS1) 10Awight: NowCommons test & test2wiki configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537375 (https://phabricator.wikimedia.org/T228851) [12:04:34] PROBLEM - Host ms-be1023 is DOWN: PING CRITICAL - Packet loss = 100% [12:07:00] RECOVERY - Host ms-be1023 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [12:07:04] RECOVERY - Juniper alarms on asw2-b-eqiad is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [12:07:26] 1023 is unrelated to the rack with the PDU maintenance, having a look [12:08:50] mortizm it is in B3 [12:09:32] that was us...the power was not seated correctly and went down. [12:12:01] ack [12:14:36] RECOVERY - cassandra-b CQL 10.192.16.187:9042 on restbase2010 is OK: TCP OK - 3.081 second response time on 10.192.16.187 port 9042 https://phabricator.wikimedia.org/T93886 [12:15:17] 10Operations, 10Release Pipeline, 10serviceops, 10CPT Initiatives (RESTBase Split (CDP2)), and 4 others: Deploy the RESTBase front-end service (RESTRouter) to Kubernetes - https://phabricator.wikimedia.org/T223953 (10mobrovac) [12:25:04] 10Operations, 10Mail, 10WMF-Communications: Updating DNS records - https://phabricator.wikimedia.org/T231387 (10mark) What's the status of this? Is this done and working? [12:27:24] RECOVERY - IPMI Sensor Status on stat1007 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [12:27:28] 10Operations, 10ops-eqiad, 10decommission, 10User-fgiunchedi: Return graphite100[13] to spares pool (or decom) - https://phabricator.wikimedia.org/T209357 (10Cmjohnson) a:05Cmjohnson→03Jclark-ctr please wipe these especially 1001 to make some space for ms-be servers [12:29:04] RECOVERY - IPMI Sensor Status on elastic1036 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [12:30:41] !log bootstrap restbase2010-c - T224553 [12:30:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:30:45] T224553: Migrate remaining Restbase servers to Stretch - https://phabricator.wikimedia.org/T224553 [12:31:26] RECOVERY - cassandra-c SSL 10.192.16.188:7001 on restbase2010 is OK: SSL OK - Certificate restbase2010-c valid until 2020-06-24 13:01:57 +0000 (expires in 281 days) https://phabricator.wikimedia.org/T120662 [12:40:58] mhhh a bunch of UNKNOWN for puppet failed runs too, I'm taking a look [12:49:20] got an unexpected puppet diff on icinga1001 but config seems ok now [12:49:26] e.g. [12:49:27] - check_command check_graphite_threshold!http://graphite-in.eqiad.wmnet!10!sumSeries(transformNull(perSecond(carbon.relays.graphite*_local.destinations.*.dropped)))!25!100!5minutes!0min!80!--over [12:49:31] + check_command check_graphite_threshold!https://graphite.wikimedia.org!10!sumSeries(transformNull(perSecond(carbon.relays.graphite*_local.destinations.*.dropped)))!25!100!5minutes!0min!80!--over [12:49:35] - parents asw-c-eqiad [12:49:36] etc [12:49:39] + parents asw2-c-eqiad [12:53:49] still invalid config though [12:53:50] Error: Could not find any hostgroup matching 'asw-b-eqiad' (config file '/etc/icinga/objects/puppet_hosts.cfg', starting on line 23958) [12:54:20] godog: related to the PDU maintenance somehow? [12:55:06] marostegui: yeah I think that's likely, it might be due to missing puppet runs [12:56:45] or I'm not so sure anymore actually [13:00:55] mhhh puppet is flapping configs, which might indicate puppet master and/or puppetdb discrepancies [13:01:21] !log The PDU swap in rack B3 eqiad is finished. [13:01:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:01:41] jbond42: in the offchance ^ is related to your puppetmaster1003 testing ? [13:02:55] !log Start replication on db1130 db1104 db1085 db1086 after PDU maintenance is completed - T227539 [13:02:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:02:58] T227539: b3-eqiad pdu refresh (Tuesday 9/17 @11am UTC) - https://phabricator.wikimedia.org/T227539 [13:03:08] looking [13:04:44] jbond42: I'm shooting a little from the hip now though :) might not be that [13:06:48] godog: puppet1003 was temporarily broken so anything pointing at that would have failed its puppet run. i have fixed that now and ran puppet on icinga and everything looks healthy [13:06:56] is there anything elses that may still be broken? [13:07:23] icinga config on icinga1001 still fails to validate for me [13:08:10] I guess it might fix itself once hosts have completed running puppet again [13:08:44] RECOVERY - MariaDB Slave Lag: s6 on db1125 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [13:09:51] e.g. wtp1034 now fails because references asw-b-eqiad not asw2-b-eqiad, I'm manually running puppet in there and see if that fixes it [13:10:01] Is it OK to deploy cxserver (just join here, and saw something is down). @godog ? [13:10:46] kart_: thanks for checking, please hold for a little while unless urgent [13:11:01] oh did a dependency get changed? in that case i think we may need to wait 30 mins for all exported resources to get updated [13:11:25] godog: OK. [13:11:35] godog: not urgent, can be done tomorrow too. [13:12:07] jbond42: yeah, the hostgroups in this case [13:12:24] jbond42: what was the testing on puppetmaster1003 btw? please remember to !log [13:14:04] godog: running octocatalog-diff from elnath. i patch to auth.conf was denying geniune hosts [13:14:05] it's missing hostgroup matching 'asw-b-eqiad' [13:14:19] !log currently running octocatalog-diff for all hosts from elnath [13:14:20] re: icinga config [13:14:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:04] ack jbond42 [13:15:48] kart_: you can go ahead btw [13:15:57] but yeah in max half an hour we should be back [13:16:27] I'm tempted to force run puppet in eqiad tho [13:17:45] !log force-run puppet in eqiad to update exported resources [13:17:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1130 db1104 db1085 db1086 after PDU maintenance - T227539', diff saved to https://phabricator.wikimedia.org/P9117 and previous config saved to /var/cache/conftool/dbconfig/20190917-132102-marostegui.json [13:21:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:12] T227539: b3-eqiad pdu refresh (Tuesday 9/17 @11am UTC) - https://phabricator.wikimedia.org/T227539 [13:23:17] (03PS2) 10Jhedden: tools-prometheus: add toolsdb mariadb metrics [puppet] - 10https://gerrit.wikimedia.org/r/537203 (https://phabricator.wikimedia.org/T220530) [13:25:52] godog: thanks! [13:26:23] (03CR) 10Jhedden: [C: 03+2] tools-prometheus: add toolsdb mariadb metrics [puppet] - 10https://gerrit.wikimedia.org/r/537203 (https://phabricator.wikimedia.org/T220530) (owner: 10Jhedden) [13:26:36] (03PS2) 10KartikMistry: Update cxserver to 2019-09-16-152511-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/537145 (https://phabricator.wikimedia.org/T224721) [13:29:42] (03CR) 10KartikMistry: [V: 03+2 C: 03+2] Update cxserver to 2019-09-16-152511-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/537145 (https://phabricator.wikimedia.org/T224721) (owner: 10KartikMistry) [13:30:02] (03CR) 10Halfak: "I don't see how this is going to work for our misc nodes. What role would you expect us to apply?" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/535631 (https://phabricator.wikimedia.org/T232494) (owner: 10Halfak) [13:32:30] RECOVERY - Widespread puppet agent failures- no resources reported on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [13:34:35] (03Abandoned) 10BBlack: Add wikiba.se to HSTS regex [puppet] - 10https://gerrit.wikimedia.org/r/500472 (https://phabricator.wikimedia.org/T213705) (owner: 10BBlack) [13:34:40] (03Abandoned) 10BBlack: Add wikiba.se to HTTPS redirect regex [puppet] - 10https://gerrit.wikimedia.org/r/500473 (https://phabricator.wikimedia.org/T213705) (owner: 10BBlack) [13:35:08] (03PS8) 10BBlack: anycast recdns: enable globally [puppet] - 10https://gerrit.wikimedia.org/r/528525 (https://phabricator.wikimedia.org/T228190) [13:35:22] (03CR) 10Muehlenhoff: [C: 04-1] "Approach is fine, but see error inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/536714 (owner: 10Dzahn) [13:36:13] (03CR) 10Giuseppe Lavagetto: [C: 03+1] LVS for MW: Remove RunCommand checks [puppet] - 10https://gerrit.wikimedia.org/r/536581 (https://phabricator.wikimedia.org/T111899) (owner: 10BBlack) [13:38:20] (03PS1) 10Jhedden: tools-prometheus: Update toolsdb target file extension [puppet] - 10https://gerrit.wikimedia.org/r/537418 (https://phabricator.wikimedia.org/T220530) [13:38:45] (03PS2) 10BBlack: LVS for MW: Remove RunCommand checks [puppet] - 10https://gerrit.wikimedia.org/r/536581 (https://phabricator.wikimedia.org/T111899) [13:41:58] (03PS1) 10Halfak: Switches ores.wmflabs monitoring to use new ores-web-(04,05,06) [puppet] - 10https://gerrit.wikimedia.org/r/537420 [13:42:52] RECOVERY - cassandra-c CQL 10.192.16.188:9042 on restbase2010 is OK: TCP OK - 3.079 second response time on 10.192.16.188 port 9042 https://phabricator.wikimedia.org/T93886 [13:43:53] (03PS2) 10Halfak: Switches ores.wmflabs monitoring to use new ores-web-(04,05,06) [puppet] - 10https://gerrit.wikimedia.org/r/537420 (https://phabricator.wikimedia.org/T232228) [13:44:59] (03PS2) 10Elukey: sre.hadoop.reboot-workers.py: reboot hosts in a batch in parallel [cookbooks] - 10https://gerrit.wikimedia.org/r/537105 (https://phabricator.wikimedia.org/T225297) [13:45:43] !log repooling restbase2010 after reimage/completed bootstrap [13:45:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:46:04] (03CR) 10BBlack: [C: 03+2] LVS for MW: Remove RunCommand checks [puppet] - 10https://gerrit.wikimedia.org/r/536581 (https://phabricator.wikimedia.org/T111899) (owner: 10BBlack) [13:46:51] !log @ helmfile [STAGING] Ran 'apply' command on namespace 'cxserver' for release 'staging' . [13:46:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:47:11] (03CR) 10Elukey: [C: 03+2] sre.hadoop.reboot-workers.py: reboot hosts in a batch in parallel [cookbooks] - 10https://gerrit.wikimedia.org/r/537105 (https://phabricator.wikimedia.org/T225297) (owner: 10Elukey) [13:50:20] (03Merged) 10jenkins-bot: sre.hadoop.reboot-workers.py: reboot hosts in a batch in parallel [cookbooks] - 10https://gerrit.wikimedia.org/r/537105 (https://phabricator.wikimedia.org/T225297) (owner: 10Elukey) [13:51:06] (03PS1) 10Herron: kafka-main1003: disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/537424 [13:52:02] !log lvs2006 - restart pybal to remove runcommands - https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/536581/ [13:52:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:05] (03CR) 10Awight: Adds git::lfs class and include it respectively (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/535631 (https://phabricator.wikimedia.org/T232494) (owner: 10Halfak) [13:52:18] !log lvs1016 - restart pybal to remove runcommands - https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/536581/ [13:52:20] (03CR) 10Jhedden: [C: 03+2] tools-prometheus: Update toolsdb target file extension [puppet] - 10https://gerrit.wikimedia.org/r/537418 (https://phabricator.wikimedia.org/T220530) (owner: 10Jhedden) [13:52:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:28] (03PS2) 10Jhedden: tools-prometheus: Update toolsdb target file extension [puppet] - 10https://gerrit.wikimedia.org/r/537418 (https://phabricator.wikimedia.org/T220530) [13:54:48] (03PS1) 10Herron: kafka-main: replace kafka1003 hardware with kafka-main1003 [puppet] - 10https://gerrit.wikimedia.org/r/537428 (https://phabricator.wikimedia.org/T225005) [13:56:39] (03PS2) 10Herron: kafka-main1003: disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/537424 [13:57:56] !log @ helmfile [CODFW] Ran 'apply' command on namespace 'cxserver' for release 'production' . [13:57:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:58:48] (03CR) 10Herron: [C: 03+2] kafka-main1003: disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/537424 (owner: 10Herron) [13:59:43] !log lvs2003 - restart pybal to remove runcommands - https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/536581/ [13:59:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:29] !log lvs1015 - restart pybal to remove runcommands - https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/536581/ [14:00:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:56] !log forcing puppet run [14:00:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:01:43] 10Operations, 10Traffic, 10Wikimedia-General-or-Unknown, 10netops: Numerous people reporting issues saving edits and viewing previews/diffs - https://phabricator.wikimedia.org/T232491 (10Steel1943) Pretty sure this is resolved now... [14:03:27] !log migrating kafka1003 to kafka-main1003 T225005 [14:03:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:30] T225005: Replace and expand kafka main hosts (kafka[12]00[123]) with kafka-main[12]00[12345] - https://phabricator.wikimedia.org/T225005 [14:05:56] (03PS2) 10Herron: kafka-main: replace kafka1003 hardware with kafka-main1003 [puppet] - 10https://gerrit.wikimedia.org/r/537428 (https://phabricator.wikimedia.org/T225005) [14:06:47] (03CR) 10Herron: [C: 03+2] kafka-main: replace kafka1003 hardware with kafka-main1003 [puppet] - 10https://gerrit.wikimedia.org/r/537428 (https://phabricator.wikimedia.org/T225005) (owner: 10Herron) [14:07:50] (03PS1) 10KartikMistry: Add templatemapping to cxserver config [deployment-charts] - 10https://gerrit.wikimedia.org/r/537432 (https://phabricator.wikimedia.org/T224721) [14:08:47] (03PS1) 10Awight: FileImporter: limited default deployment (1/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537433 (https://phabricator.wikimedia.org/T232539) [14:08:49] (03PS1) 10Awight: FileImporter: limited default deployment (2/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537434 (https://phabricator.wikimedia.org/T232539) [14:09:54] (03PS2) 10KartikMistry: Add templatemapping to cxserver config [deployment-charts] - 10https://gerrit.wikimedia.org/r/537432 (https://phabricator.wikimedia.org/T224721) [14:10:10] (03CR) 10jerkins-bot: [V: 04-1] FileImporter: limited default deployment (1/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537433 (https://phabricator.wikimedia.org/T232539) (owner: 10Awight) [14:10:43] (03PS1) 10Elukey: sre.hadoop.reboot-workers.py: add RemoteError exception handling [cookbooks] - 10https://gerrit.wikimedia.org/r/537437 (https://phabricator.wikimedia.org/T225297) [14:10:59] (03CR) 10KartikMistry: [V: 03+2 C: 03+2] Add templatemapping to cxserver config [deployment-charts] - 10https://gerrit.wikimedia.org/r/537432 (https://phabricator.wikimedia.org/T224721) (owner: 10KartikMistry) [14:11:11] (03PS3) 10KartikMistry: Add templatemapping to cxserver config [deployment-charts] - 10https://gerrit.wikimedia.org/r/537432 (https://phabricator.wikimedia.org/T224721) [14:11:20] (03CR) 10KartikMistry: [V: 03+2 C: 03+2] Add templatemapping to cxserver config [deployment-charts] - 10https://gerrit.wikimedia.org/r/537432 (https://phabricator.wikimedia.org/T224721) (owner: 10KartikMistry) [14:13:40] !log @ helmfile [STAGING] Ran 'apply' command on namespace 'cxserver' for release 'staging' . [14:13:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:16] !log @ helmfile [CODFW] Ran 'apply' command on namespace 'cxserver' for release 'production' . [14:14:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:43] !log @ helmfile [EQIAD] Ran 'apply' command on namespace 'cxserver' for release 'production' . [14:15:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:49] (03PS9) 10BBlack: anycast recdns: enable globally [puppet] - 10https://gerrit.wikimedia.org/r/528525 (https://phabricator.wikimedia.org/T228190) [14:18:24] (03CR) 10BBlack: [C: 03+2] anycast recdns: enable globally [puppet] - 10https://gerrit.wikimedia.org/r/528525 (https://phabricator.wikimedia.org/T228190) (owner: 10BBlack) [14:20:06] RECOVERY - Check correctness of the icinga configuration on icinga1001 is OK: Icinga configuration is correct https://wikitech.wikimedia.org/wiki/Icinga [14:21:44] wikibugs: come back! :) [14:22:03] bblack: I'm not sure it works that way :-P [14:22:13] (03CR) 10Elukey: [C: 03+2] sre.hadoop.reboot-workers.py: add RemoteError exception handling [cookbooks] - 10https://gerrit.wikimedia.org/r/537437 (https://phabricator.wikimedia.org/T225297) (owner: 10Elukey) [14:22:54] 10Operations, 10Traffic: Deprecate pybal SSH health checks - https://phabricator.wikimedia.org/T111899 (10BBlack) Still TODO here before resolving: remove the ferm puppetization on the MW hosts that was allowing LVS ssh access [14:23:14] akosiaris: If you see this message, see if I've not broke config of cxserver :) [14:23:34] Specially, https://gerrit.wikimedia.org/r/#/c/operations/deployment-charts/+/537432/ [14:23:43] continuous human integration :) [14:23:54] (03Merged) 10jenkins-bot: sre.hadoop.reboot-workers.py: add RemoteError exception handling [cookbooks] - 10https://gerrit.wikimedia.org/r/537437 (https://phabricator.wikimedia.org/T225297) (owner: 10Elukey) [14:26:54] PROBLEM - Check if active EventStreams endpoint is delivering messages. on icinga1001 is CRITICAL: CRITICAL: No EventStreams message was consumed from https://stream.wikimedia.org/v2/stream/recentchange within 10 seconds. https://wikitech.wikimedia.org/wiki/Event_Platform/EventStreams/Administration [14:27:25] 10Operations, 10serviceops, 10Patch-For-Review, 10Performance-Team (Radar), 10User-jijiki: Ramp up percentage of users on php7.2 to 100% on both API and appserver clusters - https://phabricator.wikimedia.org/T219150 (10Jdforrester-WMF) [14:28:09] herron: o/ - just received an alert of eventstreams not processing messages [14:28:23] ah yes it is also in here [14:28:34] I am wondering if it is due to the kafka config after the migration of kafka1003 [14:28:37] was it changed? [14:29:03] which kafka config? [14:29:24] the one that eventstreams uses [14:29:34] IIRC something happened a while ago for codfw as well [14:29:35] elukey: see mw-sec, is this possibly puppet lag? [14:30:05] ah you guys are already on it [14:30:46] (03PS1) 10Muehlenhoff: Remove ferm rules for Pybal health checks [puppet] - 10https://gerrit.wikimedia.org/r/537446 (https://phabricator.wikimedia.org/T111899) [14:34:07] (03PS1) 10BBlack: dbproxy1019: remove custom nameserver config [puppet] - 10https://gerrit.wikimedia.org/r/537448 (https://phabricator.wikimedia.org/T228190) [14:35:50] !log bouncing eventstreams service on scb hosts [14:35:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:01] (03CR) 10BBlack: [C: 03+2] dbproxy1019: remove custom nameserver config [puppet] - 10https://gerrit.wikimedia.org/r/537448 (https://phabricator.wikimedia.org/T228190) (owner: 10BBlack) [14:37:09] (03CR) 10Jcrespo: "-2 this is not the right way to setup mariadb metrics." [puppet] - 10https://gerrit.wikimedia.org/r/537203 (https://phabricator.wikimedia.org/T220530) (owner: 10Jhedden) [14:37:37] (03PS1) 10Jcrespo: Revert "tools-prometheus: add toolsdb mariadb metrics" [puppet] - 10https://gerrit.wikimedia.org/r/537450 [14:38:02] (03CR) 10jerkins-bot: [V: 04-1] Revert "tools-prometheus: add toolsdb mariadb metrics" [puppet] - 10https://gerrit.wikimedia.org/r/537450 (owner: 10Jcrespo) [14:39:33] !log anomie@deploy1001 Synchronized php-1.34.0-wmf.22/includes/MergeHistory.php: Backport MergeHistory fix for T232464 [[gerrit:537436]] (duration: 00m 54s) [14:39:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:38] T232464: Getting InvalidArgumentException when running a query on the API - https://phabricator.wikimedia.org/T232464 [14:40:02] (03CR) 10BBlack: [C: 03+1] Remove ferm rules for Pybal health checks [puppet] - 10https://gerrit.wikimedia.org/r/537446 (https://phabricator.wikimedia.org/T111899) (owner: 10Muehlenhoff) [14:47:18] (03PS1) 10Jbond: IPMI: maintain current password during resets [cookbooks] - 10https://gerrit.wikimedia.org/r/537455 (https://phabricator.wikimedia.org/T147074) [14:48:29] !log anomie@mwmaint1002 Running cleanupRevActorPage.php on test wikis and mediawikiwiki for T232464 [14:48:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:33] T232464: Getting InvalidArgumentException when running a query on the API - https://phabricator.wikimedia.org/T232464 [14:49:03] 10Operations, 10observability, 10Patch-For-Review, 10User-fgiunchedi: Replace Torrus with Prometheus snmp_exporter for PDUs monitoring - https://phabricator.wikimedia.org/T148541 (10fgiunchedi) Thanks a lot @RobH for the explanation! Please let me know if I can help with progressing this further [14:49:19] (03CR) 10Jcrespo: [C: 04-2] "I belive I confused wikitech (production hosts) with clouddb ones." [puppet] - 10https://gerrit.wikimedia.org/r/537450 (owner: 10Jcrespo) [14:49:27] (03Abandoned) 10Jcrespo: Revert "tools-prometheus: add toolsdb mariadb metrics" [puppet] - 10https://gerrit.wikimedia.org/r/537450 (owner: 10Jcrespo) [14:50:30] !log anomie@mwmaint1002 Running cleanupRevActorPage.php on section 1 wikis for T232464 [14:50:31] !log anomie@mwmaint1002 Running cleanupRevActorPage.php on section 2 wikis for T232464 [14:50:31] !log anomie@mwmaint1002 Running cleanupRevActorPage.php on remaining section 3 wikis for T232464 [14:50:31] !log anomie@mwmaint1002 Running cleanupRevActorPage.php on section 4 wikis for T232464 [14:50:31] !log anomie@mwmaint1002 Running cleanupRevActorPage.php on section 5 wikis for T232464 [14:50:31] !log anomie@mwmaint1002 Running cleanupRevActorPage.php on section 6 wikis for T232464 [14:50:31] !log anomie@mwmaint1002 Running cleanupRevActorPage.php on section 7 wikis for T232464 [14:50:32] !log anomie@mwmaint1002 Running cleanupRevActorPage.php on section 8 wikis for T232464 [14:50:32] !log anomie@mwmaint1002 Running cleanupRevActorPage.php on wikitech for T232464 [14:50:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:48] RECOVERY - Check if active EventStreams endpoint is delivering messages. on icinga1001 is OK: OK: An EventStreams message was consumed from https://stream.wikimedia.org/v2/stream/recentchange within 10 seconds. https://wikitech.wikimedia.org/wiki/Event_Platform/EventStreams/Administration [14:50:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:52:13] !log elukey@cumin1001 START - Cookbook sre.hadoop.reboot-workers [14:52:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:52:18] !log elukey@cumin1001 END (FAIL) - Cookbook sre.hadoop.reboot-workers (exit_code=99) [14:52:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:52:24] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] FileImporter: limited default deployment (1/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537433 (https://phabricator.wikimedia.org/T232539) (owner: 10Awight) [14:52:29] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] FileImporter: limited default deployment (2/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537434 (https://phabricator.wikimedia.org/T232539) (owner: 10Awight) [14:52:36] (03PS1) 10Jhedden: tools-prometheus: remove toolsdb mariadb target [puppet] - 10https://gerrit.wikimedia.org/r/537456 (https://phabricator.wikimedia.org/T220530) [14:53:24] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] NowCommons test & test2wiki configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537375 (https://phabricator.wikimedia.org/T228851) (owner: 10Awight) [14:55:16] (03CR) 10Jhedden: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/537203 (https://phabricator.wikimedia.org/T220530) (owner: 10Jhedden) [14:55:35] (03CR) 10Jcrespo: "I mean, it is not great, but as long as this is in cloud network, the previous commit it is ok with me." [puppet] - 10https://gerrit.wikimedia.org/r/537456 (https://phabricator.wikimedia.org/T220530) (owner: 10Jhedden) [14:55:52] (03CR) 10Jcrespo: [C: 04-1] tools-prometheus: remove toolsdb mariadb target [puppet] - 10https://gerrit.wikimedia.org/r/537456 (https://phabricator.wikimedia.org/T220530) (owner: 10Jhedden) [14:58:41] (03Abandoned) 10Jhedden: tools-prometheus: remove toolsdb mariadb target [puppet] - 10https://gerrit.wikimedia.org/r/537456 (https://phabricator.wikimedia.org/T220530) (owner: 10Jhedden) [14:59:44] (03CR) 10Jcrespo: "Apologies for the confusion and the extra work caused." [puppet] - 10https://gerrit.wikimedia.org/r/537456 (https://phabricator.wikimedia.org/T220530) (owner: 10Jhedden) [15:00:44] (03CR) 10Jhedden: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/537456 (https://phabricator.wikimedia.org/T220530) (owner: 10Jhedden) [15:07:17] (03PS1) 10Elukey: sre.hadoop.reboot-workers.py: wrap more commands in try/except [cookbooks] - 10https://gerrit.wikimedia.org/r/537460 (https://phabricator.wikimedia.org/T225297) [15:09:16] 10Operations, 10observability, 10Patch-For-Review, 10User-fgiunchedi: Replace Torrus with Prometheus snmp_exporter for PDUs monitoring - https://phabricator.wikimedia.org/T148541 (10ayounsi) Note that the data is in LibreNMS as well, but with some limitations: * 5min granularity * Not possible to stack or... [15:09:34] PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 64.29% of data above the critical threshold [140.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [15:16:40] !log Stop MySQL on db2127 and shut the host down for onsite maintenance [15:16:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:17:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Host down for on-site maintenance', diff saved to https://phabricator.wikimedia.org/P9120 and previous config saved to /var/cache/conftool/dbconfig/20190917-151714-marostegui.json [15:17:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:23:02] (03CR) 10Elukey: [C: 03+2] sre.hadoop.reboot-workers.py: wrap more commands in try/except [cookbooks] - 10https://gerrit.wikimedia.org/r/537460 (https://phabricator.wikimedia.org/T225297) (owner: 10Elukey) [15:24:05] (03PS2) 10Giuseppe Lavagetto: Add envoy image with TLS termination. [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/399640 [15:24:07] (03PS1) 10Giuseppe Lavagetto: envoy: sync base container with conventions used in production [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/537466 [15:24:35] (03CR) 10Elukey: sre.hadoop.reboot-workers.py: wrap more commands in try/except [cookbooks] - 10https://gerrit.wikimedia.org/r/537460 (https://phabricator.wikimedia.org/T225297) (owner: 10Elukey) [15:29:06] (03PS2) 10Elukey: sre.hadoop.reboot-workers.py: wrap more commands in try/except [cookbooks] - 10https://gerrit.wikimedia.org/r/537460 (https://phabricator.wikimedia.org/T225297) [15:30:31] (03PS5) 10Cwhite: profile: use prometheus for logstash alerting [puppet] - 10https://gerrit.wikimedia.org/r/536358 (https://phabricator.wikimedia.org/T205870) [15:35:43] (03CR) 10Elukey: [C: 03+2] sre.hadoop.reboot-workers.py: wrap more commands in try/except [cookbooks] - 10https://gerrit.wikimedia.org/r/537460 (https://phabricator.wikimedia.org/T225297) (owner: 10Elukey) [15:35:52] (03PS2) 10Jcrespo: install_server: Update partman recipe to set / on last disks [puppet] - 10https://gerrit.wikimedia.org/r/537336 (https://phabricator.wikimedia.org/T229209) [15:37:29] (03PS1) 10Jbond: ipmi: use run instead of checkouput [software/spicerack] - 10https://gerrit.wikimedia.org/r/537468 [15:37:38] (03Merged) 10jenkins-bot: sre.hadoop.reboot-workers.py: wrap more commands in try/except [cookbooks] - 10https://gerrit.wikimedia.org/r/537460 (https://phabricator.wikimedia.org/T225297) (owner: 10Elukey) [15:38:05] !log decommissioning Cassandra, restbase2011-a -- T224553 [15:38:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:38:08] T224553: Migrate remaining Restbase servers to Stretch - https://phabricator.wikimedia.org/T224553 [15:39:34] (03PS1) 10Jhedden: tools-prometheus: add clouddb100[12] node targets [puppet] - 10https://gerrit.wikimedia.org/r/537472 (https://phabricator.wikimedia.org/T220530) [15:39:54] !log elukey@cumin1001 START - Cookbook sre.hadoop.reboot-workers [15:39:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:40:13] (03CR) 10Bstorm: "> Patch Set 3:" [puppet] - 10https://gerrit.wikimedia.org/r/537201 (owner: 10Bstorm) [15:41:55] (03CR) 10Jhedden: [C: 03+2] tools-prometheus: add clouddb100[12] node targets [puppet] - 10https://gerrit.wikimedia.org/r/537472 (https://phabricator.wikimedia.org/T220530) (owner: 10Jhedden) [15:46:09] kart_: broken how ? [15:48:22] (03CR) 10jerkins-bot: [V: 04-1] ipmi: use run instead of checkouput [software/spicerack] - 10https://gerrit.wikimedia.org/r/537468 (owner: 10Jbond) [15:49:06] (03PS1) 10Bstorm: accountcheck: add bstorm to whitelist for wmcs ssh key check [puppet] - 10https://gerrit.wikimedia.org/r/537473 [15:49:14] (03PS1) 10Reedy: Revert "Set MinimumPasswordLengthToLogin to 10 for all prived groups, not just +staff" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537474 [15:49:19] (03PS2) 10Reedy: Revert "Set MinimumPasswordLengthToLogin to 10 for all prived groups, not just +staff" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537474 [15:49:43] jouncebot: now [15:49:43] No deployments scheduled for the next 0 hour(s) and 10 minute(s) [15:49:45] jouncebot: next [15:49:45] In 0 hour(s) and 10 minute(s): Puppet SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190917T1600) [15:49:53] (03CR) 10Reedy: [C: 03+2] Revert "Set MinimumPasswordLengthToLogin to 10 for all prived groups, not just +staff" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537474 (owner: 10Reedy) [15:50:43] (03CR) 10Bstorm: "Since I was tripping alarms about switching my keys, adding my account to the whitelist since I'm now using a yubikey (and was already usi" [puppet] - 10https://gerrit.wikimedia.org/r/537473 (owner: 10Bstorm) [15:50:56] Oh, bah. My local revert didn't push. [15:51:03] * James_F glares at gerrit. [15:51:14] (03Merged) 10jenkins-bot: Revert "Set MinimumPasswordLengthToLogin to 10 for all prived groups, not just +staff" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537474 (owner: 10Reedy) [15:51:31] (03CR) 10jenkins-bot: Revert "Set MinimumPasswordLengthToLogin to 10 for all prived groups, not just +staff" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537474 (owner: 10Reedy) [15:51:34] (03PS2) 10CRusnov: add customscripts directory for custom scripts [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/537242 [15:53:28] !log reedy@deploy1001 sync-file aborted: (no justification provided) (duration: 00m 00s) [15:53:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:53:35] !log reedy@deploy1001 sync-file aborted: (no justification provided) (duration: 00m 01s) [15:53:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:53:43] * Reedy slaps his clipboard [15:55:01] !log reedy@deploy1001 Synchronized wmf-config/CommonSettings.php: Revert Set MinimumPasswordLengthToLogin to 10 for all prived groups, not just +staff (duration: 00m 55s) [15:55:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:55:31] 10Operations, 10observability, 10Patch-For-Review, 10User-fgiunchedi: Replace Torrus with Prometheus snmp_exporter for PDUs monitoring - https://phabricator.wikimedia.org/T148541 (10fgiunchedi) a:05fgiunchedi→03RobH Following up from irc with @RobH, what would be needed is the list of metrics from abov... [15:55:32] (03CR) 10Muehlenhoff: [C: 03+1] "No need to switch, let's just merge this :-)" [puppet] - 10https://gerrit.wikimedia.org/r/537473 (owner: 10Bstorm) [15:56:26] RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 30.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [15:56:38] (03CR) 10Alexandros Kosiaris: "ores-misc-01 already has the role::labs::ores::staging role applied from what I see, so no need to apply new roles." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/535631 (https://phabricator.wikimedia.org/T232494) (owner: 10Halfak) [15:57:28] (03CR) 10Bstorm: [C: 03+2] accountcheck: add bstorm to whitelist for wmcs ssh key check [puppet] - 10https://gerrit.wikimedia.org/r/537473 (owner: 10Bstorm) [16:00:04] godog and _joe_: How many deployers does it take to do Puppet SWAT(Max 6 patches) deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190917T1600). [16:00:04] No GERRIT patches in the queue for this window AFAICS. [16:04:43] !log run octocatalog-diff from elnath with current facts [16:04:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:09:07] (03PS3) 10CRusnov: Initial support for custom scripts [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/537242 [16:09:14] 10Operations, 10serviceops, 10Patch-For-Review, 10Performance-Team (Radar), 10User-jijiki: Ramp up percentage of users on php7.2 to 100% on both API and appserver clusters - https://phabricator.wikimedia.org/T219150 (10Jdforrester-WMF) [16:14:57] (03PS6) 10Alexandros Kosiaris: Adds git::lfs class and include it respectively [puppet] - 10https://gerrit.wikimedia.org/r/535631 (https://phabricator.wikimedia.org/T232494) (owner: 10Halfak) [16:19:29] (03PS3) 10Alexandros Kosiaris: Switches ores.wmflabs monitoring to use new ores-web-(04,05,06) [puppet] - 10https://gerrit.wikimedia.org/r/537420 (https://phabricator.wikimedia.org/T232228) (owner: 10Halfak) [16:19:56] (03CR) 10Alexandros Kosiaris: [C: 03+2] Switches ores.wmflabs monitoring to use new ores-web-(04,05,06) [puppet] - 10https://gerrit.wikimedia.org/r/537420 (https://phabricator.wikimedia.org/T232228) (owner: 10Halfak) [16:21:26] !log elukey@cumin1001 END (PASS) - Cookbook sre.hadoop.reboot-workers (exit_code=0) [16:21:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:27:24] 10Operations, 10ops-eqiad, 10DC-Ops: update puppet for new PDU models - https://phabricator.wikimedia.org/T233129 (10RobH) [16:36:25] (03PS1) 10Elukey: site.pp: rework the regex pattern for the Hadoop test cluster [puppet] - 10https://gerrit.wikimedia.org/r/537484 [16:39:41] (03CR) 10Elukey: [C: 03+2] site.pp: rework the regex pattern for the Hadoop test cluster [puppet] - 10https://gerrit.wikimedia.org/r/537484 (owner: 10Elukey) [16:40:17] clear example of problem between keyboard and computer [16:46:11] (03PS1) 10Ayounsi: Add cookbook to update Sentry PDUs passwords [cookbooks] - 10https://gerrit.wikimedia.org/r/537486 [16:47:48] (03CR) 10jerkins-bot: [V: 04-1] Add cookbook to update Sentry PDUs passwords [cookbooks] - 10https://gerrit.wikimedia.org/r/537486 (owner: 10Ayounsi) [16:55:05] (03PS2) 10Ayounsi: Add cookbook to update Sentry PDUs passwords [cookbooks] - 10https://gerrit.wikimedia.org/r/537486 [16:59:45] (03CR) 10jerkins-bot: [V: 04-1] Add cookbook to update Sentry PDUs passwords [cookbooks] - 10https://gerrit.wikimedia.org/r/537486 (owner: 10Ayounsi) [16:59:58] !log @ helmfile [EQIAD] Ran 'apply' command on namespace 'wikifeeds' for release 'production' . [17:00:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:00:04] cscott, arlolra, subbu, halfak, and accraze: Dear deployers, time to do the Services – Graphoid / Parsoid / Citoid / ORES deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190917T1700). [17:00:36] !log @ helmfile [EQIAD] Ran 'apply' command on namespace 'wikifeeds' for release 'production' . [17:00:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:03:31] (03PS1) 10Elukey: Exclude another Hadoop datanode directory from analytics1037 [puppet] - 10https://gerrit.wikimedia.org/r/537487 [17:05:27] (03CR) 10Elukey: [C: 03+2] Exclude another Hadoop datanode directory from analytics1037 [puppet] - 10https://gerrit.wikimedia.org/r/537487 (owner: 10Elukey) [17:06:24] 10Operations, 10ops-codfw, 10fundraising-tech-ops: rack/setup/install frqueue2001 - https://phabricator.wikimedia.org/T232630 (10Papaul) [17:08:13] !log elukey@cumin1001 START - Cookbook sre.hadoop.reboot-workers [17:08:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:09:05] !log decommissioning Cassandra, restbase2011-b -- T224553 [17:09:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:09:08] T224553: Migrate remaining Restbase servers to Stretch - https://phabricator.wikimedia.org/T224553 [17:11:28] 10Operations, 10ops-eqiad, 10DC-Ops: update puppet for new PDU models - https://phabricator.wikimedia.org/T233129 (10ayounsi) Nah, Sentry Smart PDU Version 8.0n are still Sentry 4. I think the gap is at v7 = Sentry 3, v8 = Sentry 4 [17:21:00] RECOVERY - snapshot of s4 in codfw on db1115 is OK: snapshot for s4 at codfw taken less than 4 days ago and larger than 90 GB: Last one 2019-09-17 16:01:41 from db2099.codfw.wmnet:3314 (1078 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups [17:22:28] (03PS1) 10Herron: kafka-main: move kafka1003 to role spare system [puppet] - 10https://gerrit.wikimedia.org/r/537490 (https://phabricator.wikimedia.org/T225005) [17:23:52] (03CR) 10Volans: "Did a first pass, some comments inline." (0310 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/537486 (owner: 10Ayounsi) [17:24:45] 10Operations, 10ops-eqiad, 10DC-Ops: update puppet for new PDU models - https://phabricator.wikimedia.org/T233129 (10RobH) [17:24:55] (03PS1) 10Jdrewniak: beta: enable desktop watchlist for mobile AMC users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537492 (https://phabricator.wikimedia.org/T225127) [17:25:01] 10Operations, 10ops-eqiad, 10DC-Ops: update puppet for new PDU models - https://phabricator.wikimedia.org/T233129 (10RobH) If that is the case, then no puppet updates are required, as sentry4 is already listed for all the PDU models in eqiad. [17:25:37] (03CR) 10Herron: [C: 03+2] kafka-main: move kafka1003 to role spare system [puppet] - 10https://gerrit.wikimedia.org/r/537490 (https://phabricator.wikimedia.org/T225005) (owner: 10Herron) [17:27:16] (03Abandoned) 10Jdrewniak: beta: enable desktop watchlist for mobile AMC users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537492 (https://phabricator.wikimedia.org/T225127) (owner: 10Jdrewniak) [17:29:36] (03PS1) 10Herron: Revert "kafka-main1003: disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/537493 [17:29:43] (03PS2) 10Herron: Revert "kafka-main1003: disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/537493 [17:31:12] (03CR) 10Herron: [C: 03+2] Revert "kafka-main1003: disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/537493 (owner: 10Herron) [17:32:15] (03PS1) 10Jdrewniak: beta: enable desktop watchlist for mobile AMC users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537495 (https://phabricator.wikimedia.org/T225127) [17:33:56] 10Operations, 10Analytics, 10Core Platform Team Legacy (Watching / External), 10Patch-For-Review, and 2 others: Replace and expand kafka main hosts (kafka[12]00[123]) with kafka-main[12]00[12345] - https://phabricator.wikimedia.org/T225005 (10herron) [17:44:10] (03CR) 10Pmiazga: [C: 04-1] beta: enable desktop watchlist for mobile AMC users (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537495 (https://phabricator.wikimedia.org/T225127) (owner: 10Jdrewniak) [17:44:16] (03PS4) 10CRusnov: Initial support for custom scripts [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/537242 (https://phabricator.wikimedia.org/T230449) [17:54:59] (03CR) 10jerkins-bot: [V: 04-1] Initial support for custom scripts [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/537242 (https://phabricator.wikimedia.org/T230449) (owner: 10CRusnov) [17:55:31] (03CR) 10Volans: [C: 04-1] "Found some issue, not sure if those are the cause of the CI failures. Let's fix them first and see what remains to be fixed ;)" (034 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/537468 (owner: 10Jbond) [17:56:34] (03CR) 10Volans: "LGTM, one nit to be slightly DRYer" (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/537455 (https://phabricator.wikimedia.org/T147074) (owner: 10Jbond) [17:57:15] (03PS5) 10CRusnov: Initial support for custom scripts [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/537242 (https://phabricator.wikimedia.org/T230449) [17:59:48] 10Operations, 10ops-eqiad, 10DC-Ops: b3-eqiad pdu refresh (Tuesday 9/17 @11am UTC) - https://phabricator.wikimedia.org/T227539 (10Jclark-ctr) a:05Cmjohnson→03RobH [18:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190917T1800) [18:00:22] 10Operations, 10ops-eqiad, 10DC-Ops: b3-eqiad pdu refresh (Tuesday 9/17 @11am UTC) - https://phabricator.wikimedia.org/T227539 (10Jclark-ctr) Finished swapping pdu reassigned to @RobH [18:01:50] 10Operations, 10ops-codfw: refresh/replace scs-c1-codfw - https://phabricator.wikimedia.org/T231687 (10Papaul) [18:06:16] (03PS2) 10Jdrewniak: beta: enable desktop watchlist for mobile AMC users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537495 (https://phabricator.wikimedia.org/T225127) [18:06:29] !log upgrading firmware on scs1-a1-codfw [18:06:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:10:08] 10Operations, 10Analytics, 10SRE-Access-Requests: Requesting access to analytics cluster for Martin Gerlach - https://phabricator.wikimedia.org/T232707 (10herron) [18:13:00] 10Operations, 10ops-codfw: refresh/replace scs-c1-codfw - https://phabricator.wikimedia.org/T231687 (10Papaul) Upgrade firmware from Firmware: 3.16.6u4 to Firmware: 3.16.6u5 [18:13:29] 10Operations, 10ops-codfw: refresh/replace scs-c1-codfw - https://phabricator.wikimedia.org/T231687 (10Papaul) [18:17:22] RECOVERY - Host ps1-b3-eqiad is UP: PING OK - Packet loss = 0%, RTA = 1.57 ms [18:19:41] 10Operations: Apache mod_status aggregator - https://phabricator.wikimedia.org/T233047 (10eprodromou) It sounds like you've got this under control, and Tim is directly tagged, so I'm going to untag CPT. [18:19:49] (03PS1) 10Herron: admin: add mgerlach to analytics-privatedata-users, researchers [puppet] - 10https://gerrit.wikimedia.org/r/537508 (https://phabricator.wikimedia.org/T232707) [18:20:38] PROBLEM - ps1-b3-eqiad-infeed-load-tower-A-phase-Y on ps1-b3-eqiad is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:20:38] PROBLEM - ps1-b3-eqiad-infeed-load-tower-A-phase-X on ps1-b3-eqiad is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:20:38] PROBLEM - ps1-b3-eqiad-infeed-load-tower-A-phase-Z on ps1-b3-eqiad is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:20:38] PROBLEM - ps1-b3-eqiad-infeed-load-tower-B-phase-Z on ps1-b3-eqiad is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:20:38] PROBLEM - ps1-b3-eqiad-infeed-load-tower-B-phase-X on ps1-b3-eqiad is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:20:39] PROBLEM - ps1-b3-eqiad-infeed-load-tower-B-phase-Y on ps1-b3-eqiad is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:20:52] hi all im running the octocatalog differ again. The previous issues was caused because i copied some old facts files from puppet2001 over the top of the new ones from puppet1001. I have no ensured that the facts files in use are recent *anything older then a day is gone). things have been running now for about 30 mins and everything seems stable but please ping me if you see anything [18:20:58] strange with exported resources [18:21:16] 10Operations, 10ops-eqiad, 10DC-Ops: b3-eqiad pdu refresh (Tuesday 9/17 @11am UTC) - https://phabricator.wikimedia.org/T227539 (10RobH) 05Open→03Resolved I've gone ahead and setup remote access and settings identical to the other new PDUs. It now is online/ping/ssh/syslog accessible. [18:21:18] 10Operations, 10ops-eqiad, 10DC-Ops: Install new PDUs in rows A/B (Top level tracking task) - https://phabricator.wikimedia.org/T226778 (10RobH) [18:21:34] ^^ old message from buffer, still valid but the differ has ben running for a few hours now [18:25:52] (03CR) 10Ottomata: [C: 03+1] admin: add mgerlach to analytics-privatedata-users, researchers [puppet] - 10https://gerrit.wikimedia.org/r/537508 (https://phabricator.wikimedia.org/T232707) (owner: 10Herron) [18:28:08] 04Critical Alert for device ps1-b3-eqiad.mgmt.eqiad.wmnet - Device rebooted [18:28:14] 04Critical Alert for device ps1-b6-eqiad.mgmt.eqiad.wmnet - Device rebooted [18:33:08] 10Operations, 10Analytics, 10User-Elukey: setup/install eqiad kerbos node WMF5173 - https://phabricator.wikimedia.org/T233141 (10RobH) [18:33:08] 04̶C̶r̶i̶t̶i̶c̶a̶l Device ps1-b3-eqiad.mgmt.eqiad.wmnet recovered from Device rebooted [18:33:14] 04̶C̶r̶i̶t̶i̶c̶a̶l Device ps1-b6-eqiad.mgmt.eqiad.wmnet recovered from Device rebooted [18:33:33] 10Operations, 10Analytics, 10User-Elukey: setup/install codfw kerbos node WMF6577 - https://phabricator.wikimedia.org/T233142 (10RobH) [18:37:34] RECOVERY - Check systemd state on netbox2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:37:49] (03PS2) 10Jbond: IPMI: maintain current password during resets [cookbooks] - 10https://gerrit.wikimedia.org/r/537455 (https://phabricator.wikimedia.org/T147074) [18:37:59] (03CR) 10Jbond: IPMI: maintain current password during resets (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/537455 (https://phabricator.wikimedia.org/T147074) (owner: 10Jbond) [18:38:23] 10Operations, 10Analytics, 10User-Elukey: setup/install codfw kerbos node WMF6577 - https://phabricator.wikimedia.org/T233142 (10RobH) a:05RobH→03elukey @elukey, Please note that both T233141 (eqiad) and T233142 (codfw) are nearly identical. We need the following info to setup these hosts: * Hostnames... [18:38:28] 10Operations, 10Analytics, 10User-Elukey: setup/install eqiad kerbos node WMF5173 - https://phabricator.wikimedia.org/T233141 (10RobH) a:05RobH→03elukey @elukey, Please note that both T233141 (eqiad) and T233142 (codfw) are nearly identical. We need the following info to setup these hosts: * Hostnames... [18:40:26] 10Operations, 10Analytics, 10hardware-requests, 10User-Elukey: eqiad: 1 misc node for the Kerberos KDC service - https://phabricator.wikimedia.org/T227288 (10RobH) 05Open→03Resolved T233141 created for setup. resolving this request task! [18:40:37] 10Operations, 10Analytics, 10hardware-requests, 10User-Elukey: codfw: 1 misc node for the Kerberos KDC service - https://phabricator.wikimedia.org/T227425 (10RobH) 05Open→03Resolved a:05RobH→03None T233142 created for setup, resolving this request task! [18:43:20] PROBLEM - Check the Netbox report librenms for fail status. on netbox1001 is CRITICAL: librenms.LibreNMS CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [18:44:39] (03PS3) 10Jdrewniak: beta: enable desktop watchlist for mobile AMC users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537495 (https://phabricator.wikimedia.org/T225127) [18:49:32] (03CR) 10Herron: [C: 03+2] admin: add mgerlach to analytics-privatedata-users, researchers [puppet] - 10https://gerrit.wikimedia.org/r/537508 (https://phabricator.wikimedia.org/T232707) (owner: 10Herron) [18:49:36] (03PS3) 10Dzahn: delete wikiba.se [dns] - 10https://gerrit.wikimedia.org/r/534276 (https://phabricator.wikimedia.org/T99531) [18:50:23] (03PS3) 10Ayounsi: Add cookbook to update Sentry PDUs passwords [cookbooks] - 10https://gerrit.wikimedia.org/r/537486 [18:51:30] (03CR) 10Ayounsi: "Thanks! one question, everything addressed." (0310 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/537486 (owner: 10Ayounsi) [18:55:21] (03CR) 10Dzahn: [C: 03+2] delete wikiba.se [dns] - 10https://gerrit.wikimedia.org/r/534276 (https://phabricator.wikimedia.org/T99531) (owner: 10Dzahn) [18:59:16] (03PS4) 10Ayounsi: Add cookbook to update Sentry PDUs passwords [cookbooks] - 10https://gerrit.wikimedia.org/r/537486 [19:00:04] twentyafterfour: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for MediaWiki train - American version . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190917T1900). [19:03:59] (03PS1) 10Herron: admin: add srishakatux to researchers [puppet] - 10https://gerrit.wikimedia.org/r/537516 (https://phabricator.wikimedia.org/T232664) [19:04:11] (03CR) 10Dzahn: [C: 03+2] IPMI: maintain current password during resets [cookbooks] - 10https://gerrit.wikimedia.org/r/537455 (https://phabricator.wikimedia.org/T147074) (owner: 10Jbond) [19:04:15] (03PS3) 10Dzahn: IPMI: maintain current password during resets [cookbooks] - 10https://gerrit.wikimedia.org/r/537455 (https://phabricator.wikimedia.org/T147074) (owner: 10Jbond) [19:05:56] !log decommissioning Cassandra, restbase2011-c -- T224553 [19:05:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:05:59] T224553: Migrate remaining Restbase servers to Stretch - https://phabricator.wikimedia.org/T224553 [19:08:34] !log Branch cut is in progress for 1.34.0-wmf.23 [19:08:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:14:21] !log dzahn@cumin1001 START - Cookbook sre.hosts.ipmi-password-reset [19:14:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:14:38] !log dzahn@cumin1001 Updating IPMI password on 8 hosts - dzahn@cumin1001 [19:14:39] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.ipmi-password-reset (exit_code=0) [19:14:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:14:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:15:27] * raynor is going to deploy a config change for beta cluster [19:15:50] any objections? [19:18:17] (03CR) 10Pmiazga: [C: 03+2] beta: enable desktop watchlist for mobile AMC users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537495 (https://phabricator.wikimedia.org/T225127) (owner: 10Jdrewniak) [19:18:23] (03PS1) 10Herron: admin: add phedenskog to deployment group [puppet] - 10https://gerrit.wikimedia.org/r/537519 (https://phabricator.wikimedia.org/T232489) [19:19:36] (03Merged) 10jenkins-bot: beta: enable desktop watchlist for mobile AMC users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537495 (https://phabricator.wikimedia.org/T225127) (owner: 10Jdrewniak) [19:20:14] !log dzahn@cumin1001 START - Cookbook sre.hosts.ipmi-password-reset [19:20:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:20:28] !log dzahn@cumin1001 Updating IPMI password on 2 hosts - dzahn@cumin1001 [19:20:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:20:33] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.ipmi-password-reset (exit_code=0) [19:20:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:20:47] 10Operations, 10Performance-Team, 10SRE-Access-Requests, 10Patch-For-Review: Request access to 'deployment' user group for phedenskog - https://phabricator.wikimedia.org/T232489 (10herron) Uploaded a patch for this. Once we have approval documented we should be able to move forward with it. [19:21:35] (03CR) 10jenkins-bot: beta: enable desktop watchlist for mobile AMC users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537495 (https://phabricator.wikimedia.org/T225127) (owner: 10Jdrewniak) [19:22:03] !log dzahn@cumin1001 START - Cookbook sre.hosts.ipmi-password-reset [19:22:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:22:22] !log dzahn@cumin1001 Updating IPMI password on 8 hosts - dzahn@cumin1001 [19:22:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:22:34] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.ipmi-password-reset (exit_code=0) [19:22:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:25:57] !log dzahn@cumin1001 START - Cookbook sre.hosts.ipmi-password-reset [19:25:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:26:13] !log dzahn@cumin1001 Updating IPMI password on 543 hosts - dzahn@cumin1001 [19:26:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:30:09] (03PS1) 10Herron: admin: add urbanecm to researchers, analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/537524 (https://phabricator.wikimedia.org/T231616) [19:32:28] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Request access to Analytics cluster for Urbanecm - https://phabricator.wikimedia.org/T231616 (10herron) Uploaded a patch for this. But need approval from @Nuria before moving forward with it. [19:34:52] (03PS1) 10AndyRussG: Turn on EventLogging at 100% for DonateWiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537525 (https://phabricator.wikimedia.org/T233145) [19:36:49] 10Operations, 10ops-eqiad, 10User-Elukey: (Need By: August 31) rack/setup/install (3) new zookeeper nodes - https://phabricator.wikimedia.org/T227025 (10wiki_willy) @Jclark-ctr - since Chris had to use a sick day, can one of you guys take a look at this for Luca? Thanks, Willy [19:43:43] 10Operations, 10Analytics, 10SRE-Access-Requests: Requesting access to analytics cluster for Martin Gerlach - https://phabricator.wikimedia.org/T232707 (10herron) [19:44:33] 10Operations, 10Analytics, 10SRE-Access-Requests: Requesting access to analytics cluster for Martin Gerlach - https://phabricator.wikimedia.org/T232707 (10herron) 05Open→03Resolved a:03herron Hi Martin, this access is in place now. If any follow up is needed please don't hesitate to re-open. Thanks! [19:48:45] 10Operations, 10Traffic, 10Wikidata, 10serviceops, and 3 others: [Task] move wikiba.se webhosting to wikimedia cluster - https://phabricator.wikimedia.org/T99531 (10Dzahn) I think we are done with the cleanup in production now and only cloud VPS instance is left. @Addshore Does anyone use that to test cha... [19:48:52] (03PS2) 10Jbond: ipmi: use run instead of checkouput [software/spicerack] - 10https://gerrit.wikimedia.org/r/537468 [19:53:22] (03CR) 10jerkins-bot: [V: 04-1] ipmi: use run instead of checkouput [software/spicerack] - 10https://gerrit.wikimedia.org/r/537468 (owner: 10Jbond) [19:53:25] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.ipmi-password-reset (exit_code=0) [19:53:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:56:58] (03PS1) 10Zoranzoki21: Add suppressredirect right to filemovers on bnwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537534 (https://phabricator.wikimedia.org/T233137) [19:57:13] (03PS1) 10Jforrester: docroot: Commit VariantSettings.php.txt [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537535 [19:58:21] (03CR) 10Zoranzoki21: "This is here? https://noc.wikimedia.org/conf/VariantSettings.php.txt" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537535 (owner: 10Jforrester) [19:58:58] (03CR) 10Jforrester: "Yeah, I created it on the deployment server and synced it but forgot to commit it." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537535 (owner: 10Jforrester) [19:59:43] (03PS1) 10Dzahn: cumin: adjust labpuppetmaster alias [puppet] - 10https://gerrit.wikimedia.org/r/537536 [19:59:45] (03CR) 10Jforrester: [C: 03+2] docroot: Commit VariantSettings.php.txt [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537535 (owner: 10Jforrester) [20:01:00] (03Merged) 10jenkins-bot: docroot: Commit VariantSettings.php.txt [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537535 (owner: 10Jforrester) [20:01:17] (03CR) 10jenkins-bot: docroot: Commit VariantSettings.php.txt [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537535 (owner: 10Jforrester) [20:02:56] (03PS3) 10Jbond: ipmi: use run instead of checkouput [software/spicerack] - 10https://gerrit.wikimedia.org/r/537468 [20:04:35] !log dzahn@cumin1001 START - Cookbook sre.hosts.ipmi-password-reset [20:04:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:04:54] !log dzahn@cumin1001 Updating IPMI password on 29 hosts - dzahn@cumin1001 [20:04:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:06:39] (03PS2) 10Jhedden: openstack: configure apache wsgi for keystone api [puppet] - 10https://gerrit.wikimedia.org/r/536664 (https://phabricator.wikimedia.org/T223907) [20:06:59] (03CR) 10jerkins-bot: [V: 04-1] ipmi: use run instead of checkouput [software/spicerack] - 10https://gerrit.wikimedia.org/r/537468 (owner: 10Jbond) [20:07:49] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.ipmi-password-reset (exit_code=0) [20:07:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:07:55] 10Operations, 10ops-eqiad: backup1001 failed disk (degraded RAID) - https://phabricator.wikimedia.org/T232882 (10Jclark-ctr) @jcrespo Drive arrived early Replaced failed drive [20:08:16] 10Operations, 10ops-eqiad: backup1001 failed disk (degraded RAID) - https://phabricator.wikimedia.org/T232882 (10Jclark-ctr) 05Open→03Resolved [20:08:21] 10Operations, 10DBA, 10serviceops, 10Goal, 10Patch-For-Review: Strengthen backup infrastructure and support - https://phabricator.wikimedia.org/T229209 (10Jclark-ctr) [20:11:43] !log dzahn@cumin1001 START - Cookbook sre.hosts.ipmi-password-reset [20:11:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:12:00] !log dzahn@cumin1001 Updating IPMI password on 18 hosts - dzahn@cumin1001 [20:12:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:13:42] (03PS4) 10Jbond: ipmi: use run instead of checkouput [software/spicerack] - 10https://gerrit.wikimedia.org/r/537468 [20:14:17] (03CR) 10Jbond: "thanks but still failing, i think it is possibly to do with the order the mocking occurs" (034 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/537468 (owner: 10Jbond) [20:14:35] (03PS1) 1020after4: testwikis wikis to 1.34.0-wmf.23 refs T220746 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537538 [20:14:40] (03CR) 1020after4: [C: 03+2] testwikis wikis to 1.34.0-wmf.23 refs T220746 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537538 (owner: 1020after4) [20:15:17] 10Operations, 10hardware-requests: eqiad+codfw: 6x hardware request for swift backend (each site) - https://phabricator.wikimedia.org/T227314 (10Jclark-ctr) [20:15:28] !log changing email for User:Olag [20:15:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:16:57] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.ipmi-password-reset (exit_code=0) [20:16:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:17:01] (03Merged) 10jenkins-bot: testwikis wikis to 1.34.0-wmf.23 refs T220746 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537538 (owner: 1020after4) [20:17:16] (03CR) 10jenkins-bot: testwikis wikis to 1.34.0-wmf.23 refs T220746 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537538 (owner: 1020after4) [20:18:39] twentyafterfour: I'm still in CI btw [20:18:48] (03CR) 10jerkins-bot: [V: 04-1] ipmi: use run instead of checkouput [software/spicerack] - 10https://gerrit.wikimedia.org/r/537468 (owner: 10Jbond) [20:18:50] (03PS3) 10Jhedden: openstack: configure apache wsgi for keystone api [puppet] - 10https://gerrit.wikimedia.org/r/536664 (https://phabricator.wikimedia.org/T223907) [20:18:58] !log dzahn@cumin1001 START - Cookbook sre.hosts.ipmi-password-reset [20:18:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:19:13] !log dzahn@cumin1001 Updating IPMI password on 21 hosts - dzahn@cumin1001 [20:19:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:20:03] Krinkle: ok I haven't synced anything yet [20:20:13] just let me know when it's clear [20:20:17] ETA 9min [20:20:19] will do :) [20:20:36] twentyafterfour, do you have a minute to spare, I pushed a config change for beta cluster, but somehow it doesn't work [20:20:44] I'm trying to understand whats going on [20:21:03] raynor: ok, which change? [20:21:04] https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/537495/3/wmf-config/InitialiseSettings-labs.php [20:21:48] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.ipmi-password-reset (exit_code=0) [20:21:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:22:18] I know that we have `shell.php` script, but I'm totally lost on how to ssh to beta cluster ;/ [20:24:36] (03PS5) 10Ayounsi: Add cookbook to update Sentry PDUs passwords [cookbooks] - 10https://gerrit.wikimedia.org/r/537486 (https://phabricator.wikimedia.org/T233053) [20:24:39] ssh -a your-labs-shell@deployment-deploy01.eqiad.wmflabs.org [20:24:56] let me see my .config just to be sure [20:25:16] That's a direct connect... don't really need a config if it's web accessible [20:25:18] ssh deployment-deploy01.deployment-prep.eqiad.wmflabs [20:25:27] yup [20:25:29] that one [20:25:36] though I use ProxyCommand [20:25:42] permission denied :/, probably I need some perms [20:25:50] so I ssh to a bastion, then to deploy01 [20:26:07] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Request access to Analytics cluster for Urbanecm - https://phabricator.wikimedia.org/T231616 (10Nuria) @MMiller_WMF and @Urbanecm I want to clarify that we do not grant access to private data to volunteers (we have limits as to what amount of users w... [20:26:15] (03CR) 10Ayounsi: Add cookbook to update Sentry PDUs passwords (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/537486 (https://phabricator.wikimedia.org/T233053) (owner: 10Ayounsi) [20:26:26] * Krinkle staging on mwdebug1002 [20:27:18] raynor: what is your LDAP username? [20:27:38] ok, sorry, works. typo [20:27:41] heh [20:27:58] (03PS10) 10Jeena Huneidi: Add restbase chart (port from local-charts) [deployment-charts] - 10https://gerrit.wikimedia.org/r/517557 (https://phabricator.wikimedia.org/T224935) [20:28:02] so I'm in deployment-deploy01. Sorry for stupid question but how can I execute shell.php with beta env ? [20:28:09] I need to check config vars for beta env [20:28:11] mwscript shell.php ? [20:28:38] (question was for https://tools.wmflabs.org/ldap/group/project-deployment-prep - you can see who has access there :-) ) [20:28:49] yeah it should be in beta env simply because it's in the beta instance [20:28:56] Reedy will know better. I've never had to use shell.php [20:29:07] ok, and it will load beta cluster conf? neat. I thought I need to login to some specific server [20:29:08] raynor: same way as in prod [20:29:18] mwscript scriptname.php dbname [20:30:06] deployment-deploy01 is for the same purpose as deploy1001 in prod [20:30:24] (03PS4) 10Jhedden: openstack: configure apache wsgi for keystone api [puppet] - 10https://gerrit.wikimedia.org/r/536664 (https://phabricator.wikimedia.org/T223907) [20:30:27] (03CR) 10Jeena Huneidi: "ready for review" [deployment-charts] - 10https://gerrit.wikimedia.org/r/517557 (https://phabricator.wikimedia.org/T224935) (owner: 10Jeena Huneidi) [20:30:36] !log krinkle@deploy1001 Synchronized php-1.34.0-wmf.22/resources/src/mediawiki.Title/Title.js: 8372dcdcdfe02261 (duration: 02m 08s) [20:30:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:30:58] * hauskatze delete from user where user_name like '%Reedy%'; :P [20:30:58] That's 1/2 files [20:31:01] almost done [20:31:46] !log krinkle@deploy1001 Synchronized php-1.34.0-wmf.22/resources/src/mediawiki.Title/phpCharToUpper.json: 8372dcdcdfe02261 (duration: 00m 56s) [20:31:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:31:49] And that's 2/2. Done, twentyafterfour - all yours. [20:31:55] twentyafterfour: note, I've also pulled it into wmf.23 [20:32:02] but not synced [20:32:15] Ha, yes, trying to sync wmf.23 wouldn't go well. [20:32:20] I don't know if that was synced at all yet, if so, I'll do that too, otherwise leaving to you [20:32:27] like if the directory existing was synced yet [20:33:22] (03PS1) 10Alexandros Kosiaris: wikifeeds: fix iteration issue on command and args [deployment-charts] - 10https://gerrit.wikimedia.org/r/537541 (https://phabricator.wikimedia.org/T233076) [20:33:24] (03PS1) 10Alexandros Kosiaris: scaffold: Fix bug with concatenation of args/command [deployment-charts] - 10https://gerrit.wikimedia.org/r/537542 [20:33:40] (03PS11) 10Jeena Huneidi: Add restbase chart (port from local-charts) [deployment-charts] - 10https://gerrit.wikimedia.org/r/517557 (https://phabricator.wikimedia.org/T224935) [20:35:58] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] wikifeeds: fix iteration issue on command and args [deployment-charts] - 10https://gerrit.wikimedia.org/r/537541 (https://phabricator.wikimedia.org/T233076) (owner: 10Alexandros Kosiaris) [20:35:59] 10Operations, 10Cloud-Services, 10SRE-Access-Requests, 10Developer-Advocacy (Jul-Sep 2019), 10Patch-For-Review: Membership in "researchers" group for Srishti Sethi - https://phabricator.wikimedia.org/T232664 (10Nuria) @srishakatux do sync up with @bd808 about steps to go forward to have a public dashboar... [20:36:05] 10Operations, 10Cloud-Services, 10SRE-Access-Requests, 10Developer-Advocacy (Jul-Sep 2019), 10Patch-For-Review: Membership in "researchers" group for Srishti Sethi - https://phabricator.wikimedia.org/T232664 (10Nuria) 05Open→03Resolved [20:36:37] Krinkle: thanks! [20:37:07] Krinkle: wmf.23 has not been sync'd yet I was waiting [20:37:12] syncing it now [20:38:14] (03PS5) 10Jhedden: openstack: configure apache wsgi for keystone api [puppet] - 10https://gerrit.wikimedia.org/r/536664 (https://phabricator.wikimedia.org/T223907) [20:39:17] ok, so after quick check (thx Reedy for guidance), my patch is on staging env. `git log` shows the commit I merged, the `wmf-config/InitializeSettings-labs.php` has my change [20:39:21] !log @ helmfile [EQIAD] Ran 'sync' command on namespace 'wikifeeds' for release 'production' . [20:39:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:39:40] raynor: Like I mentioned in PM... I bet it's how extension registration is handling the arrays [20:39:48] but when I do `mwscript shell.php --wiki=enwiki`, and then `var_dump( $wgMFUseDesktopSpecialWatchlistPage );` it shows old value [20:39:53] $numberOfDevsBittenByExtensionRegistrationsConfigArrayHandling++; [20:40:00] that value is not defined in InitializeSettings, it's new config [20:40:04] 10Operations, 10ops-eqiad, 10User-Elukey: (Need By: August 31) rack/setup/install (3) new zookeeper nodes - https://phabricator.wikimedia.org/T227025 (10Jclark-ctr) @elukey Will be on site tomorrow morning 7:30 et questions regarding host in concern an-coord1001 or an-conf1001. Sent message on IRC to foll... [20:40:44] Yeah, plausibly. But it's a direct scalar? [20:40:52] https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/537495/3/wmf-config/InitialiseSettings-labs.php [20:41:05] Should just replace. [20:41:10] if it was defined in InitializeSettings, then yes, I would have to do the trick with - to override it (if I remember right) [20:41:32] + to extend, - to replace. [20:41:55] indeed, if not set in prod IS.php, then - or no prefix is the same [20:42:00] per wmfApplyLabsOverrideSettings() logic [20:42:00] but it's not in IS [20:42:02] (03CR) 10Jhedden: "PCC results: https://puppet-compiler.wmflabs.org/compiler1002/18342/" [puppet] - 10https://gerrit.wikimedia.org/r/536664 (https://phabricator.wikimedia.org/T223907) (owner: 10Jhedden) [20:42:06] !log twentyafterfour@deploy1001 Started scap: testwikis to 1.34.0-wmf.23 refs T220748 [20:42:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:42:09] T220748: 1.34.0-wmf.23 deployment blockers - https://phabricator.wikimedia.org/T220748 [20:42:15] Well, nothing's in IS right now. ;-) But yeah, it's not in VS> [20:42:25] it is not in the IS/VS [20:42:25] * Krinkle AFK for ~ 1hr [20:44:49] James_F, any idea whats wrong? [20:45:16] it's not urgent, I'm just worried I missed something when merging this patch [20:46:16] Not, right now. [20:46:21] Will poke it. [20:46:40] ok, thx, let me know. I'm trying to wrap up my day and finally get some rest [20:46:59] !log dzahn@cumin1001 START - Cookbook sre.hosts.ipmi-password-reset [20:46:59] I'll ping you tomorrow [20:47:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:47:02] Thanks for your time everyone! [20:47:14] !log dzahn@cumin1001 Updating IPMI password on 660 hosts - dzahn@cumin1001 [20:47:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:48:50] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Request access to Analytics cluster for Urbanecm - https://phabricator.wikimedia.org/T231616 (10Nuria) Also, the ticket referenced is already been closed, right? [20:49:37] (03CR) 10Nuria: [C: 04-1] "Let's hold on on this, please." [puppet] - 10https://gerrit.wikimedia.org/r/537524 (https://phabricator.wikimedia.org/T231616) (owner: 10Herron) [20:50:11] (03PS1) 10ArielGlenn: one-off for generating some page meta history files [dumps] - 10https://gerrit.wikimedia.org/r/537546 [20:52:06] (03PS2) 10Alexandros Kosiaris: scaffold: Fix bug with concatenation of args/command [deployment-charts] - 10https://gerrit.wikimedia.org/r/537542 [20:52:08] (03PS1) 10Alexandros Kosiaris: wikifeeds: Interpolate correctly the port value [deployment-charts] - 10https://gerrit.wikimedia.org/r/537547 (https://phabricator.wikimedia.org/T233076) [20:57:12] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.ipmi-password-reset (exit_code=0) [20:57:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:58:41] 10Operations, 10Cloud-Services: clarification of cloud terms of use regarding LDAP servers - https://phabricator.wikimedia.org/T233158 (10Dzahn) [20:58:43] 10Operations, 10Cloud-Services: clarification of cloud terms of use regarding LDAP servers - https://phabricator.wikimedia.org/T233158 (10Paladox) [21:01:21] !log enable interface damping on primary eqiad-esams link (eqiad side) - T196432 [21:01:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:01:24] T196432: Configure interface damping on primary links - https://phabricator.wikimedia.org/T196432 [21:02:39] (03PS2) 10Alexandros Kosiaris: wikifeeds: Interpolate correctly the port value [deployment-charts] - 10https://gerrit.wikimedia.org/r/537547 (https://phabricator.wikimedia.org/T233076) [21:02:41] (03PS3) 10Alexandros Kosiaris: scaffold: Fix bug with concatenation of args/command [deployment-charts] - 10https://gerrit.wikimedia.org/r/537542 [21:07:01] !log twentyafterfour@deploy1001 Finished scap: testwikis to 1.34.0-wmf.23 refs T220748 (duration: 24m 55s) [21:07:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:07:04] T220748: 1.34.0-wmf.23 deployment blockers - https://phabricator.wikimedia.org/T220748 [21:07:23] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] wikifeeds: Interpolate correctly the port value [deployment-charts] - 10https://gerrit.wikimedia.org/r/537547 (https://phabricator.wikimedia.org/T233076) (owner: 10Alexandros Kosiaris) [21:08:44] !log @ helmfile [EQIAD] Ran 'sync' command on namespace 'wikifeeds' for release 'production' . [21:08:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:10:40] !log dzahn@cumin1001 START - Cookbook sre.hosts.ipmi-password-reset [21:10:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:10:54] !log dzahn@cumin1001 Updating IPMI password on 1 hosts - dzahn@cumin1001 [21:10:55] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.ipmi-password-reset (exit_code=0) [21:10:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:10:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:12:07] !log delete AS13335 91.198.174.0/24 RPKI/ROA [21:12:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:14:53] 10Operations, 10Cassandra, 10RESTBase, 10RESTBase-Cassandra, and 2 others: Migrate remaining Restbase servers to Stretch - https://phabricator.wikimedia.org/T224553 (10Eevans) restbase2011 is fully decommissioned and ready to be reimaged. [21:24:25] (03CR) 10Aaron Schulz: [C: 03+1] Remove more unused math config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537241 (https://phabricator.wikimedia.org/T228547) (owner: 10Physikerwelt) [21:30:02] (03PS1) 10Eevans: sessionstore: configure cassandra for `local_dc` [deployment-charts] - 10https://gerrit.wikimedia.org/r/537552 (https://phabricator.wikimedia.org/T229697) [21:31:37] (03PS1) 1020after4: group0 wikis to 1.34.0-wmf.23 refs T220748 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537553 [21:31:39] (03CR) 1020after4: [C: 03+2] group0 wikis to 1.34.0-wmf.23 refs T220748 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537553 (owner: 1020after4) [21:39:52] (03Merged) 10jenkins-bot: group0 wikis to 1.34.0-wmf.23 refs T220748 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537553 (owner: 1020after4) [21:40:10] (03CR) 10jenkins-bot: group0 wikis to 1.34.0-wmf.23 refs T220748 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537553 (owner: 1020after4) [21:41:13] (03PS1) 10Reedy: Add comment about MinimumPasswordLengthToLogin [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537555 (https://phabricator.wikimedia.org/T233119) [21:42:29] (03PS2) 10Reedy: Add comment about MinimumPasswordLengthToLogin [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537555 (https://phabricator.wikimedia.org/T233119) [21:43:29] James_F: ^ for some further clarification and reminding [21:45:10] !log removed one file for legal compliance [21:45:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:53:19] Reedy: Do you want to set forceChange for the prived policy? [21:53:31] For which? [21:54:12] For both PasswordNotInLargeBlacklist and MinimalPasswordLength [21:54:25] It's set to suggestChangeOnLogin right now only, right? [21:54:34] It is... [21:54:38] That was what we were trying to fix, but used the wrong config. [21:54:53] "Suggest" means "skippable thing you can ignore forever". [21:55:40] Which, I note, is actually what I did in https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/479570 which we didn't merge. Whoops. :-) [21:58:40] (03PS3) 10Jforrester: Add comment about MinimumPasswordLengthToLogin [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537555 (https://phabricator.wikimedia.org/T233119) (owner: 10Reedy) [21:58:50] (03CR) 10Jforrester: [C: 03+1] Add comment about MinimumPasswordLengthToLogin [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537555 (https://phabricator.wikimedia.org/T233119) (owner: 10Reedy) [21:58:54] (03Restored) 10Jforrester: Enforce a 10-byte password for privileged users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479570 (https://phabricator.wikimedia.org/T208246) (owner: 10Jforrester) [22:02:07] (03PS5) 10Jforrester: Enforce a 10-byte password for privileged users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479570 (https://phabricator.wikimedia.org/T208246) [22:03:23] Reedy: Like ^^ [22:03:41] Yeah, I think that looks more like what we probably wanted :P [22:04:14] Want to try this one? :-) [22:04:25] (03CR) 10Jforrester: [C: 03+2] Add comment about MinimumPasswordLengthToLogin [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537555 (https://phabricator.wikimedia.org/T233119) (owner: 10Reedy) [22:05:16] (03CR) 10Jforrester: "Use tabs, not spaces. :-)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537433 (https://phabricator.wikimedia.org/T232539) (owner: 10Awight) [22:06:34] (03Merged) 10jenkins-bot: Add comment about MinimumPasswordLengthToLogin [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537555 (https://phabricator.wikimedia.org/T233119) (owner: 10Reedy) [22:07:56] (03CR) 10jenkins-bot: Add comment about MinimumPasswordLengthToLogin [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537555 (https://phabricator.wikimedia.org/T233119) (owner: 10Reedy) [22:08:21] 10Operations, 10netops: Review firewall rules for labpuppetmaster1001/labpuppetmaster1002 removal - https://phabricator.wikimedia.org/T233075 (10ayounsi) 05Open→03Resolved No mention of those two hosts (or their IPs) in Rancid (network devices). [22:09:21] !log jforrester@deploy1001 Synchronized wmf-config/CommonSettings.php: Add comment about MinimumPasswordLengthToLogin (duration: 01m 03s) [22:09:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:14:35] (03CR) 10Gergő Tisza: "MinimumPasswordLengthToLogin probably shouldn't be removed." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479570 (https://phabricator.wikimedia.org/T208246) (owner: 10Jforrester) [22:15:14] (03CR) 10Jforrester: "> Patch Set 5:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479570 (https://phabricator.wikimedia.org/T208246) (owner: 10Jforrester) [22:15:53] (03CR) 10Reedy: [C: 04-1] Enforce a 10-byte password for privileged users (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479570 (https://phabricator.wikimedia.org/T208246) (owner: 10Jforrester) [22:16:09] 10Operations, 10ops-eqiad, 10DC-Ops: b6-eqiad pdu refresh (Tuesday 9/10 @11am UTC) - https://phabricator.wikimedia.org/T227541 (10RobH) 15:15 <@robh> : So, I can confirm in librenms it sees both towers 15:15 <@robh> : so, this seems to me to be an icinga issue 15:15 <@robh> : Does this seem reasonable?... [22:16:54] Reedy: That's the +staff-specific over-ride for MinimumPasswordLengthToLogin which we've just discovered is toxic and shouldn't ever be used, though? [22:17:05] Well, except it's been set to 10 for ages and is fine? [22:17:08] for staff [22:17:13] "Ages" as in a few months. [22:17:50] It's extra kludge that runs on every request forever. [22:18:21] Ideally, at some point we raise it to 10 for every priv account... And remove those hacks [22:18:40] Moving it to force should start to getting people to change passwords [22:19:02] Eh. [22:19:03] Fine. [22:19:17] deployment clear? [22:19:22] Krinkle: No. [22:19:22] rolling out https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/VisualEditor/+/537523/ next [22:19:31] OK. waiting then :) [22:19:39] Krinkle: group0 to wmf.23 is merged but not deployed; I pinged twentyafterfour. [22:20:17] https://test.wikipedia.org/wiki/Special:Version [22:20:19] it is deployed? [22:20:31] To test wikis but not e.g. mediawiki.org [22:20:37] "21:07 twentyafterfour@deploy1001: Finished scap: testwikis to 1.34.0-wmf.23 refs T220748 (duration: 24m 55s) [22:20:37] " [22:20:37] T220748: 1.34.0-wmf.23 deployment blockers - https://phabricator.wikimedia.org/T220748 [22:20:58] Right [22:20:59] b8b93d5d366a74fcdb79ca1affa46f233cf20292 [22:21:04] is merged but not deployed [22:21:21] I saw another wmf-config deploy since then though [22:21:30] (yours) [22:22:10] anyhow, will give it another 30min [22:22:14] (03PS6) 10Jforrester: Enforce a 10-byte password for privileged users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479570 (https://phabricator.wikimedia.org/T208246) [22:22:25] Yes, I found that it wasn't deployed when I tried to do mine and pulled a lot more than I anticipated. [22:22:32] Krinkle: Just deploy already. [22:25:21] 10Operations, 10ops-eqiad, 10DC-Ops: b6-eqiad pdu refresh (Tuesday 9/10 @11am UTC) - https://phabricator.wikimedia.org/T227541 (10RobH) Please note this is an issue that is happening on ALL the new PDUs. I'll update the parent task. [22:25:27] 10Operations, 10MediaWiki-extensions-OATHAuth: Cannot enable 2FA on testwiki - https://phabricator.wikimedia.org/T233146 (10Reedy) p:05Triage→03High And as T231786 was only on testwiki... That makes it all the more suspect Not necessarily a bug in OATHAuth... But maybe something to do with servers/caching... [22:26:44] 10Operations, 10ops-eqiad, 10DC-Ops: Install new PDUs in rows A/B (Top level tracking task) - https://phabricator.wikimedia.org/T226778 (10RobH) It seems that when the new PDU goes into place, it fails the icinga checks for: https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=ps1-a7-eqiad... [22:27:15] 10Operations, 10ops-eqiad, 10DC-Ops: Install new PDUs in rows A/B (Top level tracking task) - https://phabricator.wikimedia.org/T226778 (10RobH) [22:27:17] 10Operations, 10ops-eqiad, 10DC-Ops: ps1 eqiad Icinga UNKNOWNs - https://phabricator.wikimedia.org/T229328 (10RobH) [22:28:25] 10Operations, 10MediaWiki-extensions-OATHAuth: Cannot enable 2FA on testwiki - https://phabricator.wikimedia.org/T233146 (10Reedy) If we look at the original error in T231786... ` [XWzSlgpAMFgAAJHPyUAAAABG] /w/index.php?title=Special:Manage_Two-factor_authentication&action=enable&module=totp BadMethodCallExce... [22:29:31] (03CR) 10Cwhite: [C: 03+2] prometheus: make statsd.relay-address toggle-able [puppet] - 10https://gerrit.wikimedia.org/r/536365 (https://phabricator.wikimedia.org/T205870) (owner: 10Cwhite) [22:29:40] (03PS3) 10Cwhite: prometheus: make statsd.relay-address toggle-able [puppet] - 10https://gerrit.wikimedia.org/r/536365 (https://phabricator.wikimedia.org/T205870) [22:30:34] (03CR) 10Cwhite: profile: use prometheus for logstash alerting (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/536358 (https://phabricator.wikimedia.org/T205870) (owner: 10Cwhite) [22:30:36] (03CR) 10Dzahn: "Should this be adjusted to the codfw roles or just be deleted entirely?" [puppet] - 10https://gerrit.wikimedia.org/r/537536 (owner: 10Dzahn) [22:33:57] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/media/{title} (Get media in test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [22:42:55] (03PS1) 10Cwhite: hiera: disable statsd relay_address on logstash nodes [puppet] - 10https://gerrit.wikimedia.org/r/537561 (https://phabricator.wikimedia.org/T205870) [22:43:19] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [22:43:36] !log dzahn@cumin1001 START - Cookbook sre.hosts.ipmi-password-reset [22:43:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:43:52] !log dzahn@cumin1001 Updating IPMI password on 6 hosts - dzahn@cumin1001 [22:43:52] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.ipmi-password-reset (exit_code=0) [22:43:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:43:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:53:26] 10Operations, 10Cloud-Services: clarification of cloud terms of use regarding LDAP servers - https://phabricator.wikimedia.org/T233158 (10Krenair) I don't see how having read-only replicas changes the real problems involved, maybe we should just add the missing 's' to 'servers'? Labs instances should never *se... [22:53:47] * Krinkle staging on mwdebug1002 [23:00:05] MaxSem, RoanKattouw, Niharika, and Urbanecm: My dear minions, it's time we take the moon! Just kidding. Time for Evening SWAT (Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190917T2300). [23:00:05] No GERRIT patches in the queue for this window AFAICS. [23:01:34] (03PS2) 10Cwhite: profile: disable statsd relay_address on logstash nodes [puppet] - 10https://gerrit.wikimedia.org/r/537561 (https://phabricator.wikimedia.org/T205870) [23:01:53] 10Operations, 10MediaWiki-extensions-OATHAuth: Cannot enable 2FA on testwiki - https://phabricator.wikimedia.org/T233146 (10Reedy) I'm getting an early night (for me!) and will look at this again sometime tomorrow. I'll steal a mwdebug server and have a look at what said array apparently includes [23:10:02] (03PS3) 10Cwhite: hiera: disable statsd relay_address on logstash nodes [puppet] - 10https://gerrit.wikimedia.org/r/537561 (https://phabricator.wikimedia.org/T205870) [23:12:20] (03CR) 10Gergő Tisza: [C: 03+1] Enforce a 10-byte password for privileged users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479570 (https://phabricator.wikimedia.org/T208246) (owner: 10Jforrester) [23:12:45] (03PS4) 10Cwhite: hiera: disable statsd relay_address on logstash nodes [puppet] - 10https://gerrit.wikimedia.org/r/537561 (https://phabricator.wikimedia.org/T205870) [23:14:23] (03PS5) 10Cwhite: hiera: disable statsd relay_address on logstash nodes [puppet] - 10https://gerrit.wikimedia.org/r/537561 (https://phabricator.wikimedia.org/T205870) [23:17:30] * Krinkle staging on mwdebug1002 [23:18:44] (03PS6) 10Cwhite: hiera: disable statsd relay_address on logstash nodes [puppet] - 10https://gerrit.wikimedia.org/r/537561 (https://phabricator.wikimedia.org/T205870) [23:19:24] syncing now [23:20:23] !log krinkle@deploy1001 Synchronized php-1.34.0-wmf.22/extensions/VisualEditor/extension.json: aae62a87be3c954378b07dfb881f79a4f73c5def (duration: 01m 05s) [23:20:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:20:33] (03PS7) 10Cwhite: hiera: disable statsd relay_address on logstash nodes [puppet] - 10https://gerrit.wikimedia.org/r/537561 (https://phabricator.wikimedia.org/T205870) [23:22:03] twentyafterfour: done [23:22:36] ok I'll sync group0 now [23:23:10] (03PS8) 10Cwhite: hiera: disable statsd relay_address on logstash nodes [puppet] - 10https://gerrit.wikimedia.org/r/537561 (https://phabricator.wikimedia.org/T205870) [23:25:32] (03PS1) 10Nuria: Adding config for friendly values on netflow dataset [puppet] - 10https://gerrit.wikimedia.org/r/537564 (https://phabricator.wikimedia.org/T229682) [23:25:57] !log twentyafterfour@deploy1001 rebuilt and synchronized wikiversions files: group0 wikis to 1.34.0-wmf.23 refs T220748 [23:26:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:26:02] T220748: 1.34.0-wmf.23 deployment blockers - https://phabricator.wikimedia.org/T220748 [23:28:57] 10Operations, 10VisualEditor: Something went wrong HTTP 404 when using Visual Editor - https://phabricator.wikimedia.org/T224384 (10matmarex) [23:35:49] 10Operations, 10Patch-For-Review, 10User-Ladsgroup, 10User-Urbanecm, 10Wiki-Setup (Create): Create Wikisource Hindi - https://phabricator.wikimedia.org/T218155 (10Ladsgroup) a:03Ladsgroup The fix got merged, I will try this later tomorrow. [23:37:06] (03CR) 10Ladsgroup: [C: 04-1] "This needs moving some configs from IS.php to VC.php." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498317 (https://phabricator.wikimedia.org/T218155) (owner: 10Urbanecm) [23:40:18] (03PS9) 10Cwhite: hiera: disable statsd relay_address on logstash nodes [puppet] - 10https://gerrit.wikimedia.org/r/537561 (https://phabricator.wikimedia.org/T205870) [23:42:25] (03PS10) 10Cwhite: hiera: disable statsd relay_address on logstash nodes [puppet] - 10https://gerrit.wikimedia.org/r/537561 (https://phabricator.wikimedia.org/T205870) [23:44:44] (03PS11) 10Cwhite: hiera: disable statsd relay_address on logstash nodes [puppet] - 10https://gerrit.wikimedia.org/r/537561 (https://phabricator.wikimedia.org/T205870) [23:47:05] (03PS12) 10Cwhite: hiera: disable statsd relay_address on logstash nodes [puppet] - 10https://gerrit.wikimedia.org/r/537561 (https://phabricator.wikimedia.org/T205870)