[00:23:48] (03PS1) 10Ayounsi: Netflow: install and configure Samplicator [puppet] - 10https://gerrit.wikimedia.org/r/529841 (https://phabricator.wikimedia.org/T226810) [00:24:43] (03CR) 10jerkins-bot: [V: 04-1] Netflow: install and configure Samplicator [puppet] - 10https://gerrit.wikimedia.org/r/529841 (https://phabricator.wikimedia.org/T226810) (owner: 10Ayounsi) [00:45:13] (03PS2) 10Ayounsi: Netflow: install and configure Samplicator [puppet] - 10https://gerrit.wikimedia.org/r/529841 (https://phabricator.wikimedia.org/T226810) [00:50:10] (03CR) 10Ayounsi: "https://puppet-compiler.wmflabs.org/compiler1001/17867/netflow1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/529841 (https://phabricator.wikimedia.org/T226810) (owner: 10Ayounsi) [00:58:30] XioNoX: is there a reason https://phabricator.wikimedia.org/T226810 is private? [00:59:47] legoktm: yup, it's DDoS related work and discussions, so a bit sensitive [01:02:01] ok, just a bit weird that there's public patch with a private task [01:02:18] and its not even WMF-NDA or something less restrictive [01:05:10] yeah the public changes is what we agreed are okay to be public in the task :) [01:05:37] it could probably be less restrictive than SRE, I'll ask and update it if fine [02:35:09] PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 176571272 and 11 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [02:39:57] RECOVERY - Postgres Replication Lag on maps1003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 71880 and 81 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [03:34:43] (03PS3) 10Vgutierrez: x509: Expose the OCSP URI of a Certificate as a property [software/acme-chief] - 10https://gerrit.wikimedia.org/r/516604 (https://phabricator.wikimedia.org/T219765) [03:34:45] (03PS3) 10Vgutierrez: ocsp: Provide basic functionality to perform OCSP requests [software/acme-chief] - 10https://gerrit.wikimedia.org/r/529202 (https://phabricator.wikimedia.org/T219765) [03:35:16] (03CR) 10Vgutierrez: "thx for the review :)" (031 comment) [software/acme-chief] - 10https://gerrit.wikimedia.org/r/516604 (https://phabricator.wikimedia.org/T219765) (owner: 10Vgutierrez) [03:47:22] (03CR) 10Vgutierrez: "> Patch Set 2:" (032 comments) [software/acme-chief] - 10https://gerrit.wikimedia.org/r/529202 (https://phabricator.wikimedia.org/T219765) (owner: 10Vgutierrez) [03:48:38] (03CR) 10Vgutierrez: "Brief example (wp.crt is the unified cert issued by globalsign):" [software/acme-chief] - 10https://gerrit.wikimedia.org/r/529202 (https://phabricator.wikimedia.org/T219765) (owner: 10Vgutierrez) [04:02:29] (03PS4) 10Vgutierrez: ocsp: Provide basic functionality to perform OCSP requests [software/acme-chief] - 10https://gerrit.wikimedia.org/r/529202 (https://phabricator.wikimedia.org/T219765) [04:52:20] (03PS1) 10Marostegui: db2121: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/529846 (https://phabricator.wikimedia.org/T228969) [04:53:34] (03CR) 10Marostegui: [C: 03+2] db2121: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/529846 (https://phabricator.wikimedia.org/T228969) (owner: 10Marostegui) [04:59:55] (03PS1) 10Marostegui: wmnet: Point m3-master codfw to dbproxy2003 [dns] - 10https://gerrit.wikimedia.org/r/529847 (https://phabricator.wikimedia.org/T202367) [05:06:18] (03PS1) 10Marostegui: db-eqiad,db-codfw.php: Provision db2122 into s7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/529848 (https://phabricator.wikimedia.org/T228969) [05:09:08] (03CR) 10Vgutierrez: [C: 03+1] db-eqiad,db-codfw.php: Provision db2122 into s7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/529848 (https://phabricator.wikimedia.org/T228969) (owner: 10Marostegui) [05:09:29] (03CR) 10Marostegui: [C: 03+2] db-eqiad,db-codfw.php: Provision db2122 into s7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/529848 (https://phabricator.wikimedia.org/T228969) (owner: 10Marostegui) [05:10:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Provision db2122 into s7 T228969', diff saved to https://phabricator.wikimedia.org/P8903 and previous config saved to /var/cache/conftool/dbconfig/20190813-051019-marostegui.json [05:10:25] (03Merged) 10jenkins-bot: db-eqiad,db-codfw.php: Provision db2122 into s7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/529848 (https://phabricator.wikimedia.org/T228969) (owner: 10Marostegui) [05:10:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:10:30] T228969: Productionize db21[21-31} - https://phabricator.wikimedia.org/T228969 [05:10:41] (03CR) 10jenkins-bot: db-eqiad,db-codfw.php: Provision db2122 into s7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/529848 (https://phabricator.wikimedia.org/T228969) (owner: 10Marostegui) [05:11:31] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Provision db2122 into s7 T228969 (duration: 00m 49s) [05:11:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:12:33] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Provision db2122 into s7 T228969 (duration: 00m 47s) [05:12:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:27:37] 10Operations, 10DBA, 10decommission: Decommission db2050.codfw.wmnet - https://phabricator.wikimedia.org/T230391 (10Marostegui) [05:27:54] 10Operations, 10DBA, 10decommission: Decommission db2050.codfw.wmnet - https://phabricator.wikimedia.org/T230391 (10Marostegui) p:05Triage→03Normal [05:29:04] 10Operations, 10DBA: Decommission db2043-db2069 - https://phabricator.wikimedia.org/T228258 (10Marostegui) [05:29:10] (03PS1) 10Marostegui: db-eqiad,db-codfw.php: Remove db2050 from config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/529849 (https://phabricator.wikimedia.org/T230391) [05:29:34] 10Operations, 10DBA, 10decommission, 10Patch-For-Review: Decommission db2050.codfw.wmnet - https://phabricator.wikimedia.org/T230391 (10Marostegui) [05:30:46] (03CR) 10Marostegui: [C: 03+2] db-eqiad,db-codfw.php: Remove db2050 from config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/529849 (https://phabricator.wikimedia.org/T230391) (owner: 10Marostegui) [05:31:38] (03Merged) 10jenkins-bot: db-eqiad,db-codfw.php: Remove db2050 from config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/529849 (https://phabricator.wikimedia.org/T230391) (owner: 10Marostegui) [05:31:52] (03CR) 10jenkins-bot: db-eqiad,db-codfw.php: Remove db2050 from config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/529849 (https://phabricator.wikimedia.org/T230391) (owner: 10Marostegui) [05:32:51] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Remove db2050 from config T230391 (duration: 00m 48s) [05:32:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:32:59] T230391: Decommission db2050.codfw.wmnet - https://phabricator.wikimedia.org/T230391 [05:33:44] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Remove db2050 from config T230391 (duration: 00m 48s) [05:33:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:35:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove db2050 from config, host will be decommissioned T230391', diff saved to https://phabricator.wikimedia.org/P8904 and previous config saved to /var/cache/conftool/dbconfig/20190813-053514-marostegui.json [05:35:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:38:08] (03PS1) 10Marostegui: mariadb: Decommissioned db2050 [puppet] - 10https://gerrit.wikimedia.org/r/529850 (https://phabricator.wikimedia.org/T230391) [05:39:35] 10Operations, 10DBA, 10decommission, 10Patch-For-Review: Decommission db2050.codfw.wmnet - https://phabricator.wikimedia.org/T230391 (10Marostegui) [05:39:41] (03PS2) 10Marostegui: mariadb: Decommission db2050 [puppet] - 10https://gerrit.wikimedia.org/r/529850 (https://phabricator.wikimedia.org/T230391) [05:40:18] !log Remove db2050 from tendril and zarcillo T230391 [05:40:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:40:26] T230391: Decommission db2050.codfw.wmnet - https://phabricator.wikimedia.org/T230391 [05:40:52] (03CR) 10Marostegui: [C: 03+2] mariadb: Decommission db2050 [puppet] - 10https://gerrit.wikimedia.org/r/529850 (https://phabricator.wikimedia.org/T230391) (owner: 10Marostegui) [05:43:59] (03PS1) 10Marostegui: install_server: Remove db2043 [puppet] - 10https://gerrit.wikimedia.org/r/529851 (https://phabricator.wikimedia.org/T230311) [05:44:39] (03CR) 10Marostegui: [C: 03+2] install_server: Remove db2043 [puppet] - 10https://gerrit.wikimedia.org/r/529851 (https://phabricator.wikimedia.org/T230311) (owner: 10Marostegui) [05:47:30] !log Stop mysql on db2050 - T230391 [05:47:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:47:39] T230391: Decommission db2050.codfw.wmnet - https://phabricator.wikimedia.org/T230391 [05:48:08] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission db2050.codfw.wmnet - https://phabricator.wikimedia.org/T230391 (10Marostegui) a:05Marostegui→03RobH [05:48:22] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission db2050.codfw.wmnet - https://phabricator.wikimedia.org/T230391 (10Marostegui) This host is ready for #dc-ops to decommission [05:48:43] 10Operations, 10DBA: Decommission db2043-db2069 - https://phabricator.wikimedia.org/T228258 (10Marostegui) [05:50:03] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission db2069.codfw.wmnet - https://phabricator.wikimedia.org/T230107 (10Marostegui) [05:57:04] PROBLEM - Blazegraph Port for wdqs-blazegraph on wdqs1009 is CRITICAL: connect to address 127.0.0.1 and port 9999: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [05:57:06] PROBLEM - Check systemd state on wdqs1009 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:57:36] PROBLEM - WDQS HTTP Port on wdqs1009 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 380 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [05:58:00] PROBLEM - Blazegraph process -wdqs-blazegraph- on wdqs1009 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 499 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [06:01:24] ACKNOWLEDGEMENT - Blazegraph Port for wdqs-blazegraph on wdqs1009 is CRITICAL: connect to address 127.0.0.1 and port 9999: Connection refused Stas Malychev Running some tests for dumplicate terms. - The acknowledgement expires at: 2019-08-14 06:00:46. https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [06:01:24] ACKNOWLEDGEMENT - Blazegraph process -wdqs-blazegraph- on wdqs1009 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 499 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war Stas Malychev Running some tests for dumplicate terms. - The acknowledgement expires at: 2019-08-14 06:00:46. https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [06:01:24] ACKNOWLEDGEMENT - Check systemd state on wdqs1009 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Stas Malychev Running some tests for dumplicate terms. - The acknowledgement expires at: 2019-08-14 06:00:46. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:01:24] ACKNOWLEDGEMENT - WDQS HTTP Port on wdqs1009 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 380 bytes in 0.001 second response time Stas Malychev Running some tests for dumplicate terms. - The acknowledgement expires at: 2019-08-14 06:00:46. https://wikitech.wikimedia.org/wiki/Wikidata_query_service [06:03:19] (03PS3) 10Ema: ATS: unset Accept-Encoding [puppet] - 10https://gerrit.wikimedia.org/r/529810 (https://phabricator.wikimedia.org/T227432) [06:05:12] (03CR) 10Vgutierrez: [C: 03+1] ATS: unset Accept-Encoding [puppet] - 10https://gerrit.wikimedia.org/r/529810 (https://phabricator.wikimedia.org/T227432) (owner: 10Ema) [06:05:29] (03CR) 10Ema: [C: 03+2] ATS: unset Accept-Encoding [puppet] - 10https://gerrit.wikimedia.org/r/529810 (https://phabricator.wikimedia.org/T227432) (owner: 10Ema) [06:09:10] RECOVERY - Blazegraph process -wdqs-blazegraph- on wdqs1009 is OK: PROCS OK: 1 process with UID = 499 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [06:09:48] RECOVERY - Blazegraph Port for wdqs-blazegraph on wdqs1009 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9999 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [06:09:50] RECOVERY - Check systemd state on wdqs1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:11:15] !log Upgrading ATS to 8.0.3-1wm3 in cp2002, cp1076, cp3034 and cp4021 - T221594 [06:11:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:11:23] T221594: Puppetize ATS TLS configuration for incoming traffic - https://phabricator.wikimedia.org/T221594 [06:11:42] PROBLEM - snapshot of s3 in codfw on db1115 is CRITICAL: snapshot for s3 at codfw taken more than 4 days ago: Most recent backup 2019-08-09 06:05:02 https://wikitech.wikimedia.org/wiki/MariaDB/Backups [06:14:38] PROBLEM - Check systemd state on wdqs1009 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:16:10] PROBLEM - Blazegraph Port for wdqs-blazegraph on wdqs1009 is CRITICAL: connect to address 127.0.0.1 and port 9999: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [06:38:18] !log upgrading fifo-log-demux to version 0.5 in cache@upload [06:38:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:39:24] RECOVERY - Blazegraph Port for wdqs-blazegraph on wdqs1009 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9999 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [06:39:30] RECOVERY - Check systemd state on wdqs1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:49:36] !log Rolling restart of fifo-log-demux and atsmtail services across cache@upload [06:49:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:59:20] !log upgrading spicerack to 0.0.26 on cumin2001 [06:59:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:00:01] (03PS1) 10Ema: ATS: deployment-ms-fe02 renamed to fe03 [puppet] - 10https://gerrit.wikimedia.org/r/529866 [07:24:38] 10Operations, 10DBA: Decommission db2043-db2069 - https://phabricator.wikimedia.org/T228258 (10Marostegui) [07:31:16] 10Operations, 10DBA: Decommission db2043-db2069 - https://phabricator.wikimedia.org/T228258 (10Marostegui) [07:32:04] 10Operations, 10DBA, 10decommission: Decommission db2057.codfw.wmnet - https://phabricator.wikimedia.org/T230394 (10Marostegui) [07:35:30] 10Operations, 10DBA, 10decommission: Decommission db2057.codfw.wmnet - https://phabricator.wikimedia.org/T230394 (10Marostegui) p:05Triage→03Normal [07:37:53] (03CR) 10Ema: [C: 03+2] ATS: deployment-ms-fe02 renamed to fe03 [puppet] - 10https://gerrit.wikimedia.org/r/529866 (owner: 10Ema) [07:39:42] (03PS1) 10Marostegui: db-eqiad,db-codfw.php: Remove db2057 from config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/529912 (https://phabricator.wikimedia.org/T230394) [07:40:38] RECOVERY - WDQS HTTP Port on wdqs1009 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.086 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [07:49:57] (03CR) 10Marostegui: [C: 03+2] db-eqiad,db-codfw.php: Remove db2057 from config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/529912 (https://phabricator.wikimedia.org/T230394) (owner: 10Marostegui) [07:50:56] (03Merged) 10jenkins-bot: db-eqiad,db-codfw.php: Remove db2057 from config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/529912 (https://phabricator.wikimedia.org/T230394) (owner: 10Marostegui) [07:51:09] (03CR) 10jenkins-bot: db-eqiad,db-codfw.php: Remove db2057 from config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/529912 (https://phabricator.wikimedia.org/T230394) (owner: 10Marostegui) [07:54:08] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Remove db2057 from config T230394 (duration: 00m 48s) [07:54:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:54:18] T230394: Decommission db2057.codfw.wmnet - https://phabricator.wikimedia.org/T230394 [07:54:56] 10Operations, 10DBA, 10decommission, 10Patch-For-Review: Decommission db2057.codfw.wmnet - https://phabricator.wikimedia.org/T230394 (10Marostegui) [07:55:03] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Remove db2057 from config T230394 (duration: 00m 47s) [07:55:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:32:03] (03PS1) 10Filippo Giunchedi: prometheus: fetch active netmon server from hiera [puppet] - 10https://gerrit.wikimedia.org/r/529914 (https://phabricator.wikimedia.org/T148541) [08:33:27] (03CR) 10jerkins-bot: [V: 04-1] prometheus: fetch active netmon server from hiera [puppet] - 10https://gerrit.wikimedia.org/r/529914 (https://phabricator.wikimedia.org/T148541) (owner: 10Filippo Giunchedi) [08:37:30] (03PS2) 10Filippo Giunchedi: prometheus: fetch active netmon server from hiera [puppet] - 10https://gerrit.wikimedia.org/r/529914 (https://phabricator.wikimedia.org/T148541) [08:41:48] (03CR) 10Alex Monk: [C: 03+2] x509: Expose the OCSP URI of a Certificate as a property [software/acme-chief] - 10https://gerrit.wikimedia.org/r/516604 (https://phabricator.wikimedia.org/T219765) (owner: 10Vgutierrez) [08:44:29] (03Merged) 10jenkins-bot: x509: Expose the OCSP URI of a Certificate as a property [software/acme-chief] - 10https://gerrit.wikimedia.org/r/516604 (https://phabricator.wikimedia.org/T219765) (owner: 10Vgutierrez) [08:44:55] (03CR) 10Alex Monk: [C: 03+2] "We should add some tests but as this is new code that's not used by anything yet this is harmless and good to go, we can followup later" [software/acme-chief] - 10https://gerrit.wikimedia.org/r/529202 (https://phabricator.wikimedia.org/T219765) (owner: 10Vgutierrez) [08:45:17] (03PS1) 10Marostegui: mariadb: Decommission db2057 [puppet] - 10https://gerrit.wikimedia.org/r/529915 (https://phabricator.wikimedia.org/T230394) [08:46:28] (03CR) 10Marostegui: [C: 03+2] mariadb: Decommission db2057 [puppet] - 10https://gerrit.wikimedia.org/r/529915 (https://phabricator.wikimedia.org/T230394) (owner: 10Marostegui) [08:46:43] 10Operations, 10DBA, 10decommission, 10Patch-For-Review: Decommission db2057.codfw.wmnet - https://phabricator.wikimedia.org/T230394 (10Marostegui) [08:47:13] (03CR) 10jenkins-bot: x509: Expose the OCSP URI of a Certificate as a property [software/acme-chief] - 10https://gerrit.wikimedia.org/r/516604 (https://phabricator.wikimedia.org/T219765) (owner: 10Vgutierrez) [08:47:40] (03Merged) 10jenkins-bot: ocsp: Provide basic functionality to perform OCSP requests [software/acme-chief] - 10https://gerrit.wikimedia.org/r/529202 (https://phabricator.wikimedia.org/T219765) (owner: 10Vgutierrez) [08:48:45] !log Remove db2057 from tendril and zarcillo T230394 [08:48:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:48:53] T230394: Decommission db2057.codfw.wmnet - https://phabricator.wikimedia.org/T230394 [08:49:40] !log Stop MySQL on db2057 - T230394 [08:49:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:50:23] (03CR) 10jenkins-bot: ocsp: Provide basic functionality to perform OCSP requests [software/acme-chief] - 10https://gerrit.wikimedia.org/r/529202 (https://phabricator.wikimedia.org/T219765) (owner: 10Vgutierrez) [08:50:25] 10Operations, 10ops-codfw, 10decommission: Decommission db2057.codfw.wmnet - https://phabricator.wikimedia.org/T230394 (10Marostegui) a:05Marostegui→03RobH [08:50:38] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission db2057.codfw.wmnet - https://phabricator.wikimedia.org/T230394 (10Marostegui) This host is ready for #dc-ops to decommission [08:51:12] 10Operations, 10DBA: Decommission db2043-db2069 - https://phabricator.wikimedia.org/T228258 (10Marostegui) [08:54:24] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: fetch active netmon server from hiera [puppet] - 10https://gerrit.wikimedia.org/r/529914 (https://phabricator.wikimedia.org/T148541) (owner: 10Filippo Giunchedi) [08:54:32] (03PS3) 10Filippo Giunchedi: prometheus: fetch active netmon server from hiera [puppet] - 10https://gerrit.wikimedia.org/r/529914 (https://phabricator.wikimedia.org/T148541) [09:12:22] (03PS1) 10Giuseppe Lavagetto: envoyproxy: support debian jessie [puppet] - 10https://gerrit.wikimedia.org/r/529919 [09:12:23] (03PS1) 10Giuseppe Lavagetto: envoyproxy: use the hot restarter [puppet] - 10https://gerrit.wikimedia.org/r/529920 [09:13:23] (03CR) 10jerkins-bot: [V: 04-1] envoyproxy: support debian jessie [puppet] - 10https://gerrit.wikimedia.org/r/529919 (owner: 10Giuseppe Lavagetto) [09:13:35] (03CR) 10jerkins-bot: [V: 04-1] envoyproxy: use the hot restarter [puppet] - 10https://gerrit.wikimedia.org/r/529920 (owner: 10Giuseppe Lavagetto) [09:37:39] (03PS1) 10Effie Mouzeli: mw-maintenance and scap: Revert changes for PHP7 transition [puppet] - 10https://gerrit.wikimedia.org/r/529921 (https://phabricator.wikimedia.org/T195392) [09:42:44] (03CR) 10Jbond: [C: 04-1] "LGTM: most commets are optional nitpicks but please use lookup() instead of hiera()" (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/529841 (https://phabricator.wikimedia.org/T226810) (owner: 10Ayounsi) [09:43:07] (03CR) 10Effie Mouzeli: [C: 04-1] "commit message should include which commits we revert" [puppet] - 10https://gerrit.wikimedia.org/r/529921 (https://phabricator.wikimedia.org/T195392) (owner: 10Effie Mouzeli) [09:43:22] (03PS1) 10Gehel: elasticsearch: wait for write queue to be empty after cluster operation [cookbooks] - 10https://gerrit.wikimedia.org/r/529922 [09:45:35] (03PS1) 10Filippo Giunchedi: mediawiki: add cluster latency alerts [puppet] - 10https://gerrit.wikimedia.org/r/529923 (https://phabricator.wikimedia.org/T230396) [09:45:37] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: wait for write queue to be empty after cluster operation [cookbooks] - 10https://gerrit.wikimedia.org/r/529922 (owner: 10Gehel) [09:46:32] (03CR) 10jerkins-bot: [V: 04-1] mediawiki: add cluster latency alerts [puppet] - 10https://gerrit.wikimedia.org/r/529923 (https://phabricator.wikimedia.org/T230396) (owner: 10Filippo Giunchedi) [09:48:18] (03PS2) 10Gehel: elasticsearch: wait for write queue to be empty after cluster operation [cookbooks] - 10https://gerrit.wikimedia.org/r/529922 [09:50:03] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: wait for write queue to be empty after cluster operation [cookbooks] - 10https://gerrit.wikimedia.org/r/529922 (owner: 10Gehel) [09:51:20] (03PS3) 10Gehel: elasticsearch: wait for write queue to be empty after cluster operation [cookbooks] - 10https://gerrit.wikimedia.org/r/529922 [09:51:59] (03PS1) 10Effie Mouzeli: Send 33.3% of anonymous users to PHP7.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/529924 (https://phabricator.wikimedia.org/T219150) [09:52:57] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: wait for write queue to be empty after cluster operation [cookbooks] - 10https://gerrit.wikimedia.org/r/529922 (owner: 10Gehel) [09:53:23] 10Operations, 10DNS, 10Traffic, 10Wikimedia-Apache-configuration, 10Patch-For-Review: Sundown aliases `minnan` and `zh-cfr` for `nan`/`zh-min-nan` - https://phabricator.wikimedia.org/T230382 (10Peachey88) Has there been any checks on the usage of these aliases yet? [09:53:39] (03PS2) 10Filippo Giunchedi: mediawiki: add cluster latency alerts [puppet] - 10https://gerrit.wikimedia.org/r/529923 (https://phabricator.wikimedia.org/T230396) [09:55:51] (03PS4) 10Gehel: elasticsearch: wait for write queue to be empty after cluster operation [cookbooks] - 10https://gerrit.wikimedia.org/r/529922 [09:56:37] (03PS1) 10Pmiazga: Undeploy editor gender surveys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/529925 (https://phabricator.wikimedia.org/T227793) [09:56:54] (03PS2) 10Pmiazga: Undeploy editor gender surveys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/529925 (https://phabricator.wikimedia.org/T227793) [09:58:24] !log upgrading the rest of cache@upload to 8.0.3-1wm3 - T221594 [09:58:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:58:33] T221594: Puppetize ATS TLS configuration for incoming traffic - https://phabricator.wikimedia.org/T221594 [09:59:45] (03PS1) 10Fsero: Introducing podsecpolicies,calico and coredns in eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/529926 [10:00:29] (03PS2) 10Fsero: Introducing podsecpolicies,calico and coredns in eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/529926 (https://phabricator.wikimedia.org/T228836) [10:02:13] (03CR) 10Fsero: [V: 03+2 C: 03+2] Introducing podsecpolicies,calico and coredns in eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/529926 (https://phabricator.wikimedia.org/T228836) (owner: 10Fsero) [10:06:44] (03PS3) 10Fsero: prometheus, k8s: enabling services prometheus service discovery [puppet] - 10https://gerrit.wikimedia.org/r/529789 [10:06:46] (03PS1) 10Fsero: caching,k8s: depool eqiad services exposed to cache for cluster recreation. [puppet] - 10https://gerrit.wikimedia.org/r/529927 (https://phabricator.wikimedia.org/T228836) [10:07:22] (03PS2) 10Fsero: caching,k8s: depool eqiad services exposed to cache for cluster recreation. [puppet] - 10https://gerrit.wikimedia.org/r/529927 (https://phabricator.wikimedia.org/T228836) [10:07:35] (03CR) 10Filippo Giunchedi: "Thresholds will likely need tuning but should be good as a first step, let me know what you think!" [puppet] - 10https://gerrit.wikimedia.org/r/529923 (https://phabricator.wikimedia.org/T230396) (owner: 10Filippo Giunchedi) [10:10:21] !log creating tiller in kube-system for helmfile T228836 [10:10:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:10:39] T228836: recreate eqiad cluster state from code stored in deployment-charts with helmfile [MIGHT CAUSE DOWNTIME] - https://phabricator.wikimedia.org/T228836 [10:10:45] !log initialize_cluster.sh kube-system kubemaster.svc.eqiad.wmnet 6443 - T228836 [10:10:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:12:20] (03CR) 10Giuseppe Lavagetto: [C: 03+1] caching,k8s: depool eqiad services exposed to cache for cluster recreation. [puppet] - 10https://gerrit.wikimedia.org/r/529927 (https://phabricator.wikimedia.org/T228836) (owner: 10Fsero) [10:17:56] (03CR) 10Fsero: [C: 03+2] caching,k8s: depool eqiad services exposed to cache for cluster recreation. [puppet] - 10https://gerrit.wikimedia.org/r/529927 (https://phabricator.wikimedia.org/T228836) (owner: 10Fsero) [10:19:34] (03CR) 10Ema: [C: 03+1] caching,k8s: depool eqiad services exposed to cache for cluster recreation. [puppet] - 10https://gerrit.wikimedia.org/r/529927 (https://phabricator.wikimedia.org/T228836) (owner: 10Fsero) [10:23:23] !log fsero@puppetmaster1001 conftool action : set/pooled=false; selector: dnsdisc=sessionstore|citoid|cxserver|eventgate-analytics|eventgate-main|termbox|blubberoid|mathoid|zotero,name=eqiad [10:23:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:25:32] !log rolling update of ghostscript [10:25:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:29:21] <_joe_> !log deleting calico deploy and configmap in kubernetes in eqiad, recreating with helmfile [10:29:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:32:46] !log oblivian@ helmfile [EQIAD] Ran 'apply' command on namespace 'kube-system' for release 'calico-policy-controller' . [10:32:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:39:12] <_joe_> !log recreating rbac roles via helmfile [10:39:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:39:32] !log oblivian@ helmfile [EQIAD] Ran 'apply' command on namespace 'kube-system' for release 'rbac-deploy-clusterrole' . [10:39:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:44:11] !log oblivian@ helmfile [EQIAD] Ran 'apply' command on namespace 'kube-system' for release 'coredns' . [10:44:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:45:59] (03PS1) 10Fsero: helmfile, eqiad: bug: ammending coredns values [deployment-charts] - 10https://gerrit.wikimedia.org/r/529931 [10:46:33] (03CR) 10Fsero: [V: 03+2 C: 03+2] helmfile, eqiad: bug: ammending coredns values [deployment-charts] - 10https://gerrit.wikimedia.org/r/529931 (owner: 10Fsero) [10:47:07] PROBLEM - Check size of conntrack table on kubernetes2006 is CRITICAL: CRITICAL: nf_conntrack is 92 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [10:49:37] !log oblivian@ helmfile [EQIAD] Ran 'apply' command on namespace 'kube-system' for release 'coredns' . [10:49:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:56:26] !log oblivian@cumin1001 START - Cookbook sre.hosts.downtime [10:56:26] !log oblivian@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [10:56:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:56:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:57:11] !log oblivian@cumin1001 START - Cookbook sre.hosts.downtime [10:57:11] !log oblivian@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [10:57:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:57:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:59:06] <_joe_> !log [eqiad] downtiming zotero on icinga for 10 minutes while recreating the deployment with helmfile [10:59:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for European Mid-day SWAT(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190813T1100). [11:00:04] raynor: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:20] o/ [11:00:59] I have SWAT rights, and my patch is the only one, I can deploy by myself. [11:01:50] (03PS3) 10Elukey: admin: add analytics-admins and ops to gpu-users [puppet] - 10https://gerrit.wikimedia.org/r/529101 [11:02:15] jbond42: o/ - time for a quick review? --^ [11:02:28] looking [11:03:01] PROBLEM - Check size of conntrack table on kubernetes2006 is CRITICAL: CRITICAL: nf_conntrack is 90 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [11:03:05] (03CR) 10Pmiazga: [C: 03+2] Undeploy editor gender surveys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/529925 (https://phabricator.wikimedia.org/T227793) (owner: 10Pmiazga) [11:04:08] (03Merged) 10jenkins-bot: Undeploy editor gender surveys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/529925 (https://phabricator.wikimedia.org/T227793) (owner: 10Pmiazga) [11:04:11] !log resetting net.netfilter.nf_conntrack_tcp_timeout_time_wait to 65 in kubernetes2006 [11:04:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:04:24] (03CR) 10jenkins-bot: Undeploy editor gender surveys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/529925 (https://phabricator.wikimedia.org/T227793) (owner: 10Pmiazga) [11:05:33] PROBLEM - PyBal backends health check on lvs1015 is CRITICAL: PYBAL CRITICAL - CRITICAL - zotero_1969: Servers kubernetes1002.eqiad.wmnet, kubernetes1003.eqiad.wmnet, kubernetes1004.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [11:05:39] PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - zotero_1969: Servers kubernetes1002.eqiad.wmnet, kubernetes1004.eqiad.wmnet, kubernetes1006.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [11:06:02] this is expected [11:06:03] can I proceed with SWAT? [11:06:13] RECOVERY - Check size of conntrack table on kubernetes2006 is OK: OK: nf_conntrack is 18 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [11:06:15] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/529101 (owner: 10Elukey) [11:06:28] !log oblivian@ helmfile [EQIAD] Ran 'apply' command on namespace 'zotero' for release 'production' . [11:06:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:06:38] (03CR) 10Elukey: [C: 03+2] admin: add analytics-admins and ops to gpu-users [puppet] - 10https://gerrit.wikimedia.org/r/529101 (owner: 10Elukey) [11:07:56] <_joe_> raynor: sure, go on [11:08:13] thx _joe_ [11:10:19] RECOVERY - PyBal backends health check on lvs1015 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [11:10:25] RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [11:13:28] (03PS1) 10Jbond: apereo_cas: add cname for idp to idp1001 [dns] - 10https://gerrit.wikimedia.org/r/529934 [11:13:43] !log recreating termbox namespace - T228836 [11:13:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:13:51] T228836: recreate eqiad cluster state from code stored in deployment-charts with helmfile [MIGHT CAUSE DOWNTIME] - https://phabricator.wikimedia.org/T228836 [11:15:08] !log pmiazga@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:529925|Undeploy editor gender surveys (T227793)]] (duration: 00m 48s) [11:15:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:15:16] T227793: First round editor gender surveys - https://phabricator.wikimedia.org/T227793 [11:16:09] ok, I'm done, anyone wants to push sth more? [11:18:17] PROBLEM - PyBal backends health check on lvs1015 is CRITICAL: PYBAL CRITICAL - CRITICAL - termbox_3030: Servers kubernetes1001.eqiad.wmnet, kubernetes1004.eqiad.wmnet, kubernetes1005.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [11:18:21] PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - termbox_3030: Servers kubernetes1001.eqiad.wmnet, kubernetes1003.eqiad.wmnet, kubernetes1004.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [11:18:43] !log EU SWAT finished [11:18:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:20:44] !log fsero@ helmfile [EQIAD] Ran 'apply' command on namespace 'termbox' for release 'production' . [11:20:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:21:36] (03CR) 10Jbond: [C: 03+2] apereo_cas: add cname for idp to idp1001 [dns] - 10https://gerrit.wikimedia.org/r/529934 (owner: 10Jbond) [11:21:48] !log recreating citoid eventgate-analytics eventgate-main mathoid namespace - T228836 [11:21:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:21:56] T228836: recreate eqiad cluster state from code stored in deployment-charts with helmfile [MIGHT CAUSE DOWNTIME] - https://phabricator.wikimedia.org/T228836 [11:25:08] !log fsero@ helmfile [EQIAD] Ran 'apply' command on namespace 'citoid' for release 'production' . [11:25:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:29:01] !log fsero@ helmfile [EQIAD] Ran 'apply' command on namespace 'eventgate-analytics' for release 'analytics' . [11:29:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:29:09] PROBLEM - High lag on wdqs1007 is CRITICAL: 3615 ge 3600 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [11:29:43] PROBLEM - High lag on wdqs1008 is CRITICAL: 3649 ge 3600 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [11:29:49] PROBLEM - High lag on wdqs1003 is CRITICAL: 3656 ge 3600 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [11:29:55] PROBLEM - High lag on wdqs2004 is CRITICAL: 3660 ge 3600 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [11:29:57] PROBLEM - High lag on wdqs1009 is CRITICAL: 3663 ge 3600 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [11:29:59] PROBLEM - High lag on wdqs1010 is CRITICAL: 3665 ge 3600 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [11:30:09] PROBLEM - High lag on wdqs2006 is CRITICAL: 3675 ge 3600 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [11:30:22] gehel onimisionipe ^ [11:30:26] !log fsero@ helmfile [EQIAD] Ran 'apply' command on namespace 'eventgate-main' for release 'main' . [11:30:27] PROBLEM - High lag on wdqs2005 is CRITICAL: 3694 ge 3600 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [11:30:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:30:40] Looking [11:30:46] thanks :) [11:31:54] wdqs updater looks healthy [11:32:23] PROBLEM - kubelet operational latencies on kubernetes1004 is CRITICAL: instance=kubernetes1004.eqiad.wmnet https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [11:32:30] there was a slight increase in writes, but not much [11:32:58] increased number of banned requests, but that might be a consequence of a slow down [11:34:39] !log restart wdqs-updater on wdqs2001 [11:34:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:34:55] actually, looks like the updater is processing data but not finding any change to apply [11:35:46] !log restart wdqs-blazegraph on wdqs2001 [11:35:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:36:20] this looks wrong! [11:36:50] reads seems to be OK, so no user impact, except for the updater lag [11:39:20] !log fsero@ helmfile [EQIAD] Ran 'apply' command on namespace 'mathoid' for release 'production' . [11:39:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:40:19] RECOVERY - kubelet operational latencies on kubernetes1004 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [11:41:34] ACKNOWLEDGEMENT - High lag on wdqs1003 is CRITICAL: 4325 ge 3600 Gehel under investigation at https://phabricator.wikimedia.org/T230410 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [11:41:34] ACKNOWLEDGEMENT - High lag on wdqs1007 is CRITICAL: 4285 ge 3600 Gehel under investigation at https://phabricator.wikimedia.org/T230410 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [11:41:34] ACKNOWLEDGEMENT - High lag on wdqs1008 is CRITICAL: 4319 ge 3600 Gehel under investigation at https://phabricator.wikimedia.org/T230410 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [11:41:35] ACKNOWLEDGEMENT - High lag on wdqs1009 is CRITICAL: 4333 ge 3600 Gehel under investigation at https://phabricator.wikimedia.org/T230410 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [11:41:36] ACKNOWLEDGEMENT - High lag on wdqs1010 is CRITICAL: 4333 ge 3600 Gehel under investigation at https://phabricator.wikimedia.org/T230410 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [11:41:37] ACKNOWLEDGEMENT - High lag on wdqs2004 is CRITICAL: 4330 ge 3600 Gehel under investigation at https://phabricator.wikimedia.org/T230410 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [11:41:38] ACKNOWLEDGEMENT - High lag on wdqs2005 is CRITICAL: 4266 ge 3600 Gehel under investigation at https://phabricator.wikimedia.org/T230410 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [11:41:39] ACKNOWLEDGEMENT - High lag on wdqs2006 is CRITICAL: 4344 ge 3600 Gehel under investigation at https://phabricator.wikimedia.org/T230410 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [11:42:19] RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [11:42:55] gehel: might be related to my eventgate redeploy? [11:43:02] timing matches [11:43:47] RECOVERY - PyBal backends health check on lvs1015 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [11:44:21] !log recreating cxserver blubber and sessionstore namespace - T228836 [11:44:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:44:29] T228836: recreate eqiad cluster state from code stored in deployment-charts with helmfile [MIGHT CAUSE DOWNTIME] - https://phabricator.wikimedia.org/T228836 [11:45:05] PROBLEM - High CPU load on API appserver on mw1227 is CRITICAL: CRITICAL - load average: 33.94, 34.14, 32.24 https://wikitech.wikimedia.org/wiki/Application_servers [11:45:23] fsero: the timing is supicious. Do you know what that redeploy actually did? [11:46:03] well just redeploying eventgate analytics and main [11:46:16] maybe some events were reprocessed? [11:47:03] loosing the kafka sequence numbers? [11:47:37] or is there a chance that events were requeued? [11:49:31] i dont know much about the internals, but i guess there is such chance [11:49:42] maybe otto or elukey knows more [11:49:52] !log fsero@ helmfile [EQIAD] Ran 'apply' command on namespace 'blubberoid' for release 'production' . [11:49:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:50:19] PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - blubberoid_8748: Servers kubernetes1002.eqiad.wmnet, kubernetes1003.eqiad.wmnet, kubernetes1004.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [11:50:37] nothing is actually burning, I'll go finish lunch and I'll be back [11:51:47] PROBLEM - PyBal backends health check on lvs1015 is CRITICAL: PYBAL CRITICAL - CRITICAL - blubberoid_8748: Servers kubernetes1001.eqiad.wmnet, kubernetes1004.eqiad.wmnet, kubernetes1005.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [11:56:32] !log fsero@ helmfile [EQIAD] Ran 'apply' command on namespace 'blubberoid' for release 'production' . [11:56:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:56:50] !log fsero@ helmfile [EQIAD] Ran 'apply' command on namespace 'blubberoid' for release 'production' . [11:56:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:57:27] <_joe_> gehel: why would wdqs still be connected to eventgate-eqiad? we did depool it ± 1 hour ago [11:59:24] _joe_: I'm assuming event-gate is the input to kafka, but looks like I'm wrong here... [12:00:38] wdqs isn't connected to event gate, it is consuming a kafka queue [12:00:46] !log fsero@ helmfile [EQIAD] Ran 'apply' command on namespace 'cxserver' for release 'production' . [12:00:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:01:51] (03PS1) 10Fsero: helmfile, eqiad,codfw: bug: ammending blubberoid namespace quota [deployment-charts] - 10https://gerrit.wikimedia.org/r/529940 [12:02:49] (03CR) 10Fsero: [V: 03+2 C: 03+2] helmfile, eqiad,codfw: bug: ammending blubberoid namespace quota [deployment-charts] - 10https://gerrit.wikimedia.org/r/529940 (owner: 10Fsero) [12:03:39] !log fsero@ helmfile [EQIAD] Ran 'apply' command on namespace 'sessionstore' for release 'production' . [12:03:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:08:04] <_joe_> !log restarted php-fpm on mw1221 [12:08:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:09:27] RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [12:09:57] PROBLEM - High CPU load on API appserver on mw1222 is CRITICAL: CRITICAL - load average: 32.57, 32.78, 32.03 https://wikitech.wikimedia.org/wiki/Application_servers [12:10:57] RECOVERY - PyBal backends health check on lvs1015 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [12:15:24] (03PS5) 10Jbond: Initial stub role for the IDP (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/528487 (owner: 10Muehlenhoff) [12:17:47] !log fsero@puppetmaster1001 conftool action : set/pooled=true; selector: dnsdisc=sessionstore|citoid|cxserver|eventgate-analytics|eventgate-main|termbox|blubberoid|mathoid|zotero,name=eqiad [12:17:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:19:17] RECOVERY - High lag on wdqs1003 is OK: (C)3600 ge (W)1200 ge 33.41 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [12:19:25] RECOVERY - High lag on wdqs1010 is OK: (C)3600 ge (W)1200 ge 14.94 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [12:19:37] RECOVERY - High lag on wdqs2006 is OK: (C)3600 ge (W)1200 ge 46.89 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [12:19:45] <_joe_> uhm ok so eventgate when switched over to the other datacenter doesn't work as expected it seems [12:19:48] (03CR) 10Jbond: [C: 03+2] Initial stub role for the IDP (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/528487 (owner: 10Muehlenhoff) [12:19:55] RECOVERY - High lag on wdqs2005 is OK: (C)3600 ge (W)1200 ge 28.11 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [12:20:15] RECOVERY - High lag on wdqs1007 is OK: (C)3600 ge (W)1200 ge 35.1 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [12:20:47] RECOVERY - High lag on wdqs1008 is OK: (C)3600 ge (W)1200 ge 28.55 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [12:21:01] RECOVERY - High lag on wdqs2004 is OK: (C)3600 ge (W)1200 ge 41.03 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [12:21:03] RECOVERY - High lag on wdqs1009 is OK: (C)3600 ge (W)1200 ge 109.1 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [12:30:41] RECOVERY - High CPU load on API appserver on mw1222 is OK: OK - load average: 12.44, 15.01, 23.43 https://wikitech.wikimedia.org/wiki/Application_servers [12:32:57] RECOVERY - High CPU load on API appserver on mw1227 is OK: OK - load average: 9.72, 13.59, 23.26 https://wikitech.wikimedia.org/wiki/Application_servers [12:36:43] elukey: around? have time for a chat about event gate? I don't understand what happened [12:39:01] gehel: hey! I am very ignorant about eventgate, and Andrew O. is on holidays.. I think that jijiki is going to follow up with Pchelolo later on (not sure if it is for the same issue). He knows a lot about eventgate [12:39:22] gehel: take a number in line :p [12:39:25] hahaha [12:39:26] I can try to answer basic questions [12:39:33] i can take a look too [12:39:35] but no idea about the internals [12:39:36] lemme try [12:39:48] gehel: I have pinged petr on -services [12:39:57] maybe we could have a discussion together [12:40:31] from what I see on the wdqs side, it looks like starting ~10:30 UTC, we were receiving events that were already processed [12:40:41] I've opened T230410 to track it [12:40:41] gehel: for the time being joe believes that something is up on codfw eventgate [12:40:41] T230410: wdqs updater processing events but not finding anything useful - https://phabricator.wikimedia.org/T230410 [12:41:11] as the wqds issue along with high load on some mw API servers [12:41:22] happened after we depooled eventgate on eqiad [12:42:07] what events are processed by eventgate? changes from mediawiki? including wikidata? [12:42:43] gehel: 10:30 is when we depooled eqiad eventgate [12:43:24] does that mean that wdqs should not have been receiving any events at all? or were events routed to codfw? [12:43:25] PROBLEM - High CPU load on API appserver on mw1346 is CRITICAL: CRITICAL - load average: 74.18, 40.34, 26.48 https://wikitech.wikimedia.org/wiki/Application_servers [12:43:31] PROBLEM - High CPU load on API appserver on mw1341 is CRITICAL: CRITICAL - load average: 72.78, 39.17, 25.81 https://wikitech.wikimedia.org/wiki/Application_servers [12:43:47] PROBLEM - High CPU load on API appserver on mw1233 is CRITICAL: CRITICAL - load average: 70.74, 33.71, 19.22 https://wikitech.wikimedia.org/wiki/Application_servers [12:43:51] PROBLEM - High CPU load on API appserver on mw1314 is CRITICAL: CRITICAL - load average: 99.65, 52.22, 31.28 https://wikitech.wikimedia.org/wiki/Application_servers [12:43:59] PROBLEM - High CPU load on API appserver on mw1317 is CRITICAL: CRITICAL - load average: 74.22, 40.55, 26.76 https://wikitech.wikimedia.org/wiki/Application_servers [12:44:51] (03PS1) 10Fsero: helmfile, eqiad,codfw: bug: several ammends [deployment-charts] - 10https://gerrit.wikimedia.org/r/529941 [12:45:07] RECOVERY - High CPU load on API appserver on mw1341 is OK: OK - load average: 38.01, 36.79, 26.30 https://wikitech.wikimedia.org/wiki/Application_servers [12:45:19] PROBLEM - High CPU load on API appserver on mw1229 is CRITICAL: CRITICAL - load average: 53.48, 26.85, 16.65 https://wikitech.wikimedia.org/wiki/Application_servers [12:45:19] PROBLEM - High CPU load on API appserver on mw1231 is CRITICAL: CRITICAL - load average: 57.72, 30.62, 18.43 https://wikitech.wikimedia.org/wiki/Application_servers [12:45:39] PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [12:45:48] (03CR) 10Mathew.onipe: [C: 03+1] elasticsearch: wait for write queue to be empty after cluster operation (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/529922 (owner: 10Gehel) [12:46:17] PROBLEM - recommendation_api endpoints health on scb1004 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [12:46:17] PROBLEM - recommendation_api endpoints health on scb1001 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [12:46:21] PROBLEM - recommendation_api endpoints health on scb2005 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [12:46:51] (03Abandoned) 10Fsero: helmfile, eqiad,codfw: bug: several ammends [deployment-charts] - 10https://gerrit.wikimedia.org/r/529941 (owner: 10Fsero) [12:46:53] gehel: afaik eventgate-main processes all the events except the jobqueue ones, that are currently being ported [12:47:13] RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [12:47:51] RECOVERY - recommendation_api endpoints health on scb1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [12:47:53] RECOVERY - recommendation_api endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [12:47:56] what's strange from my point of view, is that wdqs was still reading events from kafka, but did not find anything worth applying [12:48:15] PROBLEM - High CPU load on API appserver on mw1340 is CRITICAL: CRITICAL - load average: 84.14, 53.45, 34.36 https://wikitech.wikimedia.org/wiki/Application_servers [12:48:15] PROBLEM - High CPU load on API appserver on mw1339 is CRITICAL: CRITICAL - load average: 82.16, 52.03, 33.45 https://wikitech.wikimedia.org/wiki/Application_servers [12:48:21] PROBLEM - High CPU load on API appserver on mw1341 is CRITICAL: CRITICAL - load average: 80.94, 53.65, 34.46 https://wikitech.wikimedia.org/wiki/Application_servers [12:48:21] PROBLEM - High CPU load on API appserver on mw1345 is CRITICAL: CRITICAL - load average: 81.22, 54.20, 34.60 https://wikitech.wikimedia.org/wiki/Application_servers [12:48:24] I need to dig a bit more into the updater code, see what's the logic digarding events is. [12:48:27] ok that is different [12:48:29] PROBLEM - High CPU load on API appserver on mw1343 is CRITICAL: CRITICAL - load average: 80.83, 52.57, 33.68 https://wikitech.wikimedia.org/wiki/Application_servers [12:49:16] gehel: from what kafka topic? And you mean jumbo right? [12:49:59] elukey: `--kafka kafka-main2001.codfw.wmnet:9092,kafka-main2002.codfw.wmnet:9092,kafka-main2003.codfw.wmnet:9092 --consumer wdqs2001` [12:50:05] (03PS1) 10Fsero: helmfile, eqiad,codfw: bug: ammending sessionstore namespace quota [deployment-charts] - 10https://gerrit.wikimedia.org/r/529942 [12:50:15] that looks like kafka-main to me [12:50:30] gehel: ah right now it pulls directly from kafka-main [12:50:37] so definitely eventgate-main's events [12:50:58] the topic is hardcoded, lemme check in the code [12:51:30] (03PS2) 10Fsero: helmfile, eqiad,codfw: bug: ammending sessionstore namespace quota [deployment-charts] - 10https://gerrit.wikimedia.org/r/529942 [12:51:41] PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [12:52:07] PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [12:52:21] PROBLEM - High CPU load on API appserver on mw1312 is CRITICAL: CRITICAL - load average: 86.82, 56.41, 35.67 https://wikitech.wikimedia.org/wiki/Application_servers [12:52:41] RECOVERY - recommendation_api endpoints health on scb1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [12:52:44] (03CR) 10Fsero: [V: 03+2 C: 03+2] helmfile, eqiad,codfw: bug: ammending sessionstore namespace quota [deployment-charts] - 10https://gerrit.wikimedia.org/r/529942 (owner: 10Fsero) [12:52:47] elukey: topics: mediawiki.revision-create, mediawiki.page-delete, mediawiki.page-undelete [12:52:47] PROBLEM - recommendation_api endpoints health on scb2001 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [12:53:03] PROBLEM - High CPU load on API appserver on mw1340 is CRITICAL: CRITICAL - load average: 76.64, 56.70, 40.52 https://wikitech.wikimedia.org/wiki/Application_servers [12:53:03] PROBLEM - High CPU load on API appserver on mw1339 is CRITICAL: CRITICAL - load average: 61.39, 53.70, 39.32 https://wikitech.wikimedia.org/wiki/Application_servers [12:53:07] (03CR) 10Mathew.onipe: [C: 03+1] elasticsearch: wait for write queue to be empty after cluster operation (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/529922 (owner: 10Gehel) [12:53:09] PROBLEM - High CPU load on API appserver on mw1341 is CRITICAL: CRITICAL - load average: 62.75, 55.04, 40.41 https://wikitech.wikimedia.org/wiki/Application_servers [12:53:09] PROBLEM - High CPU load on API appserver on mw1345 is CRITICAL: CRITICAL - load average: 66.66, 55.09, 40.21 https://wikitech.wikimedia.org/wiki/Application_servers [12:53:13] RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [12:53:17] PROBLEM - recommendation_api endpoints health on scb2002 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [12:53:19] RECOVERY - High CPU load on API appserver on mw1229 is OK: OK - load average: 16.38, 23.14, 20.32 https://wikitech.wikimedia.org/wiki/Application_servers [12:53:55] PROBLEM - High CPU load on API appserver on mw1344 is CRITICAL: CRITICAL - load average: 77.47, 52.62, 37.93 https://wikitech.wikimedia.org/wiki/Application_servers [12:54:39] PROBLEM - High CPU load on API appserver on mw1346 is CRITICAL: CRITICAL - load average: 80.98, 57.85, 41.04 https://wikitech.wikimedia.org/wiki/Application_servers [12:54:49] RECOVERY - recommendation_api endpoints health on scb2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [12:54:53] PROBLEM - High CPU load on API appserver on mw1343 is CRITICAL: CRITICAL - load average: 82.57, 61.51, 43.23 https://wikitech.wikimedia.org/wiki/Application_servers [12:54:58] <_joe_> jijiki: ^^ this is hhvm having issues [12:55:26] I know [12:55:33] I am trying to undrstand [12:55:39] after 12:40 [12:55:42] <_joe_> I'd just restart it [12:55:54] <_joe_> only hhvm [12:55:55] we started having increased load [12:56:05] <_joe_> on the worst affected servers [12:56:12] it is only api again [12:56:34] <_joe_> anyways, I'm off for ~ 1:30 hour [12:56:42] (03PS1) 10Fsero: helmfile, eqiad: bug: ammending sessionstore namespace quota [deployment-charts] - 10https://gerrit.wikimedia.org/r/529943 [12:57:09] PROBLEM - High CPU load on API appserver on mw1232 is CRITICAL: CRITICAL - load average: 58.61, 32.86, 21.72 https://wikitech.wikimedia.org/wiki/Application_servers [12:57:12] (03CR) 10Fsero: [V: 03+2 C: 03+2] helmfile, eqiad: bug: ammending sessionstore namespace quota [deployment-charts] - 10https://gerrit.wikimedia.org/r/529943 (owner: 10Fsero) [12:57:13] PROBLEM - High CPU load on API appserver on mw1235 is CRITICAL: CRITICAL - load average: 48.25, 25.58, 16.83 https://wikitech.wikimedia.org/wiki/Application_servers [12:57:36] !log Restart hhvm on mw1235 [12:57:37] PROBLEM - recommendation_api endpoints health on scb1002 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [12:57:41] PROBLEM - recommendation_api endpoints health on scb2005 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [12:57:43] PROBLEM - High CPU load on API appserver on mw1315 is CRITICAL: CRITICAL - load average: 74.99, 46.60, 33.28 https://wikitech.wikimedia.org/wiki/Application_servers [12:57:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:57:45] PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [12:58:31] RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [12:58:47] RECOVERY - High CPU load on API appserver on mw1312 is OK: OK - load average: 22.97, 37.04, 34.00 https://wikitech.wikimedia.org/wiki/Application_servers [12:58:49] RECOVERY - High CPU load on API appserver on mw1235 is OK: OK - load average: 21.22, 24.14, 17.32 https://wikitech.wikimedia.org/wiki/Application_servers [12:59:09] RECOVERY - recommendation_api endpoints health on scb1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [12:59:11] RECOVERY - recommendation_api endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [12:59:13] RECOVERY - recommendation_api endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [12:59:17] RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [12:59:29] PROBLEM - High CPU load on API appserver on mw1339 is CRITICAL: CRITICAL - load average: 56.87, 53.92, 44.44 https://wikitech.wikimedia.org/wiki/Application_servers [13:01:57] RECOVERY - High CPU load on API appserver on mw1232 is OK: OK - load average: 11.29, 22.27, 20.66 https://wikitech.wikimedia.org/wiki/Application_servers [13:02:29] RECOVERY - High CPU load on API appserver on mw1315 is OK: OK - load average: 22.28, 33.90, 31.84 https://wikitech.wikimedia.org/wiki/Application_servers [13:03:31] RECOVERY - High CPU load on API appserver on mw1344 is OK: OK - load average: 23.04, 31.19, 34.95 https://wikitech.wikimedia.org/wiki/Application_servers [13:04:03] it looks like they are recovering [13:04:52] do we know why? [13:05:53] PROBLEM - High CPU load on API appserver on mw1339 is CRITICAL: CRITICAL - load average: 78.52, 51.52, 44.28 https://wikitech.wikimedia.org/wiki/Application_servers [13:05:59] PROBLEM - High CPU load on API appserver on mw1341 is CRITICAL: CRITICAL - load average: 77.41, 50.17, 43.71 https://wikitech.wikimedia.org/wiki/Application_servers [13:06:01] PROBLEM - High CPU load on API appserver on mw1345 is CRITICAL: CRITICAL - load average: 73.89, 47.80, 42.88 https://wikitech.wikimedia.org/wiki/Application_servers [13:07:01] !log rolling restart hhvm on api servers in eqiad [13:07:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:07:58] +1 jijiki [13:08:01] 10Operations, 10Graphite: scale statsd reporting/aggregation (plan) - https://phabricator.wikimedia.org/T89857 (10fgiunchedi) 05Stalled→03Invalid We're sunsetting Graphite (e.g. {T228380}) so resolving as invalid [13:08:03] 10Operations, 10RESTBase: Investigate apparent restbase request rate under-reporting in graphite: statsd issue? - https://phabricator.wikimedia.org/T89846 (10fgiunchedi) [13:08:08] 10Operations, 10WMDE-Analytics-Engineering, 10Core Platform Team Legacy (Watching / External), 10Graphite, and 3 others: scale graphite deployment (tracking) - https://phabricator.wikimedia.org/T85451 (10fgiunchedi) [13:08:27] 10Operations, 10WMDE-Analytics-Engineering, 10Core Platform Team Legacy (Watching / External), 10Graphite, and 3 others: scale graphite deployment (tracking) - https://phabricator.wikimedia.org/T85451 (10fgiunchedi) 05Open→03Invalid We're sunsetting Graphite (e.g. {T228380}) so resolving as invalid [13:12:04] jijiki: elukey gehel did something changed yesterday at 18:00 UTC https://logstash.wikimedia.org/goto/0c0260fdbc435f182b6392c3a46ea455 ? [13:12:13] PROBLEM - High CPU load on API appserver on mw1339 is CRITICAL: CRITICAL - load average: 73.84, 57.97, 49.53 https://wikitech.wikimedia.org/wiki/Application_servers [13:12:19] 10Operations, 10observability, 10Patch-For-Review, 10Performance-Team (Radar): Fully migrate >= 30% of producers off statsd - https://phabricator.wikimedia.org/T205870 (10fgiunchedi) [13:12:37] 10Operations, 10observability, 10Patch-For-Review, 10Performance-Team (Radar): Fully migrate >= 30% of producers off statsd - https://phabricator.wikimedia.org/T205870 (10fgiunchedi) [13:12:39] PROBLEM - High CPU load on API appserver on mw1313 is CRITICAL: CRITICAL - load average: 76.85, 46.21, 34.24 https://wikitech.wikimedia.org/wiki/Application_servers [13:12:41] PROBLEM - High CPU load on API appserver on mw1342 is CRITICAL: CRITICAL - load average: 76.17, 53.43, 42.92 https://wikitech.wikimedia.org/wiki/Application_servers [13:12:42] 10Operations, 10observability, 10User-fgiunchedi: Expand modern metrics infrastructure coverage (2018-19 Q2 goal) - https://phabricator.wikimedia.org/T205862 (10fgiunchedi) [13:13:00] tx fsero [13:14:03] RECOVERY - High CPU load on API appserver on mw1343 is OK: OK - load average: 7.69, 25.49, 33.34 https://wikitech.wikimedia.org/wiki/Application_servers [13:14:29] fsero: not that I know, but that looks like the cirrus checker that starts about that time weekly [13:15:10] nothing changed as far as I know, but yes it seems confined to mediawiki.job.cirrusSearchCheckerJob [13:15:15] (the topic I mean) [13:15:25] RECOVERY - High CPU load on API appserver on mw1340 is OK: OK - load average: 22.52, 27.27, 34.87 https://wikitech.wikimedia.org/wiki/Application_servers [13:15:25] RECOVERY - High CPU load on API appserver on mw1346 is OK: OK - load average: 8.33, 27.35, 34.99 https://wikitech.wikimedia.org/wiki/Application_servers [13:15:51] RECOVERY - High CPU load on API appserver on mw1313 is OK: OK - load average: 25.95, 35.83, 32.41 https://wikitech.wikimedia.org/wiki/Application_servers [13:15:57] RECOVERY - High CPU load on API appserver on mw1317 is OK: OK - load average: 24.10, 26.85, 35.92 https://wikitech.wikimedia.org/wiki/Application_servers [13:17:15] PROBLEM - High CPU load on API appserver on mw1343 is CRITICAL: CRITICAL - load average: 80.57, 45.60, 39.19 https://wikitech.wikimedia.org/wiki/Application_servers [13:17:29] RECOVERY - High CPU load on API appserver on mw1342 is OK: OK - load average: 6.32, 29.09, 35.67 https://wikitech.wikimedia.org/wiki/Application_servers [13:17:38] (03PS5) 10Gehel: elasticsearch: wait for write queue to be empty after cluster operation [cookbooks] - 10https://gerrit.wikimedia.org/r/529922 [13:18:37] PROBLEM - High CPU load on API appserver on mw1340 is CRITICAL: CRITICAL - load average: 85.86, 48.79, 41.32 https://wikitech.wikimedia.org/wiki/Application_servers [13:18:37] PROBLEM - High CPU load on API appserver on mw1346 is CRITICAL: CRITICAL - load average: 73.76, 44.20, 39.69 https://wikitech.wikimedia.org/wiki/Application_servers [13:18:43] RECOVERY - High CPU load on API appserver on mw1345 is OK: OK - load average: 15.69, 29.09, 35.82 https://wikitech.wikimedia.org/wiki/Application_servers [13:18:50] (03CR) 10Gehel: elasticsearch: wait for write queue to be empty after cluster operation (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/529922 (owner: 10Gehel) [13:19:20] hhvm restarts are at 50% [13:19:24] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: wait for write queue to be empty after cluster operation [cookbooks] - 10https://gerrit.wikimedia.org/r/529922 (owner: 10Gehel) [13:20:12] !log rolling update of postgresql-9.6 [13:20:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:54] (03PS6) 10Gehel: elasticsearch: wait for write queue to be empty after cluster operation [cookbooks] - 10https://gerrit.wikimedia.org/r/529922 [13:21:33] (03PS1) 10Ema: ATS: set minimum-content-length for compress plugin [puppet] - 10https://gerrit.wikimedia.org/r/529944 (https://phabricator.wikimedia.org/T227432) [13:21:35] (03PS1) 10Ema: ATS: enable compress plugin on cp5002 [puppet] - 10https://gerrit.wikimedia.org/r/529945 (https://phabricator.wikimedia.org/T227432) [13:21:55] RECOVERY - High CPU load on API appserver on mw1341 is OK: OK - load average: 21.05, 27.63, 35.67 https://wikitech.wikimedia.org/wiki/Application_servers [13:25:03] RECOVERY - High CPU load on API appserver on mw1339 is OK: OK - load average: 6.19, 20.94, 35.71 https://wikitech.wikimedia.org/wiki/Application_servers [13:26:05] (03CR) 10Mathew.onipe: [C: 03+1] "If it works then active_dc should be added as a required param?" [cookbooks] - 10https://gerrit.wikimedia.org/r/529922 (owner: 10Gehel) [13:26:37] RECOVERY - High CPU load on API appserver on mw1340 is OK: OK - load average: 16.87, 27.96, 35.34 https://wikitech.wikimedia.org/wiki/Application_servers [13:26:39] RECOVERY - High CPU load on API appserver on mw1346 is OK: OK - load average: 19.23, 28.48, 35.19 https://wikitech.wikimedia.org/wiki/Application_servers [13:26:55] RECOVERY - High CPU load on API appserver on mw1231 is OK: OK - load average: 10.57, 14.50, 23.09 https://wikitech.wikimedia.org/wiki/Application_servers [13:27:39] PROBLEM - mobileapps endpoints health on scb2001 is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) is CRITICAL: Test retrieve the most-read articles for January 1, 2016 (with aggregated=true) returned the unexpected status 504 (expecting: 200): /{domain}/v1/transform/html/to/mobile-html/{title} (Get preview mobile HTML for test page) is CRITI [13:27:39] view mobile HTML for test page returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [13:28:29] RECOVERY - High CPU load on API appserver on mw1343 is OK: OK - load average: 16.48, 25.51, 34.37 https://wikitech.wikimedia.org/wiki/Application_servers [13:29:15] RECOVERY - mobileapps endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [13:30:24] elukey: it looks ok fow now [13:30:27] for* [13:32:19] (03CR) 10Ema: [C: 03+2] ATS: set minimum-content-length for compress plugin [puppet] - 10https://gerrit.wikimedia.org/r/529944 (https://phabricator.wikimedia.org/T227432) (owner: 10Ema) [13:34:05] PROBLEM - mobileapps endpoints health on scb2001 is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) is CRITICAL: Test retrieve the most read articles for January 1, 2016 returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [13:35:43] RECOVERY - mobileapps endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [13:36:39] RECOVERY - High CPU load on API appserver on mw1314 is OK: OK - load average: 19.69, 19.93, 35.60 https://wikitech.wikimedia.org/wiki/Application_servers [13:42:59] RECOVERY - High CPU load on API appserver on mw1233 is OK: OK - load average: 10.99, 11.27, 22.69 https://wikitech.wikimedia.org/wiki/Application_servers [13:45:32] (03CR) 10Effie Mouzeli: mediawiki: add cluster latency alerts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/529923 (https://phabricator.wikimedia.org/T230396) (owner: 10Filippo Giunchedi) [13:46:37] (03PS1) 10Jbond: idp: enable idp role on idp1001 [puppet] - 10https://gerrit.wikimedia.org/r/529946 [13:54:53] PROBLEM - mobileapps endpoints health on scb2001 is CRITICAL: /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) is CRITICAL: Test retrieve featured image data for April 29, 2016 returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [13:55:38] (03CR) 10Jbond: [C: 03+2] idp: enable idp role on idp1001 [puppet] - 10https://gerrit.wikimedia.org/r/529946 (owner: 10Jbond) [13:59:02] (03CR) 10Gehel: "> Patch Set 6: Code-Review+1" [cookbooks] - 10https://gerrit.wikimedia.org/r/529922 (owner: 10Gehel) [13:59:43] RECOVERY - mobileapps endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [14:02:36] jakob_WMDE and I are going to smoke test the termbox service on eqiad. We'll be keeping below 5 req/s so this should be totally negligible to the api appservers; just putting load on our service [14:02:48] (03CR) 10CDanis: mediawiki: add cluster latency alerts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/529923 (https://phabricator.wikimedia.org/T230396) (owner: 10Filippo Giunchedi) [14:03:15] (03PS1) 10Jbond: idp: ensure we include the passwords class [puppet] - 10https://gerrit.wikimedia.org/r/529948 [14:03:41] (03CR) 10Gehel: [C: 03+2] elasticsearch: wait for write queue to be empty after cluster operation [cookbooks] - 10https://gerrit.wikimedia.org/r/529922 (owner: 10Gehel) [14:04:40] !log volans@cumin2001 START - Cookbook sre.hosts.decommission [14:04:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:47] !log volans@cumin2001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [14:04:51] 10Operations, 10serviceops: Migrate pool counters to Buster - https://phabricator.wikimedia.org/T224572 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by volans@cumin2001 for hosts: `poolcounter2001.codfw.wmnet` - poolcounter2001.codfw.wmnet - Removed from Puppet master and PuppetDB - Do... [14:04:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:37] (03CR) 10CDanis: dbctl: add note & candidate_master fields (031 comment) [software/conftool] - 10https://gerrit.wikimedia.org/r/529396 (https://phabricator.wikimedia.org/T229677) (owner: 10CDanis) [14:07:05] RECOVERY - Check the Netbox report-s- puppetdb for fail status. on netmon1002 is OK: puppetdb.PuppetDB OK https://wikitech.wikimedia.org/wiki/Netbox%23Reports [14:07:27] paravoid: ^^^ :) [14:08:07] :) [14:14:05] PROBLEM - mobileapps endpoints health on scb2001 is CRITICAL: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) is CRITICAL: Test retrieve title of the featured article for April 29, 2016 returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [14:14:46] !log disable all peering and transit on cr2-eqdfw [14:14:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:59] RECOVERY - mobileapps endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [14:18:34] !log reboot cr2-eqdfw for software upgrade - T227886 [14:18:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:20:30] (03CR) 10Jbond: [C: 03+2] idp: ensure we include the passwords class [puppet] - 10https://gerrit.wikimedia.org/r/529948 (owner: 10Jbond) [14:21:57] (03CR) 10Filippo Giunchedi: mediawiki: add cluster latency alerts (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/529923 (https://phabricator.wikimedia.org/T230396) (owner: 10Filippo Giunchedi) [14:22:17] (03PS3) 10Filippo Giunchedi: mediawiki: add cluster latency alerts [puppet] - 10https://gerrit.wikimedia.org/r/529923 (https://phabricator.wikimedia.org/T230396) [14:24:53] (03CR) 10Filippo Giunchedi: mediawiki: add cluster latency alerts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/529923 (https://phabricator.wikimedia.org/T230396) (owner: 10Filippo Giunchedi) [14:25:46] (03CR) 10Effie Mouzeli: mediawiki: add cluster latency alerts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/529923 (https://phabricator.wikimedia.org/T230396) (owner: 10Filippo Giunchedi) [14:28:54] cr2-eqdfw is back up [14:29:55] !log rollback: disable all peering and transit on cr2-eqdfw [14:30:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:31:09] !log gehel@cumin2001 START - Cookbook sre.elasticsearch.rolling-reboot [14:31:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:31:28] !log gehel@cumin2001 END (FAIL) - Cookbook sre.elasticsearch.rolling-reboot (exit_code=99) [14:31:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:51] (03PS4) 10Mathew.onipe: remote: make RemoteHosts iterable [software/spicerack] - 10https://gerrit.wikimedia.org/r/529802 [14:33:54] (03PS1) 10Jbond: idp: include the correct password class [puppet] - 10https://gerrit.wikimedia.org/r/529955 [14:36:37] 10Operations, 10ops-eqiad, 10vm-requests: rack/setup/install ganeti10([09]|1[0-8[).eqiad.wmnet - https://phabricator.wikimedia.org/T228924 (10Cmjohnson) [14:36:44] (03CR) 10jerkins-bot: [V: 04-1] remote: make RemoteHosts iterable [software/spicerack] - 10https://gerrit.wikimedia.org/r/529802 (owner: 10Mathew.onipe) [14:38:48] (03CR) 10Volans: [C: 03+1] "LGTM for now, for logged in people will be the same" [software/netbox] - 10https://gerrit.wikimedia.org/r/529171 (owner: 10CRusnov) [14:39:36] (03CR) 10CRusnov: [V: 03+2 C: 03+2] switch swagger to nonpublic mode [software/netbox] - 10https://gerrit.wikimedia.org/r/529171 (owner: 10CRusnov) [14:40:05] (03Abandoned) 10CRusnov: netbox: redirect swagger doc requests to official docs [puppet] - 10https://gerrit.wikimedia.org/r/528531 (owner: 10CRusnov) [14:40:18] (03CR) 10Jbond: [C: 03+2] idp: include the correct password class [puppet] - 10https://gerrit.wikimedia.org/r/529955 (owner: 10Jbond) [14:41:59] 10Operations, 10ops-eqiad, 10vm-requests: rack/setup/install ganeti10([09]|1[0-8[).eqiad.wmnet - https://phabricator.wikimedia.org/T228924 (10Cmjohnson) a:05akosiaris→03Jclark-ctr Please rack, label and cable these servers with the racking locations above. Add them to netbox, be sure to make sure status... [14:42:16] !log gehel@cumin2001 START - Cookbook sre.elasticsearch.rolling-reboot [14:42:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:42:36] !log gehel@cumin2001 END (FAIL) - Cookbook sre.elasticsearch.rolling-reboot (exit_code=99) [14:42:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:00] (03PS5) 10Mathew.onipe: remote: make RemoteHosts iterable [software/spicerack] - 10https://gerrit.wikimedia.org/r/529802 [14:44:06] (03PS4) 10Jhedden: openstack: initial haproxy profile [puppet] - 10https://gerrit.wikimedia.org/r/529436 (https://phabricator.wikimedia.org/T223907) [14:44:35] (03CR) 10Mathew.onipe: remote: make RemoteHosts iterable (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/529802 (owner: 10Mathew.onipe) [14:44:37] (03CR) 10Jhedden: "Thanks for the review. I was using the base profile for common options, and the deployment profiles for local options." (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/529436 (https://phabricator.wikimedia.org/T223907) (owner: 10Jhedden) [14:45:06] (03CR) 10jerkins-bot: [V: 04-1] openstack: initial haproxy profile [puppet] - 10https://gerrit.wikimedia.org/r/529436 (https://phabricator.wikimedia.org/T223907) (owner: 10Jhedden) [14:48:07] (03PS5) 10Jhedden: openstack: initial haproxy profile [puppet] - 10https://gerrit.wikimedia.org/r/529436 (https://phabricator.wikimedia.org/T223907) [14:55:43] (03PS1) 10CRusnov: make netbox tokens available to hosts that need em [labs/private] - 10https://gerrit.wikimedia.org/r/529958 [14:57:10] (03CR) 10CRusnov: [V: 03+2 C: 03+2] make netbox tokens available to hosts that need em [labs/private] - 10https://gerrit.wikimedia.org/r/529958 (owner: 10CRusnov) [14:57:28] !log cp5002: reboot for kernel upgrade [14:57:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:59:58] (03PS1) 10CRusnov: netbox: Add configuration for Netbox spicerack backend [puppet] - 10https://gerrit.wikimedia.org/r/529959 [15:00:12] (03PS30) 10Daimona Eaytoy: Update AbuseFilter config to keep the status quo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475772 [15:01:19] 10Operations, 10SRE-Access-Requests: Requesting access to machines [stat1004, stat1007, stat1006, notebook1003 and notebook1004] and groups for cchen - https://phabricator.wikimedia.org/T228447 (10cchen) 05Resolved→03Open Thanks @colewhite ! [15:01:53] !log increase ospf cost of cr2-eqord<->cr2-eqiad link (+1000) [15:01:57] 10Operations, 10SRE-Access-Requests: Requesting access to machines [stat1004, stat1007, stat1006, notebook1003 and notebook1004] and groups for cchen - https://phabricator.wikimedia.org/T228447 (10cchen) 05Open→03Resolved [15:01:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:06:43] (03PS3) 10Ayounsi: Netflow: install and configure Samplicator [puppet] - 10https://gerrit.wikimedia.org/r/529841 (https://phabricator.wikimedia.org/T226810) [15:08:18] (03CR) 10CDanis: dbctl: add note & candidate_master fields (031 comment) [software/conftool] - 10https://gerrit.wikimedia.org/r/529396 (https://phabricator.wikimedia.org/T229677) (owner: 10CDanis) [15:08:48] (03PS1) 10Jbond: idp: update the apereo_cas module to accept a content param for keystore [puppet] - 10https://gerrit.wikimedia.org/r/529960 [15:09:56] (03CR) 10Volans: "Some comments inline, looks almost ready to me." (034 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/529802 (owner: 10Mathew.onipe) [15:11:24] (03CR) 10Volans: [C: 04-1] "One missing nit, looks good otherwise." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/529959 (owner: 10CRusnov) [15:11:41] (03CR) 10Jbond: [C: 03+1] "LGTM one optional nit" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/529841 (https://phabricator.wikimedia.org/T226810) (owner: 10Ayounsi) [15:11:51] (03PS4) 10Jdlrobson: Update wgSkipSkins to experiment with not showing skins to users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/511321 (https://phabricator.wikimedia.org/T223824) [15:12:13] !log disable all peering and transit on cr2-eqord [15:12:16] (03CR) 10Volans: [C: 03+1] "Ship it!" (031 comment) [software/conftool] - 10https://gerrit.wikimedia.org/r/529396 (https://phabricator.wikimedia.org/T229677) (owner: 10CDanis) [15:12:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:13:14] (03CR) 10Jbond: [C: 03+2] idp: update the apereo_cas module to accept a content param for keystore [puppet] - 10https://gerrit.wikimedia.org/r/529960 (owner: 10Jbond) [15:13:35] PROBLEM - Host elastic2054.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:13:39] (03PS2) 10CRusnov: netbox: Add configuration for Netbox spicerack backend [puppet] - 10https://gerrit.wikimedia.org/r/529959 [15:14:09] (03PS2) 10Bstorm: labstore: restore original sense of the load alert with prometheus [puppet] - 10https://gerrit.wikimedia.org/r/528965 (https://phabricator.wikimedia.org/T229884) [15:15:24] (03CR) 10CRusnov: netbox: Add configuration for Netbox spicerack backend (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/529959 (owner: 10CRusnov) [15:15:46] (03PS4) 10Ayounsi: Netflow: install and configure Samplicator [puppet] - 10https://gerrit.wikimedia.org/r/529841 (https://phabricator.wikimedia.org/T226810) [15:16:17] (03CR) 10Ayounsi: "Thanks! Addressed!" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/529841 (https://phabricator.wikimedia.org/T226810) (owner: 10Ayounsi) [15:16:29] RECOVERY - Host elastic2054 is UP: PING OK - Packet loss = 0%, RTA = 36.17 ms [15:16:50] (03CR) 10Ayounsi: Netflow: install and configure Samplicator (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/529841 (https://phabricator.wikimedia.org/T226810) (owner: 10Ayounsi) [15:18:40] RECOVERY - Host elastic2054.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.99 ms [15:19:13] !log restart cr2-eqord - T227886 [15:19:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:19:26] (03CR) 10Jhedden: [C: 04-1] "Looks good, just a minor change needed." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/528965 (https://phabricator.wikimedia.org/T229884) (owner: 10Bstorm) [15:20:33] 10Operations, 10Discovery: elastic2054 unresponsive - https://phabricator.wikimedia.org/T227298 (10Papaul) 05Open→03Resolved DIMM A2 replaced and log cleared . Closing this task for now . [15:21:42] (03CR) 10Bstorm: labstore: restore original sense of the load alert with prometheus (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/528965 (https://phabricator.wikimedia.org/T229884) (owner: 10Bstorm) [15:23:29] (03PS1) 10Gehel: elasticsearch: depool servers just before actual operation [cookbooks] - 10https://gerrit.wikimedia.org/r/529963 [15:23:31] (03PS3) 10Bstorm: labstore: restore original sense of the load alert with prometheus [puppet] - 10https://gerrit.wikimedia.org/r/528965 (https://phabricator.wikimedia.org/T229884) [15:24:18] (03CR) 10Bstorm: labstore: restore original sense of the load alert with prometheus (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/528965 (https://phabricator.wikimedia.org/T229884) (owner: 10Bstorm) [15:24:52] (03CR) 10Jhedden: [C: 03+2] labstore: restore original sense of the load alert with prometheus [puppet] - 10https://gerrit.wikimedia.org/r/528965 (https://phabricator.wikimedia.org/T229884) (owner: 10Bstorm) [15:25:24] 10Operations, 10ops-eqiad, 10cloud-services-team (Kanban): Degraded RAID on cloudvirt1018 - https://phabricator.wikimedia.org/T229156 (10Cmjohnson) 05Open→03Resolved Disks replaced, please re-open an ping me if the disk fails [15:25:28] 10Operations, 10Discovery: elastic2054 unresponsive - https://phabricator.wikimedia.org/T227298 (10Papaul) Return information {F30018725} [15:25:36] (03PS4) 10Bstorm: labstore: restore original sense of the load alert with prometheus [puppet] - 10https://gerrit.wikimedia.org/r/528965 (https://phabricator.wikimedia.org/T229884) [15:28:09] (03CR) 10Jhedden: [C: 03+2] openstack: initial haproxy profile [puppet] - 10https://gerrit.wikimedia.org/r/529436 (https://phabricator.wikimedia.org/T223907) (owner: 10Jhedden) [15:28:17] (03PS6) 10Jhedden: openstack: initial haproxy profile [puppet] - 10https://gerrit.wikimedia.org/r/529436 (https://phabricator.wikimedia.org/T223907) [15:28:27] annnnnd it's back [15:28:52] 10Operations, 10ops-codfw: (OoW) wtp2019 shows error messages in the racadm getsel's output - https://phabricator.wikimedia.org/T221572 (10Papaul) 05Open→03Resolved I checked the server this morning no errors showing in log. closing the task [15:29:09] 10Operations, 10ops-codfw, 10media-storage: Degraded RAID on ms-be2021 - https://phabricator.wikimedia.org/T230275 (10Papaul) p:05Triage→03Normal [15:30:06] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/529959 (owner: 10CRusnov) [15:30:08] (03CR) 10Mathew.onipe: [C: 03+1] elasticsearch: depool servers just before actual operation [cookbooks] - 10https://gerrit.wikimedia.org/r/529963 (owner: 10Gehel) [15:30:11] !log rollback ospf + bgp changes on cr2-eqord [15:30:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:30:44] (03CR) 10Gehel: [C: 03+2] elasticsearch: depool servers just before actual operation [cookbooks] - 10https://gerrit.wikimedia.org/r/529963 (owner: 10Gehel) [15:30:50] (03CR) 10CRusnov: [C: 03+2] netbox: Add configuration for Netbox spicerack backend [puppet] - 10https://gerrit.wikimedia.org/r/529959 (owner: 10CRusnov) [15:31:09] (03PS3) 10CRusnov: netbox: Add configuration for Netbox spicerack backend [puppet] - 10https://gerrit.wikimedia.org/r/529959 [15:31:24] (03PS3) 10BBlack: lvs: isolate cloudelastic icinga check [puppet] - 10https://gerrit.wikimedia.org/r/528885 (https://phabricator.wikimedia.org/T229621) (owner: 10Mathew.onipe) [15:32:10] !log gehel@cumin2001 START - Cookbook sre.elasticsearch.rolling-reboot [15:32:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:32:50] !log disabled pupped on lvs1014, lvs1016, icinga1001 ahead of deploying https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/528885/ - T229621 [15:32:58] puppet even :P [15:32:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:32:59] T229621: Icinga check defined from LVS configuration for cloudelastic are borked - https://phabricator.wikimedia.org/T229621 [15:33:11] (03CR) 10BBlack: [C: 03+2] lvs: isolate cloudelastic icinga check [puppet] - 10https://gerrit.wikimedia.org/r/528885 (https://phabricator.wikimedia.org/T229621) (owner: 10Mathew.onipe) [15:34:16] (03PS1) 10Ayounsi: Depool eqsin for router software upgrade [dns] - 10https://gerrit.wikimedia.org/r/529966 (https://phabricator.wikimedia.org/T227886) [15:35:08] (03PS4) 10CRusnov: netbox: Add configuration for Netbox spicerack backend [puppet] - 10https://gerrit.wikimedia.org/r/529959 [15:35:34] (03CR) 10Ayounsi: [C: 03+2] Depool eqsin for router software upgrade [dns] - 10https://gerrit.wikimedia.org/r/529966 (https://phabricator.wikimedia.org/T227886) (owner: 10Ayounsi) [15:35:52] 10Operations, 10ops-codfw, 10media-storage: Degraded RAID on ms-be2021 - https://phabricator.wikimedia.org/T230275 (10Papaul) a:05Papaul→03fgiunchedi @fgiunchedi Disk replaced. [15:35:58] !log depool eqsin for cr2-eqsin upgrade [15:36:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:36:50] PROBLEM - ElasticSearch numbers of masters eligible - 9443 on search.svc.codfw.wmnet is CRITICAL: CRITICAL - Found 2 eligible masters. https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert [15:37:48] ^ that's me, check too sensitive, looking [15:38:03] RECOVERY - ElasticSearch numbers of masters eligible - 9443 on search.svc.codfw.wmnet is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert [15:38:55] RECOVERY - HP RAID on ms-be2021 is OK: OK: Slot 3: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 2I:4:1, 2I:4:2 - Controller: OK - Battery/Capacitor: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [15:39:27] !log puppet re-enabled on lvs1014, lvs1016, icinga1001 [15:39:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:42:06] !log gehel@cumin2001 END (FAIL) - Cookbook sre.elasticsearch.rolling-reboot (exit_code=99) [15:42:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:46:34] (03PS1) 10Gehel: elasticsearch: reduce sensitivity of master eligible check [puppet] - 10https://gerrit.wikimedia.org/r/529967 [15:46:40] !log fail vrrp master to cr1-eqsin [15:46:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:47:28] (03PS1) 10Gehel: elasticsearch: fix str to int conversion [cookbooks] - 10https://gerrit.wikimedia.org/r/529968 [15:49:05] !log ppchelko@deploy1001 Started deploy [restbase/deploy@8fca708]: Expose transform/wikitext/to/mobile-html endpoint T211026 [15:49:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:49:13] T211026: mobile-html: ability to preview an edited page or section with the same transforms and styles as mobile-html - https://phabricator.wikimedia.org/T211026 [15:49:17] RECOVERY - Device not healthy -SMART- on ms-be2021 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/SMART%23Alerts https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ms-be2021&var-datasource=codfw+prometheus/ops [15:50:11] PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on icinga1001 is CRITICAL: 10.08 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [15:50:27] (03CR) 10Mathew.onipe: [C: 03+1] elasticsearch: reduce sensitivity of master eligible check [puppet] - 10https://gerrit.wikimedia.org/r/529967 (owner: 10Gehel) [15:50:36] (03CR) 10Gehel: [C: 03+2] elasticsearch: reduce sensitivity of master eligible check [puppet] - 10https://gerrit.wikimedia.org/r/529967 (owner: 10Gehel) [15:50:50] (03CR) 10Milimetric: Add Cache-Control response header for Wikistats V2's index.html (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/529795 (https://phabricator.wikimedia.org/T230136) (owner: 10Elukey) [15:51:59] (03CR) 10Mathew.onipe: [C: 03+1] elasticsearch: fix str to int conversion [cookbooks] - 10https://gerrit.wikimedia.org/r/529968 (owner: 10Gehel) [15:52:08] (03CR) 10Gehel: [C: 03+2] elasticsearch: fix str to int conversion [cookbooks] - 10https://gerrit.wikimedia.org/r/529968 (owner: 10Gehel) [15:56:40] !log ppchelko@deploy1001 Finished deploy [restbase/deploy@8fca708]: Expose transform/wikitext/to/mobile-html endpoint T211026 (duration: 07m 35s) [15:56:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:49] T211026: mobile-html: ability to preview an edited page or section with the same transforms and styles as mobile-html - https://phabricator.wikimedia.org/T211026 [15:56:58] !log ppchelko@deploy1001 Started deploy [restbase/deploy@8fca708]: Expose transform/wikitext/to/mobile-html endpoint T211026, take 2 [15:57:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:58:18] (03PS1) 10Bstorm: monitoring: Change the showmount check from toolforge to be email-only [puppet] - 10https://gerrit.wikimedia.org/r/529970 (https://phabricator.wikimedia.org/T229884) [16:00:04] godog and _joe_: That opportune time is upon us again. Time for a Puppet SWAT(Max 6 patches) deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190813T1600). [16:00:04] No GERRIT patches in the queue for this window AFAICS. [16:00:27] (03CR) 10Jhedden: [C: 03+2] monitoring: Change the showmount check from toolforge to be email-only [puppet] - 10https://gerrit.wikimedia.org/r/529970 (https://phabricator.wikimedia.org/T229884) (owner: 10Bstorm) [16:05:45] PROBLEM - Check systemd state on idp1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:05:51] PROBLEM - mobileapps endpoints health on scb1003 is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [16:07:01] PROBLEM - netbox HTTPS on netmon1002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Proxy Error - 619 bytes in 0.079 second response time https://wikitech.wikimedia.org/wiki/Netbox [16:07:10] !log ppchelko@deploy1001 Finished deploy [restbase/deploy@8fca708]: Expose transform/wikitext/to/mobile-html endpoint T211026, take 2 (duration: 10m 12s) [16:07:10] ^ that's me, one sec [16:07:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:07:17] RECOVERY - mobileapps endpoints health on scb1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [16:07:17] T211026: mobile-html: ability to preview an edited page or section with the same transforms and styles as mobile-html - https://phabricator.wikimedia.org/T211026 [16:08:19] PROBLEM - Check the last execution of netbox_ganeti_eqiad_sync on netmon1002 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_eqiad_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [16:09:05] PROBLEM - Check systemd state on netmon1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:10:33] PROBLEM - Check the last execution of netbox_ganeti_codfw_sync on netmon1002 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [16:11:41] RECOVERY - netbox HTTPS on netmon1002 is OK: HTTP OK: HTTP/1.1 302 Found - 348 bytes in 0.694 second response time https://wikitech.wikimedia.org/wiki/Netbox [16:13:39] PROBLEM - IPMI Sensor Status on db1129 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [16:20:14] 10Operations, 10ops-ulsfo: refresh/replace scs-ulsfo - https://phabricator.wikimedia.org/T230077 (10RobH) The cables have arrived for this. I'll go onsite on Wednesday, August 14th to swap out the scs-ulsfo console server. I'll email the SRE team list to ensure the department is aware of the change/downtime.... [16:21:52] 10Operations, 10MediaWiki-API, 10Traffic, 10Wikidata, and 2 others: wikidata.org handles GET MWAPI requests, but silently fails on POST - https://phabricator.wikimedia.org/T230051 (10Anomie) There's nothing to do with #MediaWiki-API here, or with MediaWiki at all. Any request to wikidata.org is served a 30... [16:22:17] 10Operations, 10ops-ulsfo: refresh/replace scs-ulsfo - https://phabricator.wikimedia.org/T230077 (10RobH) [16:25:27] !log filippo@cumin1001 START - Cookbook sre.hosts.downtime [16:25:27] !log filippo@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [16:25:32] !log filippo@cumin1001 START - Cookbook sre.hosts.downtime [16:25:33] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [16:25:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:25:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:25:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:25:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:27:44] (03CR) 10CDanis: [C: 03+1] mediawiki: add cluster latency alerts [puppet] - 10https://gerrit.wikimedia.org/r/529923 (https://phabricator.wikimedia.org/T230396) (owner: 10Filippo Giunchedi) [16:28:36] RECOVERY - Check the last execution of netbox_ganeti_eqiad_sync on netmon1002 is OK: OK: Status of the systemd unit netbox_ganeti_eqiad_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [16:30:06] RECOVERY - Disk space on ms-be2021 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ms-be2021&var-datasource=codfw+prometheus/ops [16:30:52] RECOVERY - Check the last execution of netbox_ganeti_codfw_sync on netmon1002 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [16:33:26] RECOVERY - puppet last run on ms-be2021 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [16:34:03] (03CR) 10CDanis: [C: 03+2] dbctl: add note & candidate_master fields [software/conftool] - 10https://gerrit.wikimedia.org/r/529396 (https://phabricator.wikimedia.org/T229677) (owner: 10CDanis) [16:37:13] 10Operations, 10ops-eqiad: rack/setup/instal (4) CI ganeti nodes - https://phabricator.wikimedia.org/T228926 (10Cmjohnson) @Jclark-ctr Please rack 4 of the servers from the same ganeti stack in row D and label them as ganeti1019-1022. Please update netbox, and provide access switch port info. [16:37:14] (03Merged) 10jenkins-bot: dbctl: add note & candidate_master fields [software/conftool] - 10https://gerrit.wikimedia.org/r/529396 (https://phabricator.wikimedia.org/T229677) (owner: 10CDanis) [16:37:25] 10Operations, 10ops-eqiad: rack/setup/instal (4) CI ganeti nodes - https://phabricator.wikimedia.org/T228926 (10Cmjohnson) a:05akosiaris→03Jclark-ctr [16:39:22] RECOVERY - Check systemd state on netmon1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:40:02] RECOVERY - Varnish traffic drop between 30min ago and now at eqsin on icinga1001 is OK: (C)60 le (W)70 le 70.04 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [16:43:42] 10Operations, 10ops-codfw, 10media-storage: Degraded RAID on ms-be2021 - https://phabricator.wikimedia.org/T230275 (10fgiunchedi) 05Open→03Resolved Disk is rebuilding, thanks @Papaul [16:44:34] 10Operations, 10ops-eqiad, 10cloud-services-team: (OoW) cloudvirt1006 - RAID battery failed - https://phabricator.wikimedia.org/T222950 (10Cmjohnson) [16:46:11] !log disable all peering and transit on cr2-eqsin [16:46:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:46:46] 10Operations, 10LDAP-Access-Requests: Membership to 'wmf' LDAP group request for Connie Chen - https://phabricator.wikimedia.org/T230242 (10cchen) 05Resolved→03Open Thanks again for your help @colewhite ! I get the following internal server error when i trying to login with my LDAP account. ` 500 - Inter... [16:57:00] PROBLEM - Check systemd state on elastic1046 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:57:54] !log reboot cr2-eqsin [16:58:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:58:44] (03PS1) 10Giuseppe Lavagetto: remote: pass dry_run down to children remote hosts [software/spicerack] - 10https://gerrit.wikimedia.org/r/529975 [16:58:46] (03PS1) 10Giuseppe Lavagetto: Move splitting of a RemoteHosts to a method [software/spicerack] - 10https://gerrit.wikimedia.org/r/529976 [16:59:13] <_joe_> uhm something's not right I forgot to squash those two [16:59:56] (03PS2) 10Giuseppe Lavagetto: Move splitting of a RemoteHosts to a method [software/spicerack] - 10https://gerrit.wikimedia.org/r/529976 [17:00:04] cscott, arlolra, subbu, halfak, and accraze: How many deployers does it take to do Services – Graphoid / Parsoid / Citoid / ORES deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190813T1700). [17:00:20] (03Abandoned) 10Giuseppe Lavagetto: remote: pass dry_run down to children remote hosts [software/spicerack] - 10https://gerrit.wikimedia.org/r/529975 (owner: 10Giuseppe Lavagetto) [17:01:53] PROBLEM - LVS HTTPS IPv6 #page on upload-lb.eqsin.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [17:02:04] PROBLEM - OSPF status on mr1-eqsin is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:02:12] XioNoX: everything ok? [17:02:17] * volans around [17:02:25] * jbond42 around [17:02:26] * _joe_ too [17:02:36] site is depooled [17:02:42] <_joe_> oh ok [17:02:44] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:02:59] <_joe_> maybe then we should downtime the services there [17:03:03] but that should not have happen [17:03:04] o/ [17:03:10] ah, It is depooled [17:03:10] <_joe_> ok [17:03:16] * marostegui back to the sofa [17:03:24] RECOVERY - LVS HTTPS IPv6 #page on upload-lb.eqsin.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 848 bytes in 0.933 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [17:03:56] (03PS2) 10Bstorm: monitoring: Change the showmount check from toolforge to be email-only [puppet] - 10https://gerrit.wikimedia.org/r/529970 (https://phabricator.wikimedia.org/T229884) [17:04:02] yeah, no impact, sorry for the page, it should not have happen [17:04:03] * ema goes back to the sofa as well [17:04:26] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:05:04] cr2-eqsin is back [17:05:12] PROBLEM - PyBal BGP sessions are established on lvs5003 is CRITICAL: 0 le 0 https://wikitech.wikimedia.org/wiki/PyBal https://grafana.wikimedia.org/dashboard/db/pybal-bgp?var-datasource=eqsin+prometheus/ops [17:05:20] RECOVERY - OSPF status on mr1-eqsin is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:06:15] !log rollback: disable all peering and transit on cr2-eqsin [17:06:17] (03PS1) 10CRusnov: Add script to import management DNS entries [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/529977 (https://phabricator.wikimedia.org/T228670) [17:06:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:06:48] RECOVERY - PyBal BGP sessions are established on lvs5003 is OK: (C)0 le (W)0 le 1 https://wikitech.wikimedia.org/wiki/PyBal https://grafana.wikimedia.org/dashboard/db/pybal-bgp?var-datasource=eqsin+prometheus/ops [17:08:29] (03CR) 10CRusnov: "Note: This is meant to be run once or twice ever. We should include it in the repository for informational and deployment purposes. It is " [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/529977 (https://phabricator.wikimedia.org/T228670) (owner: 10CRusnov) [17:08:37] ah, now I know, forgot to depool ^ before the maintenance and it didn't failover fast enough [17:08:41] (03PS6) 10Mathew.onipe: remote: make RemoteHosts iterable [software/spicerack] - 10https://gerrit.wikimedia.org/r/529802 [17:09:04] (03CR) 10Mathew.onipe: remote: make RemoteHosts iterable (034 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/529802 (owner: 10Mathew.onipe) [17:10:40] (03PS1) 10Ayounsi: Revert "Depool eqsin for router software upgrade" [dns] - 10https://gerrit.wikimedia.org/r/529979 [17:11:32] (03CR) 10Ayounsi: [C: 03+2] Revert "Depool eqsin for router software upgrade" [dns] - 10https://gerrit.wikimedia.org/r/529979 (owner: 10Ayounsi) [17:11:41] 10Operations, 10LDAP-Access-Requests: Membership to 'wmf' LDAP group request for Connie Chen - https://phabricator.wikimedia.org/T230242 (10elukey) 05Open→03Resolved Added the user to superset! :) [17:11:58] !log repool eqsin [17:12:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:12:31] 10Operations, 10ops-eqiad, 10cloud-services-team: Degraded RAID on cloudvirt1024 - https://phabricator.wikimedia.org/T230289 (10JHedden) There are no workloads on this host now. We're good to have this replaced anytime. Thanks! [17:14:58] (03PS5) 10CRusnov: netbox: Add configuration for Netbox spicerack backend [puppet] - 10https://gerrit.wikimedia.org/r/529959 [17:16:17] (03CR) 10CRusnov: [C: 03+2] netbox: Add configuration and timers for csv dumps (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/521313 (owner: 10CRusnov) [17:18:29] (03PS1) 10CRusnov: dumpbackup.py: minor fix [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/529980 [17:19:01] (03CR) 10CRusnov: [V: 03+2 C: 03+2] dumpbackup.py: minor fix [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/529980 (owner: 10CRusnov) [17:22:45] PROBLEM - Varnish traffic drop between 30min ago and now at ulsfo on icinga1001 is CRITICAL: 59.79 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [17:27:30] (03CR) 10Volans: "So, between Ic7877d23c423fb27e01486c353f4ab1f000c4102 and this one, we have similar behaviours." [software/spicerack] - 10https://gerrit.wikimedia.org/r/529802 (owner: 10Mathew.onipe) [17:39:53] (03CR) 10Ayounsi: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/17874/netflow1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/529841 (https://phabricator.wikimedia.org/T226810) (owner: 10Ayounsi) [17:39:59] 10Operations, 10LDAP-Access-Requests: Membership to 'wmf' LDAP group request for Connie Chen - https://phabricator.wikimedia.org/T230242 (10cchen) Thanks @elukey! [17:40:01] (03PS5) 10Ayounsi: Netflow: install and configure Samplicator [puppet] - 10https://gerrit.wikimedia.org/r/529841 (https://phabricator.wikimedia.org/T226810) [17:40:55] (03PS1) 10CRusnov: Bump to v2.6.1-wmf3 [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/529985 [17:44:23] !log set target netflow port to 2000 in eqiad [17:44:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:49:18] (03CR) 10Mathew.onipe: "> Patch Set 6:" [software/spicerack] - 10https://gerrit.wikimedia.org/r/529802 (owner: 10Mathew.onipe) [17:53:00] RECOVERY - Varnish traffic drop between 30min ago and now at ulsfo on icinga1001 is OK: (C)60 le (W)70 le 72.44 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [17:53:38] 10Operations, 10Wikimedia-Mailing-lists: Set up mailing list for Santali Wikipedia - https://phabricator.wikimedia.org/T230435 (10Manik87) [18:01:21] (03PS2) 10CRusnov: Add script to import management DNS entries [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/529977 (https://phabricator.wikimedia.org/T228670) [18:03:43] (03CR) 10Volans: [C: 03+1] "LGTM, I didn't download the artifacts archive" [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/529985 (owner: 10CRusnov) [18:07:24] (03CR) 10CRusnov: [V: 03+2 C: 03+2] "LGTM after reviewing. Hopefully good. :)" [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/529985 (owner: 10CRusnov) [18:08:20] RECOVERY - Check systemd state on elastic1046 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:13:00] PROBLEM - Check systemd state on elastic1046 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:17:07] (03CR) 10Volans: "Some nit inline" (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/527487 (owner: 10Giuseppe Lavagetto) [18:22:10] (03PS2) 10Mholloway: Add MachineVision to extension-list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526541 (https://phabricator.wikimedia.org/T227348) [18:22:19] (03PS2) 10Mholloway: Add wmgUseMachineVision default setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526542 (https://phabricator.wikimedia.org/T227348) [18:22:39] (03PS3) 10Mholloway: Enable MachineVision on (beta) commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526543 (https://phabricator.wikimedia.org/T227348) [18:22:46] (03PS4) 10Mholloway: Load MachineVision extension if enabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526544 (https://phabricator.wikimedia.org/T227348) [18:24:32] (03CR) 10Mholloway: [C: 03+2] Add MachineVision to extension-list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526541 (https://phabricator.wikimedia.org/T227348) (owner: 10Mholloway) [18:25:41] (03Merged) 10jenkins-bot: Add MachineVision to extension-list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526541 (https://phabricator.wikimedia.org/T227348) (owner: 10Mholloway) [18:27:40] !log mholloway-shell@deploy1001 Synchronized wmf-config/extension-list: Enable MachineVision on Beta (1/4) (duration: 00m 48s) [18:27:50] (03CR) 10Mholloway: [C: 03+2] Add wmgUseMachineVision default setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526542 (https://phabricator.wikimedia.org/T227348) (owner: 10Mholloway) [18:27:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:27:52] (03CR) 10jenkins-bot: Add MachineVision to extension-list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526541 (https://phabricator.wikimedia.org/T227348) (owner: 10Mholloway) [18:28:08] 10Operations, 10MediaWiki-Debug-Logger, 10Wikimedia-Logstash, 10serviceops-radar, and 8 others: Port mediawiki/php/wmerrors to PHP7 and deploy - https://phabricator.wikimedia.org/T187147 (10WDoranWMF) What's the current status of this task? Are there needs from CPT? [18:28:58] (03Merged) 10jenkins-bot: Add wmgUseMachineVision default setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526542 (https://phabricator.wikimedia.org/T227348) (owner: 10Mholloway) [18:30:55] !log mholloway-shell@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Enable MachineVision on Beta (2/4) (duration: 00m 48s) [18:31:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:31:02] 10Operations, 10ops-eqiad: Degraded RAID on cloudvirt1024 - https://phabricator.wikimedia.org/T230442 (10ops-monitoring-bot) [18:31:17] (03CR) 10Mholloway: [C: 03+2] Enable MachineVision on (beta) commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526543 (https://phabricator.wikimedia.org/T227348) (owner: 10Mholloway) [18:31:47] !log crusnov@deploy1001 Started deploy [netbox/deploy@367ca84]: Update Netbox to v2.6.1-wmf3 affects: T223292 [18:31:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:31:55] T223292: Netbox: generate CSV backups - https://phabricator.wikimedia.org/T223292 [18:32:23] !log crusnov@deploy1001 Finished deploy [netbox/deploy@367ca84]: Update Netbox to v2.6.1-wmf3 affects: T223292 (duration: 00m 36s) [18:32:25] !log crusnov@deploy1001 Started deploy [netbox/deploy@367ca84]: Update Netbox to v2.6.1-wmf3 affects: T223292 [18:32:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:32:37] (03Merged) 10jenkins-bot: Enable MachineVision on (beta) commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526543 (https://phabricator.wikimedia.org/T227348) (owner: 10Mholloway) [18:32:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:33:09] !log crusnov@deploy1001 Finished deploy [netbox/deploy@367ca84]: Update Netbox to v2.6.1-wmf3 affects: T223292 (duration: 00m 43s) [18:33:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:33:39] !log crusnov@deploy1001 Started deploy [netbox/deploy@367ca84]: Update Netbox to v2.6.1-wmf3 affects: T223292 (fix perms) [18:33:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:33:49] !log crusnov@deploy1001 Finished deploy [netbox/deploy@367ca84]: Update Netbox to v2.6.1-wmf3 affects: T223292 (fix perms) (duration: 00m 09s) [18:33:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:34:39] !log mholloway-shell@deploy1001 Synchronized wmf-config/InitialiseSettings-labs.php: Enable MachineVision on Beta (3/4) (duration: 00m 47s) [18:34:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:35:00] (03CR) 10Mholloway: [C: 03+2] Load MachineVision extension if enabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526544 (https://phabricator.wikimedia.org/T227348) (owner: 10Mholloway) [18:35:59] (03CR) 10jenkins-bot: Add wmgUseMachineVision default setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526542 (https://phabricator.wikimedia.org/T227348) (owner: 10Mholloway) [18:36:02] (03Merged) 10jenkins-bot: Load MachineVision extension if enabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526544 (https://phabricator.wikimedia.org/T227348) (owner: 10Mholloway) [18:36:03] (03CR) 10jenkins-bot: Enable MachineVision on (beta) commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526543 (https://phabricator.wikimedia.org/T227348) (owner: 10Mholloway) [18:36:16] (03CR) 10jenkins-bot: Load MachineVision extension if enabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526544 (https://phabricator.wikimedia.org/T227348) (owner: 10Mholloway) [18:38:04] !log mholloway-shell@deploy1001 Synchronized wmf-config/CommonSettings.php: Enable MachineVision on Beta (4/4) (duration: 00m 48s) [18:38:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:38:22] (03PS7) 10CRusnov: netbox: Add configuration and timers for csv dumps [puppet] - 10https://gerrit.wikimedia.org/r/521313 [18:41:16] (03PS1) 10Eevans: sessionstore: Upgrade staging to Kask v1.0.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/529991 (https://phabricator.wikimedia.org/T229697) [18:41:32] !log set cpufreq scaling_governor to performance on cloudelastic100[1-4] to test any changes to indexing performance [18:41:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:46:34] (03PS1) 10Gehel: elasticsearch: wait for write queue to empty for 1h instead of 10 min [cookbooks] - 10https://gerrit.wikimedia.org/r/529992 [18:49:27] (03CR) 10Gehel: [C: 03+2] elasticsearch: wait for write queue to empty for 1h instead of 10 min [cookbooks] - 10https://gerrit.wikimedia.org/r/529992 (owner: 10Gehel) [18:50:42] !log gehel@cumin2001 START - Cookbook sre.elasticsearch.rolling-reboot [18:50:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:53:04] (03CR) 10Eevans: [C: 03+2] "Self-merging staging deployment." [deployment-charts] - 10https://gerrit.wikimedia.org/r/529991 (https://phabricator.wikimedia.org/T229697) (owner: 10Eevans) [18:54:06] (03CR) 10Eevans: [V: 03+2 C: 03+2] sessionstore: Upgrade staging to Kask v1.0.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/529991 (https://phabricator.wikimedia.org/T229697) (owner: 10Eevans) [18:56:23] 10Operations, 10SRE-Access-Requests: Requesting access to LogStash for Abijeet Patro - https://phabricator.wikimedia.org/T230104 (10colewhite) a:05colewhite→03None [18:58:46] 10Operations, 10ops-eqiad: Degraded RAID on cloudvirt1024 - https://phabricator.wikimedia.org/T230442 (10JHedden) This host also has a bad disk in slot number 8. T230289 [18:58:51] 10Operations, 10SRE-Access-Requests: Requesting access to LogStash for Abijeet Patro - https://phabricator.wikimedia.org/T230104 (10CDanis) a:03CDanis [18:59:31] 10Operations, 10SRE-Access-Requests, 10Traffic: SRE Onboarding for Sukhbir Singh - https://phabricator.wikimedia.org/T229860 (10CDanis) Is this done? [19:00:20] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to Mwmaint1002, Stat1007 for Abijeet Patro - https://phabricator.wikimedia.org/T230020 (10CDanis) a:03RStallman-legalteam @RStallman-legalteam can you confirm NDA on file? Thanks! [19:03:35] !log @ helmfile [STAGING] Ran 'apply' command on namespace 'sessionstore' for release 'staging' . [19:03:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:12:19] 10Operations, 10SRE-Access-Requests, 10Traffic: SRE Onboarding for Sukhbir Singh - https://phabricator.wikimedia.org/T229860 (10BBlack) 05Open→03Resolved Looks like it to me :) [19:15:46] !log ppchelko@deploy1001 Started deploy [cpjobqueue/deploy@f1a562e]: Increase cirrusSearchLinksUpdatePrioritized concurrency 150 -> 200 T220625 [19:15:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:15:58] T220625: Initialize CirrusSearch on cloudelastic - https://phabricator.wikimedia.org/T220625 [19:17:17] !log ppchelko@deploy1001 deploy aborted: Increase cirrusSearchLinksUpdatePrioritized concurrency 150 -> 200 T220625 (duration: 01m 30s) [19:17:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:18:45] !log ppchelko@deploy1001 Started deploy [cpjobqueue/deploy@f1a562e]: Revert on canary [19:18:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:19:03] !log ppchelko@deploy1001 deploy aborted: Revert on canary (duration: 00m 18s) [19:19:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:22:40] !log ppchelko@deploy1001 Started deploy [cpjobqueue/deploy@3882ddb]: Increase cirrusSearchLinksUpdatePrioritized concurrency 150 -> 200 T220625 [19:22:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:22:48] T220625: Initialize CirrusSearch on cloudelastic - https://phabricator.wikimedia.org/T220625 [19:23:38] !log ppchelko@deploy1001 Finished deploy [cpjobqueue/deploy@3882ddb]: Increase cirrusSearchLinksUpdatePrioritized concurrency 150 -> 200 T220625 (duration: 00m 58s) [19:23:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:48:37] !log gehel@cumin2001 END (PASS) - Cookbook sre.elasticsearch.rolling-reboot (exit_code=0) [19:48:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:08:27] 10Operations, 10Traffic, 10netops: Aug 28th: turn off knams lasers & stop advertising prefixes in advance of Relined PA-988002 maintenance - https://phabricator.wikimedia.org/T230448 (10CDanis) [20:16:07] (03PS1) 10Mholloway: Fix MachineVision provider config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/529998 [20:17:01] (03CR) 10jerkins-bot: [V: 04-1] Fix MachineVision provider config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/529998 (owner: 10Mholloway) [20:18:10] (03PS2) 10Mholloway: Fix MachineVision provider config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/529998 [20:18:59] (03CR) 10jerkins-bot: [V: 04-1] Fix MachineVision provider config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/529998 (owner: 10Mholloway) [20:19:12] (03PS3) 10Mholloway: Fix MachineVision provider config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/529998 [20:26:21] PROBLEM - Disk space on ms-be2021 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=89%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ms-be2021&var-datasource=codfw+prometheus/ops [20:28:29] (03CR) 10Mholloway: [C: 03+2] Fix MachineVision provider config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/529998 (owner: 10Mholloway) [20:29:24] (03Merged) 10jenkins-bot: Fix MachineVision provider config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/529998 (owner: 10Mholloway) [20:29:39] (03CR) 10jenkins-bot: Fix MachineVision provider config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/529998 (owner: 10Mholloway) [20:32:56] !log mholloway-shell@deploy1001 Synchronized wmf-config/InitialiseSettings-labs.php: Fix MachineVision provider config (duration: 00m 47s) [20:33:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:38:13] (03PS3) 10Bstorm: docker: add support for "stable" and "testing" tags [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/528178 (https://phabricator.wikimedia.org/T229058) [20:38:50] (03PS4) 10Bstorm: docker: add support for "testing" tags [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/528178 (https://phabricator.wikimedia.org/T229058) [20:41:05] 10Operations, 10Traffic, 10Goal, 10Patch-For-Review, 10Performance-Team (Radar): Support TLSv1.3 - https://phabricator.wikimedia.org/T170567 (10Ladsgroup) [20:41:30] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to Mwmaint1002, Stat1007 for Abijeet Patro - https://phabricator.wikimedia.org/T230020 (10CDanis) a:05RStallman-legalteam→03CDanis Oh, sorry @Arrbee I missed your update. I'll move forward with this tomorrow. [20:41:32] 10Operations, 10Traffic, 10netops: Aug 28th: turn off knams lasers & stop advertising prefixes in advance of Relined PA-988002 maintenance - https://phabricator.wikimedia.org/T230448 (10ayounsi) That seems to actually be one circuit terminating in two ports on each sides: cr2-knams:xe-0/0/3 to asw-esams:xe-0... [20:46:29] 10Operations, 10Traffic, 10netops: Aug 28th: turn off 1/3 esams-knams lasers in advance of Relined PA-988002 maintenance - https://phabricator.wikimedia.org/T230448 (10ayounsi) [20:55:51] (03PS5) 10Bstorm: docker: add support for "testing" tags [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/528178 (https://phabricator.wikimedia.org/T224558) [20:58:33] (03PS6) 10Bstorm: docker: add support for "testing" tags [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/528178 (https://phabricator.wikimedia.org/T224558) [21:03:11] (03CR) 10Bstorm: "Ok, so at this point, I'm stripping out the stable tag for now in this and setting "testing" to the default tag. Basically, this should a" [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/528178 (https://phabricator.wikimedia.org/T224558) (owner: 10Bstorm) [21:14:05] 10Operations, 10ops-eqiad: Degraded RAID on cloudvirt1024 - https://phabricator.wikimedia.org/T230442 (10wiki_willy) a:03Cmjohnson [21:23:27] 10Operations, 10Traffic, 10netops: Aug 28th: turn off 1/3 esams-knams lasers in advance of Relined PA-988002 maintenance - https://phabricator.wikimedia.org/T230448 (10CDanis) [21:27:35] (03PS34) 10CRusnov: profile::netbox: Reorganize for splitting front and back-end. [puppet] - 10https://gerrit.wikimedia.org/r/514395 (https://phabricator.wikimedia.org/T223291) [21:28:37] (03CR) 10jerkins-bot: [V: 04-1] profile::netbox: Reorganize for splitting front and back-end. [puppet] - 10https://gerrit.wikimedia.org/r/514395 (https://phabricator.wikimedia.org/T223291) (owner: 10CRusnov) [21:43:34] (03PS35) 10CRusnov: profile::netbox: Reorganize for splitting front and back-end. [puppet] - 10https://gerrit.wikimedia.org/r/514395 (https://phabricator.wikimedia.org/T223291) [21:44:32] (03CR) 10jerkins-bot: [V: 04-1] profile::netbox: Reorganize for splitting front and back-end. [puppet] - 10https://gerrit.wikimedia.org/r/514395 (https://phabricator.wikimedia.org/T223291) (owner: 10CRusnov) [21:44:42] (03PS36) 10CRusnov: profile::netbox: Reorganize for splitting front and back-end. [puppet] - 10https://gerrit.wikimedia.org/r/514395 (https://phabricator.wikimedia.org/T223291) [21:45:43] (03CR) 10jerkins-bot: [V: 04-1] profile::netbox: Reorganize for splitting front and back-end. [puppet] - 10https://gerrit.wikimedia.org/r/514395 (https://phabricator.wikimedia.org/T223291) (owner: 10CRusnov) [21:50:31] (03PS37) 10CRusnov: profile::netbox: Reorganize for splitting front and back-end. [puppet] - 10https://gerrit.wikimedia.org/r/514395 (https://phabricator.wikimedia.org/T223291) [21:51:31] (03CR) 10jerkins-bot: [V: 04-1] profile::netbox: Reorganize for splitting front and back-end. [puppet] - 10https://gerrit.wikimedia.org/r/514395 (https://phabricator.wikimedia.org/T223291) (owner: 10CRusnov) [21:58:33] (03PS1) 10CRusnov: netbox: Add fake secrets for reorg [labs/private] - 10https://gerrit.wikimedia.org/r/530008 [21:58:51] (03CR) 10CRusnov: [V: 03+2 C: 03+2] netbox: Add fake secrets for reorg [labs/private] - 10https://gerrit.wikimedia.org/r/530008 (owner: 10CRusnov) [22:03:02] (03PS38) 10CRusnov: profile::netbox: Reorganize for splitting front and back-end. [puppet] - 10https://gerrit.wikimedia.org/r/514395 (https://phabricator.wikimedia.org/T223291) [22:07:11] (03PS39) 10CRusnov: profile::netbox: Reorganize for splitting front and back-end. [puppet] - 10https://gerrit.wikimedia.org/r/514395 (https://phabricator.wikimedia.org/T223291) [22:14:26] (03CR) 10CRusnov: "Almost there!" [puppet] - 10https://gerrit.wikimedia.org/r/514395 (https://phabricator.wikimedia.org/T223291) (owner: 10CRusnov) [22:17:51] PROBLEM - Mediawiki Cirrussearch update rate - eqiad on icinga1001 is CRITICAL: CRITICAL: 10.00% of data under the critical threshold [50.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [22:19:39] PROBLEM - Mediawiki Cirrussearch update rate - codfw on icinga1001 is CRITICAL: CRITICAL: 20.00% of data under the critical threshold [50.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [22:50:42] (03PS1) 10Thcipriani: scap: set flag for check-and-restart-php [puppet] - 10https://gerrit.wikimedia.org/r/530014 (https://phabricator.wikimedia.org/T224857) [23:00:04] MaxSem, RoanKattouw, and Niharika: I, the Bot under the Fountain, allow thee, The Deployer, to do Evening SWAT (Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190813T2300). [23:00:04] Jdlrobson: A patch you scheduled for Evening SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:00:11] \o [23:06:32] i'm guessing there's a small possibility everyone is too busy wikimaning to help with swat [23:42:47] PROBLEM - Mediawiki Cirrussearch update rate - eqiad on icinga1001 is CRITICAL: CRITICAL: 10.00% of data under the critical threshold [50.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [23:43:59] PROBLEM - Mediawiki Cirrussearch update rate - codfw on icinga1001 is CRITICAL: CRITICAL: 10.00% of data under the critical threshold [50.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [23:59:46] (03PS1) 10DannyS712: Fix addition of Hubblesite.org and Spacetelescope.org to commons wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530015 (https://phabricator.wikimedia.org/T230083)