[00:23:48] <wikibugs>	 (03PS1) 10Ayounsi: Netflow: install and configure Samplicator [puppet] - 10https://gerrit.wikimedia.org/r/529841 (https://phabricator.wikimedia.org/T226810)
[00:24:43] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Netflow: install and configure Samplicator [puppet] - 10https://gerrit.wikimedia.org/r/529841 (https://phabricator.wikimedia.org/T226810) (owner: 10Ayounsi)
[00:45:13] <wikibugs>	 (03PS2) 10Ayounsi: Netflow: install and configure Samplicator [puppet] - 10https://gerrit.wikimedia.org/r/529841 (https://phabricator.wikimedia.org/T226810)
[00:50:10] <wikibugs>	 (03CR) 10Ayounsi: "https://puppet-compiler.wmflabs.org/compiler1001/17867/netflow1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/529841 (https://phabricator.wikimedia.org/T226810) (owner: 10Ayounsi)
[00:58:30] <legoktm>	 XioNoX: is there a reason https://phabricator.wikimedia.org/T226810 is private?
[00:59:47] <XioNoX>	 legoktm: yup, it's DDoS related work and discussions, so a bit sensitive
[01:02:01] <legoktm>	 ok, just a bit weird that there's public patch with a private task
[01:02:18] <legoktm>	 and its not even WMF-NDA or something less restrictive
[01:05:10] <XioNoX>	 yeah the public changes is what we agreed are okay to be public in the task :)
[01:05:37] <XioNoX>	 it could probably be less restrictive than SRE, I'll ask and update it if fine
[02:35:09] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 176571272 and 11 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[02:39:57] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps1003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 71880 and 81 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[03:34:43] <wikibugs>	 (03PS3) 10Vgutierrez: x509: Expose the OCSP URI of a Certificate as a property [software/acme-chief] - 10https://gerrit.wikimedia.org/r/516604 (https://phabricator.wikimedia.org/T219765)
[03:34:45] <wikibugs>	 (03PS3) 10Vgutierrez: ocsp: Provide basic functionality to perform OCSP requests [software/acme-chief] - 10https://gerrit.wikimedia.org/r/529202 (https://phabricator.wikimedia.org/T219765)
[03:35:16] <wikibugs>	 (03CR) 10Vgutierrez: "thx for the review :)" (031 comment) [software/acme-chief] - 10https://gerrit.wikimedia.org/r/516604 (https://phabricator.wikimedia.org/T219765) (owner: 10Vgutierrez)
[03:47:22] <wikibugs>	 (03CR) 10Vgutierrez: "> Patch Set 2:" (032 comments) [software/acme-chief] - 10https://gerrit.wikimedia.org/r/529202 (https://phabricator.wikimedia.org/T219765) (owner: 10Vgutierrez)
[03:48:38] <wikibugs>	 (03CR) 10Vgutierrez: "Brief example (wp.crt is the unified cert issued by globalsign):" [software/acme-chief] - 10https://gerrit.wikimedia.org/r/529202 (https://phabricator.wikimedia.org/T219765) (owner: 10Vgutierrez)
[04:02:29] <wikibugs>	 (03PS4) 10Vgutierrez: ocsp: Provide basic functionality to perform OCSP requests [software/acme-chief] - 10https://gerrit.wikimedia.org/r/529202 (https://phabricator.wikimedia.org/T219765)
[04:52:20] <wikibugs>	 (03PS1) 10Marostegui: db2121: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/529846 (https://phabricator.wikimedia.org/T228969)
[04:53:34] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db2121: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/529846 (https://phabricator.wikimedia.org/T228969) (owner: 10Marostegui)
[04:59:55] <wikibugs>	 (03PS1) 10Marostegui: wmnet: Point m3-master codfw to dbproxy2003 [dns] - 10https://gerrit.wikimedia.org/r/529847 (https://phabricator.wikimedia.org/T202367)
[05:06:18] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad,db-codfw.php: Provision db2122 into s7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/529848 (https://phabricator.wikimedia.org/T228969)
[05:09:08] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] db-eqiad,db-codfw.php: Provision db2122 into s7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/529848 (https://phabricator.wikimedia.org/T228969) (owner: 10Marostegui)
[05:09:29] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db-eqiad,db-codfw.php: Provision db2122 into s7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/529848 (https://phabricator.wikimedia.org/T228969) (owner: 10Marostegui)
[05:10:20] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Provision db2122 into s7 T228969', diff saved to https://phabricator.wikimedia.org/P8903 and previous config saved to /var/cache/conftool/dbconfig/20190813-051019-marostegui.json
[05:10:25] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad,db-codfw.php: Provision db2122 into s7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/529848 (https://phabricator.wikimedia.org/T228969) (owner: 10Marostegui)
[05:10:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:10:30] <stashbot>	 T228969: Productionize db21[21-31} - https://phabricator.wikimedia.org/T228969
[05:10:41] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad,db-codfw.php: Provision db2122 into s7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/529848 (https://phabricator.wikimedia.org/T228969) (owner: 10Marostegui)
[05:11:31] <logmsgbot>	 !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Provision db2122 into s7 T228969 (duration: 00m 49s)
[05:11:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:12:33] <logmsgbot>	 !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Provision db2122 into s7 T228969 (duration: 00m 47s)
[05:12:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:27:37] <wikibugs>	 10Operations, 10DBA, 10decommission: Decommission db2050.codfw.wmnet - https://phabricator.wikimedia.org/T230391 (10Marostegui)
[05:27:54] <wikibugs>	 10Operations, 10DBA, 10decommission: Decommission db2050.codfw.wmnet - https://phabricator.wikimedia.org/T230391 (10Marostegui) p:05Triage→03Normal
[05:29:04] <wikibugs>	 10Operations, 10DBA: Decommission db2043-db2069 - https://phabricator.wikimedia.org/T228258 (10Marostegui)
[05:29:10] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad,db-codfw.php: Remove db2050 from config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/529849 (https://phabricator.wikimedia.org/T230391)
[05:29:34] <wikibugs>	 10Operations, 10DBA, 10decommission, 10Patch-For-Review: Decommission db2050.codfw.wmnet - https://phabricator.wikimedia.org/T230391 (10Marostegui)
[05:30:46] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db-eqiad,db-codfw.php: Remove db2050 from config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/529849 (https://phabricator.wikimedia.org/T230391) (owner: 10Marostegui)
[05:31:38] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad,db-codfw.php: Remove db2050 from config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/529849 (https://phabricator.wikimedia.org/T230391) (owner: 10Marostegui)
[05:31:52] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad,db-codfw.php: Remove db2050 from config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/529849 (https://phabricator.wikimedia.org/T230391) (owner: 10Marostegui)
[05:32:51] <logmsgbot>	 !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Remove db2050 from config T230391 (duration: 00m 48s)
[05:32:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:32:59] <stashbot>	 T230391: Decommission db2050.codfw.wmnet - https://phabricator.wikimedia.org/T230391
[05:33:44] <logmsgbot>	 !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Remove db2050 from config T230391 (duration: 00m 48s)
[05:33:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:35:15] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove db2050 from config, host will be decommissioned T230391', diff saved to https://phabricator.wikimedia.org/P8904 and previous config saved to /var/cache/conftool/dbconfig/20190813-053514-marostegui.json
[05:35:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:38:08] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Decommissioned db2050 [puppet] - 10https://gerrit.wikimedia.org/r/529850 (https://phabricator.wikimedia.org/T230391)
[05:39:35] <wikibugs>	 10Operations, 10DBA, 10decommission, 10Patch-For-Review: Decommission db2050.codfw.wmnet - https://phabricator.wikimedia.org/T230391 (10Marostegui)
[05:39:41] <wikibugs>	 (03PS2) 10Marostegui: mariadb: Decommission db2050 [puppet] - 10https://gerrit.wikimedia.org/r/529850 (https://phabricator.wikimedia.org/T230391)
[05:40:18] <marostegui>	 !log Remove db2050 from tendril and zarcillo T230391
[05:40:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:40:26] <stashbot>	 T230391: Decommission db2050.codfw.wmnet - https://phabricator.wikimedia.org/T230391
[05:40:52] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] mariadb: Decommission db2050 [puppet] - 10https://gerrit.wikimedia.org/r/529850 (https://phabricator.wikimedia.org/T230391) (owner: 10Marostegui)
[05:43:59] <wikibugs>	 (03PS1) 10Marostegui: install_server: Remove db2043 [puppet] - 10https://gerrit.wikimedia.org/r/529851 (https://phabricator.wikimedia.org/T230311)
[05:44:39] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] install_server: Remove db2043 [puppet] - 10https://gerrit.wikimedia.org/r/529851 (https://phabricator.wikimedia.org/T230311) (owner: 10Marostegui)
[05:47:30] <marostegui>	 !log Stop mysql on db2050 - T230391
[05:47:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:47:39] <stashbot>	 T230391: Decommission db2050.codfw.wmnet - https://phabricator.wikimedia.org/T230391
[05:48:08] <wikibugs>	 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission db2050.codfw.wmnet - https://phabricator.wikimedia.org/T230391 (10Marostegui) a:05Marostegui→03RobH
[05:48:22] <wikibugs>	 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission db2050.codfw.wmnet - https://phabricator.wikimedia.org/T230391 (10Marostegui) This host is ready for #dc-ops  to decommission
[05:48:43] <wikibugs>	 10Operations, 10DBA: Decommission db2043-db2069 - https://phabricator.wikimedia.org/T228258 (10Marostegui)
[05:50:03] <wikibugs>	 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission db2069.codfw.wmnet - https://phabricator.wikimedia.org/T230107 (10Marostegui)
[05:57:04] <icinga-wm>	 PROBLEM - Blazegraph Port for wdqs-blazegraph on wdqs1009 is CRITICAL: connect to address 127.0.0.1 and port 9999: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[05:57:06] <icinga-wm>	 PROBLEM - Check systemd state on wdqs1009 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:57:36] <icinga-wm>	 PROBLEM - WDQS HTTP Port on wdqs1009 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 380 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service
[05:58:00] <icinga-wm>	 PROBLEM - Blazegraph process -wdqs-blazegraph- on wdqs1009 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 499 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[06:01:24] <icinga-wm>	 ACKNOWLEDGEMENT - Blazegraph Port for wdqs-blazegraph on wdqs1009 is CRITICAL: connect to address 127.0.0.1 and port 9999: Connection refused Stas Malychev Running some tests for dumplicate terms. - The acknowledgement expires at: 2019-08-14 06:00:46. https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[06:01:24] <icinga-wm>	 ACKNOWLEDGEMENT - Blazegraph process -wdqs-blazegraph- on wdqs1009 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 499 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war Stas Malychev Running some tests for dumplicate terms. - The acknowledgement expires at: 2019-08-14 06:00:46. https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[06:01:24] <icinga-wm>	 ACKNOWLEDGEMENT - Check systemd state on wdqs1009 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Stas Malychev Running some tests for dumplicate terms. - The acknowledgement expires at: 2019-08-14 06:00:46. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:01:24] <icinga-wm>	 ACKNOWLEDGEMENT - WDQS HTTP Port on wdqs1009 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 380 bytes in 0.001 second response time Stas Malychev Running some tests for dumplicate terms. - The acknowledgement expires at: 2019-08-14 06:00:46. https://wikitech.wikimedia.org/wiki/Wikidata_query_service
[06:03:19] <wikibugs>	 (03PS3) 10Ema: ATS: unset Accept-Encoding [puppet] - 10https://gerrit.wikimedia.org/r/529810 (https://phabricator.wikimedia.org/T227432)
[06:05:12] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] ATS: unset Accept-Encoding [puppet] - 10https://gerrit.wikimedia.org/r/529810 (https://phabricator.wikimedia.org/T227432) (owner: 10Ema)
[06:05:29] <wikibugs>	 (03CR) 10Ema: [C: 03+2] ATS: unset Accept-Encoding [puppet] - 10https://gerrit.wikimedia.org/r/529810 (https://phabricator.wikimedia.org/T227432) (owner: 10Ema)
[06:09:10] <icinga-wm>	 RECOVERY - Blazegraph process -wdqs-blazegraph- on wdqs1009 is OK: PROCS OK: 1 process with UID = 499 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[06:09:48] <icinga-wm>	 RECOVERY - Blazegraph Port for wdqs-blazegraph on wdqs1009 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9999 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[06:09:50] <icinga-wm>	 RECOVERY - Check systemd state on wdqs1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:11:15] <vgutierrez>	 !log Upgrading ATS to 8.0.3-1wm3 in cp2002, cp1076, cp3034 and cp4021 - T221594
[06:11:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:11:23] <stashbot>	 T221594: Puppetize ATS TLS configuration for incoming traffic - https://phabricator.wikimedia.org/T221594
[06:11:42] <icinga-wm>	 PROBLEM - snapshot of s3 in codfw on db1115 is CRITICAL: snapshot for s3 at codfw taken more than 4 days ago: Most recent backup 2019-08-09 06:05:02 https://wikitech.wikimedia.org/wiki/MariaDB/Backups
[06:14:38] <icinga-wm>	 PROBLEM - Check systemd state on wdqs1009 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:16:10] <icinga-wm>	 PROBLEM - Blazegraph Port for wdqs-blazegraph on wdqs1009 is CRITICAL: connect to address 127.0.0.1 and port 9999: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[06:38:18] <vgutierrez>	 !log upgrading fifo-log-demux to version 0.5 in cache@upload
[06:38:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:39:24] <icinga-wm>	 RECOVERY - Blazegraph Port for wdqs-blazegraph on wdqs1009 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9999 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[06:39:30] <icinga-wm>	 RECOVERY - Check systemd state on wdqs1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:49:36] <vgutierrez>	 !log Rolling restart of fifo-log-demux and atsmtail services across cache@upload
[06:49:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:59:20] <volans>	 !log upgrading spicerack to 0.0.26 on cumin2001
[06:59:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:00:01] <wikibugs>	 (03PS1) 10Ema: ATS: deployment-ms-fe02 renamed to fe03 [puppet] - 10https://gerrit.wikimedia.org/r/529866
[07:24:38] <wikibugs>	 10Operations, 10DBA: Decommission db2043-db2069 - https://phabricator.wikimedia.org/T228258 (10Marostegui)
[07:31:16] <wikibugs>	 10Operations, 10DBA: Decommission db2043-db2069 - https://phabricator.wikimedia.org/T228258 (10Marostegui)
[07:32:04] <wikibugs>	 10Operations, 10DBA, 10decommission: Decommission db2057.codfw.wmnet - https://phabricator.wikimedia.org/T230394 (10Marostegui)
[07:35:30] <wikibugs>	 10Operations, 10DBA, 10decommission: Decommission db2057.codfw.wmnet - https://phabricator.wikimedia.org/T230394 (10Marostegui) p:05Triage→03Normal
[07:37:53] <wikibugs>	 (03CR) 10Ema: [C: 03+2] ATS: deployment-ms-fe02 renamed to fe03 [puppet] - 10https://gerrit.wikimedia.org/r/529866 (owner: 10Ema)
[07:39:42] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad,db-codfw.php: Remove db2057 from config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/529912 (https://phabricator.wikimedia.org/T230394)
[07:40:38] <icinga-wm>	 RECOVERY - WDQS HTTP Port on wdqs1009 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.086 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service
[07:49:57] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db-eqiad,db-codfw.php: Remove db2057 from config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/529912 (https://phabricator.wikimedia.org/T230394) (owner: 10Marostegui)
[07:50:56] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad,db-codfw.php: Remove db2057 from config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/529912 (https://phabricator.wikimedia.org/T230394) (owner: 10Marostegui)
[07:51:09] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad,db-codfw.php: Remove db2057 from config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/529912 (https://phabricator.wikimedia.org/T230394) (owner: 10Marostegui)
[07:54:08] <logmsgbot>	 !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Remove db2057 from config T230394 (duration: 00m 48s)
[07:54:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:54:18] <stashbot>	 T230394: Decommission db2057.codfw.wmnet - https://phabricator.wikimedia.org/T230394
[07:54:56] <wikibugs>	 10Operations, 10DBA, 10decommission, 10Patch-For-Review: Decommission db2057.codfw.wmnet - https://phabricator.wikimedia.org/T230394 (10Marostegui)
[07:55:03] <logmsgbot>	 !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Remove db2057 from config T230394 (duration: 00m 47s)
[07:55:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:32:03] <wikibugs>	 (03PS1) 10Filippo Giunchedi: prometheus: fetch active netmon server from hiera [puppet] - 10https://gerrit.wikimedia.org/r/529914 (https://phabricator.wikimedia.org/T148541)
[08:33:27] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] prometheus: fetch active netmon server from hiera [puppet] - 10https://gerrit.wikimedia.org/r/529914 (https://phabricator.wikimedia.org/T148541) (owner: 10Filippo Giunchedi)
[08:37:30] <wikibugs>	 (03PS2) 10Filippo Giunchedi: prometheus: fetch active netmon server from hiera [puppet] - 10https://gerrit.wikimedia.org/r/529914 (https://phabricator.wikimedia.org/T148541)
[08:41:48] <wikibugs>	 (03CR) 10Alex Monk: [C: 03+2] x509: Expose the OCSP URI of a Certificate as a property [software/acme-chief] - 10https://gerrit.wikimedia.org/r/516604 (https://phabricator.wikimedia.org/T219765) (owner: 10Vgutierrez)
[08:44:29] <wikibugs>	 (03Merged) 10jenkins-bot: x509: Expose the OCSP URI of a Certificate as a property [software/acme-chief] - 10https://gerrit.wikimedia.org/r/516604 (https://phabricator.wikimedia.org/T219765) (owner: 10Vgutierrez)
[08:44:55] <wikibugs>	 (03CR) 10Alex Monk: [C: 03+2] "We should add some tests but as this is new code that's not used by anything yet this is harmless and good to go, we can followup later" [software/acme-chief] - 10https://gerrit.wikimedia.org/r/529202 (https://phabricator.wikimedia.org/T219765) (owner: 10Vgutierrez)
[08:45:17] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Decommission db2057 [puppet] - 10https://gerrit.wikimedia.org/r/529915 (https://phabricator.wikimedia.org/T230394)
[08:46:28] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] mariadb: Decommission db2057 [puppet] - 10https://gerrit.wikimedia.org/r/529915 (https://phabricator.wikimedia.org/T230394) (owner: 10Marostegui)
[08:46:43] <wikibugs>	 10Operations, 10DBA, 10decommission, 10Patch-For-Review: Decommission db2057.codfw.wmnet - https://phabricator.wikimedia.org/T230394 (10Marostegui)
[08:47:13] <wikibugs>	 (03CR) 10jenkins-bot: x509: Expose the OCSP URI of a Certificate as a property [software/acme-chief] - 10https://gerrit.wikimedia.org/r/516604 (https://phabricator.wikimedia.org/T219765) (owner: 10Vgutierrez)
[08:47:40] <wikibugs>	 (03Merged) 10jenkins-bot: ocsp: Provide basic functionality to perform OCSP requests [software/acme-chief] - 10https://gerrit.wikimedia.org/r/529202 (https://phabricator.wikimedia.org/T219765) (owner: 10Vgutierrez)
[08:48:45] <marostegui>	 !log Remove db2057 from tendril and zarcillo T230394 
[08:48:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:48:53] <stashbot>	 T230394: Decommission db2057.codfw.wmnet - https://phabricator.wikimedia.org/T230394
[08:49:40] <marostegui>	 !log Stop MySQL on db2057 - T230394
[08:49:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:50:23] <wikibugs>	 (03CR) 10jenkins-bot: ocsp: Provide basic functionality to perform OCSP requests [software/acme-chief] - 10https://gerrit.wikimedia.org/r/529202 (https://phabricator.wikimedia.org/T219765) (owner: 10Vgutierrez)
[08:50:25] <wikibugs>	 10Operations, 10ops-codfw, 10decommission: Decommission db2057.codfw.wmnet - https://phabricator.wikimedia.org/T230394 (10Marostegui) a:05Marostegui→03RobH
[08:50:38] <wikibugs>	 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission db2057.codfw.wmnet - https://phabricator.wikimedia.org/T230394 (10Marostegui) This host is ready for #dc-ops to decommission
[08:51:12] <wikibugs>	 10Operations, 10DBA: Decommission db2043-db2069 - https://phabricator.wikimedia.org/T228258 (10Marostegui)
[08:54:24] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: fetch active netmon server from hiera [puppet] - 10https://gerrit.wikimedia.org/r/529914 (https://phabricator.wikimedia.org/T148541) (owner: 10Filippo Giunchedi)
[08:54:32] <wikibugs>	 (03PS3) 10Filippo Giunchedi: prometheus: fetch active netmon server from hiera [puppet] - 10https://gerrit.wikimedia.org/r/529914 (https://phabricator.wikimedia.org/T148541)
[09:12:22] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: envoyproxy: support debian jessie [puppet] - 10https://gerrit.wikimedia.org/r/529919
[09:12:23] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: envoyproxy: use the hot restarter [puppet] - 10https://gerrit.wikimedia.org/r/529920
[09:13:23] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] envoyproxy: support debian jessie [puppet] - 10https://gerrit.wikimedia.org/r/529919 (owner: 10Giuseppe Lavagetto)
[09:13:35] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] envoyproxy: use the hot restarter [puppet] - 10https://gerrit.wikimedia.org/r/529920 (owner: 10Giuseppe Lavagetto)
[09:37:39] <wikibugs>	 (03PS1) 10Effie Mouzeli: mw-maintenance and scap: Revert changes for PHP7 transition [puppet] - 10https://gerrit.wikimedia.org/r/529921 (https://phabricator.wikimedia.org/T195392)
[09:42:44] <wikibugs>	 (03CR) 10Jbond: [C: 04-1] "LGTM: most commets are optional nitpicks but please use lookup() instead of hiera()" (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/529841 (https://phabricator.wikimedia.org/T226810) (owner: 10Ayounsi)
[09:43:07] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 04-1] "commit message should include which commits we revert" [puppet] - 10https://gerrit.wikimedia.org/r/529921 (https://phabricator.wikimedia.org/T195392) (owner: 10Effie Mouzeli)
[09:43:22] <wikibugs>	 (03PS1) 10Gehel: elasticsearch: wait for write queue to be empty after cluster operation [cookbooks] - 10https://gerrit.wikimedia.org/r/529922
[09:45:35] <wikibugs>	 (03PS1) 10Filippo Giunchedi: mediawiki: add cluster latency alerts [puppet] - 10https://gerrit.wikimedia.org/r/529923 (https://phabricator.wikimedia.org/T230396)
[09:45:37] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: wait for write queue to be empty after cluster operation [cookbooks] - 10https://gerrit.wikimedia.org/r/529922 (owner: 10Gehel)
[09:46:32] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] mediawiki: add cluster latency alerts [puppet] - 10https://gerrit.wikimedia.org/r/529923 (https://phabricator.wikimedia.org/T230396) (owner: 10Filippo Giunchedi)
[09:48:18] <wikibugs>	 (03PS2) 10Gehel: elasticsearch: wait for write queue to be empty after cluster operation [cookbooks] - 10https://gerrit.wikimedia.org/r/529922
[09:50:03] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: wait for write queue to be empty after cluster operation [cookbooks] - 10https://gerrit.wikimedia.org/r/529922 (owner: 10Gehel)
[09:51:20] <wikibugs>	 (03PS3) 10Gehel: elasticsearch: wait for write queue to be empty after cluster operation [cookbooks] - 10https://gerrit.wikimedia.org/r/529922
[09:51:59] <wikibugs>	 (03PS1) 10Effie Mouzeli: Send 33.3% of anonymous users to PHP7.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/529924 (https://phabricator.wikimedia.org/T219150)
[09:52:57] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: wait for write queue to be empty after cluster operation [cookbooks] - 10https://gerrit.wikimedia.org/r/529922 (owner: 10Gehel)
[09:53:23] <wikibugs>	 10Operations, 10DNS, 10Traffic, 10Wikimedia-Apache-configuration, 10Patch-For-Review: Sundown aliases `minnan` and `zh-cfr` for `nan`/`zh-min-nan` - https://phabricator.wikimedia.org/T230382 (10Peachey88) Has there been any checks on the usage of these aliases yet?
[09:53:39] <wikibugs>	 (03PS2) 10Filippo Giunchedi: mediawiki: add cluster latency alerts [puppet] - 10https://gerrit.wikimedia.org/r/529923 (https://phabricator.wikimedia.org/T230396)
[09:55:51] <wikibugs>	 (03PS4) 10Gehel: elasticsearch: wait for write queue to be empty after cluster operation [cookbooks] - 10https://gerrit.wikimedia.org/r/529922
[09:56:37] <wikibugs>	 (03PS1) 10Pmiazga: Undeploy editor gender surveys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/529925 (https://phabricator.wikimedia.org/T227793)
[09:56:54] <wikibugs>	 (03PS2) 10Pmiazga: Undeploy editor gender surveys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/529925 (https://phabricator.wikimedia.org/T227793)
[09:58:24] <vgutierrez>	 !log upgrading the rest of cache@upload to 8.0.3-1wm3 - T221594
[09:58:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:58:33] <stashbot>	 T221594: Puppetize ATS TLS configuration for incoming traffic - https://phabricator.wikimedia.org/T221594
[09:59:45] <wikibugs>	 (03PS1) 10Fsero: Introducing podsecpolicies,calico and coredns in eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/529926
[10:00:29] <wikibugs>	 (03PS2) 10Fsero: Introducing podsecpolicies,calico and coredns in eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/529926 (https://phabricator.wikimedia.org/T228836)
[10:02:13] <wikibugs>	 (03CR) 10Fsero: [V: 03+2 C: 03+2] Introducing podsecpolicies,calico and coredns in eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/529926 (https://phabricator.wikimedia.org/T228836) (owner: 10Fsero)
[10:06:44] <wikibugs>	 (03PS3) 10Fsero: prometheus, k8s: enabling services prometheus service discovery [puppet] - 10https://gerrit.wikimedia.org/r/529789
[10:06:46] <wikibugs>	 (03PS1) 10Fsero: caching,k8s: depool eqiad services exposed to cache for cluster recreation. [puppet] - 10https://gerrit.wikimedia.org/r/529927 (https://phabricator.wikimedia.org/T228836)
[10:07:22] <wikibugs>	 (03PS2) 10Fsero: caching,k8s: depool eqiad services exposed to cache for cluster recreation. [puppet] - 10https://gerrit.wikimedia.org/r/529927 (https://phabricator.wikimedia.org/T228836)
[10:07:35] <wikibugs>	 (03CR) 10Filippo Giunchedi: "Thresholds will likely need tuning but should be good as a first step, let me know what you think!" [puppet] - 10https://gerrit.wikimedia.org/r/529923 (https://phabricator.wikimedia.org/T230396) (owner: 10Filippo Giunchedi)
[10:10:21] <fsero>	 !log creating tiller in kube-system for helmfile T228836
[10:10:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:10:39] <stashbot>	 T228836: recreate eqiad cluster state from code stored in deployment-charts with helmfile [MIGHT CAUSE DOWNTIME] - https://phabricator.wikimedia.org/T228836
[10:10:45] <fsero>	 !log initialize_cluster.sh kube-system kubemaster.svc.eqiad.wmnet 6443 - T228836
[10:10:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:12:20] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] caching,k8s: depool eqiad services exposed to cache for cluster recreation. [puppet] - 10https://gerrit.wikimedia.org/r/529927 (https://phabricator.wikimedia.org/T228836) (owner: 10Fsero)
[10:17:56] <wikibugs>	 (03CR) 10Fsero: [C: 03+2] caching,k8s: depool eqiad services exposed to cache for cluster recreation. [puppet] - 10https://gerrit.wikimedia.org/r/529927 (https://phabricator.wikimedia.org/T228836) (owner: 10Fsero)
[10:19:34] <wikibugs>	 (03CR) 10Ema: [C: 03+1] caching,k8s: depool eqiad services exposed to cache for cluster recreation. [puppet] - 10https://gerrit.wikimedia.org/r/529927 (https://phabricator.wikimedia.org/T228836) (owner: 10Fsero)
[10:23:23] <logmsgbot>	 !log fsero@puppetmaster1001 conftool action : set/pooled=false; selector: dnsdisc=sessionstore|citoid|cxserver|eventgate-analytics|eventgate-main|termbox|blubberoid|mathoid|zotero,name=eqiad
[10:23:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:25:32] <jbond42>	 !log rolling update of ghostscript
[10:25:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:29:21] <_joe_>	 !log deleting calico deploy and configmap in kubernetes in eqiad, recreating with helmfile
[10:29:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:32:46] <logmsgbot>	 !log oblivian@ helmfile [EQIAD] Ran 'apply' command on namespace 'kube-system' for release 'calico-policy-controller' .
[10:32:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:39:12] <_joe_>	 !log recreating rbac roles via helmfile
[10:39:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:39:32] <logmsgbot>	 !log oblivian@ helmfile [EQIAD] Ran 'apply' command on namespace 'kube-system' for release 'rbac-deploy-clusterrole' .
[10:39:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:44:11] <logmsgbot>	 !log oblivian@ helmfile [EQIAD] Ran 'apply' command on namespace 'kube-system' for release 'coredns' .
[10:44:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:45:59] <wikibugs>	 (03PS1) 10Fsero: helmfile, eqiad: bug: ammending coredns values [deployment-charts] - 10https://gerrit.wikimedia.org/r/529931
[10:46:33] <wikibugs>	 (03CR) 10Fsero: [V: 03+2 C: 03+2] helmfile, eqiad: bug: ammending coredns values [deployment-charts] - 10https://gerrit.wikimedia.org/r/529931 (owner: 10Fsero)
[10:47:07] <icinga-wm>	 PROBLEM - Check size of conntrack table on kubernetes2006 is CRITICAL: CRITICAL: nf_conntrack is 92 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack
[10:49:37] <logmsgbot>	 !log oblivian@ helmfile [EQIAD] Ran 'apply' command on namespace 'kube-system' for release 'coredns' .
[10:49:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:56:26] <logmsgbot>	 !log oblivian@cumin1001 START - Cookbook sre.hosts.downtime
[10:56:26] <logmsgbot>	 !log oblivian@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
[10:56:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:56:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:57:11] <logmsgbot>	 !log oblivian@cumin1001 START - Cookbook sre.hosts.downtime
[10:57:11] <logmsgbot>	 !log oblivian@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
[10:57:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:57:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:59:06] <_joe_>	 !log [eqiad] downtiming zotero on icinga for 10 minutes while recreating the deployment with helmfile
[10:59:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:00:04] <jouncebot>	 Amir1, Lucas_WMDE, awight, and Urbanecm: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for European Mid-day SWAT(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190813T1100).
[11:00:04] <jouncebot>	 raynor: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[11:00:20] <raynor>	 o/
[11:00:59] <raynor>	 I have SWAT rights, and my patch is the only one, I can deploy by myself.
[11:01:50] <wikibugs>	 (03PS3) 10Elukey: admin: add analytics-admins and ops to gpu-users [puppet] - 10https://gerrit.wikimedia.org/r/529101
[11:02:15] <elukey>	 jbond42: o/ - time for a quick review? --^ 
[11:02:28] <jbond42>	 looking
[11:03:01] <icinga-wm>	 PROBLEM - Check size of conntrack table on kubernetes2006 is CRITICAL: CRITICAL: nf_conntrack is 90 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack
[11:03:05] <wikibugs>	 (03CR) 10Pmiazga: [C: 03+2] Undeploy editor gender surveys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/529925 (https://phabricator.wikimedia.org/T227793) (owner: 10Pmiazga)
[11:04:08] <wikibugs>	 (03Merged) 10jenkins-bot: Undeploy editor gender surveys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/529925 (https://phabricator.wikimedia.org/T227793) (owner: 10Pmiazga)
[11:04:11] <fsero>	 !log resetting net.netfilter.nf_conntrack_tcp_timeout_time_wait to 65 in kubernetes2006
[11:04:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:04:24] <wikibugs>	 (03CR) 10jenkins-bot: Undeploy editor gender surveys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/529925 (https://phabricator.wikimedia.org/T227793) (owner: 10Pmiazga)
[11:05:33] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1015 is CRITICAL: PYBAL CRITICAL - CRITICAL - zotero_1969: Servers kubernetes1002.eqiad.wmnet, kubernetes1003.eqiad.wmnet, kubernetes1004.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[11:05:39] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - zotero_1969: Servers kubernetes1002.eqiad.wmnet, kubernetes1004.eqiad.wmnet, kubernetes1006.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[11:06:02] <fsero>	 this is expected
[11:06:03] <raynor>	 can I proceed with SWAT?
[11:06:13] <icinga-wm>	 RECOVERY - Check size of conntrack table on kubernetes2006 is OK: OK: nf_conntrack is 18 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack
[11:06:15] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/529101 (owner: 10Elukey)
[11:06:28] <logmsgbot>	 !log oblivian@ helmfile [EQIAD] Ran 'apply' command on namespace 'zotero' for release 'production' .
[11:06:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:06:38] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] admin: add analytics-admins and ops to gpu-users [puppet] - 10https://gerrit.wikimedia.org/r/529101 (owner: 10Elukey)
[11:07:56] <_joe_>	 raynor: sure, go on
[11:08:13] <raynor>	 thx _joe_ 
[11:10:19] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1015 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[11:10:25] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[11:13:28] <wikibugs>	 (03PS1) 10Jbond: apereo_cas: add cname for idp to idp1001 [dns] - 10https://gerrit.wikimedia.org/r/529934
[11:13:43] <fsero>	 !log recreating termbox namespace - T228836
[11:13:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:13:51] <stashbot>	 T228836: recreate eqiad cluster state from code stored in deployment-charts with helmfile [MIGHT CAUSE DOWNTIME] - https://phabricator.wikimedia.org/T228836
[11:15:08] <logmsgbot>	 !log pmiazga@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:529925|Undeploy editor gender surveys (T227793)]] (duration: 00m 48s)
[11:15:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:15:16] <stashbot>	 T227793: First round editor gender surveys - https://phabricator.wikimedia.org/T227793
[11:16:09] <raynor>	 ok, I'm done, anyone wants to push sth more?
[11:18:17] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1015 is CRITICAL: PYBAL CRITICAL - CRITICAL - termbox_3030: Servers kubernetes1001.eqiad.wmnet, kubernetes1004.eqiad.wmnet, kubernetes1005.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[11:18:21] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - termbox_3030: Servers kubernetes1001.eqiad.wmnet, kubernetes1003.eqiad.wmnet, kubernetes1004.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[11:18:43] <raynor>	 !log EU SWAT finished
[11:18:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:20:44] <logmsgbot>	 !log fsero@ helmfile [EQIAD] Ran 'apply' command on namespace 'termbox' for release 'production' .
[11:20:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:21:36] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] apereo_cas: add cname for idp to idp1001 [dns] - 10https://gerrit.wikimedia.org/r/529934 (owner: 10Jbond)
[11:21:48] <fsero>	 !log recreating citoid eventgate-analytics eventgate-main mathoid namespace - T228836
[11:21:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:21:56] <stashbot>	 T228836: recreate eqiad cluster state from code stored in deployment-charts with helmfile [MIGHT CAUSE DOWNTIME] - https://phabricator.wikimedia.org/T228836
[11:25:08] <logmsgbot>	 !log fsero@ helmfile [EQIAD] Ran 'apply' command on namespace 'citoid' for release 'production' .
[11:25:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:29:01] <logmsgbot>	 !log fsero@ helmfile [EQIAD] Ran 'apply' command on namespace 'eventgate-analytics' for release 'analytics' .
[11:29:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:29:09] <icinga-wm>	 PROBLEM - High lag on wdqs1007 is CRITICAL: 3615 ge 3600 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen
[11:29:43] <icinga-wm>	 PROBLEM - High lag on wdqs1008 is CRITICAL: 3649 ge 3600 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen
[11:29:49] <icinga-wm>	 PROBLEM - High lag on wdqs1003 is CRITICAL: 3656 ge 3600 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen
[11:29:55] <icinga-wm>	 PROBLEM - High lag on wdqs2004 is CRITICAL: 3660 ge 3600 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen
[11:29:57] <icinga-wm>	 PROBLEM - High lag on wdqs1009 is CRITICAL: 3663 ge 3600 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen
[11:29:59] <icinga-wm>	 PROBLEM - High lag on wdqs1010 is CRITICAL: 3665 ge 3600 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen
[11:30:09] <icinga-wm>	 PROBLEM - High lag on wdqs2006 is CRITICAL: 3675 ge 3600 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen
[11:30:22] <marostegui>	 gehel onimisionipe ^
[11:30:26] <logmsgbot>	 !log fsero@ helmfile [EQIAD] Ran 'apply' command on namespace 'eventgate-main' for release 'main' .
[11:30:27] <icinga-wm>	 PROBLEM - High lag on wdqs2005 is CRITICAL: 3694 ge 3600 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen
[11:30:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:30:40] <gehel>	 Looking
[11:30:46] <marostegui>	 thanks :)
[11:31:54] <gehel>	 wdqs updater looks healthy
[11:32:23] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes1004 is CRITICAL: instance=kubernetes1004.eqiad.wmnet https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[11:32:30] <gehel>	 there was a slight increase in writes, but not much
[11:32:58] <gehel>	 increased number of banned requests, but that might be a consequence of a slow down
[11:34:39] <gehel>	 !log restart wdqs-updater on wdqs2001
[11:34:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:34:55] <gehel>	 actually, looks like the updater is processing data but not finding any change to apply
[11:35:46] <gehel>	 !log restart wdqs-blazegraph on wdqs2001
[11:35:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:36:20] <gehel>	 this looks wrong!
[11:36:50] <gehel>	 reads seems to be OK, so no user impact, except for the updater lag
[11:39:20] <logmsgbot>	 !log fsero@ helmfile [EQIAD] Ran 'apply' command on namespace 'mathoid' for release 'production' .
[11:39:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:40:19] <icinga-wm>	 RECOVERY - kubelet operational latencies on kubernetes1004 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[11:41:34] <icinga-wm>	 ACKNOWLEDGEMENT - High lag on wdqs1003 is CRITICAL: 4325 ge 3600 Gehel under investigation at https://phabricator.wikimedia.org/T230410 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen
[11:41:34] <icinga-wm>	 ACKNOWLEDGEMENT - High lag on wdqs1007 is CRITICAL: 4285 ge 3600 Gehel under investigation at https://phabricator.wikimedia.org/T230410 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen
[11:41:34] <icinga-wm>	 ACKNOWLEDGEMENT - High lag on wdqs1008 is CRITICAL: 4319 ge 3600 Gehel under investigation at https://phabricator.wikimedia.org/T230410 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen
[11:41:35] <icinga-wm>	 ACKNOWLEDGEMENT - High lag on wdqs1009 is CRITICAL: 4333 ge 3600 Gehel under investigation at https://phabricator.wikimedia.org/T230410 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen
[11:41:36] <icinga-wm>	 ACKNOWLEDGEMENT - High lag on wdqs1010 is CRITICAL: 4333 ge 3600 Gehel under investigation at https://phabricator.wikimedia.org/T230410 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen
[11:41:37] <icinga-wm>	 ACKNOWLEDGEMENT - High lag on wdqs2004 is CRITICAL: 4330 ge 3600 Gehel under investigation at https://phabricator.wikimedia.org/T230410 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen
[11:41:38] <icinga-wm>	 ACKNOWLEDGEMENT - High lag on wdqs2005 is CRITICAL: 4266 ge 3600 Gehel under investigation at https://phabricator.wikimedia.org/T230410 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen
[11:41:39] <icinga-wm>	 ACKNOWLEDGEMENT - High lag on wdqs2006 is CRITICAL: 4344 ge 3600 Gehel under investigation at https://phabricator.wikimedia.org/T230410 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen
[11:42:19] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[11:42:55] <fsero>	 gehel: might be related to my eventgate redeploy?
[11:43:02] <fsero>	 timing matches
[11:43:47] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1015 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[11:44:21] <fsero>	 !log recreating cxserver blubber and sessionstore namespace - T228836
[11:44:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:44:29] <stashbot>	 T228836: recreate eqiad cluster state from code stored in deployment-charts with helmfile [MIGHT CAUSE DOWNTIME] - https://phabricator.wikimedia.org/T228836
[11:45:05] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1227 is CRITICAL: CRITICAL - load average: 33.94, 34.14, 32.24 https://wikitech.wikimedia.org/wiki/Application_servers
[11:45:23] <gehel>	 fsero: the timing is supicious. Do you know what that redeploy actually did?
[11:46:03] <fsero>	 well just redeploying eventgate analytics and main
[11:46:16] <fsero>	 maybe some events were reprocessed?
[11:47:03] <gehel>	 loosing the kafka sequence numbers?
[11:47:37] <gehel>	 or is there a chance that events were requeued?
[11:49:31] <fsero>	 i dont know much about the internals, but i guess there is such chance
[11:49:42] <fsero>	 maybe otto or elukey knows more
[11:49:52] <logmsgbot>	 !log fsero@ helmfile [EQIAD] Ran 'apply' command on namespace 'blubberoid' for release 'production' .
[11:49:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:50:19] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - blubberoid_8748: Servers kubernetes1002.eqiad.wmnet, kubernetes1003.eqiad.wmnet, kubernetes1004.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[11:50:37] <gehel>	 nothing is actually burning, I'll go finish lunch and I'll be back
[11:51:47] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1015 is CRITICAL: PYBAL CRITICAL - CRITICAL - blubberoid_8748: Servers kubernetes1001.eqiad.wmnet, kubernetes1004.eqiad.wmnet, kubernetes1005.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[11:56:32] <logmsgbot>	 !log fsero@ helmfile [EQIAD] Ran 'apply' command on namespace 'blubberoid' for release 'production' .
[11:56:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:56:50] <logmsgbot>	 !log fsero@ helmfile [EQIAD] Ran 'apply' command on namespace 'blubberoid' for release 'production' .
[11:56:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:57:27] <_joe_>	 gehel: why would wdqs still be connected to eventgate-eqiad? we did depool it ± 1 hour ago
[11:59:24] <gehel>	 _joe_: I'm assuming event-gate is the input to kafka, but looks like I'm wrong here...
[12:00:38] <gehel>	 wdqs isn't connected to event gate, it is consuming a kafka queue
[12:00:46] <logmsgbot>	 !log fsero@ helmfile [EQIAD] Ran 'apply' command on namespace 'cxserver' for release 'production' .
[12:00:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:01:51] <wikibugs>	 (03PS1) 10Fsero: helmfile, eqiad,codfw: bug: ammending blubberoid namespace quota [deployment-charts] - 10https://gerrit.wikimedia.org/r/529940
[12:02:49] <wikibugs>	 (03CR) 10Fsero: [V: 03+2 C: 03+2] helmfile, eqiad,codfw: bug: ammending blubberoid namespace quota [deployment-charts] - 10https://gerrit.wikimedia.org/r/529940 (owner: 10Fsero)
[12:03:39] <logmsgbot>	 !log fsero@ helmfile [EQIAD] Ran 'apply' command on namespace 'sessionstore' for release 'production' .
[12:03:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:08:04] <_joe_>	 !log restarted php-fpm on mw1221
[12:08:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:09:27] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[12:09:57] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1222 is CRITICAL: CRITICAL - load average: 32.57, 32.78, 32.03 https://wikitech.wikimedia.org/wiki/Application_servers
[12:10:57] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1015 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[12:15:24] <wikibugs>	 (03PS5) 10Jbond: Initial stub role for the IDP (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/528487 (owner: 10Muehlenhoff)
[12:17:47] <logmsgbot>	 !log fsero@puppetmaster1001 conftool action : set/pooled=true; selector: dnsdisc=sessionstore|citoid|cxserver|eventgate-analytics|eventgate-main|termbox|blubberoid|mathoid|zotero,name=eqiad
[12:17:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:19:17] <icinga-wm>	 RECOVERY - High lag on wdqs1003 is OK: (C)3600 ge (W)1200 ge 33.41 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen
[12:19:25] <icinga-wm>	 RECOVERY - High lag on wdqs1010 is OK: (C)3600 ge (W)1200 ge 14.94 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen
[12:19:37] <icinga-wm>	 RECOVERY - High lag on wdqs2006 is OK: (C)3600 ge (W)1200 ge 46.89 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen
[12:19:45] <_joe_>	 uhm ok so eventgate when switched over to the other datacenter doesn't work as expected it seems
[12:19:48] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] Initial stub role for the IDP (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/528487 (owner: 10Muehlenhoff)
[12:19:55] <icinga-wm>	 RECOVERY - High lag on wdqs2005 is OK: (C)3600 ge (W)1200 ge 28.11 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen
[12:20:15] <icinga-wm>	 RECOVERY - High lag on wdqs1007 is OK: (C)3600 ge (W)1200 ge 35.1 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen
[12:20:47] <icinga-wm>	 RECOVERY - High lag on wdqs1008 is OK: (C)3600 ge (W)1200 ge 28.55 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen
[12:21:01] <icinga-wm>	 RECOVERY - High lag on wdqs2004 is OK: (C)3600 ge (W)1200 ge 41.03 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen
[12:21:03] <icinga-wm>	 RECOVERY - High lag on wdqs1009 is OK: (C)3600 ge (W)1200 ge 109.1 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen
[12:30:41] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1222 is OK: OK - load average: 12.44, 15.01, 23.43 https://wikitech.wikimedia.org/wiki/Application_servers
[12:32:57] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1227 is OK: OK - load average: 9.72, 13.59, 23.26 https://wikitech.wikimedia.org/wiki/Application_servers
[12:36:43] <gehel>	 elukey: around? have time for a chat about event gate? I don't understand what happened 
[12:39:01] <elukey>	 gehel: hey! I am very ignorant about eventgate, and Andrew O. is on holidays.. I think that jijiki is going to follow up with Pchelolo later on (not sure if it is for the same issue). He knows a lot about eventgate
[12:39:22] <jijiki>	 gehel: take a number in line :p
[12:39:25] <jijiki>	 hahaha
[12:39:26] <elukey>	 I can try to answer basic questions
[12:39:33] <fsero>	 i can take a look too
[12:39:35] <elukey>	 but no idea about the internals
[12:39:36] <fsero>	 lemme try
[12:39:48] <jijiki>	 gehel: I have pinged petr on -services 
[12:39:57] <jijiki>	 maybe we could have a discussion together
[12:40:31] <gehel>	 from what I see on the wdqs side, it looks like starting ~10:30 UTC, we were receiving events that were already processed
[12:40:41] <gehel>	 I've opened T230410 to track it
[12:40:41] <jijiki>	 gehel: for the time being joe believes that something is up  on codfw eventgate 
[12:40:41] <stashbot>	 T230410: wdqs updater processing events but not finding anything useful - https://phabricator.wikimedia.org/T230410
[12:41:11] <jijiki>	 as the wqds issue along with high load on some mw API servers
[12:41:22] <jijiki>	 happened after we depooled eventgate on eqiad
[12:42:07] <gehel>	 what events are processed by eventgate? changes from mediawiki? including wikidata?
[12:42:43] <fsero>	 gehel:  10:30 is when we depooled eqiad eventgate
[12:43:24] <gehel>	 does that mean that wdqs should not have been receiving any events at all? or were events routed to codfw?
[12:43:25] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1346 is CRITICAL: CRITICAL - load average: 74.18, 40.34, 26.48 https://wikitech.wikimedia.org/wiki/Application_servers
[12:43:31] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1341 is CRITICAL: CRITICAL - load average: 72.78, 39.17, 25.81 https://wikitech.wikimedia.org/wiki/Application_servers
[12:43:47] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1233 is CRITICAL: CRITICAL - load average: 70.74, 33.71, 19.22 https://wikitech.wikimedia.org/wiki/Application_servers
[12:43:51] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1314 is CRITICAL: CRITICAL - load average: 99.65, 52.22, 31.28 https://wikitech.wikimedia.org/wiki/Application_servers
[12:43:59] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1317 is CRITICAL: CRITICAL - load average: 74.22, 40.55, 26.76 https://wikitech.wikimedia.org/wiki/Application_servers
[12:44:51] <wikibugs>	 (03PS1) 10Fsero: helmfile, eqiad,codfw: bug: several ammends [deployment-charts] - 10https://gerrit.wikimedia.org/r/529941
[12:45:07] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1341 is OK: OK - load average: 38.01, 36.79, 26.30 https://wikitech.wikimedia.org/wiki/Application_servers
[12:45:19] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1229 is CRITICAL: CRITICAL - load average: 53.48, 26.85, 16.65 https://wikitech.wikimedia.org/wiki/Application_servers
[12:45:19] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1231 is CRITICAL: CRITICAL - load average: 57.72, 30.62, 18.43 https://wikitech.wikimedia.org/wiki/Application_servers
[12:45:39] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[12:45:48] <wikibugs>	 (03CR) 10Mathew.onipe: [C: 03+1] elasticsearch: wait for write queue to be empty after cluster operation (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/529922 (owner: 10Gehel)
[12:46:17] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb1004 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[12:46:17] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb1001 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[12:46:21] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2005 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[12:46:51] <wikibugs>	 (03Abandoned) 10Fsero: helmfile, eqiad,codfw: bug: several ammends [deployment-charts] - 10https://gerrit.wikimedia.org/r/529941 (owner: 10Fsero)
[12:46:53] <elukey>	 gehel: afaik eventgate-main processes all the events except the jobqueue ones, that are currently being ported
[12:47:13] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[12:47:51] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[12:47:53] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[12:47:56] <gehel>	 what's strange from my point of view, is that wdqs was still reading events from kafka, but did not find anything worth applying
[12:48:15] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1340 is CRITICAL: CRITICAL - load average: 84.14, 53.45, 34.36 https://wikitech.wikimedia.org/wiki/Application_servers
[12:48:15] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1339 is CRITICAL: CRITICAL - load average: 82.16, 52.03, 33.45 https://wikitech.wikimedia.org/wiki/Application_servers
[12:48:21] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1341 is CRITICAL: CRITICAL - load average: 80.94, 53.65, 34.46 https://wikitech.wikimedia.org/wiki/Application_servers
[12:48:21] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1345 is CRITICAL: CRITICAL - load average: 81.22, 54.20, 34.60 https://wikitech.wikimedia.org/wiki/Application_servers
[12:48:24] <gehel>	 I need to dig a bit more into the updater code, see what's the logic digarding events is.
[12:48:27] <jijiki>	 ok that is different 
[12:48:29] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1343 is CRITICAL: CRITICAL - load average: 80.83, 52.57, 33.68 https://wikitech.wikimedia.org/wiki/Application_servers
[12:49:16] <elukey>	 gehel: from what kafka topic? And you mean jumbo right?
[12:49:59] <gehel>	 elukey: `--kafka kafka-main2001.codfw.wmnet:9092,kafka-main2002.codfw.wmnet:9092,kafka-main2003.codfw.wmnet:9092 --consumer wdqs2001`
[12:50:05] <wikibugs>	 (03PS1) 10Fsero: helmfile, eqiad,codfw: bug: ammending sessionstore namespace quota [deployment-charts] - 10https://gerrit.wikimedia.org/r/529942
[12:50:15] <gehel>	 that looks like kafka-main to me
[12:50:30] <elukey>	 gehel: ah right now it pulls directly from kafka-main
[12:50:37] <elukey>	 so definitely eventgate-main's events
[12:50:58] <gehel>	 the topic is hardcoded, lemme check in the code
[12:51:30] <wikibugs>	 (03PS2) 10Fsero: helmfile, eqiad,codfw: bug: ammending sessionstore namespace quota [deployment-charts] - 10https://gerrit.wikimedia.org/r/529942
[12:51:41] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[12:52:07] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[12:52:21] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1312 is CRITICAL: CRITICAL - load average: 86.82, 56.41, 35.67 https://wikitech.wikimedia.org/wiki/Application_servers
[12:52:41] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[12:52:44] <wikibugs>	 (03CR) 10Fsero: [V: 03+2 C: 03+2] helmfile, eqiad,codfw: bug: ammending sessionstore namespace quota [deployment-charts] - 10https://gerrit.wikimedia.org/r/529942 (owner: 10Fsero)
[12:52:47] <gehel>	 elukey: topics: mediawiki.revision-create, mediawiki.page-delete, mediawiki.page-undelete
[12:52:47] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2001 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[12:53:03] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1340 is CRITICAL: CRITICAL - load average: 76.64, 56.70, 40.52 https://wikitech.wikimedia.org/wiki/Application_servers
[12:53:03] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1339 is CRITICAL: CRITICAL - load average: 61.39, 53.70, 39.32 https://wikitech.wikimedia.org/wiki/Application_servers
[12:53:07] <wikibugs>	 (03CR) 10Mathew.onipe: [C: 03+1] elasticsearch: wait for write queue to be empty after cluster operation (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/529922 (owner: 10Gehel)
[12:53:09] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1341 is CRITICAL: CRITICAL - load average: 62.75, 55.04, 40.41 https://wikitech.wikimedia.org/wiki/Application_servers
[12:53:09] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1345 is CRITICAL: CRITICAL - load average: 66.66, 55.09, 40.21 https://wikitech.wikimedia.org/wiki/Application_servers
[12:53:13] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[12:53:17] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2002 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[12:53:19] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1229 is OK: OK - load average: 16.38, 23.14, 20.32 https://wikitech.wikimedia.org/wiki/Application_servers
[12:53:55] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1344 is CRITICAL: CRITICAL - load average: 77.47, 52.62, 37.93 https://wikitech.wikimedia.org/wiki/Application_servers
[12:54:39] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1346 is CRITICAL: CRITICAL - load average: 80.98, 57.85, 41.04 https://wikitech.wikimedia.org/wiki/Application_servers
[12:54:49] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[12:54:53] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1343 is CRITICAL: CRITICAL - load average: 82.57, 61.51, 43.23 https://wikitech.wikimedia.org/wiki/Application_servers
[12:54:58] <_joe_>	 jijiki: ^^ this is hhvm having issues
[12:55:26] <jijiki>	 I know 
[12:55:33] <jijiki>	 I am trying to undrstand 
[12:55:39] <jijiki>	 after 12:40 
[12:55:42] <_joe_>	 I'd just restart it
[12:55:54] <_joe_>	 only hhvm
[12:55:55] <jijiki>	 we started having increased load
[12:56:05] <_joe_>	 on the worst affected servers
[12:56:12] <jijiki>	 it is only api again 
[12:56:34] <_joe_>	 anyways, I'm off for ~ 1:30 hour
[12:56:42] <wikibugs>	 (03PS1) 10Fsero: helmfile, eqiad: bug: ammending sessionstore namespace quota [deployment-charts] - 10https://gerrit.wikimedia.org/r/529943
[12:57:09] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1232 is CRITICAL: CRITICAL - load average: 58.61, 32.86, 21.72 https://wikitech.wikimedia.org/wiki/Application_servers
[12:57:12] <wikibugs>	 (03CR) 10Fsero: [V: 03+2 C: 03+2] helmfile, eqiad: bug: ammending sessionstore namespace quota [deployment-charts] - 10https://gerrit.wikimedia.org/r/529943 (owner: 10Fsero)
[12:57:13] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1235 is CRITICAL: CRITICAL - load average: 48.25, 25.58, 16.83 https://wikitech.wikimedia.org/wiki/Application_servers
[12:57:36] <jijiki>	 !log Restart hhvm on mw1235
[12:57:37] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb1002 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[12:57:41] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2005 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[12:57:43] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1315 is CRITICAL: CRITICAL - load average: 74.99, 46.60, 33.28 https://wikitech.wikimedia.org/wiki/Application_servers
[12:57:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:57:45] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[12:58:31] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[12:58:47] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1312 is OK: OK - load average: 22.97, 37.04, 34.00 https://wikitech.wikimedia.org/wiki/Application_servers
[12:58:49] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1235 is OK: OK - load average: 21.22, 24.14, 17.32 https://wikitech.wikimedia.org/wiki/Application_servers
[12:59:09] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[12:59:11] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[12:59:13] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[12:59:17] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[12:59:29] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1339 is CRITICAL: CRITICAL - load average: 56.87, 53.92, 44.44 https://wikitech.wikimedia.org/wiki/Application_servers
[13:01:57] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1232 is OK: OK - load average: 11.29, 22.27, 20.66 https://wikitech.wikimedia.org/wiki/Application_servers
[13:02:29] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1315 is OK: OK - load average: 22.28, 33.90, 31.84 https://wikitech.wikimedia.org/wiki/Application_servers
[13:03:31] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1344 is OK: OK - load average: 23.04, 31.19, 34.95 https://wikitech.wikimedia.org/wiki/Application_servers
[13:04:03] <jijiki>	 it looks like they are recovering 
[13:04:52] <elukey>	 do we know why?
[13:05:53] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1339 is CRITICAL: CRITICAL - load average: 78.52, 51.52, 44.28 https://wikitech.wikimedia.org/wiki/Application_servers
[13:05:59] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1341 is CRITICAL: CRITICAL - load average: 77.41, 50.17, 43.71 https://wikitech.wikimedia.org/wiki/Application_servers
[13:06:01] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1345 is CRITICAL: CRITICAL - load average: 73.89, 47.80, 42.88 https://wikitech.wikimedia.org/wiki/Application_servers
[13:07:01] <jijiki>	 !log rolling restart hhvm on api servers in eqiad
[13:07:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:07:58] <elukey>	 +1 jijiki 
[13:08:01] <wikibugs>	 10Operations, 10Graphite: scale statsd reporting/aggregation (plan) - https://phabricator.wikimedia.org/T89857 (10fgiunchedi) 05Stalled→03Invalid We're sunsetting Graphite (e.g. {T228380}) so resolving as invalid
[13:08:03] <wikibugs>	 10Operations, 10RESTBase: Investigate apparent restbase request rate under-reporting in graphite: statsd issue? - https://phabricator.wikimedia.org/T89846 (10fgiunchedi)
[13:08:08] <wikibugs>	 10Operations, 10WMDE-Analytics-Engineering, 10Core Platform Team Legacy (Watching / External), 10Graphite, and 3 others: scale graphite deployment (tracking) - https://phabricator.wikimedia.org/T85451 (10fgiunchedi)
[13:08:27] <wikibugs>	 10Operations, 10WMDE-Analytics-Engineering, 10Core Platform Team Legacy (Watching / External), 10Graphite, and 3 others: scale graphite deployment (tracking) - https://phabricator.wikimedia.org/T85451 (10fgiunchedi) 05Open→03Invalid We're sunsetting Graphite (e.g. {T228380}) so resolving as invalid
[13:12:04] <fsero>	 jijiki: elukey gehel did something changed yesterday at 18:00 UTC https://logstash.wikimedia.org/goto/0c0260fdbc435f182b6392c3a46ea455 ?
[13:12:13] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1339 is CRITICAL: CRITICAL - load average: 73.84, 57.97, 49.53 https://wikitech.wikimedia.org/wiki/Application_servers
[13:12:19] <wikibugs>	 10Operations, 10observability, 10Patch-For-Review, 10Performance-Team (Radar): Fully migrate >= 30% of producers off statsd - https://phabricator.wikimedia.org/T205870 (10fgiunchedi)
[13:12:37] <wikibugs>	 10Operations, 10observability, 10Patch-For-Review, 10Performance-Team (Radar): Fully migrate >= 30% of producers off statsd - https://phabricator.wikimedia.org/T205870 (10fgiunchedi)
[13:12:39] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1313 is CRITICAL: CRITICAL - load average: 76.85, 46.21, 34.24 https://wikitech.wikimedia.org/wiki/Application_servers
[13:12:41] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1342 is CRITICAL: CRITICAL - load average: 76.17, 53.43, 42.92 https://wikitech.wikimedia.org/wiki/Application_servers
[13:12:42] <wikibugs>	 10Operations, 10observability, 10User-fgiunchedi: Expand modern metrics infrastructure coverage (2018-19 Q2 goal) - https://phabricator.wikimedia.org/T205862 (10fgiunchedi)
[13:13:00] <jijiki>	 tx fsero
[13:14:03] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1343 is OK: OK - load average: 7.69, 25.49, 33.34 https://wikitech.wikimedia.org/wiki/Application_servers
[13:14:29] <gehel>	 fsero: not that I know, but that looks like the cirrus checker that starts about that time weekly
[13:15:10] <elukey>	 nothing changed as far as I know, but yes it seems confined to mediawiki.job.cirrusSearchCheckerJob
[13:15:15] <elukey>	 (the topic I mean)
[13:15:25] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1340 is OK: OK - load average: 22.52, 27.27, 34.87 https://wikitech.wikimedia.org/wiki/Application_servers
[13:15:25] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1346 is OK: OK - load average: 8.33, 27.35, 34.99 https://wikitech.wikimedia.org/wiki/Application_servers
[13:15:51] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1313 is OK: OK - load average: 25.95, 35.83, 32.41 https://wikitech.wikimedia.org/wiki/Application_servers
[13:15:57] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1317 is OK: OK - load average: 24.10, 26.85, 35.92 https://wikitech.wikimedia.org/wiki/Application_servers
[13:17:15] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1343 is CRITICAL: CRITICAL - load average: 80.57, 45.60, 39.19 https://wikitech.wikimedia.org/wiki/Application_servers
[13:17:29] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1342 is OK: OK - load average: 6.32, 29.09, 35.67 https://wikitech.wikimedia.org/wiki/Application_servers
[13:17:38] <wikibugs>	 (03PS5) 10Gehel: elasticsearch: wait for write queue to be empty after cluster operation [cookbooks] - 10https://gerrit.wikimedia.org/r/529922
[13:18:37] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1340 is CRITICAL: CRITICAL - load average: 85.86, 48.79, 41.32 https://wikitech.wikimedia.org/wiki/Application_servers
[13:18:37] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1346 is CRITICAL: CRITICAL - load average: 73.76, 44.20, 39.69 https://wikitech.wikimedia.org/wiki/Application_servers
[13:18:43] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1345 is OK: OK - load average: 15.69, 29.09, 35.82 https://wikitech.wikimedia.org/wiki/Application_servers
[13:18:50] <wikibugs>	 (03CR) 10Gehel: elasticsearch: wait for write queue to be empty after cluster operation (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/529922 (owner: 10Gehel)
[13:19:20] <jijiki>	 hhvm restarts are at 50%
[13:19:24] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: wait for write queue to be empty after cluster operation [cookbooks] - 10https://gerrit.wikimedia.org/r/529922 (owner: 10Gehel)
[13:20:12] <jbond42>	 !log rolling update of postgresql-9.6
[13:20:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:20:54] <wikibugs>	 (03PS6) 10Gehel: elasticsearch: wait for write queue to be empty after cluster operation [cookbooks] - 10https://gerrit.wikimedia.org/r/529922
[13:21:33] <wikibugs>	 (03PS1) 10Ema: ATS: set minimum-content-length for compress plugin [puppet] - 10https://gerrit.wikimedia.org/r/529944 (https://phabricator.wikimedia.org/T227432)
[13:21:35] <wikibugs>	 (03PS1) 10Ema: ATS: enable compress plugin on cp5002 [puppet] - 10https://gerrit.wikimedia.org/r/529945 (https://phabricator.wikimedia.org/T227432)
[13:21:55] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1341 is OK: OK - load average: 21.05, 27.63, 35.67 https://wikitech.wikimedia.org/wiki/Application_servers
[13:25:03] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1339 is OK: OK - load average: 6.19, 20.94, 35.71 https://wikitech.wikimedia.org/wiki/Application_servers
[13:26:05] <wikibugs>	 (03CR) 10Mathew.onipe: [C: 03+1] "If it works then active_dc should be added as a required param?" [cookbooks] - 10https://gerrit.wikimedia.org/r/529922 (owner: 10Gehel)
[13:26:37] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1340 is OK: OK - load average: 16.87, 27.96, 35.34 https://wikitech.wikimedia.org/wiki/Application_servers
[13:26:39] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1346 is OK: OK - load average: 19.23, 28.48, 35.19 https://wikitech.wikimedia.org/wiki/Application_servers
[13:26:55] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1231 is OK: OK - load average: 10.57, 14.50, 23.09 https://wikitech.wikimedia.org/wiki/Application_servers
[13:27:39] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2001 is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) is CRITICAL: Test retrieve the most-read articles for January 1, 2016 (with aggregated=true) returned the unexpected status 504 (expecting: 200): /{domain}/v1/transform/html/to/mobile-html/{title} (Get preview mobile HTML for test page) is CRITI
[13:27:39] <icinga-wm>	 view mobile HTML for test page returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[13:28:29] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1343 is OK: OK - load average: 16.48, 25.51, 34.37 https://wikitech.wikimedia.org/wiki/Application_servers
[13:29:15] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[13:30:24] <jijiki>	 elukey: it looks ok fow now
[13:30:27] <jijiki>	 for*
[13:32:19] <wikibugs>	 (03CR) 10Ema: [C: 03+2] ATS: set minimum-content-length for compress plugin [puppet] - 10https://gerrit.wikimedia.org/r/529944 (https://phabricator.wikimedia.org/T227432) (owner: 10Ema)
[13:34:05] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2001 is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) is CRITICAL: Test retrieve the most read articles for January 1, 2016 returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[13:35:43] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[13:36:39] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1314 is OK: OK - load average: 19.69, 19.93, 35.60 https://wikitech.wikimedia.org/wiki/Application_servers
[13:42:59] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1233 is OK: OK - load average: 10.99, 11.27, 22.69 https://wikitech.wikimedia.org/wiki/Application_servers
[13:45:32] <wikibugs>	 (03CR) 10Effie Mouzeli: mediawiki: add cluster latency alerts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/529923 (https://phabricator.wikimedia.org/T230396) (owner: 10Filippo Giunchedi)
[13:46:37] <wikibugs>	 (03PS1) 10Jbond: idp: enable idp role on idp1001 [puppet] - 10https://gerrit.wikimedia.org/r/529946
[13:54:53] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2001 is CRITICAL: /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) is CRITICAL: Test retrieve featured image data for April 29, 2016 returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[13:55:38] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] idp: enable idp role on idp1001 [puppet] - 10https://gerrit.wikimedia.org/r/529946 (owner: 10Jbond)
[13:59:02] <wikibugs>	 (03CR) 10Gehel: "> Patch Set 6: Code-Review+1" [cookbooks] - 10https://gerrit.wikimedia.org/r/529922 (owner: 10Gehel)
[13:59:43] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[14:02:36] <tarrow>	 jakob_WMDE and I are going to smoke test the termbox service on eqiad. We'll be keeping below 5 req/s so this should be totally negligible to the api appservers; just putting load on our service
[14:02:48] <wikibugs>	 (03CR) 10CDanis: mediawiki: add cluster latency alerts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/529923 (https://phabricator.wikimedia.org/T230396) (owner: 10Filippo Giunchedi)
[14:03:15] <wikibugs>	 (03PS1) 10Jbond: idp: ensure we include the passwords class [puppet] - 10https://gerrit.wikimedia.org/r/529948
[14:03:41] <wikibugs>	 (03CR) 10Gehel: [C: 03+2] elasticsearch: wait for write queue to be empty after cluster operation [cookbooks] - 10https://gerrit.wikimedia.org/r/529922 (owner: 10Gehel)
[14:04:40] <logmsgbot>	 !log volans@cumin2001 START - Cookbook sre.hosts.decommission
[14:04:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:04:47] <logmsgbot>	 !log volans@cumin2001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0)
[14:04:51] <wikibugs>	 10Operations, 10serviceops: Migrate pool counters to Buster - https://phabricator.wikimedia.org/T224572 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by volans@cumin2001 for hosts: `poolcounter2001.codfw.wmnet` -  poolcounter2001.codfw.wmnet   - Removed from Puppet master and PuppetDB   - Do...
[14:04:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:05:37] <wikibugs>	 (03CR) 10CDanis: dbctl: add note & candidate_master fields (031 comment) [software/conftool] - 10https://gerrit.wikimedia.org/r/529396 (https://phabricator.wikimedia.org/T229677) (owner: 10CDanis)
[14:07:05] <icinga-wm>	 RECOVERY - Check the Netbox report-s- puppetdb for fail status. on netmon1002 is OK: puppetdb.PuppetDB OK https://wikitech.wikimedia.org/wiki/Netbox%23Reports
[14:07:27] <volans>	 paravoid: ^^^ :)
[14:08:07] <paravoid>	 :)
[14:14:05] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2001 is CRITICAL: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) is CRITICAL: Test retrieve title of the featured article for April 29, 2016 returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[14:14:46] <XioNoX>	 !log disable all peering and transit on cr2-eqdfw
[14:14:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:17:59] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[14:18:34] <XioNoX>	 !log reboot cr2-eqdfw for software upgrade - T227886
[14:18:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:20:30] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] idp: ensure we include the passwords class [puppet] - 10https://gerrit.wikimedia.org/r/529948 (owner: 10Jbond)
[14:21:57] <wikibugs>	 (03CR) 10Filippo Giunchedi: mediawiki: add cluster latency alerts (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/529923 (https://phabricator.wikimedia.org/T230396) (owner: 10Filippo Giunchedi)
[14:22:17] <wikibugs>	 (03PS3) 10Filippo Giunchedi: mediawiki: add cluster latency alerts [puppet] - 10https://gerrit.wikimedia.org/r/529923 (https://phabricator.wikimedia.org/T230396)
[14:24:53] <wikibugs>	 (03CR) 10Filippo Giunchedi: mediawiki: add cluster latency alerts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/529923 (https://phabricator.wikimedia.org/T230396) (owner: 10Filippo Giunchedi)
[14:25:46] <wikibugs>	 (03CR) 10Effie Mouzeli: mediawiki: add cluster latency alerts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/529923 (https://phabricator.wikimedia.org/T230396) (owner: 10Filippo Giunchedi)
[14:28:54] <XioNoX>	 cr2-eqdfw is back up
[14:29:55] <XioNoX>	 !log rollback: disable all peering and transit on cr2-eqdfw
[14:30:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:31:09] <logmsgbot>	 !log gehel@cumin2001 START - Cookbook sre.elasticsearch.rolling-reboot
[14:31:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:31:28] <logmsgbot>	 !log gehel@cumin2001 END (FAIL) - Cookbook sre.elasticsearch.rolling-reboot (exit_code=99)
[14:31:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:32:51] <wikibugs>	 (03PS4) 10Mathew.onipe: remote: make RemoteHosts iterable [software/spicerack] - 10https://gerrit.wikimedia.org/r/529802
[14:33:54] <wikibugs>	 (03PS1) 10Jbond: idp: include the correct password class [puppet] - 10https://gerrit.wikimedia.org/r/529955
[14:36:37] <wikibugs>	 10Operations, 10ops-eqiad, 10vm-requests: rack/setup/install ganeti10([09]|1[0-8[).eqiad.wmnet - https://phabricator.wikimedia.org/T228924 (10Cmjohnson)
[14:36:44] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] remote: make RemoteHosts iterable [software/spicerack] - 10https://gerrit.wikimedia.org/r/529802 (owner: 10Mathew.onipe)
[14:38:48] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM for now, for logged in people will be the same" [software/netbox] - 10https://gerrit.wikimedia.org/r/529171 (owner: 10CRusnov)
[14:39:36] <wikibugs>	 (03CR) 10CRusnov: [V: 03+2 C: 03+2] switch swagger to nonpublic mode [software/netbox] - 10https://gerrit.wikimedia.org/r/529171 (owner: 10CRusnov)
[14:40:05] <wikibugs>	 (03Abandoned) 10CRusnov: netbox: redirect swagger doc requests to official docs [puppet] - 10https://gerrit.wikimedia.org/r/528531 (owner: 10CRusnov)
[14:40:18] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] idp: include the correct password class [puppet] - 10https://gerrit.wikimedia.org/r/529955 (owner: 10Jbond)
[14:41:59] <wikibugs>	 10Operations, 10ops-eqiad, 10vm-requests: rack/setup/install ganeti10([09]|1[0-8[).eqiad.wmnet - https://phabricator.wikimedia.org/T228924 (10Cmjohnson) a:05akosiaris→03Jclark-ctr Please rack, label and cable these servers with the racking locations above. Add them to netbox, be sure to make sure status...
[14:42:16] <logmsgbot>	 !log gehel@cumin2001 START - Cookbook sre.elasticsearch.rolling-reboot
[14:42:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:42:36] <logmsgbot>	 !log gehel@cumin2001 END (FAIL) - Cookbook sre.elasticsearch.rolling-reboot (exit_code=99)
[14:42:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:44:00] <wikibugs>	 (03PS5) 10Mathew.onipe: remote: make RemoteHosts iterable [software/spicerack] - 10https://gerrit.wikimedia.org/r/529802
[14:44:06] <wikibugs>	 (03PS4) 10Jhedden: openstack: initial haproxy profile [puppet] - 10https://gerrit.wikimedia.org/r/529436 (https://phabricator.wikimedia.org/T223907)
[14:44:35] <wikibugs>	 (03CR) 10Mathew.onipe: remote: make RemoteHosts iterable (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/529802 (owner: 10Mathew.onipe)
[14:44:37] <wikibugs>	 (03CR) 10Jhedden: "Thanks for the review. I was using the base profile for common options, and the deployment profiles for local options." (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/529436 (https://phabricator.wikimedia.org/T223907) (owner: 10Jhedden)
[14:45:06] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] openstack: initial haproxy profile [puppet] - 10https://gerrit.wikimedia.org/r/529436 (https://phabricator.wikimedia.org/T223907) (owner: 10Jhedden)
[14:48:07] <wikibugs>	 (03PS5) 10Jhedden: openstack: initial haproxy profile [puppet] - 10https://gerrit.wikimedia.org/r/529436 (https://phabricator.wikimedia.org/T223907)
[14:55:43] <wikibugs>	 (03PS1) 10CRusnov: make netbox tokens available to hosts that need em [labs/private] - 10https://gerrit.wikimedia.org/r/529958
[14:57:10] <wikibugs>	 (03CR) 10CRusnov: [V: 03+2 C: 03+2] make netbox tokens available to hosts that need em [labs/private] - 10https://gerrit.wikimedia.org/r/529958 (owner: 10CRusnov)
[14:57:28] <ema>	 !log cp5002: reboot for kernel upgrade
[14:57:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:59:58] <wikibugs>	 (03PS1) 10CRusnov: netbox: Add configuration for Netbox spicerack backend [puppet] - 10https://gerrit.wikimedia.org/r/529959
[15:00:12] <wikibugs>	 (03PS30) 10Daimona Eaytoy: Update AbuseFilter config to keep the status quo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475772
[15:01:19] <wikibugs>	 10Operations, 10SRE-Access-Requests: Requesting access to machines [stat1004, stat1007, stat1006, notebook1003 and notebook1004] and groups for cchen - https://phabricator.wikimedia.org/T228447 (10cchen) 05Resolved→03Open Thanks  @colewhite !
[15:01:53] <XioNoX>	 !log increase ospf cost of cr2-eqord<->cr2-eqiad link (+1000)
[15:01:57] <wikibugs>	 10Operations, 10SRE-Access-Requests: Requesting access to machines [stat1004, stat1007, stat1006, notebook1003 and notebook1004] and groups for cchen - https://phabricator.wikimedia.org/T228447 (10cchen) 05Open→03Resolved
[15:01:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:06:43] <wikibugs>	 (03PS3) 10Ayounsi: Netflow: install and configure Samplicator [puppet] - 10https://gerrit.wikimedia.org/r/529841 (https://phabricator.wikimedia.org/T226810)
[15:08:18] <wikibugs>	 (03CR) 10CDanis: dbctl: add note & candidate_master fields (031 comment) [software/conftool] - 10https://gerrit.wikimedia.org/r/529396 (https://phabricator.wikimedia.org/T229677) (owner: 10CDanis)
[15:08:48] <wikibugs>	 (03PS1) 10Jbond: idp: update the apereo_cas module to accept a content param for keystore [puppet] - 10https://gerrit.wikimedia.org/r/529960
[15:09:56] <wikibugs>	 (03CR) 10Volans: "Some comments inline, looks almost ready to me." (034 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/529802 (owner: 10Mathew.onipe)
[15:11:24] <wikibugs>	 (03CR) 10Volans: [C: 04-1] "One missing nit, looks good otherwise." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/529959 (owner: 10CRusnov)
[15:11:41] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM one optional nit" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/529841 (https://phabricator.wikimedia.org/T226810) (owner: 10Ayounsi)
[15:11:51] <wikibugs>	 (03PS4) 10Jdlrobson: Update wgSkipSkins to experiment with not showing skins to users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/511321 (https://phabricator.wikimedia.org/T223824)
[15:12:13] <XioNoX>	 !log disable all peering and transit on cr2-eqord
[15:12:16] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "Ship it!" (031 comment) [software/conftool] - 10https://gerrit.wikimedia.org/r/529396 (https://phabricator.wikimedia.org/T229677) (owner: 10CDanis)
[15:12:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:13:14] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] idp: update the apereo_cas module to accept a content param for keystore [puppet] - 10https://gerrit.wikimedia.org/r/529960 (owner: 10Jbond)
[15:13:35] <icinga-wm>	 PROBLEM - Host elastic2054.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[15:13:39] <wikibugs>	 (03PS2) 10CRusnov: netbox: Add configuration for Netbox spicerack backend [puppet] - 10https://gerrit.wikimedia.org/r/529959
[15:14:09] <wikibugs>	 (03PS2) 10Bstorm: labstore: restore original sense of the load alert with prometheus [puppet] - 10https://gerrit.wikimedia.org/r/528965 (https://phabricator.wikimedia.org/T229884)
[15:15:24] <wikibugs>	 (03CR) 10CRusnov: netbox: Add configuration for Netbox spicerack backend (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/529959 (owner: 10CRusnov)
[15:15:46] <wikibugs>	 (03PS4) 10Ayounsi: Netflow: install and configure Samplicator [puppet] - 10https://gerrit.wikimedia.org/r/529841 (https://phabricator.wikimedia.org/T226810)
[15:16:17] <wikibugs>	 (03CR) 10Ayounsi: "Thanks! Addressed!" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/529841 (https://phabricator.wikimedia.org/T226810) (owner: 10Ayounsi)
[15:16:29] <icinga-wm>	 RECOVERY - Host elastic2054 is UP: PING OK - Packet loss = 0%, RTA = 36.17 ms
[15:16:50] <wikibugs>	 (03CR) 10Ayounsi: Netflow: install and configure Samplicator (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/529841 (https://phabricator.wikimedia.org/T226810) (owner: 10Ayounsi)
[15:18:40] <icinga-wm>	 RECOVERY - Host elastic2054.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.99 ms
[15:19:13] <XioNoX>	 !log restart cr2-eqord - T227886
[15:19:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:19:26] <wikibugs>	 (03CR) 10Jhedden: [C: 04-1] "Looks good, just a minor change needed." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/528965 (https://phabricator.wikimedia.org/T229884) (owner: 10Bstorm)
[15:20:33] <wikibugs>	 10Operations, 10Discovery: elastic2054 unresponsive - https://phabricator.wikimedia.org/T227298 (10Papaul) 05Open→03Resolved DIMM A2 replaced and log cleared . Closing this task for now .
[15:21:42] <wikibugs>	 (03CR) 10Bstorm: labstore: restore original sense of the load alert with prometheus (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/528965 (https://phabricator.wikimedia.org/T229884) (owner: 10Bstorm)
[15:23:29] <wikibugs>	 (03PS1) 10Gehel: elasticsearch: depool servers just before actual operation [cookbooks] - 10https://gerrit.wikimedia.org/r/529963
[15:23:31] <wikibugs>	 (03PS3) 10Bstorm: labstore: restore original sense of the load alert with prometheus [puppet] - 10https://gerrit.wikimedia.org/r/528965 (https://phabricator.wikimedia.org/T229884)
[15:24:18] <wikibugs>	 (03CR) 10Bstorm: labstore: restore original sense of the load alert with prometheus (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/528965 (https://phabricator.wikimedia.org/T229884) (owner: 10Bstorm)
[15:24:52] <wikibugs>	 (03CR) 10Jhedden: [C: 03+2] labstore: restore original sense of the load alert with prometheus [puppet] - 10https://gerrit.wikimedia.org/r/528965 (https://phabricator.wikimedia.org/T229884) (owner: 10Bstorm)
[15:25:24] <wikibugs>	 10Operations, 10ops-eqiad, 10cloud-services-team (Kanban): Degraded RAID on cloudvirt1018 - https://phabricator.wikimedia.org/T229156 (10Cmjohnson) 05Open→03Resolved Disks replaced, please re-open an ping me if the disk fails
[15:25:28] <wikibugs>	 10Operations, 10Discovery: elastic2054 unresponsive - https://phabricator.wikimedia.org/T227298 (10Papaul) Return information  {F30018725}
[15:25:36] <wikibugs>	 (03PS4) 10Bstorm: labstore: restore original sense of the load alert with prometheus [puppet] - 10https://gerrit.wikimedia.org/r/528965 (https://phabricator.wikimedia.org/T229884)
[15:28:09] <wikibugs>	 (03CR) 10Jhedden: [C: 03+2] openstack: initial haproxy profile [puppet] - 10https://gerrit.wikimedia.org/r/529436 (https://phabricator.wikimedia.org/T223907) (owner: 10Jhedden)
[15:28:17] <wikibugs>	 (03PS6) 10Jhedden: openstack: initial haproxy profile [puppet] - 10https://gerrit.wikimedia.org/r/529436 (https://phabricator.wikimedia.org/T223907)
[15:28:27] <XioNoX>	 annnnnd it's back
[15:28:52] <wikibugs>	 10Operations, 10ops-codfw: (OoW) wtp2019 shows error messages in the racadm getsel's output - https://phabricator.wikimedia.org/T221572 (10Papaul) 05Open→03Resolved I checked the server this morning no errors showing in log. closing the task
[15:29:09] <wikibugs>	 10Operations, 10ops-codfw, 10media-storage: Degraded RAID on ms-be2021 - https://phabricator.wikimedia.org/T230275 (10Papaul) p:05Triage→03Normal
[15:30:06] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/529959 (owner: 10CRusnov)
[15:30:08] <wikibugs>	 (03CR) 10Mathew.onipe: [C: 03+1] elasticsearch: depool servers just before actual operation [cookbooks] - 10https://gerrit.wikimedia.org/r/529963 (owner: 10Gehel)
[15:30:11] <XioNoX>	 !log rollback ospf + bgp changes on cr2-eqord
[15:30:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:30:44] <wikibugs>	 (03CR) 10Gehel: [C: 03+2] elasticsearch: depool servers just before actual operation [cookbooks] - 10https://gerrit.wikimedia.org/r/529963 (owner: 10Gehel)
[15:30:50] <wikibugs>	 (03CR) 10CRusnov: [C: 03+2] netbox: Add configuration for Netbox spicerack backend [puppet] - 10https://gerrit.wikimedia.org/r/529959 (owner: 10CRusnov)
[15:31:09] <wikibugs>	 (03PS3) 10CRusnov: netbox: Add configuration for Netbox spicerack backend [puppet] - 10https://gerrit.wikimedia.org/r/529959
[15:31:24] <wikibugs>	 (03PS3) 10BBlack: lvs: isolate cloudelastic icinga check [puppet] - 10https://gerrit.wikimedia.org/r/528885 (https://phabricator.wikimedia.org/T229621) (owner: 10Mathew.onipe)
[15:32:10] <logmsgbot>	 !log gehel@cumin2001 START - Cookbook sre.elasticsearch.rolling-reboot
[15:32:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:32:50] <bblack>	 !log disabled pupped on lvs1014, lvs1016, icinga1001 ahead of deploying https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/528885/ - T229621
[15:32:58] <bblack>	 puppet even :P
[15:32:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:32:59] <stashbot>	 T229621: Icinga check defined from LVS configuration for cloudelastic are borked - https://phabricator.wikimedia.org/T229621
[15:33:11] <wikibugs>	 (03CR) 10BBlack: [C: 03+2] lvs: isolate cloudelastic icinga check [puppet] - 10https://gerrit.wikimedia.org/r/528885 (https://phabricator.wikimedia.org/T229621) (owner: 10Mathew.onipe)
[15:34:16] <wikibugs>	 (03PS1) 10Ayounsi: Depool eqsin for router software upgrade [dns] - 10https://gerrit.wikimedia.org/r/529966 (https://phabricator.wikimedia.org/T227886)
[15:35:08] <wikibugs>	 (03PS4) 10CRusnov: netbox: Add configuration for Netbox spicerack backend [puppet] - 10https://gerrit.wikimedia.org/r/529959
[15:35:34] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Depool eqsin for router software upgrade [dns] - 10https://gerrit.wikimedia.org/r/529966 (https://phabricator.wikimedia.org/T227886) (owner: 10Ayounsi)
[15:35:52] <wikibugs>	 10Operations, 10ops-codfw, 10media-storage: Degraded RAID on ms-be2021 - https://phabricator.wikimedia.org/T230275 (10Papaul) a:05Papaul→03fgiunchedi @fgiunchedi  Disk replaced.
[15:35:58] <XioNoX>	 !log depool eqsin for cr2-eqsin upgrade
[15:36:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:36:50] <icinga-wm>	 PROBLEM - ElasticSearch numbers of masters eligible - 9443 on search.svc.codfw.wmnet is CRITICAL: CRITICAL - Found 2 eligible masters. https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert
[15:37:48] <gehel>	 ^ that's me, check too sensitive, looking
[15:38:03] <icinga-wm>	 RECOVERY - ElasticSearch numbers of masters eligible - 9443 on search.svc.codfw.wmnet is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert
[15:38:55] <icinga-wm>	 RECOVERY - HP RAID on ms-be2021 is OK: OK: Slot 3: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 2I:4:1, 2I:4:2 - Controller: OK - Battery/Capacitor: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[15:39:27] <bblack>	 !log puppet re-enabled on lvs1014, lvs1016, icinga1001
[15:39:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:42:06] <logmsgbot>	 !log gehel@cumin2001 END (FAIL) - Cookbook sre.elasticsearch.rolling-reboot (exit_code=99)
[15:42:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:46:34] <wikibugs>	 (03PS1) 10Gehel: elasticsearch: reduce sensitivity of master eligible check [puppet] - 10https://gerrit.wikimedia.org/r/529967
[15:46:40] <XioNoX>	 !log fail vrrp master to cr1-eqsin
[15:46:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:47:28] <wikibugs>	 (03PS1) 10Gehel: elasticsearch: fix str to int conversion [cookbooks] - 10https://gerrit.wikimedia.org/r/529968
[15:49:05] <logmsgbot>	 !log ppchelko@deploy1001 Started deploy [restbase/deploy@8fca708]: Expose transform/wikitext/to/mobile-html endpoint T211026
[15:49:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:49:13] <stashbot>	 T211026: mobile-html: ability to preview an edited page or section with the same transforms and styles as mobile-html - https://phabricator.wikimedia.org/T211026
[15:49:17] <icinga-wm>	 RECOVERY - Device not healthy -SMART- on ms-be2021 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/SMART%23Alerts https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ms-be2021&var-datasource=codfw+prometheus/ops
[15:50:11] <icinga-wm>	 PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on icinga1001 is CRITICAL: 10.08 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[15:50:27] <wikibugs>	 (03CR) 10Mathew.onipe: [C: 03+1] elasticsearch: reduce sensitivity of master eligible check [puppet] - 10https://gerrit.wikimedia.org/r/529967 (owner: 10Gehel)
[15:50:36] <wikibugs>	 (03CR) 10Gehel: [C: 03+2] elasticsearch: reduce sensitivity of master eligible check [puppet] - 10https://gerrit.wikimedia.org/r/529967 (owner: 10Gehel)
[15:50:50] <wikibugs>	 (03CR) 10Milimetric: Add Cache-Control response header for Wikistats V2's index.html (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/529795 (https://phabricator.wikimedia.org/T230136) (owner: 10Elukey)
[15:51:59] <wikibugs>	 (03CR) 10Mathew.onipe: [C: 03+1] elasticsearch: fix str to int conversion [cookbooks] - 10https://gerrit.wikimedia.org/r/529968 (owner: 10Gehel)
[15:52:08] <wikibugs>	 (03CR) 10Gehel: [C: 03+2] elasticsearch: fix str to int conversion [cookbooks] - 10https://gerrit.wikimedia.org/r/529968 (owner: 10Gehel)
[15:56:40] <logmsgbot>	 !log ppchelko@deploy1001 Finished deploy [restbase/deploy@8fca708]: Expose transform/wikitext/to/mobile-html endpoint T211026 (duration: 07m 35s)
[15:56:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:56:49] <stashbot>	 T211026: mobile-html: ability to preview an edited page or section with the same transforms and styles as mobile-html - https://phabricator.wikimedia.org/T211026
[15:56:58] <logmsgbot>	 !log ppchelko@deploy1001 Started deploy [restbase/deploy@8fca708]: Expose transform/wikitext/to/mobile-html endpoint T211026, take 2
[15:57:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:58:18] <wikibugs>	 (03PS1) 10Bstorm: monitoring: Change the showmount check from toolforge to be email-only [puppet] - 10https://gerrit.wikimedia.org/r/529970 (https://phabricator.wikimedia.org/T229884)
[16:00:04] <jouncebot>	 godog and _joe_: That opportune time is upon us again. Time for a Puppet SWAT(Max 6 patches) deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190813T1600).
[16:00:04] <jouncebot>	 No GERRIT patches in the queue for this window AFAICS.
[16:00:27] <wikibugs>	 (03CR) 10Jhedden: [C: 03+2] monitoring: Change the showmount check from toolforge to be email-only [puppet] - 10https://gerrit.wikimedia.org/r/529970 (https://phabricator.wikimedia.org/T229884) (owner: 10Bstorm)
[16:05:45] <icinga-wm>	 PROBLEM - Check systemd state on idp1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:05:51] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1003 is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[16:07:01] <icinga-wm>	 PROBLEM - netbox HTTPS on netmon1002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Proxy Error - 619 bytes in 0.079 second response time https://wikitech.wikimedia.org/wiki/Netbox
[16:07:10] <logmsgbot>	 !log ppchelko@deploy1001 Finished deploy [restbase/deploy@8fca708]: Expose transform/wikitext/to/mobile-html endpoint T211026, take 2 (duration: 10m 12s)
[16:07:10] <chaomodus>	 ^ that's me, one sec
[16:07:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:07:17] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[16:07:17] <stashbot>	 T211026: mobile-html: ability to preview an edited page or section with the same transforms and styles as mobile-html - https://phabricator.wikimedia.org/T211026
[16:08:19] <icinga-wm>	 PROBLEM - Check the last execution of netbox_ganeti_eqiad_sync on netmon1002 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_eqiad_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[16:09:05] <icinga-wm>	 PROBLEM - Check systemd state on netmon1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:10:33] <icinga-wm>	 PROBLEM - Check the last execution of netbox_ganeti_codfw_sync on netmon1002 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[16:11:41] <icinga-wm>	 RECOVERY - netbox HTTPS on netmon1002 is OK: HTTP OK: HTTP/1.1 302 Found - 348 bytes in 0.694 second response time https://wikitech.wikimedia.org/wiki/Netbox
[16:13:39] <icinga-wm>	 PROBLEM - IPMI Sensor Status on db1129 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[16:20:14] <wikibugs>	 10Operations, 10ops-ulsfo: refresh/replace scs-ulsfo - https://phabricator.wikimedia.org/T230077 (10RobH) The cables have arrived for this.  I'll go onsite on Wednesday, August 14th to swap out the scs-ulsfo console server.  I'll email the SRE team list to ensure the department is aware of the change/downtime....
[16:21:52] <wikibugs>	 10Operations, 10MediaWiki-API, 10Traffic, 10Wikidata, and 2 others: wikidata.org handles GET MWAPI requests, but silently fails on POST - https://phabricator.wikimedia.org/T230051 (10Anomie) There's nothing to do with #MediaWiki-API here, or with MediaWiki at all. Any request to wikidata.org is served a 30...
[16:22:17] <wikibugs>	 10Operations, 10ops-ulsfo: refresh/replace scs-ulsfo - https://phabricator.wikimedia.org/T230077 (10RobH)
[16:25:27] <logmsgbot>	 !log filippo@cumin1001 START - Cookbook sre.hosts.downtime
[16:25:27] <logmsgbot>	 !log filippo@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
[16:25:32] <logmsgbot>	 !log filippo@cumin1001 START - Cookbook sre.hosts.downtime
[16:25:33] <logmsgbot>	 !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[16:25:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:25:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:25:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:25:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:27:44] <wikibugs>	 (03CR) 10CDanis: [C: 03+1] mediawiki: add cluster latency alerts [puppet] - 10https://gerrit.wikimedia.org/r/529923 (https://phabricator.wikimedia.org/T230396) (owner: 10Filippo Giunchedi)
[16:28:36] <icinga-wm>	 RECOVERY - Check the last execution of netbox_ganeti_eqiad_sync on netmon1002 is OK: OK: Status of the systemd unit netbox_ganeti_eqiad_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[16:30:06] <icinga-wm>	 RECOVERY - Disk space on ms-be2021 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ms-be2021&var-datasource=codfw+prometheus/ops
[16:30:52] <icinga-wm>	 RECOVERY - Check the last execution of netbox_ganeti_codfw_sync on netmon1002 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[16:33:26] <icinga-wm>	 RECOVERY - puppet last run on ms-be2021 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[16:34:03] <wikibugs>	 (03CR) 10CDanis: [C: 03+2] dbctl: add note & candidate_master fields [software/conftool] - 10https://gerrit.wikimedia.org/r/529396 (https://phabricator.wikimedia.org/T229677) (owner: 10CDanis)
[16:37:13] <wikibugs>	 10Operations, 10ops-eqiad: rack/setup/instal (4) CI ganeti nodes - https://phabricator.wikimedia.org/T228926 (10Cmjohnson) @Jclark-ctr Please rack 4 of the servers from the same ganeti stack in row D and label them as ganeti1019-1022. Please update netbox, and provide access switch port info.
[16:37:14] <wikibugs>	 (03Merged) 10jenkins-bot: dbctl: add note & candidate_master fields [software/conftool] - 10https://gerrit.wikimedia.org/r/529396 (https://phabricator.wikimedia.org/T229677) (owner: 10CDanis)
[16:37:25] <wikibugs>	 10Operations, 10ops-eqiad: rack/setup/instal (4) CI ganeti nodes - https://phabricator.wikimedia.org/T228926 (10Cmjohnson) a:05akosiaris→03Jclark-ctr
[16:39:22] <icinga-wm>	 RECOVERY - Check systemd state on netmon1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:40:02] <icinga-wm>	 RECOVERY - Varnish traffic drop between 30min ago and now at eqsin on icinga1001 is OK: (C)60 le (W)70 le 70.04 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[16:43:42] <wikibugs>	 10Operations, 10ops-codfw, 10media-storage: Degraded RAID on ms-be2021 - https://phabricator.wikimedia.org/T230275 (10fgiunchedi) 05Open→03Resolved Disk is rebuilding, thanks @Papaul
[16:44:34] <wikibugs>	 10Operations, 10ops-eqiad, 10cloud-services-team: (OoW) cloudvirt1006 - RAID battery failed - https://phabricator.wikimedia.org/T222950 (10Cmjohnson)
[16:46:11] <XioNoX>	 !log disable all peering and transit on cr2-eqsin
[16:46:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:46:46] <wikibugs>	 10Operations, 10LDAP-Access-Requests: Membership to 'wmf' LDAP group request for Connie Chen - https://phabricator.wikimedia.org/T230242 (10cchen) 05Resolved→03Open Thanks again for your help @colewhite !  I get the following internal server error when i trying to login with my LDAP account.  ` 500 - Inter...
[16:57:00] <icinga-wm>	 PROBLEM - Check systemd state on elastic1046 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:57:54] <XioNoX>	 !log reboot cr2-eqsin
[16:58:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:58:44] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: remote: pass dry_run down to children remote hosts [software/spicerack] - 10https://gerrit.wikimedia.org/r/529975
[16:58:46] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: Move splitting of a RemoteHosts to a method [software/spicerack] - 10https://gerrit.wikimedia.org/r/529976
[16:59:13] <_joe_>	 uhm something's not right I forgot to squash those two
[16:59:56] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: Move splitting of a RemoteHosts to a method [software/spicerack] - 10https://gerrit.wikimedia.org/r/529976
[17:00:04] <jouncebot>	 cscott, arlolra, subbu, halfak, and accraze: How many deployers does it take to do Services – Graphoid / Parsoid / Citoid / ORES deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190813T1700).
[17:00:20] <wikibugs>	 (03Abandoned) 10Giuseppe Lavagetto: remote: pass dry_run down to children remote hosts [software/spicerack] - 10https://gerrit.wikimedia.org/r/529975 (owner: 10Giuseppe Lavagetto)
[17:01:53] <icinga-wm>	 PROBLEM - LVS HTTPS IPv6 #page on upload-lb.eqsin.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[17:02:04] <icinga-wm>	 PROBLEM - OSPF status on mr1-eqsin is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[17:02:12] <cdanis>	 XioNoX: everything ok?
[17:02:17] * volans around
[17:02:25] * jbond42 around
[17:02:26] * _joe_ too
[17:02:36] <XioNoX>	 site is depooled
[17:02:42] <_joe_>	 oh ok
[17:02:44] <icinga-wm>	 PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[17:02:59] <_joe_>	 maybe then we should downtime the services there
[17:03:03] <XioNoX>	 but that should not have happen
[17:03:04] <marostegui>	 o/
[17:03:10] <marostegui>	 ah, It is depooled
[17:03:10] <_joe_>	 ok
[17:03:16] * marostegui back to the sofa
[17:03:24] <icinga-wm>	 RECOVERY - LVS HTTPS IPv6 #page on upload-lb.eqsin.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 848 bytes in 0.933 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[17:03:56] <wikibugs>	 (03PS2) 10Bstorm: monitoring: Change the showmount check from toolforge to be email-only [puppet] - 10https://gerrit.wikimedia.org/r/529970 (https://phabricator.wikimedia.org/T229884)
[17:04:02] <XioNoX>	 yeah, no impact, sorry for the page, it should not have happen
[17:04:03] * ema goes back to the sofa as well
[17:04:26] <icinga-wm>	 RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[17:05:04] <XioNoX>	 cr2-eqsin is back
[17:05:12] <icinga-wm>	 PROBLEM - PyBal BGP sessions are established on lvs5003 is CRITICAL: 0 le 0 https://wikitech.wikimedia.org/wiki/PyBal https://grafana.wikimedia.org/dashboard/db/pybal-bgp?var-datasource=eqsin+prometheus/ops
[17:05:20] <icinga-wm>	 RECOVERY - OSPF status on mr1-eqsin is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[17:06:15] <XioNoX>	 !log rollback: disable all peering and transit on cr2-eqsin
[17:06:17] <wikibugs>	 (03PS1) 10CRusnov: Add script to import management DNS entries [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/529977 (https://phabricator.wikimedia.org/T228670)
[17:06:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:06:48] <icinga-wm>	 RECOVERY - PyBal BGP sessions are established on lvs5003 is OK: (C)0 le (W)0 le 1 https://wikitech.wikimedia.org/wiki/PyBal https://grafana.wikimedia.org/dashboard/db/pybal-bgp?var-datasource=eqsin+prometheus/ops
[17:08:29] <wikibugs>	 (03CR) 10CRusnov: "Note: This is meant to be run once or twice ever. We should include it in the repository for informational and deployment purposes. It is " [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/529977 (https://phabricator.wikimedia.org/T228670) (owner: 10CRusnov)
[17:08:37] <XioNoX>	 ah, now I know, forgot to depool ^ before the maintenance and it didn't failover fast enough
[17:08:41] <wikibugs>	 (03PS6) 10Mathew.onipe: remote: make RemoteHosts iterable [software/spicerack] - 10https://gerrit.wikimedia.org/r/529802
[17:09:04] <wikibugs>	 (03CR) 10Mathew.onipe: remote: make RemoteHosts iterable (034 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/529802 (owner: 10Mathew.onipe)
[17:10:40] <wikibugs>	 (03PS1) 10Ayounsi: Revert "Depool eqsin for router software upgrade" [dns] - 10https://gerrit.wikimedia.org/r/529979
[17:11:32] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Revert "Depool eqsin for router software upgrade" [dns] - 10https://gerrit.wikimedia.org/r/529979 (owner: 10Ayounsi)
[17:11:41] <wikibugs>	 10Operations, 10LDAP-Access-Requests: Membership to 'wmf' LDAP group request for Connie Chen - https://phabricator.wikimedia.org/T230242 (10elukey) 05Open→03Resolved Added the user to superset! :)
[17:11:58] <XioNoX>	 !log repool eqsin
[17:12:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:12:31] <wikibugs>	 10Operations, 10ops-eqiad, 10cloud-services-team: Degraded RAID on cloudvirt1024 - https://phabricator.wikimedia.org/T230289 (10JHedden) There are no workloads on this host now. We're good to have this replaced anytime. Thanks!
[17:14:58] <wikibugs>	 (03PS5) 10CRusnov: netbox: Add configuration for Netbox spicerack backend [puppet] - 10https://gerrit.wikimedia.org/r/529959
[17:16:17] <wikibugs>	 (03CR) 10CRusnov: [C: 03+2] netbox: Add configuration and timers for csv dumps (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/521313 (owner: 10CRusnov)
[17:18:29] <wikibugs>	 (03PS1) 10CRusnov: dumpbackup.py: minor fix [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/529980
[17:19:01] <wikibugs>	 (03CR) 10CRusnov: [V: 03+2 C: 03+2] dumpbackup.py: minor fix [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/529980 (owner: 10CRusnov)
[17:22:45] <icinga-wm>	 PROBLEM - Varnish traffic drop between 30min ago and now at ulsfo on icinga1001 is CRITICAL: 59.79 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[17:27:30] <wikibugs>	 (03CR) 10Volans: "So, between Ic7877d23c423fb27e01486c353f4ab1f000c4102 and this one, we have similar behaviours." [software/spicerack] - 10https://gerrit.wikimedia.org/r/529802 (owner: 10Mathew.onipe)
[17:39:53] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/17874/netflow1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/529841 (https://phabricator.wikimedia.org/T226810) (owner: 10Ayounsi)
[17:39:59] <wikibugs>	 10Operations, 10LDAP-Access-Requests: Membership to 'wmf' LDAP group request for Connie Chen - https://phabricator.wikimedia.org/T230242 (10cchen) Thanks @elukey!
[17:40:01] <wikibugs>	 (03PS5) 10Ayounsi: Netflow: install and configure Samplicator [puppet] - 10https://gerrit.wikimedia.org/r/529841 (https://phabricator.wikimedia.org/T226810)
[17:40:55] <wikibugs>	 (03PS1) 10CRusnov: Bump to v2.6.1-wmf3 [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/529985
[17:44:23] <XioNoX>	 !log set target netflow port to 2000 in eqiad
[17:44:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:49:18] <wikibugs>	 (03CR) 10Mathew.onipe: "> Patch Set 6:" [software/spicerack] - 10https://gerrit.wikimedia.org/r/529802 (owner: 10Mathew.onipe)
[17:53:00] <icinga-wm>	 RECOVERY - Varnish traffic drop between 30min ago and now at ulsfo on icinga1001 is OK: (C)60 le (W)70 le 72.44 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[17:53:38] <wikibugs>	 10Operations, 10Wikimedia-Mailing-lists: Set up mailing list for Santali Wikipedia - https://phabricator.wikimedia.org/T230435 (10Manik87)
[18:01:21] <wikibugs>	 (03PS2) 10CRusnov: Add script to import management DNS entries [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/529977 (https://phabricator.wikimedia.org/T228670)
[18:03:43] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM, I didn't download the artifacts archive" [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/529985 (owner: 10CRusnov)
[18:07:24] <wikibugs>	 (03CR) 10CRusnov: [V: 03+2 C: 03+2] "LGTM after reviewing. Hopefully good. :)" [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/529985 (owner: 10CRusnov)
[18:08:20] <icinga-wm>	 RECOVERY - Check systemd state on elastic1046 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:13:00] <icinga-wm>	 PROBLEM - Check systemd state on elastic1046 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:17:07] <wikibugs>	 (03CR) 10Volans: "Some nit inline" (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/527487 (owner: 10Giuseppe Lavagetto)
[18:22:10] <wikibugs>	 (03PS2) 10Mholloway: Add MachineVision to extension-list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526541 (https://phabricator.wikimedia.org/T227348)
[18:22:19] <wikibugs>	 (03PS2) 10Mholloway: Add wmgUseMachineVision default setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526542 (https://phabricator.wikimedia.org/T227348)
[18:22:39] <wikibugs>	 (03PS3) 10Mholloway: Enable MachineVision on (beta) commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526543 (https://phabricator.wikimedia.org/T227348)
[18:22:46] <wikibugs>	 (03PS4) 10Mholloway: Load MachineVision extension if enabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526544 (https://phabricator.wikimedia.org/T227348)
[18:24:32] <wikibugs>	 (03CR) 10Mholloway: [C: 03+2] Add MachineVision to extension-list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526541 (https://phabricator.wikimedia.org/T227348) (owner: 10Mholloway)
[18:25:41] <wikibugs>	 (03Merged) 10jenkins-bot: Add MachineVision to extension-list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526541 (https://phabricator.wikimedia.org/T227348) (owner: 10Mholloway)
[18:27:40] <logmsgbot>	 !log mholloway-shell@deploy1001 Synchronized wmf-config/extension-list: Enable MachineVision on Beta (1/4) (duration: 00m 48s)
[18:27:50] <wikibugs>	 (03CR) 10Mholloway: [C: 03+2] Add wmgUseMachineVision default setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526542 (https://phabricator.wikimedia.org/T227348) (owner: 10Mholloway)
[18:27:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:27:52] <wikibugs>	 (03CR) 10jenkins-bot: Add MachineVision to extension-list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526541 (https://phabricator.wikimedia.org/T227348) (owner: 10Mholloway)
[18:28:08] <wikibugs>	 10Operations, 10MediaWiki-Debug-Logger, 10Wikimedia-Logstash, 10serviceops-radar, and 8 others: Port mediawiki/php/wmerrors to PHP7 and deploy - https://phabricator.wikimedia.org/T187147 (10WDoranWMF) What's the current status of this task? Are there needs from CPT?
[18:28:58] <wikibugs>	 (03Merged) 10jenkins-bot: Add wmgUseMachineVision default setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526542 (https://phabricator.wikimedia.org/T227348) (owner: 10Mholloway)
[18:30:55] <logmsgbot>	 !log mholloway-shell@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Enable MachineVision on Beta (2/4) (duration: 00m 48s)
[18:31:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:31:02] <wikibugs>	 10Operations, 10ops-eqiad: Degraded RAID on cloudvirt1024 - https://phabricator.wikimedia.org/T230442 (10ops-monitoring-bot)
[18:31:17] <wikibugs>	 (03CR) 10Mholloway: [C: 03+2] Enable MachineVision on (beta) commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526543 (https://phabricator.wikimedia.org/T227348) (owner: 10Mholloway)
[18:31:47] <logmsgbot>	 !log crusnov@deploy1001 Started deploy [netbox/deploy@367ca84]: Update Netbox to v2.6.1-wmf3 affects: T223292
[18:31:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:31:55] <stashbot>	 T223292: Netbox: generate CSV backups - https://phabricator.wikimedia.org/T223292
[18:32:23] <logmsgbot>	 !log crusnov@deploy1001 Finished deploy [netbox/deploy@367ca84]: Update Netbox to v2.6.1-wmf3 affects: T223292 (duration: 00m 36s)
[18:32:25] <logmsgbot>	 !log crusnov@deploy1001 Started deploy [netbox/deploy@367ca84]: Update Netbox to v2.6.1-wmf3 affects: T223292
[18:32:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:32:37] <wikibugs>	 (03Merged) 10jenkins-bot: Enable MachineVision on (beta) commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526543 (https://phabricator.wikimedia.org/T227348) (owner: 10Mholloway)
[18:32:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:33:09] <logmsgbot>	 !log crusnov@deploy1001 Finished deploy [netbox/deploy@367ca84]: Update Netbox to v2.6.1-wmf3 affects: T223292 (duration: 00m 43s)
[18:33:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:33:39] <logmsgbot>	 !log crusnov@deploy1001 Started deploy [netbox/deploy@367ca84]: Update Netbox to v2.6.1-wmf3 affects: T223292 (fix perms)
[18:33:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:33:49] <logmsgbot>	 !log crusnov@deploy1001 Finished deploy [netbox/deploy@367ca84]: Update Netbox to v2.6.1-wmf3 affects: T223292 (fix perms) (duration: 00m 09s)
[18:33:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:34:39] <logmsgbot>	 !log mholloway-shell@deploy1001 Synchronized wmf-config/InitialiseSettings-labs.php: Enable MachineVision on Beta (3/4) (duration: 00m 47s)
[18:34:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:35:00] <wikibugs>	 (03CR) 10Mholloway: [C: 03+2] Load MachineVision extension if enabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526544 (https://phabricator.wikimedia.org/T227348) (owner: 10Mholloway)
[18:35:59] <wikibugs>	 (03CR) 10jenkins-bot: Add wmgUseMachineVision default setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526542 (https://phabricator.wikimedia.org/T227348) (owner: 10Mholloway)
[18:36:02] <wikibugs>	 (03Merged) 10jenkins-bot: Load MachineVision extension if enabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526544 (https://phabricator.wikimedia.org/T227348) (owner: 10Mholloway)
[18:36:03] <wikibugs>	 (03CR) 10jenkins-bot: Enable MachineVision on (beta) commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526543 (https://phabricator.wikimedia.org/T227348) (owner: 10Mholloway)
[18:36:16] <wikibugs>	 (03CR) 10jenkins-bot: Load MachineVision extension if enabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526544 (https://phabricator.wikimedia.org/T227348) (owner: 10Mholloway)
[18:38:04] <logmsgbot>	 !log mholloway-shell@deploy1001 Synchronized wmf-config/CommonSettings.php: Enable MachineVision on Beta (4/4) (duration: 00m 48s)
[18:38:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:38:22] <wikibugs>	 (03PS7) 10CRusnov: netbox: Add configuration and timers for csv dumps [puppet] - 10https://gerrit.wikimedia.org/r/521313
[18:41:16] <wikibugs>	 (03PS1) 10Eevans: sessionstore: Upgrade staging to Kask v1.0.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/529991 (https://phabricator.wikimedia.org/T229697)
[18:41:32] <ebernhardson>	 !log set cpufreq scaling_governor to performance on cloudelastic100[1-4] to test any changes to indexing performance
[18:41:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:46:34] <wikibugs>	 (03PS1) 10Gehel: elasticsearch: wait for write queue to empty for 1h instead of 10 min [cookbooks] - 10https://gerrit.wikimedia.org/r/529992
[18:49:27] <wikibugs>	 (03CR) 10Gehel: [C: 03+2] elasticsearch: wait for write queue to empty for 1h instead of 10 min [cookbooks] - 10https://gerrit.wikimedia.org/r/529992 (owner: 10Gehel)
[18:50:42] <logmsgbot>	 !log gehel@cumin2001 START - Cookbook sre.elasticsearch.rolling-reboot
[18:50:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:53:04] <wikibugs>	 (03CR) 10Eevans: [C: 03+2] "Self-merging staging deployment." [deployment-charts] - 10https://gerrit.wikimedia.org/r/529991 (https://phabricator.wikimedia.org/T229697) (owner: 10Eevans)
[18:54:06] <wikibugs>	 (03CR) 10Eevans: [V: 03+2 C: 03+2] sessionstore: Upgrade staging to Kask v1.0.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/529991 (https://phabricator.wikimedia.org/T229697) (owner: 10Eevans)
[18:56:23] <wikibugs>	 10Operations, 10SRE-Access-Requests: Requesting access to LogStash for Abijeet Patro - https://phabricator.wikimedia.org/T230104 (10colewhite) a:05colewhite→03None
[18:58:46] <wikibugs>	 10Operations, 10ops-eqiad: Degraded RAID on cloudvirt1024 - https://phabricator.wikimedia.org/T230442 (10JHedden) This host also has a bad disk in slot number 8. T230289
[18:58:51] <wikibugs>	 10Operations, 10SRE-Access-Requests: Requesting access to LogStash for Abijeet Patro - https://phabricator.wikimedia.org/T230104 (10CDanis) a:03CDanis
[18:59:31] <wikibugs>	 10Operations, 10SRE-Access-Requests, 10Traffic: SRE Onboarding for Sukhbir Singh - https://phabricator.wikimedia.org/T229860 (10CDanis) Is this done?
[19:00:20] <wikibugs>	 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to Mwmaint1002, Stat1007 for Abijeet Patro - https://phabricator.wikimedia.org/T230020 (10CDanis) a:03RStallman-legalteam @RStallman-legalteam can you confirm NDA on file?  Thanks!
[19:03:35] <logmsgbot>	 !log @ helmfile [STAGING] Ran 'apply' command on namespace 'sessionstore' for release 'staging' .
[19:03:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:12:19] <wikibugs>	 10Operations, 10SRE-Access-Requests, 10Traffic: SRE Onboarding for Sukhbir Singh - https://phabricator.wikimedia.org/T229860 (10BBlack) 05Open→03Resolved Looks like it to me :)
[19:15:46] <logmsgbot>	 !log ppchelko@deploy1001 Started deploy [cpjobqueue/deploy@f1a562e]: Increase cirrusSearchLinksUpdatePrioritized concurrency 150 -> 200 T220625
[19:15:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:15:58] <stashbot>	 T220625: Initialize CirrusSearch on cloudelastic - https://phabricator.wikimedia.org/T220625
[19:17:17] <logmsgbot>	 !log ppchelko@deploy1001 deploy aborted: Increase cirrusSearchLinksUpdatePrioritized concurrency 150 -> 200 T220625 (duration: 01m 30s)
[19:17:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:18:45] <logmsgbot>	 !log ppchelko@deploy1001 Started deploy [cpjobqueue/deploy@f1a562e]: Revert on canary
[19:18:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:19:03] <logmsgbot>	 !log ppchelko@deploy1001 deploy aborted: Revert on canary (duration: 00m 18s)
[19:19:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:22:40] <logmsgbot>	 !log ppchelko@deploy1001 Started deploy [cpjobqueue/deploy@3882ddb]: Increase cirrusSearchLinksUpdatePrioritized concurrency 150 -> 200 T220625
[19:22:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:22:48] <stashbot>	 T220625: Initialize CirrusSearch on cloudelastic - https://phabricator.wikimedia.org/T220625
[19:23:38] <logmsgbot>	 !log ppchelko@deploy1001 Finished deploy [cpjobqueue/deploy@3882ddb]: Increase cirrusSearchLinksUpdatePrioritized concurrency 150 -> 200 T220625 (duration: 00m 58s)
[19:23:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:48:37] <logmsgbot>	 !log gehel@cumin2001 END (PASS) - Cookbook sre.elasticsearch.rolling-reboot (exit_code=0)
[19:48:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:08:27] <wikibugs>	 10Operations, 10Traffic, 10netops: Aug 28th: turn off knams lasers & stop advertising prefixes in advance of Relined PA-988002 maintenance - https://phabricator.wikimedia.org/T230448 (10CDanis)
[20:16:07] <wikibugs>	 (03PS1) 10Mholloway: Fix MachineVision provider config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/529998
[20:17:01] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Fix MachineVision provider config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/529998 (owner: 10Mholloway)
[20:18:10] <wikibugs>	 (03PS2) 10Mholloway: Fix MachineVision provider config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/529998
[20:18:59] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Fix MachineVision provider config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/529998 (owner: 10Mholloway)
[20:19:12] <wikibugs>	 (03PS3) 10Mholloway: Fix MachineVision provider config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/529998
[20:26:21] <icinga-wm>	 PROBLEM - Disk space on ms-be2021 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=89%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ms-be2021&var-datasource=codfw+prometheus/ops
[20:28:29] <wikibugs>	 (03CR) 10Mholloway: [C: 03+2] Fix MachineVision provider config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/529998 (owner: 10Mholloway)
[20:29:24] <wikibugs>	 (03Merged) 10jenkins-bot: Fix MachineVision provider config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/529998 (owner: 10Mholloway)
[20:29:39] <wikibugs>	 (03CR) 10jenkins-bot: Fix MachineVision provider config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/529998 (owner: 10Mholloway)
[20:32:56] <logmsgbot>	 !log mholloway-shell@deploy1001 Synchronized wmf-config/InitialiseSettings-labs.php: Fix MachineVision provider config (duration: 00m 47s)
[20:33:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:38:13] <wikibugs>	 (03PS3) 10Bstorm: docker: add support for "stable" and "testing" tags [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/528178 (https://phabricator.wikimedia.org/T229058)
[20:38:50] <wikibugs>	 (03PS4) 10Bstorm: docker: add support for "testing" tags [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/528178 (https://phabricator.wikimedia.org/T229058)
[20:41:05] <wikibugs>	 10Operations, 10Traffic, 10Goal, 10Patch-For-Review, 10Performance-Team (Radar): Support TLSv1.3 - https://phabricator.wikimedia.org/T170567 (10Ladsgroup)
[20:41:30] <wikibugs>	 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to Mwmaint1002, Stat1007 for Abijeet Patro - https://phabricator.wikimedia.org/T230020 (10CDanis) a:05RStallman-legalteam→03CDanis Oh, sorry @Arrbee I missed your update.  I'll move forward with this tomorrow.
[20:41:32] <wikibugs>	 10Operations, 10Traffic, 10netops: Aug 28th: turn off knams lasers & stop advertising prefixes in advance of Relined PA-988002 maintenance - https://phabricator.wikimedia.org/T230448 (10ayounsi) That seems to actually be one circuit terminating in two ports on each sides: cr2-knams:xe-0/0/3 to asw-esams:xe-0...
[20:46:29] <wikibugs>	 10Operations, 10Traffic, 10netops: Aug 28th: turn off 1/3 esams-knams lasers in advance of Relined PA-988002 maintenance - https://phabricator.wikimedia.org/T230448 (10ayounsi)
[20:55:51] <wikibugs>	 (03PS5) 10Bstorm: docker: add support for "testing" tags [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/528178 (https://phabricator.wikimedia.org/T224558)
[20:58:33] <wikibugs>	 (03PS6) 10Bstorm: docker: add support for "testing" tags [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/528178 (https://phabricator.wikimedia.org/T224558)
[21:03:11] <wikibugs>	 (03CR) 10Bstorm: "Ok, so at this point, I'm stripping out the stable tag for now in this and setting "testing" to the default tag.  Basically, this should a" [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/528178 (https://phabricator.wikimedia.org/T224558) (owner: 10Bstorm)
[21:14:05] <wikibugs>	 10Operations, 10ops-eqiad: Degraded RAID on cloudvirt1024 - https://phabricator.wikimedia.org/T230442 (10wiki_willy) a:03Cmjohnson
[21:23:27] <wikibugs>	 10Operations, 10Traffic, 10netops: Aug 28th: turn off 1/3 esams-knams lasers in advance of Relined PA-988002 maintenance - https://phabricator.wikimedia.org/T230448 (10CDanis)
[21:27:35] <wikibugs>	 (03PS34) 10CRusnov: profile::netbox: Reorganize for splitting front and back-end. [puppet] - 10https://gerrit.wikimedia.org/r/514395 (https://phabricator.wikimedia.org/T223291)
[21:28:37] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] profile::netbox: Reorganize for splitting front and back-end. [puppet] - 10https://gerrit.wikimedia.org/r/514395 (https://phabricator.wikimedia.org/T223291) (owner: 10CRusnov)
[21:43:34] <wikibugs>	 (03PS35) 10CRusnov: profile::netbox: Reorganize for splitting front and back-end. [puppet] - 10https://gerrit.wikimedia.org/r/514395 (https://phabricator.wikimedia.org/T223291)
[21:44:32] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] profile::netbox: Reorganize for splitting front and back-end. [puppet] - 10https://gerrit.wikimedia.org/r/514395 (https://phabricator.wikimedia.org/T223291) (owner: 10CRusnov)
[21:44:42] <wikibugs>	 (03PS36) 10CRusnov: profile::netbox: Reorganize for splitting front and back-end. [puppet] - 10https://gerrit.wikimedia.org/r/514395 (https://phabricator.wikimedia.org/T223291)
[21:45:43] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] profile::netbox: Reorganize for splitting front and back-end. [puppet] - 10https://gerrit.wikimedia.org/r/514395 (https://phabricator.wikimedia.org/T223291) (owner: 10CRusnov)
[21:50:31] <wikibugs>	 (03PS37) 10CRusnov: profile::netbox: Reorganize for splitting front and back-end. [puppet] - 10https://gerrit.wikimedia.org/r/514395 (https://phabricator.wikimedia.org/T223291)
[21:51:31] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] profile::netbox: Reorganize for splitting front and back-end. [puppet] - 10https://gerrit.wikimedia.org/r/514395 (https://phabricator.wikimedia.org/T223291) (owner: 10CRusnov)
[21:58:33] <wikibugs>	 (03PS1) 10CRusnov: netbox: Add fake secrets for reorg [labs/private] - 10https://gerrit.wikimedia.org/r/530008
[21:58:51] <wikibugs>	 (03CR) 10CRusnov: [V: 03+2 C: 03+2] netbox: Add fake secrets for reorg [labs/private] - 10https://gerrit.wikimedia.org/r/530008 (owner: 10CRusnov)
[22:03:02] <wikibugs>	 (03PS38) 10CRusnov: profile::netbox: Reorganize for splitting front and back-end. [puppet] - 10https://gerrit.wikimedia.org/r/514395 (https://phabricator.wikimedia.org/T223291)
[22:07:11] <wikibugs>	 (03PS39) 10CRusnov: profile::netbox: Reorganize for splitting front and back-end. [puppet] - 10https://gerrit.wikimedia.org/r/514395 (https://phabricator.wikimedia.org/T223291)
[22:14:26] <wikibugs>	 (03CR) 10CRusnov: "Almost there!" [puppet] - 10https://gerrit.wikimedia.org/r/514395 (https://phabricator.wikimedia.org/T223291) (owner: 10CRusnov)
[22:17:51] <icinga-wm>	 PROBLEM - Mediawiki Cirrussearch update rate - eqiad on icinga1001 is CRITICAL: CRITICAL: 10.00% of data under the critical threshold [50.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1
[22:19:39] <icinga-wm>	 PROBLEM - Mediawiki Cirrussearch update rate - codfw on icinga1001 is CRITICAL: CRITICAL: 20.00% of data under the critical threshold [50.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1
[22:50:42] <wikibugs>	 (03PS1) 10Thcipriani: scap: set flag for check-and-restart-php [puppet] - 10https://gerrit.wikimedia.org/r/530014 (https://phabricator.wikimedia.org/T224857)
[23:00:04] <jouncebot>	 MaxSem, RoanKattouw, and Niharika: I, the Bot under the Fountain, allow thee, The Deployer, to do Evening SWAT (Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190813T2300).
[23:00:04] <jouncebot>	 Jdlrobson: A patch you scheduled for Evening SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[23:00:11] <jdlrobson>	 \o
[23:06:32] <jdlrobson>	 i'm guessing there's a small possibility everyone is too busy wikimaning to help with swat
[23:42:47] <icinga-wm>	 PROBLEM - Mediawiki Cirrussearch update rate - eqiad on icinga1001 is CRITICAL: CRITICAL: 10.00% of data under the critical threshold [50.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1
[23:43:59] <icinga-wm>	 PROBLEM - Mediawiki Cirrussearch update rate - codfw on icinga1001 is CRITICAL: CRITICAL: 10.00% of data under the critical threshold [50.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1
[23:59:46] <wikibugs>	 (03PS1) 10DannyS712: Fix addition of Hubblesite.org and Spacetelescope.org to commons wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530015 (https://phabricator.wikimedia.org/T230083)