[00:00:04] <jouncebot>	 twentyafterfour: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Phabricator update. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190905T0000).
[00:02:42] <logmsgbot>	 !log jforrester@deploy1001 Synchronized multiversion/MWConfigCacheGenerator.php: CommonSettings: Factor out load of variant config into MWConfigCacheGenerator, part 1 (duration: 00m 55s)
[00:02:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:04:00] <logmsgbot>	 !log jforrester@deploy1001 Synchronized wmf-config/CommonSettings.php: CommonSettings: Factor out load of variant config into MWConfigCacheGenerator, part 2 (duration: 00m 55s)
[00:04:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:08:17] <wikibugs>	 (03PS6) 10Jforrester: CommonSettings: Factor out write of variant config into MWConfigCacheGenerator [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507728
[00:11:23] <wikibugs>	 (03CR) 10Jforrester: [C: 03+2] CommonSettings: Factor out write of variant config into MWConfigCacheGenerator (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507728 (owner: 10Jforrester)
[00:12:03] <wikibugs>	 (03Merged) 10jenkins-bot: CommonSettings: Factor out write of variant config into MWConfigCacheGenerator [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507728 (owner: 10Jforrester)
[00:12:10] <wikibugs>	 (03CR) 10jenkins-bot: CommonSettings: Factor out write of variant config into MWConfigCacheGenerator [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507728 (owner: 10Jforrester)
[00:19:58] <wikibugs>	 (03PS2) 10Jforrester: Variant configuration: Write to static (JSON) as well as serialised cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/533592 (https://phabricator.wikimedia.org/T223602)
[00:21:03] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Variant configuration: Write to static (JSON) as well as serialised cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/533592 (https://phabricator.wikimedia.org/T223602) (owner: 10Jforrester)
[00:21:36] <wikibugs>	 (03PS1) 10Jforrester: CommonSettings: Push back the mtime/globals muxing into CommonSettings (ahead of deployment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534549
[00:22:29] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] CommonSettings: Push back the mtime/globals muxing into CommonSettings (ahead of deployment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534549 (owner: 10Jforrester)
[00:23:58] <wikibugs>	 (03PS2) 10Jforrester: CommonSettings: Push back the mtime/globals muxing into CommonSettings (ahead of deployment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534549
[00:25:36] <wikibugs>	 (03CR) 10Jforrester: [C: 03+2] CommonSettings: Push back the mtime/globals muxing into CommonSettings (ahead of deployment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534549 (owner: 10Jforrester)
[00:25:55] <wikibugs>	 (03CR) 10Jforrester: [C: 03+2] "Parent patch was never deployed so this won't mess up production caches." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534549 (owner: 10Jforrester)
[00:26:36] <wikibugs>	 (03Merged) 10jenkins-bot: CommonSettings: Push back the mtime/globals muxing into CommonSettings (ahead of deployment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534549 (owner: 10Jforrester)
[00:27:30] <wikibugs>	 (03CR) 10jenkins-bot: CommonSettings: Push back the mtime/globals muxing into CommonSettings (ahead of deployment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534549 (owner: 10Jforrester)
[00:39:29] <wikibugs>	 (03PS1) 10Jforrester: CommonSettings: Fix path vs. filename difference [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534550
[00:39:37] <wikibugs>	 (03CR) 10Jforrester: [C: 03+2] CommonSettings: Fix path vs. filename difference [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534550 (owner: 10Jforrester)
[00:40:22] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] CommonSettings: Fix path vs. filename difference [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534550 (owner: 10Jforrester)
[00:40:33] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] CommonSettings: Fix path vs. filename difference [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534550 (owner: 10Jforrester)
[00:43:23] <wikibugs>	 (03PS2) 10Jforrester: CommonSettings: Fix path vs. filename difference [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534550
[00:44:14] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] CommonSettings: Fix path vs. filename difference [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534550 (owner: 10Jforrester)
[00:45:31] <wikibugs>	 (03PS3) 10Jforrester: CommonSettings: Fix path vs. filename difference [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534550
[00:48:36] <wikibugs>	 (03CR) 10Jforrester: [C: 03+2] CommonSettings: Fix path vs. filename difference [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534550 (owner: 10Jforrester)
[00:49:39] <wikibugs>	 (03Merged) 10jenkins-bot: CommonSettings: Fix path vs. filename difference [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534550 (owner: 10Jforrester)
[00:49:55] <wikibugs>	 (03CR) 10jenkins-bot: CommonSettings: Fix path vs. filename difference [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534550 (owner: 10Jforrester)
[00:51:35] <wikibugs>	 (03CR) 10Jforrester: CommonSettings: Factor out write of variant config into MWConfigCacheGenerator (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507728 (owner: 10Jforrester)
[00:54:48] <logmsgbot>	 !log jforrester@deploy1001 Synchronized multiversion/MWConfigCacheGenerator.php: CommonSettings: Factor out write of variant config into MWConfigCacheGenerator, part 1 (duration: 00m 56s)
[00:54:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:55:56] <logmsgbot>	 !log jforrester@deploy1001 Synchronized wmf-config/CommonSettings.php: CommonSettings: Factor out write of variant config into MWConfigCacheGenerator, part 2 (duration: 00m 53s)
[00:55:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:56:03] <James_F>	 Phew.
[00:56:09] <James_F>	 OK, conch returned.
[01:04:05] <wikibugs>	 (03PS3) 10Jforrester: Variant configuration: Write to static (JSON) as well as serialised cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/533592 (https://phabricator.wikimedia.org/T223602)
[01:04:30] <wikibugs>	 (03CR) 10Jforrester: [C: 04-1] "Let's do another chat before this." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/533592 (https://phabricator.wikimedia.org/T223602) (owner: 10Jforrester)
[01:24:05] <icinga-wm>	 PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/data/css/mobile/base (Get base CSS) timed out before a response was received: /{domain}/v1/page/summary/{title} (Get summary for test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29
[01:25:11] <icinga-wm>	 RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29
[02:10:31] <wikibugs>	 10Operations, 10Cassandra, 10RESTBase, 10Core Platform Team Workboards (Clinic Duty Team), 10User-Eevans: Migrate Restbase-dev cluster to Stretch - https://phabricator.wikimedia.org/T224554 (10Eevans) restbase-dev1004 has been decommissioned and can come down for a re-image at any time.  /cc @MoritzMuehl...
[02:36:15] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps1001 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 100544568 and 5 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[02:36:33] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 878137008 and 49 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[02:45:23] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps1001 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 48256 and 64 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[02:45:43] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps1002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 4648 and 84 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[03:55:15] <wikibugs>	 (03PS2) 10Dzahn: ATS/varnish/parsoid-testing: remove directors for parsoid-vd/parsoid-rt [puppet] - 10https://gerrit.wikimedia.org/r/534271 (https://phabricator.wikimedia.org/T229356)
[03:57:32] <vgutierrez>	 !log switching cp1076 from nginx to ats-tls - T231433
[03:57:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[03:57:35] <stashbot>	 T231433: Move cache upload cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231433
[03:57:53] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] hiera: Move nginx from port 443 to 4443 on cp1076 [puppet] - 10https://gerrit.wikimedia.org/r/532987 (https://phabricator.wikimedia.org/T231433) (owner: 10Vgutierrez)
[03:58:02] <wikibugs>	 (03PS2) 10Vgutierrez: hiera: Move nginx from port 443 to 4443 on cp1076 [puppet] - 10https://gerrit.wikimedia.org/r/532987 (https://phabricator.wikimedia.org/T231433)
[04:00:33] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "Arlolra, sorry for the delay with this. i was on vacation and forgot about this until i saw your mail again the other day." [puppet] - 10https://gerrit.wikimedia.org/r/534271 (https://phabricator.wikimedia.org/T229356) (owner: 10Dzahn)
[04:01:04] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] hiera: Move ats-tls from port 8443 to 4443 on cp1076 [puppet] - 10https://gerrit.wikimedia.org/r/532988 (https://phabricator.wikimedia.org/T231433) (owner: 10Vgutierrez)
[04:01:09] <wikibugs>	 (03PS3) 10Dzahn: ATS/varnish/parsoid-testing: remove directors for parsoid-vd/parsoid-rt [puppet] - 10https://gerrit.wikimedia.org/r/534271 (https://phabricator.wikimedia.org/T229356)
[04:01:21] <wikibugs>	 (03PS2) 10Vgutierrez: hiera: Move ats-tls from port 8443 to 4443 on cp1076 [puppet] - 10https://gerrit.wikimedia.org/r/532988 (https://phabricator.wikimedia.org/T231433)
[04:02:26] <vgutierrez>	 !log upgrading ATS to 8.0.5-1wm5 on cp1076 - T231433
[04:02:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:04:16] <icinga-wm>	 PROBLEM - HTTPS Unified RSA on cp1076 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS
[04:04:52] <vgutierrez>	 ^ that's expected
[04:05:04] <mutante>	 yep, saw the log line matching it :)
[04:05:18] <icinga-wm>	 PROBLEM - HTTPS Unified ECDSA on cp1076 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS
[04:05:20] <wikibugs>	 (03PS4) 10Dzahn: ATS/varnish/parsoid-testing: remove directors for parsoid-vd/parsoid-rt [puppet] - 10https://gerrit.wikimedia.org/r/534271 (https://phabricator.wikimedia.org/T229356)
[04:05:42] <icinga-wm>	 RECOVERY - HTTPS Unified RSA on cp1076 is OK: SSL OK - OCSP staple validity for en.wikipedia.org has 345577 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2019-11-22 07:59:59 +0000 (expires in 78 days) https://wikitech.wikimedia.org/wiki/HTTPS
[04:06:42] <icinga-wm>	 RECOVERY - HTTPS Unified ECDSA on cp1076 is OK: SSL OK - OCSP staple validity for en.wikipedia.org has 345515 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2019-11-22 07:59:59 +0000 (expires in 78 days) https://wikitech.wikimedia.org/wiki/HTTPS
[04:08:52] <icinga-wm>	 PROBLEM - Ensure traffic_manager binds on 8443 and responds to HTTP requests on cp1076 is CRITICAL: connect to address 10.64.0.131 and port 8443: Connection refused https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[04:13:26] <wikibugs>	 (03CR) 10Dzahn: tlsproxy/envoy: limit connections on 443 to cache servers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/534421 (owner: 10Dzahn)
[04:13:55] <wikibugs>	 (03PS2) 10Dzahn: add fake SSL key for releases.discovery.wmnet [labs/private] - 10https://gerrit.wikimedia.org/r/534408
[04:14:06] <wikibugs>	 (03CR) 10Dzahn: [V: 03+2 C: 03+2] add fake SSL key for releases.discovery.wmnet [labs/private] - 10https://gerrit.wikimedia.org/r/534408 (owner: 10Dzahn)
[04:14:19] <wikibugs>	 (03PS1) 10Dzahn: remove parsoid-vd/parsoid-rt.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/534554 (https://phabricator.wikimedia.org/T229356)
[04:14:52] <wikibugs>	 (03PS2) 10Dzahn: add fake SSL key for webperf.discovery.wmnet [labs/private] - 10https://gerrit.wikimedia.org/r/534409
[04:15:48] <wikibugs>	 (03CR) 10Dzahn: [V: 03+2 C: 03+2] add fake SSL key for webperf.discovery.wmnet [labs/private] - 10https://gerrit.wikimedia.org/r/534409 (owner: 10Dzahn)
[04:17:24] <mutante>	 the puppet-merge just took under 10 seconds. mentioning that because i saw comments how yesterday it was 1 minute
[04:18:34] <vgutierrez>	 under 10 secs?
[04:18:59] <mutante>	 yea, really quick to me
[04:19:01] <vgutierrez>	 last one for me felt way slower than that
[04:19:08] <vgutierrez>	 I'll time the next one
[04:19:14] <mutante>	 my change was in labs/private though
[04:20:11] <vgutierrez>	 !log switching cp3034 from nginx to ats-tls - T231433
[04:20:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:20:14] <stashbot>	 T231433: Move cache upload cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231433
[04:20:21] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] hiera: Move nginx from port 443 to port 4443 on cp3034 [puppet] - 10https://gerrit.wikimedia.org/r/532989 (https://phabricator.wikimedia.org/T231433) (owner: 10Vgutierrez)
[04:20:30] <wikibugs>	 (03PS2) 10Vgutierrez: hiera: Move nginx from port 443 to port 4443 on cp3034 [puppet] - 10https://gerrit.wikimedia.org/r/532989 (https://phabricator.wikimedia.org/T231433)
[04:22:49] <vgutierrez>	 mutante: real    1m8.243s
[04:23:43] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] hiera: Move ats-tls from port 8443 to port 443 on cp3034 [puppet] - 10https://gerrit.wikimedia.org/r/532990 (https://phabricator.wikimedia.org/T231433) (owner: 10Vgutierrez)
[04:23:46] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] add certificate for webperf.discovery.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/534412 (owner: 10Dzahn)
[04:23:56] <wikibugs>	 (03PS2) 10Dzahn: add certificate for webperf.discovery.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/534412
[04:24:42] <wikibugs>	 (03PS2) 10Vgutierrez: hiera: Move ats-tls from port 8443 to port 443 on cp3034 [puppet] - 10https://gerrit.wikimedia.org/r/532990 (https://phabricator.wikimedia.org/T231433)
[04:25:56] <vgutierrez>	 sigh...
[04:26:12] <wikibugs>	 (03PS3) 10Vgutierrez: hiera: Move ats-tls from port 8443 to port 443 on cp3034 [puppet] - 10https://gerrit.wikimedia.org/r/532990 (https://phabricator.wikimedia.org/T231433)
[04:26:22] <mutante>	 real	1m4.667s
[04:26:30] <mutante>	 hmm.. ok.. because i said something :p
[04:26:46] <vgutierrez>	 private VS puppet repo
[04:27:06] <icinga-wm>	 PROBLEM - HTTPS Unified RSA on cp3034 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS
[04:27:08] <mutante>	 yea, it's smarter than i thought and only syncs what is is needed i guess
[04:27:16] <icinga-wm>	 PROBLEM - HTTPS Unified ECDSA on cp3034 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS
[04:28:22] <vgutierrez>	 ^^ expected as well
[04:31:06] <icinga-wm>	 RECOVERY - HTTPS Unified RSA on cp3034 is OK: SSL OK - OCSP staple validity for en.wikipedia.org has 345558 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2019-11-22 07:59:59 +0000 (expires in 78 days) https://wikitech.wikimedia.org/wiki/HTTPS
[04:31:06] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] add discovery CNAME for webperf [dns] - 10https://gerrit.wikimedia.org/r/534414 (owner: 10Dzahn)
[04:31:12] <wikibugs>	 (03PS2) 10Dzahn: add discovery CNAME for webperf [dns] - 10https://gerrit.wikimedia.org/r/534414
[04:31:16] <icinga-wm>	 RECOVERY - HTTPS Unified ECDSA on cp3034 is OK: SSL OK - OCSP staple validity for en.wikipedia.org has 345549 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2019-11-22 07:59:59 +0000 (expires in 78 days) https://wikitech.wikimedia.org/wiki/HTTPS
[04:31:56] <vgutierrez>	 !log upgrading ATS to 8.0.5-1wm5 on cp3034 - T231433
[04:31:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:31:59] <stashbot>	 T231433: Move cache upload cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231433
[04:37:08] <mutante>	 wtf.. i merged a DNS change and authdns-update shows stuff that is not my in gerrit change
[04:37:10] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] hiera: Move nginx to port 4443 on cp4021 [puppet] - 10https://gerrit.wikimedia.org/r/532991 (https://phabricator.wikimedia.org/T231433) (owner: 10Vgutierrez)
[04:37:15] <mutante>	 not merging that but now stuck
[04:37:19] <wikibugs>	 (03PS2) 10Vgutierrez: hiera: Move nginx to port 4443 on cp4021 [puppet] - 10https://gerrit.wikimedia.org/r/532991 (https://phabricator.wikimedia.org/T231433)
[04:37:43] <vgutierrez>	 !log switching cp4021 from nginx to ats-tls - T231433
[04:37:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:37:46] <stashbot>	 T231433: Move cache upload cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231433
[04:38:56] <wikibugs>	 (03CR) 10Dzahn: "please don't merge without also running authdns-update. that caused my great confusion when i got unexpected changes after the next merge" [dns] - 10https://gerrit.wikimedia.org/r/534500 (owner: 10Papaul)
[04:40:03] <mutante>	 DNS merge does not have the "warning. type multiple" thing when there is more than one change
[04:40:22] <vgutierrez>	 nope
[04:40:48] <mutante>	 that was pretty unexpected to see unrelated stuff but i confirmed it's the last merge before mine
[04:41:15] <vgutierrez>	 somebody merged something and didn't run authdns-update?
[04:41:21] <mutante>	 yep
[04:41:53] <vgutierrez>	 maybe we should have a check for unmerged changes, like in puppet
[04:42:29] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] hiera: Move ats-tls from port 8443 to port 443 on cp4021 [puppet] - 10https://gerrit.wikimedia.org/r/532992 (https://phabricator.wikimedia.org/T231433) (owner: 10Vgutierrez)
[04:42:31] <mutante>	 yea, that would have been safer
[04:42:44] <vgutierrez>	 !log upgrading ATS to 8.0.5-1wm5 on cp4021 - T231433
[04:42:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:42:47] <stashbot>	 T231433: Move cache upload cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231433
[04:42:50] <mutante>	 it's a permission issue :(
[04:43:15] <wikibugs>	 (03PS2) 10Vgutierrez: hiera: Move ats-tls from port 8443 to port 443 on cp4021 [puppet] - 10https://gerrit.wikimedia.org/r/532992 (https://phabricator.wikimedia.org/T231433)
[04:44:19] <icinga-wm>	 PROBLEM - HTTPS Unified RSA on cp4021 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS
[04:44:42] <vgutierrez>	 ^^expected :)
[04:45:25] <icinga-wm>	 PROBLEM - HTTPS Unified ECDSA on cp4021 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS
[04:50:45] <icinga-wm>	 RECOVERY - HTTPS Unified ECDSA on cp4021 is OK: SSL OK - OCSP staple validity for en.wikipedia.org has 345551 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2019-11-22 07:59:59 +0000 (expires in 78 days) https://wikitech.wikimedia.org/wiki/HTTPS
[04:52:08] <icinga-wm>	 RECOVERY - HTTPS Unified RSA on cp4021 is OK: SSL OK - OCSP staple validity for en.wikipedia.org has 345470 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2019-11-22 07:59:59 +0000 (expires in 78 days) https://wikitech.wikimedia.org/wiki/HTTPS
[04:53:06] <wikibugs>	 (03PS2) 10Dzahn: add certificate for releases.discovery.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/534406
[04:56:42] <wikibugs>	 (03CR) 10Vgutierrez: "+1 to the acme-chief change but IMHO this should be split in two different commits" [puppet] - 10https://gerrit.wikimedia.org/r/534490 (owner: 10CRusnov)
[04:57:19] <vgutierrez>	 !log rearming keyholder on cumin1001
[04:57:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:57:40] <icinga-wm>	 RECOVERY - Keyholder SSH agent on cumin1001 is OK: OK: Keyholder is armed with all configured keys. https://wikitech.wikimedia.org/wiki/Keyholder
[05:01:03] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] add certificate for releases.discovery.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/534406 (owner: 10Dzahn)
[05:12:01] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] realm.pp: Remove filejournal table from private list [puppet] - 10https://gerrit.wikimedia.org/r/534392 (https://phabricator.wikimedia.org/T51195) (owner: 10Marostegui)
[05:12:05] <mutante>	 ganeti2005.mgmt is behaving weirdly
[05:12:08] <wikibugs>	 (03PS3) 10Marostegui: realm.pp: Remove filejournal table from private list [puppet] - 10https://gerrit.wikimedia.org/r/534392 (https://phabricator.wikimedia.org/T51195)
[05:14:55] <marostegui>	 !log Restart MySQL on codfw sanitariums (db2094 and db2095) to pick up new filters - T51195
[05:14:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:14:58] <stashbot>	 T51195: Drop filejournal table from WMF - https://phabricator.wikimedia.org/T51195
[05:19:37] <mutante>	 !log ganeti2005 - reset DRAC via local IPMI since mgmt stopped responding
[05:19:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:21:22] <mutante>	 !log ganeti2005 - DRAC reset fails - ipmi_cmd_cold_reset: bad completion code
[05:21:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:29:33] <marostegui>	 !log Restart wikibugs
[05:29:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:31:58] <marostegui>	 !log Restart MySQL on codfw sanitariums (db1124 and db1125) to pick up new filters - T51195
[05:32:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:32:01] <stashbot>	 T51195: Drop filejournal table from WMF - https://phabricator.wikimedia.org/T51195
[05:37:06] <icinga-wm>	 ACKNOWLEDGEMENT - SSH ganeti2005.mgmt on ganeti2005.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds daniel_zahn https://phabricator.wikimedia.org/T232067 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[05:38:36] <wikibugs>	 (03PS1) 10Vgutierrez: ATS: Disable SSL Session tickets [puppet] - 10https://gerrit.wikimedia.org/r/534558 (https://phabricator.wikimedia.org/T231849)
[05:41:06] <wikibugs>	 (03PS1) 10Marostegui: report_users: Add dbproxy1017 IP [software] - 10https://gerrit.wikimedia.org/r/534559
[05:41:39] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] report_users: Add dbproxy1017 IP [software] - 10https://gerrit.wikimedia.org/r/534559 (owner: 10Marostegui)
[05:41:54] <wikibugs>	 (03PS5) 10Marostegui: mariadb: Promote db1109 to s8 master [puppet] - 10https://gerrit.wikimedia.org/r/531189 (https://phabricator.wikimedia.org/T230762)
[05:42:01] <wikibugs>	 (03PS4) 10Marostegui: wmnet: Update s8-master record [dns] - 10https://gerrit.wikimedia.org/r/531455 (https://phabricator.wikimedia.org/T230762)
[05:42:16] <wikibugs>	 (03PS2) 10Vgutierrez: ATS: Disable SSL Session tickets [puppet] - 10https://gerrit.wikimedia.org/r/534558 (https://phabricator.wikimedia.org/T231849)
[05:42:56] <marostegui>	 !log Remove grants for dbproxy1005 T231280 T231967
[05:43:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:43:01] <stashbot>	 T231967: Decommission dbproxy1005.eqiad.wmnet - https://phabricator.wikimedia.org/T231967
[05:43:02] <stashbot>	 T231280: Remove grants for the old dbproxy hosts from the misc databases - https://phabricator.wikimedia.org/T231280
[05:44:16] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] "pcc looks happy: https://puppet-compiler.wmflabs.org/compiler1002/18181/" [puppet] - 10https://gerrit.wikimedia.org/r/534558 (https://phabricator.wikimedia.org/T231849) (owner: 10Vgutierrez)
[05:44:55] <wikibugs>	 10Operations, 10Analytics, 10Analytics-Cluster: notebook1004 - /srv is full - https://phabricator.wikimedia.org/T232068 (10Dzahn)
[05:45:50] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+1] "Lgtm" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534471 (https://phabricator.wikimedia.org/T231982) (owner: 10Zoranzoki21)
[05:45:52] <icinga-wm>	 ACKNOWLEDGEMENT - Disk space on notebook1004 is CRITICAL: DISK CRITICAL - free space: /srv 1044 MB (0% inode=81%): daniel_zahn https://phabricator.wikimedia.org/T232068 https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=notebook1004&var-datasource=eqiad+prometheus/ops
[05:52:00] <wikibugs>	 10Operations, 10Analytics, 10Analytics-Cluster: analytics1045 - RAID failure and /var/lib/hadoop/data/j can't be mounted - https://phabricator.wikimedia.org/T232069 (10Dzahn)
[05:52:37] <icinga-wm>	 ACKNOWLEDGEMENT - Check systemd state on analytics1045 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. daniel_zahn https://phabricator.wikimedia.org/T232069 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:52:37] <icinga-wm>	 ACKNOWLEDGEMENT - MegaRAID on analytics1045 is CRITICAL: CRITICAL: 12 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough daniel_zahn https://phabricator.wikimedia.org/T232069 https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[05:53:27] <icinga-wm>	 ACKNOWLEDGEMENT - Check systemd state on ununpentium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. daniel_zahn envoy fails to start - WIP (dzahn) https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:58:07] <wikibugs>	 (03PS1) 10Marostegui: report_users: Remove dbproxy1005 IP [software] - 10https://gerrit.wikimedia.org/r/534562
[05:59:16] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] report_users: Remove dbproxy1005 IP [software] - 10https://gerrit.wikimedia.org/r/534562 (owner: 10Marostegui)
[06:03:35] <wikibugs>	 10Operations, 10DBA: Decommission dbproxy1005.eqiad.wmnet - https://phabricator.wikimedia.org/T231967 (10Marostegui)
[06:05:06] <wikibugs>	 (03PS2) 10Dzahn: add discovery CNAME for releases [dns] - 10https://gerrit.wikimedia.org/r/534402
[06:05:28] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] add discovery CNAME for releases [dns] - 10https://gerrit.wikimedia.org/r/534402 (owner: 10Dzahn)
[06:08:36] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Decommission dbproxy1005 [puppet] - 10https://gerrit.wikimedia.org/r/534563 (https://phabricator.wikimedia.org/T231967)
[06:09:02] <wikibugs>	 (03PS3) 10Dzahn: add discovery CNAME for releases [dns] - 10https://gerrit.wikimedia.org/r/534402
[06:10:37] <wikibugs>	 (03PS1) 10Dzahn: requesttracker: use unprivileged port 1443 for envoy [puppet] - 10https://gerrit.wikimedia.org/r/534564
[06:14:50] <icinga-wm>	 PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 39 probes of 453 (alerts on 35) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts
[06:14:53] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] mariadb: Decommission dbproxy1005 [puppet] - 10https://gerrit.wikimedia.org/r/534563 (https://phabricator.wikimedia.org/T231967) (owner: 10Marostegui)
[06:18:15] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Fix typo on dbproxy1005 spare role [puppet] - 10https://gerrit.wikimedia.org/r/534565
[06:19:14] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] mariadb: Fix typo on dbproxy1005 spare role [puppet] - 10https://gerrit.wikimedia.org/r/534565 (owner: 10Marostegui)
[06:19:52] <icinga-wm>	 RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 24 probes of 453 (alerts on 35) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts
[06:30:52] <wikibugs>	 (03PS2) 10KartikMistry: Update cxserver to 2019-09-04-065911-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/534427 (https://phabricator.wikimedia.org/T213255)
[06:31:14] <wikibugs>	 (03CR) 10KartikMistry: [V: 03+2 C: 03+2] Update cxserver to 2019-09-04-065911-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/534427 (https://phabricator.wikimedia.org/T213255) (owner: 10KartikMistry)
[06:37:36] <icinga-wm>	 PROBLEM - Disk space on restbase-dev1005 is CRITICAL: DISK CRITICAL - free space: /srv 107908 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=restbase-dev1005&var-datasource=eqiad+prometheus/ops
[06:38:00] <logmsgbot>	 !log @ helmfile [STAGING] Ran 'apply' command on namespace 'cxserver' for release 'staging' .
[06:38:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:39:15] <wikibugs>	 10Operations, 10DBA, 10Patch-For-Review: Decommission dbproxy1005.eqiad.wmnet - https://phabricator.wikimedia.org/T231967 (10Marostegui)
[06:39:35] <logmsgbot>	 !log @ helmfile [CODFW] Ran 'apply' command on namespace 'cxserver' for release 'production' .
[06:39:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:40:11] <wikibugs>	 10Operations, 10ops-eqiad, 10decommission: Decommission dbproxy1005.eqiad.wmnet - https://phabricator.wikimedia.org/T231967 (10Marostegui) a:05Marostegui→03RobH
[06:40:32] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: Decommission dbproxy1005.eqiad.wmnet - https://phabricator.wikimedia.org/T231967 (10Marostegui) This host is ready for #dc-ops to get it decommissioned
[06:41:48] <logmsgbot>	 !log @ helmfile [EQIAD] Ran 'apply' command on namespace 'cxserver' for release 'production' .
[06:41:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:44:05] <kart_>	 !log Updated cxserver to 2019-09-04-065911-production (T213255, T206310)
[06:44:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:44:08] <stashbot>	 T213255: CX2: Doesn't handle correctly ISBN, should not put nowiki tags around them - https://phabricator.wikimedia.org/T213255
[06:44:09] <stashbot>	 T206310: CX2: Highlight references with a template that is missing mandatory parameters after being adapted - https://phabricator.wikimedia.org/T206310
[06:51:34] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] add discovery CNAME for releases [dns] - 10https://gerrit.wikimedia.org/r/534402 (owner: 10Dzahn)
[06:55:15] <wikibugs>	 (03CR) 10Tim Starling: [C: 03+1] "Looks good to me." [puppet] - 10https://gerrit.wikimedia.org/r/534405 (owner: 10Giuseppe Lavagetto)
[06:58:43] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] requesttracker: use unprivileged port 1443 for envoy [puppet] - 10https://gerrit.wikimedia.org/r/534564 (owner: 10Dzahn)
[06:58:53] <wikibugs>	 (03PS2) 10Dzahn: requesttracker: use unprivileged port 1443 for envoy [puppet] - 10https://gerrit.wikimedia.org/r/534564
[06:59:46] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch remaining restbase servers to Stretch [puppet] - 10https://gerrit.wikimedia.org/r/534572
[07:01:29] <wikibugs>	 (03PS2) 10Muehlenhoff: Switch remaining restbase servers to Stretch [puppet] - 10https://gerrit.wikimedia.org/r/534572
[07:01:45] <wikibugs>	 (03CR) 10Nikerabbit: Add Draft and Draft_talk aliases for wikis that define draft namespace (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/510780 (https://phabricator.wikimedia.org/T223472) (owner: 10Petar.petkovic)
[07:03:02] <icinga-wm>	 RECOVERY - Check systemd state on ununpentium is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:03:27] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Switch remaining restbase servers to Stretch [puppet] - 10https://gerrit.wikimedia.org/r/534572 (owner: 10Muehlenhoff)
[07:03:46] <wikibugs>	 (03PS3) 10Muehlenhoff: Add library hint for nghttp2 [puppet] - 10https://gerrit.wikimedia.org/r/534449
[07:05:46] <wikibugs>	 (03PS1) 10Marostegui: wmnet: Replace dbproxy1010 with dbproxy1018 [dns] - 10https://gerrit.wikimedia.org/r/534573 (https://phabricator.wikimedia.org/T231520)
[07:07:14] <icinga-wm>	 PROBLEM - Check systemd state on ununpentium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:07:19] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Add library hint for nghttp2 [puppet] - 10https://gerrit.wikimedia.org/r/534449 (owner: 10Muehlenhoff)
[07:07:27] <mutante>	 !log ununpentium - manually delete /etc/envoy/listeners.d/00-tls_terminator_443.yaml after changing port to 1443 - puppet does not remove it
[07:07:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:10:23] <icinga-wm>	 ACKNOWLEDGEMENT - Check systemd state on ununpentium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. daniel_zahn switching envoy config / port (dzahn) https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:11:39] <wikibugs>	 (03PS2) 10Muehlenhoff: Add DNS entries for new Buster-based LDAP/corp replicas [dns] - 10https://gerrit.wikimedia.org/r/534432 (https://phabricator.wikimedia.org/T231015)
[07:11:57] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 04-1] "I don't think this is the solution to your problem, but this class will be used for tls termination between internal services as well. So " (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/534421 (owner: 10Dzahn)
[07:17:51] <wikibugs>	 (03CR) 10Jcrespo: "This is ok, but you are aware this is a noop, right?" [dns] - 10https://gerrit.wikimedia.org/r/534573 (https://phabricator.wikimedia.org/T231520) (owner: 10Marostegui)
[07:19:44] <icinga-wm>	 PROBLEM - Disk space on restbase-dev1005 is CRITICAL: DISK CRITICAL - free space: /srv 107812 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=restbase-dev1005&var-datasource=eqiad+prometheus/ops
[07:24:10] <wikibugs>	 10Operations, 10DBA, 10Data-Services, 10Patch-For-Review: Replace labsdb (wikireplicas) dbproxies: dbproxy1010 and dbproxy1011 - https://phabricator.wikimedia.org/T231520 (10Marostegui) I have been talking to Jaime about this, and we might better wait for https://gerrit.wikimedia.org/r/#/c/operations/puppe...
[07:24:36] <wikibugs>	 (03CR) 10Marostegui: "> This is ok, but you are aware this is a noop, right?" [dns] - 10https://gerrit.wikimedia.org/r/534573 (https://phabricator.wikimedia.org/T231520) (owner: 10Marostegui)
[07:31:36] <mutante>	 !log ununpentium - removed /etc/envoy/envoy.yaml; ran /usr/local/sbin/build-envoy-config -c /etc/envoy to regenarate config without 443 listener; ran puppet; envoy now running on jessie
[07:31:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:32:13] <moritzm>	 !log upgrading mw1293-mw1296, mw1299-mw1306 to PHP 7.2.22
[07:32:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:34:15] <wikibugs>	 (03CR) 10Dzahn: tlsproxy/envoy: limit connections on 443 to cache servers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/534421 (owner: 10Dzahn)
[07:35:30] <wikibugs>	 10Operations, 10Traffic: varnishreqstats sends truncated statsd traffic - https://phabricator.wikimedia.org/T212310 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi Varnish doesn't send statsd any longer per {T220116}, resolving
[07:39:38] <wikibugs>	 (03PS1) 10Marostegui: production-m1: Remove puppet grants [puppet] - 10https://gerrit.wikimedia.org/r/534576 (https://phabricator.wikimedia.org/T231539)
[07:40:04] <icinga-wm>	 PROBLEM - Disk space on restbase-dev1005 is CRITICAL: DISK CRITICAL - free space: /srv 107368 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=restbase-dev1005&var-datasource=eqiad+prometheus/ops
[07:40:10] <wikibugs>	 10Operations, 10ops-codfw: ganeti2005 - mgmt interface stopped responding and reset fails - https://phabricator.wikimedia.org/T232067 (10Dzahn) p:05Triage→03Normal
[07:41:46] <icinga-wm>	 RECOVERY - Check systemd state on ununpentium is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:41:57] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] production-m1: Remove puppet grants [puppet] - 10https://gerrit.wikimedia.org/r/534576 (https://phabricator.wikimedia.org/T231539) (owner: 10Marostegui)
[07:45:10] <marostegui>	 !log Remove puppet grants from m1 for the following IPs: 10.64.0.165 10.64.16.159 10.64.16.18 T231539
[07:45:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:45:14] <stashbot>	 T231539: Drop puppet database from m1 - https://phabricator.wikimedia.org/T231539
[07:47:16] <wikibugs>	 10Operations, 10DBA, 10Patch-For-Review: Drop puppet database from m1 - https://phabricator.wikimedia.org/T231539 (10Marostegui) The following grants have been dropped: ` root@db1135.eqiad.wmnet[(none)]> drop user if exists 'puppet'@'10.64.0.165'; Query OK, 0 rows affected (0.00 sec)  root@db1135.eqiad.wmnet...
[07:53:50] <marostegui>	 !log Remove old backups for db2037 and db2042 from dbprov2001 
[07:53:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:56:13] <hashar>	 !log Switching "wikidatawiki" on mwdebug1001 to 1.34.0-wmf.21 by editing /srv/mediawiki/wikiversions.php # T232035
[07:56:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:56:21] <stashbot>	 T232035: 1.34.0-wmf.21 cause termbox to emit: Test get rendered termbox returned the unexpected status 500 - https://phabricator.wikimedia.org/T232035
[07:56:29] <wikibugs>	 10Operations, 10vm-requests: vm request for RT to replace ununpentium - https://phabricator.wikimedia.org/T232077 (10Dzahn)
[07:56:56] <_joe_>	 !log uploading scap 3.12.1 to reprepro on all distros 224857
[07:56:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:57:44] <_joe_>	 !log upgrading scap on mwdebug1001
[07:57:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:57:51] <wikibugs>	 (03PS1) 10Marostegui: wikireplica_dns: Replace dbproxy1010 with dbproxy1018 [puppet] - 10https://gerrit.wikimedia.org/r/534577 (https://phabricator.wikimedia.org/T231520)
[07:58:31] <wikibugs>	 (03CR) 10Marostegui: [C: 04-2] "Wait for T231520#5467304" [puppet] - 10https://gerrit.wikimedia.org/r/534577 (https://phabricator.wikimedia.org/T231520) (owner: 10Marostegui)
[08:02:58] <wikibugs>	 10Operations, 10vm-requests: vm request for RT to replace ununpentium - https://phabricator.wikimedia.org/T232077 (10MoritzMuehlenhoff) Looks good
[08:07:09] <wikibugs>	 10Operations, 10DBA, 10Data-Services, 10Patch-For-Review: Replace labsdb (wikireplicas) dbproxies: dbproxy1010 and dbproxy1011 - https://phabricator.wikimedia.org/T231520 (10Marostegui) a:05Marostegui→03None
[08:09:40] <vgutierrez>	 !log depooling cp3034 due to intermittent network issues
[08:09:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:11:17] <icinga-wm>	 PROBLEM - Disk space on restbase-dev1005 is CRITICAL: DISK CRITICAL - free space: /srv 107544 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=restbase-dev1005&var-datasource=eqiad+prometheus/ops
[08:13:32] <_joe_>	 !log upgrading scap on deploy1001
[08:13:33] <wikibugs>	 (03PS1) 10Hashar: Promote wikidata to 1.34.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534578 (https://phabricator.wikimedia.org/T232035)
[08:13:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:13:44] <hashar>	 going to upgrade wikidatawiki to 1.34.0-wmf.21 for termbox issue
[08:13:46] <hashar>	 which is a blocker
[08:14:05] <_joe_>	 hashar: can you please hold? see what I wrote elsewhere
[08:14:20] * hashar looks "elsewhere"
[08:14:22] <hashar>	 :D
[08:15:29] <hashar>	 _joe_: just le me sync wikiversions ;) that is fast enough!?
[08:15:46] <_joe_>	 hashar: actually
[08:15:54] <_joe_>	 wait 1 minute, we can test scap with that
[08:16:00] <hashar>	 sure
[08:16:14] <_joe_>	 the only thing that could fail, it can fail on the mwdebugs and has no impact on the deploy
[08:16:17] <hashar>	 we tried to reproduce the termbox / wikidata api query timeout  using mwdebug1001 but that is a dead end :-\
[08:16:45] <moritzm>	 !log reimage restbase-dev1004 to Stretch T224554
[08:16:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:16:48] <stashbot>	 T224554: Migrate Restbase-dev cluster to Stretch - https://phabricator.wikimedia.org/T224554
[08:17:01] <_joe_>	 hashar: I'll tell you when to proceed
[08:17:29] <_joe_>	 hashar: go on!
[08:17:30] <hashar>	 ok
[08:17:32] <hashar>	 merging merging
[08:17:35] <wikibugs>	 (03CR) 10Hashar: [C: 03+2] Promote wikidata to 1.34.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534578 (https://phabricator.wikimedia.org/T232035) (owner: 10Hashar)
[08:18:41] <wikibugs>	 (03Merged) 10jenkins-bot: Promote wikidata to 1.34.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534578 (https://phabricator.wikimedia.org/T232035) (owner: 10Hashar)
[08:18:57] <wikibugs>	 (03CR) 10jenkins-bot: Promote wikidata to 1.34.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534578 (https://phabricator.wikimedia.org/T232035) (owner: 10Hashar)
[08:19:50] <hashar>	 _joe_: tarrow: and I am now promoting wikidatawiki again
[08:20:00] <_joe_>	 ok
[08:20:03] <tarrow>	 right ho
[08:20:08] <_joe_>	 it seems like the right test to do
[08:20:24] <_joe_>	 tarrow: did you time your requests to the api before the promotion?
[08:20:37] <_joe_>	 having some benchmark can be interesting
[08:20:58] <tarrow>	 we timed with curl (but from the deployoment hosts)
[08:21:35] <logmsgbot>	 !log hashar@deploy1001 rebuilt and synchronized wikiversions files: Promote wikidatawiki to 1.34.0-wmf.21 for T232035 - T220746
[08:21:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:21:38] <stashbot>	 T220746: 1.34.0-wmf.21 deployment blockers - https://phabricator.wikimedia.org/T220746
[08:21:39] <stashbot>	 T232035: 1.34.0-wmf.21 cause termbox to emit: Test get rendered termbox returned the unexpected status 500 - https://phabricator.wikimedia.org/T232035
[08:22:05] <_joe_>	 hashar: did scap complete?
[08:22:25] <hashar>	 yeah
[08:22:43] <_joe_>	 any errors?
[08:22:49] <_joe_>	 if not, that's great
[08:23:01] <wikibugs>	 (03CR) 10Gehel: [C: 04-1] Add maps reboot cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/511819 (https://phabricator.wikimedia.org/T224072) (owner: 10Mathew.onipe)
[08:23:25] <tarrow>	 seem to be some 500's again
[08:23:49] <_joe_>	 tarrow: can you see what's the url called on the api?
[08:23:49] <vgutierrez>	 !log repooling cp3034
[08:23:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:24:08] <_joe_>	 that times out
[08:24:34] <tarrow>	 yep
[08:24:38] <tarrow>	 it's in each logstash entry
[08:24:44] <_joe_>	 heh ok
[08:24:50] <wikibugs>	 (03PS2) 10Gehel: Pick a new canary for elastic [puppet] - 10https://gerrit.wikimedia.org/r/534461 (owner: 10Muehlenhoff)
[08:24:54] <_joe_>	 did you try to call it now from deploy2001?
[08:25:20] <tarrow>	  /termbox?preferredLanguages=de%7Cen&entity=Q1&editLink=%2Fedit%2FQ1347&language=de&revision=103
[08:26:21] <_joe_>	 this is the termbox url or the mw api url?
[08:26:44] <hashar>	 so yeah that is slow again from time to time
[08:26:50] <hashar>	 https://phabricator.wikimedia.org/T232035#5467373
[08:26:57] <hashar>	 750ms - 1000 ms
[08:27:10] <hashar>	 and sometime there is a timeout of some sort and that service checker is flagged critical with 3600ms run time
[08:28:06] <tarrow>	 yeah
[08:28:07] <wikibugs>	 (03CR) 10Gehel: [C: 04-1] Pick a new canary for elastic (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/534461 (owner: 10Muehlenhoff)
[08:28:14] <_joe_>	 hashar: I would see what changed in either
[08:28:19] <tarrow>	 https://www.irccloud.com/pastebin/tzv3Qlcg/
[08:28:20] <_joe_>	 the mediawiki api response times
[08:28:26] <tarrow>	 6s to respond there
[08:28:27] <_joe_>	 or in the content of the response
[08:28:42] <_joe_>	 given termbox didn't change
[08:28:54] <_joe_>	 something  must have changed in terms of the backend calls it makes
[08:29:11] <_joe_>	 the only way to properly debug the problem is to go look at the mw api :)
[08:29:13] <tarrow>	 odd how it isn't consistent though
[08:29:16] <wikibugs>	 10Operations, 10netops: Check router ACLs for early install SSH access from puppet masters/cumin hosts - https://phabricator.wikimedia.org/T231811 (10MoritzMuehlenhoff) 05Declined→03Open @ayounsi ; Could you doublecheck whether we have any remaining router/firewall rules for the IPs used by iron.wikimedia....
[08:29:35] <tarrow>	 I suspect it depends which server it is hitting
[08:29:49] <tarrow>	 0.08s or >6s
[08:29:51] <hashar>	 pure speculation,  termbox does the query to the mediawiki API with a 3 seconds time out isn't it?
[08:29:52] <_joe_>	 tarrow: could be for a number of reasons. Can you give me the exact url that is called on the mediawiki api?
[08:29:55] <hashar>	 so eventually it dies out
[08:30:05] <_joe_>	 hashar: yes, that's what's happening
[08:30:05] <tarrow>	 curl -w "@curl-format.txt" -H 'Host: www.wikidata.org' 'http://api-ro.discovery.wmnet/w/index.php?title=Special:EntityData&format=json&id=Q1&revision=103'
[08:30:08] <hashar>	 but maybe the query is still going on on the mediawiki API servers and we might see some timeout there
[08:30:24] <tarrow>	 ignore the curl-format bit for logging the times if you like
[08:30:24] <_joe_>	 tarrow: what's that curl-format.txt file?
[08:30:27] <hashar>	 tarrow: do you get logstash access? 
[08:30:31] <_joe_>	 ahah ok
[08:30:36] <vgutierrez>	 !log rebooting cp3034
[08:30:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:30:50] <tarrow>	 https://www.irccloud.com/pastebin/jvGcwqwu/
[08:30:57] <wikibugs>	 (03PS3) 10Gehel: Pick a new canary for elastic [puppet] - 10https://gerrit.wikimedia.org/r/534461 (owner: 10Muehlenhoff)
[08:30:58] <hashar>	 https://logstash.wikimedia.org/goto/af5df8a95de6a72c184ac1072a4fdb78  is errors for wikidatawiki
[08:31:01] <tarrow>	 is the contents if you want to breakdown the slow bits
[08:31:29] <hashar>	 but that does not have much clues :-\
[08:31:35] <wikibugs>	 (03PS3) 10Muehlenhoff: Decom iron [puppet] - 10https://gerrit.wikimedia.org/r/531867 (https://phabricator.wikimedia.org/T220505)
[08:32:06] <tarrow>	 _joe_: is there some easy way to aim at different api servers?
[08:32:07] <akosiaris>	 !log depool restbase1022 T232007
[08:32:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:32:10] <stashbot>	 T232007: Restbase: significant increase of outbound dropped packets - https://phabricator.wikimedia.org/T232007
[08:32:15] <tarrow>	 e.g. see if it is the php7 ones that are slow
[08:32:53] <_joe_>	 tarrow: responses to that request are super fast
[08:32:57] <_joe_>	 on any server
[08:33:15] <tarrow>	 well not from deploy2002
[08:33:21] <_joe_>	 tarrow: yes, you can s/api-ro.discovery/mw1347.eqiad/
[08:33:28] <_joe_>	 tarrow: uh interesting
[08:33:38] <tarrow>	 yeah, sometime >6s
[08:33:42] <wikibugs>	 (03PS4) 10Muehlenhoff: Decom iron [puppet] - 10https://gerrit.wikimedia.org/r/531867 (https://phabricator.wikimedia.org/T220505)
[08:33:44] <tarrow>	 other times 0.08
[08:34:10] <_joe_>	 tarrow: uhm never happened to me
[08:34:12] <_joe_>	 but lemme check
[08:34:41] <icinga-wm>	 PROBLEM - Host cp3034 is DOWN: PING CRITICAL - Packet loss = 100%
[08:34:46] <tarrow>	 I would say between 10 and 20% of requests
[08:34:50] <tarrow>	 are slow
[08:35:00] <_joe_>	 tarrow: not my experience, which is weird
[08:35:06] <logmsgbot>	 !log jmm@cumin1001 START - Cookbook sre.hosts.downtime
[08:35:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:35:10] <tarrow>	 https://www.irccloud.com/pastebin/Nl7tAnFf/
[08:35:20] <tarrow>	 just run now
[08:35:25] <_joe_>	 so it's clearly slower on php7
[08:35:29] <_joe_>	 takes 0.8 seconds
[08:35:57] <_joe_>	 tarrow: can you please log the Server header too?
[08:36:27] <wikibugs>	 (03CR) 10Muehlenhoff: "Ack, sounds good." [puppet] - 10https://gerrit.wikimedia.org/r/534461 (owner: 10Muehlenhoff)
[08:36:28] <_joe_>	 uh wait a sec
[08:36:36] <_joe_>	 why is api-ro going to codfw FFS
[08:36:37] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Pick a new canary for elastic [puppet] - 10https://gerrit.wikimedia.org/r/534461 (owner: 10Muehlenhoff)
[08:36:53] <wikibugs>	 10Operations, 10observability, 10Performance-Team (Radar): Set up a statsv-like endpoint for Prometheus - https://phabricator.wikimedia.org/T180105 (10fgiunchedi) I was re-reading this task and {T175087} in the context of progressively moving away from statsd/graphite (T205870), some technical thoughts on ho...
[08:36:54] <_joe_>	 tarrow: argh, I might know what's up
[08:36:54] <tarrow>	 https://www.irccloud.com/pastebin/QM790zsh/
[08:37:01] <logmsgbot>	 !log jmm@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[08:37:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:37:11] <_joe_>	 indeed...
[08:37:32] <tarrow>	 :)
[08:37:43] * tarrow is not sure what he's looking at
[08:37:53] <tarrow>	 are we active active for api-ro?
[08:37:58] * hashar watches black magic going on
[08:38:03] <_joe_>	 tarrow: we shouldn't be
[08:38:12] <_joe_>	 so wait a couple minutes
[08:38:47] <logmsgbot>	 !log oblivian@puppetmaster1001 conftool action : set/pooled=false; selector: dnsdisc=a.*-ro,name=codfw
[08:38:48] <tarrow>	 okay :)
[08:38:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:38:57] <hashar>	 how have you managed to add extra logs (time_connect, time_appconnect etc)  to the query output?
[08:39:33] <tarrow>	 https://stackoverflow.com/questions/18215389/how-do-i-measure-request-and-response-times-at-once-using-curl
[08:39:41] <tarrow>	 using that
[08:40:07] <_joe_>	 ok tarrow try now?
[08:40:14] <_joe_>	 I bet you won't see issues now
[08:40:35] <_joe_>	 in terms of response times
[08:40:55] <_joe_>	 but still, why is termbox consistently alerting is beyond explanation with "some requests are slow"
[08:41:00] <hashar>	 tarrow: nice hack :]
[08:41:22] <_joe_>	 anyways if you can confirm you don't see those timeouts anymore
[08:41:55] <hashar>	 seems it is constantly taking 750ms now
[08:42:06] <hashar>	 (well 750 - 820ms)
[08:42:10] <_joe_>	 hashar: what are you doing?
[08:42:14] <hashar>	 I cant tell how fast it was before
[08:42:20] <hashar>	 deploy1001:~$ time service-checker-swagger -t 15 termbox.svc.codfw.wmnet http://termbox.svc.codfw.wmnet:3030
[08:42:26] <_joe_>	 oh ok
[08:42:31] <_joe_>	 so not the single curl
[08:42:51] <hashar>	 so what black magic happened?
[08:43:01] <tarrow>	 seems better to me :)
[08:43:09] <_joe_>	 I made requests to api-ro go to the live dc
[08:43:18] <_joe_>	 instead than to the inactive one
[08:43:27] <hashar>	 so what black magic happened?
[08:43:29] <hashar>	 sorry
[08:43:31] <_joe_>	 how they were incorrectly configured right now
[08:43:40] <hashar>	 so potentially we got timeout because some kind of caches are cold on the inactive DC?
[08:43:55] <_joe_>	 the php caches and the db caches and memcached, yes
[08:43:57] <_joe_>	 all of them
[08:43:59] <hashar>	 not sure why it worked fine with 1.34.0-wmf.20 though!:^/
[08:44:13] <_joe_>	 that is something I don't know either
[08:44:32] <icinga-wm>	 RECOVERY - Disk space on restbase-dev1005 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=restbase-dev1005&var-datasource=eqiad+prometheus/ops
[08:44:32] <tarrow>	 did we slowly warm the cache for that request
[08:44:34] <hashar>	 maybe due to some changes to the mediawiki/core cache system
[08:44:35] <_joe_>	 and I'd urge you to rollback and retest the speed of that specific endpoint if you want to verify if there was an issue
[08:44:41] <hashar>	 yeah
[08:44:47] <hashar>	 that is what I had in mind
[08:44:48] <tarrow>	 and then bumping meant it was cooled
[08:44:52] <_joe_>	 of perf degradation
[08:45:06] <_joe_>	 tarrow: nope, what happened is I made you do your requests cross-dc
[08:45:28] <tarrow>	 I mean why 20 worked fine
[08:45:33] <_joe_>	 instead of going to the damn same dc, which shouldn't have happened, but we screwed up
[08:45:51] <_joe_>	 tarrow: it would've been a problem when we switched to .20 as well though
[08:45:56] <_joe_>	 the cold caches
[08:46:02] <_joe_>	 so something definitely changed
[08:46:06] <tarrow>	 maybe it was and no-one noticed??
[08:46:12] <_joe_>	 if it's not a perf degradation, it's ok
[08:46:18] <_joe_>	 tarrow: oh that is very possible too
[08:46:22] <icinga-wm>	 RECOVERY - Host cp3034 is UP: PING OK - Packet loss = 0%, RTA = 83.41 ms
[08:46:22] <hashar>	 so rollback to 1.34.0-wmf.20 , measure
[08:46:29] <hashar>	 and then back to .21 right?
[08:46:51] <hashar>	 (possibly we can also look at the Icinga probe for termbox last week and see whether it alarmed)
[08:47:42] <wikibugs>	 (03PS1) 10Hashar: Rollback wikidata to 1.34.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534583 (https://phabricator.wikimedia.org/T232035)
[08:47:46] <hashar>	 tarrow: _joe_ ^^?
[08:48:06] <_joe_>	 hashar: +1 from me
[08:48:09] <tarrow>	 Er... yes... I guess
[08:48:23] <tarrow>	 but just be clear *what* exactly are we measuring?
[08:48:36] <tarrow>	 the time for that response?
[08:48:44] <hashar>	 the service response using service-checker-swagger
[08:48:49] <tarrow>	 got it :)
[08:48:52] <hashar>	 that is the only thing that I noticed slowing down
[08:49:00] <hashar>	 (since the queries to index.php worked just fine apparently)
[08:49:41] <hashar>	 hmm
[08:49:54] <hashar>	 https://logstash.wikimedia.org/goto/336ca058ad55c63297324fd699fd7b83
[08:50:02] <hashar>	 Icinga alerts for termbox over 15 days
[08:50:52] <tarrow>	 looks like it times with the train
[08:51:00] <_joe_>	 yeah looks like it
[08:51:05] <_joe_>	 but please confirm
[08:51:19] <_joe_>	 if that's the case, we have a satisfying explanation of the problems we've seen
[08:52:37] <wikibugs>	 (03CR) 10Hashar: [C: 03+2] "Move fingers doing black magic." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534583 (https://phabricator.wikimedia.org/T232035) (owner: 10Hashar)
[08:52:48] <hashar>	 _joe_: tarrow: huge thank you to both you for the help :]
[08:53:31] <hashar>	 after that I will have to escape
[08:53:32] <wikibugs>	 (03Merged) 10jenkins-bot: Rollback wikidata to 1.34.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534583 (https://phabricator.wikimedia.org/T232035) (owner: 10Hashar)
[08:53:36] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: scap: restart php-fpm if needed when doing a full deploy [puppet] - 10https://gerrit.wikimedia.org/r/534584 (https://phabricator.wikimedia.org/T224857)
[08:53:43] <tarrow>	 Thank you!
[08:53:48] <wikibugs>	 (03CR) 10jenkins-bot: Rollback wikidata to 1.34.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534583 (https://phabricator.wikimedia.org/T232035) (owner: 10Hashar)
[08:54:33] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] scap: restart php-fpm if needed when doing a full deploy [puppet] - 10https://gerrit.wikimedia.org/r/534584 (https://phabricator.wikimedia.org/T224857) (owner: 10Giuseppe Lavagetto)
[08:55:07] <hashar>	 and not sure whether it is useful, but https://grafana.wikimedia.org/d/AJf0z_7Wz/termbox?refresh=1m&orgId=1  might need some metrics about the queries latency
[08:55:25] <logmsgbot>	 !log hashar@deploy1001 rebuilt and synchronized wikiversions files: Rollback wikidatawiki to 1.34.0-wmf.20 for T232035
[08:55:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:55:29] <hashar>	 also
[08:55:36] <stashbot>	 T232035: 1.34.0-wmf.21 cause termbox to emit: Test get rendered termbox returned the unexpected status 500 - https://phabricator.wikimedia.org/T232035
[08:55:38] <hashar>	 I don't get why there are like 140 requests per seconds
[08:55:41] <hashar>	 but barely any errors reported
[08:55:52] <_joe_>	 eqiad doesn't have errors?
[08:55:54] <tarrow>	 hashar: latency in more detail than what is there?
[08:56:02] <_joe_>	 and termbox is called by mediawiki which is active in eqiad
[08:56:07] <hashar>	 OH
[08:56:13] <_joe_>	 so all requests are going to termbox/eqiad
[08:56:25] <_joe_>	 tarrow: he means telemetry for calls to the backend
[08:56:42] <_joe_>	 hashar: telemetry for backend calls and full tracing are coming(TM)
[08:56:43] <hashar>	 ther eis the Datacenter selector at the top of the graph bah
[08:56:48] <hashar>	 so yeah on codfw that shows some errors
[08:58:21] <hashar>	 ok back to 1.34.0-wmf.21
[08:59:27] <wikibugs>	 (03PS1) 10Hashar: Wikidata back to 1.34.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534585 (https://phabricator.wikimedia.org/T232035)
[08:59:48] <hashar>	 sometime I have the feeling I am just wasting my (and everyone else) time :-\
[08:59:55] <wikibugs>	 10Operations, 10vm-requests: vm request for RT to replace ununpentium - https://phabricator.wikimedia.org/T232077 (10Dzahn)
[08:59:56] <tarrow>	 Did you collect the stats you wanted?
[08:59:57] <wikibugs>	 10Operations: reinstall RT server with private IP and Buster - https://phabricator.wikimedia.org/T180641 (10Dzahn)
[09:00:00] <wikibugs>	 (03CR) 10Hashar: [C: 03+2] Wikidata back to 1.34.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534585 (https://phabricator.wikimedia.org/T232035) (owner: 10Hashar)
[09:00:07] <tarrow>	 hashar: that is 100% not true
[09:00:20] <hashar>	 doesn't change my feeling about it hehe
[09:00:21] <hashar>	 ;-]
[09:00:29] <tarrow>	 or rather you might have that feeling but you aren't wasting anyone's time
[09:00:38] <tarrow>	 :P
[09:00:43] <hashar>	 I never know whether i am just too obsessive / too strict
[09:00:47] <hashar>	 cool
[09:00:49] <hashar>	 thank you !
[09:00:56] <hashar>	 so claiming it as fixed
[09:00:57] <wikibugs>	 (03Merged) 10jenkins-bot: Wikidata back to 1.34.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534585 (https://phabricator.wikimedia.org/T232035) (owner: 10Hashar)
[09:01:01] <tarrow>	 woo
[09:01:02] <hashar>	 blaming cold caches in codfw 
[09:01:14] <wikibugs>	 (03CR) 10jenkins-bot: Wikidata back to 1.34.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534585 (https://phabricator.wikimedia.org/T232035) (owner: 10Hashar)
[09:01:16] <hashar>	 with the root cause undetermined but blaming mediawiki/core change of behavior
[09:01:47] <_joe_>	 hashar: I think it happened at each train release on wikidata
[09:01:53] <_joe_>	 you were just the first one to notice
[09:02:24] <hashar>	 ah
[09:04:01] <vgutierrez>	 !log rolling back from ats-tls to nginx on cp3034 - T231433
[09:04:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:04:04] <stashbot>	 T231433: Move cache upload cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231433
[09:04:37] <jynus>	 hashar: sorry, were you talking about mw changetags before?
[09:04:59] <jynus>	 I think I read someone mentioning them, but cannot remember who
[09:05:08] <_joe_>	 vgutierrez: I see issues on upload in esams, transient. Was that you doing things on cp3034?
[09:05:22] <_joe_>	 jynus: no I think it was tim
[09:05:23] <wikibugs>	 (03PS1) 10Vgutierrez: Revert "hiera: Move ats-tls from port 8443 to port 443 on cp3034" [puppet] - 10https://gerrit.wikimedia.org/r/534586
[09:05:26] <_joe_>	 related to php7?
[09:05:27] <hashar>	 tarrow: _joe_ I have marked the termbox issue fixed. Thank you very much
[09:05:39] <hashar>	 jynus: wasn't me sorry :-^
[09:05:52] <vgutierrez>	 _joe_: yeah, I've migrated this morning from nginx to ats-tls, I'm about to rollback
[09:05:54] <logmsgbot>	 !log hashar@deploy1001 rebuilt and synchronized wikiversions files: Promote wikidatawiki to 1.34.0-wmf.21 for T232035 - T220746
[09:05:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:05:58] <stashbot>	 T220746: 1.34.0-wmf.21 deployment blockers - https://phabricator.wikimedia.org/T220746
[09:05:58] <stashbot>	 T232035: 1.34.0-wmf.21 cause termbox to emit: Test get rendered termbox returned the unexpected status 500 - https://phabricator.wikimedia.org/T232035
[09:07:11] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] Revert "hiera: Move ats-tls from port 8443 to port 443 on cp3034" [puppet] - 10https://gerrit.wikimedia.org/r/534586 (owner: 10Vgutierrez)
[09:07:24] <wikibugs>	 (03PS2) 10Vgutierrez: Revert "hiera: Move ats-tls from port 8443 to port 443 on cp3034" [puppet] - 10https://gerrit.wikimedia.org/r/534586
[09:08:38] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: scap: restart php-fpm if needed when doing a full deploy [puppet] - 10https://gerrit.wikimedia.org/r/534584 (https://phabricator.wikimedia.org/T224857)
[09:08:52] <hashar>	 I am off for now be back this afternoon
[09:11:37] <wikibugs>	 (03PS1) 10Vgutierrez: Revert "hiera: Move nginx from port 443 to port 4443 on cp3034" [puppet] - 10https://gerrit.wikimedia.org/r/534587
[09:11:55] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] Revert "hiera: Move nginx from port 443 to port 4443 on cp3034" [puppet] - 10https://gerrit.wikimedia.org/r/534587 (owner: 10Vgutierrez)
[09:12:37] <wikibugs>	 (03PS2) 10Vgutierrez: Revert "hiera: Move nginx from port 443 to port 4443 on cp3034" [puppet] - 10https://gerrit.wikimedia.org/r/534587
[09:14:36] <icinga-wm>	 PROBLEM - Ensure traffic_manager binds on 443 and responds to HTTP requests on cp3034 is CRITICAL: connect to address 10.20.0.169 and port 443: Connection refused https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[09:14:47] <vgutierrez>	 that's expected
[09:14:54] <icinga-wm>	 PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3034 is CRITICAL: connect to address 10.20.0.169 and port 9322: Connection refused https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[09:15:02] <icinga-wm>	 PROBLEM - Check systemd state on cp3034 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:15:12] <icinga-wm>	 PROBLEM - Ensure trafficserver_exporter is running for instance tls on cp3034 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/python3 /usr/bin/prometheus-trafficserver-exporter --no-procstats --no-ssl-verification --endpoint https://127.0.0.1:8443/_stats --port 9322 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[09:15:34] <icinga-wm>	 PROBLEM - HTTPS Unified RSA on cp3034 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS
[09:15:34] <icinga-wm>	 PROBLEM - HTTPS Unified ECDSA on cp3034 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS
[09:16:08] <wikibugs>	 (03PS1) 10Dzahn: add moscovium.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/534588 (https://phabricator.wikimedia.org/T232077)
[09:16:30] <icinga-wm>	 RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3034 is OK: HTTP OK: HTTP/1.0 200 OK - 19048 bytes in 0.255 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[09:16:33] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] add moscovium.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/534588 (https://phabricator.wikimedia.org/T232077) (owner: 10Dzahn)
[09:16:38] <icinga-wm>	 RECOVERY - Check systemd state on cp3034 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:16:48] <icinga-wm>	 RECOVERY - Ensure trafficserver_exporter is running for instance tls on cp3034 is OK: PROCS OK: 1 process with args /usr/bin/python3 /usr/bin/prometheus-trafficserver-exporter --no-procstats --no-ssl-verification --endpoint https://127.0.0.1:8443/_stats --port 9322 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[09:17:00] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.downtime
[09:17:01] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[09:17:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:17:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:19:33] <icinga-wm>	 RECOVERY - HTTPS Unified ECDSA on cp3034 is OK: SSL OK - OCSP staple validity for en.wikipedia.org has 345098 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2019-11-22 07:59:59 +0000 (expires in 77 days) https://wikitech.wikimedia.org/wiki/HTTPS
[09:19:33] <icinga-wm>	 RECOVERY - HTTPS Unified RSA on cp3034 is OK: SSL OK - OCSP staple validity for en.wikipedia.org has 345098 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2019-11-22 07:59:59 +0000 (expires in 77 days) https://wikitech.wikimedia.org/wiki/HTTPS
[09:23:29] <wikibugs>	 10Operations, 10Cassandra, 10RESTBase, 10Core Platform Team Workboards (Clinic Duty Team), 10User-Eevans: Migrate Restbase-dev cluster to Stretch - https://phabricator.wikimedia.org/T224554 (10MoritzMuehlenhoff) restbase-dev1004 has been reinstalled as Stretch. @Eevans, you can bootstrap 1004 in Cassandr...
[09:25:28] <wikibugs>	 (03PS2) 10Dzahn: add moscovium.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/534588 (https://phabricator.wikimedia.org/T232077)
[09:25:30] <wikibugs>	 (03PS1) 10Vgutierrez: Revert "hiera: Move ats-tls from port 8443 to 4443 on cp1076" [puppet] - 10https://gerrit.wikimedia.org/r/534589
[09:25:44] <wikibugs>	 (03PS2) 10Vgutierrez: Revert "hiera: Move ats-tls from port 8443 to 4443 on cp1076" [puppet] - 10https://gerrit.wikimedia.org/r/534589
[09:25:52] <wikibugs>	 10Operations, 10vm-requests, 10Patch-For-Review: vm request for RT to replace ununpentium - https://phabricator.wikimedia.org/T232077 (10Dzahn) p:05Triage→03High
[09:25:53] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] Revert "hiera: Move ats-tls from port 8443 to 4443 on cp1076" [puppet] - 10https://gerrit.wikimedia.org/r/534589 (owner: 10Vgutierrez)
[09:25:57] <wikibugs>	 10Operations, 10vm-requests, 10Patch-For-Review: vm request for RT to replace ununpentium - https://phabricator.wikimedia.org/T232077 (10Dzahn) a:03Dzahn
[09:26:31] <vgutierrez>	 !log rolling back from ats-tls to nginx on cp1076 - T231433
[09:26:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:26:34] <stashbot>	 T231433: Move cache upload cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231433
[09:29:21] <mutante>	 check the Phabricator and Gerrit contributions of a user - new tool https://tools.wmflabs.org/wikicontrib/
[09:30:02] <wikibugs>	 10Operations, 10observability, 10Patch-For-Review, 10Performance-Team (Radar): Fully migrate >= 30% of producers off statsd - https://phabricator.wikimedia.org/T205870 (10fgiunchedi)
[09:32:15] <wikibugs>	 (03PS1) 10Vgutierrez: Revert "hiera: Move nginx from port 443 to 4443 on cp1076" [puppet] - 10https://gerrit.wikimedia.org/r/534590
[09:32:25] <wikibugs>	 (03PS2) 10Vgutierrez: Revert "hiera: Move nginx from port 443 to 4443 on cp1076" [puppet] - 10https://gerrit.wikimedia.org/r/534590
[09:32:34] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] Revert "hiera: Move nginx from port 443 to 4443 on cp1076" [puppet] - 10https://gerrit.wikimedia.org/r/534590 (owner: 10Vgutierrez)
[09:33:24] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] add moscovium.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/534588 (https://phabricator.wikimedia.org/T232077) (owner: 10Dzahn)
[09:39:25] <mutante>	 !log ganeti1001 - creating VM moscovium (T232077)
[09:39:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:39:28] <stashbot>	 T232077: vm request for RT to replace ununpentium - https://phabricator.wikimedia.org/T232077
[09:40:08] <wikibugs>	 10Operations: Conffile handling for PHP 7.2 packages - https://phabricator.wikimedia.org/T231881 (10MoritzMuehlenhoff) I tracked this down: Our way of integrating the puppetised php.ini files is working fine and as expected. The current round of conffile prompts is triggered by an upstream change between 7.2.16...
[09:43:19] <wikibugs>	 (03CR) 10Dzahn: "thanks !:)" [labs/private] - 10https://gerrit.wikimedia.org/r/534275 (owner: 10Dzahn)
[09:44:16] <wikibugs>	 10Operations, 10ops-codfw: codfw humidity too high - https://phabricator.wikimedia.org/T225137 (10Marostegui) Is this good to be closed?
[09:45:58] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove neodymium/sarin from MySQL root clients [puppet] - 10https://gerrit.wikimedia.org/r/534591 (https://phabricator.wikimedia.org/T220503)
[09:47:49] <wikibugs>	 (03CR) 10Marostegui: [C: 03+1] Remove neodymium/sarin from MySQL root clients [puppet] - 10https://gerrit.wikimedia.org/r/534591 (https://phabricator.wikimedia.org/T220503) (owner: 10Muehlenhoff)
[09:54:04] <wikibugs>	 (03PS1) 10Dzahn: releases: add envoy for TLS termination [puppet] - 10https://gerrit.wikimedia.org/r/534594 (https://phabricator.wikimedia.org/T210411)
[09:59:29] <wikibugs>	 (03PS2) 10Muehlenhoff: Remove neodymium/sarin from MySQL root clients [puppet] - 10https://gerrit.wikimedia.org/r/534591 (https://phabricator.wikimedia.org/T220503)
[10:01:00] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Remove neodymium/sarin from MySQL root clients [puppet] - 10https://gerrit.wikimedia.org/r/534591 (https://phabricator.wikimedia.org/T220503) (owner: 10Muehlenhoff)
[10:03:18] <wikibugs>	 (03PS2) 10Dzahn: releases: add envoy for TLS termination [puppet] - 10https://gerrit.wikimedia.org/r/534594 (https://phabricator.wikimedia.org/T210411)
[10:03:20] <wikibugs>	 (03PS1) 10Dzahn: install_server: add moscovium to DHCP and netboot [puppet] - 10https://gerrit.wikimedia.org/r/534595 (https://phabricator.wikimedia.org/T232077)
[10:07:36] <wikibugs>	 (03PS2) 10Dzahn: install_server: add moscovium to DHCP and netboot [puppet] - 10https://gerrit.wikimedia.org/r/534595 (https://phabricator.wikimedia.org/T232077)
[10:07:52] <wikibugs>	 (03CR) 10Dzahn: [C: 04-2] install_server: add moscovium to DHCP and netboot [puppet] - 10https://gerrit.wikimedia.org/r/534595 (https://phabricator.wikimedia.org/T232077) (owner: 10Dzahn)
[10:14:32] <wikibugs>	 (03PS1) 10Dzahn: webperf: add envoy for TLS termination [puppet] - 10https://gerrit.wikimedia.org/r/534597 (https://phabricator.wikimedia.org/T210411)
[10:18:03] <wikibugs>	 (03PS1) 10Muehlenhoff: Decomission sarin [puppet] - 10https://gerrit.wikimedia.org/r/534598 (https://phabricator.wikimedia.org/T220504)
[10:20:57] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Decomission sarin [puppet] - 10https://gerrit.wikimedia.org/r/534598 (https://phabricator.wikimedia.org/T220504) (owner: 10Muehlenhoff)
[10:21:50] <wikibugs>	 10Operations, 10ops-codfw, 10decommission, 10Patch-For-Review: Decommission sarin - https://phabricator.wikimedia.org/T220504 (10MoritzMuehlenhoff)
[10:22:48] <wikibugs>	 10Operations, 10ops-codfw, 10decommission, 10Patch-For-Review: Decommission sarin - https://phabricator.wikimedia.org/T220504 (10MoritzMuehlenhoff) a:05MoritzMuehlenhoff→03RobH This is ready to be decommissioned now.
[10:25:36] <moritzm>	 !log upgrading mw1238-mw1258 to PHP 7.2.22
[10:25:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:28:04] <_joe_>	 !log upgrading scap across the fleet T224857
[10:28:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:28:07] <stashbot>	 T224857: Enhance MediaWiki deployments for support of php7.x - https://phabricator.wikimedia.org/T224857
[10:31:08] <moritzm>	 !log upgrading mw1319-mw1333 to PHP 7.2.22
[10:31:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:32:02] <_joe_>	 moritzm: can you make debdeploy use a special script to perform a service restart?
[10:32:13] <_joe_>	 or it defaults to systemctl restart?
[10:33:39] <moritzm>	 it doesn't handle restarts (apart from the restarts triggered by postinst scripts automatically), all restarts need to be done with Cumin or cook books
[10:33:48] <_joe_>	 ok
[10:34:11] <_joe_>	 I was confused by the "non-daemon update, no service restart needed"
[10:34:39] <_joe_>	 can I update scap across the clusters now?
[10:34:42] <moritzm>	 the initial debdeploy version based on salt had restart support, but it turned out to be better handled outside of debdeploy, so wasn't ported over to the cumin version
[10:34:54] <_joe_>	 ack!
[10:35:08] <_joe_>	 we need to amend the service restarts page btw
[10:35:31] <moritzm>	 for envoy?
[10:35:38] <_joe_>	 not only
[10:35:49] <_joe_>	 but for the work I've done on safe restarts
[10:36:00] <moritzm>	 ah yes
[10:37:43] <moritzm>	 BTW, one thing we still might miss in the cook books for mw restarts is to wait for ffmpeg on video scalers, some of the video scaler jobs can take quite a bit, so for reboots I've always doublechecked with cumin that no remaining ffmpeg processes are around
[10:38:13] <moritzm>	 OTOH if the tmh code has retry logic we can probably also simply rely on that, not sure
[10:38:14] <_joe_>	 heh that's going to be a huge problem with php7 btw
[10:38:30] <_joe_>	 given it is restarted regularly for opcache reasons
[10:38:38] <_joe_>	 but yeah, it has a retry logic
[10:38:38] <moritzm>	 yeah, I had been wondering about that :-)
[10:38:51] <_joe_>	 and we should *really* have a separate service for videoscaling
[10:39:06] <moritzm>	 agreed
[10:40:35] <_joe_>	 does any FLOSS video encoding service exist that we could use to this end?
[10:44:12] <moritzm>	 not sure if anything existing exists, Brion probably knows best
[10:46:52] <moritzm>	 !log upgrading mw1221-mw1335 to PHP 7.2.22
[10:47:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:48:36] <godog>	 FYI I'm going to delete a bunch of puppetdb spammy metrics from prometheus eqiad, T228395
[10:48:37] <stashbot>	 T228395: puppetdb prometheus metrics per-host metrics - https://phabricator.wikimedia.org/T228395
[10:49:25] <logmsgbot>	 !log filippo@puppetmaster1001 conftool action : set/pooled=inactive; selector: name=prometheus1004.eqiad.wmnet
[10:49:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:49:42] <_joe_>	 heh about spammy metrics, I guess we'll need to create filters when we start collecting envoy metrics
[10:50:38] <godog>	 does envoy spam metrics by default?
[10:50:45] <godog>	 or rather, emit spammy metrics?
[10:53:34] <godog>	 !log temporarily enable prometheus admin web api in prometheus@ops in eqiad to delete spammy metrics - T228395
[10:53:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:00:05] <jouncebot>	 Amir1, Lucas_WMDE, awight, and Urbanecm: I, the Bot under the Fountain, allow thee, The Deployer, to do European Mid-day SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190905T1100).
[11:00:05] <jouncebot>	 kostajh: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[11:00:26] <kostajh>	 \o
[11:03:14] <icinga-wm>	 PROBLEM - Prometheus prometheus1004/ops restarted: beware possible monitoring artifacts on prometheus1004 is CRITICAL: instance=127.0.0.1:9900 job=prometheus https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/ops
[11:03:22] <godog>	 expected ^
[11:06:43] <wikibugs>	 (03PS3) 10Ema: lvs: add restbase-ssl [puppet] - 10https://gerrit.wikimedia.org/r/534462 (https://phabricator.wikimedia.org/T210411)
[11:11:56] <icinga-wm>	 PROBLEM - Prometheus prometheus1004/global -or a Prometheus it scrapes- was restarted: beware possible monitoring artifacts on prometheus1004 is CRITICAL: instance=127.0.0.1:9900 job=prometheus site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/global
[11:12:08] <icinga-wm>	 PROBLEM - Prometheus prometheus1004/ops restarted: beware possible monitoring artifacts on prometheus1004 is CRITICAL: instance=127.0.0.1:9900 job=prometheus https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/ops
[11:12:18] <kostajh>	 Anyone around for SWAT? The two patches will take a while to get through CI
[11:16:55] <kostajh>	 Amir1 / Urbanecm ^
[11:18:29] <dcausse>	 kostajh: I can SWAT
[11:18:38] <kostajh>	 Thx dcausse 
[11:31:22] <wikibugs>	 (03PS1) 10Muehlenhoff: Decommission neodymium [puppet] - 10https://gerrit.wikimedia.org/r/534600
[11:32:16] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Decommission neodymium [puppet] - 10https://gerrit.wikimedia.org/r/534600 (owner: 10Muehlenhoff)
[11:38:54] <icinga-wm>	 RECOVERY - Prometheus prometheus1004/ops restarted: beware possible monitoring artifacts on prometheus1004 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/ops
[11:39:06] <wikibugs>	 (03PS2) 10Muehlenhoff: Decommission neodymium [puppet] - 10https://gerrit.wikimedia.org/r/534600 (https://phabricator.wikimedia.org/T220503)
[11:39:16] <icinga-wm>	 RECOVERY - Prometheus prometheus1004/global -or a Prometheus it scrapes- was restarted: beware possible monitoring artifacts on prometheus1004 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/global
[11:48:31] <logmsgbot>	 !log dcausse@deploy1001 Synchronized php-1.34.0-wmf.20/extensions/CirrusSearch/: T159321: Add morelikethis a non-greedy version of the morelike keyword (duration: 00m 59s)
[11:48:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:48:35] <stashbot>	 T159321: [Bug] Unpredictable behavior with the order of Special:Search parameters - https://phabricator.wikimedia.org/T159321
[11:48:49] <dcausse>	 kostajh: it's live ^
[11:49:25] <Amir1>	 kostajh: sorry, at meetings:(
[11:49:27] <kostajh>	 dcausse: lovely, thanks
[11:50:28] <dcausse>	 !log EU swat done
[11:50:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:51:20] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Decommission neodymium [puppet] - 10https://gerrit.wikimedia.org/r/534600 (https://phabricator.wikimedia.org/T220503) (owner: 10Muehlenhoff)
[11:52:39] <wikibugs>	 10Operations, 10ops-eqiad, 10decommission, 10Patch-For-Review: Decommission neodymium - https://phabricator.wikimedia.org/T220503 (10MoritzMuehlenhoff)
[11:53:00] <wikibugs>	 10Operations, 10ops-eqiad, 10decommission, 10Patch-For-Review: Decommission neodymium - https://phabricator.wikimedia.org/T220503 (10MoritzMuehlenhoff) a:05MoritzMuehlenhoff→03RobH This is ready for decom
[11:56:38] <wikibugs>	 (03PS2) 10Mathew.onipe: elasticsearch: logging.yml template is ensure=absent [puppet] - 10https://gerrit.wikimedia.org/r/534398
[11:56:41] <wikibugs>	 (03PS5) 10Mathew.onipe: elasticsearch: add syslog logging option [puppet] - 10https://gerrit.wikimedia.org/r/533928 (https://phabricator.wikimedia.org/T225125)
[11:56:43] <wikibugs>	 (03PS3) 10Mathew.onipe: elasticsearch: switch relforge to new logging pipeline [puppet] - 10https://gerrit.wikimedia.org/r/534399 (https://phabricator.wikimedia.org/T225125)
[11:57:07] <wikibugs>	 (03CR) 10Mathew.onipe: elasticsearch: switch relforge to new logging pipeline (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/534399 (https://phabricator.wikimedia.org/T225125) (owner: 10Mathew.onipe)
[11:57:29] <moritzm>	 !log upgrading remaining job runners to PHP 7.2.22
[11:57:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:57:52] <wikibugs>	 (03CR) 10Mathew.onipe: elasticsearch: logging.yml template is ensure=absent (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/534398 (owner: 10Mathew.onipe)
[11:58:52] <wikibugs>	 (03CR) 10Mathew.onipe: elasticsearch: add syslog logging option (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/533928 (https://phabricator.wikimedia.org/T225125) (owner: 10Mathew.onipe)
[12:02:05] <logmsgbot>	 !log @ helmfile [STAGING] Ran 'sync' command on namespace 'sessionstore' for release 'staging' .
[12:02:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:08:48] <wikibugs>	 (03CR) 10Mathew.onipe: "PCC output is expected: https://puppet-compiler.wmflabs.org/compiler1002/18186/" [puppet] - 10https://gerrit.wikimedia.org/r/533928 (https://phabricator.wikimedia.org/T225125) (owner: 10Mathew.onipe)
[12:13:56] <moritzm>	 !log upgrading mw1284-mw1290 to PHP 7.2.22
[12:13:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:21:43] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 04-1] lvs: add restbase-ssl (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/534462 (https://phabricator.wikimedia.org/T210411) (owner: 10Ema)
[12:37:23] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: mediawiki::php: only enable tideways/mongodb where needed [puppet] - 10https://gerrit.wikimedia.org/r/534405
[12:42:14] <wikibugs>	 (03CR) 10Marostegui: "Moritzm, keep in mind that the proxies are accessed by all the tools that want to connect to the services labsdb services (web and analyti" [puppet] - 10https://gerrit.wikimedia.org/r/531670 (owner: 10Muehlenhoff)
[12:47:37] <logmsgbot>	 !log @ helmfile [STAGING] Ran 'sync' command on namespace 'sessionstore' for release 'staging' .
[12:47:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:57:17] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki::php: only enable tideways/mongodb where needed [puppet] - 10https://gerrit.wikimedia.org/r/534405 (owner: 10Giuseppe Lavagetto)
[12:59:42] <hashar>	 o/
[13:00:04] <jouncebot>	 hashar: It is that lovely time of the day again! You are hereby commanded to deploy MediaWiki train - European version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190905T1300).
[13:00:56] <wikibugs>	 (03PS1) 10Hashar: all wikis to 1.34.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534604
[13:00:58] <wikibugs>	 (03CR) 10Hashar: [C: 03+2] all wikis to 1.34.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534604 (owner: 10Hashar)
[13:00:58] <hashar>	 wish me luck
[13:01:12] <Lucas_WMDE>	 good luck…
[13:02:37] <wikibugs>	 (03Merged) 10jenkins-bot: all wikis to 1.34.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534604 (owner: 10Hashar)
[13:02:55] <wikibugs>	 (03CR) 10jenkins-bot: all wikis to 1.34.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534604 (owner: 10Hashar)
[13:03:43] <hashar>	 apaches syncing
[13:04:25] <logmsgbot>	 !log hashar@deploy1001 rebuilt and synchronized wikiversions files: all wikis to 1.34.0-wmf.21
[13:04:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:05:36] <hashar>	 hmm uneventful
[13:15:32] <wikibugs>	 (03PS3) 10Muehlenhoff: Add DNS entries for new Buster-based LDAP/corp replicas [dns] - 10https://gerrit.wikimedia.org/r/534432 (https://phabricator.wikimedia.org/T231015)
[13:15:36] <wikibugs>	 10Operations, 10Puppet, 10User-fgiunchedi: puppetdb prometheus metrics per-host metrics - https://phabricator.wikimedia.org/T228395 (10fgiunchedi) prometheus1004 completed, with this process:  ` # depool # stop puppet # add  --web.enable-admin-api to /lib/systemd/system/prometheus@ops.service systemctl daemo...
[13:15:49] <wikibugs>	 10Operations, 10Analytics, 10Analytics-Cluster: notebook1004 - /srv is full - https://phabricator.wikimedia.org/T232068 (10Ottomata) ping @Groceryheist  I don't know ryanmax's phab id, so I will email him.
[13:17:41] <logmsgbot>	 !log filippo@puppetmaster1001 conftool action : set/pooled=yes; selector: name=prometheus1004.eqiad.wmnet
[13:17:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:19:00] <icinga-wm>	 PROBLEM - Prometheus prometheus1004/global -or a Prometheus it scrapes- was restarted: beware possible monitoring artifacts on prometheus1004 is CRITICAL: instance=127.0.0.1:9900 job=prometheus site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/global
[13:19:20] <icinga-wm>	 PROBLEM - Prometheus prometheus1004/ops restarted: beware possible monitoring artifacts on prometheus1004 is CRITICAL: instance=127.0.0.1:9900 job=prometheus https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/ops
[13:21:12] <godog>	 that's me ^
[13:21:20] <godog>	 there will others for prometheus1003 shortly
[13:21:47] <logmsgbot>	 !log filippo@puppetmaster1001 conftool action : set/pooled=inactive; selector: name=prometheus1003.eqiad.wmnet
[13:21:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:21:51] <_joe_>	 and those alerts are better not acknowledged AIUI
[13:22:24] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Add DNS entries for new Buster-based LDAP/corp replicas [dns] - 10https://gerrit.wikimedia.org/r/534432 (https://phabricator.wikimedia.org/T231015) (owner: 10Muehlenhoff)
[13:23:29] <wikibugs>	 10Operations, 10Analytics, 10Analytics-Cluster: notebook1004 - /srv is full - https://phabricator.wikimedia.org/T232068 (10Nuria) Give that this is likely to impact other users can we temporarily compress that directory ( (/home/ryanmax) to make up space?
[13:23:36] <godog>	 yeah, also they'll auto resolve 
[13:26:51] <wikibugs>	 (03PS1) 10Muehlenhoff: Add partman config for ldap-corp* [puppet] - 10https://gerrit.wikimedia.org/r/534609
[13:30:17] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: Make one user out of 3 use php7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534610 (https://phabricator.wikimedia.org/T219150)
[13:30:58] <_joe_>	 Reedy: ^^ :P
[13:31:15] <Reedy>	 :D
[13:31:24] <wikibugs>	 (03CR) 10Reedy: [C: 03+1] Make one user out of 3 use php7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534610 (https://phabricator.wikimedia.org/T219150) (owner: 10Giuseppe Lavagetto)
[13:33:06] <icinga-wm>	 PROBLEM - Prometheus prometheus1003/ops restarted: beware possible monitoring artifacts on prometheus1003 is CRITICAL: instance=127.0.0.1:9900 job=prometheus https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/ops
[13:34:00] <icinga-wm>	 PROBLEM - Prometheus prometheus1003/global -or a Prometheus it scrapes- was restarted: beware possible monitoring artifacts on prometheus1003 is CRITICAL: instance=127.0.0.1:9900 job=prometheus site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/global
[13:34:22] <wikibugs>	 10Operations, 10Analytics, 10Analytics-Cluster: notebook1004 - /srv is full - https://phabricator.wikimedia.org/T232068 (10Ottomata) I deleted a little bit from my home dir, so we have a little bit of room for a bit. I'll give them a little time to respond.
[13:35:05] <wikibugs>	 (03CR) 10Arlolra: "> Patch Set 2: Code-Review+2" [puppet] - 10https://gerrit.wikimedia.org/r/534271 (https://phabricator.wikimedia.org/T229356) (owner: 10Dzahn)
[13:35:51] <hashar>	 train looks fine to me so far
[13:37:09] <logmsgbot>	 !log @ helmfile [STAGING] Ran 'apply' command on namespace 'sessionstore' for release 'staging' .
[13:37:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:38:04] <icinga-wm>	 RECOVERY - Disk space on notebook1004 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=notebook1004&var-datasource=eqiad+prometheus/ops
[13:39:52] <moritzm>	 !log upgrading remaining API servers to PHP 7.2.22
[13:39:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:40:58] <icinga-wm>	 RECOVERY - Prometheus prometheus1004/global -or a Prometheus it scrapes- was restarted: beware possible monitoring artifacts on prometheus1004 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/global
[13:41:28] <icinga-wm>	 RECOVERY - Prometheus prometheus1004/ops restarted: beware possible monitoring artifacts on prometheus1004 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/ops
[13:45:07] <wikibugs>	 10Operations, 10Analytics, 10Analytics-Cluster: notebook1004 - /srv is full - https://phabricator.wikimedia.org/T232068 (10RyanSteinberg) I just deleted some files and I'm compressing others. I didn't realize space was so tight ... my apologies.
[13:49:15] <wikibugs>	 10Operations, 10Analytics, 10Analytics-Cluster: notebook1004 - /srv is full - https://phabricator.wikimedia.org/T232068 (10Ottomata) The Jupyter Notebook servers are meant mostly to be an GUI/Cli interface to Hadoop based systems.  If you can, please consider storing data in HDFS.
[13:51:27] <wikibugs>	 10Operations, 10DBA, 10Patch-For-Review, 10User-notice: Switchover m1 primary master: db1063 to db1135: Tuesday 10th September at 16:00 UTC - https://phabricator.wikimedia.org/T231403 (10Agusbou2015) Will enwiki the only wiki affected to this failover?
[13:52:18] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: sessionstore: Bump again memory limits/requests [deployment-charts] - 10https://gerrit.wikimedia.org/r/534613 (https://phabricator.wikimedia.org/T229697)
[13:52:37] <wikibugs>	 10Operations, 10DBA, 10Patch-For-Review, 10User-notice: Switchover m1 primary master: db1063 to db1135: Tuesday 10th September at 16:00 UTC - https://phabricator.wikimedia.org/T231403 (10Marostegui) >>! In T231403#5468025, @Agusbou2015 wrote: > Will enwiki the only wiki affected to this failover?  enwiki w...
[13:52:46] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] sessionstore: Bump again memory limits/requests [deployment-charts] - 10https://gerrit.wikimedia.org/r/534613 (https://phabricator.wikimedia.org/T229697) (owner: 10Alexandros Kosiaris)
[13:54:31] <logmsgbot>	 !log @ helmfile [CODFW] Ran 'apply' command on namespace 'sessionstore' for release 'production' .
[13:54:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:54:38] <icinga-wm>	 PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 53.85% of data above the critical threshold [140.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1
[13:54:40] <icinga-wm>	 RECOVERY - Prometheus prometheus1003/global -or a Prometheus it scrapes- was restarted: beware possible monitoring artifacts on prometheus1003 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/global
[13:55:24] <icinga-wm>	 RECOVERY - Prometheus prometheus1003/ops restarted: beware possible monitoring artifacts on prometheus1003 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/ops
[14:01:48] <wikibugs>	 10Operations, 10Analytics, 10Analytics-Cluster: notebook1004 - /srv is full - https://phabricator.wikimedia.org/T232068 (10Nuria) @RyanSteinberg + 1 to andrew's suggestion. data should not be kept on notebook servers, rather you can keep it on your user database in hadoop. This is due to space concerns in no...
[14:11:02] <cdanis>	 !log restarted swiftrepl on ms-fe1005 T231110
[14:11:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:11:06] <stashbot>	 T231110: bring swiftrepl back to life - https://phabricator.wikimedia.org/T231110
[14:13:48] <wikibugs>	 (03PS1) 10CDanis: swiftrepl: fix missing local variable assignment [software] - 10https://gerrit.wikimedia.org/r/534621
[14:14:00] <logmsgbot>	 !log @ helmfile [CODFW] Ran 'sync' command on namespace 'sessionstore' for release 'production' .
[14:14:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:14:12] <wikibugs>	 (03PS4) 10Ema: lvs: add restbase-ssl [puppet] - 10https://gerrit.wikimedia.org/r/534462 (https://phabricator.wikimedia.org/T210411)
[14:15:08] <logmsgbot>	 !log @ helmfile [EQIAD] Ran 'sync' command on namespace 'sessionstore' for release 'production' .
[14:15:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:15:15] <wikibugs>	 (03CR) 10Ayounsi: "I'd recommend putting `include ::profile::base::firewall` in the role instead of in the profile." [puppet] - 10https://gerrit.wikimedia.org/r/534490 (owner: 10CRusnov)
[14:16:35] <wikibugs>	 (03CR) 10Ema: lvs: add restbase-ssl (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/534462 (https://phabricator.wikimedia.org/T210411) (owner: 10Ema)
[14:19:56] <wikibugs>	 10Operations, 10Cassandra, 10RESTBase, 10Core Platform Team Workboards (Clinic Duty Team), 10User-Eevans: Migrate Restbase-dev cluster to Stretch - https://phabricator.wikimedia.org/T224554 (10Eevans) >>! In T224554#5467470, @MoritzMuehlenhoff wrote: > restbase-dev1004 has been reinstalled as Stretch. @E...
[14:21:29] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2005 is CRITICAL: /{domain}/v1/page/random/title (retrieve a random article title) is CRITICAL: Test retrieve a random article title returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[14:21:29] <icinga-wm>	 PROBLEM - Prometheus prometheus1003/ops restarted: beware possible monitoring artifacts on prometheus1003 is CRITICAL: instance=127.0.0.1:9900 job=prometheus https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/ops
[14:24:53] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[14:25:01] <icinga-wm>	 PROBLEM - Check systemd state on restbase-dev1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:25:43] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: restart-appservers: fix to the cli args, some other cosmetic changes [cookbooks] - 10https://gerrit.wikimedia.org/r/534445
[14:30:59] <logmsgbot>	 !log filippo@puppetmaster1001 conftool action : set/pooled=yes; selector: name=prometheus1003.eqiad.wmnet
[14:31:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:32:06] <wikibugs>	 10Operations, 10WMF-Communications: Updating DNS records - https://phabricator.wikimedia.org/T231387 (10mark) Hi Anusha, Greg,  Looking into this. Unfortunately it seems the way this is being implemented, we would effectively be signing away complete control of our email security settings/policy for //wikimedi...
[14:32:25] <wikibugs>	 10Operations, 10WMF-Communications: Updating DNS records - https://phabricator.wikimedia.org/T231387 (10mark) 05Stalled→03Open
[14:32:57] <icinga-wm>	 RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 30.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1
[14:35:08] <wikibugs>	 (03CR) 10CDanis: [C: 03+2] swiftrepl: fix missing local variable assignment [software] - 10https://gerrit.wikimedia.org/r/534621 (owner: 10CDanis)
[14:35:15] <icinga-wm>	 RECOVERY - Check systemd state on restbase-dev1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:35:37] <wikibugs>	 (03Merged) 10jenkins-bot: swiftrepl: fix missing local variable assignment [software] - 10https://gerrit.wikimedia.org/r/534621 (owner: 10CDanis)
[14:39:39] <icinga-wm>	 PROBLEM - Check systemd state on restbase-dev1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:42:55] <XioNoX>	 !log remove iron from mr* routers - T231811
[14:42:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:42:58] <stashbot>	 T231811: Check router ACLs for early install SSH access from puppet masters/cumin hosts - https://phabricator.wikimedia.org/T231811
[14:43:36] <wikibugs>	 (03PS5) 10Ema: restbase: TLS termination with envoy on port 7443 [puppet] - 10https://gerrit.wikimedia.org/r/533028 (https://phabricator.wikimedia.org/T210411)
[14:44:03] <icinga-wm>	 RECOVERY - Prometheus prometheus1003/ops restarted: beware possible monitoring artifacts on prometheus1003 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/ops
[14:45:43] <wikibugs>	 10Operations, 10netops: Check router ACLs for early install SSH access from puppet masters/cumin hosts - https://phabricator.wikimedia.org/T231811 (10ayounsi) 05Open→03Resolved Done!
[14:46:00] <wikibugs>	 10Operations, 10Cassandra, 10RESTBase, 10Core Platform Team Workboards (Clinic Duty Team), 10User-Eevans: Migrate Restbase-dev cluster to Stretch - https://phabricator.wikimedia.org/T224554 (10Eevans) >>! In T224554#5467470, @MoritzMuehlenhoff wrote: > restbase-dev1004 has been reinstalled as Stretch. @E...
[14:48:10] <wikibugs>	 (03CR) 10Ema: [C: 03+2] restbase: TLS termination with envoy on port 7443 [puppet] - 10https://gerrit.wikimedia.org/r/533028 (https://phabricator.wikimedia.org/T210411) (owner: 10Ema)
[14:50:50] <ema>	 !log restbase2009: depool and add TLS termination w/ envoy -- https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/533028/ T210411
[14:50:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:50:57] <wikibugs>	 (03PS1) 10Herron: kafka-main1001: disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/534625
[14:50:57] <stashbot>	 T210411: Applayer services without TLS - https://phabricator.wikimedia.org/T210411
[14:50:59] <wikibugs>	 (03PS3) 10CRusnov: netbox: Add netbox* hosts to acmechief. [puppet] - 10https://gerrit.wikimedia.org/r/534490
[14:52:55] <wikibugs>	 (03CR) 10CRusnov: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/534490 (owner: 10CRusnov)
[14:53:03] <wikibugs>	 (03PS4) 10CRusnov: netbox: Add netbox* hosts to acmechief. [puppet] - 10https://gerrit.wikimedia.org/r/534490
[14:53:05] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] "lgtm, don't forget to cleanup the old ones when not necessary anymore." [puppet] - 10https://gerrit.wikimedia.org/r/534490 (owner: 10CRusnov)
[14:53:52] <wikibugs>	 (03PS2) 10Herron: kafka-main1001: disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/534625
[14:53:54] <wikibugs>	 (03CR) 10CRusnov: [C: 03+2] netbox: Add netbox* hosts to acmechief. [puppet] - 10https://gerrit.wikimedia.org/r/534490 (owner: 10CRusnov)
[14:54:27] <ema>	 !log restbase2009: repool after successful envoy deployment T210411
[14:54:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:55:02] <wikibugs>	 (03PS3) 10Herron: kafka-main1001: disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/534625
[14:55:59] <wikibugs>	 (03CR) 10Herron: [C: 03+2] kafka-main1001: disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/534625 (owner: 10Herron)
[14:57:30] <ema>	 akosiaris: restbase1022 has puppet disabled since a few hours, is that intentional or can we re-enable?
[14:57:49] <icinga-wm>	 PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 53.33% of data above the critical threshold [140.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1
[15:02:36] <wikibugs>	 10Operations, 10Puppet, 10User-fgiunchedi: puppetdb prometheus metrics per-host metrics - https://phabricator.wikimedia.org/T228395 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi Both prometheus1003 and prometheus1004 have been cleaned and repooled, resolving. @EBernhardson please give the web ui anot...
[15:20:27] <wikibugs>	 10Operations, 10Traffic: Track TLS related ATS metrics in prometheus - https://phabricator.wikimedia.org/T231286 (10ema) p:05Triage→03Normal
[15:20:44] <wikibugs>	 10Operations, 10Acme-chief, 10Traffic: Decide/document criteria needed to serve acme-chief LE issued unified certificate to end users - https://phabricator.wikimedia.org/T230687 (10ema) p:05Triage→03Normal
[15:22:41] <wikibugs>	 10Operations, 10ops-codfw: codfw humidity too high - https://phabricator.wikimedia.org/T225137 (10RobH) 05Open→03Resolved I went ahead and pulled the CyrusOne report for this month, and humidity seems to be in the 50% range.  It started high, but seems CyrusOne rebalanced and now its back to normal.  {F302...
[15:23:13] <herron>	 !log beginning replacement of kafka1001 with kafka-main1001 T225005
[15:23:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:23:17] <stashbot>	 T225005: Replace and expand kafka main hosts (kafka[12]00[123]) with kafka-main[12]00[12345] - https://phabricator.wikimedia.org/T225005
[15:25:32] <wikibugs>	 (03PS2) 10Herron: kafka-main: replace kafka1001 hardware with kafka-main1001 [puppet] - 10https://gerrit.wikimedia.org/r/528271 (https://phabricator.wikimedia.org/T225005)
[15:25:32] <icinga-wm>	 ACKNOWLEDGEMENT - Check systemd state on restbase-dev1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. eevans T224554 - The acknowledgement expires at: 2019-09-09 15:24:59. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:25:40] <icinga-wm>	 PROBLEM - Nginx local proxy to apache on mw1340 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.010 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[15:26:02] <wikibugs>	 (03PS2) 10Ottomata: Switch all events to eventgate. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534506 (https://phabricator.wikimedia.org/T228705) (owner: 10Ppchelko)
[15:26:56] <icinga-wm>	 RECOVERY - Nginx local proxy to apache on mw1340 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 591 bytes in 0.049 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[15:27:00] <wikibugs>	 (03CR) 10Herron: [C: 03+2] kafka-main: replace kafka1001 hardware with kafka-main1001 [puppet] - 10https://gerrit.wikimedia.org/r/528271 (https://phabricator.wikimedia.org/T225005) (owner: 10Herron)
[15:54:33] <logmsgbot>	 !log jynus@deploy1001 Synchronized private/PrivateSettings.php: updating cli password (duration: 00m 47s)
[15:54:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:59:21] <jynus>	 !log restarting batch processes on mwmaint1002 T232106
[15:59:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:00:05] <jouncebot>	 godog and _joe_: #bothumor My software never has bugs. It just develops random features. Rise for Puppet SWAT(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190905T1600).
[16:00:05] <jouncebot>	 No GERRIT patches in the queue for this window AFAICS.
[16:01:09] <icinga-wm>	 PROBLEM - traffic_server tls process restarted on cp5001 is CRITICAL: 2 ge 2 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server https://grafana.wikimedia.org/d/6uhkG6OZk/ats-instance-drilldown?orgId=1&var-site=eqsin+prometheus/ops&var-instance=cp5001&var-layer=tls
[16:02:40] <wikibugs>	 (03PS1) 10Herron: kafka-main: move kafka1001 to role::spare::system [puppet] - 10https://gerrit.wikimedia.org/r/534634 (https://phabricator.wikimedia.org/T225005)
[16:04:03] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] Switch all events to eventgate. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534506 (https://phabricator.wikimedia.org/T228705) (owner: 10Ppchelko)
[16:04:34] <ottomata>	 !log switching remaining job queue events (and all remaining events) to eventgate - T228705
[16:04:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:04:42] <stashbot>	 T228705: Migrate JobQueue to eventgate - https://phabricator.wikimedia.org/T228705
[16:05:45] <logmsgbot>	 !log otto@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Switch all events to eventgate - T228705 (duration: 00m 48s)
[16:06:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:06:43] <wikibugs>	 (03CR) 10jenkins-bot: Switch all events to eventgate. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534506 (https://phabricator.wikimedia.org/T228705) (owner: 10Ppchelko)
[16:07:29] <icinga-wm>	 RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 30.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1
[16:09:17] <wikibugs>	 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Kanban, 10netops: Move cloudvirtan* hardware out of CloudVPS back into production Analytics VLAN. - https://phabricator.wikimedia.org/T225128 (10Ottomata) Next steps?
[16:22:34] <logmsgbot>	 !log otto@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Switch all events to eventgate - T228705 - take 2 (duration: 00m 49s)
[16:22:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:22:46] <stashbot>	 T228705: Migrate JobQueue to eventgate - https://phabricator.wikimedia.org/T228705
[16:28:21] <wikibugs>	 (03PS1) 10Ppchelko: Remove references to eventlogging-service. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534637 (https://phabricator.wikimedia.org/T211248)
[16:29:30] <wikibugs>	 (03PS2) 10Ppchelko: Remove EventBusRCFeedEngine eventServiceName. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534236 (https://phabricator.wikimedia.org/T229863)
[16:33:08] <wikibugs>	 (03PS2) 10Ppchelko: Remove references to eventlogging-service. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534637 (https://phabricator.wikimedia.org/T232122)
[16:33:24] <icinga-wm>	 PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [140.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1
[16:38:49] <wikibugs>	 (03PS1) 10CRusnov: netbox: Add dhparam [puppet] - 10https://gerrit.wikimedia.org/r/534640
[16:42:48] <wikibugs>	 (03CR) 10CRusnov: [C: 03+2] "Uncontroversial change." [puppet] - 10https://gerrit.wikimedia.org/r/534640 (owner: 10CRusnov)
[16:47:07] <icinga-wm>	 RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 30.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1
[16:53:26] <wikibugs>	 (03PS3) 10Gehel: elasticsearch: logging.yml template is ensure=absent [puppet] - 10https://gerrit.wikimedia.org/r/534398 (owner: 10Mathew.onipe)
[16:54:16] <wikibugs>	 (03CR) 10Gehel: [C: 03+2] elasticsearch: logging.yml template is ensure=absent [puppet] - 10https://gerrit.wikimedia.org/r/534398 (owner: 10Mathew.onipe)
[16:57:46] <wikibugs>	 (03PS6) 10Gehel: elasticsearch: add syslog logging option [puppet] - 10https://gerrit.wikimedia.org/r/533928 (https://phabricator.wikimedia.org/T225125) (owner: 10Mathew.onipe)
[17:00:04] <jouncebot>	 cscott, arlolra, subbu, halfak, and accraze: #bothumor My software never has bugs. It just develops random features. Rise for Services – Graphoid / Parsoid / Citoid / ORES. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190905T1700).
[17:00:07] <wikibugs>	 (03CR) 10Gehel: [C: 03+2] elasticsearch: add syslog logging option [puppet] - 10https://gerrit.wikimedia.org/r/533928 (https://phabricator.wikimedia.org/T225125) (owner: 10Mathew.onipe)
[17:12:29] <wikibugs>	 (03PS3) 10Bstorm: toolforge: add CORS header to docker-registry [puppet] - 10https://gerrit.wikimedia.org/r/528617 (owner: 10BryanDavis)
[17:12:44] <wikibugs>	 10Operations, 10ops-codfw: Decommission old mw2231/WMF6435 - https://phabricator.wikimedia.org/T232126 (10Papaul)
[17:14:47] <wikibugs>	 10Operations, 10ops-codfw, 10decommission, 10observability, 10Patch-For-Review: Decom graphite2001/WMF6160 replaced with WMF6403 - https://phabricator.wikimedia.org/T200209 (10Papaul)
[17:15:48] <wikibugs>	 10Operations, 10ops-codfw, 10decommission, 10observability, 10Patch-For-Review: Decom graphite2001/WMF6160  - https://phabricator.wikimedia.org/T200209 (10Papaul)
[17:16:07] <wikibugs>	 10Operations, 10ops-codfw: Decommission old mw2231/WMF6435 replaced with WMF6403 - https://phabricator.wikimedia.org/T232126 (10Papaul) p:05Triage→03Normal
[17:16:37] <wikibugs>	 (03PS4) 10Mathew.onipe: elasticsearch: switch relforge to new logging pipeline [puppet] - 10https://gerrit.wikimedia.org/r/534399 (https://phabricator.wikimedia.org/T225125)
[17:16:39] <wikibugs>	 (03PS1) 10Mathew.onipe: elasticsearch: fix syntax error in logging config [puppet] - 10https://gerrit.wikimedia.org/r/534645
[17:16:46] <wikibugs>	 (03CR) 10Bstorm: [C: 03+2] toolforge: add CORS header to docker-registry [puppet] - 10https://gerrit.wikimedia.org/r/528617 (owner: 10BryanDavis)
[17:18:00] <wikibugs>	 (03PS2) 10Gehel: elasticsearch: fix syntax error in logging config [puppet] - 10https://gerrit.wikimedia.org/r/534645 (owner: 10Mathew.onipe)
[17:19:12] <wikibugs>	 (03CR) 10Herron: [C: 03+2] kafka-main: move kafka1001 to role::spare::system [puppet] - 10https://gerrit.wikimedia.org/r/534634 (https://phabricator.wikimedia.org/T225005) (owner: 10Herron)
[17:19:22] <wikibugs>	 (03PS2) 10Herron: kafka-main: move kafka1001 to role::spare::system [puppet] - 10https://gerrit.wikimedia.org/r/534634 (https://phabricator.wikimedia.org/T225005)
[17:19:26] <wikibugs>	 (03CR) 10Gehel: [C: 03+2] elasticsearch: fix syntax error in logging config [puppet] - 10https://gerrit.wikimedia.org/r/534645 (owner: 10Mathew.onipe)
[17:20:33] <wikibugs>	 (03PS3) 10Herron: kafka-main: move kafka1001 to role::spare::system [puppet] - 10https://gerrit.wikimedia.org/r/534634 (https://phabricator.wikimedia.org/T225005)
[17:21:10] <wikibugs>	 (03CR) 10Mathew.onipe: "change is only applied on relforge: https://puppet-compiler.wmflabs.org/compiler1002/18189/" [puppet] - 10https://gerrit.wikimedia.org/r/534399 (https://phabricator.wikimedia.org/T225125) (owner: 10Mathew.onipe)
[17:21:16] <wikibugs>	 10Operations, 10ops-codfw: ganeti2005 - mgmt interface stopped responding and reset fails - https://phabricator.wikimedia.org/T232067 (10Papaul) a:03Papaul
[17:29:45] <wikibugs>	 (03PS1) 10Herron: Revert "kafka-main1001: disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/534646
[17:30:02] <wikibugs>	 (03PS2) 10Herron: Revert "kafka-main1001: disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/534646
[17:31:29] <logmsgbot>	 !log crusnov@deploy1001 Started deploy [netbox/deploy@367ca84]: test deploy for netbox split
[17:31:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:32:28] <wikibugs>	 (03CR) 10Herron: [C: 03+2] Revert "kafka-main1001: disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/534646 (owner: 10Herron)
[17:33:27] <wikibugs>	 10Operations, 10Analytics, 10Core Platform Team Legacy (Watching / External), 10Patch-For-Review, and 2 others: Replace and expand kafka main hosts (kafka[12]00[123]) with kafka-main[12]00[12345] - https://phabricator.wikimedia.org/T225005 (10herron)
[18:00:05] <jouncebot>	 MaxSem, RoanKattouw, Niharika, and Urbanecm: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Morning SWAT (Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190905T1800).
[18:00:05] <jouncebot>	 No GERRIT patches in the queue for this window AFAICS.
[18:10:08] <logmsgbot>	 !log crusnov@deploy1001 Finished deploy [netbox/deploy@367ca84]: test deploy for netbox split (duration: 38m 39s)
[18:10:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:14:36] <wikibugs>	 (03PS17) 10Holger Knust: table-properties: Initial commit [software/cassandra-table-properties] - 10https://gerrit.wikimedia.org/r/524921 (https://phabricator.wikimedia.org/T220246)
[18:16:39] <wikibugs>	 (03PS1) 10Bstorm: Revert "toolforge: add CORS header to docker-registry" [puppet] - 10https://gerrit.wikimedia.org/r/534648
[18:17:52] <wikibugs>	 (03PS2) 10Bstorm: Revert "toolforge: add CORS header to docker-registry" [puppet] - 10https://gerrit.wikimedia.org/r/534648
[18:18:04] <wikibugs>	 (03CR) 10Bstorm: [V: 03+2 C: 03+2] Revert "toolforge: add CORS header to docker-registry" [puppet] - 10https://gerrit.wikimedia.org/r/534648 (owner: 10Bstorm)
[18:21:46] <wikibugs>	 10Operations, 10ops-eqiad, 10fundraising-tech-ops: rack/setup/install new eqiad netsec server - https://phabricator.wikimedia.org/T232137 (10RobH) p:05Triage→03Normal
[18:21:59] <wikibugs>	 10Operations, 10ops-eqiad, 10fundraising-tech-ops: rack/setup/install new eqiad netsec server - https://phabricator.wikimedia.org/T232137 (10RobH)
[18:22:51] <wikibugs>	 10Operations, 10ops-eqiad, 10fundraising-tech-ops: rack/setup/install frnetmon1001 - https://phabricator.wikimedia.org/T232137 (10RobH)
[18:24:54] <wikibugs>	 10Operations, 10observability, 10Discovery-Search (Current work): Alert when a jvm hits more than 100 old gc ops/hour - https://phabricator.wikimedia.org/T231516 (10debt) 05Open→03Resolved
[18:26:31] <wikibugs>	 10Operations, 10Discovery-Search (Current work): Run jstack / jmap / etc... with PrivateTmp=true - https://phabricator.wikimedia.org/T230774 (10debt) 05Open→03Resolved
[18:32:58] <wikibugs>	 10Operations, 10Elasticsearch, 10Discovery-Search (Current work): Icinga reports read time out error for some checks on cloudelastic cluster - https://phabricator.wikimedia.org/T230366 (10debt) 05Open→03Resolved a:03debt
[18:33:32] <icinga-wm>	 PROBLEM - HTTP availability for Varnish at codfw on icinga1001 is CRITICAL: job=varnish-upload site=codfw https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d
[18:34:56] <icinga-wm>	 RECOVERY - HTTP availability for Varnish at codfw on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d
[18:36:17] <wikibugs>	 10Operations, 10Analytics, 10Discovery, 10Research-Backlog, 10Patch-For-Review: Workflow to be able to move data files computed in jobs from analytics cluster to production - https://phabricator.wikimedia.org/T213976 (10debt)
[18:37:15] <wikibugs>	 (03PS1) 10Andrew Bogott: codf1dev: move the puppetmaster enc database to cloudb2001-dev [puppet] - 10https://gerrit.wikimedia.org/r/534657 (https://phabricator.wikimedia.org/T229441)
[18:39:24] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10Discovery, 10Discovery-Search (Current work): Memory correctable errors -EDAC- elastic1029 - https://phabricator.wikimedia.org/T214283 (10debt) 05Open→03Resolved >>! In T214283#5451622, @RobH wrote: > Also, in the future, please open a new task for hardware trou...
[18:40:49] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] codf1dev: move the puppetmaster enc database to cloudb2001-dev [puppet] - 10https://gerrit.wikimedia.org/r/534657 (https://phabricator.wikimedia.org/T229441) (owner: 10Andrew Bogott)
[18:49:27] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1021 with 10G interfaces - https://phabricator.wikimedia.org/T229873 (10Cmjohnson)
[18:49:48] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1021 with 10G interfaces - https://phabricator.wikimedia.org/T229873 (10Cmjohnson) a:05Cmjohnson→03Andrew @andrew this is ready for you to re-image
[18:50:13] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1022 with 10G interfaces - https://phabricator.wikimedia.org/T229872 (10Cmjohnson)
[18:50:33] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1022 with 10G interfaces - https://phabricator.wikimedia.org/T229872 (10Cmjohnson) a:05Cmjohnson→03Andrew @andrew this is ready for you to re-image
[18:53:54] <wikibugs>	 (03PS1) 10CRusnov: netbox: fix includes of ::profile::base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/534658
[18:59:29] <wikibugs>	 (03PS5) 10Krinkle: Remove $wgSiteStatsAsyncFactor setting which had the same effect as the default (disabled) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521004 (owner: 10Aaron Schulz)
[18:59:53] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] netbox: fix includes of ::profile::base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/534658 (owner: 10CRusnov)
[19:00:22] <wikibugs>	 (03CR) 10CRusnov: [C: 03+2] netbox: fix includes of ::profile::base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/534658 (owner: 10CRusnov)
[19:00:27] * Krinkle deploying https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/WikimediaMaintenance/+/534660/
[19:01:09] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1023 with 10G interfaces - https://phabricator.wikimedia.org/T229871 (10Cmjohnson) B0:26:28:29:6A:E0
[19:06:51] <icinga-wm>	 PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 42.86% of data above the critical threshold [140.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1
[19:15:47] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1023 with 10G interfaces - https://phabricator.wikimedia.org/T229871 (10Cmjohnson)
[19:16:56] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1023 with 10G interfaces - https://phabricator.wikimedia.org/T229871 (10Cmjohnson) a:05Cmjohnson→03Andrew @andrew the new mac is in an earlier update.  The server is moved, connected to the new port...
[19:21:53] <logmsgbot>	 !log krinkle@deploy1001 Synchronized php-1.34.0-wmf.21/extensions/WikimediaMaintenance/blameStartupRegistry.php: 7adf466614d (duration: 00m 48s)
[19:21:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:22:08] <wikibugs>	 (03CR) 10Krinkle: [C: 03+2] Remove $wgSiteStatsAsyncFactor setting which had the same effect as the default (disabled) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521004 (owner: 10Aaron Schulz)
[19:23:33] <wikibugs>	 (03Merged) 10jenkins-bot: Remove $wgSiteStatsAsyncFactor setting which had the same effect as the default (disabled) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521004 (owner: 10Aaron Schulz)
[19:23:51] <wikibugs>	 (03CR) 10jenkins-bot: Remove $wgSiteStatsAsyncFactor setting which had the same effect as the default (disabled) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521004 (owner: 10Aaron Schulz)
[19:24:41] * Krinkle staging on mwdebug1002
[19:28:29] <logmsgbot>	 !log krinkle@deploy1001 Synchronized wmf-config/CommonSettings.php: c7678f0e3d638 (duration: 00m 47s)
[19:28:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:31:33] <wikibugs>	 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Kanban, 10netops: Move cloudvirtan* hardware out of CloudVPS back into production Analytics VLAN. - https://phabricator.wikimedia.org/T225128 (10Cmjohnson) @ottomata the on-site work is done, They will need updated production DNS but all are moved and c...
[19:32:05] <wikibugs>	 (03PS1) 10Andrew Bogott: Cloudvirt1023: move to 10G nic [puppet] - 10https://gerrit.wikimedia.org/r/534663 (https://phabricator.wikimedia.org/T229871)
[19:32:40] <wikibugs>	 (03PS2) 10Andrew Bogott: Cloudvirt1023: move to 10G nic [puppet] - 10https://gerrit.wikimedia.org/r/534663 (https://phabricator.wikimedia.org/T229871)
[19:33:39] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Cloudvirt1023: move to 10G nic [puppet] - 10https://gerrit.wikimedia.org/r/534663 (https://phabricator.wikimedia.org/T229871) (owner: 10Andrew Bogott)
[19:36:10] <icinga-wm>	 RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 30.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1
[19:46:46] <wikibugs>	 (03CR) 10Jforrester: [C: 03+1] Make one user out of 3 use php7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534610 (https://phabricator.wikimedia.org/T219150) (owner: 10Giuseppe Lavagetto)
[19:53:24] <wikibugs>	 (03PS1) 10Andrew Bogott: update nic names for cloudvirt1021 and cloudvirt1022 [puppet] - 10https://gerrit.wikimedia.org/r/534669 (https://phabricator.wikimedia.org/T229873)
[19:54:02] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] update nic names for cloudvirt1021 and cloudvirt1022 [puppet] - 10https://gerrit.wikimedia.org/r/534669 (https://phabricator.wikimedia.org/T229873) (owner: 10Andrew Bogott)
[20:05:58] <icinga-wm>	 PROBLEM - Nginx local proxy to apache on mw1266 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.007 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[20:07:22] <icinga-wm>	 RECOVERY - Nginx local proxy to apache on mw1266 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 591 bytes in 0.049 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[20:11:44] <wikibugs>	 (03PS1) 10Jhedden: openstack: Add codfw1dev glance API to haproxy [puppet] - 10https://gerrit.wikimedia.org/r/534680 (https://phabricator.wikimedia.org/T223907)
[20:18:48] <wikibugs>	 (03CR) 10Jhedden: "PCC results: https://puppet-compiler.wmflabs.org/compiler1001/18191/" [puppet] - 10https://gerrit.wikimedia.org/r/534680 (https://phabricator.wikimedia.org/T223907) (owner: 10Jhedden)
[20:21:35] <wikibugs>	 (03PS1) 10Andrew Bogott: openstack scheduler: update comments for cloudvirts [puppet] - 10https://gerrit.wikimedia.org/r/534681 (https://phabricator.wikimedia.org/T229873)
[20:26:58] <wikibugs>	 10Operations, 10Code-Stewardship-Reviews, 10Graphoid, 10Core Platform Team Legacy (Watching / External), and 3 others: graphoid: Code stewardship request - https://phabricator.wikimedia.org/T211881 (10dr0ptp4kt) We're mulling this over still.
[20:28:15] <wikibugs>	 10Operations, 10Commons, 10MediaWiki-File-management, 10Multimedia, and 4 others: Picture from Commons not found from Singapore - https://phabricator.wikimedia.org/T231086 (10Krinkle)
[20:29:39] <wikibugs>	 10Operations, 10Commons, 10MediaWiki-File-management, 10Multimedia, and 6 others: Picture from Commons not found from Singapore - https://phabricator.wikimedia.org/T231086 (10Krinkle)
[20:29:50] <wikibugs>	 10Operations, 10Commons, 10MediaWiki-File-management, 10Multimedia, and 6 others: Picture from Commons not found from Singapore - https://phabricator.wikimedia.org/T231086 (10Krinkle) Tagging Multimedia for possible CR of <https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/532096/>.
[20:47:18] <wikibugs>	 (03PS2) 10Andrew Bogott: openstack scheduler: update comments for cloudvirts [puppet] - 10https://gerrit.wikimedia.org/r/534681 (https://phabricator.wikimedia.org/T229873)
[20:47:20] <wikibugs>	 (03PS1) 10Andrew Bogott: cloudvirt1023: rename network interfaces [puppet] - 10https://gerrit.wikimedia.org/r/534682 (https://phabricator.wikimedia.org/T229871)
[20:48:14] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] cloudvirt1023: rename network interfaces [puppet] - 10https://gerrit.wikimedia.org/r/534682 (https://phabricator.wikimedia.org/T229871) (owner: 10Andrew Bogott)
[21:08:16] <wikibugs>	 (03PS3) 10Andrew Bogott: openstack scheduler: update comments for cloudvirts [puppet] - 10https://gerrit.wikimedia.org/r/534681 (https://phabricator.wikimedia.org/T229873)
[21:08:18] <wikibugs>	 (03PS1) 10Andrew Bogott: cloudvirt1023: rename interfaces, again [puppet] - 10https://gerrit.wikimedia.org/r/534684 (https://phabricator.wikimedia.org/T229871)
[21:09:20] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] cloudvirt1023: rename interfaces, again [puppet] - 10https://gerrit.wikimedia.org/r/534684 (https://phabricator.wikimedia.org/T229871) (owner: 10Andrew Bogott)
[21:12:01] <James_F>	 jouncebot: next
[21:12:01] <jouncebot>	 In 1 hour(s) and 47 minute(s): Evening SWAT (Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190905T2300)
[21:12:14] <wikibugs>	 (03PS1) 10CRusnov: netbox: fix role includes for really reals [puppet] - 10https://gerrit.wikimedia.org/r/534685
[21:12:38] <James_F>	 Krinkle: How do you feel about me pushing the write-JSON change out to prod? https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/533592
[21:21:33] <Krinkle>	 James_F: checking
[21:21:44] <James_F>	 Thanks
[21:25:28] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] "change and compiler look good." [puppet] - 10https://gerrit.wikimedia.org/r/534685 (owner: 10CRusnov)
[21:25:53] <wikibugs>	 (03CR) 10CRusnov: [C: 03+2] netbox: fix role includes for really reals [puppet] - 10https://gerrit.wikimedia.org/r/534685 (owner: 10CRusnov)
[21:26:07] <wikibugs>	 (03PS2) 10CRusnov: netbox: fix role includes for really reals [puppet] - 10https://gerrit.wikimedia.org/r/534685
[21:29:40] <wikibugs>	 (03CR) 10Jforrester: Variant configuration: Read from JSON, not serialised PHP (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/533593 (https://phabricator.wikimedia.org/T223602) (owner: 10Jforrester)
[21:30:07] <wikibugs>	 (03CR) 10Krinkle: Variant configuration: Write to static (JSON) as well as serialised cache (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/533592 (https://phabricator.wikimedia.org/T223602) (owner: 10Jforrester)
[21:33:23] <Krinkle>	 James_F: Wanna to IS => array, as first step? 
[21:33:26] <Krinkle>	 do*
[21:34:46] <Krinkle>	 also curious whether we'd be able to actually disuse '+foo'. Seems doable, but I don't know if there's cases where we really need it within wgConf vs doing it in CommonSettings.php afterwards.
[21:34:55] <logmsgbot>	 !log crusnov@deploy1001 Started deploy [netbox/deploy@367ca84]: test deploy for netbox split - again
[21:35:07] <logmsgbot>	 !log crusnov@deploy1001 Finished deploy [netbox/deploy@367ca84]: test deploy for netbox split - again (duration: 00m 12s)
[21:35:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:35:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:35:49] <James_F>	 Krinkle: I worry about converting IS to an array too early.
[21:36:01] <wikibugs>	 (03CR) 10Jforrester: [C: 04-1] "> Patch Set 3:" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/533592 (https://phabricator.wikimedia.org/T223602) (owner: 10Jforrester)
[21:36:26] <James_F>	 Krinkle: Sadly we're running HHVM and PHP72 not PHP73 so we can't use JSON_THROW_ON_ERROR
[21:37:50] <Krinkle>	 James_F: yeah, I tested that on 3v4l.org before I submitted my comment to see if it would make a difference. It means it'll throw instead of returning false for invalid utf8, but it still doesn't communicate in any way about invalid values like functions or non-std class instances
[21:38:08] <James_F>	 Yes, but we don't use functions of class instances.
[21:38:22] <James_F>	 And in the medium term it'll be impossible to try, as it'll be configured in YAML.
[21:38:32] <Krinkle>	 Right
[21:38:49] <Krinkle>	 So authoring in YAML or JSON would be great. But I conflated those with what the expanded format is.
[21:38:58] <Krinkle>	 I mixed them up in my mind.
[21:39:04] <Krinkle>	 Why switch to .json for the compiled format?
[21:39:08] <James_F>	 Authoring in YAML, converting to JSON.
[21:39:33] <James_F>	 Because the compiled format will be committed in git, and this way it'll (a) not vary by PHP run time and (b) be manually inspectable as to the outcome.
[21:40:07] <James_F>	 Essentially, a poor man's T220775.
[21:40:07] <stashbot>	 T220775: Consider creating a puppet-compiler equivalent for mediawiki-config.git - https://phabricator.wikimedia.org/T220775
[21:40:21] <Krinkle>	 serialised php doesn't vary by PHP run time. We only do that now because we allow changing config itself by HHVM e.g. in Setup.php and extension hooks.
[21:40:29] <Krinkle>	 But yes, human readable expansion matters.
[21:40:41] <Krinkle>	 we can use static arrays for that, like were doing for interwiki, wikversions and (soon) localisation cache.
[21:40:48] <James_F>	 Yes.
[21:41:04] <Krinkle>	 would parse quicker than  json, and forgoes the need for APC
[21:41:08] <Krinkle>	 because it'll be in opcache
[21:41:13] <James_F>	 Eh.
[21:41:23] <James_F>	 "Quicker" in terms of nanoseconds.
[21:42:17] <logmsgbot>	 !log crusnov@deploy1001 Started deploy [netbox/deploy@367ca84]: deploy for netbox split T223291
[21:42:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:42:20] <logmsgbot>	 !log crusnov@deploy1001 Finished deploy [netbox/deploy@367ca84]: deploy for netbox split T223291 (duration: 00m 03s)
[21:42:20] <stashbot>	 T223291: Netbox: move it to dedicated Ganeti VMs - https://phabricator.wikimedia.org/T223291
[21:42:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:42:24] <James_F>	 Essentially, this is the reverse of wikiversion.json vs. wikiversions.php.
[21:43:18] <Krinkle>	 difference between unserialize and json_decode was 0.1 ms, not much indeed. The file read is about ~ 1ms, which would also be skipped.
[21:43:23] <Krinkle>	 but yeah, not much either way.
[21:43:33] <James_F>	 But we already do the file read, right?
[21:43:40] <James_F>	 It's not currently opcached.
[21:43:53] <Krinkle>	 HHVM has an elaborate file stat cache, which we will soon lose.
[21:43:57] <James_F>	 So the marginal difference for now is small.
[21:44:01] <James_F>	 Oh, yes, true.
[21:44:19] <Krinkle>	 There is a task about fixing ExtensionProcessor to not read JSON and mtime stat calls as much.
[21:44:22] <Krinkle>	 I need to fix that.
[21:44:33] <Krinkle>	 it doesn't scale currently for N extensions.
[21:44:37] <Krinkle>	 1 config file is fine though.
[21:45:02] <wikibugs>	 (03PS4) 10Jforrester: Variant configuration: Write to static (JSON) as well as serialised cache for testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/533592 (https://phabricator.wikimedia.org/T223602)
[21:45:14] <James_F>	 Oh, the "enable wgStoreMTime" or whatever task?
[21:45:38] <Krinkle>	 but yeah, we could go from ~ 0 file reads on HHVM with stat cache to 1 file read on PHP 72 + json parse (+ 1 ms, + 0.1ms), or go to 0 file reads and also skip the ~ 0.1ms for unserialize/json_decode  with a static array file in opcache. 
[21:45:54] <Krinkle>	 or we can go to PHP72 with no json_parse or file read if we use APCu and an mtime check only
[21:46:16] <James_F>	 T187154
[21:46:17] <stashbot>	 T187154: Consider enabling wgExtensionInfoMTime in wmf-production - https://phabricator.wikimedia.org/T187154
[21:46:22] <Krinkle>	 Yeah, that one.
[21:46:39] <Krinkle>	 I can see that portion growing in the flame graph when we got more PHP72 traffic
[21:46:53] <Krinkle>	 initially a bit random on excier due to small sampling
[21:46:55] <Krinkle>	 more obvious now
[21:46:57] <James_F>	 Well, we're about to go to 1/3rd PHP72.
[21:47:01] <James_F>	 So…
[21:47:09] <James_F>	 It's going to get worse quite quickly.
[21:49:00] <wikibugs>	 (03PS5) 10Jforrester: Variant configuration: Write to static (JSON) as well as serialised cache for testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/533592 (https://phabricator.wikimedia.org/T223602)
[21:52:35] <wikibugs>	 (03PS3) 10Zoranzoki21: Set noindex for user and user_talk on zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534471 (https://phabricator.wikimedia.org/T231982)
[21:52:44] <icinga-wm>	 PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:04:40] <icinga-wm>	 PROBLEM - netbox HTTPS on netbox1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 312 bytes in 0.008 second response time https://wikitech.wikimedia.org/wiki/Netbox
[22:05:00] <chaomodus>	 expected, downtiming
[22:05:14] <icinga-wm>	 PROBLEM - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[22:07:10] <icinga-wm>	 PROBLEM - Check the last execution of netbox_ganeti_eqiad_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_eqiad_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[22:10:28] <wikibugs>	 (03PS1) 10CRusnov: netbox: add netbox hosts to dsh host list. [puppet] - 10https://gerrit.wikimedia.org/r/534692
[22:10:54] <Krinkle>	 James_F: oh my
[22:10:54] <Krinkle>	 https://performance.wikimedia.org/arclamp/svgs/daily/2019-09-04.excimer.load.svgz
[22:11:06] <Krinkle>	 13% (!) is spent in ExtensionRegistry::loadFromQueue
[22:11:06] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] netbox: add netbox hosts to dsh host list. [puppet] - 10https://gerrit.wikimedia.org/r/534692 (owner: 10CRusnov)
[22:11:14] <Krinkle>	 That's php72 only
[22:11:43] <James_F>	 Krinkle: That is definitely not great.
[22:11:43] <Krinkle>	 It's ~0% on HHVM (not sampled at all over 24 hours, so very tiny)
[22:12:33] <James_F>	 Right.
[22:22:40] <icinga-wm>	 PROBLEM - Check systemd state on netboxdb2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:25:37] <wikibugs>	 (03PS2) 10CRusnov: netbox: add netbox hosts to dsh host list. [puppet] - 10https://gerrit.wikimedia.org/r/534692
[22:38:11] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] "code and PCC looks good." [puppet] - 10https://gerrit.wikimedia.org/r/534692 (owner: 10CRusnov)
[22:38:15] <wikibugs>	 (03PS1) 10Jforrester: Stop setting wgCookieSetOnAutoBlock and wgCookieSetOnIpBlock to the default; never varied [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534698 (https://phabricator.wikimedia.org/T191922)
[22:38:34] <wikibugs>	 (03CR) 10CRusnov: [C: 03+2] netbox: add netbox hosts to dsh host list. [puppet] - 10https://gerrit.wikimedia.org/r/534692 (owner: 10CRusnov)
[22:57:36] <wikibugs>	 (03PS5) 10Reedy: Require that passwords are not in the most common 100k list for all users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479574 (https://phabricator.wikimedia.org/T151425) (owner: 10Jforrester)
[22:59:43] <wikibugs>	 (03PS1) 10CRusnov: netbox: Undo some mistakes in the netbox user [puppet] - 10https://gerrit.wikimedia.org/r/534703
[23:00:04] <jouncebot>	 MaxSem, RoanKattouw, Niharika, and Urbanecm: I, the Bot under the Fountain, allow thee, The Deployer, to do Evening SWAT (Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190905T2300).
[23:00:05] <jouncebot>	 No GERRIT patches in the queue for this window AFAICS.
[23:00:41] <wikibugs>	 (03CR) 10CRusnov: [C: 03+2] netbox: Undo some mistakes in the netbox user [puppet] - 10https://gerrit.wikimedia.org/r/534703 (owner: 10CRusnov)
[23:00:49] <wikibugs>	 (03CR) 10Jforrester: "Oh, right, we said we'd do this today. Let's roll?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479574 (https://phabricator.wikimedia.org/T151425) (owner: 10Jforrester)
[23:01:08] <Reedy>	 James_F: If you want to we can
[23:01:09] * Reedy grins
[23:01:19] <Reedy>	 Might aswell get it done when we said we would
[23:01:46] <wikibugs>	 (03CR) 10Jforrester: [C: 03+2] "Service, with a smile." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479574 (https://phabricator.wikimedia.org/T151425) (owner: 10Jforrester)
[23:02:45] <wikibugs>	 (03Merged) 10jenkins-bot: Require that passwords are not in the most common 100k list for all users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479574 (https://phabricator.wikimedia.org/T151425) (owner: 10Jforrester)
[23:03:06] <paladox>	 lol ^
[23:03:57] <wikibugs>	 (03CR) 10jenkins-bot: Require that passwords are not in the most common 100k list for all users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479574 (https://phabricator.wikimedia.org/T151425) (owner: 10Jforrester)
[23:04:34] <James_F>	 Reedy: Live on mwdebug1002 if you want to test?
[23:05:00] <Reedy>	 I don't see much point testing it...
[23:07:53] <James_F>	 Well, I can definitely log out and log back in both of my prod accounts.
[23:08:29] <James_F>	 Let's go.
[23:09:13] <logmsgbot>	 !log jforrester@deploy1001 Synchronized wmf-config/CommonSettings.php: T151425 Require that passwords are not in the most common 100k list for all users (duration: 00m 48s)
[23:09:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:09:32] <stashbot>	 T151425: Enlarge Popular Password File to 100,000 entries and enforce the new minimum in the config - https://phabricator.wikimedia.org/T151425
[23:09:54] <Reedy>	 <3
[23:12:47] <logmsgbot>	 !log ayounsi@deploy1001 Started deploy [netbox/deploy@367ca84]: test
[23:12:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:13:29] <logmsgbot>	 !log ayounsi@deploy1001 Finished deploy [netbox/deploy@367ca84]: test (duration: 00m 42s)
[23:13:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:14:53] <wikibugs>	 (03PS1) 10Bstorm: sssd: Add some new images to test sssd in containers [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/534704 (https://phabricator.wikimedia.org/T229058)
[23:15:31] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] sssd: Add some new images to test sssd in containers [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/534704 (https://phabricator.wikimedia.org/T229058) (owner: 10Bstorm)
[23:18:12] <wikibugs>	 (03PS6) 10Jforrester: Variant configuration: Write to static (JSON) as well as serialised cache for testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/533592 (https://phabricator.wikimedia.org/T223602)
[23:21:22] <James_F>	 Reedy: Do we want to set MinimumPasswordLengthToLogin to 10 for priv groups (right now it's just +staff)?
[23:21:55] <Reedy>	 I think we do at some point, for sure
[23:22:06] <Reedy>	 Do we need some communications for that first? Likely
[23:22:07] <wikibugs>	 (03PS1) 10Jforrester: Drop PasswordCannotBePopular compatibility hack, no longer needed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534706
[23:22:08] <James_F>	 But not right now?
[23:22:18] <James_F>	 Eh. You're such a goody-goody.
[23:22:22] <Reedy>	 Heh
[23:22:25] <Reedy>	 I mean
[23:22:34] <Reedy>	 Part of me would love to see how many people it affected...
[23:23:04] <Reedy>	 It depends how you read https://meta.wikimedia.org/wiki/Password_policy for example
[23:23:09] <Reedy>	 Password requirements for privileged users:
[23:23:09] <Reedy>	 Must be at least 10 characters
[23:23:20] <Reedy>	 I'd see must... As in, MW will make you
[23:23:35] <Reedy>	 So, in some regards, it's literally following the policy... So nothing to actually announce?
[23:23:51] <wikibugs>	 (03PS1) 10Jforrester: Set MinimumPasswordLengthToLogin to 10 for all prived groups, not just +staff [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534707
[23:24:08] <James_F>	 We can announce and wait another week?
[23:24:20] <James_F>	 MinimumPasswordLengthToLogin is a bit aggressive.
[23:24:39] <James_F>	 I'm not sure how well it works on API login, e.g. the apps.
[23:24:40] <wikibugs>	 (03CR) 10Reedy: [C: 03+1] "As per https://meta.wikimedia.org/wiki/Password_policy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534707 (owner: 10Jforrester)
[23:24:45] <Reedy>	 That seems reasonable
[23:25:03] <Reedy>	 Helps remove more cruft and edge cases from CS
[23:28:04] <James_F>	 https://meta.wikimedia.org/w/index.php?diff=19355540&oldid=19355349&title=Tech/News/2019/37&diffmode=visual
[23:50:16] <wikibugs>	 (03PS2) 10Bstorm: sssd: Add some new images to test sssd in containers [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/534704 (https://phabricator.wikimedia.org/T229058)
[23:50:35] <wikibugs>	 (03CR) 10Krinkle: Variant configuration: Write to static (JSON) as well as serialised cache for testwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/533592 (https://phabricator.wikimedia.org/T223602) (owner: 10Jforrester)
[23:51:59] <wikibugs>	 (03CR) 10Jforrester: Variant configuration: Write to static (JSON) as well as serialised cache for testwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/533592 (https://phabricator.wikimedia.org/T223602) (owner: 10Jforrester)